Pre-processing Data

How about ‘up-sampling’, where we make every cuisine in the list be the same size as the one with the biggest recipe. This can be done by adding recipe to each smaller size cuisine by random sampling. However, for this exercise, we see that to up size a sample of 16 recipe to a sample of 290 recipe will result in a lot of duplication recipes. This can lead to an issue of overfitting for cuisine such as ‘Brazilian’, ‘Russian’, etc.

The best way to handle this is to do a mixture of ‘up-sampling’ and ‘down-sampling’ . That is, we can set a fix sample size that we think that could result in a large enough training data set as well as reduce the risk of overfitting  and ‘up-sampling’ smaller set to this value as well as ‘down-sampling’ larger set to this value.

We constructed Decision Tree with depth =2  on this small sample training set. We use a mixture of ‘up-sampling’ and ‘down-sampling’ to a recipe size of 100 for each cuisine to deal with Imbalance data. Here is the result of the classification on the evaluation set

Imbalance data result

We can see that without balancing the recipe in the training data, small sample size cuisine such as ‘Brazilian’ has accuracy of 0%. After performing ‘up-sampling’ and ‘down-sampling’, the model is able give a better classification.

Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineer.
Please check our Career Page.

AI / Data Science Project

Please check about experiences for Data Science Project

Vietnam AI / Data Science Lab

Vietnam AI Lab
Vietnam AI Lab

 

 

 

Please also visit Vietnam AI Lab