How about ‘up-sampling’, where we make every cuisine in the list be the same size as the one with the biggest recipe. This can be done by adding recipe to each smaller size cuisine by random sampling. However, for this exercise, we see that to up size a sample of 16 recipe to a sample of 290 recipe will result in a lot of duplication recipes. This can lead to an issue of overfitting for cuisine such as ‘Brazilian’, ‘Russian’, etc.
The best way to handle this is to do a mixture of ‘up-sampling’ and ‘down-sampling’ . That is, we can set a fix sample size that we think that could result in a large enough training data set as well as reduce the risk of overfitting and ‘up-sampling’ smaller set to this value as well as ‘down-sampling’ larger set to this value.
We constructed Decision Tree with depth =2 on this small sample training set. We use a mixture of ‘up-sampling’ and ‘down-sampling’ to a recipe size of 100 for each cuisine to deal with Imbalance data. Here is the result of the classification on the evaluation set
We can see that without balancing the recipe in the training data, small sample size cuisine such as ‘Brazilian’ has accuracy of 0%. After performing ‘up-sampling’ and ‘down-sampling’, the model is able give a better classification.
Hiring Data Scientist / Engineer
Please also visit Vietnam AI Lab