Pre-processing Data

In Data Science, before building a predictive model from a particular data set, it is important to explore and perform pre-processing data.  In this blog, we will illustrate some typical steps in data pre-processing.

In this particular exercise, we will build a simple Decision Tree model to classify the food cuisine from the list of ingredients. The data for this exercise can be taken from:

https://www.kaggle.com/kaggle/recipe-ingredients-dataset

From this exercise, we will show the important of data pre-processing. This blog will be presented as follow:

  1. Data Exploration and Pre-processing.
  2. Imbalance Data.

1.  Data Exploration and Pre-processing

When you are given a set of data, it is important to explore and analyze them before constructing a predictive model. Let us first explore this data set.

1.  Data Exploration and Pre-processing

From the first 10 items of this data set. We observe that given a particular cuisine, the list of ingredients may be different.

From this data set, we can find out that there are 20 different cuisines and the recipes distribution is not uniform. For example, recipes from ‘Italian’ cuisine takes 19.7% of all the data set, while there is only 1.17% of the recipes are coming from ‘Brazilian’ cuisine.

receipt dataset

Now, let us explore further into this data set. Let us look at the top 15 ingredents

top15 ingredients

If we look at the top 15 ingredients, we will see that they include “salt”, “water”, “sugar”, etc. They are all generic and can be found in every cuisine. Intuitionally,  if we remove these ingredients from the classification model,  the accuracy of the classification should not be affected.

In the classification model, we would refer that recipes in each cuisine to have unique ingredients to that country. This will help the model to easily identify which cuisine this recipe comes from.

After removing  all the generic ingredients (salt, water, sugar, etc) from the data set, we look at the top 15 ingredients again.

top15 ingredients

It looks like we can remove a more ingredients, but decision which one to remove properly leave to someone with a bit more domain of knowledge in cooking. For example, some country may use ‘onion’ in their recipe, the other may use ‘red onion’. So it is better not to overly filter out too many generic ingredients.

Now, we look at the distribution of ingredients in each recipe in the data set.

ingredients

Some recipes have only 1 to 2 ingredients in the recipe, some may have up to 60. It is probably best to remove those recipes with so little ingredients out of the data set, as the number of ingredients may not be representative enough for the classification model. What is the minimum number of ingredients require to classify the cuisine? The short answer is no one know. It is best to experiment it out by remove data sets with 1, 2, 3, etc ingredients and re-train the model and compare the accuracy to decide which one work best for your model.

The ingredients in the recipe are all words, in order to do some further pre-processing, we will need to use some NLP (Natural Language Processing).

Binomial Theorem

Can you expand on $(x+y)^{2}$? I guess you would find that is quite easy to do. You can easily find that $(x+y)^{2} = x^{2}+ 2xy +y^{2}$.

How about the expansion of $(x+y)^{10}$. It is no longer easy.

It is no longer easy isn’t it. However, if we use Binomial Theorem, this expansion becomes an easy problem.

Binomial Theorem is a very intriguing topic in mathematics and it has wide range of applications.

Theorem

Let $x$$y$ be real numbers (or complex, or polynomial). For any positive integer $n$, we have:

$$\begin{align*} (x+y)^{n} &= \binom{n}{0}x^{n} + \binom{n}{1}x^{n-1}y + \dots + \binom{n}{n-1}xy^{n-1} + \binom{n}{n}y^{n}\\ &= \sum_{k=0}^{n} x^{n-k}y^{k} \end{align*}$$

where,

$$\begin{align*} \binom{n}{k} = \frac{n!}{k!(n-k)!} \end{align*}$$

Proof:

We will use prove by induction. The base case $n=1$ is obvious. Now suppose that the theorem is true for the case $n-1$, that is assume that:

$$\begin{align*} (x+y)^{n-1} = \binom{n-1}{0}x^{n-1} + \binom{n-1}{1}x^{n-2}y + \dots + \binom{n-1}{n-2}xy^{n-2} + \binom{n-1}{n-1}y^{n-1} \end{align*}$$

 

we will need to  show that, this is true for

$$\begin{align*} (x+y)^{n} &= \binom{n}{0}x^{n} + \binom{n}{1}x^{n-1}y + \dots + \binom{n}{n-1}xy^{n-1} + \binom{n}{n}y^{n} \label{eq1} \end{align*}$$

Let us consider the left hand side of equation above

$$\begin{align*} (x+y)^{n} &= (x+y) (x+y)^{n-1} \\ &= (x+y) \bigg( \binom{n-1}{0}x^{n-1} + \binom{n-1}{1}x^{n-2}y + \dots \\ &+ \binom{n-1}{n-2}xy^{n-2} + \binom{n-1}{n-1}y^{n-1}\bigg) \\ &= x^{n} + \bigg( \binom{n-1}{0} + \binom{n-1}{1}\bigg) x^{n-1}y + \bigg( \binom{n-1}{1} + \binom{n-1}{2}\bigg) x^{n-2}y^{1} + \dots \\ &+\bigg( \binom{n-1}{n-2} + \binom{n-1}{n-1}\bigg) xy^{n-1} + y^{n} \end{align*}$$

We can now apply Pascal’s identity:

 

$$\begin{align*} \binom{n-1}{k-1} + \binom{n-1}{k} = \binom{n}{k} \end{align*}$$

The equation above can be simplified to:

$$\begin{align*} (x+y)^{n} &= x^{n} + \binom{n}{1}x^{n-1}y + \binom{n}{1}x^{n-1}y + \dots + \binom{n}{n-1}xy^{n-1} +y^{n}\\ & = \binom{n}{0}x^{n} + \binom{n}{1}x^{n-1}y + \dots + \binom{n}{n-1}xy^{n-1} + \binom{n}{n}y^{n} \end{align*}$$

as we desired.

Example 1:  Power rule in Calculus

 

In calculus, we always use the power rule that $\frac{d}{dx} x^{n} = n x^{n-1}$

 

We can prove this rule using Binomial Theorem.

Proof:

Recall that derivative for any continuous function f(x) is defined as:

 

$$\begin{align*} \frac{d}{dx} f(x) = \lim_{h \rightarrow 0} \frac{f(x+h) - f(x)}{h}. \end{align*}$$

Let $n$ be a positive integer and let $f(x) = x^{n}$

 

The derivative of f(x) is:

 

$$\begin{align*} &\frac{d}{dx} x^{n} = \lim_{h \rightarrow 0} \frac{(x+h)^{n} - x^{n}}{h}\\ &= \lim_{h \rightarrow 0} \frac{\bigg( \binom{n}{0} x^{n} + \binom{n}{1}x^{n-1}h + \dots + \binom{n}{n} h^{n} \bigg) - x^{n}}{h}\\ & = \lim_{h \rightarrow 0} \frac{ \binom{n}{1}x^{n-1}h + \binom{n}{2}x^{n-2}h^{2}+ \dots + \binom{n}{n} h^{n}}{h}\\ & = \lim_{h \rightarrow 0} \bigg( \binom{n}{1}x^{n-1} + \binom{n}{2}x^{n-2}h+ \dots + \binom{n}{n} h^{n-1} \bigg)\\ &= n \lim_{h \rightarrow 0} \bigg( \binom{n-1}{0}x^{n-1} + \binom{n-1}{1}x^{n-2}h+ \dots + \binom{n-1}{n-1}h^{n-1} \bigg)\\ &= n \lim_{h \rightarrow 0} (x+h)^{n-1}\\ &= n x^{n-1}. \end{align*}$$

Example 2:  Binomial Distribution 

Let X be the number of Head a sequence of n independent coin tossing. X is usually model by binomial distribution in probability model. Let $ p \in [0,1]$ be the probability that a head show up in a toss, and let $k = 0,1,\dots,n$. The probability that there is $k$ head in the sequence of $n$ toss is:

$$\begin{align*} P(X = k) = \binom{n}{k} p^{k}(1-p)^{n-k} \end{align*}$$

We know that sum of all the probability must equal to 1. In order to show this, we can use Binomial Theorem. We have:

 

$$\begin{align*} \sum_{k=0}^{n} P(X = k) &= \sum_{k=0}^{n} \binom{n}{k} p^{k}(1-p)^{n-k}\\ & = (p + 1 - p)^{n}\\ &= 1. \end{align*}$$

Please also check another article Gaussian Samples and N-gram language models ,Bayesian model , Monte Carlo for statistics knowledges.

Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineer.
Please check our Career Page.

Data Science Project

Please check about experiences for Data Science Project

Vietnam AI / Data Science Lab

Vietnam AI Lab

Please also visit Vietnam AI Lab