Introduction to Feature Engineering


In a modeling process, there are 3 core concepts that will always exist:

  • Data.
  • Features.
  • Type of model and its corresponding parameters.

From data to the model, features are a measurable representation of the data, which would be the format for the data to be processed by the model thus method to create features from data is a must-to-do modeling process. Moreover, to improve model performance, apart from enhancing data quality and executing the model selection, having better features is also a possibility. On the other hand, the domain aspect related to the data information will affect the feature creation process by either being the requirement or providing the direction to execute it. Therefore, how to create features, do it well, and even relating domain knowledge to assist this process is a helpful skill to reach results easier.

This thus would lead us to the main purpose of this blog post: introduction to feature engineering. It is hoped that this blog could help in getting a sense of the operation and provide the start ground for anyone who would like to research feature engineering.

What is feature engineering?

Feature engineering can be said as the process of transforming information from data into features that effectively enhance the model’s performance with the possibility of using domain knowledge as an assist.

As created features are not always useful and a large number of features could easily cause overfitting or curse of dimensionality situation, feature engineering would go along with feature selection to conclude only contributing features to be used for the model. Apart from that, regularization or kernel method can also help in limiting the growth of feature numbers effectively.

Feature engineering process.

Since the modeling process is an iterative one, the feature engineering process is the same as well. The process can be summarized like this: after data gathering and preprocessing, data analysis and assumption can be used as the base to come up with the initial features set, and then after testing through the model result, a decision for further feature engineering or not can be made.

The steps of a feature engineering process are:

  • Looking at the data to have a direction for what features to start.
  • Creating features.
  • See the effectiveness of the feature provided for the result.
  • If the result is still unsatisfied, evaluate and set a new direction for feature engineering.

Mindset and consideration for feature engineering


Not only just for covering method to create features, but feature engineering principles has also evolved into constructing best practices and intuition in the feature creation process to avoid mistakes and boost the performance on a whole.

As data and use-case are situational, best feature engineering practices mostly can be achieved through tries out and analysis. There is no systematic way to do it, but underlying reasons do exist for practitioners to mainly do things in a certain way. The mindset for feature engineering is to actively operate experiments or study from past precedents to look for deeper principles and in the end shaping the intuition to execute the operation.


From “Feature Engineering and Selection: A Practical Approach for Predictive Models” (Kuhn & Johnson, 2019), some points worth noting:

  • Overfitting is also a concern when a failed feature engineering introduces features that are relevant to the current dataset but don’t have any relationship with the outcomes when new additional data being included.
  • To find the link between predictors and defined outcomes for prediction, supervised or unsupervised data analysis can be done.
  • Considering the “No Free Lunch” Theorem, trying several of models’ types to see which one works best is the best course of action.
  • Another concept that relating to this topic: Model vs modeling process, model bias and variance, experience-driven modeling & empirically driven modeling, and big data.

They also showcase another example to present these points:

  • The “Trial and error” process is needed to go through to find the best one.
  • The interaction between models and features is complex and unpredictable. However, an effect from feature sets may be more significant than the effect of different models.
  • Based on the same set of right features, the best performance can be ensured regardless of the model’s types.

Feature engineering techniques’ introduction by following an example

Different data types will have their corresponding feature engineering techniques. This part of the blog post is to introduce the techniques by linking them with the data type so that when coming across the data type related techniques can be identified immediately.

Performance Metrics for Weather Images Forecasting

In a typical Machine Learning project, one would need to find out how good or bad their models are by measuring the models’ performance on a test dataset, using some statistical metrics.

Various performance metrics are used for different problems, depending on what needs to be optimized by the models. For this blog, we will focus on the evaluation metrics that are used in weather forecasting, based on radar images.

The major problem that we need to overcome in our forecasting model is to quickly, precisely, and accurately predict the movement of rain clouds in a short period of time. If heavy rain is predicted as showers or – even worse – as cloudy weather without rain, the consequences could be serious for the users of our prediction model. If the rain is going to stop in the next few minutes, incorrect forecasting – that predicts that rainfall would continue with high intensity – may cause little to no harm; however, the prediction model is no longer useful.

A good model should tackle as many of these issues as possible. We believe that the following measures may help us identify which model is better.

Performance Metrics

  • Root Mean Square Error (RMSE):This is a broad measure of accuracy in terms of an average error across the value of forecast-observation pairs. Formally, it is defined as follows:Root Mean Square Error (RMSE)This measure will help us to compare how much difference intensity between ground truth observation and predicted one.

Root Mean Square Error (RMSE)Figure 1: RMSE between 2 models across 60 minutes forecast.

Figure 1 shows an example of the RMSE between 2 different models over 60 minutes forecast.

The RMSE  of Model 1  is increasing over time. On the other hand, it seems that Model 2 has a smaller RMSE, which means it is a better model out of the two models.

Before defining the next metric, we need to recall about Confusion Matrix (Figure 2). Each column of the matrix represents the instances in an actual class while each row represents the instances in a predicted class, or vice versa [1]. By using Confusion Matrix, we can calculate the number of False Positives (FP), False Negatives (FN), True Positives (TP), and True Negatives (TN).


Figure 2: Definition of a Confusion Matrix.

  • Hit Rate (H): The fraction of observed events that are forecast correctly. This is also known as the Probability of Detection.​ It tells us what proportion actually had rain was predicted by the algorithm as having rain. It ranges from [0,1].

Hit Rate

Hit rate through the time

Figure 3: Hit rate of 2 models across 60 forecast time.

From Figure 3, the Hit Rate of both models is good in the first 20 minutes. Model 2 has a higher value than model 1 (higher probability of predicting rain). Therefore,  model 2 is the better model base on this measure.

  • False Alarm Ratio (FAR): The fraction of “yes” forecasts that were wrong. It is calculated as follows:

False Alarm Ratio

Even though in weather forecast the False Alarms do not lead to serious consequences. However,  a model with a high FAR measure is not ideal.

  • Bias (B): This measure compares the number of points are predicted as having rain and the total number of actual rain points.  Specifically,


Investigating Methods of Handling Missing Data

Handling Missing Data – Abstract

The article discusses various types of missing data and how to handle them. We demonstrate how the prediction results are affected by the quality of missing data as well as method of handle missing data through some experiments.

  1. Introduction – Handling Missing Data

For any real data set, missing data is almost unavoidable. There are many possible reasons for this phenomenon including changes in design of data collection, precision of data that user entered, the unwilling of participants surveyed when answering some questions, etc. Detecting and handling these missing values are problems of data wrangling process.

There are 3 major types of Missing data:

  • Missing Completely at Random (MCAR): this is actually the random case. The missing record is just a random and there is no correlation between any value between the missing values and values in other variables.

Missing Completely at Random (MCAR)

  • Missing at Random (MAR): this type of missing means that the propensity for a missing point is not related to the missing data, but to some of the observed data. For example, in a market research survey, for any reasons, some interviewers (of some cities) forgot to ask about income of interviewee, that lead to the ratio of missing income values in these cities higher than other ones. We can consider this is a Missing at Random.

Missing at Random.

  • Missing Not at Random (MNAR): this is the highly biased case. The missingness is related to the value of missing observation. In some cases, the dataset should be re-collected to ensure not to have this type of missing. For example, interviewees with high income rejected to answer about their figure could cause this type of missing.

Missing Not at Random (MNAR)

  1. Handling Missing Data


Yeah, you just ignore it, if you know missing data is MCAR. Although you do not do anything by yourself, the library (such as XGBoost) is the one that do the stuff for you by choosing an appropriate method. So technically, we can count this method as cases of other methods, depends on circumstance.

Removing (Deletion)

  • Column deletion: another simple to handling missing data is remove that attribute (column deletion). It can be applied when the missing record ratio is high (should be least 60%, but this is not a fixed rule) and the variable is insignificant.
  • Row deletion: If the missing value is MCAR and the missing ratio is not very high, we can drop the entire record (row). This method can be acknowledged as listwise deletion. But if the missing case is not MCAR, this method could introduce bias to dataset.
  • Pairwise deletion: instead of completely removing unknown records, we will maximize data usage by omitting only when necessary. Pairwise deletion can be considered as a method to reduce the data loss caused by listwise deletion. 

Imputation (Fill-in)

  • Imputation with Median/Mean/Mode values: these values usually used to fill the missing position. In most of times, the mean value is used. By using mean value, we are keeping mean unchanged after processed. In case of categorical variable, the most popular value (mode) can be used to fill. Imputation method could decrease variance of the attribute. We could extend the imputation by adding information whether value comes from imputation or from original dataset value using boolean type (this technique can be called marking imputed values in some document). However, one must be aware of using this method, if the data missing is not random, using mean can introduce outliners to the data.
  • Algorithm-based Imputation: instead of using a constant for imputing missing values, we could model variable with missing values as a function of other features. A regression algorithm can predict them with some assumptions.
  • If linear regression is used, we must assume that variables have linear relationship.
  • If predicting missing values based on order of a high correlated columns, the process is called hot-deck imputation.
  • KNN Imputation: this method can be considered as a variant of median/mean/mode imputation, but instead of calculating these values across all observations, it only does among K nearest observations. One question we should think is how to measure distance between observations.


  • Multivariate Imputation Chained Equations: instead of imputation value of each columns separately, we can repeat to estimate missing values based on distribution of other variable. The process repeats until data become stable. This approach has two setting: single and multiple data sets (can also be mentioned as Multiple Imputation by Chained Equations – MICE).

Chained Equations - MICE

One iteration of MICE

  1. Experiment

We are using Titanic dataset for experiment, which is quite familiar with most data scientists. The original data consist of 12 variables, include categorical variables and numerical variables. The original task is predicting whether each passenger is survived or not.

We will do classification task with Logistic Regression (fixed among trials). In each experiment, we try to simulate the situation of data missing by removing some existing values from some features of input data. There will be 2 ways to removing data: completely random (MCAR Generator) and random (MAR Generator). Consider MAR Generator, in each trial, values will be removed with different ratio based on values of other feature (in particular, we based on Pclass – a highly correlated variable with Survived status). We track the changing of accuracy across different settings. For cross validation, we apply K-Fold with K=5.

In experiment 1, we observe the changing of accuracy when we removing different amounts of data from some features.

In experiment 2, we generate missing data using MCAR and MAR Generator and use 2 MCAR-compatible methods to handle them. We will find out whether these methods decrease accuracy of classifier model.

  1. Results and Discussion

Affection of Missing Data Amount

In this experiment, we will try to find the correlation (not actually the correlation coefficient but the correlation in general) between the amount of missing data and the output of learning models, as well as the method to handle them. We do this by masking different ratios of a few columns with MCAR setting.

Masking Sex, Dropping Title Masking Age, Dropping Title Masking Age, Sex, Dropping Title Masking Age, Sex, Keeping Title
0 81.04 0 81.04 0 81.07 0 82.21 0
20 77.23 -3.81 81.19 0.15 77.53 -3.54 81.83 -0.38
40 75.17 -5.87 80.84 -0.2 75.41 -5.66 81.87 -0.34
60 73.96 -7.08 80.29 -0.75 73.93 -7.14 82.32 0.11
80 71.95 -9.09 79.58 -1.46 71.92 -9.15 82.69 0.48
99 71.48 -9.56 79.5 -1.54 71 -10.07 82.98 0.77


Figure 3 Affection of Missing Ratio. The columns just right to each accuracy columns show the difference between the original (0%) and current setting

As can be seen, the more values is removed, the more accuracy decreasing. But it happens only under some settings.

The Missing Data quantity affected significantly only if the feature brings “unique” information. With the presence of Title feature (extracted from Name), the missing values in Sex column do not decrease the performance of model, even with 99% missing data. It is because the majority of values of Title column (Mr, Mrs, Ms, Dr…) induced information of Sex columns.

With the existence of some features that are important and highly correlated with missing features, the missing data effect of become negligible. One thing we can learn that although its simplicity, removing entire variables should be considered in many cases, especially if there are some features that highly correlate with missing feature. This can be valuable if we do not want to sacrifice performance and waste effort in order to gain a small portion of accuracy (around 1%).

Affection of Missing Generator and Handling Method

In this experiment, we use MCAR and MAR simulator to create modified datasets. With each removing method, we apply on numerical columns (Age and Fare). Then, we use Mean Imputation (so we choose numerical features for removing values) and Listwise Deletion, which compatible which MCAR setting, to handle these missing values and observe the difference of accuracy. 

Handling by Mean Imputation

Missing ratio MCAR Missing Generator (Age) MAR Missing Generator (Age) Difference
0 81 81 0
20 80.97 80.99 -0.02
40 80.72 80.7 0.02
60 80.04 80.38 -0.34

Handling by Listwise Deletion

Missing ratio MCAR Missing Generator (Age) MAR Missing Generator (Age) Difference
0 79.24 79.24 0
20 78.69 77.85 0.84
40 78.81 76.59 2.22
60 80.65 77.34 3.31

Figure 4 Different Missing Generators with different MCAR Handling Methods

Once again, we notice that with Mean Imputation, there is not any significant improvements when we use MCAR Missing Generator instead the MAR one. We can see that although Mean Imputation (which is considered as a MCAR-compatible handling method) can distort the correlation between features in case of MAR Missing Generator, the classification task can achieve a comparable accuracy.

On the other hand, in case of using Listwise Deletion, the classifier accuracy is higher when handling method is synced (MCAR Missing Generator). This can be explained by doing listwise deletion, we also throw data from other variables away. So in MAR Generator case, we removed rows with a not randomly mechanism (it is still removed randomly in MCAR Generator case), which worsen the classifier’s accuracy. Note that in one column, there is an increasing in 60% setting. This phenomenon happens because by removing more rows, both the training and testing folds become smaller. We should not consider this as an improvement of model when we increase missing ratio.

  1. Recap

All methods of handling missing data may be helpful, but the choice is really based on the circumstance. For better choice, data scientists should understand about the process that generated the dataset, as well as the knowledge of the domain.

Considering the correlation between features are important to decide whether missing data should be handle or just ignore or delete them from dataset.

There are also some aspects of handling missing data we want to show you but due to time and resource limitation, we have not done these experiments yet. We would want to do experiments with more complex methods such as algorithm-based handling, as well as compare the affection over different datasets. We hope to come back to these problems some days.


Multiple Imputation by Chained Equations (MICE):

Data Science Blog

Please check our other Data Science Blog

Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineer.
Please check our Career Page.

AI / Data Science Project

Please check about experiences for Data Science Project

Vietnam AI / Data Science Lab

Vietnam AI Lab
Please also visit Vietnam AI Lab

Causal inference and potential outcome framework

In this blog, we would introduce basic concepts in causal inference and the potential outcome framework

1.Causality terminology

  1. Unit: The fundamental notion is that causality is tied to an action (or manipulation, treatment, or intervention), applied to a unit. A unit here can be a physical object, a firm, an individual person, at a particular point in time. The same physical object or person at a different time is a different unit. For instance, when you have a headache and you decide to take an aspirin to relieve your headache, you could also have chosen not to take the aspirin, or you could have chosen to take an alternative medicine. In this framework, articulating with precision the nature and timing of the action sometimes requires a certain amount of imagination. For example, if we define race solely in terms of skin color, the action might be a pill that alters only skin color. Such a pill may not currently exist (but, then, neither did surgical procedures for heart transplants hundreds of years ago), but we can still imagine such an action.
  2. Active treatment vs. Control treatment: Often, one of these actions corresponds to a more active treatment (e.g., taking an aspirin) in contrast to a more passive action (e.g., not taking the aspirin). We refer to the first action as the active treatment, the second action as the control treatment
  3. Potential Outcome: given a unit and a set of actions, we associate each action-unit pair with a potential outcome. We refer to these outcomes as potential outcomes because only one will ultimately be realized and therefore possibly observed: the potential outcome corresponding to the actually taken. The other potential outcomes cannot be observed because the corresponding actions that would lead to them being realized were not taken.
  4. Causal Effect: The causal effect of one action or treatment relative to another involves the comparison of these potential outcomes, one realized and the others not realized and therefore not observable.

Suppose we have a ‘treatment’ variable A with two levels: 1 and 0 and an outcome variable Y with two levels: 1 (death) and 0 (survival). The treatment A has a causal effect on an individual’s outcome Y if the potential outcomes under a = 1 and a = 0 are different. The causal effect of the treatment involves the comparison of these potential outcomes. A causes B if:

Causality terminology

For example, consider the case of a single unit, i, at a particular point in time, contemplating whether or not take an aspirin for my headache. That is, there are two treatment levels, taking an aspirin, and not taking an aspirin. There are therefore two potential outcomes, Y(Aspirin) and Y(No Aspirin), one for each level of the treatment.

Table 1: illustrates this situation assuming the values Y(Aspirin) = No Headache, Y (No Aspirin) = Headache.

Table 1: illustrates this situation assuming the values Y(Aspirin)

  1. Fundamental problem of causal inference: There are two important aspects of the definition of a causal effect. First, the definition of the causal effect depends on the potential outcomes, but it does not depend on which outcome is actually observed. Specifically, whether I take an aspirin (and am therefore unable to observe the state of my headache with no aspirin) or do not take an aspirin (and am thus unable to observe the outcome with an aspirin) does not affect the definition of the causal effect. Second, the causal effect is the comparison of potential outcomes, for the same unit, at the same moment in time post-treatment. In particular, the causal effect is not defined in terms of comparisons of outcomes at different times, as in a before-and-after comparison of my headache before and after deciding to take or not to take the aspirin. “The fundamental problem of causal inference” (Holland, 1986, p.947) is therefore the problem that at most one of the potential outcomes can be realized and thus observed. If the action you take is Aspirin, you observe Y(Aspirin) and will never know the value of Y(No Aspirin) because you cannot go back in time.
  2. Causal Estimands / Average Treatment Effect: For a population of units, indexed by i = 1,…,N. Each unit in this population can be exposed to one of a set of treatments.
  • Let Ti (or Wi elsewhere)denote the set of treatments to which unit i can be exposed.

Ti = T = {0, 1}

  • For each unit i, and for each treatment in the common set of treatments, there are corresponding potential outcome Yi(0) and Yi(1).
  • Comparison of Y1(1) and Yi(0) are unit-level causal effects

Yi(1) – Yi(0)

2 .Potential Outcomes Framework

2.1       Introduction

The potential outcome framework, formalized for randomized experiments by Neyman (1923) and developed for observational settings by Rubin (1974), defines for all individuals such potential outcomes, only some of which are subsequently observed. This framework dominates applications in epidemiology, medical statistics, and economics, stating the conditions under causal effects can be estimated in rigorous mathematical language

The potential outcomes approach was designed to quantify the magnitude of the causal effect of a factor on an outcome, NOT to determine whether it is actually a cause or not. Its goal is to estimate the effects of “cause”, not causes of effect. Quantitative counterfactual inference helps us predict what would happen under different circumstances, but is agnostic in saying which is a cause or not.

2.2       Counterfactual

Potential outcome is the value corresponding to the various levels of a treatment: Suppose we have a ‘treatment’ variable X with two levels: 1 (treat) and 0 (not treat) and an outcome variable Y with two levels: 1 (death) and 0 (survival). If we expose a subject, we observe Y1 but we do not observe Y0. Indeed, Y0 is the value we would have observed if the subject had been exposed. The unobserved variable is called a counterfactual. The variables (Y0, Y1) are also called potential outcomes. We have enlarged our set of variables from (X, Y) to (X, Y, Y0, Y1). As small dataset might look like this

2       Potential Outcomes Framework

The asterisks indicate unobserved variables. Causal questions involve the distribution p(y0, y1) of the potential outcomes. We can interpret p(y1) as p(y|set X = 1) and we can interpret p(y0) as p(y|set X = 0). For each unit, we can observe at most one of the two potential outcomes, the other is missing (counterfactual).

Causal inference under the potential outcome framework is essentially a missing data problem. Suppose now that X is a binary variable that represents some exposure. So X = 1 means the subject was exposed and X = 0 means the subject was not exposed. We can address the problem of predicting Y from X by estimating E(Y|X = x). To address causal questions, we introduce counterfactuals. Let Y1 denote the response if the subject is exposed. Let Y0 denote the response if the subject is not exposed. Then

2.2       Counterfactual

Potential outcomes and assignments jointly determine the values of the observed and missing outcomes:

2.2       Counterfactual

Since it is impossible to observe the counterfactual for a given individual or set of individuals. Instead, evaluators must compare outcomes for two otherwise similar sets of beneficiaries who are and are not exposed to the intervention, with the latter group representing the counterfactual

2.3       Confounding

In some cases, it is not feasible or ethical to do a randomized experiment and we must use data from observational (non-randomized) studies. Smoking and lung cancer is an example. Can we estimate causal parameters from observational (non-randomized) studies? The answer is: sort of

In an observational study, the treated and untreated groups will not be comparable. Maybe the healthy people chose to take the treatment and the unhealthy people didn’t. In other words, X is not independent of . The treatment may have no effect but we would still see a strong association between Y and X. In other words, a (correlation) may be large even though q (causation) = 0.

Here is a simplified example. Suppose X denotes whether someone takes vitamins and Y is some binary health outcome (with Y = 1 meaning “healthy”)

2.3       Confounding

In this example, there are only two types of people: healthy and unhealthy. The healthy peopl have (Y0, Y1) = (1,1). These people are healthy whether or not they take vitamins. The unhealthy people have (Y0, Y1)= (0,0). These peopl are unhealthy whether or not they take vitamins.

The obsereved data are:

2.3       Confounding

In this example, q = 0 but a = 1. The problem is that peopl who choose to take vitamins are different than people who choose not to take vitamins. That’s just another way of saying that X is not independent of (Y0, Y1).

To account for the differences in the groups, we can measure confounding variables. These are the variables that affect both X and Y. These variables explain why the two groups of people are different. In other words, these variables account for the dependence between X and . By definition, there are no such variables in a randomized experiment. The hope is that if we measure enough confounding variables then, perhaps the treated and untreated groups will be comparable, condition on Z. This means that  is independent of  conditional on Z.

2.4       Measuring the Average Causal Effect

The mean treatment effect or mean causal effect is defined by

E(Y1) – E(Y0) = E(Y|set X=1) – E(Y|set X=0)

The parameter q has the following interpretation: q is the mean response if we exposed everyone minus the mean response if we exposed no-one

The estimator for parameter: Estimator = difference-in-means


Hernán MA, Robins JM (2020). Causal Inference: What If
Imbens, G., & Rubin, D. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences.
Judea Pearl (2000). Causality: Models, Reasoning and Inference

Data Science Blog

Please check our other Data Science Blog

Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineer.
Please check our Career Page.

AI / Data Science Project

Please check about experiences for Data Science Project

Vietnam AI / Data Science Lab

Vietnam AI Lab
Please also visit Vietnam AI Lab


Word Embeddings – blessing or curse in disguise?

As word embeddings become more and more ubiquitous in language applications, a key issue has likewise emerged. The ability of embeddings to learn complex, underlying relationships between words is also their greatest caveat:

How do we know when we have trained a good embedding?

It’s important to differentiate between a good embedding in a more general sense and a good embedding for a specific downstream task. Although some methods for evaluation, such as word similarity/analogy tasks have been proposed, most remain somewhat controversial as to their validity as well as relevancy to actual target applications (e.g. Faruqui et al. (2016)).

In this context, one distinguishes between two types of evaluations:  intrinsic where one typically employs specifically designed, human-moderated data sets and extrinsic whereby embeddings are tested on simpler, proxy NLP tasks to estimate their performance.

For both types, it is yet unclear to what extent good performance correlates with actual useful embeddings. In many, if not most state-of-the-art neural networks, the embeddings are trained alongside the model to tailor to the task at hand.

Here, we want to evaluate two different embeddings (Skipgram and CBOW)  trained on a Japanese text corpus (300K) with the aim of assessing which algorithm is more suitable.

Our setup is as follows: 

Data: Japanese text corpus, containing full texts and their matching summaries (300K) 

Preprocessing: Subword segmentation using SentencePiece (Kudo et al.,2018)

Embedding: Train 2 models: Skipgram and CBOW, vector size: 300, 40K vocabulary size using FastText (Athiwaratkun et al., 2018).

Japanese is a non-space separated language and needs to be segmented as part of the preprocessing. This can be done using morphological analyzers, such as Mecab (Kudo, 2006) or language-independent algorithms, such as SentencePiece (Kudo et al., 2018). As the concept of a “word” is therefore highly arbitrary, different methods can return different segmentations, all of which may be appropriate given specific target applications.

In order to tackle the ubiquitous Out-of-Vocabulary (OOV) problem, we are segmenting our texts into “subwords” using SentencePiece. These typically return smaller units and do not align with “word” segmentations returned by Mecab.

If we wanted to evaluate our embeddings on an intrinsic task such as word similarity, we could use the Japanese word similarity data set (Sakaizawa et al., 2018), containing word similarity ratings for pairs of words across different word types by human evaluators.

However, preliminary vocabulary comparisons showed that because of differences in segmentation, there was little to no overlap between the words in our word embeddings and those in the data set.  For instance, the largest common group occurred in nouns: only 50 out of 1000 total noun comparison pairs.

So instead we are going to propose a naïve approach to compare two word embeddings using a Synonym Vector Mapping Approach

For the current data set, we would like to see whether the model can map information from the full text and its summary correctly, even when different expressions are being used, i.e. we would like to test the model’s ability to pair information from two texts that use different words. 

Pre-processing Data

In Data Science, before building a predictive model from a particular data set, it is important to explore and perform pre-processing data.  In this blog, we will illustrate some typical steps in data pre-processing.

In this particular exercise, we will build a simple Decision Tree model to classify the food cuisine from the list of ingredients. The data for this exercise can be taken from:

From this exercise, we will show the important of data pre-processing. This blog will be presented as follow:

  1. Data Exploration and Pre-processing.
  2. Imbalance Data.

1.  Data Exploration and Pre-processing

When you are given a set of data, it is important to explore and analyze them before constructing a predictive model. Let us first explore this data set.

1.  Data Exploration and Pre-processing

From the first 10 items of this data set. We observe that given a particular cuisine, the list of ingredients may be different.

From this data set, we can find out that there are 20 different cuisines and the recipes distribution is not uniform. For example, recipes from ‘Italian’ cuisine takes 19.7% of all the data set, while there is only 1.17% of the recipes are coming from ‘Brazilian’ cuisine.

receipt dataset

Now, let us explore further into this data set. Let us look at the top 15 ingredents

top15 ingredients

If we look at the top 15 ingredients, we will see that they include “salt”, “water”, “sugar”, etc. They are all generic and can be found in every cuisine. Intuitionally,  if we remove these ingredients from the classification model,  the accuracy of the classification should not be affected.

In the classification model, we would refer that recipes in each cuisine to have unique ingredients to that country. This will help the model to easily identify which cuisine this recipe comes from.

After removing  all the generic ingredients (salt, water, sugar, etc) from the data set, we look at the top 15 ingredients again.

top15 ingredients

It looks like we can remove a more ingredients, but decision which one to remove properly leave to someone with a bit more domain of knowledge in cooking. For example, some country may use ‘onion’ in their recipe, the other may use ‘red onion’. So it is better not to overly filter out too many generic ingredients.

Now, we look at the distribution of ingredients in each recipe in the data set.


Some recipes have only 1 to 2 ingredients in the recipe, some may have up to 60. It is probably best to remove those recipes with so little ingredients out of the data set, as the number of ingredients may not be representative enough for the classification model. What is the minimum number of ingredients require to classify the cuisine? The short answer is no one know. It is best to experiment it out by remove data sets with 1, 2, 3, etc ingredients and re-train the model and compare the accuracy to decide which one work best for your model.

The ingredients in the recipe are all words, in order to do some further pre-processing, we will need to use some NLP (Natural Language Processing).

AWS Lake Formation – Data Lake(Setup)

Overview for Build a Data lake with AWS(Beginner)

AWS Lake Formation enables you to set up a secure data lake. A data lake is a centralized, curated, and secured repository storing all your structured and unstructured data, at any scale. You can store your data as-is, without having first to structure it. And you can run different types of analytics to better guide decision-making—from dashboards and visualizations to big data processing, real-time analytics, and machine learning.

The challenges of data lakes

The main challenge to data lake administration stems from the storage of raw data without content oversight. To make the data in your lake usable, you need defined mechanisms for cataloging and securing that data.

Lake Formation provides the mechanisms to implement governance, semantic consistency, and access controls over your data lake. Lake Formation makes your data more usable for analytics and machine learning, providing better value to your business.

Lake Formation allows you to control data lake access and audit those who access data. The AWS Glue Data Catalog integrates data access policies, making sure of compliance regardless of the data’s origin.

Set up the S3 bucket and put the dataset.

Set up the S3 bucket and put the dataset.

Set up Data Lake with AWS Lake Formation.

Step 1: Create a data lake administrator

First, designate yourself a data lake administrator to allow access to any Lake Formation resource.

Create a data lake administrator

Step 2: Register an Amazon S3 path

Next, register an Amazon S3 path to contain your data in the data lake.

Register an Amazon S3 path

Step 3: Create a database

Next, create a database in the AWS Glue Data Catalog to contain the datasetsample00 table definitions.

  •      For Database, enter datasetsample00-db
  •      For Location, enter your S3 bucket/ datasetsample00.
  •      For New tables in this database, do not select Grant All to Everyone.

Step 4: Grant permissions

Next, grant permissions for AWS Glue to use the datasetsample00-db database. For IAM role, select your user and AWSGlueServiceRoleDefault.

Grant your user and AWSServiceRoleForLakeFormationDataAccess permissions to use your data lake using a data location:

  • For IAM role, choose your user and AWSServiceRoleForLakeFormationDataAccess.
  • For Storage locations, enter s3:// datalake-hiennu-ap-northeast-1.

Step 5: Crawl the data with AWS Glue to create the metadata and table

In this step, a crawler connects to a data store, progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata tables in your AWS Glue Data Catalog.

Create a table using an AWS Glue crawler. Use the following configuration settings:

  • Crawler name: samplecrawler.
    Crawler name: samplecrawler.
  • Data stores: Select this field.
  • Choose a data store: Select S3.
  • Specified path: Select this field.
  • Include path: s3://datalake-hiennu-ap-northeast-1/datasetsample00.
  • Add another data store: Choose No.
  • Choose an existing IAM role: Select this field.
  • IAM role: Select AWSGlueServiceRoleDefault.
  • Run on demand: Select this field.
  • Database: Select datasetsample00-db.

Step 6: Grant access to the table data

Set up your AWS Glue Data Catalog permissions to allow others to manage the data. Use the Lake Formation console to grant and revoke access to tables in the database.

  • In the navigation pane, choose Tables.
  • Choose Grant.
  • Provide the following information:
    1. For IAM role, select your user and AWSGlueServiceRoleDefault.
    2. For Table permissions, choose Select all.

Step 7: Query the data with Athena

Query the data in the data lake using Athena.

  • In the Athena console, choose Query Editor and select the datasetsample00-db
  • Choose Tables and select the datasetsample00 table.
  • Choose Table Options (three vertical dots to the right of the table name).
  • Select Preview table.

Athena issues the following query: SELECT * FROM datasetsample00 limit 10;

Combinatorial Optimization: From Supervised Learning to Reinforcement Learning – Part 1

Recently, I was asked to solve an interesting problem: The problem of Sorting Array:


An array A contains unique n-elements, whose values are integers. The length (n) of A is ranged from 2 to10.


A sorted array B in ascending order. The length of array B must be the same as the length of array A.

Examples: A = [3,0] -> B = [0,3], A = [1,3,2] -> B = [1,2,3], A = [5,9,1,3,7] -> B = [1,3,5,7,9]

Array sorting is not a new problem. There are many sorting algorithms such as  Straight Insertion, Shell Sort, Bubble Sort, Quick Sort, Selection Sort, Heap Sort, etc. The problem above becomes much more interesting if we consider it as a Combinatorial Optimization problem. Here, various Machine Learning approaches can be applied.

Combinatorial Optimization

“Combinatorial Optimization is a category of problems which requires optimizing a function over a combination of discrete objects and the solutions are constrained. Examples include finding shortest paths in a graph, maximizing value in the Knapsack problem and finding boolean settings that satisfy a set of constraints. Many of these problems are NP-Hard, which means that no polynomial time solution can be developed for them. Instead, we can only produce approximations in polynomial time that are guaranteed to be some factor worse than the true optimal solution.”

Source: Recent Advances in Neural Program Synthesis (

The traditional solvers are often relying on handcrafted designs to make decisions. In recent years, many Machine Learning (ML) techniques have been used to solve the combinatorial optimization problems. The related technologies vary from supervised learning techniques to modern reinforcement learning techniques.

Using the above sorting list problem, we will see how the problem can be solved using different ML techniques.


In this series, we will start with some supervised techniques, then we’ll apply the neuro-evolution, finally using some modern RL techniques.

Part 1: Supervised learning: Gradient Boosting, Fully Connected Neural Networks, SeqtoSeq.

Part 2: Deep Neuro-Evolution: NEAT, Evolution Strategies, Genetic Algorithms.

Part 3: Reinforcement Learning: Deep Q-Network, Actor-Critic, PPO with Pointer network and Attention based model.

Code for Part 1:

(Note: Enable Colab GPU to speed up running time)

Supervised learning

Supervised machine learning algorithms are designed to learn by example. If we want to use Supervised learning, we have to have data.

First, we will generate 3 data sets with different sizes: 1000, 5000, 50000. Then we will use some models to train with this data sets. Then, we will compare their sorting abilities after the learning process.

1.Generate training data:

How to generate data? One possible approach is: if we consider each element of the input list as a feature, and each element of the sorted list is a label, we can easily convert the data back to a tabular form

In1 In2 In3 In4 In5 In6 In7 In8 In9 In10
0 -1 -1 -1 -1 -1 7 0 1 2 3
1 1 7 6 3 4 5 0 2 8 9
2 -1 -1 6 2 7 4 1 0 3 8
Out1 Out2 Out3 Out4 Out5 Out6 Out7 Out8 Out9 Out10
0 -1 -1 -1 -1 -1 0 1 2 3 7
1 0 1 2 3 4 5 6 7 8 9
2 -1 -1 0 1 2 3 4 6 7 8

Then we can use any multi-label regression or multi-label classification models for this training data.

2. Multi-label regression

For this Tabular dataset, I will use 2 common techniques : Gradient boosting (use XGB lib) and simple Fully connected neural networks (FCNNs).

from sklearn.multioutput import MultiOutputRegressor
from xgboost import XGBRegressor

AWS – Batch Process and Data Analytics

(AWS – Batch Process and Data Analytics)

Today, I will describe our current system references architecture for Batch processing and Data Analytics for the sales report system. Our mission is to create a big data analytics system that interacts with Machine Learning features for insights and prediction. Nowadays, too many AI companies are researching and were applied machine learning features to their system to improve services. We also working to research and applied machine learning features to our system such as NLP, Forecast, OCR … that help us have the opportunity to provide better service for customers.

References Architecture for Batch data processing
References Architecture for Batch data processing
  1. Data Source
    We have various sources from multiple systems including both on-premise and on-cloud with a large dataset and unpredictable frequent updates.
  2. Data lake storage
    We use S3 as our data lake storage with unlimited types and volumes. That helps us easy to scale our system easily
  3. Machine Learning
    We focus on Machine learning to build great AI solutions from our dataset. Machine learning models will predict for insight and integration directly to our system as microservices.
  4. Compute
    Compute machines are most important in our system. We choose fit machine services as infrastructure or serverless to maximize optimize cost and performance.
    – We using AWS lambda functions for small jobs such as call AI services, small dataset processing, and integrating
    – We using AWS Glue ETL to build ETL pipeline and build custom pipelines with AWS step functions.
    – We also provide web and APIs service for end-users that system building by microservices architecture and using AWS Fargate for hosting services
  5. Report datastore
    After processing data and export insight data from predictions we store data in DynamoDB and RDS for visualization and build-in AWS insight features.
  6. Data analytics with visualization and insights
    We are using AWS Quicksight and insight for data analytics

According to the specific dashboard, we have specific data processing pipeline to process and prepare data for visualization and insights. Here is our standard pipeline with AWS Step functions orchestrators. We want to hold everything basic as possible. We using AWS Events for schedule, and Lambda functions as interacting functions.
– Advantages: Serverless and easy to develop and scale

– Disadvantages: Most depend on AWS services and challenges when exceeding the limitation of AWS services such as Lambda functions processing time.
– Most use case scenario: Batch data processing pipeline.

Data processing pipeline
Data processing pipeline
  • Using serverless compute for data processing and microservices architecture
  • Easy to develop, deploy and scale system without modifying
  • Flexibility to using build-in services and build custom Machine Learning model with AWS SageMaker
  • Separately with Datasource systems in running
  • Effectively data lake processing

This architecture is basic and I hope you can get somethings in here. We focus to describe our current system and it is building to larger than our design with more features in the future; we should choose a flexible solution for easy to maintain and replaceable. When you choose any architecture or service which helps you resolve your problems, you should consider what is the best fit service for your current situation, and no architecture/service can resolve all problems that depend on the specific problem. And avoid the case for solution finding for problems 😀

— — — — — — — — — — — — — — — —


In NLP, encoding text is the heart of understanding language.  There are many implementations like Glove, Word2vec, fastText which are aware of word embedding. However, these embeddings are only useful for word-level and may not perform well in case we would like to expand to encode for sentences or in general, greater than one word. In this post, we would like to introduce one of the SOTAs for such a task: the Universal Sentence Encoder model


The Universal Sentence Encoder (USE) encodes text into high dimensional vectors (embedding vectors or just embeddings). These vectors are supposed to capture the textual semantic. But why do we even need them?

A vector is an array of numbers of a particular dimension. With the vectors in hand, it’s much easier for computers to work on textual data. For example, we can say two data points are similar or not just by calculating the distance between the two points’ embedding vectors.


(Image source:

The embedding vectors then in turn, can be used for other NLP downstream tasks such as text classification, semantic similarity, clustering…

2.USE architecture

It comes with two variations with the main difference resides in the embedding part. One is equipped with the encoder part from the famous Transformer architecture, the other one uses Deep Averaging Network (DAN)

2.1 Transformer encoder

The Transformer architecture is designed to handle sequential data, but not in order like the RNN-based architectures. It use the attention mechanism to compute context-aware representations of words in a sentence taking into account both the ordering and significance of all the other words. The encoder takes input as a lowercased PTB tokenized string and outputs the representations of each sentence as a fixed-length encoding vector by computing the element-wise sum of the representations at each word position. Due to this feature, the Transformer allows for much more parallelization than RNNs and therefore reduced training times.

Universal Sentence Encoder uses only the encoder branch of Transformer to take advantage of its strong embedding capacity.


(Image source:

2.2 Deep Averaging Network (DAN):

DAN is a simple Neural Network that takes average of embeddings for words and bi-grams and then passed the “combined” vector through a feedforward deep neural network (DNN) to produce sentence embeddings. Similar to the Transformer encoder, DAN takes as input a lowercased PTB tokenized string and output a 512 dimensional sentence embedding.


(Image source:

The two have a trade-off of accuracy and computational resource requirement. While the one with Transformer encoder has higher accuracy, it is computationally more intensive. The one with DNA encoding is computationally less expensive and with little lower accuracy.

3. How was it trained?

The key idea for training this model is to make the model work for generic tasks such as:

  • Modified Skip-thought
  • Conversational input-response prediction
  • Natural language inference.

3.1 Modified skip-thought:

given a sentence, the model needs to predict the sentences around it.



  • 3.2 Conversational input-response prediction:

    In this task, the model needs to predict the correct response for a given input among a list of correct responses and other randomly sampled responses.