Do even the best DNN models understand language?

Do even the best DNN models understand language?

New advances, new excitement

Without any doubt, Deep Neural Networks (DNNs) have brought huge improvements to the NLP world recently. News like an AI model using DNN can write articles like a human or can write code to create a website like a real developer comes to mainstream media frequently. A lot of these achievements would have been surreal if we talked about them just a few years ago.


One of the most influential models is Bert (Bidirectional Encoder Representations from Transformers), created by Google in 2018. Google claimed with Bert, they now can understand searches better than ever before. Not stopped there, they even took it further by saying embedding this model to its core search engine(SE) “representing the biggest leap forward in the past five years, and one of the biggest leaps forward in the history of Search”. Impressed by the bold claim, I took my liberty to check how the SE works with a COVID-related inquiry like the one below.

Screenshot of a COVID-related inquiry on Google


Figure 1: The Search Engine doesn’t just give out locations where vaccine shots are provided but also suggests who is eligible for getting the shots. This result cannot come from a keyword-based search mechanism. And Yes, so far, the result seems to justify their confident claim.

However, Bert was not the only champion in the game. Another powerful language model which was released more recently has come with its advantages. It was GPT-3. Open AI built the model with 175 billion parameters which were 100 times more parameters than its predecessor GPT-2. Due to this large number of parameters and the extensive dataset it has been trained on, GPT-3 performed impressively on the downstream NLP tasks without fine-tuning. Here is an article from MTI Review written by this gigantic model.

Article form MTI Review written by this gigantic model.
Screenshot of an article on MTI Blog

Figure 2: The italicized part was input they fed the model, served as a prompt. This article talks about a unicorn with such fluent English and a high level of confidence, almost indistinguishable from human writing. I would have been convinced the piece of writing was genuine if I did not know the creature did not exist.


Many people were astounded at the text that was produced, and indeed, this speaks to the remarkable effectiveness of the particular computational systems. It seems, for some not-crystal-clear reasons, the models understand language. If that’s true, it would be the first step for AI to think like humans. Unsurprisingly, the media took the news by storm. People started to talk about the societal impacts like workforce replace by AI systems. Some even went further by saying humans might be in danger 😉 But really, are we there yet?

Do the models understand language?

So, are the models that great? Are these models capable of understanding language or are they somewhat gaming the whole system? A series of recent papers claimed that models like BERT don’t understand the language in any meaningful way. One of the reasons for their outstanding results might come from their training and testing datasets.

Introduction to Feature Engineering


In a modeling process, there are 3 core concepts that will always exist:

  • Data.
  • Features.
  • Type of model and its corresponding parameters.

From data to the model, features are a measurable representation of the data, which would be the format for the data to be processed by the model thus method to create features from data is a must-to-do modeling process. Moreover, to improve model performance, apart from enhancing data quality and executing the model selection, having better features is also a possibility. On the other hand, the domain aspect related to the data information will affect the feature creation process by either being the requirement or providing the direction to execute it. Therefore, how to create features, do it well, and even relating domain knowledge to assist this process is a helpful skill to reach results easier.

This thus would lead us to the main purpose of this blog post: introduction to feature engineering. It is hoped that this blog could help in getting a sense of the operation and provide the start ground for anyone who would like to research feature engineering.

What is feature engineering?

Feature engineering can be said as the process of transforming information from data into features that effectively enhance the model’s performance with the possibility of using domain knowledge as an assist.

As created features are not always useful and a large number of features could easily cause overfitting or curse of dimensionality situation, feature engineering would go along with feature selection to conclude only contributing features to be used for the model. Apart from that, regularization or kernel method can also help in limiting the growth of feature numbers effectively.

Feature engineering process.

Since the modeling process is an iterative one, the feature engineering process is the same as well. The process can be summarized like this: after data gathering and preprocessing, data analysis and assumption can be used as the base to come up with the initial features set, and then after testing through the model result, a decision for further feature engineering or not can be made.

The steps of a feature engineering process are:

  • Looking at the data to have a direction for what features to start.
  • Creating features.
  • See the effectiveness of the feature provided for the result.
  • If the result is still unsatisfied, evaluate and set a new direction for feature engineering.

Mindset and consideration for feature engineering


Not only just for covering method to create features, but feature engineering principles has also evolved into constructing best practices and intuition in the feature creation process to avoid mistakes and boost the performance on a whole.

As data and use-case are situational, best feature engineering practices mostly can be achieved through tries out and analysis. There is no systematic way to do it, but underlying reasons do exist for practitioners to mainly do things in a certain way. The mindset for feature engineering is to actively operate experiments or study from past precedents to look for deeper principles and in the end shaping the intuition to execute the operation.


From “Feature Engineering and Selection: A Practical Approach for Predictive Models” (Kuhn & Johnson, 2019), some points worth noting:

  • Overfitting is also a concern when a failed feature engineering introduces features that are relevant to the current dataset but don’t have any relationship with the outcomes when new additional data being included.
  • To find the link between predictors and defined outcomes for prediction, supervised or unsupervised data analysis can be done.
  • Considering the “No Free Lunch” Theorem, trying several of models’ types to see which one works best is the best course of action.
  • Another concept that relating to this topic: Model vs modeling process, model bias and variance, experience-driven modeling & empirically driven modeling, and big data.

They also showcase another example to present these points:

  • The “Trial and error” process is needed to go through to find the best one.
  • The interaction between models and features is complex and unpredictable. However, an effect from feature sets may be more significant than the effect of different models.
  • Based on the same set of right features, the best performance can be ensured regardless of the model’s types.

Feature engineering techniques’ introduction by following an example

Different data types will have their corresponding feature engineering techniques. This part of the blog post is to introduce the techniques by linking them with the data type so that when coming across the data type related techniques can be identified immediately.

Performance Metrics for Weather Images Forecasting

In a typical Machine Learning project, one would need to find out how good or bad their models are by measuring the models’ performance on a test dataset, using some statistical metrics.

Various performance metrics are used for different problems, depending on what needs to be optimized by the models. For this blog, we will focus on the evaluation metrics that are used in weather forecasting, based on radar images.

The major problem that we need to overcome in our forecasting model is to quickly, precisely, and accurately predict the movement of rain clouds in a short period. If heavy rain is predicted as showers or – even worse – as cloudy weather without rain, the consequences could be serious for the users of our prediction model. If the rain is going to stop in the next few minutes, incorrect forecasting – that predicts that rainfall would continue with high intensity – may cause little to no harm; however, the prediction model is no longer useful.

A good model should tackle as many of these issues as possible. We believe that the following measures may help us identify which model is better.

Performance Metrics

  • Root Mean Square Error (RMSE)This is a broad measure of accuracy in terms of an average error across the value of forecast-observation pairs. Formally, it is defined as follows: Root Mean Square Error (RMSE)This measure will help us to compare how much difference intensity between ground truth observation and predicted one.

Root Mean Square Error (RMSE)Figure 1: RMSE between 2 models across 60 minutes forecast.

Figure 1 shows an example of the RMSE between 2 different models over 60 minutes forecast.

The RMSE  of Model 1  is increasing over time. On the other hand, it seems that Model 2 has a smaller RMSE, which means it is a better model out of the two models.

Before defining the next metric, we need to recall about Confusion Matrix (Figure 2). Each column of the matrix represents the instances in an actual class while each row represents the instances in a predicted class, or vice versa [1]. By using Confusion Matrix, we can calculate the number of False Positives (FP), False Negatives (FN), True Positives (TP), and True Negatives (TN).

Figure 2: Definition of a Confusion Matrix.
Figure 2: Definition of a Confusion Matrix.
  • Hit Rate (H): The fraction of observed events that are forecast correctly. This is also known as the Probability of Detection.​ It tells us what proportion had rain was predicted by the algorithm as having rain. It ranges from [0,1].

Hit Rate (H)

Hit rate through the time

Figure 3: Hit rate of 2 models across 60 forecast times.

From Figure 3, the Hit Rate of both models is good in the first 20 minutes. Model 2 has a higher value than model 1 (higher probability of predicting rain). Therefore,  model 2 is the better model base on this measure.

  • False Alarm Ratio (FAR): The fraction of “yes” forecasts that were wrong. It is calculated as follows:

False Alarm Ratio

Even though in weather forecast the False Alarms do not lead to serious consequences. However,  a model with a high FAR measure is not ideal.

  • Bias (B): This measure compares the number of points is predicted as having rain and the total number of actual rain points.  Specifically,


Investigating Methods of Handling Missing Data

Handling Missing Data – Abstract

The article discusses various types of missing data and how to handle them. We demonstrate how the prediction results are affected by the quality of missing data as well as the method of handle missing data through some experiments.

  1. Introduction – Handling Missing Data

For any real data set, missing data is almost unavoidable. There are many possible reasons for this phenomenon including changes in the design of data collection, precision of data that users entered, the unwillingness of participants surveyed when answering some questions, etc. Detecting and handling these missing values are problems of the data wrangling process.

There are 3 major types of Missing data:

  • Missing Completely at Random (MCAR): this is the random case. The missing record is just random and there is no correlation between any value between the missing values and values in other variables.

Missing Completely at Random (MCAR)

  • Missing at Random (MAR): this type of missing means that the propensity for a missing point is not related to the missing data, but some of the observed data. For example, in a market research survey, for any reason, some interviewers (of some cities) forgot to ask about the income of interviewees, which lead to the ratio of missing income values in these cities higher than other ones. We can consider this as Missing at Random.

Missing at Random.

  • Missing Not at Random (MNAR): this is a highly biased case. The missingness is related to the value of missing observation. In some cases, the dataset should be re-collected to ensure not to have this type of missing. For example, interviewees with high income rejected to answer about their figure could cause this type of missing.

Missing Not at Random (MNAR)

  1. Handling Missing Data


Yeah, you just ignore it, if you know missing data is MCAR. Although you do not do anything by yourself, the library (such as XGBoost) is the one that does the stuff for you by choosing an appropriate method. So technically, we can count this method as cases of other methods, depends on circumstance.

Removing (Deletion)

  • Column deletion: another simple to handling missing data is to remove that attribute (column deletion). It can be applied when the missing record ratio is high (should be at least 60%, but this is not a fixed rule) and the variable is insignificant.
  • Row deletion: If the missing value is MCAR and the missing ratio is not very high, we can drop the entire record (row). This method can be acknowledged as listwise deletion. But if the missing case is not MCAR, this method could introduce bias to a dataset.
  • Pairwise deletion: instead of completely removing unknown records, we will maximize data usage by omitting only when necessary. Pairwise deletion can be considered as a method to reduce the data loss caused by listwise deletion. 

Imputation (Fill-in)

  • Imputation with Median/Mean/Mode values: these values are usually used to fill the missing position. Most of the time, the mean value is used. By using the mean value, we are keeping the mean unchanged after processing. In case of a categorical variable, the most popular value (mode) can be used to fill. The imputation method could decrease the variance of the attribute. We could extend the imputation by adding information whether value comes from imputation or from original dataset value using boolean type (this technique can be called marking imputed values in some document). However, one must be aware of using this method, if the data missing is not random, using mean can introduce outliers to the data.
  • Algorithm-based Imputation: instead of using a constant for imputing missing values, we could model variables with missing values as a function of other features. A regression algorithm can predict them with some assumptions.
  • If a linear regression is used, we must assume that variables have linear relationship.
  • If predicting missing values based on the order of high correlated columns, the process is called hot-deck imputation.
  • KNN Imputation: this method can be considered as a variant of median/mean/mode imputation, but instead of calculating these values across all observations, it only does among K nearest observations. One question we should think about is how to measure the distance between observations.


  • Multivariate Imputation Chained Equations: instead of the imputation value of each column separately, we can repeat to estimate missing values based on the distribution of other variables. The process repeats until data become stable. This approach has two settings: single and multiple data sets (can also be mentioned as Multiple Imputation by Chained Equations – MICE).
Chained Equations - MICE
One iteration of MICE
  1. Experiment

We are using the Titanic dataset for the experiment, which is quite familiar to most data scientists. The original data consist of 12 variables includes categorical variables and numerical variables. The original task is predicting whether each passenger is survived or not.

We will do a classification task with Logistic Regression (fixed among trials). In each experiment, we try to simulate the situation of data missing by removing some existing values from some features of input data. There will be 2 ways to removing data: completely random (MCAR Generator) and random (MAR Generator). Consider MAR Generator, in each trial, values will be removed with a different ratio based on values of other features (in particular, we based on Pclass – a highly correlated variable with Survived status). We track the changing of accuracy across different settings. For cross-validation, we apply K-Fold with K=5.

In experiment 1, we observe the changing of accuracy when we removing different amounts of data from some features.

In experiment 2, we generate missing data using MCAR and MAR Generator and use 2 MCAR-compatible methods to handle them. We will find out whether these methods decrease the accuracy of the classifier model.

  1. Results and Discussion

The affection of Missing Data Amount

In this experiment, we will try to find the correlation (not the correlation coefficient but the correlation in general) between the amount of missing data and the output of learning models, as well as the method to handle them. We do this by masking different ratios of a few columns with an MCAR setting.

Masking Sex, Dropping Title Masking Age, Dropping Title Masking Age, Sex, Dropping Title Masking Age, Sex, Keeping Title
0 81.04 0 81.04 0 81.07 0 82.21 0
20 77.23 -3.81 81.19 0.15 77.53 -3.54 81.83 -0.38
40 75.17 -5.87 80.84 -0.2 75.41 -5.66 81.87 -0.34
60 73.96 -7.08 80.29 -0.75 73.93 -7.14 82.32 0.11
80 71.95 -9.09 79.58 -1.46 71.92 -9.15 82.69 0.48
99 71.48 -9.56 79.5 -1.54 71 -10.07 82.98 0.77


Figure 3: Affection of Missing Ratio. The columns just right to each accuracy columns show the difference between the original (0%) and current setting

As can be seen, the more values are removed, the more accuracy decreases. But it happens only under some settings.

The Missing Data quantity is affected significantly only if the feature brings “unique” information. With the presence of the Title feature (extracted from Name), the missing values in the Sex column do not decrease the performance of the model, even with 99% missing data. It is because the majority of values of the Title column (Mr, Mrs, Ms, Dr…) induced information of Sex columns.

With the existence of some features that are important and highly correlated with missing features, the missing data effect becomes negligible. One thing we can learn that although its simplicity, removing entire variables should be considered in many cases, especially if some features that highly correlate with missing features. This can be valuable if we do not want to sacrifice performance and waste effort to gain a small portion of accuracy (around 1%).

The affection of Missing Generator and Handling Method

In this experiment, we use MCAR and MAR simulators to create modified datasets. With each removing method, we apply numerical columns (Age and Fare). Then, we use Mean Imputation (so we choose numerical features for removing values) and Listwise Deletion, which compatible with which MCAR setting, to handle these missing values and observe the difference of accuracy. 

Handling by Mean Imputation

Missing ratio MCAR Missing Generator (Age) MAR Missing Generator (Age) Difference
0 81 81 0
20 80.97 80.99 -0.02
40 80.72 80.7 0.02
60 80.04 80.38 -0.34

Handling by Listwise Deletion

Missing ratio MCAR Missing Generator (Age) MAR Missing Generator (Age) Difference
0 79.24 79.24 0
20 78.69 77.85 0.84
40 78.81 76.59 2.22
60 80.65 77.34 3.31

Figure 4: Different Missing Generators with different MCAR Handling Methods

Once again, we notice that with Mean Imputation, there are not any significant improvements when we use the MCAR Missing Generator instead of the MAR one. We can see that although Mean Imputation (which is considered as an MCAR-compatible handling method) can distort the correlation between features in case of MAR Missing Generator, the classification task can achieve comparable accuracy.

On the other hand, in case of using Listwise Deletion, the classifier accuracy is higher when the handling method is synced (MCAR Missing Generator). This can be explained by doing listwise deletion, we also throw data from other variables away. So in the MAR Generator case, we removed rows with an unrandom mechanism (it is still removed randomly in the MCAR Generator case), which worsen the classifier’s accuracy. Note that in one column, there is an increase in 60% setting. This phenomenon happens because by removing more rows, both the training and testing folds become smaller. We should not consider this as an improvement of the model when we increase the missing ratio.

  1. Recap

All methods of handling missing data may be helpful, but the choice is based on the circumstance. For better choice, data scientists should understand the process that generated the dataset, as well as the knowledge of the domain.

Considering the correlation between features are important to decide whether missing data should be handle or just ignore or delete from the dataset.

There are also some aspects of handling missing data we want to show you but due to time and resource limitation, we have not done these experiments yet. We would want to do experiments with more complex methods such as algorithm-based handling, as well as compare the affection over different datasets. We hope to come back to these problems some days.


Multiple Imputation by Chained Equations (MICE):

Data Science Blog

Please check our other Data Science Blog

Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineer.
Please check our Career Page.

AI / Data Science Project

Please check about experiences for Data Science Project

Vietnam AI / Data Science Lab

Vietnam AI Lab
Please also visit Vietnam AI Lab

Causal inference and potential outcome framework

In this blog, we would like to introduce basic concepts in causal inference and the potential outcome framework.

1.Causality terminology

  1. Unit: The fundamental notion is that causality is tied to action (or manipulation, treatment, or intervention), applied to a unit. A unit here can be a physical object, a firm, a person, at a particular point in time. The same physical object or person at a different time in a different unit. For instance, when you have a headache and you decide to take an aspirin to relieve your headache, you could also have chosen not to take the aspirin, or you could have chosen to take alternative medicine. In this framework, articulating with precision the nature and timing of the action sometimes require a certain amount of imagination. For example, if we define race solely in terms of skin color, the action might be a pill that alters only skin color. Such a pill may not currently exist (but, then, neither did surgical procedures for heart transplants hundreds of years ago), but we can still imagine such action.
  2. Active treatment vs. Control treatment: Often, one of these actions corresponds to more active treatment (e.g., taking an aspirin) in contrast to a more passive action (e.g., not taking the aspirin). We refer to the first action as the active treatment, the second action as the control treatment
  3. Potential Outcome: given a unit and a set of actions, we associate each action-unit pair with a potential outcome. We refer to these outcomes as potential outcomes because only one will ultimately be realized and therefore possibly observed: the potential outcome corresponding to the taken. The other potential outcomes cannot be observed because the corresponding actions that would lead to them being realized were not taken.
  4. Causal Effect: The causal effect of one action or treatment relative to another involves the comparison of these potential outcomes, one realized and the others not realized and therefore not observable.

Suppose we have a ‘treatment’ variable A with two levels: 1 and 0 and an outcome variable Y with two levels: 1 (death) and 0 (survival). Treatment A has a causal effect on an individual’s outcome Y if the potential outcomes under a = 1 and a = 0 are different. The causal effect of the treatment involves the comparison of these potential outcomes. A causes B if:

Causality terminology

For example, consider the case of a single unit, I, at a particular point in time, contemplating whether or not to take an aspirin for my headache. That is, there are two treatment levels, taking an aspirin, and not taking aspirin. There are therefore two potential outcomes, Y(Aspirin) and Y(No Aspirin), one for each level of the treatment.

Table 1: illustrates this situation assuming the values Y(Aspirin) = No Headache, Y (No Aspirin) = Headache.

Table 1: illustrates this situation assuming the values Y(Aspirin)

  1. A fundamental problem of causal inference: There are two important aspects of the definition of a causal effect. First, the definition of the causal effect depends on the potential outcomes, but it does not depend on which outcome is observed. Specifically, whether I take aspirin (and am therefore unable to observe the state of my headache with no aspirin) or do not take aspirin (and am thus unable to observe the outcome with an aspirin) does not affect the definition of the causal effect. Second, the causal effect is the comparison of potential outcomes, for the same unit, at the same moment in time post-treatment. In particular, the causal effect is not defined in terms of comparisons of outcomes at different times, as in a before-and-after comparison of my headache before and after deciding to take or not to take the aspirin. “The fundamental problem of causal inference” (Holland, 1986, p.947) is therefore the problem that at most one of the potential outcomes can be realized and thus observed. If the action you take is Aspirin, you observe Y(Aspirin) and will never know the value of Y(No Aspirin) because you cannot go back in time.
  2. Causal Estimands / Average Treatment Effect: For a population of units, indexed by i = 1,…, N. Each unit in this population can be exposed to one of a set of treatments.
  • Let Ti (or Wi elsewhere)denote the set of treatments to which unit I can be exposed.

Ti = T = {0, 1}

  • For each unit i, and for each treatment in the common set of treatments, there are corresponding potential outcome Yi(0) and Yi(1).
  • Comparison of Y1(1) and Yi(0) are unit-level causal effects

Yi(1) – Yi(0)

2 .Potential Outcomes Framework

2.1       Introduction

The potential outcome framework, formalized for randomized experiments by Neyman (1923) and developed for observational settings by Rubin (1974), defines for all individuals such potential outcomes, only some of which are subsequently observed. This framework dominates applications in epidemiology, medical statistics, and economics, stating the conditions under causal effects can be estimated in rigorous mathematical language

The potential outcomes approach was designed to quantify the magnitude of the causal effect of a factor on an outcome, NOT to determine whether it is a cause or not. Its goal is to estimate the effects of “cause”, not causes of an effect. Quantitative counterfactual inference helps us predict what would happen under different circumstances, but is agnostic in saying which is a cause or not.

2.2       Counterfactual

The potential outcome is the value corresponding to the various levels of treatment: Suppose we have a ‘treatment’ variable X with two levels: 1 (treat) and 0 (not treat) and an outcome variable Y with two levels: 1 (death) and 0 (survival). If we expose a subject, we observe Y1 but we do not observe Y0. Indeed, Y0 is the value we would have observed if the subject had been exposed. The unobserved variable is called a counterfactual. The variables (Y0, Y1) are also called potential outcomes. We have enlarged our set of variables from (X, Y) to (X, Y, Y0, Y1). A small dataset might look like this

2       Potential Outcomes Framework

The asterisks indicate unobserved variables. Causal questions involve the distribution p(y0, y1) of the potential outcomes. We can interpret p(y1) as p(y|set X = 1) and we can interpret p(y0) as p(y|set X = 0). For each unit, we can observe at most one of the two potential outcomes, the other is missing (counterfactual).

Causal inference under the potential outcome framework is essentially a missing data problem. Suppose now that X is a binary variable that represents some exposure. So X = 1 means the subject was exposed and X = 0 means the subject was not exposed. We can address the problem of predicting Y from X by estimating E(Y|X = x). To address causal questions, we introduce counterfactuals. Let Y1 denote the response if the subject is exposed. Let Y0 denote the response if the subject is not exposed. Then

2.2       Counterfactual

Potential outcomes and assignments jointly determine the values of the observed and missing outcomes:

2.2       Counterfactual

Since it is impossible to observe the counterfactual for a given individual or set of individuals. Instead, evaluators must compare outcomes for two otherwise similar sets of beneficiaries who are and are not exposed to the intervention, with the latter group representing the counterfactual

2.3       Confounding

In some cases, it is not feasible or ethical to do a randomized experiment and we must use data from observational (non-randomized) studies. Smoking and lung cancer is an example. Can we estimate causal parameters from observational (non-randomized) studies? The answer is: sort of

In an observational study, the treated and untreated groups will not be comparable. Maybe the healthy people chose to take the treatment and the unhealthy people didn’t. In other words, X is not independent. The treatment may have no effect but we would still see a strong association between Y and X. In other words, a (correlation) may be large even though q (causation) = 0.

Here is a simplified example. Suppose X denotes whether someone takes vitamins and Y is some binary health outcome (with Y = 1 meaning “healthy”)

2.3       Confounding

In this example, there are only two types of people: healthy and unhealthy. The healthy people have (Y0, Y1) = (1,1). These people are healthy whether or not they take vitamins. The unhealthy people have (Y0, Y1)= (0,0). These people are unhealthy whether or not they take vitamins.

The observed data are:

2.3       Confounding

In this example, q = 0 but a = 1. The problem is that people who choose to take vitamins are different from people who choose not to take vitamins. That’s just another way of saying that X is not independent of (Y0, Y1).

To account for the differences in the groups, we can measure confounding variables. These are the variables that affect both X and Y. These variables explain why the two groups of people are different. In other words, these variables account for the dependence between X and  Y. By definition, there are no such variables in a randomized experiment. The hope is that if we measure enough confounding variables then, perhaps the treated and untreated groups will be comparable, condition on Z. This means that  is independent of  conditional on Z.

2.4       Measuring the Average Causal Effect

The mean treatment effect or mean causal effect is defined by

E(Y1) – E(Y0) = E(Y|set X=1) – E(Y|set X=0)

The parameter q has the following interpretation: q is the mean response if we exposed everyone minus the mean response if we exposed no-one

The estimator for parameter: Estimator = difference-in-means


Hernán MA, Robins JM (2020). Causal Inference: What If
Imbens, G., & Rubin, D. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences.
Judea Pearl (2000). Causality: Models, Reasoning and Inference

Data Science Blog

Please check our other Data Science Blog

Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineer.
Please check our Career Page.

AI / Data Science Project

Please check about experiences for Data Science Project

Vietnam AI / Data Science Lab

Vietnam AI Lab
Please also visit Vietnam AI Lab


Word Embeddings – blessing or curse in disguise?

As word embeddings become more and more ubiquitous in language applications, a key issue has likewise emerged. The ability of embeddings to learn complex, underlying relationships between words is also their greatest caveat:

How do we know when we have trained a good embedding?

It’s important to differentiate between a good embedding in a more general sense and a good embedding for a specific downstream task. Although some methods for evaluation, such as word similarity/analogy tasks have been proposed, most remain somewhat controversial as to their validity as well as relevancy to actual target applications (e.g. Faruqui et al. (2016)).

In this context, one distinguishes between two types of evaluations:  intrinsic where one typically employs specifically designed, human-moderated data sets, and extrinsic whereby embeddings are tested on simpler, proxy NLP tasks to estimate their performance.

For both types, it is yet unclear to what extent good performance correlates with actual useful embeddings. In many, if not most state-of-the-art neural networks, the embeddings are trained alongside the model to tailor to the task at hand.

Here, we want to evaluate two different embeddings (Skip-gram and CBOW)  trained on a Japanese text corpus (300K) to assess which algorithm is more suitable.

Our setup is as follows: 

Data: Japanese text corpus, containing full texts and their matching summaries (300K) 

Preprocessing: Subword segmentation using SentencePiece (Kudo et al.,2018)

Embedding: Train 2 models: Skip-gram and CBOW, vector size: 300, 40K vocabulary size using FastText (Athiwaratkun et al., 2018).

Japanese is a non-space separated language and needs to be segmented as part of the preprocessing. This can be done using morphological analyzers, such as Mecab (Kudo, 2006), or language-independent algorithms, such as SentencePiece (Kudo et al., 2018). As the concept of a “word” is therefore highly arbitrary, different methods can return different segmentations, all of which may be appropriate given specific target applications.

To tackle the ubiquitous Out-of-Vocabulary (OOV) problem, we are segmenting our texts into “subwords” using SentencePiece. These typically return smaller units and do not align with “word” segmentations returned by Mecab.

If we wanted to evaluate our embeddings on an intrinsic task such as word similarity, we could use the Japanese word similarity data set (Sakaizawa et al., 2018), containing word similarity ratings for pairs of words across different words types by human evaluators.

However, preliminary vocabulary comparisons showed that because of differences in the segmentation, there was little to no overlap between the words in our word embeddings and those in the data set.  For instance, the largest common group occurred in nouns: only 50 out of 1000 total noun comparison pairs.

So instead we are going to propose a naïve approach to compare two-word embeddings using a Synonym Vector Mapping Approach

For the current data set, we would like to see whether the model can map information from the full text and its summary correctly, even when different expressions are being used, i.e. we would like to test the model’s ability to pair information from two texts that use different words. 

Pre-processing Data

In Data Science, before building a predictive model from a particular data set, it is important to explore and perform pre-processing data.  In this blog, we will illustrate some typical steps in data pre-processing.

In this particular exercise, we will build a simple Decision Tree model to classify the food cuisine from the list of ingredients. The data for this exercise can be taken from:

From this exercise, we will show the importance of data pre-processing. This blog will be presented as follow:

  1. Data Exploration and Pre-processing.
  2. Imbalance Data.

1.  Data Exploration and Pre-processing

When you are given a set of data, it is important to explore and analyze them before constructing a predictive model. Let us first explore this data set.

1.  Data Exploration and Pre-processing

From the first 10 items of this data set. We observe that given a particular cuisine, the list of ingredients may be different.

From this data set, we can find out that there are 20 different cuisines and the recipes distribution is not uniform. For example, recipes from ‘Italian’ cuisine take 19.7% of all the data set, while there is only 1.17% of the recipes are coming from ‘Brazilian’ cuisine.

Dataset receipt

Now, let us explore further into this data set. Let us look at the top 15 ingredients

top 15 ingredients

If we look at the top 15 ingredients, we will see that they include “salt”, “water”, “sugar”, etc. They are all generic and can be found in every cuisine. Intuitionally,  if we remove these ingredients from the classification model,  the accuracy of the classification should not be affected.

In the classification model, we would refer that recipes in each cuisine to have unique ingredients to that country. This will help the model to easily identify which cuisine this recipe comes from.

After removing  all the generic ingredients (salt, water, sugar, etc) from the data set, we look at the top 15 ingredients again.

top 15 ingredients

It looks like we can remove more ingredients, but a decision which one to remove properly leave to someone with a bit more domain of knowledge in cooking. For example, some country may use ‘onion’ in their recipe, the other may use ‘red onion’. So it is better not to overly filter out too many generic ingredients.

Now, we look at the distribution of ingredients in each recipe in the data set.


Some recipes have only 1 to 2 ingredients in the recipe, some may have up to 60. It is probably best to remove those recipes with so few ingredients out of the data set, as the number of ingredients may not be representative enough for the classification model. What is the minimum number of ingredients require to classify the cuisine? The short answer is no one knows. It is best to experiment it out by removing data sets with 1, 2, 3, etc ingredients and re-train the model and compare the accuracy to decide which one works best for your model.

The ingredients in the recipe are all words, to do some further pre-processing, we will need to use some NLP (Natural Language Processing).

AWS Lake Formation – Data Lake(Setup)

Overview for Build a Data lake with AWS(Beginner)

AWS Lake Formation enables you to set up a secure data lake. A data lake is a centralized, curated, and secured repository storing all your structured and unstructured data, at any scale. You can store your data as-is, without having first to structure it. And you can run different types of analytics to better guide decision-making—from dashboards and visualizations to big data processing, real-time analytics, and machine learning.

The challenges of data lakes

The main challenge to data lake administration stems from the storage of raw data without content oversight. To make the data in your lake usable, you need defined mechanisms for cataloging and securing that data.

Lake Formation provides the mechanisms to implement governance, semantic consistency, and access controls over your data lake. Lake Formation makes your data more usable for analytics and machine learning, providing better value to your business.

Lake Formation allows you to control data lake access and audit those who access data. The AWS Glue Data Catalog integrates data access policies, making sure of compliance regardless of the data’s origin.

Set up the S3 bucket and put the dataset.

Set up the S3 bucket and put the dataset.

Set up Data Lake with AWS Lake Formation.

Step 1: Create a data lake administrator

First, designate yourself a data lake administrator to allow access to any Lake Formation resource.

Create a data lake administrator

Step 2: Register an Amazon S3 path

Next, register an Amazon S3 path to contain your data in the data lake.

Register an Amazon S3 path

Step 3: Create a database

Next, create a database in the AWS Glue Data Catalog to contain the datasetsample00 table definitions.

  •      For Database, enter datasetsample00-db
  •      For Location, enter your S3 bucket/ datasetsample00.
  •      For New tables in this database, do not select Grant All to Everyone.

Step 4: Grant permissions

Next, grant permissions for AWS Glue to use the datasetsample00-db database. For the IAM role, select your user and AWSGlueServiceRoleDefault.

Grant your user and AWSServiceRoleForLakeFormationDataAccess permissions to use your data lake using a data location:

  • For the IAM role, choose your user and AWSServiceRoleForLakeFormationDataAccess.
  • For Storage locations, enter s3:// datalake-hiennu-ap-northeast-1.

Step 5: Crawl the data with AWS Glue to create the metadata and table

In this step, a crawler connects to a data store, progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata tables in your AWS Glue Data Catalog.

Create a table using an AWS Glue crawler. Use the following configuration settings:

  • Crawler name: samplecrawler.
    Crawler name: samplecrawler.
  • Datastores: Select this field.
  • Choose a data store: Select S3.
  • Specified path: Select this field.
  • Include path: s3://datalake-hiennu-ap-northeast-1/datasetsample00.
  • Add another datastore: Choose No.
  • Choose an existing IAM role: Select this field.
  • IAM role: Select AWSGlueServiceRoleDefault.
  • Run on-demand: Select this field.
  • Database: Select datasetsample00-db.

Step 6: Grant access to the table data

Set up your AWS Glue Data Catalog permissions to allow others to manage the data. Use the Lake Formation console to grant and revoke access to tables in the database.

  • In the navigation pane, choose Tables.
  • Choose Grant.
  • Provide the following information:
    1. For the IAM role, select your user and AWSGlueServiceRoleDefault.
    2. For Table permissions, choose Select all.

Step 7: Query the data with Athena

Query the data in the data lake using Athena.

  • In the Athena console, choose Query Editor and select the datasetsample00-db
  • Choose Tables and select the datasetsample00 table.
  • Choose Table Options (three vertical dots to the right of the table name).
  • Select Preview table.

Athena issues the following query: SELECT * FROM datasetsample00 limit 10;

Combinatorial Optimization: From Supervised Learning to Reinforcement Learning – Part 1

Recently, I was asked to solve an interesting problem: The problem of Sorting Array:


An array A contains unique n-elements, whose values are integers. The length (n) of A is ranged from 2 to10.


A sorted array B in ascending order. The length of array B must be the same as the length of array A.

Examples: A = [3,0] -> B = [0,3], A = [1,3,2] -> B = [1,2,3], A = [5,9,1,3,7] -> B = [1,3,5,7,9]

Array sorting is not a new problem. There are many sorting algorithms such as  Straight Insertion, Shell Sort, Bubble Sort, Quick Sort, Selection Sort, Heap Sort, etc. The problem above becomes much more interesting if we consider it as a Combinatorial Optimization problem. Here, various Machine Learning approaches can be applied.

Combinatorial Optimization

“Combinatorial Optimization is a category of problems which requires optimizing a function over a combination of discrete objects and the solutions are constrained. Examples include finding the shortest paths in a graph, maximizing value in the Knapsack problem, and finding boolean settings that satisfy a set of constraints. Many of these problems are NP-Hard, which means that no polynomial-time solution can be developed for them. Instead, we can only produce approximations in polynomial time that are guaranteed to be some factor worse than the true optimal solution.”

Source: Recent Advances in Neural Program Synthesis (

The traditional solvers are often relying on handcrafted designs to make decisions. In recent years, many Machine Learning (ML) techniques have been used to solve combinatorial optimization problems. The related technologies vary from supervised learning techniques to modern reinforcement learning techniques.

Using the above sorting list problem, we will see how the problem can be solved using different ML techniques.


In this series, we will start with some supervised techniques, then we’ll apply the neuro-evolution, finally using some modern RL techniques.

Part 1: Supervised learning: Gradient Boosting, Fully Connected Neural Networks, SeqtoSeq.

Part 2: Deep Neuro-Evolution: NEAT, Evolution Strategies, Genetic Algorithms.

Part 3: Reinforcement Learning: Deep Q-Network, Actor-Critic, PPO with Pointer network and Attention-based model.

Code for Part 1:

(Note: Enable Colab GPU to speed up running time)

Supervised learning

Supervised machine learning algorithms are designed to learn by example. If we want to use Supervised learning, we have to have data.

First, we will generate 3 data sets with different sizes: 1000, 5000, 50000. Then we will use some models to train with these data sets. Then, we will compare their sorting abilities after the learning process.

1.Generate training data:

How to generate data? One possible approach is: if we consider each element of the input list as a feature, and each element of the sorted list is a label, we can easily convert the data back to a tabular form

In1 In2 In3 In4 In5 In6 In7 In8 In9 In10
0 -1 -1 -1 -1 -1 7 0 1 2 3
1 1 7 6 3 4 5 0 2 8 9
2 -1 -1 6 2 7 4 1 0 3 8
Out1 Out2 Out3 Out4 Out5 Out6 Out7 Out8 Out9 Out10
0 -1 -1 -1 -1 -1 0 1 2 3 7
1 0 1 2 3 4 5 6 7 8 9
2 -1 -1 0 1 2 3 4 6 7 8

Then we can use any multi-label regression or multi-label classification models for this training data.

2. Multi-label regression

For this Tabular dataset, I will use 2 common techniques: Gradient boosting (use XGB lib) and simple Fully connected neural networks (FCNNs).

from sklearn.multioutput import MultiOutputRegressor
from xgboost import XGBRegressor

AWS – Batch Process and Data Analytics

(AWS – Batch Process and Data Analytics)

Today, I will describe our current system references architecture for Batch processing and Data Analytics for the sales report system. Our mission is to create a big data analytics system that interacts with Machine Learning features for insights and prediction. Nowadays, too many AI companies are researching and were applied machine learning features to their system to improve services. We also working to research and applied machine learning features to our system such as NLP, Forecast, OCR … that help us have the opportunity to provide better service for customers.

References Architecture for Batch data processing
References Architecture for Batch data processing
  1. Data Source
    We have various sources from multiple systems including both on-premise and on-cloud with a large dataset and unpredictable frequent updates.
  2. Data lake storage
    We use S3 as our data lake storage with unlimited types and volumes. That helps us easy to scale our system easily
  3. Machine Learning
    We focus on Machine learning to build great AI solutions from our dataset. Machine learning models will predict for insight and integration directly to our system as microservices.
  4. Compute
    Compute machines are most important in our system. We choose fit machine services as infrastructure or serverless to maximize optimize cost and performance.
    – We using AWS lambda functions for small jobs such as call AI services, small dataset processing, and integrating
    – We using AWS Glue ETL to build ETL pipeline and build custom pipelines with AWS step functions.
    – We also provide web and APIs service for end-users that system building by microservices architecture and using AWS Fargate for hosting services
  5. Report datastore
    After processing data and export insight data from predictions we store data in DynamoDB and RDS for visualization and build-in AWS insight features.
  6. Data analytics with visualization and insights
    We are using AWS Quicksight and insight for data analytics

Data Processing Pipeline: AWS Event + AWS Step Function + AWS Lambda

According to the specific dashboard, we have specific data processing pipeline to process and prepare data for visualization and insights. Here is our standard pipeline with AWS Step functions orchestrators. We want to hold everything basic as possible. We using AWS Events for schedule, and Lambda functions as interacting functions.
– Advantages: Serverless and easy to develop and scale

– Disadvantages: Most depend on AWS services and challenges when exceeding the limitation of AWS services such as Lambda functions processing time.
– Most use case scenario: Batch data processing pipeline.

Data processing pipeline
Data processing pipeline

What are the strong points in our system?

  • Using serverless compute for data processing and microservices architecture
  • Easy to develop, deploy and scale system without modifying
  • Flexibility to using build-in services and build custom Machine Learning model with AWS SageMaker
  • Separately with Datasource systems in running
  • Effectively data lake processing


This architecture is basic and I hope you can get something in here. We focus to describe our current system and it is building to larger than our design with more features in the future; we should choose a flexible solution for easy to maintain and replaceable. When you choose any architecture or service which helps you resolve your problems, you should consider what is the best fit service for your current situation, and no architecture/service can resolve all problems that depend on the specific problem. And avoid the case for solution finding for problems 😀

— — — — — — — — — — — — — — — —