Do even the best DNN models understand language?

Do even the best DNN models understand language?

New advances, new excitement

Without any doubt, Deep Neural Networks (DNNs) have brought huge improvements to the NLP world recently. News like an AI model using DNN can write articles like a human or can write code to create a website like a real developer comes to mainstream media frequently. A lot of these achievements would have been surreal if we talked about them just a few years ago.


One of the most influential models is Bert (Bidirectional Encoder Representations from Transformers), created by Google in 2018. Google claimed with Bert, they now can understand searches better than ever before. Not stopped there, they even took it further by saying embedding this model to its core search engine(SE) “representing the biggest leap forward in the past five years, and one of the biggest leaps forward in the history of Search”. Impressed by the bold claim, I took my liberty to check how the SE works with a COVID-related inquiry like the one below.

Screenshot of a COVID-related inquiry on Google


Figure 1: The Search Engine doesn’t just give out locations where vaccine shots are provided but also suggests who is eligible for getting the shots. This result cannot come from a keyword-based search mechanism. And Yes, so far, the result seems to justify their confident claim.

However, Bert was not the only champion in the game. Another powerful language model which was released more recently has come with its advantages. It was GPT-3. Open AI built the model with 175 billion parameters which were 100 times more parameters than its predecessor GPT-2. Due to this large number of parameters and the extensive dataset it has been trained on, GPT-3 performed impressively on the downstream NLP tasks without fine-tuning. Here is an article from MTI Review written by this gigantic model.

Article form MTI Review written by this gigantic model.
Screenshot of an article on MTI Blog

Figure 2: The italicized part was input they fed the model, served as a prompt. This article talks about a unicorn with such fluent English and a high level of confidence, almost indistinguishable from human writing. I would have been convinced the piece of writing was genuine if I did not know the creature did not exist.


Many people were astounded at the text that was produced, and indeed, this speaks to the remarkable effectiveness of the particular computational systems. It seems, for some not-crystal-clear reasons, the models understand language. If that’s true, it would be the first step for AI to think like humans. Unsurprisingly, the media took the news by storm. People started to talk about the societal impacts like workforce replace by AI systems. Some even went further by saying humans might be in danger 😉 But really, are we there yet?

Do the models understand language?

So, are the models that great? Are these models capable of understanding language or are they somewhat gaming the whole system? A series of recent papers claimed that models like BERT don’t understand the language in any meaningful way. One of the reasons for their outstanding results might come from their training and testing datasets.

Word Embeddings – blessing or curse in disguise?

As word embeddings become more and more ubiquitous in language applications, a key issue has likewise emerged. The ability of embeddings to learn complex, underlying relationships between words is also their greatest caveat:

How do we know when we have trained a good embedding?

It’s important to differentiate between a good embedding in a more general sense and a good embedding for a specific downstream task. Although some methods for evaluation, such as word similarity/analogy tasks have been proposed, most remain somewhat controversial as to their validity as well as relevancy to actual target applications (e.g. Faruqui et al. (2016)).

In this context, one distinguishes between two types of evaluations:  intrinsic where one typically employs specifically designed, human-moderated data sets, and extrinsic whereby embeddings are tested on simpler, proxy NLP tasks to estimate their performance.

For both types, it is yet unclear to what extent good performance correlates with actual useful embeddings. In many, if not most state-of-the-art neural networks, the embeddings are trained alongside the model to tailor to the task at hand.

Here, we want to evaluate two different embeddings (Skip-gram and CBOW)  trained on a Japanese text corpus (300K) to assess which algorithm is more suitable.

Our setup is as follows: 

Data: Japanese text corpus, containing full texts and their matching summaries (300K) 

Preprocessing: Subword segmentation using SentencePiece (Kudo et al.,2018)

Embedding: Train 2 models: Skip-gram and CBOW, vector size: 300, 40K vocabulary size using FastText (Athiwaratkun et al., 2018).

Japanese is a non-space separated language and needs to be segmented as part of the preprocessing. This can be done using morphological analyzers, such as Mecab (Kudo, 2006), or language-independent algorithms, such as SentencePiece (Kudo et al., 2018). As the concept of a “word” is therefore highly arbitrary, different methods can return different segmentations, all of which may be appropriate given specific target applications.

To tackle the ubiquitous Out-of-Vocabulary (OOV) problem, we are segmenting our texts into “subwords” using SentencePiece. These typically return smaller units and do not align with “word” segmentations returned by Mecab.

If we wanted to evaluate our embeddings on an intrinsic task such as word similarity, we could use the Japanese word similarity data set (Sakaizawa et al., 2018), containing word similarity ratings for pairs of words across different words types by human evaluators.

However, preliminary vocabulary comparisons showed that because of differences in the segmentation, there was little to no overlap between the words in our word embeddings and those in the data set.  For instance, the largest common group occurred in nouns: only 50 out of 1000 total noun comparison pairs.

So instead we are going to propose a naïve approach to compare two-word embeddings using a Synonym Vector Mapping Approach

For the current data set, we would like to see whether the model can map information from the full text and its summary correctly, even when different expressions are being used, i.e. we would like to test the model’s ability to pair information from two texts that use different words. 

Pre-processing Data

In Data Science, before building a predictive model from a particular data set, it is important to explore and perform pre-processing data.  In this blog, we will illustrate some typical steps in data pre-processing.

In this particular exercise, we will build a simple Decision Tree model to classify the food cuisine from the list of ingredients. The data for this exercise can be taken from:

From this exercise, we will show the importance of data pre-processing. This blog will be presented as follow:

  1. Data Exploration and Pre-processing.
  2. Imbalance Data.

1.  Data Exploration and Pre-processing

When you are given a set of data, it is important to explore and analyze them before constructing a predictive model. Let us first explore this data set.

1.  Data Exploration and Pre-processing

From the first 10 items of this data set. We observe that given a particular cuisine, the list of ingredients may be different.

From this data set, we can find out that there are 20 different cuisines and the recipes distribution is not uniform. For example, recipes from ‘Italian’ cuisine take 19.7% of all the data set, while there is only 1.17% of the recipes are coming from ‘Brazilian’ cuisine.

Dataset receipt

Now, let us explore further into this data set. Let us look at the top 15 ingredients

top 15 ingredients

If we look at the top 15 ingredients, we will see that they include “salt”, “water”, “sugar”, etc. They are all generic and can be found in every cuisine. Intuitionally,  if we remove these ingredients from the classification model,  the accuracy of the classification should not be affected.

In the classification model, we would refer that recipes in each cuisine to have unique ingredients to that country. This will help the model to easily identify which cuisine this recipe comes from.

After removing  all the generic ingredients (salt, water, sugar, etc) from the data set, we look at the top 15 ingredients again.

top 15 ingredients

It looks like we can remove more ingredients, but a decision which one to remove properly leave to someone with a bit more domain of knowledge in cooking. For example, some country may use ‘onion’ in their recipe, the other may use ‘red onion’. So it is better not to overly filter out too many generic ingredients.

Now, we look at the distribution of ingredients in each recipe in the data set.


Some recipes have only 1 to 2 ingredients in the recipe, some may have up to 60. It is probably best to remove those recipes with so few ingredients out of the data set, as the number of ingredients may not be representative enough for the classification model. What is the minimum number of ingredients require to classify the cuisine? The short answer is no one knows. It is best to experiment it out by removing data sets with 1, 2, 3, etc ingredients and re-train the model and compare the accuracy to decide which one works best for your model.

The ingredients in the recipe are all words, to do some further pre-processing, we will need to use some NLP (Natural Language Processing).


In NLP, encoding text is the heart of understanding language.  There are many implementations like Glove, Word2vec, fastText which are aware of word embedding. However, these embeddings are only useful for word-level and may not perform well in case we would like to expand to encode for sentences or in general, greater than one word. In this post, we would like to introduce one of the SOTAs for such a task: the Universal Sentence Encoder model


The Universal Sentence Encoder (USE) encodes text into high dimensional vectors (embedding vectors or just embeddings). These vectors are supposed to capture the textual semantic. But why do we even need them?

A vector is an array of numbers of a particular dimension. With the vectors in hand, it’s much easier for computers to work on textual data. For example, we can say two data points are similar or not just by calculating the distance between the two points’ embedding vectors.


(Image source:

The embedding vectors then in turn can be used for other NLP downstream tasks such as text classification, semantic similarity, clustering…

2.USE architecture

It comes with two variations with the main difference resides in the embedding part. One is equipped with the encoder part from the famous Transformer architecture, the other one uses Deep Averaging Network (DAN)

2.1 Transformer encoder

The Transformer architecture is designed to handle sequential data, but not in order like the RNN-based architectures. It uses the attention mechanism to compute context-aware representations of words in a sentence taking into account both the ordering and significance of all the other words. The encoder takes input as a lowercased PTB tokenized string and outputs the representations of each sentence as a fixed-length encoding vector by computing the element-wise sum of the representations at each word position. Due to this feature, the Transformer allows for much more parallelization than RNNs and therefore reduced training times.

Universal Sentence Encoder uses only the encoder branch of Transformer to take advantage of its strong embedding capacity.

(Image source:

2.2 Deep Averaging Network (DAN):

DAN is a simple Neural Network that takes an average of embeddings for words and bi-grams and then passed the “combined” vector through a feedforward deep neural network (DNN) to produce sentence embeddings. Similar to the Transformer encoder, DAN takes as input a lowercased PTB tokenized string and outputs a 512-dimensional sentence embedding.

(Image source:

The two have a trade-off of accuracy and computational resource requirement. While the one with the Transformer encoder has higher accuracy, it is computationally more intensive. The one with DNA encoding is computationally less expensive and with little lower accuracy.

3. How was it trained?

The key idea for training this model is to make the model work for generic tasks such as:

  • Modified Skip-thought
  • Conversational input-response prediction
  • Natural language inference.

3.1 Modified skip-thought:

given a sentence, the model needs to predict the sentences around it.