In the paper: “Probing Neural Network Comprehension of Natural Language Arguments”, the authors believe exploiting linguistic cues plays a major part in the models’ performance. The authors altered the evaluation datasets in a way that would make no difference to how the results were interpreted. They started with removing negations such as “not”, “cannot” but kept the meaning of the sentences. For example: “it is not raining, therefore, I can go for a run”, was changed to “it is raining, therefore, I cannot go for a run”. As a result, the SOTA models struggled badly, almost as bad as a random system.
In another paper called “Right for the Wrong Reasons”, the authors hypothesized three syntactic heuristics these models might use to score high in the NLI task. For details of the heuristics, please refer to the image below.
Figure 3: The authors proved that the syntactic overlaps between premise and hypothesis affect the NLP models’ predictive capability significantly. Thomas McCoy and Ellie Pavlick, 2020.
The points were being made pretty clear here. The Bert-like models seem to use statistical “cues” as a crude heuristic to get better results, far away from true “reasoning” or “inference” skills needed to understand human language.
And not just because of the flaws in the datasets, we need to concern about the way we train the models as well. Up to now, we tend to associate models with more parameters, trained with more text data in an “optimal” way with better “understanding” language capability. In the paper “Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data”, Emily Bender and Alexander Koller investigated whether LMs such as GPT-3 or BERT can ever learn to “understand” language, no matter how much data we feed them.
They started with defining form and meaning. According to their paper, form is the identifiable, physical part of a language. The marks and symbols that represented the language, such as the symbols on a page or the pixels and bytes on a webpage. The meaning is how these elements relate to things in the external world.
It turns out, up to now, we have not figured out how to feed the models the outside world knowledge effectively. It’s like someone tries to learn a new language by reading books without knowing the connection between the content of the books to the world. Like you learn the word “apple” without knowing an actual apple outside in the world. To a certain extent, these models lack the common sense of the world. Without such knowledge, from a theoretical point of view, it’s impossible for any DNN models to truly understand language.
We may question the NLP models’ ability to understand language like a human being, but we cannot deny how significantly they have evolved in the past few years. If the datasets were not “hard” enough to train and properly evaluate the model, we can always construct more advanced ones. NA, SuperGLUE, or XTREME are few examples. If they lack common sense knowledge, a new type of architecture like a multi-modal one can be devised and improved to handle such a new type of input. Even if they are “cheating” to gain more performance, that shows how far they have come. For me, understanding their limits are just to make them better and the future of applying DNN models to NLP tasks remains very bright. 😉
Ps: This writing is by no means a comprehensive review on the topic 😊
Data Science Blog
Please check our other Data Science Blog