Word Embeddings – blessing or curse in disguise?

As word embeddings become more and more ubiquitous in language applications, a key issue has likewise emerged. The ability of embeddings to learn complex, underlying relationships between words is also their greatest caveat:

How do we know when we have trained a good embedding?

It’s important to differentiate between a good embedding in a more general sense and a good embedding for a specific downstream task. Although some methods for evaluation, such as word similarity/analogy tasks have been proposed, most remain somewhat controversial as to their validity as well as relevancy to actual target applications (e.g. Faruqui et al. (2016)).

In this context, one distinguishes between two types of evaluations:  intrinsic where one typically employs specifically designed, human-moderated data sets and extrinsic whereby embeddings are tested on simpler, proxy NLP tasks to estimate their performance.

For both types, it is yet unclear to what extent good performance correlates with actual useful embeddings. In many, if not most state-of-the-art neural networks, the embeddings are trained alongside the model to tailor to the task at hand.

Here, we want to evaluate two different embeddings (Skipgram and CBOW)  trained on a Japanese text corpus (300K) with the aim of assessing which algorithm is more suitable.

Our setup is as follows: 

Data: Japanese text corpus, containing full texts and their matching summaries (300K) 

Preprocessing: Subword segmentation using SentencePiece (Kudo et al.,2018)

Embedding: Train 2 models: Skipgram and CBOW, vector size: 300, 40K vocabulary size using FastText (Athiwaratkun et al., 2018).

Japanese is a non-space separated language and needs to be segmented as part of the preprocessing. This can be done using morphological analyzers, such as Mecab (Kudo, 2006) or language-independent algorithms, such as SentencePiece (Kudo et al., 2018). As the concept of a “word” is therefore highly arbitrary, different methods can return different segmentations, all of which may be appropriate given specific target applications.

In order to tackle the ubiquitous Out-of-Vocabulary (OOV) problem, we are segmenting our texts into “subwords” using SentencePiece. These typically return smaller units and do not align with “word” segmentations returned by Mecab.

If we wanted to evaluate our embeddings on an intrinsic task such as word similarity, we could use the Japanese word similarity data set (Sakaizawa et al., 2018), containing word similarity ratings for pairs of words across different word types by human evaluators.

However, preliminary vocabulary comparisons showed that because of differences in segmentation, there was little to no overlap between the words in our word embeddings and those in the data set.  For instance, the largest common group occurred in nouns: only 50 out of 1000 total noun comparison pairs.

So instead we are going to propose a naïve approach to compare two word embeddings using a Synonym Vector Mapping Approach

For the current data set, we would like to see whether the model can map information from the full text and its summary correctly, even when different expressions are being used, i.e. we would like to test the model’s ability to pair information from two texts that use different words. 

Bias in Data Science – the Good, the Bad and the Avoidable !?

In recent years, there have been a few prominent examples of accidental bias in machine-learning applications, such as smart phones’ beauty filters (that essentially ended up whitening skin) [1] or Microsoft’s from-innocent-teen-to-racist-in-24-hours chatbot [2,3]. Examples such as these obviously fell victim to inherently biased data being fed into algorithms too complex to allow for much transparency. Hidden bias continues to be an issue on ubiquitous social media platforms, such as Instagram, whose curators appear to profess themselves both regretful AND baffled [4]. Unfortunately, any model will somewhat regurgitate what it has been fed and interventions at this level of model complexity may prove tricky.  

Interestingly, bias itself does not need to be harmful and is often built into a model’s design on purpose, either to address only a subset of the overall population or to model a real-world state, for instance when predicting house prices from its size and number of bedrooms, the model’s bias parameter often represents the average house price in the data set. Thus, we need to distinguish between conscious and unconscious bias in data analysis. Additionally, there is the factor of intent, i.e. is the person conducting the analysis well-intentioned to follow good scientific method or trying to manipulate it to achieve a particular outcome.

In the following, I will only discuss aspects of unintentional and unconscious bias, meaning bias hidden from the researcher or data scientist introducing it. This is by no means an exhaustive discussion, but merely a highlight of some pervasive aspects:

A. Data availability bias

B. Coherent theory bias

C. Method availability/popularity bias

A. Data availability bias

A. The problem of scientists’ selecting their data out of convenience rather than suitability or representativeness for the current task has been around for a while [4], e.g. the ideal data set may not be available in machine-readable format or would require higher costs and more time for processing, in short, several obstacles to doing an analysis quickly. For instance, in the area of Natural Language Processing, the major European languages, like English, French and German etc. tend to receive more attention because both data and tools to analyze them are widely available. Similarly, psychology research has mostly focused on so-called WEIRD societies (White, Educated, Industrialized, Rich, Democratic) [5] and out of convenience often targets the even smaller population of “North American college students” that unsurprisingly have been found to not represent human populations at large.

B. Coherent theory bias

B. Various studies suggest that we as people strongly favour truths that fit into our pre-existent world view, and why would scientists be exempt from this? Thus, it appears when people analyze data they are often biased by their underlying beliefs about the outcome and are then less likely to yield to unexpected non-significant results [6]. This does not include scientists disregarding new evidence because of conflicting interests [7]. This phenomenon is commonly referred to as confirmation bias or, more fittingly, “myside” bias.

C. Method availability/popularity bias

C. There is a tendency of hailing new trendy algorithms as one-fits-all solutions for whatever task or application. The solution is presented before examining the problem and its actual requirements. While more complex models are often more powerful, this comes at the cost of interpretability, which in some cases is not advisable. Additionally, some methods, both simple and complex ones, enjoy popularity primarily because they come ready-to-use in one’s favourite programming language.

Going forward… 

We as data scientists should:

a. Select our data carefully with our objective in mind. Get to know our data and its limitations.

b. Be honest with ourselves about possible emotional investment in our analyses’ outcomes and resulting conflicts.

c. Examine the problem and its (theoretical) solutions BEFORE making any model design choices.


[1] https://www.theguardian.com/technology/2017/apr/25/faceapp-apologises-for-racist-filter-which-lightens-users-skintone(last accessed 21.10.2020)

 [2] https://www.bbc.com/news/technology-35902104(last accessed 21.10.2020)

 [3] https://spectrum.ieee.org/tech-talk/artificial-intelligence/machine-learning/in-2016-microsofts-racist-chatbot-revealed-the-dangers-of-online-conversation(last accessed 21.10.2020)

[4] https://www.theguardian.com/technology/2020/aug/09/instagrams-censorship-of-black-models-photo-shoot-reignites-claims-of-race-bias-nyome-nicholas-williams(last accessed 21.10.2020)

[5] Joseph Rudman (2003) Cherry Picking in Nontraditional Authorship Attribution Studies, CHANCE, 16:2,26-32, DOI: 10.1080/09332480.2003.10554845

[6] Henrich, Joseph; Heine, Steven J., and Norenzayan, Ara. The Weirdest People in the World? Behavioral and Brain Sciences, 33(2-3):61–83, 2010. doi: 10.1017/S0140525X0999152X.

[7] Hewitt CE, Mitchell N, Torgerson DJ. Heed the data when results are not significant. BMJ. 2008;336(7634):23-25. doi:10.1136/bmj.39379.359560.AD

[8] Ioannidis JP. Why most published research findings are false. PLoS Med. 2005;2(8):e124. doi:10.1371/journal.pmed.0020124

Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineer.
Please check our Career Page.

Data Science Project

Please check about experiences for Data Science Project

Vietnam AI / Data Science Lab

Vietnam AI Lab

Please also visit Vietnam AI Lab