Word Embeddings – blessing or curse in disguise?

As word embeddings become more and more ubiquitous in language applications, a key issue has likewise emerged. The ability of embeddings to learn complex, underlying relationships between words is also their greatest caveat:

How do we know when we have trained a good embedding?

It’s important to differentiate between a good embedding in a more general sense and a good embedding for a specific downstream task. Although some methods for evaluation, such as word similarity/analogy tasks have been proposed, most remain somewhat controversial as to their validity as well as relevancy to actual target applications (e.g. Faruqui et al. (2016)).

In this context, one distinguishes between two types of evaluations:  intrinsic where one typically employs specifically designed, human-moderated data sets, and extrinsic whereby embeddings are tested on simpler, proxy NLP tasks to estimate their performance.

For both types, it is yet unclear to what extent good performance correlates with actual useful embeddings. In many, if not most state-of-the-art neural networks, the embeddings are trained alongside the model to tailor to the task at hand.

Here, we want to evaluate two different embeddings (Skip-gram and CBOW)  trained on a Japanese text corpus (300K) to assess which algorithm is more suitable.

Our setup is as follows: 

Data: Japanese text corpus, containing full texts and their matching summaries (300K) 

Preprocessing: Subword segmentation using SentencePiece (Kudo et al.,2018)

Embedding: Train 2 models: Skip-gram and CBOW, vector size: 300, 40K vocabulary size using FastText (Athiwaratkun et al., 2018).

Japanese is a non-space separated language and needs to be segmented as part of the preprocessing. This can be done using morphological analyzers, such as Mecab (Kudo, 2006), or language-independent algorithms, such as SentencePiece (Kudo et al., 2018). As the concept of a “word” is therefore highly arbitrary, different methods can return different segmentations, all of which may be appropriate given specific target applications.

To tackle the ubiquitous Out-of-Vocabulary (OOV) problem, we are segmenting our texts into “subwords” using SentencePiece. These typically return smaller units and do not align with “word” segmentations returned by Mecab.

If we wanted to evaluate our embeddings on an intrinsic task such as word similarity, we could use the Japanese word similarity data set (Sakaizawa et al., 2018), containing word similarity ratings for pairs of words across different words types by human evaluators.

However, preliminary vocabulary comparisons showed that because of differences in the segmentation, there was little to no overlap between the words in our word embeddings and those in the data set.  For instance, the largest common group occurred in nouns: only 50 out of 1000 total noun comparison pairs.

So instead we are going to propose a naïve approach to compare two-word embeddings using a Synonym Vector Mapping Approach

For the current data set, we would like to see whether the model can map information from the full text and its summary correctly, even when different expressions are being used, i.e. we would like to test the model’s ability to pair information from two texts that use different words.