Word Embeddings are one of the most interesting aspects of the Natural Language Processing field. When I first came across them, it was intriguing to see a simple recipe of unsupervised training on a bunch of text yield representations that show signs of syntactic and semantic understanding.
In this post, we will explore a word embedding algorithm called "FastText" that was introduced by Bojanowski et al. and understand how it enhances the Word2Vec algorithm from 2013.
Suppose we have the following words and we want to represent them as vectors so that they can be used in Machine Learning models.
Ronaldo, Messi, Dicaprio
A simple idea could be to perform a one-hot encoding of the words, where each word gets a unique position.
We can see that this sparse representation doesn't capture any relationship between the words and every word is isolated from each other.
Maybe we could do something better. We know Ronaldo and Messi are footballers while Dicaprio is an actor. Let's use our world knowledge and create manual features to represent the words better.
This is better than the previous one-hot-encoding because related items are closer in space.
But manually doing this for every possible word is not scalable. If we designed features based on our world knowledge of the relationship between words, can we replicate the same with a neural network?
Can we have neural networks comb through a large corpus of text and generate word representations automatically?
This is the intention behind the research in word-embedding algorithms.
In 2013, Mikolov et al. introduced an efficient method to learn vector representations of words from large amounts of unstructured text data. The paper was an execution of this idea from Distributional Semantics.
You shall know a word by the company it keeps - J.R. Firth 1957
Since similar words appear in a similar context, Mikolov et al. used this insight to formulate two tasks for representation learning.
The first was called "Continuous Bag of Words" where need to predict the center words given the neighbor words.
The second task was called "Skip-gram" where we need to predict the neighbor words given a center word.
While Word2Vec was a game-changer for NLP, we will see how there was still some room for improvement:
Out of Vocabulary(OOV) Words:
In Word2Vec, an embedding is created for each word. As such, it can't handle any words it has not encountered during its training.
For words with same radicals such as “eat” and “eaten”, Word2Vec doesn’t do any parameter sharing. Each word is learned uniquely based on the context it appears in. Thus, there is scope for utilizing the internal structure of the word to make the process more efficient.
To solve the above challenges, Bojanowski et al. proposed a new embedding method called FastText. Their key insight was to use the internal structure of a word to improve vector representations obtained from the skip-gram method.
The modification to the skip-gram method is applied as follows:
For a word, we generate character n-grams of length 3 to 6 present in it.
- We take a word and add angular brackets to denote the beginning and end of a word
Then, we generate character n-grams of length n. For example, for the word "eating", character n-grams of length 3 can be generated by sliding a window of 3 characters from the start of the angular bracket till the ending angular bracket is reached. Here, we shift the window one step each time.
Word Length(n) Character n-grams eating 3 eating 4 eating 5 eating 6
Since there can be huge number of unique n-grams, we apply hashing to bound the memory requirements. Instead of learning an embedding for each unique n-gram, we learn total B embeddings where B denotes the bucket size. The paper used a bucket of a size of 2 million.
Each character n-gram is hashed to an integer between 1 to B. Though this could result in collisions, it helps control the vocabulary size. The paper uses the FNV-1a variant of the Fowler-Noll-Vo hashing function to hash character sequences to integer values.
To understand the pre-training, let's take a simple toy example. We have a sentence with a center word "eating" and need to predict the context words "am" and "food".
We take dot product between the center word and the actual context words and apply sigmoid function to get a match score between 0 and 1.
FastText improves performance on syntactic word analogy tasks significantly for morphologically rich language like Czech and German.
word2vec-skipgram word2vec-cbow fasttext Czech 52.8 55.0 77.8 German 44.5 45.0 56.4 English 70.1 69.9 74.9 Italian 51.5 51.8 62.7
word2vec-skipgram word2vec-cbow fasttext Czech 25.7 27.6 27.5 German 66.5 66.8 62.3 English 78.5 78.2 77.8 Italian 52.3 54.7 52.3
FastText is 1.5 times slower to train than regular skipgram due to added overhead of n-grams.
Using sub-word information with character-ngrams has better performance than CBOW and skip-gram baselines on word-similarity task. Representing out-of-vocab words by summing their sub-words has better performance than assigning null vectors.
skipgram cbow fasttext(null OOV) fasttext(char-ngrams for OOV) Arabic WS353 51 52 54 55 GUR350 61 62 64 70 German GUR65 78 78 81 81 ZG222 35 38 41 44 English RW 43 43 46 47 WS353 72 73 71 71 Spanish WS353 57 58 58 59 French RG65 70 69 75 75 Romanian WS353 48 52 51 54 Russian HJ 69 60 60 66
- Piotr Bojanowski et al., "Enriching Word Vectors with Subword Information"
- Armand Joulin et al., "Bag of Tricks for Efficient Text Classification"
- Tomas Mikolov et al., "Efficient Estimation of Word Representations in Vector Space"
If you enjoyed this blog post, feel free to connect with me on Twitter where I share new blog posts every week.