Explanations, visualisations and formulas:
- Jurafsky-Martin 6
- Lena Voita: Word embeddings
- Jay Alammar: The Illustrated Word2vec
- Xin Rong: word2vec Parameter Learning Explained
In one-hod encodings no meaning is represented: all words are equally distant from each other, each word is one dimension in a high-dimensional space, each orthogonal to each.
When we embed a word in a space, each word is a data point in the space, the number of dimensions can be lower as we don’t need one dimension per word. When embedded, words with similar meanings will be positioned close to each other in the embedding space.
We regard all words that appear in a text from two sides. On one side, each word carries its own distinct meaning, which we want to represent. When we look at the words from this side, we call them target. On the other side, each word is also a feature or a component of the meaning of other words. When a word describes other words, we call it context. What role a word has depends on where we are currently looking.
Important: Each word has a target and a context representation.
Used before NNs became the main paradigm in NLP. Continuous representations of words are learnt in three steps:
word2vec is a neural language model. We calculate the low-dimensional (dense) vectors directly from text data by training a minimal feed-forward neural network with a single hidden layer. The task of the network is to predict one word given another word. More precisely, the network can be trained in two ways:
Important: The set of labels is huge: each word is a label!
In the skip-gram version, we replace the objective p(c|t) with p(yes|t,c). So, instead of predicting the word, we just decide whether the target-context relation holds between two words. We go from thousands of labels to only 2. But for this to work, we need examples for both cases yes and no. We get yeses from the training text and we randomly sample nos.
The purpose of statistical LMs was to ensure the right order of words, that is the grammar of a language. Neural LMs were initially trained for the same reason, but the fact that they learn the order of word via internal continuous representations made them a good tool for learning the meaning of the words. The weight matrices learned by a neural LM (such as word2vec) by predicting the target or the context words contain the information about the meaning of the word. We can store these weights and then used them as features for all kinds of classification tasks. This why we say that the weights are representations are the meanings of words!