6. Text encoding with Transformers NNs
Explanations, formulas, visualisations:
Encoder-decoder framework (similar to earlier seq2seq)
- One huge neural network
- The encoder part provides the representation of the input
- The decoder part generates a new sequence, given the representation of the input
Similar to word2vec: Training with self-supervision
- Masked language modelling as a training goal (objective, task)
- Cross-entropy (comparing probability distributions) as a loss function
Difference with word2vec: Better, contextual, “dynamic” (sub)word vectors
- We basically represent sequences of symbols (subwords), not single words
- The result of text encoding with Transformers is a representation for each subword segment in the given sentence. This is a dynamic representation because it depends on the sentence as opposed to “static” representations (e.g. word2vec).
- With the self-attention mechanism, we can extract more information from the context, we can select more relevant contexts.
Reasons for the large number of parameters
- Instead of extracting one vector per word (like in word2vec), we store and reuse the whole network, all weights
- Multihead attention: need to repeat the attention mechanism several times, with varied parameter initialisations
- Stacked FFNNs encoders: need to repeat the whole encoding process several times to achieve good results
Subword tokenization: similar to previous seq2seq, different from word2vec
(More details in the following lectures)
- Control over the size of the vocabulary
- Dealing with unknown words
Generalised attention: different from previous seq2seq:
(More details in the following lectures)
- The notion of attention comes from encoder-decoder RNNs built for machine translation: it allows the decoder to select the most relevant encoder states when generating the output.
- Generalised as self-attention this mechanism allows to find the most relevant contexts for encoding the input.
- It helps increases parallel computation because the input sequence (e.g. a sentence) is broken down into many pairs of words; we can disregard the order of words.
- Positional encoding: an additional function needed to make up for disregarding the order of words