10. Cross-lingual transfer and multilingual NLP
Explanations and visualisations
Reasons for studying multilingual NLP
- societal and economic: people like to interact with the technology in their native language/accent; more varieties covered, more users reached
- linguistic and machine learning: bigger challenges lead to better approaches, e.g. subword tokenisation
- cultural and normative: better representation of the real world knowledge
- cognitive: learn interlingual abstractions
Language vectors
- One-hot encodings are used as language IDs
- Typological feature values can be considered vectors, but actual features in the existing databases need to be processed, the two most common methods are conversion into binary values and interpolation (filling missing values).
- Vectors learned from text samples are called language model (LM) vectors: typically a special token appended to each sentence (same token for all sentences of a single language), this token is expected to contain a representation of a language
Multilingual data sets
- Universal Dependencies (UD) 106 languages, 20 families, Bias towards Eurasia recognised but not intended
- Bible 100, 103 languages, 30 families, Majority non-Indo-European
- mBERT, 97 languages, 15 families, Top 100 size of Wikipedia plus Thai and Mongolian
- XTREME, 40 languages, 14 families, Diversity
- XGLUE, 19 languages, 7 families
- XNLI, 15 languages, 7 families, Span families, include low resource languages
- XCOPA, 11 languages, 11 families, Max diversity
- TyDiQA, 11 languages, 10 families, Typological diversity
- XQuAD, 12 languages, 6 families, Extension to new languages
Multilingual pre-trained models
BERT-type
- mBERT was the first, trained on top 100 Wikipedia languages, plus a few arbitrary ones
- XML-R, a BERT-based model, currently most popular as a starting point for multilingual experiments
GPT-type
Full Transformers
Other pre-trained models are typically trained for a single language or a group of languages (e.g. Indic BERT, AraBERT, BERTić)
Transfer across languages
- a pre-trained LLM needs to be selected for a given pair (language, task)
- there is a trade-off between the size of the training data and the closeness to the target language
- often BERT base (only English) works best even if the target language is very distant
- an interesting example is the Helsinki team solution to the AmericasNLP task of translating from Spanish into low-resource languages: for each pari Spanish - Target, train a on English - Spanish for 90% of the time, then continue on Spanish - Target for 10% of the time, best results for all Target languages
- pre-trained model can be a bare LM or trained for a specific task
Language similarity and sampling
- Are languages included in data sets and models representative of all structural types?
- How to get a representative sample? How to compare languages?
- What are good transfer-target pairs? Why?