nlp_intro

11. Cross-lingual transfer and multilingual NLP

Explanations and visualisations

Crash Course Linguistics #16

Sebastian Ruder’s blog: The State of Multilingual AI

Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT

The State and Fate of Linguistic Diversity and Inclusion in the NLP World

No Language Left Behind: Scaling Human-Centered Machine Translation

Connecting Language Technologies with Rich, Diverse Data Sources Covering Thousands of Languages

finetune

continuepretrain

test

Cross-lingual transfer

In principle, we can pre-train a model on one language and use it to process texts in another language
Most often, we pre-train models on multiple languages and then use this multilingual model to process texts in any given single language
Target language is the language in which we want to perform a task (process texts)
Transfer language is the language in which we have a lot of labelled data, we fine-tune a pre-trained multilingual model on the transfer language and then apply it (often as zero-shot) to a different target language.
Double transfer: when we transfer a model across languages and fine-tune it for a given task, there are two transfer steps, one across languages and one across tasks
(Continued) training: if we have some unlabelled data in the target language, we can continue training the model with the pre-training objective before fine-tuning it for a given task
Zero-shot: we attempt to perform a task without fine-tuning and continued training
Pre-trained model can be a bare LM or trained/fine-tuned for a specific task
An interesting example is the Helsinki team solution to the AmericasNLP task of translating from Spanish into low-resource languages: for each pair Spanish - TARGET, train a on Spanish - English for 90% of the time, then continue on Spanish - TARGET for 10% of the time, best results for all TARGET languages

Multilingual data sets

Only text

Bible 100, 103 languages, 30 families, Majority non-Indo-European
mBERT, 97 languages, 15 families, Top 100 size of Wikipedia plus Thai and Mongolian

Parallel data (for machine translation)

OPUS, 744 languages
FLORES, 200 languages

Annotated for text parsing

Universal Dependencies (UD) 150 languages

Annotated for semantic NLP tasks (sentiment, similarity, inference, question-answering, …)

XTREME, 40 languages (used for training XML-R)
XGLUE, 19 languages
XNLI, 15 languages
XCOPA, 11 languages
TyDiQA, 11 languages
XQuAD, 12 languages

Many multilingual data sets are created from a selection of data taken from Common Crawl.

Multilingual pre-trained models

BERT-type

mBERT was the first, trained on top 100 Wikipedia languages, plus a few arbitrary ones
XML-R, a BERT-based model, currently most popular as a starting point for multilingual experiments

GPT-type

BLOOM
Falcon
Phi

Full Transformers

Multiple encoder-decoder (not transformers)

NLLB

Other pre-trained models are typically trained for a single language or a group of languages (e.g. Indic BERT, AraBERT, BERTić)

Language similarity and transfer

mentions

If the target language is seen in pre-training, the performance will be better
There is a trade-off between the size of the training data and the closeness to the target language
It is not easy to predict which will be good transfer-target pairs
Often BERT base (only English) works best even if the target language is very distant

Language vectors

To measure distances between languages, we represent each language as a vector
One-hot encodings are used as language IDs
Typological feature values can be considered vectors, but actual features in the existing databases need to be processed, the two most common methods are conversion into binary values and interpolation (filling missing values).
Vectors learned from text samples are called language model (LM) vectors: typically a special token appended to each sentence (same token for all sentences of a single language), this token is expected to contain a representation of a language
Typological data bases: WALS, Glottolog, URIEL (derived from WALS, Glottolog and some other sources), Grambank

Benefits of multilingual NLP

Linguistic and machine learning: bigger challenges lead to better approaches, e.g. subword tokenisation
Cultural and normative: better representation of the real world knowledge
Cognitive: learn interlingual abstractions