nlp_intro

11. Cross-lingual transfer and multilingual NLP

Explanations and visualisations

 

finetune

continuepretrain

test

 

Cross-lingual transfer

 

Labelled vs. unlabelled data

state-fate

Source https://aclanthology.org/2020.acl-main.560.pdf

 

Examples of unlabelled multilingual data sets:

Examples of parallel data (for machine translation), considered labelled

Examples of labelled multilingual data sets

Many multilingual data sets are created from a selection of data taken from Common Crawl.

 

Examples of multilingual pre-trained models

Other pre-trained models are typically trained for a single language or a group of languages (e.g. Indic BERT, AraBERT, BERTić)

 

Multilingual tokenisation and script issues

tiktoken

Source https://tiktokenizer.vercel.app

Language similarity and transfer

 

mentions

 

Language features

 

Benefits of multilingual NLP