nlp_intro

9. Transfer learning: Performing tasks with encoder-type pre-trained models

Explanations and visualisations:

Jurafsky-Martin 11

Lena Voita’s blog: Transfer Learning

Jay Alammar’s blog: A Visual Guide to Using BERT for the First Time

transfer

Source: http://transfersoflearning.blogspot.com/2010/03/positive-transfer.html

Before Large Language Models (LLMs)

We shared trained models which were strictly task-specific
Models could be trained with joint or multi-task learning: find the parameters that minimise two losses, e.g. PoS and lemma
Neural networks could be designed with parameter sharing: some parts of the network are shared between two tasks (their loss combined used for weight updating) and other parts are task specific (using only one loss to update the weights)

With LLMs

tasks

Consecutive transfer: first pre-train a LLM, then the train the main model with LLM as input
A LLM is trained on raw text with self-supervised learning
The main model is trained on labelled data (most of the time) with supervised learning
The way how pass the pre-trained weights depends on whether the task is token or sentence level

Options for how to use LLMs

Fine-tuning: This term is typically used in cases where the learning objective (task) used for pre-training LLM is different from the objective (task) on which the main model is trained. For example, a typical task for pre-training LLMs is masked language modelling (MLM), while the main model is trained for text classification or sentiment analysis. All weights are updated with the main model’s loss.
Continued training: This term is used in cases where a pre-trained LLM is used to improve text representation on a new domain or a new language. In this case, the the learning objective is the same for the pre-trained LLM and the main LM (e.g. we use MLM in both cases), but pre-training and main training are done on different data sets. All weights are updated with the main model’s loss.
One- and few-shot learning, also known as meta-learning: pre-trained LLM performs the main task without updating parameters. A pre-trained LLM learns to classify new examples relying on a similarity function. This terminology is very new, not yet summarised in a textbook. An overview of terms and references can be found in Timo Schick’s PhD thesis
Adaptation: Pre-trained weights fixed, train only a small component to use the pre-trained weights for a new task
Zero-shot classification: unsupervised setting without updating LMMs parameters.

LLMs model types

evoLLMs

Source: https://www.interconnects.ai/p/llm-development-paths

model architecture types:
- only the encoder part of Transformers (e.g. BERT, RoBERTa)
- only the decoder part of Transformers (e.g. GPT)
- full encoder-decoder Transformers (e.g. t5)
training objective:
- masked language modelling (e.g. BERT, RoBERTa)
- discriminating between alternative fillings of slots (e.g. ELECTRA)
- reconstructing the input after permutations (e.g. XLNet)
model size
- several models are trained using fewer parameters (e.g. DistilBERT)

LLMs source data and type

English (Wikipedia and BooksCorpus): bert-base, cased and uncased
French: FlauBERT (BERT), CamemBERT (RoBERTa), variants
Bosnian, Croatian, Montengrin, Serbian: BERTić (ELECTRA), trained on almost all the texts available online
many, many more!