nlp_intro

2. NLP tasks, data sets, benchmarks

Explanations and visualisations

Crash Course Linguistics #2, #3, #4

Universal Dependencies, CoNLL-U format

Jurafsky-Martin 2.4

Text parsing

Because language is compositional, text parsing is performed at several levels.

Tokenisation

Here we decide what the units of processing are. In the CoNLL-like formats, tokenisation is deciding what goes in each row. Traditionally, each word is considered to be one token. But what is a word? What about punctuation?

splits

Lemmatisation

Mapping different word forms into a single canonical form, e.g. journaux -> journal. It can be very difficult for some language due to:

non-concatenative morphology
not clear difference between derivation and morphology
no clear word boundaries (e.g. Chinese)

Morphology

splits splits splits

Derivation

splits

Part-of-speech (PoS) tagging or morphosyntactic definition (MSD)

Classifying tokens into categories, e.g. VERB, NOUN. If a language has rich morphology (like Latin), we need additional features called morphosyntactic definitions, e.g. NOUN in the ACCUSATIVE case SINGULAR, MASCULINE gender

splits

Syntactic parsing

How tokens combine into phrases and sentences:

No labels

splits

Constituent analysis

splits

Dependency analysis

splits

End-user tasks

Examples in the HuggingFace tutorial:
- sentiment analysis: given a short text, is it positive or negative?
- named entity recognition: given a token, is it an ordinary word or does it refer to a specific real entity?
- question answering: given a question and a text snippet, what segments of the text respond to the question?
- mask filling: given a sentence with empty slots, what tokens suit best the empty slots?
- translation
- summarisation
- text generation
Famous (old) NLU benchmarks and data sets:
- GLUE
- SQuAD
- SNLI