nlp_intro

2. NLP tasks, data sets, benchmarks

Explanations and visualisations

 

Text parsing

Because language is compositional, text parsing is performed at several levels.

Tokenisation

Here we decide what the units of processing are. In the CoNLL-like formats, tokenisation is deciding what goes in each row. Traditionally, each word is considered to be one token. But what is a word? What about punctuation?

splits

 

Lemmatisation

Mapping different word forms into a single canonical form, e.g. journaux -> journal. It can be very difficult for some language due to:

Morphology

splitssplitssplits

Derivation

splits

 

Part-of-speech (PoS) tagging or morphosyntactic definition (MSD)

Classifying tokens into categories, e.g. VERB, NOUN. If a language has rich morphology (like Latin), we need additional features called morphosyntactic definitions, e.g. NOUN in the ACCUSATIVE case SINGULAR, MASCULINE gender

splits

 

Syntactic parsing

How tokens combine into phrases and sentences:

No labels

splits

 

Constituent analysis

splits

 

Dependency analysis

splits

 

End-user tasks