Explanations and visualisations
- Crash Course Linguistics #2, #3, #4
- Universal Dependencies, CoNLL-U format
- Jurafsky-Martin 2.4
Because language is compositional, text parsing is performed at several levels.
Here we decide what the units of processing are. In the CoNLL-like formats, tokenisation is deciding what goes in each row. Traditionally, each word is considered to be one token. But what is a word? What about punctuation?
Mapping different word forms into a single canonical form, e.g. journaux -> journal. It can be very difficult for some language due to:
Morphology
Derivation
Classifying tokens into categories, e.g. VERB, NOUN. If a language has rich morphology (like Latin), we need additional features called morphosyntactic definitions, e.g. NOUN in the ACCUSATIVE case SINGULAR, MASCULINE gender
How tokens combine into phrases and sentences:
No labels
Constituent analysis
Dependency analysis