12. What is knowledge about language?
Explanations and visualisations
- Crash Course Linguistics #1, #14
- Revisiting research training in linguistics: theory, logic, method, Unit 01_01, Unit 01_02, Unit 01_03, Unit 01_04, Unit 02_01
- U. Shlonsky and G. Bocci, Syntactic cartography
- S. Piantadosi, Zipf’s word frequency law in natural language: A critical review and future directions
Grammar vs. linguistics
- Grammar is what we study in school: rules of a single language
- Linguistics is about explaining why grammars are the way they are
- What is common to all grammars?
- What kinds of rules may or may not exist?
- Big gap between theory and grammar
- Most of what is called “linguistics” is about grammars (not theoretical)
Most important terminology in grammar and linguistics with the corresponding NLP tasks
- Phonetics describes physical properties of sounds (place of articulation, pitch, duration, intonation, etc.), phonology describes rules over abstract sound representations
- Morphology describes the rules of word formation (derivation, inflection)
- NLP tasks: stemming, lemmatisation, subword tokenization
- Syntax describes the rules of sentence formation (dependencies between words), can be constituency or dependency trees
- Semantics deals with meanings, can be lexical (meaning of words) or propositional (meaning of sentences)
- NLP tasks: word sense disambiguation, semantic role labelling, co-reference resolution
- Pragmatics deals with meaning in context: how we understand non-explicit meanings
- NLP task: intent classification
- Discourse analysis describes the rules of combining sentences into higher structures
- NLP task: dialogue and interactive systems (chat bots), identifying discourse relations
- Sociolinguistics: linguistic differences between social groups (e.g. young vs. old speakers, men vs. women, degrees of education)
- NLP task: computational sociolinguistics (classifying social media users)
- Psycholinguistics, neurolinguistics: where is language located in the brain? what structures are harder for the brain to process?
- NLP task: cognitive modelling (simulating human language processing)
- Language acquisition: how language develops in children
- NLP task: cognitive modelling (simulating human language processing)
- Second language acquisition: how is a foreign language learnt
Most important theoretical problems
- Arbitrariness of the sign: there is no natural connection between a word and its meaning,
- but the Bouba/kiki effect shows some connection
- what is the mapping between the meaning and the form?
- Double articulation (duality of patterning): merging meaningless units into meaningful ones, merging meaningful units into higher-order meaningful units
- sometimes only the latter regarded as language
- relevant to the question of what are the smallest units of language
- relevant to the question of the form of language competence: is it just one operation (merge)?
- Displacement: we can talk about things we don’t see
- but it seems that we don’t use this freedom all the time, a short video about that (by me): What you see is what you say, or is it?
- relevant to the question what knowledge can be extracted from texts
- Innateness: are we born with a specialised language faculty or it’s all just general cognition?
- a famous puzzle: Poverty of the stimulus
- distantly relevant to generalisation and universality of NLP models
- related to universality
Information theory: text measures and quantitative laws
- Shannon entropy and complexity: text as a sequence of outcomes of a stochastic process
- Zipf-Mandelbrot Law: the relation between word frequency and its frequency rank is universal
- Zipf’s Law of Abbreviation: the length of a word is inversely proportional to its frequency
- Menzerath-Altmann’s Law: the bigger the whole, the smaller the parts
- Uniform information density, constant information rate: tendency to keep the amount of information constant
Linguistics vs. NLP
- Symbolic rule-based methods relied a lot on grammars
- Statistical methods used annotated texts, Penn Treebank was an example for many others, now Universal Dependencies
- Both rules and annotations are slightly formalised grammars, not scientific theory
- LLMs and self-supervised learning often work without any explicit linguistic knowledge
- Popular question: is there still any room for linguistics in NLP?