Part VI · NLP & Large Language Models · Chapter 01

The fundamentals of natural language processing, the set of linguistic abstractions and engineering primitives that every NLP system — classical or neural, embedding or transformer — sits on top of.

Language is the medium in which humans do most of their thinking, most of their arguing, and nearly all of their record-keeping; making computers do useful things with it has been an obsession of the field since the 1950s. The particular difficulty is not that language is complicated — though it is — but that its complication is layered in a way that resists any single abstraction. A sentence is a sequence of characters, but also a sequence of morphemes, a sequence of words, a parse tree, a set of predicate-argument relations, a discourse move, and a context-laden speech act, all at once. Different downstream tasks need different layers. A search engine mostly needs tokens and statistics; a grammar checker needs morphological analysis and syntax; a summariser needs discourse structure; a translator needs everything. Modern neural NLP has compressed much of this stack into end-to-end learned representations — a transformer trained on trillions of tokens ends up implicitly handling morphology, syntax, and pragmatics — but even in the LLM era the classical primitives still matter: every tokeniser descends from subword segmentation research, every retrieval system relies on linguistic normalisation, every dataset curation effort leans on POS taggers and parsers, and every failure case traces back to one of the layers the model did or did not learn. This chapter lays the foundation. We walk through the levels of linguistic structure (orthographic, lexical, morphological, syntactic, semantic, discursive), the tools that operate on each (tokenisers, morphological analysers, stemmers and lemmatisers, POS taggers, parsers, semantic role labellers, coreference resolvers), and the treebanks and annotated resources that made modern NLP possible. The goal is not exhaustive coverage — that belongs to a textbook — but a working map of the concepts every LLM engineer eventually needs to reach for.

How to read this chapter

Sections one and two establish the frame. Section one asks what NLP actually is — why processing language is harder than processing images or tabular data, what the essential difficulties are (ambiguity, compositionality, productivity, context-dependence), and how the field has oscillated between symbolic, statistical, and neural approaches over its seventy-year history. Section two lays out the levels of linguistic structure — orthography, morphology, syntax, semantics, pragmatics, discourse — the layered abstraction that organises everything that follows and that every NLP engineer eventually has to reason about, even when the chosen model end-to-end elides the layers.

Sections three through six cover the bottom of the stack: how raw character streams become linguistic units. Section three is tokenisation — the apparently trivial operation of splitting text into tokens, which turns out to be the site of an enormous amount of modern engineering and the place where multilingual failures most often begin. Section four is subword tokenisation — BPE, WordPiece, SentencePiece, Unigram LM — the algorithms that replaced word-level vocabularies in every serious modern system and that quietly power every transformer tokeniser from GPT-2 to Llama. Section five is text normalisation — the cleaning step that makes tokens comparable: Unicode normalisation, case folding, punctuation handling, and the historical baggage of encoding choices that still trips up pipelines today. Section six is morphology — the structure of words themselves, the distinction between inflection and derivation, and why morphologically rich languages (Turkish, Finnish, Arabic) have always been a source of trouble for NLP systems built around English assumptions.

Sections seven through eleven ascend through the classical NLP pipeline. Section seven is stemming and lemmatisation — the two families of morphological normalisation, the difference between a crude string rule (Porter) and a lexicon-backed analysis (WordNet-based lemmatisers), and why modern retrieval systems still use them. Section eight is part-of-speech tagging — the first structural annotation, the Penn Treebank tagset, the historical succession of HMM → MEMM → CRF → BiLSTM → BERT taggers, and the role POS tags play today in pipelines that don't call themselves classical. Section nine is syntax and parsing as a conceptual layer — constituency versus dependency, the formal grammar traditions, and why syntax became a secondary concern in the transformer era without ever quite going away. Section ten is dependency parsing — head-dependent relations, Universal Dependencies, transition-based and graph-based parsers, and the practical tools (spaCy, Stanza) that make dependency parses a commodity. Section eleven is constituency parsing — phrase-structure trees, CKY and chart parsing, PCFGs, and the newer neural approaches that recovered constituency parsing's relevance for resource-rich tasks.

Sections twelve through fifteen move up to meaning. Section twelve is lexical semantics — word senses, WordNet, hypernymy, synonymy, the distinction between type and token meaning, and the classical problem of word-sense disambiguation. Section thirteen is semantic role labelling — the predicate-argument layer (PropBank, FrameNet), who-did-what-to-whom annotation, and why SRL is still a useful diagnostic for what a model actually understands. Section fourteen is coreference resolution — the problem of figuring out which noun phrases refer to the same entity (mentions, clusters, chains), and the progression from rule-based systems through Mention-Pair models to end-to-end neural coreference. Section fifteen is discourse and pragmatics — the layer above the sentence: Rhetorical Structure Theory, discourse connectives, coherence relations, and the pragmatic inferences (implicature, presupposition, speech acts) that language users perform automatically and machines still struggle with.

Sections sixteen through eighteen close the chapter. Section sixteen covers multilingual NLP — the assumptions English-centric NLP smuggled into the field, the diversity of the world's languages, low-resource settings, and the modern multilingual models (mBERT, XLM-R, NLLB) that partially address it. Section seventeen is treebanks and annotated resources — the Penn Treebank, OntoNotes, Universal Dependencies, the decades of annotation labour that made statistical NLP possible and that modern evaluation still relies on. The closing in-ml section places NLP fundamentals in the wider field: as the intellectual substrate of every language model, as the toolkit that data curation and evaluation still use daily, and as the set of abstractions that lets us say what a transformer is and is not doing when it processes a sentence.

Why NLP is hardAmbiguity, productivity, context, the things that make language not tabular
Levels of linguistic structureOrthography, morphology, syntax, semantics, pragmatics, discourse
TokenisationWords, whitespace, punctuation, scripts, and the ways it goes wrong
Subword tokenisationBPE, WordPiece, SentencePiece, Unigram LM — the modern default
Text normalisationUnicode, case folding, punctuation, encoding pitfalls
MorphologyInflection, derivation, agglutinative vs. analytic languages
Stemming and lemmatisationPorter, Snowball, WordNet lemmatisers, retrieval normalisation
Part-of-speech taggingPenn Treebank tagset, HMM → BiLSTM → BERT taggers
Syntax and parsingConstituency vs. dependency, grammar formalisms, why syntax still matters
Dependency parsingHead-dependent relations, Universal Dependencies, transition-based parsers
Constituency parsingPhrase structure, PCFGs, CKY, neural chart parsers
Lexical semanticsWord senses, WordNet, hypernymy, polysemy, WSD
Semantic role labellingPropBank, FrameNet, who-did-what-to-whom annotation
Coreference resolutionMentions, chains, end-to-end neural coreference
Discourse and pragmaticsRST, coherence relations, implicature, speech acts
Multilingual NLPEnglish-centric assumptions, low-resource settings, mBERT, XLM-R, NLLB
Treebanks and annotated resourcesPenn Treebank, OntoNotes, UD, the annotation substrate
Where it compounds in MLTokenisers, data curation, evaluation, the substrate under every LLM

§1

Why NLP is hard

Language looks like a sequence of characters but behaves like a hierarchy of interlocking abstractions, and nearly every operation an NLP system performs has to reconcile the two.

Natural language processing is the discipline of getting computers to do useful things with human language — parse it, generate it, translate it, answer questions in it, retrieve documents with it, extract structured facts from it. Its difficulty is often underestimated at first encounter and often overestimated at second. The first encounter says: "language is just strings; we have strings solved." The second says: "language is infinitely complex; no finite system can capture it." Both miss the point. Language is tractable in the sense that most of its structure is learnable from examples, but intractable in the sense that no single level of abstraction — character, word, sentence, paragraph — is sufficient on its own. Every NLP system has to pick an abstraction, and every abstraction leaks.

Four properties make NLP hard in a way that image classification and tabular prediction are not. First, ambiguity is pervasive at every level: the same character string ("bank") can denote a river edge or a financial institution; the same parse tree can attach a prepositional phrase to different heads ("I saw the man with the telescope"); the same pronoun can refer to different antecedents ("the trophy didn't fit in the suitcase because it was too big"). Second, productivity means speakers routinely produce sentences no one has ever uttered before — the set of possible English sentences is not just large but unbounded — so any system that works only on the training distribution is a system that will not work. Third, compositionality means that meaning is built up from parts in systematic ways, but the rules of composition are themselves context-dependent and riddled with exceptions; you cannot reliably read the meaning of a sentence off a lookup of its words. Fourth, language is context-dependent in a way that most data is not: the same utterance can mean opposite things depending on who says it, when, to whom, and in response to what.

These properties drove the field through three broad eras. The symbolic era (1950s–1980s) tried to encode linguistic knowledge as explicit rules: hand-written grammars, ontologies, inference engines. It produced genuine intellectual progress — the syntactic theories of Chomsky, the formal semantics of Montague — but struggled with coverage and ambiguity. The statistical era (late 1980s–early 2010s) replaced rules with probabilities learned from annotated corpora: n-gram language models, HMM taggers, PCFG parsers, phrase-based statistical machine translation. It dramatically improved coverage and robustness at the cost of shallower linguistic insight. The neural era (2013 onward) collapsed many of the pipeline stages into end-to-end learned representations: word embeddings, recurrent models, and — decisively — transformers trained on trillions of tokens. The neural era has produced systems that handle many NLP tasks without explicit linguistic supervision, but it has not made the underlying abstractions obsolete; it has made them implicit. When an LLM fails, it fails in a way that is usually best diagnosed by reaching back into the classical toolkit.

Key idea. NLP is difficult because language is multi-level, ambiguous, productive, compositional, and context-dependent. The history of the field is a series of moves between making those properties explicit (symbolic), making them statistical (statistical era), and making them implicit (neural era). The levels themselves never go away — they are the substrate of every system that processes language.

§2

Levels of linguistic structure

Language is organised as a layered abstraction — orthography, phonology, morphology, syntax, semantics, pragmatics, discourse — and understanding what any NLP system does usually amounts to identifying which of these layers it handles and which it assumes away.

The levels are the organising skeleton of the field. From the bottom up: orthography is the system of written symbols — letters, Chinese characters, Devanagari graphemes, emoji — and the encoding conventions (Unicode, UTF-8) that let computers represent them. Just below orthography sits phonology, the sound structure of speech, which matters for speech recognition, text-to-speech, and any task involving rhyme, rhythm, or pronunciation (historical spelling pipelines included). Above orthography is morphology, the structure of words — how walk relates to walks, walked, walking, and how unhappiness is built from un-, happy, and -ness.

Above morphology is syntax, the rules and patterns by which words combine into larger units — phrases, clauses, sentences. Syntax is often described in two overlapping ways: constituency (the sentence is a tree of phrases: noun phrases, verb phrases, prepositional phrases) and dependency (the sentence is a graph of head-dependent relations between words). Above syntax is semantics, the study of meaning — lexical semantics (what words mean), compositional semantics (how meanings combine), and formal semantics (the logical structure of propositions). Above semantics sits pragmatics, which deals with how meaning is modulated by context, speaker intention, and conversational norms: implicature, presupposition, speech acts, politeness. And above pragmatics is discourse, which spans multiple sentences — coherence, cohesion, rhetorical relations, referential chains, genre.

The levels are not a strict hierarchy; they interact. Morphology and syntax are deeply entangled in agglutinative languages where a single "word" encodes tense, aspect, number, person, and evidentiality. Semantics and pragmatics blur continuously — is the utterance "can you pass the salt?" a literal question about ability, or a request for salt? Discourse relations reach into syntax when the interpretation of a pronoun depends on a parse three sentences back. NLP systems nonetheless lean heavily on the layered picture because it gives them a vocabulary for talking about what they are trying to do and where they are likely to fail.

A useful way to read modern systems against this ladder: a bag-of-words classifier handles orthography and (often badly) morphology, ignores syntax and semantics, and has no concept of discourse. A statistical MT system in 2010 relied on an explicit syntactic analysis and a partial semantic layer. A transformer LLM in 2025 has implicit representations at every level — a probe can extract POS tags from its early layers, constituency trees from its middle layers, and coreference chains from its later layers — but those representations are never exposed as an API, only as the substrate for the next-token prediction that everything ultimately flows through. When the LLM fails on a particular example, the failure is nearly always traceable to one of the layers the model did not learn well enough for this input.

The layered picture is sometimes criticised as a hangover from symbolic NLP that modern systems should move past. The criticism is half right. Modern systems do blur the boundaries, and end-to-end training gets you places that pipeline systems never reached. But the layers remain useful for three things: describing what a task actually requires; diagnosing what a failing system is missing; and designing the annotation, evaluation, and probing tools that the field still runs on.

§3

Tokenisation

Tokenisation — splitting a text into the discrete units a model will process — looks trivial and is not. It is the site of most multilingual failure, a surprising amount of model performance, and almost every downstream evaluation glitch.

Tokenisation is the operation of breaking a character stream into tokens — the discrete units on which the rest of the pipeline operates. In English and other space-delimited languages, a first attempt is to split on whitespace: "the cat sat" → ["the", "cat", "sat"]. This almost works but immediately runs into problems. Punctuation attaches to words (cat,, cat., "cat"); contractions fuse tokens (don't, it's); hyphenated compounds are not obviously one token or two (state-of-the-art); URLs, email addresses, and code snippets defy the whitespace convention entirely.

Classical NLP developed elaborate rule-based tokenisers — the Penn Treebank tokeniser, Stanford's PTBTokenizer, the spaCy tokeniser — that encode decades of conventions: split punctuation from words, expand contractions, handle common abbreviations (Dr., U.S.A.), preserve URLs. These rule sets are language-specific and dataset-specific: the Penn Treebank convention for tokenising English news text is different from what you want for tweets, code, chemistry literature, or Chinese. For languages without whitespace between words — Chinese, Japanese, Thai — tokenisation is a substantive subtask often requiring a separate learned segmenter, and for morphologically rich languages with heavy clitics (Hebrew, Arabic) tokenisation blurs into morphological analysis.

Modern neural NLP has largely replaced word-level tokenisation with subword tokenisation (next section) — a move that sidesteps most of these problems but introduces new ones. When you hear practitioners say "tokenisation" today, they usually mean "whatever the tokeniser of my transformer does", which is nearly always a variant of BPE, WordPiece, SentencePiece, or Unigram LM, and which produces tokens that do not correspond to linguistic words in any clean way. The tokens ▁nat, ural, ▁lang, uage are a reasonable segmentation of "natural language" to a modern LLM tokeniser, and unreasonable by any classical standard.

Tokenisation matters more than it looks. Two models of the same architecture trained on the same data with different tokenisers can differ by several points on downstream evaluations, because the tokeniser determines which strings collapse to the same token, how context windows are counted, and how rare words are handled. Tokenisation is also the place where most multilingual problems originate: a tokeniser trained primarily on English text will split a Burmese or Amharic word into dozens of byte-level fragments, inflating its cost and degrading its representation. Every serious multilingual system audits the tokeniser before the model.

Key idea. Tokenisation is the boundary between character streams and model units. It looks mechanical but encodes substantive decisions about language, script, and domain. Historically done with rules; now done with learned subword schemes; always consequential. Audit the tokeniser before you diagnose the model.

§4

Subword tokenisation

Modern LLM tokenisers split text into subword units — pieces smaller than words but typically larger than characters — chosen by a data-driven algorithm. The specific algorithm matters less than the fact that all of them eliminate the out-of-vocabulary problem.

Word-level vocabularies force a hard choice: either cap the vocabulary and emit <UNK> for anything rare (destroying information for the long tail), or let the vocabulary grow with the corpus (blowing up the embedding matrix and never quite covering novel forms). Character-level models solve the OOV problem but produce absurdly long sequences and force the model to rediscover word-like structure from scratch. Subword tokenisation is the compromise: a fixed-size vocabulary of frequently occurring character sequences, chosen so that common words are single tokens and rare or novel words are decomposed into shorter pieces, with individual bytes or characters as the ultimate fallback.

Four algorithms dominate practice. Byte Pair Encoding (BPE), introduced for NLP by Sennrich et al. 2016 (adapted from a 1994 compression algorithm), starts with a character vocabulary and greedily merges the most frequent adjacent pair until the vocabulary reaches a target size. WordPiece (Schuster and Nakajima 2012, popularised by BERT) is a close variant that scores merge candidates by how much they improve a unigram language model rather than raw frequency. Unigram language model tokenisation (Kudo 2018) takes the opposite direction: start with an oversized vocabulary and iteratively remove tokens whose removal least hurts a unigram model of the training corpus, producing a vocabulary that supports probabilistic (non-deterministic) segmentation. SentencePiece (Kudo and Richardson 2018) is the widely used implementation that runs BPE or Unigram LM on raw text (no whitespace pre-tokenisation) and treats spaces as a special character (rendered ▁ in output), making it language-agnostic by construction.

Two practical details often bite. First, most modern tokenisers operate on UTF-8 bytes rather than Unicode code points, giving them a 256-symbol alphabet that covers every possible input without modification; this is the "byte-level BPE" popularised by GPT-2 and now ubiquitous. Second, the vocabulary size is a real hyperparameter with downstream consequences: smaller vocabularies mean longer sequences (more compute and context-window cost per document); larger vocabularies mean shorter sequences but a bigger embedding matrix and rarer updates to each token. Typical modern vocabularies range from 32K (Llama, T5) to 100K+ (GPT-4, Claude), with the upper end trading memory for efficiency on multilingual text.

The most subtle issue with subword tokenisation is that the learned units are only loosely aligned with linguistic morphemes. A good tokeniser might split unhappiness into un, happi, ness — close to the true morphological decomposition — but will just as happily split happiness into hap, piness if that is what corpus frequencies produce. This makes subword tokenisation robust and general, and also opaque: you cannot read morphology off the token stream, and you cannot in general rely on a fixed mapping between tokens and meaningful linguistic units. For applications that need explicit morphological structure (linguistic analysis, certain low-resource tasks), a lemmatiser or morphological analyser still beats the tokeniser every time.

A useful diagnostic when debugging a model is to look at its tokenisation of the failing example. Multi-digit numbers split into single digits, code indentation turned into token-per-space, proper nouns exploded into byte fragments — all of these change the effective context length and the per-token gradient signal, and they are often the first thing to notice when a model is mysteriously struggling on a particular input.

§5

Text normalisation

Normalisation is the set of operations — Unicode canonicalisation, case folding, punctuation handling, whitespace regularisation — that make superficially different strings comparable. It is invisible when it works and everywhere when it does not.

Text normalisation turns a noisy, historically contingent character stream into something a downstream model or index can handle uniformly. The operations are mundane but accumulate. Unicode normalisation resolves the fact that the same visible character can be encoded in multiple ways: the letter é can be a single codepoint (U+00E9) or the combination of e (U+0065) and a combining acute accent (U+0301), and string equality in any naive pipeline will treat them as distinct. Unicode defines four normalisation forms (NFC, NFD, NFKC, NFKD); most modern pipelines use NFC (composed) for storage and NFKC (compatibility composed) when they want to collapse visually similar variants.

Case folding — lowercasing, or the locale-aware generalisation — was a mandatory step in most classical IR and statistical NLP pipelines and is still standard for search and matching, though the convention has shifted in neural NLP: modern transformers keep case because capitalisation is informative (it distinguishes Apple the company from apple the fruit, marks sentence boundaries, signals emphasis) and because large pretraining corpora contain both cases in abundance. The BERT family famously shipped both bert-base-uncased and bert-base-cased variants, and the jury on which was better for which task never quite came in.

Other normalisation steps: stripping or regularising punctuation; collapsing repeated whitespace; normalising line endings; handling diacritics (fold accents or preserve them — a different answer in French search than in Vietnamese retrieval); expanding contractions; resolving ligatures (ﬁ → fi); normalising number formats (1,000,000 vs 1 000 000 vs 1000000); handling Chinese/Japanese width variants (full-width vs half-width ASCII); normalising emoji sequences and zero-width joiners.

The practical lesson is that normalisation is a policy, not a default. Every application has to choose what to collapse and what to preserve, and the choice depends on the task: a search engine wants aggressive folding so that queries match documents; a legal document pipeline wants conservative normalisation so that whitespace variants and typography differences are preserved as evidence; a training pipeline for a multilingual LLM wants consistent Unicode normalisation across all source languages to avoid representing the same word as multiple token sequences. Mismatched normalisation between training and inference — a training pipeline that applies NFKC, an inference pipeline that does not — is a classic source of mysterious quality regressions.

A useful rule of thumb: if you are building a retrieval or matching system, normalise aggressively and audit for locale correctness. If you are building a generative system, normalise conservatively and log everything; the model will often recover from noise, but it cannot recover from lost information you silently discarded before training.

§6

Morphology

Morphology is the structure of words — how walks relates to walk, how unhappiness is built from pieces, and why morphologically rich languages have always embarrassed NLP systems engineered around English.

Morphology studies the internal structure of words and the rules by which they are formed. The smallest meaningful unit is the morpheme: cats contains two (cat + -s), unhappinesses contains four (un- + happy + -ness + -es). Morphemes come in two flavours: free morphemes can stand alone as words (cat, happy); bound morphemes only appear attached to something else (-s, un-, -ness). And morphological processes come in two broad families: inflection, which modifies a word to express grammatical features (tense, number, case, person) without changing its part of speech (walk → walked, cat → cats), and derivation, which changes a word's meaning or part of speech by combining morphemes (happy → happiness, happy → unhappy).

Languages vary enormously in how much morphological work they do. English is relatively analytic — most grammatical information is carried by word order and function words rather than inflection — which is part of why it became the default testbed for NLP and part of why NLP tools designed for English fail badly elsewhere. Agglutinative languages (Turkish, Finnish, Hungarian, Swahili) string many morphemes together into single words: the Turkish evlerinizden means "from your houses" and decomposes into ev (house) + -ler (plural) + -iniz (2nd-person plural possessive) + -den (ablative). Fusional languages (Russian, Latin, Greek) encode multiple grammatical features in a single morpheme whose form is hard to decompose. Polysynthetic languages (Inuktitut, Mohawk) pack what English would express as a full sentence into a single morphologically complex word.

The NLP consequence is that a vocabulary-based approach that works for English (a few hundred thousand word forms covers most text) explodes for Turkish (a single verb stem can have thousands of inflected forms) and breaks entirely for polysynthetic languages. Before subword tokenisation, NLP systems for morphologically rich languages relied on explicit morphological analysers — finite-state transducers, two-level morphology — that could segment a surface form into its underlying morphemes. These systems were sophisticated and labour-intensive; the Xerox finite-state toolkit and the Helsinki University HFST library encode the careful work of linguists over decades. Subword tokenisation has reduced the dependence on explicit morphology for modern pretrained models, but it has not eliminated it: applications that need accurate morphological analysis (grammar checkers, language learning tools, linguistic research) still use the classical tools, and modern multilingual benchmarks still reveal the gap between English-like and morphologically complex languages.

A related phenomenon is compounding: the formation of new words by concatenating existing ones. English does it with spaces (ice cream, high school) and hyphens (state-of-the-art); German does it without any orthographic separation, which is why Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz is a famous real-world word. Compounding interacts with tokenisation in the obvious way: a tokeniser trained on English under-segments German compounds and over-segments English phrases, with downstream consequences for everything from search to summarisation to the context window a user can fit into a single query.

Even in the LLM era, morphology remains the primary explanation for performance gaps across languages. When a multilingual model underperforms on Turkish or Finnish relative to English with comparable data scale, the first suspect is tokenisation; the second is morphology, and specifically that the model has to learn the morphology implicitly from subword statistics rather than inheriting it from a structured analyser.

§7

Stemming and lemmatisation

Stemming crudely chops off endings; lemmatisation maps a surface form to its dictionary form. Both collapse morphological variation so that running, runs, and ran are treated as the same thing — a cheap, classical, surprisingly useful reduction.

Stemming is the process of reducing a word to a rough base form by stripping suffixes according to a fixed set of rules. Lemmatisation does the same job more carefully: it maps a surface form to its lemma — the canonical dictionary form — using morphological analysis and a lexicon. For running, a stemmer might produce run or runn depending on its rules; a lemmatiser will produce run because it consults a lexicon that knows running is a form of the verb run. The distinction matters: stemmers are fast and language-independent in principle but produce non-words; lemmatisers produce real words but need a lexicon and often a POS tagger as input (since the lemma of saw is see if it's a verb and saw if it's a noun).

The Porter stemmer (Martin Porter, 1980) is the classic English stemmer and the one most practitioners cite. It applies a sequence of rule-based suffix strippings in phases: remove -sses → -ss, -ies → -i, then -ational → -ate, then -ness, -ity, etc. It is simple, fast, deterministic, and widely reimplemented. The Snowball framework (Porter himself, early 2000s) extended the approach to a dozen languages with a small DSL for writing stemmers. Lancaster is a more aggressive English stemmer; Krovetz is a lighter one that consults a lexicon to avoid obvious mistakes.

Lemmatisation typically leans on WordNet (for English) or on finite-state morphological analysers (for other languages). A WordNet lemmatiser looks up each inflected form in the WordNet lexicon and returns the registered lemma; if the input's POS is known, the lookup is unambiguous. spaCy, Stanza, and CoreNLP ship production-quality lemmatisers for dozens of languages. The quality gap between stemming and lemmatisation is especially large in morphologically rich languages, where rule-based stemming tends to over- or under-strip.

Stemming and lemmatisation remain in heavy use despite the neural turn, primarily in three places. Retrieval systems use them to collapse morphological variants so that a query for cats matches documents containing cat; most production search stacks (Elasticsearch, Solr, classical BM25 pipelines) have a stemming or lemmatisation layer. Data cleaning and analysis pipelines — topic modelling, keyword extraction, corpus linguistics — use them to reduce sparsity. Linguistic tooling (grammar checkers, language learning apps, dictionaries) relies on lemmatisation to connect surface forms to canonical entries. What has changed in the neural era is that transformers, because they operate on subwords and have dense representations, do not need explicit stemming — they learn morphological equivalence from context. But for every task where the output is a sparse index, a list of keywords, or a human-readable analysis, the classical tools still shine.

The clearest argument for understanding stemming and lemmatisation today is that they force you to think explicitly about morphological collapse — which surface forms should be treated as equivalent for your task. Even when a neural model does the collapsing implicitly, having a model of what it should be collapsing is the right mental tool for debugging failure cases.

§8

Part-of-speech tagging

POS tagging assigns each token a grammatical category — noun, verb, adjective, adverb, determiner, preposition. It is the gateway to syntax, the oldest sequence-labelling task in NLP, and the place the field first worked out how to learn from data.

Part-of-speech (POS) tagging is the task of assigning each word in a sentence a grammatical category — a tag — from a predefined inventory. The canonical English inventory is the Penn Treebank tagset, with 36 tags plus 12 punctuation tags: NN (common noun), NNS (plural noun), NNP (proper noun), VB (verb base), VBD (verb past), VBG (verb gerund), JJ (adjective), RB (adverb), DT (determiner), IN (preposition), and so on. The Universal POS tagset (Petrov, Das, McDonald 2012) gives a coarser 17-tag inventory that is consistent across languages and is the tag layer used by Universal Dependencies.

POS tagging is the first non-trivial NLP problem that was systematically tackled with machine learning, and the succession of model families that attacked it doubles as a miniature history of the field. The earliest systems were rule-based, relying on hand-written disambiguation rules over morphological and contextual features. The 1980s brought the Hidden Markov Model, treating tags as latent states and words as emissions; Brill's transformation-based tagger (1995) added a rule-learning twist. The 1990s and 2000s brought discriminative approaches: Maximum Entropy Markov Models (McCallum 2000) and Conditional Random Fields (Lafferty 2001), which model the full sequence jointly. The 2010s brought BiLSTM taggers (Huang et al. 2015) that combined a bidirectional LSTM over words with a CRF output layer, reaching ~97% token accuracy on English newswire — which turned out to be close to the inter-annotator ceiling. The 2018-onward era uses BERT-style contextual embeddings, either as features into a lightweight tagger or as the base for direct fine-tuning; these systems essentially saturate the benchmark.

A clarifying fact: English POS tagging is a nearly-solved problem in the sense that modern systems hit 97–98% token accuracy on Penn Treebank-style text, which is roughly the rate at which humans annotators themselves disagree. The remaining disagreement is mostly about hard cases — to (preposition or infinitive marker?), that (determiner, complementiser, or relative pronoun?), -ing forms (gerund or participle?) — which resist any amount of data because the categories themselves are fuzzy. The problem remains hard for low-resource languages, for social-media text, for domain-specific corpora, and for languages whose tagsets are less well-developed than English.

In 2026 POS tags are rarely exposed to users, but they persist invisibly in the infrastructure: named-entity recognition systems use POS features; lemmatisers need them to disambiguate; parsers consume them; information extraction pipelines rely on them; and the annotation effort that produced POS-tagged corpora — the Penn Treebank most of all — is still the foundation on which a lot of modern benchmarks rest. When a transformer is probed to see what it has implicitly learned, POS information is reliably present in the early layers; the layers above build on it.

Key idea. POS tagging is the foundational sequence-labelling task: assign one category per token. It is solved at the text-book level for well-resourced languages and unsolved for low-resource ones. It persists in modern systems as an implicit feature in pretrained models and an explicit layer in classical pipelines.

§9

Syntax and parsing

Syntax is the combinatorial structure of language — how words combine into phrases, phrases into clauses, clauses into sentences. Parsing is the task of recovering that structure. The two dominant formalisms, constituency and dependency, are two views of the same underlying organisation.

Syntax is the study of how words combine into larger units, and of the combinatorial rules that license some combinations and forbid others. The language "the cat sat on the mat" is syntactically well-formed; "the cat on mat the sat" is not; "colourless green ideas sleep furiously" is syntactically well-formed but semantically anomalous, and the point of that famous Chomsky example (1957) is that syntax is a separable level from meaning. Parsing is the computational task of recovering the syntactic structure of a sentence from its surface form.

Two dominant formalisms frame the task. Constituency grammar, rooted in American structuralist linguistics and formalised by Chomsky, analyses a sentence as a tree of nested phrases: a sentence is a noun phrase and a verb phrase, a verb phrase is a verb and a noun phrase, a noun phrase is a determiner and a noun, and so on down to the leaves. The resulting tree is called a phrase-structure tree, and parsing in this formalism is recovering that tree. Dependency grammar, rooted in European traditions (Tesnière, 1959) and dominant in modern computational practice, analyses a sentence as a graph of head-dependent relations between words: sat is the root; cat depends on sat as subject; the depends on cat as determiner; on depends on sat as a prepositional modifier; mat depends on on as its object. The two formalisms are not incompatible — you can often convert between them — but they highlight different structural properties and suit different downstream uses.

Syntax mattered enormously to NLP from the 1960s through the 2010s. Early MT systems relied on syntactic transfer; statistical MT used tree-to-string and tree-to-tree models; semantic role labelling and coreference resolution consumed parse trees as input; information extraction from text leaned on syntactic patterns. The decade-long effort to build high-accuracy statistical parsers — the Collins parser (1999), the Charniak parser (2000), the Stanford parser (Klein and Manning 2003), the Berkeley parser (Petrov et al. 2006) — was one of the field's flagship projects. By 2014 the Stanford neural dependency parser (Chen and Manning) hit state of the art with a surprisingly small neural network, and parsing accuracy for English has continued to climb with contextualised embeddings and attention-based architectures.

The neural turn quietly demoted explicit parsing. End-to-end models — neural MT, BERT, transformer LLMs — perform their downstream tasks without consuming parse trees, and in many cases outperform pipelined systems that did. But parsing has not become obsolete; it has become infrastructural. Modern systems use parsers for data curation, for linguistic probing, for structured information extraction, and for applications (grammar checking, language learning, legal analysis) where the parse itself is the product. Meanwhile, probing studies (Hewitt and Manning 2019, Tenney et al. 2019) have shown that transformer hidden states contain rich implicit syntactic information, recoverable with modest linear probes — a vindication of the intuition that syntax did not go away, it just went inside.

Whether to use an explicit parser in a modern pipeline comes down to a simple question: does the downstream consumer want a structured object, or does it want vectors? If the answer is vectors, a pretrained transformer is usually the right choice; if the answer is a parse tree a human or symbolic downstream system will consume, a dedicated parser (spaCy, Stanza, Berkeley Neural Parser) is still the right tool.

§10

Dependency parsing

Dependency parsing recovers the head-dependent graph of a sentence. Its universality, efficiency, and the Universal Dependencies project have made it the parsing formalism of choice for practical, multilingual NLP.

In dependency parsing, each word in a sentence (except a designated root) is connected to exactly one other word — its head — by an arc labelled with a grammatical relation: nsubj for subject, obj for object, det for determiner, amod for adjectival modifier, case for case-marking preposition or postposition, and so on. The resulting structure is a rooted labelled tree over the words. For "the cat sat on the mat", the dependency tree has sat at the root, with cat as subject, mat as a prepositional modifier via on, and determiners attached to their nouns.

The Universal Dependencies project (Nivre et al. 2016) has been the driving force behind the modern dominance of dependency parsing. UD defines a single cross-linguistically consistent annotation scheme — a universal POS tagset, a universal morphological feature inventory, a universal set of dependency relations — and has treebanks for more than 150 languages (as of the UD 2.x series). The consistency matters for practical NLP: a parser trained on Universal Dependencies gives comparable output regardless of language, which is essential for multilingual pipelines and for transfer across language boundaries.

Two families of parsing algorithms dominate. Transition-based parsers (Nivre 2003) cast parsing as a sequence of shift, reduce, and arc-addition actions applied to a stack-and-buffer configuration; at each step a classifier predicts the next action, and the sequence of actions constructs the tree. They run in linear time and are the basis of modern incremental parsers (MaltParser, Stanford Neural Dependency Parser, spaCy's parser). Graph-based parsers (McDonald et al. 2005) score every possible arc and then find the highest-scoring spanning tree using algorithms like Chu-Liu-Edmonds. They run in cubic time in the worst case but often produce better trees, especially for long-distance dependencies; the biaffine parser (Dozat and Manning 2017) is a widely used modern graph-based architecture.

Practical tooling for dependency parsing is mature. spaCy offers fast transition-based parsers in English and dozens of other languages, designed for production throughput. Stanza (Stanford NLP Group) provides UD-compliant parsers for over 70 languages with carefully curated per-language models. UDPipe is another widely used cross-language implementation. The Berkeley Neural Parser handles both constituency and dependency with transformer-based scoring. For applications that need parse output — grammar checking, linguistic research, structured IE — these tools give accurate, efficient results on general-domain text; their accuracy degrades on noisy or domain-specific inputs, and they still struggle with languages that have small treebanks. In modern LLM-based systems, dependency parses are not typically produced at inference time, but they are heavily used for data processing, filtering, and evaluation.

A useful property of dependency parses is that they make semantic roles shallowly visible: the nsubj and obj of a verb usually line up with the actor and patient. This is only an approximation — true semantic role labelling needs its own layer (§13) — but it is often a good-enough approximation for lightweight IE, and a reason dependency parses show up in a lot of pipelines that do not call themselves parsing pipelines.

§11

Constituency parsing

Constituency parsing recovers the phrase-structure tree of a sentence. Older, more formally elaborate, and now largely displaced by dependency parsing for general-purpose use — but still the right tool for tasks where nested phrase structure is what you actually need.

Constituency parsing (also called phrase-structure parsing) recovers a tree in which the leaves are words, the internal nodes are phrase categories (NP, VP, PP, S, SBAR, …), and the children of each internal node are the constituents that combine to form it. The tree for "the cat sat on the mat" has S dominating an NP ("the cat") and a VP; the VP dominates a V ("sat") and a PP; the PP dominates a P ("on") and an NP ("the mat"); the NPs dominate determiners and nouns. Constituency parsing is older than dependency parsing in the computational tradition, more tightly tied to formal linguistic theory, and more expressive in certain ways (it captures nested phrase relationships that dependency does not naturally encode).

The historical foundation is the context-free grammar (CFG) and its probabilistic extension, the probabilistic context-free grammar (PCFG). A PCFG assigns a probability to each production rule (S → NP VP, VP → V NP, NP → DT NN, …), and parsing becomes the task of finding the highest-probability tree that generates the sentence. The classic algorithm is CKY (Cocke, Kasami, Younger — independently rediscovered in the 1960s), a dynamic-programming bottom-up chart parser that runs in O(n³) time for a sentence of length n. Vanilla PCFGs underperform because the independence assumptions of CFG miss important context; the statistical parsing era (1995–2010) found dozens of ways to re-introduce context: lexicalisation (Collins 1999), parent annotation and horizontal Markovisation, latent subcategories learned by the EM-trained Berkeley parser (Petrov et al. 2006).

Neural constituency parsing has had two decisive moments. First, self-attention and sentence-level contextual encoders (ELMo, BERT) dramatically improved scoring accuracy; the Berkeley Neural Parser (Kitaev and Klein 2018) combined self-attention with chart decoding to reach high-90s F1 on the Penn Treebank. Second, the realisation that a well-pretrained transformer produces representations from which constituency structure can be recovered by linear probes (Tenney et al. 2019) showed that the information is present even without explicit supervision — though for exposed-tree output a dedicated parser still wins.

When is constituency parsing preferable to dependency parsing today? Three situations: when the downstream task explicitly consumes phrase-level structure (coordination ellipsis, certain semantic parsing formalisms, QA systems with tree-based features); when the linguistic analysis being performed is theoretical and needs phrase constituents rather than word-level relations; and when working in formal syntax (Combinatorial Categorial Grammar, Head-Driven Phrase Structure Grammar, Tree-Adjoining Grammar) where constituency is the native abstraction. For most general-purpose pipelines in 2026, dependency parsing is the default, and constituency parsing is reached for when phrase boundaries specifically are what the task needs.

A small but important subtlety: constituency and dependency trees for the same sentence encode overlapping but not identical information. Converting between them is a well-studied exercise; there are standard head-finding rules that turn a phrase-structure tree into a dependency tree, and deeper head-lexicalisation schemes that turn a dependency tree into something close to a constituency tree. In practice, the two formalisms coexist in the field, and the choice between them is almost always downstream-task-driven.

§12

Lexical semantics

Lexical semantics studies what words mean — how the same word can have multiple senses, how senses relate to each other, and how the vocabulary of a language forms a structured network of meaning. WordNet is the canonical resource; distributional semantics is the modern alternative.

Lexical semantics is the subfield of semantics concerned with the meaning of individual words, and with the relationships between word meanings that make a vocabulary into a structured system rather than an arbitrary list. The central phenomenon is polysemy: most common words have multiple related senses. Bank can denote a financial institution, the land beside a river, a pile of material, a row of something (a bank of switches), a curve in a road (the road banks to the left), and a billiard shot against a cushion. These are not accidental homographs — the relationships between senses have their own regular structures (metaphor, metonymy, specialisation, broadening).

The classical computational resource for lexical semantics is WordNet (Fellbaum 1998, based on Miller's work at Princeton from the 1980s). WordNet organises English nouns, verbs, adjectives, and adverbs into synsets — sets of synonymous senses. Synsets are linked by lexical relations: hypernymy (a dog is a kind of mammal), hyponymy (the inverse), meronymy (a wheel is part of a car), antonymy (hot is the opposite of cold). The result is a large graph — over 100 000 synsets for English — that provides structured lexical knowledge queryable by any downstream system. Multilingual extensions (EuroWordNet, Open Multilingual WordNet) cover dozens of languages.

The operation that most directly consumes lexical-semantic resources is word-sense disambiguation (WSD): given a word in context, decide which of its registered senses is meant. WSD was a classic NLP benchmark for decades. The performance trajectory mirrors the rest of the field: rule-based methods in the 1980s, supervised learning on sense-annotated corpora (SemCor) in the 1990s, semi-supervised and knowledge-based methods in the 2000s, and now contextual embeddings — a BERT vector for bank in a particular context is close to the vectors of other bank instances with the same sense, even without explicit sense supervision. Modern WSD systems that combine contextual embeddings with WordNet glosses (GlossBERT, SensEmBERT) perform at or near the inter-annotator agreement ceiling on standard benchmarks.

Alongside the knowledge-based tradition of WordNet, the distributional tradition has always offered an alternative: meaning is not a discrete set of senses but a continuous position in a vector space defined by co-occurrence. The distributional hypothesis (Harris 1954; Firth 1957, "you shall know a word by the company it keeps") motivated decades of work on vector-space models, from Latent Semantic Analysis (Deerwester et al. 1990) to word2vec (Mikolov et al. 2013) and GloVe (Pennington et al. 2014), and ultimately to the contextual embeddings of modern transformers. Chapter 03 of this part (Word Embeddings & Distributional Semantics) will take up the distributional side in full; here we note that lexical semantics today lives across both traditions — WordNet and embedding spaces are complementary, not competing, resources.

A useful diagnostic: when an LLM confuses two unrelated meanings of a polysemous word (the classic bank-river-vs-bank-money confusion in a long-context query), the failure is a lexical-semantic one, and the right conceptual tool for talking about it is sense disambiguation, not tokenisation or syntax. The tools of lexical semantics have a long history of handling exactly this kind of confusion.

§13

Semantic role labelling

Semantic role labelling annotates who did what to whom in a sentence: agents, patients, instruments, beneficiaries, and the predicates that relate them. It is a shallow semantic layer, less ambitious than full logical form, and useful precisely because of that.

Semantic role labelling (SRL) is the task of identifying, for each verb or predicate in a sentence, the phrases that play specific semantic roles with respect to it. Given the sentence "Mary sold the book to John yesterday", an SRL system identifies sold as the predicate and labels Mary as the Agent (the seller), the book as the Theme (the thing sold), John as the Recipient, and yesterday as a temporal modifier. SRL is often described as the "who did what to whom, when, where, why, how" layer — a shallow but structured semantics that sits above syntax and below full logical form.

Two annotation schemes dominate. PropBank (Palmer, Gildea, and Kingsbury 2005) annotates predicates with numbered arguments (ARG0, ARG1, ARG2, …) whose interpretation is defined per-predicate: ARG0 is the prototypical agent, ARG1 is the prototypical patient, and the higher-numbered arguments depend on the verb's frame. PropBank was built on top of the Penn Treebank and provides roughly a million annotated predicate instances. FrameNet (Baker, Fillmore, and Lowe 1998) takes a different approach: it groups predicates into frames (conceptual scenarios like Commercial_transaction or Motion) and labels arguments with frame-specific semantic roles (Seller, Buyer, Goods). FrameNet is theoretically richer; PropBank is more widely used in NLP systems because of its coverage and its integration with syntactic treebanks.

SRL performance followed the usual trajectory. Early systems used pipelines of syntactic parsing, predicate identification, argument identification, and argument classification, with engineered features. The CoNLL-2005 and CoNLL-2008 shared tasks established benchmarks. Neural SRL arrived in the mid-2010s (He et al. 2017 used deep BiLSTMs), and contemporary systems built on BERT or similar encoders reach F1 scores in the high 80s on PropBank, approaching the human ceiling. The 2017-era observation that good SRL could be learned without explicit syntactic input — "deep learning SRL without a parser" — was a meaningful moment for the pipelined view of NLP.

Why SRL remains useful in the LLM era: it provides a compact, structured representation of event semantics that is good for downstream information extraction, QA, and event coreference. It is also a standard probing task for understanding what LLMs know — if a model produces output that systematically confuses agent and patient, it is failing at a specific layer, and SRL gives the vocabulary for naming that layer. In production, SRL is less common than named-entity recognition or relation extraction, but it shows up in knowledge-graph construction pipelines, in semantic search systems that care about event structure, and in evaluation suites for narrative understanding.

SRL is a good example of a classical NLP abstraction that modern end-to-end systems do implicitly. A transformer does not produce explicit SRL output, but its representations encode argument-role relationships well enough that a probing classifier can extract them with high accuracy. The classical task definition remains the clearest way to specify what the model is implicitly computing.

§14

Coreference resolution

Coreference resolution figures out which noun phrases refer to the same entity — Mary, she, the teacher, her. It is a central subtask for discourse understanding, a perennial benchmark, and a reliable place to catch LLM failures.

Coreference resolution is the task of identifying which expressions in a document refer to the same real-world entity. In the passage "Mary saw the book on the table. She picked it up. The teacher was pleased with her.", a coreference system should cluster {Mary, She, her}, {the book, it}, and identify the teacher as coreferent with Mary (assuming the intended reading). Each cluster corresponds to one entity; each mention is a noun phrase that refers to it. The output of a coreference system is a partition of the mentions in a document into entity clusters.

Coreference is harder than it looks. It requires syntactic parsing (to identify mention boundaries), morphological agreement (gender, number), syntactic constraints (Binding Theory — pronouns and their antecedents cannot stand in certain configurations), semantic plausibility (does the candidate antecedent fit the verbal context?), and world knowledge (the Winograd Schema Challenge: "The trophy didn't fit in the suitcase because it was too big" — what is "it"?). No single cue is sufficient; coreference is a place where the different linguistic levels collide.

Three eras of systems. Rule-based systems (Hobbs algorithm 1978, Lappin and Leass 1994) encoded linguistic constraints as explicit rules, performed well on easy cases, and struggled with ambiguity. Statistical systems (2000s-2010s) framed coreference as a mention-pair classification problem — for each pair of mentions, is the later one referring to the same entity as the earlier? — and trained classifiers on annotated corpora (OntoNotes 5.0, with approximately 1.5 million words of coreference-annotated text). End-to-end neural systems (Lee et al. 2017 "end-to-end neural coreference resolution") jointly identified mentions and clustered them without an explicit mention-detection step; later systems incorporated BERT-style contextual encoders (Joshi et al. 2019 SpanBERT, Dobrovolskii 2021 word-level coreference) and reached F1 scores in the low 80s on the OntoNotes benchmark.

Coreference remains a favourite probe for LLMs. It is a task where one can construct carefully crafted adversarial examples (the Winograd Schema Challenge, and its modern descendant WinoGrande) that require commonsense reasoning about entity properties. Modern LLMs do well on standard coreference benchmarks but continue to fail on genuinely hard cases — a failure mode that correlates with failures in longer-context tasks where tracking who is who matters. Explicit coreference output is used in information extraction pipelines, knowledge-graph construction, dialogue systems (who is the user talking about?), and any downstream task where entity identity matters across sentences.

Key idea. Coreference is the task of linking referring expressions to a common entity. It is structural (syntax-driven), lexical (word-sense-driven), and world-knowledge-driven all at once, which makes it a reliable diagnostic for a model's discourse competence.

§15

Discourse and pragmatics

Discourse is the structure above the sentence; pragmatics is the meaning beyond the literal. Together they cover the relations, inferences, and speech acts that make a sequence of sentences into a coherent text — and that naïve models silently fail on.

Sentences do not stand alone. A well-formed text has cohesion (surface linguistic ties — pronouns, lexical repetition, conjunctions) and coherence (logical and rhetorical relations between utterances). The field calling itself discourse analysis studies these relations. The most widely used computational framework is Rhetorical Structure Theory (RST) (Mann and Thompson 1988), which analyses a text as a tree whose leaves are elementary discourse units (roughly clauses) and whose internal nodes encode rhetorical relations: Elaboration, Contrast, Cause, Evidence, Background. The RST Discourse Treebank (Carlson et al. 2003) is the standard corpus. An alternative framework, the Penn Discourse Treebank (Prasad et al. 2008), focuses on discourse connectives — but, because, however, although — and the relations they signal between adjacent clauses.

Pragmatics is the study of meaning in context. Its central concepts are old: implicature (Grice 1975) — the way speakers convey more than they literally say ("some of the students passed" implicates "not all"); presupposition — the background assumptions that utterances take for granted ("the king of France is bald" presupposes the existence of a king of France); speech acts (Austin 1962, Searle 1969) — the actions speakers perform in saying things (assertion, question, request, promise). These distinctions were developed in philosophy of language long before computational NLP, and they are the vocabulary for talking about the gap between what an utterance says and what a competent listener takes it to mean.

Computational discourse and pragmatics have always been the parts of NLP that lag the hardest. Annotated resources are smaller and less consistent than treebanks for syntax; the phenomena are themselves less crisply defined; and the payoff for downstream tasks is often ambiguous — most applications can do well on many benchmarks while getting pragmatics wrong. Classical systems for discourse parsing (RST parsers, PDTB-style shallow discourse parsers) never reached the accuracy of syntactic parsers, and a lot of pragmatic phenomena were handled, when they were handled at all, with task-specific heuristics.

LLMs have changed the picture in an interesting way. Because they are trained on vast amounts of coherent text, transformers have acquired implicit discourse and pragmatic competence: they complete texts with appropriate rhetorical connectives, they track presuppositions across paragraphs, they recognise speech acts well enough to respond appropriately in dialogue. They also make distinctive pragmatic mistakes — over-literal responses to indirect requests, misreadings of sarcasm and irony, failures of shared-knowledge reasoning — that classical pragmatic theory predicted. The field is now in an unfamiliar position: the classical abstractions describe phenomena that modern models partially but imperfectly internalise, without an obvious path to explicit supervision that would close the remaining gap.

When an LLM gives an answer that is literally accurate but pragmatically wrong — answering the factual content of a question when the user was asking for help, or treating a rhetorical question as a real one — the failure is pragmatic. The classical vocabulary (implicature, speech act, presupposition) is still the clearest way to describe and diagnose the problem, even when the system being diagnosed has no explicit pragmatic module.

§16

Multilingual NLP

NLP is historically English-centric, and most of its methods, tools, and benchmarks smuggle English-shaped assumptions into their foundations. Multilingual NLP is the subfield that pushes back — and by now, the honest default for any new work.

The world has roughly seven thousand languages. NLP tools, benchmarks, and pretrained models cover a tiny fraction of them well. A classical 2016-era survey would say NLP tools existed for about a hundred languages at any usable quality; by 2025 that number had grown, largely through multilingual pretraining, but the long tail — languages with fewer than a million speakers, or with no substantial digital footprint, or with writing systems under-supported by Unicode and fonts — remains sparsely covered. The field calls these low-resource languages, and the set of techniques designed for them low-resource NLP.

Four axes of difficulty recur. Script and tokenisation: tokenisers trained on English-heavy corpora produce poor splits for scripts they rarely saw (a Bengali sentence might be 5× as many tokens as the equivalent English); byte-level fallbacks help but do not fully close the gap. Morphology: as covered in §6, agglutinative and polysynthetic languages have word-form counts that blow up English assumptions about vocabulary size and word boundary. Data availability: annotated corpora for syntax, semantics, and coreference exist primarily for English and a small set of European and East Asian languages; most languages have at best a POS-tagged corpus of a few thousand sentences. Evaluation: many benchmarks are English-only or machine-translated from English, which introduces translation artefacts and under-represents the linguistic phenomena particular to other languages.

Multilingual pretrained models are the current centre of gravity. mBERT (Devlin et al. 2019) was BERT pretrained on Wikipedia in 104 languages with no explicit cross-lingual supervision; it turned out to produce usable cross-lingual transfer, a finding that surprised the field. XLM-R (Conneau et al. 2020) scaled up the approach to 100 languages of CommonCrawl and decisively outperformed mBERT on cross-lingual benchmarks. Meta's NLLB-200 (2022) trained a 54-billion-parameter translation model covering 200 languages, with heavy investment in low-resource data mining. Modern multilingual LLMs (Llama, Mistral, Claude, GPT-4) cover dozens to hundreds of languages with quality that varies sharply by resource level.

Practical guidance for building multilingual NLP in 2026: audit the tokeniser for your target languages before anything else; prefer models pretrained specifically multilingually over English-heavy ones; use Universal Dependencies treebanks where they exist; be sceptical of benchmarks that are translated rather than natively produced; and recognise that cross-lingual transfer — pretraining on many languages and fine-tuning on English-only task data — works surprisingly well for typologically similar languages and surprisingly badly for distant ones. The multilingual frontier is less about language count and more about the quality of coverage on the long tail.

A humbling fact: many of the "universal" properties NLP researchers thought they had discovered about language — that word order is mostly SVO, that syntax is mostly projective, that prefixes are roughly as common as suffixes — are actually English-specific or Indo-European-specific. Multilingual work is one of the best ways to calibrate which claims about "language" are really claims about the small handful of languages that most NLP history was done in.

§17

Treebanks and annotated resources

The Penn Treebank, OntoNotes, Universal Dependencies, and the dozens of other annotated corpora are the substrate on which statistical NLP was built and on which modern NLP still trains its evaluators. Annotation is slow, expensive, consequential, and easy to underestimate.

A treebank is a corpus of sentences annotated with their syntactic structure. More broadly, an annotated corpus is a corpus in which some linguistic phenomenon has been labelled by humans according to a documented scheme. The annotated corpora that NLP has built over the last forty years are the field's most durable contribution: they outlast specific models, they are the benchmark against which progress is measured, and they are the only way to evaluate what a system claims to understand.

The Penn Treebank (Marcus, Santorini, and Marcinkiewicz 1993) is the single most influential annotated resource in NLP history. It annotates about a million words of Wall Street Journal text with POS tags and constituency parse trees, following a carefully documented set of guidelines (the Penn Treebank II bracketing guidelines ran to over 300 pages). Essentially every syntactic parsing paper from 1995 to 2015 used a Penn Treebank split (train: sections 02-21, dev: 22, test: 23) as its evaluation, and the benchmark is still used for comparison even in the transformer era. The cost of producing it — roughly fifteen person-years of trained linguistic annotators — set a template for subsequent projects.

A generation of subsequent resources extended the picture. OntoNotes (Weischedel et al. 2011) added POS tags, parse trees, predicate-argument structure (PropBank), word senses (OntoNotes senses, linked to WordNet), coreference chains, and named entities over a 2.6-million-word corpus spanning news, broadcast, and telephone conversation in English, Chinese, and Arabic. PropBank, NomBank, and FrameNet added semantic-role annotations (§13). Universal Dependencies (§10) has extended dependency annotation to more than 150 languages with consistent guidelines. Named-entity corpora — CoNLL-2003, OntoNotes — supported the NER benchmarks. Natural-language inference corpora — SNLI, MultiNLI — supplied the entailment benchmarks that dominated 2017–2019 evaluation. GLUE and SuperGLUE packaged multiple tasks into single benchmarks. MMLU, BIG-Bench, and HELM are the LLM-era successors.

What is easy to miss from inside the neural revolution is how much the modern benchmark landscape still rests on work done in the 1990s and 2000s. The annotation guidelines, the annotator-agreement studies, the error analyses, and the evaluation protocols were developed on Penn-Treebank-era corpora and have propagated into every generation of benchmarks since. When a modern LLM paper reports an F1 score on CoNLL-2003, or a labelled attachment score on Universal Dependencies, or a coreference F1 on OntoNotes, it is drawing on an annotation substrate whose intellectual design predates the model by two or three decades.

A practical consequence: the annotation scheme you evaluate against silently shapes what counts as success. PTB POS tags carry English-newswire assumptions; OntoNotes coreference has conventions about generic NPs that disagree with other schemes; Universal Dependencies made deliberate, sometimes controversial, choices about what to annotate as head and what as dependent. Reading the annotation guidelines is frequently the highest-leverage thing a researcher can do on a new benchmark, and is nearly always skipped.

§18

Where it compounds in ML

NLP fundamentals compound through every modern language model. Tokenisers are subword segmenters; data curation depends on POS taggers and parsers; evaluation still runs on treebank-era benchmarks; failure diagnosis reaches into lexical semantics and pragmatics. The substrate outlasts any particular architecture.

The premise of this chapter is that language has layered structure, and any useful system has to touch those layers — explicitly in classical pipelines, implicitly in modern end-to-end models. The premise of Part VI is that the rest of the curriculum follows: Chapter 02 revisits the classical-NLP toolkit that this chapter introduced, showing how statistical methods used these primitives to build systems that worked at scale before neural methods arrived. Chapter 03 takes up word embeddings and distributional semantics, the bridge between the discrete lexical semantics covered here and the continuous vector spaces of modern models. Chapter 04 covers the transformer architecture — the substrate of every modern language model — from the self-attention mechanism up through the full encoder-decoder and decoder-only variants. Chapter 05 turns to pretraining paradigms — masked LM, causal LM, denoising objectives, encoder vs. decoder architectures. Chapters 06 through 10 cover the LLM era proper: scale and emergence, instruction tuning and alignment, parameter-efficient fine-tuning, retrieval-augmented generation, and LLM evaluation.

Four places where this chapter's material compounds most directly. Tokenisers are the frontier: every modern model starts with a subword tokeniser (§4), and the tokeniser is the first thing to audit when diagnosing multilingual or domain-specific failures. Data curation leans on classical tools: deduplication, quality filtering, language identification, PII removal, profanity filtering, and the heuristics that shape a pretraining corpus all use POS taggers (§8), parsers (§§9–11), and language models of various sorts. Without the classical-NLP tooling, the pretraining datasets that modern LLMs depend on could not be built. Evaluation still relies on treebank-era benchmarks (§17): when an LLM paper reports on POS tagging, NER, coreference, or syntactic probing, it is measuring against annotations that linguists built decades ago. Failure analysis reaches into every level of the stack: a model that confuses agent and patient is failing at SRL; a model that gets pronouns wrong is failing at coreference; a model that misses irony is failing at pragmatics.

A less obvious compounding: the framing of language in layers has shaped how the field thinks, and that framing persists even when the models that implement it have collapsed the layers. When a modern practitioner says "the model hallucinated" or "the model failed to ground" or "the model got the quantifier scope wrong", they are reaching for linguistic distinctions that this chapter's toolkit makes precise. The vocabulary of classical NLP is the vocabulary in which the problems of modern NLP are still described, even when the implementations have moved on.

The next chapter, Classical NLP, takes up the statistical methods that filled the gap between rule-based systems and neural ones — n-gram language models, naive Bayes and logistic-regression classifiers, Hidden Markov Models and Conditional Random Fields, the IBM models for statistical machine translation, latent Dirichlet allocation for topic modelling, and the pipelines (spaCy, NLTK, Stanford CoreNLP) that packaged these methods for production use. These techniques are not historical curiosities: they still power production search, classification, and extraction systems, and they are the natural baseline against which any neural claim has to be measured.

The fundamentals of natural language processing, the set of linguistic abstractions and engineering primitives that every NLP system — classical or neural, embedding or transformer — sits on top of.

How to read this chapter

Contents

Why NLP is hard

Levels of linguistic structure

Tokenisation

Subword tokenisation

Text normalisation

Morphology

Stemming and lemmatisation

Part-of-speech tagging

Syntax and parsing

Dependency parsing

Constituency parsing

Lexical semantics

Semantic role labelling

Coreference resolution

Discourse and pragmatics

Multilingual NLP

Treebanks and annotated resources

Where it compounds in ML

Further reading

Anchor textbooks

Foundational papers

Modern extensions & deep dives

Software & systems