Part VI · NLP & Large Language Models · Chapter 02

Classical NLP, the statistical and feature-engineering toolkit that carried the field from the 1990s to the neural turn — and still powers most of the retrieval, classification, and extraction that actually runs in production.

Before BERT there was TF-IDF. Before GPT there were n-gram language models. Before the transformer architecture turned every NLP task into matrix multiplications on contextual embeddings, a generation of researchers built working language technology out of counts, sparse features, and carefully parameterised probabilistic models. That era — roughly 1988 through 2015 — produced the statistical machine translation systems that first made MT usable, the spam filters that made email survivable, the information extraction pipelines that populated the earliest knowledge graphs, and the search engines that indexed the web. Its methods — bag of words, TF-IDF, n-gram language models, naive Bayes, logistic regression, HMMs and CRFs for sequence labelling, PCFGs for parsing, LDA for topic modelling, IBM-model phrase-based SMT — look simple next to a modern transformer, and on many tasks they still win. Spam filters run naive Bayes. Full-text search runs BM25. Most of the named-entity extraction that feeds production IE systems runs CRF-over-features pipelines. And the modern neural stack did not replace the classical toolkit so much as stack on top of it: every retrieval-augmented LLM is a BM25 or dense-retrieval front end on a transformer back end; every tokeniser is the direct descendant of classical subword segmentation; every benchmark is scored with classical-era metrics like precision, recall, F1, BLEU, and ROUGE. This chapter walks through the classical NLP toolkit as a coherent body of engineering technique. We cover text representations (bag of words, TF-IDF, n-grams), the probabilistic models built on top of them (n-gram LMs, naive Bayes, logistic regression, maximum entropy), the sequence models that did POS tagging and NER (HMMs and CRFs), the extraction and translation pipelines that industrialised the field (NER, relation extraction, statistical MT), the unsupervised methods that let you say something about a corpus without labels (LDA, classical IR), the evaluation metrics that the whole stack is still scored by, and the software pipelines (NLTK, CoreNLP, OpenNLP, spaCy) that packaged it all for production.

How to read this chapter

Sections one through five cover text representations and the language models built directly on top of them. Section one makes the case for studying classical NLP in 2026 — not as history, but as the set of techniques that still dominate production for most NLP tasks and that define what any neural method has to beat. Section two introduces bag of words — the simplest text representation, its surprising durability, and the sparse-feature intuition that runs through the whole classical stack. Section three is TF-IDF — the weighting scheme that turns raw counts into informative features, the foundation of most classical search, and the single most reused formula in the NLP toolkit. Section four covers n-grams — contiguous sequences of tokens, the representation that captured local context before embeddings did. Section five is the n-gram language model — Shannon's original idea of language as a Markov chain, the smoothing methods (add-k, Good-Turing, Kneser-Ney) that made it work, and the perplexity measure that is still the right first evaluation for any generative system.

Sections six through eight cover the classification side of the field. Section six is naive Bayes — the generative classifier that, despite its patently wrong independence assumption, continues to beat more elaborate models on small-data text classification and is still the canonical method for spam filtering. Section seven is logistic regression (and, equivalently, maximum entropy) — the discriminative workhorse of classical text classification, the model every modern text classifier is ultimately compared against, and the conceptual bridge to neural models. Section eight is maximum entropy and log-linear models more broadly — the unifying framework that subsumes logistic regression and underpins MEMMs and CRFs; historically important because it set the intellectual template for discriminative sequence models.

Sections nine through twelve cover sequence labelling and structured extraction. Section nine is Hidden Markov Models — the generative sequence model that dominated POS tagging and early NER, and whose Baum-Welch and Viterbi algorithms are still the right reference for anyone learning sequence modelling. Section ten is Conditional Random Fields — the discriminative successor to HMMs, the standard model for sequence labelling from 2001 until BiLSTM-CRF hybrids in 2015, and still the workhorse under many production NER systems. Section eleven is named entity recognition — the classical IE task that productionised sequence labelling, the CoNLL-2003 benchmark that shaped a decade of research, and the pipeline (feature templates, gazetteers, CRF decoder) that still ships in most real systems. Section twelve is relation extraction and open IE — the task of turning free text into structured (subject, relation, object) triples, the methods that fed classical knowledge-base construction, and the open-IE systems (ReVerb, Ollie, OpenIE 5) that tried to do it without a fixed schema.

Sections thirteen through fifteen cover the methods that worked without supervision, and the translation systems that dominated the field's most visible application. Section thirteen is topic modelling — the Latent Dirichlet Allocation family and the broader question of what it means to find themes in a corpus without labels, the probabilistic graphical model template that influenced a generation of papers, and the reason LDA is still the default baseline for exploratory text analysis. Section fourteen is classical information retrieval — the vector-space model, the BM25 scoring function that still dominates text search, and the evaluation framework (precision, recall, MAP, nDCG) that shaped how the field thinks about ranking. Section fifteen is statistical machine translation — the IBM models, phrase-based SMT, the decoder search that made MT feasible at Google scale, and the story of a subfield that was completely displaced by neural methods in a five-year window but whose conceptual infrastructure remains.

Sections sixteen through eighteen close the chapter. Section sixteen covers evaluation — precision, recall, F1, accuracy, BLEU, ROUGE, perplexity, and the methodological arguments (train-test splits, cross-validation, significance testing) that classical NLP got right and that modern LLM work sometimes gets wrong. Section seventeen surveys pipelines and toolkits — NLTK, Stanford CoreNLP, OpenNLP, spaCy, Gensim, scikit-learn — the software through which most people still encounter classical NLP and that defined how the field packaged its methods. The closing in-ml section places classical NLP in the wider landscape: as the substrate under most production NLP systems, as the baseline every neural claim has to beat, as the toolkit that RAG systems and data-cleaning pipelines still run on, and as the intellectual foundation on which the field's evaluation, tooling, and engineering habits are built.

Why classical NLP still mattersProduction dominance, baselines, and the stack beneath the stack
Bag of wordsThe simplest representation, its durability, and its limits
TF-IDFTerm frequency × inverse document frequency, the most reused formula in NLP
N-gramsContiguous token sequences, local context, feature templates
N-gram language modelsShannon, smoothing, Kneser-Ney, perplexity
Naive BayesGenerative classification, the spam filter paradigm, small-data king
Logistic regressionThe discriminative workhorse, regularisation, calibration
Maximum entropy & log-linear modelsMaxEnt, feature templates, the unifying framework
Hidden Markov ModelsBaum-Welch, Viterbi, generative sequence labelling
Conditional Random FieldsDiscriminative sequence models, the NER workhorse
Named entity recognitionCoNLL-2003, feature-engineered pipelines, production IE
Relation extraction & open IEBinary relations, pattern-based extraction, ReVerb, Ollie
Topic modellingLDA, pLSA, the unsupervised-corpus-themes workflow
Classical information retrievalVector space model, BM25, MAP, nDCG, the ranking stack
Statistical machine translationIBM models, phrase-based SMT, the field that was displaced
Evaluation metricsPrecision, recall, F1, BLEU, ROUGE, perplexity, significance
Pipelines & toolkitsNLTK, CoreNLP, OpenNLP, spaCy, Gensim, scikit-learn
Where it compounds in MLRetrieval front ends, baselines, data cleaning, evaluation

§1

Why classical NLP still matters

Classical NLP is not a history chapter. It is the baseline every neural claim has to beat, the stack under every production retrieval system, and the engineering toolkit that still powers most text classification, extraction, and search that runs on real traffic.

The distinction between "classical" and "neural" NLP is less clean than the narrative suggests. The classical era produced statistical methods — bag of words, TF-IDF, n-gram models, naive Bayes, logistic regression, HMMs, CRFs, LDA, BM25, phrase-based SMT — that dominated the field from roughly 1988 to roughly 2015. The neural era produced embedding-based, recurrent, and transformer-based methods that have largely supplanted the older approaches on benchmarks and on the research frontier. But "largely supplanted" is not "entirely supplanted", and the sites where the classical methods still win — or at least still pay — are more numerous than a benchmarks-focused reading would suggest.

Four places where classical NLP is still the right answer. Small-data text classification: for a domain-specific classifier with a few hundred or few thousand training examples, naive Bayes or a linear SVM over TF-IDF features often matches or beats a fine-tuned transformer, at a thousandth the cost and with interpretable weights. Full-text search: BM25 is the ranking function of nearly every production search engine, from Elasticsearch and Solr to custom in-house systems; modern dense-retrieval systems are typically hybrids of BM25 and neural encoders, not replacements. Production extraction at scale: CRF-based NER, rule-and-gazetteer pipelines, and hand-tuned pattern extractors run most of the entity extraction in real industrial systems, often for reasons of latency, throughput, or interpretability that neural methods have not solved. Data pipelines for training neural models: the filters, deduplicators, quality scorers, and language identifiers that curate the trillion-token corpora LLMs are trained on are almost entirely built from classical components — logistic regression classifiers, MinHash deduplication, n-gram language models, naive Bayes quality filters.

There is a second, less tangible reason classical NLP matters: it is the intellectual background without which the neural methods are hard to evaluate. When a paper claims a transformer "understands" negation, or "handles" long-range dependencies, or "reasons" about quantities, the only way to assess the claim is against the classical methods' baselines on the same task. And when a production system mysteriously regresses, the diagnostic toolkit — held-out sets, confusion matrices, precision-recall curves, feature ablations — is the classical toolkit. The modern practitioner who has never trained a logistic regression on TF-IDF features, never computed perplexity of an n-gram model, never seen the exact difference between precision and recall on a held-out set, is consistently a worse debugger than one who has.

Key idea. Classical NLP persists because it is cheaper, more interpretable, often more accurate on small data, and still the right substrate for retrieval, classification, and extraction at production scale. It is also the evaluation framework against which neural claims are (or should be) measured. This chapter treats it as a live engineering toolkit, not a historical curiosity.

§2

Bag of words

Represent a document as a sparse vector of word counts. Ignore word order, grammar, and meaning. It sounds comically simplistic and it is one of the most durable ideas in NLP.

The bag-of-words (BoW) representation treats a document as an unordered multiset of its tokens: the vector x for a document has one entry per vocabulary word, and the value is how many times that word appears in the document. Word order is discarded; syntax is discarded; any notion of linguistic structure is discarded. What remains is a fixed-length numeric representation (sparse, because most words do not appear in most documents) that any downstream machine-learning method can consume.

The durability of BoW is the central surprise of statistical NLP. Discarding word order should, intuitively, be devastating — the sentence "dog bites man" has the same bag as "man bites dog" — and yet for a wide range of tasks the loss is small to negligible. Sentiment classification of movie reviews, spam filtering, topic classification, document clustering, authorship attribution: all of these work well with BoW features and a simple classifier, often matching much more elaborate models. The reason is that for tasks where the content of a document matters more than its construction, the bag of words captures enough of the content signal. "Dog bites man" and "man bites dog" differ in news value but contain almost the same content, and a content-focused classifier does not need to distinguish them.

BoW comes with a small family of design decisions that matter in practice. Vocabulary selection: the full set of word types in a corpus is typically hundreds of thousands to millions, most of them too rare to be useful; vocabularies are usually capped by frequency or by a minimum document-frequency threshold. Stopword removal: function words (the, of, and, a) contribute little to most content tasks and are often dropped, though for some tasks (authorship attribution, stylistic analysis) they are the most informative features. Normalisation: lowercasing, stemming, lemmatisation all reduce the effective vocabulary by collapsing surface forms. Binary vs count: for some tasks, whether a word appears is enough; for others, how many times it appears is informative. The choice of BoW variant is almost always task-specific and is typically settled by cross-validation rather than principle.

The obvious limitation is the loss of order and structure. A BoW representation cannot distinguish "I loved it" from "I did not love it"; cannot handle negation, scope, quantification, or any phenomenon that depends on how words combine. The n-gram extensions (§4) partly address this by including pairs and triples of adjacent tokens. The deeper limitation — that words in isolation lose much of their meaning — is what word embeddings eventually addressed, and what contextual embeddings finally overcame. But the conceptual frame of "document as vector, classifier applies to vector" has persisted: every modern text classifier, including transformer-based ones, is ultimately doing a more sophisticated version of what BoW started.

The practical rule of thumb: if you can frame a task as "does this document contain content X or content Y", BoW + a linear classifier will give you a strong baseline with minimal effort. If the task involves compositional meaning — negation, comparison, reasoning — BoW is inadequate and you will need order-sensitive methods. Try the simple thing first; most of the time it is sufficient.

§3

TF-IDF

Raw counts over-weight common words. TF-IDF corrects for this by dividing each term's in-document frequency by a measure of how common it is across documents. It is the single most reused formula in classical NLP.

Term frequency–inverse document frequency is a weighting scheme that takes a bag-of-words vector and rescales each entry by two factors: how frequent the term is in this document (the TF component) and how rare the term is across the corpus (the IDF component). A word that appears often in one document and rarely across the corpus gets a high score; a word that appears frequently across all documents (the, of, and) gets a low score, regardless of how many times it appears in any single document. The intuition — that a word's informativeness for a document comes from it being both present here and absent from the average document — is the key insight, and it turns out to be remarkably robust.

Several formulations coexist. The simplest: TF is the raw count of the term in the document, IDF is log(N / df(t)) where N is the total number of documents and df(t) is the number of documents containing term t, and TF-IDF is their product. In practice, TF is often dampened with a logarithm (1 + log(tf)) so that the fiftieth occurrence of a word doesn't count fifty times as much as the first, and IDF is sometimes smoothed to avoid division by zero for unseen terms. The cosine similarity between two TF-IDF vectors — the dot product divided by the product of the norms — is the default similarity measure in classical IR and in a lot of text-clustering pipelines.

TF-IDF has three durable applications. Information retrieval: rank documents by cosine similarity to a TF-IDF-weighted query vector; this is the vector-space model of classical IR (§14), and the backbone of pre-BM25 search engines. Text classification: use TF-IDF features as input to a linear classifier (logistic regression, SVM); this remains a strong baseline for most classification tasks. Keyword extraction: the highest-TF-IDF terms in a document are a reasonable approximation to its distinctive vocabulary, which is why TF-IDF still shows up in keyphrase-extraction pipelines.

The limitations are the usual BoW limitations amplified. TF-IDF does not capture synonymy (car and automobile have orthogonal TF-IDF vectors), polysemy (bank gets one weight across all senses), or word order. But it has one property that has kept it relevant: the features it produces are interpretable and auditable. You can read off which terms are driving a classification, which query terms matter for a retrieval, which words dominate a document cluster. This transparency matters in regulated settings, in debugging, and in any system where you need to explain a ranking to a user or a regulator. It is part of the reason TF-IDF has outlived the specific models it was paired with: every subsequent improvement has had to do more to justify its opacity.

BM25 (§14) is a TF-IDF descendant that adds document-length normalisation and TF saturation, and is the ranking function of most production search today. If you have any reason to care about search quality, TF-IDF is where to start and BM25 is where to end up.

§4

N-grams

An n-gram is a contiguous sequence of n tokens — unigram, bigram, trigram, 4-gram, 5-gram. Using n-grams as features recovers a slice of local word-order structure that bag of words throws away, and for many tasks the slice is enough.

A unigram is a single token; a bigram is two consecutive tokens; a trigram is three; a 4-gram is four; and so on. Given a sentence "the cat sat on the mat", its bigrams are (the, cat), (cat, sat), (sat, on), (on, the), (the, mat) and its trigrams are (the, cat, sat), (cat, sat, on), (sat, on, the), (on, the, mat). N-grams are the classical method of encoding local context without committing to any syntactic analysis: they capture which words appear next to each other, in what order.

As features, n-grams are typically used in two ways. The first is as additional features on top of unigrams in a classifier: a text classifier trained on unigrams + bigrams can distinguish "not good" from "very good" in ways that pure unigram features cannot. The second is as the basis for n-gram language models (§5), where the n-gram's own frequency becomes the probability of seeing that sequence. The two uses share a common combinatorial challenge — the space of n-grams grows roughly as the vocabulary size to the nth power, and for n > 3 the vast majority of possible n-grams never occur in any corpus — so most n-gram-based systems cap n at 2 or 3 for classification and 3 to 5 for language modelling, with smoothing methods to handle unseen n-grams.

N-gram features have a mechanical appeal: they are language-independent, trivial to compute, and scale linearly in corpus size. They also have real limitations. They explode in dimensionality; they do not generalise across paraphrases (a classifier trained on "very good" has seen nothing about "exceptionally good"); and they lose the signal of long-range dependency entirely. For these reasons n-grams have been gradually displaced in neural-era NLP by embeddings, which share parameters across related words. But the n-gram concept is still present in the modern stack, sometimes in disguise: convolutional models for text (Kim 2014) are essentially learning to fire on interesting n-grams, and attention-based models can often be interpreted as learning to attend over specific n-gram patterns.

A practical observation from decades of text classification: adding bigrams on top of unigrams typically helps by a point or two of F1; adding trigrams often hurts because the sparsity overwhelms the added signal. The best working setup for most classification tasks is TF-IDF over unigrams + bigrams, capped at some frequency threshold, fed into a logistic regression or linear SVM. This is a strong baseline; try it on any new classification problem before reaching for anything more elaborate.

Character n-grams — where the "tokens" are individual characters rather than words — are a useful specialisation for tasks where morphology matters (language identification, handling misspellings, author attribution, code classification). Byte-level n-grams generalise further and are language-agnostic. Both show up in the data-pipeline tooling used to curate training corpora for modern LLMs.

§5

N-gram language models

Shannon's idea: assign a probability to every sequence of tokens, and approximate it with a Markov chain over n-grams. The model is simple, wrong in knowable ways, and still the right mental model for thinking about what a language model does.

A language model assigns a probability to every possible sequence of tokens. The goal is to answer questions like "what is the probability of this sentence under natural English?" (for tasks like speech recognition rescoring or spelling correction) and "what is the probability of the next token given the tokens so far?" (for generation and for anything that needs to score continuations). Classical language modelling approximates this with an n-gram model: assume that the probability of the next token depends only on the previous n-1 tokens, and estimate those conditional probabilities from corpus counts. A bigram model estimates P(w_i | w_{i-1}) as count(w_{i-1}, w_i) / count(w_{i-1}); a trigram model conditions on the previous two tokens; and so on.

Shannon (1948) introduced this line of thinking in his paper on the mathematical theory of communication, where he computed n-gram approximations to English and showed, famously, that higher-order n-gram models produce text that looks more and more English-like. The n-gram model was the workhorse of language modelling for the half-century that followed, dominating speech recognition, statistical MT, and spelling correction. Its reign ended in the mid-2010s when neural language models (first feed-forward, then LSTM, then transformer) took over; but n-gram models are still the right pedagogical starting point and still the right baseline for research claims about language modelling.

The central practical problem is sparsity: most n-grams never appear in any training corpus, even one of billions of tokens. A naive n-gram model would assign probability zero to any sentence containing an unseen n-gram, which is both obviously wrong and computationally fatal (log-probabilities blow up to negative infinity). Smoothing methods solve this by redistributing probability mass from seen n-grams to unseen ones. Add-one smoothing (Laplace 1812, adapted to NLP by Jelinek) pretends every possible n-gram has been seen once; it is simple and catastrophically over-smooths. Good-Turing smoothing (Good 1953, applied to NLP by Katz 1987) estimates the probability of unseen n-grams by looking at the number of n-grams seen once. Kneser-Ney smoothing (Kneser and Ney 1995) refines Good-Turing by considering how many distinct contexts an n-gram appears in, and was the dominant smoothing method until neural LMs took over. Modified Kneser-Ney (Chen and Goodman 1998) is the textbook state of the art.

N-gram language models are evaluated by perplexity: the exponentiated average negative log-likelihood per token on held-out text. Lower is better; a perplexity of 100 means the model is, on average, as uncertain as if it had to choose uniformly among 100 tokens. Perplexity is still the right first evaluation for any generative language model, and most research reports on LLMs still mention it. What changed in the neural era is the magnitude: state-of-the-art n-gram models on English newswire had perplexities in the 100-200 range; modern transformers can reach single digits on the same text. The reason is that neural models share parameters across related contexts (they know that "the cat sat on the" and "the dog sat on the" are similar), while n-gram models treat every n-gram independently.

Key idea. An n-gram language model approximates the distribution over token sequences with an (n-1)th-order Markov chain. Its simplicity, its sparsity problem, and the smoothing methods that fixed it are the intellectual foundation on which every later language model is built. Modern LLMs are more sophisticated but they are doing conceptually the same job: scoring continuations.

§6

Naive Bayes

Naive Bayes assumes features are conditionally independent given the class. The assumption is almost always wrong. The classifier works anyway, and for small-data text classification it is often the best thing you can do.

Naive Bayes is a generative classifier derived from Bayes' rule with one aggressive simplification: assume that features are conditionally independent given the class. Given a document represented as a bag-of-words vector x = (x_1, …, x_V) and a set of classes C, the classifier computes P(C | x) ∝ P(C) · ∏ P(x_i | C), where P(C) is the class prior (frequency of class C in training) and P(x_i | C) is the likelihood of word x_i appearing in a class-C document. At prediction time, pick the class that maximises the posterior. At training time, estimate the priors and likelihoods as maximum-likelihood counts from the training corpus, plus a small smoothing constant (add-one or add-α) to handle unseen words.

The independence assumption is obviously wrong: words in a document are correlated, sometimes strongly so. And yet naive Bayes works — sometimes at or above the accuracy of more elaborate models. The reason is that Naive Bayes is a decision boundary classifier: its predictions depend on which class has higher posterior probability, not on whether the posterior is well-calibrated. As long as the independence assumption introduces roughly equal bias across classes, the decision boundary can be approximately correct even when the probabilities themselves are wildly off. This is one of the most durable results in applied statistics: a wrong model with well-placed errors can out-predict a correct model with poorly-placed ones.

Two variants dominate text classification. Multinomial naive Bayes treats a document as a sample from a class-conditional multinomial distribution over words; the likelihood of a document is the product of word likelihoods, each raised to its count. Bernoulli naive Bayes treats each word as a binary feature (present or absent) and models each as class-conditional Bernoulli. Multinomial NB usually wins on long documents with many words; Bernoulli NB is often better on short documents where presence matters more than count. Both are trivially fast to train and to classify.

Naive Bayes has two applications that have defined it. The first is spam filtering: Paul Graham's 2002 essay "A Plan for Spam" popularised naive Bayes as the method of choice for email spam classification, and it became the backbone of spam filters industry-wide for the decade that followed. Modern spam systems have moved partly to neural methods but still often include a naive Bayes component. The second is small-data text classification: when you have hundreds or low thousands of labelled examples, a naive Bayes classifier over unigram + bigram features often matches or beats a fine-tuned transformer, at a tiny fraction of the cost. The modern recommendation is to always try naive Bayes as a baseline, not because it will be the best, but because the gap to fancier methods tells you how much the task actually needs fancier methods.

Naive Bayes is sometimes criticised for producing badly calibrated probabilities — predictions of 0.99 or 0.01 where the true probability is 0.7 or 0.3. The criticism is correct but often irrelevant. If the downstream use is ranking or decision-making, calibration doesn't matter. If it is fusion with other model outputs, calibration can be recovered via a simple post-hoc isotonic regression or Platt scaling.

§7

Logistic regression

Logistic regression is a discriminative linear classifier that models P(y | x) directly. It dominated text classification from the early 2000s onward because it handles correlated features gracefully, accepts arbitrary real-valued features, and produces calibrated probabilities.

Logistic regression is the canonical discriminative classifier for text. Where naive Bayes models the joint P(x, y) and inverts to get P(y | x), logistic regression models P(y | x) directly as a linear function of the features passed through a logistic (or, for multiclass, softmax) nonlinearity. For binary classification, P(y = 1 | x) = σ(w · x + b), where σ is the sigmoid. For multiclass, replace the sigmoid with softmax and learn a weight vector per class. Training maximises the log-likelihood of the training data under this model — equivalently, minimises cross-entropy loss — using gradient-based optimisation. The objective is convex, so any local optimum is global.

The practical advantage over naive Bayes is that logistic regression does not assume feature independence. If two features are highly correlated (the words good and great, say), naive Bayes double-counts their evidence; logistic regression learns that they are redundant and adjusts the weights accordingly. The practical cost is that logistic regression requires iterative optimisation (LBFGS, stochastic gradient descent, or coordinate descent) rather than closed-form counting, and is more prone to overfitting. Regularisation is essential in high-dimensional feature spaces: L2 regularisation (ridge) penalises large weights and improves generalisation; L1 regularisation (lasso) drives many weights to exactly zero, producing sparse models.

Logistic regression's second advantage is feature flexibility. You can throw in any real-valued feature you like: word counts, TF-IDF weights, boolean gazetteer hits, character n-grams, POS bigrams, dependency paths, sentiment lexicon counts, topic scores. The classifier will figure out which features help and by how much. This is the feature-engineering loop that defined applied NLP for 15 years: devise a clever feature, add it to the model, measure the lift on a held-out set, iterate. By the time of the 2012 Stanford NER system or the 2014 IBM sentiment system, the best-performing classifiers routinely used tens of thousands of hand-crafted features. Deep learning would eventually replace this loop with learned features, but the discipline of feature engineering remains invaluable for small-data problems and for understanding what a model has actually learned.

One important connection: logistic regression is equivalent to a one-layer neural network with a softmax output and no hidden layers. Every deep classifier you train today has a logistic (or softmax) regression at its top, operating on learned rather than hand-crafted features. Understanding logistic regression deeply is therefore not historical trivia — it is understanding the final layer of every text classifier in current use.

Practical recipe for a strong text classifier baseline: TF-IDF-weighted unigrams and bigrams, L2-regularised logistic regression, cross-validated regularisation strength. This will beat most neural baselines on datasets with fewer than ~10,000 training examples and many tasks with more.

§8

Maximum entropy and log-linear models

Maximum entropy models — identical to multiclass logistic regression — were the computational-linguistics community's weapon of choice through the 2000s. The maxent framing generalises naturally to structured prediction and provides a clean information-theoretic justification.

Under a different name and a different intellectual tradition, maximum entropy modelling is the same object as multiclass logistic regression. The name comes from the guiding principle: given a set of constraints on the model's expected feature values (usually the empirical averages in the training data), choose the probability distribution that has maximum entropy subject to satisfying those constraints. Intuitively: match what the data tells you, but otherwise assume as little as possible. Lagrangian duality turns this constrained optimisation into an unconstrained one that has the log-linear form P(y | x) ∝ exp(w · φ(x, y)), where φ(x, y) is a joint feature function over inputs and outputs. Training maximises log-likelihood, same as logistic regression.

Log-linear models — the general family of which maxent is one instance — underpin a remarkable amount of classical NLP. Most structured prediction models for tagging, parsing, and semantic role labelling through the 2000s were log-linear in one form or another, trained by generalised iterative scaling, LBFGS, or stochastic gradient descent. Berger, Della Pietra, and Della Pietra's 1996 paper "A Maximum Entropy Approach to Natural Language Processing" is the canonical reference; Adam Berger and Stephen Della Pietra's later tutorial is still the cleanest introduction.

The maxent framing has one real advantage over "logistic regression": it generalises cleanly to structured outputs. If y is not a single class but a structured object — a sequence, a tree, a graph — you can still define a log-linear distribution P(y | x) ∝ exp(w · φ(x, y)) and maximise regularised log-likelihood. The computational challenge becomes computing the partition function (the sum or integral over all possible structures) efficiently. For linear-chain sequences, dynamic programming (forward-backward) solves this in polynomial time. This is the mathematical move that leads directly from maxent to Conditional Random Fields (§10).

§9

Hidden Markov Models

HMMs are the classical model for sequence labelling. A chain of hidden states emits observed words; given the observations, you recover the most likely hidden sequence. This is how POS tagging, speech recognition, and NER were done for two decades.

A Hidden Markov Model consists of a set of hidden states (e.g. part-of-speech tags or named-entity types), a set of observations (usually words), and three families of probabilities: initial-state probabilities P(s_1), transition probabilities P(s_t | s_{t-1}), and emission probabilities P(o_t | s_t). The Markov assumption says that each state depends only on the previous state; the emission assumption says that each observation depends only on the current state. These assumptions are obviously wrong — word choice depends on more than the immediately preceding tag — but they buy you tractability: you can compute the probability of any observation sequence in O(T · K²) time for a length-T sequence with K states via the forward algorithm, recover the most likely hidden sequence in the same time via the Viterbi algorithm, and train the model from unlabelled sequences via Baum-Welch (a specialisation of EM to HMMs).

For labelled data, training is trivial: estimate transition and emission probabilities by counting (with smoothing). For unlabelled data, Baum-Welch alternates between computing expected state occupancies (forward-backward) and re-estimating parameters from those expectations. Convergence is to a local optimum of the likelihood, which is sensitive to initialisation. In practice, labelled training almost always wins, and Baum-Welch is used mainly as a pretraining step or for semi-supervised variants.

The HMM's defining application was part-of-speech tagging. The Brown corpus, tagged by hand in the 1960s–70s, provided the training data; HMM taggers like Charniak's 1993 system reached 95–96% per-token accuracy, roughly the level of human inter-annotator agreement. HMMs were then exported to speech recognition (where they dominated from the 1980s until deep learning displaced them in the 2010s), biological sequence analysis (gene finding, protein alignment), and named-entity recognition as an early baseline. The critical limitation is the emission model: P(word | tag) treats each word as an independent atom, so the model has no way to exploit morphology, prefixes, suffixes, or context beyond the adjacent tag. MEMMs and CRFs were invented specifically to escape this limitation.

HMMs come in two training modes. Supervised HMMs are estimated from labelled data by direct counting — fast, reliable, and the default for tagging. Unsupervised HMMs (Baum-Welch from raw text) are used when labels are expensive, as in speech recognition acoustic modelling or some biological sequence problems. The algorithms are the same; only the training objective changes.

§10

Conditional Random Fields

CRFs extend logistic regression to sequence labelling. They model the conditional distribution P(y | x) over the entire label sequence, condition on arbitrary observation features, and escape the label-bias problem of MEMMs. For a decade, they were the undisputed state of the art for tagging, chunking, and NER.

The Conditional Random Field, introduced by Lafferty, McCallum, and Pereira in 2001, is a discriminative model for sequence labelling. A linear-chain CRF defines P(y | x) ∝ exp(∑_t w · φ(y_{t-1}, y_t, x, t)), where φ is a joint feature function over adjacent labels and the entire observation sequence. Because conditioning is on the whole x, the feature function can look arbitrarily far into the observations: surrounding words, prefixes, suffixes, shape features, gazetteer hits, previous prediction results. Training maximises the conditional log-likelihood, computed efficiently via a forward-backward variant that respects the dependency structure.

CRFs improved on two predecessors. Relative to HMMs, they are discriminative — directly modelling P(y | x) rather than the joint — and accept arbitrary observation features. Relative to Maximum Entropy Markov Models (a locally normalised variant), CRFs use a global normalisation that avoids the label bias problem: MEMMs must decide at each step how to distribute probability among next labels using only local information, which can cause them to under-favour rare-but-correct transitions. CRFs normalise over the entire sequence, so no such bias appears.

CRFs became the standard model for a long list of tagging-style tasks: part-of-speech tagging on morphologically rich languages, chunking (identifying noun and verb phrases), named-entity recognition, semantic role labelling, chunking in biomedical text, Chinese word segmentation, and more. The typical CRF-based NER system used several hundred to several thousand feature templates producing hundreds of thousands of sparse features, trained with L2-regularised LBFGS on CoNLL-2003 or OntoNotes. Systems like the Stanford CRF NER and the StanfordNLP CRF tagger were the workhorses of applied NLP from the mid-2000s until BERT displaced them in 2018–2019.

The modern equivalent is a neural sequence model (BiLSTM or transformer) with a CRF output layer. Even in 2026, the CRF layer remains a useful final stage over neural features for structured prediction tasks where transitions between labels matter (BIO tagging schemes, for example). The neural encoder learns features; the CRF layer enforces legal label sequences. The name CRF now denotes a training objective and a decoding strategy as often as it denotes a complete standalone model.

§11

Named Entity Recognition

Named Entity Recognition is the task of identifying spans of text that refer to persons, organisations, locations, dates, amounts, and other typed entities. It is the canonical sequence-labelling task and the bridge from raw text to structured information.

Named Entity Recognition (NER) identifies spans of text that refer to entities of interest and assigns each span a type. Typical types for newswire are PERSON, ORGANIZATION, LOCATION, DATE, TIME, MONEY, and PERCENT — the "MUC types" from the Message Understanding Conferences of the 1990s. Biomedical NER adds GENE, PROTEIN, DISEASE, CHEMICAL. Legal NER adds STATUTE, JUDGE, CASE. Financial NER adds TICKER, INSTRUMENT, RATIO. Each domain requires its own tagset, its own annotated data, and often its own gazetteers.

The standard formulation treats NER as sequence labelling. Each token receives a tag that encodes both the entity type and the position within the span, using a BIO tagging scheme: B-PER for the first token of a PERSON span, I-PER for subsequent tokens, O for tokens outside any entity. Variants like BIOES add explicit end and single-token tags. A CRF or neural sequence model learns to assign these tags from features of the surrounding tokens. The span structure is then recovered by post-processing the tag sequence.

The canonical benchmark is CoNLL-2003, a corpus of Reuters newswire annotated with four types (PER, ORG, LOC, MISC) in English and German. For over a decade, progress on CoNLL-2003 tracked progress in NER more broadly. Classical CRF systems with rich feature templates — word shape, orthography, gazetteer hits, prefix/suffix n-grams, POS tags, chunk tags, dependency edges, context windows — reached F1 scores in the high 80s to low 90s. Adding word embeddings as features (the 2011 Collobert-Weston SENNA system) pushed that higher; bidirectional LSTMs with a CRF output layer (Lample et al. 2016, Chiu & Nichols 2016) pushed to 91–92; fine-tuned BERT (2018–2019) to 93–94; and modern T5/GPT-scale systems approach but rarely exceed 95. The headroom has shrunk to annotator noise.

Even in the age of LLMs, dedicated NER systems remain the production-grade solution for high-throughput entity extraction, because they are 100x faster and 100x cheaper than calling a language model. A typical modern pipeline uses a fine-tuned transformer encoder with a CRF or linear output layer, trained on domain-specific annotated data, deployed with INT8 quantisation. The feature-engineering tradition of classical NER lives on as a fallback for low-resource languages and specialised domains where labelled data is scarce and gazetteers are the strongest signal.

Gazetteers — lists of known entities like country names, company tickers, drug names — were central to classical NER. A binary feature "this word appears in the country gazetteer" was often the single strongest signal for LOC tags. Gazetteer construction is labour-intensive but pays for itself on any domain-specific application; a good company-name gazetteer can move NER F1 by several points in financial text.

§12

Relation extraction and information extraction

Having identified entities, the next step is identifying relations between them: Acme acquired Widget, Paris is the capital of France, aspirin treats headache. Relation extraction populates knowledge bases, supports question answering, and builds the scaffolding for structured reasoning.

Relation extraction classifies ordered pairs of entities (or entity mentions) into relation types drawn from a schema. The typical setup: given a sentence with two marked entity spans, predict a relation label from a fixed vocabulary or the special label NONE. Schemas range from small and curated (ACE's 18 relation types, TACRED's 41) to large and open-ended (Wikidata's thousands of properties; OpenIE's free-form argument-predicate triples).

Classical relation extraction followed three paradigms. Pattern-based extraction used hand-written lexico-syntactic patterns — "X, the CEO of Y", "X was born in Y" — to match surface-form relations. Hearst's 1992 "automatic acquisition of hyponyms" paper pioneered this with patterns like "X such as Y, Z, and W" for hypernymy. Pattern systems like KnowItAll (2004) and TextRunner (2007) scaled up to web-scale extraction. Supervised classification trained a classifier (SVM, logistic regression, or MaxEnt) over features derived from the dependency path between entities, the surface words between them, named-entity tags, and WordNet features. Systems like Zhou et al. 2005 and the Stanford relation extraction system were the workhorses. Distantly supervised extraction (Mintz et al. 2009) used an existing knowledge base (Freebase, Wikipedia infoboxes) as a noisy source of training labels: any sentence containing two entities with a known relation in the KB is treated as a positive example for that relation.

Open information extraction — ReVerb (2011), Ollie (2012), OpenIE (2015) and their successors — dispensed with a fixed schema entirely. Instead of classifying into relation types, these systems extract arbitrary predicate-argument triples from text: "Acme; acquired; Widget in a $200M deal". The relations are whatever surface phrases the text uses. The result is a flexible but noisy triple store suitable for downstream question answering or further schema mapping. OpenIE systems are fast, schema-free, and domain-general; their tradeoff is that relations are not normalised (acquired, bought, purchased, took over are all distinct strings) and must be clustered or linked to a canonical vocabulary post-hoc.

The broader task of information extraction glues NER, coreference resolution, relation extraction, and event extraction together into end-to-end structured output: given an article, produce a JSON record describing who did what to whom, where, and when. Classical IE pipelines — NIST's MUC program through the 1990s, ACE through the 2000s, TAC-KBP through the 2010s — demonstrated that reasonable structured extraction was achievable on newswire but brittle to domain shift. Modern LLMs have changed the picture: a GPT-4-class model can often perform schema-directed extraction zero-shot at accuracies that classical pipelines needed thousands of annotated examples to reach. The classical pipeline remains useful when throughput, cost, or determinism matters more than flexibility.

§13

Topic modelling

Topic models — LSA, pLSA, LDA — discover latent themes from an unlabelled corpus. Each document becomes a mixture of topics, each topic a distribution over words. Topic models gave NLP its first tractable unsupervised models of document content and still serve as a baseline for exploratory corpus analysis.

The governing intuition of topic modelling is that each document is "about" a small number of topics, and each topic is characterised by a distribution over words. A finance article might be 60% MARKETS and 40% CORPORATE; a MARKETS topic might put high probability on stock, bond, yield, index, trader; a CORPORATE topic might put high probability on earnings, revenue, guidance, CEO, shareholder. Topic models fit these distributions from raw document-word counts, with no labels.

Latent Semantic Analysis (LSA, Deerwester et al. 1990) was the first workable formulation. Take the term-document matrix, apply truncated Singular Value Decomposition to produce a low-rank approximation, and use the resulting dense vectors as topic representations. LSA is fast, linear, and conceptually clean, but produces topics that can be hard to interpret and tolerates no noise model. Probabilistic Latent Semantic Analysis (pLSA, Hofmann 1999) reformulated LSA as a proper probabilistic mixture model, but lacked a principled generative story for documents outside the training set. Latent Dirichlet Allocation (LDA, Blei, Ng & Jordan 2003) fixed this by placing Dirichlet priors on the document-topic and topic-word distributions, yielding a fully generative model that supports inference on unseen documents and has a clean Bayesian interpretation. LDA became the default topic model for the next decade.

Inference in LDA is intractable in closed form, so practitioners use Gibbs sampling (Griffiths & Steyvers 2004) or variational Bayes. Collapsed Gibbs sampling is surprisingly simple to implement and gives good results on moderate corpora; variational methods scale better but produce biased estimates. Popular extensions include Correlated Topic Models (topics that co-occur), Dynamic Topic Models (topics that drift over time), Author-Topic Models, and Hierarchical LDA (tree-structured topic hierarchies).

Topic models are most valuable for exploratory corpus analysis: given a large unlabelled collection, what is it about? A 50-topic LDA over a news archive or a scientific literature corpus often produces human-interpretable clusters that invite further investigation. They are less valuable as document representations for downstream classification, where simpler approaches like TF-IDF or dense embeddings typically match or beat topic-based features. The modern alternatives — sentence embeddings and cluster analysis (BERTopic, top2vec) — provide finer-grained and higher-quality topical structure, but LDA retains a niche for its interpretability, its small computational footprint, and its theoretical cleanliness.

§14

Information retrieval and BM25

Information retrieval — matching queries against a document collection and ranking results — predates modern NLP and still underpins it. Classical IR contributes vocabulary (TF-IDF, inverted index, relevance), benchmarks (TREC), and the core retrieval algorithm (BM25) that anchors modern retrieval-augmented generation.

Information retrieval is the computer-science discipline of matching a user's query against a document collection. Modern NLP inherited most of its intuitions about word weighting, term statistics, and evaluation from classical IR. The canonical setup is the ranked retrieval model: given a query, return a ranked list of documents from the collection, ordered by predicted relevance. Evaluation metrics are almost exclusively ranking-based — precision at cutoff k, mean average precision, normalised discounted cumulative gain — rather than classification-based.

The dominant ranking function of the last 30 years is BM25 (Best Match 25), a refinement of the probabilistic retrieval framework developed by Stephen Robertson and Karen Spärck Jones. Given a query Q and document D, BM25 computes score(Q, D) = ∑_{q ∈ Q} IDF(q) · [(f(q, D) · (k+1)) / (f(q, D) + k · (1 - b + b · |D|/avgdl))], where f(q, D) is the term frequency in the document, |D| is the document length, avgdl is the average document length, and k and b are tuning parameters (typically k ≈ 1.2, b ≈ 0.75). The formula compactly captures three intuitions: rare terms matter more (the IDF factor), term frequency saturates (the (f · (k+1)) / (f + k) shape), and long documents should be penalised (the length normalisation term).

BM25 is extraordinary for three reasons. First, it is computable in microseconds against billions of documents via inverted indexing — a data structure that maps each term to the list of documents containing it, along with the required term-frequency statistics. A query fetches each term's posting list and accumulates scores. Second, BM25 remains a strong baseline three decades after its introduction; modern dense-retrieval systems routinely use BM25 as a complement rather than a replacement. Third, BM25 requires no training data. It has two hyperparameters; that is all. For the cost of a line of code, you get a retrieval system that, on most tasks, beats a naive embedding-based retriever trained on tens of thousands of query-document pairs.

Modern retrieval-augmented generation pipelines — the LLM sends a query to a retriever, receives a short list of relevant documents, and conditions its answer on them — almost always include BM25 as either the primary retriever or a hybrid complement to dense retrieval. Elasticsearch, OpenSearch, Vespa, Tantivy, and Pyserini all ship BM25 out of the box. The classical retrieval stack — inverted index, BM25 scoring, ranked result list — is arguably the single most important component of classical NLP that remains unreplaced in 2026.

The TREC evaluations (Text REtrieval Conferences), running since 1992, defined the IR research agenda for decades. They established the evaluation protocol — build a collection, pool top results from multiple systems, have human assessors judge relevance, evaluate all systems against the pooled judgments — that transfers directly to modern retrieval benchmarks like BEIR and MS MARCO.

§15

Statistical machine translation

Statistical machine translation — word alignments, phrase tables, noisy-channel decoders — dominated MT from the early 1990s until neural MT overtook it around 2015. The SMT era produced parallel corpora, evaluation metrics (BLEU), and a generation of researchers who would later build the neural systems that replaced their work.

Statistical machine translation (SMT) recast translation as a probabilistic inference problem. Given a source sentence f in the source language, find the target sentence e that maximises P(e | f). IBM researchers Peter Brown, Stephen Della Pietra, Vincent Della Pietra, and Robert Mercer reframed this via the noisy channel: P(e | f) ∝ P(f | e) · P(e), where P(e) is a target-side language model and P(f | e) is a translation model. This decomposition proved productive: the language model could be an n-gram model trained on monolingual target text (abundant), while the translation model was estimated from parallel text (scarcer but available).

The foundational IBM work (the "IBM Models 1–5") introduced a series of increasingly sophisticated word alignment models, each learning to align source and target words from parallel sentences. IBM Model 1 assumed uniform alignment probabilities; Model 2 added positional preferences; later models added fertility (one source word can align to multiple targets), distortion (word-order rearrangement), and more. Alignments were learned by EM. These models produced poor translations directly, but their alignments were the critical input to phrase-based translation.

Phrase-based SMT (Koehn, Och & Marcu 2003; Moses, 2007) translated contiguous multi-word phrases rather than individual words. A phrase table extracted from aligned parallel data mapped source phrases to target phrases with probabilities. A decoder searched over segmentations of the source sentence and phrase-translation choices, scoring candidate translations by a weighted combination of phrase probabilities, language model probability, distortion penalty, and word penalty. The search was beam-limited because the space is exponential. Moses, the open-source phrase-based MT toolkit, became the research and production backbone of SMT for a decade.

Two SMT legacies endure. The first is the BLEU metric (Papineni et al. 2002), which compares n-gram overlap between a candidate translation and one or more reference translations. BLEU is a blunt instrument — it misses paraphrasing, it penalises short translations, its correlation with human judgment is imperfect — but it is fast, reproducible, and unbiased enough to drive a decade of research progress. It remains the default metric for MT and related generation tasks even in 2026, alongside newer alternatives like chrF, COMET, and learned metrics. The second legacy is the set of parallel corpora — Europarl, UN proceedings, OPUS — that SMT researchers built up and that continue to train every large language model. Neural MT did not replace the data infrastructure; it inherited it.

§16

Evaluation

Every classical NLP task came with a benchmark, a metric, and a leaderboard. Understanding these is understanding the incentive gradient that shaped the field — and the reference frame against which every modern method is still compared.

The CoNLL shared tasks, running annually since 1999, defined the standard evaluation for chunking, NER, dependency parsing, semantic role labelling, coreference resolution, and more. Each shared task released a corpus, specified train/dev/test splits, defined a metric, and held a competition. The resulting datasets — CoNLL-2000 chunking, CoNLL-2003 NER, CoNLL-2005 SRL, CoNLL-2009 and 2017/2018 parsing — became the reference benchmarks for their tasks for decades. Modern neural systems are still evaluated on CoNLL-2003 for English NER; in 2026, reporting a system's CoNLL-2003 F1 is the fastest way to contextualise its quality.

Task-specific metrics each encode a choice about what "good" means. Accuracy — fraction of labels correct — is standard for POS tagging, where nearly all tokens have a valid label. F1 score — harmonic mean of precision and recall — is standard for NER and chunking, where most tokens are "outside" any span and accuracy on O tokens alone would be misleading. Unlabelled and labelled attachment score (UAS and LAS) are standard for dependency parsing: fraction of tokens whose predicted head (and, for LAS, whose predicted dependency label) matches the gold tree. B-cubed, MUC, and CEAF are the standard metrics for coreference, each differing in how they weight mention-level vs. entity-level agreement. BLEU, chrF, and METEOR are standard for machine translation. Each metric has known failure modes, and modern papers usually report several.

The GLUE (2018) and SuperGLUE (2019) benchmarks aggregated a set of classification and reading-comprehension tasks into a single suite, becoming the de facto proving ground for pretrained transformers. They are the natural bridge between classical evaluation and modern LLM evaluation, now supplanted by the broader BIG-bench, MMLU, HELM, and ZeroEval suites. Even so, many production decisions in 2026 hinge on a CoNLL-2003 F1 or a CoNLL-2009 LAS, because those numbers are the shared vocabulary for task-specific quality.

§17

NLP pipelines

Classical NLP was a pipeline discipline. Raw text passed through tokenisation, sentence splitting, POS tagging, parsing, NER, coreference, and relation extraction, each step consuming the output of the previous. Understanding pipeline design is understanding why certain modern architectures exist and what problems they were built to avoid.

A classical NLP pipeline composes specialised components in sequence. The canonical stack looks like this: sentence splitter → tokeniser → morphological analyser or stemmer → part-of-speech tagger → named-entity recogniser → dependency or constituency parser → coreference resolver → relation extractor. Each component is typically a standalone model trained on its own annotated corpus. The pipeline architecture made NLP modular: researchers could focus on improving one component while consuming the rest as black boxes.

Representative classical pipelines include Stanford CoreNLP (a Java toolkit combining tokenisation, tagging, parsing, NER, sentiment, coreference, and relation extraction), spaCy (a Python toolkit optimised for production throughput), NLTK (a pedagogically oriented Python toolkit with many components and many legacy models), OpenNLP (an Apache toolkit favoured in JVM contexts), Flair (a PyTorch-based toolkit combining classical task decomposition with modern neural models), and StanfordNLP (the neural successor to CoreNLP). Most are still maintained and widely deployed in 2026.

Pipelines have two well-known weaknesses. The first is error propagation: a tokenisation mistake causes a tagging mistake, which causes a parsing mistake, which causes an NER mistake, which causes a relation-extraction mistake. Errors compound multiplicatively. The second is impedance mismatch between components: a parser trained on Wall Street Journal assumes perfect POS tags and reacts badly to errors, especially when they come from a tagger trained on a different genre. A decade of research went into joint models (single models performing several pipeline stages simultaneously) and into error-aware downstream components; much of it is now obsolete, because end-to-end neural systems simply sidestep the problem.

Pipelines remain pragmatically superior in many production settings. They are modular, interpretable, fast, and admit component-level upgrades. When you care about latency, throughput, cost, or interpretability — as most production systems do — a spaCy or StanfordNLP pipeline is often the correct choice over a large language model, even in 2026. The right question is rarely "pipeline vs LLM" in the abstract; it is "what combination of pipeline components and LLM calls meets the product's quality, latency, and cost constraints". The classical pipeline is a tool that continues to pay its rent.

Modern "agentic" NLP systems often rediscover the pipeline architecture with different names. A tool-calling LLM that extracts entities, looks them up in a database, and generates a summary is, structurally, a pipeline — with an LLM at several steps, but a pipeline nonetheless. Engineering lessons from two decades of classical NLP pipelines — error handling, confidence propagation, graceful degradation — transfer directly to LLM-based systems and are often rediscovered painfully.

§18

Classical NLP in machine learning

Classical NLP's techniques have not gone away. They live inside modern systems as preprocessing, as baselines, as features, and as fall-backs when neural methods cannot pay their own freight. Knowing when to reach for them — and when to skip them — separates working ML engineers from people who have read one too many blog posts.

Several classical techniques remain essential components of modern systems. BM25 is the retrieval workhorse for retrieval-augmented generation; no production LLM system reaches for neural retrieval alone. TF-IDF is the default featurisation for small-data text classification, clustering, and nearest-neighbour similarity — it runs in milliseconds on millions of documents and often beats neural baselines when training data is scarce. Inverted indexes back every serious search engine and most vector-database hybrids. Regular expressions and hand-written patterns handle the high-precision slice of most information extraction problems (dates, phone numbers, amounts, ticker symbols, chemical formulae) more reliably than any neural model.

Several classical techniques survive as baselines. Naive Bayes and logistic regression over bag-of-words or TF-IDF features are the first baselines a competent ML engineer runs on any new text classification problem: they take 30 seconds, cost almost nothing, and set a floor that tells you how much headroom fancier models have. Linear-chain CRFs and their neural descendants are the baseline for sequence labelling. HMMs remain the textbook introduction to sequence modelling and continue to serve niche applications in bioinformatics and low-resource speech. LDA is the baseline topic model and the default tool for exploratory corpus analysis.

Classical methods shine when labelled data is scarce. Under a few thousand labelled examples, a carefully feature-engineered logistic regression or CRF will often beat a fine-tuned transformer and almost always beat a zero-shot LLM. Under a few hundred labelled examples, a handful of regular-expression patterns plus a small gazetteer will often beat both. The scaling crossover — the training-set size at which neural methods start to dominate — depends on the task, the domain, and the quality of the features, but it is rarely below a few hundred examples and sometimes above tens of thousands. A working ML practitioner learns to estimate which regime a problem lives in before reaching for the biggest available model.

Classical NLP also provides the evaluation discipline that modern NLP inherited. The notion of train/dev/test splits, the obsession with held-out performance, the precision/recall/F1 taxonomy, the BLEU/chrF/COMET tradition for generation, the practice of reporting statistical significance intervals, the insistence on reproducibility — all of these are inheritances from classical NLP evaluation practice. When modern LLM evaluations periodically fall into methodological errors that the CoNLL shared-task tradition had already catalogued by 2005, the fix is usually to re-read the relevant classical paper. The techniques of classical NLP may be partly obsolete; the habits of mind are not.

The honest summary, from 2026: classical NLP is not a rival to LLMs, it is a complement. A production NLP system is almost always a hybrid — LLMs for flexible reasoning and generation, BM25 for retrieval, fine-tuned transformers for high-throughput classification and extraction, regular expressions and gazetteers for high-precision cases, and a pipeline architecture with confidence-based fall-backs gluing them together. The engineer who understands classical NLP writes better hybrid systems, because they know which tool to reach for and why. The engineer who only knows LLMs writes expensive systems that silently fail on the long tail. Both kinds of engineer exist in 2026; the first is more valuable.

Classical NLP, the statistical and feature-engineering toolkit that carried the field from the 1990s to the neural turn — and still powers most of the retrieval, classification, and extraction that actually runs in production.

How to read this chapter

Contents

Why classical NLP still matters

Bag of words

TF-IDF

N-grams

N-gram language models

Naive Bayes

Logistic regression

Maximum entropy and log-linear models

Hidden Markov Models

Conditional Random Fields

Named Entity Recognition

Relation extraction and information extraction

Topic modelling

Information retrieval and BM25

Statistical machine translation

Evaluation

NLP pipelines

Classical NLP in machine learning

Further reading

Anchor textbooks

Foundational papers

Modern extensions

Software