Part VI · NLP & Large Language Models · Chapter 03

Word embeddings & distributional semantics, the idea that meaning lives in vector space — and the technical revolution that turned every token in every language into a row of numbers whose geometry encodes what it means.

"You shall know a word by the company it keeps." J.R. Firth's 1957 aphorism sat in the semantics literature for half a century as a suggestive but unimplemented slogan. Then in the early 1990s computational linguists began counting — building matrices of word co-occurrences across large corpora, reducing them with singular value decomposition, and finding that the resulting vectors captured real semantic structure. In 2013 Tomas Mikolov's word2vec paper compressed that machinery into a neural architecture so fast and so effective that within eighteen months every serious NLP system in the world was using pretrained word vectors. GloVe followed in 2014. FastText in 2017. ELMo and BERT brought contextualised embeddings in 2018. Sentence-transformers and the modern embedding ecosystem followed. By 2020 the notion that everything — words, sentences, paragraphs, documents, images, audio clips, user preferences, database rows — could be usefully represented as a vector in some learned space had become the governing metaphor of applied machine learning. Retrieval-augmented generation runs on embedding spaces. Recommender systems run on embedding spaces. Semantic search, clustering, anomaly detection, near-duplicate detection, cross-lingual transfer, multi-modal models — all are embedding-space problems. This chapter walks through the technology: the distributional hypothesis itself, the classical count-based methods (PMI, LSA, LSI), the neural breakthroughs (word2vec, GloVe, fastText), the geometry that makes embedding spaces interesting (analogies, clusters, linear subspaces of bias), the evaluation protocols (intrinsic and extrinsic), the move from static to contextualised embeddings (ELMo, BERT), the modern sentence-level models (Sentence-BERT, universal encoders), and the engineering stack (approximate nearest neighbours, vector databases, hybrid retrieval) that operationalised embeddings at Internet scale.

How to read this chapter

Sections one through four establish the intellectual foundation. Section one argues why embeddings matter — the move from sparse one-hot vectors to dense distributed representations was not a minor engineering improvement but a phase change that enabled most of what came next. Section two introduces the distributional hypothesis, the linguistic and philosophical claim that word meaning is constituted by the contexts in which a word appears. Section three covers count-based distributional semantics — the pre-neural ancestors of word2vec, built from co-occurrence matrices and pointwise mutual information, including LSA, HAL, and the PPMI family. Section four introduces word2vec — Tomas Mikolov's 2013 paper that made dense embeddings fast, scalable, and ubiquitous.

Sections five through eight cover the word2vec family and its immediate descendants. Section five unpacks the two word2vec architectures — skip-gram and CBOW — and what each optimises. Section six covers negative sampling and hierarchical softmax, the training tricks without which word2vec would not have worked at corpus scale. Section seven is GloVe — Pennington, Socher, and Manning's 2014 reformulation that combined global co-occurrence statistics with the local-context objectives of word2vec. Section eight is fastText — Bojanowski, Grave, Joulin, and Mikolov's 2017 extension to character n-grams, which finally handled morphology and out-of-vocabulary words.

Sections nine through twelve look at what embeddings actually contain. Section nine covers the geometry of embedding spaces — the famous "king - man + woman ≈ queen" analogy, semantic clusters, linear subspaces, and what these phenomena do and do not tell us about what the model has learned. Section ten is evaluation — intrinsic metrics (word similarity, analogy, outlier detection) and extrinsic evaluation (downstream task performance), and the long-running argument about which one matters. Section eleven covers cross-lingual and multilingual embeddings — aligning monolingual spaces, MUSE, Procrustes rotation, and the path from single-language embeddings to universal cross-lingual representations. Section twelve addresses bias and fairness in embedding spaces — the Bolukbasi et al. 2016 demonstration that word vectors encode gender, racial, and other social biases, and the subsequent debate over debiasing methods.

Sections thirteen through sixteen cover the transition from static to contextualised embeddings, and the sentence-level extensions. Section thirteen explains why static embeddings were not enough — polysemy, syntactic role, and the kinds of meaning that depend on context. Section fourteen is ELMo — Peters et al. 2018, the bi-LSTM contextualiser that introduced the paradigm. Section fifteen is BERT-era contextual embeddings — the transformer turn, how to extract embeddings from a pretrained encoder, and the interpretability literature that emerged around it. Section sixteen covers sentence and document embeddings — Sentence-BERT, SimCSE, universal sentence encoders, and the contrastive-learning revolution that made modern semantic search possible.

Sections seventeen and eighteen close the chapter. Section seventeen covers the engineering stack — approximate nearest-neighbour search (FAISS, HNSW, ScaNN), vector databases (Pinecone, Weaviate, Milvus, Qdrant), hybrid retrieval, and the practical considerations that separate a toy embedding project from a production system serving billions of queries. The closing in-ml section places word embeddings in the broader machine-learning landscape: as the first successful demonstration of representation learning, as the conceptual ancestor of every modern foundation model, as the bridge from symbolic NLP to continuous ML, and as the engineering substrate on which retrieval-augmented generation, recommendation, and multi-modal learning all now run.

Why embeddings matterFrom one-hot to dense, the phase change, the representation-learning turn
The distributional hypothesisFirth, Harris, "a word is known by the company it keeps"
Count-based distributional semanticsCo-occurrence matrices, PMI, PPMI, LSA, HAL
word2vecMikolov 2013, the breakthrough that made dense embeddings ubiquitous
Skip-gram vs CBOWThe two architectures, what each objective actually optimises
Negative sampling & hierarchical softmaxThe training tricks, noise contrastive estimation, efficient loss
GloVePennington, Socher, Manning 2014 — global matrix factorisation meets local context
fastTextSubword n-grams, morphology, OOV handling, classification
The geometry of embedding spaceAnalogies, clusters, linear substructure, what it all means
Evaluating embeddingsIntrinsic vs extrinsic, similarity, analogy, downstream
Cross-lingual & multilingual embeddingsAlignment, Procrustes, MUSE, multilingual BERT
Bias in embedding spacesSocial bias, debiasing methods, what can and cannot be fixed
Static embeddings to contextualisedPolysemy, homographs, why one vector per word is not enough
ELMoPeters 2018, bi-LSTM contextualisation, the paradigm shift
BERT-era contextual embeddingsTransformer encoders, feature extraction, probing
Sentence and document embeddingsSentence-BERT, SimCSE, universal encoders, contrastive training
Engineering embedding spacesANN search, FAISS, HNSW, vector databases, hybrid retrieval
Embeddings across MLRepresentation learning, RAG, recommendation, multi-modal

§1

Why embeddings matter

Before embeddings, words in ML were one-hot vectors: a vocabulary of 50,000 words meant a 50,000-dimensional vector with a single 1 and 49,999 zeros. Every pair of distinct words was exactly orthogonal. Cat and kitten were no more similar than cat and thermodynamics. The move to dense embeddings fixed this, and in doing so enabled nearly everything that came after.

The classical representation of text in machine learning was the one-hot vector: enumerate the vocabulary, and represent each word as the corresponding basis vector. With a 50,000-word vocabulary this is a 50,000-dimensional vector. Every pair of distinct words is orthogonal — their dot product is exactly zero. This means that in any linear model built over one-hot features, the similarity between two words is whatever the model learns it to be from scratch, with no prior information. If you have never seen the word kitten in your training data, your model knows literally nothing about it, not even that it is probably similar to cat. Classical sparse-feature models (TF-IDF + logistic regression, CRFs over bag-of-words) live and die by this property. They are effective when training data is abundant and vocabulary is bounded, and hopeless otherwise.

Dense embeddings — learned low-dimensional real-valued vectors, typically 50 to 1,000 dimensions — break this curse by making similarity a property of the representation itself. In a well-trained embedding space, cat and kitten land near each other, cat and dog land a little further, and cat and thermodynamics land in a different region of the space entirely. A classifier trained on top of embeddings therefore enjoys a form of generalisation across vocabulary that is impossible with one-hot features. This is the core value proposition: embeddings share statistical strength across related words, so models require less labelled data to reach good performance and are robust to words they have never seen in training.

Dense embeddings also solve several engineering problems that sparse features create. They keep downstream models small — a 300-dimensional embedding input is two orders of magnitude smaller than a 50,000-dimensional one-hot. They composed well with the neural architectures that were becoming dominant in 2013–2015. They are differentiable: you can backpropagate through an embedding layer, fine-tuning the representation for your task. And crucially, they allow pretraining: learn embeddings once on a billion words of raw text, then reuse them for every downstream task. This decoupling — expensive pretraining followed by cheap task-specific fine-tuning — is the template that every subsequent foundation model has followed.

The consequences went well beyond NLP. Once it became clear that dense vectors could encode rich semantic structure, the same idea was applied to users (collaborative filtering embeddings), to items in a catalogue (product2vec), to nodes in graphs (DeepWalk, node2vec), to images (CNN feature layers), to audio clips, to molecules, and eventually to multi-modal joint spaces (CLIP). "Everything is a vector" became the governing metaphor of applied ML for a decade. Retrieval-augmented generation, semantic search, recommender systems, multi-modal foundation models — all trace their architectural lineage back to the moment when NLP successfully learned to represent words as points in a continuous space.

The shift from sparse to dense representations is the single most consequential methodological change in applied ML of the last two decades. Every subsequent development — transformers, LLMs, foundation models, retrieval, multi-modal learning — builds on the assumption that meaning can be encoded as geometry in a learned vector space. This chapter is about how we got to that assumption and what it costs us.

§2

The distributional hypothesis

"You shall know a word by the company it keeps." J.R. Firth's 1957 slogan is the linguistic commitment underlying every embedding method ever built. Words that appear in similar contexts tend to have similar meanings; if we can quantify context, we can quantify meaning.

The distributional hypothesis is the claim that the meaning of a word is largely determined by the linguistic contexts in which it appears. If cat and kitten appear in similar surrounding word distributions — both frequently occur with purr, whiskers, lap, fur, meow — then they share meaning. If cat and thermodynamics appear in dissimilar distributions, they do not. The hypothesis does not claim that distribution fully determines meaning (it famously fails on synonyms vs antonyms: good and bad appear in almost identical contexts), but it claims enough: the distributional signal is strong enough to carry most of the semantic structure we care about, from an engineering perspective.

The idea has two parallel lineages. In British linguistics, John Rupert Firth formulated the slogan "You shall know a word by the company it keeps" in a 1957 volume of Studies in Linguistic Analysis. His student Michael Halliday developed the idea into systemic-functional linguistics. In American structuralist linguistics, Zellig Harris — Noam Chomsky's doctoral advisor — argued in 1954's "Distributional Structure" that the structure of a language could in principle be derived entirely from the distribution of its elements. Harris's work influenced generations of computational linguists and remains the canonical citation for the hypothesis in its operationalised form.

Translating the hypothesis into engineering requires three choices. First, what counts as context? The standard choice is a fixed-width window of surrounding words — the five words on either side of the target, say — but alternatives include syntactic dependency context (the words that are grammatically related to the target), document context (the whole document in which the target appears), or document-level topic context. Second, how do we represent the relationship between a word and its contexts? The classical answer is a co-occurrence count matrix; word2vec's answer is a learned parameter matrix. Third, how do we reduce the high-dimensional co-occurrence signal to a useful low-dimensional embedding? LSA uses SVD, word2vec uses neural training, GloVe uses weighted matrix factorisation. All three choices can vary independently, producing a zoo of methods that all share the same distributional-hypothesis commitment.

The hypothesis has limits. Words with low frequency have too little distributional evidence to localise reliably. Words with multiple distinct senses (polysemy: bank = river bank or financial institution; bat = animal or sports equipment) are badly served by a single vector that averages over senses. Words that differ only in polarity (good vs bad, hot vs cold) appear in such similar contexts that pure distributional methods cannot separate them. These limits motivated contextualised embeddings (§13–§15) and various sense-disambiguation extensions. But for a vast range of applications, the distributional hypothesis — operationalised as dense vectors trained on raw text — is simply the right model. Most of what we have discovered about word meaning since 2013 has come from pushing this hypothesis harder, not from abandoning it.

§3

Count-based distributional semantics

Long before word2vec, computational linguists were building distributional representations by counting. Co-occurrence matrices, PMI weighting, and truncated SVD produced word vectors that captured real semantic structure — and the best count-based methods remain competitive with neural embeddings on many intrinsic tasks.

The simplest count-based embedding is the term-context co-occurrence matrix. For each pair of words (w, c), count how many times w appears within a window around c across the corpus. This gives a |V| × |V| matrix whose rows are word vectors — each row is the distribution of contexts in which the corresponding word appears. Already this crude representation captures semantic information: words with similar meanings have similar rows. But the raw counts are biased: a stopword like the co-occurs with nearly everything, so using raw counts as weights effectively treats the as the strongest signal of every word's meaning, which is the opposite of useful.

The standard fix is Pointwise Mutual Information (PMI): PMI(w, c) = log[P(w, c) / (P(w) · P(c))]. PMI measures whether a word and context co-occur more or less than chance would predict. Under independence, P(w, c) = P(w) · P(c) and PMI is zero. If cat co-occurs with purr far more than chance, PMI is strongly positive; if it co-occurs with the at chance levels, PMI is zero. The practical variant Positive PMI (PPMI) replaces negative values with zero, since negative PMI corresponds to "co-occurs less than chance" — a signal that is usually noisy for rare word-context pairs. PPMI-weighted co-occurrence matrices are the canonical form of classical distributional semantics.

PPMI matrices are still sparse and high-dimensional, so practitioners apply dimensionality reduction — usually truncated Singular Value Decomposition. Take the top k singular vectors (typical k: 50, 100, 300) to produce a dense |V| × k embedding matrix. This method has a name: Latent Semantic Analysis (LSA, Deerwester et al. 1990). LSA was the dominant word-embedding method for about twenty years. Its close cousins include Hyperspace Analogue to Language (HAL, Lund & Burgess 1996), which uses directional windows; Random Indexing, which approximates LSA without SVD; and GloVe (§7), which can be read as a particular weighted matrix factorisation of a log-co-occurrence matrix.

Count-based methods have two under-appreciated virtues. They are inspectable: you can look at the raw co-occurrence counts and the PPMI weights and understand exactly where each dimension's signal came from, in a way that a neural embedding does not permit. And they are competitive: Levy, Goldberg, and Dagan's 2014 paper "Improving Distributional Similarity with Lessons Learned from Word Embeddings" showed that carefully tuned count-based methods match or exceed word2vec on most intrinsic evaluations, when both are given the same vocabulary and the same hyperparameter budget. The advantage of neural methods is not really that they capture more semantic signal than counting does; it is that they are faster to train at scale and easier to integrate into downstream neural systems.

The Levy et al. 2014 paper is the correct cure for a common misconception: that word2vec discovered a new kind of semantic signal. It did not. It discovered a dramatically more efficient way to extract signal that counting had been extracting for decades. When you read that a neural model has "learned" some semantic structure, remember that most of that structure is there for the counting in the underlying co-occurrence statistics.

§4

word2vec

Tomas Mikolov's 2013 paper "Efficient Estimation of Word Representations in Vector Space" was the moment dense embeddings became a standard tool for NLP. The architectures were shallow, the training objective was simple, and the resulting vectors were good enough to fuel a generation of applications.

The word2vec paper (Mikolov, Chen, Corrado, Dean 2013) introduced two architectures — Continuous Bag-of-Words (CBOW) and Skip-gram — for learning dense word vectors from raw text. Both architectures are shallow neural networks: a single hidden layer, linear activations on the input and output sides, trained to predict context words from target words or vice versa. The hidden layer weights are the embeddings. There is no deep network, no convolutions, no attention — just a single learned matrix representing each word as a row, plus a second learned matrix representing each word as a context. After training, one of these matrices (usually the input-side one) is used as the embedding matrix.

The key insight is that the task — predict context words from a centre word, or vice versa — is a pretext. The paper's authors did not care about the accuracy of these predictions. They cared about what the hidden-layer weights learned in the process. Training the model to predict that purr appears near cat, and that it appears near kitten, and that it does not appear near thermodynamics, forces the hidden-layer representations of cat, kitten, and thermodynamics into positions that reflect their distributional similarity. Meaning falls out of prediction as a by-product.

What made word2vec a breakthrough was engineering, not algorithmic novelty. The objective — predict context given word — was a special case of Bengio et al.'s 2003 neural language model, which had been around for a decade. The distributional framing was a century old. What Mikolov's team got right was making the computation so cheap that you could train on billions of words on a single machine. The architectural simplification to a single linear layer, the use of negative sampling (§6) instead of full softmax, and a careful implementation in C that exploited multi-core CPUs turned a days-long training run on 100M words into a minutes-long run on a billion. Suddenly anyone with a corpus could train embeddings, and anyone with pretrained embeddings could use them for downstream tasks.

The impact on applied NLP was immediate and overwhelming. Within eighteen months of the 2013 release, pretrained word2vec vectors (especially the "GoogleNews-vectors-negative300.bin" file, trained on 100 billion words of news text) had become a standard input to nearly every serious NLP system. Sentiment classifiers, document classifiers, NER taggers, parsers, machine translation systems — all improved by initialising their lookup tables from word2vec. The dramatic accuracy gains, combined with the recipe's simplicity (download a 1.5GB binary, look up each word), caused most other word-representation methods to be abandoned in production over the course of two or three years. word2vec was the event that converted the NLP community to embeddings as a default tool.

§5

Skip-gram vs CBOW

Word2vec ships with two architectures that invert the same task. CBOW predicts a centre word from its surrounding context; skip-gram predicts context words from a centre word. The choice matters: skip-gram produces better embeddings for rare words, CBOW trains faster.

Continuous Bag-of-Words (CBOW) takes a window of context words and tries to predict the word at the centre. Given the context the ___ sat on the mat, the model should predict cat with higher probability than thermodynamics. Training averages the embeddings of the context words, passes the average through a softmax, and compares the predicted distribution against the one-hot target word. Gradient descent adjusts the input-side embeddings of the context words and the output-side weights of the centre word. The architecture is called "bag of words" because the context is treated as an unordered set; the order within the window does not matter.

Skip-gram inverts this: given a centre word, predict each of the surrounding context words. For every (centre, context) pair in a training window, the model computes P(context | centre) and adjusts the embeddings to raise the probability of the observed context and lower the probability of unobserved contexts (via negative sampling, §6). Skip-gram generates many more training examples per sentence than CBOW, because each centre word spawns one training example per context position. This makes it slower to train but gives more learning signal per rare word.

The practical tradeoff is well documented. CBOW trains 2–3x faster than skip-gram for the same corpus. It produces slightly better embeddings for frequent words. It averages over the context, which makes the signal from any individual context word relatively weak, and it disfavours rare words because they receive proportionally fewer training updates. Skip-gram is slower but treats every word equally as a training target, so it learns better embeddings for rare words and captures sharper semantic distinctions. For most modern applications the skip-gram model is preferred; CBOW survives mostly as a baseline and for applications where training time is the binding constraint.

A second important hyperparameter is the window size — how many words on each side of the target count as context. Small windows (2–5) capture syntactic similarity: the embeddings of nouns cluster together, verbs cluster together, adjectives cluster together, because these classes have similar immediate contexts. Large windows (10–20) capture topical similarity: words from the same domain cluster together regardless of part of speech. A window of 5 is standard; a window of 10 is typical for topic-oriented applications. The choice is rarely critical in practice, but it is worth knowing that "similar" can mean very different things depending on this hyperparameter.

A subtle point: skip-gram and CBOW do not actually produce the same embeddings even at convergence. The objectives are different (predicting context from centre vs centre from context), so the two methods learn different geometries on the same corpus. In practice the differences are small for downstream applications, but they are not identical.

§6

Negative sampling and hierarchical softmax

The softmax over a 50,000-word vocabulary is too expensive to compute at every training step. Word2vec's speed came from two clever workarounds: negative sampling, which replaces the full softmax with a set of binary classifications against random non-contexts; and hierarchical softmax, which uses a binary tree to reduce softmax cost from O(|V|) to O(log |V|).

Consider the naive skip-gram training objective: for each (centre, context) pair, maximise log P(context | centre), where the conditional is computed via a softmax over the full vocabulary. A gradient step requires computing the softmax denominator, which is a sum over all |V| vocabulary items — a dot product against every word's output vector. For a 50,000-word vocabulary and a corpus of a billion words, that is 50 billion dot products per training pass, plus the gradients. This is prohibitively expensive. Mikolov's team addressed this via two different techniques, either of which will work.

Hierarchical softmax replaces the flat vocabulary softmax with a binary tree in which each leaf corresponds to a vocabulary word. Each internal node has a learned vector, and the probability of a leaf is the product of binary decisions along the path from root to leaf. With a balanced tree over a |V|-word vocabulary, the path length is log |V| rather than |V|. A Huffman-coded tree (frequent words near the top) reduces the expected path length further. Training is tractable because each update touches only O(log |V|) tree nodes rather than all |V| output vectors.

Negative sampling takes a different approach. Reformulate the training objective as a binary classification: for each observed (centre, context) pair, predict "yes, this is a real context", and for k randomly sampled (centre, random word) pairs, predict "no, this is a fake context". The random contexts are called negative samples. This turns the expensive softmax into k + 1 cheap sigmoid computations (typically k = 5 to 20). The objective is a noise-contrastive estimation of the full softmax, and it provably converges to a related but not identical fixed point — roughly the PMI matrix, as Levy and Goldberg showed in 2014.

The negative-sample distribution matters. Drawing negatives uniformly at random over-samples rare words and under-samples common ones; drawing from the empirical unigram distribution does the reverse. Mikolov's team found empirically that raising the unigram distribution to the 3/4 power struck a sweet spot, boosting rare words modestly. This strange-looking heuristic — sample negatives from P_unigram(w)^0.75 — propagated into every subsequent embedding method. It is a small example of a general lesson: the engineering choices that get a method to work at scale often matter more than the algorithmic framing, and they often get absorbed into the subsequent literature as "what everyone does".

In practice, most word2vec implementations use negative sampling rather than hierarchical softmax: it is simpler to implement, easier to tune, and produces slightly better embeddings on standard benchmarks. Hierarchical softmax retains a niche in settings where exact probabilities are required downstream (it gives a proper normalised distribution; negative sampling gives point scores). The broader idea — replace an expensive softmax with a noise-contrastive objective — has become ubiquitous in modern contrastive learning, from SimCSE to CLIP. Word2vec's speedup trick turned out to be a fundamental technique.

§7

GloVe

Pennington, Socher, and Manning's 2014 GloVe paper reformulated embedding learning as weighted matrix factorisation over global co-occurrence counts. It sat between LSA's global-statistics approach and word2vec's local-context approach, and briefly matched or beat both.

GloVe (Global Vectors for Word Representation; Pennington, Socher, Manning 2014) was motivated by a simple observation: word2vec explicitly uses only local context windows, ignoring the global co-occurrence statistics that LSA captures; LSA uses global co-occurrence statistics but produces embeddings via SVD that do not optimise for the kind of semantic signal word2vec was finding. GloVe was an attempt to combine the advantages of both. The method first constructs the global word-context co-occurrence matrix X, then learns word vectors and context vectors whose dot product approximates log X_{ij}, weighted by a function that downweights very frequent co-occurrences.

The GloVe objective is J = ∑_{i,j} f(X_{ij}) · (w_i · c_j + b_i + b_j - log X_{ij})², where w_i and c_j are word and context vectors, b_i and b_j are bias terms, and f is a weighting function that caps at x = 100 (so extremely frequent co-occurrences do not dominate) and vanishes at x = 0 (so we do not try to fit zero co-occurrences). This is a weighted least-squares factorisation of the log-co-occurrence matrix. Training uses AdaGrad on the loss. The final embedding is typically w + c — the sum of the word and context vectors — which empirically performs slightly better than either alone.

The derivation starts from the observation that ratios of co-occurrence probabilities encode semantic information. P(ice | steam) / P(water | steam) is small (ice and steam are related thematically but in different thermodynamic states), while P(ice | solid) / P(water | solid) is large (ice is solid; water is not). GloVe's loss is derived by asking what form of function of word vectors would make dot products reproduce these ratios, and working backwards. The details of the derivation are less important than the conclusion: a weighted matrix factorisation with this weighting function falls out naturally from the distributional-ratio starting point.

When GloVe was released in 2014, it outperformed word2vec on most word-similarity and analogy benchmarks by a modest margin. A year of follow-up work mostly equalised the methods: with equivalent hyperparameter tuning and corpus size, the two methods produce similar-quality embeddings for downstream use. In practice, GloVe embeddings trained on Common Crawl (6 billion tokens, 400,000 words) and released as pretrained 50-, 100-, 200-, or 300-dimensional vectors became a widely used alternative to word2vec's GoogleNews vectors. By 2018, both sets had been largely displaced by ELMo and BERT contextual embeddings, but they remain useful as baselines and for settings where static embeddings suffice.

§8

fastText

Word2vec and GloVe share a weakness: they treat each word as an atomic unit, so they cannot represent words they have not seen in training. FastText fixes this by embedding character n-grams instead, which gives morphology-aware representations and graceful out-of-vocabulary behaviour.

fastText (Bojanowski, Grave, Joulin, Mikolov 2017) extends the word2vec skip-gram architecture by representing each word as a bag of character n-grams. The word cat, with n = 3, becomes the n-grams <ca, cat, at>, plus the word itself as a special token <cat>. Each n-gram has its own embedding; the word embedding is the sum of its n-gram embeddings. Training is skip-gram with negative sampling, just as in word2vec, but with this compositional embedding at every step.

The immediate practical benefit is out-of-vocabulary handling. Classical word2vec has no representation for a word it did not see in training — it cannot produce an embedding for pre-industrial if that exact token never appeared in the training corpus. FastText constructs the embedding from n-gram embeddings that were seen in training, so it can produce reasonable vectors for any string. This is especially valuable for morphologically rich languages (Finnish, Turkish, Arabic, Russian) where a single lemma can surface as dozens of inflected forms, and also for any domain with many neologisms, typos, or technical terms.

The second benefit is morphology-aware similarity. The embeddings of walk, walked, walks, walking, and walker all share n-grams and therefore share parts of their vector representations. A classical word2vec model has to learn these similarities from context; fastText has them partially baked into the architecture. For morphologically rich languages the improvement on downstream tasks can be several percentage points, particularly for rare words.

fastText also ships a separate text classifier — a simple linear model that averages fastText word embeddings and trains a softmax classifier on top. Despite the architectural simplicity, Joulin et al.'s 2017 "Bag of Tricks for Efficient Text Classification" paper showed that this classifier matches or beats much deeper models on many standard text classification benchmarks, at a fraction of the training cost. The practical recipe — fastText features, linear classifier, 1 GB of RAM, trains in minutes — has remained a strong baseline for over five years.

The fastText library, released by Facebook AI Research, is still actively maintained and widely used. In 2026, for problems involving static word embeddings on morphologically rich languages or short-text classification, fastText is often the right tool — not because it is the most accurate option (a fine-tuned BERT will usually beat it) but because it is the Pareto-optimal combination of accuracy, training cost, inference cost, and operational simplicity. Pretrained fastText vectors are available for 157 languages.

§9

The geometry of embedding space

Word2vec produced one of the most famous visualisations in modern ML: king - man + woman ≈ queen. The analogy discovery was not just a party trick — it was the first clear evidence that learned embedding spaces encode semantic relationships as geometric regularities, opening a decade of work on what is actually in those vectors.

The analogy structure of word2vec embeddings was the first result to make clear that something genuinely interesting was happening in the learned space. Mikolov and colleagues showed that analogies like "man is to woman as king is to queen" could be solved by vector arithmetic: compute vec(king) - vec(man) + vec(woman), find the nearest embedding vector, and the answer is often queen. The same held for analogies across categories: Paris : France :: Rome : Italy, walking : walked :: swimming : swam, big : bigger :: small : smaller. The fact that translations between semantic concepts showed up as consistent vector offsets was striking and unexpected.

The effect is real but over-reported. The "king - man + woman" example works because queen is excluded from the candidate set during the nearest-neighbour lookup — if you include the original three words, king is often closer to the target vector than queen is. The accuracy on large analogy benchmarks like Google's analogy dataset is 60–75% for good embedding methods, not 95%. Many analogies that look like they should work do not; many of those that do work succeed because of the particular method of nearest-neighbour search used (3CosAdd vs 3CosMul). The general lesson is that embedding spaces encode semantic relationships imperfectly but in a way that sometimes shows up as linear structure, not that there is a perfect grid of meaning embedded in the vectors.

Beyond analogies, embedding spaces have other interesting geometric properties. Semantic clusters are robust: countries cluster together, animals cluster together, emotions cluster together, sports verbs cluster together. Linear subspaces sometimes encode specific axes of meaning — a royalty axis, a gender axis, a temperature axis. The norm of a vector often correlates loosely with word frequency or semantic specificity. The dimensionality of the space has been studied in isolation: 300 dimensions appears to be a sweet spot for English word embeddings, with diminishing returns above 500 and degradation below about 50.

The geometric structure is not free. Much of it emerges only with careful embedding training and corpus preprocessing; on small or idiosyncratic corpora, the structure can be weak or absent. Cross-lingual analogy structure is especially fragile — it requires alignment methods (§11). And none of the geometric structure survives naive averaging across senses: the vector for bank lies somewhere between the financial-institution region and the river-edge region of the space, and is not really "close" to either. This last problem is what contextualised embeddings (§13–§15) exist to fix.

A recurring theme in the interpretability literature: the structure we perceive in embedding spaces often reflects the structure of the corpus rather than any deep property of language. Embeddings trained on news text reflect news-text concerns; embeddings trained on biomedical text place different things near each other. This is not a bug — it is what "learning distributional semantics from a corpus" means — but it is easy to forget when reading about what embeddings "understand".

§10

Evaluating embeddings

How do you know if an embedding is good? There are two broad answers — intrinsic evaluation measures the embedding against a task defined directly on word pairs; extrinsic evaluation measures downstream task performance. Each has its uses, and the community has argued about which one matters for over a decade.

Intrinsic evaluation measures the embedding directly, without a downstream task. The two most common intrinsic metrics are word similarity (the correlation between model-predicted similarity and human-annotated similarity over standard word pairs) and analogy accuracy (the fraction of a : b :: c : ? analogies the model gets right). Standard word-similarity datasets include WordSim-353, SimLex-999, RareWord, MEN, and SimVerb-3500. The standard analogy dataset is Google's, consisting of roughly 20,000 analogies across syntactic and semantic categories. These intrinsic tests are fast to run and their results are interpretable, which is why they dominated the embedding literature from 2013 to about 2017.

Extrinsic evaluation measures what actually matters: does this embedding make a downstream system better? You hold the rest of the pipeline fixed — a particular NER tagger, say, or a sentiment classifier — and compare F1 scores or accuracy with different embedding inputs. This is more informative in one way (it measures the thing you actually care about) and less informative in another (performance can depend heavily on the downstream task, fine-tuning details, and hyperparameters, making cross-paper comparison difficult). Modern papers typically report a mix of both kinds of evaluation.

The two kinds of evaluation can disagree, sometimes dramatically. Schnabel, Labutov, Mimno, and Joachims's 2015 paper "Evaluation methods for unsupervised word embeddings" showed that the ranking of embedding methods changes depending on the evaluation metric, and that intrinsic metrics correlate only weakly with downstream task performance. Faruqui, Tsvetkov, Rastogi, and Dyer's 2016 "Problems With Evaluation of Word Embeddings Using Word Similarity Tasks" argued that word-similarity benchmarks are so noisy and task-dependent that they should not be used as the primary evaluation. These papers did not settle the argument but they established a discipline of reporting multiple metrics and not reading too much into any single score.

A practical consequence: do not trust benchmark numbers in isolation. A good embedding for a particular downstream task is the one that performs best on that task, and the only way to know is to try it. Intrinsic benchmarks are a useful filter — a model that scores 0.2 on WordSim-353 is probably not worth downstream evaluation — but they are not a substitute for the downstream evaluation itself. This lesson transfers directly to the current generation of contextual and sentence embeddings, where the same gap between intrinsic and extrinsic performance shows up in benchmarks like MTEB.

§11

Cross-lingual & multilingual embeddings

If every language has its own embedding space, can we align them into a shared one where translations land near each other? Yes, and the techniques involved — orthogonal alignment, Procrustes rotation, adversarial training — are among the cleverest in the embedding literature.

Monolingual embeddings trained on English and monolingual embeddings trained on French share no coordinate system. The English cat and the French chat are both points in 300-dimensional spaces, but those spaces are unrelated, so the two points have no meaningful relationship. Cross-lingual embeddings address this by aligning monolingual spaces so that translation pairs land near each other. The simplest approach — orthogonal Procrustes alignment — uses a dictionary of known translation pairs (say 5,000 pairs) to learn a single orthogonal rotation matrix that maps the source-language space into the target-language space. Mikolov, Le, and Sutskever showed in 2013 that this works remarkably well: the aligned space preserves within-language geometry (because the transformation is orthogonal) while placing translations near each other.

The approach was refined in multiple directions. Artetxe, Labaka, and Agirre's 2016 paper showed that weighted Procrustes, with the weights derived from the dictionary confidence, improves alignment. MUSE (Conneau, Lample, Ranzato, Denoyer, Jégou 2018) introduced unsupervised alignment: use adversarial training to find a rotation that makes the source-space distribution look like the target-space distribution, without any dictionary. MUSE's results on language pairs like English-Spanish, English-French, and English-German were striking — alignment quality close to supervised methods, with no bilingual data at all. The technique does not work for all language pairs (it fails on distant languages like English-Hindi or English-Arabic), but where it works, it demonstrated that cross-lingual geometry is a property of the language pair itself, not something that has to be supervised.

A different approach aligns at training time rather than post-hoc. Multilingual BERT (mBERT, 2018) and XLM-R (2019) train a single encoder on concatenated text from 100+ languages, sharing a vocabulary of subword tokens and a set of parameters across all of them. The resulting encoder produces contextual embeddings that share a cross-lingual space without explicit alignment. Pires, Schlinger, and Garrette's 2019 "How multilingual is multilingual BERT?" paper demonstrated that mBERT supports zero-shot cross-lingual transfer — fine-tuning on English task data and testing on German, Spanish, or Chinese — with surprisingly strong results, especially for syntax-heavy tasks.

The practical value of multilingual embeddings is large. Cross-lingual retrieval (query in one language, results in another) runs on them. Cross-lingual classification lets you train once and deploy across languages. Language identification, machine translation, and low-resource NLP all depend on representations that span language boundaries. The underlying geometric question — how similar are the conceptual spaces of different languages? — also bears on deep questions about linguistic universals and the distributional hypothesis itself. The empirical answer, broadly, is "more similar than most linguists expected before about 2013", with the caveat that the similarity breaks down for culture-bound and low-resource languages.

§12

Bias in embedding spaces

If word2vec learns that king - man + woman ≈ queen, what else does it learn? Unsurprisingly, it learns all the social biases present in its training corpus — and it encodes them in geometric form, making them easy to detect and difficult to remove.

The foundational paper is Bolukbasi, Chang, Zou, Saligrama, and Kalai's 2016 "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings". The authors showed that GoogleNews word2vec embeddings encoded gender stereotypes in their geometry: man : computer programmer :: woman : homemaker, father : doctor :: mother : nurse, man : strong :: woman : pretty. The effects were systematic, large, and consistent with empirical patterns of bias in the training corpus — which was, after all, Google News. If news articles historically associated programming with men and nursing with women, the embeddings learned and encoded that association.

The paper did not just diagnose bias; it proposed a debiasing procedure. Project the embeddings onto a gender direction (defined by the difference vector between gender-paired word sets, e.g. he - she, man - woman, boy - girl), then subtract the projection from words that should not be gender-marked (programmer, nurse, doctor), while preserving the projection for words that should (he, mother, actor). The debiased embeddings reduced gender-stereotypical analogy performance while preserving performance on downstream tasks and on gender-neutral semantic tasks.

The field's reaction was mixed. Gonen and Goldberg's 2019 paper "Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings but do not Remove Them" argued that Bolukbasi-style debiasing does not actually remove bias — it just makes bias harder to detect. Gender-stereotypical words still cluster together after debiasing, they just no longer align with a single gender direction. The authors concluded that bias is so deeply woven into the distributional signal that local geometric fixes cannot remove it; the training data itself would need to be changed.

The broader lesson is about what embedding methods are and are not doing. An embedding trained on a corpus encodes the statistical regularities of that corpus, including regularities we would prefer it not to learn. The mathematical machinery is value-neutral — it will learn that programmer is associated with masculine contexts if the training corpus reflects that association, and with feminine contexts if it does not. "Debiasing" a single set of vectors is a cosmetic fix. Real debiasing requires attention to the training data, the representational form, and the downstream decision systems that consume the embeddings. This applies equally to contextual embeddings and to LLMs, where the same biases show up in amplified form.

Bias in embeddings is not only about social stereotypes. Technical embeddings show analogous effects — scientific-literature embeddings reflect citation politics, code embeddings reflect the demographics of open-source contribution, medical embeddings reflect the historical sample composition of clinical trials. Wherever a corpus has asymmetry, the embedding will encode it. Being aware of this is one of the most important skills an ML practitioner can develop.

§13

Static embeddings to contextualised

A single vector per word cannot capture that bank means a financial institution in one sentence and a river edge in another. The fix was contextualised embeddings: one vector per token, computed as a function of the surrounding sentence. The paradigm shift reshaped NLP in 2018.

Static word embeddings — word2vec, GloVe, fastText — give each word a single fixed vector, learned once from the training corpus and served up to every downstream system. This is fine for words with stable meanings (thermodynamics, Paris, eleven) and terrible for words with context-dependent meanings. The word bank has at least three senses (financial institution, side of a river, act of tilting an aircraft) that appear in different contexts and mean different things. A single vector for bank averages over all three senses, landing somewhere in the middle of the three clusters and being a good representation of none. Polysemy — the phenomenon of a single word having multiple distinct meanings — is not a rare edge case; it affects a substantial fraction of high-frequency vocabulary.

The problem extends beyond discrete sense disambiguation. The word run can be a verb or a noun; it can describe physical running, operating software, managing a business, a streak of luck, or a sequence of cards. The part of speech varies with context, and so does the fine-grained meaning. Classical workarounds included training separate embeddings per part of speech, or doing explicit word-sense disambiguation as a preprocessing step, or simply ignoring the problem and hoping downstream models would recover from the noise. None of these was fully satisfying.

The contextualised embedding approach computes a fresh vector for each token as a function of its surrounding sentence. The word bank in "I deposited the check at the bank" gets a vector near the financial-institution cluster; the same word in "We sat on the bank of the river" gets a vector near the river-edge cluster. One word, many possible vectors, depending on the context in which it appears. This requires architectures that can actually process sequential context — bidirectional RNNs and, later, transformers — which is why contextual embeddings had to wait until those architectures were mature.

The conceptual move was not new: the idea that word meaning depends on sentential context is as old as semantics. What was new was the demonstration that a large pretrained encoder could produce useful context-dependent vectors automatically, without requiring explicit sense annotation. ELMo (2018) demonstrated the approach with bidirectional LSTMs. BERT (2018) and its successors generalised it to transformers and to dozens of downstream tasks. Within two years, static embeddings had been largely displaced in research and in many production settings by contextual ones. The remaining role for static embeddings is where speed and simplicity matter more than accuracy — quick similarity lookup, word-level clustering, and as a default input to lightweight production systems.

§14

ELMo

Peters et al.'s 2018 ELMo paper was the proof of concept that contextualised embeddings could replace static ones on downstream tasks. A bi-directional LSTM language model, trained on raw text, produced per-token vectors that substantially improved state-of-the-art across six NLP benchmarks.

ELMo — Embeddings from Language Models (Peters, Neumann, Iyyer, Gardner, Clark, Lee, Zettlemoyer 2018) — was the paper that mainstreamed contextualised embeddings. The architecture was a two-layer bi-directional LSTM trained as a language model on a billion-word corpus. The forward LSTM predicts the next token given the leftward context; the backward LSTM predicts the previous token given the rightward context. After training, ELMo provides per-token embeddings by combining the hidden states from both LSTMs across both layers: three vectors per token (the character-based initial embedding plus the two LSTM layer outputs), which downstream tasks mix via a learned convex combination.

The key property of ELMo embeddings was that they were contextual: the same word in different sentences received different vectors. A sentence containing bank near money and deposit produced a different vector for bank than one containing bank near river and bridge. This alone was a substantial departure from static embeddings. The other novelty was the layer-mixing: the authors found that different downstream tasks benefit from different mixtures of the LSTM layers. Lower-layer outputs encoded more syntactic information (useful for parsing), higher-layer outputs encoded more semantic information (useful for entailment, SRL). A learned attention over layers gave each downstream task access to the mixture it needed.

The empirical results were striking. Drop-in replacement of static embeddings with ELMo improved state-of-the-art on six NLP benchmarks: reading comprehension (SQuAD), textual entailment (SNLI), semantic role labelling (OntoNotes), coreference resolution (OntoNotes), named entity recognition (CoNLL-2003), and sentiment analysis (SST). The improvements ranged from 1.8 to 6.5 F1 points across tasks — large gains for dropping in one component. For a year, ELMo was state of the art across most of NLP.

ELMo's reign was short. Within six months of its release, BERT appeared and took state-of-the-art on nearly every ELMo-dominated benchmark by a larger margin. The transformer architecture's parallelisability let BERT train on more data, scale to larger models, and use masked language modelling instead of the next-token prediction that ELMo's LSTMs were constrained to. But ELMo was the paper that demonstrated the principle: pretrain a large bidirectional encoder on raw text, extract contextual embeddings, plug them into downstream tasks, and watch everything improve. The BERT paper explicitly acknowledges ELMo as the inspiration; the two together defined the pretrain-then-fine-tune paradigm that became universal in NLP.

ELMo is mostly of historical interest in 2026 — for production you would use a BERT-family or LLM-family encoder. But the paper remains worth reading, both as a piece of ML history and as a demonstration of how architectural choices (bi-directional LSTM, layer-mixing) translate into empirical claims. It is one of the cleanest "here is an idea, here is the architecture that implements it, here are the results that validate it" papers in NLP.

§15

BERT-era contextual embeddings

Transformer-based encoders — BERT, RoBERTa, ELECTRA, DeBERTa — made contextualised embeddings the default NLP representation. They are used both as fine-tunable pretrained models and as feature extractors for tasks where fine-tuning is impractical or expensive.

BERT (Devlin, Chang, Lee, Toutanova 2018) is covered in more depth in the Transformer Architecture and Pretraining Paradigms chapters; here we care about BERT as an embedding model. Given a pretrained BERT encoder, you can pass a sentence through it and extract per-token hidden states as contextual embeddings. The standard choices are: the last layer only (for semantic tasks); the last four layers concatenated or summed (a common compromise); or a learned combination across all layers (the ELMo-style approach). The [CLS] token's final representation is often used as a sentence-level embedding — though §16 covers better sentence embedding methods.

Fine-tuning vs feature extraction is a practical choice. Fine-tuning updates all BERT parameters along with a task-specific output head; it achieves the best downstream accuracy but requires storing a separate fine-tuned copy of the full model per task, which is expensive at scale. Feature extraction keeps BERT frozen and uses its outputs as fixed features for a lightweight task-specific head; it is cheaper, shares one encoder across many tasks, and accepts some accuracy loss. A common compromise is lightweight fine-tuning via LoRA or other parameter-efficient adapters (covered in the Fine-Tuning chapter). For most production workloads in 2026, the choice is driven by throughput and memory budgets, not by theory.

The probing literature grew up around BERT-style embeddings. Researchers ask: what linguistic structure do these contextual vectors actually encode? Tenney, Xia, Chen et al.'s 2019 "What do you learn from context?" paper showed that BERT's layers encode a recognisable progression: lower layers capture surface features (casing, position), middle layers capture syntax (POS, dependency relations, constituents), and higher layers capture semantics (SRL, coreference, entailment). Hewitt and Manning's 2019 "A Structural Probe for Finding Syntax in Word Representations" showed that BERT's hidden states encode dependency-tree structure as linear subspaces — there is a linear transformation of BERT embeddings after which the squared Euclidean distance between two tokens' embeddings correlates with their syntactic distance in the dependency tree. This was striking: the model learned structured linguistic knowledge without ever being supervised on syntactic annotations.

The subsequent literature refined the picture. Probing work on BERT, RoBERTa, T5, and GPT-family models established a consistent picture: large pretrained transformers encode a lot of linguistic structure, most of which the model was not explicitly trained to learn, and the richness of the encoded structure scales roughly with model size and pretraining corpus size. This is part of why modern LLMs can do zero-shot linguistic tasks reasonably well. It also explains why modern sentence embeddings — covered next — are so much better than classical ones: the underlying token embeddings have dramatically more structured information to aggregate.

§16

Sentence and document embeddings

Most applications care about sentence or document similarity, not word similarity. Sentence-BERT, SimCSE, and the universal encoder family produce dense vectors for multi-word text that support semantic search, clustering, and similarity at Internet scale.

A pretrained BERT produces per-token embeddings, not sentence embeddings. Naive approaches — averaging the token embeddings, using the [CLS] token's final hidden state — produce embeddings that perform poorly on sentence-similarity benchmarks despite BERT's superiority on per-token tasks. Reimers and Gurevych's 2019 Sentence-BERT paper diagnosed this and fixed it: fine-tune BERT with a sentence-similarity objective (e.g. cosine similarity between embeddings of paraphrases, squared error against human-labelled similarity scores) to produce embeddings that actually support similarity computations. The resulting model could compare 10,000 sentence pairs in a few seconds, versus the roughly five days raw BERT would take — a 100,000-fold speedup with better accuracy.

The SimCSE paper (Gao, Yao, Chen 2021) pushed this further with contrastive learning. The key insight was that dropout itself, applied twice to the same sentence, produces two different but semantically-equivalent representations. Train a model to pull these representations together in embedding space and push apart representations of different sentences. This single idea — contrastive learning with dropout as the data augmentation — produced sentence embeddings that beat Sentence-BERT on most benchmarks without requiring labelled similarity data. It is a gorgeous example of how the right training objective can extract better representations from an already-pretrained model.

Beyond these specific methods, sentence embedding has become its own active subfield. Universal Sentence Encoder (Cer et al. 2018) produced task-agnostic sentence embeddings. InferSent (Conneau et al. 2017) was an early paper showing that supervised training on natural language inference produces broadly useful sentence embeddings. More recent models — E5 (Wang et al. 2022), GTE (Li et al. 2023), BGE (Xiao et al. 2023), nomic-embed (2024), Voyage (2024) — use mixtures of contrastive and generative objectives on large high-quality datasets to produce embeddings optimised for the full pipeline of retrieval and ranking that modern applications need. The MTEB benchmark (Muennighoff et al. 2022) is the default shared evaluation across these models, covering retrieval, classification, clustering, STS, reranking, and summarisation.

For practical use in 2026, the choice of sentence embedding depends on the task. For general-purpose English semantic search, current top models (the OpenAI, Voyage, Cohere, and various open-source embeddings on the Hugging Face MTEB leaderboard) cluster within a few points of each other. For multilingual applications, dedicated multilingual models outperform English-first ones. For specialised domains (legal, biomedical, code), domain-finetuned or domain-pretrained embeddings help substantially. The practical playbook is: start with a strong general-purpose model, evaluate on your retrieval or classification task of interest, try domain adaptation if headroom remains, and fine-tune with contrastive loss on in-domain positive/negative pairs if you have labelled data.

§17

Engineering embedding spaces

Having computed a billion embeddings, how do you find the nearest neighbours of a query vector in real time? The answer is approximate nearest neighbour search, and the infrastructure — FAISS, HNSW, ScaNN, and the vector databases built on them — is what makes embedding-based applications possible at scale.

The core operation in every embedding-based application is nearest-neighbour search: given a query vector, find the k closest vectors in a corpus of N. Exact search is linear in N: compute all N distances, sort, return top k. For small N this is fine. For N = 100 million, at 768 dimensions, it is prohibitive — each query requires roughly 75 billion floating-point operations. Production systems therefore use approximate nearest neighbour (ANN) search: trade a small amount of recall for dramatic speedup.

The major ANN algorithm families are each a compromise of build time, memory footprint, query latency, and recall. IVF (Inverted File Index) partitions the vector space into Voronoi cells using k-means, then searches only cells near the query. HNSW (Hierarchical Navigable Small World, Malkov & Yashunin 2016) builds a multi-layer graph where each vector has edges to its nearest neighbours at multiple resolutions; queries descend through layers to find their neighbourhood. Product Quantization (Jégou, Douze, Schmid 2011) compresses each vector into a small code that enables approximate distance computation without decompression, trading memory for speed. ScaNN (Guo et al. 2020) combines quantisation with learned pruning. In practice HNSW is the default for most applications with a few million to a billion vectors; product quantisation or hybrid IVF-PQ is used beyond that scale.

The FAISS library (Johnson, Douze, Jégou 2017), released by Facebook, is the de facto reference implementation for ANN search. It ships every major algorithm with both CPU and GPU implementations and is the backbone of nearly every vector database. The FAISS paper is worth reading not just for the algorithms but for the systems-engineering discipline: careful attention to memory layout, cache locality, SIMD utilisation, and GPU parallelism. ANN search is a data-structure problem as much as it is a numerical one, and FAISS exemplifies the engineering required to make it fast.

On top of the ANN algorithms sit vector databases — specialised data stores designed to manage millions to billions of embeddings with metadata filtering, CRUD operations, sharding, replication, and integration with the rest of the application stack. Pinecone, Weaviate, Milvus, Qdrant, Vespa, and the vector-search extensions of classical databases (PostgreSQL's pgvector, Elasticsearch's dense_vector, MongoDB Atlas Vector Search) all occupy this space. The differences between them are largely operational: indexing algorithm choices, scaling patterns, filtering support, hybrid (vector + BM25) retrieval capabilities, and hosted-vs-self-hosted economics.

The practical lessons for running embedding-based systems at scale are largely generic. Build smaller indexes: most applications do better with 1M curated embeddings than with 100M unfiltered ones. Use hybrid retrieval — BM25 + dense — rather than dense alone, because the two methods have complementary failure modes. Tune the recall/latency tradeoff per application: chat applications tolerate 50ms latency; autocomplete does not. Monitor drift: as embedding models improve, old indexes become stale and need to be rebuilt. And budget for the re-ranker: ANN returns approximate nearest neighbours which are usually worth re-ranking with a more expensive but more accurate model before presenting to the user.

§18

Embeddings across machine learning

Word embeddings turned out to be the first instance of a general ML principle: learn dense distributed representations of anything you want to reason about, and most of the downstream modelling becomes easier. That principle, more than any specific embedding method, is the lasting contribution of the field.

The lasting consequence of word embeddings is the confirmation that representation learning works. Before 2013 the mainstream view in ML was that you engineered features for your task and fed them into a classifier; the feature-engineering stage was where domain expertise lived. Word embeddings demonstrated that a generic self-supervised objective — predict context from word — could produce features at least as good as hand-crafted ones, and in many cases substantially better. This lesson generalised with remarkable speed: within five years, learned representations had displaced hand-crafted features in vision (via CNN backbones), in speech (via learned spectrogram features), in recommendations (via user and item embeddings), and in many tabular settings. The modern foundation-model paradigm is the logical conclusion of this shift: learn a single huge representation of everything, then adapt it to specific tasks as needed.

Embeddings sit at the core of retrieval-augmented generation — the architecture through which LLMs access information beyond their training data. A RAG system embeds documents in a corpus, stores them in a vector index, embeds incoming queries, retrieves the top-k nearest documents, and conditions the LLM's output on the retrieved text. The quality of the embeddings largely determines the quality of the retrieval, which largely determines the quality of the final answer. Most RAG improvement work is really embedding improvement work: better embedding models, better chunking strategies, better hybrid scoring, better re-ranking.

Embeddings power recommendation systems. User and item embeddings, trained on interaction data, support collaborative filtering at a scale that classical matrix factorisation cannot touch. Two-tower architectures — one tower embeds users, one embeds items, similarity scores are dot products — are the production recommendation paradigm at most large consumer companies, from YouTube to Pinterest to TikTok. The techniques are direct descendants of word2vec's two-matrix structure.

Embeddings unify multi-modal learning. CLIP (Radford et al. 2021) demonstrated that joint training of image and text encoders could produce a shared embedding space where images and their captions land near each other. This enabled zero-shot image classification, text-to-image retrieval, and the foundations of text-to-image generation. Subsequent models extend the idea to audio (CLAP), video (VideoCLIP), structured data (TabTransformer embeddings), proteins (ESM embeddings), and molecules. Cross-modal reasoning in 2026 is fundamentally a shared-embedding-space problem.

And embeddings are a universal debugging tool. When a system misbehaves, embedding the inputs and looking at clusters often reveals what is going wrong. When a pipeline is under-performing, examining nearest neighbours of misclassified examples often exposes the distributional mismatch. When a new dataset arrives, t-SNE or UMAP of its embeddings is usually the first step in understanding it. The shift from symbolic to continuous representations has not only changed how ML systems work — it has changed how ML practitioners think. A working ML engineer in 2026 reaches for embeddings as reflexively as a working statistician of 1990 reached for histograms and scatter plots. They are the primary way we see our data.

Word embeddings & distributional semantics, the idea that meaning lives in vector space — and the technical revolution that turned every token in every language into a row of numbers whose geometry encodes what it means.

How to read this chapter

Contents

Why embeddings matter

The distributional hypothesis

Count-based distributional semantics

word2vec

Skip-gram vs CBOW

Negative sampling and hierarchical softmax

GloVe

fastText

The geometry of embedding space

Evaluating embeddings

Cross-lingual & multilingual embeddings

Bias in embedding spaces

Static embeddings to contextualised

ELMo

BERT-era contextual embeddings

Sentence and document embeddings

Engineering embedding spaces

Embeddings across machine learning

Further reading

Anchor textbooks

Foundational papers

Modern extensions

Software