Part XIV · Applied Domains · Chapter 02

Search & Information Retrieval, finding what you asked for.

Search is the oldest serious application of information retrieval and the most-used product on the internet — Google handles ~9 billion queries a day, every enterprise stores billions of documents in search-indexed databases, and the rise of retrieval-augmented LLMs has made search the substrate of how AI systems access knowledge. The methodology has gone through three eras: classical lexical retrieval (TF-IDF, BM25) running on inverted indexes; learning-to-rank with hand-engineered features in the 2010s; and dense neural retrieval with transformer-based encoders since 2019. Modern production search blends all three. This chapter develops the retrieval problem, the major method families (lexical, dense, learned-sparse, cross-encoder rerankers), the engineering of multi-stage search pipelines, and the deployment realities of running search at scale.

Prerequisites & orientation

This chapter assumes basic probability and linear algebra (Part I Ch 04 and Ch 01), neural-network fundamentals (Part V Ch 01–02), and familiarity with transformer-based encoders (Part VI Ch 02–03). The recommender-systems material of Ch 01 is the natural complement — modern recommendation and search increasingly use the same architectural building blocks (two-tower retrieval, deep rerankers, learning-to-rank). The classical-NLP material of Part VI Ch 01 provides the linguistic-preprocessing background used in lexical retrieval. No prior exposure to information-retrieval literature is assumed.

Two threads run through the chapter. The first is the multi-stage architecture: production search systems combine fast lexical retrieval, dense neural retrieval, and heavy cross-encoder reranking into a pipeline that trades off recall, precision, and latency. The methodology of the chapter follows this layering. The second is the relevance-vs-engagement tension: search is technically about returning relevant documents but commercially often about returning engaging ones, and the metrics, training data, and evaluation procedures all reflect this tension. The chapter is organised so the foundational lexical methods come first, then the learning-to-rank machinery, then the dense-retrieval revolution, then the modern hybrids that combine everything.

In this chapter

Why Search Is Its Own Field recall · precision · scale · query understanding
Lexical Retrieval: TF-IDF and BM25 inverted index · TF-IDF · BM25 · Okapi · stemming
Learning to Rank pointwise · pairwise · listwise · LambdaMART · features
Dense Retrieval and Bi-Encoders DPR · Sentence-BERT · ANN · contrastive · negatives
Cross-Encoders and Rerankers monoBERT · cross-encoder · ColBERT · late interaction
Learned Sparse and Hybrid Retrieval SPLADE · uniCOIL · hybrid lexical+dense
The Multi-Stage Search Pipeline retrieval · rerank · diversity · query expansion
Evaluation: TREC, NDCG, and Click Models qrels · TREC · NDCG · MRR · click models · A/B
Search for Retrieval-Augmented Generation RAG · grounding · chunking · multi-vector · LLM rerank
Applications and Frontier web · enterprise · code · multimodal · agent retrieval · frontier

Why Search Is Its Own Field

Search looks superficially like classification — given a query, score every document, return the highest. The reasons it has its own field, with its own benchmarks, its own conferences, and its own engineering culture, are scale and structure: corpora can have billions of documents, queries are highly variable in form and intent, and the cost of a missed-but-relevant document differs sharply from the cost of an irrelevant top result. Information retrieval is the discipline that takes these constraints seriously.

The retrieval problem

The classical IR problem: given a query q and a corpus of documents D = {d₁, ..., d_N}, return the K documents most relevant to q, ordered by relevance. The complications are layered. The corpus is large — Google indexes hundreds of billions of pages, an enterprise document store may have hundreds of millions, even a single repository may have millions of files. The relevance is subtle — different users with the same query want different things, and the ground-truth signal is noisy and biased. The latency budget is brutal — web search returns results in tens of milliseconds; enterprise search must be fast enough that users will use it.

Recall and precision

Two metrics define the field. Recall is the fraction of relevant documents that appear in the returned list — did the system find the answer? Precision is the fraction of returned documents that are relevant — did the system return junk? The two trade off: pulling in more documents increases recall but typically reduces precision, and the right operating point depends on the application. Web search prioritises top-K precision (most users only look at the first page); enterprise document discovery often prioritises recall (a missed regulatory document is worse than ten extras).

The position-aware metrics — NDCG, MRR — refine this by weighting top-ranked documents more heavily, which matches user behaviour where almost no one looks past the top few. The chapter's evaluation section develops this in detail, but the conceptual takeaway is that search quality is rank-aware: where a relevant document appears matters as much as whether it appears.

Query understanding is half the problem

"Italian restaurants near me" and "Italian restaurants nearby" should return the same results; "jaguar" might mean the car, the animal, or the operating system. Real search systems spend substantial machinery on query understanding before retrieval even starts: spell correction, query expansion (adding synonyms), entity recognition (identifying named entities like "Italian"), intent classification (informational vs. transactional vs. navigational), and increasingly, LLM-based query rewriting that produces multiple variants and combines results. The retriever's job is easier when the query is well-formed; the query-understanding stage is what makes it well-formed.

The recall-then-precision pattern

Production search systems organise around a simple architectural rule: a fast-but-imprecise retrieval stage selects a candidate set, then a slow-but-accurate ranking stage orders them. This is the same two-stage pattern as in recommendation (Ch 01 Section 7), but the trade-offs are slightly different. Search retrieval is more I/O-bound (the inverted index lives on disk, accessing it dominates latency), and the ranking stage is more CPU-bound (modern cross-encoder rerankers run a transformer per (query, doc) pair). The methodology of the chapter follows this layering — Sections 2, 4, and 6 develop retrieval methods; Sections 3 and 5 develop ranking methods; Section 7 puts them together.

Why Search Matters Now More Than Ever

Search has been a foundational technology for thirty years, but its importance is growing rather than shrinking in the LLM era. Retrieval-augmented generation uses search as the substrate that grounds LLM outputs in factual content; agentic systems use search to access knowledge and tools; modern enterprise AI deployment is mostly RAG plus search. The chapter's methodology — especially the dense-retrieval and reranker material — is increasingly the substrate of the AI stack, not a separate concern.

Lexical Retrieval: TF-IDF and BM25

For thirty years, the dominant retrieval method was lexical: score documents by how often the query terms appear, with discounts for common terms and length. The TF-IDF and BM25 family are the canonical instances, and despite the rise of dense retrieval, they remain the workhorse of production search. A well-tuned BM25 retriever is shockingly competitive with neural methods on out-of-domain queries and is the right starting baseline for any new search problem.

The inverted index

The data structure that makes lexical retrieval fast at scale is the inverted index. Instead of scanning documents one by one for query terms, the index maps each term to the list of documents containing it (a "posting list"), with positions and frequencies. A query of K terms is processed by intersecting (or unioning) K posting lists — much faster than O(N·M) per-document scanning. The inverted index is the substrate of every classical search engine and is conceptually simple enough to implement in a few hundred lines of code; production implementations like Lucene/Elasticsearch and Tantivy add compression, skip lists, and many engineering layers but the core data structure is unchanged from the 1970s.

TF-IDF

The classical scoring function is TF-IDF — term frequency times inverse document frequency. The intuition: a term is valuable if it appears often in the document (TF) but rarely in the corpus (IDF). The product weights both — a query term that's frequent in this document but rare overall (like "schistosomiasis") gets a high score; one that's frequent everywhere ("the") gets a low score.

TF-IDF score tfidf(t, d) = tf(t, d) \cdot log( N / df(t) ) tf(t, d) is the count of term t in document d (sometimes log-normalised); df(t) is the number of documents containing t; N is the corpus size. The document score for query q is the sum over query terms: score(q, d) = Σ t \in q tfidf(t, d). The formulation is shockingly simple but captured most of the relevance signal until BM25 refined it.

BM25: the modern lexical workhorse

BM25 (Robertson and Walker 1994) is the standard lexical scoring function in 2026 and the right baseline for any retrieval comparison. It refines TF-IDF in three ways: TF saturates (a term appearing 50 times is not 50× more relevant than once); document length is normalised (longer documents get a discount because they have more chances to contain query terms); and the parameters are tunable.

BM25 scoring function score(q, d) = Σ t \in q IDF(t) \cdot tf(t, d) \cdot (k 1 + 1) / ( tf(t, d) + k 1 \cdot (1 - b + b \cdot |d| / avg|d|) ) k 1 \in [1.2, 2.0] controls term-frequency saturation; b \in [0, 1] controls document-length normalisation (b=0 turns it off, b=1 fully normalises). IDF(t) is the standard inverse-document-frequency. The default values k 1 =1.2, b=0.75 are the right starting point and rarely need much tuning.

BM25 has remained the gold-standard lexical retriever for thirty years. The 2019–2024 wave of dense-retrieval research has produced methods that beat BM25 in domain (when training data is available) but consistently lose to it on out-of-domain queries (the BEIR benchmark, Thakur et al. 2021, made this systematic). For applications where the query distribution is unfamiliar or the training data is absent, BM25 is genuinely the best choice.

Lexical preprocessing

Before BM25 can score anything, both queries and documents must be processed into terms. Standard pipelines apply: tokenisation (split on whitespace and punctuation), case-folding (convert to lowercase), stemming (reduce words to a root form — "running" and "runs" both become "run"; the Porter stemmer is canonical), stop-word removal (drop high-frequency words like "the" and "is"), and language detection (route to language-specific stemmers). Each step is a hyperparameter; production search engines typically run multiple variants and let later stages combine them.

Where lexical methods break

BM25 has known weaknesses. It cannot match synonyms (a query for "automobile" misses documents that say "car"); it has no understanding of word order or syntactic structure; it treats every term independently. The standard fixes — query expansion via thesaurus, learned synonyms, or LLM-generated rewrites — partially close the gap. The deeper fix is dense retrieval (Section 4), which represents queries and documents as vectors in a semantic space where synonyms are nearby. The right 2026 stance: lexical retrieval is not obsolete, it is the workhorse that dense retrieval augments rather than replaces.

Learning to Rank

By the late 2000s, the realisation that BM25 alone left obvious signal on the table — click logs, document quality, freshness, link graphs — produced a wave of methods that learned to rank using all of these features simultaneously. Learning to rank is the umbrella for these methods, and the LambdaMART family that emerged from Microsoft Bing remains a strong production baseline that is used at scale by every major web-search engine.

Three loss families

Learning-to-rank methods are conventionally classified by their training-loss formulation. Pointwise methods treat each (query, document) pair independently and predict a relevance score with a regression or classification loss. Simple and easy to train but ignores the relative-ordering signal that matters at evaluation time.

Pairwise methods take pairs of documents from the same query and train the model to score the more relevant one higher. The classical RankNet (Burges et al. 2005) optimises a logistic-style pairwise loss. Pairwise losses align better with ranking metrics than pointwise but still miss listwise effects (the loss does not directly optimise NDCG-style position-discounting).

Listwise methods take the full ranked list for a query and optimise a loss that depends on the whole ordering. LambdaRank (Burges 2010) is the canonical instance — it modifies the gradient of pairwise RankNet to weight pairs by their impact on NDCG, effectively turning the pairwise gradient into a listwise one. LambdaMART combines LambdaRank with gradient-boosted decision trees (MART = Multiple Additive Regression Trees) and is the workhorse of production learning-to-rank.

Features

The strength of learning-to-rank is the breadth of features it can integrate. Typical industrial feature sets include hundreds of features:

Query-document features: BM25 scores on different fields (title, body, URL), cosine similarity in various embedding spaces, exact-match indicators, query-term coverage. Document features: PageRank-style link-graph scores, click-through-rate priors, freshness, document length, language match. Query features: query length, intent classification, entity types, navigational vs. informational. User-context features: location, device type, query history. Cross-features: interactions between any of the above.

The model — typically a gradient-boosted ensemble like LambdaMART — combines these into a relevance score. Production systems at Bing and Yandex have published feature counts in the high hundreds; Google's internal feature sets are presumably in the thousands.

The MART / LambdaMART recipe

Gradient-boosted decision trees turn out to be unusually well-suited to learning-to-rank. They handle heterogeneous features (some categorical, some continuous, some sparse), they are robust to feature scaling, they admit fast inference (a few hundred shallow trees can be evaluated in microseconds), and they handle the high-dimensional discrete-feature spaces typical of search well. LambdaMART has been the standard production ranker since around 2010, and even in the era of deep neural rankers it remains the choice for many enterprise and web-search systems where interpretability and operational simplicity matter.

Modern deep rankers

The 2020s saw deep neural networks displace LambdaMART for the heaviest ranking workloads. The architecture is essentially the recommendation-style ranker of Ch 01 Section 5: a deep model takes hundreds of features and outputs a relevance score, trained with pairwise or listwise losses. Wide & Deep, DCN, and the various transformer-over-features architectures all see use. The deep rankers buy two things over LambdaMART: better handling of dense feature interactions (especially when features include learned embeddings from text, images, or behavioural signals), and better integration with end-to-end training pipelines that include neural retrievers upstream. They cost more inference compute and are harder to debug, which is why LambdaMART has not been universally displaced.

The training-data question

Learning-to-rank requires labelled training data — (query, document, relevance) triples. Three sources dominate. Editorial labels: human raters score (query, document) pairs on a relevance scale. Expensive, low volume, but high quality. Click logs: use click-through behaviour as an implicit relevance signal. High volume but biased (users click on what they're shown, with strong position bias). Click models: model the bias explicitly to extract clean relevance signals from clicks. The 2024 production trend combines editorial labels for the highest-quality signal with click-model-corrected logs for volume; LLM-generated relevance labels are increasingly part of the mix.

Dense Retrieval and Bi-Encoders

The dominant change in retrieval since 2019 is the rise of dense retrieval: queries and documents are encoded into dense vectors in a shared semantic space, and retrieval is approximate-nearest-neighbour search over the vector index. The architecture is simple — two transformer encoders, one for queries and one for documents, sharing weights or trained separately — but the empirical impact has been enormous, and dense retrieval is now standard alongside BM25 in serious production systems.

The bi-encoder architecture

A bi-encoder uses two transformer encoders (often the same model with the same weights, sometimes separate) to map queries and documents into a shared vector space. The query encoder takes the query as input and outputs a dense vector; the document encoder does the same for documents. The similarity is the dot product (or cosine) of the two vectors:

Bi-encoder dense retrieval score(q, d) = encoder q (q) ⊤ \cdot encoder d (d) Both encoders typically use a transformer architecture (BERT, DistilBERT, or specialised retrieval models) with mean-pooling or CLS-token-pooling over the final-layer hidden states. The crucial property: documents can be encoded once at index time, and retrieval at query time is a single forward pass plus a fast nearest-neighbour search. This makes dense retrieval feasible at the scale of millions of documents and single-digit-millisecond query latency.

Sentence-BERT and the sentence-encoder lineage

The conceptual ancestor is Sentence-BERT (Reimers and Gurevych 2019), which fine-tuned BERT with a contrastive loss to produce vectors useful for sentence similarity. The architecture and training procedure transferred directly to retrieval: train the bi-encoder so that relevant (query, document) pairs are close in vector space and irrelevant pairs are far apart. The sentence-transformer library that emerged from this work is the dominant practical toolkit for dense retrieval in 2026.

DPR and contrastive training

Dense Passage Retrieval (DPR, Karpukhin et al. 2020) was the first widely-adopted dense retriever for open-domain question answering. The training objective is contrastive: for each (question, gold-passage) pair, the model is trained to score the gold passage above a set of in-batch negative passages. The contrastive loss maximises the gold passage's score relative to negatives, producing a vector space where questions cluster near their answers.

Dense retrieval contrastive loss ℒ = - Σ (q, d⁺) log [ exp(score(q, d⁺)) / ( exp(score(q, d⁺)) + Σ d⁻ exp(score(q, d⁻)) ) ] d⁺ is the relevant document, d⁻ ranges over negative documents (typically other batch elements' positive documents — "in-batch negatives"). The InfoNCE-style loss makes the gold passage's score higher than any of the negatives, in expectation. The crucial hyperparameter is the negative-sampling strategy — Section 4's later subsection covers this.

Negative sampling

The choice of negative documents during training is the single most-important hyperparameter in dense retrieval. In-batch negatives use other passages in the same training batch as negatives — cheap and abundant but typically too easy. Hard negatives are negatives selected to be close to the gold passage in vector space (top-ranked but not relevant); they produce stronger models but require careful curation. BM25-mined negatives are high-BM25-scoring but non-relevant documents, a cheap way to find hard negatives. ANCE (Xiong et al. 2020) and RocketQA (Qu et al. 2021) developed the canonical hard-negative mining procedures. The 2024 standard combines all three: in-batch for breadth, BM25-mined for hard structural negatives, model-mined for adaptive hard negatives that change as the model improves.

Approximate nearest-neighbour search

A trained bi-encoder produces dense vectors for every document. Retrieving the top-K nearest documents to a query vector requires approximate nearest-neighbour search at scale — exact nearest-neighbour over millions of vectors is too slow. Three families dominate. HNSW (Hierarchical Navigable Small World, Malkov and Yashunin 2018) builds a multi-layer graph for fast approximate search. IVF (Inverted File) clusters vectors and searches only the nearest clusters. PQ (Product Quantisation) compresses vectors with discrete codes for memory efficiency. Production systems combine these — typical: IVF-PQ for memory-efficient indexing, HNSW for the highest accuracy. Libraries like FAISS (Facebook), ScaNN (Google), and Vespa are the standard production tools.

The BEIR result and what it taught us

The BEIR benchmark (Thakur et al. 2021) evaluated dense and sparse retrievers on 18 diverse retrieval tasks and produced an inconvenient result: in-domain, dense retrievers beat BM25 substantially; out-of-domain (the test set is from a different distribution than the training data), BM25 wins on a majority of tasks. The 2022–2026 wave of dense-retrieval research has been about closing this generalisation gap — large-scale pretraining on retrieval data (E5, BGE, gte models), instruction-tuned retrievers, and increasingly LLM-distilled retrievers. The 2026 production reality: a strong dense retriever pretrained on a billion query-document pairs matches or beats BM25 on most benchmarks, but combining the two via hybrid retrieval (Section 6) usually beats either alone.

Cross-Encoders and Rerankers

Bi-encoders are fast but limited — the query and document never interact during encoding, so the model cannot directly use query terms to attend to specific document parts. Cross-encoders fix this by encoding the (query, document) pair jointly, with full attention between query and document tokens. They produce stronger relevance scores at substantially higher cost, and the standard production pattern is to use bi-encoders for retrieval and cross-encoders for reranking.

The cross-encoder architecture

A cross-encoder takes the query and document concatenated together (with separator tokens) and runs a transformer over the full sequence. The output is a single relevance score — the transformer's CLS token, projected through a small head. Because every query token can attend to every document token, the model can use detailed query-document interactions that bi-encoders cannot represent.

Cross-encoder relevance scoring score(q, d) = MLP( transformer( [CLS] q [SEP] d [SEP] ) CLS) The query and document are concatenated as a single input; the transformer processes the whole sequence with self-attention; the relevance score is read off the CLS token. The cost: every (q, d) pair requires a full transformer forward pass, which is too expensive for retrieval over millions of documents but tractable for reranking the top 100-1000 retrieved candidates.

The four major retrieval architectures arranged by interaction depth (left to right). BM25 matches query and document terms with no learned representation — fast, lexical, no synonyms. Bi-encoder encodes query and document independently into single vectors and scores by dot product — semantic but no token-level interaction. ColBERT encodes both as sequences of token vectors and uses late interaction (each query token max-pools over document tokens) — combines bi-encoder efficiency with token-level granularity. Cross-encoder encodes query and document jointly with full attention — highest quality but must re-encode for every (q, d) pair, so used only at the reranking stage. Latencies are illustrative on a single GPU; production systems aggressively optimise the lower-cost architectures.

monoBERT and the canonical recipe

monoBERT (Nogueira and Cho 2019) was the first widely-cited cross-encoder reranker, and the recipe — fine-tune BERT on (query, relevant doc, irrelevant doc) triples with a pairwise or pointwise loss — became the standard for the field. Subsequent generations (T5-based duoT5 / monoT5, the LLM-based RankGPT family) extend the approach with larger models and better training data, but the architecture remains the same.

The empirical pattern: cross-encoder rerankers consistently improve quality over bi-encoder retrieval alone. On standard benchmarks, the improvement is 5–15 NDCG points; on harder benchmarks it's larger. The production cost is the per-document transformer forward pass — even with optimised inference, reranking 1000 candidates takes 100–500ms on a single GPU, which is too slow for many deployments and dictates careful candidate-set sizing.

ColBERT and late interaction

An interesting middle ground is ColBERT (Khattab and Zaharia 2020, ColBERTv2 in 2021). The architecture encodes the query as a sequence of token vectors and the document as a sequence of token vectors (no pooling); at scoring time, each query token finds its best-matching document token via cosine similarity, and the document score is the sum of these per-token max-similarities. This late interaction preserves token-level granularity (unlike bi-encoders, which collapse documents to a single vector) while keeping document encoding fast (unlike cross-encoders, which require joint encoding at query time).

ColBERT consistently matches or beats cross-encoder rerankers on retrieval benchmarks at much lower inference cost, and ColBERTv2 with quantised token embeddings makes the storage tractable. It is increasingly popular as a "second-stage" retriever that combines the recall properties of dense bi-encoders with quality close to cross-encoder reranking.

LLM-based rerankers

The 2023–2026 frontier uses LLMs as rerankers directly. RankGPT (Sun et al. 2023) prompts a large language model with the query and a list of candidate passages and asks it to produce a ranked list. The empirical results are striking — LLM rerankers match or beat fine-tuned cross-encoders on many benchmarks without any task-specific training. The cost is high (a large LLM forward pass per query, with a long prompt), so production deployments use LLM rerankers only on small candidate sets or distil their outputs into faster cross-encoders.

The two-stage pattern

The standard 2026 production pattern: a fast retriever (bi-encoder + BM25 fusion) generates 100–1000 candidates; a slower cross-encoder reranks them precisely; a final LLM reranker (when latency allows) refines the top 10–20 for the highest-stakes presentations. The latency-quality trade-off mirrors the recommendation-system pattern of Ch 01. The architectural shape is similar enough that some 2026 systems use the same engineering team for both search and recommendation infrastructure.

Learned Sparse and Hybrid Retrieval

Lexical retrieval (BM25) is fast and out-of-domain robust but cannot match synonyms; dense retrieval (bi-encoders) handles synonyms but is sensitive to domain shift. A third family — learned sparse retrieval — produces sparse vectors that are interpretable like BM25's term weights but learned like a neural network. Combined with dense retrieval in a hybrid system, learned sparse methods are the modern best-of-both-worlds approach.

Learned sparse retrieval

The idea: train a transformer to produce, for each document, a sparse weighted bag-of-terms representation — most term weights are zero, the few non-zero weights capture the document's important terms (including expansions and reweightings learned from data). The query is similarly encoded as a sparse weighted bag of terms. Retrieval uses the same inverted-index machinery as BM25, just over the learned weights instead of the raw frequencies.

SPLADE (Formal et al. 2021) is the canonical instance. A BERT-based model expands each input token into a probability distribution over the vocabulary, then sparsifies via a regularisation term. The result is a learned sparse representation that uses the inverted index for fast retrieval but captures synonym-style relationships that pure BM25 misses. uniCOIL (Lin and Ma 2021) is a simpler alternative that learns per-token weights without expansion. Both methods are competitive with dense retrievers on standard benchmarks at much lower memory cost (sparse vectors compress better than dense ones).

Hybrid retrieval

The dominant 2026 production pattern is hybrid retrieval: combine BM25 (or learned sparse), dense, and sometimes cross-encoder signals. The combinations include:

Score fusion: run multiple retrievers, normalise their scores, combine via weighted sum or rank-based methods (Reciprocal Rank Fusion). The simplest hybrid approach and surprisingly effective.

Disjunction with reranking: take the top-K from each retriever (deduplicated), then rerank the union with a cross-encoder. This produces a candidate set with broader coverage than either retriever alone.

Joint training: train dense and sparse retrievers jointly with a shared loss, so they learn complementary representations rather than redundant ones.

The empirical pattern: hybrid methods consistently beat single-retriever baselines, especially on out-of-domain queries. The production reality is that essentially every serious search system in 2026 is hybrid; the methodology has converged on this rather than on a single dominant retriever.

Multi-vector dense retrieval

A different generalisation of bi-encoder retrieval: rather than collapsing each document to a single vector, represent it as multiple vectors that each capture different aspects (different paragraphs, different aspects of the content). At query time, score against each vector and combine. ColBERT (Section 5) takes this idea to its extreme with per-token vectors. Less extreme variants — using a vector per paragraph, per topic, per query intent — are increasingly common in production. The cost is index size; the benefit is recall and precision improvements that single-vector methods cannot match.

The 2026 retriever stack

Putting the methods of Sections 2–6 together, a typical 2026 production retrieval stack looks like: BM25 over the inverted index for lexical recall; a dense bi-encoder for semantic recall; a learned-sparse retriever as the bridge; score fusion combining all three into a candidate set; and finally a cross-encoder or LLM reranker for precision. Each layer adds complexity; the empirical case for the layered approach is strong; the engineering investment is substantial. For applications where simpler stacks suffice, BM25 alone or BM25 plus dense remain perfectly serviceable.

The Multi-Stage Search Pipeline

A production search system is more than its retriever and ranker. The full pipeline includes query understanding, query expansion, multi-source retrieval, candidate generation, ranking, diversification, and presentation — all under a strict latency budget. This section assembles the pieces from the prior sections into a complete production-style architecture.

Query understanding

Before retrieval, the query passes through several stages of processing. Spell correction handles typos. Query expansion adds synonyms, related terms, or LLM-generated reformulations. Entity linking recognises named entities (locations, people, products) and links them to canonical entries. Intent classification labels the query as informational, navigational, transactional, or other. Locale detection routes to language-specific pipelines. Each stage produces a richer query representation that downstream retrieval can use.

The 2024 generation increasingly uses LLMs for query understanding — a single LLM call that produces an enriched query with rewrites, identified entities, classified intent, and explicit search constraints. The result is more flexible than rule-based pipelines and often better-quality, at the cost of inference latency.

Multi-source retrieval

The retrieval stage runs multiple retrievers in parallel: BM25 over the main index, dense retrieval over the embedding index, learned sparse retrieval, and often specialised retrievers for specific content types (image search, video search, code search). Each produces a ranked list; the lists are deduplicated and combined into a single candidate set, typically of size 100–1000.

The combination methodology is itself an engineering question. Reciprocal Rank Fusion (Cormack et al. 2009) is the standard rank-based combiner — each document's combined score is the sum over retrievers of 1 / (rank + k), with k typically 60. The method requires no score calibration across retrievers and is remarkably robust. Score-based combiners require normalising the per-retriever scores to comparable ranges, which is finicky in practice. Most 2026 production systems use RRF or a learned weighted combiner.

Ranking

The candidate set is reranked by a heavy ranker — LambdaMART or deep ranker on hand-crafted features, optionally followed by a cross-encoder or LLM reranker for the highest-stakes queries. The ranker uses many features (relevance, freshness, authority, click signals, personalisation) and produces a final ranked list. Latency budgets typically allow 50–200ms for reranking, which constrains the candidate-set size and the reranker's depth.

Diversification and presentation

The ranked list is rarely shown raw. Diversification ensures the top results cover different facets of the query — a query for "Python" should not return ten results all about the snake. Vertical mixing combines results from different sources (web, news, images, videos) into a single page. Presentation logic enriches results with snippets, thumbnails, and direct-answer modules. The final user-visible page is the output of all of this composition.

The freshness loop

Search systems must handle continuously-updating corpora — new documents arrive, old documents change, ranking signals (clicks, citations) accumulate. The standard pattern is a layered indexing system: a real-time index for the most recent documents (a few minutes' worth), a fast-updating tier for the last day or week, and a stable batch-built tier for the long tail. Queries hit all tiers in parallel and the results are merged. The engineering complexity is substantial but the alternative — rebuilding the entire index daily — is infeasible at scale.

Operational realities

Production search systems run at scale that dwarfs most ML deployments. Google indexes hundreds of billions of pages and serves billions of queries a day at 99th-percentile latencies under 200ms. Enterprise search at large companies indexes billions of internal documents. The methodology of the chapter is the substance of these systems but the engineering cost is dominant — a serious search system requires substantial investment in distributed indexing, query routing, caching, fault tolerance, and monitoring infrastructure that the chapter does not explicitly cover but should be assumed to dominate the build cost.

Evaluation: TREC, NDCG, and Click Models

Search evaluation is mature — the field has been doing rigorous evaluation since TREC began in 1992 — but it is also fraught. The standard offline metrics correlate imperfectly with user satisfaction; online metrics are biased by the system that generated them; and modern LLM-augmented search has introduced new evaluation challenges. This section covers the standard machinery and the modern complications.

The TREC tradition

TREC (Text REtrieval Conference, NIST, since 1992) is the canonical IR benchmark. The protocol: a fixed corpus, a set of topics (queries), human-labelled relevance judgments (qrels) for a pooled set of documents per topic. Retrieval systems run on the corpus; their results are scored against the qrels using standard metrics. TREC has produced dozens of tracks (ad-hoc, web, deep learning, conversational) and remains the gold standard for academic IR evaluation. The MS MARCO benchmark (Microsoft, 2016) extended the framework to web-scale data with millions of queries from real Bing logs.

Standard metrics

The dominant offline metrics for search are: Mean Average Precision (MAP) — the mean over queries of the average precision at each relevant document; Mean Reciprocal Rank (MRR) — the mean of 1/rank for the first relevant document; NDCG at K — Normalised Discounted Cumulative Gain over the top-K results, which is rank-aware and handles graded relevance; Recall at K — fraction of relevant documents in the top K. NDCG@10 is the dominant 2026 academic metric; recall@1000 is often the dominant retrieval-only metric (it measures whether the candidate set is good enough for downstream reranking).

Click models

For production systems, the evaluation problem is different: the test data is the click logs the system itself generated, and using clicks naively as relevance labels introduces severe biases. Click models are explicit probabilistic models of user click behaviour that try to disentangle relevance from confounders.

The simplest click model is the Position-Based Model: a click occurs when the user examines the result (probability depends on position) and the result is relevant (probability depends on the document). The two probabilities are estimated jointly from logs. Subsequent models — Cascade Model, Dynamic Bayesian Network model, the various counterfactual click models — refine this with richer behavioural assumptions. The 2024 production standard uses click models alongside human-labeled validation data to extract clean relevance signals from logs, which then feed back into ranker training.

The offline-online gap

Search evaluation has the same offline-online gap as recommendation (Ch 01 Section 8). Offline metrics on logged data measure something subtly different from online performance, and methods that win offline often disappoint online. The gap is smaller in search than in recommendation (the relevance labels are more stable than user preferences, and the action space is more constrained), but the discipline is the same: offline benchmarks for development and prioritisation, online A/B tests for the headline result.

Modern complications: LLMs and conversational search

The rise of LLM-augmented search has introduced new evaluation challenges. Generated answers (the LLM-summarised answer at the top of a search result) require evaluation that goes beyond ranking — was the generated answer factually grounded in the retrieved documents? Conversational search spans multiple turns where the relevance of one turn depends on the previous turns. Retrieval-augmented generation evaluation must measure both retrieval quality and downstream answer quality, with neither metric a complete picture.

The 2024–2026 IR community has produced new benchmarks for these regimes — RAG benchmarks (HotpotQA, RGB, ALCE for citation quality), conversational benchmarks (CONQRR, TREC CAsT), grounded-generation benchmarks (CRAG). The methodology is still consolidating, and "what is the right metric for an LLM search assistant" is genuinely open.

Search for Retrieval-Augmented Generation

The fastest-growing application of search in 2026 is as the substrate of retrieval-augmented generation — search retrieves relevant context, an LLM generates an answer grounded in that context. RAG has become the dominant pattern for enterprise AI because it sidesteps the LLM's reliability problem on factual recall. The retrieval methodology of the chapter applies almost verbatim, but the deployment patterns and evaluation frames are distinctive enough to warrant their own treatment.

Why RAG works

An LLM trained on internet text has knowledge with a hard cutoff at its pretraining date and limited reliability on factual recall. RAG augments the LLM with a retrieval step: at inference time, retrieve relevant documents from an external corpus, include them in the prompt, generate an answer grounded in the retrieved content. The result has up-to-date knowledge (the corpus can be updated daily), is constrained to documents the system trusts (no hallucinations from training data), and produces citations that link back to source documents.

The architectural shape is canonical: dense retriever + LLM, with the retriever selecting top-K passages from a corpus and the LLM conditioning its generation on them. The corpus is typically the deployer's domain-specific knowledge — internal documentation, customer-support history, product manuals, compliance documents — that the LLM never saw at pretraining time.

Chunking

The first engineering question in RAG is how to chunk source documents. Documents are usually too long to retrieve whole — a 50-page PDF cannot be a single retrieval unit because the relevant content is a few paragraphs. Chunking splits documents into smaller units (typically 200–1000 tokens) that are individually indexed. Chunk boundaries matter — splitting mid-sentence or mid-table corrupts the retrieved context — and 2024 best practices use semantic chunking that respects paragraph and section boundaries rather than fixed token counts.

Overlap between chunks is standard — 10–20% overlap means context that spans a chunk boundary is captured in at least one chunk. Hierarchical retrieval indexes both fine-grained chunks (for precise retrieval) and coarse documents (for broad context), then routes queries to the appropriate level. The 2026 RAG production pattern uses some form of hierarchical retrieval as standard.

Multi-vector and contextual retrieval

A specific 2024 RAG insight: chunks lose context. A chunk that says "the company decided to fire him" loses the context of who "him" refers to and which company. Contextual retrieval (Anthropic, 2024) adds context to each chunk before indexing — the LLM is asked to summarise the chunk's situating context, and that summary is included in the indexed text. The result is dramatically better retrieval on long documents at modest extra indexing cost.

Multi-vector approaches index the same chunk multiple times — once for the verbatim text, once for an LLM-generated summary, once for hypothetical questions the chunk could answer. At query time, all three indexes are searched and the results combined. This sacrifices index size for retrieval quality and is increasingly common in production RAG.

LLM rerankers and the RAG-eval feedback loop

The reranking machinery of Section 5 applies directly to RAG. After retrieval, an LLM reranker scores each retrieved chunk for relevance to the query and the top-K reranked chunks are passed to the generator LLM. The cost is real but the quality improvement is substantial — RAG systems with LLM rerankers consistently outperform those without, especially on harder queries.

RAG evaluation has its own machinery. Faithfulness: does the generated answer factually align with the retrieved content? Answer relevance: does the answer address the query? Context precision: are the retrieved chunks relevant? Context recall: did the retrieval find all the chunks needed? Frameworks like Ragas, TruLens, and DeepEval automate this evaluation by using LLMs as judges, with the obvious caveat that LLM-as-judge has its own reliability problems.

Hybrid RAG

Production RAG increasingly combines retrieval with other sources of grounding: structured database queries for precise facts, code execution for calculations, web search for live information, knowledge-graph queries for relational facts. The LLM orchestrates these — Section 7 of Part XIII Ch 11 (Neurosymbolic AI) developed this pattern as tool use; RAG is the search-flavoured instance of it. The 2026 production pattern for serious enterprise AI is rarely "just RAG" — it is RAG plus structured queries plus tool calls plus careful prompting.

Applications and Frontier

Search is everywhere in modern software — web search, enterprise search, code search, e-commerce search, conversational AI, agent tool-calling, RAG. The methodology of the chapter applies across all of these with domain-specific variations. This final section surveys the application landscape and the frontier where modern LLM-era search is reshaping the field.

Web search

Web search is the canonical application — Google, Bing, and the various challengers (DuckDuckGo, Brave, Perplexity, Kagi) all run massive multi-stage retrieval pipelines combining BM25, learned sparse, dense retrieval, and heavy reranking. The 2024 generation of web search increasingly integrates LLM-summarised answer panels at the top of the results, fed by a RAG-style architecture with the search results as the context. This is changing how users interact with search — fewer click-throughs to web pages, more direct-answer consumption — with substantial implications for the publishers whose content drives the underlying corpus.

Enterprise search

Enterprise search retrieves over internal corpora — Confluence, Slack, SharePoint, internal wikis, customer-support tickets. The technical problems differ from web search: corpora are smaller (millions rather than billions of documents) but more heterogeneous (mix of structured and unstructured); access control matters (different users see different documents); and the queries are more domain-specific. Production enterprise search increasingly looks like RAG over the company's own data — Glean, Microsoft's M365 Copilot, Google's enterprise search products, and the various startup offerings all share this architectural shape.

Code search

Code search is a specialised application with distinctive properties: queries are often natural-language descriptions of desired functionality, but documents are code; semantic similarity matters more than lexical match; the corpus has rich structure (syntax trees, type information, call graphs) that can inform retrieval. GitHub Copilot, Sourcegraph, and the various AI coding assistants use code-specialised retrievers (CodeBERT, Code-T5, the various code-trained dense retrievers) combined with LLM rerankers. The 2024 generation of code-search systems is genuinely better than text-only baselines on code-specific tasks.

E-commerce and product search

E-commerce search has its own constraints: the corpus is products with structured attributes (price, category, brand, rating); queries often combine free text with explicit filters; the goal is conversion, not just relevance. Amazon, eBay, and the various marketplaces combine BM25 over product text with dense retrieval over product embeddings (often image+text multimodal embeddings) and structured filtering on attributes. The ranking stage is similar to web-search ranking but optimised for purchase rather than click.

Multimodal and cross-modal search

Modern search increasingly spans modalities. Image search via text queries (Google Images, Pinterest visual search) and text search via image queries (reverse image search, lens-style applications) use joint vision-language embeddings (CLIP, ALIGN, the various successor models). Video search adds temporal structure. Audio search uses learned audio embeddings. The 2024 generation of multimodal search is unified — a single embedding space holds text, images, video frames, and audio, and queries in any modality retrieve documents in any modality. The CLIP-family encoders are the substrate.

Agent tool-calling and search

The 2024 rise of agentic systems has produced a specific search application: agents decide when to call search as a tool and synthesise the results into their reasoning. The retrieval methodology of the chapter applies but the integration is different — search is one tool among many, called when the agent's reasoning identifies an information need. Agent search has unusual requirements: latency must be low (the agent often calls search multiple times per task), result quality on specific factual queries must be high, and the agent must be able to formulate queries that will work for the retrieval system. This is an active research area with no settled methodology.

Frontier methods

Several frontiers are particularly active in 2026. Generative retrieval: instead of retrieving documents from a fixed index, a model generates document IDs autoregressively (DSI, NCI, the various generative-retrieval lines of work). Retrieval-augmented training: train LLMs to use retrieval naturally during their forward pass, rather than as a post-hoc add-on (REALM, Atlas, the various end-to-end RAG approaches). Personalised search: integrate user history and preferences into retrieval beyond simple session-aware ranking. LLM-as-retriever: use a large LLM to score documents directly against queries, without training a dedicated retriever. Long-context retrieval: retrieve and process documents at the scale of millions of tokens, enabled by long-context LLMs and approximate-retrieval methods that scale with context.

What this chapter does not cover

Several adjacent areas are out of scope. The classical IR theory — probabilistic retrieval, language models for IR, the BM25-derivation lineage — has its own substantial literature; the chapter has used the results without deriving them. Question answering as a distinct task (extractive QA, generative QA, knowledge-base QA) overlaps with search but has its own methodological centre. Conversational search and dialogue systems intersect with RAG but have their own dialogue-state-tracking machinery. Domain-specific retrieval — biomedical literature search, legal-document retrieval, patent search — has substantial domain-specific methodology that the chapter does not develop. And the privacy and ethical issues of search at scale — censorship, filter bubbles, surveillance — are largely policy questions rather than ML ones.