A pretrained language model knows what it was trained on. That knowledge is frozen at the moment training stopped, it is compressed into weights that do not preserve citations, and it cannot be updated except by another training run. For any application that depends on facts — company documents, product manuals, scientific literature, the contents of this week's wiki — the model's frozen knowledge is almost always wrong by default: missing, outdated, or hallucinated. Retrieval-augmented generation (RAG) is the dominant response. At inference time, the user's query is used to retrieve relevant passages from a corpus — through keyword search, dense-vector search, or a hybrid of the two — the retrieved passages are stitched into the prompt, and the model is asked to answer using what it just read. The retrieval step gives the system up-to-date, citeable, auditable knowledge; the generation step gives it the fluency and composition abilities of a large pretrained model. RAG pipelines now serve a substantial fraction of the user-facing LLM applications in production — customer support, code assistants with repository context, legal and medical reference tools, internal document Q&A, web-search-augmented chat — and the engineering discipline around them has developed its own literature, its own failure modes, its own evaluation metrics, and its own tension with the long-context alternative of just putting everything in the prompt and letting the model sort it out. This chapter covers the full pipeline: what to retrieve, how to retrieve it, how to rerank, how to stuff it into the context, what goes wrong, and how to tell whether the system is working.
Sections one and two frame the problem and the pattern. Section one is why RAG — the class of problems that retrieval addresses better than fine-tuning, the distinction between knowledge and behaviour, and the reason RAG has become the default architecture for any LLM application that touches real-world documents. Section two walks through the canonical RAG architecture from end to end — indexing, query, retrieval, augmentation, generation — so that the subsequent sections on individual components have a shared frame.
Sections three through six cover retrieval itself. Section three is chunking — the preprocessing step that turns source documents into the units that get indexed and retrieved, and the single design decision that most determines the quality of everything that follows. Section four covers sparse retrieval — BM25, TF-IDF, and the lexical methods that have been the backbone of information retrieval for decades and remain the strong baseline that many RAG systems never beat. Section five covers dense retrieval — DPR, bi-encoders, and the embedding-based approach that added semantic matching to the retrieval toolkit. Section six covers hybrid search — the near-universal finding that combining sparse and dense retrieval beats either alone.
Sections seven and eight cover the layer above retrieval proper. Section seven is rerankers — the cross-encoder models that take a shortlist from first-stage retrieval and reorder it using much more compute per candidate. Section eight covers vector databases and approximate-nearest-neighbour infrastructure — the FAISS/HNSW/IVF story and the operational side of serving embeddings at scale.
Sections nine and ten cover the query and context sides of the prompt. Section nine is query transformation — HyDE, query expansion, step-back prompting, and the techniques for turning a user's question into a retrieval-friendly form. Section ten is context construction — how to order, deduplicate, and pack retrieved passages into a limited context window, and the lost-in-the-middle phenomenon that constrains how you do it.
Sections eleven through thirteen cover the structural alternatives and extensions. Section eleven is the long-context vs RAG debate — the argument that million-token context windows obviate retrieval, and the reasons the practice has not gone that way. Section twelve covers graph RAG — structured retrieval over knowledge graphs rather than flat chunks. Section thirteen covers multimodal RAG — retrieving images, tables, diagrams, and code alongside text.
Sections fourteen through sixteen cover agency, evaluation, and failure. Section fourteen is agentic RAG — iterative and tool-using retrieval, where the model itself decides when and what to retrieve. Section fifteen covers evaluation — RAGAS, faithfulness/groundedness/relevance metrics, and the (harder than it looks) problem of knowing whether a RAG system is actually working. Section sixteen is failure modes — context poisoning, retrieval drift, hallucination under retrieval, and the operational patterns that catch them.
The closing in-ml section places RAG between Chapter 08 (fine-tuning) and Chapter 10 (evaluation), and sketches where the field is going as of early 2026: the interplay of retrieval with agents, the still-unresolved chunking problem, the growing sophistication of reranking stacks, and the surprisingly durable dominance of RAG as the pattern through which most LLMs reach users.
Retrieval-augmented generation is the answer to a specific question: how do you build an LLM application that needs to be correct about facts the model does not know? The model does not know about your company's documents, today's news, last month's product release, or the contents of a customer's past tickets. It cannot be told about them through prompting alone because they do not fit. It cannot be trained on them cheaply. RAG is the pattern that splits the problem: the model supplies language, reasoning, and format; a retrieval system supplies the facts at inference time.
The fundamental observation, made explicitly in the Lewis et al. 2020 RAG paper and refined many times since, is that LLMs are remarkably good at reading and summarising text they have not seen before. A model with no prior knowledge of your domain will, given a well-chosen passage from your documents, produce a faithful and well-phrased answer to questions about that passage. The model's contribution is language capability; the corpus's contribution is specific knowledge. RAG is the infrastructure that matches the two.
Contrast this with fine-tuning, covered in the previous chapter. Fine-tuning is the right tool for changing behaviour — the model's style, format, tone, or default response pattern. Fine-tuning is the wrong tool for injecting knowledge. Attempts to fine-tune facts into a model — a single bullet point, a product SKU, a policy clause — generally fail: the fact is diluted across millions of parameters, it interferes with adjacent knowledge, it cannot be updated without retraining, and it cannot be cited. The same fact in a retrieved passage is precise, up-to-date, and auditable.
The specific problems RAG was built to solve, and the ways it solves each:
There is a second, subtler reason RAG matters. It makes the information-retrieval expertise that organisations have built up for decades — search relevance, indexing, ranking, query understanding — directly applicable to LLM-based products. A company with a search team and a document store can build a RAG system. A company without those things has to build both the retrieval side and the LLM side from scratch. This is why RAG adoption has tracked closely with existing search-engineering maturity rather than with AI-team maturity.
RAG is not a panacea. It adds latency (a retrieval round trip before generation). It adds failure modes (bad retrieval produces bad generation). It adds complexity (an entire second production system in the serving path). The chapter's job is to explain how each piece works, when each piece fails, and what the current best practice looks like for each.
Before getting into each component, it helps to have the whole pipeline on one page. A production RAG system has two phases — offline indexing, online query — that share a common representation of the corpus and that together define the system's quality.
Offline: indexing. The corpus is ingested, chunked, embedded, and indexed. This runs whenever documents are added, removed, or changed. For a small corpus (thousands of documents) it takes minutes; for a large one (hundreds of millions of documents) it is a batch pipeline that runs continuously, and the infrastructure to maintain it is non-trivial.
Online: query. A user's request is turned into a retrieval query, relevant chunks are pulled back, passed through a reranker if present, stitched into a prompt, and sent to the LLM.
Several cross-cutting concerns show up at every layer. Metadata filtering — restricting retrieval to a subset of chunks based on access control, document type, or recency — is needed in almost every production system. Access control — making sure user A does not see chunks from documents they should not have access to — is the single most common source of compliance bugs in RAG systems. Observability — logging which chunks were retrieved for which query, and which answers followed — is how the system is debugged and improved over time. These concerns are part of the architecture, not bolted on.
The rest of the chapter proceeds component by component, with an eye to both how each piece works and how it fails. The two most important components — chunking and the retrieval stack — come first because they are the components that most determine whether the whole system works.
Chunking is the step where a corpus becomes a retrieval problem. A document must be broken into units that are small enough to be retrieved selectively and large enough to carry meaning. The choice of chunk size and boundary strategy determines, more than any other single decision, what the retrieval system can and cannot do. Teams that treat chunking as a default parameter rather than a design problem pay for it in retrieval quality.
The conceptual trade-off is simple. Small chunks — a sentence, a short paragraph — have high specificity: each one covers a narrow concept, and a relevant retrieval returns a passage that is densely on-topic. They have low context: a one-sentence chunk often lacks the surrounding explanation needed to understand it. Large chunks — a multi-page section — have the opposite profile: high context, low specificity. The right point on this trade-off depends on the corpus, the query style, and the downstream model's ability to ignore irrelevant context.
Common chunking strategies, in increasing sophistication:
RecursiveCharacterTextSplitter and in most mature RAG pipelines as of 2026.There are a few specific gotchas worth knowing. Tables chunk badly with any text-based strategy; a row broken across chunks loses its header and becomes meaningless. Tables are typically handled by a separate preprocessing step that keeps them intact and often serialises them (e.g. to Markdown or JSON) before embedding. Code chunks badly with natural-language strategies; code-specific splitters respect function and class boundaries. Headers should usually be prepended to each chunk of their section — a chunk from the "Refund policy" section is more retrievable if "Refund policy →" appears at its start.
The chunking choice is tightly coupled to the embedding model. Most embedding models have a sequence limit (typically 512 or 8192 tokens), and chunks longer than the limit are truncated. Chunks much shorter than the limit waste the embedding's capacity — a sentence-long chunk fed to a 512-token model has 511 tokens of padding. The sweet spot is usually the model's limit for dense retrieval and shorter (a sentence or two) for sparse retrieval.
A useful debugging habit: read your chunks. Dump a random sample of fifty chunks from your index and read them as if they were standalone documents. If a chunk is unintelligible without its context, your chunking is wrong. If a chunk contains two unrelated topics, your chunking is too coarse. This exercise has caught more bad RAG setups than any automated evaluation has.
Sparse retrieval is the classical information-retrieval story: treat documents and queries as weighted bags of words, build an inverted index, retrieve the documents whose word overlap with the query scores highest under some statistical model. It predates neural methods by half a century, it is used at Google scale by the largest search engines, and — embarrassingly for the dense-retrieval literature — it remains competitive with, and often superior to, pure neural methods on many real-world RAG benchmarks.
The dominant sparse method in 2026 is still BM25 (Okapi BM25, Robertson & Walker 1994), a term-frequency model with saturation and length normalisation. For a query $q$ and a document $d$:
$\text{BM25}(q, d) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{f(t,d)(k_1 + 1)}{f(t,d) + k_1(1 - b + b \cdot |d|/\overline{|d|})}$
where $f(t,d)$ is the term frequency, $\text{IDF}(t)$ is the inverse document frequency, $|d|$ is the document length, and $k_1$ and $b$ are tuning parameters (typical defaults $k_1 = 1.2$, $b = 0.75$). The shape of the formula captures three observations: rare terms are more informative than common ones (IDF), matching a term multiple times matters but with diminishing returns (saturation via $k_1$), and longer documents should not be rewarded for having more opportunities to contain the query terms (length normalisation via $b$).
BM25 has properties that are easy to undervalue until you work without them. It handles exact matches — a product SKU, a person's name, a code identifier — perfectly, which dense retrieval famously struggles with. It has no training requirement: the inverted index and the IDF statistics are all it needs. It scales to corpora larger than dense retrieval can handle, using less memory per document. It is interpretable: you can point at the terms that caused a match. Its failure mode is also clean: it cannot match synonyms or paraphrases, and a query that uses different words than the document will return nothing.
The infrastructure story is similarly mature. Elasticsearch and OpenSearch — both built on Lucene — are the production systems most teams reach for; they handle billions of documents, distribute across nodes, and support the rich query language that lets you combine BM25 with filters on metadata (date ranges, access control tags, document types). tantivy (the Rust successor to Lucene) and Meilisearch are lighter-weight alternatives for smaller deployments. Postgres's tsvector is adequate for moderate-scale corpora colocated with relational data.
Beyond BM25 proper, there are a handful of extensions worth knowing. Query-time expansion — adding synonyms to the query — helps with the vocabulary-mismatch problem. Stemming and lemmatisation — reducing morphological variants to a common form — is almost always a win for English and a mixed bag for other languages. Field-weighted BM25 — scoring matches in titles higher than matches in body text — is a small but consistent quality gain for structured documents. SPLADE (Formal et al. 2021) is a neural sparse method that learns a sparse term-weighting over a transformer's vocabulary; it outperforms BM25 on most benchmarks while keeping the inverted-index infrastructure, and has real production uptake in 2025–2026.
The practical recipe: start with BM25. Tune $k_1$ and $b$ on your corpus. Add field weighting for structured documents. Add query expansion for domains with heavy vocabulary. Only then, if the quality gap remains, consider replacing or augmenting with dense retrieval (which §5 covers) or SPLADE. Teams that skip this and jump to dense retrieval often spend months tuning embedding models and rerankers to rebuild the capability BM25 would have given them in an afternoon.
Dense retrieval embeds queries and documents into a common vector space and scores their similarity by inner product or cosine. Where sparse retrieval matches words, dense retrieval matches meanings. The paraphrase problem that defeats BM25 — "how do I cancel my subscription" vs a document titled "unsubscribing from our service" — is what dense retrieval was designed to solve.
The architecture that dominates is the bi-encoder: a transformer that maps a passage to a fixed-dimension vector, trained so that query–passage pairs known to be relevant have higher inner product than negative pairs. The training objective is typically a contrastive loss: for each query, one positive passage and a batch of negatives are scored, and the model is updated to push the positive up and the negatives down. In-batch negatives — using other queries' positives as this query's negatives — are standard; hard negatives (confusingly similar non-matches) are added when quality demands.
The foundational paper is Karpukhin et al.'s Dense Passage Retrieval (DPR, 2020), which demonstrated on open-domain question answering that a well-trained bi-encoder beats BM25 substantially on paraphrase-heavy benchmarks. The method: two BERT encoders, one for queries and one for passages, trained on (query, positive passage) pairs from Natural Questions, Trivia QA, and a handful of others. DPR's numbers were striking enough to kick off the dense-retrieval wave.
The lineage of improvements since has been substantial. Sentence-BERT (Reimers & Gurevych 2019) predates DPR and introduced the bi-encoder template. ColBERT (Khattab & Zaharia 2020) argued for late interaction — store per-token embeddings, score query tokens against document tokens at query time — trading storage for better quality. E5 (Wang et al. 2022), BGE (Xiao et al. 2023), GTE, NV-Embed, and the embedding outputs of frontier models (OpenAI text-embedding-3, Cohere embed-v3, Voyage, Claude) are the widely used embedding models in 2026.
Dense retrieval has a number of failure modes that the literature has had to name. Exact-match failure: dense models can produce embeddings for distinct SKUs or identifiers that are close enough in vector space to be confused. Out-of-domain collapse: a model trained on Natural Questions may perform poorly on medical or legal text, sometimes worse than BM25. Popularity bias: frequently-queried passages end up well-embedded, rare passages end up poorly embedded, a long tail of rarely-retrieved content accumulates. Length bias: short queries and long passages often don't pair well; some models have trained fixes, others don't.
The serving story for dense retrieval is non-trivial and §8 covers it in more detail. The embeddings must be stored in an approximate-nearest-neighbour index (brute-force search scales linearly with corpus size and becomes prohibitive past a few million vectors). The index data structures — HNSW, IVF, ScaNN — trade off recall, latency, and memory in different ways. The engineering competence required to run a dense-retrieval system at production scale is considerable, and more than one enterprise team has learned this the hard way after a prototype worked and production didn't.
The empirical consensus in 2026 is that dense retrieval is rarely the right solo choice. It complements sparse retrieval. The next section covers how the combination is constructed in practice.
The benchmark result that shaped modern RAG practice is that a combination of BM25 and dense retrieval consistently beats either alone. The two methods make different kinds of mistakes; their errors are loosely uncorrelated; the union of their top results is richer than either. Hybrid search is now the default retrieval configuration in most production RAG systems.
The intuition for why hybrid works is straightforward. BM25 excels at exact matches and named entities — SKUs, people's names, technical terms, code identifiers — but fails on paraphrases and cross-lingual queries. Dense retrieval excels at paraphrases and semantic matching but can confuse similar-but-distinct entities. A query like "how do I reset my password on the v2 API?" matches a document titled "v2 API: credential management" via dense retrieval (the password/credential paraphrase) and matches a document mentioning "v2 API" via BM25 (the exact-match tag). Retrieving from both and combining gives the system access to both kinds of signal.
The implementation question is how to combine. There are three dominant approaches:
Several tooling choices make hybrid search cheaper than it used to be. Elasticsearch/OpenSearch now ship with native dense-vector support alongside BM25, so a single query can return both sparse and dense scores in one round trip. Weaviate, Qdrant, and Pinecone all offer hybrid APIs. Postgres's pg_trgm combined with pgvector covers the smaller-scale case. The "run two separate indices and combine in application code" pattern, while still common, has lost ground to integrated offerings.
Three practical notes. First, the same chunks should be indexed in both the sparse and dense index; retrieving from different chunks defeats the combination. Second, filtering (metadata, access control, recency) should be applied before fusion, not after; filtered-out results should not take up slots in the union. Third, the $k$ for each retriever — how many candidates to pull before fusion — often wants to be larger than the final result count; fetching 50 from each, fusing, and keeping the top 10 is a common pattern.
Hybrid search is the most reliable quality improvement available in RAG engineering. Teams that skip it in favour of expensive reranker tuning or aggressive chunking experiments are almost always leaving quality on the table. The recipe is cheap, well-supported by tooling, and the gains are consistent across corpora.
Retrieval returns dozens to hundreds of candidates; the generator can only attend to a handful. The reranker is the model that decides which of the candidates actually make it into the prompt. It is almost always a cross-encoder: a transformer that sees the query and a candidate document jointly and emits a single relevance score. Unlike the bi-encoders used for retrieval, the cross-encoder reads both inputs in one pass and can model fine-grained interactions between the query tokens and the document tokens. This makes it much more accurate than a bi-encoder and far too slow to run over an entire corpus — which is exactly why it sits at the second stage.
The pipeline shape is retrieve a lot, rerank to a few. A typical production setup might fetch 100 candidates from a hybrid retriever, pass all 100 through a cross-encoder that scores each (query, candidate) pair, and keep the top 5–10 for the generator. The cross-encoder is expensive per call — a forward pass per candidate, not per corpus — but the candidate set is small enough to make the cost tractable. Running a reranker over millions of documents would be infeasible; running it over a hundred is a few hundred milliseconds.
The quality improvement from reranking is substantial. On BEIR, adding a cross-encoder reranker to a dense retriever typically adds 3–8 points of NDCG@10. The pattern is consistent across domains; the effect is larger when the first-stage retriever is weaker, smaller when it is already excellent. The main failure mode is score inflation: cross-encoders can hallucinate relevance for documents that share surface vocabulary with the query but answer a different question. Good rerankers are trained with hard negatives specifically to resist this.
bge-reranker-large, mxbai-rerank, jina-reranker — all cross-encoders fine-tuned from encoder-only transformers on MSMARCO and synthetic data, typically 300M–500M parameters. Commercial APIs: Cohere's Rerank (v3), Voyage's voyage-rerank. LLM-as-reranker: prompt a general-purpose LLM (GPT-4-class) to score (query, document) pairs; much more expensive but sometimes higher quality, especially for complex queries. ColBERT is a third pattern — late-interaction between query and document tokens — that sits between bi-encoder and cross-encoder on the cost/quality frontier.
Three design questions recur. How many candidates to rerank? More candidates means higher recall but more cost; 50–100 is the common range, with diminishing returns past ~200. What to pass into the reranker? Usually the full chunk text, but some systems concatenate chunk + section title + document title to give the reranker more context for the decision. Whether to use reranker scores as final confidence? The reranker's scores are calibrated to the reranker's training distribution, not to the downstream task. Using them as a hard threshold (drop all candidates with score below $\tau$) is useful for abstention; using them as a soft weight in a further fusion is trickier.
Rerankers are the highest-leverage addition to a RAG pipeline after hybrid search. They are cheap relative to fine-tuning a new embedding model, well-supported by managed APIs, and they compose cleanly with everything upstream. A team whose retrieval feels "close but not quite right" should try a reranker before trying almost anything else.
A dense retriever's runtime problem is given a query vector, find the $k$ nearest vectors in a corpus of millions or billions. The exact solution — compute the distance to every vector and take the top $k$ — is $O(N)$ per query and intolerable past a few hundred thousand documents. The field has responded with a family of approximate nearest neighbour (ANN) algorithms that trade a small amount of recall for a large amount of speed, and with a family of vector databases that package these algorithms behind a query interface.
Three ANN index families dominate. HNSW (Hierarchical Navigable Small World graphs, Malkov & Yashunin 2016) builds a multi-layer graph where each layer is sparser than the one below; queries descend layer by layer, greedily moving toward closer neighbours. HNSW is the quality leader — near-exact recall at reasonable latency — but memory-hungry and slow to build. IVF (Inverted File Index) partitions the vector space into Voronoi cells via k-means, and at query time searches only the few cells closest to the query. IVF is cheaper in memory and faster to build than HNSW but less accurate, and is usually combined with PQ (Product Quantisation) which compresses each vector into a few bytes for a massive memory reduction at further recall cost. Finally, ScaNN (Google) and DiskANN (Microsoft) represent research pushes on the frontier — ScaNN for in-memory GPU-friendly search, DiskANN for billion-scale indices that live on SSD.
pgvector (Postgres), Redis Vector, SQLite's sqlite-vec. Libraries, not databases: FAISS (Meta, the canonical ANN toolkit), Annoy (Spotify), hnswlib. The "do I need a dedicated vector DB?" question has become more complicated as existing general-purpose databases add vector support; for many use cases under ~10M vectors, pgvector on a Postgres instance is enough.
The operational concerns around vector DBs are less about algorithms and more about hybridisation with structured data. A real RAG query is rarely "find the most semantically similar chunks"; it is "find the most semantically similar chunks from documents the user has permission to see, published after X, tagged with Y, excluding source Z." Filtered ANN is a harder problem than unfiltered ANN — the obvious algorithms either filter first and lose ANN speedups, or filter after and return too few results. Modern vector DBs offer various strategies for pre-, post-, and integrated filtering; correctness and performance both depend on choosing the right one for the filter's selectivity.
The other operational concern is reindexing. Embedding models update; chunking strategies change; new data arrives continuously. A well-designed RAG system treats the index as a build artefact, versioned alongside the code, so that a change to chunking or embeddings produces a new index that can be swapped in atomically. Teams that "patch" a long-lived index in place tend to accumulate silent quality drift.
The emerging consensus is that vector databases will be one of many tools in a retrieval stack, not the whole stack. The best systems combine ANN search with full-text search, metadata filters, graph traversals, and structured SQL — each chosen for the query type it serves best.
The user's question is often a poor search query. It may be under-specified, use different vocabulary from the corpus, or bundle several questions together. The remedy is to insert a preprocessing step that rewrites the question into one or more queries better matched to the retrieval system. This class of techniques — query transformation — has become one of the main intervention points in modern RAG design.
The simplest form is query expansion: add synonyms, related terms, or reformulations before retrieval. Classical IR used thesauri and pseudo-relevance feedback; modern systems use an LLM to "rewrite this question three different ways" and union the retrieval results. The cost is a small LLM call; the benefit is coverage of vocabulary variants that a single query would miss.
A more surgical pattern is HyDE — Hypothetical Document Embeddings (Gao et al. 2022). Instead of embedding the query, prompt an LLM to answer the query hypothetically, then embed the hypothetical answer and search with that embedding. The insight is that answers look more like documents than questions do — documents are declarative, questions are interrogative — so embedding a hypothetical answer lands closer to the correct documents in vector space. HyDE helps most when queries are short and corpora are long; it costs one LLM call per query and can measurably improve recall in zero-shot settings.
author=Hinton AND year>2015) and hands them to the retrieval layer.The design tension is latency. Each transformation adds an LLM call, often a large one, before retrieval can begin. A chain of query expansion + decomposition + routing can add a second or two to the first-token latency — a serious cost in interactive applications. The production practice is to be selective: use HyDE only when the first-stage retriever returns low-confidence results; use decomposition only for questions the router flags as multi-hop; skip transformations entirely for simple factual queries where plain retrieval is already strong.
The theoretical question at the back of this section is where does the reasoning belong? — in front of the retriever or behind it. Moving reasoning upstream (decompose, route, rewrite) makes retrieval smarter but the generator simpler; moving it downstream (retrieve generously, let the generator sort things out) makes retrieval simpler but the generator work harder. The current best practice is a mix: light transformation in front for query quality, the main reasoning work behind retrieval where the full context is available.
Once retrieval and reranking have selected a handful of chunks, the system still has to assemble them into a prompt the generator will attend to well. This stage is undervalued: teams put enormous effort into retrieval quality and then paste the top-5 chunks into the prompt in rank order, unaware that ordering, delimitation, and packing decisions can change the answer quality by double-digit percentages.
The first decision is ordering, and the main finding in this area is lost-in-the-middle (Liu et al. 2023): transformer-based language models attend more carefully to the beginning and end of their context than to the middle. A relevant chunk placed in the middle of a ten-chunk context is less likely to be used than the same chunk placed first or last, even holding the total context length constant. The practical recipe that follows is to order chunks so that the most relevant ones are at the ends — often with the single most relevant chunk at the very end, closest to the question. Several popular RAG frameworks implement "long-context reordering" as a default.
The second decision is packing: how many chunks, how much total context, and what to do when retrieved chunks exceed the budget. The naïve strategy — always include top-$k$ — wastes tokens when the first chunk already contains the answer and starves the generator when the answer is split across many. A better approach is a budget-aware packer that sorts chunks by reranker score, adds them until a target token count is reached, and stops. An even better approach is to compress each chunk (via a smaller summariser model or a contextual compression API like LongLLMLingua) before packing, trading compute for tokens.
<document id="3">...</document>) so the model can cite by ID; instruction prefacing ("Answer the question using only the information in the documents below. If the answer is not present, say 'I don't know.'"); and a closing repetition of the question after the documents, which pushes the question into the attention-advantaged tail. Anthropic's and OpenAI's RAG cookbooks both converge on similar scaffolds.
The third decision is citation. If the generator is expected to cite its sources, the prompt must give it a way to do so — numbered chunks, inline IDs, URL metadata. If the generator is expected to abstain when the corpus does not contain the answer, the prompt must say so explicitly; a bare retrieval prompt will almost always produce some answer, whether or not it is grounded. The difference between a grounded, cited answer and a fluent-but-unsupported one is often a single sentence in the system prompt.
Context construction sits at an uncomfortable interface: retrieval engineers think of it as "the generator's problem," and generator prompt engineers think of it as "retrieval's output." In practice it belongs to the integration layer and deserves its own design attention, its own evaluation metrics (faithfulness, answer relevance), and its own experiments. The cheapest quality wins in a mature RAG system are almost always in how the prompt is assembled, not in how the chunks were found.
The main architectural question hanging over RAG in 2024–2026 is whether it matters at all. Frontier models now accept context windows of one million tokens and above — roughly the text of a long novel, a medium codebase, or a year of email. If the entire corpus fits in the prompt, why bother with retrieval? This question has produced a large literature of benchmarks, ablations, and increasingly nuanced answers.
The case for long context is simplicity and robustness. No retrieval pipeline to build; no chunking decisions; no embedding model to maintain; no failure mode where the right chunk is not in the top-$k$. The model sees everything and decides for itself what is relevant. The "needle-in-a-haystack" benchmarks — planting a single fact in a long document and asking the model to recall it — show that frontier long-context models can retrieve single facts with near-perfect accuracy even at the far end of a million-token window.
The case for RAG survives several of these results because the needle-in-a-haystack benchmark is the easiest possible long-context task. Real workloads involve many relevant facts, distributed across the document, often in tension with irrelevant but superficially similar facts. Benchmarks like LongBench, RULER, and InfiniteBench show that long-context performance degrades — sometimes sharply — as the task moves from single-fact retrieval to multi-fact reasoning, aggregation, or comparison. The million-token window is not uniformly useful; attention still has a budget.
The consensus that has emerged is that RAG and long-context are complements, not substitutes. Long-context is the right tool when the relevant data is small (a single long document, a contained codebase) and the access pattern is read-heavy. RAG is the right tool when the corpus is large or dynamic, when access control matters, when auditability matters, and when latency and cost matter. Many production systems combine the two: a retriever fetches a few hundred chunks, the generator is given all of them at once in a long context, and the model decides what to attend to.
One further distinction: RAG provides citability almost for free, because the system knows which documents it fed to the model. Long-context alone provides no such audit trail — the model was given "everything" and used what it used. For regulated domains, legal research, and medical contexts, this is a decisive advantage for RAG that even a perfect infinite-context model would not erase.
Chunk-and-embed RAG does well on local questions — "what does this document say about X?" — and poorly on global ones — "what are the main themes across the corpus?" The reason is structural: top-$k$ retrieval returns a small, local sample of the corpus, which is adequate for a question that has a small, local answer and inadequate for one that requires synthesis. GraphRAG is the family of techniques that addresses this gap by building an explicit graph structure over the corpus and using it to answer global questions.
The canonical version is Microsoft's GraphRAG (Edge et al. 2024). At index time, an LLM reads each chunk and extracts entities and relationships, forming a knowledge graph. A community-detection algorithm (Leiden) groups the graph into clusters, and an LLM summarises each cluster into a short description. At query time, the system retrieves the relevant cluster summaries rather than raw chunks — so a question like "what are the main research directions in this archive?" retrieves cluster-level summaries that span the whole corpus, not individual documents. The approach trades higher indexing cost (tens of thousands of LLM calls to build the graph) for dramatically better performance on global, thematic queries.
The more general pattern that GraphRAG belongs to is structured retrieval: representing the corpus as something richer than a flat list of chunks, and retrieving over the structure. Variants include:
The broader reading is that retrieval is a search problem, and different questions need different search indices. A production system that asks both local questions (answered by chunk-retrieval) and global questions (answered by summary-retrieval) and factual questions (answered by graph traversal) will end up with multiple indices, a router in front, and some glue code behind. The monolithic "one vector index" picture is a starting point, not a destination.
Real documents are not plain text. They contain diagrams, charts, tables, screenshots, equations, photographs, and scanned pages. A RAG system that only indexes extracted text loses most of the signal in a typical PDF of a scientific paper, a financial report, or a product manual. Multimodal RAG is the family of techniques that extend retrieval to non-text content.
Two architectural patterns dominate. The first is caption-and-embed: run a vision-language model over each image or table to produce a textual description, and index that description alongside the surrounding text. Retrieval and generation stay purely textual; the multimodality lives entirely in the ingestion pipeline. This approach is cheap at query time, plays well with existing vector infrastructure, and is what most production systems do today. Its weakness is that captions lose information — a complex chart cannot be fully described in a paragraph, and the generator never sees the original image.
The second is unified multimodal embeddings. Embed images and text into a shared vector space (via CLIP, SigLIP, or a more recent multimodal encoder) so that a text query retrieves both text and image chunks in one ANN search. At generation time, pass the retrieved images directly into a vision-language model (GPT-4o, Claude, Gemini) that can see them. This approach preserves the original signal and enables genuinely visual questions — "find the diagram that looks like this one" — but requires multimodal embeddings, image-capable generators, and storage for the images themselves. ColPali (Faysse et al. 2024) is a notable recent point in this space: it embeds PDF pages directly as images, skipping text extraction entirely, and retrieves whole pages as visual evidence.
The current state of the practice is a mixture. Most production systems caption-and-embed for cost reasons, but the frontier is moving toward unified embeddings as multimodal encoders improve and generator pricing drops. The ColPali result — that page-level visual retrieval can outperform text extraction on document understanding benchmarks — is an early signal that the "extract text first" default may not survive long. For now, the pragmatic answer is: extract text well, caption images and tables, and choose a generator that can look at the originals if the caption turns out to be insufficient.
Domain-specific multimodality extends further. Audio RAG (over meeting transcripts, podcasts, call-centre recordings) adds a speech-to-text stage; video RAG adds frame sampling and captioning; CAD/3D-model retrieval adds shape descriptors. Each adds its own ingestion pipeline, but the core shape — retrieve over embeddings, generate with what you retrieved — survives intact.
Classical RAG is single-shot: one query goes in, one retrieval happens, one answer comes out. Agentic RAG replaces that shape with an LLM loop in which retrieval is one tool among several and the model decides, turn by turn, what to search for next. The result is a system that can answer questions requiring several sequential lookups — where the result of the first search tells you what the second search should look for.
The minimal pattern is a ReAct-style loop (Yao et al. 2022): the LLM receives the question, emits either a tool call (search("X")) or a final answer, and repeats until it decides it has enough information. A multi-hop question like "which of our Q3 contracts had a clause similar to the one in the 2021 Acme agreement?" naturally decomposes into two searches — first fetch the Acme clause, then search Q3 contracts for similar language — that a single-shot RAG cannot express. Modern variants (Self-Ask, ReWOO, Reflexion) add planning, self-criticism, and memory to this basic loop.
The practical upside is substantial: agentic RAG can handle questions that require genuine reasoning over multiple documents, can combine retrieval with other tools (calculators, SQL queries, web search, APIs), and can abstain gracefully when the corpus does not contain the answer. The practical downside is also substantial: agentic loops are slow (many sequential LLM calls), expensive (tokens for every intermediate step), and capable of failing in new ways — infinite loops, tool-call hallucinations, plan drift. Production deployments wrap agents in step limits, cost budgets, and fall-back paths that revert to single-shot RAG when the agent exceeds a threshold.
Agentic RAG is also the bridge from question-answering to task-completion. Once the LLM has retrieval plus a set of action tools — send email, update a database, create a calendar event — the pattern generalises from "answer my question using this corpus" to "do this task using these tools and this corpus." The retrieval step becomes one of many the agent chooses among. This shift is what makes agentic RAG the dominant architectural pattern in the current generation of LLM-driven assistants and copilots.
The open research questions are the same ones that face agentic systems generally: how to bound cost, how to bound loop length, how to detect and recover from mistakes mid-loop, how to evaluate a system whose behaviour depends on stochastic tool-call sequences. None of these are solved. Most are being actively worked on.
A RAG system has more ways to fail than a pure generator does, and evaluating it is correspondingly harder. There are two distinct components — retrieval and generation — that can each be evaluated separately, and then the end-to-end behaviour that neither component's metric captures. Good RAG engineering depends on having metrics at all three levels and watching them move when the system changes.
The retrieval evaluation is the most traditional. Given a labelled test set of (query, relevant document) pairs, compute standard IR metrics: recall@k (did we fetch the right document in the top $k$?), precision@k, MRR (reciprocal rank of the first relevant result), and NDCG@k (discounted cumulative gain, which rewards getting relevant results at high ranks). The BEIR benchmark (Thakur et al. 2021) — a zero-shot retrieval benchmark across 18 diverse domains — is the canonical external yardstick. For a specific corpus, teams typically build a small labelled test set (100–1000 queries) and track recall@k as their north star retrieval metric.
The generation evaluation is harder because the outputs are free text. The RAGAS framework (Es et al. 2023) introduced the dominant vocabulary: faithfulness (does the answer only contain claims supported by the retrieved documents?), answer relevance (does the answer address the question?), context precision (are the retrieved documents actually relevant?), context recall (did retrieval bring in everything needed to answer?). Each metric is operationalised as an LLM-as-judge prompt over (query, retrieved context, answer) triples. The framework is widely adopted; its weakness is that LLM-as-judge metrics are noisy and drift as the judging model changes.
Three specific failure modes deserve dedicated metrics. Hallucination: the generator produces a confident answer when retrieval brought back nothing relevant. Measured by running the eval on out-of-corpus questions and counting non-abstentions. Citation accuracy: when the generator cites a source, does that source actually support the claim? Measured by extracting cited claims and verifying each against its cited document. Staleness: the answer is correct with respect to an old snapshot of the corpus but wrong with respect to the current one. Measured by time-partitioning the eval set.
The organisational point is that RAG evaluation needs both an offline test suite (for regression testing and CI) and an online observability layer (for catching production drift). Neither alone is sufficient; the offline tests will always underrepresent real traffic, and the online telemetry will always be too noisy to debug a specific change. A mature RAG team runs both, correlates them, and treats divergence between offline and online metrics as a signal worth investigating.
A survey of public post-mortems and internal retrospectives from teams that have shipped RAG at scale turns up a remarkably consistent catalogue of failure modes. Most are not novel problems in retrieval or in generation; they are integration problems that arise specifically because the two stages are glued together. Knowing the catalogue in advance is the single highest-leverage piece of domain knowledge in RAG engineering.
Lost-in-the-middle (already mentioned): relevant information placed in the middle of a long prompt is used less well than the same information at the ends. Mitigation: reorder chunks so the most relevant are at the ends, keep the total context short, or compress.
Distractor sensitivity. Adding an additional irrelevant-but-superficially-similar chunk to the context can flip the answer from correct to incorrect. The generator "sees" the distractor and gets pulled toward it. Mitigation: better reranking, lower $k$, higher reranker precision threshold.
Chunk boundary failures. The answer spans the boundary between two chunks, and retrieval brings back one half; the generator fabricates the other half. Mitigation: overlap between chunks, retrieval with windowed context (fetch chunk $i$ together with $i-1$ and $i+1$), or structure-aware chunking.
Semantic-lexical mismatch. The corpus uses a specialised vocabulary (ICD-10 codes, product SKUs, legal Latin) that the embedding model was not trained on; dense retrieval fails silently. Mitigation: hybrid search, domain-adapted embeddings, synonym dictionaries, query rewriting.
Stale indices. The corpus updates but the index does not, and the system confidently returns outdated information. Mitigation: index versioning, incremental updates, explicit "last indexed" metadata, TTLs on cached embeddings.
Access-control leakage. Retrieval returns a chunk the user was not authorised to see; the generator puts its contents in the answer. Mitigation: filter for permissions at retrieval time, not only at display time; test explicitly with access-boundary queries.
Long-tail query collapse. The system is evaluated on the common-query distribution and performs well, but the bulk of user pain is on the tail — rare, poorly specified, idiosyncratic queries that the retrieval was never tuned for. Mitigation: log all queries, cluster the tail, prioritise fixes by volume × severity rather than by what's easy to measure.
Embedding model drift. The embedding API updates to a new version, old embeddings and new embeddings are no longer compatible, retrieval quality silently degrades. Mitigation: version-pin embeddings, reindex on version bumps, treat embedding model updates as a deployment event with its own eval run.
The common thread across these failures is that they are detected late — users complain, support tickets accumulate, then someone runs a root-cause analysis and discovers the mismatch. A RAG system that ships with telemetry and evaluation for each of these failure modes in place at launch will save itself the firefighting.
Retrieval-augmented generation occupies a peculiar position in contemporary machine learning. It is the dominant deployment pattern for LLMs in industry — most "LLM applications" are RAG under the hood — and yet it is also one of the least model-y techniques in the AI stack, consisting mostly of information retrieval, database engineering, and prompt design rather than novel neural architectures. The tension between "RAG is everywhere" and "RAG is barely an ML problem" is characteristic of the current era.
The intellectual lineage runs through three distinct traditions. From information retrieval, RAG inherits BM25, inverted indices, TF-IDF, and the entire apparatus of search quality measurement. From neural question answering, it inherits the reader-retriever separation that Chen et al. 2017 made explicit in DrQA and that DPR (2020) made the modern standard. From open-domain language modelling, it inherits the generators — the GPT, Llama, Claude, Gemini families — whose quality is the upper bound on what a RAG system can produce. Each tradition contributes half-solved pieces of the modern stack.
The relationship to the rest of this chapter is one of deferred knowledge. Fine-tuning internalises knowledge into weights; RAG externalises knowledge into a searchable corpus. Instruction tuning shapes how the model responds; RAG shapes what the model has access to when responding. Pretraining gives the model a general world model; RAG gives it a specific one. The decision of what to teach the model (weights) versus what to show the model (context) is the central architectural decision in any LLM application.
The skill at the centre of all of this is not model training; it is system design. A competent RAG engineer understands embeddings well enough to choose one, chunking well enough to tune it, retrieval algorithms well enough to debug them, prompt engineering well enough to guide generation, and evaluation well enough to measure outcomes. They also understand, which is harder, the data — what is in the corpus, how it is structured, what users will ask — because the best algorithmic choices depend on the data's shape. This full-stack understanding has no clean home in any academic subfield; it lives in the engineering-heavy practice of applied NLP.
From the perspective of a compendium: RAG is the pattern by which language models became useful in situations where pretraining alone was insufficient. It is likely to remain the dominant pattern for as long as corpora exist that cannot be baked into model weights — which is, essentially, forever. The particular algorithms will evolve; the pattern of retrieve, then generate, grounded in something outside the model is almost certainly permanent.
Retrieval-augmented generation sits at the intersection of information retrieval, neural question-answering, and modern LLM engineering, and the reading list reflects that breadth. The selections below are a 2026 snapshot: the IR classics that still define the retrieval half of the pipeline, the neural QA papers that made reader-retriever the modern default, the specific papers that introduced now-ubiquitous RAG techniques (DPR, BM25-as-dense-baseline, HyDE, Lost-in-the-Middle, GraphRAG), the key benchmarks (BEIR, MTEB, RULER), and the open-source tooling that most production systems actually use.
pgvector adds vector types and ANN to Postgres; Chroma is an embedded vector store. Adequate for most prototypes and many production systems under ~10M vectors.