25 RAG and Retrieval

A model’s weights are a lossy, frozen snapshot of the corpus it was trained on. Retrieval-augmented generation puts a live, queryable corpus next to the model and feeds the right passages into the prompt at answer time. By the end of this chapter a reader can explain how a query becomes a set of retrieved chunks, why dense embeddings, sparse term matching, and rerankers each earn their place in the pipeline, how GraphRAG and agentic RAG extend the naive loop, and where retrieval is contested against ever-longer context windows.

25.1 Problem

A frontier model knows what was in its pre-training mixture, up to the day the mixture was frozen, blurred by the lossy compression of Chapter 3. It does not know your private documents, it does not know what happened after the cutoff, and when asked about a fact it half-remembers it will produce a fluent, confident, wrong answer. Fine-tuning the knowledge in is expensive, slow to update, and still lossy. The cleaner move is to keep the knowledge outside the weights and fetch it on demand.

That reframes the problem as information retrieval married to generation. Given a question and a corpus of millions of passages, find the small set of passages that actually answer the question, place them in the context window, and let the model read and cite them. The hard parts are all in the retrieval: the corpus is large, so search must be sub-linear; the question and the answer rarely share exact words, so lexical match alone is brittle; the context window is finite and expensive, so you can pass only a handful of passages and they had better be the right ones. Retrieval is the act of turning a question into the few hundred tokens of evidence that change the answer.

25.2 Design

The naive RAG pipeline has four stages: chunk and index the corpus offline, embed the query online, retrieve the nearest chunks, and generate an answer conditioned on them. Each stage exists to answer a specific failure of the stage that would otherwise replace it.

Chunking exists because documents are too long to embed or retrieve whole. A single vector cannot represent a hundred-page report at the granularity a question needs, and you cannot paste the whole report into the prompt. So the corpus is split into chunks, typically a few hundred tokens, and each chunk is indexed independently. The chunk size is a direct trade: chunks too large dilute the relevant sentence among irrelevant ones and waste context budget; chunks too small sever the sentence from the context that disambiguates it. Overlapping windows and structure-aware splitting (by section, by paragraph) soften the boundary problem.

Dense retrieval answers the brittleness of lexical match. A dual-encoder embeds the query and each chunk into the same vector space, trained so that a question and its answer land near each other even when they share no words. Karpukhin et al. (2020), Dense Passage Retrieval, showed that a dual-encoder trained on a few thousand question-passage pairs beats a strong BM25 baseline by a wide margin on top-20 retrieval accuracy across open-domain QA. Relevance becomes a dot product or cosine similarity in embedding space:

\[ \text{score}(q, c) = \mathbf{E}_q(q) \cdot \mathbf{E}_c(c) \]

where \(\mathbf{E}_q\) and \(\mathbf{E}_c\) are the query and chunk encoders. The encoders can be tied or separate, and the chunk embeddings are computed once, offline, then frozen into an index.

The vector store answers the cost of search. With millions of chunk vectors, computing the dot product against every one for every query is too slow. Approximate nearest neighbor indexes trade a small, bounded recall loss for orders of magnitude in speed. The dominant structure is HNSW, the hierarchical navigable small-world graph of Malkov and Yashunin (2018), which builds a multi-layer proximity graph and searches it greedily from a sparse top layer down, giving logarithmic-scale query time. A vector store is HNSW (or a quantized variant) plus metadata filtering plus the plumbing to keep the index in sync with the corpus.

Hybrid search answers the blind spot of dense retrieval. Embeddings generalize across paraphrase but lose exact tokens: a part number, a rare name, a code identifier, a literal string that the dense model has smoothed away. Sparse lexical retrieval, BM25 over an inverted index, is exact on those terms and complementary. Hybrid search runs both and fuses the rankings, commonly with reciprocal rank fusion, so that a chunk ranked highly by either method surfaces.

Reranking answers the gap between cheap retrieval and accurate scoring. The dual-encoder is fast because the query and chunk never meet until the final dot product, which is also why it is imprecise: it cannot model fine-grained interaction between the query’s words and the chunk’s. A cross-encoder reranker feeds the query and chunk together through a transformer and scores their joint representation, which is far more accurate and far too slow to run over the whole corpus. So the pipeline retrieves a few hundred candidates cheaply, then reranks the top candidates expensively. ColBERT (Khattab and Zaharia, 2020) sits between the two: its late-interaction architecture keeps per-token embeddings and does a cheap MaxSim interaction, recovering much of the cross-encoder’s accuracy at retrieval-time cost.

flowchart LR
  Q[query] --> DE[dense embed]
  Q --> SP[BM25 sparse]
  DE --> ANN[ANN index / HNSW]
  SP --> INV[inverted index]
  ANN --> FUSE[rank fusion]
  INV --> FUSE
  FUSE --> RR[cross-encoder rerank]
  RR --> CTX[top-k chunks in context]
  CTX --> GEN[generate + cite]

The pipeline is a funnel: each stage is cheaper-and-broader or costlier-and-sharper than the next, and the order exists so that the expensive stages only ever see a short list.

25.3 Evolution

Lewis et al. (2020) named retrieval-augmented generation and framed it as combining parametric memory, the model’s weights, with non-parametric memory, a retrievable index, training the retriever and generator together on knowledge-intensive tasks. The naive RAG that followed froze that idea into the four-stage funnel above and decoupled the stages: embed once, retrieve once, generate once. This is the workhorse, and it has three structural weaknesses that the rest of the field’s evolution attacks in turn.

The first weakness is local-only reasoning. Naive RAG retrieves the chunks most similar to the query, which works for questions whose answer sits in a few passages and fails for questions whose answer is spread across the whole corpus. Asked “what are the main themes in these documents?”, similarity search has nothing to grab, because no single chunk is about the themes. GraphRAG (Edge et al., 2024) reframes this as query-focused summarization. It uses a model to extract an entity-and-relationship graph from the corpus, runs Leiden community detection to partition the graph hierarchically, and pre-summarizes each community. A global question is then answered by map-reduce over community summaries rather than over raw chunks, which the authors show improves the comprehensiveness and diversity of answers on sensemaking questions over corpora in the million-token range. LightRAG (Guo et al., 2024) keeps the graph idea but cuts the cost: a dual-level retrieval that combines low-level entity retrieval with high-level concept retrieval over a graph index, with incremental updates so a new document does not force a full re-index.

The second weakness is the static, single-shot control flow. Naive RAG always retrieves, retrieves exactly once, and never checks whether what it retrieved was any good. The fix is to put a controller in the loop. Self-RAG (Asai et al., 2023) trains the model to emit reflection tokens that decide whether to retrieve at all and then critique each retrieved passage for relevance and support before using it. CRAG (Yan et al., 2024) adds a lightweight evaluator between retrieval and generation that scores retrieved chunks and, when they fall short, triggers a corrective web search instead of generating from weak evidence. These are the bridge from naive to agentic.

The third move generalizes the second. Singh et al. (2025), surveying agentic RAG, describe the endpoint: the retrieval loop becomes an agent that plans, decomposes a question into sub-queries, chooses which tools and indexes to query, reflects on intermediate results, and iterates until it has enough evidence. Retrieval stops being a fixed funnel and becomes a policy the agent runs, drawing on the agent architectures of Chapter 21 and, when multiple specialized retrievers are involved, the coordination of Chapter 24. The same passages, reordered and re-queried, get used very differently.

25.4 Trade-offs

Every stage of the funnel is a balance with a knee.

Chunk size. Large chunks preserve context but dilute relevance and burn the context budget; small chunks are precise but lose the surrounding meaning. There is no universal size; it depends on document structure and question type, which is why structure-aware and overlapping chunking exist.
Dense versus sparse. Dense retrieval generalizes across paraphrase and fails on exact tokens; sparse retrieval is exact and fails on paraphrase. Hybrid search pays two indexes and a fusion step to get both, and that cost is usually worth it.
Recall versus latency in the index. Exact nearest neighbor is correct and too slow; approximate indexes like HNSW trade a tunable recall loss for logarithmic query time. The index parameters are a recall-latency dial, not a fixed setting.
Retrieve-then-rerank. A bi-encoder is cheap and approximate; a cross-encoder is accurate and expensive. Running the cross-encoder only on a retrieved shortlist buys most of the accuracy at a fraction of the cost, and the shortlist length is the dial.
Naive versus agentic. A single-shot pipeline is cheap, predictable, and low-latency. An agentic loop that plans, re-queries, and reflects answers harder questions but multiplies token cost and latency and adds failure modes of its own. Most production traffic does not need the agent; the minority that does benefits enormously.

Retrieval also moves the security boundary. Once untrusted text from a corpus flows into the prompt, that text can carry instructions. SafeRAG (Liang et al., 2025) benchmarks this directly, classifying attacks into silver noise, inter-context conflict, soft advertisement, and white denial-of-service, and finds that across many RAG components even simple injected passages bypass retrievers and filters and degrade answer quality. Retrieved content is untrusted input and must be treated as such, which connects directly to Chapter 34.

What’s contested

The live debate is whether long context will make retrieval obsolete. As context windows grow toward and past a million tokens, one camp argues the simplest design wins: drop the retrieval machinery, paste the whole corpus into the prompt, and let attention do the searching. The other camp argues retrieval stays necessary on three grounds the window does not address: cost, since attention over a million tokens per query is far more expensive than fetching a few thousand; freshness, since the corpus changes faster than you can re-prompt and an index updates incrementally; and scale, since real corpora exceed any window. The evidence cuts against the pure long-context position: Liu et al. (2023), “Lost in the Middle,” show that models use evidence placed mid-context markedly worse than evidence at the edges, so a full window is not a uniform substitute for a short, well-ranked one. The honest reading is that the two are converging. Long context makes retrieval coarser and more forgiving, and retrieval makes long context affordable and current. See Chapter 20 for the window side and Chapter 26 for how the retrieved evidence is assembled into the prompt.

Constraint arrow

Retrieval is kept economically necessary by a lower layer. The serving cost of attention, Chapter 16, and the key-value cache it fills, Chapter 17, make every token in the context window a recurring expense paid on every request. That cost is what forbids the simplest design, pasting the entire corpus into the prompt, and forces a retrieval funnel that delivers a few hundred high-value tokens instead of a few hundred thousand mediocre ones. The shape of the retrieval pipeline is dictated by what a token of context costs to serve.

25.5 Implementation

A minimal naive RAG loop is short, and the brevity is the point: the sophistication lives in the index, the embeddings, and the reranker, not the control flow.

def answer(query, k=5):
    q = embed(query)
    dense = ann_index.search(q, n=100)         # HNSW recall
    sparse = bm25.search(query, n=100)          # exact terms
    fused = reciprocal_rank_fusion(dense, sparse)
    top = cross_encoder.rerank(query, fused)[:k]
    return generate(prompt=build_context(query, top))

Three implementation realities decide whether this works in practice. The first is index freshness: the offline chunk-and-embed step must run incrementally as the corpus changes, or retrieval silently serves stale evidence. The second is evaluation, which is harder than it looks, because a RAG system can fail at retrieval (the right chunk was never fetched) or at generation (the right chunk was fetched and ignored or misread), and these need separate measurement. Retrieval quality is measured with ranking metrics like recall at k and nDCG; answer quality with faithfulness to the retrieved evidence and answer relevance, typically judged by a model as in Chapter 28, with the caveats about agentic and trajectory evaluation in Chapter 27. The third is the provenance contract: because every claim should trace to a retrieved chunk, the pipeline must carry citations through to the answer, which is both the original selling point of RAG in Lewis et al. and the main defense against the confident hallucination that motivated it.

RAG sits beside, not inside, the agent’s working memory. The retrieved context is ephemeral, assembled per request and discarded, which distinguishes it from the persistent state of Chapter 22. Retrieval is how an agent reaches knowledge it does not hold; memory is how it keeps what it has learned. The two compose, and the strongest systems use both.

25.6 Further reading

Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” 2020. arXiv:2005.11401
Karpukhin et al., “Dense Passage Retrieval for Open-Domain Question Answering,” 2020. arXiv:2004.04906
Khattab & Zaharia, “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT,” 2020. arXiv:2004.12832
Malkov & Yashunin, “Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs,” 2018. arXiv:1603.09320
Edge et al., “From Local to Global: A Graph RAG Approach to Query-Focused Summarization,” 2024. arXiv:2404.16130
Guo et al., “LightRAG: Simple and Fast Retrieval-Augmented Generation,” 2024. arXiv:2410.05779
Asai et al., “Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection,” 2023. arXiv:2310.11511
Yan et al., “Corrective Retrieval Augmented Generation,” 2024. arXiv:2401.15884
Singh et al., “Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG,” 2025. arXiv:2501.09136
Liang et al., “SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model,” 2025. arXiv:2501.18636
Liu et al., “Lost in the Middle: How Language Models Use Long Contexts,” 2023. arXiv:2307.03172