Retrieval-Augmented Generation

All sixteen sections are in draft status. Open problems are flagged inline and consolidated in §14.

This chapter is a new chapter with no AIMA 4e antecedent. Retrieval-Augmented Generation (RAG) combines a retrieval system that finds relevant information from a knowledge corpus with a generator (typically an LLM) that produces responses grounded in the retrieved content. RAG has become substantial production infrastructure in 2023-2026 - used in enterprise Q&A, consumer products (Perplexity, ChatGPT with browsing, Claude with search), specialized applications (legal research, medical information), and as a foundation for many agentic systems.

The chapter consolidates RAG-specific material referenced across many other chapters: LLM §9 (briefly), AI Agents §3-§6 (memory and retrieval as tool use), Multimodal Models §10 (cross-modal retrieval), Causality §11 (RAG and causal reasoning). This chapter develops the comprehensive treatment.

The chapter assumes the Large Language Models chapter, the Foundation Models chapter, and basic familiarity with information retrieval.

Scope and What This Chapter Is About

The chapter develops retrieval-augmented generation - the combination of retrieval systems with generative models. We cover the conceptual framework (why retrieval; what to retrieve; how to combine), the dominant retrieval methods (sparse, dense, hybrid, reranking), the indexing and vector-store infrastructure, the retrieval-generation interface, advanced patterns (multi-hop, conversational, multimodal), evaluation, production deployment, and connections to agentic systems.

Approximate length target: 15,000–22,000 words.

§1. Motivation and Scope

Three worked instances

Three concrete instances spanning the modern RAG landscape.

Instance 1: Perplexity answers a current-events question. A user asks Perplexity “What’s the latest on the EU AI Act implementation?”. Perplexity searches the web for recent articles on the topic; retrieves the most relevant; passes them to its LLM with the user query; produces a synthesized answer with citations to the source articles. The user can click each citation to verify. The entire pipeline takes a few seconds; the answer reflects information from after the LLM’s training cutoff.

Instance 2: An enterprise Q&A system answers a policy question. An employee at a large company asks the internal Q&A system “What’s our policy on remote work for parental leave?”. The system retrieves relevant pages from the company’s internal HR documentation; passes them to an LLM with the question; produces an answer with references to the specific policy sections. The system handles thousands of such queries daily, drawing from millions of internal documents.

Instance 3: A legal research agent investigates a case. A lawyer asks Harvey AI (or a similar legal-AI system) “What’s the precedent for claims like this one in California?” The system retrieves relevant case law from legal databases; uses LLM reasoning to identify the most-relevant precedents; produces a memo summarizing applicable precedents with citations. The lawyer reviews and refines.

These three instances span web-scale (Perplexity), enterprise-scale (internal Q&A), and specialized-domain (legal) RAG. They share architectural pattern (retrieve relevant content; generate response grounded in retrieval); they differ in scale, domain, and evaluation criteria.

What RAG is

A working definition. Retrieval-Augmented Generation (RAG) is the combination of:

A retrieval system that, given a query, returns relevant content from a knowledge corpus.
A generator (typically an LLM) that produces a response grounded in the retrieved content.

The basic pipeline:

   THE RAG PIPELINE

   User query
       │
       ▼
   Retrieval (search corpus → return top-K relevant items)
       │
       ▼
   Augmentation (combine query + retrieved items → prompt)
       │
       ▼
   Generation (LLM produces response from augmented prompt)
       │
       ▼
   Response (possibly with citations to retrieved sources)

The crucial property. The generator’s response is grounded in specific retrieved content. This is qualitatively different from pure LLM generation, where responses come from the model’s parametric knowledge alone.

Why RAG matters

Three structural reasons.

1. Knowledge freshness. LLM training is expensive and infrequent. Knowledge encoded in model parameters is frozen at training cutoff. RAG enables the model to access current information without retraining - by retrieving from continuously-updated knowledge sources.

2. Factual grounding. LLMs hallucinate. Generation grounded in specific retrieved documents can be cited, verified, and corrected. The retrieval layer provides a grounding mechanism the parametric LLM alone lacks.

3. Domain specialization. General-purpose LLMs can’t be experts in everything. RAG enables specialized deployment: connect the LLM to a domain-specific corpus (medical literature, legal cases, internal documentation); the system answers domain-specific questions with domain-specific knowledge.

The combined argument. RAG addresses three persistent LLM limitations (staleness; hallucination; domain limitation) with a substantively-better architecture. By 2026, RAG is standard production technology for many LLM applications.

The 2023-2026 RAG inflection

A specific industry transition. Pre-2023, RAG was a research direction (Lewis et al. 2020 paper; some specialized deployments). Post-2023 - particularly with ChatGPT’s deployment and the broader LLM-application boom - RAG became the dominant pattern for production LLM applications.

The growth metrics.

Vector database market. Pinecone, Weaviate, Chroma, Qdrant, Milvus, and others grew from research-stage in 2022 to substantial commercial reality by 2024-2026.
RAG-as-service providers. Multiple companies offer managed RAG infrastructure (LangChain RAG products, LlamaIndex, dedicated RAG platforms).
Enterprise RAG deployment. Most large enterprises had RAG systems in production by 2024-2025.

The 2026 state. RAG is foundational infrastructure for production LLM applications. Multimodal RAG, agentic RAG, and specialized-domain RAG are emerging frontiers.

Boundaries with adjacent chapters

Large Language Models §9 briefly covers RAG; this chapter develops in depth.
AI Agents §3-§6 covers retrieval as tool use; this chapter covers the underlying retrieval methodology.
Multimodal Models §10 covers cross-modal retrieval; this chapter covers RAG-specific multimodal aspects.
Causality §11 covers RAG in causal-reasoning contexts.
Self-Supervised Learning §5 covers contrastive learning, which underlies modern dense embeddings.
Foundation Models provides the FM-as-substrate framing.
Evaluation §10 covers cross-cutting evaluation; this chapter covers RAG-specific evaluation.

What this chapter does not try to do

We do not provide a complete information-retrieval textbook treatment. Manning, Raghavan, Schütze (Introduction to Information Retrieval) provides the classical IR foundation.
We do not extensively cover vector-database engineering. The substrate is mature; multiple commercial and open-source options exist.
We do not develop search-engine ranking algorithms in depth (PageRank, learning-to-rank). These are foundational but well-covered elsewhere.

Position taken in this chapter

The chapter takes RAG seriously as substantial production infrastructure. The methodology is mature; the deployment is broad; the open problems (multi-hop, hallucination prevention, multimodal quality, evaluation) are substantive. The chapter develops both the established methodology and the active frontiers.

§2. Historical Context

This section traces RAG from classical information retrieval through the modern LLM-integrated paradigm.

A timeline of the inflection points:

   1970s-2000s   Classical information retrieval. TF-IDF
                  (1970s); Okapi BM25 (1994); inverted index;
                  PageRank (Brin and Page 1998); learning-to-
                  rank methods. Foundational for web search.
                                  │
                                  ▼
   2018-2019    Neural information retrieval emerges. BERT
                  (Devlin et al. 2018) enables learned query-
                  document matching. ColBERT (Khattab and
                  Zaharia 2020), DPR (Karpukhin et al. 2020)
                  establish dense-retrieval paradigm.
                                  │
                                  ▼
   2020         RAG paper (Lewis et al., Facebook AI 2020):
                  "Retrieval-Augmented Generation for
                  Knowledge-Intensive NLP Tasks." Formalizes
                  the RAG paradigm; demonstrates substantial
                  benefits.
                                  │
                                  ▼
   2020-2022    Dense-retrieval refinements: Contriever
                  (Izacard and Grave 2021); E5 (Wang et al.
                  2022); modern embedding-model families.
                  Vector databases mature (FAISS, Pinecone
                  early commercial).
                                  │
                                  ▼
   2022 Nov     ChatGPT released; LLM-application boom
                  begins. RAG becomes the dominant pattern
                  for production LLM applications.
                                  │
                                  ▼
   2023         LangChain, LlamaIndex frameworks proliferate.
                  Make RAG easy to implement. Vector
                  databases see substantial commercial growth.
                                  │
                                  ▼
   2023         OpenAI ChatGPT browsing/plugins; first
                  large-scale consumer-facing RAG.
                  Perplexity AI launches (search-grounded
                  conversational AI).
                                  │
                                  ▼
   2024         Enterprise RAG matures. Glean, Notion AI,
                  Microsoft Copilot all deploy RAG at scale.
                  Specialized RAG (Harvey for legal, multiple
                  medical RAG systems).
                                  │
                                  ▼
   2024-2025    Agentic RAG. Retrieval as *tool* invoked
                  dynamically by LLM agents. Multi-hop
                  retrieval; iterative refinement.
                                  │
                                  ▼
   2024-2025    Multimodal RAG: image + text retrieval;
                  document RAG with mixed content. ColPali
                  and similar for document understanding.
                                  │
                                  ▼
   2025-2026    Long-context-vs-RAG debate. Frontier models
                  with 1M+ token contexts enable alternative
                  to RAG for some applications. The two
                  approaches coexist; production picks based
                  on application requirements.

We develop each phase below.

Classical information retrieval

The pre-deep-learning foundation. Information retrieval has substantial history - from library card catalogs through computational text search.

TF-IDF (Term Frequency-Inverse Document Frequency). A document is represented as a vector of term weights; query-document similarity is cosine similarity. The 1970s-1980s methodology that dominated text search for decades.

Okapi BM25 (Robertson, 1994 and subsequent work). Refinement of TF-IDF with probabilistic motivation; substantially better empirically. Standard sparse-retrieval baseline.

The inverted index. Foundational data structure. For each term, store list of documents containing it. Enables efficient search. The substrate for production search engines.

PageRank (Brin and Page, 1998). Web-link-based document quality scoring. Foundational for Google’s early search engine.

Learning-to-rank methods. Use machine learning to combine multiple retrieval signals. Substantial commercial deployment in search engines through 2000s-2010s.

The 2026 state. Classical IR methods (especially BM25) remain substantial production technology - sparse retrieval is part of most hybrid RAG systems.

Neural information retrieval

The 2018-2020 transformation. Transformer-based models enabled learned retrieval substantially better than BM25 on many tasks.

BERT for retrieval (multiple groups, 2018-2019). Use BERT to compute query-document similarity. Substantially better than BM25 on benchmarks; much slower (must compute BERT score for every query-document pair).

ColBERT (Khattab and Zaharia, 2020). Late-interaction model: encode query and document tokens separately; compute fine-grained similarity. Substantial improvement over basic BERT-retrieval.

DPR (Dense Passage Retrieval) (Karpukhin et al., Facebook AI, 2020). Two-tower dense retrieval. Encode queries and passages into separate dense embeddings; retrieve via nearest-neighbor search. Substantially faster than ColBERT; competitive accuracy.

Contriever (Izacard and Grave, 2021). Unsupervised contrastive learning for retrieval. Reduces labelled-data requirements.

The pattern. Dense retrieval became the new standard. Substantial subsequent work scaled and refined.

The 2020 RAG paper

The formalization. Lewis, Perez, Piktus et al. (Facebook AI Research, 2020) “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” Formalized the RAG paradigm.

The recipe.

Retriever. Dense passage retrieval. Given query, return top-K relevant passages.
Generator. Sequence-to-sequence model (BART). Generate response conditional on query and retrieved passages.
Training. End-to-end fine-tuning on knowledge-intensive QA tasks. The retriever and generator are jointly optimized.

The result. RAG substantially outperformed both retrieval-only and generation-only baselines on knowledge-intensive QA benchmarks. Established that retrieval + generation is substantially better than either alone for many tasks.

The trajectory. The Lewis et al. paper was foundational; the RAG paradigm it established became the dominant pattern for knowledge-intensive LLM applications.

The ChatGPT-era RAG explosion

The 2022-2023 inflection. After ChatGPT (November 2022), LLM applications proliferated rapidly. Many applications required current information or domain-specific knowledge - both addressed by RAG.

The framework explosion. LangChain (October 2022, growing rapidly through 2023) and LlamaIndex (originally GPT Index, late 2022) made RAG easy to implement. Modular abstractions for retrieval, generation, pipelines.

The vector-database commercial growth. Pinecone (founded 2019, substantial growth 2023), Weaviate (founded 2019), Chroma (founded 2022), Qdrant (founded 2021), Milvus (open-source 2019, commercial Zilliz). The vector-DB market grew substantially in 2023-2024.

The early production deployment. ChatGPT browsing/plugins (early 2023), Perplexity AI (founded 2022, substantial growth 2023), enterprise RAG products (Glean, Notion AI, Microsoft Copilot). RAG became standard production technology.

2024-2025: agentic and multimodal RAG

Two specific advances.

Agentic RAG. Treat retrieval as a tool the LLM can invoke dynamically. Rather than always retrieving up front, the LLM decides when retrieval is needed and what to query. Cross-reference AI Agents §4. The pattern enables more sophisticated reasoning; reduces unnecessary retrieval; supports multi-hop reasoning.

Multimodal RAG. Retrieval over multimodal content (images, documents with images, video). ColPali (Faysse et al., 2024) for document understanding. Cross-reference Multimodal Models §10.

The long-context-vs-RAG debate

A specific 2024-2026 tension. Long-context models (Gemini 1.5 Pro 1M+ tokens; Claude with 200K+ context; GPT-4 with 128K+) enable putting entire documents in context without retrieval. Does long context replace RAG?

The trade-offs.

Long context advantages. No retrieval system to maintain; no chunking artifacts; potentially better handling of cross-document reasoning.
Long context limitations. Substantially more expensive per query; latency higher; performance degrades with very long contexts (lost-in-the-middle); doesn’t handle corpora larger than context.
RAG advantages. Scalable to arbitrarily large corpora; cheaper per query; freshness via index updates; precise grounding via citations.
RAG limitations. Retrieval quality bounds end-to-end quality; chunking artifacts; multi-hop reasoning is harder.

The 2026 picture. Both coexist. Long context for some applications; RAG for others; hybrid (retrieval-augmented long-context generation) for substantial cases. The two are complementary not strictly substitutable.

Where this leaves us in 2026

RAG is foundational production infrastructure for LLM applications. The methodology is mature; deployment is broad; ongoing advances in multi-hop, multimodal, agentic, and evaluation areas continue.

§3. The RAG Framework

The basic pipeline

The fundamental architecture. RAG combines retrieval and generation in a specific pattern:

   RAG PIPELINE (detailed)

   1. INDEXING (offline, before queries arrive):
      Documents → chunks → embeddings → vector index.

   2. QUERY TIME:
      User query
          │
          ▼
      Query encoding (embed the query)
          │
          ▼
      Retrieval (find top-K relevant chunks via similarity)
          │
          ▼
      Re-ranking (optional: refine top-K with cross-encoder)
          │
          ▼
      Context construction (combine query + retrieved chunks)
          │
          ▼
      Generation (LLM produces response)
          │
          ▼
      Response with citations to source chunks

The components.

Document corpus. The knowledge base (documents, web pages, structured data).
Indexer. Builds searchable index from corpus.
Retriever. Finds relevant content for a query.
Re-ranker. (Optional) Refines retrieval results.
Generator. LLM producing response from retrieved context.
Output formatter. Adds citations, formats response.

Each component has design choices; combinations produce many distinct RAG architectures.

Why RAG vs alternatives

The comparative case for RAG.

vs larger models with more parametric knowledge. Larger models know more in principle but knowledge is static at training cutoff; expensive to update. RAG enables continuous knowledge updates without retraining.

vs longer context windows. Long-context models can fit substantial content directly. But: corpora exceeding context size still require retrieval; long-context per-query cost is substantial; lost-in-the-middle degrades long-context quality.

vs fine-tuning. Fine-tuning encodes specific knowledge in weights; requires retraining for updates; harder to attribute outputs to sources.

vs prompt engineering. Prompt engineering with manually-included context is brittle and doesn’t scale beyond small corpora.

The aggregate. RAG is the right choice when the corpus is too large for context, requires frequent updates, or needs citation-based grounding. Other approaches are appropriate in other contexts; they often combine with RAG.

The reliability, freshness, cost arguments

Three specific deployment-relevant arguments for RAG.

Reliability. Grounded generation reduces hallucination. The LLM is forced to use retrieved content; citations enable verification.

Freshness. Update the index without retraining the model. New documents appear in search results immediately.

Cost. Retrieving a few thousand tokens of context is cheaper than passing a 1M-token long-context query through the full LLM.

Trade-offs

The costs of RAG.

Pipeline complexity. RAG systems have many components; engineering is non-trivial.
Retrieval quality bottleneck. Bad retrieval → bad responses.
Latency. Retrieval adds latency; production systems must optimize.
Index maintenance. Indexes need updating; embeddings may need recomputation as embedding models change.

The 2026 production picture. RAG is substantial production infrastructure despite these costs. The benefits substantially outweigh the costs for many applications.

§4. Sparse Retrieval

BM25 and TF-IDF

The classical algorithms. Both produce sparse vector representations (high-dimensional vectors with most components zero - non-zero only for terms present).

TF-IDF. Term Frequency × Inverse Document Frequency. Weight each term by how often it appears in the document and how rare it is across the corpus. Query-document similarity is dot product or cosine.

BM25. Refinement with non-linear term-frequency scaling and document-length normalization. Substantially better empirically than TF-IDF; remains the standard sparse-retrieval baseline.

The BM25 formula (simplified):

\text{BM25}(q, d) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{f(t, d) \cdot (k_1 + 1)}{f(t, d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\bar{d}})}

where $f(t, d)$ is term frequency, $|d|$ document length, $\bar{d}$ average document length, $k_1$ and $b$ tuning parameters (typically $k_1 = 1.2-2.0$ ; $b = 0.75$ ).

The inverted index

The data structure. For each term, store list of (document-ID, term-frequency) pairs. Enables efficient retrieval - given a query, look up terms; intersect or union document lists; rank by score.

The implementation. Production inverted indexes handle millions to billions of documents; substantial engineering for storage, compression, query processing.

Production sparse-retrieval systems

The dominant implementations.

Elasticsearch / OpenSearch. Open-source distributed search engines based on Apache Lucene. Dominant in industrial deployment. Production-grade; substantial ecosystem.

Apache Solr. Older alternative; substantial production use.

Vespa (Yahoo, open-source). Designed for substantial scale; substantial commercial deployment (used by Spotify, Yahoo, others).

ZincSearch, MeiliSearch. Lighter-weight alternatives for smaller deployments.

When sparse retrieval works best

Specific situations.

Keyword-rich queries. Specific term matches matter; e.g., “Python list comprehension” - exact terms are signal.
Corpora with substantial term overlap with queries. Production documentation; technical content.
When interpretability matters. BM25 scores can be decomposed into per-term contributions.
When low-latency at scale matters. Sparse retrieval is very fast; production sparse retrieval handles thousands of queries per second per machine.

The 2026 state. Sparse retrieval (especially BM25) is substantial production technology. Modern RAG systems typically include sparse retrieval in hybrid (with dense; §6) pipelines.

§5. Dense Retrieval

Dense embeddings via neural networks

The modern approach. Encode each document and query into a dense vector (typically 384-1536 dimensions). Compute similarity via dot product or cosine.

The training. Typically contrastive learning (cross-reference SSL §5). Encoder learns to map similar documents and queries to similar vectors; dissimilar to dissimilar vectors.

The encoders.

Bi-encoder. Query and document encoded separately by the same (or similar) encoder. Allows pre-computing document embeddings (offline indexing). Standard for production retrieval.

Cross-encoder. Query and document concatenated, encoded together. Much higher quality but cannot pre-compute; computationally expensive. Typically used for re-ranking (§6) rather than initial retrieval.

Modern embedding models

The 2020-2026 progression.

Sentence-BERT (SBERT) (Reimers and Gurevych, 2019). BERT-based bi-encoder for sentence/passage embeddings.

DPR (Dense Passage Retrieval) (Karpukhin et al., 2020). BERT-based passage retrieval for open-domain QA.

ColBERT (Khattab and Zaharia, 2020). Late-interaction model; per-token embeddings. Better quality than DPR; higher index size.

Contriever (Izacard and Grave, 2021). Unsupervised contrastive learning.

E5 (Wang et al., 2022). Modern strong embedding model; widely deployed.

BGE family (Beijing Academy of Artificial Intelligence, 2023+). Open-source strong embedding models; widely used.

Cohere Embed, OpenAI text-embedding-3, Voyage AI, Mistral Embed. Commercial high-quality embedding APIs.

Snowflake Arctic Embed, Nomic Embed. Open-source 2024 strong embedders.

Jina Embeddings, GTE. Multiple competing open-source embedders.

The 2026 state. Embedding model quality has substantially advanced. Modern open-source models compete with closed-source. Embedding dimensions range from 384 (small models) to 4096+ (frontier).

Vector databases

The infrastructure substrate. Vector databases store embeddings and support efficient nearest-neighbor search.

FAISS (Facebook AI Similarity Search, 2017+). Open-source library for similarity search. Substantial deployment; foundational.

Pinecone (founded 2019). Managed vector database service. Substantial commercial growth 2023-2026.

Weaviate (founded 2019). Open-source vector database with substantial features. Commercial support.

Chroma (founded 2022). Open-source vector database designed for ease-of-use. Substantial uptake.

Qdrant (founded 2021). Open-source vector database; substantial commercial activity.

Milvus (open-source, 2019). Commercial via Zilliz. Substantial scale capability.

pgvector. PostgreSQL extension adding vector indexing. Substantial uptake in 2023-2026 - many teams prefer adding vectors to existing Postgres over deploying separate vector DB.

Vespa. Combines sparse and dense retrieval in unified system; substantial commercial deployment.

The trade-offs. Cloud vs self-hosted; managed vs open-source; specialized vs general-purpose; cost vs features. Different applications choose differently.

Approximate nearest neighbor algorithms

The algorithmic substrate. Exact nearest-neighbor search scales poorly; approximate algorithms enable substantial scale.

HNSW (Hierarchical Navigable Small World). Graph-based ANN. Excellent recall/speed trade-off. Standard in most vector databases.

IVF (Inverted File Index). Cluster-based ANN. Used in FAISS variants.

Product Quantization (PQ). Compress vectors for memory efficiency. Used with IVF for very large indexes.

ScaNN (Google). High-performance ANN library.

The 2026 state. HNSW dominates production. Substantial engineering matters for very-large indexes (billions of vectors).

§6. Hybrid and Re-ranking

Combining sparse and dense

The 2026 production standard. Combine sparse (BM25) and dense (embedding-based) retrieval. Sparse handles exact-keyword queries; dense handles semantic-similarity queries; combination handles both.

The patterns.

Score fusion. Run both retrievers; combine scores. Methods:

Reciprocal Rank Fusion (RRF). For each retrieval result, score = 1/(k + rank). Sum across retrievers. Simple and effective.
Weighted combinations. score = α × dense_score + (1 - α) × sparse_score. Requires score normalization.
Learned combinations. Train a model to combine scores.

Sequential. Use one retriever to filter candidates; use the other to rank.

Disjunctive union with deduplication. Take top-K from each; deduplicate; combine.

Cross-encoder re-ranking

A specific re-ranking pattern. Initial retrieval (bi-encoder or sparse) returns top-K candidates; a cross-encoder re-ranks these to produce final top-N (N ≤ K, often N = 5-10).

The cross-encoder. Encode query and document together (concatenated through transformer). Compute relevance score. Substantially higher quality than bi-encoder scoring but much slower; only feasible on small candidate sets (typically K = 100, N = 5-10).

The standard production pattern:

   HYBRID RAG WITH RE-RANKING (production standard)

   Query
       │
       ▼
   ┌─────────────┐    ┌─────────────┐
   │ Sparse      │    │ Dense       │
   │ (BM25)      │    │ (Embedding) │
   └─────────────┘    └─────────────┘
       │                    │
       │ top-100            │ top-100
       ▼                    ▼
   ┌──────────────────────────────┐
   │ Combine via RRF / dedupe     │
   └──────────────────────────────┘
       │
       │ candidate set (top-100-200)
       ▼
   ┌──────────────────────────────┐
   │ Cross-encoder re-ranker      │
   └──────────────────────────────┘
       │
       │ top-5-10
       ▼
   Generator (LLM)

Notable cross-encoders. ColBERT-based re-rankers; Cohere Rerank; BGE Reranker; Jina Reranker. Substantial competition.

Production hybrid systems

The 2026 deployment. Most serious RAG deployments use hybrid retrieval + re-ranking. The infrastructure is mature; the methodology is well-established.

§7. Chunking and Indexing

Document chunking strategies

A specific design decision. Long documents must be split into chunks for retrieval. Chunk granularity substantially affects retrieval quality.

Fixed-size chunking. Split documents into N-token chunks (e.g., 512 tokens, 1024 tokens). Simple; doesn’t respect content boundaries.

Semantic chunking. Split at semantically-meaningful boundaries (paragraph breaks; section headers; sentence boundaries). Substantially better quality.

Hierarchical chunking. Multiple chunk granularities (page-level, section-level, paragraph-level, sentence-level). Retrieve at different granularities for different needs.

Sliding-window chunking. Overlapping chunks to capture cross-boundary content.

Document-structure-aware chunking. Use document structure (headings, sections, tables) to inform chunk boundaries.

The trade-offs.

Smaller chunks. More precise retrieval; potentially lose context.
Larger chunks. More context per chunk; potentially less precise retrieval.

The 2026 best practice. Semantic + hierarchical chunking. 200-800 token chunks typical for most applications. Cross-reference application requirements.

Indexing pipelines

The production process. Convert documents into indexable representations.

   INDEXING PIPELINE (typical)

   Documents (PDFs, HTML, Word, etc.)
       │
       ▼
   Parsing (extract text, structure, metadata)
       │
       ▼
   Cleaning (deduplication, normalization, filtering)
       │
       ▼
   Chunking (split into chunks per strategy)
       │
       ▼
   Embedding (compute embeddings for each chunk)
       │
       ▼
   Indexing (insert into vector DB + sparse index)
       │
       ▼
   Metadata indexing (separate index for filterable metadata)

The engineering matters. Parsing different document formats; handling tables, images, code blocks; preserving structure; updating incrementally.

Updating indexes

A specific operational concern. Indexes need updating as the corpus changes.

Full reindexing. Rebuild entire index. Expensive; reliable.

Incremental updates. Add/update/delete documents one at a time. Faster; engineering-heavy.

Embedding-model upgrades. When the embedding model improves, all embeddings must be recomputed. Substantial operational cost.

The 2026 production patterns. Periodic full reindexing combined with incremental updates. Embedding-model upgrades are major events; production systems plan carefully.

§8. The Retrieval-Generation Interface

How retrieved content enters the generator

The basic pattern. After retrieval, format retrieved chunks into a prompt for the LLM.

A typical prompt template:

   PROMPT TEMPLATE (illustrative)

   System: You are a helpful assistant. Answer the question
           based on the following context. Cite sources with
           [n] for chunk n.

   Context:
   [1] [chunk 1 content]
   [2] [chunk 2 content]
   ...
   [N] [chunk N content]

   Question: [user query]

   Answer:

The variations.

Number of chunks. Typically 3-10 chunks; depends on context window and chunk size.
Chunk formatting. Source attribution; metadata; structure preservation.
System instructions. Cite sources; refuse if context insufficient; format constraints.
Order. Recency, relevance, or randomized.

Citation and grounding

A specific requirement for many RAG applications. The generator should cite which retrieved chunks support each part of its response.

The patterns.

Inline citations. Generator produces citations inline: “The policy allows 12 weeks of leave [1, 3].”

Footnote citations. All citations at the end.

Per-claim citations. Each factual claim cited.

Source highlighting. UI highlights which retrieved chunks contributed.

The challenges. Generators sometimes cite chunks that don’t actually support the claim; verification is needed.

Handling retrieval failures

A specific challenge. What if retrieval returns no relevant content?

The patterns.

Refuse to answer. “I don’t have information about that.”

Acknowledge uncertainty. “Based on limited information, I think... but I’m not sure.”

Fall back to parametric knowledge. Generate from the LLM’s training data alone; flag this clearly.

Re-query. Reformulate the query and try again.

Escalate to human. For high-stakes applications, route to human review.

The 2026 best practice. Combine multiple patterns: retrieve; check relevance; if low, refuse or escalate; if borderline, acknowledge uncertainty.

§9. Advanced RAG Patterns

Multi-hop retrieval

A specific capability. Some queries require multiple retrieval steps - find one piece of information; use it to retrieve another.

Example. “Who succeeded the CEO of the company that acquired Anthropic?” Requires: (1) find who acquired Anthropic; (2) find their CEO; (3) find the CEO’s successor.

The patterns.

Iterative retrieval. Repeatedly retrieve based on accumulating information.

Decomposition + retrieval. LLM decomposes question into sub-questions; retrieve for each; combine.

Graph-based. Build knowledge graph from retrieved content; reason over graph.

The 2026 state. Multi-hop RAG is active research with substantial open problems (OP-RAG-2). Production deployment is limited; quality varies substantially.

Conversational RAG

A specific application pattern. Multi-turn conversations where each turn may require retrieval.

The challenges.

Query reformulation. User’s followup may rely on conversation context; must reformulate for retrieval.
Session memory. Track what’s been retrieved in the conversation.
Multi-document handling. Context accumulates across turns.

Production examples. ChatGPT with retrieval; Claude with search; Perplexity conversational mode.

Agentic RAG

A specific pattern that has emerged in 2024-2026. Treat retrieval as a tool the LLM agent can invoke dynamically (cross-reference AI Agents §4).

The pattern.

LLM decides when retrieval is needed.
LLM constructs the retrieval query.
LLM may issue multiple retrieval queries.
LLM integrates retrieved information into reasoning.

The advantages. Avoids unnecessary retrieval; supports multi-hop reasoning; adapts retrieval to evolving understanding.

The challenges. LLM may issue poor retrieval queries; latency and cost can grow; debugging is harder.

Self-RAG and reflective retrieval

A specific 2023-2024 advance. Self-RAG (Asai et al., 2023) trains the LLM to evaluate its retrievals and outputs.

The pattern. The model generates reflection tokens indicating:

Whether retrieval was needed.
Whether retrieved content is relevant.
Whether generated content is grounded in retrieval.

The reflection enables iterative refinement.

Other advanced patterns

HyDE (Hypothetical Document Embeddings). Generate a hypothetical answer; embed the hypothetical answer; use it for retrieval. Sometimes substantially better than embedding the query directly.

Query expansion. Expand the query with related terms before retrieval.

Multi-query retrieval. Generate multiple query variations; retrieve for each; combine.

Recursive retrieval. Retrieve at multiple granularities (chunk → section → document).

§10. Multimodal RAG

The extension to non-text modalities. Cross-reference Multimodal Models §10 for the broader cross-modal-retrieval treatment.

Image+text RAG

The dominant multimodal RAG pattern.

The indexing. Use CLIP-style models (cross-reference Multimodal Models §4) to embed images and text into a shared space. Index both.

The retrieval. Text queries can retrieve images; image queries can retrieve text or other images.

The generation. Multimodal LLMs (GPT-4o, Claude 3.5+, Gemini) handle images in generation; produce responses that reference retrieved images.

Document RAG with mixed content

A specific application. Documents (PDFs, HTML) often contain images, tables, diagrams, code blocks. Retrieving and using these correctly is non-trivial.

ColPali (Faysse et al., 2024). Document-page-as-image retrieval. Treat each PDF page as an image; embed via vision-language model; retrieve at page level. Substantially better than text-extraction-then-embedding for image-heavy documents.

Modern document RAG. Combines text extraction with image-based retrieval; preserves document structure; handles tables and figures.

The 2026 state. Document RAG is substantial production technology. Specialized infrastructure for legal documents, scientific papers, technical documentation.

Video and audio RAG

Less mature. Retrieving relevant video clips or audio segments for a query, then generating responses based on retrieved content.

The challenges. Video and audio retrieval is harder than text or image. Storage is large; relevance is subtler.

The 2026 state. Emerging; substantial open problems.

§11. Evaluating RAG

Retrieval-quality metrics

Standard IR metrics.

Recall@K. Fraction of relevant documents in top-K results.

Precision@K. Fraction of top-K results that are relevant.

MRR (Mean Reciprocal Rank). Average of 1/rank of first relevant result.

NDCG (Normalized Discounted Cumulative Gain). Weighted relevance with rank discount.

These require labeled relevance data. Standard benchmarks (MS MARCO, BEIR) provide labeled data.

End-to-end RAG evaluation

Measuring the full system - retrieval plus generation.

Faithfulness / grounding. Does the generated response accurately reflect retrieved content?

Answer quality. Is the generated response a good answer to the question?

Hallucination rate. Does the response include claims not supported by retrieval?

Citation accuracy. Do citations correctly attribute claims to retrieved chunks?

The methods.

LLM-as-judge (cross-reference Evaluation §7). Use an LLM to evaluate RAG outputs against rubrics.

Human evaluation. Domain experts assess responses.

Reference-based. Compare to reference answers (where available).

Notable benchmarks.

RAGAS (RAG Assessment, Es et al. 2023). Framework for systematic RAG evaluation across multiple dimensions.
TruthfulQA-RAG. RAG variants of truthfulness benchmarks.
HotpotQA. Multi-hop QA benchmark; tests multi-hop RAG.
NQ (Natural Questions). Open-domain QA from Google.
MS MARCO. Web-scale retrieval and QA.

Hallucination measurement

A specific concern. Even RAG hallucinates. Specific evaluation methods target this.

Per-claim checking. For each claim in the response, check if retrieved content supports it.
Citation verification. For each citation, check if the citation actually supports the cited claim.
Adversarial evaluation. Test RAG on queries with no relevant content; check if model refuses or hallucinates.

Production monitoring

Beyond benchmark evaluation. Production RAG systems require ongoing monitoring.

User feedback. Explicit ratings; implicit signals (refinement, abandonment).
Retrieval-quality monitoring. Track recall and precision over time.
Hallucination monitoring. Sample outputs; check for hallucinations; aggregate trends.
Cost monitoring. Track retrieval and generation costs.

§12. Production RAG Systems

Enterprise RAG

Substantial commercial activity. Enterprise RAG systems answer questions over internal company documentation, code, conversations.

Glean (founded 2019). Enterprise search and RAG. Substantial commercial growth 2023-2026.

Notion AI. RAG over Notion workspaces.

Microsoft Copilot for Microsoft 365. RAG over Office documents, emails, Teams chats.

Google Gemini for Workspace. Similar offering.

Atlassian Intelligence. RAG for Jira, Confluence.

Slack AI. RAG for Slack conversations and connected content.

The pattern. Substantial enterprise demand; substantial commercial competition. RAG is standard enterprise software by 2026.

Consumer RAG

A different deployment context.

Perplexity. RAG over web search; substantial commercial growth. The dominant consumer-RAG product.

You.com. Competitor to Perplexity.

ChatGPT with browsing/search. OpenAI’s consumer RAG offering.

Claude with web search. Anthropic’s consumer RAG.

Google AI Mode and AI Overviews. RAG integrated into Google Search.

Bing Chat / Microsoft Copilot. Similar integration into Bing.

Specialized RAG

Domain-specific deployments.

Harvey AI. Legal research RAG.

Multiple medical RAG systems. Glass, OpenEvidence, others.

Scientific RAG. Elicit, Consensus, others for academic-paper search.

Sales/CRM RAG. Multiple enterprise sales tools.

Customer-support RAG. Substantial commercial deployment.

RAG-as-service infrastructure

Managed RAG services.

LangChain RAG products. Managed RAG platform.

LlamaIndex Cloud. Managed RAG and indexing.

Vector-DB managed services (Pinecone, Weaviate Cloud, others).

Cohere RAG offering. RAG-specific commercial products.

The 2026 state. RAG infrastructure is mature. Multiple service tiers from raw vector DB to fully-managed end-to-end RAG.

§13. Connections to Other Chapters

Large Language Models §9 covers RAG briefly. This chapter develops in depth.
AI Agents §3-§6 covers retrieval as memory/tool. This chapter covers underlying retrieval methodology. Agentic RAG bridges both.
Multimodal Models §10 covers cross-modal retrieval. This chapter §10 covers the RAG-specific aspects.
Causality §11 covers RAG in causal-reasoning contexts (LLM as causal-inference tool with retrieval).
Self-Supervised Learning §5 covers contrastive learning, foundational for modern dense embeddings.
Foundation Models provides the FM-as-substrate framing.
Evaluation §10 covers cross-cutting evaluation; this chapter §11 covers RAG-specific.
Generative Models §3 covers autoregressive generation underlying the generator component.
Alignment §7 covers safety evaluations including hallucination measurement.

§14. Critiques, Limitations, and Open Problems

“RAG is just a workaround for context limits”

A sceptical position. With infinite context windows, RAG would be unnecessary. Long-context models will eventually obsolete RAG.

The pushback. Context can be infinite in principle; per-query cost remains substantial. Corpora can be larger than any context window. RAG provides fundamental advantages (citation, freshness, cost) that long context alone doesn’t.

The chapter’s position. RAG and long-context are complementary. Both will coexist in 2026 and beyond; neither obsoletes the other.

“Long-context models will replace RAG”

A specific version of the above. The argument: as context windows grow to multi-million tokens, RAG becomes unnecessary.

The pushback. Even at 10M tokens of context, corpora can be larger (web-scale corpora are far larger). Per-query cost of long context is substantial. Citation and grounding remain.

The chapter’s position. Long-context will displace RAG for some applications (small-corpus document QA) but not all.

“RAG quality is dominated by retrieval quality”

A specific empirical observation. Bad retrieval → bad responses, even with excellent generators.

The implication. Investment in retrieval quality often pays more than investment in generation. Modern RAG systems substantially invest in retrieval pipelines (hybrid, re-ranking, query expansion).

Open problems

OP-RAG-1. Retrieval quality for complex queries. Multi-faceted queries; ambiguous queries; queries requiring deep understanding. Retrieval quality is uneven.
OP-RAG-2. Multi-hop reasoning. Multi-step retrieval-and-reasoning remains hard.
OP-RAG-3. Multimodal RAG quality. Substantial emerging but uneven quality across modalities.
OP-RAG-4. Cost-effective fresh-knowledge integration. Frequent reindexing is expensive; balancing freshness and cost is non-trivial.
OP-RAG-5. Hallucination prevention in RAG. Even with retrieval, models hallucinate. Grounding mechanisms imperfect.
OP-RAG-6. RAG evaluation methodology. End-to-end RAG evaluation is hard; existing metrics imperfect.
OP-RAG-7. Retrieval for agentic systems at scale. Production agentic systems need substantial retrieval infrastructure.
OP-RAG-8. Privacy-preserving RAG. Retrieving from private documents while preserving privacy. Multi-tenant RAG.
OP-RAG-9. Continual learning of retrieval. Retrievers that improve over time from feedback.
OP-RAG-10. RAG over structured data. RAG over tables, knowledge graphs, time-series. Less mature than text RAG.

§15. Further Reading

Foundational

Lewis, P., et al. (Facebook AI 2020). “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” The RAG paper.
Karpukhin, V., et al. (Facebook AI 2020). “Dense Passage Retrieval for Open-Domain Question Answering.” DPR.
Khattab, O., and Zaharia, M. (2020). “ColBERT.”
Robertson, S., and Zaragoza, H. (2009). “The Probabilistic Relevance Framework: BM25 and Beyond.”

Modern advances

Asai, A., et al. (2023). “Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection.”
Izacard, G., and Grave, E. (2021). “Unsupervised Dense Information Retrieval with Contrastive Learning.” Contriever.
Faysse, M., et al. (2024). “ColPali: Efficient Document Retrieval with Vision Language Models.”

Production frameworks

LangChain, LlamaIndex documentation.
Vector database documentation (Pinecone, Weaviate, Chroma, Qdrant, Milvus, pgvector).

Evaluation

Es, S., et al. (2023). “RAGAS: Automated Evaluation of Retrieval Augmented Generation.”
Yang, Z., et al. (2018). “HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering.”

Information retrieval foundations

Manning, C., Raghavan, P., Schütze, H. (2008). Introduction to Information Retrieval. Available free online.

§16. Exercises and Experiments

E1. Build a small RAG system. From a small corpus (e.g., a Wikipedia subset), build a complete RAG system with embedding indexing, retrieval, and generation. Evaluate on QA queries.
E2. Compare sparse vs dense retrieval. On the same corpus, evaluate BM25 vs dense embeddings. Plot recall@K vs K. Investigate query types where each excels.
E3. Implement re-ranking. Add a cross-encoder reranker to your basic RAG system. Measure quality improvement.
E4. Build multi-hop RAG. Implement multi-hop retrieval on HotpotQA. Compare to single-hop.
E5. Evaluate hallucination. Take your RAG system; specifically test on queries with no relevant context. Measure hallucination rate.
E6. Multimodal RAG. Build a system that retrieves from a corpus of mixed text and images.
E7. Compare chunking strategies. On the same corpus, test fixed-size, semantic, and hierarchical chunking. Measure retrieval quality.
E8. Production-pattern implementation. Build hybrid (BM25 + dense + reranker) RAG; compare to each component alone.
E9. RAGAS evaluation. Use RAGAS to evaluate your RAG system on multiple dimensions; analyze results.
E10. Cost-quality analysis. For a fixed task, vary retrieval depth (top-K) and measure cost vs quality. Plot Pareto frontier.