Large Language Models

§3 (Tokenization) was reordered to precede §4 (LLM Architecture) on 2026-05-12 — see ADR-0009 — because architecture uses token IDs as input to the embedding layer, so tokenization should be grounded first.

This chapter develops the most concrete and consequential specialization of foundation models: large language models (LLMs). Where the Foundation Models chapter is the spine, this chapter is the largest single rib. Material that is general to all foundation models lives in the FM chapter; material specific to language LLMs lives here.

Scope and What This Chapter Is About

The chapter develops language-specific foundation models end to end: architectural choices for text, the training pipeline (pretraining → instruction tuning → preference tuning), inference and decoding, prompting, in-context learning at depth, tool use as it touches LLMs, long-context techniques, the open / closed model ecosystem, and a survey of major model families.

We treat the language-specific surface. The substrate (transformers, MoE, SSMs) lives in Deep Learning. The training paradigm (SSL) lives in Self-Supervised Learning. The conceptual frame (foundation models) lives in the FM chapter. Reasoning models (test-time compute), agents, alignment in depth, and multimodal LLMs live in their own chapters with explicit pointers from here. Open problems are flagged inline and consolidated in §13.

§1. Motivation and Scope

What this chapter is for

A large language model (LLM) is a foundation model whose pretraining objective and dominant deployment surface are text. The Foundation Models chapter developed the general regime: pretrain once at scale on broad data, then adapt many ways. This chapter narrows the focus to the language-specific instance, which is the most studied, most deployed, and most consequential class of foundation models in 2026. Material general to all foundation models — the regime’s logic, its scaling behaviour, its open problems with emergence and homogenization — lives in the FM chapter. Material that is language-specific — how text becomes a model input, how generation actually works at inference time, what the modern training pipeline looks like, what the deployed ecosystem looks like — lives here.

Treating LLMs as a footnote to foundation models would underrepresent the practice. Most readers’ direct experience of foundation models is mediated by an LLM (a chatbot, a coding assistant, a writing tool); most published benchmarks target LLMs; most of the alignment and safety literature is LLM-centric in its current form. A research-oriented reader needs the language-specific surface in depth.

What an LLM looks like, concretely

Begin with one concrete trace, tracking what happens when a user sends the prompt What is the capital of France? to a modern LLM. The shape applies, with minor variations, to GPT-4, Claude, Gemini, Llama, Mistral, and most others.

Step 1: Tokenization. The prompt is split into tokens — small units of text, typically a few characters each. The procedure that does the splitting is the model’s tokenizer. For an English prompt under most common tokenizers (in particular byte-pair encoding, or BPE; treated in §3), the example splits into roughly seven tokens: ["What", " is", " the", " capital", " of", " France", "?"]. Each token has a unique integer ID in the model’s vocabulary, the fixed set of tokens it can produce or consume. Typical modern LLM vocabularies are in the range of 32,000–256,000 tokens.

Step 2: Wrapping in a chat format. Deployed LLMs do not see the prompt naked. They see it wrapped in a chat format — a structured sequence of role-tagged messages such as [system: "You are a helpful assistant."] [user: "What is the capital of France?"] [assistant: ""] — where the model’s job is to fill in the assistant slot. The role tags and the system message are themselves tokens that the model has been trained to interpret. The exact format is a per-model design choice; the abstraction is the same.

Step 3: Forward pass through the model. The token sequence is fed into the model — a deep neural network, in 2026 almost always a decoder-only Transformer. The Transformer architecture is developed in the Deep Learning chapter; decoder-only means the architecture only generates tokens, it does not separately encode an input sequence into a representation that a separate decoder then consumes (the encoder-only and encoder-decoder alternatives are described in §2). The model produces, for each position in the sequence, a probability distribution over what the next token should be — a vector of length equal to the vocabulary size, normalized to sum to one.

Step 4: Sampling. A single next token is drawn from that probability distribution by a decoding strategy — typically temperature-controlled sampling or nucleus (top-p) sampling (both treated in §6). Imagine the chosen token is Paris.

Step 5: Autoregression. The newly chosen token is appended to the sequence, and the whole process repeats. The extended sequence is fed back into the model, which produces a new distribution over the next next token, from which another token is sampled. This is autoregressive generation: each new token conditions on every preceding token. The loop continues until the model emits a special end-of-message token, or a maximum length is reached. Diagrammatically:

flowchart TD
  Prompt(["$$\text{prompt tokens: } [t_1, \ldots, t_n]$$"])
  Seq["$$\text{current sequence: } [t_1, \ldots, t_n, y_1, \ldots, y_k]$$"]
  Fwd["$$\text{forward pass through model}$$"]
  Dist["$$\text{distribution over next token (length } |\text{vocab}|)$$"]
  Dec["$$\text{decoding strategy (§6): sample } y_{k+1}$$"]
  App["$$\text{append } y_{k+1}$$"]
  Check{"$$\text{end token or max length?}$$"}
  Stop(["$$\text{stop — return response}$$"])

  Prompt --> Seq --> Fwd --> Dist --> Dec --> App --> Check
  Check -. no — loop .-> Seq
  Check --> |yes| Stop

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  classDef dim  fill:#fff,stroke:#aaa,stroke-dasharray:3 3,color:#888
  class Prompt,Seq,Fwd,Dist,Dec,App,Stop pill

The cost: each step is one full forward pass through the model. Generating $N$ output tokens requires $N$ forward passes. This is the central performance bottleneck of LLM inference and is what makes the decoding-and-serving stack of §6 so consequential.

The resulting response — Paris is the capital of France. or similar — is returned to the user.

Three features of this trace deserve emphasis, since they are language-specific in ways the general FM chapter does not address:

Generation is sequential, one token at a time. The Transformer can process many tokens in parallel during training and during the prompt-processing phase of inference, but generation proper is autoregressive: token t+1 depends on token t, which had to be produced first. This shapes inference cost, latency, and the algorithms developed to mitigate them (speculative decoding, KV-caching, both treated in §6).
The full token history matters; the model has no hidden memory across calls. Within a single generation, the model conditions on the entire token sequence so far. Across separate calls (separate sessions, separate API requests), a deployed LLM is stateless by default; “memory” between sessions, if it exists, is implemented by the surrounding system feeding earlier tokens back in at the start of each new request.
The model’s working surface is text, but text is doing more work than it looks like. The same model handles natural-language conversation, code generation, structured outputs (JSON, function calls), and structured input from tools and retrieval systems — all by running tokens in and tokens out. The deployment patterns that exploit this (prompting, function calling, agent loops, retrieval augmentation) are why LLMs became the substrate for so much modern AI plumbing.

What “large” means

“Large” in large language model is not a fixed threshold; it is a practical class. We adopt the following pragmatic working definition for this chapter:

Large language model (LLM). A language model — a model that assigns probabilities to sequences of tokens — pretrained at sufficient scale that downstream adaptation (fine-tuning, prompting, or in-context learning) is meaningfully cheaper than retraining for the target use case. In 2026, this typically means models with at least several billion parameters trained on trillions of tokens.

The lower edge of the class — at what parameter count does a model stop being “an LLM”? — is contested. The seven-billion-parameter open-weights models routinely covered by the LLM literature (Llama-7B, Mistral-7B, and their successors) are in. Anything frontier is in. So are, by extension, the much smaller small language models (SLMs) that share the architecture and training pipeline but at one to three orders of magnitude smaller scale — even though their behaviour and deployment differ enough to make the boundary case interesting. We include them at the borders of the chapter rather than ruling them out by definition.

Boundaries with adjacent chapters

The chapter is positioned within a larger structure. To avoid duplication and to orient the reader:

Foundation Models (prerequisite) covers the general regime: definition, the three pillars (scale, self-supervision, adaptation), scaling laws and emergence. We presume this material.
Deep Learning (prerequisite) develops the Transformer architecture, attention, normalization, positional encodings, and training dynamics. We refer rather than re-derive.
Self-Supervised Learning (prerequisite) develops the family of pretraining objectives. We use them as building blocks.
Theoretical Foundations of Learning covers generalization theory; we cite forward for the in-context-learning theory discussion in §7.
Reasoning Models covers test-time-compute reasoning systems (o1-, o3-, and successor-style models) — closely related to LLMs but with distinct training and inference characteristics. We summarize the connection in §2 and defer.
AI Agents and Tool Use covers agentic loops, multi-step planning, and tool ecosystems. We discuss the LLM-side tool-calling surface in §8 and defer the systems story.
Retrieval-Augmented Generation covers retrieval architectures in depth; we discuss the LLM-side surface (§9) and defer retriever design.
Mechanistic Interpretability covers what is inside an LLM at the level of features and circuits; we use its results when discussing capabilities and limitations but do not duplicate the techniques.
Alignment covers preference tuning (RLHF, DPO, GRPO), scalable oversight, and safety training in depth; we cover the LLM-side training-pipeline placement in §5 and defer the substance.
Evaluation covers benchmarking, contamination, and capability evaluation in depth; we touch the LLM-specific evaluation issues at relevant points and defer the rest.
Multimodal Models covers vision-language, audio-language, and embodied extensions. We treat the language-only core here; the multimodal extensions live there.

What remains here, and what gives this chapter its weight, is the language-specific surface: architecture choices that matter for text (§4); tokenization (§3); the modern training pipeline as it composes pretraining, supervised fine-tuning, and preference tuning into a deployed LLM (§5); inference and decoding (§6); prompting and in-context learning at depth (§7); the practical surface of tool use (§8); long-context techniques (§9); the open / closed ecosystem (§10); a survey of major model families as of the chapter’s snapshot date (§11). §13 catalogues the open problems specific to LLMs.

§2. Historical Context

This section gives a descriptive trajectory of how the LLM regime came to exist. It is not a complete history of computational linguistics nor an attempt to adjudicate credit. As in the FM chapter, the trajectory is best read as an accumulation of partial results that became viable together once compute and data scaled.

A timeline of the inflection points covered below:

~1990s–2000s

n-gram language models

Statistical n-gram models dominate practical language modelling; Kneser-Ney smoothing carries the approach into the 2010s in speech, translation, and predictive text.
2003

Neural language model (Bengio et al.)

Feedforward neural network predicts next word from learned dense vector representations, introducing distributed representations of words.
2010

Recurrent language models (Mikolov et al.)

Recurrent neural networks process tokens sequentially with a hidden state, handling arbitrary-length context and displacing n-grams on research benchmarks.
2017

The Transformer (Vaswani et al.)

Attention-only architecture eliminates recurrence; parallelizable training and scalability make it the architectural substrate for all that follows.
2018–2020

The 2018 inflection — BERT, GPT, T5

Three decoder-only (GPT), encoder-only (BERT), and encoder-decoder (T5) instantiations emerge; GPT-3 (2020) demonstrates in-context learning at scale.
2021

Instruction tuning (Wei et al.)

Supervised fine-tuning on diverse natural-language instruction datasets produces models that generalise to unseen instructions at inference time.
2022

InstructGPT / RLHF (Ouyang et al.)

Preference tuning via reinforcement learning from human feedback layered on top of instruction tuning becomes the standard post-training pipeline.
Nov 2022

ChatGPT

Instruction-tuned GPT-3 successor packaged as a conversational interface reaches mass public attention, reshaping research priorities and public expectations within months.
2024

Reasoning models — o1, o3, and successors

Models trained via reinforcement learning on their own reasoning traces allocate substantial test-time compute to deliberation, outperforming comparable non-reasoning models on multi-step tasks.

n-gram language models

For most of the 1990s and into the 2000s, the dominant practical language model was the n-gram model: a model that estimates the probability of the next word from the n − 1 preceding words by counting how often the corresponding n-gram appears in a training corpus. A bigram model (n = 2) estimates the probability of the next word given the single previous word; a trigram model uses the two preceding words; longer contexts become statistically infeasible without massive corpora because the count tables grow exponentially in the context length. The simplification that the next word depends only on the previous n − 1 words is called the Markov assumption at a fixed order.

Practical n-gram models required smoothing — techniques such as Laplace smoothing, Good-Turing, and most famously Kneser-Ney smoothing — to assign nonzero probability to n-grams that did not occur in training. Smoothing carried the n-gram era into the 2010s in production systems (speech recognition, machine translation, predictive text), but the framework was fundamentally limited: it could not generalize to semantically similar but lexically different contexts, and it scaled poorly in context length.

Neural language models

Bengio et al. (2003) introduced the neural language model: instead of counting n-gram statistics, a feedforward neural network learned to predict the next word from a learned vector representation of the preceding context. The key idea was the distributed representation of words — words encoded as dense vectors in a continuous space, with semantically similar words ending up with similar vectors. A dense vector here just means a fixed-length list of real numbers, most of which are nonzero (in contrast to a sparse one-hot encoding of a vocabulary index, which is zero everywhere except one position).

A small illustrative example. Imagine a tiny 4-dimensional embedding space in which the words king, queen, and spreadsheet end up at concrete positions like

   king        ≈ [ 0.62, -0.08,  0.31,  0.45 ]
   queen       ≈ [ 0.58, -0.03,  0.34,  0.41 ]
   spreadsheet ≈ [-0.21,  0.55,  0.04, -0.18 ]

“Similar vectors” can be quantified by cosine similarity, defined for two vectors $\mathbf{u}, \mathbf{v}$ as

\cos(\mathbf{u}, \mathbf{v}) \;=\; \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \, \|\mathbf{v}\|}

— the cosine of the angle between them in the embedding space (1 when parallel, 0 when orthogonal, $-1$ when opposite). For the toy vectors above, $\cos(\text{king}, \text{queen}) \approx 0.99$ while $\cos(\text{king}, \text{spreadsheet}) \approx -0.10$ . The model has learned that king and queen belong in a similar region of the space, even though no rule said so — the geometry falls out of training to predict tokens in natural text.

Distributed representations let the model generalize across contexts in ways n-gram models structurally could not: the probability of a continuation could be informed by similar contexts seen in training even when the exact context had never appeared. The same idea — dense vectors in a learned space — runs through the rest of this chapter; in modern LLMs the embedding space has $d_{\text{model}}$ dimensions in the thousands (developed in §4), not four.

The recurrent extension followed within a decade. Recurrent language models (Mikolov et al., 2010, and subsequent LSTM-based variants) used recurrent neural networks — networks that process tokens one at a time while maintaining a hidden state that carries information forward, treated in the Deep Learning chapter — to handle arbitrary-length context. Recurrent LMs displaced n-grams in research benchmarks and many production settings through the mid-2010s.

The Transformer arrives

The Transformer architecture (Vaswani et al., 2017), introduced for machine translation, was rapidly applied to language modeling. Its combination of attention-based context handling, parallelizable training (no recurrence to serialize), and scalability made it the architectural choice for everything that followed. The Deep Learning chapter develops the architecture; for purposes of this chapter, it is enough to know that the LLM era is the Transformer-language-modeling era.

The 2018 inflection: BERT, GPT, T5

Three architecturally distinct uses of the Transformer for language emerged in close succession:

BERT (Devlin et al., 2018) is an encoder-only Transformer pretrained with masked language modeling (MLM): during training, a fraction of tokens in the input are replaced with a special [MASK] token, and the model predicts the originals from the surrounding context on both sides (left and right). Concretely, the input “The cat sat on the mat.” might be transformed as
```
original input :  ["The",  "cat",  "sat",   "on",  "the",  "mat",  "."]
masked input   :  ["The",  "cat",  "[MASK]","on",  "the",  "[MASK]","."]
training target:  predict "sat" at position 3 and "mat" at position 6
```
with about 15% of tokens masked at random in the original BERT recipe. Because the model sees tokens on both sides of each [MASK], MLM is suited to learning representations for understanding the input rather than for generating new text. BERT was designed not for generation but for understanding tasks — classification, named-entity recognition, question answering — to which it was adapted by fine-tuning a small task-specific head on top of the pretrained model.
GPT (Radford et al., 2018) is a decoder-only Transformer pretrained with causal language modeling (CLM): the model predicts the next token conditioning only on the preceding tokens (not the following ones). This left-to-right structure makes the same model naturally usable for generation — produce a token, condition on it, produce the next. The GPT lineage continued with GPT-2 (Radford et al., 2019) and GPT-3 (Brown et al., 2020), the latter at a scale large enough to demonstrate in-context learning (defined in FM §1) — performing novel tasks specified in the prompt without any further training.
T5 (Raffel et al., 2019) is an encoder-decoder Transformer pretrained with span corruption (the model is shown text with random contiguous spans replaced by sentinel tokens, and trained to produce the original spans). T5 cast many NLP tasks uniformly as text-to-text problems.

In 2018–2020, the field debated which of these three would dominate. By 2026 the answer is clear: decoder-only LLMs, the GPT lineage’s descendants, are the substrate of practically every deployed LLM, including those produced by labs that initially favoured other approaches. Encoder-only and encoder-decoder models persist in specialized roles (retrieval encoders, some translation systems, some sequence-labeling pipelines) but are no longer where mainstream LLM research happens. We give the architectural reasons for this convergence in §4.

The instruction-tuning turn

Pretraining alone produces models that are fluent but not instruction-following. A pretrained model continues whatever pattern is in its prompt; if the prompt looks like the start of a Wikipedia article, the model produces what looks like the continuation of a Wikipedia article, regardless of whether the user wanted that. Making LLMs follow instructions reliably required a separate adaptation step.

Wei et al. (2021) demonstrated that instruction tuning — supervised fine-tuning on a diverse mixture of tasks framed as natural-language instructions paired with ideal responses — produced models that generalized to unseen instructions: an instruction-tuned model could follow new directives at inference time that resembled (but did not duplicate) its training instructions. The technique transferred quickly across labs. Ouyang et al. (2022), in the InstructGPT paper, layered preference tuning via RLHF (defined in FM §1) on top of supervised instruction tuning. The combination — pretrain, then instruction-tune via supervised fine-tuning on demonstrations, then preference-tune via RLHF or its newer alternatives — became the standard pipeline that FM §7 surveys and this chapter’s §5 develops in detail.

The productization era

OpenAI deployed ChatGPT in November 2022, the first LLM-based system to reach mass public attention. ChatGPT was not architecturally novel — it was a successor of GPT-3 with the instruction-tuning-plus-RLHF pipeline applied and a chat interface wrapped around it — but its packaging as a conversational system, its public availability, and its competence across a wide range of requests reshaped both research priorities and public expectations within months. The subsequent two years saw the deployment of GPT-4 (OpenAI, 2023), Claude (Anthropic), Gemini (Google DeepMind), and a rapid expansion of the open-weights ecosystem — models whose trained parameters are released publicly, in contrast to the closed-weights frontier where only API access is available. Llama (Meta), Mistral (Mistral), Qwen (Alibaba), DeepSeek (DeepSeek), OLMo (AI2), and many smaller research releases populated this ecosystem. By 2024 the open-weights ecosystem was capable enough that frontier-versus-open performance comparisons became a recurring debate, treated in §10.

The reasoning turn

By 2024 a new class of LLM-derived system had emerged: reasoning models, most prominently OpenAI’s o1 (2024) and o3 (announced late 2024) and successors from other labs. Reasoning models are trained — most often via reinforcement learning on their own reasoning traces — to produce long, deliberate chains of intermediate reasoning at inference time before committing to a final answer. They allocate substantial test-time compute (compute spent at inference, not during training) to that deliberation, and outperform comparable non-reasoning models on tasks involving multi-step structure (mathematics, programming, certain agentic tasks). They are distinct enough in their training and inference characteristics that we treat them as a separate object of study in the Reasoning Models chapter; this chapter covers the LLM substrate they are built on.

Where this leaves us

As of 2026, the LLM regime is characterized by: rapid iteration across closed and open foundation models on a near-monthly cadence; a stable training pipeline (pretraining → instruction tuning → preference tuning, optionally followed by reasoning-model RL or further adaptation); an active alignment and interpretability research community; the rise of agentic deployment patterns built on top of LLMs; and a recurring set of unresolved questions about reliability, factuality, calibration, and the open-versus-closed capability gap that the rest of the chapter develops.

Historical aside. As in the FM chapter, this narrative privileges the lineage that produced today’s dominant systems. Substantial parallel research — symbolic NLP, structured semantic parsing, classical knowledge representation, neuroscience-inspired language modeling — remains active and is treated in its own contexts.

§3. Tokenization for Language

Tokenization is the bridge between text and the integers an LLM operates on. Every modern LLM has a tokenizer in front of it: a procedure that splits raw text into a sequence of small units called tokens and looks up each one in a fixed vocabulary (the set of all token strings the tokenizer can produce) to get a sequence of integer IDs. Those IDs become the input to the embedding layer (§4) and, via the model’s output projection, also the unit the model produces one at a time during generation. Tokenization sits underneath everything else in this chapter — it determines, at the byte level, what the model can “see” and what it can “say”.

This section develops, in order: what tokenization is and why it exists; the dominant algorithm (byte-pair encoding, BPE) and its byte-level variant; alternatives (WordPiece, Unigram, SentencePiece); vocabulary design choices and special tokens; the multilingual coverage problem; tokenization effects on code, math, structured output; and tokenizer-free alternatives. The story will keep returning to one observation: choices made by the tokenizer designer have outsize, sometimes surprising effects on what the trained model can do.

What tokenization is, concretely

The tokenizer is a function: text in, sequence of integer IDs out. A concrete example:

   text:    "The unhappiness is unbearable."
                          │
                          ▼
                     tokenizer
                          │
                          ▼
   tokens:    [  "The",  " un",  "happiness",  " is",  " un",  "bearable",  "." ]
                          │
                          │ lookup in vocabulary table
                          ▼
   token IDs: [   791,    650,      33915,       374,    650,     8053,      13 ]
                          │
                          ▼
                  feed to model (§4)

(The exact tokens and IDs depend on the tokenizer; the example uses GPT-style byte-pair encoding for illustration.) The model never sees the raw text “The unhappiness is unbearable.” — only the seven integers [791, 650, 33915, 374, 650, 8053, 13].

The tokenizer also has an inverse: given a sequence of IDs, reconstruct the text. Detokenization (or “decoding” in the tokenizer sense — not to be confused with §6’s “decoding” in the sampling sense) is just the reverse lookup followed by string concatenation, with whitespace handled by whatever convention the tokenizer uses.

Why do we need this at all? A neural network takes fixed-size numerical inputs — but text is variable-length, drawn from a huge character set, with words and morphemes of different lengths. We need a procedure that:

Splits text into units that can be indexed into a fixed vocabulary.
Returns a sequence of integer IDs (one per token).
Has a clean inverse, so the model’s output IDs can become text again.
Handles unseen inputs gracefully — there should be no “I don’t know what that character is”.

The two extreme choices and why they fail

The simplest tokenization is word-level: take each whitespace-separated word as a token. The vocabulary is the set of words seen in training. This fails in two ways. The vocabulary is enormous: English has hundreds of thousands of distinct words once morphological variants are counted, and any new word at inference time (a typo, a proper noun, a domain term, a translation) becomes an out-of-vocabulary (OOV) token with no representation. Different languages also need different tokenizers — there is no shared cross-lingual representation.

The opposite extreme is character-level: take each character as a token. The vocabulary is small (a few hundred Unicode characters in most cases) and there are no OOV problems. But sequences become very long — 4–5× the word-level length for English. Attention is quadratic in sequence length (§4.7), so long sequences are expensive. The model also has no built-in notion of “word” or “morpheme”; it has to learn these from raw character co-occurrence statistics.

The middle ground: subword tokenization

Subword tokenization is the dominant compromise. The vocabulary contains common whole words and shorter character sequences that can compose into rarer words. Common words like “the”, “is”, “happiness” are typically single tokens. Rare words like “unhappiness” might decompose as “un” + “happiness”, or “un” + “happy” + “ness”, depending on what the tokenizer learned. Vocabulary stays manageable (32K–256K tokens in modern LLMs), every input is representable (because the fallback is always shorter pieces — at worst, single characters or single bytes), and the typical English word produces 1–2 tokens.

BPE, WordPiece, Unigram, and SentencePiece (the four algorithms covered in this section) are all variants of subword tokenization. They differ in how they pick which subword pieces become tokens, not in the basic idea.

Byte-pair encoding (BPE)

Byte-pair encoding (BPE) is the dominant tokenization algorithm in 2026. It is used by the GPT family (GPT-2 onward), Llama, Mistral, DeepSeek, and most open-weights LLMs. BPE originated as a data-compression algorithm (Gage, 1994) and was adapted to neural machine translation by Sennrich et al. (2016).

The idea is mechanical and easy to state. Start with a vocabulary of individual characters. Repeatedly find the most frequent pair of adjacent tokens in the training corpus and merge that pair into a new single token. Each merge adds one entry to the vocabulary. After enough merges, the vocabulary contains the original characters plus an ordered list of common multi-character chunks — exactly the subword pieces we want. We develop the algorithm in two parts: training (which produces the merge list) and encoding (which applies the merges to new text).

BPE training

BPE training takes a text corpus and a target vocabulary size $V$ , and outputs a merge table (an ordered list of pairs to merge) plus the final vocabulary.

BPE TRAINING
============

Inputs:
  corpus  : a list of training texts
  V       : target vocabulary size (e.g. 50,000, 128,000)

Initialize:
  # Step 1: pre-tokenize by whitespace (and punctuation) and count word frequencies.
  word_freq ← Counter over corpus, e.g. {"the": 1_500_000,
                                          "happiness": 12_000,
                                          "unhappiness": 3_400, ...}

  # Step 2: represent each word as a tuple of its characters,
  # plus a special end-of-word marker "</w>" so BPE doesn't merge
  # across word boundaries.
  splits ← { word: tuple(chars) + ("</w>",) for word in word_freq }
    # e.g. "happiness" -> ("h", "a", "p", "p", "i", "n", "e", "s", "s", "</w>")

  vocab ← set of all individual characters seen in the corpus
  merge_table ← []                          # ordered list of merges

Training loop:
  while |vocab| < V:
    # Count occurrences of every adjacent pair across all words,
    # weighted by word frequency.
    pair_freq ← Counter()
    for word, freq in word_freq:
      tokens ← splits[word]
      for i in 0 .. len(tokens) - 2:
        pair_freq[(tokens[i], tokens[i+1])] += freq

    if pair_freq is empty:
      break                                  # no more merges possible

    # Pick the most frequent pair.
    best_pair  ← argmax(pair_freq)           # e.g. ("h", "a")
    new_token  ← concat(best_pair)           # "ha"

    # Apply the merge: replace every occurrence of the pair
    # with the new token in every word's current split.
    for word in word_freq:
      splits[word] ← apply_merge(splits[word], best_pair, new_token)

    # Record the merge in order.
    merge_table.append(best_pair)
    vocab.add(new_token)

Output:
  vocab        : the final vocabulary of size V
  merge_table  : the ordered list of merges

A small worked example. Suppose the entire training corpus consists of just three words with these frequencies (the toy example from Sennrich et al., 2016):

   "low":     count 5
   "lower":   count 2
   "newest":  count 6

After pre-tokenization and the end-of-word marker, the initial splits are:

   l o w </w>            (count 5)
   l o w e r </w>        (count 2)
   n e w e s t </w>      (count 6)

Initial vocabulary: $\{$ l, o, w, e, r, n, s, t, </w> $\}$ .

Iteration 1. Count adjacent pairs across all words, weighted by word frequency:

   ("l", "o"):       5 + 2 = 7
   ("o", "w"):       5 + 2 = 7
   ("w", "</w>"):    5
   ("w", "e"):       2 + 6 = 8    ← most frequent
   ("e", "r"):       2
   ("r", "</w>"):    2
   ("n", "e"):       6
   ("e", "s"):       6
   ("s", "t"):       6
   ("t", "</w>"):    6

Merge ("w", "e") → "we". New vocabulary token: we. After applying the merge to every split:

   l o w </w>            (count 5)
   l o we r </w>         (count 2)
   n e we s t </w>       (count 6)

Iteration 2. Recount the pairs:

   ("l", "o"):       5 + 2 = 7    ← tied for most frequent
   ("o", "w"):       5
   ("o", "we"):      2
   ("we", "r"):      2
   ("we", "s"):      6
   ...

The most frequent pair is now $(l, o)$ with count 7; merge to lo. Continue iterating. After many merges the vocabulary contains the original characters plus useful chunks: word stems, common suffixes, frequent collocations.

The merge table is ordered: later merges depend on earlier ones. The vocabulary alone is not enough to tokenize new text — we also need the order of merges, because applying merge $k$ requires that merges $1, \ldots, k-1$ have already produced the relevant intermediate tokens.

BPE encoding (tokenizing new text)

Given the trained vocabulary and merge table, tokenizing a new piece of text is straightforward:

BPE ENCODING
============

Inputs:
  text         : a raw input string
  vocab        : trained BPE vocabulary
  merge_table  : ordered list of merges from training

Procedure:
  words ← split_by_whitespace_and_punct(text)
  encoded_ids ← []

  for word in words:
    tokens ← list(word) + ["</w>"]          # start at character level

    # Greedily apply merges in the order they were learned.
    for (a, b) in merge_table:
      i ← 0
      while i < len(tokens) - 1:
        if tokens[i] == a and tokens[i+1] == b:
          tokens[i : i+2] ← [a + b]         # merge in place
        else:
          i ← i + 1

    for tok in tokens:
      encoded_ids.append(vocab.id(tok))

Output: encoded_ids

Walking the unseen word “lowest” through a small example merge table:

   l o w e s t </w>                                  (initial char split)
     │
     │ apply merges in order. Suppose the trained merge table is:
     │     1. (w, e)   -> we
     │     2. (we, s)  -> wes
     │     3. (wes, t) -> west
     │     4. (l, o)   -> lo
     │     ...
     ▼
   l o we s t </w>                                   (after merge 1)
     │
     ▼
   l o wes t </w>                                    (after merge 2)
     │
     ▼
   l o west </w>                                     (after merge 3)
     │
     ▼
   lo west </w>                                      (after merge 4)

The unseen word “lowest” tokenizes as ["lo", "west", "</w>"] — two subword pieces, derived from merges learned on different training words. This is the kind of generalization subword tokenizers buy us: a word never seen at training time still has a sensible representation.

Byte-level BPE

A subtle but important variant. The BPE above operates on characters — but Unicode has 150,000+ codepoints. Inputs containing unseen characters break a strictly character-level BPE.

Byte-level BPE (Radford et al., 2019, in GPT-2) solves this by operating on the bytes of the UTF-8 encoding of the text rather than on Unicode characters. UTF-8 represents any Unicode codepoint as a sequence of 1–4 bytes; every possible input text is a sequence of bytes drawn from the 256 possible byte values. Byte-level BPE starts its vocabulary with all 256 byte values and applies the same merge procedure to bytes.

Consequences:

No out-of-vocabulary input ever exists — every text is a byte sequence, and every byte is in the initial vocabulary.
No language-specific preprocessing. The tokenizer does not need to know that some languages do not put spaces between words, or that some scripts use combining diacritics. The bytes handle it.
Common sequences still get long tokens. Frequent ASCII-byte sequences (English words, code keywords, common punctuation) get merged into multi-byte tokens just as in character-level BPE.

By 2026 byte-level BPE is the dominant scheme: GPT-2, GPT-3, GPT-4, Llama (post v1), Mistral, Qwen, DeepSeek, and most major LLMs use it.

Alternatives: WordPiece, Unigram, SentencePiece

BPE is dominant but not universal. Three other names appear in the literature.

WordPiece (Schuster & Nakajima, 2012; used in BERT) is a close cousin of BPE. The training procedure is identical except for the criterion that picks which pair to merge: instead of picking the most frequent pair, WordPiece picks the pair that maximizes the likelihood of the corpus under a unigram language model defined over the current vocabulary. In practice, BPE and WordPiece produce very similar vocabularies on the same corpus; the algorithms are often interchangeable in casual discussion.

Unigram language model tokenization (Kudo, 2018) is structurally different. Instead of building up by merging, Unigram starts from a large candidate vocabulary (typically all subword substrings appearing above some frequency threshold) and prunes it: at each step, remove the tokens whose removal decreases the corpus likelihood the least, until the vocabulary reaches the target size. Encoding new text under a trained Unigram tokenizer requires solving an optimization — find the segmentation of the input that maximizes total token-probability product. This is done via Viterbi-style dynamic programming. Unigram is the default in Google’s NMT systems and in the multilingual T5 and mT5 models. It produces somewhat different segmentations from BPE on the same corpus, but at the deployment level the differences rarely matter.

SentencePiece (Kudo & Richardson, 2018) is not a separate algorithm but a framework: it implements both BPE and Unigram inside a shared tooling, with the additional design choice of operating on raw text — including whitespace — without any prior whitespace-based pre-tokenization. SentencePiece treats whitespace as just another character (encoded with a sentinel symbol ▁), which makes it directly applicable to languages without explicit word boundaries (Japanese, Chinese, Thai, Khmer). SentencePiece is widely used in multilingual models and is the default tokenization library in many open-source pipelines.

The differences among these algorithms matter less than the regime they share: subword tokenization with a fixed vocabulary of 30K–256K tokens, derived from a training corpus, applied to new text via a learned procedure.

Vocabulary design and special tokens

Beyond the algorithm, the tokenizer designer makes several consequential choices.

Vocabulary size $V$ . Common values are 32,000 (early Llama), 50,000 (GPT-2/3), 128,000 (modern Llama, Mistral-Large class), and up to 256,000 (recent multilingual models). The tradeoffs:

Larger $V$ : shorter token sequences (each token covers more characters on average); fewer attention computations per character; but a larger embedding table (recall §4.2: the embedding matrix has shape $V \times d_{\text{model}}$ , so doubling $V$ adds tens to hundreds of millions of parameters). Larger $V$ also means more rare tokens that the model sees infrequently during training, which can produce undertrained tokens (see “glitch tokens” below).
Smaller $V$ : smaller embedding matrix; but longer sequences (more tokens per word) and more positions to attend over.

Special tokens. Tokens with reserved meaning that do not appear in normal text:

<bos> and <eos> — beginning-of-sequence and end-of-sequence markers used to signal where a document starts and ends.
<pad> — used to pad sequences to a uniform length during batched training. The model is trained to ignore these positions via attention masking.
<unk> — fallback for inputs the tokenizer cannot represent. With byte-level BPE, <unk> is unnecessary because every byte is in the vocabulary.
Chat-template tokens like <|im_start|>, <|im_end|>, <|system|>, <|user|>, <|assistant|> — these mark the role boundaries in conversation transcripts (the chat format described in §1).
Tool-use tokens — in models that support function calling, special tokens delimit the function name, arguments, and tool return values.
Reasoning tokens — in reasoning models (§5’s reasoning-RL stage), tokens delimit the chain-of-thought region (e.g., <think> and </think> markers) from the final answer.

The set of special tokens is part of the tokenizer’s contract with the model: the model is trained to respond to these tokens in specific ways, and prompts must use the exact special-token strings the model was trained on to elicit the corresponding behaviour.

The multilingual story

LLM tokenizers are trained on corpora that are heavily skewed toward English (and a few other high-resource languages: Chinese, Spanish, French, German, Japanese). The skew has a direct, mechanical consequence: a typical English word tokenizes to fewer tokens than the equivalent in many other languages, because English character sequences won the early BPE merges by frequency.

Concretely, an English sentence of roughly 30 characters might tokenize to 6–8 tokens. The same content in Tamil, Burmese, or Khmer — written in a non-Latin script underrepresented in the training corpus — might tokenize to 30–60 tokens at the same byte length. The non-English sequences end up represented byte-by-byte (or close to it) because their merge frequencies were lower.

The practical consequences:

Cost asymmetry. API providers charge per token. The same logical content costs several times more in some languages than in English. This affects research access in low-resource languages.
Context-window asymmetry. A model with a “128K-token context window” can hold many more characters of English than of an underrepresented script.
Quality asymmetry. More tokens to generate per concept means more positions where the model can make an error; downstream capability is correlated with tokenization efficiency, which is correlated with how heavily a language was represented in the tokenizer’s training corpus.

Modern multilingual tokenizers (used in Llama-3, recent Qwen, recent Gemini) train on more linguistically balanced corpora and use larger vocabularies (128K–256K) to allocate more slots to non-English subwords. The asymmetry persists but has been reduced.

Tokenization effects on downstream behaviour

Tokenization is invisible at runtime but shapes what the model can and cannot do. Several specific effects are worth flagging.

Arithmetic and numbers. Different tokenizers split numbers differently. The number 1234 might tokenize as a single token, as [12, 34], as [1, 234], or as [1, 2, 3, 4], depending on what the BPE training data contained. The model’s arithmetic ability is strongly affected: if 123 and 124 are two distinct tokens with no embedding-space relationship, the model has to learn from scratch that they represent consecutive integers. Some modern tokenizers explicitly split all digit sequences into individual digit tokens, as a deliberate design choice for arithmetic competence — at the cost of slightly longer sequences for any text containing numbers.

Code. Code is highly whitespace-sensitive. A four-space indent might be a single token (" "), four separate space tokens, or various intermediate splits, depending on the tokenizer. Identifiers like getUserName might be one token or three (["get", "User", "Name"]); the latter is usually better because each subword carries semantic content the model can learn. Code-oriented tokenizers (used in code-specialized variants and increasingly in general-purpose modern models) are trained on code-heavy corpora and produce more efficient code tokenizations.

Structured output. When generating JSON, code, or schema-constrained outputs, the model produces syntax tokens one at a time. If the closing brace } is a single token, the model emits it cleanly; if it is bundled with other characters in a multi-character token, the model must commit to all those characters together. Tokenizer choices interact subtly with constrained-decoding pipelines (§6.7) — a constrained decoder enforces “valid JSON suffix” by masking out token IDs whose strings would make the output invalid; whether that masking is fine-grained or coarse depends on the tokenization.

Glitch tokens. A well-documented failure mode (investigated thoroughly by Rumbelow & Watkins, 2023): in GPT-2 and early GPT-3 tokenizers, certain tokens appeared frequently in the BPE training corpus (a scrape used to build the tokenizer) but rarely or never in the separate LLM-training corpus. The model had embeddings for these tokens but had effectively never been trained on them. When asked to produce them, the model produced bizarre, undefined behaviour: refusing to repeat them, emitting unrelated text, or “getting stuck”. The most famous case was the token " SolidGoldMagikarp", frequent on a specific Reddit forum but absent from the actual training. Modern training pipelines audit for this kind of tokenization-vocabulary mismatch, but it remains a real failure mode whenever the tokenizer and the training corpus are not aligned.

Tokenizer-free and byte-level alternatives

If tokenization causes all of these problems, why not eliminate it? Several research lines explore tokenizer-free models.

ByT5 (Xue et al., 2022) is an encoder-decoder model trained directly on UTF-8 bytes — each byte is a token, with a fixed 256-byte vocabulary. The model has no subword structure imposed; it learns whatever structure it needs. Tradeoffs: no tokenization-induced biases (better multilingual fairness, better arithmetic, no glitch tokens), but sequences are 3–4× longer than with BPE. Attention is quadratic in sequence length (§4.7), so byte-level models are correspondingly more expensive.

CANINE (Clark et al., 2022) operates on Unicode characters (with a clever hash-based vocabulary to avoid the OOV problem) and includes downsampling layers that reduce sequence length internally. Similar tradeoff profile.

MegaByte (Yu et al., 2023) is a hybrid: a “global” Transformer operates on patches of bytes, while a “local” Transformer fills in byte-level detail within each patch. This addresses the sequence-length problem head-on by structuring the computation hierarchically.

As of 2026, tokenizer-free models remain a research direction, not the dominant practice. The cost of longer sequences has so far outweighed the benefits of eliminating tokenization for frontier-scale LLMs. The calculus may change as sub-quadratic architectures (Mamba, hybrid attention/SSM; §4.10) mature: with linear-time inference, byte-level sequences become less expensive, and the tokenization-vs-no-tokenization tradeoff shifts.

Editorial note. Tokenization is a place where a “supporting machinery” topic has unusually outsize consequences. Many of the LLM-specific open problems flagged in §13 — multilingual coverage (OP-LLM-3), arithmetic competence (OP-LLM-8), tokenization-induced failure modes (OP-LLM-9) — trace back to tokenizer design choices. The audience-facing default of the field is to treat tokenization as a solved engineering problem; the deeper truth is that it interacts with capability, safety, and fairness in ways the field is still working out.

§4. LLM Architecture

The architectural substrate for modern LLMs is overwhelmingly the decoder-only Transformer. This section explains why that single choice dominates, what a modern Transformer-based LLM actually looks like, where mixture-of-experts fits, what long-context architectures look like, and where non-Transformer alternatives stand. The Transformer architecture in general is developed in the Deep Learning chapter; here we develop enough mechanism that the reader can picture what the model actually does to a token sequence — what computation runs, what gets stored, and what each component is for.

Why decoder-only won

Recall from §2 that three architectural families competed as LLM substrates in 2018–2020: encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5). The convergence on decoder-only is not historical accident; it follows from three properties of the modern deployment regime.

Generation is the dominant deployment surface. Encoder-only models, designed for understanding tasks, do not natively support generation: a BERT-style model cannot produce text token-by-token because it was trained on bidirectional context, not on left-to-right prediction. The deployed-LLM regime — chatbots, code generation, structured-output APIs, agents — is generation-first, and decoder-only models are generation-native.

Decoder-only is a strict superset of usable behaviour. A decoder-only model can do most of what an encoder-only model does (representation, classification by reading off internal states or by zero-shot prompting) and most of what an encoder-decoder model does (translation, summarization, structured transformation, all as prompted text-to-text tasks). Encoder-only and encoder-decoder models, by contrast, cannot match decoder-only models on open-ended generation.

A single training objective scales cleanly. Decoder-only models are pretrained with a single objective (causal language modeling, defined in §2 — predict the next token from the preceding ones). The pretraining objective composes cleanly with the rest of the pipeline (SFT, preference tuning, reasoning-RL) and benefits from straightforward scaling-law extrapolation (Foundation Models §6). Mixed-objective architectures complicate the training story in ways that the field, after several years of comparison, judged not worth the gain.

Encoder-only models persist as retrieval encoders (small, fast models that turn passages of text into dense vectors for retrieval pipelines, treated in the Retrieval-Augmented Generation chapter), as classification heads in specialized pipelines, and as research substrates where bidirectionality is essential. Encoder-decoder models persist in machine translation, some sequence-labeling pipelines, and code-edit settings where the encoder/decoder split corresponds to clean input/output separation. Neither is where mainstream LLM research happens.

From tokens to vectors: the embedding layer

Before any attention, before any Transformer block, the model must turn tokens — integer IDs (§1, §3) — into vectors of real numbers that the network can manipulate. The embedding layer does this lookup.

The model has a learned embedding table of shape $(V, d_{\text{model}})$ , where $V$ is the vocabulary size and $d_{\text{model}}$ is the hidden size (the dimension of the vector representing each token, the width of the residual stream). For a vocabulary of 128,000 tokens and a hidden size of 4,096, the table contains $128{,}000 \times 4{,}096 \approx 524$ million real numbers, each updated during training. To embed a token, the model looks up the corresponding row.

The entire embedding operation:

token IDs

[3742,\ 318,\ 257,\ \ldots]

integer sequence

look up each ID as a row in the embedding table

one lookup per token

embeddings

[\mathbf{e}_1,\ \mathbf{e}_2,\ \mathbf{e}_3,\ \ldots]

each

\mathbf{e}_t \in \mathbb{R}^{d_{\text{model}}}

Each output vector $\mathbf{e}_t$ is a dense vector — a list of $d_{\text{model}}$ real numbers, most of which are nonzero. For $d_{\text{model}} = 4{,}096$ , a single token embedding looks like

\mathbf{e}_t \;\approx\; [\, 0.018,\, -0.234,\, 1.102,\, 0.547,\, \ldots,\, -0.089 \,] \qquad (\text{4,096 numbers}).

“Dense” contrasts with sparse (mostly zeros, as in one-hot encodings of token IDs). Two tokens with related meanings — say king and queen — typically end up with embeddings that are closer in $\mathbb{R}^{d_{\text{model}}}$ than two unrelated tokens like king and spreadsheet. Closeness is most commonly measured by cosine similarity: for two vectors $\mathbf{u}, \mathbf{v} \in \mathbb{R}^d$ ,

\cos(\mathbf{u}, \mathbf{v}) \;=\; \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \, \|\mathbf{v}\|},

which is $1$ when the vectors point in the same direction, $0$ when orthogonal, $-1$ when opposite. Trained LLM embeddings often have $\cos(\text{king}, \text{queen}) > \cos(\text{king}, \text{spreadsheet})$ , reflecting semantic similarity learned as a side effect of the next-token-prediction training objective.

The embedding layer is the model’s only point of contact with raw token IDs. From this point on, the model operates on vectors in $\mathbb{R}^{d_{\text{model}}}$ .

The Transformer block, end to end

A modern LLM is a tall stack of Transformer blocks — identical-in-structure layers that each refine the per-token vector representations using all preceding tokens as context. After the embedding layer turns input tokens into vectors $(\mathbf{e}_1, \mathbf{e}_2, \ldots, \mathbf{e}_T)$ , the blocks are applied in sequence:

Token IDs

Embedding layer

Block 1

Block 2

Block

N

Output projection

Logits

per token, over vocab

After the last block, an output projection maps the final per-token vectors back to vocabulary-sized logit vectors — one logit per possible next token at each position (recall §1 and §6). The block count $N$ ranges from a few dozen in small open-weights models to over a hundred in frontier models.

Each block has the same internal structure. The modern 2026 recipe is:

flowchart TD
  In(["$$\text{Input } \mathbf{x}_1, \ldots, \mathbf{x}_T$$"])
  N1["$$\text{RMSNorm}$$"]
  Attn["$$\text{Multi-head causal self-attention}$$"]
  Add1(["$$\oplus$$"])
  Int(["$$\text{intermediate vectors}$$"])
  N2["$$\text{RMSNorm}$$"]
  FF["$$\text{SwiGLU feedforward}$$"]
  Add2(["$$\oplus$$"])
  Out(["$$\text{Output } \mathbf{x}'_1, \ldots, \mathbf{x}'_T$$"])

  In --> N1 --> Attn --> Add1
  In -. residual .-> Add1
  Add1 --> Int --> N2 --> FF --> Add2
  Int -. residual .-> Add2
  Add2 --> Out

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  classDef sum  fill:#fafaf9,stroke:#1a1a1a,stroke-width:1.5px,color:#1a1a1a
  class In,N1,Attn,Int,N2,FF,Out pill
  class Add1,Add2 sum

Two structural features deserve emphasis before we look at the components individually.

The residual stream. The block adds its computation onto its input at the ⊕ joins (the residual connections), rather than replacing it. Stacked through $N$ blocks, this creates a “stream” of vectors that runs end-to-end, with each block reading from and writing to it. From the mechanistic-interpretability perspective the residual stream is the central object: tokens accumulate information across the layers, each block contributing a partial update. The Deep Learning chapter develops this view.

Pre-normalization. The normalization (RMSNorm) happens before each sub-layer (attention or feedforward), not after. The original 2017 Transformer normalized after, but at LLM scale this becomes unstable; pre-normalization is now standard.

We now describe each component.

Self-attention

Self-attention is the mechanism that lets each token “look at” the other tokens before deciding what to write into the residual stream. We develop it in three stages: the projections, the attention computation, and the causal mask.

Stage 1: project each token into queries, keys, and values. Given the per-token input vectors $\mathbf{x}_1, \ldots, \mathbf{x}_T$ (each of dimension $d_{\text{model}}$ ), three learned linear projections produce three new vector sequences:

\mathbf{q}_t \;=\; \mathbf{W}_Q \mathbf{x}_t, \qquad \mathbf{k}_t \;=\; \mathbf{W}_K \mathbf{x}_t, \qquad \mathbf{v}_t \;=\; \mathbf{W}_V \mathbf{x}_t,

where $\mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V$ are learned weight matrices of shape $(d_{\text{head}}, d_{\text{model}})$ , with $d_{\text{head}}$ the per-head dimension (typically 64 or 128). Reading the names: $\mathbf{q}_t$ is the query (“what is this token looking for?”); $\mathbf{k}_t$ is the key (“what does this token offer?”); $\mathbf{v}_t$ is the value (“what does this token actually contribute when matched?”). These are intuitions about the roles, not strict definitions; the actual content of $\mathbf{q}, \mathbf{k}, \mathbf{v}$ is determined by training.

Stage 2: compute attention weights and the output. For each query position $t$ , the attention output is a weighted sum of value vectors at preceding positions, where the weights depend on how well each preceding key matches the current query:

\mathbf{o}_t \;=\; \sum_{s \,\leq\, t} \alpha_{t,s} \, \mathbf{v}_s, \qquad \alpha_{t,s} \;=\; \mathrm{softmax}_{s}\!\left( \frac{\mathbf{q}_t \cdot \mathbf{k}_s}{\sqrt{d_{\text{head}}}} \right),

where the softmax is taken over $s$ (i.e., the weights $\{\alpha_{t,s}\}_{s \leq t}$ sum to 1 for each fixed $t$ ). In matrix form — one of the most-written equations in modern ML — let $\mathbf{Q}, \mathbf{K}, \mathbf{V}$ be the matrices stacking $\mathbf{q}_t, \mathbf{k}_t, \mathbf{v}_t$ as rows. Then

\mathbf{O} \;=\; \mathrm{softmax}\!\left( \frac{\mathbf{Q} \mathbf{K}^{\!\top}}{\sqrt{d_{\text{head}}}} \right) \mathbf{V}.

The $1 / \sqrt{d_{\text{head}}}$ factor is numerical-stability scaling: without it, the dot products grow with $d_{\text{head}}$ and push the softmax into saturation. Visually:

Queries

\mathbf{Q}

, Keys

\mathbf{K}

, Values

\mathbf{V}

\mathbf{Q} \cdot \mathbf{K}^\top / \sqrt{d_{\text{head}}}

scaled dot products, a

T \times T

matrix

softmax row-wise

each row becomes a probability dist over preceding positions

Attention weights

\alpha

multiply by

\mathbf{V}

weighted sum of values, per query

Output

\mathbf{O} = \alpha \cdot \mathbf{V}

A small worked example, with $T = 3$ tokens and $d_{\text{head}} = 4$ . The exact numbers are illustrative.

Compute scaled Q·K^T (3 × 3 matrix of dot products):

                         k_1     k_2     k_3
                     ┌─────────────────────────┐
                  q_1│  0.4    0.1     0.2    │
                  q_2│  0.6    0.3     0.5    │
                  q_3│  0.2    0.8     0.1    │
                     └─────────────────────────┘

Apply causal mask (set entries above the diagonal to −∞,
so they vanish under softmax):

                         k_1     k_2     k_3
                     ┌─────────────────────────┐
                  q_1│  0.4    −∞      −∞     │      (token 1 sees only itself)
                  q_2│  0.6    0.3     −∞     │      (token 2 sees positions 1, 2)
                  q_3│  0.2    0.8     0.1    │      (token 3 sees positions 1, 2, 3)
                     └─────────────────────────┘

Apply softmax row-wise:

                         k_1     k_2     k_3
                     ┌─────────────────────────┐
                  q_1│  1.00   0.00    0.00   │
                  q_2│  0.57   0.43    0.00   │
                  q_3│  0.31   0.46    0.23   │
                     └─────────────────────────┘

These are the attention weights α_{t,s}. Output o_t is the weighted
sum of value vectors weighted by row t:

  o_1 = 1.00·v_1
  o_2 = 0.57·v_1 + 0.43·v_2
  o_3 = 0.31·v_1 + 0.46·v_2 + 0.23·v_3

The causal mask is what makes the model “decoder-only” rather than encoder-only: it prevents each token from attending to tokens that come after it in the sequence. Without the mask, the model could “cheat” during training by looking at the very token it is supposed to predict. With the mask, the only information the model has when producing the prediction at position $t$ is positions $1, 2, \ldots, t$ — exactly what it has at inference time.

Stage 3: project the output back to model dimension. The attention output $\mathbf{o}_t$ has dimension $d_{\text{head}}$ , but the residual stream expects dimension $d_{\text{model}}$ . A final learned linear projection $\mathbf{W}_O$ maps the output back:

\text{attn-output}_t \;=\; \mathbf{W}_O \, \mathbf{o}_t.

This is what gets added back to the residual stream at the ⊕ join in the block diagram.

Multi-head attention, GQA, and MQA

The single-head attention described above is the basic mechanism. In practice, modern LLMs run multi-head attention: $H$ independent attention computations in parallel, each with its own projections $\mathbf{W}_Q^h, \mathbf{W}_K^h, \mathbf{W}_V^h$ . Their outputs are concatenated and projected back to $d_{\text{model}}$ :

flowchart TD
  Input(["$$\text{input}$$"])
  H1["$$\text{head 1: own } \mathbf{Q}, \mathbf{K}, \mathbf{V}$$"]
  H2["$$\text{head 2: own } \mathbf{Q}, \mathbf{K}, \mathbf{V}$$"]
  HD["…"]
  HH["$$\text{head } H\text{: own } \mathbf{Q}, \mathbf{K}, \mathbf{V}$$"]
  O1["$$\text{out}_1$$"]
  O2["$$\text{out}_2$$"]
  OD["…"]
  OH["$$\text{out}_H$$"]
  Concat["$$\text{concat: } \text{out}_1 \,\Vert\, \text{out}_2 \,\Vert\, \cdots \,\Vert\, \text{out}_H$$"]
  WO["$$\text{output projection } \mathbf{W}_O$$"]
  Res(["$$\text{add to residual}$$"])

  Input --> H1
  Input --> H2
  Input --> HD
  Input --> HH

  H1 --> O1
  H2 --> O2
  HD --> OD
  HH --> OH

  O1 --> Concat
  O2 --> Concat
  OD --> Concat
  OH --> Concat

  Concat --> WO --> Res

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  classDef dim  fill:#fff,stroke:#aaa,stroke-dasharray:3 3,color:#888
  class Input,Res,WO,Concat,H1,H2,HH,O1,O2,OH pill
  class HD,OD dim

Modern LLMs typically use $H$ in the range of 32 to 128 heads. Each head can specialize: some heads track syntactic relations (subject ↔ verb), some track long-range coreference, some attend mostly to the immediately preceding token. The Mechanistic Interpretability chapter develops what specific heads actually do.

The KV-cache problem. In multi-head attention, each head needs its own key and value cached at every preceding position to make autoregressive generation efficient (§1, §6 develop KV caching). The total KV-cache size per inference call is

2 \times N_{\text{layers}} \times T \times H \times d_{\text{head}} \quad \text{(real numbers)},

— the factor of $2$ is for K and V — times the bytes per element (e.g., 2 bytes at fp16). For a 100-billion-parameter model with $N_{\text{layers}} = 80$ , $H = 64$ , $d_{\text{head}} = 128$ , and a 100,000-token context, this is on the order of tens of gigabytes per inference call. Cache size, not parameter count, is what limits how many concurrent requests a server can hold.

Grouped-query attention (GQA) (Ainslie et al., 2023) addresses this by letting groups of query heads share keys and values:

Standard multi-head (H = 8 query heads, 8 K heads, 8 V heads):
  Q:  Q1  Q2  Q3  Q4  Q5  Q6  Q7  Q8
  K:  K1  K2  K3  K4  K5  K6  K7  K8                (8 K's, 8 V's)
  V:  V1  V2  V3  V4  V5  V6  V7  V8

GQA with 2 KV groups (8 Q heads, 2 K heads, 2 V heads):
  Q:  Q1  Q2  Q3  Q4  Q5  Q6  Q7  Q8
        └────┬────┘  └────┬────┘
             │            │
  K:        K1           K2                          (only 2 K's, 2 V's)
  V:        V1           V2

Multi-query attention (MQA) — one shared KV across all heads:
  Q:  Q1  Q2  Q3  Q4  Q5  Q6  Q7  Q8
        └────────────┬───────────┘
                     │
  K:                K                                (1 K, 1 V)
  V:                V

GQA cuts the KV-cache size by the ratio of query heads to KV groups (eight-to-two here, a 4× reduction), with usually only a small capability loss. MQA is the most aggressive variant. The choice in practice is a tradeoff between memory and quality; modern LLMs (Llama-3, Mistral, Qwen) overwhelmingly use GQA, with KV-group counts that strike the empirical sweet spot for their scale.

Rotary position embeddings (RoPE)

The attention mechanism described so far is permutation-invariant: shuffling the input tokens would produce the same set of attention weights, because the dot products $\mathbf{q}_t \cdot \mathbf{k}_s$ do not know anything about $t$ or $s$ . To make the model order-sensitive, the position of each token must be injected somewhere. The dominant choice in modern LLMs is rotary position embeddings (RoPE) (Su et al., 2021).

RoPE injects position by rotating the query and key vectors by an angle that depends on the token’s position. Group the dimensions of $\mathbf{q}_t$ into pairs $(q_t^{(2i)}, q_t^{(2i+1)})$ . Each pair is treated as a 2D vector and rotated by angle $t \cdot \theta_i$ , where $\theta_i$ is a per-pair frequency:

   Each consecutive pair of query dimensions is treated as a 2D vector
   and rotated by an angle proportional to the token's position t:

       y ↑     (q_t^{(2i+1)})
         │       •
         │      ╱
         │     ╱ rotation by angle t·θ_i
         │    ╱
         │   ●━━━━▶ x  (q_t^{(2i)})

   The same rotation is applied to the corresponding key dimensions.

The same rotation is applied to the corresponding pair of dimensions in $\mathbf{k}_s$ . After rotation, the dot product $\mathbf{q}_t \cdot \mathbf{k}_s$ depends only on the difference $t - s$ between the two positions, not on their absolute values. This is RoPE’s central property: the model is sensitive to relative position, which makes it possible (with appropriate extrapolation tricks; see below) to use the same model at longer context lengths than it was trained on.

The per-pair frequencies $\theta_i$ are typically chosen as $\theta_i = b^{-2i / d_{\text{head}}}$ for some base $b$ (commonly $b = 10{,}000$ ). The high-frequency pairs are sensitive to short-range position differences; the low-frequency pairs are sensitive to long-range ones.

RMSNorm

Inside the block, RMSNorm (Zhang & Sennrich, 2019) is the normalization step. The standard alternative, LayerNorm, both centers and rescales activations:

\mathrm{LayerNorm}(\mathbf{x}) \;=\; \boldsymbol{\gamma} \odot \frac{\mathbf{x} - \mathrm{mean}(\mathbf{x})}{\mathrm{std}(\mathbf{x})} + \boldsymbol{\beta},

where $\mathrm{mean}(\mathbf{x})$ and $\mathrm{std}(\mathbf{x})$ are the per-vector mean and standard deviation, $\boldsymbol{\gamma}$ and $\boldsymbol{\beta}$ are learned per-dimension scale and shift parameters, and $\odot$ is elementwise multiplication. RMSNorm omits the mean-subtraction and the additive bias:

\mathrm{RMSNorm}(\mathbf{x}) \;=\; \boldsymbol{\gamma} \odot \frac{\mathbf{x}}{\sqrt{\frac{1}{d} \sum_{i=1}^d x_i^2}}.

The operation rescales each token vector to unit root-mean-square magnitude, then multiplies elementwise by the learned scale $\boldsymbol{\gamma}$ . RMSNorm is cheaper to compute, numerically stable at low precision (fp8, int8), and empirically equivalent in quality. Most modern LLMs use it.

SwiGLU feedforward

The block’s feedforward sub-layer is a per-token nonlinearity that operates independently on each position. The original Transformer used a simple two-layer MLP: project up to a larger hidden dimension $d_{\text{ff}}$ , apply a nonlinearity (ReLU or GELU), project back down. Modern LLMs use SwiGLU (Shazeer, 2020), a gated variant:

\mathrm{SwiGLU}(\mathbf{x}) \;=\; \mathbf{W}_2 \, \Big(\, \mathrm{Swish}(\mathbf{W}_1 \mathbf{x}) \;\odot\; \mathbf{W}_3 \mathbf{x} \,\Big),

where $\mathrm{Swish}(z) = z \cdot \sigma(z)$ (with $\sigma$ the logistic sigmoid) and $\odot$ is elementwise multiplication. Three learned weight matrices: $\mathbf{W}_1$ and $\mathbf{W}_3$ project up to $d_{\text{ff}}$ in parallel; one is gated by Swish; the two are multiplied elementwise; the result is projected back to $d_{\text{model}}$ by $\mathbf{W}_2$ :

flowchart TD
  X(["$$\text{input } \mathbf{x}$$"])
  W1["$$\mathbf{W}_1 \mathbf{x}$$"]
  Sw["$$\text{Swish}$$"]
  W3["$$\mathbf{W}_3 \mathbf{x}$$"]
  Mul(["$$\odot$$"])
  W2["$$\mathbf{W}_2$$"]
  Out(["$$\text{output}$$"])

  X --> W1 --> Sw --> Mul
  X --> W3 --> Mul
  Mul --> W2 --> Out

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  classDef sum  fill:#fafaf9,stroke:#1a1a1a,stroke-width:1.5px,color:#1a1a1a
  class X,W1,Sw,W3,W2,Out pill
  class Mul sum

The gating gives the model a multiplicative interaction (one branch decides how much to amplify the other) that pure additive nonlinearities lack. Empirically SwiGLU consistently outperforms the original MLP at matched parameter counts.

Mixture-of-experts

The block as described so far is dense: every parameter of the block runs on every token. Mixture-of-experts (MoE) is an alternative scaling strategy. In a MoE block, the feedforward sub-layer is replaced by a bank of $E$ feedforward sub-layers (the experts), and only $k$ of them — typically $k = 2$ out of $E = 8, 16, 64,$ or more — run on any given token. A small learned router decides which:

flowchart TD
  X(["$$\text{Token vector } \mathbf{x}_t$$"])
  R["$$\text{Router: scores } \mathbf{s} = \mathbf{W}_{\text{router}}\,\mathbf{x}_t \in \mathbb{R}^E$$"]
  Top["$$\text{top-}k\text{; softmax} \to w_1, \ldots, w_k$$"]
  E1["$$\text{Expert FF}_1$$"]
  E2["$$\text{Expert FF}_2$$"]
  ED["…"]
  EE["$$\text{Expert FF}_E$$"]
  Y1["$$y_1 = \mathrm{FF}_1(\mathbf{x}_t)$$"]
  Y2["$$y_2$$"]
  YD["…"]
  YE["$$y_k$$"]
  Sum["$$\text{weighted sum } \sum_e w_e \cdot y_e$$"]
  Out(["$$\text{output for } \mathbf{x}_t$$"])

  X --> R --> Top
  Top --> E1
  Top --> E2
  Top --> ED
  Top --> EE
  E1 --> Y1
  E2 --> Y2
  ED --> YD
  EE --> YE
  Y1 --> Sum
  Y2 --> Sum
  YD --> Sum
  YE --> Sum
  Sum --> Out

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  classDef dim  fill:#fff,stroke:#aaa,stroke-dasharray:3 3,color:#888
  class X,R,Top,E1,E2,EE,Y1,Y2,YE,Sum,Out pill
  class ED,YD dim

Concretely: the router computes scores $\mathbf{s} = \mathbf{W}_{\text{router}} \mathbf{x}_t$ (a vector of length $E$ , one entry per expert), keeps the top- $k$ , normalizes those into weights via softmax, runs only the chosen $k$ experts on the token, and combines their outputs via the router weights.

Two parameter counts matter:

The total parameter count is the parameter count across all $E$ experts (plus the rest of the model).
The active parameter count per token is the parameters in only the chosen $k$ experts (plus the rest).

The active count determines training and inference cost per token. The total count determines model capacity. MoE lets a model be larger in capacity than in cost. At frontier scale this matters: a 1-trillion-total-parameter MoE with 100-billion active parameters costs roughly the same per inference as a 100-billion-parameter dense model, while behaving (often) like a substantially larger one.

The tradeoffs are real. Load balancing — ensuring all experts get used roughly evenly during training — is necessary to prevent expert collapse (one expert absorbs all tokens, the rest are unused); this requires an auxiliary loss term. MoE inference has irregular memory-access patterns that complicate efficient serving. And MoE models are harder to fine-tune than dense ones because preserving load balance through fine-tuning is non-trivial. The Deep Learning chapter develops the training side; for the LLM chapter, the practically relevant fact is that by 2026, MoE is standard at the frontier (Mixtral, DeepSeek-MoE, and others) and increasingly common at smaller scales.

Long-context architectures

Attention’s cost is quadratic in sequence length: a context of length $L$ requires $O(L^2)$ attention computations per layer. For context windows now in the hundreds of thousands of tokens, this cost dominates everything else. Several architectural responses coexist in 2026.

Sliding-window attention. Each token attends only to a fixed window of nearby tokens (say, the previous $W = 4{,}096$ ), rather than to all earlier tokens. The attention pattern becomes band-diagonal:

   Standard causal attention:           Sliding window (W = 3):

         ▲                                    ▲
         │ █                                  │ █
         │ ██                                 │ ██
         │ ███                                │ ███
   pos t │ ████                          pos t│  ███
         │ █████                              │   ███
         │ ██████                             │    ███
         │ ███████                            │     ███
         │                                    │
         └──────────────▶                     └──────────────▶
                pos s                                pos s

   (lower triangle: token t attends         (only a band: token t attends
    to every token s ≤ t)                    to the W tokens before it)

Cost is $O(L \cdot W)$ per layer — linear in $L$ once $W$ is fixed. Long-range dependencies are handled by stacking: a token at position $t$ in layer 20 can be informed by a token at position $t - 20W$ in the original input, through a chain of overlapping windows.

Global-plus-local attention. A small subset of tokens (often the first few of the context) attends globally; the rest use sliding-window attention. Useful when there is reason to expect important global state to be encoded in those tokens.

Dilated and ring attention. Dilated attention spaces attended positions exponentially (attend to positions $t-1, t-2, t-4, t-8, \ldots$ ); used in Longformer. Ring attention is a long-context distributed inference pattern, partitioning the context across devices that pass key/value blocks around a ring of GPUs.

Position-encoding extrapolation. Rather than change the attention pattern, change how positions are encoded. RoPE-based models can be extended to context lengths longer than they were trained on by scaling the rotation frequencies. NTK-aware scaling adjusts the rotation base; YaRN (Peng et al., 2023) is a more careful interpolation. The effective context after extrapolation is shorter than the nominal one (§9 develops this).

ALiBi. Attention with Linear Biases (Press et al., 2021) replaces positional encodings altogether: a fixed linear bias decaying with distance is added to each attention score. ALiBi-trained models extrapolate to longer contexts more gracefully than vanilla learned positional embeddings; empirical comparison with RoPE-plus-extrapolation has gone back and forth.

Retrieval as a long-context substitute. Rather than feed all relevant text into a long context, retrieve only the most relevant passages at inference time and feed those. Treated in §9 and in the Retrieval-Augmented Generation chapter.

The practical answer to long context in 2026 is usually a combination: GQA + RoPE + sliding-window (or global-plus-local) + retrieval, with extrapolation when the deployed context exceeds the training context.

Non-Transformer alternatives

The Transformer’s dominance is empirically overwhelming, not architecturally inevitable. Several alternative substrates remain active research.

State-space models (SSMs), most prominently Mamba (Gu & Dao, 2023) and its successors, are sub-quadratic architectures that process tokens sequentially while maintaining a learned hidden state — modern successors to recurrent networks (§2), designed to be parallelizable in training while linear in inference cost. By 2026 SSMs are competitive with Transformers at small to mid scales; the gap at frontier scale has narrowed but not closed.
Hybrid attention/SSM stacks (e.g., Jamba) interleave Transformer blocks with SSM blocks, attempting to combine the long-range retrieval capability of attention with the efficient long-context inference of SSMs.
RWKV is another sub-quadratic architecture rooted in recurrent computation, with a different set of tradeoffs.

The Deep Learning chapter develops the architectural substance. For the LLM chapter, the practically relevant fact is that as of 2026, the decoder-only Transformer with the modern recipe described above remains the substrate of essentially every frontier and competitive open-weights LLM, with sub-quadratic alternatives present but not (yet) dominant.

Editorial note. The convergence on a single architectural recipe is unusually strong in 2026. Whether this reflects a genuine attractor in architecture space or an artifact of community focus, deployment economics, and tooling investment is an open question. We record that the convergence is real and that its origins are debated; the deeper question is treated in the Deep Learning chapter.

§5. The LLM Training Pipeline

The modern LLM training pipeline is a sequence of distinct training stages applied in order to the same model. §1’s worked example and §2’s history showed the high-level shape; this section develops the substance. The pipeline’s defining property is separation of concerns: each stage targets a specific property of the final model (broad competence, instruction-following, preference alignment, reasoning, specialized capability), uses a specific kind of data, and has its own characteristic algorithm. By 2026 the composition has largely stabilized across labs.

This section covers seven stages, in the order they are typically applied:

Pretraining — causal language modeling on a very broad, mostly unlabelled corpus.
Supervised fine-tuning (SFT) — continued training on instruction-following demonstrations.
Preference tuning — adjusting the model to produce outputs that humans (or trained proxies) prefer.
Reasoning-RL — optional, for reasoning models; deferred to the Reasoning Models chapter.
Continual pretraining — periodic updates to incorporate new knowledge.
Distillation — producing smaller, cheaper models from a larger trained one.
Synthetic-data augmentation — generating training data with one model to train another.

Pretraining

Pretraining is the initial — and overwhelmingly the most expensive — training stage. The objective is causal language modeling (defined in §2): predict the next token from the preceding ones, on a corpus of unlabelled text. Foundation Models §3 covered the why (self-supervision lets the procedure consume essentially unlimited unlabelled data); we cover the how here.

The full pipeline, from raw data to a base model:

Raw web crawl + books + code + papers + Wikipedia + ...

Quality filtering

heuristics + learned classifiers

Deduplication

exact + near-duplicate removal

Mixture composition

weights per source tuned by ablation

Tokenization (§3)

text → token IDs

Packing into fixed-length sequences (

L = 2\text{k}

–

8\text{k}

tokens)

concatenate token IDs into contexts with boundary markers

Pretraining loop: forward pass → next-token loss → backprop → AdamW step

trillions of tokens;

10^5

10^7

optimizer steps

Base model

\pi_{\text{base}}

fluent, not instruction-following

Each box is itself a substantial engineering effort. We describe the early stages briefly here; the deep details (tokenizer design, distributed training infrastructure, hyperparameter recipes) are treated in §3 and the Deep Learning chapter.

Corpus composition. A modern LLM pretraining corpus contains on the order of trillions of tokens drawn from multiple sources: web crawls (notably Common Crawl and its derivatives), curated web subsets (technical Q&A like Stack Exchange, news sites, Reddit conversations), books, code repositories (GitHub and similar), academic papers (arXiv), Wikipedia, and language-specific resources. The exact mixture is a per-lab design choice that affects downstream behaviour: weighting code more heavily produces better-coding models; weighting math-reasoning data more heavily produces better-mathematics models; weighting non-English data more heavily produces models with better multilingual coverage. Choosing the mixture — what is sometimes called data curation — is now a substantial fraction of pretraining engineering work, tuned via small-scale ablation experiments that test the effect of mixture changes on downstream behaviour.

Deduplication. A corpus assembled from web sources contains many near-duplicate passages — boilerplate, reposted articles, multiple copies of common reference text. Training on duplicates wastes compute and biases the model toward memorizing duplicated material rather than generalizing. Deduplication — removing exact and near-exact duplicates from the corpus before training — is now standard practice. Lee et al. (2022) and subsequent work demonstrated meaningful capability gains from aggressive deduplication.

Quality filtering. Beyond deduplication, quality filtering removes pages judged unlikely to contribute to a useful model: machine-translated low-quality content, spam, gibberish, content that does not look like coherent text in the target language(s). Quality filters range from simple heuristics (page length, perplexity under a small reference model, content-detector rules) to learned classifiers trained on human-judged quality labels. The Pile (Gao et al., 2020) and its successors are well-documented public examples; proprietary corpora at frontier labs go further and the specific recipes are mostly not public.

Tokenization and curricula. Tokenization — the process that splits the corpus into tokens (§3) — interacts with multilingual coverage, code performance, and arithmetic ability. Training curricula specify the order and mixture of data through training: many recipes anneal the data mixture, starting with broader and more diverse data and concentrating on higher-quality or more domain-specific data toward the end of training. This two-stage or multi-stage pretraining is widespread by 2026.

The pretraining loop

Once the corpus is prepared and packed into fixed-length sequences, the actual training loop is straightforward in form (the engineering of distributed training at frontier scale is anything but — the Efficient and Scaled Training chapter develops that):

PRETRAINING PROCEDURE
=====================

Inputs:
  corpus     : packed token sequences after filter/dedup/tokenize/pack;
               each sequence is L tokens long
  L          : context length (typical 2048, 4096, or 8192 tokens)
  B          : batch size in sequences per optimizer step
               (often 1M-8M tokens per step at frontier scale, so
                B = 256-2000 sequences with L = 4096)
  η_schedule : learning-rate schedule (typical: warmup over the first
               few thousand steps, then cosine decay to 10% of peak)
  S          : total optimizer steps (typical 10^5 to 10^7)

Initialize:
  θ ← model parameters (random init with appropriate scaling;
       see Deep Learning chapter for init schemes)

Training loop:
  for step = 1 .. S:
    # 1. Sample a batch of B sequences from the corpus.
    batch ← sample B sequences from corpus
    # Each sequence is (t_1, t_2, ..., t_L), a packed concatenation
    # of tokenized text from one or more documents.

    # 2. Forward pass. For each sequence in the batch, the model
    # produces a distribution over the vocabulary at every position,
    # conditioning only on prior tokens (causal mask, §4.5).
    L_total ← 0
    n_predicted ← 0

    for each sequence (t_1, ..., t_L) in batch:
      logits ← model_θ.forward(t_1, ..., t_L)        # shape: L × V

      # 3. Next-token cross-entropy at every position 1 .. L-1.
      # The model predicts t_{i+1} from t_1..t_i; the loss is the
      # negative log-probability the model assigns to the actual next
      # token.
      for i = 1 .. L-1:
        L_total ← L_total + (-log p_θ(t_{i+1} | t_1, ..., t_i))
        n_predicted ← n_predicted + 1

    L_mean ← L_total / n_predicted                   # per-token loss

    # 4. Backprop + optimizer step.
    η ← η_schedule(step)
    θ ← AdamW_update(θ, ∂L_mean/∂θ, η)

Output: pretrained base model π_base

A concrete picture of a single training batch. Imagine $L = 8$ , $B = 2$ , and two packed sequences:

   sequence 1: [The_, _quick_, _brown_, _fox_, _jumps_, _over_, _the_, _lazy_]
   sequence 2: [In_, _2026_, _the_, _company_, _announced_, _a_, _new_, _model_]

   For sequence 1, the loss is the sum of:
     -log p_θ(_quick_  | The_)
     -log p_θ(_brown_  | The_, _quick_)
     -log p_θ(_fox_    | The_, _quick_, _brown_)
     ...
     -log p_θ(_lazy_   | The_, _quick_, _brown_, ..., _the_)

   For sequence 2, similarly. The total batch loss is the average
   per-token negative log-likelihood across both sequences.

The model is asked to predict every token from its predecessors, in parallel, in a single forward pass. The “parallel” is what makes Transformer pretraining tractable at scale: every position’s prediction can be computed simultaneously using causally-masked attention.

The output of pretraining is a base model (sometimes called a pretrained model or, more loosely, the foundation model). A base model is fluent but not instruction-following: it cannot reliably respond to “What is the capital of France?” with “Paris” because, having seen many lists of quiz questions during pretraining, it is just as likely to interpret the prompt as the start of a quiz article and continue with three more quiz questions. The remaining stages exist to turn the base model into something deployable.

Supervised fine-tuning (SFT)

Supervised fine-tuning (SFT) was defined in §1 and Foundation Models §1: continue training the base model on a curated dataset of (prompt, ideal response) pairs. The ideal responses are usually written by humans — often by highly trained annotators whose work is reviewed for quality and style — though by 2026 a substantial fraction of SFT data is also synthetically generated by stronger models and then filtered or edited (see §5.7 below).

A literal SFT example might look like:

   prompt:
     <|im_start|>system
     You are a helpful programming assistant. Answer concisely
     and include working code when relevant.
     <|im_end|>
     <|im_start|>user
     Write a Python function that returns the nth Fibonacci number.
     <|im_end|>
     <|im_start|>assistant

   ideal response (target):
     Here is an iterative implementation that runs in O(n) time
     and O(1) extra space:

         def fib(n):
             a, b = 0, 1
             for _ in range(n):
                 a, b = b, a + b
             return a
     <|im_end|>

The angle-bracketed tokens (<|im_start|>, <|im_end|>) are chat-template special tokens (§3.4); the model has been trained to treat them as role boundaries. SFT training updates the model’s parameters so the assistant response is highly likely given the prompt, and so that the boundary tokens fire at the right places.

Structural difference from pretraining. SFT uses the same next-token-prediction objective as pretraining, but the loss is computed only on the response tokens, not on the prompt. The prompt is held as fixed conditioning input; we do not want to train the model to generate the prompt, only to respond to it. Diagrammatically:

   Pretraining: loss on EVERY position
   ─────────────────────────────────────────────────────────────────
   [t_1   t_2   t_3   t_4   t_5   ...   t_L]
     │     │     │     │     │           │
     │     │     │     │     │           │       at each position,
     ▼     ▼     ▼     ▼     ▼           ▼       predict the NEXT token
   pred  pred  pred  pred  pred  ...   pred      and accumulate loss
     │     │     │     │     │           │
     └─────┴─────┴─────┴─────┴───────────┘
                       │
                       ▼
           total loss = sum over ALL positions


   SFT: loss only on RESPONSE positions
   ─────────────────────────────────────────────────────────────────
   [p_1  p_2  ...  p_n  |  r_1  r_2  ...  r_m]
     ↓    ↓         ↓       │    │         │
   no loss on prompt        ▼    ▼         ▼     predict r_1, r_2, ..., r_m
   (forward pass runs       pred pred ... pred    and accumulate loss
    over the prompt;        │    │         │
    activations matter,     └────┴─────────┘     (the prompt is masked
    but loss is masked)            │              out of the loss; the
                                   ▼              model sees it as input
                       total loss = sum only       only)
                       over RESPONSE positions

Mechanically: a single binary loss mask vector says “compute loss at this position” or “skip”. During training, the loss-mask vector is 0 for the prompt tokens and 1 for the response tokens. The forward pass is the same; only the loss aggregation differs.

The data is small relative to pretraining — typically tens of thousands to a few million examples versus the trillions of pretraining tokens — but its design is consequential. SFT-data design largely determines:

Conversational style. Terse vs. verbose, formal vs. casual, the model’s apparent “voice”.
Instruction-following reliability. Whether the model robustly responds to what was asked rather than continuing the prompt as text.
Format adherence. Whether the model produces structured outputs (JSON, code, lists) in the expected format.
Safe refusal patterns. What the model declines to produce, and in what tone.

The training objective during SFT is the same as in pretraining — next-token prediction — now applied to the supervised pairs only. The procedure is otherwise standard supervised learning: minimize cross-entropy loss on the target tokens. In pseudocode:

SFT PROCEDURE
=============

Inputs:
  π_base     : pretrained base model (output of pretraining)
  D_sft      : dataset {(x_i, y_i)} of (prompt, ideal response) pairs
               typically 10^4 to 10^6 pairs
  η_sft      : learning rate (smaller than pretraining, typical 1e-5)
  E_sft      : number of epochs (typical 1-3)
  B          : batch size

Initialize:
  Copy π_base's parameters into a new trainable model π_θ.
  θ ← parameters of π_base

Training loop:
  for epoch = 1 .. E_sft:
    for each mini-batch B sampled from D_sft:
      L ← 0
      n_response_tokens ← 0

      for each (x, y) in B:
        # Concatenate prompt and response for the forward pass.
        full_seq ← concat(x, y)

        # Build the loss mask: 0 for prompt positions, 1 for response.
        # Length |x| + |y|.
        mask ← [0] * |x| + [1] * |y|

        # Forward pass through the model.
        logits ← model_θ.forward(full_seq)

        # Loss is summed only over positions where mask == 1.
        for i = 1 .. |full_seq| - 1:
          if mask[i+1] == 1:                              # response token
            L ← L + (-log p_θ(full_seq[i+1] | full_seq[1..i]))
            n_response_tokens ← n_response_tokens + 1

      L ← L / n_response_tokens                          # mean loss

      θ ← θ - η_sft · ∂L/∂θ                              # AdamW step

Output: SFT model π_SFT

Modern recipes extend the basic pattern in several ways:

Multi-turn SFT — training on extended conversations, not just single prompt-response pairs.
Tool-use SFT — training on demonstrations that include tool calls (function invocations, code execution, retrieval queries; the tool surface is treated in §8).
Chain-of-thought SFT — training on demonstrations that explicitly include intermediate reasoning steps before the final answer.

Preference tuning

After SFT the model usually follows instructions but may not produce the best response among the many plausible responses to a given prompt. Preference tuning (also called alignment tuning or preference alignment) is the stage that refines the model based on human (or human-trained-proxy) judgments about which responses are better. Several algorithmic families share a common structure and differ in how they use the preference signal.

The preference data. All preference-tuning methods rely on preference data: pairs of responses to the same prompt, with a human or AI-proxy judgment of which is preferred. A typical entry is (prompt, response_A, response_B, preferred = A). Preference data is cheaper to collect per example than fully written demonstrations — comparing is easier than writing — but at frontier scale it is still a substantial human-labour cost.

RLHF: Reinforcement Learning from Human Feedback. The original method (Christiano et al., 2017; Ouyang et al., 2022) — the technique that ChatGPT introduced to the public. We develop it in detail because (a) it remains in production use at frontier labs, (b) it grounds the DPO derivation that follows, and (c) most of the alignment literature is RLHF-shaped, even where the actual deployed algorithm is something else.

The goal is the one we’ve already stated: update the policy $\pi_\theta$ to produce responses humans prefer, with a KL constraint against the SFT model $\pi_{\text{SFT}}$ . RLHF solves this in two distinct training stages: first train an explicit reward model, then optimize the policy against it via reinforcement learning. We treat each stage in turn.

Stage 1: training the reward model

The reward model, written $\hat{r}_\phi(x, y)$ and parameterized by $\phi$ , is a learned function that takes a prompt $x$ and a response $y$ and outputs a scalar score — a single real number predicting how strongly a human would prefer this response over alternatives.

Architecture. The reward model uses the same Transformer backbone as the SFT model (§4) — same number of layers, same hidden size, same attention heads. The only structural difference is the output head:

Transformer LLM backbone (architecture from §4, initialized from

\pi_{\text{SFT}}

)

Final hidden states

h_1, h_2, \ldots, h_{|x|+|y|}

for concatenated input

x \mathbin{+\!+} y

Take last token’s hidden state

h_{\text{last}} \in \mathbb{R}^{d_{\text{model}}}

Scalar head

\mathbf{W}_r

(learned projection to

\mathbb{R}^1

)

Scalar reward

\hat{r}_\phi(x, y) \in \mathbb{R}

The token-prediction head of the SFT model (which maps $\mathbb{R}^{d_{\text{model}}}$ to a vocabulary-sized logit vector) is replaced by scalar head $\mathbf{W}_r$ — a single learned projection mapping $\mathbb{R}^{d_{\text{model}}}$ down to one number. The reward model’s parameters $\phi$ consist of the Transformer weights plus this scalar head. The Transformer portion of $\phi$ is initialized from $\pi_{\text{SFT}}$ 's weights so that the model starts with strong text representations and only needs to learn to map them to preference scores; $\mathbf{W}_r$ is initialized fresh.

Training data. The same preference dataset described in the introduction to this subsection: $N$ triples $(x_i, y_w^i, y_l^i)$ with $y_w^i$ the chosen (“winning”) response and $y_l^i$ the rejected (“losing”) response.

Training loss. Under the Bradley-Terry preference model (defined later in the DPO subsection), the probability that $y_w$ is preferred over $y_l$ given the model’s rewards is

P(y_w \succ y_l \mid x) \;=\; \sigma\!\big( \hat{r}_\phi(x, y_w) - \hat{r}_\phi(x, y_l) \big),

where $\sigma$ is the logistic sigmoid. To fit $\hat{r}_\phi$ to the observed preferences, we maximize the log-likelihood — equivalently, minimize the negative log-likelihood:

\mathcal{L}_{\text{RM}}(\phi) \;=\; -\, \mathbb{E}_{(x, y_w, y_l)}\, \log \sigma\!\big( \hat{r}_\phi(x, y_w) - \hat{r}_\phi(x, y_l) \big).

This is a standard binary classification-style loss on the margin $\hat{r}_\phi(x, y_w) - \hat{r}_\phi(x, y_l)$ : encourage the chosen response’s score to exceed the rejected response’s score.

Training procedure:

REWARD-MODEL TRAINING
=====================

Inputs:
  π_SFT   : SFT model (Transformer LLM, architecture §4, frozen — used
            only to initialize the reward model's weights)
  D       : preference dataset {(x_i, y_w_i, y_l_i)}_{i=1..N}
  η_RM    : learning rate (typical: 1e-6 to 1e-5)
  E_RM    : number of epochs (typical: 1-2 — reward models overfit easily)

Initialize:
  Copy π_SFT's Transformer weights into a new model.
  Replace the LM head with a fresh linear projection W_r of
  shape (d_model, 1).
  Let φ denote all the parameters (Transformer weights + W_r).

Training loop:
  for epoch in 1..E_RM:
    for each mini-batch B drawn from D:
      L ← 0
      for each (x, y_w, y_l) in B:
        # Score both responses. Each call is a forward pass
        # of the same Transformer; the final hidden state at
        # the last token is taken and W_r is applied.
        r_w ← r̂_φ(x, y_w)
        r_l ← r̂_φ(x, y_l)

        # Bradley-Terry / margin loss.
        L_example ← -log σ(r_w - r_l)
        L ← L + L_example

      L ← L / |B|

      # Standard backprop on all of φ.
      φ ← φ - η_RM · ∂L/∂φ

Output: trained reward model r̂_φ

After training, the reward model is frozen and used (without further updates) to provide scalar rewards during the policy-optimization stage.

Stage 2: optimizing the policy with PPO

With $\hat{r}_\phi$ in hand, we now optimize the policy $\pi_\theta$ to produce responses that score highly under it, subject to staying close to $\pi_{\text{SFT}}$ . The objective is the KL-constrained reward maximization we’ll see again in the DPO derivation:

\max_{\pi_\theta} \; \mathbb{E}_{x \sim \mathcal{D}, \; y \sim \pi_\theta(\cdot \mid x)} \big[ \hat{r}_\phi(x, y) \big] \;-\; \beta \, \mathbb{E}_{x \sim \mathcal{D}} \, \mathrm{KL}\!\left[ \pi_\theta(\cdot \mid x) \,\|\, \pi_{\text{SFT}}(\cdot \mid x) \right].

The crucial detail is the inner expectation $y \sim \pi_\theta(\cdot \mid x)$ : to evaluate the reward, we need to actually sample responses from the current policy. This makes the problem fundamentally reinforcement-learning-shaped — the algorithm must repeatedly sample from the policy, evaluate the samples, and update the policy based on the evaluation. The dataset of responses is generated by the policy itself and changes as the policy changes; this is called on-policy training, in contrast to DPO’s fully off-policy approach.

The standard algorithm is Proximal Policy Optimization (PPO) (Schulman et al., 2017). PPO was developed earlier for game-playing and robotics and adapted to LLMs in InstructGPT (Ouyang et al., 2022). PPO’s full justification is treated in the Reinforcement Learning chapter; here we describe what it does in the LLM setting and ground the concepts it uses.

Three RL concepts we need.

A rollout is a sequence of samples generated by interacting with the environment under the current policy. In LLM RLHF, “the environment” reduces to: given a prompt, produce a response. A rollout is therefore just (prompt x, response y sampled from π_θ(·|x), scalar reward r̂_φ(x,y)).
An advantage, written $A$ , measures how much better the actual reward of a sampled response was than some baseline expectation. Higher advantage means “the policy did something surprisingly good here, reinforce it.” The simplest baseline is a running average of recent rewards; more sophisticated implementations use Generalized Advantage Estimation (GAE) with a learned value-function critic — a separate network predicting expected reward for each prompt. The RL chapter develops both.
An importance-sampling ratio $\rho = \pi_\theta(y \mid x) / \pi_{\theta_{\text{old}}}(y \mid x)$ compares the current policy’s probability of a sampled response to the old policy’s probability — the policy as it was when the rollout was generated. After one gradient step, $\theta$ no longer equals $\theta_{\text{old}}$ , so the rollout data is no longer drawn from $\pi_\theta$ . PPO uses the importance ratio to correct for this distribution mismatch.

The PPO objective. PPO updates $\pi_\theta$ using a clipped surrogate objective:

\mathcal{L}_{\text{PPO}}(\theta) \;=\; -\, \mathbb{E}_{(x, y, A)} \, \min\!\Big( \rho \cdot A, \;\; \mathrm{clip}(\rho, \, 1 - \varepsilon, \, 1 + \varepsilon) \cdot A \Big),

where $\varepsilon \approx 0.2$ is a hyperparameter. The clip keeps the importance ratio bounded: if a gradient step would move $\theta$ so much that $\rho$ exceeds $1 + \varepsilon$ , the objective is clipped, so further gradient steps in that direction don’t help. The effect is to keep $\pi_\theta$ from straying too far from $\pi_{\theta_{\text{old}}}$ between rollout collections — a “trust region” enforced by the loss shape rather than by an explicit constraint. This is the central mechanism that distinguishes PPO from earlier policy-gradient methods.

The KL penalty. The KL term $\beta \cdot \mathrm{KL}(\pi_\theta \,\|\, \pi_{\text{SFT}})$ from the original objective is implemented in one of two ways:

As a per-token reward bonus subtracted from $\hat{r}_\phi$ at rollout time: each generated token $y_t$ is penalized by $\beta \cdot (\log \pi_\theta(y_t \mid x, y_{<t}) - \log \pi_{\text{SFT}}(y_t \mid x, y_{<t}))$ . The “effective reward” for the response is then the scalar from $\hat{r}_\phi$ minus this per-token KL sum.
As an explicit term in the loss, added directly to $\mathcal{L}_{\text{PPO}}$ .

Both implementations achieve the same effect: pull $\pi_\theta$ back toward $\pi_{\text{SFT}}$ when it drifts too far.

Why the KL penalty matters. Without the KL constraint, the policy quickly drifts into reward hacking — generating outputs that score artificially high under the reward model but are degenerate. Common failure modes: the policy discovers a phrase the reward model is biased toward (e.g., a particular formatting flourish) and emits it everywhere; the policy produces verbose, padded text because the reward model conflates length with quality; the policy produces ungrammatical token sequences that happen to land in a high-scoring region of the reward landscape. The KL penalty pulls the policy back toward the SFT model, which is fluent and coherent by construction. Reward hacking is treated in depth in the Alignment chapter; here we record only that the KL term is what stands between RLHF and catastrophic reward exploitation.

The full PPO loop:

RLHF / PPO POLICY OPTIMIZATION
==============================

Inputs:
  π_SFT       : frozen SFT model (used to compute KL reference)
  r̂_φ        : frozen trained reward model
  D_prompts   : a set of prompts {x_1, x_2, ..., x_M_total}
  β           : KL-penalty strength (typical: 0.01-0.05 for PPO,
                smaller than for DPO because per-token KL accumulates)
  ε           : PPO clip parameter (typical: 0.2)
  η           : learning rate (typical: 1e-6)
  K_iters     : number of PPO iterations
  M_batch     : prompts sampled per iteration
  K_epochs    : optimizer epochs per rollout (typical: 2-4)

Initialize:
  θ ← parameters of π_SFT       (start the policy as a copy of SFT)

PPO loop:
  for iteration in 1..K_iters:

    # === ROLLOUT phase ===
    sample prompts {x_1, ..., x_M_batch} from D_prompts

    # Snapshot the policy parameters for importance sampling.
    θ_old ← θ

    # Generate one response per prompt by sampling from π_θ_old
    # token by token (autoregressive sampling, §1, §6).
    for each x_i:
      generate y_i ~ π_θ_old(·|x_i)

      # Score the response.
      r_i ← r̂_φ(x_i, y_i)            (forward pass through reward model)

      # Compute per-token KL contribution against SFT.
      kl_i ← Σ_t  [ log π_θ_old(y_{i,t} | x_i, y_{i,<t})
                  - log π_SFT  (y_{i,t} | x_i, y_{i,<t}) ]

      # Effective reward: scalar reward minus β-scaled KL.
      R_i ← r_i - β · kl_i

      # Compute advantage A_i. With a running-baseline approximation:
      A_i ← R_i - baseline
      # (Production implementations use GAE with a value-function critic.)

    # === UPDATE phase ===
    # Multiple gradient updates per rollout. The clip keeps θ from
    # drifting too far from θ_old per iteration, which is what makes
    # the rollout data still usable.
    for epoch in 1..K_epochs:
      for each (x_i, y_i, A_i) in rollouts:
        # Importance-sampling ratio comparing current policy to
        # the rollout-time policy.
        ρ_i ← π_θ(y_i | x_i) / π_θ_old(y_i | x_i)

        # Clipped surrogate objective (negate because we minimize).
        L_PPO_i ← -min( ρ_i · A_i,
                        clip(ρ_i, 1-ε, 1+ε) · A_i )

      total_loss ← mean(L_PPO_i)

      # Backprop on the policy.
      θ ← θ - η · ∂total_loss/∂θ

Output: trained policy π_θ

A few features of this loop worth pulling out.

Three (or four) models in flight at once. The trainable policy $\pi_\theta$ , the frozen SFT reference $\pi_{\text{SFT}}$ (for KL), the frozen reward model $\hat{r}_\phi$ (for scoring), and — in production implementations with GAE — a separately-trained value-function critic. All four must be served on the same hardware during PPO training. This is why RLHF is memory-expensive: it requires roughly four times the GPU memory of standard fine-tuning. DPO avoids almost all of this.

Each iteration requires fresh rollouts. Unlike DPO’s static dataset, PPO must sample from the current policy on every iteration. Rollout phase is essentially LLM inference at training time, with the same memory-bandwidth bottleneck as deployed serving. This is the single biggest contributor to RLHF’s compute cost relative to DPO.

Many things can go wrong. The KL coefficient $\beta$ must be tuned: too small and the policy reward-hacks; too large and the policy never moves. The clip $\varepsilon$ must be tuned: too small and updates are slow; too large and PPO loses its stability properties. The number of PPO epochs per rollout: too few wastes the rollout compute; too many causes $\pi_\theta$ to overshoot $\pi_{\theta_{\text{old}}}$ despite the clip. Production RLHF is full of empirical hyperparameter folklore. The Alignment chapter treats the practical recipes; the RL chapter treats the theoretical underpinnings.

The pipeline as a whole

flowchart TD
  subgraph P1["Phase 1 — train reward model (offline, one-shot)"]
    PrefData(["$$\text{preference data } \{(x_i, y_w^i, y_l^i)\}$$"])
    InitRM["$$\text{initialise } \hat{r}_\phi \text{ from } \pi_{\text{SFT}}$$"]
    TrainRM["$$\text{Bradley-Terry loss on margin } \hat{r}(x,y_w) - \hat{r}(x,y_l)$$"]
    FrozenRM(["$$\text{trained reward model } \hat{r}_\phi \text{ (frozen for Phase 2)}$$"])
    PrefData --> InitRM --> TrainRM --> FrozenRM
  end

  subgraph P2["Phase 2 — PPO policy optimisation (iterative, on-policy)"]
    Prompts(["$$\text{prompts } \{x_i\}$$"])
    Roll["$$\text{1. sample rollouts } y_i \sim \pi_\theta(\cdot|x_i)$$"]
    Score["$$\text{2. score: } r_i = \hat{r}_\phi(x_i, y_i)$$"]
    KL["$$\text{3. KL penalty per token vs } \pi_{\text{SFT}}$$"]
    EffR["$$\text{4. effective reward } R_i = r_i - \beta \cdot \mathrm{kl}_i$$"]
    Adv["$$\text{5. advantage } A_i = R_i - \text{baseline (or GAE)}$$"]
    PPO["$$\text{6. PPO clipped surrogate update on } \theta$$"]
    LoopCheck{"$$\text{more iterations?}$$"}
    Policy(["$$\text{trained policy } \pi_\theta$$"])
    Prompts --> Roll --> Score --> KL --> EffR --> Adv --> PPO --> LoopCheck
    LoopCheck -. yes — loop .-> Roll
    LoopCheck --> |no| Policy
  end

  FrozenRM -. used to score rollouts .-> Score

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  class PrefData,InitRM,TrainRM,FrozenRM,Prompts,Roll,Score,KL,EffR,Adv,PPO,Policy pill

Why RLHF still exists despite DPO

By 2026, DPO and its variants have displaced RLHF as the default preference-tuning algorithm at most labs — comparable benchmark performance at a fraction of the operational cost (§5.3 returns to this comparison). RLHF nonetheless remains in production use. Three reasons.

Flexibility on the reward side. A trained reward model is a reusable artifact. Once $\hat{r}_\phi$ exists, new reward signals — a formatting linter, a safety classifier, a factuality checker — can be incorporated by combining them with $\hat{r}_\phi$ at PPO time, e.g., adding a bonus or penalty for tripping a particular detector. DPO has no equivalent, because it has no explicit reward; adding a new reward signal requires collecting new preference data and re-running DPO from scratch.

On-policy data collection. PPO samples from the current policy at training time, so the training distribution matches where the policy actually operates as it evolves. DPO trains on a static preference dataset assembled before training began; for policies that drift far from the SFT initialization, that dataset’s coverage may not match the policy’s current output distribution.

The reward model is independently useful. A trained $\hat{r}_\phi$ doubles as a proxy preference oracle — useful for ablations, for evaluating other models, for filtering generated training data, and for synthetic preference generation. DPO’s implicit-reward formulation does not yield a usable standalone reward model.

For most applications, DPO’s operational simplicity outweighs these considerations. For frontier-lab deployments with many reward signals and the engineering budget to run RL infrastructure, RLHF is often retained.

The complexity of RLHF, and the recognition that the two-stage structure was solving the same underlying problem twice, motivated the search for simpler alternatives. The next subsection — DPO — describes what that simplification looks like.

DPO: Direct Preference Optimization. Rafailov et al. (2023) showed that the reward-model step of RLHF can be eliminated entirely — the policy can be updated directly from preference data with a single supervised loss. The result is operationally simpler, and the derivation reveals something deeper about RLHF itself. We build the explanation from the ground up.

What we are actually doing

Before any math, ground the setup. At this point in the pipeline:

We have an SFT model: a Transformer LLM (architecture in §4) that has been pretrained and then supervised-fine-tuned on instruction-following demonstrations (§5.2). Its parameters are fixed. Call it $\pi_{\text{SFT}}$ .

We have a preference dataset: a collection of (prompt, chosen response, rejected response) triples judged by humans (or human-trained proxies). A literal entry might be:

prompt:   "Explain quantum entanglement in one sentence for a curious teenager."
chosen:   "Two particles can be linked so that measuring one instantly tells
           you the state of the other, no matter how far apart they are."
rejected: "Quantum entanglement is a phenomenon in which the quantum states
           of two or more particles become correlated such that the state of
           each particle cannot be described independently."

Both responses are coherent; the human judged the first more appropriate for the requested audience. A preference dataset contains thousands to millions of such triples.

Our goal: update the LLM’s parameters so that, on average, it produces responses humans rate as preferred, without straying so far from the SFT model that it loses fluency, factuality, or instruction-following.

Notation we will use. Throughout this subsection:

$x$ denotes a prompt (a sequence of input tokens).
$y$ denotes a response (a sequence of output tokens generated by the model).
$y_w$ and $y_l$ denote, respectively, the preferred (“winning”) and dispreferred (“losing”) responses in a preference pair. The $w$ / $l$ subscripts come from the preference-learning literature and are common across the modern preference-tuning papers.
When indexing examples in a preference dataset of size $N$ , we write $(x_i, y_w^i, y_l^i)$ for the $i$ -th example, with $i$ ranging from $1$ to $N$ .
$\theta$ denotes the parameters of the policy being trained — the trainable weights of the Transformer (architecture in §4). At frontier scale, $\theta$ contains billions of float values, but it is conventional to write it as a single object.
$\pi_\theta$ denotes the policy parameterized by $\theta$ . Different values of $\theta$ produce different policies. When we say “update $\theta$ ” we mean change the values in this parameter vector.
$\pi_{\text{SFT}}$ denotes the SFT reference policy — the trained SFT model used as a fixed reference. Its parameters are frozen during DPO.
$\beta > 0$ denotes the KL-penalty strength (a hyperparameter). Smaller $\beta$ lets the policy move further from $\pi_{\text{SFT}}$ ; larger $\beta$ keeps it closer.
$\eta > 0$ denotes the learning rate — how large a step gradient descent takes per update.
$\mathcal{D}$ denotes the distribution over prompts we are training on (usually a fixed dataset of prompts).
$\mathbb{E}_{z \sim p}[f(z)]$ denotes the expectation of $f(z)$ when $z$ is drawn from distribution $p$ .

Defining the foundational concepts

We need three definitions before the derivation: policy, reward, and KL divergence.

Policy. In the LLM context, the policy is just the model itself, viewed as a probability distribution over responses. The same Transformer LLM described in §4, with its parameters $\theta$ , takes a prompt $x$ as input and assigns a probability to every possible complete response $y$ . Writing the response as a sequence of $T$ tokens $y = (y_1, y_2, \ldots, y_T)$ (with $T$ the response length and $y_t$ the $t$ -th token), the policy factorizes as

\pi_\theta(y \mid x) \;=\; \prod_{t=1}^{T} \pi_\theta(y_t \mid x, y_{<t}).

The notation in this formula:

$\prod_{t=1}^{T}$ is a product over $t = 1, 2, \ldots, T$ .
$y_t$ is the $t$ -th token of the response.
$y_{<t}$ denotes all tokens before position $t$ in the response — that is, $(y_1, y_2, \ldots, y_{t-1})$ . At $t = 1$ this is the empty sequence.
$\pi_\theta(y_t \mid x, y_{<t})$ is the probability the model assigns to the specific token $y_t$ at position $t$ , given the prompt and the tokens generated so far.

In words: the probability of the full response is the product of the next-token probabilities, computed one token at a time exactly as in §1’s autoregressive trace. The architecture stays the same; what we call a “policy” is the same forward pass we have been talking about throughout the chapter, just packaged as a probability distribution over complete responses rather than viewed one decoding step at a time.

Reward. A reward is a scalar number $r(x, y) \in \mathbb{R}$ — a single real value — that says how good response $y$ is for prompt $x$ . Higher reward means more preferred response. The reward can come from anywhere: a human rating, a learned reward model trained on preference data (the RLHF setup), or — as DPO will show — it can be implicit in a policy.

KL divergence. The Kullback-Leibler divergence between two probability distributions $p$ and $q$ over the same outcome space is

\mathrm{KL}(p \,\|\, q) \;=\; \sum_{y} p(y) \log \frac{p(y)}{q(y)} \quad \text{(or the integral analogue for continuous } y\text{)}.

Three properties matter:

$\mathrm{KL}(p \,\|\, q) \geq 0$ , with equality iff $p = q$ everywhere.
It is not symmetric: $\mathrm{KL}(p \,\|\, q) \neq \mathrm{KL}(q \,\|\, p)$ in general.
Intuitively, $\mathrm{KL}(p \,\|\, q)$ measures the expected log-probability gap a sample from $p$ would show under $q$ — i.e., how surprised you’d be on average if you expected $q$ but reality was drawn from $p$ .

In the RLHF objective below, we use $\mathrm{KL}(\pi_\theta \,\|\, \pi_{\text{SFT}})$ as a penalty term — a quantity added to the loss with a negative sign on reward — to keep the trained policy $\pi_\theta$ from drifting too far from the SFT reference $\pi_{\text{SFT}}$ . Without such a constraint, the policy will over-optimize the reward signal and produce incoherent text that exploits reward-model quirks.

The training workflow

Concretely, what happens during DPO training? Given the SFT model and the preference dataset:

DPO TRAINING PROCEDURE
======================

Inputs:
  π_SFT  : the trained SFT model. A Transformer LLM (architecture §4)
           that has been pretrained (§5.1) and instruction-tuned
           (§5.2). Its parameters are frozen — they do not change
           during DPO. Used as a reference to keep π_θ close to.

  D      : the preference dataset. A collection of N triples
           {(x_i, y_w_i, y_l_i)}, where for the i-th example:
             - x_i      is the prompt
             - y_w_i    is the preferred ("winning") response
             - y_l_i    is the dispreferred ("losing") response
           Typical N: 10^4 to 10^6 triples.

  β      : the KL-penalty strength (a positive real number).
           Smaller β  ⇒ policy can move further from π_SFT.
           Larger β   ⇒ policy stays closer to π_SFT.
           Typical: 0.1.

  η      : the learning rate (a positive real number).
           How large a step the optimizer takes per update.
           Typical: 1e-6 to 1e-5 for LLM fine-tuning.

  E      : the number of epochs (passes through D).
           Typical: 1 to 3.

Initialize:
  Copy the parameter values of π_SFT into a fresh, trainable model
  with the same architecture:
        θ  ←  parameters of π_SFT.
  We will update θ; π_SFT keeps its original values and is used only
  as a fixed reference. Both models are forward-passed during
  training, but only π_θ accumulates gradients.

Training loop:
  for epoch in 1 .. E:
    for each mini-batch B drawn from D:        # |B| examples per batch
      L ← 0

      for each (x, y_w, y_l) in B:
        # Four forward passes. Same architecture (Transformer LLM),
        # different parameter values:
        log_pi_theta_w ← log π_θ(y_w | x)      # gradient flows through θ
        log_pi_theta_l ← log π_θ(y_l | x)      # gradient flows through θ
        log_pi_sft_w   ← log π_SFT(y_w | x)    # NO gradient (frozen)
        log_pi_sft_l   ← log π_SFT(y_l | x)    # NO gradient (frozen)

        # Implicit reward margins:
        #   the log-probability ratio between policy and SFT reference,
        #   scaled by β. Higher margin_w ⇒ policy assigns higher
        #   relative probability to y_w than the SFT model does.
        margin_w ← β · (log_pi_theta_w - log_pi_sft_w)
        margin_l ← β · (log_pi_theta_l - log_pi_sft_l)

        # Per-example DPO loss (derivation in "The math" below).
        # σ is the logistic sigmoid: σ(z) = 1 / (1 + exp(-z)).
        # The loss is low when margin_w > margin_l, i.e. when the
        # policy ranks y_w above y_l more strongly than the SFT
        # reference does — which is what we want.
        L_example ← -log σ(margin_w - margin_l)
        L ← L + L_example

      L ← L / |B|                              # mean over the batch

      # Standard backprop on the policy's parameters:
      # reverse-mode autodiff computes ∂L/∂θ through the two
      # forward passes that touched θ, and the optimizer takes
      # a gradient step.
      θ ← θ - η · ∂L/∂θ                        # (in practice, Adam/AdamW)

Output: trained policy π_θ

A few things to note about this workflow:

The architecture never changes — π_θ and π_SFT are the same Transformer LLM (from §4). What changes is the parameter values in π_θ. π_SFT is just a frozen copy used as a reference.
“Training” here means gradient descent on a loss function, exactly as in any other deep-learning training. We compute a loss, compute its gradient with respect to parameters, and update parameters. The same machinery as in §5.1 (pretraining) and §5.2 (SFT), just with a different loss.
Each preference pair contributes one log-sigmoid term to the loss. We do not sample new responses from the model during training — we only score the responses that are already in the preference dataset. (This is the contrast with RLHF, where we would have to sample responses from π_θ inside the training loop and pass them through the reward model. RLHF is on-policy; DPO is fully off-policy.)
$\pi_{\text{SFT}}$ appearing in the margin is what makes the procedure stay close to the SFT model without an explicit KL penalty term — the implicit reward formulation builds the KL constraint into the loss itself.

The question that remains: why does this particular loss work? The derivation is worth understanding because the insight generalizes.

The math: why this loss is what it is

The DPO loss is not a heuristic. It is the exact solution to the RLHF problem under a Bradley-Terry preference model. The derivation has four steps.

Step 1: write down the RLHF objective. RLHF trains the policy $\pi_\theta$ to maximize expected reward under a KL constraint against the SFT policy. Recall from the notation list: $\mathcal{D}$ is the distribution over prompts (the prompt portion of our training data), and $\mathbb{E}_{x \sim \mathcal{D}}[\cdot]$ denotes expectation when $x$ is drawn from $\mathcal{D}$ . The objective is

\max_{\pi_\theta} \; \mathbb{E}_{x \sim \mathcal{D}, \; y \sim \pi_\theta(\cdot \mid x)} \big[ r(x, y) \big] \;-\; \beta \, \mathbb{E}_{x \sim \mathcal{D}} \, \mathrm{KL}\!\left[ \pi_\theta(\cdot \mid x) \,\|\, \pi_{\text{SFT}}(\cdot \mid x) \right].

Read this in two parts. The first expectation averages over prompts $x$ drawn from $\mathcal{D}$ , and within each prompt, over responses $y$ sampled from the current policy $\pi_\theta(\cdot \mid x)$ . The inner $r(x, y)$ is the scalar reward for that (prompt, response) pair. So the first term says: produce responses with high reward, in expectation. The second expectation averages over prompts only, and computes the KL divergence between the policy’s response distribution at that prompt and the SFT reference’s response distribution at the same prompt. The coefficient $\beta > 0$ controls how heavily this drift is penalized.

PPO solves this iteratively — sample responses, evaluate them via the reward model, take a gradient step — but it doesn’t have to be solved iteratively.

Step 2: write down the closed-form optimum. This optimization has an analytical solution: the optimal policy under the KL-constrained reward-maximization objective is the SFT policy reweighted by an exponential of the reward:

\pi^*(y \mid x) \;=\; \frac{1}{Z(x)} \, \pi_{\text{SFT}}(y \mid x) \, \exp\!\left( \tfrac{1}{\beta} r(x, y) \right),

where $Z(x) = \sum_{y'} \pi_{\text{SFT}}(y' \mid x) \exp(r(x, y') / \beta)$ is a normalization (the partition function) that depends on the prompt $x$ but not on the response $y$ . Reading: the optimal policy upweights responses with high reward (in proportion to $\exp(r/\beta)$ ) on top of the SFT base distribution.

Step 3: invert the formula. Solve the closed form for $r$ in terms of $\pi^*$ :

r(x, y) \;=\; \beta \log \frac{\pi^*(y \mid x)}{\pi_{\text{SFT}}(y \mid x)} \;+\; \beta \log Z(x).

This is the central observation. The reward is implicit in the policy. For any policy, you can read off the reward function under which that policy would be optimal, up to the $x$ -only term $\beta \log Z(x)$ .

Step 4: cancel the partition function via Bradley-Terry. The Bradley-Terry preference model formalizes how preferences arise from rewards. Given two responses $y_w$ (preferred) and $y_l$ (dispreferred) to the same prompt $x$ :

P(y_w \succ y_l \mid x) \;=\; \sigma\!\big( r(x, y_w) - r(x, y_l) \big),

where $\sigma(z) = 1 / (1 + e^{-z})$ is the logistic sigmoid. The key detail: this involves the difference of rewards on the same prompt. Plug in the implicit reward from Step 3 and the $\beta \log Z(x)$ terms cancel — they appear identically on both sides of the difference. We are left with

P(y_w \succ y_l \mid x) \;=\; \sigma\!\left( \beta \log \frac{\pi^*(y_w \mid x)}{\pi_{\text{SFT}}(y_w \mid x)} \;-\; \beta \log \frac{\pi^*(y_l \mid x)}{\pi_{\text{SFT}}(y_l \mid x)} \right).

The reward model has disappeared. The probability of the preference is now expressed entirely in terms of the policy $\pi^*$ and the frozen reference $\pi_{\text{SFT}}$ .

To fit a policy $\pi_\theta$ from preference data $\{(x, y_w, y_l)\}$ , we maximize this probability — equivalently, minimize its negative log — over $\theta$ . That gives the DPO loss the training procedure above is minimizing:

\mathcal{L}_{\text{DPO}}(\theta) \;=\; -\, \mathbb{E}_{(x, y_w, y_l)} \log \sigma\!\left( \beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{SFT}}(y_w \mid x)} \;-\; \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{SFT}}(y_l \mid x)} \right).

This is a standard supervised loss in $\theta$ : differentiable, optimizable with backprop and any standard optimizer (Adam/AdamW). No reward model. No PPO. No rollout sampling.

The pipelines compared

flowchart LR
  subgraph RLHF["RLHF"]
    RD(["$$\text{preference data}$$"])
    RM["$$\text{reward model } \hat{r}(x,y)$$"]
    PPOLoop["$$\text{PPO rollout loop}$$"]
    Pi_r["$$\pi_\theta$$"]
    UTheta_r(["$$\text{updates } \theta$$"])
    KLSFT_r["$$\text{KL penalty vs } \pi_{\text{SFT}}$$"]
    RD --> RM
    RM -. used every step .-> PPOLoop
    Pi_r -. sample responses .-> PPOLoop
    PPOLoop --> UTheta_r
    KLSFT_r -. separate loss term .-> PPOLoop
  end

  subgraph DPO["DPO"]
    DD(["$$(x, y_w, y_l) \text{ — preference data}$$"])
    Loss["$$\text{supervised DPO loss}$$"]
    Ref["$$\pi_{\text{SFT}} \text{ frozen (log-prob ratios only)}$$"]
    UTheta_d(["$$\text{updates } \theta \text{ on } \pi_\theta$$"])
    DD --> Loss --> UTheta_d
    Ref -. log-prob ratios .-> Loss
  end

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  class RD,RM,PPOLoop,Pi_r,UTheta_r,KLSFT_r,DD,Loss,Ref,UTheta_d pill

The insight worth keeping

The takeaway is not the algebra but the underlying duality. RLHF and DPO are not different algorithms; they are different parameterizations of the same problem. RLHF fits the reward function, then uses it to update the policy. DPO recognizes that the policy already encodes the reward implicitly — so the explicit reward representation is redundant, and we can update the policy directly from preferences. The same problem; one representation needs an auxiliary model, the other does not.

This duality is the kind of observation that recurs in modern ML: an algorithm that looks complicated because it builds an explicit intermediate object can often be simplified by recognizing that the intermediate object is determined by the thing you actually want. Score-based generative models, energy-based models, and several other techniques have the same flavour.

Practical consequences

DPO is dramatically simpler to implement than RLHF: a single supervised loss on preference pairs, with the SFT model held frozen as a reference. There is no reward-model architecture to design, no PPO hyperparameter sweep, no concern about reward hacking during PPO rollouts. Empirically, DPO matches or modestly underperforms RLHF on standard benchmarks; for many applications the simplification more than compensates. By 2026, DPO and its variants are the default preference-tuning algorithm at most labs, with RLHF retained where its flexibility — for example, easy incorporation of new reward signals after the fact — justifies the operational cost.

DPO is not free of failure modes. It can overshoot, driving the preferred-response log-likelihood ratio to extreme values without proportionate quality gains; the IPO variant (next paragraph) addresses this. It requires preference pairs in a specific format (one chosen and one rejected response per prompt), which is more restrictive than RLHF’s reward-model-mediated workflow. And the theoretical equivalence to RLHF holds under specific assumptions about the Bradley-Terry preference model and the KL regularization; in practice these are approximate.

DPO variants: IPO, KTO, ORPO

DPO’s basic form has known failure modes, and several variants address them. We develop each briefly — enough that the reader can see what’s being modified and why.

IPO: Identity Preference Optimization (Azar et al., 2023). DPO’s loss uses the log-sigmoid of the margin between preferred and rejected log-likelihoods. As that margin grows, the log-sigmoid keeps rewarding further increases — there is no saturation. In practice this lets DPO push the preferred-response log-likelihood ratio to extreme values long after the model is already strongly preferring the right response, without proportionate quality gains. The model overfits the preference signal.

IPO replaces the log-sigmoid loss with a squared-error loss against a target margin $\tau$ :

\mathcal{L}_{\text{IPO}}(\theta) \;=\; \mathbb{E}_{(x, y_w, y_l)} \left[ \, \beta \, \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{SFT}}(y_w \mid x)} - \beta \, \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{SFT}}(y_l \mid x)} - \tau \, \right]^2.

Once the margin reaches $\tau$ , the squared term is zero — there is no incentive to push further. The model converges to a bounded preference margin rather than diverging. In practice IPO is more conservative than DPO; it sacrifices some peak performance for stability.

KTO: Kahneman-Tversky Optimization (Ethayarajh et al., 2024). DPO and IPO both require pairwise preference data — a chosen response and a rejected response for the same prompt. Pairwise data is more expensive to collect than single-label data (“was this response good or bad?”); rejected responses must be generated and curated alongside chosen ones.

KTO uses single-label data: an entry is either $(x, y, \texttt{good})$ or $(x, y, \texttt{bad})$ , no pair required. The loss is motivated by prospect theory — Kahneman and Tversky’s account of how humans weight gains and losses asymmetrically — and asymmetrically penalises worsening a desirable response (a “loss”) more strongly than it rewards improving one (a “gain”):

   For a "good" example (x, y, good):
     reward ← β · log(π_θ(y|x) / π_SFT(y|x))
     loss   ← gain_term(reward)          (concave, gentle)

   For a "bad" example (x, y, bad):
     reward ← β · log(π_θ(y|x) / π_SFT(y|x))
     loss   ← loss_term(reward)          (convex, sharper)

The asymmetry matches the prospect-theory finding that losses loom larger than gains. Operationally, KTO is attractive because single-label data is much easier to gather (every accept/reject signal in a deployed product becomes training data) and because the same loss handles both signals.

ORPO: Odds Ratio Preference Optimization (Hong et al., 2024). DPO, IPO, and KTO all require a separate preference-tuning stage after SFT. ORPO folds the preference signal into the SFT stage itself: a single combined loss is the standard SFT cross-entropy loss on the chosen response, plus an odds-ratio penalty term that suppresses the rejected response. Concretely:

\mathcal{L}_{\text{ORPO}}(\theta) \;=\; \mathcal{L}_{\text{SFT}}(x, y_w) \;+\; \lambda \cdot \mathcal{L}_{\text{OR}}(x, y_w, y_l),

where $\mathcal{L}_{\text{OR}}$ is a log-odds-ratio penalty pushing the policy’s likelihood for $y_w$ above its likelihood for $y_l$ . The advantage is operational: one training run, one optimizer state, one set of hyperparameters — no separate preference-tuning phase to design, tune, and run. ORPO trades some of DPO’s theoretical elegance for end-to-end simplicity.

GRPO: Group Relative Policy Optimization

Group Relative Policy Optimization (GRPO) (introduced for DeepSeek’s R1 in 2024) is the algorithm most associated with reasoning models (§5.4) and with the test-time-compute era of LLMs. It deserves a real treatment because of its role rather than its complexity — the algorithm itself is conceptually clean once you see what it’s removing.

GRPO sits between RLHF and DPO in spirit. Like RLHF, it is on-policy: it samples responses from the current policy and updates the policy from the resulting feedback. Like DPO, it can dispense with a separate reward model, provided the reward signal is something the environment can compute directly — typically a verifiable outcome reward such as “did the model produce the correct mathematical answer?” or “did the generated code pass the test cases?”.

The key idea is a group baseline. Instead of training a value-function critic (the GAE approach in RLHF, recall the “(or four) models in flight” remark), GRPO samples multiple responses from the policy for each prompt, computes the reward for each, and uses the mean reward over the group as the baseline for advantage estimation. A response’s advantage is just how much better its reward is than the group’s average.

GRPO TRAINING (sketch)
======================

Inputs:
  π_θ        : current policy (Transformer LLM, initialized from SFT)
  D_prompts  : prompt dataset (e.g., math problems with verifiable answers)
  G          : group size — responses sampled per prompt (typical: 8-64)
  ε, β       : PPO-style clip parameter; KL-penalty weight against π_SFT
  η          : learning rate
  K_iters    : number of training iterations

  reward_fn(x, y) → R  : a function that scores responses.
                         For math problems: 1.0 if the final answer
                         is correct, 0.0 otherwise.
                         For code: 1.0 if all test cases pass, fractional
                         credit for partial passes, 0.0 if syntax error.

Training loop:
  for iteration = 1 .. K_iters:

    # === ROLLOUT phase ===
    sample a batch of prompts {x_1, ..., x_M} from D_prompts

    for each prompt x_i:
      # Generate G independent responses by sampling from the
      # current policy (this is the "group").
      for g = 1 .. G:
        y_{i,g} ~ π_θ(·|x_i)

      # Score each response with the external reward function.
      for g = 1 .. G:
        r_{i,g} ← reward_fn(x_i, y_{i,g})

      # Compute group baseline: mean reward in the group for prompt i.
      r_mean_i ← mean(r_{i,1}, ..., r_{i,G})

      # Per-response advantage relative to the group's mean.
      for g = 1 .. G:
        A_{i,g} ← r_{i,g} - r_mean_i
        # (Some implementations also divide by the standard
        # deviation within the group for normalization.)

    # === UPDATE phase ===
    # Same clipped-surrogate update as PPO, but using the
    # group-relative advantages and no value-function critic.
    for epoch in 1..K_epochs:
      for each (x_i, y_{i,g}, A_{i,g}) sample:
        ρ ← π_θ(y_{i,g} | x_i) / π_θ_old(y_{i,g} | x_i)
        L_clip ← -min( ρ · A_{i,g},
                       clip(ρ, 1-ε, 1+ε) · A_{i,g} )

        # KL penalty against the SFT reference, applied either
        # as a per-token bonus to the reward or as a separate
        # loss term.
        L_kl ← β · KL(π_θ(·|x_i) || π_SFT(·|x_i))

        L ← L_clip + L_kl

      θ ← θ - η · ∂L/∂θ

Output: trained policy π_θ

What GRPO buys, relative to standard PPO-based RLHF:

No reward model. When the reward is verifiable (mathematics, code execution, structured-output validation), the reward function $\hat{r}$ can be replaced by a direct computation. No labour, no training, no reward-model misalignment.
No value-function critic. The group baseline serves the role the value-function critic plays in standard PPO/GAE. Three models in flight (policy, SFT reference, reward function) instead of four (policy, SFT, reward model, value critic).
High signal-to-noise on hard problems. Sampling $G$ responses per prompt lets the algorithm distinguish “this is a hard prompt where any response gets low reward” (low $r_{\text{mean}}$ , similar individual rewards, small advantages — small gradient) from “this prompt is solvable but only some sampled paths got it” (variance in $r$ across the group, meaningful advantages — large gradient on the correct paths).

The third property is what makes GRPO well-suited to reasoning model training: on a problem that the policy solves correctly only $20\%$ of the time, the algorithm pushes hard on the $20\%$ that worked and de-weights the $80\%$ that didn’t. Over many iterations the policy concentrates on reasoning paths that produce correct answers. The Reasoning Models chapter develops this in full, including the connection to chain-of-thought training (§7) and the test-time-compute trade-off.

RLAIF and Constitutional AI. Reinforcement learning from AI feedback (RLAIF) replaces the human preference labels with judgments from a stronger LLM (or the same LLM evaluating its own outputs against written criteria). Constitutional AI (Bai et al., 2022) is the canonical instance: a “constitution” — a list of natural-language principles — is used to generate critiques of model outputs, which then drive an iterative self-improvement loop. The Alignment chapter develops the substance; for the LLM training pipeline, the practically relevant point is that AI-generated preference data is now widespread and reduces — though does not eliminate — the human-labelling bottleneck.

Reasoning-RL (deferred)

For reasoning models (described in §2), an additional training stage applies reinforcement learning to the model’s own reasoning traces, typically with outcome-based rewards (was the final answer correct?) and GRPO-style algorithms. This stage produces models that allocate substantial test-time compute to deliberation. We defer the substance to the Reasoning Models chapter.

Continual pretraining and knowledge updating

A pretrained, fine-tuned, preference-tuned LLM has a knowledge cutoff — a date beyond which it has no training data and therefore no internal knowledge of events. By the time a model reaches deployment, the cutoff is typically months in the past; by the time it has been in production for a year, it is two years stale. Closing the gap is knowledge updating, and its principal mechanism is continual pretraining: apply additional pretraining-style training on a corpus that includes newer data, then re-run SFT and preference tuning on top.

The structural problem this creates is catastrophic forgetting — the tendency of further training to overwrite earlier-learned material:

   Original training distribution:
     [- - - period 1 data - - - period 2 - - - period 3 - - -]
     model learns from all of this; weights θ_0 encode all eras.

   Naive continual training on new data only:
     [period 4 data]
                │
                ▼
     all gradient signal pushes weights toward period 4 statistics
                │
                ▼
     model θ_1 strongly biases toward period 4;
     period 1-3 knowledge degrades — catastrophic forgetting.

   Mitigated continual training (e.g., 80% old + 20% new):
     [- old replay -|- new period 4 -|- old replay -|- new -|...]
                │
                ▼
     gradient signal stays balanced; new knowledge added
     without erasing old.

Four mitigations recur in practice, each addressing the forgetting problem from a different angle:

Data mixing with replay. Continual training proceeds on a mixture of old and new data — typically 80–95% old data sampled from the original pretraining corpus (the “replay”) plus 5–20% new data. The replay supplies the gradient signal that prevents the model from drifting on old content.
Low learning rate. Continual training uses a much smaller learning rate than the original pretraining (often 10× to 100× smaller). Small updates accumulate the new information slowly while perturbing previously-encoded representations only mildly.
Regularization toward the original model. A penalty added to the loss — typically a KL divergence between the continued model’s output distribution and the original model’s on a held-out set, or an L2 distance between the parameter vectors — pulls the updated model back toward its starting point. This is the same mechanism the KL penalty plays in RLHF.
Modular approaches. Rather than updating the full parameter vector, train adapters (small trainable modules inserted into specific layers, e.g., LoRA) on the new data, keeping the base parameters frozen. The adapters carry the new knowledge; if they fail or need rollback, the base model is untouched.

None of these mitigations is fully satisfactory. The fundamental tradeoff is between fast adaptation to new information and stability of existing capabilities. A model that updates too readily drifts; a model that updates too cautiously stays stale. As of 2026 the practical recipe at frontier labs is some combination of all four — moderate replay, low learning rate, regularization, occasional modular knowledge isolation — applied on a periodic (quarterly or monthly) re-training schedule rather than as continuous online learning. This is the cross-chapter open problem OP-FM-8 (knowledge updating without catastrophic forgetting; see Foundation Models §11). The LLM-specific facet — that knowledge updates interact with SFT and preference-tuning behaviour, not just with raw knowledge — shows up in §13.

Distillation

Once a high-quality model exists, distillation transfers its capabilities to a smaller model. Introduced in deep learning by Hinton et al. (2015) for image classifiers and adapted to LLMs, distillation is now responsible for much of the open-weights ecosystem.

The motivating economics: a 7-billion-parameter model that captures 90% of a 70-billion-parameter model’s capability is dramatically cheaper to serve, and most production deployments accept the 10% quality loss in exchange for the 10× cost reduction. The basic idea: train a student model (smaller, cheaper) to match a teacher model (larger, more capable). Three variants matter in practice; they differ in what signal the student learns from.

Hard-label (token-level) distillation

The simplest form. For each training prompt, the teacher generates a response; the student is trained on that (prompt, response) pair using the standard SFT loss (per-position cross-entropy on the response tokens).

prompt

x

→ TEACHER

emit response

y_1, y_2, \ldots, y_T

(chosen tokens only)

(x,\; y_1 \ldots y_T)

STUDENT trained on this pair via SFT loss

next-token cross-entropy on the actual

y

tokens

This is effectively SFT where the “ideal responses” come from the teacher rather than from human annotators. No teacher logits are required — the teacher only needs to produce text — so this is the distillation method that works against closed-API teachers, where only generated text is accessible. The cost: information the teacher had but did not emit (its uncertainty between near-equivalent token choices, its weighting of alternatives) is discarded.

Soft-label distillation

A richer signal: instead of matching only the teacher’s emitted tokens, the student matches the teacher’s full output distribution at each position. The loss is the KL divergence between the teacher’s per-token distribution and the student’s:

\mathcal{L}_{\text{soft}}(\theta) \;=\; \mathbb{E}_{x,\,t}\, \mathrm{KL}\!\left( \pi_{\text{teacher}}(\cdot \mid x, y_{<t}) \;\|\; \pi_\theta(\cdot \mid x, y_{<t}) \right).

The student is trained to assign similar probabilities to all tokens at each position, not just the one the teacher sampled. This captures information the teacher had but did not emit: the alternatives it considered, the relative weights it assigned to near-synonyms, the magnitude of its uncertainty at each step. Soft-label distillation typically produces better students than hard-label at the same data scale, at the cost of requiring access to the teacher’s logits (impossible against closed-API models).

Side by side:

   HARD-LABEL                       SOFT-LABEL
   ──────────                       ──────────

   teacher                           teacher
      │                                 │
      ▼ emit token y_t                  ▼ produce full distribution
                                          p_teacher(·|context)
   student trains on (x, y_t):       
                                       student trains to match
   loss = -log p_student(y_t|...)     the WHOLE distribution:
                                       loss = KL(p_teacher || p_student)

   teacher uncertainty discarded     teacher uncertainty preserved
   works with closed-API teacher     requires teacher logits

On-policy distillation

In the two variants above, the training data is generated by the teacher. The student is trained on the teacher’s distribution of behaviour, which is not the same as the student’s distribution at deployment. The student encounters at inference time situations its training did not cover — a distribution-mismatch failure mode.

On-policy distillation flips the data direction: the student generates its own outputs; the teacher labels them (with corrections, with quality scores, or with preferred alternatives). The student is then trained on (student-generated, teacher-corrected) pairs. The training distribution now matches the deployment distribution.

STUDENT generates response

TEACHER labels / corrects the student’s response

(student_prompt, student_response, teacher_label)

STUDENT trains on label

correction, quality score, or preference signal

On-policy distillation is closer to RLHF in spirit — it requires sampling from the current student policy at training time, like PPO — but the “reward” signal comes from a teacher’s labels rather than a learned reward model. The expensive side is rollout cost; the benefit is that the student covers exactly the distributions it will encounter at deployment.

Trace distillation for reasoning models

For reasoning models (§5.4 / Reasoning Models chapter), the teacher’s reasoning trace — its intermediate chain-of-thought, not just its final answer — can be distilled along with the answer. A small student trained on traces from a much larger reasoning teacher inherits the structure of the teacher’s deliberation. This is how many “small reasoning models” (sub-10B-parameter models with strong reasoning behaviour) are produced in 2026: distillation from a frontier reasoning teacher rather than RL-from-scratch.

Where distillation sits in the ecosystem

Distillation is responsible for much of the open-weights LLM ecosystem. Smaller, deployable models are often distilled from larger ones — sometimes within the same lab (Llama-3-8B from Llama-3-405B, for example), sometimes by training students on outputs of competing frontier models. The latter practice has active legal and ethical questions around terms-of-service, copyright, and competition, which we touch on in §10.

Synthetic data

The synthetic-data turn is the practice of using model-generated data in further training. By 2026 it is unavoidable: the supply of high-quality unlabelled web text is finite, and at frontier scale labs are increasingly training on data that another model — or the same model in an earlier iteration — generated.

Where synthetic data enters the pipeline

Synthetic data feeds into every stage of the training pipeline introduced above:

flowchart TD
  Synth(["$$\text{synthetic data}$$"])

  PT["$$\textbf{Pretraining}$$\nreal web/code/books PLUS synthetic augmentation (math, code, low-resource languages, structured formats)"]
  SFT["$$\textbf{SFT}$$\nreal human demonstrations PLUS synthetic demonstrations from a stronger teacher, filtered for quality"]
  Pref["$$\textbf{Preference tuning}$$\nreal human preference pairs PLUS RLAIF judgments (Constitutional AI)"]
  RRL["$$\textbf{Reasoning-RL}$$\nteacher reasoning traces as targets for student reasoning-RL or trace distillation"]

  Synth --> PT
  Synth --> SFT
  Synth --> Pref
  Synth --> RRL

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  class Synth,PT,SFT,Pref,RRL pill
| Token | Probability | Cumulative |
|---|---|---|
| `" Paris"` | 0.50 | 0.50 |
| `" France"` | 0.20 | 0.70 |
| `" the"` | 0.12 | 0.82 |
| `" a"` | 0.08 | 0.90 |
| `" our"` | 0.04 | 0.94 |
| `" home"` | 0.03 | 0.97 |
| `" happy"` | 0.02 | 0.99 |
| (rest of vocab) | 0.01 | 1.00 |
   token            probability     cumulative
   --------------   -----------     ----------
   " Paris"           0.50              0.50
   " France"          0.20              0.70
   " the"             0.12              0.82
   " a"               0.08              0.90
   " our"             0.04              0.94
   " home"            0.03              0.97
   " happy"           0.02              0.99
   (rest of vocab)    0.01              1.00

Under each strategy:

Greedy. Always picks " Paris" (the highest probability). Deterministic.
Temperature $T = 1$ , pure sampling. Samples from the full distribution as-is. Most likely outcome " Paris" (50% of the time), but " France", " the", etc., are all possible with their listed probabilities.
Temperature $T = 0.5$ (sharpening). Rescale logits, re-normalize. Roughly, the top probability is concentrated further: " Paris" becomes ≈ 0.78, " France" becomes ≈ 0.12, the rest split the residual. Effectively closer to greedy.
Temperature $T = 2$ (flattening). The top probability drops; lower-probability tokens get relatively more mass. " Paris" becomes ≈ 0.30, the rest of the table become correspondingly more likely. More creative, more error-prone.
Top-k with $k = 3$ . Restrict to {" Paris", " France", " the"}, renormalize (0.50, 0.20, 0.12 → 0.61, 0.24, 0.15), sample. Everything outside the top 3 is impossible.
Top-k with $k = 50$ . All seven of these tokens plus the next 43 from the full vocabulary are in the sampling set. The renormalization is mild.
Nucleus $p = 0.9$ . Smallest set with cumulative probability ≥ 0.9: {" Paris", " France", " the", " a"}. Renormalize and sample from those four. At a confident step like this one, the nucleus is small.
Nucleus $p = 0.95$ . Smallest set with cumulative ≥ 0.95: {" Paris", " France", " the", " a", " our"}. Five tokens.
Min-p with $p = 0.05$ . Threshold is $0.05 \times 0.50 = 0.025$ . Tokens whose probability is at least 0.025 stay: {" Paris", " France", " the", " a", " our", " home"}. The 0.02 token (" happy") is dropped. Renormalize and sample.

Two things to note from this example. First, on a confident step like the one above, all the sampling strategies behave similarly — they all concentrate on the top one or two tokens, just with slightly different cutoffs. The differences become more pronounced on uncertain steps where the distribution is flatter. Second, the temperature interacts multiplicatively with the truncation: applying temperature scaling before top-p or min-p produces different behaviour than applying them after, and production systems usually pick a specific ordering.

In practice, common deployed configurations are temperature ≈ 0.7 combined with nucleus $p \in [0.9, 0.95]$ , or temperature ≈ 0.7 with min-p ≈ 0.05. Greedy ( $T = 0$ ) is used when reproducibility is needed (e.g., in evaluation pipelines).

Beam search

Beam search maintains, at each step, the $B$ highest-probability complete sequences so far (the beams), extending each by all possible next tokens and keeping the top $B$ extended sequences. Beam search produces the approximately highest-probability sequence under the model — a fundamentally different goal from sampling from the model’s distribution.

For neural machine translation and other tasks with a clearly correct output, beam search was the dominant decoding strategy through the 2010s. For modern LLMs it has been largely abandoned for two reasons. First, the highest-probability sequence is usually not the best sequence — it tends to be short, generic, and repetitive (the “beam-search curse” first noted in NMT literature). Second, modern LLM use cases are generative and creative, where sampling-style diversity is the goal, not modal certainty. Beam search persists in specialized settings: machine translation, structured prediction tasks where there is a uniquely correct answer, and constrained-decoding pipelines (below) where it interacts well with constraint enforcement.

Speculative decoding

The memory-bandwidth bottleneck in decoding suggests a workaround: if a cheap-to-evaluate draft model can guess what the expensive target model will produce, the target model can verify many tokens at once in a single forward pass, instead of one at a time.

Speculative decoding (Leviathan et al., 2023; Chen et al., 2023) implements this:

Run the draft model — a small, fast LLM — to generate the next $K$ tokens speculatively.
Run one forward pass of the target model on the prefix plus the $K$ draft tokens, producing a distribution at each position.
For each draft token, accept it with probability based on the ratio of target-model and draft-model probabilities (using rejection-sampling rules that provably preserve the exact target-model distribution).
On the first rejection, sample a corrected token from a residual distribution and restart.

Diagrammatically:

flowchart TD
  Prefix(["$$\text{prefix: } [t_1, \ldots, t_n]$$"])
  Draft["$$\text{draft model (small, fast): propose } K \text{ tokens}$$\n$$[d_1, \ldots, d_K] \text{ with } p_{\text{draft}}(d_i|\cdots)$$"]
  Target["$$\text{target model: one forward pass over}$$\n$$[t_1, \ldots, t_n, d_1, \ldots, d_K]$$\n$$\to p_{\text{target}}(d_i|\cdots)$$"]
  Walk["$$\text{walk left-to-right: acceptance ratio } r = p_{\text{target}}(d_i)/p_{\text{draft}}(d_i)$$"]
  Accept{"$$r \geq 1?$$\n$$\text{or accept with prob } r\text{?}$$"}
  Correction["$$\text{sample correction from residual:}$$\n$$\max(0,\; p_{\text{target}} - p_{\text{draft}})$$"]
  Extended(["$$\text{extended prefix (longer than } t_n)$$"])
  Done(["$$\text{done — full response generated}$$"])

  Prefix --> Draft --> Target --> Walk --> Accept
  Accept --> |"all accepted"| Extended
  Accept -. rejected .-> Correction --> Extended
  Extended -. loop until done .-> Draft
  Extended --> |"end token or max length"| Done

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  class Prefix,Draft,Target,Walk,Correction,Extended,Done pill

Because the target model’s forward pass over $K + 1$ tokens is roughly as expensive as a single-token forward pass — the memory-bandwidth cost is per pass, not per token — accepting even half of the draft tokens cuts effective decoding cost substantially. Practical speedups of 2–4× are typical, with no loss of generation quality (the procedure is exact — it produces samples from exactly the target model’s distribution).

Why it preserves the target distribution

The claim of exactness is worth unpacking, because it’s what makes speculative decoding usable in production. The procedure draws from a different proposal distribution (the draft model) but rejects in a precise way that compensates.

Consider one position. The draft model has proposed token $d$ with probability $p_{\text{draft}}(d)$ . The target model assigns it probability $p_{\text{target}}(d)$ . The acceptance rule:

If $p_{\text{target}}(d) \geq p_{\text{draft}}(d)$ , accept $d$ unconditionally.
Otherwise, accept $d$ with probability $r = p_{\text{target}}(d) / p_{\text{draft}}(d)$ .

On rejection, sample a correction token from the residual distribution $\max(0, p_{\text{target}} - p_{\text{draft}})$ , renormalized.

Why this works: by the standard rejection sampling argument, the total probability of emitting token $d$ at this position — across both the “accept the draft” path and the “reject and sample correction” path — is exactly $p_{\text{target}}(d)$ . The draft model can be arbitrarily bad; the procedure is correct regardless. A good draft model (one whose distribution is close to the target’s) yields high acceptance rates and large speedups; a bad draft model yields low acceptance and small speedups, but the output distribution is always the target’s.

This is the deeper invariant: speculative decoding trades draft-model quality for decoding speed, with exactness preserved by construction.

Variants extend the idea. Medusa (Cai et al., 2024) avoids a separate draft model by training multiple small prediction heads attached to the target model itself, each predicting a different offset into the future. Lookahead decoding (Fu et al., 2024) uses a Jacobi-iteration-style drafting strategy without any draft model. EAGLE and successors continue this line.

KV-cache management and serving

For long contexts, the KV cache itself is the dominant memory cost. Recall (§4.6) the KV-cache size formula: at fp16 precision, the cache holds $2 \times N_{\text{layers}} \times T \times H \times d_{\text{head}}$ half-precision floats per request, where $T$ is the current context length and the factor of $2$ is for K and V. The physical layout, in concrete terms:

   KV cache for one inference request
   ──────────────────────────────────

                       ┌──────────────────────────────────┐
   per layer (one of   │   token positions 1..T           │
   N_layers stacks):   ├──────────────────────────────────┤
                       │  ┌──────┐ ┌──────┐         ┌──────┐
   per head (one of    │  │ K_1  │ │ K_2  │   ...   │ K_T  │
   H heads):           │  └──────┘ └──────┘         └──────┘
                       │   each K_t ∈ R^{d_head}
                       │
                       │  ┌──────┐ ┌──────┐         ┌──────┐
                       │  │ V_1  │ │ V_2  │   ...   │ V_T  │
                       │  └──────┘ └──────┘         └──────┘
                       │   each V_t ∈ R^{d_head}
                       └──────────────────────────────────┘
                          .  same shape for layer 2 ..
                          .  same shape for layer N_layers

For a 100-billion-parameter model with $N_{\text{layers}} = 80$ , $H = 64$ , $d_{\text{head}} = 128$ , and a 100,000-token context at fp16:

2 \times 80 \times 100{,}000 \times 64 \times 128 \times 2\,\text{bytes} \;=\; \text{about } 26\,\text{GB per request}.

That is the cache for one user. Frontier serving systems hold dozens of such requests in flight simultaneously. The KV cache, not the parameter count, is what limits how many concurrent users a server can sustain. The next two techniques attack this constraint.

Continuous batching. Static batching — grouping requests into a fixed-size batch processed together — wastes GPU compute because requests have different lengths and finish at different times. Continuous batching (Yu et al., 2022, “Orca”) dynamically replaces finished requests in the batch with new ones at every step. The technique is now standard in LLM serving systems.

Paged attention

A second waste source is the physical layout of the KV cache. The naive implementation stores each request’s KV cache as one contiguous memory block, sized for the maximum possible context length (say, 128k tokens). But most requests use far less. A request whose generated response is 200 tokens long still reserves a 128k-sized block — most of it unused.

Paged attention (Kwon et al., 2023, “vLLM”) borrows the virtual-memory paging idea from operating systems. Memory is split into fixed-size pages (typically 16 or 32 tokens per page), and each request’s logical sequence of KV-cache entries is mapped to a (potentially non-contiguous) list of physical pages via a per-request block table:

   Without paging (contiguous, max-size):
   ───────────────────────────────────────
   request A's KV cache (128k tokens reserved):
     [ used 200 tokens ][ . . . . wasted 127,800 tokens . . . . ]
   request B's KV cache (128k tokens reserved):
     [ used 5000 tokens ][ . . . wasted 123,000 tokens . . . . ]
   request C's KV cache (128k tokens reserved):
     [ used 100 tokens ][ . . . . wasted 127,900 tokens . . . . ]

   Memory utilization: hundreds of GB wasted on padding.


   With paged attention (fixed-size pages, non-contiguous):
   ─────────────────────────────────────────────────────────
   Physical memory pool, split into pages of 16 tokens each:

     [page 0] [page 1] [page 2] [page 3] [page 4] [page 5] [page 6]
     [page 7] [page 8] [page 9] [page 10] [page 11] [page 12] ...

   Block tables map logical positions to physical pages:

     request A (200 tokens, needs 13 pages):
       logical pages [0, 1, 2, ..., 12]  ─▶  physical [3, 7, 1, ..., 21]

     request B (5000 tokens, needs 313 pages):
       logical pages [0, 1, ..., 312]    ─▶  physical [4, 12, 5, ..., 256]

     request C (100 tokens, needs 7 pages):
       logical pages [0, ..., 6]         ─▶  physical [0, 8, 2, 9, ...]

   Only the pages actually used are allocated. Pages freed by
   finished requests are returned to the pool and reused.

The attention computation is modified to gather K and V values via the block table rather than from a single contiguous range. The result: memory utilization rises from typically 20–40% (with contiguous max-size allocation) to near 95%, multiplying the number of concurrent requests a server can hold by roughly 2–4×.

Two-level subtleties matter in production: copy-on-write sharing of pages across requests with a common prefix (such as a long system prompt shared across many users), and page-table updates during continuous batching as requests come and go. The vLLM paper develops the details; the inference-serving chapter (forthcoming, in the Efficient and Scaled Training material) treats them.

Quantization. Storing model parameters in lower-precision numerical formats reduces both memory footprint and memory bandwidth. fp8 (8-bit floating point) is now standard for serving frontier models. int8 and int4 integer quantization are widely used at smaller scales; sub-4-bit quantization is an active research area. Quantization is essentially free (negligible quality loss) at fp8/int8 for many models and increasingly aggressive for inference-optimized deployments. The Deep Learning chapter develops the substance.

Structured-output decoding

A frequent practical requirement: the model’s output must conform to a structured format — valid JSON, code that parses, an XML schema, output matching a regular expression, output parsing under a context-free grammar. Constrained decoding modifies the sampling step to mask out tokens that would violate the constraint, leaving only legal continuations to sample from.

Regex-constrained decoding restricts outputs to a specified regular expression.
JSON-constrained decoding ensures outputs are syntactically valid JSON, often with schema conformance (correct keys, value types matching the schema).
Grammar-constrained decoding generalizes to context-free grammars: outputs must be parseable under a specified grammar (e.g., a programming-language grammar).

The Outlines library (Willard & Louf, 2023) and the broader Guidance / lmformatenforcer ecosystem implement these patterns. By 2026, most production LLM deployments use some form of constrained decoding when structured outputs are required.

Serving infrastructure

The full LLM-serving picture — multi-GPU sharding strategies, tensor/pipeline parallelism, hardware-aware scheduling, multi-node deployment — is treated in the Deep Learning chapter and the Efficient and Scaled Training chapter. The LLM-side of inference engineering is covered above; the systems-side is delegated.

Empirical note. Inference-side optimization is an unusually empirical part of the field, dominated by engineering rather than theory, and changing rapidly. Specific recipes (which quantization scheme, which serving framework, which speculative-decoding variant) shift on a roughly quarterly cadence; the categorical structure given here is stable, but the within-category state of the art is not.

§7. Prompting and In-Context Learning

The most consequential capability of large language models — and the one that distinguishes them most sharply from smaller pretrained models — is the ability to take instructions and demonstrations from the prompt itself, at inference time, with no further training. This section develops prompting (the practice of crafting input to elicit desired behaviour) and in-context learning (the phenomenon of acquiring task behaviour from examples provided in the prompt), then surveys the theoretical accounts of why ICL works. The section is structured to separate the empirical regularities (well-established) from the theoretical accounts (multiple, none decisive).

Prompting as a deployment surface

In the LLM regime, a substantial fraction of behavioural adaptation happens not through training but by prompting: changing the input to change the output. The deployed model’s parameters are fixed; the user (or system designer) varies the prompt.

The structure of a typical prompt to a deployed LLM was sketched in §1: a sequence of role-tagged messages, typically with three roles.

System — a message that sets the assistant’s persona, behavioural rules, and any persistent context (tools available, output format, hard constraints).
User — the user’s current request.
Assistant — the slot the model fills in.

Multi-turn conversations alternate user and assistant messages; the entire history is fed to the model on each turn.

Within this structure, several prompting patterns recur:

Zero-shot prompting. The user asks for the task directly — "Translate this paragraph to French: ..." — without examples. Effective when the task is well-represented in the model’s pretraining.
Few-shot prompting. The user provides a small number of input-output examples before the actual query: "English: dog → French: chien. English: cat → French: chat. English: bird → French: ___". The model is expected to follow the demonstrated pattern. In-context learning is the technical name for this capability.
A more concrete picture. Suppose we want the model to classify movie reviews as positive or negative. Without demonstrations (zero-shot):
```
user:    Review: "The cinematography was breathtaking but the plot
         dragged."  Is this review positive or negative?
model:   The review expresses mixed feelings — praising the
         cinematography but criticizing the pacing. Overall it
         leans slightly negative.
```
With few-shot demonstrations:
```
user:    Review: "An absolute masterpiece. I was glued to my seat."
         Label: positive

         Review: "Pretentious, slow, and ultimately empty."
         Label: negative

         Review: "Great cast but the script was awful."
         Label: negative

         Review: "The cinematography was breathtaking but the plot
         dragged."
         Label:
model:   negative
```
The model now produces a single-word label conforming to the demonstrated pattern. No parameters were updated; only the prompt changed. The model has “learned” the task (a binary classification with the specific label vocabulary) from the four demonstrations.
Role-based prompting. The system message assigns the model a role ("You are an expert physician..."), which can shift the style and content of subsequent responses.
Format prompting. The prompt explicitly specifies the desired output format, often via a structured template that the model continues.

The line between prompting and prompt engineering — iteratively refining a prompt to improve performance — is blurry. Both are common; both have been gradually displaced by training-time techniques (instruction tuning, preference tuning) where reliability matters, as discussed later in this section.

In-context learning

In-context learning (ICL) — the LLM’s ability to learn a new task from examples in the prompt, without any parameter updates — was the most striking capability demonstrated by GPT-3 (Brown et al., 2020) and is one of the defining features of large language models. We use learning loosely: the model’s parameters do not change, but the model’s behaviour at inference time is reshaped by the prompt content in ways that look, externally, like learning a new task.

The empirical regularities of ICL are now reasonably well-characterized:

Sensitivity to demonstration format. The exact format of the in-context examples — separators, capitalization, spacing, ordering — has substantial and sometimes surprising effects on accuracy. Minor reformatting can change accuracy by tens of points on the same task.
Order matters. The order in which demonstrations appear in the prompt affects accuracy, sometimes dramatically (Lu et al., 2022, “Fantastically Ordered Prompts”). The optimal order is not predictable from first principles and is usually found empirically.
Label correctness matters less than expected. Min et al. (2022) reported that, surprisingly, the correctness of the labels in the demonstrations matters less than the distribution over labels — the model can in-context-learn the task even when many of the demonstration labels are wrong, provided the label space is correctly represented in the prompt. This finding is consequential for theoretical accounts (below) and contested in its scope; the conditions under which it holds are still being mapped.
Larger models do more in-context learning. ICL accuracy grows with model scale, and the most striking ICL phenomena — emergent behaviour in arithmetic, code translation, structured tasks — appear primarily at scale (see Foundation Models §6 on emergence, with the Schaeffer et al. critique recorded there).
Some tasks are harder than others. ICL works well for tasks similar to or simpler than ones the model saw during pretraining; it works less well for tasks involving novel multi-step structure, where reasoning-time techniques (below) help.

Chain-of-thought prompting

A specific prompting pattern that produces large performance gains on reasoning tasks: chain-of-thought (CoT) prompting (Wei et al., 2022). The pattern is to elicit intermediate reasoning steps from the model before its final answer, by either:

providing few-shot examples that include explicit step-by-step reasoning, or
appending a simple prompt like "Let's think step by step." to the user’s request — the zero-shot CoT variant (Kojima et al., 2022).

Models prompted this way perform substantially better on multi-step arithmetic, logical reasoning, and structured problem-solving. The improvement is not free — generating intermediate reasoning costs additional inference tokens — but on reasoning tasks the accuracy gain is large enough that CoT became a standard prompting pattern in practice. It is also the foundational observation behind reasoning models (§5.4): if scaffolded reasoning at inference time helps so much, training the model to produce such reasoning natively should help more.

A worked example shows the mechanism. Without CoT, on a multi-step arithmetic word problem:

user:    A pet store has 64 birds. They sell 1/4 of them. Then
         they receive 3 new shipments of 10 birds each. How
         many birds does the store have now?
model:   78

The model emits a guess directly. With a non-reasoning model in the tens-of-billions-of-parameters range, the direct answer is right perhaps half the time — the model gets something like the right magnitude but skips a step, or mis-applies the fraction.

With CoT (zero-shot, by appending the prompt suffix "Let's think step by step."):

user:    A pet store has 64 birds. They sell 1/4 of them. Then
         they receive 3 new shipments of 10 birds each. How
         many birds does the store have now?

         Let's think step by step.
model:   1/4 of 64 is 16, so they sell 16 birds.
         After selling, they have 64 - 16 = 48 birds.
         3 shipments of 10 birds each adds 3 × 10 = 30 birds.
         After receiving the shipments, they have 48 + 30 = 78.

         The answer is 78.

Two effects produce the accuracy gain. First, each intermediate computation is externalized into its own decoding step, which is mechanically what next-token prediction is good at — the model is asked to compute “16” in a context that strongly suggests it, rather than holding it in latent state. Second, the final answer is conditioned on the intermediate work, so the kinds of errors that come from skipping steps (or from mis-applying a fraction in one’s head) are structurally less likely.

Variants extend the basic pattern:

Self-consistency (Wang et al., 2022) samples multiple CoT reasoning chains from the model, takes a majority vote on the final answer, and reports it. The diversity in reasoning paths often surfaces a more robust answer than any single chain would:

flowchart TD
  Q(["$$\text{question (e.g., }5 \times 24\text{?)}$$"])
  C1["$$\text{chain 1 (CoT, } T{=}0.7\text{)}$$\n5×24 = 120"]
  C2["$$\text{chain 2}$$\n5×20+5×4 = 100+20 = 120"]
  C3["$$\text{chain 3}$$\n(4+1)×24 = 96+24 = 120"]
  CD["…"]
  CN["$$\text{chain } N$$\n...100+24 = 124 (oops)"]
  A1["$$\text{answer: 120}$$"]
  A2["$$\text{answer: 120}$$"]
  A3["$$\text{answer: 120}$$"]
  AD["…"]
  AN["$$\text{answer: 124}$$"]
  Tally["$$\text{tally votes: } 120 \times (N{-}1),\; 124 \times 1$$"]
  Maj(["$$\text{majority answer: 120}$$"])

  Q --> C1 --> A1 --> Tally
  Q --> C2 --> A2 --> Tally
  Q --> C3 --> A3 --> Tally
  Q --> CD --> AD --> Tally
  Q --> CN --> AN --> Tally
  Tally --> Maj

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  classDef dim  fill:#fff,stroke:#aaa,stroke-dasharray:3 3,color:#888
  class Q,C1,C2,C3,CN,A1,A2,A3,AN,Tally,Maj pill
  class CD,AD dim

Generating $N$ chains costs $N$ times more inference compute, but the diversity in reasoning paths often catches single-chain errors that no individual chain would have detected. Self-consistency is the simplest example of test-time-compute scaling — the observation that more inference compute produces better answers — an idea developed in full in the Reasoning Models chapter.

Least-to-most prompting (Zhou et al., 2023) decomposes a complex question into a sequence of sub-questions, has the model answer each in turn, then aggregates.
Tree-of-thoughts (Yao et al., 2023) generalizes CoT to explore multiple reasoning branches with an explicit search procedure. At each step the model samples several candidate next-steps; an evaluator (the same model with a different prompt, or a separate scoring function) rates each candidate; promising ones are expanded further, unpromising ones are pruned. Diagrammatically:
```
flowchart TD
  Q(["$$\text{question}$$"])
  T1["$$\text{thought\_1}$$\nscore: 7"]
  T2["$$\text{thought\_2}$$\nscore: 2"]
  T3["$$\text{thought\_3}$$\nscore: 8"]
  Prune["$$\times \text{ prune}$$"]
  T1a["$$\text{thought\_1a}$$"]
  T1b["$$\text{thought\_1b}$$"]
  T3a["$$\text{thought\_3a}$$"]
  T3b["$$\text{thought\_3b}$$"]

  Q --> T1
  Q --> T2
  Q --> T3
  T2 --> Prune
  T1 --> T1a
  T1 --> T1b
  T3 --> T3a
  T3 --> T3b

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  classDef dim  fill:#fff,stroke:#aaa,stroke-dasharray:3 3,color:#888
  class Q,T1,T2,T3,T1a,T1b,T3a,T3b pill
  class Prune dim
```
ToT is a tree search: branching factor and depth are hyperparameters; the evaluator is usually the bottleneck. ToT outperforms single-chain CoT on tasks where the right reasoning path is non-obvious from the start (puzzle solving, planning, multi-hop reasoning). Conceptually, it is the precursor to test-time-compute reasoning models — a way of using more inference compute to find better answers, via search rather than just sampling-and-voting.
Program-of-thought (Chen et al., 2022) has the model emit code that, when executed, produces the answer — outsourcing arithmetic and structured manipulation to an interpreter.

The general lesson: shaping what tokens the model generates between question and answer can substantially improve answer quality.

Theoretical accounts of in-context learning

Why does ICL work? The model’s parameters do not change between training and inference; nothing in standard supervised-learning theory predicts that a fixed-parameter model should acquire new task behaviour from a few prompt examples. Several theoretical accounts have been proposed; none is decisive. We survey them without ranking them. The unresolved status of ICL theory is recorded as OP-FM-14 in Foundation Models §11 and is a central open problem of the field.

Induction heads. Olsson et al. (2022) identified specific circuits in pretrained Transformers — termed induction heads — that implement a simple algorithm: when a token pattern [A B ... A] appears in the input, the head attends back to the previous occurrence of A and reads off the token immediately after it, then predicts that token at the current position. Concretely, on the sequence below, after the model has seen "Alice Smith" once and then encounters "Alice" again, an induction head implements:

flowchart TD
  Seq(["$$\ldots\;\text{``Alice''}\;\text{``Smith''}\;\ldots\;\text{``Alice''}\;[\text{next?}]$$"])
  S1["$$\textbf{Step 1 — prefix-match}$$\nQuery at current token (``Alice'') looks back for a previous occurrence of the same token"]
  S2["$$\textbf{Step 2 — successor-copy}$$\nAttend to the token immediately after the matched occurrence — that is ``Smith''"]
  S3["$$\textbf{Step 3 — predict}$$\nOutput ``Smith'' at the current position"]
  Out(["$$\text{prediction: ``Smith''}$$"])

  Seq --> S1 --> S2 --> S3 --> Out

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  class Seq,S1,S2,S3,Out pill

The mechanism is two coordinated attention operations: a prefix-match (find the previous occurrence of the current token) followed by a successor-copy (predict whatever followed that previous occurrence). Both operations are within reach of a multi-head Transformer’s attention: one head can do the matching, and the values its attention exposes can be read by a downstream head as the predicted next token.

Olsson et al. argue that ICL is mechanistically grounded in this capability, which emerges during pretraining in a sudden phase transition observable in training curves. Subsequent mechanistic-interpretability work (treated in the Mechanistic Interpretability chapter) has elaborated and qualified the picture — induction heads are necessary but not sufficient for the full range of ICL behaviour — but the broad story is widely cited. Crucially, many ICL tasks reduce to this pattern: a few-shot prompt is a sequence of demonstrations followed by a query that resembles one of the demonstrations, and the model’s job is to “find the matching demonstration and copy its response” — exactly the induction-head algorithm at the level of structured examples rather than individual tokens.

Implicit gradient descent. Several papers (Akyürek et al., 2022; von Oswald et al., 2023; Garg et al., 2022) showed that, in simplified theoretical settings, a Transformer can implement a forward pass that is functionally equivalent to running gradient descent on the in-context examples internally. Under this account, ICL is the model “fine-tuning itself” with no parameter updates: the forward pass implements an optimization procedure on the prompt’s examples, producing an updated effective model that is then queried. The theoretical settings in which this account is exact are simplified (linear regression on synthetic inputs, in particular); how much they explain ICL in deployed LLMs on natural-language tasks is unclear.

Task vectors and task identification. Hendel et al. (2023) and successors propose that what ICL actually does is identify a task — pick out, from the in-context examples, which of many tasks the pretrained model already knows how to perform — rather than learning a novel task. Under this account, demonstrations are pointers, not training data: they signal “the task is X” rather than teaching X from scratch. This matches the empirical finding that label correctness matters less than label-space coverage (Min et al., 2022): demonstrations work by indicating which task to invoke, not by providing training signal.

Bayesian framings. Xie et al. (2022) propose a Bayesian account: pretraining endows the model with an implicit posterior over latent “tasks” (concepts), and ICL performs approximate Bayesian inference over this posterior conditioned on the in-context examples. The Bayesian view is appealingly clean but operates at a level of abstraction where direct contact with mechanism is loose.

Editorial note. We deliberately survey these accounts without ranking them. As of 2026, no single theoretical framework explains the full empirical behaviour of ICL — its sensitivity to format, its scale-dependence, its variation across tasks. The accounts above are likely complementary rather than competitive: each captures part of the phenomenon. The unresolved status of ICL theory is the open problem OP-FM-14.

The fate of prompt engineering

Prompt engineering — the practice of crafting prompts to elicit desired model behaviour, often through iterative refinement and informally shared best-practice folklore — emerged as a deployment-time discipline in 2020–2023 and was, briefly, a job category. Its trajectory since has been one of gradual displacement by training-time techniques:

Instruction tuning (§5.2) makes models reliably follow naturally phrased instructions, reducing the need for elaborate prompt scaffolding.
Preference tuning (§5.3) makes models converge on default-good responses, reducing the prompt-format sensitivity that motivated much prompt engineering.
Tool use (§8) and retrieval augmentation (§9) move work from the prompt to the surrounding system.
Reasoning models (§5.4) move the work of step-by-step reasoning from a CoT prompt into model-internal training.

By 2026 the prompt-engineering folklore of 2022 — explicit reasoning prompts, role assignments, format specifications — has been substantially absorbed into the models themselves. Some prompting practice remains (production-system reliability, niche tasks, structured-output extraction over noisy inputs), but the discipline has not turned out to be a stable engineering specialty.

Empirical note. The displacement-by-training-time dynamic is empirically clear, but the limit of the trend is not. Whether all prompt-engineering practice will be absorbed into training (suggesting “good prompts” are a temporary scaffold), or whether some genuinely irreducible prompting craft will remain — for novel tasks, for production reliability, for highly structured outputs — is an open empirical question.

§8. Tool Use and Function Calling

The chat-style LLMs developed in §1–§7 take a prompt as input and produce a text response. Tool use extends this loop: between receiving the prompt and producing the final response, the model can call external functions — fetch information from a database, execute code, search the web, look up a calculator — and incorporate the results into its reasoning. The model is no longer a closed function from text to text; it is a controller of a small computational graph that includes both the model’s own reasoning and external operations.

This is conceptually simple but turns out to be the foundation of agentic AI as a whole. We treat the LLM-side surface here — what tool calling looks like, how the model emits and consumes tool calls, how the loop is structured — and defer the systems treatment (multi-step planning, agent design, complex agentic workflows) to the AI Agents and Tool Use chapter.

The basic loop

A tool is any function that an LLM can invoke from inside a conversation. The function is defined outside the model (in the surrounding application code, an API, an environment), with a documented interface: a name, an argument schema, and a return-value type. The model is shown the available tools as part of its prompt (in the system message or via a special tools section); when the model wants to use one, it emits a structured tool call specifying the tool name and arguments. The surrounding system intercepts the tool call, runs the function, and feeds the result back into the conversation as a new message. The model continues from there:

flowchart TD
  Query(["$$\text{user query}$$"])
  LLM["$$\text{LLM forward pass}$$\nsees: [system + tools] [user query] [conversation so far]\nemits: final response OR structured tool call"]
  Decision{"$$\text{tool call?}$$"}
  Run["$$\text{surrounding code intercepts tool call}$$\n$$\texttt{fn(args)} \to \text{result}$$"]
  Feed["$$\text{result fed back as tool message}$$\n$$\text{LLM continues with new context}$$"]
  Done(["$$\text{return final response to user}$$"])

  Query --> LLM --> Decision
  Decision --> |tool call| Run --> Feed
  Feed -. loop .-> LLM
  Decision --> |final response| Done

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  class Query,LLM,Run,Feed,Done pill

The loop terminates when the model emits a final response rather than another tool call. Modern systems put a hard cap on the number of tool calls per query (typical: 10–30) to prevent runaway loops.

What a tool call actually looks like

Concretely, a tool exposed to the model via OpenAI’s function-calling format (the de-facto standard, used in similar form by Anthropic, Google, and most open-weights models) looks like this in the prompt:

   System tools available:

     {
       "name": "get_weather",
       "description": "Get the current weather for a city.",
       "parameters": {
         "type": "object",
         "properties": {
           "city":  { "type": "string", "description": "City name." },
           "units": { "type": "string", "enum": ["celsius", "fahrenheit"] }
         },
         "required": ["city"]
       }
     }

When the user asks “What’s the weather in Tokyo?”, the model emits a structured tool call rather than a text response. In the chat format:

   user:      What's the weather in Tokyo?

   assistant: <tool_call>
                 {"name": "get_weather",
                  "arguments": {"city": "Tokyo", "units": "celsius"}}
              </tool_call>

   tool       <tool_result name="get_weather">
   (system     {"temperature": 22, "condition": "clear", "humidity": 65}
    runs       </tool_result>
    fn):

   assistant: It's currently 22°C and clear in Tokyo, with humidity at 65%.

The <tool_call> and <tool_result> markers are chat-template special tokens (§3.4) that the model has been trained to emit and recognize. The arguments are JSON conforming to the declared schema; the surrounding code parses them, calls the function, and inserts the result back into the conversation.

The training for this behaviour happens in SFT (§5.2): the model is trained on demonstrations that include tool calls in the expected format. By 2026, this is part of the standard instruction-tuning recipe rather than a separate capability.

Categories of tool

Practically deployed tools cluster into a few categories:

Retrieval / search. Look up information in a database, a search index, or the web. Examples: vector search over a knowledge base, web search APIs, document retrieval over a user’s files. Retrieval-augmented generation (RAG; §9, and the RAG chapter) is the most consequential instance.
Code execution. Run code in a sandbox and return the output. Python sandboxes are the most common; calculators are a degenerate case (“eval this arithmetic expression”). Enables the model to do arithmetic, plot data, run simulations, and process structured input outside its own context.
Web browsing. Fetch and render a specific URL, follow links, take screenshots. More expressive than search; the model can interact with pages rather than only read snippets.
Application APIs. Calendar, email, CRM, internal systems — anything with a documented API. The model becomes an interface between the user’s natural-language request and the underlying system.
Computer use. Drive a desktop or a browser through screenshot + click/type actions. The “tool” is the operating system or browser itself. This is the most agentic of the tool patterns and is treated in the Agents chapter.

The tool-calling surface is increasingly the deployment pattern for production LLM applications. A user-facing chatbot is now almost always a model + a constellation of tools + an agentic loop, not a model in isolation.

Multi-tool and parallel calls

Modern function-calling formats allow the model to emit multiple tool calls in one turn, sometimes to be executed in parallel:

   assistant: <tool_call> {"name": "get_weather", "arguments": {"city": "Tokyo"}}  </tool_call>
              <tool_call> {"name": "get_weather", "arguments": {"city": "Paris"}}  </tool_call>
              <tool_call> {"name": "get_time",    "arguments": {"city": "Tokyo"}} </tool_call>

Parallel tool calls reduce latency: rather than waiting for one tool to complete before issuing the next, the surrounding system runs all three concurrently and feeds the three results back together. The model must be trained to emit the parallel-call format; modern models do this natively.

Pointer to the Agents chapter

The simple loop above is the LLM-side surface. The systems treatment — multi-turn agentic workflows, complex tool ecosystems with hundreds of tools, multi-agent coordination, error recovery, planning over tool sequences, evaluation of agentic systems — is the substance of the AI Agents and Tool Use chapter. The LLM chapter’s contribution is the substrate: a model trained to emit structured tool calls and consume their results.

§9. Long Context

The context window of an LLM is the maximum number of tokens it can take as input in a single forward pass. From a starting point of 2,048 tokens (GPT-2 in 2019) the context windows of frontier models have grown to 1,000,000 tokens or more by 2026. Long context unlocks new deployments: feeding entire books, entire codebases, hours of meeting transcripts. It also raises a recurrent question — does a “1M-token context window” actually work at 1M tokens?

This section covers what long context means in practice. The architectural techniques (sliding-window attention, RoPE extrapolation, ALiBi, etc.) are developed in §4.7; we will refer to them by name here without re-deriving. The retrieval-as-a-long-context-substitute story is sketched and then deferred to the Retrieval-Augmented Generation chapter.

Nominal vs effective context

A model’s nominal context window is its advertised maximum, the upper bound the architecture supports. The effective context window is how many tokens of context the model can actually use — meaning, attend to, integrate, reason over — without significant performance degradation.

These are not the same number. A model rated for 128k tokens of input does not necessarily reason equally well over the first 1,000 and the last 1,000 of those tokens. Several phenomena commonly cause effective context to lag nominal:

Position-encoding limits. If a model was trained at 4k tokens and extrapolated to 128k via RoPE frequency scaling, the encodings at deep positions are interpolated rather than learned. Information at those positions is correctly addressed but not as cleanly attended to.
Lost-in-the-middle. The model attends most reliably to material near the start and end of its context. Tokens in the middle are de-prioritized, even when nominally addressable.
Distractor pile-up. With long context, the model’s attention is spread thinner. A relevant fact in a 100k-token document is “competed against” by 99,990 tokens of irrelevant material.
Working-memory bottlenecks. Even if all tokens are technically addressable, the model’s internal representations only have so much capacity to hold information across long ranges. Some information is read but not retained for synthesis.

Visually:

   Nominal context: every position addressable.
   Effective context: capacity varies sharply by position.

      |#####|---------------------------------------|#####|
      ▲ ▲▲▲▲ ▲                                       ▲▲▲▲▲ ▲
      start of context   "lost in the middle"        end of context
      (strong attention) (degraded retrieval &       (strong attention)
                          reasoning capacity)

The shape of the curve depends on the model: some have a relatively flat profile across the window; others have steep falloffs starting around 10–20% of the nominal limit.

How long context is actually evaluated

A small number of standard tests have emerged to probe effective context.

Needle-in-a-Haystack (NIAH). Insert a specific factoid (the “needle”) at a specific position in an otherwise irrelevant long context (the “haystack”); ask the model to retrieve it. Sweep over needle positions and over haystack lengths. The result is a 2D heatmap of retrieval accuracy:

                                  needle position →
                                  early    middle    late
                       4k     ┌─────────────────────────┐
                              │ ████ │ ████ │ ████ │ ████ │
                              │ ████ │ ████ │ ████ │ ████ │     accuracy
                       16k    │ ████ │ ████ │ ████ │ ████ │     "good"
                              │ ████ │ ███▒ │ ███▒ │ ████ │     ████ → 100%
   haystack            32k    │ ████ │ ███▒ │ ▒▒░░ │ ████ │     ███▒ → ~80%
   length                     │ ████ │ ▒▒░░ │ ░░░░ │ ████ │     ▒▒░░ → ~50%
                       64k    │ ████ │ ░░░░ │ ░░░░ │ ████ │     ░░░░ → ~20%
                              │ ████ │ ░░░░ │ ░░░░ │ ████ │
                       128k   │ ████ │ ░░░░ │ ░░░░ │ ███▒ │
                              └─────────────────────────┘

In the schematic above, the model retrieves needles near the start and end reliably but loses needles in the middle as the haystack grows. Real models produce real heatmaps with this general shape; the cutoff position where retrieval drops varies by model.

NIAH is necessary but not sufficient — it tests retrieval, not the harder capacity to reason over long context.

RULER (Hsieh et al., 2024) is a more comprehensive long-context benchmark. It tests retrieval (single and multi-key), aggregation (summing over multiple values scattered through the context), multi-hop tracing (chains of references across the context), and question answering with long input. RULER scores typically degrade more sharply than NIAH scores, exposing the difference between “the model can find this fact” and “the model can use this fact in concert with other facts also in the context”.

LongBench, Loong, and follow-ups extend this with more diverse tasks: long-document summarization, long-conversation memory, codebase-scale reasoning. By 2026 a typical evaluation report includes NIAH plus at least one harder benchmark; reporting only NIAH is considered insufficient.

Architectural techniques (recap and pointers)

The techniques that enable long context were developed in §4.7. Briefly recapitulated:

Sliding-window attention caps attention at a window of $W$ tokens, making attention cost linear in $L$ but requiring stacked layers to propagate distant information.
Global-plus-local attention combines a small number of globally-attending tokens with sliding-window otherwise.
Position-encoding extrapolation: RoPE frequency scaling (NTK-aware, YaRN) extends a model trained at one context length to longer contexts at inference time, with quality degradation that grows with the extrapolation factor.
ALiBi is an alternative positional scheme designed to extrapolate gracefully.
Sub-quadratic architectures (state-space models, Mamba; §4.10): linear-time attention alternatives that change the asymptotic cost of long context entirely.

Each technique partially addresses the cost or quality problem of long context. No single technique solves both: cost-effective long context with high effective-context quality remains an active area.

Retrieval as a long-context substitute

A different approach to “the model needs information from a large corpus”: don’t put the corpus in the context at all. Instead, retrieve the most relevant passages and put those in the context. The model sees a much smaller, more relevant input — often a few thousand tokens of highly targeted material rather than the original 1M-token corpus.

This is retrieval-augmented generation (RAG), the substance of which is developed in the Retrieval-Augmented Generation chapter. The LLM-side surface is what we need here:

user query

retrieval system (vector search, BM25, or hybrid)

embed query; find top-

K

most similar passages from the large indexed corpus

top-

K

relevant passages

e.g. 5 passages × 500 tokens = 2.5k tokens

LLM forward pass

input: query + retrieved passages; output: response conditioned on retrieved evidence

The retrieval-vs-long-context tradeoff:

Long context, all-in-window: simpler architecture, all information in one place, no retrieval failure modes, but expensive (1M-token inference is slow) and effective-context-limited (the lost-in-the-middle problem applies).
Retrieval + modest context: cheaper per query, sharper input, but failure modes shift to retrieval errors (the relevant passage was not retrieved) and require a separately maintained index.

By 2026 the practical answer in most production systems is hybrid: a moderately long context (32k–128k) plus a retrieval system to pull in additional grounding documents. The pure-long-context path remains attractive for use cases where the user wants to dump an entire codebase or book without preprocessing; the pure-retrieval path remains attractive when the underlying corpus is much larger than any reasonable context window. Most production deployments sit in between.

The open problem OP-LLM-4 in §13 captures the persistent gap: nominal context windows have grown faster than effective-context quality, and there is no consensus on how to close the gap architecturally.

§10. The Open / Closed Model Ecosystem

LLMs in 2026 ship in two distinguishable regimes. Closed-weights models — GPT, Claude, Gemini, and others — are deployed as API services; the trained parameters never leave the provider’s servers, only inference outputs do. Open-weights models — Llama, Mistral, Qwen, DeepSeek, OLMo, and many others — release the trained parameters publicly, so anyone with the hardware can run them locally, fine-tune them further, and inspect their internals. The split is not new (the BERT era was open-weights by default, GPT-3 was closed), but the gap between the two has become a defining feature of the field’s structure, with consequences for evaluation, governance, research access, and the dynamics of the field itself.

The dimensions of openness

“Open” and “closed” are not single attributes; openness has at least four dimensions, and a real model occupies a position on each.

                    OPEN ←────────────────────────────→ CLOSED

   Weights         "downloadable from HuggingFace"     "API only"
                    (Llama, Mistral, Qwen, ...)         (GPT, Claude, Gemini)

   Training        "training scripts, hyperparams,     "no information"
   recipe          and data mixtures published"
                    (OLMo, Pythia)                       (most frontier models)

   Training        "exact training data released"      "data composition described
   data             (Pythia, BLOOM)                     in broad strokes or not at all"
                                                        (most frontier models)

   License         "permissive, commercial OK"         "research-only or terms-
                    (Apache 2.0, MIT, Llama 2+)         restricted-by-acceptable-use"
                                                        (some Llama variants
                                                         have field-of-use carve-outs)

A model can be open along some axes and closed along others. Llama-3 has open weights with a permissive-ish commercial license but does not disclose its training data or full training recipe. OLMo has all four dimensions open. GPT-4 and Claude have all four dimensions closed. Most discussion of “the open-weights ecosystem” really means weights-open; the other dimensions are usually closed even there.

The frontier-gap dynamic

By 2026 the gap between frontier closed-weights models and best-available open-weights models is real but smaller than it was. The dynamic has three components.

Lag. Open-weights models trail the closed frontier by roughly six to eighteen months on standard benchmarks. The lag has fluctuated — sometimes shrinking to a few months (when a major open release lands), sometimes widening when a frontier lab makes a substantial jump. As of the chapter’s snapshot date, the lag on general capability benchmarks is in the six-to-twelve-month range for the largest open-weights models.

Compression. Within the lag, capability is compressed: an open model lagging by twelve months on the benchmark frontier may still be highly usable in production, and the difference between “frontier-quality” and “best-open” is often invisible for many practical applications. The marginal value of frontier capability diminishes for most use cases.

Asymmetry by capability. The gap is not uniform. Open-weights models have closed in particularly hard on text and code; reasoning models (§5.4) and multimodal capability lagged longer because they require additional training stages or modalities that are operationally harder to reproduce. As of 2026, the gap on reasoning and on agentic capability is wider than the gap on pure language tasks.

Whether the gap closes, stabilizes, or widens over the longer run is an open question (recorded as OP-LLM-6 in §13). The competing pressures: frontier labs invest heavily in proprietary advantages (more compute, more data, post-training secrets), but open-weights releases benefit from rapid community iteration, distillation from closed models, and the increasing maturity of public training infrastructure.

Reproducibility under closed deployment

When a model is closed, you cannot reproduce its behaviour from first principles — only observe its outputs. Two consequences pose practical research and evaluation problems.

The silent-update problem. Closed-model providers can change the deployed model at any time without notice. A benchmark result reported in March on gpt-4-turbo-2024-03-15 may not reproduce in May, because the underlying model has been silently updated. The provider may publish a date-stamped pinned version (good practice), or may not (worse practice). Even with pinned versions, the inference stack — system prompts, safety filters, decoding defaults — can shift without versioning.

Lack of weights for analysis. Mechanistic interpretability (the subject of its own chapter), probing, attention analysis, adversarial robustness studies — all require access to the model’s weights. Closed models foreclose this entire line of research on the frontier. The available alternative — distilling weights-open student models from closed teachers and studying the students — produces useful but partial substitutes.

Implications for evaluation, governance, and research

Each implication is treated more fully in its own chapter; we summarize here.

Evaluation. Benchmarks need to specify which model version they were run on; comparing scores across versions without disclosure is misleading. The Evaluation chapter discusses test-set contamination as a related issue: closed labs may have seen public benchmarks during training (deliberately or by accident), which inflates their scores.
Governance. Regulatory frameworks that target “frontier models” (the EU AI Act, the US executive orders, voluntary commitments at international AI safety summits) operate primarily on closed models, because those are the ones whose providers can be regulated as corporate entities. Open-weights releases complicate this picture: an open model becomes part of the open-source software supply chain and is harder to regulate or recall.
Research access. Academic and independent research depends on open-weights models for any work that requires non-trivial access (interpretability, fine-tuning at scale, controlled training-data experiments). The shape of the open-weights ecosystem therefore matters for the health of the field’s research base, beyond its consequences for deployment.

The split between regimes is unlikely to resolve in either direction. The pragmatic reality of 2026 is both — a frontier with capabilities you cannot run yourself, and a fast-following open-weights tier that supports much of the research, fine-tuning, and on-device deployment of LLM applications.

§11. Survey of Major Model Families

Snapshot date: 2026-05. This section ages faster than any other in the chapter. Treat it as a description of the landscape on the chapter’s last-updated date, not a stable map. The structural observations (which labs ship which kinds of models, how families differentiate themselves) are more stable than the specific model names and version numbers.

We cover eight families. Each entry covers organizational origin, the family’s scale and openness posture, the distinguishing technical choices that have emerged across releases, and the primary deployment context.

GPT family (OpenAI). The originator of the modern decoder-only LLM regime — GPT (2018), GPT-2 (2019), GPT-3 (2020), GPT-4 (2023) and successors (4o, 4.1, the o-series reasoning models). All closed-weights, API-only. Distinguishing choices: aggressive scale, early commitment to RLHF (InstructGPT, ChatGPT), pioneering work on reasoning models (the o1, o3, and successor lineage). Primary deployment: OpenAI’s ChatGPT product, the OpenAI API serving most third-party LLM applications by request volume, and downstream partnerships (Microsoft, others).

Claude family (Anthropic). Closed-weights, API-only. Distinguishing choices: heavy investment in Constitutional AI (§5.3), early and continued emphasis on long context, the longest publicly documented commitment to alignment and safety research among frontier labs, and explicit governance commitments under the Responsible Scaling Policy. The family’s Sonnet, Opus, and Haiku tiers organize a price-vs-capability ladder. Primary deployment: Anthropic’s API, Claude.ai consumer product, and significant enterprise use.

Gemini family (Google DeepMind). Closed-weights flagship models with an open-weights Gemma sibling family for research and on-device deployment. Distinguishing choices: native multimodality (text, image, audio, video, robotics — Gemini Robotics) from the start of the line rather than as an add-on, deeply integrated tool use, and exceptionally large context windows (>1M tokens at the top of the line). Primary deployment: Google’s product surface (Search, Workspace, Android, Gemini Robotics), Vertex AI for enterprise, the AI Studio API.

Llama family (Meta). The dominant open-weights family. Meta has released most major Llama versions with permissive (commercial-OK with some restrictions) licenses. Distinguishing choices: weights-open release strategy at near-frontier capability, code and multilingual specialization variants (Code Llama, multilingual Llamas), and a wide range of model sizes within each release for diverse deployment contexts. Primary deployment: the open-source ecosystem, on-device applications, and Meta’s own products. Llama releases shape what “best open-weights” means at any given moment.

Mistral family (Mistral). European open-weights and closed-frontier hybrid. Open-weights releases include Mistral 7B, Mixtral (the popular MoE variant), and successors; the flagship Mistral Large family is API-only. Distinguishing choices: early and prominent use of mixture-of-experts (§4.6) in production-scale open-weights releases, dense focus on inference efficiency, and a European regulatory posture that has shaped some of the family’s deployment dynamics.

Qwen family (Alibaba). Closed-frontier and open-weights releases from Alibaba. The Qwen lineage (including Qwen-VL for vision-language and Qwen-Coder for code) has been increasingly competitive with the best open-weights models from Western labs. Distinguishing choices: strong Chinese-language capability, broad multilingual coverage, frequent open-weights releases of substantial models. Primary deployment: Chinese consumer and enterprise products (Alibaba’s stack) and the global open-source ecosystem.

DeepSeek family (DeepSeek). Chinese lab whose 2024–2025 releases — particularly DeepSeek-V3 (a large MoE model) and the DeepSeek-R1 reasoning models — closed the open-weights gap dramatically. Distinguishing choices: aggressive MoE design, the GRPO algorithm developed for and popularized by DeepSeek-R1 (§5.3), and an exceptionally cost-efficient training-and-inference posture. Primary deployment: open-weights releases that have been widely fine-tuned and distilled across the ecosystem.

Open-weights research models (OLMo, Pythia, BLOOM, and others). Models released with not just weights but also full training data, training scripts, intermediate checkpoints, and reproducible recipes. OLMo (AI2), Pythia (EleutherAI), and BLOOM (BigScience) are the canonical examples. These models are not competitive with frontier capability — they are deliberately smaller, more transparent, and intended for research rather than deployment. They are the substrate of much of the academic interpretability, training-dynamics, and scaling-laws literature.

The structural picture: a small number of closed-weights frontier labs (OpenAI, Anthropic, Google DeepMind), a hybrid Western tier (Meta and Mistral straddling open/closed), a strong Chinese tier (Qwen, DeepSeek) increasingly defining what “best open-weights” means, and a research-oriented tier (OLMo and successors) maintaining the fully-reproducible substrate. The composition of these tiers will shift; the shape of having multiple tiers is likely durable.

§12. Connections to Other Chapters

LLMs sit at the intersection of many other chapters. The dependencies on which the LLM treatment draws, and the chapters that develop topics this chapter touches but does not develop:

Foundation Models (prerequisite) — the conceptual frame for the whole regime. Definitions, scaling laws (§6 of FM), the three pillars (§3 of FM). This chapter presumes that material.
Deep Learning (prerequisite) — architectural substrate. Attention, transformer mechanics, positional encodings, optimizers, normalization. Referenced extensively from LLM §4.
Self-Supervised Learning (prerequisite) — pretraining objectives in general (causal LM, masked LM, span corruption, contrastive). The LLM chapter uses these as building blocks; the SSL chapter develops them.
Theoretical Foundations of Learning — generalization theory and the modern generalization puzzle. Referenced from LLM §7 for the theoretical accounts of in-context learning, particularly the implicit-gradient-descent account.
Reasoning Models — test-time-compute deliberation built on LLMs. The o1/o3 lineage, GRPO-trained reasoning models, chain-of-thought RL, scaling test-time compute. LLM §5.4 (“reasoning-RL deferred”) points forward to the dedicated chapter.
AI Agents and Tool Use — agentic deployment of LLMs. The systems story behind LLM §8: multi-turn agentic loops, planning over tools, complex agent ecosystems, agent evaluation.
Retrieval-Augmented Generation — retrieval-as-an-LLM-tool and as a long-context substitute. LLM §8 (tool use) and §9 (long context) both point at RAG; the RAG chapter develops retriever design, indexing strategies, evaluation of retrieval-augmented systems.
Mechanistic Interpretability — what is inside an LLM at the circuit level. Heavily relevant to LLM §4 (the residual stream view) and §7 (induction heads as part of an ICL story).
Alignment — preference tuning in depth (RLHF, DPO, GRPO and the safety questions around them), scalable oversight, constitutional AI, the open problems around prompt injection and reward hacking. LLM §5.3 covers the algorithmic surface; the Alignment chapter covers the substance.
Evaluation — benchmarking, contamination, capability assessments, dangerous-capability evaluations. The LLM chapter touches these at relevant points; the dedicated chapter does the substance.
Multimodal Models — extension of the LLM regime to images, audio, video, embodied control. The LLM chapter holds the language-only core; the multimodal chapter holds extensions.
AI for Science — domain-specific LLM applications (mathematics, chemistry, biology). Distillation and fine-tuning of general LLMs onto scientific domains; specialized models like AlphaFold-lineage protein models.

§13. Limitations and Open Problems

A research-oriented inventory of LLM-specific open problems. Each is the LLM facet of a broader question (the Foundation Models chapter §11 holds the cross-cutting list, OP-FM-1 through OP-FM-15); the items below are language-specific. We do not adjudicate any of them — the point is to mark what is unresolved as of 2026 so the reader can place subsequent literature against the right open questions.

OP-LLM-1: Hallucination and factuality. LLMs produce confident-sounding statements that are false at non-trivial rates. The phenomenon has multiple causes — gaps in training data, training objectives that reward fluency over accuracy, the autoregressive commitment to a path once a few tokens are chosen — and partial mitigations (RAG to ground in retrieved evidence, confidence-thresholded refusal, post-hoc verification with external tools), none fully satisfactory. The cross-chapter version is OP-FM-5; the LLM-specific facet is that mitigations interact with the chat-style instruction-following surface in ways that are not yet well-characterized. Key references: surveys on hallucination in LLMs; RAG literature; verification-and-refinement methods.

OP-LLM-2: Calibration of LLM confidences. A well-calibrated model assigns probabilities that match empirical frequencies — when it says “70% confident”, it is right 70% of the time. Pretrained LMs are reasonably well-calibrated on token probabilities; post-trained chat models often are not, because preference tuning rewards confident-sounding language regardless of underlying uncertainty. Verbalized uncertainty (“I’m not sure, but...”) is poorly correlated with actual model uncertainty. How to recover calibration after preference tuning — and how to estimate the model’s real uncertainty about a claim — is open. Connection: this is the LLM-specific facet of a broader Bayesian-calibration problem in modern ML.

OP-LLM-3: Multilingual coverage and English-centric biases. As developed in §3.5, modern tokenizers and training corpora skew heavily toward English and a few other high-resource languages. Per-token costs, effective context windows, and downstream capability are all worse for low-resource languages. Multilingual tokenizer design (§3) and balanced corpora (§5.1) reduce the gap but do not close it. The relationship between training-data fraction and downstream capability is not linear, and the right way to allocate scarce data and compute across thousands of languages is unsolved.

OP-LLM-4: Long-context fidelity vs nominal context length. As developed in §9, nominal context windows have grown faster than effective context. The gap between “the model can address position 500,000 in the context” and “the model can usefully reason over what is at position 500,000” remains. Architectural and training-time techniques (§4.7, §9) close part of the gap; whether the gap closes in full, or fundamentally requires retrieval as an architectural primitive, is open. Related cross-chapter open problem: OP-DL-4 in the Deep Learning chapter.

OP-LLM-5: Reproducibility under closed deployment. As developed in §10, closed-model providers can update their deployed models silently. Benchmark results, scientific claims about model behaviour, and downstream applications can all break without notice. Practical reproducibility now depends on pinned model versions, careful prompt-version logging, and where possible cross-validation against open-weights models. Whether the closed-deployment regime can be made scientifically reproducible without forcing open-weights release is open.

OP-LLM-6: Open vs closed capability gap dynamics. As developed in §10, the gap fluctuates and currently sits in the six-to-twelve-month range on most benchmarks for the largest open models, but is wider on reasoning and agentic tasks. Whether the gap closes, stabilizes, or widens over time depends on dynamics — proprietary investment vs community iteration — that are themselves not settled. The cross-chapter version is OP-FM-12 (centralization concerns).

OP-LLM-7: Memory and personalization without privacy compromise. A useful assistant should remember details about its user across sessions; a privacy-respecting one should not surrender that information. Long-context approaches treat memory as input retrieval; fine-tuning approaches bake memory into parameters; external-store approaches keep memory outside the model. Each has failure modes: long-context approaches don’t generalize across sessions; fine-tuning approaches risk memorization of sensitive data; external-store approaches risk leakage through prompt-injection attacks. The right architecture for safe personalization is unsolved. Related cross-chapter problem: OP-FM-8 (knowledge updating and continual learning).

OP-LLM-8: Code and math — reasoning vs pattern-match. LLMs solve many programming problems and mathematical questions; the disagreement is over how. Are these instances of underlying compositional reasoning, or sophisticated retrieval of training-corpus patterns? The empirical evidence supports a mix: clear reasoning on novel-but-structurally-familiar problems, clear pattern-matching on problems whose surface form was over-represented in training. Reasoning models (§5.4) and CoT (§7) push the balance toward reasoning, but the underlying question of when the model is doing what remains unresolved. Connection: this is the LLM facet of OP-FM-6 (compositional generalization).

OP-LLM-9: Tokenization-induced failure modes. As developed in §3.6, tokenization design choices cascade into model behaviour in unintuitive ways: digit splitting affecting arithmetic, language splitting affecting multilingual coverage, glitch tokens like SolidGoldMagikarp. The relationship between tokenizer choices and downstream failure modes is not fully mapped; tokenizer design is largely empirical. Tokenizer-free architectures (ByT5, MegaByte; §3.7) are a possible eventual resolution but currently come with sequence-length costs that frontier deployments do not accept.

OP-LLM-10: Instruction-following robustness under adversarial input. Instruction-tuned models follow user instructions reliably under normal use but can be subverted by carefully crafted adversarial inputs — prompt injection (instructions hidden inside data the model is asked to process), jailbreaks (prompts that bypass refusal training), and various indirect attacks via tools and retrieval. The fundamental issue is that the model has no architectural distinction between “data to process” and “instructions to follow”; everything is tokens in the context window. How to make instruction-following robust without breaking generality is an open alignment-and-security problem. Treated in depth in the Alignment chapter.

These ten open problems are not exhaustive. They are the ones most often surfaced in current research on language-specific LLM behaviour. The cross-cutting Foundation Models open-problems list (OP-FM-1 through OP-FM-15) addresses the broader regime; the chapter-specific lists in Reasoning Models, Agents, Mechanistic Interpretability, Alignment, Evaluation, and Multimodal Models will surface adjacent problems we do not duplicate here.

§14. Critiques

The broad critiques of the foundation-model regime — symbolic AI, causal critique, embodiment critique, world-models perspective — live in the Foundation Models chapter (§12). This section addresses only critiques that are language-specific; we cite forward to FM §12 for the rest.

The “stochastic parrot” position. Bender et al. (2021), in “On the Dangers of Stochastic Parrots”, argued that large language models produce fluent-sounding text by interpolating between training examples without genuine semantic understanding, and that calling them “intelligent” or “linguistic” attributes capabilities they do not have. The position has three threads. First, an empirical claim: LLMs operate by pattern completion over training data rather than by grounded language use, and their outputs reflect this. Second, a methodological claim: the field’s evaluation practices conflate fluent output with understanding. Third, an ethical claim: the costs of training and deploying these models (environmental, economic, social) are not justified by the capabilities delivered, especially in light of the harms (misinformation, displacement of low-resource speech communities, concentration of power).

By 2026 the position remains contested. The strongest empirical version — that LLMs only pattern-match and never reason in any meaningful sense — is hard to maintain against reasoning models (§5.4) and the empirical record of LLMs solving novel multi-step problems. The weaker version — that pattern completion is a large fraction of what LLMs do, and that the field tends to over-attribute capability — is widely accepted, including by researchers who disagree with the original paper’s other claims. The methodological and ethical threads have outlasted the empirical one and continue to shape evaluation practice and governance debates.

The memorization-vs-generalization debate. A related but distinct concern: how much of an LLM’s apparent capability is novel synthesis and how much is retrieval of training-data passages? Studies (Carlini et al. on extraction attacks; subsequent work on training-data contamination of benchmarks) show that LLMs can be made to emit verbatim training data, that benchmark contamination is widespread, and that capability scores reported in papers are sometimes inflated by overlap between training data and test sets. The empirical picture is mixed: clear evidence of memorization on some problems, clear evidence of generalization on others, with the boundary not cleanly characterized.

The debate matters because it interacts with claims about scale, with copyright and licensing concerns about training data, and with evaluation methodology. It is also the most concrete instance of OP-LLM-8 (reasoning vs pattern-match).

LLM-specific embodiment critique. The broader embodiment critique — that systems trained without any link to a physical world cannot acquire grounded concepts — lives in FM §12. The LLM-specific facet is sharper: LLMs are explicitly text-only systems. Their model of the world is the model that fits the statistics of human text. They have never observed an object, never failed to lift a heavy box, never moved through space. Critics (Bender and Koller, McClelland and colleagues, others) argue that this places hard limits on what LLMs can understand about physical, perceptual, and embodied concepts — and that bench-marked capability on these topics reflects fluent retrieval from text about them, not the kinds of representation embodied agents possess.

The counter-argument: enough text contains enough indirect signal that LLMs do build partial representations of physical concepts (object permanence, causation, spatial reasoning); the gap is in degree, not kind; and multimodal models (the Multimodal Models chapter) close part of the gap by adding visual and increasingly other sensory input. As of 2026, neither position is settled.

Editorial note. As elsewhere in this book, we survey these critiques as positions, not as decided questions. The methodologically responsible reader takes them seriously without committing to any of them prematurely.

§15. Further Reading

An opinionated list, not a survey. The chapter cites many primary papers inline; this section lists the ones worth reading first if the reader wants to go deeper, grouped by topic and annotated with what each gives that the chapter does not.

Foundational papers (the LLM lineage in primary sources)

Vaswani et al. (2017), “Attention Is All You Need.” The Transformer paper. The chapter’s §4 derives from this; reading the original is still worth the half hour for the original framing and the encoder-decoder picture that subsequent decoder-only models inherited.
Devlin et al. (2018), “BERT.” The encoder-only / masked-LM lineage. The clearest single source for masked language modeling and the BERT-fine-tuning pattern.
Radford et al. (2018, 2019), “Improving Language Understanding by Generative Pre-Training” and GPT-2. The decoder-only / causal-LM lineage. Together with Brown et al. (2020), the canonical sources for the regime the chapter develops.
Brown et al. (2020), “Language Models are Few-Shot Learners” (GPT-3). The empirical paper that established in-context learning as the centrepiece of LLM behaviour. Also the cleanest documentation of how scale changes capability profile.
Ouyang et al. (2022), “InstructGPT.” The standard reference for the modern training pipeline (pretrain → SFT → RLHF). Reads quickly and grounds the algorithmic developments in §5.

Tokenization

Sennrich, Haddow, Birch (2016), “Neural Machine Translation of Rare Words with Subword Units.” The BPE-for-NLP paper. §3.2’s training and encoding algorithms come from here.
Kudo (2018), “Subword Regularization.” The Unigram-LM tokenization paper. §3.3.
Kudo and Richardson (2018), “SentencePiece.” The framework paper; explains the design decisions that make SentencePiece usable across languages without word-boundary preprocessing.

Architecture and scaling

Su et al. (2021), “RoFormer.” RoPE introduced; §4.5’s rotation construction.
Ainslie et al. (2023), “GQA.” Grouped-query attention; §4.6’s KV-cache reduction.
Shazeer (2020), “GLU Variants Improve Transformer.” SwiGLU and related; §4.7’s feedforward choice.
Zhang and Sennrich (2019), “RMSNorm.” The simpler-than-LayerNorm normalization; §4.7.
Kaplan et al. (2020), “Scaling Laws for Neural Language Models.” The original empirical scaling-law paper. Foundation Models §6 unpacks the methodology.
Hoffmann et al. (2022), “Training Compute-Optimal Large Language Models” (Chinchilla). The compute-optimal revision; this is the paper to read if you want one source on scaling-law methodology.

Inference

Holtzman et al. (2020), “The Curious Case of Neural Text Degeneration.” Introduces nucleus sampling; the source for §6.3’s decoding-strategy framing.
Leviathan, Kalman, Matias (2023), “Fast Inference from Transformers via Speculative Decoding.” §6.4’s rejection-sampling argument; clearer than the simultaneous-and-equivalent Chen et al. paper.
Kwon et al. (2023), “Efficient Memory Management for Large Language Model Serving with PagedAttention” (vLLM). §6.5’s paged-attention mechanism; one of the most influential serving-system papers of the recent period.

Preference tuning

Christiano et al. (2017), “Deep Reinforcement Learning from Human Preferences.” The pre-LLM RLHF paper. Foundational for the algorithmic ideas; reads in an afternoon.
Rafailov et al. (2023), “Direct Preference Optimization.” The DPO paper. §5.3’s derivation derives from here; the original paper’s exposition is clear and worth reading.
Schulman et al. (2017), “Proximal Policy Optimization Algorithms.” PPO; the RL primitive underlying RLHF.
Bai et al. (2022), “Constitutional AI.” RLAIF and the constitutional approach; the standard reference and a clear methodological paper.

Prompting and ICL

Wei et al. (2022), “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” §7.3’s CoT origin.
Kojima et al. (2022), “Large Language Models are Zero-Shot Reasoners.” Zero-shot CoT.
Wang et al. (2022), “Self-Consistency Improves Chain of Thought Reasoning.” Self-consistency.
Olsson et al. (2022), “In-context Learning and Induction Heads.” Mechanistic story for ICL; §7.5’s induction-head treatment.

Long context and retrieval

Press, Smith, Lewis (2021), “Train Short, Test Long: Attention with Linear Biases.” ALiBi.
Peng et al. (2023), “YaRN.” RoPE extrapolation done carefully; §4.9.
Hsieh et al. (2024), “RULER.” Long-context evaluation; §9’s harder-than-NIAH benchmark.

Open-weights documentation

Llama, Mistral, Qwen, DeepSeek model cards and technical reports. Each release has accompanying documentation; these are the primary sources for the model-family details in §11. Llama-3 and DeepSeek-V3 are particularly worth reading for their disclosed training details.
OLMo and Pythia papers (Groeneveld et al., 2024; Biderman et al., 2023). Fully-open research models; the canonical sources for reproducible LLM research substrates.

Critiques

Bender et al. (2021), “On the Dangers of Stochastic Parrots.” §14’s stochastic-parrot critique.
Carlini et al. (papers on training-data extraction). The empirical basis for the memorization-vs-generalization debate.

Surveys (when you want one paper covering a lot of ground)

Bommasani et al. (2021), “On the Opportunities and Risks of Foundation Models.” The Stanford CRFM report that coined “foundation model.” Long, but the section structure makes it skimmable.
Any current arXiv survey titled “A Survey of Large Language Models” — multiple have been written; they age quickly but are useful overviews when current.

This list will be revised as the field evolves. Readers from a future snapshot should treat the older entries as foundational and consult more recent surveys for current state.

§16. Exercises and Experiments

The exercises below are research-style rather than textbook-style: each is open-ended enough that the “answer” is a small empirical investigation rather than a closed-form solution. They are sized so that a graduate student or research practitioner with access to a single GPU (or, in some cases, only an API account) can complete each in a few hours to a few days. Runnable notebooks are planned in notebooks/llm/.

Per the conventions, each exercise declares its intent (demonstration — deterministic and fast — or exploration — open-ended, possibly expensive) and lists what the reader is expected to come away with.

E1. Implement greedy / temperature / nucleus sampling from scratch.

Intent: demonstration.

Setup. Pick a small open-weights model (e.g., a 1B-parameter Llama variant or smaller). Using the model’s tokenizer and forward-pass interface — not a high-level generation API — implement, by hand, the four decoding strategies from §6.3: greedy, temperature sampling, top-k sampling, and nucleus (top-p) sampling.

Tasks.

Implement each strategy as a function decode(logits, **params) → token_id.
Run each on the same prompt, with the same seed for the random number generator. Observe identical outputs for greedy, different outputs across temperatures, and different sampling-set sizes for top-k vs top-p at varying confidence.
Reproduce §6.3’s “What each strategy actually picks” example by constructing a specific output distribution (either by manually crafting it or by finding a real position in a generation where the distribution has the desired shape) and verifying that each strategy selects from the expected subset.

Takeaway. A working mental model of what each parameter actually does to the next-token distribution.

E2. Reproduce ICL sensitivity to prompt format.

Intent: exploration.

Setup. Pick an open-weights model in the 7B–30B range. Choose a simple few-shot task — sentiment classification, simple arithmetic, or a small synthetic transformation task with known ground truth.

Tasks.

Construct a “canonical” few-shot prompt with four demonstrations.
Generate variations of the canonical prompt by changing surface features only — separators (comma vs newline vs colon), capitalization (lowercase vs title case), demonstration ordering (random permutations), prefix/suffix wording. Each variation should preserve semantic content.
Evaluate each variation on a held-out test set and record accuracy.
Tabulate the resulting accuracies. The spread should be substantial — easily tens of percentage points — replicating the Lu et al. (2022) finding cited in §7.4.

Takeaway. A direct empirical encounter with ICL’s format sensitivity. A reader who has run this exercise will not over-attribute capability to fragile prompts again.

E3. Compare DPO and RLHF on a small preference dataset.

Intent: exploration; expensive end of the chapter’s exercise budget.

Setup. Pick a small SFT model and a small preference dataset (UltraFeedback, HH-RLHF, or a custom dataset built for the exercise). Pick a baseline of comparison: a held-out evaluation set with paired preferences not seen during training.

Tasks.

Train two checkpoints from the same starting SFT model, using the same preference data: one with DPO (§5.3), one with PPO-based RLHF (§5.3) including a separately-trained reward model.
Evaluate both on the held-out preference set: which model produces preferred responses more often?
Track training-time complexity: wall-clock time, peak GPU memory, hyperparameter sweeps required to get a working result.
Report both axes (preference accuracy and operational cost). DPO is expected to be operationally simpler at modestly worse preference performance; verify empirically.

Takeaway. First-hand evidence of the operational-simplicity-vs-flexibility tradeoff developed in §5.3.

E4. Effective vs nominal long-context with NIAH.

Intent: demonstration.

Setup. Pick an open-weights long-context model (e.g., a Llama variant with a 128k or larger context window). Use a public NIAH (Needle-in-a-Haystack) implementation or write a minimal one.

Tasks.

Sweep the haystack length: 4k, 8k, 16k, 32k, 64k, 128k tokens of filler.
Sweep the needle position: early (10% of context), middle (50%), late (90%).
For each (length, position) pair, ask the model to retrieve the needle 5 or 10 times to reduce variance.
Produce a heatmap analogous to §9.2’s schematic.
Optionally extend to a harder retrieval-plus-reasoning task (e.g., RULER-style multi-hop tracing) and compare the two heatmaps.

Takeaway. Quantitative encounter with the nominal-vs-effective context gap for whichever model is at hand.

E5. Tokenizer effects on arithmetic and code.

Intent: exploration.

Setup. Pick two LLMs with substantially different tokenizers — typically (a) a model whose tokenizer splits multi-digit numbers into individual digits and (b) a model whose tokenizer keeps multi-digit chunks (e.g., 50,000-vocabulary BPE).

Tasks.

Construct a small arithmetic test set (addition, subtraction, multiplication of two 3-to-5-digit numbers; 200 problems).
Evaluate each model’s accuracy on the test set with greedy decoding.
Inspect the tokenization of the inputs and outputs: how is “1234 + 5678” tokenized in each model?
Repeat with a code-completion benchmark, comparing tokenization of identifiers like getUserName.
Discuss the difference, if any. Tokenizer effects should be visible; the magnitude varies.

Takeaway. Concrete evidence for §3.6’s claim that tokenizer choices cascade into capability.

E6. Structured-output decoding strategies.

Intent: demonstration.

Setup. Pick an open-weights model and a task whose output must be valid JSON conforming to a small schema (e.g., extracting structured records from short bios).

Tasks.

Implement three strategies:
1. Free-form generation + post-hoc parsing (try to parse the model’s output; retry on failure).
2. JSON-constrained decoding (using Outlines, lmformatenforcer, or a hand-rolled mask).
3. Function calling (using the model’s native tool-call format from §8 to request structured output).
Evaluate three things on a held-out test set: rate of valid JSON output, semantic accuracy of the extracted fields, average tokens generated per response.
Compare the strategies on these three axes.

Takeaway. Hands-on understanding of why constrained decoding has displaced free-form-plus-parse for production structured-output applications.

E7. (Optional, advanced) Reproduce a simple scaling law.

Intent: exploration; substantial compute.

Setup. Pretrain a series of small transformer language models (e.g., 1M, 10M, 100M, 1B parameters), each on the same dataset, each trained to convergence. Use a common open small-corpus like a subset of FineWeb or a domain-specific corpus.

Tasks.

Measure converged validation loss for each scale.
Plot $\log L$ vs $\log N$ (parameter count).
Fit a power law; extract the exponent.
Compare your exponent to the Kaplan et al. (2020) and Hoffmann et al. (2022) values.

Takeaway. First-hand experience of the empirical workflow described in FM §6, including the practical difficulties (compute-budget tuning across scales, convergence criteria, fit quality).

Notebook implementations (planned, not yet released): see notebooks/llm/E1_decoding.ipynb, E2_icl_format.ipynb, and so on, once the project’s interactive layer is decided.