Foundation Models
Scope and What This Chapter Is About
The chapter treats foundation models as a unifying lens for modern AI: large-scale, self-supervised models that serve as a substrate for many downstream tasks via adaptation. We treat the concept and cross-cutting questions here. Architectural and training depth lives in dedicated chapters (Deep Learning, Self-Supervised Learning, LLMs, Multimodal, Reasoning Models, Agents, Alignment, Evaluation, Efficient and Scaled Training). This chapter is the spine those chapters attach to. Open problems are flagged inline and consolidated in §11.
§1. Motivation and Definition
What this chapter is for
By 2026 the term foundation model is the most common label for the class of AI systems that has reorganized the field over the past five years. A working researcher coming in cold could be forgiven for finding the term opaque: it carries an architectural metaphor (foundations on which other things are built), a technical claim (these models are unusually general), and an industrial signal (these are the systems that frontier labs invest in). This chapter exists because treating the material as scattered subsections of “deep learning” or “natural language processing” misrepresents how the field is now organized - and because the underlying concept, while genuinely new in its consequences, is simpler than the term suggests once it is grounded in a concrete example. We start with a concrete example and build the formal definition from it.
What a foundation model is, concretely
Begin with one example, traced end to end. Consider GPT-4, a foundation model released by OpenAI in 2023. Its production proceeded in two main stages.
The first stage is pretraining: a single large neural network was trained on a very broad corpus - roughly trillions of tokens of text drawn from web pages, books, code, and other sources. A token is a small unit of text; a tokenizer is the procedure that splits text into tokens, and depending on its choices a typical English word breaks into one to four tokens. The training task at each position in the corpus was simple: predict what token comes next. This is an example of self-supervised learning - self-supervised meaning that the labels (the correct next tokens) come from the data itself rather than from human annotators. Self-supervision lets the training procedure consume essentially unlimited unlabeled text, where supervised training would require humans to annotate examples and would be capped by labeling cost.
After pretraining, the model could produce fluent, contextually appropriate text continuations from any prompt. It could not yet consistently follow instructions, refuse harmful requests, or behave like a helpful assistant.
The second stage is adaptation: a sequence of much smaller training procedures that turn the pretrained model into a deployed system. For GPT-4, adaptation included two further training phases:
Supervised fine-tuning (SFT). “Fine-tuning” means continuing to train an already-trained model on additional data; supervised means each training example is a (prompt, ideal response) pair written by humans. SFT teaches the model the target conversational behaviour: how to answer questions, follow instructions, format outputs, and so on.
Preference tuning. Even after SFT, the model produces multiple plausible responses to many prompts and may favour the wrong one. Preference tuning refines the model using human comparisons of its outputs: human raters are shown two responses to the same prompt and choose the better one. The specific algorithm OpenAI used belongs to a family called reinforcement learning from human feedback (RLHF): a small reward model is first trained to predict human preferences from the comparison data, and then the foundation model is updated (using a reinforcement-learning algorithm called PPO, treated in the RL chapter) to produce responses that score highly on the learned reward.
The pretrained GPT-4 base is then adapted in further ways for other deployments. Smaller fine-tuning runs specialize it for particular tasks (code generation, medical question answering, customer service). Retrieval augmentation lets it call out to a search system at query time and incorporate fetched documents into its responses. Prompting elicits specific behaviours by giving the model carefully designed input - system instructions, examples, format specifications - without any further training at all. The same pretrained model serves all of these.
The economic logic underlying the regime is simple: the expensive pretraining is paid for once, and a comparatively cheap adaptation specializes the resulting model for each downstream use. This separation - one expensive pretraining run on broad data, then many cheap adaptations to specific applications - is the defining feature of the foundation-model regime. The earlier paradigm trained a fresh model per task; the foundation-model paradigm trains a base once and reuses it everywhere.
The pattern beyond GPT-4
The same shape - pretrain at scale on broad data, then adapt many ways - produces a wide variety of systems. Claude, Gemini, Llama, Mistral, and Qwen are foundation models for language with the same overall lifecycle as GPT-4. CLIP (Radford et al., 2021) is a foundation model for joint understanding of images and text: it was pretrained on hundreds of millions of image-caption pairs by predicting which captions matched which images (a contrastive objective, treated in the Self-Supervised Learning chapter), and the resulting image and text representations can be used directly - zero-shot, meaning without any further training - for classification, retrieval, and grounding. SAM (Kirillov et al., 2023) is a foundation model for image segmentation: pretrained on a billion segmentation masks, it can segment objects in arbitrary images guided by simple prompts such as a click, a bounding box, or a text description. AlphaFold 2 (Jumper et al., 2021) is a foundation model for protein-structure prediction: pretrained on a large database of known protein structures, it predicts the three-dimensional fold of a protein from its amino-acid sequence with accuracy that displaced decades of structural-biology methodology. The vision-language-action (VLA) lineage in robotics - RT-2, OpenVLA, π0 - extends the pattern to embodied control: a single model is pretrained on combinations of image, language, and motor-action data, then adapted for specific manipulation or navigation tasks.
What these systems share, despite spanning very different modalities and applications:
They are pretrained at scale on broad data - broad in the sense of covering many domains, sources, or modalities, not narrowly task-specific.
They are designed and deployed with the expectation that they will be adapted to many downstream tasks rather than trained for one. A downstream task is any application or use the model is put to that is different from the pretraining objective itself; predicting the next token is the pretraining objective, while answering medical questions or writing code are downstream tasks.
Adaptation is cheaper than retraining from scratch, which is what makes the pattern economically and methodologically distinct from earlier transfer learning (the older practice of starting from a pretrained model - typically supervised on ImageNet - and fine-tuning narrowly for a related task).
The foundation-model lifecycle, as a diagram:
Broad data
(web / books / code /
images / audio / ...)
│
▼
┌────────────────────┐
│ PRETRAIN ONCE │ self-supervised at scale;
│ (the expensive │ trillions of tokens or
│ stage) │ billions of images;
│ │ weeks to months of compute
└──────────┬─────────┘
│
▼
Pretrained base
(general capability,
not yet specialized)
│
┌───────────────┼───────────────┐
▼ ▼ ▼
adapt for adapt for adapt for ... many other
chat assistant code completion medical Q&A downstream tasks
(SFT + RLHF) (fine-tuning) (RAG + and
fine-tune) deployments
│ │ │
▼ ▼ ▼
GPT-4 / Claude / Copilot-style medical
Gemini style coding tools decision
chatbots supportThe economic shift compared to the earlier “one model per task” regime: the costly pretraining is paid for once, and each adaptation is comparatively cheap. The same base model serves many deployments. This is the defining structural property of the foundation-model regime.
Formal definition
The term foundation model was introduced in a 2021 report by the Stanford Center for Research on Foundation Models (Bommasani et al., 2021), which defined it as
any model that is trained on broad data (generally using self-supervision at scale) that can be adapted (e.g., fine-tuned) to a wide range of downstream tasks.
The definition is functional rather than architectural: it does not name transformers, autoregressive language models, or any specific training objective. What it requires is broad pretraining and adaptability. The same report identified two cross-cutting properties that, it argued, distinguish the foundation-model regime from prior practice:
Emergence. New capabilities, the report claims, emerge with scale rather than being explicitly designed in. Whether this emergence is genuine, or an artifact of how capabilities are measured, is contested; we treat the debate in §6.
Homogenization. A small number of pretrained base models serve many downstream applications, concentrating both methodological reuse and systemic risk into a few systems.
For the rest of this chapter we adopt a slightly tightened working definition:
Foundation model. A model trained at scale on broad data, typically with self-supervised objectives, intended for adaptation to a wide range of downstream tasks. Adaptation can take many forms: fine-tuning (continuing to train the model on task-specific data), prompting (specifying the task at inference time through carefully chosen input), retrieval augmentation (giving the model access to a database to query at inference time), or in-context learning (specifying a task purely through examples provided in the prompt, without any training updates at all). §7 develops these.
The added constraints - trained at scale, intended for adaptation, wide range of tasks - keep the definition from over-extending to small specialized pretrained models that happen to admit some fine-tuning. Borderline cases exist (a 100-million-parameter BERT variant fine-tuned only for medical de-identification, for example) and we acknowledge them where they matter.
Alternative names
“Foundation model” is the most widely used term, but several alternatives circulate, each emphasizing a different aspect. We use foundation model throughout, switching only where another term better fits the local context.
Pretrained model or large pretrained model - older and more neutral; avoids the “foundation” metaphor but does not capture the adaptation-first deployment posture.
General-purpose AI (GPAI) - used in regulatory contexts (notably the EU AI Act), emphasizing capability generality rather than the training pattern.
Frontier model - used in safety and policy literature for the largest, most capable models at any given time. A frontier model is typically a foundation model, but not all foundation models are at the frontier.
Generalist agent - used in DeepMind’s Gato (Reed et al., 2022) and subsequent agentic-AI work; emphasizes deployment as an agent that selects actions rather than as a pure input-to-output function.
The metaphor and its limits
The “foundation” metaphor has been criticized on several grounds. Foundations are usually durable, but current foundation models are deprecated and replaced on cycles measured in months. Foundations are something other things are built on top of, but most downstream applications simply call the model rather than building structures atop it. And calling everything from a seven-billion-parameter open-weights model to a frontier mixture-of-experts (MoE - an architecture covered in the Deep Learning chapter where many “expert” sub-networks share work) “a foundation model” can obscure differences that matter for theory, training, and deployment. We adopt the term because it is the most useful single label, while flagging its limits where they bear on technical or empirical claims.
Editorial note. The term will likely change within the next decade. A reader who finds foundation model too loaded should mentally substitute the working definition above; nothing else in this chapter depends on the metaphor.
§2. Historical Context
This section gives a descriptive trajectory of how the foundation-model regime came to dominate AI practice. It is not a complete history of machine learning, nor an attempt to credit individuals with the “invention” of foundation models. Several converging research programs across decades made the regime possible; the trajectory is best read as an accumulation of partial results that became viable together once compute, data, and architecture aligned.
A timeline of the inflection points covered below:
~1990s-2000s task-specific ML; feature engineering;
transfer between tasks rare and narrow
│
▼
2012-2015 vision pretraining: AlexNet on ImageNet
shows that supervised pretraining yields
reusable representations
│
▼
2013-2018 language pretraining lineage:
Word2Vec → GloVe → ELMo → ULMFiT
static then contextual word embeddings
│
▼
2017 Transformer architecture (Vaswani et al.)
arrives - parallel training, scalable
│
▼
2018-2020 THE INFLECTION: BERT (encoder-only, MLM),
GPT (decoder-only, causal LM), T5 (encoder-
decoder, span corruption). Pretrain-then-
fine-tune becomes the default in NLP.
│
▼
2020-2022 THE SCALING ERA: GPT-3 demonstrates
in-context learning at scale. The term
"foundation model" coined (Bommasani et al.,
2021). Empirical scaling laws (Kaplan,
Chinchilla). RLHF for products (InstructGPT).
ChatGPT productization (late 2022).
│
▼
2023-2024 Multimodal and beyond text: CLIP, ViT, DINO,
SAM, Whisper, AlphaFold-2 lineage. Robotics
VLAs (RT-1, RT-2, OpenVLA, π0). Open-weights
ecosystem matures: Llama, Mistral, Qwen,
DeepSeek.
│
▼
2024-2026 THE AGENTIC + REASONING TURN: tool use,
retrieval augmentation, agentic loops,
o1/o3-style reasoning models trained with RL
on traces; test-time compute as a new axis
of "scaling".We develop each phase below.
The pre-foundation era
Through the 1990s and 2000s, mainstream machine learning was dominated by task-specific models. A model was designed, trained, and deployed for a single task; feature engineering, model selection, and hyperparameter tuning were all bound to that task. Transfer between tasks was occasional, narrow, and not central to the methodology. Compute was small enough that retraining for a new task was usually cheaper than designing an adaptation pipeline.
The 2010s reshaped this picture in two stages: first in vision, then in language.
Vision: pretraining as a useful pattern
The widely cited results of AlexNet on ImageNet (Krizhevsky et al., 2012; Russakovsky et al., 2015) demonstrated that deep convolutional networks trained on a single large supervised dataset could produce representations useful for other tasks. Practitioners noticed that the convolutional features learned on ImageNet transferred well - often as initialization for downstream models, occasionally as fixed feature extractors. This was the first widely adopted pretraining-and-adaptation pattern in deep learning. It lacked two properties of contemporary foundation models: the pretraining was supervised rather than self-supervised, and the downstream adaptations were narrow. But the pattern of pretrain once, adapt many was visible and economically meaningful.
Language: from word embeddings to pretrained language models
A parallel lineage emerged in natural language processing. Word embeddings - dense vector representations of words learned from co-occurrence statistics - became standard practice with Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014). These were static: each word had a single vector regardless of context. Contextual embeddings, introduced with ELMo (Peters et al., 2018), produced different representations for the same word in different contexts by running a bidirectional language model and reading out its hidden states. ULMFiT (Howard & Ruder, 2018) introduced systematic fine-tuning of pretrained language models for downstream classification tasks, demonstrating that the pretraining-and-adaptation pattern from vision transferred to language.
The architectural substrate that would carry the next decade arrived in 2017: the Transformer (Vaswani et al., 2017), introduced for machine translation and rapidly generalized. The Transformer’s combination of parallelizable training, scalable attention, and absence of recurrence made it the default substrate once compute and data were available at the relevant scales.
The 2018–2020 inflection
The pattern crystallized in 2018 with three architecturally distinct but methodologically convergent models:
BERT (Devlin et al., 2018) - an encoder-only Transformer pretrained with masked language modeling.
GPT (Radford et al., 2018) and its successors - a decoder-only Transformer pretrained with causal language modeling.
T5 (Raffel et al., 2019) - an encoder-decoder Transformer pretrained with span corruption.
Each was pretrained on large unlabeled text corpora and then adapted to many downstream tasks. The economic shift was real: a small number of pretrained models, fine-tuned for many problems, displaced the previous norm of training task-specific models from scratch. By 2020 the pretraining-and-adaptation pattern was the default across research and industrial NLP.
The scaling era
GPT-3 (Brown et al., 2020) marked a quantitative turn. With 175 billion parameters and a much larger training corpus, it demonstrated in-context learning - performing novel tasks specified in the prompt, without any gradient-based adaptation - and made the case that scale itself was producing capabilities that smaller models did not exhibit. The Bommasani et al. (2021) report appeared shortly after, naming the phenomenon “foundation models” and arguing that emergence and homogenization warranted treating the regime as a distinct object of study.
The next several years were dominated by attempts to characterize, exploit, and respond to the scaling phenomenon. Empirical scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022) attempted to predict model loss as a function of parameters, data, and compute, and shifted the field’s understanding of compute-optimal training. Reinforcement learning from human feedback (RLHF), developed earlier (Christiano et al., 2017) and applied to language models in InstructGPT (Ouyang et al., 2022), made foundation models viable as products by aligning their outputs to human preferences. ChatGPT, released in late 2022, was the first deployment to reach mass public attention; the next two years saw GPT-4 (OpenAI, 2023), Claude, Gemini, Llama, Mistral, and a rapid expansion of both proprietary and open-weights models.
Beyond text
The foundation-model pattern spread well outside language. In vision:
CLIP (Radford et al., 2021) trained a joint image-text representation by contrastive learning at scale, enabling zero-shot image classification.
Vision Transformers (Dosovitskiy et al., 2020) brought the Transformer architecture to image patches.
Self-supervised vision - DINO (Caron et al., 2021), SimCLR, MAE, and successors - produced general-purpose image representations without labels.
SAM (Kirillov et al., 2023) demonstrated foundation-model-style segmentation that responded to prompts.
In audio, Whisper (Radford et al., 2022) and successors produced general-purpose speech recognition. In scientific domains, AlphaFold 2 (Jumper et al., 2021) demonstrated that a domain-specific foundation model could solve a long-standing problem in structural biology, with later versions extending to ligand binding and other tasks. Subsequent work in materials, weather, and mathematics followed similar templates. In robotics, the vision-language-action lineage (RT-1, RT-2, OpenVLA, π0) extended foundation-model methodology to embodied control.
The agentic and reasoning turn
By 2024 the discussion had shifted from raw model capabilities to systems built around foundation models. Tool use, retrieval-augmented generation, and structured agentic loops became standard. A new class of reasoning models - most prominently OpenAI’s o1 and o3, and successors from other labs - used reinforcement learning to train models to produce long, deliberate chains of intermediate reasoning at inference time. This shifted some performance gains from training-time scaling to test-time compute, and reframed what “scaling” means in practice.
By 2026 the regime is characterized by rapid iteration across closed and open foundation models, an active alignment and interpretability research community, the rise of agentic deployment, and a growing set of domain-specific foundation models in science, robotics, and design. The substrate has stabilized enough to support a chapter-length treatment; the open problems have not stabilized at all.
Historical aside. The narrative above privileges the lineage that produced today’s dominant systems. Several research programs that did not lead directly to foundation models - symbolic AI, classical knowledge representation, structured probabilistic models, neuroscience-inspired learning - remain active and relevant; this book treats them in their own contexts. The history of an idea is rarely the same as the history of the field that produced it.
§3. The Three Pillars
Foundation models are not the product of any single technical innovation. They are the joint product of three pillars, each of which had been pursued independently for years before the combination became viable: scale, self-supervision, and adaptation. The chapter’s later sections develop each in detail; this section argues why all three are jointly required.
FOUNDATION MODELS
▲
│ jointly require
│
┌─────────────────────┼─────────────────────┐
│ │ │
▼ ▼ ▼
SCALE SELF-SUPERVISION ADAPTATION
─────────────── ─────────────────── ───────────────
model parameters, objectives that fine-tuning,
training data, consume unlabeled prompting,
training compute data by predicting retrieval,
(jointly via the parts of the input in-context
Chinchilla-style from other parts learning,
relationship) (next token, masked RLHF, etc.
token, contrastive)
│ │ │
without it, without it, without it,
adaptation is no labeled data is the pretraining
cheaper than bottleneck (the is useless;
retraining; the 2010s ImageNet the model is
foundation regime ceiling) not deployable
collapsesThe three pillars are jointly necessary in the strong sense developed in this section: dropping any one removes the regime as it exists in 2026. Removing scale leaves us with small pretrained models; removing self-supervision leaves us with the supervised ImageNet-era ceiling; removing adaptation leaves us with research artifacts no one can deploy. The counterfactual at the end of this section (§3’s “A counterfactual lens”) works through each absence in turn.
Scale
By “scale” we mean the joint expansion of three resources: model parameters, training data, and compute. The empirical evidence assembled in §6 shows that performance on pretraining objectives, and on many downstream tasks, improves as a smooth function of these resources, with regularities precise enough to be called scaling laws. The three resources are not independently maximizable: at fixed compute, there is a roughly optimal model size and a roughly optimal training-data size - a finding consolidated by Hoffmann et al. (2022) under the label Chinchilla scaling.
Scale matters not because larger is always better in some absolute sense, but because the foundation-model regime depends on a quantitative property: a single pretraining run must be large enough that downstream adaptation is cheaper than retraining. Without scale, the economic case for foundation models collapses; one might as well train per task.
Self-supervision
By “self-supervision” we mean training objectives that consume unlabeled data by predicting parts of the input from other parts. Causal language modeling predicts the next token; masked language modeling predicts masked tokens; contrastive image-text training predicts which captions belong to which images. The full taxonomy is developed in the Self-Supervised Learning chapter. What matters here is the fundability of self-supervised objectives: they consume the kind of data the world produces in vast quantities (text on the web, images, audio recordings, code) without requiring human annotation. Supervised pretraining at the same scale would be impossible by orders of magnitude.
The combination of scale and self-supervision removed the labeling bottleneck that capped pre-2017 deep learning. ImageNet’s million-plus labeled examples were once the largest practical training corpus; by 2026, language foundation models are trained on trillions of tokens of unlabeled text, and the labeling bottleneck has been displaced by data-quality and licensing questions treated in §5.
Adaptation
By “adaptation” we mean any mechanism that turns a general-purpose pretrained model into a useful one for a specific downstream task or deployment. The catalog is broad: full fine-tuning, parameter-efficient methods (LoRA, adapters), instruction tuning on demonstrations, preference tuning (RLHF, DPO, GRPO), retrieval augmentation, prompting, in-context learning, and tool use. §7 develops these.
Adaptation is what makes the foundation-model regime deployable. A pretrained model that admits no useful specialization is a research artifact, not a foundation. The economic and methodological shift from “one model per task” to “one model adapted many ways” depends on adaptation mechanisms being cheap relative to retraining, and on adaptation being more useful than starting from scratch. Both conditions hold in practice for the dominant model families today; neither was a foregone conclusion when the regime began.
A counterfactual lens
It is instructive to ask what fails if any single pillar is removed.
Without scale, but with self-supervision and adaptation, we have small pretrained models. They exist (early BERT variants for niche domains; small CLIP variants for resource-constrained deployment); they support a kind of foundation-model practice; they do not exhibit the emergent capability claims of §6 or anchor the modern field.
Without self-supervision, but with scale and adaptation, we are back to supervised pretraining at ImageNet scale: useful, transferable, but capped by labeling cost. The 2010s vision lineage shows what this regime can and cannot achieve.
Without adaptation, but with scale and self-supervision, we have very large pretrained models that are not useful for any specific deployment. Such models are research artifacts at best; a deployed system requires adaptation in some form, even if only zero-shot prompting.
The three pillars are jointly necessary in a strong sense: the foundation-model regime as it exists in 2026 does not appear under any two of the three.
Empirical note. This counterfactual is descriptive of the historical record, not theoretical. There is no proof that no other combination of mechanisms could produce foundation-model-like systems; alternative routes (purely supervised training at much greater scale, purely model-based RL, neurosymbolic hybrids) are active research programmes whose viability is open. The claim here is more modest: in the actual history of the field, the three pillars together produced the regime, and removing any one is sufficient to remove the regime as we know it.
§4. Architectural Substrate (Brief)
This section is deliberately brief. The Deep Learning chapter develops the transformer architecture and its variants in full mechanical detail - Q/K/V projections, attention computation, positional encodings, normalization, MoE routing, state-space models. The LLM chapter §4 develops the language-specific design choices. What we cover here is the architectural taxonomy of foundation models: what families exist, how they relate, and where to read for depth.
The transformer family and its three deployment-shape variants
Modern foundation models are overwhelmingly transformers (Vaswani et al., 2017). The architecture has three deployment shapes that emerged in the 2018 inflection (§2):
Decoder-only - predicts the next token from prior tokens (causal masking). Generation-native; used by GPT, Llama, Mistral, Qwen, DeepSeek, and most LLMs and multimodal generative models. The dominant shape in 2026.
Encoder-only - produces a representation of the input (bidirectional attention). Suited to classification, retrieval, embedding generation. BERT and its descendants. Persists in retrieval encoders, classification heads, and specialized representation-learning systems.
Encoder-decoder - encodes the input into a representation, then decodes an output autoregressively. T5 and its descendants. Persists in machine translation, structured prediction, and some code-edit settings.
The LLM chapter (§4) develops why decoder-only won for general-purpose language modelling. The same arguments apply across modalities for generation-first applications; encoder-only and encoder-decoder shapes persist where their structural fit is strong.
Scaled architectural variants
The transformer family extends in two practically important ways:
Mixture-of-experts (MoE). The dense feedforward layer is replaced by a bank of expert feedforward layers with a learned router; each token is processed by only a small subset of experts. Frontier-scale MoE (Mixtral, DeepSeek-MoE, and others) achieves a higher effective capacity than dense at the same inference cost. The Deep Learning chapter develops the training-side mechanics; the LLM chapter §4 develops the LLM-side picture.
Long-context variants. Sliding-window attention, global-plus-local attention, dilated attention, and ring attention each modify the attention pattern to reduce the quadratic cost of long context. Position-encoding extrapolation (NTK-aware, YaRN) and ALiBi extend trained context lengths at inference time. Treated in LLM §4.7 and §9.
Non-transformer alternatives
Several sub-quadratic architectures remain active research:
State-space models (SSMs), most prominently the Mamba lineage (Gu and Dao, 2023), process tokens sequentially with a learned hidden state. Linear inference cost in sequence length; parallelizable in training.
Hybrid attention/SSM stacks (e.g., Jamba) interleave transformer and SSM blocks, combining attention’s retrieval capability with SSMs’ efficient long-context inference.
RWKV is a different sub-quadratic family with recurrent structure.
As of 2026 the decoder-only transformer remains the dominant frontier substrate; sub-quadratic alternatives are competitive at smaller scales and are research-active at the frontier.
Multimodal architectures
Foundation models that process more than one modality (vision-language, audio-language, vision-language-action) typically use one of three architectural patterns:
Cross-attention bridges. A modality-specific encoder produces tokens (or a fixed-size representation) of one modality; cross-attention in a separate decoder integrates these with another modality. Flamingo-style architectures use this pattern.
Fused token streams. All modalities are tokenized into a shared sequence space; a single transformer processes the interleaved tokens. Modern multimodal Gemini, GPT-4V/4o-class architectures, and most contemporary multimodal LLMs use this pattern.
Perceiver-style designs. A small set of learned latent vectors cross-attends to arbitrary modality-specific inputs of variable size, producing a fixed representation that downstream layers consume.
The Multimodal Models chapter develops these patterns in depth.
What we do not cover here
For mechanism, math, and engineering choices - attention computation, positional encodings (RoPE, ALiBi, learned absolute), normalization (LayerNorm, RMSNorm), feedforward variants (SwiGLU, GeGLU), training-stability tricks (pre-normalization, gradient clipping) - see the Deep Learning chapter and LLM §4. The point of this section is the taxonomy and the pointers; the depth lives elsewhere.
§5. Pretraining
This section treats pretraining as a regime-spanning operation, not as a language-specific one. The LLM chapter (§5.1) developed pretraining for language; the analogous structure recurs across vision, audio, code, multimodal, robotics, and scientific foundation models, with mechanically similar pipelines and substantially overlapping engineering. We cover the regime here; the chapter-specific details live in the corresponding chapters (Self-Supervised Learning for the objectives, Multimodal Models for the multimodal cases, Robotics for VLA-style pretraining, AI for Science for domain-specific FMs).
Objective families across modalities
The defining structural property of pretraining is a self-supervised objective on broad data: the model is trained to predict some piece of the input from other pieces, with the targets coming from the data itself rather than from human annotation. Different modalities have different natural choices for what to predict. The dominant objective families:
MODALITY DOMINANT OBJECTIVE NOTES
Text causal language modelling GPT family, Llama,
(next-token prediction) Mistral, etc.
masked language modelling BERT family; declining
(predict masked tokens in the LLM era but
from surrounding context) alive in retrieval encoders
span corruption T5 family
Images contrastive image-text CLIP and successors;
(match captions to images) the dominant vision-LM
substrate
masked image modelling MAE, BEiT; rebuild patches
from partially-masked images
autoregressive image gen image-token-by-image-token
prediction, mostly for
multimodal generation
Audio waveform LM / Whisper-style ASR is the
contrastive audio-text dominant audio FM training
Video contrastive, active research; less
masked patches, consolidated than text
autoregressive or images
Code causal LM (same as text) code-specialized variants
of language pretraining
Scientific domain-specific predictive AlphaFold (predict folded
objectives (e.g., predict protein structure from
protein structure from sequence) - domain-tailored
sequence) self-supervised objectives
Robotics / VLA behaviour cloning, vision-language-action data:
multimodal sequence observe sequences of
prediction image + language + actionWhat unites these is the structural pattern: a forward pass through a large neural network, a self-supervised loss computed from the data itself, optimization at scale. What differs is what counts as a token in each modality and what the model is being asked to predict. The Self-Supervised Learning chapter develops each objective family in depth; this section assembles them under the FM regime.
Data: the shared substrate
Across modalities, pretraining data shares structural features that have shaped how FMs are built.
Heavy-tailed quality. Real-world data of any modality has a long tail of low-quality material. Pretraining pipelines invest heavily in filtering: heuristics for obvious junk, learned quality classifiers for borderline cases, and (increasingly) model-judgment for the hardest cases. Deduplication (Lee et al., 2022, for text; analogous techniques for images and audio) removes near-copies that would otherwise waste compute and bias training.
Mixture composition. A pretraining corpus is a mixture of sources, weighted by a curated recipe. Across modalities the question of what mix produces the best downstream model is empirical and an active research area. Code-heavy text mixtures produce stronger coding models; math-heavy mixtures produce stronger math models; broad multimodal mixtures produce more generally-capable multimodal models, at the cost of some specialization in any single modality.
Licensing and copyright. Foundation-model training data raises live legal and ethical questions across modalities - copyrighted text, copyrighted images, scraped code under restrictive licenses, biological data with disclosure restrictions. The regulatory landscape is evolving and varies by jurisdiction. This is treated more fully in the Alignment chapter’s safety-and-governance section.
The exhaustion question. For text especially, the supply of high-quality unlabelled data is finite. By 2026, frontier labs have largely processed the publicly accessible high-quality text corpus; further scaling depends on synthetic data (OP-FM-2), better filtering of lower-quality data, or new sources (private corpora, transcribed audio, multimodal data). Whether the field is running out of data in some meaningful sense is contested; it is at least running out of the cheapest data.
Tokenization across modalities
Tokenization - the process that turns raw input into a sequence of discrete tokens - exists in every modality, not just in text:
Text uses subword tokenizers (BPE, WordPiece, Unigram, SentencePiece; see LLM §3).
Images are tokenized into patches (typically 16×16 pixels for ViT-style architectures), each patch becoming one token in the input sequence. Some models use learned visual tokenizers that map image patches to discrete codes (VQ-VAE-style).
Audio is tokenized either as raw waveform samples (less common at scale), as spectrogram patches (common in speech FMs like Whisper), or as discrete audio tokens from a learned codec.
Action sequences in robotics are tokenized as discrete or quantized motor commands, allowing a multimodal model to treat actions as just more tokens in a unified sequence.
The fact that all modalities can be reduced to a sequence of discrete tokens is what makes architecturally unified multimodal models possible: a single Transformer can ingest interleaved image patches, text tokens, and audio tokens.
Curricula and training schedules
Most modern pretraining recipes do not feed a uniform random sample of the corpus from start to finish. Instead they use a curriculum: a controlled schedule over data composition. Two-stage and multi-stage pretraining is standard by 2026:
Early stages use broader, more diverse data - get the model fluent across many domains.
Later stages concentrate on higher-quality or more domain-targeted material - sharpen the model’s capability profile in directions the deployer cares about.
Curriculum design is largely empirical; small-scale ablation experiments test the effect of mixture changes, and the recipe is tuned before committing to the full training run.
Compute and infrastructure (pointer)
Frontier pretraining at 2026 scales requires distributed training across thousands of GPUs, tensor and pipeline parallelism, careful fault tolerance, and a dedicated team of infrastructure engineers per training run. The Deep Learning chapter and the Efficient and Scaled Training chapter develop this. For purposes of this chapter, the relevant facts are: training a frontier FM is a months-long operation that costs tens to hundreds of millions of dollars, requires hundreds of thousands of GPU-hours, and concentrates capability in a small number of organizations with the resources to do it (OP-FM-12).
Open problems anchored here
Two of the cross-cutting open problems (§11) are most directly tied to pretraining:
OP-FM-1: Data quality vs scale tradeoffs. The exchange rate between quantity and quality, and the right way to allocate compute between them, remains empirical and not fully characterized.
OP-FM-2: Synthetic data and recursive training. Filtered synthetic data demonstrably helps; uncurated recursive training produces model collapse. The boundary between robust and harmful is an active research area.
§6. Scaling Laws and Emergence
This section treats the empirical regularities of scaling - how loss and downstream performance change with model size, data size, and compute - and the contested claim of emergence: that some capabilities appear discontinuously at scale rather than improving smoothly.
Empirical scaling laws
Kaplan et al. (2020) measured the test loss of decoder-only language models as a function of three resources: number of parameters , dataset size (in tokens), and training compute (in FLOPs). They reported approximate power-law scaling of loss in each variable, with consistent exponents:
Every symbol in this formula is a fitted quantity, not a derived one. counts model parameters; counts training tokens; counts training-time floating-point operations. The subscripted constants , , are characteristic scales - values calibrated against the data such that when (and similarly for ). The exponents are scaling exponents, typically in the range 0.05–0.10 for language models, controlling how rapidly loss drops as resources grow. Both the characteristic scales and the exponents are empirical fits extracted from training-run data; nothing in the formula is derived from a theoretical model of learning. We return to this distinction in the editorial pitfall at the end of the section.
The empirical workflow. Producing a scaling law looks roughly like this:
Train a series of models at different scales, sweeping one of while holding the others fixed or in a known relationship. A single sweep might include twenty or more training runs spanning two orders of magnitude in the swept resource.
Measure the converged test loss for each run.
Plot against . If the points fall approximately on a line, the underlying relationship is a power law , and the slope of the line is .
Fit the slope and the intercept by least squares (or a more careful weighted fit). The slope gives the exponent; the position of the line gives the characteristic scale.
Diagrammatically, on log-log axes:
log L ▲
│ •
│ •
│ • each • is one trained model;
│ • loss measured at convergence
│ •
│ • the line of best fit has slope -α
│ - - - - - - - . and gives the characteristic
│ slope = -α • scale (where the line crosses
│ • L = 1 marks N_c, D_c, or C_c)
│ •
└────────────────────────▶
log N (or log D, or log C)A scaling law thus has the form of a theoretical prediction but the epistemic status of a regression fit. It tells us how loss has behaved across the range of scales we trained at; it does not tell us how loss must behave at scales beyond the fit. The exponents and characteristic scales are also dataset-specific and architecture-specific: change the corpus, change the model family, and you should expect different fits.
Hoffmann et al. (2022), using more careful experimental design across a range of model sizes and training-token budgets, revised the implied compute-optimal allocation between parameters and data. Where Kaplan et al. had suggested that compute should grow primarily through model size, Hoffmann et al. found that compute-optimal models train on roughly twenty tokens per parameter - substantially more data than common practice at the time. The resulting model, Chinchilla, outperformed larger contemporaries trained on less data, and the “Chinchilla optimal” rule of thumb reshaped subsequent training practice.
Scaling laws have since been extended in several directions: to multimodal training (with separate exponents per modality), to code, to mixture-of-experts (where the relevant counts are activated parameters and total parameters separately), and to inference-aware training, where one trains past the compute-optimal point because inference cost dominates total cost over a deployed model’s lifetime.
Compute-optimal versus inference-optimal training
The Chinchilla framing implicitly optimizes for one objective: minimum loss at fixed training compute. This is the right objective when inference cost is negligible relative to training cost, as in a research project. It is the wrong objective for a deployed model serving billions of inference queries, where each additional inference cycle has a cost. For deployed models, training past the Chinchilla-optimal point - on more data than would be optimal for training compute alone - produces a smaller model with the same loss, lower per-inference cost, and (often) acceptable additional training cost amortized over inference. The Llama family, among others, has been trained explicitly past Chinchilla-optimal for this reason. The choice between compute-optimal and inference-optimal training is a deployment-economics question, not a property of the underlying scaling law.
Emergence: the claim and the critique
A widely cited claim, articulated most prominently by Wei et al. (2022), is that some capabilities of large language models exhibit emergence: they remain near random until model scale crosses a threshold, then improve sharply. The implication, if true, is consequential. Smooth scaling of training loss does not entail smooth scaling of all downstream capabilities; some capabilities arrive in discontinuous jumps that cannot be predicted by extrapolating from smaller models.
Schaeffer et al. (2023) responded that the apparent discontinuities in many emergence plots are artifacts of how performance is measured. Capabilities that look discontinuous under a 0/1 metric (such as exact-match accuracy on multi-step arithmetic) become smooth under a continuous metric (such as token-level accuracy or cross-entropy on the same task). Under this critique, “emergence” is not a property of the model’s underlying capability but of the choice of evaluation metric: a non-linear function applied to underlying smooth improvements can produce apparent jumps.
The current state of the empirical question is unresolved. Schaeffer et al.'s critique applies cleanly to the specific examples they re-analyzed; whether it generalizes to all claimed emergent capabilities is contested. Subsequent work has explored cases where the smooth-underlying-metric story is harder to maintain - some agentic and reasoning capabilities; specific in-context-learning regimes - and the literature continues to develop.
Editorial note. We deliberately refrain from picking a winner between Wei et al. and Schaeffer et al. The technical question - are there capabilities whose appearance is genuinely discontinuous in scale? - is one of the most important open empirical questions in the field, and the evidence as of 2026 does not settle it. We return to its consequences for safety and governance under open problem OP-FM-3 in §11 and in the Alignment chapter.
Predictability of capabilities
A related but distinct question is whether scaling laws permit prediction of downstream capabilities, not merely of pretraining loss. Kaplan-style and Chinchilla-style laws describe loss; for many users, the practically relevant quantities are downstream task performance, agentic competence, code-generation reliability, or specific safety-relevant capabilities. The literature on whether and how loss scaling translates into capability scaling is partial: in some regimes the relationship is reasonably smooth; in others (especially capabilities involving multi-step structure) it is less reliable.
The predictability question matters beyond academic literature. Safety frameworks at frontier labs - model evaluations, dangerous-capability assessments, scaling commitments - implicitly assume some degree of capability predictability from scale; if the assumption fails, if capabilities can appear without warning at scales not predictable from earlier model behaviour, the safety case for scaling-as-currently-practiced is weaker. We treat this in detail in the Alignment and Evaluation chapters; here we record it as open problem OP-FM-4 in §11.
Pitfall. It is tempting to read scaling laws as theory, since they take the mathematical form of theoretical predictions. They are not theory in the standard sense. They are empirical regularities fit to a finite range of observed scales; their extrapolation to scales beyond the fit, or to capabilities not directly measured, is an empirical bet, not a derivation. The Theoretical Foundations of Learning chapter develops this distinction further (§9).
§7. Adaptation
Adaptation is the second of the three pillars (§3) and the operation that makes the foundation-model regime economically viable: a single expensive pretraining produces a base model that can be specialized many times, cheaply, for many downstream uses. This section catalogues the adaptation mechanisms in roughly increasing order of intrusiveness - from zero parameter changes (prompting, in-context learning) to substantial parameter changes (full fine-tuning, RL-based methods). The LLM chapter develops the language-specific surface of each mechanism; this section treats the regime-spanning structure.
The adaptation spectrum
no parameter changes parameter changes retraining
────────────────────────────────────────────────────────────▶ ───────▶
| |
| Zero-shot In-context Parameter- Full | continual
| prompting learning efficient fine- | pretraining
| (ICL) fine-tuning tuning | + the rest
| ▲ ▲ ▲ ▲ | of the
| | | | | | pipeline
| no training no training train small train |
| no examples a few adapters or the |
| examples prefixes full |
| in prompt (LoRA, model |
| adapters, on
| prefix new
| tuning, IA³) data
────────────────────────────────────────────────────────────▶
cheaper more expensive
per use per adaptationThe further right on this spectrum, the more the model changes (rather than just the input); the further right, the more durable the adaptation but the more expensive each instance. Production deployments combine multiple points on the spectrum - for example, a model that has been instruction-tuned (parameter changes) and is then prompted (no parameter changes) at inference time.
Zero-shot prompting
The simplest adaptation: the pretrained (or instruction-tuned) model is shown a task description in natural language and produces a response. No examples, no parameter updates. Effective when the task is well-represented in pretraining and the description is unambiguous. The LLM chapter develops the chat-format that makes zero-shot prompting practical (system/user/assistant roles, instruction templates).
In-context learning (ICL)
Slightly more adaptive: the prompt includes a small number of input-output examples, and the model is expected to follow the demonstrated pattern. Still no parameter updates. As developed in the LLM chapter §7, ICL is one of the defining capabilities of large foundation models - and one whose theoretical basis is not fully understood. The four theoretical accounts (induction heads, implicit gradient descent, task vectors, Bayesian inference) are surveyed in the LLM chapter; we record the open question here as OP-FM-14.
Retrieval augmentation
A different mechanism for adapting the model’s effective knowledge without changing parameters: retrieve relevant external documents at inference time and add them to the prompt. The Retrieval-Augmented Generation chapter develops this; the LLM chapter §9 treats the LLM-specific surface.
Parameter-efficient fine-tuning (PEFT)
When zero parameter changes is not enough but full fine-tuning is too expensive (or risky for catastrophic forgetting), parameter-efficient fine-tuning (PEFT) updates a small subset of parameters - or a small number of additional parameters - while keeping the base model frozen. The dominant techniques:
LoRA (Low-Rank Adaptation) (Hu et al., 2021) injects trainable low-rank matrices into the model’s attention layers; only those matrices are updated, the original weights are frozen. The trainable parameter count is typically 0.1–1% of the base model. After training, the LoRA matrices can be merged back into the base for inference, or kept separate as a swappable adapter.
Adapters (Houlsby et al., 2019) insert small trainable modules into each layer of the transformer; only the adapters are updated.
Prefix tuning / prompt tuning (Li and Liang, 2021; Lester et al., 2021) prepends a small number of trainable vectors to the input sequence at every layer (prefix) or just the input layer (prompt); only those vectors are updated.
IA³ (Liu et al., 2022) learns per-element scaling factors on key, value, and feedforward activations.
The shared advantage: many adaptations of the same base model are possible at near-zero marginal storage cost. The shared disadvantage: capacity is limited; tasks requiring substantial behavioural change usually benefit from full fine-tuning instead.
Full fine-tuning
Continuing training of all of the base model’s parameters on a task-specific dataset. The dominant tool for substantial behavioural change: when the deployment differs enough from the base model’s competence that PEFT cannot bridge the gap. Risk: catastrophic forgetting of the base model’s broader capability if fine-tuning is too aggressive (OP-FM-8). The LLM chapter §5 develops the standard recipe.
Instruction tuning
A specific form of supervised fine-tuning where the data consists of (instruction, ideal response) pairs across many task types. This is the standard step that turns a fluent-but-not-instruction-following base model into one that reliably follows directives. The LLM chapter §5.2 develops this; the same pattern applies to multimodal FMs, code FMs, and robotics FMs.
Preference-based methods
After instruction tuning, the model usually follows instructions but may not produce the best response among the many plausible options. Preference-based methods refine the model based on judgments - by humans or by trained proxies - about which responses are better. The major algorithms:
RLHF (reinforcement learning from human feedback) - train a separate reward model on preference data; optimize the policy against the reward model via PPO.
DPO (direct preference optimization) - skip the reward model; update the policy directly from preference data with a closed-form supervised-style loss.
GRPO (group relative policy optimization) - sample multiple responses, score each with a verifiable reward function, use group-relative advantages for policy updates.
RLAIF (RL from AI feedback) and Constitutional AI - replace human judgments with model judgments against written criteria, reducing the human-labelling bottleneck.
The LLM chapter §5.3 develops the algorithmic depth (full derivations, pseudocode); the Alignment chapter develops the substance of what the model is being aligned to.
Distillation
Once a high-quality adapted model exists, distillation transfers its capabilities to a smaller student model. The LLM chapter §5.6 develops the variants (hard-label, soft-label, on-policy, trace distillation). Distillation is the standard mechanism by which large frontier models give rise to deployable smaller models, and is responsible for much of the open-weights LLM ecosystem.
Model merging
A more recent adaptation pattern: rather than training a fine-tuned variant from scratch, combine multiple already-fine-tuned variants of the same base model by averaging or interpolating their parameters. Techniques include task arithmetic (subtract the base from a fine-tuned model to get the “task vector”; combine task vectors algebraically), TIES (a more careful merging procedure that handles conflicts between merged models), and SLERP (spherical linear interpolation in parameter space). Model merging is cheap (no training required), occasionally produces surprisingly capable results, and is an active research area.
Putting adaptation together
A production foundation model is rarely a single adaptation. The typical pipeline:
Pretrained base
│
▼
Instruction tuning (SFT on demonstrations)
│
▼
Preference tuning (DPO / RLHF / GRPO)
│
▼
Optional: domain adaptation (PEFT or full fine-tune
on domain data)
│
▼
Optional: distillation (produce smaller deployable
variants)
│
▼
Deployed model
│
▼
Per-query adaptation (prompting, ICL, retrieval
augmentation, tool use)Each layer of adaptation makes the model more specialized for its deployment; each costs less per instance than the layers below it. The economic shape of the FM regime is this layered, increasingly cheap specialization on top of a one-time-paid pretraining base.
§8. Capabilities
A capability inventory across modalities. The point of this section is not to evaluate foundation models - that is §9 and the Evaluation chapter - but to catalogue what kinds of tasks they have shown competence on, where capability is sharpest, and where it is weakest. Each subsection is short; depth is delegated to the chapter where the capability is most fully developed.
Language
The capabilities most studied in the LLM chapter (§1–§14): natural-language generation, summarization, translation, instruction following, code generation, structured output (JSON, function calls), question answering, classification, named-entity recognition, dialogue. By 2026 frontier LLMs are competitive with non-expert humans on most language tasks; expert humans still outperform on specialized domains and on tasks requiring careful long-form reasoning.
Vision and multimodal
Image description, visual question answering (VQA), grounded image segmentation (SAM-style prompted segmentation), grounded object detection, multimodal generation (image, video, audio synthesis), document understanding, screenshot understanding, computer-use control. Native multimodal models (modern Gemini, GPT-4o-class, Claude with multimodal input) handle interleaved text-and-images in a single forward pass; older systems used separate encoder + LLM pipelines. The Multimodal Models chapter develops the substance; the Generative Models chapter covers diffusion- and flow-matching-based generative capability.
Reasoning
What foundation models without test-time compute can do: pattern-completion, retrieval of training-corpus patterns, structurally-familiar multi-step inference up to a few steps. What requires test-time compute (chain-of-thought or reasoning-model RL): multi-step mathematical proof, theorem proving, hard programming problems, multi-hop reasoning over long contexts, planning over multi-step decision sequences. The Reasoning Models chapter develops the test-time-compute regime (o1/o3-style models). The open question of where reasoning shades into pattern-matching - OP-FM-6 (compositional generalization) and OP-LLM-8 (reasoning vs pattern-match) - recurs throughout.
Tool use and agency
Calling external functions, querying databases, browsing the web, executing code, controlling computers (screenshot + click/type). LLM §8 develops the LLM-side surface; the AI Agents and Tool Use chapter develops the systems story (multi-step agents, planning over tools, complex agent ecosystems, agentic evaluation). By 2026 agentic deployment is one of the most consequential deployment patterns for foundation models - many production “chatbots” are actually agents in the technical sense.
Embodied
Foundation models for robotics (RT-1, RT-2, OpenVLA, π0 and successors) combine vision, language, and motor action in a single multimodal model trained on combinations of perception and demonstration data. Capabilities range from manipulation of objects to navigation to multi-step physical tasks. The gap between embodied FMs and their text-only cousins is substantial: embodied training data is much more expensive to collect and lower in volume; embodied capability lags accordingly. The Robotics chapter develops this.
Scientific
Domain-specific foundation models have produced striking results in several scientific areas:
Protein structure prediction. AlphaFold-2 (Jumper et al., 2021) and AlphaFold-3 are foundation models for biology; they accept amino-acid sequences and predict three-dimensional folded structures with experimentally-useful accuracy. The impact has been substantial enough to merit a Nobel Prize.
Materials discovery. GNoME and successors search materials space for stable crystal structures by training on databases of known materials.
Mathematics. AlphaProof, AlphaGeometry, and FunSearch have produced solutions to substantial mathematical problems; their relationship to “general” reasoning is contested (OP-LLM-8).
Weather and climate. GraphCast and similar models predict weather competitively with operational numerical-weather-prediction systems, at dramatically lower inference cost.
Chemistry, drug discovery, biology more broadly. Smaller-scale FMs for molecular property prediction, protein-protein interaction, gene-expression modelling.
The AI for Science chapter develops the substance.
Calibration
Across modalities, an open question: do foundation models know what they know? Calibration - whether the model’s expressed confidence matches its empirical accuracy - is uneven. Pretrained base models are reasonably well-calibrated on token probabilities; post-trained chat models often are not, because preference tuning rewards confident-sounding language regardless of underlying uncertainty (OP-LLM-2). Verbalized uncertainty (“I’m not sure, but...”) is a poor proxy for actual model uncertainty. The question of how to recover or measure real model uncertainty after the standard training pipeline is unresolved.
What foundation models are weak at
The mirror of the capabilities catalogue:
Reliable multi-step reasoning without scaffolding (helped by reasoning models but not solved).
Tasks requiring tightly bound factual knowledge with no error tolerance (hallucination, OP-FM-5).
Tasks far outside the training distribution.
Embodied tasks compared to specialized robotic systems.
Tasks requiring genuine novelty (compositional generalization, OP-FM-6).
Long-horizon planning without extensive search infrastructure around the model.
The full open-problems list is in §11.
§9. Evaluation
Foundation models are notoriously difficult to evaluate well. The challenges come from several directions and compound; the Evaluation chapter develops methodology in depth, but the chapter-level surface is worth laying out because so much depends on it.
Why FM evaluation is hard
Generality breaks task-specific benchmarks. Traditional ML evaluation assumed a model trained for a specific task could be measured on that task’s test set. Foundation models are deployed across many tasks; no single benchmark covers the deployment surface, and aggregate scores across many benchmarks combine quantities that may not be commensurable.
Test-set contamination. Pretraining corpora include essentially everything publicly available on the web, often including the test sets of published benchmarks. A frontier model that “scores 95% on benchmark X” may be (in part) reciting test-set answers from its training data. Detecting contamination is hard; mitigating it requires holdout benchmarks or fresh-data evaluation, both of which are operationally expensive.
Surface vs depth. A model that produces fluent-sounding answers does not necessarily produce correct ones. Benchmarks that measure only output fluency (rather than checking correctness against ground truth) systematically over-estimate capability. The same problem in a sharper form: many benchmarks check whether an answer resembles the right one, not whether the model’s reasoning underneath was sound.
Adaptation moves the target. A “GPT-4” evaluated as a base model is different from “GPT-4” after instruction tuning, different again after preference tuning, different again under various system prompts and tool-use configurations. Evaluation results from one configuration often do not transfer. OP-FM-15 in §11.
Agentic loops compound errors. When a foundation model is deployed in an agentic loop (multiple tool calls, multiple turns), small per-step error rates compound across the loop. A model with 90% reliability per step is 60% reliable over five steps, 35% reliable over ten. Aggregate evaluation that ignores this compounding misrepresents real-world reliability.
Silent updates. As developed in LLM §10, closed-model providers update deployed models silently. A benchmark run today may not reproduce tomorrow.
Benchmark families
A non-exhaustive taxonomy of how FM benchmarks are organized:
Knowledge benchmarks - MMLU, BBH, knowledge-of-world tests across domains.
Reasoning benchmarks - GSM8K (math), MATH, ARC, BIG-Bench reasoning subsets.
Code benchmarks - HumanEval, MBPP, SWE-bench, LiveCodeBench (regularly refreshed to fight contamination).
Multimodal benchmarks - MMMU, vision-language QA suites, segmentation benchmarks.
Long-context benchmarks - NIAH, RULER, LongBench (see LLM §9).
Agentic benchmarks - WebArena, AgentBench, GAIA, SWE-bench-like environments that test multi-step tool-using agents end to end.
Capability evaluations / dangerous-capability evaluations - biosecurity, cybersecurity, autonomous-replication assessments used in frontier-lab safety reviews. The Evaluation and Alignment chapters develop these.
Holistic frameworks - HELM (Liang et al., 2022) and successors that combine many benchmarks into a structured report covering accuracy, fairness, robustness, bias, calibration, and other axes.
Human-judgment evaluation - pairwise preference comparisons (Chatbot Arena), task-completion ratings by domain experts.
A modern frontier-model evaluation report typically draws on many of these families simultaneously; relying on any single benchmark is now considered insufficient.
Contamination: detection and mitigation
Detection techniques include: probing the model for verbatim recall of benchmark examples, checking training corpora directly (when accessible), running benchmark variants generated after the training cutoff, and using held-out private test sets. Mitigation requires either: holdout benchmarks that are kept private, fresh-data benchmarks that are rebuilt periodically (LiveCodeBench is the canonical example), or careful training-data curation that excludes known benchmarks.
By 2026 contamination is widely acknowledged as a real problem; the field’s response varies - some labs publish detailed contamination analyses, others do not. The reader of any specific benchmark result should consider contamination as a plausible component of the headline number.
Pointer to the Evaluation chapter
This section is the chapter-level surface. The Evaluation chapter develops the methodology in depth: how to design a benchmark, how to think about validity, how to handle contamination at the policy level, how to evaluate agentic systems and reasoning models specifically, and how to read existing benchmark reports skeptically.
§10. Connections to Other Chapters
This chapter is the spine. The map of which chapters develop which dimensions of the regime:
Deep Learning develops the architectural substrate that foundation models are built on - attention, the transformer block, positional encodings, normalization, MoE training, state-space models. §4 of this chapter is the chapter-level surface; the substance lives there.
Self-Supervised Learning develops the family of pretraining objectives - causal LM, masked LM, span corruption, contrastive learning, masked image modelling, JEPA, multimodal alignment. §5 of this chapter catalogues them across modalities; the SSL chapter develops each in depth.
Theoretical Foundations of Learning develops generalization theory, the modern generalization puzzle, and what scaling laws (§6) can and cannot tell us. The epistemic-status framing in §6 is developed in full in that chapter.
Large Language Models is the language-specific specialization of this chapter. Most of the cross-cutting adaptation material developed in §7 is given algorithmic depth in the LLM chapter (§5 for the training pipeline, §6 for inference, §7 for prompting and ICL, §8 for tool use).
Multimodal Models is the vision-language, audio-language, and embodied extension. The cross-modality pretraining of §5 and the multimodal architecture pattern of §4 are developed in depth there.
Generative Models covers diffusion, flow matching, and the broader class of generative foundation models - image, video, audio, molecular synthesis. The pretraining objectives in §5’s image and audio rows are developed there.
Reasoning Models covers the test-time-compute layer built on top of foundation models - o1/o3-style reasoning models, GRPO-trained systems, the reasoning-RL stage of the training pipeline.
AI Agents and Tool Use covers agentic deployment patterns - multi-turn tool use, planning over tools, complex agent systems. The LLM-side surface is in LLM §8; the systems story is in that chapter.
Retrieval-Augmented Generation covers retrieval architectures and the retrieval-as-long-context-substitute pattern flagged in §7.
Mechanistic Interpretability covers what is inside a foundation model at the level of features and circuits. Key for §8’s calibration discussion and §11’s open problems on the theory of in-context learning (OP-FM-14).
Alignment covers the substance of what FMs are aligned to - preference tuning details (RLHF, DPO, GRPO, RLAIF), scalable oversight, dangerous-capability evaluations, governance frameworks. The chapter’s §7 covers algorithmic surface; the Alignment chapter covers what alignment is for.
Causality develops the Pearl-aligned critique flagged in §12 - the question of whether associational learning fundamentally limits foundation models. OP-FM-11 anchors here.
Reinforcement Learning develops the RL primitives used in preference tuning (PPO, GRPO), reasoning-model training, and the broader question of LLMs-as-RL-policies.
Evaluation covers the methodology of evaluating foundation models - benchmark design, contamination, validity, capability assessments. §9 is the chapter-level surface.
Efficient and Scaled Training covers the engineering of training at frontier scale - distributed training, parallelism strategies, hardware-aware design, quantization. Compute-and-infrastructure pointers in §5 and elsewhere link here.
AI for Science covers domain-specific foundation models - AlphaFold and successors in biology, GNoME in materials, GraphCast in weather, AlphaProof in mathematics.
Robotics covers embodied foundation models - RT-2, OpenVLA, π0 - and the embodiment story of §12 and OP-FM-13.
The Foundation Models chapter is meant to be the first the reader encounters in the modern-AI cluster; the chapters above develop the substance of the dimensions sketched here. Where a chapter is a prerequisite of another, this is noted in the YAML frontmatter of the dependent chapter.
§11. Limitations and Open Problems
A neutral inventory of cross-cutting open problems in the foundation-model regime. Each item below is the FM-level version of a question that recurs across the chapters of this book; the LLM, Multimodal, Reasoning, Alignment, Evaluation, and other chapters surface chapter-specific facets that we cross-reference inline. As elsewhere in the book, we list the problem and the current state of partial answers; we do not adjudicate.
OP-FM-1: Data quality vs scale tradeoffs. The empirical scaling laws of §6 measure loss as a function of data quantity. They do not capture data quality: a corpus of well-edited books and code yields different downstream behaviour than an equivalent volume of low-quality web crawl. The current state of practice is that high-quality data dominates at the margin - careful filtering and curation produce better models per token than naive scraping - but the right exchange rate between scale and quality is not characterized. How much can scale compensate for quality, and where? The cross-chapter facet in the LLM chapter is OP-LLM-3 (multilingual) and the tokenization-induced parts of OP-LLM-9; the data-curation side is treated in §5.
OP-FM-2: Synthetic data and recursive training. As developed in the LLM chapter’s distillation and synthetic-data subsections, modern training pipelines increasingly include model-generated data. Filtered synthetic data demonstrably helps; uncurated recursive training produces model collapse (Shumailov et al., 2024), where successive generations of models trained on each other’s outputs progressively lose distributional diversity. The conditions under which synthetic data is robust versus harmful are not fully mapped. The cross-chapter facet is the LLM training pipeline’s “synthetic-data turn” treatment.
OP-FM-3: Predicting emergence and capability transitions. §6 surveyed the Wei et al. (2022) / Schaeffer et al. (2023) debate on emergence. The deeper open question, regardless of which side is right: when a new capability does appear at scale, can it be predicted in advance from smaller-model behaviour? Safety frameworks at frontier labs depend on the answer - dangerous capabilities that arrive without warning at scales not predicted by earlier model behaviour weaken the safety case for further scaling. The relationship between scaling laws (loss) and capability scaling is partial and remains an active research area.
OP-FM-4: Validity of scaling laws beyond observed regimes. Scaling laws (§6) are fit to a finite range of observed scales. Their extrapolation to scales beyond that range is an empirical bet, not a derivation. Whether the power-law relationship continues, flattens, breaks, or changes regime at larger scales is unknown until experiments at those scales are run. The Theoretical Foundations of Learning chapter (§9) develops the epistemic-status side of this question.
OP-FM-5: Hallucination and factuality. Foundation models - particularly LLMs - produce confident-sounding statements that are false at non-trivial rates. The causes are multiple (training-data gaps, fluency-rewarding objectives, autoregressive commitment to wrong paths); the mitigations are multiple (retrieval grounding, confidence calibration, post-hoc verification); none is complete. The LLM-specific facet is OP-LLM-1; the Multimodal-specific facets show up in the Multimodal chapter.
OP-FM-6: Compositional generalization. Foundation models reliably handle inputs that recombine pieces seen at training time (familiar-sentence-structure new-words, familiar-task new-content); they handle less reliably inputs that require novel compositions of pieces in ways unseen at training. The boundary between “recombinable” and “novel composition” is itself contested. Reasoning models (§7) and chain-of-thought close part of the gap. The LLM-specific facet is OP-LLM-8 (reasoning vs pattern-match).
OP-FM-7: Long context and working memory. Foundation models hold information across a context window; they do not hold information across calls without explicit mechanisms. The question is whether more-context (longer windows) is the right architectural primitive for working memory, or whether retrieval (search over an external store) is, or whether some hybrid is. The LLM-specific facet is OP-LLM-4; the RAG chapter develops the retrieval side; OP-FM-8 covers the long-term knowledge-update side.
OP-FM-8: Knowledge updating without catastrophic forgetting. As developed in the LLM chapter’s continual-pretraining subsection, the fundamental tradeoff is between fast adaptation to new information and stability of existing capabilities. The four mitigations (data-mixing replay, low learning rate, regularization, modular adapters) each partially address the problem; none solves it. How to update a foundation model’s knowledge online, across the lifetime of a deployment, without periodic full re-training, is open.
OP-FM-9: Sample inefficiency relative to humans. Foundation models train on trillions of tokens of text - vastly more than a human encounters in a lifetime - to reach competence that humans achieve from a much smaller language exposure. The mismatch could reflect the absence of grounding (the embodiment story, OP-FM-13), the absence of compositional structure in current architectures (OP-FM-6), the difference between predictive learning and explicit reasoning, or all three. Whether sample efficiency at the human level is achievable with current architectures plus richer training signals, or requires fundamentally different architectures, is open. The Self-Supervised Learning chapter treats the predictive-learning side.
OP-FM-10: Robustness and adversarial inputs. Foundation models can be subverted by adversarial inputs: prompt injection in LLMs (OP-LLM-10), adversarial perturbations in vision, jailbreaks bypassing safety training, indirect attacks via tools and retrieval. The fundamental issue across modalities: the model has no architectural distinction between “data to process” and “instructions to follow”; every input shapes behaviour. Making foundation models robust to adversarial input without sacrificing capability is an open alignment-and-security problem, treated in depth in the Alignment chapter.
OP-FM-11: Causal reasoning. Pearl-aligned critics (the Causality chapter develops this position) argue that purely associational learning - predicting one thing from another - is fundamentally limited and cannot produce systems capable of intervention and counterfactual reasoning in the full causal-ladder sense. Whether foundation models perform some kind of partial causal reasoning, or whether their apparent causal competence is sophisticated associational pattern-matching, is an open empirical and theoretical question. The Causality chapter develops the framework; this is its FM-level facet.
OP-FM-12: Energy, economics, and access. Frontier-model training at 2026 scales requires capital expenditures and ongoing compute spending that only a small number of organizations can sustain. This concentrates the field’s capability frontier in a few labs, with consequences for research access (OP-LLM-5/6 capture the LLM facet), environmental footprint, and global access to the technology. Whether the centralization is durable or whether efficiency gains will redistribute access is open. The Efficient and Scaled Training chapter treats the efficiency side; this is its societal facet.
OP-FM-13: Embodiment. The broader embodiment critique (§12 below): systems trained without any link to a physical world may have fundamental limits on grounded concepts. Whether enough text/image/audio data carries enough indirect signal to build adequate physical representations - or whether some form of embodied training (robotics, simulated environments, sensorimotor experience) is necessary - is contested. The Robotics and Multimodal chapters cover the embodied-FM side; the LLM-specific facet is part of LLM §14.
OP-FM-14: Theoretical understanding of in-context learning. As developed in the LLM chapter §7, multiple theoretical accounts of ICL exist (induction heads, implicit gradient descent, task vectors, Bayesian inference) and none is decisive. ICL is the most consequential capability that distinguishes large foundation models from smaller pretrained ones; a theory that fully explained when and why it works would substantially clarify many other open problems on this list. The Theoretical Foundations of Learning chapter develops the learning-theory side.
OP-FM-15: Evaluation under adaptation - the moving target problem. Foundation models are adapted in many ways before deployment (SFT, preference tuning, retrieval augmentation, agentic loops, tool use). Each adaptation changes the model’s behaviour, and the evaluation results from before adaptation often do not transfer. Test-set contamination, silent model updates (OP-LLM-5), and the diversity of deployment configurations compound the problem: what does it mean to “evaluate GPT-4” when there are dozens of different post-trained variants of the same base, accessed through different interfaces? The Evaluation chapter treats the methodology; this is the FM-level surface.
These fifteen open problems span the FM regime. The chapter-specific open-problem lists in LLMs (OP-LLM-1 through OP-LLM-10), Multimodal Models, Reasoning Models, Mechanistic Interpretability, Alignment, Evaluation, and Robotics extend and specialize this list along the dimensions each chapter develops.
§12. Critiques and Alternative Frames
The foundation-model regime, as developed throughout this chapter, is not the only available frame for understanding the path to capable AI systems. Several substantive critiques exist as research programmes in their own right, each with its own publications, conferences, and adherents. We present each as a position with citations and current state; the editorial stance is neutral.
The symbolic / hybrid AI critique. A long-running position, with Gary Marcus as one prominent advocate, holds that scaling pure pattern-completion models cannot produce systems with the kind of compositional, systematic, rule-following reasoning that intelligent behaviour requires. The claim has both an empirical version (pure neural models fail at compositional generalization, OP-FM-6) and an architectural version (the right path forward is a hybrid of neural perception with symbolic reasoning systems, rather than scaling neural systems alone). The 2024–2026 response from foundation-model researchers has been mixed: reasoning models (the Reasoning Models chapter) close part of the empirical gap, while purely neural systems continue to under-perform on the hardest compositional tasks. The architectural debate remains live; “neurosymbolic” remains an active research direction whose practitioners argue that the empirical gap is structural and durable.
The causal critique. Judea Pearl and aligned researchers argue that purely associational learning - predicting outputs from inputs without modelling intervention or counterfactual reasoning - is fundamentally limited. The full position rests on the ladder of causation (associations, interventions, counterfactuals), with the claim that foundation models operate at the lowest rung and cannot, by their training objective alone, ascend. The Causality chapter develops this position; OP-FM-11 captures the open empirical question (whether FMs exhibit some partial causal reasoning despite not being explicitly trained for it). The position predicts that systems built purely on next-token prediction will plateau on tasks requiring causal reasoning, regardless of scale. The empirical picture as of 2026 is partial evidence in both directions.
The embodiment critique. Several research traditions - robotics, developmental psychology, cognitive science - hold that grounded concepts (what an object is, what it means to move, what causation between physical events feels like) cannot be learned from text alone, and that embodied training in some form is necessary. The position has degrees: a strong version (no genuine grounding from text) and a weaker version (text-only systems have ceiling effects on physical-world reasoning that multimodal or embodied training would surpass). Vision-language-action models (the Robotics chapter), multimodal foundation models, and simulator-trained systems are the empirical responses; the strong-version critics generally find these responses inadequate, the weak-version critics find them more compelling. OP-FM-13 records the question.
The world-models perspective. Yann LeCun, with the JEPA family of architectures as the technical proposal, has argued that the right path forward involves building explicit world models - predictive models of how the environment evolves under actions - rather than continuing to scale prediction over outputs. Under this framing, current LLMs are world models only of text; the harder and more important target is a system that predicts the next state of the physical or simulated environment. The position has both architectural content (JEPA-style predictive-embedding architectures instead of generative ones) and methodological content (focus on representation quality rather than output quality). It overlaps with the embodiment critique but is distinct in its specific technical proposal.
Cognitive-science critiques. A loose family of positions, from researchers in psychology and cognitive neuroscience (McClelland, Tenenbaum, Lake, Marcus, and others), questions what foundation models are as models of cognition. The critiques include: that FMs lack the structured world-modelling humans deploy from early childhood; that they fail at tasks requiring causal or analogical reasoning that humans pass easily; that the inductive biases that make them work (massive data, dense distributed representations) are very different from those of biological learning. Whether foundation models are bad models of cognition or partial models - or useful tools whose function as models of cognition is incidental - varies by which critic is asked.
Editorial note. Each of these critiques represents a substantive research programme, not a fringe view. The book covers them where they are most fully developed: the Causality chapter for the Pearl-aligned critique, the Robotics and Multimodal Models chapters for the embodiment story, the Mechanistic Interpretability chapter for the symbolic-vs-neural mechanism question. A reader who finds any of these positions persuasive can pursue them through the corresponding chapters and references; the foundation-model regime described elsewhere in this book is a path forward in AI, not the only path.
§13. Further Reading
An opinionated list of starting points. Each entry is annotated with what it adds beyond this chapter.
The foundation-model regime itself
Bommasani et al. (2021), “On the Opportunities and Risks of Foundation Models.” The Stanford CRFM report that coined the term. Long and uneven in places, but the section structure makes it skimmable; chapters 1, 2, 4, and 5 are particularly worth reading for the regime’s framing.
Scaling laws
Kaplan et al. (2020), “Scaling Laws for Neural Language Models.” The original empirical scaling-law paper. §6 unpacks the methodology.
Hoffmann et al. (2022), “Training Compute-Optimal Large Language Models” (Chinchilla). The compute-optimal revision; the single best paper to read for scaling-law methodology.
Wei et al. (2022), “Emergent Abilities of Large Language Models.” The emergence claim.
Schaeffer et al. (2023), “Are Emergent Abilities of Large Language Models a Mirage?” The metric-artifact critique. Read both Wei and Schaeffer back to back.
Cross-modality foundation models
Jumper et al. (2021), “Highly accurate protein structure prediction with AlphaFold.” The defining domain-specific FM result; structurally similar pretrain-and-deploy story to text FMs but in a very different domain.
Radford et al. (2021), “CLIP.” Multimodal pretraining at scale via contrastive image-text learning.
Kirillov et al. (2023), “Segment Anything” (SAM). Foundation-model-style prompted segmentation.
Reed et al. (2022), “A Generalist Agent” (Gato). Early architectural unification of modalities.
Pretraining engineering
Gao et al. (2020), “The Pile.” A foundational document for pretraining-corpus design.
Lee et al. (2022), “Deduplicating Training Data Makes Language Models Better.” Empirical demonstration of the value of deduplication.
Shumailov et al. (2024), “The Curse of Recursion: Training on Generated Data Makes Models Forget.” Model collapse; OP-FM-2.
Critiques
Bender et al. (2021), “On the Dangers of Stochastic Parrots.” The most influential critique of the foundation-model regime; §12 develops this.
Marcus and Davis (2019), “Rebooting AI”; subsequent essays by Marcus. The neurosymbolic position; §12.
Pearl (2018), “Theoretical Impediments to Machine Learning with Seven Sparks from the Causal Revolution.” A short essay introducing the causal critique; the full development is in Pearl’s books.
LeCun (2022 position paper on JEPA, and subsequent writings). The world-models alternative.
Adaptation algorithms
Hu et al. (2021), “LoRA.” The dominant PEFT method.
Christiano et al. (2017), “Deep Reinforcement Learning from Human Preferences.” Foundational RLHF paper.
Rafailov et al. (2023), “DPO.” The pivotal preference-tuning paper of recent years.
Bai et al. (2022), “Constitutional AI.” Standard reference for RLAIF and constitutional approaches.
Surveys
Any recent (within ~12 months of reading) arXiv survey covering foundation models or LLMs. These age quickly but are useful for current state.
The CRFM report (Bommasani et al., above) is the standard regime-spanning survey; it is now several years old.
Reading order
For a reader new to foundation models, I would suggest: start with the Bommasani report’s introduction; then read the Kaplan and Chinchilla papers; then GPT-3 (Brown et al., 2020) and AlphaFold-2 (Jumper et al., 2021) as the two canonical FM successes outside vision; then DPO (Rafailov et al., 2023) for a clean technical paper; then one critique (Bender et al., 2021, or Pearl, 2018) for the contrasting view. After that, the chapter-specific reading lists in LLMs, Multimodal Models, and the other dedicated chapters give the depth.
§14. Exercises and Experiments
Five research-style exercises. As in the LLM chapter §16, each declares its intent (demonstration - deterministic and fast - or exploration - open-ended, possibly expensive), gives the setup and tasks, and identifies the expected takeaway. Notebooks planned in notebooks/foundation-models/.
E1. Reproduce a small scaling-law fit.
Intent: exploration; substantial compute.
Setup. Pretrain a series of small transformer language models - say, 1M, 10M, 100M, and 1B parameters - each on the same dataset, each trained to convergence. Use a publicly available small corpus (a FineWeb subset, OpenWebText, or domain-specific).
Tasks.
Measure converged validation loss for each scale.
Plot vs on log-log axes.
Fit a power law; extract the exponent .
Compare to Kaplan et al. (2020) and Hoffmann et al. (2022) reported values.
Repeat for vs (sweeping dataset size at fixed ) if compute permits.
Takeaway. First-hand experience of the empirical workflow described in §6: model-sweeping, convergence criteria, fit quality. Notice how much practical difficulty hides behind the published smooth curves.
E2. Probe ICL sensitivity to prompt format.
Intent: exploration.
Setup. Pick an open-weights model in the 7B–30B range. Choose a simple few-shot task: sentiment classification, binary entailment, or a synthetic pattern-matching task with known ground truth.
Tasks.
Build a canonical 4-example few-shot prompt.
Generate variations by changing surface features only - separators (newline vs colon vs comma), capitalization, demonstration order (random permutations), prefix/suffix wording - while preserving semantic content.
Evaluate each variation on a held-out test set and record accuracy.
Tabulate the spread; it should be substantial.
Takeaway. A direct empirical encounter with the ICL format sensitivity flagged in OP-FM-14 and developed in LLM §7.
E3. Compare PEFT (LoRA) vs full fine-tuning.
Intent: exploration.
Setup. Pick a small base model (a 1B-class open-weights LLM is comfortable on a single GPU). Pick a focused downstream task: domain-specific classification, summarization on a niche corpus, or instruction-following on a specific style.
Tasks.
Train two variants of the model: one via full fine-tuning of all parameters; one via LoRA on the attention layers (typical rank: 8 or 16).
Measure: trainable parameter count, training time, peak GPU memory, downstream task accuracy.
Evaluate also for catastrophic forgetting: how does each variant perform on the base model’s original capabilities (a held-out general-knowledge or fluency benchmark)?
Tabulate both: (task improvement, base-capability degradation).
Takeaway. Quantitative encounter with the cost-quality-stability tradeoffs developed in §7’s adaptation spectrum.
E4. Demonstrate emergence as metric artifact.
Intent: demonstration.
Setup. Pick a multi-step task whose accuracy can be evaluated at multiple granularities - multi-digit arithmetic is the canonical case. Pick a series of model scales (use an open scaling suite like Pythia, or fine-tune from a single base at several truncations).
Tasks.
Evaluate each model on the task using a 0/1 metric (exact-match correctness).
Evaluate each model on the same task using a continuous metric (per-token accuracy, cross-entropy of the correct answer, partial-credit score).
Plot both metrics as a function of scale.
Observe the difference: the 0/1 metric likely shows a sharp jump (emergence); the continuous metric likely shows a smooth curve.
Takeaway. Direct replication of the Schaeffer et al. (2023) critique on a model and task of your choice. The result motivates careful metric design when reporting capability-as-a-function-of-scale.
E5. Contamination check.
Intent: exploration.
Setup. Pick a published benchmark (a coding benchmark like HumanEval, a knowledge benchmark like MMLU, or an older benchmark like SuperGLUE) and an open-weights model.
Tasks.
Sample a subset of the benchmark’s questions.
For each question, query the model with a partial prompt - the question stem, but with the answer obscured.
Compare the model’s generation against the benchmark’s known answers. If the model emits verbatim-or-near-verbatim benchmark answers given only the stem, that is evidence of training-data contamination.
Try also querying with deliberately corrupted question stems (typos, paraphrases) and see whether the model still produces benchmark-specific answers.
Tabulate the contamination rate.
Takeaway. Hands-on experience with the contamination problem developed in §9. Replicates a small piece of what serious contamination audits look like.
Notebook implementations are planned (see notebooks/foundation-models/E1_scaling.ipynb, etc.) once the project’s interactive layer is decided.