Generative Models

The chapter is structured to interface cleanly with several other chapters: the autoregressive material in §3 is the same machinery as LLM §3 (decoder-only Transformers) viewed through the generative-model lens; diffusion in §6 is the same machinery as SSL §4 (denoising as SSL) viewed through the modelling lens; flow-matching in §7 is the practical successor to diffusion that has displaced it in many applications.


Scope and What This Chapter Is About

The chapter develops generative models - probabilistic models that learn to generate samples from a target data distribution. We cover the modelling principle (likelihood, density, score, ODE-flow), the six dominant families (autoregressive, VAE, normalizing flows, GANs, diffusion, flow-matching), the practical engineering of training and sampling, the modality-specific instantiations (images, audio, video, 3D, molecules), and the unifying theoretical perspective that has emerged in the 2020s (diffusion as score-matching; flow-matching as a generalization of both). Open problems are flagged inline and consolidated in §13.


§1. Motivation and Scope

A worked-example anchor

Concretely, what does a generative model do? Three instances from 2026 practice, drawn to span modalities and method families:

  1. Stable Diffusion XL produces a photorealistic image from a text prompt like “a cat sitting on a windowsill at sunset, oil painting style.” Under the hood: a text encoder turns the prompt into a conditioning vector; a U-Net or DiT model iteratively denoises a random initial latent over 20–50 steps; a decoder maps the final latent to pixels. The output is a 1024×10241024 \times 1024 image that - given a good prompt - looks plausible enough to fool most casual viewers.

  2. AlphaFold 2 predicts a protein’s 3D structure from its amino-acid sequence. The model is not “generative” in the artistic-image sense but is a generative model of structures conditional on sequence - sampling a 3D conformation from the predicted distribution. The output is atomic coordinates that match experimental structures to within a few angstroms on a substantial fraction of proteins.

  3. GPT-style LMs generate text autoregressively, one token at a time, sampling from p(xtx<t)p(x_t \mid x_{<t}). This is the third major generative-model family - autoregressive - and is responsible for the dominant production-grade text generation of 2026.

All three are generative models. They differ in modality (image, structure, text), architecture (U-Net+DiT, equivariant Transformer, decoder-only Transformer), and training objective (denoising score-matching, distogram + structure prediction, next-token cross-entropy). The unifying feature: each represents and samples from a probability distribution p(x)p(x) (or p(xc)p(x \mid c) for conditional models) over a structured output space.

What generative modelling is, formally

A generative model is a probabilistic model of a data distribution pdata(x)p_{\text{data}}(x), where xx is a structured object - a vector, image, sentence, molecule, video - drawn from some space X\mathcal{X}. The model has parameters θ\theta and defines a distribution pθ(x)p_\theta(x) that approximates pdatap_{\text{data}}. The model is useful if we can do at least one of the following with it:

  • Sample. Produce a new xpθ(x)x \sim p_\theta(x). This is the most visible capability - the ability to generate new images, sentences, molecules.

  • Evaluate density. Compute pθ(x)p_\theta(x) for a given xx. Useful for anomaly detection, compression, model comparison.

  • Approximate inference. Compute or sample from posteriors pθ(zx)p_\theta(z \mid x) where zz is some latent variable. Useful for representation learning and structured inference.

The six families covered in this chapter differ in which subset of these they support and how they support it. Autoregressive models (§3) support sampling and exact-density-evaluation but not exact-posterior-inference. VAEs (§4) support sampling and approximate-density-evaluation and approximate-posterior-inference. Flows (§5) support all three exactly. GANs (§5) support only sampling. Diffusion (§6) supports sampling and approximate-density-evaluation. Flow-matching (§7) supports sampling with a clean continuous-time framework that subsumes much of diffusion.

The differences matter. Which capabilities matter for an application determines which family is appropriate.

What generative modelling is, intuitively

The intuitive picture in one sentence: a generative model learns the shape of the data distribution well enough to produce new samples that look like they came from the same distribution.

Two illustrations.

First, the 1D case. Suppose xRx \in \mathbb{R} and the true distribution is a mixture of two Gaussians, pdata(x)=0.5N(x;2,1)+0.5N(x;+2,1)p_{\text{data}}(x) = 0.5 \cdot \mathcal{N}(x; -2, 1) + 0.5 \cdot \mathcal{N}(x; +2, 1). A generative model trained on samples from this distribution should produce a pθp_\theta that captures the two-mode structure: samples from pθp_\theta should themselves cluster around 2-2 and +2+2. A bad model (a single Gaussian centered at 00) misses the structure entirely - it would never generate values near 2-2 or +2+2 with the right relative frequency.

Second, the image case. Suppose xx is a 28×2828 \times 28 pixel image of a handwritten digit. The set of “plausible digit images” is a tiny, structured subset of the R784\mathbb{R}^{784} pixel space - the data manifold. A generative model trained on MNIST learns the geometry of this manifold and can produce new images that lie on it. A bad model (random pixel noise) produces images that are not on the manifold and look like noise. A good model produces images that are on the manifold and look like digits - but are not in the training set.

The “lies on the data manifold” intuition is consistent across families and modalities. Audio: the model produces plausible sound waveforms or spectrograms. Video: the model produces plausible space-time-coherent image sequences. Molecules: the model produces plausible chemical-valid graphs. The mathematical formalism (likelihood, score, ODE-flow) differs across families; the underlying goal - capture the structure of the data manifold and sample from it - is shared.

Why generative models matter in 2026

Three motivations a 2026 researcher should care about.

1. Generative models are deployed at massive scale. Image-generation services (Midjourney, DALL·E, Stable Diffusion, Imagen, Firefly), video-generation services (Sora, Veo, Runway), audio-generation services (Suno, Udio, ElevenLabs), and text-generation services (ChatGPT, Claude, Gemini) are used by tens of millions of users daily. The aggregate economic, cultural, and creative impact of generative AI is substantial enough that no research-oriented practitioner can ignore the technical substrate.

2. Generative models are scientific instruments. AlphaFold 2 (and successors AlphaFold 3) transformed protein structure prediction; AlphaProof, GNoME, and other scientific generative models extend the recipe to mathematics, materials, and chemistry. The Foundation Models chapter (FM §8) catalogues these in more detail; the generative-modelling techniques that underlie them are this chapter’s content.

3. Generative components are everywhere in the modern stack. A vision-language model includes a generative text decoder. An LLM’s reasoning chain is generative text. A recommendation system’s response is increasingly generative. A robot policy can be framed as conditional generation (action sequence given observation). The generative-modelling machinery in this chapter appears, directly or indirectly, in essentially every modern Foundation Model.

Density modelling vs sample generation

A specific distinction worth flagging because it confuses many readers entering the field.

Density modelling is the problem of learning the function xpθ(x)x \mapsto p_\theta(x) - assigning a probability to any input. It is a modelling task. It serves anomaly detection (low-probability inputs are anomalous), compression (high-probability inputs need fewer bits), and model comparison (which pθp_\theta better explains a dataset?).

Sample generation is the problem of producing new xpθ(x)x \sim p_\theta(x). It is a sampling task. It serves content creation, simulation, scientific exploration.

The two are related but not identical. A model can be excellent at density modelling and poor at sample generation (an autoregressive language model with good perplexity but slow sampling). A model can be excellent at sample generation and unable to evaluate density (a GAN). A model can be poor at both. The choice between families often comes down to which of the two tasks matters for the application.

In contemporary practice, sample generation gets most of the public attention while density modelling gets most of the rigorous evaluation. We develop both throughout the chapter.

Boundaries with adjacent chapters

This chapter sits in the middle of a constellation of related chapters; the boundaries:

  • Self-Supervised Learning §4 treats denoising as a self-supervised pretraining objective. The same denoising machinery powers diffusion models (this chapter’s §6). The chapters develop the same mechanism through two different lenses: SSL views denoising as a way to learn representations; Generative Models views it as a way to sample. Many modern systems use the same trained model for both.

  • Large Language Models §3 and §6 develop autoregressive Transformer LMs and their inference-time decoding strategies. This chapter’s §3 develops the autoregressive generative-modelling framework - what it is, how it relates to other families, where it is and is not appropriate. The LM chapter is the LM-specific instantiation.

  • Foundation Models is the spine chapter. Generative models are one of the two dominant FM substrates (the other being discriminative SSL models). FM §3 discusses generative models as a family at a high level; this chapter develops the mechanism.

  • Deep Learning provides the architectures (U-Net, Transformer, DiT, equivariant nets) that generative models use. The chapter assumes the architectural material rather than developing it.

  • Multimodal Models (planned) develops cross-modal generation; this chapter develops the within-modality generative-modelling techniques that cross-modal systems use.

  • AI for Science (planned) develops scientific generative models in domain depth (protein design, materials, mathematics, climate); this chapter develops the modelling techniques.

What this chapter does not try to do

A research-oriented chapter must be honest about scope:

  • We do not develop application-specific best practices in depth. Diffusion models for high-resolution image synthesis, for example, have a substantial sub-literature on samplers, schedulers, model-conditioning, and prompt engineering that this chapter sketches but does not exhaustively cover.

  • We do not derive most results in full mathematical detail. Diffusion’s connection to stochastic differential equations, for instance, is sketched here and developed properly in the references.

  • We do not treat evaluation as a separate engineering discipline beyond §10. The Evaluation chapter (planned) will develop the cross-cutting issues.

  • We do not survey all generative-modelling techniques. The space is large; we develop the six dominant families and refer the reader to surveys for niche techniques (energy-based models, score-based models with non-standard noise schedules, etc.).

  • We do not treat the legal, copyright, and societal questions about generative AI substantively. They are real and important; they live in the Alignment/Ethics chapter.

Position taken in this chapter

The chapter is organized by mathematical family rather than by modality or by historical era. The structural argument: a researcher who understands autoregressive, VAE, flow, GAN, diffusion, and flow-matching as mathematical families can quickly understand any specific application of generative models, because the modality-specific details are mostly architecture and conditioning choices layered on top of the core family. Conversely, organizing by modality (image generation, video generation, text generation) would repeat the same core mechanisms many times.

The chapter is also organized to reflect 2026 practice. Diffusion (§6) and flow-matching (§7) get the most detailed treatment because they are dominant in 2026 production systems for non-text modalities. Autoregressive (§3) gets a tight treatment because LLM §3 already develops it. GANs (§5) get a brief treatment because they are largely historical for new system design (though still common in some niches). VAEs (§4) get a brief treatment because they are now most often used as components (e.g., the latent encoder in latent diffusion) rather than as standalone generators.


§2. Historical Context

This section traces the field from early statistical generative modelling through the modern deep-learning era. The chapter’s substance is the current mechanics; the history is here because the conceptual moves that produced the modern toolkit are themselves illuminating.

A timeline of the inflection points:

   ~1980s        Boltzmann machines (Hinton, Sejnowski 1985);
                  early energy-based generative models
                                  │
                                  ▼
   ~1990s        Mixture models, Hidden Markov Models;
                  generative modelling as a statistics topic
                                  │
                                  ▼
   2006-2010     Restricted Boltzmann Machines; Deep Belief
                  Networks (Hinton et al. 2006);
                  deep generative models begin to be feasible
                                  │
                                  ▼
   2013          VARIATIONAL AUTOENCODERS: Kingma & Welling
                  "Auto-Encoding Variational Bayes"
                                  │
                                  ▼
   2014          GANs: Goodfellow et al. "Generative
                  Adversarial Nets" - adversarial training
                  becomes the dominant image-generation paradigm
                                  │
                                  ▼
   2015          NORMALIZING FLOWS: Rezende & Mohamed
                  "Variational Inference with Normalizing Flows";
                  Sohl-Dickstein et al. "Deep Unsupervised
                  Learning using Nonequilibrium Thermodynamics"
                  (the first diffusion-model paper, largely
                  overlooked for five years)
                                  │
                                  ▼
   2016          PixelRNN (van den Oord et al.); WaveNet
                  (van den Oord et al.); GPT-1 era preludes;
                  autoregressive generation gains traction
                                  │
                                  ▼
   2017-2018     GAN era: DCGAN, Wasserstein GAN, Progressive
                  GANs, BigGAN, StyleGAN - high-resolution
                  image synthesis as the GAN benchmark.
                  Glow, MAF, IAF for flow models.
                                  │
                                  ▼
   2018-2019     GPT-2 (Radford et al. 2019) demonstrates
                  large-scale autoregressive text generation;
                  AlphaFold 1 (Senior et al. 2020 - work
                  finished 2018) shows generative modelling
                  for protein structure
                                  │
                                  ▼
   2020          DDPM: Ho, Jain, Abbeel "Denoising Diffusion
                  Probabilistic Models" - the modern diffusion
                  recipe; Score-based generative modelling
                  (Song & Ermon 2019 antecedent, 2020 unified
                  framework)
                                  │
                                  ▼
   2021          DDIM (Song, Meng, Ermon); CLIP (Radford et al.);
                  classifier-free guidance (Ho & Salimans);
                  GLIDE (Nichol et al.) shows text-to-image
                  diffusion at scale
                                  │
                                  ▼
   2022          DIFFUSION TAKES OVER: DALL-E 2 (Ramesh et al.);
                  Imagen (Saharia et al.); STABLE DIFFUSION
                  (Rombach et al., latent diffusion) - open
                  weights at consumer-GPU scale; AlphaFold 2.
                  Flow-matching (Lipman et al.) introduced.
                                  │
                                  ▼
   2023          Flow-matching and rectified flow (Liu et al.)
                  consolidate the post-diffusion framework.
                  Video diffusion gains traction. DiT
                  (Peebles & Xie) replaces U-Net with
                  Transformer backbone.
                                  │
                                  ▼
   2024-2026     SORA (OpenAI, early 2024): text-to-video
                  diffusion at minute-length scale. Veo (Google);
                  Gen-3 (Runway). Stable Diffusion 3, Flux:
                  flow-matching-based commercial text-to-image.
                  AlphaFold 3 (DeepMind, 2024). Generative
                  models pervasive across science and creative
                  applications.

We develop each phase below.

Early statistical generative modelling

Generative modelling as a research topic substantially predates deep learning. Mixture models (especially Gaussian mixture models) for unsupervised density estimation, Hidden Markov Models for sequence modelling, Bayesian networks as structured generative models - these were the standard pre-deep-learning toolkit. The Bayesian-networks chapter (planned, originally AIMA Ch 13–14) develops this material; we proceed from the deep-learning era.

Boltzmann machines (Hinton and Sejnowski, 1985) and especially Restricted Boltzmann Machines were among the first energy-based deep generative models. Deep Belief Networks (Hinton et al., 2006) - stacked RBMs trained layer-by-layer - were briefly the leading deep generative approach in the late 2000s and helped trigger the broader deep-learning revival. They are now largely of historical interest; the energy-based-model lineage continues but is no longer dominant.

The 2013–2014 inflection: VAEs and GANs

Two papers within twelve months transformed deep generative modelling.

Variational Autoencoders (VAEs), introduced by Kingma and Welling (2013) “Auto-Encoding Variational Bayes,” combined a probabilistic encoder-decoder framework with the reparameterization trick - a technical move that makes the gradient of an expectation over latent variables backpropagatable. The result: a deep generative model with a well-defined log-likelihood lower bound (the ELBO), a meaningful latent representation, and tractable training.

Generative Adversarial Networks (GANs), introduced by Goodfellow et al. (2014) “Generative Adversarial Nets,” took an entirely different approach. Train two networks: a generator that maps random noise to samples, and a discriminator that tries to distinguish generated samples from real data. The two networks play a minimax game: the generator improves to fool the discriminator; the discriminator improves to detect generated samples. At equilibrium, the generator produces samples indistinguishable from the data.

GANs and VAEs differ structurally. GANs make no attempt to model the data density - they only learn to produce plausible samples. VAEs learn an explicit (but lower-bounded) density and a latent representation. Their relative strengths and weaknesses defined a multi-year debate: GANs produced sharper images, VAEs produced more meaningful latents. Both became major research areas.

The normalizing-flows line and the missed diffusion paper

In 2015, two papers appeared. Rezende and Mohamed (2015) “Variational Inference with Normalizing Flows” introduced normalizing flows - invertible neural-network transformations with tractable Jacobian-determinant, allowing exact log-likelihood computation and exact sampling. Flows became a third major family.

Less prominent at the time: Sohl-Dickstein et al. (2015) “Deep Unsupervised Learning using Nonequilibrium Thermodynamics” introduced what is now called diffusion modelling. The paper proposed a procedure that progressively corrupted data with noise and learned to reverse the process; samples could be drawn by starting from pure noise and iteratively denoising. The paper was thorough and the framework rigorous, but the empirical results at the time did not stand out, and the line of work was largely overlooked for five years.

The same period saw autoregressive image and audio models mature: PixelRNN/PixelCNN (van den Oord et al., 2016) for images, WaveNet (van den Oord et al., 2016) for audio. These showed that autoregressive generation - long established for text - also worked for high-dimensional sensory modalities.

The GAN era: 2016–2020

For most of 2016–2020, GANs were the dominant deep generative paradigm for images. The progression of architectures and training techniques was rapid:

The GAN era produced striking images. It also produced two persistent difficulties: mode collapse (the generator focuses on a few easy-to-fool modes and ignores the rest of the data distribution) and training instability (the minimax optimization is delicate and often fails to converge). Many engineering techniques addressed these difficulties but none fully solved them.

Autoregressive scaling and AlphaFold

In parallel, two other generative-model threads were quietly transforming the field. GPT-2 (Radford et al., 2019) - a large-scale autoregressive Transformer trained on web text - demonstrated that scaling autoregressive text generation produced qualitatively new capabilities. This thread eventually became GPT-3, ChatGPT, and the LLM-dominated era of 2022 onward (LLM chapter).

AlphaFold 1 (Senior et al., 2020) for protein structure prediction was a different but related generative result: a deep network produced plausible 3D protein structures from amino-acid sequences. While not “generative” in the same sense as image-synthesis GANs, AlphaFold demonstrated that deep generative modelling could solve a substantive scientific problem.

Diffusion takes over: 2020–2022

The paper that started the modern diffusion era: Ho, Jain, Abbeel (2020) “Denoising Diffusion Probabilistic Models” (DDPM). DDPM simplified and operationalized the Sohl-Dickstein et al. (2015) framework into a recipe that worked at scale. Three innovations:

  • A specific noise schedule and parameterization that made training stable.

  • A reweighted training objective (the simplified MSE on the noise prediction) that worked much better than the variational lower bound.

  • Concrete sampling procedures that produced high-quality images.

DDPM image samples were immediately competitive with GANs and quickly surpassed them. Song, Ermon (2019, 2020) had developed a parallel score-based generative-modelling framework that turned out to be mathematically equivalent to diffusion (Song et al., 2021); the unified view became standard.

The years 2021–2022 saw the diffusion explosion. DDIM (Song, Meng, Ermon, 2021) gave a deterministic, much-faster sampler. Classifier-free guidance (Ho, Salimans, 2021) gave the dominant conditioning technique. GLIDE (Nichol et al., 2021) demonstrated text-to-image diffusion at scale. Then in 2022 the production-grade systems arrived: DALL-E 2 (Ramesh et al., 2022), Imagen (Saharia et al., 2022), and especially Stable Diffusion (Rombach et al., 2022) - the latent diffusion model that ran on consumer GPUs and was released with open weights, putting text-to-image generation in the hands of millions of users.

By 2022, diffusion had replaced GANs as the dominant image-generation paradigm. The same machinery extended to audio (Riffusion, MusicLM successors), video (Imagen Video, then Sora), and 3D (DreamFusion, then several improvements). Diffusion-shaped recipes became the default starting point for new generative-modelling applications.

Flow-matching and rectified flow: the post-diffusion era

In late 2022 and into 2023, a new framework emerged that subsumes and extends diffusion. Flow-matching (Lipman, Chen, Ben-Hamu, Nickel, Le, 2022) “Flow Matching for Generative Modeling” gave a clean continuous-time formulation: train a vector field whose ODE-flow transports a simple distribution (e.g., Gaussian) into the data distribution. Rectified flow (Liu, Gong, Liu, 2022) “Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow” gave a particular instantiation that produced near-straight ODE trajectories and therefore fast sampling.

The relationship to diffusion. Diffusion can be viewed as a particular noise-schedule choice within the flow-matching framework. Flow-matching is more general (does not require diffusion’s specific noise schedule), conceptually cleaner (continuous-time ODE flow rather than discrete-time stochastic process), and often produces better empirical results with fewer sampling steps. Many 2024–2026 generative systems (including Stable Diffusion 3, Flux, and several frontier text-to-video systems) use flow-matching or rectified-flow training rather than DDPM-style diffusion.

The post-diffusion era is, in 2026, still consolidating. Diffusion-trained models remain in widespread production use; flow-matching-trained models are increasingly dominant in new deployments. The relationship is technical-and-incremental rather than paradigm-shifting; we develop both in §6 and §7.

2024–2026: video, frontier scientific generation, and the production stack

The most recent inflections:

  • Sora (OpenAI, early 2024) demonstrated minute-length, coherent, high-quality text-to-video diffusion. Veo (Google) and Gen-3 (Runway) followed. Video generation became a real consumer product.

  • AlphaFold 3 (DeepMind, 2024) extended the protein-structure model to protein-ligand, protein-nucleic-acid, and protein-protein complexes - a scientifically substantial generalization.

  • Stable Diffusion 3 and Flux demonstrated flow-matching at production scale for text-to-image.

  • Generative models for scientific design (drugs, materials, catalysts) matured into a substantial sub-field.

Where this leaves us in 2026

The current state. Generative modelling is a mature subfield with six major mathematical families and many modality-specific applications. The dominant production paradigms are diffusion (still widely deployed) and flow-matching (increasingly dominant in new systems). Autoregressive models dominate text generation and parts of audio. VAEs and flows are mostly used as components rather than as standalone systems. GANs are largely historical for new system design.

The remaining sections of this chapter develop each family on its own terms. §3 covers autoregressive; §4 VAEs; §5 flows and GANs; §6 diffusion (the centerpiece); §7 flow-matching; §8 conditional generation; §9 modality-specific architectures; §10 evaluation; §11§15 close out.

Editorial note. The post-diffusion-vs-flow-matching boundary is the most rapidly-evolving part of the chapter and the most likely to date. The 2024–2026 period has seen continued consolidation; readers should treat the diffusion-vs-flow-matching material as a current snapshot rather than a settled state.


§3. Autoregressive Models

The factorization

The conceptual core of autoregressive generation. For any structured object x=(x1,x2,,xT)x = (x_1, x_2, \ldots, x_T) - a sequence of tokens, pixels, audio samples, mesh vertices - the joint probability factors exactly using the chain rule of probability:

p(x)=p(x1,x2,,xT)=t=1Tp(xtx1,x2,,xt1)=t=1Tp(xtx<t).p(x) = p(x_1, x_2, \ldots, x_T) = \prod_{t=1}^{T} p(x_t \mid x_1, x_2, \ldots, x_{t-1}) = \prod_{t=1}^{T} p(x_t \mid x_{<t}).

Reading this. The joint probability of the entire sequence equals the product of conditional probabilities - each xtx_t’s probability given everything that came before. The factorization is mathematically exact; it makes no assumption about the data. It is simply the chain rule.

An autoregressive model parameterizes each of the conditional distributions pθ(xtx<t)p_\theta(x_t \mid x_{<t}) with a learnable function (typically a neural network with shared parameters across time steps). Generation works one token at a time: sample x1pθ(x1)x_1 \sim p_\theta(x_1), then x2pθ(x2x1)x_2 \sim p_\theta(x_2 \mid x_1), then x3pθ(x3x1,x2)x_3 \sim p_\theta(x_3 \mid x_1, x_2), and so on.

A worked example: a tiny autoregressive image model

To anchor the abstraction. Suppose we want to model 4×44 \times 4 binary images, x{0,1}16x \in \{0, 1\}^{16}. Order the pixels in row-major order: x1=x_1 = top-left, x2=x_2 = one to its right, ..., x16=x_{16} = bottom-right.

The factorization:

   x_1 → x_2 → x_3 → x_4
                       │
                       ▼
   x_5 → x_6 → x_7 → x_8
                       │
                       ▼
   x_9 → x_10 → x_11 → x_12
                          │
                          ▼
   x_13 → x_14 → x_15 → x_16

For each pixel xtx_t, the model predicts pθ(xt=1x1,,xt1)p_\theta(x_t = 1 \mid x_1, \ldots, x_{t-1}) - a single probability between 0 and 1 since xtx_t is binary.

Training: for each image in the dataset, compute the loss

L(x)=logpθ(x)=t=116logpθ(xtx<t).\mathcal{L}(x) = -\log p_\theta(x) = -\sum_{t=1}^{16} \log p_\theta(x_t \mid x_{<t}).

Sampling: choose x1pθ(x1)x_1 \sim p_\theta(x_1), then x2x_2 conditional on x1x_1, ..., 16 sequential sampling steps total.

For real images at resolution 256×256×3=196,608256 \times 256 \times 3 = 196{,}608 values, the same recipe applies - but the sequential generation becomes prohibitively slow. This is the central trade-off of autoregressive models: exact likelihood at the cost of sequential generation.

Training: next-token cross-entropy

For a dataset D={x(1),,x(N)}\mathcal{D} = \{x^{(1)}, \ldots, x^{(N)}\}, the maximum-likelihood objective is

θ=argmaxθi=1Nlogpθ(x(i))=argmaxθi=1Nt=1Tlogpθ(xt(i)x<t(i)).\theta^* = \arg\max_\theta \sum_{i=1}^{N} \log p_\theta(x^{(i)}) = \arg\max_\theta \sum_{i=1}^{N} \sum_{t=1}^{T} \log p_\theta(x_t^{(i)} \mid x_{<t}^{(i)}).

Equivalently, minimize the average negative log-likelihood, which is the next-token cross-entropy loss. For discrete tokens drawn from a vocabulary of size VV, this is the standard softmax-cross-entropy loss familiar from supervised classification - but applied to the next-token prediction at every position.

A crucial implementation detail. Training is parallel across positions. For a given training sequence x=(x1,,xT)x = (x_1, \ldots, x_T), we want to compute the loss at every position tt. Naively, this requires TT forward passes (each conditioning on different lengths of context). But with causal masking in a Transformer (LLM §3, DL §6), all TT next-token predictions can be computed in one forward pass: feed in x<Tx_{<T} as input, get all pθ(xtx<t)p_\theta(x_t \mid x_{<t}) predictions at once.

Generation, however, is inherently sequential - each token must be sampled before the next can be conditioned on it. This training-vs-generation asymmetry is the source of autoregressive models’ efficiency profile: training is fast (parallelizable), generation is slow (sequential).

Architectural choices

The autoregressive principle is architecture-agnostic. Any neural network that can take x<tx_{<t} and produce pθ(xtx<t)p_\theta(x_t \mid x_{<t}) works. Three concrete instantiations from history:

PixelRNN / PixelCNN (van den Oord et al., 2016). For images, generate one pixel at a time (and one channel within each pixel). PixelRNN used RNNs with horizontal-and-vertical recurrence; PixelCNN used masked convolutions to ensure each pixel only depended on earlier pixels in the raster order. Generated high-quality images for the time; very slow to sample (256×256×3=196,608256 \times 256 \times 3 = 196{,}608 sequential steps).

WaveNet (van den Oord et al., 2016). For raw audio at 16 kHz sampling rate, generate one audio sample at a time using dilated causal convolutions. Achieved unprecedented quality for text-to-speech and music. Generation cost per second of audio: 16,000 forward passes. Optimizations (parallel WaveNet, Gaussian inverse autoregressive flow distillation) reduced this; the framework’s autoregressive nature kept it inherently slow.

Decoder-only Transformers / GPT family (Radford et al., 2018, 2019, 2020). For text tokens, autoregressive over a vocabulary of 5×104\sim 5 \times 10^4 subword tokens. The dominant text generative architecture from 2018 onward. Developed in detail in LLM §3 (architecture) and LLM §5 (training) and LLM §6 (inference). This chapter does not re-develop the LM-specific material.

Strengths of autoregressive models

Three structural advantages worth flagging.

Exact likelihood. Unlike VAEs (which give a lower bound on likelihood) or GANs (which give no likelihood at all), autoregressive models give the exact log-likelihood of any sample. This makes them the standard choice when likelihood-based evaluation matters (anomaly detection, model comparison, compression).

No mode collapse. GANs notoriously fail by collapsing to a few modes. Autoregressive models, trained on next-token cross-entropy, cannot collapse - every token in every training example contributes to the gradient with equal weight. The training objective is coverage-preserving by construction.

Token-level interpretability. Each generation step is interpretable as “the model assigned probability pp to this specific token.” For analysis, debugging, and steering, this is far easier than diffusion’s iterative-denoising or GANs’ opaque latent-to-output mapping.

Weaknesses of autoregressive models

Equally structural.

Sequential generation. The dominant practical issue. Generating an NN-token sequence takes NN sequential forward passes. For images and video, NN is enormous, making generation expensive (KV-caching helps; LLM §6). The diffusion-vs-autoregressive trade-off for image generation: diffusion uses 20\sim 205050 sequential denoising steps (each is a single forward pass of the full image), autoregressive uses 104\sim 10^410610^6 steps (each is a single forward pass of one token).

Modality-specific tokenization. Autoregressive models need a discrete tokenization of the modality. Text has natural discrete units (words, subwords); images and audio do not. Image autoregressive models require either pixel-by-pixel generation (slow) or learned discrete tokens (typically from a VQ-VAE - see §4). Audio is similar. The tokenization choice substantially affects model quality.

Ordering dependence. The factorization p(x)=tp(xtx<t)p(x) = \prod_t p(x_t \mid x_{<t}) requires choosing an order. For text, left-to-right is natural. For images, raster order is arbitrary - the model “sees” the top-left of the image first and the bottom-right last, which has no semantic justification. The model can still work well in practice but the order imposes structure that the data does not have.

Connection to LLM §3 and modern practice

LLM §3 develops decoder-only Transformer architecture (embedding, attention, FFN, position encoding, output projection) in detail, applied to the case where the autoregressive model is over text tokens. The same architecture works for any tokenized modality:

  • Text: tokens from BPE or SentencePiece subword vocabulary.

  • Images: tokens from a learned VQ-VAE codebook (Esser, Rombach, Ommer, 2021 “Taming Transformers” / VQGAN; the LlamaGen line).

  • Audio: tokens from a learned audio codec (SoundStream, EnCodec).

  • Video: tokens from a 3D VQ-VAE (the MAGVIT line).

  • Multimodal: interleaved tokens from multiple codebooks.

By 2026, autoregressive Transformers dominate text generation entirely and have substantial niches in image, audio, and video generation (often alongside or in competition with diffusion/flow-matching). The choice between autoregressive and diffusion-style generation for a given modality depends on sample-quality requirements, generation-speed requirements, modality-specific tokenization quality, and engineering investment in each path.

Where autoregressive models sit in 2026

The honest accounting. Autoregressive models are:

  • Dominant for text and structured-sequence data.

  • Competitive but not dominant for images, audio, and video (where diffusion/flow-matching is often preferred).

  • Increasingly important for multimodal generation, where unified-token autoregressive models can generate across modalities in a single framework (Chameleon, Gemini Native).

The trade-offs are well-understood; the choice is application-specific rather than a sweeping recommendation.


§4. Variational Autoencoders (VAEs)

The latent-variable framework

A different family of generative models, based on a different conceptual move. VAEs assume the data xx is generated from a low-dimensional latent variable zz:

   LATENT-VARIABLE GENERATIVE MODEL

         z ~ p(z)   (prior, e.g., N(0, I) in R^d)
                │
                ▼
         x ~ p_theta(x | z)   (decoder)


   To generate a new sample x:
     1. Sample z from the prior.
     2. Pass z through the decoder to get p(x | z).
     3. Sample x from this distribution.

The model specifies a prior p(z)p(z) (usually a standard normal in Rd\mathbb{R}^d for some modest dd, e.g., d=32d = 32 or d=256d = 256) and a decoder pθ(xz)p_\theta(x \mid z) - typically a neural network mapping zz to the parameters of the output distribution (e.g., the mean and variance of a Gaussian over images).

To compute the marginal likelihood:

pθ(x)=pθ(xz)p(z)dz.p_\theta(x) = \int p_\theta(x \mid z) p(z) \, dz.

This integral is the fundamental difficulty of latent-variable models. For deep decoders, the integral is intractable - there is no closed-form solution and no efficient way to compute it. We need an approximation. The variational autoencoder (Kingma and Welling, 2013) is one specific way to handle this.

The ELBO: a tractable lower bound on the likelihood

The variational approach. Introduce a second neural network - an encoder qϕ(zx)q_\phi(z \mid x) - that maps inputs xx to distributions over latents. The encoder approximates the (true but intractable) posterior pθ(zx)=pθ(xz)p(z)/pθ(x)p_\theta(z \mid x) = p_\theta(x \mid z) p(z) / p_\theta(x).

For any encoder qϕ(zx)q_\phi(z \mid x) and any input xx, the Evidence Lower BOund (ELBO) is

logpθ(x)Ezqϕ(zx)[logpθ(xz)]KL(qϕ(zx)p(z)).\log p_\theta(x) \geq \mathbb{E}_{z \sim q_\phi(z|x)} [\log p_\theta(x \mid z)] - \text{KL}(q_\phi(z \mid x) \| p(z)).

Reading this. The right-hand side is a lower bound on the log-likelihood that we can compute and optimize. The bound has two terms:

  • The reconstruction term Ezqϕ(zx)[logpθ(xz)]\mathbb{E}_{z \sim q_\phi(z|x)} [\log p_\theta(x \mid z)]: encode xx into a latent zz, then ask how well the decoder reconstructs xx from zz. Higher reconstruction quality → higher likelihood.

  • The KL term KL(qϕ(zx)p(z))\text{KL}(q_\phi(z \mid x) \| p(z)): how far the encoder’s output (the posterior over zz given xx) is from the prior p(z)p(z). Acts as a regularizer pushing the encoder toward the prior.

The two terms have a clean interpretation. Reconstruct well, but don’t deviate too much from the prior. The trade-off is what produces the VAE’s structured latent space.

The reparameterization trick

To train, we need to backpropagate gradients through the expectation Ezqϕ(zx)[]\mathbb{E}_{z \sim q_\phi(z|x)}[\cdot]. The challenge: zz is sampled from a distribution that depends on ϕ\phi (the encoder’s parameters); naive Monte Carlo sampling does not give a backpropagatable gradient.

The fix: the reparameterization trick (Kingma and Welling, 2013). For a Gaussian posterior qϕ(zx)=N(z;μϕ(x),σϕ(x)2I)q_\phi(z \mid x) = \mathcal{N}(z; \mu_\phi(x), \sigma_\phi(x)^2 I), sample as follows:

ϵN(0,I),z=μϕ(x)+σϕ(x)ϵ.\epsilon \sim \mathcal{N}(0, I), \qquad z = \mu_\phi(x) + \sigma_\phi(x) \cdot \epsilon.

Reading this. Rather than sample zz from a distribution depending on ϕ\phi, sample ϵ\epsilon from a fixed distribution (standard normal) and deterministically compute zz as a function of ϵ\epsilon and ϕ\phi. Gradients with respect to ϕ\phi flow through μϕ\mu_\phi and σϕ\sigma_\phi to the encoder’s parameters. The randomness is “off the gradient path.”

This trick is foundational: it appears not only in VAEs but in SAC (RL §7), Gumbel-softmax for discrete latents, and other modern stochastic-network architectures.

Pseudocode for VAE training

ALGORITHM VAE Training (Kingma & Welling, 2013)
INPUT: encoder q_phi (outputs mean and log-variance),
       decoder p_theta, dataset D

For each minibatch x:
  # Encode
  mu, log_sigma2 = q_phi(x)
  sigma = exp(0.5 * log_sigma2)

  # Sample z via reparameterization
  eps ~ N(0, I)
  z = mu + sigma * eps

  # Decode and compute reconstruction loss
  x_recon = p_theta_decode(z)
  L_recon = reconstruction_loss(x, x_recon)
            # e.g., -log p(x | z), often MSE for image data
            # or pixelwise cross-entropy for discrete pixels

  # KL term: closed form for Gaussian posterior and prior
  L_kl = 0.5 * sum(mu^2 + sigma^2 - 1 - log_sigma2)

  # Total ELBO (negative, since we minimize)
  L = L_recon + L_kl

  Update theta, phi by gradient descent on L.

The structure: encode, reparameterize, decode, compute two losses, backpropagate. Two networks (qϕq_\phi and pθp_\theta) trained jointly.

A worked example: MNIST VAE

A standard pedagogical instance. Train a VAE on MNIST 28×2828 \times 28 binary digit images with latent dimension d=2d = 2 (chosen low for visualization). After training:

  • The encoder maps each digit image to a 2D point. Plotting these reveals a structured latent space: similar digits cluster together (all "1"s near each other, all "8"s elsewhere).

  • Sampling new digits: draw zN(0,I2)z \sim \mathcal{N}(0, I_2), decode. The result is a new digit-like image, smoothly interpolating between training-set styles depending on zz.

  • Latent traversal: sweep zz along a line in the 2D latent space and decode at each point. The decoded images smoothly morph from one digit style to another. The latent space is continuous and meaningful.

This combination - meaningful latent space, smooth interpolation, principled generation - is the VAE’s defining feature. It is what GANs do not have (their latent-to-output mapping is not constrained to be meaningful) and what autoregressive models do not have (they have no latent space at all).

Common VAE variants

β\beta-VAE (Higgins et al., 2017). Modify the ELBO by reweighting the KL term:

Lβ=E[logpθ(xz)]+βKL(qϕ(zx)p(z)).\mathcal{L}_{\beta} = -\mathbb{E}[\log p_\theta(x \mid z)] + \beta \cdot \text{KL}(q_\phi(z \mid x) \| p(z)).

For β>1\beta > 1, the KL term is upweighted, pushing the posterior closer to the prior. Empirically this often produces more disentangled latents - different dimensions of zz correspond to different semantic factors of variation (rotation, scale, colour). The disentanglement claims are contested in subsequent work (Locatello et al., 2019), but β\beta-VAE remains widely used.

Conditional VAE (CVAE). For tasks where we want to generate xx conditional on some side information cc: replace pθ(xz)p_\theta(x \mid z) with pθ(xz,c)p_\theta(x \mid z, c) and qϕ(zx)q_\phi(z \mid x) with qϕ(zx,c)q_\phi(z \mid x, c). The structure is unchanged; the conditioning adds cc to both networks. Used for image-to-image translation, style-conditional generation, and (notably) the latent encoder in many diffusion-based systems.

VQ-VAE (van den Oord, Vinyals, Kavukcuoglu, 2017). The latent variable zz is discrete - a sequence of integer codes from a learned codebook of size KK (typically K[210,214]K \in [2^{10}, 2^{14}]). The encoder maps xx to a sequence of codebook indices; the decoder reconstructs from those indices.

VQ-VAE’s importance: it enables autoregressive modelling in latent space. Train a VQ-VAE on images; tokenize images into sequences of codebook indices; train a Transformer autoregressively over those tokens. This is the recipe behind VQGAN (Esser, Rombach, Ommer, 2021), DALL-E 1 (Ramesh et al., 2021), and the modern image-tokenization stack used in autoregressive image generation. VQ-VAE turns out to be more central to modern generative AI than the standalone VAE: it is the interface between continuous data and autoregressive models.

Posterior collapse and other VAE pathologies

Two characteristic failure modes worth flagging.

Posterior collapse. The encoder learns to ignore xx and produce qϕ(zx)p(z)q_\phi(z \mid x) \approx p(z) for all xx - making the KL term zero. The decoder then must produce xx from no information (since zz carries none), which means the reconstruction loss is high. The model produces average-looking outputs regardless of input.

The cause: the decoder is too powerful relative to the encoder; it can produce reasonable reconstructions without needing zz to carry information, so the optimizer takes the easy path. Mitigations include KL warmup (start β=0\beta = 0 and anneal up), bottleneck design, and free-bits objectives (Kingma et al., 2016).

Blurry samples. Standard VAE decoders trained with Gaussian likelihood produce blurry images. The reason is technical: the Gaussian likelihood penalizes pixel-wise squared error, which is minimized by averaging over plausible reconstructions rather than committing to any specific one. The blurriness is an artefact of the loss function, not of the latent-variable framework. VQ-VAEs avoid it (discrete codes prevent averaging); diffusion models avoid it (denoising loss does not penalize per-pixel averaging the same way).

Where VAEs sit in 2026

The honest accounting. Standalone VAE generative models are largely historical - GANs displaced them for image quality in 2016–2020; diffusion displaced both in 2021–2022. New high-quality generative systems are rarely standalone VAEs.

But VAEs (especially VQ-VAEs) are ubiquitous as components:

  • Latent encoders in latent diffusion. Stable Diffusion uses a VAE to encode images into a compressed latent space before applying diffusion. The VAE compresses (e.g.) a 512×512×3512 \times 512 \times 3 image into a 64×64×464 \times 64 \times 4 latent, making diffusion 64x cheaper per step. The diffusion model lives in latent space; the VAE handles the pixel-space interface.

  • Discrete tokenizers for autoregressive image models. VQ-VAE provides the tokenization that turns continuous images into discrete tokens for Transformer autoregressive modelling.

  • Audio and video codecs. Modern audio (EnCodec) and video (MAGVIT) tokenizers are VQ-VAE-style discrete encoders.

The VAE framework - encoder, decoder, latent bottleneck, possibly with discrete codes - is one of the most-used architectural patterns in modern generative AI, even though the standalone VAE-as-generator is rarely used today.


§5. Normalizing Flows and GANs

We treat these two families together because both are largely historical for new system design but remain conceptually important. Flows are the simplest exact-likelihood family that handles continuous data; GANs were the dominant image-generation paradigm for half a decade and shaped the field’s expectations.

Normalizing flows: exact likelihood via invertible transformations

The defining idea. A normalizing flow is an invertible neural network fθ:RdRdf_\theta: \mathbb{R}^d \to \mathbb{R}^d that maps a simple distribution (typically standard Gaussian) to the data distribution. The model density is computed from the base density and the Jacobian-determinant of the inverse transformation.

Concretely. Let zpz(z)=N(0,I)z \sim p_z(z) = \mathcal{N}(0, I) be a sample from the base distribution. Define x=fθ(z)x = f_\theta(z). By the change-of-variables formula:

pθ(x)=pz(fθ1(x))detfθ1x.p_\theta(x) = p_z(f_\theta^{-1}(x)) \cdot \left| \det \frac{\partial f_\theta^{-1}}{\partial x} \right|.

Reading this. The density of xx under the flow equals the density of the corresponding zz in the base distribution, multiplied by the Jacobian-determinant of the inverse transformation. The Jacobian-determinant captures how much the transformation locally stretches or compresses volume - high-density regions of xx-space are where the transformation has compressed volume relative to zz-space.

Two design requirements follow:

  1. The transformation fθf_\theta must be invertible (so that fθ1(x)f_\theta^{-1}(x) exists for every xx).

  2. The Jacobian-determinant must be efficiently computable (so that training and density evaluation are tractable).

These requirements rule out standard neural network architectures (a generic deep network has neither property). Flow architectures are designed specifically to be invertible with tractable Jacobians.

Coupling-layer flows: RealNVP and Glow

The dominant flow architecture. RealNVP (Dinh, Sohl-Dickstein, Bengio, 2017) introduced coupling layers: split the input into two halves xa,xbx_a, x_b; transform one half conditional on the other; leave the other half unchanged. The transformation xb=xbexp(s(xa))+t(xa)x_b' = x_b \cdot \exp(s(x_a)) + t(x_a) for neural networks s,ts, t is invertible by construction and has a triangular Jacobian whose determinant is the product of the diagonal - efficiently computable.

   COUPLING LAYER (RealNVP)

   Input:  x = (x_a, x_b)        # split into two halves

   Forward:
     x_a' = x_a                  # unchanged
     x_b' = x_b * exp(s(x_a)) + t(x_a)
   Output: (x_a', x_b')

   Inverse:
     x_a  = x_a'
     x_b  = (x_b' - t(x_a')) * exp(-s(x_a'))

   Log-det Jacobian:
     log |det J| = sum(s(x_a))   # simply the sum of the scale outputs

Stacking many coupling layers with different splits produces a flow with substantial expressive power. Glow (Kingma and Dhariwal, 2018) added invertible 1×11 \times 1 convolutions to mix between layers and produced high-quality image generation.

Autoregressive flows: MAF and IAF

A different parameterization. Masked Autoregressive Flow (MAF) (Papamakarios, Pavlakou, Murray, 2017) makes each output dimension depend on previous output dimensions:

zi=(xiμi(x<i))/σi(x<i).z_i = (x_i - \mu_i(x_{<i})) / \sigma_i(x_{<i}).

The Jacobian is triangular (each output depends only on earlier outputs), so the determinant is again a simple product.

Inverse Autoregressive Flow (IAF) (Kingma, Salimans, Welling et al., 2016) reverses the direction: each latent dimension depends on previous latent dimensions. MAF is fast for density evaluation, slow for sampling; IAF is fast for sampling, slow for density evaluation. The choice depends on which operation matters more.

Where flows sit in 2026

The honest accounting. Normalizing flows are mathematically elegant and have exact likelihoods (unlike VAEs’ lower bounds and GANs’ absence of likelihood). They are the cleanest exact-likelihood family for continuous data.

But they have not become dominant. Three reasons:

  1. Architectural constraint cost. The invertibility-and-tractable-Jacobian requirement substantially limits expressiveness compared to unconstrained networks. State-of-the-art image generation requires deep flows with many coupling layers; the result is parameter-heavy and slower to train than alternatives.

  2. Sample quality. Modern flows produce reasonable but not state-of-the-art image samples. Diffusion (§6) and flow-matching (§7) substantially exceed flows on image quality benchmarks.

  3. Use cases narrowed. The clean-density-evaluation niche turned out to be served well enough by autoregressive models (which also have exact likelihood, are simpler architecturally, and scale better).

Flows remain in use in a few niches: scientific applications where exact likelihood matters (cosmology simulators, lattice field theory); density estimation for anomaly detection; conditional density estimation for tabular data. They are also conceptually important - the continuous flow framework underlies flow-matching (§7), which is the modern descendant.

GANs: adversarial training

The defining idea. Generative Adversarial Networks (Goodfellow et al., 2014) train two networks in opposition:

  • A generator Gθ:RdXG_\theta: \mathbb{R}^d \to \mathcal{X} that maps random noise zpzz \sim p_z (typically Gaussian) to samples Gθ(z)G_\theta(z).

  • A discriminator Dϕ:X[0,1]D_\phi: \mathcal{X} \to [0, 1] that takes a sample and outputs the probability that it is real (from the data distribution) rather than generated (by GθG_\theta).

The training objective:

minθmaxϕExpdata[logDϕ(x)]+Ezpz[log(1Dϕ(Gθ(z)))].\min_\theta \max_\phi \mathbb{E}_{x \sim p_{\text{data}}}[\log D_\phi(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D_\phi(G_\theta(z)))].

Reading this. The discriminator maximizes its ability to distinguish real from generated; the generator minimizes the discriminator’s ability (by producing samples the discriminator misclassifies as real). At equilibrium, the generator produces samples that the discriminator cannot distinguish from real data.

A clean theoretical result (Goodfellow et al.): at the unique fixed point of the minimax game, pθ=pdatap_{\theta} = p_{\text{data}} and Dϕ1/2D_\phi \equiv 1/2. The generator has matched the data distribution; the discriminator is reduced to chance.

Pseudocode for GAN training

ALGORITHM GAN Training (Goodfellow et al., 2014)
INPUT: generator G_theta, discriminator D_phi, dataset D

For each iteration:
  # Discriminator update
  Sample minibatch x_real from D.
  Sample minibatch z from p_z; compute x_fake = G_theta(z).
  L_D = -mean[ log D_phi(x_real) + log(1 - D_phi(x_fake)) ]
  phi = AdamStep(phi, L_D)

  # Generator update
  Sample minibatch z from p_z; compute x_fake = G_theta(z).
  L_G = -mean[ log D_phi(x_fake) ]    # non-saturating form
  theta = AdamStep(theta, L_G)

The generator’s “non-saturating” loss logD(xfake)-\log D(x_{\text{fake}}) is a practical improvement on the theoretical log(1D(xfake))\log(1 - D(x_{\text{fake}})) - the latter has vanishing gradients early in training when DD classifies fakes as fake with high confidence.

Mode collapse, instability, and the GAN failure modes

The characteristic difficulties of GAN training.

Mode collapse. The generator finds a small subset of the data distribution that consistently fools the discriminator and stays there. For MNIST, this might be a generator that only produces 3’s and 7’s, ignoring other digits. The training loss is low; the discriminator is fooled; but the model has not learned the full distribution. Mode collapse is the GAN-specific failure that cannot happen with autoregressive models, VAEs, or diffusion (all of which have likelihood-style losses that penalize ignoring parts of the distribution).

Training instability. GANs train via a minimax game with two networks; the dynamics are not those of standard gradient descent on a single loss. The training can oscillate, diverge, or fail to converge to a useful state. Empirically, GAN training requires careful hyperparameter tuning, architectural choices, and tricks (spectral normalization, gradient penalties, careful loss weighting). It is fragile relative to other generative-modelling families.

No likelihood. GANs do not provide any way to evaluate pθ(x)p_\theta(x) for a given xx. This rules out likelihood-based applications (anomaly detection, compression) and makes evaluation difficult - there is no internal metric of progress, only the discriminator’s loss (which can be misleading) and external sample-quality metrics (FID, §10).

The GAN refinement programme: 2016–2020

The years following Goodfellow’s original paper saw substantial engineering effort addressing the failure modes:

  • DCGAN (Radford, Metz, Chintala, 2016). Practical guidelines for CNN-based generators (transposed convolutions, batch normalization, leaky ReLU). The first stable recipe for GAN training at moderate resolutions.

  • Wasserstein GAN (WGAN) (Arjovsky, Chintala, Bottou, 2017). Replace the Jensen-Shannon-style discriminator with a Wasserstein-distance critic. The Wasserstein loss has smoother gradients and addresses many training-stability issues. WGAN-GP (Gulrajani et al., 2017) added a gradient-penalty term that made the recipe practical.

  • Progressive Growing GANs (Karras, Aila, Laine, Lehtinen, 2018). Train at low resolution first, then progressively add resolution. Stabilizes high-resolution training.

  • BigGAN (Brock, Donahue, Simonyan, 2019). Class-conditional ImageNet generation at 512×512512 \times 512. Demonstrated that GANs scaled to substantial dataset sizes with careful engineering.

  • StyleGAN (Karras, Laine, Aila, 2019) and StyleGAN2/3. Unconditional face generation at 1024×10241024 \times 1024 with style-based generator architecture. The visual high-water mark of the GAN era - and the closest GANs came to “photorealistic” image generation.

Through 2020 GANs were the default choice for image synthesis. Then diffusion arrived.

Why GANs gave way to diffusion

By 2022, diffusion (§6) had largely displaced GANs for new image-generation systems. The reasons are mostly empirical-and-engineering rather than theoretical:

  • Mode coverage. Diffusion’s likelihood-style training covers the data distribution by construction; GANs collapse.

  • Training stability. Diffusion training is single-loss-on-single-network supervised learning; GAN training is fragile minimax.

  • Conditioning. Text-to-image conditioning (CLIP-guided, classifier-free) integrates more naturally with diffusion than with GANs.

  • Sample quality. By 2022 diffusion matched or exceeded StyleGAN quality on standard benchmarks while being more tractable to scale to text-conditional and high-resolution settings.

GANs are not gone. They survive in niches where their speed matters: GAN sampling is single-forward-pass (vs diffusion’s 20–50 steps), making GANs attractive for real-time applications. They also remain a useful research model - easy to analyze theoretically, useful for studying training-dynamics questions. The high-fidelity-but-narrow-distribution generator (e.g., StyleGAN for specific face datasets) remains a real engineering choice in 2026.

But for new general-purpose generative-modelling research and system design, GANs are no longer the default. The mantle has passed to diffusion and flow-matching.


§6. Diffusion Models

This is the chapter’s centerpiece. Diffusion models are the dominant non-text generative paradigm of 2026 (with flow-matching, §7, as the increasingly-dominant variant in new deployments). We develop the framework mechanistically: the forward noising process, the reverse denoising process, the loss, the sampling algorithms, and the architectural choices that make it work at scale.

The intuition

The core idea, stripped of mathematics. Suppose we want to generate images. We have a large dataset of real images. Here is the procedure:

  1. Training-time. Take each training image. Gradually corrupt it with Gaussian noise - adding a tiny amount of noise at step 1, more at step 2, until at step TT the image is pure noise (indistinguishable from N(0,I)\mathcal{N}(0, I)). At each step, the model is trained to predict the noise that was just added.

  2. Sampling-time. Start with pure noise. Gradually remove the noise - using the trained model to predict, at each step, what noise to remove. After TT steps, what was pure noise has become a plausible new image.

The training process defines a recipe for going from noise to data, by reversing the noising process. The sampling process executes the recipe.

Two pictures to anchor:

   FORWARD PROCESS (training-time, fixed)

   x_0 (real image) → x_1 → x_2 → ... → x_T (pure noise)
       │ add small  │ add  │              │
       │   noise    │      │              │

   REVERSE PROCESS (sampling-time, learned)

   x_T (pure noise) → x_{T-1} → x_{T-2} → ... → x_0 (generated image)
       │ remove pre-  │           │              │
       │   dicted noise           │              │

Two intuitive arguments for why this works.

First, local denoising is easy. Predicting what noise was added in one step of the forward process is a tractable problem - the model only needs to look at slightly-noisy data and predict the small perturbation. It does not need to model the entire transition from data to noise in one shot.

Second, iterative refinement composes. By chaining many small denoising steps, the model effectively transports samples from pure noise to data through a sequence of small, locally-easy moves. This is the same structural insight that makes gradient descent work: many small steps in a good direction can traverse a complicated landscape.

The forward process

Formally. Define a sequence of distributions q(xtxt1)q(x_t \mid x_{t-1}) for t=1,,Tt = 1, \ldots, T that add Gaussian noise:

q(xtxt1)=N(xt;1βtxt1,βtI),q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} \, x_{t-1}, \beta_t \, I),

where βt(0,1)\beta_t \in (0, 1) is a small noise level at step tt (the variance schedule). A typical schedule has β1104\beta_1 \approx 10^{-4} growing to βT0.02\beta_T \approx 0.02 over T=1000T = 1000 steps.

Reading this. At each step, xtx_t is a noisy version of xt1x_{t-1}: the mean is scaled-down xt1x_{t-1} (factor 1βt\sqrt{1 - \beta_t}, very close to 1 for small βt\beta_t), and Gaussian noise of variance βtI\beta_t I is added. With many steps and growing βt\beta_t, xTx_T becomes essentially pure noise - independent of x0x_0.

A crucial property of Gaussian noise: the forward process composes in closed form. We do not need to sample x1,x2,,xt1x_1, x_2, \ldots, x_{t-1} to get xtx_t; we can sample xtx_t from x0x_0 directly:

q(xtx0)=N(xt;αˉtx0,(1αˉt)I),q(x_t \mid x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} \, x_0, (1 - \bar{\alpha}_t) I),

where αˉt=s=1t(1βs)\bar{\alpha}_t = \prod_{s=1}^{t} (1 - \beta_s). This closed-form formula lets us sample any noise level tt in one operation:

xt=αˉtx0+1αˉtϵ,ϵN(0,I).x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon, \qquad \epsilon \sim \mathcal{N}(0, I).

Reading this. Sampling at noise level tt: take a fraction αˉt\sqrt{\bar{\alpha}_t} of the original image plus a fraction 1αˉt\sqrt{1 - \bar{\alpha}_t} of Gaussian noise. At t=0t = 0, αˉ0=1\bar{\alpha}_0 = 1, x0x_0 is the original. At t=Tt = T, αˉT0\bar{\alpha}_T \approx 0, xTϵx_T \approx \epsilon (pure noise).

The reverse process and the training loss

The reverse process is what we learn. We parameterize a neural network ϵθ(xt,t)\epsilon_\theta(x_t, t) that predicts the noise added between xt1x_{t-1} and xtx_t.

Specifically. Given xtx_t and tt, the model outputs ϵ^=ϵθ(xt,t)\hat{\epsilon} = \epsilon_\theta(x_t, t) - a prediction of the noise ϵ\epsilon that was added when producing xtx_t from x0x_0. The training loss is

LDDPM(θ)=Et,x0,ϵ ⁣[ϵϵθ(xt,t)2],\mathcal{L}_{\text{DDPM}}(\theta) = \mathbb{E}_{t, x_0, \epsilon}\!\left[ \|\epsilon - \epsilon_\theta(x_t, t)\|^2 \right],

where tt is sampled uniformly from {1,,T}\{1, \ldots, T\}, x0x_0 is sampled from the dataset, and ϵN(0,I)\epsilon \sim \mathcal{N}(0, I) is the noise used to produce xtx_t from x0x_0.

Reading this. The loss is just mean squared error between the true noise and the predicted noise. It is a standard supervised regression loss - no adversarial training, no variational lower bound, no Jacobian. This simplicity is one of the reasons diffusion training is so stable.

Important nuance. The full variational derivation of DDPM gives a weighted MSE loss with weights depending on tt. The Ho-Jain-Abbeel insight: dropping those weights (using uniform-in-tt unweighted MSE) gives both better empirical results and a simpler training procedure. This simplified loss is the standard DDPM training objective.

Pseudocode for DDPM training

ALGORITHM DDPM Training (Ho, Jain, Abbeel, 2020)
INPUT: noise schedule beta_1, ..., beta_T (and derived alpha_bar_t),
       dataset D, denoising network epsilon_theta

For each minibatch x_0 from D:
  # Sample noise level uniformly
  t ~ Uniform({1, ..., T})

  # Sample noise
  epsilon ~ N(0, I)

  # Compute noisy x_t in closed form
  x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon

  # Predict the noise
  epsilon_hat = epsilon_theta(x_t, t)

  # Simple MSE loss
  L = mean((epsilon - epsilon_hat) ** 2)

  Update theta with gradient descent on L.

The structure is striking. Diffusion training is supervised regression - predict the noise, no minimax, no adversarial loss, no variational hierarchy. The network architecture is unconstrained (any deep network works); TT is a hyperparameter (typically T=1000T = 1000). The training-time complexity per step is comparable to a single classification model.

Sampling: DDPM and DDIM

After training, sampling produces a new image by reversing the noising process. The standard DDPM sampling procedure:

ALGORITHM DDPM Sampling
INPUT: trained denoising network epsilon_theta,
       noise schedule beta_1, ..., beta_T

1. x_T ~ N(0, I)   # start from pure noise

2. For t = T, T-1, ..., 1:
   # Predict noise
   epsilon_hat = epsilon_theta(x_t, t)

   # Compute mean of p(x_{t-1} | x_t) (DDPM formula)
   mu_t = (1 / sqrt(1 - beta_t)) *
          (x_t - (beta_t / sqrt(1 - alpha_bar_t)) * epsilon_hat)

   # Sample with stochastic noise (except at final step)
   If t > 1: z ~ N(0, I); else: z = 0
   x_{t-1} = mu_t + sqrt(beta_t) * z

3. Return x_0

Reading this. At each step, the model predicts the noise; we subtract a fraction of it from xtx_t to get μt\mu_t; we add a small amount of fresh noise to get xt1x_{t-1}. After TT steps, x0x_0 is a sample from the model’s approximation of pdatap_{\text{data}}.

The cost: TT forward passes of ϵθ\epsilon_\theta to generate a single sample. For T=1000T = 1000, this is expensive. Two responses:

DDIM (Song, Meng, Ermon, 2021) “Denoising Diffusion Implicit Models.” A deterministic sampler that produces samples from the same trained model in far fewer steps - typically 20–50. The reformulation interprets the diffusion as an ODE rather than an SDE; the deterministic ODE solver can take larger steps than the stochastic process. With DDIM, diffusion sampling becomes practical for production use. DDIM is the dominant sampler in 2026 production systems.

Higher-order solvers. DPM-Solver (Lu et al., 2022) and its successors apply advanced ODE-solver theory to reduce step counts further - competitive with DDIM at 10–20 steps for high-quality samples.

Score-based generative modelling: the unified framework

A parallel line of work. Song, Ermon (2019, 2020) developed score-based generative modelling: parameterize a neural network to estimate the score function xlogpt(x)\nabla_x \log p_t(x) at each noise level tt. Sample by Langevin dynamics - random walks guided by the score.

The connection to DDPM. The noise prediction ϵθ(xt,t)\epsilon_\theta(x_t, t) in DDPM is equivalent to a (rescaled) score estimate xtlogq(xtx0)\nabla_{x_t} \log q(x_t \mid x_0). Predicting noise is estimating the score, up to a known constant. Song et al. (2021) formalized this: DDPM and score-based modelling are two equivalent formulations of the same underlying framework. The continuous-time view (stochastic differential equations) provides the cleanest unification and is the foundation for flow-matching (§7).

The practical implication. The framework is one thing with two presentations. Different papers use different notation (some say “predict the noise”; some say “estimate the score”; some say “predict the data x0x_0”); all are equivalent up to rescaling. A reader engaging the literature needs to know all three.

Conditional diffusion and classifier-free guidance

For most applications, we want conditional generation: image conditioned on text prompt, audio conditioned on melody, structure conditioned on sequence. Two techniques.

Conditional training. Train ϵθ(xt,t,c)\epsilon_\theta(x_t, t, c) where cc is the conditioning (e.g., a text embedding). The training data includes (x,c)(x, c) pairs; the loss is MSE on noise prediction as before. Architecturally, the conditioning cc is injected via cross-attention or feature concatenation.

Classifier-free guidance (CFG) (Ho and Salimans, 2021). The crucial technique that made text-to-image diffusion produce high-quality, prompt-faithful samples. The idea:

  • Train a single model that handles both conditional and unconditional generation. During training, randomly drop the conditioning (replace cc with a null token) with probability 10\sim 10%. The result is a network ϵθ(xt,t,c)\epsilon_\theta(x_t, t, c) that produces meaningful output whether cc is given or null.

  • At inference, extrapolate away from the unconditional prediction toward the conditional one. Use the guided prediction:

ϵ~θ(xt,t,c)=ϵθ(xt,t,)+w(ϵθ(xt,t,c)ϵθ(xt,t,)),\tilde{\epsilon}_\theta(x_t, t, c) = \epsilon_\theta(x_t, t, \emptyset) + w \cdot \big(\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset)\big),

where ww is the guidance scale (typically w[3,15]w \in [3, 15]).

Reading this. With w=1w = 1, this is just the conditional model. With w>1w > 1, the guidance exaggerates the conditional signal - pushing the sample more in the direction of the conditioning. Empirically, w=7w = 7 or so produces samples that are strongly prompt-faithful at the cost of some diversity.

Classifier-free guidance is the single most important engineering technique behind modern text-to-image diffusion. Without it, the samples are either uncorrelated to the prompt (low ww) or excessively diverse (no guidance). The guidance-scale knob lets users trade fidelity for diversity.

Latent diffusion: Stable Diffusion’s architecture

The most-used variant of diffusion in 2026. Latent diffusion (Rombach, Blattmann, Lorenz, Esser, Ommer, 2022) “High-Resolution Image Synthesis with Latent Diffusion Models” - the technical core of Stable Diffusion.

The problem latent diffusion solves. Diffusion in pixel space is expensive at high resolution: each forward pass of ϵθ\epsilon_\theta on a 1024×1024×31024 \times 1024 \times 3 image is computationally heavy. For 50-step sampling, this is prohibitive on consumer GPUs.

The solution. Train a VAE (§4) to compress images into a much-smaller latent representation. For Stable Diffusion: 512×512×3512 \times 512 \times 3 images compressed to 64×64×464 \times 64 \times 4 latents - a 48×48 \times reduction. Then run diffusion in latent space. After sampling a latent, decode it with the VAE to produce the final image.

   LATENT DIFFUSION ARCHITECTURE (Stable Diffusion)

   text prompt → text encoder (CLIP) → conditioning c
                                         │
                                         ▼
   z_T ~ N(0, I)  →  U-Net diffusion  →  z_0
   (in latent space  conditional on c   (in latent space)
   64x64x4)         in latent space)
                                         │
                                         ▼
                                  VAE decoder → x_0 (image)
                                                (512x512x3)

The architecture has three components:

  • VAE (encoder + decoder). Trained separately first; frozen during diffusion training. Compresses images to latents and back.

  • Diffusion model (typically a U-Net, increasingly a DiT). Operates in latent space. Conditioned on text via cross-attention.

  • Text encoder (typically CLIP’s text encoder). Produces the conditioning vectors.

The result: full-quality 512×512512 \times 512 or 1024×10241024 \times 1024 image generation in a few seconds on a consumer GPU. Stable Diffusion’s open-weights release in 2022 brought text-to-image generation to millions of users and is the most-influential single generative-model release of the diffusion era.

Architectural choices: U-Net vs DiT

The denoising network ϵθ(xt,t,c)\epsilon_\theta(x_t, t, c) can be any architecture that takes the noisy input plus time-step and conditioning and outputs a same-shape noise prediction. Two dominant choices.

U-Net. The original DDPM and most early diffusion systems use a U-Net (Ronneberger, Fischer, Brox, 2015) - a CNN with skip connections between encoder and decoder layers. Empirically excellent for image diffusion; matches the image’s spatial structure naturally.

Diffusion Transformer (DiT) (Peebles and Xie, 2023). Replaces the U-Net with a Transformer operating on image patches (like ViT, DL §4). DiT scales better (the Transformer’s parameter count can grow more cleanly than a U-Net’s), and large-scale models (Sora, Stable Diffusion 3) increasingly use DiT-style architectures.

By 2026 the U-Net-vs-DiT choice is still mixed in production. Smaller and older systems use U-Nets; newer and larger systems use DiTs. The trend is toward DiT for new frontier work.

Where diffusion sits in 2026

The dominant generative paradigm for:

  • Images (Stable Diffusion line, DALL-E line, Midjourney, Imagen, Firefly).

  • Video (Sora, Veo, Gen-3) - extending diffusion to the space-time domain.

  • Audio (Riffusion, Stable Audio) and music.

  • 3D (Stable 3D, DreamFusion line) - extending diffusion to 3D representations.

  • Molecules and proteins (RFdiffusion, Chroma, AlphaFold-3’s structure-prediction component).

Each of these is developed in modality-specific detail in §9.

The flow-matching framework (§7) is increasingly displacing pure-diffusion in new systems, but the diffusion-trained models in production will remain in use for years. The two frameworks are best understood as variations on the same continuous-time generative-modelling theme.


§7. Flow-Matching and Rectified Flow

The continuous-time perspective

Diffusion (§6) was developed as a discrete-time process - TT steps of adding noise, TT steps of denoising. The continuous-time perspective generalizes this and is the foundation of flow-matching.

Consider a time-indexed family of distributions pt(x)p_t(x) for t[0,1]t \in [0, 1], interpolating between a simple base distribution p0p_0 (e.g., N(0,I)\mathcal{N}(0, I)) and the data distribution p1=pdatap_1 = p_{\text{data}}. There is some velocity field ut(x)u_t(x) that, applied as an ordinary differential equation, transports samples from p0p_0 to p1p_1:

dxdt=ut(x),x(0)p0.\frac{dx}{dt} = u_t(x), \qquad x(0) \sim p_0.

If we know utu_t, we can generate a sample from pdatap_{\text{data}} by sampling x(0)x(0) from the base distribution and integrating the ODE forward to t=1t = 1.

   CONTINUOUS-TIME GENERATIVE FLOW

   Base distribution p_0 (Gaussian noise)
            │
            │   d x / d t = u_t(x)
            │   (velocity field, learned)
            ▼
   Time     0 ──────── 0.5 ──────── 1
            │           │           │
   Sample:  noise   intermediate   data
   p_t:     p_0        p_0.5       p_1

Two questions: (1) what should the velocity field be? (2) how do we learn it from data?

Flow-matching (Lipman, Chen, Ben-Hamu, Nickel, Le, 2022) answers both with a clean framework that subsumes diffusion as a special case.

The flow-matching objective

The setup. Choose any conditional path - a family of distributions pt(xx1)p_t(x \mid x_1) that interpolates between p0(xx1)=p0(x)p_0(x \mid x_1) = p_0(x) (noise, independent of the target) and p1(xx1)=δ(xx1)p_1(x \mid x_1) = \delta(x - x_1) (the target x1x_1 itself, with no variance). The simplest such path is the straight-line interpolation:

pt(xx1)=N(x;tx1,(1t)2I),p_t(x \mid x_1) = \mathcal{N}(x; t \cdot x_1, (1 - t)^2 I),

or equivalently, sample x0N(0,I)x_0 \sim \mathcal{N}(0, I) and define xt=(1t)x0+tx1x_t = (1 - t) x_0 + t x_1.

Reading this. At t=0t = 0, xt=x0x_t = x_0 (pure noise). At t=1t = 1, xt=x1x_t = x_1 (the data). In between, xtx_t linearly interpolates between them. The path is a straight line in xx-space from noise to data.

For this path, the conditional velocity field - the velocity that transports points along the straight line - is simply

ut(xx1)=x1x0.u_t(x \mid x_1) = x_1 - x_0.

(The derivative of xt=(1t)x0+tx1x_t = (1-t) x_0 + t x_1 with respect to tt.)

The flow-matching loss. Train a network vθ(x,t)v_\theta(x, t) to predict the conditional velocity field at randomly sampled (t,x0,x1)(t, x_0, x_1):

LFM(θ)=Et,x0,x1 ⁣[vθ(xt,t)(x1x0)2],\mathcal{L}_{\text{FM}}(\theta) = \mathbb{E}_{t, x_0, x_1}\!\left[ \|v_\theta(x_t, t) - (x_1 - x_0)\|^2 \right],

where tU(0,1)t \sim \mathcal{U}(0, 1), x0N(0,I)x_0 \sim \mathcal{N}(0, I), x1pdatax_1 \sim p_{\text{data}}, and xt=(1t)x0+tx1x_t = (1 - t) x_0 + t x_1.

Reading this. The training procedure is structurally identical to diffusion: sample a noise level, sample a noise instance, sample a data point, mix them, predict the target direction. The target is now x1x0x_1 - x_0 (the direction from noise to data) rather than ϵ\epsilon (the noise itself). Otherwise identical.

Pseudocode for flow-matching training

ALGORITHM Flow-Matching Training (Lipman et al., 2022, straight-line variant)
INPUT: dataset D, velocity-prediction network v_theta

For each minibatch x_1 from D:
  # Sample time and noise
  t ~ Uniform(0, 1)
  x_0 ~ N(0, I)

  # Interpolated point
  x_t = (1 - t) * x_0 + t * x_1

  # Target velocity (straight line)
  v_target = x_1 - x_0

  # Predict
  v_hat = v_theta(x_t, t)

  # MSE loss
  L = mean((v_hat - v_target) ** 2)

  Update theta with gradient descent on L.

The structure is supervised regression - same as diffusion. The only difference from DDPM is what the regression target is.

Sampling: ODE integration

After training, sampling is ODE integration. Start from noise, integrate the learned velocity field:

ALGORITHM Flow-Matching Sampling (Euler integration, simplest)
INPUT: trained velocity-prediction network v_theta, number of steps N

1. x ~ N(0, I)   # start from noise
2. dt = 1 / N
3. For i = 0, 1, ..., N-1:
   t = i * dt
   x = x + dt * v_theta(x, t)
4. Return x   # the generated sample at t = 1

The simplest Euler integrator works for moderate step counts. More sophisticated ODE solvers (Heun’s method, Runge-Kutta, adaptive solvers) can produce comparable quality with fewer evaluations. Modern flow-matching systems typically use 20–50 integration steps - comparable to DDIM-sampled diffusion.

Why flow-matching is conceptually cleaner than diffusion

The frameworks are mathematically related; the conceptual cleanliness of flow-matching matters for several reasons.

The path is a choice. Flow-matching makes explicit what diffusion implicitly assumes - that a path connecting noise to data must be specified. Diffusion’s path is determined by its noise schedule; flow-matching lets you choose any path. The straight-line path of rectified flow (below) is not the path diffusion uses, and the straight-line path is empirically better in many settings.

Sampling is ODE integration, not SDE simulation. Diffusion sampling involves stochastic dynamics (in DDPM’s stochastic form) or deterministic dynamics interpretable through SDE-to-ODE conversion (in DDIM). Flow-matching is deterministic from the start. Standard ODE solvers from numerical analysis apply directly.

Connections to optimal transport. The flow-matching objective is connected to optimal transport (Villani, 2008; Peyré and Cuturi, 2019) - the mathematics of moving probability mass from one distribution to another at minimum cost. The straight-line path is the optimal-transport solution in a specific sense; this gives a principled reason to expect it to work well.

Rectified flow: straightening the paths

Liu, Gong, Liu (2022) “Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow” developed a particularly useful instance of the framework.

The observation. The straight-line conditional paths of flow-matching are conditional - they assume we know x1x_1 at training time. The marginal paths (the actual trajectories sampled at inference) are not straight; they curve as samples mix through the population of possible x1x_1’s.

The fix. Rectified flow proposes a reflow procedure: after training a first flow, use it to generate (x0,x1)(x_0, x_1) pairs (sample x0x_0 from noise, integrate to get x1x_1). Train a second flow on these pairs, using the same straight-line objective. The second flow’s marginal paths are straighter than the first’s. Iterate.

The result. After 2–3 reflow iterations, the marginal paths become almost straight. Sampling can then use very few integration steps - sometimes a single Euler step from noise to data - with minimal quality loss. This is the one-step generation result that made rectified flow practically important.

Where flow-matching sits in 2026

By 2026, flow-matching has substantially displaced pure-diffusion in new generative-modelling systems. The empirical advantages:

  • Better sample quality at low step counts. With well-tuned flow-matching, 4–8 step sampling can produce high-quality samples - substantially faster than DDIM-sampled diffusion.

  • Cleaner training dynamics. The simpler loss formulation produces more stable training, especially for large models.

  • One-step generation via rectified flow. When inference speed is paramount, rectified flow’s one-step generation is unmatched.

Notable production systems using flow-matching:

  • Stable Diffusion 3 (Stability AI, 2024) and successors use a flow-matching objective.

  • Flux (Black Forest Labs, 2024) is a major commercial text-to-image system built on flow-matching.

  • Some text-to-video systems (notably ones derived from the Stable Diffusion lineage) have moved to flow-matching.

The framework is not a paradigm shift from diffusion - the architectures, conditioning techniques, classifier-free guidance, and modality-specific engineering carry over almost unchanged. It is a training-objective improvement that is incremental in mechanism but substantial in production impact.

Diffusion-trained models (the bulk of deployed systems in 2026) will remain in use for years; flow-matching is the default for new training runs. The two are compatible enough that “diffusion / flow-matching” is often treated as a single family in informal references.

A unifying conceptual picture

To close the section. The unified view of diffusion, score-matching, and flow-matching:

  • All three train a neural network to predict something about how to move samples from a base distribution to the data distribution.

  • DDPM predicts the noise added at each step.

  • Score-based methods predict the gradient of the log-density at each noise level.

  • Flow-matching predicts the velocity field of an ODE flow.

All three are equivalent up to rescaling for appropriate parameterization choices. The differences are in the noise schedule (or path), the training objective’s reweighting, and the sampling procedure. The choice between them is empirical (which produces better samples? trains more stably? requires fewer sampling steps?), not theoretical.

This unification is the key conceptual maturity of 2024–2026 generative modelling: the framework is one thing with multiple presentations.


§8. Conditional Generation and Guidance

Why conditioning matters

Generative models in production are almost always conditional. Users want to generate an image of a particular thing, a video with a particular subject, a melody in a particular style. Pure unconditional generation - sample anything plausible - has limited applications; conditional generation is what makes generative models useful as tools.

We have already seen two conditioning mechanisms: the conditional VAE (CVAE, §4) and conditional diffusion with classifier-free guidance (§6). This section systematizes the topic, covers the dominant text-conditioning recipes, and develops three deployment patterns (image-to-image, ControlNet, image editing).

Class-conditional models: the simplest case

The simplest setting. Given a discrete class label c{1,,K}c \in \{1, \ldots, K\}, generate samples from class cc. Two implementation patterns.

Conditioning via concatenation. Concatenate a one-hot or embedded representation of cc to the input of the generator network. The network learns to use this signal to produce class-specific samples. Used in early class-conditional diffusion (Dhariwal and Nichol, 2021).

Conditioning via cross-attention. Use cc (or its embedding) as a key/value in a cross-attention layer; the generator’s intermediate features attend to it. More flexible than concatenation - easily handles multiple conditioning signals and variable-length conditioning (such as text). Now the standard for modern systems.

Class-conditional generation on ImageNet (1000 classes) was an early benchmark; BigGAN and class-conditional DDPM both achieved strong results. The interesting cases came when “class” was replaced by text.

Text-to-image: the dominant conditioning recipe

The biggest deployed conditioning task. Given a natural-language prompt cc (e.g., “a cat sitting on a windowsill at sunset, oil painting style”), generate an image that matches the prompt.

The recipe, as it stands in 2026:

  1. Text encoder. Embed the prompt into a sequence of feature vectors. Two dominant choices:

    • CLIP text encoder (Radford et al., 2021): a Transformer trained contrastively against images (see SSL §7). Produces text features that are aligned with image features. Used by Stable Diffusion 1/2 and many derivatives.

    • T5 (Raffel et al., 2020): a text-to-text Transformer pretrained on general text. Produces more linguistically rich features but is not image-aligned. Used by Imagen and DeepFloyd IF, and increasingly by newer systems.

    • Modern frontier systems often use both - CLIP for image-text alignment, T5 for rich language understanding.

  2. Cross-attention conditioning. The diffusion U-Net or DiT includes cross-attention layers; the text features serve as keys and values. Each spatial position in the noisy latent attends to relevant tokens in the prompt, allowing the model to align image content with text.

  3. Classifier-free guidance (§6). Trained with random text-dropout; sampled with guidance scale w[3,15]w \in [3, 15] to balance prompt fidelity with diversity.

The pipeline (slightly more detailed than the §6 latent-diffusion diagram):

   TEXT-TO-IMAGE PIPELINE (Stable Diffusion architecture)

   prompt: "a cat at sunset"
       │
       ▼
   text encoder (CLIP / T5)
       │
       ▼
   c = sequence of token embeddings    (conditioning)
       │
       ├──────────────────────┐
       │                      │
   z_T ~ N(0, I)               │
       │                      │
       ▼                      ▼
   ┌───────────────────────────────────────┐
   │  U-Net / DiT diffusion in latent space│
   │                                       │
   │  At each timestep t and each layer:   │
   │    - self-attention over spatial      │
   │      positions in z_t                 │
   │    - cross-attention from z_t to c    │
   │    - feedforward                      │
   │                                       │
   │  With classifier-free guidance:       │
   │    epsilon_tilde = (1 + w) epsilon(z, c)│
   │                  -    w  epsilon(z, ∅)│
   └───────────────────────────────────────┘
       │
       ▼
   z_0 (clean latent)
       │
       ▼
   VAE decoder
       │
       ▼
   x_0 (generated image, 1024x1024)

This is the canonical architecture for text-to-image generation in 2026. Every major system (Stable Diffusion, Imagen, DALL-E 3, Midjourney, Flux) follows this template with variations in the specifics.

The guidance scale and the fidelity-diversity trade-off

The single most user-visible hyperparameter. Classifier-free guidance (§6) extrapolates the conditional prediction away from the unconditional prediction:

ϵ~(xt,t,c)=(1+w)ϵ(xt,t,c)wϵ(xt,t,).\tilde{\epsilon}(x_t, t, c) = (1 + w) \epsilon(x_t, t, c) - w \epsilon(x_t, t, \emptyset).

The guidance scale ww controls the trade-off:

  • w=0w = 0: unconditional generation. The prompt is ignored.

  • w=1w = 1: standard conditional generation. The prompt is respected but the samples are diverse.

  • w=5w = 5 to w=10w = 10: typical “good” range for text-to-image. Samples are prompt-faithful with reasonable diversity.

  • w=15+w = 15+: extreme guidance. Samples become over-saturated, artefact-prone, and lose diversity. Mode collapse around the most “average” interpretation of the prompt.

   FIDELITY vs DIVERSITY TRADE-OFF (schematic)

         fidelity (matches prompt)
            ▲
            │                  *
            │            *
            │       *
            │   *
            │ *           best
            │*           operating
            │*            range
            │ *
            │   *
            │     *  *
            │           *  *  *  *
            └─────────────────────────► diversity
                 w high              w low
                (mode collapse)     (off-prompt)

Users of text-to-image services often have access to this knob; product-grade systems typically default to w7w \approx 7.

Image-to-image: starting from a given image

A second conditioning regime. Given an initial image xinitx_{\text{init}} and a text prompt, produce a new image that resembles xinitx_{\text{init}} but is modified per the prompt.

The recipe (SDEdit, Meng et al., 2021). Encode xinitx_{\text{init}} to a latent; add noise to a moderate level tt^* (rather than the full TT); run the diffusion reverse process from tt^* down to 0, conditioned on the text prompt. The choice of tt^* controls how much the original image structure is preserved: small tt^* preserves most of the original; large tt^* produces samples mostly determined by the prompt.

This is the image-to-image slider that users see in tools like Stable Diffusion’s web UI. The same machinery handles photo editing, stylization, and inpainting.

ControlNet: spatial conditioning

A third conditioning regime. Sometimes the conditioning signal is spatial - a depth map, an edge map, a pose skeleton, a semantic segmentation. We want the generated image to follow that spatial structure while remaining free in colour, texture, and style.

ControlNet (Zhang, Rao, Agrawala, 2023) “Adding Conditional Control to Text-to-Image Diffusion Models” gives a clean architectural recipe. Take a pretrained text-to-image model; duplicate its encoder; train the duplicate to consume a spatial conditioning signal; inject its outputs into the original network’s decoder via zero-initialized convolutions (so that early training does not disrupt the pretrained model).

The result. A ControlNet for “Canny edges” lets users sketch an outline and have the diffusion fill in a prompt-conditioned image that respects the outline. ControlNets for depth, pose, segmentation, scribble, etc. compose with text-to-image generation to give users fine-grained spatial control.

ControlNet is engineering-heavy but practically transformative for creative applications. Most production text-to-image systems in 2026 ship with at least a dozen ControlNet variants.

Image editing: prompt-driven changes

A fourth regime. Given a real image and an editing instruction (“change the cat to a dog”, “make it sunset”, “remove the chair”), produce a modified image.

The dominant approach in 2026 uses training-time preference data on edits - pretraining the diffusion model with paired (original, edited, instruction) triples - combined with inference-time guidance. Specific systems include:

  • InstructPix2Pix (Brooks, Holynski, Efros, 2023). Train on paired data generated by GPT + Stable Diffusion; the resulting model edits real images per instruction.

  • Imagen Editor (Wang et al., 2023). Uses Imagen as backbone with editing-specific conditioning.

  • DALL-E 3 inpainting and outpainting. Selectively regenerate portions of an image.

The editing space is also where text-conditioning shows its current limits - fine-grained edits (“rotate the cat 30 degrees”, “make the third building taller”) are still difficult, and the best systems make mistakes that a human editor would not.

Negative prompts: a cheap-and-useful technique

A practical conditioning technique worth mentioning. Most production text-to-image systems support a negative prompt cc^- - text describing what the user does not want in the image. The implementation is straightforward classifier-free guidance with a non-empty unconditional:

ϵ~(xt,t,c,c)=(1+w)ϵ(xt,t,c)wϵ(xt,t,c).\tilde{\epsilon}(x_t, t, c, c^-) = (1 + w) \epsilon(x_t, t, c) - w \epsilon(x_t, t, c^-).

Reading this. Replace the empty unconditional with a negative-prompt unconditional. The model extrapolates away from cc^- as it extrapolates toward cc. Setting c=c^- = “blurry, low quality, deformed” pushes samples away from these undesirable features.

Negative prompts are widely used in production text-to-image and require no architectural changes - just an inference-time substitution.

Multi-modal and multi-signal conditioning

Modern systems combine multiple conditioning signals: text + depth map + reference image + style image. The architectural approach is to inject each via cross-attention or feature-fusion at different layers; the result is a system that respects multiple constraints simultaneously.

The most general framing: a conditional diffusion model ϵθ(xt,t,c)\epsilon_\theta(x_t, t, c) where cc is any structured conditioning input (text, image, audio, structured data). The pretraining recipe of large-scale text-conditioned diffusion + targeted ControlNet-style additions for new modalities has produced the bulk of 2026 production systems.

Where conditioning fits in 2026

Three observations to close the section:

  1. Text conditioning is mostly solved at the engineering level. Cross-attention + CLIP/T5 text encoders + classifier-free guidance is a robust recipe. New text-conditional diffusion systems work well out of the box; the engineering effort is now mostly in scaling and refinement.

  2. Spatial conditioning via ControlNet is engineering-heavy but powerful. Each ControlNet must be trained; the recipe is well-established but the work is not zero.

  3. Editing remains a research frontier. Fine-grained, semantically-aware editing is harder than the high-level “prompt-to-image” task. This is where current systems most visibly fall short of human capability.

OP-GM-3 (compositional generalization) is the open problem that underlies many of conditioning’s failure modes: current systems handle common combinations of concepts well but compose novel combinations poorly. We return to this in §13.


§9. Modality-Specific Architectures and Practice

The generative-modelling frameworks of §3§7 are modality-agnostic - the math is the same whether xx is an image, an audio clip, a 3D mesh, or a molecule. The architecture and engineering differ substantially by modality. This section surveys the dominant modalities, their characteristic architectural choices, and the production-grade systems of 2026.

Image generation

The most-developed modality. The dominant 2026 recipe is latent diffusion (§6) or latent flow-matching (§7) with a Transformer-based denoising network.

The standard pipeline:

  • VAE encoder/decoder trained separately, frozen during diffusion training. Compresses pixel images to a much-smaller latent (Stable Diffusion: 512×512×364×64×4512 \times 512 \times 3 \to 64 \times 64 \times 4, a 48×48 \times reduction; SD3 and others use even more aggressive compression).

  • Text encoder (CLIP + T5 in modern systems). Embeds the prompt as a sequence of token features.

  • Denoising network operating in latent space. Two architectural choices:

    • U-Net with cross-attention to text features. Standard through 2023; still in production use (Stable Diffusion 1.x, 2.x).

    • Diffusion Transformer (DiT) (Peebles and Xie, 2023). Operates on patches of the latent; conditioning via cross-attention or adaptive layer norm (AdaLN). Better scaling; the dominant choice in newer systems (Stable Diffusion 3, Flux, Sora-style image backbones).

  • Sampler. DDIM (20–50 steps) for diffusion-trained systems; Euler or higher-order ODE solvers (10–30 steps) for flow-matching-trained systems.

The notable 2026 commercial systems for text-to-image:

  • Midjourney v6/v7. Proprietary; visually distinctive; closed.

  • DALL-E 3 / GPT-4o image generation. Proprietary; tightly integrated with the GPT chat experience.

  • Stable Diffusion 3 / SDXL. Open weights; the most-used open system.

  • Flux (Black Forest Labs). Open weights; flow-matching-based; high quality.

  • Imagen 3 / Imagen 4. Google’s frontier system.

  • Firefly (Adobe). Trained on stock-photography data; positioned for commercial use with clean licensing.

The technical state. High-resolution photorealistic generation is largely solved at production quality; the open problems are now in (a) compositional generalization (OP-GM-3), (b) precise prompt control, (c) generation speed, and (d) memorization detection.

Video generation

The breakout modality of 2024–2026. Sora (OpenAI, early 2024) demonstrated minute-length, coherent, high-quality text-to-video generation. Veo (Google), Gen-3 (Runway), and Mochi (Genmo) followed.

The architectural choices for video extend image generation to space-time:

  • Patches in space-time. Sora and successors tokenize video into 16×16×t16 \times 16 \times t patches for some temporal extent tt (typically 1–4 frames). The diffusion model operates on sequences of these patches.

  • 3D VAE. A specialized VAE that handles temporal compression - typically compressing T×H×W×3T \times H \times W \times 3 videos to T/4×H/8×W/8×CT/4 \times H/8 \times W/8 \times C latents.

  • Temporal attention. The denoising network has both spatial attention (within a frame) and temporal attention (across frames). Sora-style systems use joint spatiotemporal attention; cheaper alternatives factorize.

  • Long-video conditioning. Sora generates up to one minute of coherent video; the recipe uses progressively longer windows during training and inference, with strong KV-caching to handle temporal context.

The current state. Video generation has improved dramatically from 2023 to 2026 but still exhibits visible artefacts on hard cases: rare interactions (a person stirring a cup of coffee), persistent identity (the same person across long shots), and physical realism (objects respecting gravity and contact). These are not fully solved and are active research areas.

Audio generation

Three regimes worth distinguishing.

Speech synthesis. Text-to-speech is largely solved at production quality. The dominant recipes:

  • Autoregressive (WaveNet line, Tortoise TTS, XTTS). High quality, slow.

  • Diffusion (NaturalSpeech 3, Voicebox). High quality, faster sampling.

  • Flow-based and modern flow-matching variants. Increasingly used.

Music generation. Less mature than speech but rapidly improving.

  • MusicLM (Google, 2023): hierarchical autoregressive on audio tokens.

  • MusicGen (Meta, 2023): Transformer on EnCodec tokens.

  • Suno and Udio (2023–2024): commercial systems producing high-quality vocal-and-instrumental music from text prompts.

  • Stable Audio: diffusion-based commercial music generation.

General audio. Generation of arbitrary audio (sound effects, ambient soundscapes) is less mature. AudioLM (Google), AudioGen (Meta), and others use autoregressive Transformers on discrete audio tokens.

The dominant pattern across audio modalities: audio is tokenized via a learned neural codec (EnCodec, SoundStream, DAC), then either autoregressive or diffusion is applied in the discrete-token or continuous-latent space. The codec is the engineering substrate that makes audio generation tractable.

3D generation

The most-active research frontier in 2026 (still less mature than image or video).

NeRF (Neural Radiance Fields) (Mildenhall et al., 2020). Represents a 3D scene as a continuous function fθ(x,y,z,θview,ϕview)(r,g,b,σ)f_\theta(x, y, z, \theta_{\text{view}}, \phi_{\text{view}}) \to (r, g, b, \sigma) - colour and density at each spatial point. Rendered by ray-marching. Originally trained per-scene; later generalized to generative settings.

Gaussian Splatting (3DGS) (Kerbl et al., 2023). Represents a scene as a cloud of 3D Gaussians with position, covariance, colour, and opacity. Renders faster than NeRF, simpler optimization. Has substantially displaced NeRF for new 3D systems.

Diffusion-based 3D generation. Several techniques.

  • DreamFusion (Poole et al., 2022). Use a 2D text-to-image diffusion model as a score for optimizing a 3D representation (NeRF or 3DGS) - the Score Distillation Sampling (SDS) approach. Slow (each generation is an optimization) but produces 3D-consistent results.

  • Direct 3D diffusion. Diffusion models trained directly on 3D representations (point clouds, voxels, implicit fields). Less mature; growing rapidly.

  • Multi-view diffusion + reconstruction. Generate multiple consistent views with a 2D diffusion model conditioned on camera parameters; reconstruct 3D from the views.

The current state. 3D generation is rapidly improving but is not as polished as image or video. The dominant approach for high-quality results is still optimization-based (SDS + Gaussian Splatting) at substantial inference cost.

Molecules and proteins

A modality where generative models have produced scientifically significant results.

AlphaFold 2 / 3 (Jumper et al., 2021; Abramson et al., 2024). Structure prediction from sequence. Not a generative model in the “sample diverse outputs” sense, but a conditional generative model of structure given sequence. AlphaFold 3 extends to protein-ligand, protein-protein, and protein-nucleic-acid complexes. The architectural keys: an equivariant representation that respects the rotation-and-translation symmetries of 3D space, and the Pairformer module that handles pairwise residue interactions.

RFdiffusion (Watson et al., 2023). Generative model for protein backbones - sample new protein shapes conditional on functional constraints. Used for de novo protein design. The base recipe is diffusion on protein backbone coordinates with an SE(3)-equivariant denoising network.

Chroma (Ingraham et al., 2023). Programmable protein generation with conditioning on symmetry, shape, function. Same general framework as RFdiffusion with refined conditioning.

Molecular generation systems. A separate line of work generates small molecules (drug candidates, materials components). Techniques include equivariant graph diffusion, fragment-based generation, and equivariant flows.

The architectural common feature: equivariance. Models for 3D molecular and protein data must respect the symmetries of physical space - a rotated and translated input should produce a correspondingly rotated and translated output. SE(3)-equivariant neural networks (Thomas et al., 2018; Geiger and Smidt, 2022) are the standard substrate. The AI for Science chapter (planned) develops the equivariance machinery in domain depth.

Tabular and time-series generation

Worth a brief mention. Generating synthetic tabular data - for privacy-preserving data sharing, missing-value imputation, augmentation - is a substantial industrial application. Dominant techniques: GANs (CTGAN), VAEs (TVAE), and increasingly diffusion (TabDDPM). The challenges are mixed data types (numeric and categorical in the same row) and small dataset sizes (tabular data is rarely as large as image data).

Time-series generation (financial markets, sensor data, biological signals) uses similar techniques. The challenges are temporal coherence (long-range structure that’s hard to capture) and multivariate dependencies (correlated time series with structured interactions).

Multimodal native generation

A 2024–2026 development worth flagging. Some frontier systems generate across multiple modalities natively - text-and-image in the same model, or text-and-image-and-audio. Examples:

  • GPT-4o (OpenAI, 2024). Generates text, images, and audio in a unified autoregressive framework.

  • Chameleon (Meta, 2024). Mixed-modal autoregressive on interleaved text and image tokens.

  • Gemini (Google). Native multimodal input and output.

The architectural approach: tokenize each modality (text via subword, image via VQ-VAE, audio via EnCodec), interleave the tokens into a single sequence, and apply a Transformer autoregressively. This is the unified-tokens approach. Diffusion-based equivalents (joint image-text diffusion, etc.) exist but are less dominant in 2026 frontier systems.

The Multimodal Models chapter (planned) develops this in depth.

Cross-cutting practical issues

Two engineering issues that arise in every modality:

Data curation matters more than algorithmic choice. A diffusion model trained on a small, well-curated dataset typically beats one trained on a large, uncurated dataset. The single biggest practical determinant of generative-model quality is the data. Specific issues: deduplication (highly-replicated training examples cause memorization), copyright filtering, quality filtering, and recaptioning (rewriting messy alt-text into well-structured prompts; the DALL-E 3 and SD3 recipe).

Conditioning quality matters more than model size. A smaller model with high-quality text encoders and conditioning often outperforms a larger model with poor conditioning. Investment in the text encoder (T5, CLIP, both), in the conditioning architecture (cross-attention vs AdaLN), and in negative prompts pays off more than equivalent investment in scaling the denoising network.

These two observations are unifying lessons across modalities and are worth emphasizing because they are sometimes obscured by the architectural detail.


§10. Evaluating Generative Models

Evaluating generative models is hard. We have no single scalar that captures “is this model good.” This section develops the standard metrics, their failure modes, and the 2026 state of the evaluation question.

The fundamental difficulty

The evaluation question has no clean form. For a classifier, we ask: how often is it correct on a held-out test set? The answer is a single number (accuracy or its refinements). For a generative model, the analogous question - how well does it model the data? - admits no such clean answer. The model is producing new samples, not labelling existing ones; “correct” is not the right word.

Three related-but-distinct questions a generative-model evaluation might try to answer:

  1. Sample quality. Are the model’s samples individually high-quality? Photorealistic, coherent, prompt-faithful?

  2. Distribution coverage. Does the model produce a diverse set of samples that covers the true data distribution, not just a narrow mode?

  3. Likelihood. Does the model assign high probability to held-out real data?

A generative model can excel at one and fail at another. A GAN with mode collapse has excellent (1) and terrible (2). An over-regularized VAE has decent (2) and (3) but blurry samples. A perfect-density-estimation autoregressive model might have great (3) and excellent (2) but unimpressive (1) on aesthetic dimensions.

The evaluation problem is choosing which questions to ask and how to operationalize each.

Likelihood-based evaluation

For models that can evaluate density (autoregressive, exact-flow, VAE-ELBO-lower-bound), the natural metric is the log-likelihood on held-out data. For autoregressive language models this is perplexity:

perplexity(θ;Dtest)=exp ⁣(1Ni=1Nlogpθ(xi)).\text{perplexity}(\theta; \mathcal{D}_{\text{test}}) = \exp\!\left(-\frac{1}{N} \sum_{i=1}^{N} \log p_\theta(x_i)\right).

Reading this. Perplexity is the effective branching factor - the geometric mean of the model’s per-token probability over the test set. Lower is better; perplexity 1 means the model assigns probability 1 to every test token (perfect).

For images: bits-per-dimension (BPD), the negative log-likelihood per pixel-channel divided by log2\log 2. Lower is better.

Strengths of likelihood. It is a well-defined probabilistic measure of model fit; it has rigorous theoretical foundations; it does not require expensive sample generation.

Weaknesses of likelihood:

  • Not all models support it. GANs, score-based methods (without explicit density formulation), and some diffusion variants do not give a usable density.

  • Likelihood and sample quality are decoupled. Theis, van den Oord, Bethge (2016) “A Note on the Evaluation of Generative Models” gave the canonical critique: high likelihood is neither necessary nor sufficient for good sample quality. A model with strong likelihood can produce visually terrible samples; a model with poor likelihood can produce visually excellent samples.

  • Sensitivity to dimensionality. For high-dimensional data, likelihood values are dominated by easy-to-model dimensions and may miss interesting structure.

For text generation, perplexity is widely used as one signal alongside others. For image generation, likelihood (when available) is rarely the primary metric.

FID: the standard image-quality metric

The dominant metric for image-generation quality. Fréchet Inception Distance (Heusel et al., 2017).

The recipe. Take a pretrained Inception V3 network trained on ImageNet classification. Pass real images and generated images through it, extracting features from the penultimate pooling layer. Fit a multivariate Gaussian to each set of features (mean μ\mu and covariance Σ\Sigma). The FID is the Fréchet distance between the two Gaussians:

FID=μrealμgen2+Tr(Σreal+Σgen2(ΣrealΣgen)1/2).\text{FID} = \|\mu_{\text{real}} - \mu_{\text{gen}}\|^2 + \text{Tr}(\Sigma_{\text{real}} + \Sigma_{\text{gen}} - 2(\Sigma_{\text{real}} \Sigma_{\text{gen}})^{1/2}).

Lower FID is better. A FID of 0 would mean the generated and real feature distributions are identical (under the Gaussian approximation).

What FID measures. Two things implicitly: (a) sample quality - bad samples produce features unlike any real images, shifting μgen\mu_{\text{gen}}; (b) sample diversity - mode-collapsed generators produce features clustered in a small region, shrinking Σgen\Sigma_{\text{gen}} and increasing the Fréchet distance. FID captures both in a single scalar.

Why FID became dominant. It correlates reasonably well with human judgement on image quality; it is straightforward to compute; it works for any image-generation model (no need for likelihood or particular model structure). It became the standard reporting metric in image-generation papers from 2017 onward.

FID’s limitations

By 2026, FID’s limits are well-understood and worth flagging:

  • Inception network bias. FID measures distance in ImageNet-Inception feature space. This space encodes a specific notion of what makes images “different” - one shaped by ImageNet’s 1000-class taxonomy. For generation of images outside ImageNet’s distribution (medical images, satellite imagery, abstract art), FID can be misleading.

  • Sample-size sensitivity. FID converges slowly with sample size; small differences in FID between models can vanish with larger sample sets. Best practice is to evaluate with 10K–50K generated samples and a similarly-sized real reference.

  • Saturation at high quality. As image-generation systems improve, FID values approach zero and small differences become hard to interpret. Two systems with FID 2.5 and 3.0 may produce visibly different samples or essentially identical ones; FID can’t tell.

  • Prompt-following blind. FID measures distribution similarity; it doesn’t measure whether a text-to-image model actually follows the prompt. Two models with the same FID might have very different prompt-fidelity.

Inception Score and other older metrics

Inception Score (Salimans et al., 2016) preceded FID. Computes a single-image quality score plus a diversity score using the same Inception network. Critiqued by Barratt and Sharma (2018) and Borji (2019); largely superseded by FID. Mentioned here for historical context.

CLIP-score: prompt fidelity

For text-to-image models, FID does not measure whether samples follow the prompt. CLIP-score addresses this. Use the CLIP text encoder to embed the prompt and the CLIP image encoder to embed the generated image; the cosine similarity between them measures how well the image matches the prompt according to CLIP.

CLIP-score is widely used alongside FID. It has its own limitations - CLIP itself is a model with biases, and high CLIP-score does not guarantee aesthetic quality - but it captures the prompt-fidelity dimension that FID misses.

Human preference: the ground truth

For practical purposes, the most reliable evaluation is human pairwise preference. Show humans pairs of generated images (or videos, or audio) and ask which they prefer. Aggregate over many comparisons to produce an Elo rating or preference matrix.

Two landmark uses:

  • Human Preference Score (HPS) (Wu et al., 2023) - a learned model trained on human preferences. Used as an automatic substitute for actual human evaluation.

  • LMSYS Chatbot Arena (and its image-generation counterparts). Public deployments where users compare model outputs head-to-head. Produces Elo-style leaderboards that are increasingly treated as authoritative.

Human preference is expensive but is the gold standard. Automated metrics (FID, CLIP-score, HPS) are useful proxies but should be checked against human evaluation when the stakes warrant.

Diversity-fidelity trade-off and the precision-recall framing

Many metrics (FID especially) collapse two dimensions into one scalar. Sajjadi et al. (2018) “Assessing Generative Models via Precision and Recall” proposed a two-dimensional alternative:

  • Precision. What fraction of generated samples lie in the real-data distribution? (High precision = good sample quality; the model produces things that look like real data.)

  • Recall. What fraction of the real-data distribution is covered by generated samples? (High recall = good diversity; the model produces a wide range of things.)

A precision-recall pair gives a richer picture than FID alone. Kynkäänniemi et al. (2019) “Improved Precision and Recall Metric for Generative Models” gave a practical algorithm using kk-nearest-neighbour computations in feature space. Increasingly used in image-generation evaluation.

Memorization detection

A growing concern in 2026. Generative models trained on large datasets can memorize training examples and reproduce them at inference. For copyrighted training data this is a legal problem; for privacy-sensitive data (medical images, faces) it is a privacy problem.

Carlini et al. (2023) “Extracting Training Data from Diffusion Models.” Showed that Stable Diffusion can be made to regurgitate exact training images for specific prompts that match the training caption.

Detection techniques:

  • Nearest-neighbour search. For each generated sample, find the nearest training example in feature space; if the distance is below a threshold, flag as memorization.

  • Membership inference. Test whether the model’s probability of a candidate xx is anomalously high (suggesting it was in the training set).

  • Targeted prompt probing. For known training captions, test whether the model generates the corresponding training image.

Modern training practices include deduplication of training data (CLIP-based near-duplicate removal) to reduce memorization rates. The trade-off: more aggressive deduplication reduces memorization but may also reduce sample quality. The balance is an active area of practical research.

The benchmarking problem in 2026

Three issues that affect how generative-model evaluation is read in 2026:

Benchmark saturation. On standard image-generation benchmarks (ImageNet-1K class-conditional, MS-COCO text-to-image), frontier systems achieve FID values so low that the metric no longer distinguishes them meaningfully. The community has shifted toward harder benchmarks (DrawBench, GenEval, T2I-CompBench for compositional generation) and head-to-head human comparison.

Multi-dimensional reporting. Modern papers report multiple metrics together: FID + CLIP-score + Precision-Recall + human preference. No single metric is taken as definitive.

Real-world preference vs benchmark metrics. The most-used image-generation systems (Midjourney, DALL-E 3, Imagen 3) are not always the FID-best - they are optimized for user preference, which weighs aesthetic appeal and prompt fidelity in ways FID does not capture. The disconnect between research benchmarks and deployed-system quality is real and known.

Modality-specific evaluation gaps. Video evaluation is much less mature than image evaluation; current best practices (FVD - Frechet Video Distance, FID applied per-frame, human evaluation) have known limitations. Audio evaluation similarly varies by sub-modality. 3D and molecular evaluation are even less standardized.

Where evaluation sits in 2026

The honest summary. Generative-model evaluation has improved substantially but remains the most difficult engineering problem in generative modelling. The current state of practice:

  • For images: FID + CLIP-score + human preference (Chatbot Arena style) is the standard reporting suite.

  • For video: FVD + human preference. Less standardized.

  • For audio: domain-specific metrics + human preference.

  • For text: perplexity (for likelihood) + task-specific automated metrics + human preference.

  • For 3D and molecules: domain-specific scientific evaluation (RMSD for protein structure, etc.).

Across all modalities, human evaluation is the gold standard and automated metrics are useful proxies. As frontier models continue to improve, the gap between automated metrics and meaningful quality differences will continue to narrow; new evaluation techniques will emerge to fill the gap. OP-GM-2 (evaluation that aligns with human judgement) is the open problem this section centres on.


§11. Connections to Other Chapters

This chapter is connected to many others - generative models are one of the two foundational paradigms (alongside discriminative SSL) on which Foundation Models are built, and they appear directly or indirectly in most modern AI systems. The cross-references below are dependency statements, not pointers.

  • Self-Supervised Learning §4 develops generative SSL objectives - masked language modelling, span corruption, MAE, denoising/diffusion-as-SSL. This chapter develops the generative-modelling lens on the same machinery: how to use it to sample. The two chapters cover overlapping technique with different emphases. Recommended to read SSL §4 first if encountering generative modelling for the first time.

  • Large Language Models is the LM-specific instantiation of autoregressive generative modelling (§3 of this chapter). LLM §3 develops the decoder-only Transformer architecture; LLM §6 develops decoding strategies (greedy, beam, nucleus, contrastive) - all autoregressive-generation specifics. This chapter develops autoregressive as a family and refers to the LM chapter for the LM-specific detail.

  • Foundation Models §3 lists generative models as one of two dominant FM substrates (the other being discriminative SSL). FM §6 covers scaling laws which apply to generative models; FM §7 covers adaptation methods that can include preference-based training of generative models. The Foundation Models chapter is the spine; this chapter develops one of the two ribs.

  • Deep Learning provides the architectural building blocks: U-Net (§4 of DL), Transformers (DL §6), equivariant networks (referenced for §9 of this chapter). The chapter assumes the architectural material rather than re-developing it.

  • Multimodal Models (planned) develops cross-modal systems - text-to-image, image-to-text, audio-text, vision-language-action. The text-to-image conditioning of §8 is the dominant cross-modal generative technique; the planned chapter will develop the broader cross-modal landscape.

  • Reinforcement Learning §10 develops RLHF/DPO/GRPO as post-training for generative models. The connection: a generative model trained with maximum likelihood (this chapter) can be further trained to align with human preferences (RL §10). RLHF as applied to text-to-image models is an active research area; the framework comes from RL.

  • AI for Science (planned) develops scientific generative models in domain depth - protein design (RFdiffusion, Chroma), molecular generation, materials, mathematics. The generative-modelling techniques are this chapter’s content; the domain-specific architectures (equivariance, structure-prediction backbones) and benchmarks live in the AI for Science chapter.

  • Evaluation (planned) is the cross-cutting chapter on benchmarking. §10 of this chapter sketches generative-model evaluation specifically; the broader evaluation methodology lives in the planned chapter.

  • Theoretical Foundations of Learning §8 develops the modern generalization puzzle. Generative models inherit the same puzzle: why does a high-capacity diffusion model generalize beyond its training set rather than memorizing it? OP-GM-7 (memorization vs novel generation) is the generative-model-specific instance.

  • Mechanistic Interpretability (planned) studies internal representations of generative models. Diffusion U-Nets and DiTs have been the subject of substantial interpretability research (e.g., what features at which timesteps?); the techniques mirror those used for discriminative networks but adapted for the iterative-denoising setting.

  • Alignment / Ethics treats memorization, copyright, deepfakes, watermarking, and the broader societal implications of generative AI as central concerns. OP-GM-4 (watermarking and provenance) and OP-GM-7 (memorization) are the open problems that the Alignment chapter develops in depth.


§12. Critiques and Alternative Perspectives

This section presents critiques of generative AI as critiques - substantive intellectual positions held by working researchers and observers, not strawmen. The chapter does not adjudicate (per the project’s editorial stance, ADR-0004); the critiques are organized by what they are critiquing.

The most-publicized critique of generative AI: the models memorize substantial portions of their training data and can reproduce copyrighted material in their outputs. Carlini et al.'s (2023) extraction of training images from Stable Diffusion is the canonical empirical demonstration; the New York Times v. OpenAI lawsuit (2023) is the canonical legal one.

Two distinct concerns. Quantitative memorization: how often does a generative model reproduce a verbatim training example? Empirically, the rate is non-zero but small for most prompts; specific prompts can elicit specific memorized examples. Qualitative memorization: do the model’s outputs derivatively depend on copyrighted training material in a way that constitutes copying? The legal answer is unsettled and varies by jurisdiction; the technical answer is essentially yes (the model would not produce the outputs it does without having been trained on the data, by construction).

The critique here is that training-data licensing should be respected at training time - current frontier systems train on enormous web-scraped datasets without per-asset licensing, and the legal-and-ethical implications are real. Mitigations include opt-out mechanisms (the OpenAI image-classifier opt-out), licensed training data (Adobe Firefly’s stock-photo training set), and post-training filters. None is a complete solution.

Diversity collapse

A related but distinct concern. Generative models trained on web data and tuned with RLHF for human preferences tend to converge on a narrow style - generating images, text, or music that look superficially diverse but cluster in a small region of the space the model is theoretically capable of producing.

Empirically: ask a text-to-image model for “a beautiful landscape” 100 times; the resulting images often share aesthetic characteristics (lighting, composition, palette) more than the prompt itself implies. Ask an LLM for “a creative story” 100 times; the resulting stories share narrative patterns. The model has been steered by training data and reward signals toward a particular zone of plausibility.

The critique. Generative AI deployed at scale homogenizes the cultural/creative output it is used to produce - not because each individual user wanted homogeneity, but because each user got the same model’s same biases. The long-run consequences for cultural and creative diversity are an open empirical question; the short-run consequences are visible in the “AI look” that contemporary observers can identify in many AI-generated images.

Evaluation as a social problem

A meta-critique of the evaluation discussion in §10. The metrics we use to evaluate generative models shape what models we build. FID rewards distribution coverage; CLIP-score rewards prompt fidelity; human preference rewards what humans currently prefer (which is shaped by what they have already seen).

The critique: optimization against any of these metrics produces Goodhart’s-law failures. Once a metric is the target, it stops being a good measure. The 2026 evidence: production text-to-image models that score well on standard benchmarks but produce visibly inferior outputs to alternatives that score worse; LLMs that score well on standard reasoning benchmarks but fail at slightly-perturbed problems.

The deeper critique: there may be no automated metric that captures all the dimensions that matter, because “what matters” is itself plural and contested. The benchmarking community has responded with multi-metric reporting and live human-preference comparisons; the underlying problem is structural and persistent.

The “do generative models understand?” debate

A philosophical critique. When a diffusion model produces a realistic image of “a cat sitting on a windowsill,” does the model understand what cats and windowsills are, in any meaningful sense? Or has it learned a statistical pattern that is accurate enough to produce convincing samples without any understanding?

Two camps.

The no-understanding camp argues that generative models manipulate statistical patterns over surface representations (pixels, tokens) without representing the underlying entities. Evidence: models fail at out-of-distribution combinations, struggle with physical reasoning (“an object falling and bouncing”), and produce confidently wrong outputs on slightly-novel queries. The technical critique points to compositional generalization failures (OP-GM-3).

The some-understanding camp argues that the distinction between “statistical pattern” and “understanding” is not as clear as the no-understanding camp implies. Modern generative models exhibit emergent capabilities (in-context learning, chain-of-thought reasoning, world-model behaviour in video models) that look like some form of understanding, even if not human-like. The empirical evidence is sufficient that the strong no-understanding position requires substantial qualifications.

This chapter does not adjudicate. The honest summary: the debate is partly empirical (about what generative models actually can and cannot do) and partly definitional (about what we mean by “understand”). Both threads are active. We refer the reader to the broader cognitive-science and philosophy-of-AI literature for the substantive treatment.

The labour critique

A non-technical but consequential critique. Generative models compete with and displace human creative labour - illustrators, writers, voice actors, musicians, designers. The economic and social effects are real and immediate. The technical community’s response to this critique has been mixed: some researchers acknowledge the labour displacement and engage with mitigations (revenue-sharing, attribution, opt-out); others treat it as outside their scope.

The chapter notes the critique without taking a position. The Alignment / Ethics chapter (planned) develops it in depth.


§13. Limitations and Open Problems

Consolidated open-problems list. Each item carries an OP-GM-N identifier so other chapters can cross-reference.

  • OP-GM-1. Sample-efficient training of generative models. Frontier diffusion and flow-matching systems require massive training data (10910^9+ image-text pairs) and compute (102310^{23}+ FLOPs). Smaller training budgets typically produce substantially weaker models. Whether this is fundamental or a consequence of current architectures and recipes is open. The protein-structure analogue (AlphaFold 2 trained on the much smaller PDB) shows that with strong inductive biases, much less data can suffice - but generalizing this insight to general image/audio/video generation is unsolved. Relates to OP-FM-2 (foundation-model training data).

  • OP-GM-2. Evaluation that aligns with human judgement. §10 makes the case extensively. FID, CLIP-score, and the standard automated metrics correlate imperfectly with what humans actually value. Designing metrics that align better with human judgement - without becoming Goodhart-targets themselves - is an open and likely-permanent problem. Recent learned-metric approaches (HPS, ImageReward) help but inherit the biases of the preference data they were trained on.

  • OP-GM-3. Compositional generalization. Current text-to-image systems handle common combinations of concepts well (“a red car”, “a cat in a hat”) but compose novel combinations poorly (“a cat with three heads riding a bicycle in zero gravity”). The model has memorized common combinations; truly compositional generation - combining concepts the model has never seen combined - fails systematically. This is the generative-model analogue of the broader compositional-generalization problem. Cross-references the discriminative-side discussions in DL/SSL/LLM open problems.

  • OP-GM-4. Watermarking and provenance. As generative content becomes ubiquitous, distinguishing AI-generated content from human-created content becomes important - for trust, attribution, copyright, and democratic discourse. Watermarking techniques (visible, invisible, model-specific) have been proposed; none is robust to all attacks. The C2PA (Coalition for Content Provenance and Authenticity) industry standard is one approach; its adoption is partial. OP-GM-4 has both a technical core (robust watermarks) and an institutional dimension (incentives for adoption).

  • OP-GM-5. Theory of why diffusion / flow-matching work. Empirically, diffusion and flow-matching produce high-quality samples reliably. Theoretically, why they work as well as they do is partial. Rigorous results exist for restricted settings (Gaussian data, kernel methods, linear models), but a theory that explains the generalization and sample-quality of practical large-scale diffusion is open. Connects to OP-TH-1 (overparameterization generalization) - the generative-model-specific instance of the broader generalization puzzle.

  • OP-GM-6. Continuous-time vs discrete formulations. Diffusion was developed discretely; flow-matching is continuous-time. Both work; the relationship is now well-understood. The open question is whether new generative-modelling frameworks lurk in formulation choices we have not yet explored - non-Gaussian noise, non-Euclidean spaces, trainable noise schedules. This is a research-aesthetic open problem rather than a hard practical one.

  • OP-GM-7. Memorization vs novel generation. When does a generative model produce novel outputs vs recombinations (or verbatim reproductions) of training data? The problem has technical, legal, and ethical dimensions. Detection techniques (§10) have improved but are not fully reliable. The deep theoretical question - what does it mean for a high-dimensional generative model to “produce something new” - is unsolved.

  • OP-GM-8. Long-horizon coherent generation. For video: minute-length generation is now possible (Sora, Veo); hour-length coherent generation is not. For text: novel-length coherent generation by a single model is not yet reliable. For 3D: extended scenes with consistent geometry across viewpoints are difficult. The long-horizon problem - maintaining coherence over generation lengths much larger than training-time sequence lengths - is one of the central open problems in 2026.

  • OP-GM-9. Inference-time efficiency. Diffusion sampling at 20–50 steps per image is far slower than GANs (single-step) but is now the standard. Inference cost is the dominant deployment expense. Better samplers (DDIM, DPM-Solver), distillation (consistency models, rectified flow), and architecture choices (smaller models matched to inference target) are the active research directions. Tied to OP-RL-9 (training/inference compute trade-off).

  • OP-GM-10. Equivariance and invariance. For domains with strong symmetries (molecules, proteins, materials, physical simulations), generative models that are equivariant to the relevant symmetry group produce more reliable and sample-efficient results. Designing equivariant architectures for general-purpose generative modelling - applying equivariance principles where they help without crippling expressiveness - is an open area.


§14. Further Reading

Opinionated list. Not exhaustive; intended as a reading-order recommendation for someone entering generative modelling cold.

Foundational textbooks and surveys

  • Murphy, K. (2023). Probabilistic Machine Learning: Advanced Topics. Chapter on generative models gives a unified textbook treatment.

  • Tomczak, J. (2024). Deep Generative Modeling. A focused textbook on modern generative models.

  • Bond-Taylor, S., et al. (2022). “Deep Generative Modelling: A Comparative Review of VAEs, GANs, Normalizing Flows, Energy-Based and Autoregressive Models.” A useful unified survey; somewhat pre-flow-matching but covers the older families well.

Autoregressive

  • van den Oord, A., et al. (2016). “Pixel Recurrent Neural Networks.” PixelRNN.

  • van den Oord, A., et al. (2016). “WaveNet: A Generative Model for Raw Audio.” WaveNet.

  • Esser, P., Rombach, R., and Ommer, B. (2021). “Taming Transformers for High-Resolution Image Synthesis.” VQGAN - the autoregressive-image-generation recipe via discrete tokens.

VAEs

  • Kingma, D. P., and Welling, M. (2014). “Auto-Encoding Variational Bayes.” The foundational VAE paper.

  • van den Oord, A., Vinyals, O., and Kavukcuoglu, K. (2017). “Neural Discrete Representation Learning.” VQ-VAE.

  • Higgins, I., et al. (2017). “β\beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework.” The disentanglement-oriented variant.

  • Kingma, D. P., and Welling, M. (2019). “An Introduction to Variational Autoencoders.” A clear extended tutorial.

Normalizing flows

  • Rezende, D., and Mohamed, S. (2015). “Variational Inference with Normalizing Flows.” Foundational.

  • Dinh, L., Sohl-Dickstein, J., and Bengio, S. (2017). “Density Estimation using Real NVP.” Coupling-layer flows.

  • Kingma, D. P., and Dhariwal, P. (2018). “Glow: Generative Flow with Invertible 1x1 Convolutions.” Glow.

  • Papamakarios, G., et al. (2021). “Normalizing Flows for Probabilistic Modeling and Inference.” Comprehensive survey.

GANs

  • Goodfellow, I., et al. (2014). “Generative Adversarial Nets.” Foundational.

  • Radford, A., Metz, L., and Chintala, S. (2016). “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks.” DCGAN.

  • Arjovsky, M., Chintala, S., and Bottou, L. (2017). “Wasserstein Generative Adversarial Networks.” WGAN.

  • Karras, T., Laine, S., and Aila, T. (2019). “A Style-Based Generator Architecture for Generative Adversarial Networks.” StyleGAN.

  • Brock, A., Donahue, J., and Simonyan, K. (2019). “Large Scale GAN Training for High Fidelity Natural Image Synthesis.” BigGAN.

Diffusion

  • Sohl-Dickstein, J., et al. (2015). “Deep Unsupervised Learning using Nonequilibrium Thermodynamics.” The original diffusion paper.

  • Ho, J., Jain, A., and Abbeel, P. (2020). “Denoising Diffusion Probabilistic Models.” DDPM - the modern recipe.

  • Song, Y., and Ermon, S. (2019, 2020). “Generative Modeling by Estimating Gradients of the Data Distribution” and related papers. Score-based formulation.

  • Song, Y., et al. (2021). “Score-Based Generative Modeling through Stochastic Differential Equations.” The unified SDE framework.

  • Song, J., Meng, C., and Ermon, S. (2021). “Denoising Diffusion Implicit Models.” DDIM.

  • Ho, J., and Salimans, T. (2021). “Classifier-Free Diffusion Guidance.” Classifier-free guidance - the dominant conditioning technique.

  • Rombach, R., et al. (2022). “High-Resolution Image Synthesis with Latent Diffusion Models.” Stable Diffusion’s technical core.

  • Peebles, W., and Xie, S. (2023). “Scalable Diffusion Models with Transformers.” DiT.

  • Karras, T., et al. (2022). “Elucidating the Design Space of Diffusion-Based Generative Models.” A systematic ablation study; the modern engineering reference.

Flow-matching and rectified flow

  • Lipman, Y., et al. (2022). “Flow Matching for Generative Modeling.” Foundational.

  • Liu, X., Gong, C., and Liu, Q. (2022). “Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow.” Rectified flow.

  • Albergo, M. S., Boffi, N. M., and Vanden-Eijnden, E. (2023). “Stochastic Interpolants: A Unifying Framework for Flows and Diffusions.” A unifying framework.

Modality-specific

  • Mildenhall, B., et al. (2020). “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis.” NeRF.

  • Kerbl, B., et al. (2023). “3D Gaussian Splatting for Real-Time Radiance Field Rendering.” 3DGS.

  • Poole, B., et al. (2023). “DreamFusion: Text-to-3D using 2D Diffusion.” SDS.

  • Jumper, J., et al. (2021). “Highly Accurate Protein Structure Prediction with AlphaFold.” AlphaFold 2.

  • Abramson, J., et al. (2024). “Accurate Structure Prediction of Biomolecular Interactions with AlphaFold 3.” AlphaFold 3.

  • Watson, J. L., et al. (2023). “De novo Design of Protein Structure and Function with RFdiffusion.” RFdiffusion.

  • Brooks, T., et al. (2024). “Video Generation Models as World Simulators.” Sora (technical report).

Conditioning and guidance

  • Zhang, L., Rao, A., and Agrawala, M. (2023). “Adding Conditional Control to Text-to-Image Diffusion Models.” ControlNet.

  • Brooks, T., Holynski, A., and Efros, A. A. (2023). “InstructPix2Pix: Learning to Follow Image Editing Instructions.” Image editing.

  • Meng, C., et al. (2022). “SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations.” Image-to-image.

Evaluation

  • Theis, L., van den Oord, A., and Bethge, M. (2016). “A Note on the Evaluation of Generative Models.” The canonical critique of single-metric evaluation.

  • Heusel, M., et al. (2017). “GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium.” FID introduction.

  • Sajjadi, M. S. M., et al. (2018). “Assessing Generative Models via Precision and Recall.” Precision-recall framing.

  • Wu, X., et al. (2023). “Human Preference Score: Better Aligning Text-to-Image Models with Human Preference.” HPS.

  • Carlini, N., et al. (2023). “Extracting Training Data from Diffusion Models.” Memorization.

Reading-order recommendation

For someone entering the field cold: start with the Bond-Taylor 2022 survey for orientation. Then read the foundational papers chronologically: Goodfellow 2014 (GANs) → Kingma-Welling 2014 (VAEs) → Dinh et al. 2017 (RealNVP) → Ho-Jain-Abbeel 2020 (DDPM) → Song et al. 2021 (unified SDE framework) → Ho-Salimans 2021 (CFG) → Rombach et al. 2022 (latent diffusion) → Lipman et al. 2022 (flow-matching) → Liu et al. 2022 (rectified flow) → Karras et al. 2022 (“Elucidating”) for engineering depth. Add modality-specific papers as needed.


§15. Exercises and Experiments

Research-style exercises spanning the chapter’s families. Each is designed to develop hands-on understanding of one or two of the dominant techniques.

  • E1. VAE on MNIST with latent visualization. Implement a small VAE with d=2d = 2 latent dimensions on MNIST. After training, plot the encoded test set in the 2D latent space coloured by digit class - confirm meaningful clustering. Sample new digits by decoding random points in the latent space. Implement latent traversal (sweep along a line in latent space, decode at each point) to confirm smooth interpolation. Then increase to d=64d = 64 and compare reconstruction quality.

  • E2. GAN on MNIST and observation of mode collapse. Implement a small DCGAN on MNIST. Train and sample. Deliberately under-train the discriminator (or use a too-high learning rate) and observe mode collapse - generated samples cluster on a few digit classes. Then implement WGAN-GP and confirm it is more robust to the same parameter settings.

  • E3. DDPM on CIFAR-10 or Fashion-MNIST. Implement DDPM with a small U-Net, T=1000T = 1000 noise schedule, and the simplified MSE loss. Train to reasonable convergence. Implement DDPM and DDIM sampling; compare sample quality at T=1000T = 1000 (full DDPM) vs T=50T = 50 (DDIM). Plot a few samples per setting; quantify with FID against a held-out reference.

  • E4. Flow-matching on a 2D toy distribution. Implement straight-line flow-matching on a 2D Gaussian-mixture target distribution. Train a small MLP velocity-prediction network. Implement Euler integration sampling. Visualize the learned velocity field on a grid of (x,t)(x, t) points and the resulting sample trajectories from t=0t = 0 to t=1t = 1. Repeat for a more complex 2D target (a spiral or moons distribution).

  • E5. Implement classifier-free guidance. Take the trained DDPM from E3 and add class-conditional training (CIFAR-10 has 10 classes). Train with random class-dropout (10%). At inference, implement CFG with varying guidance scales w{1,3,7,15}w \in \{1, 3, 7, 15\}. Visualize how sample fidelity (matches the target class) and diversity change with ww. Quantify with class-conditional FID.

  • E6. Compute FID for trained models. Take three trained models (your VAE from E1, your GAN from E2, your DDPM from E3). Compute FID against the corresponding test set for each. Compare. Investigate the effect of sample-set size on FID (1K, 5K, 10K samples). Confirm the well-known instability of FID at small sample sizes.

  • E7. Reproduce a memorization experiment. Following Carlini et al. (2023) at small scale: train a VAE or small DDPM on MNIST with a deliberately under-deduplicated dataset (some samples replicated 100x). Investigate which samples the model memorizes (find generated samples nearest to known training examples in pixel space). Observe the relationship between training-replication count and memorization rate.

  • E8. Implement a simple ControlNet for edge-conditioned generation. Take your trained DDPM from E3. Add a ControlNet branch that conditions on Canny edges of the target image. Train on (image, edge-map) pairs. At inference, supply a new edge map and a class condition; verify that the generated image follows the edge structure.

  • E9. Compare U-Net and DiT for diffusion. Implement two small diffusion models with comparable parameter counts: one with a U-Net, one with a DiT. Train both on the same dataset (e.g., CelebA-64). Compare sample quality (FID), training stability (loss curves), and inference speed.

  • E10. Investigate the effect of training-data quality. Take a small diffusion model. Train three versions: one on the full CIFAR-10 training set, one on a deduplicated subset, one on a 10% random subset. Compare sample quality and (where possible) memorization rates across the three.


AI: A Living Reference by Fuzue. Content licensed under CC BY-SA 4.0 - share, adapt, and build on it; keep the attribution and the open licence on derivatives.