Generative Models
The chapter is structured to interface cleanly with several other chapters: the autoregressive material in §3 is the same machinery as LLM §3 (decoder-only Transformers) viewed through the generative-model lens; diffusion in §6 is the same machinery as SSL §4 (denoising as SSL) viewed through the modelling lens; flow-matching in §7 is the practical successor to diffusion that has displaced it in many applications.
Scope and What This Chapter Is About
The chapter develops generative models - probabilistic models that learn to generate samples from a target data distribution. We cover the modelling principle (likelihood, density, score, ODE-flow), the six dominant families (autoregressive, VAE, normalizing flows, GANs, diffusion, flow-matching), the practical engineering of training and sampling, the modality-specific instantiations (images, audio, video, 3D, molecules), and the unifying theoretical perspective that has emerged in the 2020s (diffusion as score-matching; flow-matching as a generalization of both). Open problems are flagged inline and consolidated in §13.
§1. Motivation and Scope
A worked-example anchor
Concretely, what does a generative model do? Three instances from 2026 practice, drawn to span modalities and method families:
Stable Diffusion XL produces a photorealistic image from a text prompt like “a cat sitting on a windowsill at sunset, oil painting style.” Under the hood: a text encoder turns the prompt into a conditioning vector; a U-Net or DiT model iteratively denoises a random initial latent over 20–50 steps; a decoder maps the final latent to pixels. The output is a image that - given a good prompt - looks plausible enough to fool most casual viewers.
AlphaFold 2 predicts a protein’s 3D structure from its amino-acid sequence. The model is not “generative” in the artistic-image sense but is a generative model of structures conditional on sequence - sampling a 3D conformation from the predicted distribution. The output is atomic coordinates that match experimental structures to within a few angstroms on a substantial fraction of proteins.
GPT-style LMs generate text autoregressively, one token at a time, sampling from . This is the third major generative-model family - autoregressive - and is responsible for the dominant production-grade text generation of 2026.
All three are generative models. They differ in modality (image, structure, text), architecture (U-Net+DiT, equivariant Transformer, decoder-only Transformer), and training objective (denoising score-matching, distogram + structure prediction, next-token cross-entropy). The unifying feature: each represents and samples from a probability distribution (or for conditional models) over a structured output space.
What generative modelling is, formally
A generative model is a probabilistic model of a data distribution , where is a structured object - a vector, image, sentence, molecule, video - drawn from some space . The model has parameters and defines a distribution that approximates . The model is useful if we can do at least one of the following with it:
Sample. Produce a new . This is the most visible capability - the ability to generate new images, sentences, molecules.
Evaluate density. Compute for a given . Useful for anomaly detection, compression, model comparison.
Approximate inference. Compute or sample from posteriors where is some latent variable. Useful for representation learning and structured inference.
The six families covered in this chapter differ in which subset of these they support and how they support it. Autoregressive models (§3) support sampling and exact-density-evaluation but not exact-posterior-inference. VAEs (§4) support sampling and approximate-density-evaluation and approximate-posterior-inference. Flows (§5) support all three exactly. GANs (§5) support only sampling. Diffusion (§6) supports sampling and approximate-density-evaluation. Flow-matching (§7) supports sampling with a clean continuous-time framework that subsumes much of diffusion.
The differences matter. Which capabilities matter for an application determines which family is appropriate.
What generative modelling is, intuitively
The intuitive picture in one sentence: a generative model learns the shape of the data distribution well enough to produce new samples that look like they came from the same distribution.
Two illustrations.
First, the 1D case. Suppose and the true distribution is a mixture of two Gaussians, . A generative model trained on samples from this distribution should produce a that captures the two-mode structure: samples from should themselves cluster around and . A bad model (a single Gaussian centered at ) misses the structure entirely - it would never generate values near or with the right relative frequency.
Second, the image case. Suppose is a pixel image of a handwritten digit. The set of “plausible digit images” is a tiny, structured subset of the pixel space - the data manifold. A generative model trained on MNIST learns the geometry of this manifold and can produce new images that lie on it. A bad model (random pixel noise) produces images that are not on the manifold and look like noise. A good model produces images that are on the manifold and look like digits - but are not in the training set.
The “lies on the data manifold” intuition is consistent across families and modalities. Audio: the model produces plausible sound waveforms or spectrograms. Video: the model produces plausible space-time-coherent image sequences. Molecules: the model produces plausible chemical-valid graphs. The mathematical formalism (likelihood, score, ODE-flow) differs across families; the underlying goal - capture the structure of the data manifold and sample from it - is shared.
Why generative models matter in 2026
Three motivations a 2026 researcher should care about.
1. Generative models are deployed at massive scale. Image-generation services (Midjourney, DALL·E, Stable Diffusion, Imagen, Firefly), video-generation services (Sora, Veo, Runway), audio-generation services (Suno, Udio, ElevenLabs), and text-generation services (ChatGPT, Claude, Gemini) are used by tens of millions of users daily. The aggregate economic, cultural, and creative impact of generative AI is substantial enough that no research-oriented practitioner can ignore the technical substrate.
2. Generative models are scientific instruments. AlphaFold 2 (and successors AlphaFold 3) transformed protein structure prediction; AlphaProof, GNoME, and other scientific generative models extend the recipe to mathematics, materials, and chemistry. The Foundation Models chapter (FM §8) catalogues these in more detail; the generative-modelling techniques that underlie them are this chapter’s content.
3. Generative components are everywhere in the modern stack. A vision-language model includes a generative text decoder. An LLM’s reasoning chain is generative text. A recommendation system’s response is increasingly generative. A robot policy can be framed as conditional generation (action sequence given observation). The generative-modelling machinery in this chapter appears, directly or indirectly, in essentially every modern Foundation Model.
Density modelling vs sample generation
A specific distinction worth flagging because it confuses many readers entering the field.
Density modelling is the problem of learning the function - assigning a probability to any input. It is a modelling task. It serves anomaly detection (low-probability inputs are anomalous), compression (high-probability inputs need fewer bits), and model comparison (which better explains a dataset?).
Sample generation is the problem of producing new . It is a sampling task. It serves content creation, simulation, scientific exploration.
The two are related but not identical. A model can be excellent at density modelling and poor at sample generation (an autoregressive language model with good perplexity but slow sampling). A model can be excellent at sample generation and unable to evaluate density (a GAN). A model can be poor at both. The choice between families often comes down to which of the two tasks matters for the application.
In contemporary practice, sample generation gets most of the public attention while density modelling gets most of the rigorous evaluation. We develop both throughout the chapter.
Boundaries with adjacent chapters
This chapter sits in the middle of a constellation of related chapters; the boundaries:
Self-Supervised Learning §4 treats denoising as a self-supervised pretraining objective. The same denoising machinery powers diffusion models (this chapter’s §6). The chapters develop the same mechanism through two different lenses: SSL views denoising as a way to learn representations; Generative Models views it as a way to sample. Many modern systems use the same trained model for both.
Large Language Models §3 and §6 develop autoregressive Transformer LMs and their inference-time decoding strategies. This chapter’s §3 develops the autoregressive generative-modelling framework - what it is, how it relates to other families, where it is and is not appropriate. The LM chapter is the LM-specific instantiation.
Foundation Models is the spine chapter. Generative models are one of the two dominant FM substrates (the other being discriminative SSL models). FM §3 discusses generative models as a family at a high level; this chapter develops the mechanism.
Deep Learning provides the architectures (U-Net, Transformer, DiT, equivariant nets) that generative models use. The chapter assumes the architectural material rather than developing it.
Multimodal Models (planned) develops cross-modal generation; this chapter develops the within-modality generative-modelling techniques that cross-modal systems use.
AI for Science (planned) develops scientific generative models in domain depth (protein design, materials, mathematics, climate); this chapter develops the modelling techniques.
What this chapter does not try to do
A research-oriented chapter must be honest about scope:
We do not develop application-specific best practices in depth. Diffusion models for high-resolution image synthesis, for example, have a substantial sub-literature on samplers, schedulers, model-conditioning, and prompt engineering that this chapter sketches but does not exhaustively cover.
We do not derive most results in full mathematical detail. Diffusion’s connection to stochastic differential equations, for instance, is sketched here and developed properly in the references.
We do not treat evaluation as a separate engineering discipline beyond §10. The Evaluation chapter (planned) will develop the cross-cutting issues.
We do not survey all generative-modelling techniques. The space is large; we develop the six dominant families and refer the reader to surveys for niche techniques (energy-based models, score-based models with non-standard noise schedules, etc.).
We do not treat the legal, copyright, and societal questions about generative AI substantively. They are real and important; they live in the Alignment/Ethics chapter.
Position taken in this chapter
The chapter is organized by mathematical family rather than by modality or by historical era. The structural argument: a researcher who understands autoregressive, VAE, flow, GAN, diffusion, and flow-matching as mathematical families can quickly understand any specific application of generative models, because the modality-specific details are mostly architecture and conditioning choices layered on top of the core family. Conversely, organizing by modality (image generation, video generation, text generation) would repeat the same core mechanisms many times.
The chapter is also organized to reflect 2026 practice. Diffusion (§6) and flow-matching (§7) get the most detailed treatment because they are dominant in 2026 production systems for non-text modalities. Autoregressive (§3) gets a tight treatment because LLM §3 already develops it. GANs (§5) get a brief treatment because they are largely historical for new system design (though still common in some niches). VAEs (§4) get a brief treatment because they are now most often used as components (e.g., the latent encoder in latent diffusion) rather than as standalone generators.
§2. Historical Context
This section traces the field from early statistical generative modelling through the modern deep-learning era. The chapter’s substance is the current mechanics; the history is here because the conceptual moves that produced the modern toolkit are themselves illuminating.
A timeline of the inflection points:
~1980s Boltzmann machines (Hinton, Sejnowski 1985);
early energy-based generative models
│
▼
~1990s Mixture models, Hidden Markov Models;
generative modelling as a statistics topic
│
▼
2006-2010 Restricted Boltzmann Machines; Deep Belief
Networks (Hinton et al. 2006);
deep generative models begin to be feasible
│
▼
2013 VARIATIONAL AUTOENCODERS: Kingma & Welling
"Auto-Encoding Variational Bayes"
│
▼
2014 GANs: Goodfellow et al. "Generative
Adversarial Nets" - adversarial training
becomes the dominant image-generation paradigm
│
▼
2015 NORMALIZING FLOWS: Rezende & Mohamed
"Variational Inference with Normalizing Flows";
Sohl-Dickstein et al. "Deep Unsupervised
Learning using Nonequilibrium Thermodynamics"
(the first diffusion-model paper, largely
overlooked for five years)
│
▼
2016 PixelRNN (van den Oord et al.); WaveNet
(van den Oord et al.); GPT-1 era preludes;
autoregressive generation gains traction
│
▼
2017-2018 GAN era: DCGAN, Wasserstein GAN, Progressive
GANs, BigGAN, StyleGAN - high-resolution
image synthesis as the GAN benchmark.
Glow, MAF, IAF for flow models.
│
▼
2018-2019 GPT-2 (Radford et al. 2019) demonstrates
large-scale autoregressive text generation;
AlphaFold 1 (Senior et al. 2020 - work
finished 2018) shows generative modelling
for protein structure
│
▼
2020 DDPM: Ho, Jain, Abbeel "Denoising Diffusion
Probabilistic Models" - the modern diffusion
recipe; Score-based generative modelling
(Song & Ermon 2019 antecedent, 2020 unified
framework)
│
▼
2021 DDIM (Song, Meng, Ermon); CLIP (Radford et al.);
classifier-free guidance (Ho & Salimans);
GLIDE (Nichol et al.) shows text-to-image
diffusion at scale
│
▼
2022 DIFFUSION TAKES OVER: DALL-E 2 (Ramesh et al.);
Imagen (Saharia et al.); STABLE DIFFUSION
(Rombach et al., latent diffusion) - open
weights at consumer-GPU scale; AlphaFold 2.
Flow-matching (Lipman et al.) introduced.
│
▼
2023 Flow-matching and rectified flow (Liu et al.)
consolidate the post-diffusion framework.
Video diffusion gains traction. DiT
(Peebles & Xie) replaces U-Net with
Transformer backbone.
│
▼
2024-2026 SORA (OpenAI, early 2024): text-to-video
diffusion at minute-length scale. Veo (Google);
Gen-3 (Runway). Stable Diffusion 3, Flux:
flow-matching-based commercial text-to-image.
AlphaFold 3 (DeepMind, 2024). Generative
models pervasive across science and creative
applications.We develop each phase below.
Early statistical generative modelling
Generative modelling as a research topic substantially predates deep learning. Mixture models (especially Gaussian mixture models) for unsupervised density estimation, Hidden Markov Models for sequence modelling, Bayesian networks as structured generative models - these were the standard pre-deep-learning toolkit. The Bayesian-networks chapter (planned, originally AIMA Ch 13–14) develops this material; we proceed from the deep-learning era.
Boltzmann machines (Hinton and Sejnowski, 1985) and especially Restricted Boltzmann Machines were among the first energy-based deep generative models. Deep Belief Networks (Hinton et al., 2006) - stacked RBMs trained layer-by-layer - were briefly the leading deep generative approach in the late 2000s and helped trigger the broader deep-learning revival. They are now largely of historical interest; the energy-based-model lineage continues but is no longer dominant.
The 2013–2014 inflection: VAEs and GANs
Two papers within twelve months transformed deep generative modelling.
Variational Autoencoders (VAEs), introduced by Kingma and Welling (2013) “Auto-Encoding Variational Bayes,” combined a probabilistic encoder-decoder framework with the reparameterization trick - a technical move that makes the gradient of an expectation over latent variables backpropagatable. The result: a deep generative model with a well-defined log-likelihood lower bound (the ELBO), a meaningful latent representation, and tractable training.
Generative Adversarial Networks (GANs), introduced by Goodfellow et al. (2014) “Generative Adversarial Nets,” took an entirely different approach. Train two networks: a generator that maps random noise to samples, and a discriminator that tries to distinguish generated samples from real data. The two networks play a minimax game: the generator improves to fool the discriminator; the discriminator improves to detect generated samples. At equilibrium, the generator produces samples indistinguishable from the data.
GANs and VAEs differ structurally. GANs make no attempt to model the data density - they only learn to produce plausible samples. VAEs learn an explicit (but lower-bounded) density and a latent representation. Their relative strengths and weaknesses defined a multi-year debate: GANs produced sharper images, VAEs produced more meaningful latents. Both became major research areas.
The normalizing-flows line and the missed diffusion paper
In 2015, two papers appeared. Rezende and Mohamed (2015) “Variational Inference with Normalizing Flows” introduced normalizing flows - invertible neural-network transformations with tractable Jacobian-determinant, allowing exact log-likelihood computation and exact sampling. Flows became a third major family.
Less prominent at the time: Sohl-Dickstein et al. (2015) “Deep Unsupervised Learning using Nonequilibrium Thermodynamics” introduced what is now called diffusion modelling. The paper proposed a procedure that progressively corrupted data with noise and learned to reverse the process; samples could be drawn by starting from pure noise and iteratively denoising. The paper was thorough and the framework rigorous, but the empirical results at the time did not stand out, and the line of work was largely overlooked for five years.
The same period saw autoregressive image and audio models mature: PixelRNN/PixelCNN (van den Oord et al., 2016) for images, WaveNet (van den Oord et al., 2016) for audio. These showed that autoregressive generation - long established for text - also worked for high-dimensional sensory modalities.
The GAN era: 2016–2020
For most of 2016–2020, GANs were the dominant deep generative paradigm for images. The progression of architectures and training techniques was rapid:
DCGAN (Radford, Metz, Chintala, 2016) - practical guidelines for stable CNN-based GAN training.
Wasserstein GAN (Arjovsky, Chintala, Bottou, 2017) - the Wasserstein-distance training objective addressed many training-stability issues.
Progressive Growing GANs (Karras et al., 2018) - train at increasing resolutions.
BigGAN (Brock, Donahue, Simonyan, 2019) - class-conditional ImageNet generation at high quality.
StyleGAN (Karras, Laine, Aila, 2019) and StyleGAN2/3 - high-resolution photorealistic face synthesis, the visual high-water mark of the GAN era.
The GAN era produced striking images. It also produced two persistent difficulties: mode collapse (the generator focuses on a few easy-to-fool modes and ignores the rest of the data distribution) and training instability (the minimax optimization is delicate and often fails to converge). Many engineering techniques addressed these difficulties but none fully solved them.
Autoregressive scaling and AlphaFold
In parallel, two other generative-model threads were quietly transforming the field. GPT-2 (Radford et al., 2019) - a large-scale autoregressive Transformer trained on web text - demonstrated that scaling autoregressive text generation produced qualitatively new capabilities. This thread eventually became GPT-3, ChatGPT, and the LLM-dominated era of 2022 onward (LLM chapter).
AlphaFold 1 (Senior et al., 2020) for protein structure prediction was a different but related generative result: a deep network produced plausible 3D protein structures from amino-acid sequences. While not “generative” in the same sense as image-synthesis GANs, AlphaFold demonstrated that deep generative modelling could solve a substantive scientific problem.
Diffusion takes over: 2020–2022
The paper that started the modern diffusion era: Ho, Jain, Abbeel (2020) “Denoising Diffusion Probabilistic Models” (DDPM). DDPM simplified and operationalized the Sohl-Dickstein et al. (2015) framework into a recipe that worked at scale. Three innovations:
A specific noise schedule and parameterization that made training stable.
A reweighted training objective (the simplified MSE on the noise prediction) that worked much better than the variational lower bound.
Concrete sampling procedures that produced high-quality images.
DDPM image samples were immediately competitive with GANs and quickly surpassed them. Song, Ermon (2019, 2020) had developed a parallel score-based generative-modelling framework that turned out to be mathematically equivalent to diffusion (Song et al., 2021); the unified view became standard.
The years 2021–2022 saw the diffusion explosion. DDIM (Song, Meng, Ermon, 2021) gave a deterministic, much-faster sampler. Classifier-free guidance (Ho, Salimans, 2021) gave the dominant conditioning technique. GLIDE (Nichol et al., 2021) demonstrated text-to-image diffusion at scale. Then in 2022 the production-grade systems arrived: DALL-E 2 (Ramesh et al., 2022), Imagen (Saharia et al., 2022), and especially Stable Diffusion (Rombach et al., 2022) - the latent diffusion model that ran on consumer GPUs and was released with open weights, putting text-to-image generation in the hands of millions of users.
By 2022, diffusion had replaced GANs as the dominant image-generation paradigm. The same machinery extended to audio (Riffusion, MusicLM successors), video (Imagen Video, then Sora), and 3D (DreamFusion, then several improvements). Diffusion-shaped recipes became the default starting point for new generative-modelling applications.
Flow-matching and rectified flow: the post-diffusion era
In late 2022 and into 2023, a new framework emerged that subsumes and extends diffusion. Flow-matching (Lipman, Chen, Ben-Hamu, Nickel, Le, 2022) “Flow Matching for Generative Modeling” gave a clean continuous-time formulation: train a vector field whose ODE-flow transports a simple distribution (e.g., Gaussian) into the data distribution. Rectified flow (Liu, Gong, Liu, 2022) “Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow” gave a particular instantiation that produced near-straight ODE trajectories and therefore fast sampling.
The relationship to diffusion. Diffusion can be viewed as a particular noise-schedule choice within the flow-matching framework. Flow-matching is more general (does not require diffusion’s specific noise schedule), conceptually cleaner (continuous-time ODE flow rather than discrete-time stochastic process), and often produces better empirical results with fewer sampling steps. Many 2024–2026 generative systems (including Stable Diffusion 3, Flux, and several frontier text-to-video systems) use flow-matching or rectified-flow training rather than DDPM-style diffusion.
The post-diffusion era is, in 2026, still consolidating. Diffusion-trained models remain in widespread production use; flow-matching-trained models are increasingly dominant in new deployments. The relationship is technical-and-incremental rather than paradigm-shifting; we develop both in §6 and §7.
2024–2026: video, frontier scientific generation, and the production stack
The most recent inflections:
Sora (OpenAI, early 2024) demonstrated minute-length, coherent, high-quality text-to-video diffusion. Veo (Google) and Gen-3 (Runway) followed. Video generation became a real consumer product.
AlphaFold 3 (DeepMind, 2024) extended the protein-structure model to protein-ligand, protein-nucleic-acid, and protein-protein complexes - a scientifically substantial generalization.
Stable Diffusion 3 and Flux demonstrated flow-matching at production scale for text-to-image.
Generative models for scientific design (drugs, materials, catalysts) matured into a substantial sub-field.
Where this leaves us in 2026
The current state. Generative modelling is a mature subfield with six major mathematical families and many modality-specific applications. The dominant production paradigms are diffusion (still widely deployed) and flow-matching (increasingly dominant in new systems). Autoregressive models dominate text generation and parts of audio. VAEs and flows are mostly used as components rather than as standalone systems. GANs are largely historical for new system design.
The remaining sections of this chapter develop each family on its own terms. §3 covers autoregressive; §4 VAEs; §5 flows and GANs; §6 diffusion (the centerpiece); §7 flow-matching; §8 conditional generation; §9 modality-specific architectures; §10 evaluation; §11–§15 close out.
Editorial note. The post-diffusion-vs-flow-matching boundary is the most rapidly-evolving part of the chapter and the most likely to date. The 2024–2026 period has seen continued consolidation; readers should treat the diffusion-vs-flow-matching material as a current snapshot rather than a settled state.
§3. Autoregressive Models
The factorization
The conceptual core of autoregressive generation. For any structured object - a sequence of tokens, pixels, audio samples, mesh vertices - the joint probability factors exactly using the chain rule of probability:
Reading this. The joint probability of the entire sequence equals the product of conditional probabilities - each ’s probability given everything that came before. The factorization is mathematically exact; it makes no assumption about the data. It is simply the chain rule.
An autoregressive model parameterizes each of the conditional distributions with a learnable function (typically a neural network with shared parameters across time steps). Generation works one token at a time: sample , then , then , and so on.
A worked example: a tiny autoregressive image model
To anchor the abstraction. Suppose we want to model binary images, . Order the pixels in row-major order: top-left, one to its right, ..., bottom-right.
The factorization:
x_1 → x_2 → x_3 → x_4
│
▼
x_5 → x_6 → x_7 → x_8
│
▼
x_9 → x_10 → x_11 → x_12
│
▼
x_13 → x_14 → x_15 → x_16For each pixel , the model predicts - a single probability between 0 and 1 since is binary.
Training: for each image in the dataset, compute the loss
Sampling: choose , then conditional on , ..., 16 sequential sampling steps total.
For real images at resolution values, the same recipe applies - but the sequential generation becomes prohibitively slow. This is the central trade-off of autoregressive models: exact likelihood at the cost of sequential generation.
Training: next-token cross-entropy
For a dataset , the maximum-likelihood objective is
Equivalently, minimize the average negative log-likelihood, which is the next-token cross-entropy loss. For discrete tokens drawn from a vocabulary of size , this is the standard softmax-cross-entropy loss familiar from supervised classification - but applied to the next-token prediction at every position.
A crucial implementation detail. Training is parallel across positions. For a given training sequence , we want to compute the loss at every position . Naively, this requires forward passes (each conditioning on different lengths of context). But with causal masking in a Transformer (LLM §3, DL §6), all next-token predictions can be computed in one forward pass: feed in as input, get all predictions at once.
Generation, however, is inherently sequential - each token must be sampled before the next can be conditioned on it. This training-vs-generation asymmetry is the source of autoregressive models’ efficiency profile: training is fast (parallelizable), generation is slow (sequential).
Architectural choices
The autoregressive principle is architecture-agnostic. Any neural network that can take and produce works. Three concrete instantiations from history:
PixelRNN / PixelCNN (van den Oord et al., 2016). For images, generate one pixel at a time (and one channel within each pixel). PixelRNN used RNNs with horizontal-and-vertical recurrence; PixelCNN used masked convolutions to ensure each pixel only depended on earlier pixels in the raster order. Generated high-quality images for the time; very slow to sample ( sequential steps).
WaveNet (van den Oord et al., 2016). For raw audio at 16 kHz sampling rate, generate one audio sample at a time using dilated causal convolutions. Achieved unprecedented quality for text-to-speech and music. Generation cost per second of audio: 16,000 forward passes. Optimizations (parallel WaveNet, Gaussian inverse autoregressive flow distillation) reduced this; the framework’s autoregressive nature kept it inherently slow.
Decoder-only Transformers / GPT family (Radford et al., 2018, 2019, 2020). For text tokens, autoregressive over a vocabulary of subword tokens. The dominant text generative architecture from 2018 onward. Developed in detail in LLM §3 (architecture) and LLM §5 (training) and LLM §6 (inference). This chapter does not re-develop the LM-specific material.
Strengths of autoregressive models
Three structural advantages worth flagging.
Exact likelihood. Unlike VAEs (which give a lower bound on likelihood) or GANs (which give no likelihood at all), autoregressive models give the exact log-likelihood of any sample. This makes them the standard choice when likelihood-based evaluation matters (anomaly detection, model comparison, compression).
No mode collapse. GANs notoriously fail by collapsing to a few modes. Autoregressive models, trained on next-token cross-entropy, cannot collapse - every token in every training example contributes to the gradient with equal weight. The training objective is coverage-preserving by construction.
Token-level interpretability. Each generation step is interpretable as “the model assigned probability to this specific token.” For analysis, debugging, and steering, this is far easier than diffusion’s iterative-denoising or GANs’ opaque latent-to-output mapping.
Weaknesses of autoregressive models
Equally structural.
Sequential generation. The dominant practical issue. Generating an -token sequence takes sequential forward passes. For images and video, is enormous, making generation expensive (KV-caching helps; LLM §6). The diffusion-vs-autoregressive trade-off for image generation: diffusion uses – sequential denoising steps (each is a single forward pass of the full image), autoregressive uses – steps (each is a single forward pass of one token).
Modality-specific tokenization. Autoregressive models need a discrete tokenization of the modality. Text has natural discrete units (words, subwords); images and audio do not. Image autoregressive models require either pixel-by-pixel generation (slow) or learned discrete tokens (typically from a VQ-VAE - see §4). Audio is similar. The tokenization choice substantially affects model quality.
Ordering dependence. The factorization requires choosing an order. For text, left-to-right is natural. For images, raster order is arbitrary - the model “sees” the top-left of the image first and the bottom-right last, which has no semantic justification. The model can still work well in practice but the order imposes structure that the data does not have.
Connection to LLM §3 and modern practice
LLM §3 develops decoder-only Transformer architecture (embedding, attention, FFN, position encoding, output projection) in detail, applied to the case where the autoregressive model is over text tokens. The same architecture works for any tokenized modality:
Text: tokens from BPE or SentencePiece subword vocabulary.
Images: tokens from a learned VQ-VAE codebook (Esser, Rombach, Ommer, 2021 “Taming Transformers” / VQGAN; the LlamaGen line).
Audio: tokens from a learned audio codec (SoundStream, EnCodec).
Video: tokens from a 3D VQ-VAE (the MAGVIT line).
Multimodal: interleaved tokens from multiple codebooks.
By 2026, autoregressive Transformers dominate text generation entirely and have substantial niches in image, audio, and video generation (often alongside or in competition with diffusion/flow-matching). The choice between autoregressive and diffusion-style generation for a given modality depends on sample-quality requirements, generation-speed requirements, modality-specific tokenization quality, and engineering investment in each path.
Where autoregressive models sit in 2026
The honest accounting. Autoregressive models are:
Dominant for text and structured-sequence data.
Competitive but not dominant for images, audio, and video (where diffusion/flow-matching is often preferred).
Increasingly important for multimodal generation, where unified-token autoregressive models can generate across modalities in a single framework (Chameleon, Gemini Native).
The trade-offs are well-understood; the choice is application-specific rather than a sweeping recommendation.
§4. Variational Autoencoders (VAEs)
The latent-variable framework
A different family of generative models, based on a different conceptual move. VAEs assume the data is generated from a low-dimensional latent variable :
LATENT-VARIABLE GENERATIVE MODEL
z ~ p(z) (prior, e.g., N(0, I) in R^d)
│
▼
x ~ p_theta(x | z) (decoder)
To generate a new sample x:
1. Sample z from the prior.
2. Pass z through the decoder to get p(x | z).
3. Sample x from this distribution.The model specifies a prior (usually a standard normal in for some modest , e.g., or ) and a decoder - typically a neural network mapping to the parameters of the output distribution (e.g., the mean and variance of a Gaussian over images).
To compute the marginal likelihood:
This integral is the fundamental difficulty of latent-variable models. For deep decoders, the integral is intractable - there is no closed-form solution and no efficient way to compute it. We need an approximation. The variational autoencoder (Kingma and Welling, 2013) is one specific way to handle this.
The ELBO: a tractable lower bound on the likelihood
The variational approach. Introduce a second neural network - an encoder - that maps inputs to distributions over latents. The encoder approximates the (true but intractable) posterior .
For any encoder and any input , the Evidence Lower BOund (ELBO) is
Reading this. The right-hand side is a lower bound on the log-likelihood that we can compute and optimize. The bound has two terms:
The reconstruction term : encode into a latent , then ask how well the decoder reconstructs from . Higher reconstruction quality → higher likelihood.
The KL term : how far the encoder’s output (the posterior over given ) is from the prior . Acts as a regularizer pushing the encoder toward the prior.
The two terms have a clean interpretation. Reconstruct well, but don’t deviate too much from the prior. The trade-off is what produces the VAE’s structured latent space.
The reparameterization trick
To train, we need to backpropagate gradients through the expectation . The challenge: is sampled from a distribution that depends on (the encoder’s parameters); naive Monte Carlo sampling does not give a backpropagatable gradient.
The fix: the reparameterization trick (Kingma and Welling, 2013). For a Gaussian posterior , sample as follows:
Reading this. Rather than sample from a distribution depending on , sample from a fixed distribution (standard normal) and deterministically compute as a function of and . Gradients with respect to flow through and to the encoder’s parameters. The randomness is “off the gradient path.”
This trick is foundational: it appears not only in VAEs but in SAC (RL §7), Gumbel-softmax for discrete latents, and other modern stochastic-network architectures.
Pseudocode for VAE training
ALGORITHM VAE Training (Kingma & Welling, 2013)
INPUT: encoder q_phi (outputs mean and log-variance),
decoder p_theta, dataset D
For each minibatch x:
# Encode
mu, log_sigma2 = q_phi(x)
sigma = exp(0.5 * log_sigma2)
# Sample z via reparameterization
eps ~ N(0, I)
z = mu + sigma * eps
# Decode and compute reconstruction loss
x_recon = p_theta_decode(z)
L_recon = reconstruction_loss(x, x_recon)
# e.g., -log p(x | z), often MSE for image data
# or pixelwise cross-entropy for discrete pixels
# KL term: closed form for Gaussian posterior and prior
L_kl = 0.5 * sum(mu^2 + sigma^2 - 1 - log_sigma2)
# Total ELBO (negative, since we minimize)
L = L_recon + L_kl
Update theta, phi by gradient descent on L.The structure: encode, reparameterize, decode, compute two losses, backpropagate. Two networks ( and ) trained jointly.
A worked example: MNIST VAE
A standard pedagogical instance. Train a VAE on MNIST binary digit images with latent dimension (chosen low for visualization). After training:
The encoder maps each digit image to a 2D point. Plotting these reveals a structured latent space: similar digits cluster together (all "1"s near each other, all "8"s elsewhere).
Sampling new digits: draw , decode. The result is a new digit-like image, smoothly interpolating between training-set styles depending on .
Latent traversal: sweep along a line in the 2D latent space and decode at each point. The decoded images smoothly morph from one digit style to another. The latent space is continuous and meaningful.
This combination - meaningful latent space, smooth interpolation, principled generation - is the VAE’s defining feature. It is what GANs do not have (their latent-to-output mapping is not constrained to be meaningful) and what autoregressive models do not have (they have no latent space at all).
Common VAE variants
-VAE (Higgins et al., 2017). Modify the ELBO by reweighting the KL term:
For , the KL term is upweighted, pushing the posterior closer to the prior. Empirically this often produces more disentangled latents - different dimensions of correspond to different semantic factors of variation (rotation, scale, colour). The disentanglement claims are contested in subsequent work (Locatello et al., 2019), but -VAE remains widely used.
Conditional VAE (CVAE). For tasks where we want to generate conditional on some side information : replace with and with . The structure is unchanged; the conditioning adds to both networks. Used for image-to-image translation, style-conditional generation, and (notably) the latent encoder in many diffusion-based systems.
VQ-VAE (van den Oord, Vinyals, Kavukcuoglu, 2017). The latent variable is discrete - a sequence of integer codes from a learned codebook of size (typically ). The encoder maps to a sequence of codebook indices; the decoder reconstructs from those indices.
VQ-VAE’s importance: it enables autoregressive modelling in latent space. Train a VQ-VAE on images; tokenize images into sequences of codebook indices; train a Transformer autoregressively over those tokens. This is the recipe behind VQGAN (Esser, Rombach, Ommer, 2021), DALL-E 1 (Ramesh et al., 2021), and the modern image-tokenization stack used in autoregressive image generation. VQ-VAE turns out to be more central to modern generative AI than the standalone VAE: it is the interface between continuous data and autoregressive models.
Posterior collapse and other VAE pathologies
Two characteristic failure modes worth flagging.
Posterior collapse. The encoder learns to ignore and produce for all - making the KL term zero. The decoder then must produce from no information (since carries none), which means the reconstruction loss is high. The model produces average-looking outputs regardless of input.
The cause: the decoder is too powerful relative to the encoder; it can produce reasonable reconstructions without needing to carry information, so the optimizer takes the easy path. Mitigations include KL warmup (start and anneal up), bottleneck design, and free-bits objectives (Kingma et al., 2016).
Blurry samples. Standard VAE decoders trained with Gaussian likelihood produce blurry images. The reason is technical: the Gaussian likelihood penalizes pixel-wise squared error, which is minimized by averaging over plausible reconstructions rather than committing to any specific one. The blurriness is an artefact of the loss function, not of the latent-variable framework. VQ-VAEs avoid it (discrete codes prevent averaging); diffusion models avoid it (denoising loss does not penalize per-pixel averaging the same way).
Where VAEs sit in 2026
The honest accounting. Standalone VAE generative models are largely historical - GANs displaced them for image quality in 2016–2020; diffusion displaced both in 2021–2022. New high-quality generative systems are rarely standalone VAEs.
But VAEs (especially VQ-VAEs) are ubiquitous as components:
Latent encoders in latent diffusion. Stable Diffusion uses a VAE to encode images into a compressed latent space before applying diffusion. The VAE compresses (e.g.) a image into a latent, making diffusion 64x cheaper per step. The diffusion model lives in latent space; the VAE handles the pixel-space interface.
Discrete tokenizers for autoregressive image models. VQ-VAE provides the tokenization that turns continuous images into discrete tokens for Transformer autoregressive modelling.
Audio and video codecs. Modern audio (EnCodec) and video (MAGVIT) tokenizers are VQ-VAE-style discrete encoders.
The VAE framework - encoder, decoder, latent bottleneck, possibly with discrete codes - is one of the most-used architectural patterns in modern generative AI, even though the standalone VAE-as-generator is rarely used today.
§5. Normalizing Flows and GANs
We treat these two families together because both are largely historical for new system design but remain conceptually important. Flows are the simplest exact-likelihood family that handles continuous data; GANs were the dominant image-generation paradigm for half a decade and shaped the field’s expectations.
Normalizing flows: exact likelihood via invertible transformations
The defining idea. A normalizing flow is an invertible neural network that maps a simple distribution (typically standard Gaussian) to the data distribution. The model density is computed from the base density and the Jacobian-determinant of the inverse transformation.
Concretely. Let be a sample from the base distribution. Define . By the change-of-variables formula:
Reading this. The density of under the flow equals the density of the corresponding in the base distribution, multiplied by the Jacobian-determinant of the inverse transformation. The Jacobian-determinant captures how much the transformation locally stretches or compresses volume - high-density regions of -space are where the transformation has compressed volume relative to -space.
Two design requirements follow:
The transformation must be invertible (so that exists for every ).
The Jacobian-determinant must be efficiently computable (so that training and density evaluation are tractable).
These requirements rule out standard neural network architectures (a generic deep network has neither property). Flow architectures are designed specifically to be invertible with tractable Jacobians.
Coupling-layer flows: RealNVP and Glow
The dominant flow architecture. RealNVP (Dinh, Sohl-Dickstein, Bengio, 2017) introduced coupling layers: split the input into two halves ; transform one half conditional on the other; leave the other half unchanged. The transformation for neural networks is invertible by construction and has a triangular Jacobian whose determinant is the product of the diagonal - efficiently computable.
COUPLING LAYER (RealNVP)
Input: x = (x_a, x_b) # split into two halves
Forward:
x_a' = x_a # unchanged
x_b' = x_b * exp(s(x_a)) + t(x_a)
Output: (x_a', x_b')
Inverse:
x_a = x_a'
x_b = (x_b' - t(x_a')) * exp(-s(x_a'))
Log-det Jacobian:
log |det J| = sum(s(x_a)) # simply the sum of the scale outputsStacking many coupling layers with different splits produces a flow with substantial expressive power. Glow (Kingma and Dhariwal, 2018) added invertible convolutions to mix between layers and produced high-quality image generation.
Autoregressive flows: MAF and IAF
A different parameterization. Masked Autoregressive Flow (MAF) (Papamakarios, Pavlakou, Murray, 2017) makes each output dimension depend on previous output dimensions:
The Jacobian is triangular (each output depends only on earlier outputs), so the determinant is again a simple product.
Inverse Autoregressive Flow (IAF) (Kingma, Salimans, Welling et al., 2016) reverses the direction: each latent dimension depends on previous latent dimensions. MAF is fast for density evaluation, slow for sampling; IAF is fast for sampling, slow for density evaluation. The choice depends on which operation matters more.
Where flows sit in 2026
The honest accounting. Normalizing flows are mathematically elegant and have exact likelihoods (unlike VAEs’ lower bounds and GANs’ absence of likelihood). They are the cleanest exact-likelihood family for continuous data.
But they have not become dominant. Three reasons:
Architectural constraint cost. The invertibility-and-tractable-Jacobian requirement substantially limits expressiveness compared to unconstrained networks. State-of-the-art image generation requires deep flows with many coupling layers; the result is parameter-heavy and slower to train than alternatives.
Sample quality. Modern flows produce reasonable but not state-of-the-art image samples. Diffusion (§6) and flow-matching (§7) substantially exceed flows on image quality benchmarks.
Use cases narrowed. The clean-density-evaluation niche turned out to be served well enough by autoregressive models (which also have exact likelihood, are simpler architecturally, and scale better).
Flows remain in use in a few niches: scientific applications where exact likelihood matters (cosmology simulators, lattice field theory); density estimation for anomaly detection; conditional density estimation for tabular data. They are also conceptually important - the continuous flow framework underlies flow-matching (§7), which is the modern descendant.
GANs: adversarial training
The defining idea. Generative Adversarial Networks (Goodfellow et al., 2014) train two networks in opposition:
A generator that maps random noise (typically Gaussian) to samples .
A discriminator that takes a sample and outputs the probability that it is real (from the data distribution) rather than generated (by ).
The training objective:
Reading this. The discriminator maximizes its ability to distinguish real from generated; the generator minimizes the discriminator’s ability (by producing samples the discriminator misclassifies as real). At equilibrium, the generator produces samples that the discriminator cannot distinguish from real data.
A clean theoretical result (Goodfellow et al.): at the unique fixed point of the minimax game, and . The generator has matched the data distribution; the discriminator is reduced to chance.
Pseudocode for GAN training
ALGORITHM GAN Training (Goodfellow et al., 2014)
INPUT: generator G_theta, discriminator D_phi, dataset D
For each iteration:
# Discriminator update
Sample minibatch x_real from D.
Sample minibatch z from p_z; compute x_fake = G_theta(z).
L_D = -mean[ log D_phi(x_real) + log(1 - D_phi(x_fake)) ]
phi = AdamStep(phi, L_D)
# Generator update
Sample minibatch z from p_z; compute x_fake = G_theta(z).
L_G = -mean[ log D_phi(x_fake) ] # non-saturating form
theta = AdamStep(theta, L_G)The generator’s “non-saturating” loss is a practical improvement on the theoretical - the latter has vanishing gradients early in training when classifies fakes as fake with high confidence.
Mode collapse, instability, and the GAN failure modes
The characteristic difficulties of GAN training.
Mode collapse. The generator finds a small subset of the data distribution that consistently fools the discriminator and stays there. For MNIST, this might be a generator that only produces 3’s and 7’s, ignoring other digits. The training loss is low; the discriminator is fooled; but the model has not learned the full distribution. Mode collapse is the GAN-specific failure that cannot happen with autoregressive models, VAEs, or diffusion (all of which have likelihood-style losses that penalize ignoring parts of the distribution).
Training instability. GANs train via a minimax game with two networks; the dynamics are not those of standard gradient descent on a single loss. The training can oscillate, diverge, or fail to converge to a useful state. Empirically, GAN training requires careful hyperparameter tuning, architectural choices, and tricks (spectral normalization, gradient penalties, careful loss weighting). It is fragile relative to other generative-modelling families.
No likelihood. GANs do not provide any way to evaluate for a given . This rules out likelihood-based applications (anomaly detection, compression) and makes evaluation difficult - there is no internal metric of progress, only the discriminator’s loss (which can be misleading) and external sample-quality metrics (FID, §10).
The GAN refinement programme: 2016–2020
The years following Goodfellow’s original paper saw substantial engineering effort addressing the failure modes:
DCGAN (Radford, Metz, Chintala, 2016). Practical guidelines for CNN-based generators (transposed convolutions, batch normalization, leaky ReLU). The first stable recipe for GAN training at moderate resolutions.
Wasserstein GAN (WGAN) (Arjovsky, Chintala, Bottou, 2017). Replace the Jensen-Shannon-style discriminator with a Wasserstein-distance critic. The Wasserstein loss has smoother gradients and addresses many training-stability issues. WGAN-GP (Gulrajani et al., 2017) added a gradient-penalty term that made the recipe practical.
Progressive Growing GANs (Karras, Aila, Laine, Lehtinen, 2018). Train at low resolution first, then progressively add resolution. Stabilizes high-resolution training.
BigGAN (Brock, Donahue, Simonyan, 2019). Class-conditional ImageNet generation at . Demonstrated that GANs scaled to substantial dataset sizes with careful engineering.
StyleGAN (Karras, Laine, Aila, 2019) and StyleGAN2/3. Unconditional face generation at with style-based generator architecture. The visual high-water mark of the GAN era - and the closest GANs came to “photorealistic” image generation.
Through 2020 GANs were the default choice for image synthesis. Then diffusion arrived.
Why GANs gave way to diffusion
By 2022, diffusion (§6) had largely displaced GANs for new image-generation systems. The reasons are mostly empirical-and-engineering rather than theoretical:
Mode coverage. Diffusion’s likelihood-style training covers the data distribution by construction; GANs collapse.
Training stability. Diffusion training is single-loss-on-single-network supervised learning; GAN training is fragile minimax.
Conditioning. Text-to-image conditioning (CLIP-guided, classifier-free) integrates more naturally with diffusion than with GANs.
Sample quality. By 2022 diffusion matched or exceeded StyleGAN quality on standard benchmarks while being more tractable to scale to text-conditional and high-resolution settings.
GANs are not gone. They survive in niches where their speed matters: GAN sampling is single-forward-pass (vs diffusion’s 20–50 steps), making GANs attractive for real-time applications. They also remain a useful research model - easy to analyze theoretically, useful for studying training-dynamics questions. The high-fidelity-but-narrow-distribution generator (e.g., StyleGAN for specific face datasets) remains a real engineering choice in 2026.
But for new general-purpose generative-modelling research and system design, GANs are no longer the default. The mantle has passed to diffusion and flow-matching.
§6. Diffusion Models
This is the chapter’s centerpiece. Diffusion models are the dominant non-text generative paradigm of 2026 (with flow-matching, §7, as the increasingly-dominant variant in new deployments). We develop the framework mechanistically: the forward noising process, the reverse denoising process, the loss, the sampling algorithms, and the architectural choices that make it work at scale.
The intuition
The core idea, stripped of mathematics. Suppose we want to generate images. We have a large dataset of real images. Here is the procedure:
Training-time. Take each training image. Gradually corrupt it with Gaussian noise - adding a tiny amount of noise at step 1, more at step 2, until at step the image is pure noise (indistinguishable from ). At each step, the model is trained to predict the noise that was just added.
Sampling-time. Start with pure noise. Gradually remove the noise - using the trained model to predict, at each step, what noise to remove. After steps, what was pure noise has become a plausible new image.
The training process defines a recipe for going from noise to data, by reversing the noising process. The sampling process executes the recipe.
Two pictures to anchor:
FORWARD PROCESS (training-time, fixed)
x_0 (real image) → x_1 → x_2 → ... → x_T (pure noise)
│ add small │ add │ │
│ noise │ │ │
REVERSE PROCESS (sampling-time, learned)
x_T (pure noise) → x_{T-1} → x_{T-2} → ... → x_0 (generated image)
│ remove pre- │ │ │
│ dicted noise │ │Two intuitive arguments for why this works.
First, local denoising is easy. Predicting what noise was added in one step of the forward process is a tractable problem - the model only needs to look at slightly-noisy data and predict the small perturbation. It does not need to model the entire transition from data to noise in one shot.
Second, iterative refinement composes. By chaining many small denoising steps, the model effectively transports samples from pure noise to data through a sequence of small, locally-easy moves. This is the same structural insight that makes gradient descent work: many small steps in a good direction can traverse a complicated landscape.
The forward process
Formally. Define a sequence of distributions for that add Gaussian noise:
where is a small noise level at step (the variance schedule). A typical schedule has growing to over steps.
Reading this. At each step, is a noisy version of : the mean is scaled-down (factor , very close to 1 for small ), and Gaussian noise of variance is added. With many steps and growing , becomes essentially pure noise - independent of .
A crucial property of Gaussian noise: the forward process composes in closed form. We do not need to sample to get ; we can sample from directly:
where . This closed-form formula lets us sample any noise level in one operation:
Reading this. Sampling at noise level : take a fraction of the original image plus a fraction of Gaussian noise. At , , is the original. At , , (pure noise).
The reverse process and the training loss
The reverse process is what we learn. We parameterize a neural network that predicts the noise added between and .
Specifically. Given and , the model outputs - a prediction of the noise that was added when producing from . The training loss is
where is sampled uniformly from , is sampled from the dataset, and is the noise used to produce from .
Reading this. The loss is just mean squared error between the true noise and the predicted noise. It is a standard supervised regression loss - no adversarial training, no variational lower bound, no Jacobian. This simplicity is one of the reasons diffusion training is so stable.
Important nuance. The full variational derivation of DDPM gives a weighted MSE loss with weights depending on . The Ho-Jain-Abbeel insight: dropping those weights (using uniform-in- unweighted MSE) gives both better empirical results and a simpler training procedure. This simplified loss is the standard DDPM training objective.
Pseudocode for DDPM training
ALGORITHM DDPM Training (Ho, Jain, Abbeel, 2020)
INPUT: noise schedule beta_1, ..., beta_T (and derived alpha_bar_t),
dataset D, denoising network epsilon_theta
For each minibatch x_0 from D:
# Sample noise level uniformly
t ~ Uniform({1, ..., T})
# Sample noise
epsilon ~ N(0, I)
# Compute noisy x_t in closed form
x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon
# Predict the noise
epsilon_hat = epsilon_theta(x_t, t)
# Simple MSE loss
L = mean((epsilon - epsilon_hat) ** 2)
Update theta with gradient descent on L.The structure is striking. Diffusion training is supervised regression - predict the noise, no minimax, no adversarial loss, no variational hierarchy. The network architecture is unconstrained (any deep network works); is a hyperparameter (typically ). The training-time complexity per step is comparable to a single classification model.
Sampling: DDPM and DDIM
After training, sampling produces a new image by reversing the noising process. The standard DDPM sampling procedure:
ALGORITHM DDPM Sampling
INPUT: trained denoising network epsilon_theta,
noise schedule beta_1, ..., beta_T
1. x_T ~ N(0, I) # start from pure noise
2. For t = T, T-1, ..., 1:
# Predict noise
epsilon_hat = epsilon_theta(x_t, t)
# Compute mean of p(x_{t-1} | x_t) (DDPM formula)
mu_t = (1 / sqrt(1 - beta_t)) *
(x_t - (beta_t / sqrt(1 - alpha_bar_t)) * epsilon_hat)
# Sample with stochastic noise (except at final step)
If t > 1: z ~ N(0, I); else: z = 0
x_{t-1} = mu_t + sqrt(beta_t) * z
3. Return x_0Reading this. At each step, the model predicts the noise; we subtract a fraction of it from to get ; we add a small amount of fresh noise to get . After steps, is a sample from the model’s approximation of .
The cost: forward passes of to generate a single sample. For , this is expensive. Two responses:
DDIM (Song, Meng, Ermon, 2021) “Denoising Diffusion Implicit Models.” A deterministic sampler that produces samples from the same trained model in far fewer steps - typically 20–50. The reformulation interprets the diffusion as an ODE rather than an SDE; the deterministic ODE solver can take larger steps than the stochastic process. With DDIM, diffusion sampling becomes practical for production use. DDIM is the dominant sampler in 2026 production systems.
Higher-order solvers. DPM-Solver (Lu et al., 2022) and its successors apply advanced ODE-solver theory to reduce step counts further - competitive with DDIM at 10–20 steps for high-quality samples.
Score-based generative modelling: the unified framework
A parallel line of work. Song, Ermon (2019, 2020) developed score-based generative modelling: parameterize a neural network to estimate the score function at each noise level . Sample by Langevin dynamics - random walks guided by the score.
The connection to DDPM. The noise prediction in DDPM is equivalent to a (rescaled) score estimate . Predicting noise is estimating the score, up to a known constant. Song et al. (2021) formalized this: DDPM and score-based modelling are two equivalent formulations of the same underlying framework. The continuous-time view (stochastic differential equations) provides the cleanest unification and is the foundation for flow-matching (§7).
The practical implication. The framework is one thing with two presentations. Different papers use different notation (some say “predict the noise”; some say “estimate the score”; some say “predict the data ”); all are equivalent up to rescaling. A reader engaging the literature needs to know all three.
Conditional diffusion and classifier-free guidance
For most applications, we want conditional generation: image conditioned on text prompt, audio conditioned on melody, structure conditioned on sequence. Two techniques.
Conditional training. Train where is the conditioning (e.g., a text embedding). The training data includes pairs; the loss is MSE on noise prediction as before. Architecturally, the conditioning is injected via cross-attention or feature concatenation.
Classifier-free guidance (CFG) (Ho and Salimans, 2021). The crucial technique that made text-to-image diffusion produce high-quality, prompt-faithful samples. The idea:
Train a single model that handles both conditional and unconditional generation. During training, randomly drop the conditioning (replace with a null token) with probability %. The result is a network that produces meaningful output whether is given or null.
At inference, extrapolate away from the unconditional prediction toward the conditional one. Use the guided prediction:
where is the guidance scale (typically ).
Reading this. With , this is just the conditional model. With , the guidance exaggerates the conditional signal - pushing the sample more in the direction of the conditioning. Empirically, or so produces samples that are strongly prompt-faithful at the cost of some diversity.
Classifier-free guidance is the single most important engineering technique behind modern text-to-image diffusion. Without it, the samples are either uncorrelated to the prompt (low ) or excessively diverse (no guidance). The guidance-scale knob lets users trade fidelity for diversity.
Latent diffusion: Stable Diffusion’s architecture
The most-used variant of diffusion in 2026. Latent diffusion (Rombach, Blattmann, Lorenz, Esser, Ommer, 2022) “High-Resolution Image Synthesis with Latent Diffusion Models” - the technical core of Stable Diffusion.
The problem latent diffusion solves. Diffusion in pixel space is expensive at high resolution: each forward pass of on a image is computationally heavy. For 50-step sampling, this is prohibitive on consumer GPUs.
The solution. Train a VAE (§4) to compress images into a much-smaller latent representation. For Stable Diffusion: images compressed to latents - a reduction. Then run diffusion in latent space. After sampling a latent, decode it with the VAE to produce the final image.
LATENT DIFFUSION ARCHITECTURE (Stable Diffusion)
text prompt → text encoder (CLIP) → conditioning c
│
▼
z_T ~ N(0, I) → U-Net diffusion → z_0
(in latent space conditional on c (in latent space)
64x64x4) in latent space)
│
▼
VAE decoder → x_0 (image)
(512x512x3)The architecture has three components:
VAE (encoder + decoder). Trained separately first; frozen during diffusion training. Compresses images to latents and back.
Diffusion model (typically a U-Net, increasingly a DiT). Operates in latent space. Conditioned on text via cross-attention.
Text encoder (typically CLIP’s text encoder). Produces the conditioning vectors.
The result: full-quality or image generation in a few seconds on a consumer GPU. Stable Diffusion’s open-weights release in 2022 brought text-to-image generation to millions of users and is the most-influential single generative-model release of the diffusion era.
Architectural choices: U-Net vs DiT
The denoising network can be any architecture that takes the noisy input plus time-step and conditioning and outputs a same-shape noise prediction. Two dominant choices.
U-Net. The original DDPM and most early diffusion systems use a U-Net (Ronneberger, Fischer, Brox, 2015) - a CNN with skip connections between encoder and decoder layers. Empirically excellent for image diffusion; matches the image’s spatial structure naturally.
Diffusion Transformer (DiT) (Peebles and Xie, 2023). Replaces the U-Net with a Transformer operating on image patches (like ViT, DL §4). DiT scales better (the Transformer’s parameter count can grow more cleanly than a U-Net’s), and large-scale models (Sora, Stable Diffusion 3) increasingly use DiT-style architectures.
By 2026 the U-Net-vs-DiT choice is still mixed in production. Smaller and older systems use U-Nets; newer and larger systems use DiTs. The trend is toward DiT for new frontier work.
Where diffusion sits in 2026
The dominant generative paradigm for:
Images (Stable Diffusion line, DALL-E line, Midjourney, Imagen, Firefly).
Video (Sora, Veo, Gen-3) - extending diffusion to the space-time domain.
Audio (Riffusion, Stable Audio) and music.
3D (Stable 3D, DreamFusion line) - extending diffusion to 3D representations.
Molecules and proteins (RFdiffusion, Chroma, AlphaFold-3’s structure-prediction component).
Each of these is developed in modality-specific detail in §9.
The flow-matching framework (§7) is increasingly displacing pure-diffusion in new systems, but the diffusion-trained models in production will remain in use for years. The two frameworks are best understood as variations on the same continuous-time generative-modelling theme.
§7. Flow-Matching and Rectified Flow
The continuous-time perspective
Diffusion (§6) was developed as a discrete-time process - steps of adding noise, steps of denoising. The continuous-time perspective generalizes this and is the foundation of flow-matching.
Consider a time-indexed family of distributions for , interpolating between a simple base distribution (e.g., ) and the data distribution . There is some velocity field that, applied as an ordinary differential equation, transports samples from to :
If we know , we can generate a sample from by sampling from the base distribution and integrating the ODE forward to .
CONTINUOUS-TIME GENERATIVE FLOW
Base distribution p_0 (Gaussian noise)
│
│ d x / d t = u_t(x)
│ (velocity field, learned)
▼
Time 0 ──────── 0.5 ──────── 1
│ │ │
Sample: noise intermediate data
p_t: p_0 p_0.5 p_1Two questions: (1) what should the velocity field be? (2) how do we learn it from data?
Flow-matching (Lipman, Chen, Ben-Hamu, Nickel, Le, 2022) answers both with a clean framework that subsumes diffusion as a special case.
The flow-matching objective
The setup. Choose any conditional path - a family of distributions that interpolates between (noise, independent of the target) and (the target itself, with no variance). The simplest such path is the straight-line interpolation:
or equivalently, sample and define .
Reading this. At , (pure noise). At , (the data). In between, linearly interpolates between them. The path is a straight line in -space from noise to data.
For this path, the conditional velocity field - the velocity that transports points along the straight line - is simply
(The derivative of with respect to .)
The flow-matching loss. Train a network to predict the conditional velocity field at randomly sampled :
where , , , and .
Reading this. The training procedure is structurally identical to diffusion: sample a noise level, sample a noise instance, sample a data point, mix them, predict the target direction. The target is now (the direction from noise to data) rather than (the noise itself). Otherwise identical.
Pseudocode for flow-matching training
ALGORITHM Flow-Matching Training (Lipman et al., 2022, straight-line variant)
INPUT: dataset D, velocity-prediction network v_theta
For each minibatch x_1 from D:
# Sample time and noise
t ~ Uniform(0, 1)
x_0 ~ N(0, I)
# Interpolated point
x_t = (1 - t) * x_0 + t * x_1
# Target velocity (straight line)
v_target = x_1 - x_0
# Predict
v_hat = v_theta(x_t, t)
# MSE loss
L = mean((v_hat - v_target) ** 2)
Update theta with gradient descent on L.The structure is supervised regression - same as diffusion. The only difference from DDPM is what the regression target is.
Sampling: ODE integration
After training, sampling is ODE integration. Start from noise, integrate the learned velocity field:
ALGORITHM Flow-Matching Sampling (Euler integration, simplest)
INPUT: trained velocity-prediction network v_theta, number of steps N
1. x ~ N(0, I) # start from noise
2. dt = 1 / N
3. For i = 0, 1, ..., N-1:
t = i * dt
x = x + dt * v_theta(x, t)
4. Return x # the generated sample at t = 1The simplest Euler integrator works for moderate step counts. More sophisticated ODE solvers (Heun’s method, Runge-Kutta, adaptive solvers) can produce comparable quality with fewer evaluations. Modern flow-matching systems typically use 20–50 integration steps - comparable to DDIM-sampled diffusion.
Why flow-matching is conceptually cleaner than diffusion
The frameworks are mathematically related; the conceptual cleanliness of flow-matching matters for several reasons.
The path is a choice. Flow-matching makes explicit what diffusion implicitly assumes - that a path connecting noise to data must be specified. Diffusion’s path is determined by its noise schedule; flow-matching lets you choose any path. The straight-line path of rectified flow (below) is not the path diffusion uses, and the straight-line path is empirically better in many settings.
Sampling is ODE integration, not SDE simulation. Diffusion sampling involves stochastic dynamics (in DDPM’s stochastic form) or deterministic dynamics interpretable through SDE-to-ODE conversion (in DDIM). Flow-matching is deterministic from the start. Standard ODE solvers from numerical analysis apply directly.
Connections to optimal transport. The flow-matching objective is connected to optimal transport (Villani, 2008; Peyré and Cuturi, 2019) - the mathematics of moving probability mass from one distribution to another at minimum cost. The straight-line path is the optimal-transport solution in a specific sense; this gives a principled reason to expect it to work well.
Rectified flow: straightening the paths
Liu, Gong, Liu (2022) “Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow” developed a particularly useful instance of the framework.
The observation. The straight-line conditional paths of flow-matching are conditional - they assume we know at training time. The marginal paths (the actual trajectories sampled at inference) are not straight; they curve as samples mix through the population of possible ’s.
The fix. Rectified flow proposes a reflow procedure: after training a first flow, use it to generate pairs (sample from noise, integrate to get ). Train a second flow on these pairs, using the same straight-line objective. The second flow’s marginal paths are straighter than the first’s. Iterate.
The result. After 2–3 reflow iterations, the marginal paths become almost straight. Sampling can then use very few integration steps - sometimes a single Euler step from noise to data - with minimal quality loss. This is the one-step generation result that made rectified flow practically important.
Where flow-matching sits in 2026
By 2026, flow-matching has substantially displaced pure-diffusion in new generative-modelling systems. The empirical advantages:
Better sample quality at low step counts. With well-tuned flow-matching, 4–8 step sampling can produce high-quality samples - substantially faster than DDIM-sampled diffusion.
Cleaner training dynamics. The simpler loss formulation produces more stable training, especially for large models.
One-step generation via rectified flow. When inference speed is paramount, rectified flow’s one-step generation is unmatched.
Notable production systems using flow-matching:
Stable Diffusion 3 (Stability AI, 2024) and successors use a flow-matching objective.
Flux (Black Forest Labs, 2024) is a major commercial text-to-image system built on flow-matching.
Some text-to-video systems (notably ones derived from the Stable Diffusion lineage) have moved to flow-matching.
The framework is not a paradigm shift from diffusion - the architectures, conditioning techniques, classifier-free guidance, and modality-specific engineering carry over almost unchanged. It is a training-objective improvement that is incremental in mechanism but substantial in production impact.
Diffusion-trained models (the bulk of deployed systems in 2026) will remain in use for years; flow-matching is the default for new training runs. The two are compatible enough that “diffusion / flow-matching” is often treated as a single family in informal references.
A unifying conceptual picture
To close the section. The unified view of diffusion, score-matching, and flow-matching:
All three train a neural network to predict something about how to move samples from a base distribution to the data distribution.
DDPM predicts the noise added at each step.
Score-based methods predict the gradient of the log-density at each noise level.
Flow-matching predicts the velocity field of an ODE flow.
All three are equivalent up to rescaling for appropriate parameterization choices. The differences are in the noise schedule (or path), the training objective’s reweighting, and the sampling procedure. The choice between them is empirical (which produces better samples? trains more stably? requires fewer sampling steps?), not theoretical.
This unification is the key conceptual maturity of 2024–2026 generative modelling: the framework is one thing with multiple presentations.
§8. Conditional Generation and Guidance
Why conditioning matters
Generative models in production are almost always conditional. Users want to generate an image of a particular thing, a video with a particular subject, a melody in a particular style. Pure unconditional generation - sample anything plausible - has limited applications; conditional generation is what makes generative models useful as tools.
We have already seen two conditioning mechanisms: the conditional VAE (CVAE, §4) and conditional diffusion with classifier-free guidance (§6). This section systematizes the topic, covers the dominant text-conditioning recipes, and develops three deployment patterns (image-to-image, ControlNet, image editing).
Class-conditional models: the simplest case
The simplest setting. Given a discrete class label , generate samples from class . Two implementation patterns.
Conditioning via concatenation. Concatenate a one-hot or embedded representation of to the input of the generator network. The network learns to use this signal to produce class-specific samples. Used in early class-conditional diffusion (Dhariwal and Nichol, 2021).
Conditioning via cross-attention. Use (or its embedding) as a key/value in a cross-attention layer; the generator’s intermediate features attend to it. More flexible than concatenation - easily handles multiple conditioning signals and variable-length conditioning (such as text). Now the standard for modern systems.
Class-conditional generation on ImageNet (1000 classes) was an early benchmark; BigGAN and class-conditional DDPM both achieved strong results. The interesting cases came when “class” was replaced by text.
Text-to-image: the dominant conditioning recipe
The biggest deployed conditioning task. Given a natural-language prompt (e.g., “a cat sitting on a windowsill at sunset, oil painting style”), generate an image that matches the prompt.
The recipe, as it stands in 2026:
Text encoder. Embed the prompt into a sequence of feature vectors. Two dominant choices:
CLIP text encoder (Radford et al., 2021): a Transformer trained contrastively against images (see SSL §7). Produces text features that are aligned with image features. Used by Stable Diffusion 1/2 and many derivatives.
T5 (Raffel et al., 2020): a text-to-text Transformer pretrained on general text. Produces more linguistically rich features but is not image-aligned. Used by Imagen and DeepFloyd IF, and increasingly by newer systems.
Modern frontier systems often use both - CLIP for image-text alignment, T5 for rich language understanding.
Cross-attention conditioning. The diffusion U-Net or DiT includes cross-attention layers; the text features serve as keys and values. Each spatial position in the noisy latent attends to relevant tokens in the prompt, allowing the model to align image content with text.
Classifier-free guidance (§6). Trained with random text-dropout; sampled with guidance scale to balance prompt fidelity with diversity.
The pipeline (slightly more detailed than the §6 latent-diffusion diagram):
TEXT-TO-IMAGE PIPELINE (Stable Diffusion architecture)
prompt: "a cat at sunset"
│
▼
text encoder (CLIP / T5)
│
▼
c = sequence of token embeddings (conditioning)
│
├──────────────────────┐
│ │
z_T ~ N(0, I) │
│ │
▼ ▼
┌───────────────────────────────────────┐
│ U-Net / DiT diffusion in latent space│
│ │
│ At each timestep t and each layer: │
│ - self-attention over spatial │
│ positions in z_t │
│ - cross-attention from z_t to c │
│ - feedforward │
│ │
│ With classifier-free guidance: │
│ epsilon_tilde = (1 + w) epsilon(z, c)│
│ - w epsilon(z, ∅)│
└───────────────────────────────────────┘
│
▼
z_0 (clean latent)
│
▼
VAE decoder
│
▼
x_0 (generated image, 1024x1024)This is the canonical architecture for text-to-image generation in 2026. Every major system (Stable Diffusion, Imagen, DALL-E 3, Midjourney, Flux) follows this template with variations in the specifics.
The guidance scale and the fidelity-diversity trade-off
The single most user-visible hyperparameter. Classifier-free guidance (§6) extrapolates the conditional prediction away from the unconditional prediction:
The guidance scale controls the trade-off:
: unconditional generation. The prompt is ignored.
: standard conditional generation. The prompt is respected but the samples are diverse.
to : typical “good” range for text-to-image. Samples are prompt-faithful with reasonable diversity.
: extreme guidance. Samples become over-saturated, artefact-prone, and lose diversity. Mode collapse around the most “average” interpretation of the prompt.
FIDELITY vs DIVERSITY TRADE-OFF (schematic)
fidelity (matches prompt)
▲
│ *
│ *
│ *
│ *
│ * best
│* operating
│* range
│ *
│ *
│ * *
│ * * * *
└─────────────────────────► diversity
w high w low
(mode collapse) (off-prompt)Users of text-to-image services often have access to this knob; product-grade systems typically default to .
Image-to-image: starting from a given image
A second conditioning regime. Given an initial image and a text prompt, produce a new image that resembles but is modified per the prompt.
The recipe (SDEdit, Meng et al., 2021). Encode to a latent; add noise to a moderate level (rather than the full ); run the diffusion reverse process from down to 0, conditioned on the text prompt. The choice of controls how much the original image structure is preserved: small preserves most of the original; large produces samples mostly determined by the prompt.
This is the image-to-image slider that users see in tools like Stable Diffusion’s web UI. The same machinery handles photo editing, stylization, and inpainting.
ControlNet: spatial conditioning
A third conditioning regime. Sometimes the conditioning signal is spatial - a depth map, an edge map, a pose skeleton, a semantic segmentation. We want the generated image to follow that spatial structure while remaining free in colour, texture, and style.
ControlNet (Zhang, Rao, Agrawala, 2023) “Adding Conditional Control to Text-to-Image Diffusion Models” gives a clean architectural recipe. Take a pretrained text-to-image model; duplicate its encoder; train the duplicate to consume a spatial conditioning signal; inject its outputs into the original network’s decoder via zero-initialized convolutions (so that early training does not disrupt the pretrained model).
The result. A ControlNet for “Canny edges” lets users sketch an outline and have the diffusion fill in a prompt-conditioned image that respects the outline. ControlNets for depth, pose, segmentation, scribble, etc. compose with text-to-image generation to give users fine-grained spatial control.
ControlNet is engineering-heavy but practically transformative for creative applications. Most production text-to-image systems in 2026 ship with at least a dozen ControlNet variants.
Image editing: prompt-driven changes
A fourth regime. Given a real image and an editing instruction (“change the cat to a dog”, “make it sunset”, “remove the chair”), produce a modified image.
The dominant approach in 2026 uses training-time preference data on edits - pretraining the diffusion model with paired (original, edited, instruction) triples - combined with inference-time guidance. Specific systems include:
InstructPix2Pix (Brooks, Holynski, Efros, 2023). Train on paired data generated by GPT + Stable Diffusion; the resulting model edits real images per instruction.
Imagen Editor (Wang et al., 2023). Uses Imagen as backbone with editing-specific conditioning.
DALL-E 3 inpainting and outpainting. Selectively regenerate portions of an image.
The editing space is also where text-conditioning shows its current limits - fine-grained edits (“rotate the cat 30 degrees”, “make the third building taller”) are still difficult, and the best systems make mistakes that a human editor would not.
Negative prompts: a cheap-and-useful technique
A practical conditioning technique worth mentioning. Most production text-to-image systems support a negative prompt - text describing what the user does not want in the image. The implementation is straightforward classifier-free guidance with a non-empty unconditional:
Reading this. Replace the empty unconditional with a negative-prompt unconditional. The model extrapolates away from as it extrapolates toward . Setting “blurry, low quality, deformed” pushes samples away from these undesirable features.
Negative prompts are widely used in production text-to-image and require no architectural changes - just an inference-time substitution.
Multi-modal and multi-signal conditioning
Modern systems combine multiple conditioning signals: text + depth map + reference image + style image. The architectural approach is to inject each via cross-attention or feature-fusion at different layers; the result is a system that respects multiple constraints simultaneously.
The most general framing: a conditional diffusion model where is any structured conditioning input (text, image, audio, structured data). The pretraining recipe of large-scale text-conditioned diffusion + targeted ControlNet-style additions for new modalities has produced the bulk of 2026 production systems.
Where conditioning fits in 2026
Three observations to close the section:
Text conditioning is mostly solved at the engineering level. Cross-attention + CLIP/T5 text encoders + classifier-free guidance is a robust recipe. New text-conditional diffusion systems work well out of the box; the engineering effort is now mostly in scaling and refinement.
Spatial conditioning via ControlNet is engineering-heavy but powerful. Each ControlNet must be trained; the recipe is well-established but the work is not zero.
Editing remains a research frontier. Fine-grained, semantically-aware editing is harder than the high-level “prompt-to-image” task. This is where current systems most visibly fall short of human capability.
OP-GM-3 (compositional generalization) is the open problem that underlies many of conditioning’s failure modes: current systems handle common combinations of concepts well but compose novel combinations poorly. We return to this in §13.
§9. Modality-Specific Architectures and Practice
The generative-modelling frameworks of §3–§7 are modality-agnostic - the math is the same whether is an image, an audio clip, a 3D mesh, or a molecule. The architecture and engineering differ substantially by modality. This section surveys the dominant modalities, their characteristic architectural choices, and the production-grade systems of 2026.
Image generation
The most-developed modality. The dominant 2026 recipe is latent diffusion (§6) or latent flow-matching (§7) with a Transformer-based denoising network.
The standard pipeline:
VAE encoder/decoder trained separately, frozen during diffusion training. Compresses pixel images to a much-smaller latent (Stable Diffusion: , a reduction; SD3 and others use even more aggressive compression).
Text encoder (CLIP + T5 in modern systems). Embeds the prompt as a sequence of token features.
Denoising network operating in latent space. Two architectural choices:
U-Net with cross-attention to text features. Standard through 2023; still in production use (Stable Diffusion 1.x, 2.x).
Diffusion Transformer (DiT) (Peebles and Xie, 2023). Operates on patches of the latent; conditioning via cross-attention or adaptive layer norm (AdaLN). Better scaling; the dominant choice in newer systems (Stable Diffusion 3, Flux, Sora-style image backbones).
Sampler. DDIM (20–50 steps) for diffusion-trained systems; Euler or higher-order ODE solvers (10–30 steps) for flow-matching-trained systems.
The notable 2026 commercial systems for text-to-image:
Midjourney v6/v7. Proprietary; visually distinctive; closed.
DALL-E 3 / GPT-4o image generation. Proprietary; tightly integrated with the GPT chat experience.
Stable Diffusion 3 / SDXL. Open weights; the most-used open system.
Flux (Black Forest Labs). Open weights; flow-matching-based; high quality.
Imagen 3 / Imagen 4. Google’s frontier system.
Firefly (Adobe). Trained on stock-photography data; positioned for commercial use with clean licensing.
The technical state. High-resolution photorealistic generation is largely solved at production quality; the open problems are now in (a) compositional generalization (OP-GM-3), (b) precise prompt control, (c) generation speed, and (d) memorization detection.
Video generation
The breakout modality of 2024–2026. Sora (OpenAI, early 2024) demonstrated minute-length, coherent, high-quality text-to-video generation. Veo (Google), Gen-3 (Runway), and Mochi (Genmo) followed.
The architectural choices for video extend image generation to space-time:
Patches in space-time. Sora and successors tokenize video into patches for some temporal extent (typically 1–4 frames). The diffusion model operates on sequences of these patches.
3D VAE. A specialized VAE that handles temporal compression - typically compressing videos to latents.
Temporal attention. The denoising network has both spatial attention (within a frame) and temporal attention (across frames). Sora-style systems use joint spatiotemporal attention; cheaper alternatives factorize.
Long-video conditioning. Sora generates up to one minute of coherent video; the recipe uses progressively longer windows during training and inference, with strong KV-caching to handle temporal context.
The current state. Video generation has improved dramatically from 2023 to 2026 but still exhibits visible artefacts on hard cases: rare interactions (a person stirring a cup of coffee), persistent identity (the same person across long shots), and physical realism (objects respecting gravity and contact). These are not fully solved and are active research areas.
Audio generation
Three regimes worth distinguishing.
Speech synthesis. Text-to-speech is largely solved at production quality. The dominant recipes:
Autoregressive (WaveNet line, Tortoise TTS, XTTS). High quality, slow.
Diffusion (NaturalSpeech 3, Voicebox). High quality, faster sampling.
Flow-based and modern flow-matching variants. Increasingly used.
Music generation. Less mature than speech but rapidly improving.
MusicLM (Google, 2023): hierarchical autoregressive on audio tokens.
MusicGen (Meta, 2023): Transformer on EnCodec tokens.
Suno and Udio (2023–2024): commercial systems producing high-quality vocal-and-instrumental music from text prompts.
Stable Audio: diffusion-based commercial music generation.
General audio. Generation of arbitrary audio (sound effects, ambient soundscapes) is less mature. AudioLM (Google), AudioGen (Meta), and others use autoregressive Transformers on discrete audio tokens.
The dominant pattern across audio modalities: audio is tokenized via a learned neural codec (EnCodec, SoundStream, DAC), then either autoregressive or diffusion is applied in the discrete-token or continuous-latent space. The codec is the engineering substrate that makes audio generation tractable.
3D generation
The most-active research frontier in 2026 (still less mature than image or video).
NeRF (Neural Radiance Fields) (Mildenhall et al., 2020). Represents a 3D scene as a continuous function - colour and density at each spatial point. Rendered by ray-marching. Originally trained per-scene; later generalized to generative settings.
Gaussian Splatting (3DGS) (Kerbl et al., 2023). Represents a scene as a cloud of 3D Gaussians with position, covariance, colour, and opacity. Renders faster than NeRF, simpler optimization. Has substantially displaced NeRF for new 3D systems.
Diffusion-based 3D generation. Several techniques.
DreamFusion (Poole et al., 2022). Use a 2D text-to-image diffusion model as a score for optimizing a 3D representation (NeRF or 3DGS) - the Score Distillation Sampling (SDS) approach. Slow (each generation is an optimization) but produces 3D-consistent results.
Direct 3D diffusion. Diffusion models trained directly on 3D representations (point clouds, voxels, implicit fields). Less mature; growing rapidly.
Multi-view diffusion + reconstruction. Generate multiple consistent views with a 2D diffusion model conditioned on camera parameters; reconstruct 3D from the views.
The current state. 3D generation is rapidly improving but is not as polished as image or video. The dominant approach for high-quality results is still optimization-based (SDS + Gaussian Splatting) at substantial inference cost.
Molecules and proteins
A modality where generative models have produced scientifically significant results.
AlphaFold 2 / 3 (Jumper et al., 2021; Abramson et al., 2024). Structure prediction from sequence. Not a generative model in the “sample diverse outputs” sense, but a conditional generative model of structure given sequence. AlphaFold 3 extends to protein-ligand, protein-protein, and protein-nucleic-acid complexes. The architectural keys: an equivariant representation that respects the rotation-and-translation symmetries of 3D space, and the Pairformer module that handles pairwise residue interactions.
RFdiffusion (Watson et al., 2023). Generative model for protein backbones - sample new protein shapes conditional on functional constraints. Used for de novo protein design. The base recipe is diffusion on protein backbone coordinates with an SE(3)-equivariant denoising network.
Chroma (Ingraham et al., 2023). Programmable protein generation with conditioning on symmetry, shape, function. Same general framework as RFdiffusion with refined conditioning.
Molecular generation systems. A separate line of work generates small molecules (drug candidates, materials components). Techniques include equivariant graph diffusion, fragment-based generation, and equivariant flows.
The architectural common feature: equivariance. Models for 3D molecular and protein data must respect the symmetries of physical space - a rotated and translated input should produce a correspondingly rotated and translated output. SE(3)-equivariant neural networks (Thomas et al., 2018; Geiger and Smidt, 2022) are the standard substrate. The AI for Science chapter (planned) develops the equivariance machinery in domain depth.
Tabular and time-series generation
Worth a brief mention. Generating synthetic tabular data - for privacy-preserving data sharing, missing-value imputation, augmentation - is a substantial industrial application. Dominant techniques: GANs (CTGAN), VAEs (TVAE), and increasingly diffusion (TabDDPM). The challenges are mixed data types (numeric and categorical in the same row) and small dataset sizes (tabular data is rarely as large as image data).
Time-series generation (financial markets, sensor data, biological signals) uses similar techniques. The challenges are temporal coherence (long-range structure that’s hard to capture) and multivariate dependencies (correlated time series with structured interactions).
Multimodal native generation
A 2024–2026 development worth flagging. Some frontier systems generate across multiple modalities natively - text-and-image in the same model, or text-and-image-and-audio. Examples:
GPT-4o (OpenAI, 2024). Generates text, images, and audio in a unified autoregressive framework.
Chameleon (Meta, 2024). Mixed-modal autoregressive on interleaved text and image tokens.
Gemini (Google). Native multimodal input and output.
The architectural approach: tokenize each modality (text via subword, image via VQ-VAE, audio via EnCodec), interleave the tokens into a single sequence, and apply a Transformer autoregressively. This is the unified-tokens approach. Diffusion-based equivalents (joint image-text diffusion, etc.) exist but are less dominant in 2026 frontier systems.
The Multimodal Models chapter (planned) develops this in depth.
Cross-cutting practical issues
Two engineering issues that arise in every modality:
Data curation matters more than algorithmic choice. A diffusion model trained on a small, well-curated dataset typically beats one trained on a large, uncurated dataset. The single biggest practical determinant of generative-model quality is the data. Specific issues: deduplication (highly-replicated training examples cause memorization), copyright filtering, quality filtering, and recaptioning (rewriting messy alt-text into well-structured prompts; the DALL-E 3 and SD3 recipe).
Conditioning quality matters more than model size. A smaller model with high-quality text encoders and conditioning often outperforms a larger model with poor conditioning. Investment in the text encoder (T5, CLIP, both), in the conditioning architecture (cross-attention vs AdaLN), and in negative prompts pays off more than equivalent investment in scaling the denoising network.
These two observations are unifying lessons across modalities and are worth emphasizing because they are sometimes obscured by the architectural detail.
§10. Evaluating Generative Models
Evaluating generative models is hard. We have no single scalar that captures “is this model good.” This section develops the standard metrics, their failure modes, and the 2026 state of the evaluation question.
The fundamental difficulty
The evaluation question has no clean form. For a classifier, we ask: how often is it correct on a held-out test set? The answer is a single number (accuracy or its refinements). For a generative model, the analogous question - how well does it model the data? - admits no such clean answer. The model is producing new samples, not labelling existing ones; “correct” is not the right word.
Three related-but-distinct questions a generative-model evaluation might try to answer:
Sample quality. Are the model’s samples individually high-quality? Photorealistic, coherent, prompt-faithful?
Distribution coverage. Does the model produce a diverse set of samples that covers the true data distribution, not just a narrow mode?
Likelihood. Does the model assign high probability to held-out real data?
A generative model can excel at one and fail at another. A GAN with mode collapse has excellent (1) and terrible (2). An over-regularized VAE has decent (2) and (3) but blurry samples. A perfect-density-estimation autoregressive model might have great (3) and excellent (2) but unimpressive (1) on aesthetic dimensions.
The evaluation problem is choosing which questions to ask and how to operationalize each.
Likelihood-based evaluation
For models that can evaluate density (autoregressive, exact-flow, VAE-ELBO-lower-bound), the natural metric is the log-likelihood on held-out data. For autoregressive language models this is perplexity:
Reading this. Perplexity is the effective branching factor - the geometric mean of the model’s per-token probability over the test set. Lower is better; perplexity 1 means the model assigns probability 1 to every test token (perfect).
For images: bits-per-dimension (BPD), the negative log-likelihood per pixel-channel divided by . Lower is better.
Strengths of likelihood. It is a well-defined probabilistic measure of model fit; it has rigorous theoretical foundations; it does not require expensive sample generation.
Weaknesses of likelihood:
Not all models support it. GANs, score-based methods (without explicit density formulation), and some diffusion variants do not give a usable density.
Likelihood and sample quality are decoupled. Theis, van den Oord, Bethge (2016) “A Note on the Evaluation of Generative Models” gave the canonical critique: high likelihood is neither necessary nor sufficient for good sample quality. A model with strong likelihood can produce visually terrible samples; a model with poor likelihood can produce visually excellent samples.
Sensitivity to dimensionality. For high-dimensional data, likelihood values are dominated by easy-to-model dimensions and may miss interesting structure.
For text generation, perplexity is widely used as one signal alongside others. For image generation, likelihood (when available) is rarely the primary metric.
FID: the standard image-quality metric
The dominant metric for image-generation quality. Fréchet Inception Distance (Heusel et al., 2017).
The recipe. Take a pretrained Inception V3 network trained on ImageNet classification. Pass real images and generated images through it, extracting features from the penultimate pooling layer. Fit a multivariate Gaussian to each set of features (mean and covariance ). The FID is the Fréchet distance between the two Gaussians:
Lower FID is better. A FID of 0 would mean the generated and real feature distributions are identical (under the Gaussian approximation).
What FID measures. Two things implicitly: (a) sample quality - bad samples produce features unlike any real images, shifting ; (b) sample diversity - mode-collapsed generators produce features clustered in a small region, shrinking and increasing the Fréchet distance. FID captures both in a single scalar.
Why FID became dominant. It correlates reasonably well with human judgement on image quality; it is straightforward to compute; it works for any image-generation model (no need for likelihood or particular model structure). It became the standard reporting metric in image-generation papers from 2017 onward.
FID’s limitations
By 2026, FID’s limits are well-understood and worth flagging:
Inception network bias. FID measures distance in ImageNet-Inception feature space. This space encodes a specific notion of what makes images “different” - one shaped by ImageNet’s 1000-class taxonomy. For generation of images outside ImageNet’s distribution (medical images, satellite imagery, abstract art), FID can be misleading.
Sample-size sensitivity. FID converges slowly with sample size; small differences in FID between models can vanish with larger sample sets. Best practice is to evaluate with 10K–50K generated samples and a similarly-sized real reference.
Saturation at high quality. As image-generation systems improve, FID values approach zero and small differences become hard to interpret. Two systems with FID 2.5 and 3.0 may produce visibly different samples or essentially identical ones; FID can’t tell.
Prompt-following blind. FID measures distribution similarity; it doesn’t measure whether a text-to-image model actually follows the prompt. Two models with the same FID might have very different prompt-fidelity.
Inception Score and other older metrics
Inception Score (Salimans et al., 2016) preceded FID. Computes a single-image quality score plus a diversity score using the same Inception network. Critiqued by Barratt and Sharma (2018) and Borji (2019); largely superseded by FID. Mentioned here for historical context.
CLIP-score: prompt fidelity
For text-to-image models, FID does not measure whether samples follow the prompt. CLIP-score addresses this. Use the CLIP text encoder to embed the prompt and the CLIP image encoder to embed the generated image; the cosine similarity between them measures how well the image matches the prompt according to CLIP.
CLIP-score is widely used alongside FID. It has its own limitations - CLIP itself is a model with biases, and high CLIP-score does not guarantee aesthetic quality - but it captures the prompt-fidelity dimension that FID misses.
Human preference: the ground truth
For practical purposes, the most reliable evaluation is human pairwise preference. Show humans pairs of generated images (or videos, or audio) and ask which they prefer. Aggregate over many comparisons to produce an Elo rating or preference matrix.
Two landmark uses:
Human Preference Score (HPS) (Wu et al., 2023) - a learned model trained on human preferences. Used as an automatic substitute for actual human evaluation.
LMSYS Chatbot Arena (and its image-generation counterparts). Public deployments where users compare model outputs head-to-head. Produces Elo-style leaderboards that are increasingly treated as authoritative.
Human preference is expensive but is the gold standard. Automated metrics (FID, CLIP-score, HPS) are useful proxies but should be checked against human evaluation when the stakes warrant.
Diversity-fidelity trade-off and the precision-recall framing
Many metrics (FID especially) collapse two dimensions into one scalar. Sajjadi et al. (2018) “Assessing Generative Models via Precision and Recall” proposed a two-dimensional alternative:
Precision. What fraction of generated samples lie in the real-data distribution? (High precision = good sample quality; the model produces things that look like real data.)
Recall. What fraction of the real-data distribution is covered by generated samples? (High recall = good diversity; the model produces a wide range of things.)
A precision-recall pair gives a richer picture than FID alone. Kynkäänniemi et al. (2019) “Improved Precision and Recall Metric for Generative Models” gave a practical algorithm using -nearest-neighbour computations in feature space. Increasingly used in image-generation evaluation.
Memorization detection
A growing concern in 2026. Generative models trained on large datasets can memorize training examples and reproduce them at inference. For copyrighted training data this is a legal problem; for privacy-sensitive data (medical images, faces) it is a privacy problem.
Carlini et al. (2023) “Extracting Training Data from Diffusion Models.” Showed that Stable Diffusion can be made to regurgitate exact training images for specific prompts that match the training caption.
Detection techniques:
Nearest-neighbour search. For each generated sample, find the nearest training example in feature space; if the distance is below a threshold, flag as memorization.
Membership inference. Test whether the model’s probability of a candidate is anomalously high (suggesting it was in the training set).
Targeted prompt probing. For known training captions, test whether the model generates the corresponding training image.
Modern training practices include deduplication of training data (CLIP-based near-duplicate removal) to reduce memorization rates. The trade-off: more aggressive deduplication reduces memorization but may also reduce sample quality. The balance is an active area of practical research.
The benchmarking problem in 2026
Three issues that affect how generative-model evaluation is read in 2026:
Benchmark saturation. On standard image-generation benchmarks (ImageNet-1K class-conditional, MS-COCO text-to-image), frontier systems achieve FID values so low that the metric no longer distinguishes them meaningfully. The community has shifted toward harder benchmarks (DrawBench, GenEval, T2I-CompBench for compositional generation) and head-to-head human comparison.
Multi-dimensional reporting. Modern papers report multiple metrics together: FID + CLIP-score + Precision-Recall + human preference. No single metric is taken as definitive.
Real-world preference vs benchmark metrics. The most-used image-generation systems (Midjourney, DALL-E 3, Imagen 3) are not always the FID-best - they are optimized for user preference, which weighs aesthetic appeal and prompt fidelity in ways FID does not capture. The disconnect between research benchmarks and deployed-system quality is real and known.
Modality-specific evaluation gaps. Video evaluation is much less mature than image evaluation; current best practices (FVD - Frechet Video Distance, FID applied per-frame, human evaluation) have known limitations. Audio evaluation similarly varies by sub-modality. 3D and molecular evaluation are even less standardized.
Where evaluation sits in 2026
The honest summary. Generative-model evaluation has improved substantially but remains the most difficult engineering problem in generative modelling. The current state of practice:
For images: FID + CLIP-score + human preference (Chatbot Arena style) is the standard reporting suite.
For video: FVD + human preference. Less standardized.
For audio: domain-specific metrics + human preference.
For text: perplexity (for likelihood) + task-specific automated metrics + human preference.
For 3D and molecules: domain-specific scientific evaluation (RMSD for protein structure, etc.).
Across all modalities, human evaluation is the gold standard and automated metrics are useful proxies. As frontier models continue to improve, the gap between automated metrics and meaningful quality differences will continue to narrow; new evaluation techniques will emerge to fill the gap. OP-GM-2 (evaluation that aligns with human judgement) is the open problem this section centres on.
§11. Connections to Other Chapters
This chapter is connected to many others - generative models are one of the two foundational paradigms (alongside discriminative SSL) on which Foundation Models are built, and they appear directly or indirectly in most modern AI systems. The cross-references below are dependency statements, not pointers.
Self-Supervised Learning §4 develops generative SSL objectives - masked language modelling, span corruption, MAE, denoising/diffusion-as-SSL. This chapter develops the generative-modelling lens on the same machinery: how to use it to sample. The two chapters cover overlapping technique with different emphases. Recommended to read SSL §4 first if encountering generative modelling for the first time.
Large Language Models is the LM-specific instantiation of autoregressive generative modelling (§3 of this chapter). LLM §3 develops the decoder-only Transformer architecture; LLM §6 develops decoding strategies (greedy, beam, nucleus, contrastive) - all autoregressive-generation specifics. This chapter develops autoregressive as a family and refers to the LM chapter for the LM-specific detail.
Foundation Models §3 lists generative models as one of two dominant FM substrates (the other being discriminative SSL). FM §6 covers scaling laws which apply to generative models; FM §7 covers adaptation methods that can include preference-based training of generative models. The Foundation Models chapter is the spine; this chapter develops one of the two ribs.
Deep Learning provides the architectural building blocks: U-Net (§4 of DL), Transformers (DL §6), equivariant networks (referenced for §9 of this chapter). The chapter assumes the architectural material rather than re-developing it.
Multimodal Models (planned) develops cross-modal systems - text-to-image, image-to-text, audio-text, vision-language-action. The text-to-image conditioning of §8 is the dominant cross-modal generative technique; the planned chapter will develop the broader cross-modal landscape.
Reinforcement Learning §10 develops RLHF/DPO/GRPO as post-training for generative models. The connection: a generative model trained with maximum likelihood (this chapter) can be further trained to align with human preferences (RL §10). RLHF as applied to text-to-image models is an active research area; the framework comes from RL.
AI for Science (planned) develops scientific generative models in domain depth - protein design (RFdiffusion, Chroma), molecular generation, materials, mathematics. The generative-modelling techniques are this chapter’s content; the domain-specific architectures (equivariance, structure-prediction backbones) and benchmarks live in the AI for Science chapter.
Evaluation (planned) is the cross-cutting chapter on benchmarking. §10 of this chapter sketches generative-model evaluation specifically; the broader evaluation methodology lives in the planned chapter.
Theoretical Foundations of Learning §8 develops the modern generalization puzzle. Generative models inherit the same puzzle: why does a high-capacity diffusion model generalize beyond its training set rather than memorizing it? OP-GM-7 (memorization vs novel generation) is the generative-model-specific instance.
Mechanistic Interpretability (planned) studies internal representations of generative models. Diffusion U-Nets and DiTs have been the subject of substantial interpretability research (e.g., what features at which timesteps?); the techniques mirror those used for discriminative networks but adapted for the iterative-denoising setting.
Alignment / Ethics treats memorization, copyright, deepfakes, watermarking, and the broader societal implications of generative AI as central concerns. OP-GM-4 (watermarking and provenance) and OP-GM-7 (memorization) are the open problems that the Alignment chapter develops in depth.
§12. Critiques and Alternative Perspectives
This section presents critiques of generative AI as critiques - substantive intellectual positions held by working researchers and observers, not strawmen. The chapter does not adjudicate (per the project’s editorial stance, ADR-0004); the critiques are organized by what they are critiquing.
Memorization and copyright
The most-publicized critique of generative AI: the models memorize substantial portions of their training data and can reproduce copyrighted material in their outputs. Carlini et al.'s (2023) extraction of training images from Stable Diffusion is the canonical empirical demonstration; the New York Times v. OpenAI lawsuit (2023) is the canonical legal one.
Two distinct concerns. Quantitative memorization: how often does a generative model reproduce a verbatim training example? Empirically, the rate is non-zero but small for most prompts; specific prompts can elicit specific memorized examples. Qualitative memorization: do the model’s outputs derivatively depend on copyrighted training material in a way that constitutes copying? The legal answer is unsettled and varies by jurisdiction; the technical answer is essentially yes (the model would not produce the outputs it does without having been trained on the data, by construction).
The critique here is that training-data licensing should be respected at training time - current frontier systems train on enormous web-scraped datasets without per-asset licensing, and the legal-and-ethical implications are real. Mitigations include opt-out mechanisms (the OpenAI image-classifier opt-out), licensed training data (Adobe Firefly’s stock-photo training set), and post-training filters. None is a complete solution.
Diversity collapse
A related but distinct concern. Generative models trained on web data and tuned with RLHF for human preferences tend to converge on a narrow style - generating images, text, or music that look superficially diverse but cluster in a small region of the space the model is theoretically capable of producing.
Empirically: ask a text-to-image model for “a beautiful landscape” 100 times; the resulting images often share aesthetic characteristics (lighting, composition, palette) more than the prompt itself implies. Ask an LLM for “a creative story” 100 times; the resulting stories share narrative patterns. The model has been steered by training data and reward signals toward a particular zone of plausibility.
The critique. Generative AI deployed at scale homogenizes the cultural/creative output it is used to produce - not because each individual user wanted homogeneity, but because each user got the same model’s same biases. The long-run consequences for cultural and creative diversity are an open empirical question; the short-run consequences are visible in the “AI look” that contemporary observers can identify in many AI-generated images.
Evaluation as a social problem
A meta-critique of the evaluation discussion in §10. The metrics we use to evaluate generative models shape what models we build. FID rewards distribution coverage; CLIP-score rewards prompt fidelity; human preference rewards what humans currently prefer (which is shaped by what they have already seen).
The critique: optimization against any of these metrics produces Goodhart’s-law failures. Once a metric is the target, it stops being a good measure. The 2026 evidence: production text-to-image models that score well on standard benchmarks but produce visibly inferior outputs to alternatives that score worse; LLMs that score well on standard reasoning benchmarks but fail at slightly-perturbed problems.
The deeper critique: there may be no automated metric that captures all the dimensions that matter, because “what matters” is itself plural and contested. The benchmarking community has responded with multi-metric reporting and live human-preference comparisons; the underlying problem is structural and persistent.
The “do generative models understand?” debate
A philosophical critique. When a diffusion model produces a realistic image of “a cat sitting on a windowsill,” does the model understand what cats and windowsills are, in any meaningful sense? Or has it learned a statistical pattern that is accurate enough to produce convincing samples without any understanding?
Two camps.
The no-understanding camp argues that generative models manipulate statistical patterns over surface representations (pixels, tokens) without representing the underlying entities. Evidence: models fail at out-of-distribution combinations, struggle with physical reasoning (“an object falling and bouncing”), and produce confidently wrong outputs on slightly-novel queries. The technical critique points to compositional generalization failures (OP-GM-3).
The some-understanding camp argues that the distinction between “statistical pattern” and “understanding” is not as clear as the no-understanding camp implies. Modern generative models exhibit emergent capabilities (in-context learning, chain-of-thought reasoning, world-model behaviour in video models) that look like some form of understanding, even if not human-like. The empirical evidence is sufficient that the strong no-understanding position requires substantial qualifications.
This chapter does not adjudicate. The honest summary: the debate is partly empirical (about what generative models actually can and cannot do) and partly definitional (about what we mean by “understand”). Both threads are active. We refer the reader to the broader cognitive-science and philosophy-of-AI literature for the substantive treatment.
The labour critique
A non-technical but consequential critique. Generative models compete with and displace human creative labour - illustrators, writers, voice actors, musicians, designers. The economic and social effects are real and immediate. The technical community’s response to this critique has been mixed: some researchers acknowledge the labour displacement and engage with mitigations (revenue-sharing, attribution, opt-out); others treat it as outside their scope.
The chapter notes the critique without taking a position. The Alignment / Ethics chapter (planned) develops it in depth.
§13. Limitations and Open Problems
Consolidated open-problems list. Each item carries an OP-GM-N identifier so other chapters can cross-reference.
OP-GM-1. Sample-efficient training of generative models. Frontier diffusion and flow-matching systems require massive training data (+ image-text pairs) and compute (+ FLOPs). Smaller training budgets typically produce substantially weaker models. Whether this is fundamental or a consequence of current architectures and recipes is open. The protein-structure analogue (AlphaFold 2 trained on the much smaller PDB) shows that with strong inductive biases, much less data can suffice - but generalizing this insight to general image/audio/video generation is unsolved. Relates to OP-FM-2 (foundation-model training data).
OP-GM-2. Evaluation that aligns with human judgement. §10 makes the case extensively. FID, CLIP-score, and the standard automated metrics correlate imperfectly with what humans actually value. Designing metrics that align better with human judgement - without becoming Goodhart-targets themselves - is an open and likely-permanent problem. Recent learned-metric approaches (HPS, ImageReward) help but inherit the biases of the preference data they were trained on.
OP-GM-3. Compositional generalization. Current text-to-image systems handle common combinations of concepts well (“a red car”, “a cat in a hat”) but compose novel combinations poorly (“a cat with three heads riding a bicycle in zero gravity”). The model has memorized common combinations; truly compositional generation - combining concepts the model has never seen combined - fails systematically. This is the generative-model analogue of the broader compositional-generalization problem. Cross-references the discriminative-side discussions in DL/SSL/LLM open problems.
OP-GM-4. Watermarking and provenance. As generative content becomes ubiquitous, distinguishing AI-generated content from human-created content becomes important - for trust, attribution, copyright, and democratic discourse. Watermarking techniques (visible, invisible, model-specific) have been proposed; none is robust to all attacks. The C2PA (Coalition for Content Provenance and Authenticity) industry standard is one approach; its adoption is partial. OP-GM-4 has both a technical core (robust watermarks) and an institutional dimension (incentives for adoption).
OP-GM-5. Theory of why diffusion / flow-matching work. Empirically, diffusion and flow-matching produce high-quality samples reliably. Theoretically, why they work as well as they do is partial. Rigorous results exist for restricted settings (Gaussian data, kernel methods, linear models), but a theory that explains the generalization and sample-quality of practical large-scale diffusion is open. Connects to OP-TH-1 (overparameterization generalization) - the generative-model-specific instance of the broader generalization puzzle.
OP-GM-6. Continuous-time vs discrete formulations. Diffusion was developed discretely; flow-matching is continuous-time. Both work; the relationship is now well-understood. The open question is whether new generative-modelling frameworks lurk in formulation choices we have not yet explored - non-Gaussian noise, non-Euclidean spaces, trainable noise schedules. This is a research-aesthetic open problem rather than a hard practical one.
OP-GM-7. Memorization vs novel generation. When does a generative model produce novel outputs vs recombinations (or verbatim reproductions) of training data? The problem has technical, legal, and ethical dimensions. Detection techniques (§10) have improved but are not fully reliable. The deep theoretical question - what does it mean for a high-dimensional generative model to “produce something new” - is unsolved.
OP-GM-8. Long-horizon coherent generation. For video: minute-length generation is now possible (Sora, Veo); hour-length coherent generation is not. For text: novel-length coherent generation by a single model is not yet reliable. For 3D: extended scenes with consistent geometry across viewpoints are difficult. The long-horizon problem - maintaining coherence over generation lengths much larger than training-time sequence lengths - is one of the central open problems in 2026.
OP-GM-9. Inference-time efficiency. Diffusion sampling at 20–50 steps per image is far slower than GANs (single-step) but is now the standard. Inference cost is the dominant deployment expense. Better samplers (DDIM, DPM-Solver), distillation (consistency models, rectified flow), and architecture choices (smaller models matched to inference target) are the active research directions. Tied to OP-RL-9 (training/inference compute trade-off).
OP-GM-10. Equivariance and invariance. For domains with strong symmetries (molecules, proteins, materials, physical simulations), generative models that are equivariant to the relevant symmetry group produce more reliable and sample-efficient results. Designing equivariant architectures for general-purpose generative modelling - applying equivariance principles where they help without crippling expressiveness - is an open area.
§14. Further Reading
Opinionated list. Not exhaustive; intended as a reading-order recommendation for someone entering generative modelling cold.
Foundational textbooks and surveys
Murphy, K. (2023). Probabilistic Machine Learning: Advanced Topics. Chapter on generative models gives a unified textbook treatment.
Tomczak, J. (2024). Deep Generative Modeling. A focused textbook on modern generative models.
Bond-Taylor, S., et al. (2022). “Deep Generative Modelling: A Comparative Review of VAEs, GANs, Normalizing Flows, Energy-Based and Autoregressive Models.” A useful unified survey; somewhat pre-flow-matching but covers the older families well.
Autoregressive
van den Oord, A., et al. (2016). “Pixel Recurrent Neural Networks.” PixelRNN.
van den Oord, A., et al. (2016). “WaveNet: A Generative Model for Raw Audio.” WaveNet.
Esser, P., Rombach, R., and Ommer, B. (2021). “Taming Transformers for High-Resolution Image Synthesis.” VQGAN - the autoregressive-image-generation recipe via discrete tokens.
VAEs
Kingma, D. P., and Welling, M. (2014). “Auto-Encoding Variational Bayes.” The foundational VAE paper.
van den Oord, A., Vinyals, O., and Kavukcuoglu, K. (2017). “Neural Discrete Representation Learning.” VQ-VAE.
Higgins, I., et al. (2017). “-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework.” The disentanglement-oriented variant.
Kingma, D. P., and Welling, M. (2019). “An Introduction to Variational Autoencoders.” A clear extended tutorial.
Normalizing flows
Rezende, D., and Mohamed, S. (2015). “Variational Inference with Normalizing Flows.” Foundational.
Dinh, L., Sohl-Dickstein, J., and Bengio, S. (2017). “Density Estimation using Real NVP.” Coupling-layer flows.
Kingma, D. P., and Dhariwal, P. (2018). “Glow: Generative Flow with Invertible 1x1 Convolutions.” Glow.
Papamakarios, G., et al. (2021). “Normalizing Flows for Probabilistic Modeling and Inference.” Comprehensive survey.
GANs
Goodfellow, I., et al. (2014). “Generative Adversarial Nets.” Foundational.
Radford, A., Metz, L., and Chintala, S. (2016). “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks.” DCGAN.
Arjovsky, M., Chintala, S., and Bottou, L. (2017). “Wasserstein Generative Adversarial Networks.” WGAN.
Karras, T., Laine, S., and Aila, T. (2019). “A Style-Based Generator Architecture for Generative Adversarial Networks.” StyleGAN.
Brock, A., Donahue, J., and Simonyan, K. (2019). “Large Scale GAN Training for High Fidelity Natural Image Synthesis.” BigGAN.
Diffusion
Sohl-Dickstein, J., et al. (2015). “Deep Unsupervised Learning using Nonequilibrium Thermodynamics.” The original diffusion paper.
Ho, J., Jain, A., and Abbeel, P. (2020). “Denoising Diffusion Probabilistic Models.” DDPM - the modern recipe.
Song, Y., and Ermon, S. (2019, 2020). “Generative Modeling by Estimating Gradients of the Data Distribution” and related papers. Score-based formulation.
Song, Y., et al. (2021). “Score-Based Generative Modeling through Stochastic Differential Equations.” The unified SDE framework.
Song, J., Meng, C., and Ermon, S. (2021). “Denoising Diffusion Implicit Models.” DDIM.
Ho, J., and Salimans, T. (2021). “Classifier-Free Diffusion Guidance.” Classifier-free guidance - the dominant conditioning technique.
Rombach, R., et al. (2022). “High-Resolution Image Synthesis with Latent Diffusion Models.” Stable Diffusion’s technical core.
Peebles, W., and Xie, S. (2023). “Scalable Diffusion Models with Transformers.” DiT.
Karras, T., et al. (2022). “Elucidating the Design Space of Diffusion-Based Generative Models.” A systematic ablation study; the modern engineering reference.
Flow-matching and rectified flow
Lipman, Y., et al. (2022). “Flow Matching for Generative Modeling.” Foundational.
Liu, X., Gong, C., and Liu, Q. (2022). “Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow.” Rectified flow.
Albergo, M. S., Boffi, N. M., and Vanden-Eijnden, E. (2023). “Stochastic Interpolants: A Unifying Framework for Flows and Diffusions.” A unifying framework.
Modality-specific
Mildenhall, B., et al. (2020). “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis.” NeRF.
Kerbl, B., et al. (2023). “3D Gaussian Splatting for Real-Time Radiance Field Rendering.” 3DGS.
Poole, B., et al. (2023). “DreamFusion: Text-to-3D using 2D Diffusion.” SDS.
Jumper, J., et al. (2021). “Highly Accurate Protein Structure Prediction with AlphaFold.” AlphaFold 2.
Abramson, J., et al. (2024). “Accurate Structure Prediction of Biomolecular Interactions with AlphaFold 3.” AlphaFold 3.
Watson, J. L., et al. (2023). “De novo Design of Protein Structure and Function with RFdiffusion.” RFdiffusion.
Brooks, T., et al. (2024). “Video Generation Models as World Simulators.” Sora (technical report).
Conditioning and guidance
Zhang, L., Rao, A., and Agrawala, M. (2023). “Adding Conditional Control to Text-to-Image Diffusion Models.” ControlNet.
Brooks, T., Holynski, A., and Efros, A. A. (2023). “InstructPix2Pix: Learning to Follow Image Editing Instructions.” Image editing.
Meng, C., et al. (2022). “SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations.” Image-to-image.
Evaluation
Theis, L., van den Oord, A., and Bethge, M. (2016). “A Note on the Evaluation of Generative Models.” The canonical critique of single-metric evaluation.
Heusel, M., et al. (2017). “GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium.” FID introduction.
Sajjadi, M. S. M., et al. (2018). “Assessing Generative Models via Precision and Recall.” Precision-recall framing.
Wu, X., et al. (2023). “Human Preference Score: Better Aligning Text-to-Image Models with Human Preference.” HPS.
Carlini, N., et al. (2023). “Extracting Training Data from Diffusion Models.” Memorization.
Reading-order recommendation
For someone entering the field cold: start with the Bond-Taylor 2022 survey for orientation. Then read the foundational papers chronologically: Goodfellow 2014 (GANs) → Kingma-Welling 2014 (VAEs) → Dinh et al. 2017 (RealNVP) → Ho-Jain-Abbeel 2020 (DDPM) → Song et al. 2021 (unified SDE framework) → Ho-Salimans 2021 (CFG) → Rombach et al. 2022 (latent diffusion) → Lipman et al. 2022 (flow-matching) → Liu et al. 2022 (rectified flow) → Karras et al. 2022 (“Elucidating”) for engineering depth. Add modality-specific papers as needed.
§15. Exercises and Experiments
Research-style exercises spanning the chapter’s families. Each is designed to develop hands-on understanding of one or two of the dominant techniques.
E1. VAE on MNIST with latent visualization. Implement a small VAE with latent dimensions on MNIST. After training, plot the encoded test set in the 2D latent space coloured by digit class - confirm meaningful clustering. Sample new digits by decoding random points in the latent space. Implement latent traversal (sweep along a line in latent space, decode at each point) to confirm smooth interpolation. Then increase to and compare reconstruction quality.
E2. GAN on MNIST and observation of mode collapse. Implement a small DCGAN on MNIST. Train and sample. Deliberately under-train the discriminator (or use a too-high learning rate) and observe mode collapse - generated samples cluster on a few digit classes. Then implement WGAN-GP and confirm it is more robust to the same parameter settings.
E3. DDPM on CIFAR-10 or Fashion-MNIST. Implement DDPM with a small U-Net, noise schedule, and the simplified MSE loss. Train to reasonable convergence. Implement DDPM and DDIM sampling; compare sample quality at (full DDPM) vs (DDIM). Plot a few samples per setting; quantify with FID against a held-out reference.
E4. Flow-matching on a 2D toy distribution. Implement straight-line flow-matching on a 2D Gaussian-mixture target distribution. Train a small MLP velocity-prediction network. Implement Euler integration sampling. Visualize the learned velocity field on a grid of points and the resulting sample trajectories from to . Repeat for a more complex 2D target (a spiral or moons distribution).
E5. Implement classifier-free guidance. Take the trained DDPM from E3 and add class-conditional training (CIFAR-10 has 10 classes). Train with random class-dropout (10%). At inference, implement CFG with varying guidance scales . Visualize how sample fidelity (matches the target class) and diversity change with . Quantify with class-conditional FID.
E6. Compute FID for trained models. Take three trained models (your VAE from E1, your GAN from E2, your DDPM from E3). Compute FID against the corresponding test set for each. Compare. Investigate the effect of sample-set size on FID (1K, 5K, 10K samples). Confirm the well-known instability of FID at small sample sizes.
E7. Reproduce a memorization experiment. Following Carlini et al. (2023) at small scale: train a VAE or small DDPM on MNIST with a deliberately under-deduplicated dataset (some samples replicated 100x). Investigate which samples the model memorizes (find generated samples nearest to known training examples in pixel space). Observe the relationship between training-replication count and memorization rate.
E8. Implement a simple ControlNet for edge-conditioned generation. Take your trained DDPM from E3. Add a ControlNet branch that conditions on Canny edges of the target image. Train on (image, edge-map) pairs. At inference, supply a new edge map and a class condition; verify that the generated image follows the edge structure.
E9. Compare U-Net and DiT for diffusion. Implement two small diffusion models with comparable parameter counts: one with a U-Net, one with a DiT. Train both on the same dataset (e.g., CelebA-64). Compare sample quality (FID), training stability (loss curves), and inference speed.
E10. Investigate the effect of training-data quality. Take a small diffusion model. Train three versions: one on the full CIFAR-10 training set, one on a deduplicated subset, one on a 10% random subset. Compare sample quality and (where possible) memorization rates across the three.