Deep Learning

Scope and What This Chapter Is About

The chapter develops the architectural and optimization substrate that modern deep learning rests on: how the dominant model families work, how they are trained, and which design choices matter at scale. It is the substrate chapter that the Foundation Models, Self-Supervised Learning, Large Language Models, Multimodal, Reasoning Models, Generative Models, and Robotics chapters build on.

We treat the machinery. Self-supervised objectives, the foundation-model framing, distributed training infrastructure, and modern alignment methods live in their own chapters with explicit pointers from here. Open problems are flagged inline and consolidated in §12.

§1. Motivation and Scope

What this chapter is for

By 2026, deep learning is no longer a research direction; it is the default substrate on which essentially every modern AI system is built. The Foundation Models chapter (which is what most readers will reach first) treats deep learning as a black box that handles its architectural and training details. The LLM, Multimodal, Reasoning Models, Robotics, and Generative Models chapters each develop a specific use of deep-learning machinery without re-deriving the machinery itself. This chapter exists to consolidate what those other chapters presume - the architectures, the training algorithms, the design choices, the empirical regularities - so that a reader who picks up the cross-cutting machinery here can follow the rest of the book without having to assemble it piecewise from each chapter’s local treatment.

The chapter has a second purpose: as a reference. A reader who knows transformers well but wants the residual-stream picture, who knows SGD but needs the AdamW-and-warmup recipe, who has worked with multi-head attention but never coded grouped-query attention - this chapter is the place to consult. We develop mechanism with diagrams and pseudocode (per the project’s editorial rules); we do not assume the reader has the field’s current jargon already loaded.

What “deep” means here

The phrase deep learning has multiple readings; we use the dominant one. We mean the paradigm of multi-layer parametric models trained end-to-end by gradient descent on differentiable loss functions, with very many parameters (millions to trillions) and very many training examples (millions to trillions of input units, depending on the modality). “Deep” is a misnomer in some sense - modern transformers are deep but the central technical achievement is breadth and scale of parameters, not depth alone - but the term is stable enough that we keep it.

A working definition for this chapter:

Deep learning. The use of multi-layer parametric models - typically neural networks with millions to trillions of parameters - trained by gradient-based optimization of a differentiable loss function on large datasets, where the model and its training procedure are jointly responsible for both representation (what features the model encodes) and behaviour (what outputs the model produces).

The contrast that grounds this definition: in classical ML pipelines, representation was handled by humans through feature engineering, and a relatively shallow model was fit on top of those features. In deep learning, representation and behaviour are both learned from the data by the same training procedure - features are not given but discovered.

What we cover, in three layers

The chapter is organized into three structural layers.

Architectures

How neural networks are structured.

§3 MLPs and backprop fundamentals
§4 Convolutional networks (historical)
§5 Recurrent networks (historical)
§6 Attention and the Transformer (the bulk)
§7 Mixture-of-Experts and sparse models
§8 State-space models and hybrids

Training

How the architectures get trained.

§9 Training dynamics: optimizers, normalization, initialization, schedules, regularization, mixed precision

Scale

What changes when these get large.

§10 Efficient attention and long context

The remaining sections (§11 Connections to other chapters, §12 Open Problems, §13 Further Reading, §14 Exercises) close the chapter in the standard project pattern.

Boundaries with adjacent chapters

Deep learning is a foundational substrate for many things; consequently the chapter abuts many others. Where the chapter draws a line:

Self-Supervised Learning treats what objectives the architectures are trained on. This chapter treats what the architectures are and how they are trained mechanically; SSL treats what is being optimized.
Foundation Models treats the deployment regime - pretrain once at scale, adapt many ways. This chapter treats the substrate; the FM chapter treats what is built on top.
Theoretical Foundations of Learning treats why deep models generalize (or fail to) - PAC theory, the deep-learning generalization puzzle, scaling laws as empirical regularities. This chapter treats what works empirically and the recipes that produce working models; theory is delegated.
LLMs, Multimodal Models, Reasoning Models, Robotics, Generative Models are applications of the substrate - each takes the architectures and training procedures developed here and builds a deployment-specific story on top.
Efficient and Scaled Training treats distributed training, hardware-aware design, and the engineering of training at frontier scale. This chapter touches efficiency at the algorithmic level (§9 on optimizers, §10 on long context); the systems story is delegated.

Editorial posture

We present deep learning as a working substrate - the architectures and training procedures that actually produce useful models in 2026 - not as a settled theory. Many of the design choices we describe (the transformer block recipe, the AdamW-with-warmup recipe, RoPE vs ALiBi) are empirical: they work, they have been verified across many groups and many model scales, but they do not have first-principles derivations. Where this matters, we flag it. Where mainstream practice is contested or rapidly evolving (state-space models, sub-quadratic attention, alternative training paradigms), we note the contest rather than pretend it is settled.

§2. Historical Context

This section sketches the trajectory from the perceptron of 1958 to the transformer-dominant substrate of 2026. It is not a complete history of machine learning, nor an attempt to credit individuals with the “invention” of deep learning - the field has been built by many people across many decades, and most credit narratives oversimplify. The trajectory below is the substrate history: how the architectures and training procedures of this chapter actually came to exist.

A timeline of the inflection points covered below:

1958

Rosenblatt’s Perceptron

Single-layer linear classifier with weight updates.
1969

Minsky and Papert

Critique limits the connectionist programme; first AI winter.
1986

Backpropagation formalized

Rumelhart, Hinton, Williams (1986) train multi-layer networks; second-wave connectionism.
~1990s–2000s

Statistical-ML mainstream

SVMs, kernel methods, boosting. Neural networks persist but are not central.
2006–2010

Deep-learning revival

Deep belief networks (Hinton), unsupervised pretraining, restricted Boltzmann machines.
2012

AlexNet inflection

Krizhevsky et al. (2012) win ImageNet by a wide margin - supervised deep learning at scale with GPUs becomes the dominant paradigm in vision.
2014–2017

CNN era + parallel RNN/LSTM era

VGG, GoogLeNet, ResNet (He et al., 2015 with residual connections - the architectural primitive that enabled very deep networks). Parallel sequence-modelling thread on LSTMs with seq2seq (Sutskever et al., 2014) and attention as add-on (Bahdanau et al., 2014).
2017

The Transformer

Vaswani et al. (2017) absorb both the vision and language threads; the architectural substrate of everything that follows.
2018–2022

Pretraining, scale, consolidation

BERT, GPT, T5. CNNs displaced from vision by ViT (Dosovitskiy et al., 2020). Architectural design coalesces around a small recipe.
2022–2026

Scaling era and consolidation

Mixture-of-experts at the frontier (Mixtral, DeepSeek-MoE). State-space models as the first serious sub-quadratic challenger (Mamba). The “modern recipe” of RMSNorm + SwiGLU + RoPE + GQA stabilizes.

We develop each phase briefly below.

The perceptron and the first AI winter

Frank Rosenblatt’s perceptron (1958) was the first practical learning neural model - a single-layer linear classifier with an iterative weight-update rule that provably converged on linearly separable problems. The perceptron generated substantial early enthusiasm.

In 1969 Marvin Minsky and Seymour Papert published Perceptrons, a careful analysis showing that single-layer perceptrons could not represent functions like XOR - and that extending to multi-layer networks would require, among other things, a way to train the hidden layers, which was not then understood. The book is widely cited as a contributor to the first “AI winter” - a multi-year decline in funding and interest in neural-network research. The narrative is somewhat overdetermined (other factors contributed), but the upshot is clear: connectionism as a research programme was significantly dampened from roughly 1970 to the mid-1980s.

Backpropagation and the second wave

The training problem for multi-layer networks was solved (formalized for the field) in the 1980s by several researchers; the most cited reference is Rumelhart, Hinton, and Williams (1986), “Learning representations by back-propagating errors”. Backpropagation (which we develop in §3) is just reverse-mode automatic differentiation applied to a layered network’s parameters; the contribution of the 1986 paper was demonstrating that the technique worked on non-trivial multi-layer problems and arguing that the field should take it seriously.

The “second wave” of connectionism (mid-1980s to mid-1990s) produced foundational architectural ideas - convolutional networks (LeCun’s LeNet for handwritten-digit recognition), recurrent networks (Hochreiter and Schmidhuber’s LSTM in 1997), early ideas about attention and content-addressable memory. But neural networks were not yet the dominant ML paradigm; statistical learning theory (Vapnik), support vector machines, kernel methods, and boosting dominated practical ML through the 2000s.

The pre-2012 dormancy

Through roughly 2006–2010, deep-network research was a niche within ML. Geoffrey Hinton and others kept developing ideas - restricted Boltzmann machines, deep belief networks, layer-wise unsupervised pretraining - and there was real progress on benchmarks, but the central position in the field was occupied by statistical ML. Two things were missing: enough data, and enough compute.

2012: the AlexNet inflection

In 2012 Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered an 8-layer convolutional network trained on two consumer GPUs in the ImageNet Large Scale Visual Recognition Challenge. AlexNet won by a wide margin against the best non-deep-learning entrants - a top-5 error rate of 15.3% against the next-best entry’s 26.2%. The result was not a slight improvement; it was a different regime.

The narrative interpretation: deep convolutional networks had been mostly a curiosity for several decades because no one had run them at the right combination of data scale (ImageNet’s millions of labelled images) and compute scale (GPUs). Once those were available, the architecture’s potential was visible. AlexNet kicked off the modern era of deep learning. Within three years, essentially every computer-vision research project was using a deep network.

The CNN era (2012–2017)

A series of architectural and engineering improvements pushed image-classification benchmarks rapidly: VGG (Simonyan and Zisserman, 2014) increased depth dramatically with a uniform 3×3 convolution structure; GoogLeNet/Inception (Szegedy et al., 2014) introduced the inception block; ResNet (He et al., 2015) introduced residual connections - a small but crucial architectural primitive that made it possible to train networks of arbitrary depth without the vanishing-gradient problems that had previously capped useful depth at ~20 layers. The residual stream that we will see in §6 of this chapter, and that the LLM and FM chapters keep returning to as a useful conceptual frame, traces directly to ResNet.

Beyond architectures: AlexNet-era practice introduced or popularized ReLU activations (faster training than sigmoid), dropout (a regularization technique), batch normalization (Ioffe and Szegedy, 2015 - training stability), and a constellation of small empirical tricks that compounded into a usable training recipe. The CNN era was also when the use of GPUs for ML training became standard, and the field’s relationship to compute providers (NVIDIA in particular) became structurally important.

The RNN era in parallel

While CNNs reshaped vision, recurrent networks - LSTMs (Hochreiter and Schmidhuber, 1997) and GRUs (Cho et al., 2014) - handled sequence modelling, especially in language. Seq2seq models (Sutskever et al., 2014) and attention as an add-on to seq2seq (Bahdanau et al., 2014) were the substrate for neural machine translation through 2017. The recurrent-attention combination had real successes - Google’s production translation system switched to it in 2016 - but it had structural limitations: sequential training (no parallelization across positions), gradient propagation through long sequences was unstable, and capacity was limited.

2017: the Transformer

Vaswani et al. (2017), “Attention Is All You Need”, proposed an architecture that eliminated recurrence. The Transformer used attention not as an add-on to a recurrent backbone but as the entire mechanism for cross-position interaction. The training was parallelizable across positions in a way recurrent networks could not be. The architecture was originally introduced for machine translation, where it modestly outperformed the best LSTM-based systems; within a few years it had absorbed essentially every sequence-modelling application in the field.

The Transformer is the architectural substrate of every chapter after the FM chapter. We develop it in detail in §6.

2018–2022: pretraining, scale, and consolidation

The pretraining-and-adapt regime (developed in the FM chapter) crystallized in 2018 with BERT, GPT, and T5 (treated in LLM §2). What matters for this chapter: the field stopped designing new architectures for every problem and started reusing a small set of substrate architectures across many problems. Vision was reshaped a second time when ViT (Dosovitskiy et al., 2020) showed that the Transformer worked on image patches, displacing CNNs from many vision applications by 2022.

Inside the Transformer family, the architectural design consolidated. The 2017 paper’s recipe (sinusoidal positional encodings, learned biases, post-LayerNorm, standard MLP feedforward) was replaced piece by piece by empirically better alternatives:

Pre-normalization instead of post-LayerNorm - training stability at depth.
RMSNorm (Zhang and Sennrich, 2019) replacing LayerNorm - cheaper, equivalent quality at low precision.
SwiGLU (Shazeer, 2020) replacing the ReLU/GELU MLP - small consistent gains.
Rotary position embeddings (Su et al., 2021) replacing sinusoidal or learned absolute encodings - better extrapolation, cleaner relative-position properties.
Grouped-query attention (Ainslie et al., 2023) replacing standard multi-head - much smaller KV cache at modest capability cost.

By 2024 these choices had stabilized into the “modern recipe” we develop in §6.

2022–2026: scaling, MoE, and non-transformer alternatives

The recent era is shaped by three forces. Scaling: model parameters and training compute have grown several orders of magnitude beyond the 2020 baseline, with empirical scaling laws (treated in FM §6) describing the result. Mixture-of-experts (§7): frontier models routinely use sparse architectures with substantially more total parameters than active parameters per token. Non-transformer alternatives (§8): state-space models (Mamba, Gu and Dao 2023) and hybrid attention/SSM architectures (Jamba) are the first serious challengers to the Transformer’s substrate dominance in years; whether they displace it at the frontier remains open.

Where this leaves us

The chapter’s substrate as of 2026: the decoder-only Transformer with the modern recipe (RMSNorm + SwiGLU + RoPE + GQA, optional MoE) at scales ranging from a few hundred million parameters (on-device models) to a trillion-plus total parameters (frontier MoE). Trained with AdamW-based optimizers, learning-rate schedules with warmup and cosine decay, mixed precision (fp16/bf16/fp8 depending on the deployment), and a constellation of empirical training tricks that have been verified across labs. Sub-quadratic alternatives present but not dominant.

Historical aside. This narrative privileges the lineage that produced today’s dominant systems. Several research programmes that did not lead directly to current frontier deep learning - symbolic AI, classical statistical learning theory, Bayesian neural networks, equivariant networks, neuroscience-inspired learning - remain active and are treated in their own contexts in this book (the Theoretical Foundations of Learning chapter, the Causality chapter, and others). The history of an idea is rarely the same as the history of the field that produced it.

§3. The Multi-Layer Perceptron and Backpropagation

This section develops the substrate underneath every architecture in this chapter: the multi-layer perceptron (MLP) as the simplest fully-connected neural network, and backpropagation as the training mechanism. The transformer block of §6 has an MLP inside it; the convolutional and recurrent networks of §4 and §5 are MLPs with structured weight-sharing. The training procedure for every architecture in this book is backpropagation plus a gradient-based optimizer (§9). Understanding the substrate is therefore prerequisite to understanding what specific architectures change.

The multi-layer perceptron

An MLP is a function from input vectors to output vectors built from a sequence of linear projections and element-wise nonlinearities. Given an input $\mathbf{x} \in \mathbb{R}^{d_0}$ , an $L$ -layer MLP computes:

\mathbf{h}_0 = \mathbf{x}, \qquad \mathbf{h}_\ell = \sigma(\mathbf{W}_\ell \mathbf{h}_{\ell-1} + \mathbf{b}_\ell), \qquad \mathbf{y} = \mathbf{W}_{L} \mathbf{h}_{L-1} + \mathbf{b}_L,

for $\ell = 1, \ldots, L-1$ , where $\mathbf{W}_\ell \in \mathbb{R}^{d_\ell \times d_{\ell-1}}$ are learned weight matrices, $\mathbf{b}_\ell \in \mathbb{R}^{d_\ell}$ are learned bias vectors, $\sigma$ is a non-linear activation applied element-wise, and the final output $\mathbf{y}$ is typically produced by a linear layer without nonlinearity.

Diagrammatically:

flowchart TD
  X(["$$\text{input } \mathbf{x} \in \mathbb{R}^{d_0}$$"])
  L1["$$\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1$$"]
  H1(["$$\mathbf{h}_1 = \sigma(\cdot) \in \mathbb{R}^{d_1}$$"])
  L2["$$\mathbf{W}_2 \mathbf{h}_1 + \mathbf{b}_2$$"]
  H2(["$$\mathbf{h}_2 = \sigma(\cdot) \in \mathbb{R}^{d_2}$$"])
  Dots["…"]
  LL["$$\mathbf{W}_L \mathbf{h}_{L-1} + \mathbf{b}_L$$"]
  Y(["$$\mathbf{y} \in \mathbb{R}^{d_L} \;\; (\text{output})$$"])

  X --> L1 --> H1 --> L2 --> H2 --> Dots --> LL --> Y

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  classDef act fill:#fafaf9,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  classDef dim fill:#fff,stroke:#aaa,stroke-dasharray:3 3,color:#888
  class X,Y act
  class L1,L2,LL pill
  class H1,H2 act
  class Dots dim

Each layer of an MLP applies a linear transformation followed by a per-element nonlinearity. The widths $d_\ell$ are design choices; common patterns for general MLPs are a uniform width (e.g., 4096 across all hidden layers) or a “wide hidden” pattern (e.g., 4× the input width).

The universal approximation theorem (Cybenko, 1989; Hornik et al., 1989) says that an MLP with a single sufficiently-wide hidden layer can approximate any continuous function on a compact domain to arbitrary precision. In principle, MLPs are enough. In practice, the theorem tells us very little - it does not say how wide the layer must be (often exponential in input dimension for problems of interest), how many examples are needed to find good weights, or how well-conditioned the optimization will be. Modern deep learning’s success comes from better function representations (deeper, structured, with appropriate inductive biases) and from trainable representations (architectures whose optimization actually converges to useful weights). Universal approximation is a permission slip, not a recipe.

Activations: what nonlinearity to use

The choice of $\sigma$ matters more than the universal approximation theorem suggests. Linear-only networks (no nonlinearity) collapse: a stack of linear transformations is itself a single linear transformation, regardless of depth. The nonlinearity is what makes depth useful.

A historical sequence of common choices:

Sigmoid $\sigma(z) = 1 / (1 + e^{-z})$ . The classical choice in early connectionist work. Saturates at 0 and 1 for large $|z|$ ; the gradient through a saturated sigmoid is near zero, which makes gradients in deep networks shrink layer by layer (the vanishing gradient problem). The classical bottleneck for training deep MLPs through the 1990s.
tanh $\sigma(z) = (e^z - e^{-z})/(e^z + e^{-z})$ . Sigmoid-like but centered at zero, with output range $[-1, 1]$ . Slightly better gradient flow than sigmoid but the saturation problem remains.
ReLU $\sigma(z) = \max(0, z)$ (Glorot et al., 2011, popularized by AlexNet, 2012). The dominant activation through the CNN era. Non-saturating in the positive regime - the gradient is exactly 1 for positive inputs - which dramatically improves gradient flow at depth. The “dying ReLU” failure mode (units that saturate at zero and stop updating) motivated variants:
Leaky ReLU, PReLU (small negative-side slope to prevent the dying-ReLU failure).
GELU (Gaussian Error Linear Unit; Hendrycks and Gimpel, 2016): $\sigma(z) = z \cdot \Phi(z)$ where $\Phi$ is the standard normal CDF. Smooth alternative to ReLU; standard in BERT and GPT-2 / GPT-3.
Swish / SiLU (Ramachandran et al., 2017; Elfwing et al., 2017): $\sigma(z) = z \cdot \mathrm{sigmoid}(z)$ . Smooth, self-gated. The modern default in many recipes; used inside SwiGLU (§6).
GLU variants (gated linear units, Shazeer 2020). Not strictly an activation but a layer pattern: take two parallel linear projections of the input, apply a sigmoid-like gate to one, multiply them elementwise. SwiGLU is the dominant 2026 choice for the feedforward sub-layer of transformers (§6).

Visualizing the shapes:

\sigma(z) = 1/(1+e^{-z})

\tanh(z)

\text{ReLU}(z) = \max(0, z)

\text{GELU}(z) = z \cdot \Phi(z)

smooth alternative to ReLU

\text{Swish}(z) = z \cdot \sigma(z)

smooth, self-gated

Backpropagation

How does an MLP get trained? The model has a loss function $\mathcal{L}(\theta)$ that depends on the parameters $\theta = \{\mathbf{W}_\ell, \mathbf{b}_\ell\}_{\ell=1}^L$ (the collection of all weights and biases). Training minimizes the loss by gradient descent: compute $\nabla_\theta \mathcal{L}$ and take a step in the direction that decreases the loss.

The non-trivial part is computing $\nabla_\theta \mathcal{L}$ for a network with millions to billions of parameters. Backpropagation is the algorithm that does this efficiently.

The key observation: backpropagation is reverse-mode automatic differentiation specialized for layered architectures. Reverse-mode AD computes the gradient of a scalar output with respect to many inputs in time proportional to the cost of the forward computation itself - one extra forward-pass-worth of work gives you the gradient with respect to every parameter, no matter how many parameters there are.

The procedure for our $L$ -layer MLP:

BACKPROPAGATION (one training example)

Inputs:
  x         : a training input vector
  y_target  : the target label
  θ = {W_1, b_1, ..., W_L, b_L}   : current parameters
  L(·, ·)   : loss function comparing prediction to target

FORWARD PASS - compute and CACHE intermediate activations:
  h_0 ← x
  for ℓ = 1 .. L-1:
    z_ℓ ← W_ℓ h_{ℓ-1} + b_ℓ          # pre-activation
    h_ℓ ← σ(z_ℓ)                      # post-activation
  z_L ← W_L h_{L-1} + b_L            # final pre-activation
  ŷ   ← z_L                          # prediction (no final nonlinearity here)

  loss ← L(ŷ, y_target)

BACKWARD PASS - compute gradients in REVERSE layer order:
  # Start: derivative of loss w.r.t. the final pre-activation z_L.
  δ_L ← ∂L/∂z_L         (e.g., for squared error: δ_L = ŷ - y_target)

  # Output-layer parameter gradients:
  ∇_{W_L} L ← δ_L · h_{L-1}^T        (outer product)
  ∇_{b_L} L ← δ_L

  # Propagate the gradient backward through the layers:
  for ℓ = L-1 .. 1:
    # 1. Pass gradient through W_{ℓ+1}:
    δ_ℓ ← (W_{ℓ+1}^T · δ_{ℓ+1}) ⊙ σ'(z_ℓ)
    #     gradient at h_ℓ from downstream
    #     ⊙  elementwise derivative of σ evaluated at z_ℓ

    # 2. Compute parameter gradients for layer ℓ:
    ∇_{W_ℓ} L ← δ_ℓ · h_{ℓ-1}^T
    ∇_{b_ℓ} L ← δ_ℓ

Output: ∇_θ L = { ∇_{W_ℓ} L, ∇_{b_ℓ} L  for ℓ = 1..L }

The chain rule does all the work: each gradient is a product of the gradient from downstream layers and the local derivative at the current layer. The “vanishing gradient” failure mode is now mechanically visible - if $\sigma'(z_\ell)$ is near zero (saturated sigmoid, dying ReLU), the gradient $\delta_\ell$ shrinks; propagated through many such layers, the gradient at early layers becomes useless for training.

In modern frameworks (PyTorch, JAX, TensorFlow), you do not write backpropagation by hand. The framework’s autodiff engine constructs the computation graph during the forward pass and runs reverse-mode AD automatically. Conceptually, however, backpropagation is what every “loss.backward()” call computes.

Initialization: where does $\theta$ start?

Backpropagation tells us how to update parameters; it doesn’t say what they should start as. Initialization matters at depth.

If weights are initialized too small, activations shrink layer by layer, and gradients shrink with them - the network’s signal is lost in the noise floor. If weights are initialized too large, activations grow and saturate - gradients vanish at the saturated regions. Both failure modes are exponential in depth: a 100-layer network with the wrong initialization is unusable; a 10-layer network with the wrong initialization is merely slow to train.

The dominant initialization recipes both aim to keep activation variance approximately constant across layers:

Xavier (Glorot) initialization (Glorot and Bengio, 2010): sample weights from a distribution with variance $2 / (d_{\text{in}} + d_{\text{out}})$ . Designed for tanh/sigmoid activations.
He initialization (He et al., 2015): variance $2 / d_{\text{in}}$ . Designed for ReLU-family activations, accounting for the fact that ReLU zeros out half the inputs on average.
Modern scaled initialization: variants on the above, sometimes with additional rescaling based on depth (DeepNet, ReZero, and others). Transformer-specific initialization recipes (e.g., scaling output projections by $1/\sqrt{N_{\text{layers}}}$ ) appear in modern training pipelines.

The recipes differ in detail; what they share is the principle: choose the scale of initial weights so that activation variance is preserved as the signal passes through the network.

The gradient-flow problem and its architectural responses

Stacking deep MLPs and training them with backpropagation was technically possible in the 1990s but rarely worked at depth beyond a handful of layers. The vanishing-gradient (and complementary exploding-gradient) failure modes were the central obstacle.

Three architectural responses, developed at different points and now standard, restore gradient flow at depth:

Residual connections (He et al., 2015, in ResNet). The fundamental fix. Each layer’s output is added to its input rather than replacing it: $\mathbf{h}_\ell = f_\ell(\mathbf{h}_{\ell-1}) + \mathbf{h}_{\ell-1}$ . The gradient from later layers can flow back unchanged via the identity path; even if $f_\ell$ ’s gradient is degraded, the additive structure ensures the signal still propagates. Residual connections are what made networks of arbitrary depth possible. They are also the architectural primitive underlying the residual stream view of §6.
Normalization layers (LayerNorm, RMSNorm, BatchNorm, GroupNorm, and others). Rescale activations to keep their magnitudes in a useful range, which keeps the local derivative $\sigma'(z)$ in a useful range. Normalization is the secondary fix; without it, residual connections alone are not always enough.
Better activations (ReLU and successors). Non-saturating activations (or weakly-saturating like GELU/Swish) prevent the worst of the vanishing-gradient failure mode within a single layer.

By 2026 every serious deep architecture uses some combination of all three. The transformer block of §6 uses pre-normalization (with RMSNorm) + a non-saturating activation in the feedforward (SwiGLU) + residual connections wrapping both sub-layers. The residual-stream interpretability framing is a direct consequence of this combination.

What you need from §3 to read the rest of the chapter

If you came to this section to understand what the other chapters’ architectures are built on, the takeaways are:

Every architecture in this book is some structured variant of an MLP - different choices of which weights are tied, which positions are connected, what additional operations are interleaved.
Every architecture in this book is trained by gradient descent on a differentiable loss, computed via backpropagation (reverse-mode autodiff).
Depth at scale requires three architectural decisions: non-saturating activations, normalization, and residual connections. The transformer block has all three; so do modern CNNs, RNNs, and SSMs.
The training procedure has knobs (initialization, learning rate, optimizer choice) that §9 develops in detail.

§4. Convolutional Networks (Historical Centre, Now Brief)

Convolutional networks (CNNs, ConvNets) dominated computer vision from 2012 (AlexNet, §2) through roughly 2020 (the arrival of Vision Transformers). The Transformer-dominant era has displaced them from the centre of vision research, but they remain a useful architectural pattern, and the inductive biases they encode are pedagogically clarifying. This section treats them briefly - enough to know what they are, why they worked, where they still win, and where the Transformer took over.

The convolutional inductive bias

A convolutional layer is a constrained MLP. Instead of an unrestricted weight matrix between layers, a convolution applies the same small set of weights (the kernel or filter) at every spatial position of the input. Three structural properties follow:

Locality. Each output value depends only on a small spatial neighbourhood of the input - typically a $3 \times 3$ or $5 \times 5$ patch. The output at position $(i, j)$ does not depend on input position $(i + 100, j + 100)$ directly; it depends on it only through the chain of compositions in subsequent layers.
Translation equivariance. Shifting the input shifts the output identically. The same feature detector (a vertical-edge detector, say) recognises the feature wherever it appears in the input.
Parameter sharing. The kernel’s weights are shared across all spatial positions. A ConvNet with millions of activations has only thousands of unique parameters per layer.

These three together encode prior knowledge about images - that what a thing is matters more than where it is, and that local patches contain meaningful structure - into the architecture itself. The result is dramatically improved sample efficiency on visual tasks compared to fully-connected MLPs of comparable parameter count, plus invariance properties that the model does not have to learn from scratch.

Landmark architectures

A compressed history of CNN architecture, with each entry’s key contribution:

LeNet (LeCun et al., 1989). The original CNN for handwritten-digit recognition. Already had the now-standard pattern: alternating convolution + activation + pooling layers, ending in a fully-connected classifier.
AlexNet (Krizhevsky et al., 2012). The §2 inflection - eight layers, ReLU activations, dropout, GPU training. Established CNN-at-scale as practical.
VGG (Simonyan and Zisserman, 2014). Uniform stack of $3 \times 3$ convolutions; depth (up to 19 layers) as the central design choice; demonstrated that going deeper consistently helps.
GoogLeNet / Inception (Szegedy et al., 2014). The inception block with multiple parallel convolutions of different filter sizes, concatenated. Improved efficiency at given capacity.
ResNet (He et al., 2015). Residual connections - the architectural primitive that made arbitrary depth possible (§3 of this chapter develops this) and that the Transformer’s residual stream inherits.
EfficientNet (Tan and Le, 2019). Principled co-scaling of depth, width, and input resolution; “EfficientNet-B0 through B7” as a calibrated capacity ladder.

These architectures were the central focus of vision research from 2012 to roughly 2020; many specialized variants (dense prediction, object detection, segmentation) added to the family.

The Transformer takeover and where ConvNets remain

Vision Transformer (ViT) (Dosovitskiy et al., 2020) showed that the Transformer architecture works on images: tokenize an image into patches (typically $16 \times 16$ pixels), treat the patches as a sequence, run a standard Transformer over them. With sufficient training data, ViT matches or outperforms ConvNets on image-classification benchmarks. CLIP (Radford et al., 2021) used a Transformer-based image encoder trained on hundreds of millions of image-text pairs and demonstrated zero-shot capability that ConvNets had not approached.

By 2026, mainstream vision research has largely moved to Transformer-based architectures. ConvNets persist in specific niches:

Small-data regimes. Without ImageNet-scale or larger pretraining, a ConvNet’s inductive biases give it an advantage; a from-scratch-trained model on 10,000 images is often better as a ConvNet than as a ViT.
On-device deployment. ConvNets are amenable to mobile and edge-device hardware optimization (NEON, Apple Neural Engine, etc.) in ways ViTs have been less so until very recently.
Dense prediction. Semantic segmentation, depth estimation, optical flow - tasks where the output is a per-pixel map - fit ConvNet-style architectures (U-Net and successors) naturally.
Modern revivals. ConvNeXt (Liu et al., 2022) and ConvMixer showed that ConvNets can be made competitive with ViTs at the same training scale, given the right architectural updates (patchify-then-process, depthwise convolutions, GELU activations, LayerNorm). The takeaway: the original ViT-vs-ConvNet capability gap was partly architectural and partly recipe-related; closing both gaps closes most of the practical gap.

For the rest of this chapter and the rest of the book, the substrate we develop is the Transformer family. ConvNets remain a useful pattern to know - the inductive-bias arguments above apply to many domains beyond vision (genomics, time-series, audio with spectrograms) - but they are no longer the centre of gravity in modern deep learning.

§5. Recurrent Networks (Historical Centre, Now Brief)

Recurrent networks (RNNs) handled sequence modelling for most of the deep-learning era preceding the Transformer’s 2017 arrival. Like the CNN section, this section is brief: enough to know what RNNs were, why they had structural limits, and where the recurrent-modelling pattern still matters.

Vanilla RNNs, LSTMs, GRUs

The basic RNN processes a sequence by maintaining a hidden state that updates at each step:

\mathbf{h}_t \;=\; \sigma(\mathbf{W}_h \mathbf{h}_{t-1} + \mathbf{W}_x \mathbf{x}_t + \mathbf{b}), \qquad \mathbf{y}_t \;=\; \mathbf{W}_y \mathbf{h}_t + \mathbf{c}.

The hidden state $\mathbf{h}_t$ is supposed to encode everything relevant from the prior history $\mathbf{x}_1, \ldots, \mathbf{x}_{t-1}$ . In practice, the vanilla RNN’s hidden state cannot retain useful information across more than a handful of timesteps - the vanishing gradient problem of §3 applies along the time dimension with particular force, because the same weight matrix $\mathbf{W}_h$ is multiplied through repeatedly.

LSTM (Long Short-Term Memory; Hochreiter and Schmidhuber, 1997) replaces the simple recurrence with a gated structure: a separate cell state that the network learns to write to (input gate), forget from (forget gate), and read from (output gate). The cell state is updated additively rather than multiplicatively, side-stepping the worst of the vanishing-gradient failure mode and enabling useful retention over tens to hundreds of timesteps.

GRU (Gated Recurrent Unit; Cho et al., 2014) is a streamlined variant - two gates instead of three, no separate cell state - with similar empirical behaviour at slightly lower computational cost.

LSTMs and GRUs were the workhorses of sequence modelling (machine translation, speech recognition, language modelling) from roughly 2014 to 2017.

Truncated backprop through time

Training an RNN requires backpropagating through the temporal recurrence. Backpropagation through time (BPTT) unrolls the network across $T$ steps and applies standard backprop (§3) to the unrolled computation graph. For long sequences, this is memory-expensive (activations at every timestep must be cached). Truncated BPTT addresses this by backpropagating only through a fixed window of $K$ recent steps; gradients beyond $K$ steps in the past are simply dropped.

Truncated BPTT trades exactness for tractability. The trade-off shapes what RNNs can learn: dependencies longer than $K$ steps are invisible to the gradient signal, even if the architecture in principle could capture them.

Why RNNs were displaced

The Transformer (§6) replaced RNNs across almost all sequence-modelling applications between 2017 and 2020. Three reasons:

Parallelism. Transformers parallelize across positions; RNNs are inherently sequential. Training a Transformer on $L$ -token sequences is wall-clock-faster than training an RNN on the same sequences, on hardware that supports parallel computation.
Long-range dependencies. Self-attention provides direct access to any prior position; the RNN’s bottleneck through a fixed-size hidden state limits how much information can be carried forward.
Scaling. As models grew larger, Transformers’ empirical performance scaled more reliably than RNNs’. The scaling laws (FM §6) were developed on Transformers; equivalent results for RNNs were never as clean.

Where recurrent designs persist

RNNs and recurrent designs more broadly have not disappeared - they persist in several niches:

Streaming inference. A recurrent network can process a stream of inputs token by token with constant per-token cost, ideal for real-time audio (speech recognition, voice assistants) and streaming text.
Low-latency on-device. Where memory bandwidth is the bottleneck and KV-cache-style transformer inference is expensive, recurrent designs can be more efficient.
Control and robotics. Some control systems use recurrent architectures for reasons of inductive bias or hardware fit.
The state-space model revival (§8). Modern SSMs (Mamba, RWKV, Jamba) are the recurrent renaissance - recurrent in inference but parallelizable in training, designed to compete with attention on its own terms. The classical RNN failure modes are largely sidestepped by careful architectural design.

Bridge to §8

The state-space models of §8 are RNNs in a precise sense: they update a hidden state recurrently. What makes them competitive with Transformers is the architectural engineering - linear recurrences that can be unrolled as convolutions for training, input-dependent dynamics for content selectivity - that the original LSTM and GRU lineage lacked. If §5 is the historical context, §8 is its modern continuation.

§6. Attention and the Transformer

The Transformer (Vaswani et al., 2017) is the architectural substrate of every chapter after the Foundation Models chapter. This section develops it from first principles: the attention operator, the block structure, the deployment shapes (encoder-only, decoder-only, encoder-decoder), positional information, normalization placement, and the residual-stream view that recent mechanistic-interpretability work has made central. This section is the canonical mechanical reference for the architecture; the LLM chapter §4 specializes the same machinery to the modern language-modelling recipe.

Motivation: what attention solves

Before attention, sequence-processing neural networks were recurrent. An RNN reads a sequence token by token, maintaining a hidden state that summarizes everything seen so far. Recurrent processing has two structural limitations:

Information bottleneck. Everything the model knows about earlier tokens must fit through the fixed-size hidden state.
Sequential training. Token $t$ ’s representation cannot be computed before token $t-1$ ’s; training cannot parallelize over positions.

Attention addresses both directly. Each output position attends to every relevant input position via learned weights - no bottleneck, since each query has direct access to every key/value pair. And the attention computation is fully parallel across positions; the entire forward pass over a sequence of length $L$ can be computed in parallel on hardware that supports it.

What attention costs in exchange: quadratic time and memory in sequence length. The trade is favourable at the scales of interest; the chapter’s §10 develops the responses (sliding windows, sparse patterns, sub-quadratic alternatives).

The attention operator

We develop attention in three stages: the projections, the attention computation, and the output projection.

Stage 1: queries, keys, values

Given a sequence of input vectors $\mathbf{x}_1, \ldots, \mathbf{x}_T$ , each $\mathbf{x}_t \in \mathbb{R}^{d_{\text{model}}}$ , three learned linear projections produce three new vector sequences. In an LLM, $t$ indexes token positions in the sequence and $\mathbf{x}_t$ is the activation at that position - at the input, the token embedding (plus a positional encoding); at deeper layers, the running residual-stream activation that earlier blocks have written into (developed below in §6.6 and in the Mechanistic Interpretability chapter). The dimensionality $d_{\text{model}}$ is the width of the residual stream, held constant throughout the stack so that residual additions type-check; it is typically a few thousand - 768 in GPT-2 small, 4096 in LLaMA-2 7B, 12288 in GPT-3 175B. The three projections are:

\mathbf{q}_t \;=\; \mathbf{W}_Q \mathbf{x}_t, \qquad \mathbf{k}_t \;=\; \mathbf{W}_K \mathbf{x}_t, \qquad \mathbf{v}_t \;=\; \mathbf{W}_V \mathbf{x}_t,

where $\mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V \in \mathbb{R}^{d_{\text{head}} \times d_{\text{model}}}$ are learned weight matrices and $d_{\text{head}}$ is the per-head dimension (typically 64 or 128). The names - query, key, value - are intuitions about the roles:

$\mathbf{q}_t$ is what position $t$ is “looking for” in the rest of the sequence.
$\mathbf{k}_t$ is what position $t$ “offers” to be matched against.
$\mathbf{v}_t$ is what position $t$ “contributes” once matched.

These roles are not enforced architecturally; they are learned by the training procedure. Empirical work on trained transformers (the Mechanistic Interpretability chapter develops this) shows that some heads do behave intuitively in this query-key-value sense and others find their own roles entirely.

Stage 2: scaled dot-product attention

For each query position $t$ , the attention output is a weighted sum of value vectors, with weights determined by how well each key matches the query:

\mathbf{o}_t \;=\; \sum_{s} \alpha_{t,s} \, \mathbf{v}_s, \qquad \alpha_{t,s} \;=\; \mathrm{softmax}_s\!\left( \frac{\mathbf{q}_t \cdot \mathbf{k}_s}{\sqrt{d_{\text{head}}}} \right).

The dot product $\mathbf{q}_t \cdot \mathbf{k}_s$ measures alignment; the softmax (over $s$ ) turns the alignment scores at each $t$ into a probability distribution over positions; the attention output is the corresponding weighted average of values.

The $1/\sqrt{d_{\text{head}}}$ factor is a numerical-stability rescaling: as $d_{\text{head}}$ grows, the variance of unscaled dot products grows linearly, and the softmax saturates (concentrates all mass on one entry). Dividing by $\sqrt{d_{\text{head}}}$ keeps the softmax operating in a useful regime across scales.

In matrix form - the equation that appears in every Transformer paper - let $\mathbf{Q}, \mathbf{K}, \mathbf{V} \in \mathbb{R}^{T \times d_{\text{head}}}$ stack the per-position vectors as rows. Then

\mathbf{O} \;=\; \mathrm{softmax}\!\left( \frac{\mathbf{Q} \mathbf{K}^\top}{\sqrt{d_{\text{head}}}} \right) \mathbf{V}.

Pictorially:

flowchart TD
  Q(["$$\text{Queries } \mathbf{Q}$$"])
  S["$$\mathbf{Q}\, \mathbf{K}^\top / \sqrt{d_{\text{head}}}$$"]
  Sm["$$\text{softmax row-wise}$$"]
  A(["$$\text{Attention weights } \alpha$$"])
  Mul["$$\alpha\, \mathbf{V}$$"]
  O(["$$\text{Output } \mathbf{O} = \alpha\, \mathbf{V}$$"])

  Q --> S
  S --"$$T \times T \text{ matrix}$$"--> Sm
  Sm --"$$\text{each row sums to 1 over keys}$$"--> A
  A --> Mul
  Mul --"$$\text{weighted sum of values, one per query}$$"--> O

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  class Q,S,Sm,A,Mul,O pill

A small numerical example with $T = 3$ tokens and $d_{\text{head}} = 4$ . Suppose the scaled dot products produce the following $3 \times 3$ matrix:

	$k_1$	$k_2$	$k_3$
$q_1$	0.40	0.10	0.20
$q_2$	0.60	0.30	0.50
$q_3$	0.20	0.80	0.10

Scaled dot products

\mathbf{Q}\mathbf{K}^\top/\sqrt{d_{\text{head}}}

Apply softmax row-wise (each row becomes a probability dist):

	$k_1$	$k_2$	$k_3$
$q_1$	0.40	0.30	0.30
$q_2$	0.40	0.27	0.33
$q_3$	0.22	0.49	0.29

After row-wise softmax

Output $o_t$ is the row- $t$ -weighted sum of value vectors:

o_1 = 0.40·v_1 + 0.30·v_2 + 0.30·v_3
o_2 = 0.40·v_1 + 0.27·v_2 + 0.33·v_3
o_3 = 0.22·v_1 + 0.49·v_2 + 0.29·v_3

This is self-attention: queries, keys, and values all come from the same input sequence. We will see two structural variants: cross-attention (queries from one sequence, keys/values from another) and causally-masked self-attention (queries cannot attend to later positions).

Stage 3: output projection

The attention output $\mathbf{o}_t \in \mathbb{R}^{d_{\text{head}}}$ has the dimension of a single head; the surrounding architecture expects vectors of dimension $d_{\text{model}}$ . A final learned linear projection $\mathbf{W}_O \in \mathbb{R}^{d_{\text{model}} \times d_{\text{head}}}$ (or, with multi-head attention below, $\mathbb{R}^{d_{\text{model}} \times H \cdot d_{\text{head}}}$ ) maps the per-head outputs back:

\text{attn-output}_t \;=\; \mathbf{W}_O \, \mathbf{o}_t.

Multi-head attention

Single-head attention is sufficient for the mechanism, but in practice attention is multi-head: $H$ independent attention operations run in parallel, each with its own $\mathbf{W}_Q^h, \mathbf{W}_K^h, \mathbf{W}_V^h$ . Their outputs are concatenated and projected back to $d_{\text{model}}$ by a single $\mathbf{W}_O$ .

flowchart TD
  Input(["$$\text{Input}$$"])
  H1["$$\text{head 1: own } \mathbf{Q}, \mathbf{K}, \mathbf{V}$$"]
  H2["$$\text{head 2: own } \mathbf{Q}, \mathbf{K}, \mathbf{V}$$"]
  HD["…"]
  HH["$$\text{head } H \text{: own } \mathbf{Q}, \mathbf{K}, \mathbf{V}$$"]
  O1["$$\text{out}_1$$"]
  O2["$$\text{out}_2$$"]
  OD["…"]
  OH["$$\text{out}_H$$"]
  Concat["$$\text{concat: } \text{out}_1 \,\Vert\, \text{out}_2 \,\Vert\, \cdots \,\Vert\, \text{out}_H$$"]
  WO["$$\text{output projection } \mathbf{W}_O$$"]
  Res(["$$\text{add to residual}$$"])

  Input --> H1
  Input --> H2
  Input --> HD
  Input --> HH

  H1 --> O1
  H2 --> O2
  HD --> OD
  HH --> OH

  O1 --> Concat
  O2 --> Concat
  OD --> Concat
  OH --> Concat

  Concat --> WO --> Res

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  classDef dim fill:#fff,stroke:#aaa,stroke-dasharray:3 3,color:#888
  class Input,Res,WO,Concat pill
  class H1,H2,HH,O1,O2,OH pill
  class HD,OD dim

Modern transformers use $H$ in the range of 12 to 128 attention heads per layer. Each head can specialize: empirical interpretability work has found heads dedicated to specific syntactic relations, to coreference resolution, to attending mostly to the previous token, and to less interpretable patterns. The Mechanistic Interpretability chapter develops what individual heads do; for purposes of this chapter, the multi-head structure is what gives a single attention layer enough capacity to perform several distinct operations in parallel.

The KV-cache problem and its mitigation via grouped-query attention (GQA) are language-deployment-specific and are developed in LLM §4.6.

Cross-attention vs self-attention

So far, $\mathbf{Q}, \mathbf{K}, \mathbf{V}$ all derive from the same input sequence. This is self-attention, the basis of every modern transformer.

Cross-attention is the other structural pattern: queries come from one sequence, keys and values from another. Cross-attention appears in encoder-decoder transformers, where the decoder’s queries attend to the encoder’s keys and values - the decoder integrates information from the encoder via cross-attention layers. Cross-attention also appears in some multimodal architectures, where one modality’s tokens cross-attend to another’s. The mechanism is identical to self-attention except for the source of $\mathbf{K}, \mathbf{V}$ .

Causal masking

In decoder-only transformers (the dominant LLM substrate, per LLM §4), self-attention is causally masked: position $t$ can attend only to positions $\leq t$ , never to future positions. The mask is implemented by setting the corresponding entries of the $\mathbf{Q} \mathbf{K}^\top$ matrix to $-\infty$ before softmax (so they vanish in the exponential).

Standard self-attention - every query attends to every key

Causally-masked self-attention (decoder-only): position $t$ can attend only to positions $s \le t$ .

Causally-masked self-attention - position

t

attends only to

s \le t

The causal mask is what makes decoder-only transformers generation-native: during inference, the model produces one token at a time, and the mask ensures that the architecture has no way to “cheat” by looking at tokens it has not yet produced. During training, all positions can be computed in parallel - the mask is purely a constraint on which positions attend to which, not on the order of computation.

Positional information

Standard attention is permutation-invariant: shuffling the input tokens would produce the same set of attention weights, because $\mathbf{q}_t \cdot \mathbf{k}_s$ does not know what $t$ or $s$ are. To make the model order-aware, position must be injected somewhere. Several schemes are in use:

Sinusoidal absolute encodings (the original Transformer, 2017). Add to each token’s embedding a fixed sinusoidal pattern that depends on absolute position. Simple, parameter-free, but does not extrapolate well to lengths longer than training.
Learned absolute encodings. Replace the sinusoidal pattern with a learned per-position embedding (BERT, GPT-2). Still does not extrapolate; constrained to the maximum trained position.
Rotary position embeddings (RoPE) (Su et al., 2021). Rotate queries and keys by an angle that depends on position; the dot product after rotation depends only on the relative position $t - s$ . Becomes the de-facto default after ~2022 for both LLMs and many vision-language transformers. LLM §4.5 develops the rotation construction.
ALiBi (Press et al., 2021). Add a fixed linear bias to attention scores that decays with $|t - s|$ . No learned position parameters at all; extrapolates gracefully but has empirical tradeoffs against RoPE that have gone back and forth.

By 2026 RoPE is the dominant choice, with ALiBi persisting in some training pipelines that emphasize length extrapolation.

The Transformer block

A single transformer block combines an attention sub-layer and a feedforward sub-layer, with residual connections and normalization. The block is then stacked many times to form a full transformer; for LLMs, dozens to hundreds of identical-in-structure blocks.

The block, in its 2026 standard form:

flowchart TD
  In(["$$\text{Input } \mathbf{x}_1, \ldots, \mathbf{x}_T$$"])
  N1["$$\text{Norm}$$"]
  Attn["$$\text{Multi-head (causal) attention}$$"]
  Add1(["$$\oplus$$"])
  Int(["$$\text{intermediate vectors}$$"])
  N2["$$\text{Norm}$$"]
  FF["$$\text{Feed-forward (MLP or GLU)}$$"]
  Add2(["$$\oplus$$"])
  Out(["$$\text{Output } \mathbf{x}'_1, \ldots, \mathbf{x}'_T$$"])

  In --> N1 --> Attn --> Add1
  In -. residual .-> Add1
  Add1 --> Int --> N2 --> FF --> Add2
  Int -. residual .-> Add2
  Add2 --> Out

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  classDef sum fill:#fafaf9,stroke:#1a1a1a,stroke-width:1.5px,color:#1a1a1a
  class In,N1,Attn,Int,N2,FF,Out pill
  class Add1,Add2 sum

Two structural properties of this block are worth pulling out before the components:

The residual stream. The block adds its computation onto its input (via the residual ⊕ joins) rather than replacing it. Stacked through $N$ blocks, this creates a “stream” of vectors that flows end-to-end - every block reads from it (via its norm+attention and norm+feedforward sub-layers) and writes to it (via the residual additions). Mechanistic-interpretability work (Elhage et al., 2021, and subsequent literature) has shown that thinking of the transformer as a residual stream that each block partially updates is more useful than thinking of it as a stack of independent operations. The residual stream view recurs throughout this book.

Pre- vs post-normalization. The original 2017 Transformer placed normalization after each sub-layer (post-LayerNorm). At depth and at scale, post-normalization is unstable - gradients accumulate badly through the residual path. Pre-normalization (Xiong et al., 2020) moves the normalization before each sub-layer, stabilizing training at depth. By 2022 essentially every new transformer architecture uses pre-normalization.

Feedforward sub-layer

Inside each block, the feedforward sub-layer is a per-token nonlinearity that operates independently at each position. Two common forms.

The simple MLP, used in the original Transformer:

\mathrm{FFN}(\mathbf{x}) \;=\; \mathbf{W}_2 \, \sigma(\mathbf{W}_1 \mathbf{x}),

where $\mathbf{W}_1 \in \mathbb{R}^{d_{\text{ff}} \times d_{\text{model}}}$ projects up to a wider hidden dimension $d_{\text{ff}}$ (typically $4 \cdot d_{\text{model}}$ ), $\sigma$ is a non-linear activation (originally ReLU, later GELU), and $\mathbf{W}_2 \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}}$ projects back down.

GLU variants (Shazeer, 2020) generalize the MLP with a multiplicative gating structure:

\mathrm{GLU\text{-}FFN}(\mathbf{x}) \;=\; \mathbf{W}_2 \, \big( \sigma(\mathbf{W}_1 \mathbf{x}) \odot \mathbf{W}_3 \mathbf{x} \big),

where $\odot$ is elementwise multiplication and a third learned matrix $\mathbf{W}_3$ provides the gating signal. SwiGLU is the GLU variant using the Swish activation ( $\mathrm{Swish}(z) = z \cdot \sigma(z)$ ); it consistently outperforms the simpler MLP+activation at matched parameter counts and is the dominant feedforward choice in 2026 LLMs.

Normalization

The block’s normalization step rescales activations to stabilize training. Two variants matter.

LayerNorm (Ba et al., 2016) centres and rescales:

\mathrm{LayerNorm}(\mathbf{x}) \;=\; \boldsymbol{\gamma} \odot \frac{\mathbf{x} - \mu(\mathbf{x})}{\sigma(\mathbf{x})} + \boldsymbol{\beta},

where $\mu$ and $\sigma$ are the per-vector mean and standard deviation, and $\boldsymbol{\gamma}, \boldsymbol{\beta}$ are learned per-dimension scale and shift parameters.

RMSNorm (Zhang and Sennrich, 2019) drops the mean-centering and the additive bias:

\mathrm{RMSNorm}(\mathbf{x}) \;=\; \boldsymbol{\gamma} \odot \frac{\mathbf{x}}{\sqrt{\frac{1}{d}\sum_i x_i^2}}.

RMSNorm is cheaper to compute, numerically stable at low precision (fp8, int8), and empirically equivalent in quality. By 2026 it has displaced LayerNorm in most new transformer architectures.

Three deployment shapes

The transformer family has three deployment shapes that emerged in the 2018 inflection (FM §2):

Encoder-only: bidirectional self-attention, no causal mask. The model produces a representation of the input. Used for understanding tasks (classification, named-entity recognition, retrieval embeddings). BERT and its descendants.
Decoder-only: causally-masked self-attention, autoregressive. The model produces output tokens one at a time. Dominant for generation tasks; the substrate of every modern LLM (LLM §4).
Encoder-decoder: an encoder (bidirectional) processes the input, a decoder (causal, plus cross-attention into the encoder) produces the output. Used for translation, summarization, and structured-prediction tasks where the input and output are distinct sequences. T5 and its descendants.

LLM §4 develops the argument for why decoder-only won the LLM substrate competition; the encoder and encoder-decoder shapes remain useful in specialized roles.

Closing summary

The full modern transformer block, assembled from the components above, is:

flowchart TD
  In(["$$\text{input}$$"])
  N1["$$\text{RMSNorm}$$"]
  Attn["$$\text{multi-head causal attention with RoPE}$$"]
  Add1(["$$\oplus$$"])
  Int(["$$\text{intermediate}$$"])
  N2["$$\text{RMSNorm}$$"]
  FF["$$\text{SwiGLU feedforward}$$"]
  Add2(["$$\oplus$$"])
  Out(["$$\text{output}$$"])

  In --> N1 --> Attn --> Add1
  In -. residual .-> Add1
  Add1 --> Int --> N2 --> FF --> Add2
  Int -. residual .-> Add2
  Add2 --> Out

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  classDef sum fill:#fafaf9,stroke:#1a1a1a,stroke-width:1.5px,color:#1a1a1a
  class In,N1,Attn,Int,N2,FF,Out pill
  class Add1,Add2 sum

stacked $N$ times to produce a depth- $N$ transformer. The architectural recipe - pre-normalization with RMSNorm, multi-head causal attention with GQA and RoPE, SwiGLU feedforward - has stabilized across labs by 2026. Variations within this recipe (number of layers, hidden size, head count, expert count for MoE) shape capacity and cost. The LLM, FM, and Multimodal Models chapters develop the deployment-specific instantiations.

Editorial note. The architectural convergence on a single recipe is unusually strong for a research field. Whether this reflects a genuine attractor in architecture space or an artifact of community focus, deployment economics, and tooling investment is an open question. The convergence is real; its origins are debated. The non-transformer alternatives in §8 are the field’s current attempt to test whether the recipe is uniquely good.

§7. Mixture-of-Experts and Sparse Models

The Transformer block of §6 is dense: every parameter is used on every token. Mixture-of-experts (MoE) is an alternative architectural pattern that uses more total parameters than dense at the same per-token compute cost. MoE has emerged as the dominant frontier architecture by 2026 - Mixtral, DeepSeek-V3, GPT-4-class models, and many others are MoE. This section develops the mechanism and the training/inference trade-offs.

The dense vs sparse trade-off

Increase a dense transformer’s capacity by widening its layers - bigger $\mathbf{W}_Q$ , $\mathbf{W}_K$ , $\mathbf{W}_V$ , $\mathbf{W}_O$ , $\mathbf{W}_1$ , $\mathbf{W}_2$ . The result: every parameter active for every token, and per-token compute scales linearly with parameter count. Doubling the parameters doubles inference cost.

MoE breaks this lock-step. Instead of one big feedforward sub-layer per block, MoE has a bank of feedforward sub-layers (the experts), and a learned router decides which experts process each token. Typically a token is routed to $k = 2$ out of $E = 8, 32, 64,$ or more experts. Only the routed experts compute; the rest sit idle.

The two parameter counts to keep in mind:

Total parameters: parameters across all $E$ experts (plus the rest of the model that stays dense).
Active parameters per token: parameters in the $k$ experts that actually run (plus the dense parts).

The motivation is straightforward: per-token compute and inference cost scale with active parameters, while model capacity scales with total parameters. MoE allows a model to be larger in capacity than in inference cost - a one-trillion-total-parameter MoE with 100B active parameters costs roughly the same per inference as a 100B-parameter dense model, while often behaving like a substantially larger one on capability benchmarks.

The routing mechanism

The standard pattern: replace the feedforward sub-layer of some or all transformer blocks with a routed-expert sub-layer:

flowchart TD
  X(["$$\text{Token vector } \mathbf{x}_t$$"])
  R["$$\text{Router: scores } \mathbf{s} = \mathbf{W}_\text{router}\, \mathbf{x}_t \in \mathbb{R}^E$$"]
  Top["$$\text{top-}k\text{; softmax } \to w_1, \ldots, w_k$$"]

  Ei["$$\text{Expert FF}_i$$"]
  Ej["$$\text{Expert FF}_j$$"]
  Ed["…"]
  El["$$\text{Expert FF}_l$$"]

  Yi["$$y_i = \mathrm{FF}_i(\mathbf{x}_t)$$"]
  Yj["$$y_j$$"]
  Yd["…"]
  Yl["$$y_l$$"]

  Sum["$$\text{weighted sum } \sum_e w_e \cdot y_e$$"]
  Out(["$$\text{output for } \mathbf{x}_t$$"])

  X --> R --> Top
  Top --> Ei
  Top --> Ej
  Top --> Ed
  Top --> El

  Ei --> Yi
  Ej --> Yj
  Ed --> Yd
  El --> Yl

  Yi --> Sum
  Yj --> Sum
  Yd --> Sum
  Yl --> Sum
  Sum --> Out

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  classDef dim fill:#fff,stroke:#aaa,stroke-dasharray:3 3,color:#888
  class X,R,Top,Sum,Out,Ei,Ej,El,Yi,Yj,Yl pill
  class Ed,Yd dim

(Of the $E$ experts available, only $k$ - typically 2 - run per token; the rest sit idle.)

Concretely:

Score. A small router computes $\mathbf{s} = \mathbf{W}_{\text{router}} \mathbf{x}_t \in \mathbb{R}^E$ .
Select. Keep the top- $k$ scores; mask out the rest.
Weight. Apply softmax over the kept scores to produce routing weights $w_1, \ldots, w_k$ summing to 1.
Run. Compute $y_e = \mathrm{FF}_e(\mathbf{x}_t)$ for each chosen expert $e$ .
Combine. Output is $\sum_e w_e \cdot y_e$ .

Three routing variants matter:

Top- $k$ routing (the standard). Each token chooses the top- $k$ experts.
Expert-choice routing (Zhou et al., 2022). Each expert chooses the top- $N/E$ tokens (where $N$ is batch size, $E$ is expert count). The roles are flipped; this gives every expert a guaranteed workload but every token may not get its top- $k$ choice.
Soft routing (every token uses every expert, weighted). Mostly historical; the sparsity is what gives MoE its compute advantage.

Load balancing

A pathological failure mode: nothing in the basic router prevents the model from learning to route all tokens to the same one or two experts. The “unused” experts would then never receive gradient signal, and the model would effectively become a dense model with the rest of the experts as wasted parameters.

The standard fix: an auxiliary loss added to the main training loss, penalizing routing imbalance. A common form:

\mathcal{L}_{\text{aux}} \;=\; \alpha \cdot E \cdot \sum_e f_e \cdot P_e,

where $f_e$ is the fraction of tokens routed to expert $e$ in the batch, $P_e$ is the average routing probability assigned to expert $e$ across tokens, and $\alpha$ is a hyperparameter (typical 0.01–0.1). The product $f_e \cdot P_e$ is minimized when both quantities are spread evenly across experts (each $1/E$ ). Adding this to the main loss steers the router toward balanced expert utilization.

Other approaches include expert-choice routing (which makes the balance automatic by construction) and soft constraints during training that gradually tighten. Production MoE training typically combines several.

Training and inference costs

The cost picture has several layers worth understanding because MoE complicates the parameter-count/compute-cost relationship that dense models had:

Per-token forward compute. Same as a dense model with the same active parameter count. The $k$ chosen experts run; the rest do not.

Memory bandwidth. Higher than dense, because all expert weights still need to be loaded somewhere - typically all $E$ experts live in GPU memory across the cluster, and routing during inference involves communication to send tokens to the GPU(s) holding their assigned experts (expert parallelism). The communication pattern is irregular, which complicates efficient serving.

Training memory. Holds all expert weights plus all optimizer state (e.g., AdamW moments) for all experts. This is the dominant memory cost in MoE training. Practical MoE training requires careful parallelism design - typically a combination of expert parallelism, data parallelism, and tensor parallelism.

Activation memory. During training, each token’s activations must be kept around for backpropagation through the routed expert. Recomputation (gradient checkpointing) trades extra forward-pass compute for reduced activation memory.

The result: MoE moves cost from “every-token compute” to “every-token memory bandwidth and communication”. For inference workloads with many small batches, this trade-off is favourable. For inference with very large batches, dense can win.

Why MoE became dominant at the frontier

A simple counting argument. Dense models scale per-token compute linearly with parameter count; MoE decouples capacity (total parameters) from compute (active parameters). At frontier scale where the bottleneck is capability per inference dollar, MoE wins by giving more capacity for the same inference cost. The trade-offs (more memory, more communication, harder fine-tuning) are real but tractable for organizations with the engineering capacity to handle them.

By 2026 the frontier picture is: most newly-trained frontier models are MoE; open-weights MoE releases (Mixtral 8×7B, Mixtral 8×22B, DeepSeek-V3) have demonstrated that the pattern works at scale; the practical question is no longer whether to use MoE but how - what expert count, what top- $k$ , what auxiliary-loss weight, how to deploy.

Open versus closed MoE practice

What is public in 2026 about MoE training recipes:

Mixtral and DeepSeek model cards and technical reports document expert counts, top- $k$ , auxiliary-loss form, and routing variants.
Switch Transformer (Fedus et al., 2022) was an early important public reference for sparse expert routing.

What remains undocumented at frontier labs (OpenAI, Anthropic, Google DeepMind closed releases):

Whether and how MoE is used (sometimes inferred from inference-cost characteristics; rarely confirmed).
Specific routing recipes, auxiliary-loss weights, and load-balancing schemes.
How fine-tuning and preference-tuning interact with the routing.

This is part of the broader open/closed pattern (LLM §10) - many of the engineering details that matter for reproducing frontier results are closed even when the architectural pattern is widely known.

§8. State-Space Models and Hybrid Architectures

The Transformer’s quadratic-in-sequence-length attention has been the dominant substrate of modern deep learning, but at very long contexts (hundreds of thousands of tokens) the $O(L^2)$ cost becomes painful. State-space models (SSMs) are the most prominent class of sub-quadratic alternatives - a recurrent renaissance, building on classical state-space machinery from control theory but designed to be parallelizable in training and linear in inference cost.

This section develops the SSM mechanism, the relationship to recurrent networks, the Mamba lineage that brought SSMs to competitiveness with Transformers, and the hybrid attention/SSM architectures that have emerged as the most pragmatic compromise.

What a state-space model is

A state-space model in continuous time is a linear dynamical system

\frac{d \mathbf{h}(t)}{dt} \;=\; \mathbf{A} \mathbf{h}(t) + \mathbf{B} \mathbf{x}(t), \qquad \mathbf{y}(t) \;=\; \mathbf{C} \mathbf{h}(t) + \mathbf{D} \mathbf{x}(t),

where $\mathbf{x}(t) \in \mathbb{R}$ is a scalar input over time, $\mathbf{h}(t) \in \mathbb{R}^d$ is a hidden state, $\mathbf{y}(t) \in \mathbb{R}$ is a scalar output, and $\mathbf{A}, \mathbf{B}, \mathbf{C}, \mathbf{D}$ are matrices. The hidden state $\mathbf{h}$ is a continuous-time summary of the input history; the output is a linear function of state and current input. Classical state-space machinery handles many properties of this system - controllability, observability, stability - but for our purposes the relevant operation is discretization: turn the continuous-time system into a discrete sequence-processing model by sampling at discrete time steps.

After discretization, the system becomes a recurrence:

\mathbf{h}_t \;=\; \bar{\mathbf{A}} \mathbf{h}_{t-1} + \bar{\mathbf{B}} \mathbf{x}_t, \qquad \mathbf{y}_t \;=\; \bar{\mathbf{C}} \mathbf{h}_t + \bar{\mathbf{D}} \mathbf{x}_t,

where the bar matrices come from a discretization rule (zero-order hold, bilinear transform, etc.). This is an RNN - but a linear one, with no nonlinearity inside the recurrence and no learned nonlinear gating like LSTM’s.

The crucial property: because the recurrence is linear, it can be unrolled as a convolution:

\mathbf{y}_t \;=\; \sum_{s = 0}^{t} \bar{\mathbf{C}} \bar{\mathbf{A}}^{t-s} \bar{\mathbf{B}} \cdot \mathbf{x}_s + \bar{\mathbf{D}} \mathbf{x}_t,

i.e., $\mathbf{y}$ is the convolution of $\mathbf{x}$ with a kernel determined by $\bar{\mathbf{A}}, \bar{\mathbf{B}}, \bar{\mathbf{C}}$ . This duality is what makes SSMs both training-parallelizable (via the convolution formulation, computed in $O(L \log L)$ time using FFTs or in $O(L)$ time using specialized parallel-scan algorithms) and inference-efficient (via the recurrence, $O(1)$ per generated token, $O(L)$ over $L$ generated tokens).

Two views of the same SSM, identical computation:

Recurrent view (efficient inference, sequential):

flowchart LR
  Hprev(["$$\mathbf{h}_{t-1}$$"])
  AB["$$\bar{\mathbf{A}}, \bar{\mathbf{B}}$$"]
  Ht(["$$\mathbf{h}_t$$"])
  Xt(["$$\mathbf{x}_t$$"])

  Hprev --> AB --> Ht
  Xt --> AB

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  class Hprev,Ht,Xt,AB pill

Per step: $O(d)$ work for hidden size $d$ ; constant in sequence length.

Convolutional view (parallel training, all positions at once):

flowchart LR
  X(["$$\mathbf{x}_0, \mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_L$$"])
  Conv["$$\text{FFT-based convolution}$$"]
  Y(["$$\mathbf{y}_0, \ldots, \mathbf{y}_L$$"])

  X --> Conv --> Y

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  class X,Conv,Y pill

$O(L \log L)$ via FFT, or $O(L)$ via a parallel-scan algorithm - across all positions at once, rather than $t$ sequentially.

This is the technical primitive that makes SSMs competitive with attention.

From classical SSMs to S4

Linear time-invariant SSMs as just described have been understood for decades in classical signal processing and control theory. The deep-learning challenge has been making them useful as sequence-model substrates. Two papers established the field’s modern interest:

HiPPO (Gu et al., 2020), which proposed a specific structured matrix $\mathbf{A}$ (the HiPPO matrix) designed to project an input sequence onto orthogonal polynomial basis functions - a principled way to compress sequence history into a fixed-dimensional state.
S4 (Gu, Goel, and Ré, 2021), “Efficiently Modeling Long Sequences with Structured State Spaces”, which used the HiPPO initialization plus a structured parameterization that made the convolution efficient to compute. S4 was the first SSM-based model to compete on standard sequence-modelling benchmarks against then-state-of-the-art attention models.

S4’s contribution was engineering as much as architecture: it showed that the convolution kernel of a structured SSM could be computed efficiently enough to make the architecture practical at scale.

Mamba: the breakthrough

S4-style SSMs had a structural limitation: they were time-invariant, meaning the matrices $\mathbf{A}, \mathbf{B}, \mathbf{C}$ did not depend on the input. The model could not selectively attend to or ignore specific parts of its input the way attention can. On retrieval-heavy benchmarks (associative recall, language modelling), S4-class models underperformed.

Mamba (Gu and Dao, 2023) introduced input-dependent state-space parameters: $\bar{\mathbf{A}}, \bar{\mathbf{B}}, \bar{\mathbf{C}}$ depend on the current input $\mathbf{x}_t$ . The recurrence becomes time-varying, breaking the simple convolution formulation but enabling content-aware selectivity (“read this token carefully, skip that one”). To preserve efficient training, Mamba uses a custom hardware-aware parallel-scan algorithm - a primitive that computes the time-varying recurrence efficiently on GPUs through careful memory-hierarchy management.

The result: an architecture that achieves attention-like content-dependent processing while keeping the SSM’s linear-time inference cost. Mamba was competitive with Transformers on standard language-modelling benchmarks at small-to-mid scales, and triggered a wave of follow-up work.

The lineage continued with Mamba-2 (Dao and Gu, 2024), which reformulated selective SSMs in terms of a connection to attention itself (State-Space Duality), unifying parts of the two architectural families and producing further efficiency gains.

RWKV: the other branch

RWKV (Receptance-Weighted Key-Value; Peng et al., 2023) is a different sub-quadratic architecture, rooted in a recurrent reformulation of attention. RWKV achieves sub-quadratic inference cost through a different mathematical mechanism than SSMs - a careful exponential-decay recurrence over key-value pairs - but occupies a similar niche: efficient long-sequence modelling. The RWKV ecosystem produced open-weights models that have been used in production.

Hybrid attention / SSM architectures

The most pragmatic 2026 design pattern is not “pure SSM” or “pure attention” but hybrid stacks that interleave the two. The motivation is empirical:

Attention layers excel at retrieval - looking back to find a specific earlier token and reading it.
SSM layers excel at summarization and long-range integration - maintaining a low-cost summary of long context.
Combining them in the same model gives both.

Jamba (AI21 Labs, 2024) is the most prominent hybrid: a stack of mostly SSM blocks (Mamba-style) with attention blocks interspersed at a fixed cadence (e.g., one attention layer every eight SSM layers). The attention layers provide explicit retrieval capacity; the SSM layers handle the bulk of the computation cheaply. The total computational cost is dramatically lower than a pure-attention model at long context, while retrieval-heavy benchmark performance approaches pure-attention levels.

By 2026 hybrid architectures of various designs (Jamba, Zamba, hybrid versions of recent frontier models) are an active and competitive research direction.

Trade-offs against attention

To put the SSM vs attention picture in one comparison:

	Attention	SSM (Mamba-class)
Training-time cost	$O(L^2)$ per layer	$O(L)$ or $O(L \log L)$ per layer
Inference per token (autoregressive)	$O(L)$ for the position being computed; $O(L^2)$ over a sequence of length $L$	$O(1)$ for the position
Retrieval (look back at arbitrary earlier token)	Direct, via the attention pattern	Indirect, via the hidden state’s summary of history
Long-range integration	Strong but expensive	Strong; the hidden state is designed for it
Recipe maturity (2026)	Very mature (2017–2026 of continuous refinement)	Maturing; the modern SSM recipe stabilized around 2023–2024
Frontier scale (2026)	Dominant	Present but not dominant; hybrid architectures closing the gap

Status as of 2026

The transformer’s substrate dominance is real but no longer unchallenged. Mamba and successors are competitive at small-to-mid scale; the frontier scale story is partial - SSMs have been verified up to roughly the 10B-parameter range as standalone architectures, and hybrid SSM/attention models scale further. Whether sub-quadratic architectures will displace attention at the frontier in the next several years remains open.

The empirical evidence supports two reads:

Pure SSMs lag pure transformers on retrieval-heavy tasks at scale; the gap exists.
Hybrid SSM/attention architectures match transformers on retrieval at substantially lower cost; this is the practical sweet spot for long-context deployment.

The architectural-convergence-as-open-question editorial note in §6 applies here too. Whether the next decade of deep learning will look like more transformers or more hybrids - or whether something else entirely will emerge - is the kind of question that this chapter cannot resolve.

§9. Training Dynamics

§3 introduced backpropagation as the mechanism for computing gradients. This section develops what to do with those gradients: how to choose optimizers, learning-rate schedules, batch sizes, precision formats, and regularization, and what is known empirically about the loss landscapes these procedures traverse. The choices in this section determine whether a training run that can converge actually does converge, and how efficiently.

We treat the recipes that have stabilized across labs by 2026. Most of these are empirical - they work and have been verified across many groups and scales, but they do not have first-principles derivations. We flag this throughout.

Optimizers

Gradient descent in its simplest form, stochastic gradient descent (SGD):

\theta_{t+1} \;=\; \theta_t - \eta \cdot \nabla_\theta \mathcal{L}_t,

where $\eta$ is the learning rate and $\nabla_\theta \mathcal{L}_t$ is the gradient on the current minibatch. SGD is simple and well-understood but slow to converge on the non-convex loss landscapes of deep networks. Modern training uses one of a small number of more sophisticated optimizers.

SGD with momentum. Adds an exponentially-weighted moving average of past gradients:

\mathbf{m}_{t+1} = \beta \mathbf{m}_t + \nabla_\theta \mathcal{L}_t, \qquad \theta_{t+1} = \theta_t - \eta \mathbf{m}_{t+1},

with $\beta \approx 0.9$ . The momentum term smooths the noise of per-batch gradients and accelerates progress in consistent directions. SGD with momentum was the default optimizer for the CNN era (2012–~2018).

Adam (Kingma and Ba, 2014). Adaptive per-parameter learning rates, computed from first and second moments of the gradient. The update has the form

\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1 - \beta_1) \nabla_\theta \mathcal{L}_t,

\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1 - \beta_2) (\nabla_\theta \mathcal{L}_t)^2,

\theta_{t+1} = \theta_t - \eta \cdot \frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon},

where $\hat{\mathbf{m}}_t, \hat{\mathbf{v}}_t$ are bias-corrected versions of $\mathbf{m}_t, \mathbf{v}_t$ and the square and square-root operations are element-wise. Typical $\beta_1 = 0.9$ , $\beta_2 = 0.999$ . The intuition: each parameter gets a learning rate scaled inversely by an estimate of its gradient magnitude - parameters with large recent gradients move less aggressively, parameters with small recent gradients move more aggressively.

AdamW (Loshchilov and Hutter, 2017). Adam with decoupled weight decay: the L2-style regularization is applied directly to the parameters (subtracting $\eta \lambda \theta_t$ at each step) rather than added into the gradient before the Adam updates. This separation matters at scale - Adam-with-L2 in the gradient confounds the regularization signal with the adaptive scaling, while AdamW keeps them separate. AdamW is the dominant optimizer for transformer training in 2026.

Lion (Chen et al., 2023), Sophia (Liu et al., 2023), and others. Newer optimizers seeking to match AdamW’s convergence with less memory (Adam tracks two moments per parameter, doubling the optimizer’s memory footprint relative to plain SGD). Used in some production training runs at frontier scale; AdamW remains the most common default.

Practical rule of thumb in 2026. AdamW with $\beta_1 = 0.9$ , $\beta_2 = 0.95$ (transformer training often uses 0.95 rather than the original 0.999), $\epsilon = 10^{-8}$ , weight decay around $0.1$ , is the starting recipe for new training runs. Deviations are justified empirically.

Learning-rate schedules

The learning rate $\eta$ is the most-tuned hyperparameter in deep learning. Modern training does not use a constant $\eta$ ; it uses a schedule - $\eta(t)$ as a function of training step.

The canonical 2026 schedule has three phases:

\eta(t)

- warmup then cosine decay

linear warmup over first 10% of training, then cosine decay to

\eta_{\min} = 0.05 \cdot \eta_{\max}

Warmup. $\eta$ grows linearly from a small initial value (often 0) to the peak learning rate $\eta_{\max}$ over the first $T_{\text{warmup}}$ steps. Typical $T_{\text{warmup}}$ : 1–10% of total training. Empirically, transformer training is unstable without warmup at the start; the optimizer’s moment estimates need a few steps to stabilize before the learning rate is large.
Decay. $\eta$ decreases over the remaining $T_{\text{total}} - T_{\text{warmup}}$ steps. The dominant choice is cosine decay: $\eta(t) = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})(1 + \cos(\pi \cdot t / T_{\text{total}}))$ , decreasing smoothly to $\eta_{\min}$ at the end.
(Optional) Final fine-tune. Some recipes hold $\eta = \eta_{\min}$ constant for a final phase rather than decaying through it.

Variants include linear decay (simpler, sometimes competitive), WSD (“warmup, stable, decay” - hold at peak for most of training, then decay at the end), and trapezoidal schedules. Empirical comparisons between these have produced mixed conclusions; the practical answer in 2026 is cosine decay unless you have specific reason to deviate.

Batch size, gradient accumulation, gradient clipping

Batch size. The number of training examples whose gradients are averaged for each parameter update. Larger batch sizes give lower-variance gradients (so each update is more reliable) but require more memory and more parallel hardware. The relationship between batch size and learning rate is non-trivial - the “linear scaling rule” (Goyal et al., 2017) says that for many architectures, batch size and learning rate can be scaled jointly (double the batch, double the learning rate) without losing performance, up to some break-down point.

Gradient accumulation. When a single batch does not fit in GPU memory, gradients can be accumulated across multiple smaller “micro-batches” before applying an update. The accumulation is mathematically equivalent to one larger batch (with floating-point caveats); the practical effect is to decouple the logical batch size from the per-step memory footprint.

Gradient clipping. Cap the norm of the gradient vector at some maximum $g_{\max}$ :

\nabla_\theta \mathcal{L} \;\leftarrow\; \nabla_\theta \mathcal{L} \cdot \min\!\left(1, \, \frac{g_{\max}}{\|\nabla_\theta \mathcal{L}\|}\right).

Prevents rare outlier batches (e.g., a single training example whose loss is anomalously large) from corrupting the optimizer state with a runaway gradient. Standard in transformer training; typical $g_{\max} = 1.0$ .

Mixed precision

Modern training does not use 32-bit floating-point everywhere. The dominant practice is mixed precision: store some quantities in low precision (16-bit or 8-bit floats) to save memory and bandwidth, while keeping numerically sensitive quantities (the optimizer’s moment estimates, the loss accumulation) in higher precision.

Three formats matter in 2026:

fp16 (half precision). 5-bit exponent, 10-bit mantissa. Range is narrow (numbers above ~65,000 or below ~ $6 \times 10^{-8}$ overflow/underflow), requiring loss scaling - multiply the loss by a large constant before backpropagation to bring gradients into the representable range, then divide them out before the optimizer step.
bf16 (Bfloat16). 8-bit exponent (the same as fp32), 7-bit mantissa. Same range as fp32 (no loss-scaling needed) but with reduced precision. The default for training at frontier scale on hardware that supports it (NVIDIA A100/H100, Google TPUs).
fp8. 8-bit floats with two main formats (E4M3 and E5M2; different exponent/mantissa splits). Even smaller memory footprint; used for inference and increasingly for training on hardware with native fp8 support.

The general pattern: forward and backward passes run in low precision; the optimizer state (parameters, moments) stays in higher precision (often fp32 or bf16); a single training step casts between formats at the right points.

Regularization at scale

In the smaller-model era (pre-2020), regularization techniques like dropout (Srivastava et al., 2014) and stochastic depth (Huang et al., 2016) were essential to prevent overfitting on limited data. At frontier scale, the regime is different: training data is in the billions to trillions of tokens (or images), models have billions to trillions of parameters, and the central problem is capacity utilization rather than overfitting.

The empirical consequence is that dropout’s role has declined at scale. Many frontier LLMs use little or no dropout in their transformer layers. The regularization that remains:

Weight decay (via AdamW). The dominant regularization mechanism; small enough to barely affect the loss visibly but large enough to keep parameter magnitudes from drifting.
Data augmentation. In vision and audio, transformations that produce new training examples from existing ones (random crops, flips, colour jitter, time stretching). At LLM scale, less directly used; data quality and mixture composition (LLM §5.1, FM §5) play an analogous role.
Label smoothing. Replace one-hot training targets with slightly softened versions; small but consistent gain in some settings.
Stochastic depth in very deep networks. Randomly skip layers during training. Used in some vision transformers and modern CNNs.

The empirical pattern is that most explicit regularization mechanisms developed for the smaller-model era are weaker or unnecessary at scale; the implicit regularization of large-scale stochastic gradient descent on overparameterized networks does much of the work. This is part of the broader generalization puzzle (Theoretical Foundations of Learning chapter).

Loss landscape intuitions

Despite the empirical success of training procedures, the loss landscape of a deep network is high-dimensional, non-convex, and only partially understood. Several empirical findings shape current intuitions:

Mode connectivity (Garipov et al., 2018). Two SGD-trained solutions of the same network are typically connected by a low-loss path in parameter space - they do not sit in separate “basins” but on a connected manifold of approximately-equivalent solutions. The loss landscape is more like a high-dimensional plateau with structured variation than like an isolated set of valleys.
Flat vs sharp minima (Keskar et al., 2017; Hochreiter and Schmidhuber, 1997). Solutions in “flatter” regions of parameter space (small Hessian eigenvalues) tend to generalize better than solutions in “sharper” regions, empirically. The mechanistic explanation is contested; the empirical correlation is widely observed.
The lottery ticket hypothesis (Frankle and Carbin, 2018). Within a randomly-initialized network, there exists a subnetwork that, trained from scratch from its initial weights, would reach comparable performance to the full network. Training finds (and prunes around) this subnetwork rather than creating something genuinely new from random initialization.

These findings inform intuition but do not by themselves prescribe training recipes. The Theoretical Foundations of Learning chapter develops the generalization-theoretic side; this section’s takeaway is that current training procedures work in spite of, not because of, our limited theoretical understanding.

What’s left to say

§10 develops the long-context and efficient-attention story - what changes when the architectures of §6 face sequence lengths in the hundreds of thousands. The other architecture-specific sections (§4 on CNNs, §5 on RNNs, §7 on MoE, §8 on SSMs) each have their own training-dynamics quirks that we treat in those sections rather than here. The Efficient and Scaled Training chapter develops the distributed-systems story - how training is parallelized across thousands of GPUs.

§10. Efficient Attention and Long Context

§6 developed standard attention with the quadratic-in-length cost. This section develops the techniques that have reduced or worked around that cost - at the architectural level (FlashAttention, sparse-attention patterns), at the inference-time level (KV-cache management, paged attention), and at the long-context level (positional extrapolation, retrieval-as-substitute). Some of this material recurs in LLM §6 and §9 in deployment-focused form; the DL chapter develops the architectural-mechanical version.

The quadratic-cost problem

Attention’s computational structure is, as a reminder:

Compute cost: $O(L^2)$ per layer for the $\mathbf{Q} \mathbf{K}^\top$ matrix, then $O(L^2)$ for the softmax-weighted $\mathbf{V}$ aggregation. Total: $O(L^2 \cdot d_{\text{model}})$ per layer.
Memory cost: $O(L^2)$ for storing the attention matrix in standard implementations.

For $L$ in the thousands this is acceptable; for $L$ in the hundreds of thousands (1M+ token contexts of late-2020s frontier models), the cost dominates everything else. Several lines of work attack this problem from different directions.

Memory-efficient attention

The first observation: the $L \times L$ attention matrix does not need to exist in memory. The output of attention at each position depends only on a single row of the attention matrix at a time. FlashAttention (Dao et al., 2022, 2023) exploits this by computing attention in a blocked, fused manner that never materializes the full attention matrix.

Standard attention - peak memory $O(T^2)$ , the full $T \times T$ matrix exists at every step:

flowchart LR
  S["$$\mathbf{S} = \mathbf{Q}\mathbf{K}^\top / \sqrt{d_{\text{head}}} \;\; (T \times T)$$"]
  P["$$\mathbf{P} = \text{softmax}(\mathbf{S}) \;\; (T \times T)$$"]
  O["$$\mathbf{O} = \mathbf{P}\mathbf{V} \;\; (T \times d)$$"]
  S --> P --> O
  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  class S,P,O pill

FlashAttention - peak memory $O(T \cdot d)$ , the full attention matrix is never materialised:

flowchart TD
  Tile["$$\text{Tile } \mathbf{Q}, \mathbf{K}, \mathbf{V} \text{ into blocks } B_q\!\times\!d, B_k\!\times\!d$$"]
  SubS["$$\text{For each } \mathbf{Q}\text{-block and each } \mathbf{K}\text{-block: compute the } B_q\!\times\!B_k \text{ sub-}\mathbf{S}$$"]
  OnlineSM["$$\text{Online-softmax: update running max and running sum}$$"]
  Accum["$$\text{Accumulate weighted } \mathbf{V} \text{ into the running output}$$"]
  Assemble["$$\mathbf{O} \text{ assembled from running stats}$$"]

  Tile --> SubS --> OnlineSM --> Accum --> Assemble

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  class Tile,SubS,OnlineSM,Accum,Assemble pill

The key technical mechanism is the online softmax algorithm - a way of computing the softmax of a long sequence by processing it in blocks, maintaining a running maximum (for numerical stability) and a running denominator. The mathematical result is bit-exact compared to the naïve implementation; the memory savings come entirely from the access pattern.

FlashAttention’s practical impact has been substantial. The implementation is now the default in essentially every modern transformer training framework (PyTorch, JAX, Triton). FlashAttention-2 (Dao, 2023) improved parallelism within blocks; FlashAttention-3 (Shah et al., 2024) added hardware-specific optimizations for newer GPUs (NVIDIA H100 and successors). The general lesson: GPU-aware implementations of mathematically-identical operations can produce order-of-magnitude wall-clock speedups.

KV-cache management

During autoregressive generation (the LLM use case), each new token’s attention computation depends on the keys and values of all preceding tokens. Naively, this means recomputing all keys and values from scratch at every generation step - quadratic in generated tokens. The KV cache stores already-computed keys and values for the prompt and prior generated tokens, so each new step is linear in current sequence length rather than quadratic.

The cache structure (a per-request data structure):

   For a generation request:
     prompt of length P,
     generating tokens one at a time:

   After prompt processing:
     KV cache for each layer holds:
       K of shape (P, H, d_head)    keys at every prompt position
       V of shape (P, H, d_head)    values at every prompt position

   For each new generated token g_1, g_2, ...:
     compute K_new, V_new for the new token only      (cheap)
     append to the cache                              (cheap)
     attention computation: Q_new attends to the
       full cache (P + |generated so far|)            (linear in cache size)

The KV cache turns generation from $O(L^2)$ per token into $O(L)$ per token. The trade-off is memory: the cache size grows linearly with both context length and number of layers, and for long-context models can easily reach tens of gigabytes per active inference request.

Paged attention and serving optimization

The naive KV-cache layout allocates a contiguous block of memory sized for the maximum context length, even when most requests use far less. Paged attention (Kwon et al., 2023, “vLLM”) borrows the virtual-memory paging idea from operating systems: split the KV cache into fixed-size pages, allocate pages on demand, and maintain a per-request block table mapping logical positions to physical pages. The result: dramatic improvement in memory utilization (from typically 20–40% with contiguous allocation to near 95% with paging), allowing many more concurrent requests on the same hardware.

The LLM chapter §6 develops paged attention in deployment detail with full diagrams; we record the architectural mechanism here.

Sparse attention patterns

A different attack on the $L^2$ problem: don’t let every token attend to every other token. Sparse attention restricts the attention pattern to a structured subset of position pairs:

Sliding-window attention. Token $t$ attends only to tokens in a window of size $W$ around it. Cost: $O(L \cdot W)$ per layer instead of $O(L^2)$ .
Dilated attention (Longformer, Beltagy et al., 2020). Attention to positions at exponential offsets: $t-1, t-2, t-4, t-8, \ldots$ . Cost: $O(L \log L)$ .
Global-plus-local attention. A small number of “global” tokens attend everywhere and are attended to by everyone; the rest use sliding-window or sparse patterns. Used for inputs where a few positions need full visibility (e.g., the [CLS] token in BERT, the first few tokens of a long context).
Ring attention (Liu et al., 2023). A distributed-inference pattern: partition the context across GPUs; each GPU holds part of the keys and values; tokens are passed around a ring to compute attention efficiently across the cluster. Not strictly sparse but enables long-context inference that does not fit on a single GPU.

The shared property: a structured restriction of the attention pattern that exposes locality and parallelism. The cost: representational capacity. A position cannot directly attend to positions outside its sparsity pattern - long-range dependencies must propagate through intermediate positions across multiple layers.

Positional extrapolation: same architecture, longer context

A different approach: keep the architecture identical, but extend its effective context window at inference time beyond what it was trained on. The mechanism is positional encoding extrapolation.

RoPE-based models (the dominant choice; §6) are particularly amenable. Recall that RoPE rotates queries and keys by an angle proportional to position, with frequencies $\theta_i = b^{-2i/d_{\text{head}}}$ . To extrapolate to longer contexts, scale the rotation frequencies appropriately:

NTK-aware scaling (re-derived independently by several practitioners). Adjust the base $b$ so that the rotation angles at the new context length match the distribution of angles seen during training. Heuristic but cheap.
YaRN (Peng et al., 2023). A more careful interpolation that interpolates short-range frequencies less aggressively than long-range ones, recognizing that the model has seen the short-range rotations many times in training but the long-range rotations rarely.

The empirical pattern: positional extrapolation works for moderate extensions (e.g., a 4k-trained model extended to 32k or 64k) with manageable quality degradation; it works less well at very large multipliers (e.g., 4k to 1M). The Long Context section of the LLM chapter §9 develops the effective-vs-nominal-context-window distinction that this produces.

Retrieval as a long-context substitute

Architecturally cleaner than any of the above: don’t put the long context into the model at all. Instead, retrieve the relevant portion at inference time and put only that into a normal context window. This is retrieval-augmented generation (RAG), treated in depth in the Retrieval-Augmented Generation chapter. Architecturally, RAG side-steps the long-context problem rather than solving it; the model itself is a standard transformer with a modest context window, and the retrieval system handles the corpus-scale problem.

Putting it together

In production, a long-context deployment in 2026 typically combines several of these techniques:

Architecture

Transformer with grouped-query attention (smaller KV cache)
RoPE positional encoding
Optional: sliding-window or global-plus-local sparse pattern
Optional: SSM blocks interleaved (hybrid; §8)

Implementation

FlashAttention (versioned for the target hardware)
Paged attention for KV cache management
Continuous batching for serving multiple requests

Positional extension

YaRN or NTK-aware scaling if the deployed context exceeds the training context

Supplementary

Retrieval system to handle corpus-scale inputs that don’t belong in the context window at all

Pointer to the Efficient and Scaled Training chapter

This section has covered the architectural and per-request-inference picture. The systems story - how training and serving infrastructure is parallelized across thousands of GPUs, how communication patterns are designed, how hardware-specific kernels are written - is developed in the Efficient and Scaled Training chapter.

§11. Connections to Other Chapters

This chapter is the substrate: most other chapters in the book build on the architectures and training procedures developed here. The map of who uses what:

Foundation Models treats deep learning as a black-box substrate and develops the deployment regime on top. FM §4 has a brief architectural-substrate section that points back here for depth.
Self-Supervised Learning develops the objectives that the architectures of this chapter are trained on - causal LM, masked LM, contrastive learning, predictive embedding objectives. The architectures don’t care what objective they’re trained on; the SSL chapter cares about everything except the architecture.
Theoretical Foundations of Learning develops generalization theory - why architectures trained as in §9 generalize. This chapter is about what works empirically; the theory chapter is about why (where we have theory) and what is unresolved (most of it).
Large Language Models specializes the Transformer family of §6, §7, §8, §10 to the modern language-modelling recipe (decoder-only, causal mask, modern recipe of RMSNorm + SwiGLU + RoPE + GQA). LLM §4 develops the LLM-specific architectural picture; this chapter holds the canonical mechanical reference.
Generative Models develops diffusion, flow matching, autoregressive image and audio generation - all built on the architectures of §3 and §6.
Multimodal Models extends §6’s Transformer and §7’s MoE pattern across modalities (vision, audio, video, embodied control).
Reasoning Models layers test-time-compute deliberation on top of these architectures.
Mechanistic Interpretability develops what is inside the architectures of §6 at the circuit level. The residual stream view of §6 originates in this literature.
Efficient and Scaled Training develops the distributed-systems engineering that §9 and §10 sketch from the algorithmic side: tensor and pipeline parallelism, hardware-aware kernel design, communication patterns at thousand-GPU scale.
Reinforcement Learning uses the architectures of this chapter as value functions, policy networks, and world models. The RL chapter develops the algorithms; the architectures come from here.
Robotics uses deep-learning substrates for perception, control, and vision-language-action models. The specifics of how robotics adapts these architectures live in the Robotics chapter.

§12. Limitations and Open Problems

A research-oriented inventory of deep-learning open problems. As elsewhere in the book, we mark what is unresolved without adjudicating; each item is the substrate-level facet of a question that recurs across other chapters’ lists.

OP-DL-1: Architectural design without strong theory. The “modern recipe” of §6 (RMSNorm over LayerNorm, RoPE over ALiBi or sinusoidal, SwiGLU over ReLU/GELU MLPs, pre-normalization over post-normalization) is empirically grounded: each choice has been verified to outperform alternatives across many groups and scales. But none has a principled derivation from architectural-theoretic first principles. We do not understand why SwiGLU outperforms ReLU MLPs at matched parameter count, only that it does. The collection of empirical results is reliable; the collection of theoretical explanations for the results is partial. Whether the right architectural choices are derivable from theory (and we just haven’t worked out the theory) or whether deep-learning architecture is fundamentally empirical is open.

OP-DL-2: When does MoE actually help? Mixture-of-experts (§7) is dominant at frontier scale by 2026, but the boundary - at what scale and for what tasks MoE outperforms dense - is not crisp. Some empirical findings suggest MoE’s advantages grow with scale, others suggest the gain saturates. The right active-to-total parameter ratio, the right expert count, and the right routing scheme are tuned empirically per model family. A unifying account of when sparsity beats density in deep learning is missing.

OP-DL-3: State-space vs attention. As §8 noted, SSMs are competitive with Transformers at small-to-mid scales but lag on retrieval-heavy tasks. Whether the gap closes with architectural progress, whether hybrid attention/SSM architectures (Jamba) are the durable answer, or whether one or the other architecture becomes dominant at the frontier remains open. Related: whether the field’s architectural convergence on Transformers reflects genuine optimality or a tooling/community lock-in.

OP-DL-4: Long-context fidelity vs nominal context length. As developed in LLM §9 and DL §10, nominal context windows have grown faster than effective context. The gap between “the model can address position 500,000” and “the model can usefully reason over position 500,000” persists. This is the substrate-level version of OP-LLM-4; whether the gap closes architecturally, requires retrieval as primitive, or has a more fundamental cause is unresolved.

OP-DL-5: Optimization at frontier scale. The training recipes of §9 (AdamW with warmup + cosine decay) work at scales from millions to trillions of parameters. But the robustness of frontier-scale training - why some runs converge to good models and others fail despite nominally identical recipes - is not well understood. Hyperparameter sensitivity at frontier scale is high; the field operates with substantial folklore around what works. A principled account of frontier-scale optimization dynamics would change the operational economics of training large models.

OP-DL-6: Architectural search. Neural Architecture Search (NAS) promised to automate the design of neural-network architectures. By 2026, hand-designed architectures consistently outperform NAS-found ones in mainstream deep learning. Why: the search space is too large; the search signal (downstream performance after long training runs) is too expensive to evaluate; transferring NAS results across scales is unreliable. Whether NAS is fundamentally less effective than human design (because humans bring strong priors that the search cannot encode) or whether better NAS recipes will eventually surpass hand-design is open.

OP-DL-7: Numerical stability at low precision. The trajectory of training precision has been fp32 → bf16 → fp8 → (proposed) fp4 and lower. Each step exposes new numerical-stability failures that the previous regime did not. By 2026 fp8 is in production training; fp4 and below are research-active. The question of how low precision can go without losing training quality, and where the architecture/optimizer must change to support lower precision, is an active research area whose empirical floor moves on quarterly cadence.

OP-DL-8: Catastrophic forgetting in continual training. Cross-referenced as the LLM-facet OP-LLM-7 and FM-facet OP-FM-8: how to update a trained model’s knowledge or behaviour without erasing what it previously learned. The mitigations of §9 and LLM §5.5 (data replay, low learning rate, regularization, modular adapters) are partial. The substrate-level question is whether deep networks fundamentally suffer from this failure mode (because gradient updates affect parameters that other capabilities depend on) or whether some architectural redesign (modular networks, mixture-of-experts variants, completely separated knowledge stores) could in principle escape it.

These eight problems are the substrate-level questions. The chapter-specific lists in the LLM chapter (OP-LLM-1..10), Foundation Models chapter (OP-FM-1..15), and the other dedicated chapters develop the questions in their domain-specific forms.

§13. Further Reading

An opinionated list, grouped by topic. Each entry is annotated with what it adds beyond this chapter.

Foundational

Rumelhart, Hinton, Williams (1986), “Learning representations by back-propagating errors.” The paper that brought backpropagation to the field’s attention. Worth reading as a historical document and for the conceptual exposition.
LeCun, Bottou, Bengio, Haffner (1998), “Gradient-based learning applied to document recognition.” Includes LeNet. The canonical reference for the early CNN era.
Krizhevsky, Sutskever, Hinton (2012), “ImageNet Classification with Deep Convolutional Neural Networks” (AlexNet). The §2 inflection paper; reads quickly and grounds the modern era.
He et al. (2015), “Deep Residual Learning for Image Recognition” (ResNet). Introduces residual connections, the architectural primitive underlying everything in this chapter.
Vaswani et al. (2017), “Attention Is All You Need.” The Transformer. §6 is mostly an unpacking of this paper.

Modern Transformer recipe

Su et al. (2021), “RoFormer” (RoPE). Positional encoding via rotation; §6.
Ainslie et al. (2023), “GQA.” Grouped-query attention; §6 and LLM §4.
Shazeer (2020), “GLU Variants Improve Transformer” (SwiGLU and others). §3 and §6.
Zhang and Sennrich (2019), “RMSNorm.” §3 and §6.

Vision-Transformer transition

Dosovitskiy et al. (2020), “An Image is Worth 16×16 Words” (ViT). The Transformer comes to vision.
Liu et al. (2022), “A ConvNet for the 2020s” (ConvNeXt). The ConvNet response to ViT.

MoE

Shazeer et al. (2017), “Outrageously Large Neural Networks.” The early-modern sparse-expert paper.
Fedus et al. (2022), “Switch Transformer.” Sparse expert routing at scale.
Mixtral and DeepSeek-MoE technical reports. Public production-scale MoE references.

State-space models

Gu et al. (2020), “HiPPO.” The structured-state-space initialization.
Gu, Goel, Ré (2021), “Efficiently Modeling Long Sequences with Structured State Spaces” (S4). §8.
Gu and Dao (2023), “Mamba.” Selective state-space models; the breakthrough.
Dao and Gu (2024), “Mamba-2” / “Transformers are SSMs.” The state-space-duality unification.

Efficient attention and long context

Dao et al. (2022), “FlashAttention.” §10.
Dao (2023), “FlashAttention-2.”
Kwon et al. (2023), “Efficient Memory Management for LLM Serving with PagedAttention” (vLLM). §10.
Peng et al. (2023), “YaRN.” §10.
Beltagy, Peters, Cohan (2020), “Longformer.” Dilated attention.

Training dynamics

Loshchilov and Hutter (2017), “Decoupled Weight Decay Regularization” (AdamW). §9.
Kingma and Ba (2014), “Adam.” §9.
Goyal et al. (2017), “Accurate, Large Minibatch SGD.” The linear scaling rule.

Mechanistic interpretability (substrate framings)

Elhage et al. (2021), “A Mathematical Framework for Transformer Circuits.” The residual-stream view of §6 originates here.
Olsson et al. (2022), “In-Context Learning and Induction Heads.” Substrate-level mechanistic story for a specific Transformer behaviour. (Also referenced from LLM §7.)

Surveys and broader perspective

Goodfellow, Bengio, Courville (2016), “Deep Learning.” The textbook of the pre-Transformer era; still useful for backprop, optimization, and the basics. Outdated on architectures.
Any current arXiv survey on Transformers or efficient attention. These age fast but cover specific subfields well.

§14. Exercises and Experiments

Five research-style exercises, each with declared intent (demonstration or exploration), setup, tasks, and expected takeaway. Sized for a graduate student with one-GPU access except where noted.

E1. Implement a Transformer block from scratch.

Intent: demonstration.

Setup. Pick a tiny configuration: $d_{\text{model}} = 128$ , 4 attention heads, sequence length 64. Use PyTorch or JAX but do not use built-in nn.MultiheadAttention - implement attention by hand.

Tasks.

Implement (a) multi-head self-attention (Q/K/V projections, scaled dot-product, output projection), (b) a SwiGLU feedforward, (c) RMSNorm, (d) the pre-normalized residual block from §6.
Verify the forward pass against PyTorch’s reference implementation (e.g., via torch.nn.functional.scaled_dot_product_attention) on random inputs.
Verify that gradients flow correctly: run backprop on a synthetic loss and check that gradient magnitudes do not vanish or explode through 12 stacked blocks.
Optionally add RoPE and verify that the resulting attention pattern is sensitive only to relative position.

Takeaway. A working mental model of every component of the modern Transformer block, in code you wrote yourself.

E2. Normalization ablation.

Intent: exploration.

Setup. A small Transformer (4–8 layers, ~10M parameters) trained on a tiny language-modelling task (e.g., character-level Shakespeare or a small subset of FineWeb-Edu).

Tasks.

Train three variants: with LayerNorm, with RMSNorm, with no normalization at all.
Measure: training loss curves, gradient magnitudes through training, time-to-convergence, final perplexity.
The no-norm variant should fail to train at depth; quantify the failure (which layers’ gradients vanish?).
LayerNorm and RMSNorm should be approximately equivalent on final loss; check whether RMSNorm trains faster (fewer FLOPs per step).

Takeaway. Empirical confirmation that normalization matters and that RMSNorm has the expected cost-quality profile.

E3. Reproduce a small scaling-law fit.

Intent: exploration; substantial compute.

Setup. Pretrain a series of small Transformer language models (1M, 10M, 100M parameters; if compute permits, 1B). All on the same dataset, all to convergence.

Tasks.

Plot $\log L$ vs $\log N$ on log-log axes.
Fit a power law; extract the exponent $\alpha_N$ .
Compare to Kaplan et al. (2020) and Hoffmann et al. (2022).
Optionally sweep over $D$ (training tokens) at fixed $N$ as well.

(This exercise overlaps with FM §14’s E1; the difference is the framing - DL is interested in architectures and their fits, FM is interested in the regime they produce.)

Takeaway. Direct experience of the empirical scaling-law workflow.

E4. Attention vs SSM on long-context retrieval.

Intent: exploration.

Setup. Choose an open-weights Transformer in the 1B parameter range (e.g., a small Llama variant) and an open-weights Mamba model at comparable scale. Choose a long-context retrieval benchmark (RULER, Needle-in-a-Haystack).

Tasks.

Evaluate both models at increasing context lengths (4k, 16k, 64k, 128k).
For NIAH, produce both models’ heatmaps and compare. The attention-based model should retrieve more reliably; the SSM may show position-dependent gaps.
For RULER, compare on retrieval-heavy and aggregation-heavy subsets separately. Quantify the gap.
Optionally test a hybrid attention/SSM model if one is available at comparable scale.

Takeaway. Empirical encounter with the trade-off table of §8: SSMs match Transformers on summarization-like tasks; the gap appears on retrieval.

E5. Positional encoding extrapolation.

Intent: exploration.

Setup. A small Transformer (~50M parameters) trained at a fixed context length (e.g., 1024 tokens). Two variants: one with sinusoidal positional encoding, one with RoPE.

Tasks.

After training at length 1024, evaluate at lengths 2048, 4096, 8192 without further training.
Measure perplexity at each length.
For the RoPE variant, also try NTK-aware scaling and YaRN; measure perplexity with each extrapolation strategy.
The sinusoidal variant should fail to extrapolate (perplexity rises sharply at lengths beyond training); RoPE should extrapolate more gracefully; YaRN should extrapolate further than NTK-aware.

Takeaway. Empirical confirmation of the extrapolation story in §10: positional encoding choice has substantial consequences for effective context window.

Notebook implementations planned at notebooks/deep-learning/ once the project’s interactive layer is decided.

Deep Learning

Scope and What This Chapter Is About

§1. Motivation and Scope

What this chapter is for

What “deep” means here

What we cover, in three layers

Boundaries with adjacent chapters

Editorial posture

§2. Historical Context

The perceptron and the first AI winter

Backpropagation and the second wave

The pre-2012 dormancy

2012: the AlexNet inflection

The CNN era (2012–2017)

The RNN era in parallel

2017: the Transformer

2018–2022: pretraining, scale, and consolidation

2022–2026: scaling, MoE, and non-transformer alternatives

Where this leaves us

§3. The Multi-Layer Perceptron and Backpropagation

The multi-layer perceptron

Activations: what nonlinearity to use

Backpropagation

Initialization: where does θ\thetaθ start?

The gradient-flow problem and its architectural responses

What you need from §3 to read the rest of the chapter

§4. Convolutional Networks (Historical Centre, Now Brief)

The convolutional inductive bias

Landmark architectures

The Transformer takeover and where ConvNets remain

§5. Recurrent Networks (Historical Centre, Now Brief)

Vanilla RNNs, LSTMs, GRUs

Truncated backprop through time

Why RNNs were displaced

Where recurrent designs persist

Bridge to §8

§6. Attention and the Transformer

Motivation: what attention solves

The attention operator

Stage 1: queries, keys, values

Stage 2: scaled dot-product attention

Stage 3: output projection

Multi-head attention

Cross-attention vs self-attention

Causal masking

Positional information

The Transformer block

Feedforward sub-layer

Normalization

Three deployment shapes

Closing summary

§7. Mixture-of-Experts and Sparse Models

The dense vs sparse trade-off

The routing mechanism

Load balancing

Training and inference costs

Why MoE became dominant at the frontier

Open versus closed MoE practice

§8. State-Space Models and Hybrid Architectures

What a state-space model is

From classical SSMs to S4

Mamba: the breakthrough

RWKV: the other branch

Hybrid attention / SSM architectures

Trade-offs against attention

Status as of 2026

§9. Training Dynamics

Optimizers

Learning-rate schedules

Batch size, gradient accumulation, gradient clipping

Mixed precision

Regularization at scale

Loss landscape intuitions

What’s left to say

§10. Efficient Attention and Long Context

The quadratic-cost problem

Memory-efficient attention

KV-cache management

Paged attention and serving optimization

Sparse attention patterns

Initialization: where does $\theta$ start?