Theoretical Foundations of Learning


Scope and What This Chapter Is About

The chapter develops the theoretical understanding (and theoretical confusion) about why machine learning generalizes. It covers classical statistical learning theory (PAC, VC, Rademacher), modern bounds (PAC-Bayes, information-theoretic generalization), and the deep-learning-specific puzzles that resist classical analysis. It also covers scaling laws as empirical regularities and what they can and cannot predict.

We treat theory. Algorithms, architectures, and training recipes live in their own chapters. Open problems are flagged inline and consolidated in §12.


§1. Motivation and Scope

What this chapter is for

By 2026 the empirical methods of machine learning are dramatically more successful than the theoretical understanding of why they work. Classical statistical learning theory - PAC, VC dimension, Rademacher complexity, uniform convergence - was developed in the 1970s–1990s and predicts what deep learning should not be able to do. Modern deep learning routinely succeeds at scales and on tasks where classical theory says it should fail. The empirical results are robust; the theoretical explanations are partial.

This chapter exists to make the theoretical part of the picture visible. Specifically:

  • To present classical learning theory (PAC, VC, Rademacher, PAC-Bayes, information-theoretic generalization) on its own terms - what it proves, where it applies, what its failure modes are.

  • To present the modern generalization puzzle - the empirical phenomena that classical theory does not explain, and the partial theoretical accounts of them (NTK, double descent, implicit regularization, scaling laws as empirical regularities).

  • To consolidate, for a research-oriented reader, what is known, what is conjectured, and what is genuinely open in learning theory.

The chapter does not try to choose a winner among theoretical frameworks; it tries to lay out the conceptual landscape so that subsequent literature can be placed against it.

Why learning theory matters in 2026

Three reasons a research-oriented practitioner should care about theory even when their primary work is empirical:

  • Theory tells you what cannot work. Worst-case bounds in classical learning theory are pessimistic but rigorous. When a method is in a regime where the worst-case bound is the relevant one (small data, distribution shift, formal guarantees needed), the theory is binding.

  • Theory tells you where to look for failure. Modern deep learning does fail - adversarial attacks, distribution shift, jailbreaks, prompt injection. A theoretical understanding of why a method generalizes often illuminates the boundary where it stops generalizing.

  • Theory shapes long-term research direction. What we believe is theoretically true affects what research seems promising. The double-descent phenomenon (§8) reshaped the bias-variance teaching narrative of decades; the scaling laws (§9) reshaped the field’s compute-allocation thinking.

The combination of “the methods work” and “we do not fully understand why” is normal in deep learning - flagged elsewhere in this book (DL §12 OP-DL-1, SSL §11 OP-SSL-1, FM §11). This chapter develops the theoretical apparatus that exists for engaging that gap rigorously rather than waving at it.

The three cultures within learning theory

A useful organizing scheme. Learning-theory research roughly clusters into three traditions, each with its own framing and its own characteristic failure modes:

   THREE CULTURES OF LEARNING THEORY

   1. CLASSICAL / WORST-CASE
      ───────────────────────
      Distribution-free, hypothesis-class-based bounds.
      PAC framework (§3), VC dimension (§4), uniform
      convergence (§4), Rademacher complexity (§5).
      Strength: rigorous, applies to broad model classes.
      Failure: bounds typically loose by many orders of
               magnitude for practical learners.

   2. DATA-DEPENDENT / AVERAGE-CASE
      ──────────────────────────────
      Bounds that use information about the training data
      or the learning procedure, producing tighter
      predictions. PAC-Bayes (§6), information-theoretic
      generalization (§7), margin-based bounds.
      Strength: bounds can be non-vacuous for deep
                networks.
      Failure: applicability is narrower; the bounds
               depend on quantities that are hard to
               compute.

   3. EMPIRICAL-REGULARITY / DESCRIPTIVE
      ─────────────────────────────────
      Empirical observations of deep-learning behaviour
      that take the form of theoretical predictions.
      Scaling laws (§9), neural tangent kernel (§8),
      double descent (§8), implicit bias of optimizers (§8).
      Strength: matches empirical practice closely.
      Failure: the "theory" is often a regression fit;
               extrapolation beyond the fitted range is
               an empirical bet, not a derivation.

Each culture has a coherent research community, a body of results, and characteristic limitations. None of them on its own explains why modern ML works. Together they form a partial map.

What this chapter does not try to do

A research-oriented theory chapter must be honest about scope:

  • We do not prove most results in full mathematical detail. We state results, sketch the techniques behind them, and cite the primary literature for proofs.

  • We do not adjudicate between competing theoretical frameworks. The §10 closing section returns to the question of how the perspectives relate; we do not pretend the answers are settled.

  • We do not cover all of learning theory. The field is vast; we focus on the perspectives most consequential for modern deep-learning practice.

  • We do not derive scaling laws from theory. Scaling laws are empirical regularities (Foundation Models §6); we develop their epistemic status in §9 of this chapter but treat their derivation as outside the chapter’s scope (because no satisfactory derivation exists).

What this chapter is for (revisited)

The dominant claim of the chapter: in 2026, learning theory exists, is partial, is fragmented across communities, and is best engaged honestly rather than ignored or overclaimed. The empirical practitioner who reads this chapter should come away knowing what theoretical apparatus is available, what each piece of it can and cannot do, and where the genuine open questions live.

Boundaries with adjacent chapters

  • Deep Learning develops what the architectures and training procedures actually are. This chapter develops theoretical understanding of why training procedures generalize.

  • Self-Supervised Learning §8 surveys four theoretical perspectives specifically about SSL. This chapter develops the broader framework those perspectives sit within.

  • Foundation Models §6 treats scaling laws as empirical regularities; this chapter develops the theoretical status of scaling laws - the regression-fit-vs-derivation distinction.

  • Causality develops a parallel theoretical critique (Pearl-aligned) that this chapter touches in §12 but defers to that chapter for substance.

  • Statistics in the traditional sense underlies much of classical learning theory; we assume undergraduate probability and statistics throughout per ADR-0001.


§2. Historical Context

This section traces the history of learning theory from its 1960s foundations through the deep-learning shock of 2017 and the post-shock fragmentation. The chapter’s substance is the current theoretical apparatus; the history is here because the conceptual moves that produced today’s apparatus are themselves illuminating.

A timeline of the inflection points:

   ~1960s        Foundations: Vapnik and Chervonenkis on
                  uniform convergence (1968, 1971)
                                  │
                                  ▼
   1971          VC dimension introduced (Vapnik and
                  Chervonenkis)
                                  │
                                  ▼
   1984          PAC framework (Valiant): probabilistic-
                  approximately-correct learnability as a
                  formal definition
                                  │
                                  ▼
   ~1990s        SLT programme matures: kernel methods (SVMs),
                  structural risk minimization, the
                  bias-variance tradeoff as the teaching
                  narrative
                                  │
                                  ▼
   1995-1999     PAC-Bayes framework (McAllester 1999,
                  Shawe-Taylor and Williamson 1997);
                  margin bounds (Schapire et al. on boosting)
                                  │
                                  ▼
   2000s         Statistical-ML mainstream consolidates;
                  classical theory considered settled;
                  deep learning is still a niche
                                  │
                                  ▼
   2012-2016     Deep learning's empirical rise (AlexNet
                  inflection, ResNet); the empirical
                  success is incommensurable with classical
                  theoretical predictions
                                  │
                                  ▼
   2017          THE SHOCK: Zhang, Bengio, Hardt, Recht,
                  Vinyals, "Understanding Deep Learning
                  Requires Rethinking Generalization." Deep
                  networks can fit RANDOM labels yet still
                  GENERALIZE on real labels; classical
                  capacity-based bounds cannot explain this
                                  │
                                  ▼
   2018-2022     POST-SHOCK FRAGMENTATION: Neural Tangent
                  Kernel (Jacot et al. 2018), double descent
                  (Belkin et al. 2019), implicit bias of SGD
                  (multiple papers), PAC-Bayes revival
                  (Dziugaite and Roy 2017 - first non-vacuous
                  bounds on deep nets)
                                  │
                                  ▼
   2020-2026     Scaling laws as a new kind of empirical
                  theory (Kaplan et al. 2020, Hoffmann et al.
                  2022); the field develops in two directions:
                  rigorous-but-narrow theoretical work, and
                  large-scale-empirical-regularity work

We develop each phase below.

Foundations: Vapnik, Chervonenkis, and the uniform-convergence programme

The mathematical foundation of classical statistical learning theory was laid by Vladimir Vapnik and Alexey Chervonenkis in the 1960s and 1970s. Their central result (1968, 1971): under certain conditions on a hypothesis class, the empirical risk on a training set converges uniformly to the true risk over all hypotheses in the class as the training set grows. The convergence rate is controlled by a capacity measure of the hypothesis class - what is now called the VC dimension.

The intuition: if a hypothesis class is “large” (in the VC-dimension sense), the worst-case gap between training and true error can be substantial, and we need many examples to be confident the empirical winner is also the true winner. If the class is “small”, few examples suffice.

This was the first rigorous account of why statistical learning generalizes: it generalizes when the hypothesis class is not so large that some hypothesis fits the training data well by accident. The framework was domain-agnostic - applicable to any binary classification problem with a well-defined hypothesis class - and remains the conceptual foundation of classical learning theory.

PAC: making learnability formal

Leslie Valiant’s “A Theory of the Learnable” (1984) introduced the PAC (Probably Approximately Correct) framework. The formal definition: a hypothesis class is PAC-learnable if there exists an algorithm that, given any ϵ,δ>0\epsilon, \delta > 0, returns a hypothesis with error at most ϵ\epsilon with probability at least 1δ1 - \delta, using a number of training examples polynomial in 1/ϵ1/\epsilon and 1/δ1/\delta (and in the hypothesis-class complexity).

PAC made learnability itself a formal property rather than a vague desideratum. It gave the field a vocabulary for distinguishing easy from hard learning problems, and it spawned the COLT (Computational Learning Theory) research community. PAC is still the standard framework for stating sample-complexity results in classical learning theory; we develop it in §3.

The 1990s SLT programme

Through the 1990s, Statistical Learning Theory (SLT, the programme that Vapnik’s work led to) matured into the dominant theoretical framework for ML. Vapnik’s 1995 textbook The Nature of Statistical Learning Theory consolidated the framework. Support Vector Machines (Cortes and Vapnik, 1995) were a practical algorithm that came directly out of SLT - minimize empirical risk subject to a structural constraint on capacity, which we now call structural risk minimization.

The bias-variance tradeoff became the dominant teaching narrative. A model with too few parameters has high bias and cannot fit the data; a model with too many parameters has high variance and overfits noise. The right model size balances the two - and is somewhere in the middle. Generations of ML students learned this picture as the conceptual frame for model selection.

By the early 2000s, classical SLT was considered mature as a field. The major theoretical questions appeared to be settled or refined.

The PAC-Bayes refinement

A parallel thread developed in the late 1990s. The classical PAC framework gives bounds that depend on the worst-case hypothesis in the class. PAC-Bayes (McAllester, 1999; Shawe-Taylor and Williamson, 1997) gives bounds that depend on the distribution the learner produces over hypotheses. The framework is Bayesian-flavoured: the learner has a prior over hypotheses, sees the data, produces a posterior; the bound depends on the KL divergence between posterior and prior.

PAC-Bayes was initially a refinement that gave slightly tighter bounds in some settings. Its real importance would come two decades later when it produced the first non-vacuous bounds on deep networks - but for the 1990s, it was a technical advance within the broader SLT framework.

The pre-2017 dormancy

Through the 2000s and into the 2010s, classical learning theory continued to develop in incremental ways. The deep-learning revival of the late 2000s and the AlexNet inflection of 2012 (Deep Learning §2) were treated as engineering successes that did not particularly challenge the theoretical framework. Deep networks were assumed to have very high VC dimension and therefore (by the bias-variance narrative) to require lots of data and to be prone to overfitting - both of which were taken as descriptively true based on early experience with smaller datasets.

The combination of enough data (ImageNet) and enough compute (GPUs) that produced the AlexNet inflection was an engineering combination; nobody read it as a theoretical refutation.

2017: the Zhang et al. shock

In 2017, Zhang, Bengio, Hardt, Recht, and Vinyals published “Understanding Deep Learning Requires Rethinking Generalization.” The paper’s central experiments produced a result that was, as a matter of theory, impossible.

The experiment: take a standard convolutional network and a standard image dataset (CIFAR-10, ImageNet). Randomize the training labels - assign each training image a uniformly random label instead of its true label. Train the network to convergence. Two findings:

  1. The network successfully fits the random labels. Training accuracy reaches 100% even though there is no signal whatsoever in the labels.

  2. The same network architecture, trained on the true (non-randomized) labels, generalizes well to the test set, achieving near-state-of-the-art accuracy.

By classical learning theory, this combination should be impossible. If the network can fit random labels, its effective capacity must be at least as large as the dataset. With effective capacity at least as large as the dataset, classical uniform-convergence bounds are vacuous - they would not predict any generalization at all. Yet the network does generalize. Therefore, whatever explains deep learning’s generalization, it is not classical capacity-based theory.

This is what made the result a shock rather than a curiosity. It did not refute classical theory mathematically (the bounds are still correct as math) but showed that classical theory does not predict the empirical behaviour of deep networks. Some other mechanism must be at work.

Subsequent papers (Nagarajan and Kolter, 2019, in particular) showed that uniform convergence is provably unable to explain deep-learning generalization in many natural settings - not because the bounds are weak, but because the implicit-bias structure of training violates the assumptions on which uniform convergence rests.

Post-2017 fragmentation

The Zhang et al. shock did not produce a single replacement theory. Instead, it produced a fragmentation of theoretical work into multiple lines, each capturing a piece of the phenomenon. Several of these threads are developed in detail later in this chapter:

  • Neural Tangent Kernel (NTK) (Jacot, Gabriel, Hongler, 2018). In an infinite-width limit, neural networks trained with gradient descent behave like kernel methods - making them theoretically analyzable. The NTK framework explains some of what deep networks do at near-infinite width; how relevant the framework is to finite-width networks is contested.

  • Double descent (Belkin, Hsu, Ma, Mandal, 2019). Empirically, the test error of overparameterized models often decreases past the point where the bias-variance narrative predicts overfitting. The “U-shape” of the bias-variance curve continues into a second descent in the overparameterized regime. This is now well-documented and partially understood (with multiple complementary theoretical accounts), but it falsified the bias-variance picture as the central narrative for deep learning.

  • Implicit bias of SGD. A series of papers showed that the optimization procedure (specifically, stochastic gradient descent) introduces structural biases on the solutions it finds - biases that are not present in the loss function and are not captured by classical theory. Among overparameterized solutions that all fit the training data, SGD-found solutions have specific structural properties (low norm, low complexity in various senses) that classical bounds don’t anticipate. These implicit biases may be part of the explanation for why deep learning generalizes.

  • PAC-Bayes revival. Dziugaite and Roy (2017) produced the first non-vacuous PAC-Bayes bounds on deep networks - bounds with numerical values that actually predict the empirical behaviour rather than being trivially weak. Subsequent work has developed PAC-Bayes into one of the most practically useful frameworks for deep-learning generalization.

By 2022 the field had multiple partial accounts of deep-learning generalization, none decisive, each capturing a piece. The classical theory remained correct as math but did not have descriptive power for the modern regime; the modern accounts had descriptive power but had not consolidated into a unified theory.

Scaling laws as a new kind of empirical theory

A different mode of theoretical work emerged in parallel. Kaplan et al. (2020) and the Chinchilla revision (Hoffmann et al., 2022) developed scaling laws - power-law empirical regularities relating loss to model parameters, dataset size, and compute. Foundation Models §6 develops these as empirical phenomena; this chapter §9 develops their theoretical status.

The scaling-laws style of work is neither classical (no worst-case bounds, no PAC framework) nor data-dependent in the PAC-Bayes sense. It is purely empirical - power-law fits to training-run data. But it takes the mathematical form of theoretical predictions, and the field has come to use it as a kind of theory. The epistemic status - regression fit vs derivation - is a real question that we develop in §9.

Where this leaves us in 2026

The current state. Classical learning theory exists, is mathematically valid, and is largely orthogonal to deep-learning empirical practice. Modern theoretical work has produced partial accounts of deep-learning generalization, with PAC-Bayes (§6) and information-theoretic frameworks (§7) being the most practically useful. The modern generalization puzzle (§8) is partially explained but not fully understood. Scaling laws (§9) are an empirical-regularity framework with theory-shaped output that has substantially reshaped the field’s compute-allocation practice.

The remaining sections of this chapter develop each piece on its own terms. §3§5 cover classical theory; §6§7 cover modern data-dependent bounds; §8 covers the modern generalization puzzle and its partial accounts; §9 covers scaling laws specifically; §10 covers sample complexity in RL; §11 is the cross-chapter map; §12 is the open-problems inventory.

Editorial note. This is theoretical material that interacts with active research; some statements above will date faster than the empirical-side material elsewhere in the book. The chapter is meant as a snapshot of the conceptual landscape and the canonical literature, not as a comprehensive survey of every theoretical contribution.


§3. The PAC Framework

What problem PAC is solving

Before any formal definition, the question PAC answers. Suppose you have a learning algorithm and you ask: given enough training data, will the hypothesis it returns be close to the true relationship in the data? “Close” is vague; “enough” is vague; “true relationship” is vague. PAC makes all three precise.

The framework was introduced by Leslie Valiant in “A Theory of the Learnable” (1984). The acronym PAC stands for Probably Approximately Correct:

  • Approximately correct: the hypothesis the algorithm returns has error close to the best possible error.

  • Probably: this happens with high probability over the random draw of the training set.

The “probably” hedge is necessary because any training set could in principle be unrepresentative; a learner that promises correctness with certainty cannot exist if training data is random. PAC asks for high-probability correctness instead.

Notation we will use

The PAC framework requires a few objects defined up front. We will reuse this notation throughout §3§7.

  • X\mathcal{X}: the input space. Examples in X\mathcal{X} are things the learner sees (images, sentences, vectors of features). In a binary image-classification problem, X\mathcal{X} might be {0,,255}H×W×3\{0, \ldots, 255\}^{H \times W \times 3}, the set of all H×WH \times W RGB images.

  • Y\mathcal{Y}: the label space. For binary classification, Y={0,1}\mathcal{Y} = \{0, 1\} or {1,+1}\{-1, +1\}. For regression, Y=R\mathcal{Y} = \mathbb{R}.

  • D\mathcal{D}: a distribution over X×Y\mathcal{X} \times \mathcal{Y}. The training data is sampled from this distribution. The learner does not know D\mathcal{D}; it only sees samples.

  • S={(x1,y1),,(xn,yn)}S = \{(x_1, y_1), \ldots, (x_n, y_n)\}: a training set of nn examples drawn i.i.d. from D\mathcal{D}. (“i.i.d.” = independent and identically distributed: each (xi,yi)(x_i, y_i) is drawn from D\mathcal{D} independently of the others.)

  • H\mathcal{H}: a hypothesis class - a set of candidate functions h:XYh: \mathcal{X} \to \mathcal{Y} that the learner may return. Examples: the set of all linear classifiers in Rd\mathbb{R}^d; the set of all decision trees of depth at most kk; the set of functions computable by a neural network of a specified architecture.

  • LD(h)L_{\mathcal{D}}(h): the true risk (or “generalization error”) of hypothesis hh on distribution D\mathcal{D}. For binary classification with 0-1 loss, LD(h)=P(x,y)D[h(x)y]L_{\mathcal{D}}(h) = \mathbb{P}_{(x, y) \sim \mathcal{D}}[h(x) \neq y] - the probability that hh predicts wrongly on a fresh sample from D\mathcal{D}.

  • LS(h)L_S(h): the empirical risk of hypothesis hh on training set SS. For 0-1 loss, LS(h)=1ni=1n1[h(xi)yi]L_S(h) = \frac{1}{n} \sum_{i=1}^{n} \mathbb{1}[h(x_i) \neq y_i] - the fraction of training examples that hh gets wrong.

“Risk” and “error” are used interchangeably; “loss” usually refers to a per-example quantity that gets averaged or summed into a risk. The expectation P(x,y)D\mathbb{P}_{(x,y) \sim \mathcal{D}} means the probability when (x,y)(x, y) is drawn from D\mathcal{D}. The indicator 1[]\mathbb{1}[\cdot] is 11 when its argument is true and 00 otherwise.

The PAC definition

A hypothesis class H\mathcal{H} is PAC-learnable if there exists an algorithm AA and a function mH(ϵ,δ)m_{\mathcal{H}}(\epsilon, \delta) such that, for any distribution D\mathcal{D} over X×Y\mathcal{X} \times \mathcal{Y} for which the true relationship lies in H\mathcal{H} (the realizable case; we relax this below), and for any ϵ,δ(0,1)\epsilon, \delta \in (0, 1):

If AA is given nmH(ϵ,δ)n \geq m_{\mathcal{H}}(\epsilon, \delta) training examples sampled i.i.d. from D\mathcal{D}, then AA returns a hypothesis hh satisfying

PSDn[LD(h)ϵ]1δ.\mathbb{P}_{S \sim \mathcal{D}^n}\big[L_{\mathcal{D}}(h) \leq \epsilon\big] \geq 1 - \delta.

Reading this: with probability at least 1δ1 - \delta over the random draw of the training set, the algorithm returns a hypothesis whose true risk is at most ϵ\epsilon. The function mH(ϵ,δ)m_{\mathcal{H}}(\epsilon, \delta) is the sample complexity of learning H\mathcal{H} - the number of examples needed to achieve (ϵ,δ)(\epsilon, \delta)-correctness.

The two parameters are independent knobs:

  • ϵ\epsilon controls how good the returned hypothesis is. Smaller ϵ\epsilon requires more data.

  • δ\delta controls how confident we are in that result. Smaller δ\delta requires more data.

The framework demands learnability with sample complexity polynomial in 1/ϵ1/\epsilon and 1/δ1/\delta - meaning the data requirement grows at worst as a polynomial in how strict the demands are.

A concrete example: learning axis-aligned rectangles

The canonical pedagogical example. The input space is X=R2\mathcal{X} = \mathbb{R}^2 (points in the plane); the label space is Y={0,1}\mathcal{Y} = \{0, 1\} (positive or negative). The hypothesis class is

H={axis-aligned rectangles in R2},\mathcal{H} = \{\,\text{axis-aligned rectangles in } \mathbb{R}^2\,\},

where a rectangle classifies points inside it as 11 and points outside it as 00. Suppose there is a true rectangle RR^* generating the labels; positive examples lie inside, negative examples outside.

The learning algorithm is simple: return the smallest axis-aligned rectangle that contains all positive training examples. Call it R^\hat{R}.

The analysis. R^\hat{R} is always contained in RR^* (since all observed positives are inside RR^*). The error of R^\hat{R} is therefore the probability mass of RR^R^* \setminus \hat{R} - the set of points inside the true rectangle but not inside the empirical one. This region decomposes into four “strips” along the four sides of R^\hat{R}. A standard argument: for the error to exceed ϵ\epsilon, at least one strip must have probability mass exceeding ϵ/4\epsilon/4; the probability that no training example falls into a strip of mass ϵ/4\epsilon/4 after nn draws is at most (1ϵ/4)nenϵ/4(1 - \epsilon/4)^n \leq e^{-n\epsilon/4}. Union bound over four strips: total failure probability is at most 4enϵ/44 e^{-n\epsilon/4}.

Demanding 4enϵ/4δ4 e^{-n\epsilon/4} \leq \delta and solving for nn:

n4ϵln4δ.n \geq \frac{4}{\epsilon} \ln \frac{4}{\delta}.

So mH(ϵ,δ)=O((1/ϵ)log(1/δ))m_{\mathcal{H}}(\epsilon, \delta) = O\big((1/\epsilon) \log(1/\delta)\big). Polynomial in 1/ϵ1/\epsilon and 1/δ1/\delta: rectangles are PAC-learnable.

Two observations. First, the sample complexity does not depend on the unknown distribution D\mathcal{D} - only on ϵ\epsilon and δ\delta. This is distribution-free learning, a defining feature of the PAC framework. Second, the proof was concrete and constructive: we wrote down an algorithm (smallest enclosing rectangle), bounded its error, and produced a sample-complexity formula. PAC results often have this character.

Realizable vs agnostic PAC

The version above assumes the true relationship is in the hypothesis class - there is some hHh^* \in \mathcal{H} with LD(h)=0L_{\mathcal{D}}(h^*) = 0. This is the realizable setting. It is restrictive: in practice, the true relationship is rarely exactly in the class.

The agnostic PAC framework relaxes this. Define the best achievable risk in H\mathcal{H}:

LH(D)=infhHLD(h).L^*_{\mathcal{H}}(\mathcal{D}) = \inf_{h \in \mathcal{H}} L_{\mathcal{D}}(h).

Agnostic PAC asks: does there exist an algorithm that, given enough data, returns a hypothesis whose risk is close to LH(D)L^*_{\mathcal{H}}(\mathcal{D}) - close to the best the class can do, even if that best is not zero? Formally, H\mathcal{H} is agnostically PAC-learnable if there is an algorithm AA and a function mHag(ϵ,δ)m^{\mathrm{ag}}_{\mathcal{H}}(\epsilon, \delta) such that, for any D\mathcal{D} and any ϵ,δ\epsilon, \delta:

PSDn[LD(h)LH(D)+ϵ]1δ,\mathbb{P}_{S \sim \mathcal{D}^n}\big[L_{\mathcal{D}}(h) \leq L^*_{\mathcal{H}}(\mathcal{D}) + \epsilon\big] \geq 1 - \delta,

whenever nmHag(ϵ,δ)n \geq m^{\mathrm{ag}}_{\mathcal{H}}(\epsilon, \delta). The hypothesis returned is at most ϵ\epsilon worse than the best the class allows. This is the framework that matches most real machine-learning settings (where no model is exactly right but some are pretty good).

The agnostic sample complexity is generally worse than the realizable one: mHag(ϵ,δ)m^{\mathrm{ag}}_{\mathcal{H}}(\epsilon, \delta) typically scales as 1/ϵ21/\epsilon^2 rather than 1/ϵ1/\epsilon. The extra factor comes from estimating risks (which now matter for every hypothesis, not just the perfect one) via concentration of empirical averages.

The canonical algorithm: ERM

The standard learning algorithm in the PAC framework is Empirical Risk Minimization (ERM):

ALGORITHM ERM(training set S, hypothesis class H)
    return argmin over h in H of L_S(h)

ERM picks the hypothesis in the class that has lowest empirical risk on the training data. In the realizable case, “lowest empirical risk” means “zero empirical risk” - any hHh \in \mathcal{H} that fits the training data perfectly works.

The conceptual move that makes ERM analyzable: if every hypothesis in H\mathcal{H} has its empirical risk close to its true risk (the uniform-convergence guarantee, §4), then minimizing empirical risk approximately minimizes true risk. The PAC guarantee for ERM is therefore an immediate corollary of uniform convergence - the gap between LSL_S and LDL_{\mathcal{D}} is small, the ERM minimizer’s LSL_S is small, so its LDL_{\mathcal{D}} is small.

This dependence on uniform convergence is what links §3 to §4 - and what makes the failure of uniform convergence for deep networks (Nagarajan and Kolter, 2019; covered in §4) a deep problem.

Limitations of the PAC framework

PAC is rigorous and conceptually clean. Its limitations are well-understood and shape the framing of everything that comes after:

  • Distribution-free is too pessimistic. PAC bounds hold for any distribution, including adversarially constructed ones. The actual data distributions we face in practice are far more benign; bounds that hold for arbitrary distributions are typically far looser than the actual generalization behaviour. Data-dependent frameworks (PAC-Bayes, §6; information-theoretic, §7) exploit the gap.

  • Worst-case over the class. Classical PAC bounds depend on a capacity measure of H\mathcal{H} (such as VC dimension, §4) and not on the specific hypothesis the learner returned. For modern overparameterized networks where VC dimension is astronomical, the bounds are vacuous. Refinements (margin bounds §5, PAC-Bayes §6) bound based on the returned hypothesis itself.

  • Polynomial is a low bar. A sample complexity of, say, n=1020n = 10^{20} examples is polynomial in 1/ϵ1/\epsilon but practically useless. PAC-learnability is a binary property that doesn’t track the constants that matter empirically.

  • The framework is about whether learning is possible, not how. Computational efficiency of the learning algorithm is a separate question (efficient PAC-learnability adds a polynomial-time requirement). For deep learning, the gap between “sample-complexity bounds” and “what training procedures actually find” is the modern generalization puzzle (§8).

These limitations motivate everything in §4§8: each successive framework tries to tighten the bound by exploiting more structure - capacity measures (§4), data-dependent quantities (§5, §6), information-theoretic structure (§7), or specifics of the modern training regime (§8).


§4. VC Dimension and Uniform Convergence

Why we need a notion of “capacity”

The PAC framework (§3) reduces learning to: find a hypothesis with low empirical risk in a class where empirical and true risk are close. The second condition - empirical and true risk are close, uniformly over the class - is the uniform convergence property. Whether a class has it, and how many samples are needed to achieve it, depends on the richness or capacity of the class.

The intuition: a class that contains very few hypotheses is “small” - there are few ways for any hypothesis in it to accidentally fit the data; empirical and true risk track each other tightly. A class with many hypotheses is “large” - some hypothesis in the class will accidentally fit any finite training set well, and empirical risk underestimates true risk. We need a formal notion of size that captures this.

For finite classes, the obvious measure is cardinality H|\mathcal{H}|. For infinite classes (linear classifiers in Rd\mathbb{R}^d, neural networks with continuous parameters), cardinality is infinite and uninformative. We need a different measure.

Shattering

A set of mm points {x1,,xm}X\{x_1, \ldots, x_m\} \subset \mathcal{X} is shattered by hypothesis class H\mathcal{H} if, for every possible labeling of the mm points with {0,1}\{0, 1\} values, some hHh \in \mathcal{H} realizes that labeling. There are 2m2^m possible labelings; shattering means H\mathcal{H} can produce all of them on this particular set of points.

A concrete check. Take X=R\mathcal{X} = \mathbb{R} and H={half-lines ha(x)=1[xa]}\mathcal{H} = \{\text{half-lines } h_a(x) = \mathbb{1}[x \geq a]\} (each hypothesis classifies points to the right of some threshold aa as positive). Consider the set {1,3}\{1, 3\}. The four possible labelings are (0,0),(0,1),(1,0),(1,1)(0, 0), (0, 1), (1, 0), (1, 1).

  • (0,0)(0, 0): realized by h10h_{10} (threshold to the right of both points).

  • (0,1)(0, 1): realized by h2h_2 (threshold between 1 and 3).

  • (1,1)(1, 1): realized by h0h_0 (threshold to the left of both points).

  • (1,0)(1, 0): would require h(1)=1h(1) = 1 and h(3)=0h(3) = 0. Impossible - half-lines are monotone in xx.

The class fails to realize (1,0)(1, 0), so {1,3}\{1, 3\} is not shattered by half-lines on R\mathbb{R}. By a similar argument, no two-point set is shattered. A single-point set {x1}\{x_1\} is shattered (the two labelings (0)(0) and (1)(1) are both achievable by choosing a threshold to the right or left of x1x_1).

VC dimension

The VC dimension of H\mathcal{H}, written VC(H)\text{VC}(\mathcal{H}) or dVCd_{\text{VC}}, is the maximum number of points that can be shattered by H\mathcal{H} - the size of the largest set H\mathcal{H} can label in every possible way.

The half-line example: VC(H)=1\text{VC}(\mathcal{H}) = 1, since single points can be shattered and two-point sets cannot.

More examples:

  • Intervals on R\mathbb{R}: H={1[axb]}\mathcal{H} = \{\mathbb{1}[a \leq x \leq b]\}. VC dimension =2= 2. Two points can be shattered (interval covers neither, only the left, only the right, both); three points {x1<x2<x3}\{x_1 < x_2 < x_3\} cannot, since the labeling (1,0,1)(1, 0, 1) is unrealizable.

  • Axis-aligned rectangles in R2\mathbb{R}^2: VC dimension =4= 4. Four points in general position can be shattered; no five-point configuration can.

  • Half-planes in R2\mathbb{R}^2 (lines): VC dimension =3= 3. Three points in general position can be shattered by lines; four points cannot (the XOR-like configuration is unrealizable).

  • Linear classifiers in Rd\mathbb{R}^d: VC dimension =d+1= d + 1 (with bias term).

  • Neural networks of architecture with WW parameters: VC dimension scales roughly as O(WlogW)O(W \log W) for binary classification (Bartlett et al., 1998; tighter bounds exist for specific architectures).

VC dimension is a property of the class, not of any particular training set. It measures the class’s expressive power in a worst-case combinatorial sense.

The growth function and Sauer-Shelah

A more refined object than VC dimension is the growth function τH(m)\tau_{\mathcal{H}}(m) - the maximum number of distinct labelings H\mathcal{H} can realize on any mm-point set. By definition:

τH(m)=maxx1,,xmX{(h(x1),,h(xm)):hH}.\tau_{\mathcal{H}}(m) = \max_{x_1, \ldots, x_m \in \mathcal{X}} \big|\{(h(x_1), \ldots, h(x_m)) : h \in \mathcal{H}\}\big|.

If H\mathcal{H} shatters some mm-point set, τH(m)=2m\tau_{\mathcal{H}}(m) = 2^m (every labeling realizable). Otherwise τH(m)<2m\tau_{\mathcal{H}}(m) < 2^m. VC dimension is equivalently the largest mm for which τH(m)=2m\tau_{\mathcal{H}}(m) = 2^m.

The Sauer-Shelah lemma (Sauer, 1972; Shelah, 1972) is the crucial structural result: if VC(H)=d\text{VC}(\mathcal{H}) = d, then for all mdm \geq d,

τH(m)i=0d(mi)(emd)d.\tau_{\mathcal{H}}(m) \leq \sum_{i=0}^{d} \binom{m}{i} \leq \left(\frac{em}{d}\right)^d.

In words: once we exceed the shattering threshold, the growth function transitions from 2m2^m (exponential in mm) to polynomial in mm of degree dd. This is a sharp phase transition. It is what makes VC dimension the right notion of capacity: the growth function below dd can be exponential, but above dd it is forever polynomial.

The uniform convergence theorem

The central theorem of classical learning theory. Roughly: if a class has finite VC dimension, then with high probability over the training set, every hypothesis in the class has empirical risk close to its true risk.

Formally: for a class H\mathcal{H} with VC dimension dd, for any distribution D\mathcal{D} over X×{0,1}\mathcal{X} \times \{0, 1\}, with probability at least 1δ1 - \delta over the random draw of a training set SS of size nn from D\mathcal{D}:

suphHLS(h)LD(h)O ⁣(dlog(n/d)+log(1/δ)n).\sup_{h \in \mathcal{H}} \big| L_S(h) - L_{\mathcal{D}}(h) \big| \leq O\!\left(\sqrt{\frac{d \log(n/d) + \log(1/\delta)}{n}}\right).

Reading this: the worst gap between empirical and true risk, over all hypotheses in the class, goes to zero as d/n\sqrt{d / n} (up to log factors). To make the gap small, we need nn growing linearly with dd.

Combined with ERM (§3), this gives the fundamental theorem of statistical learning (for binary classification with 0-1 loss): the realizable PAC sample complexity is

mH(ϵ,δ)=O ⁣(d+log(1/δ)ϵ),m_{\mathcal{H}}(\epsilon, \delta) = O\!\left(\frac{d + \log(1/\delta)}{\epsilon}\right),

and the agnostic sample complexity is

mHag(ϵ,δ)=O ⁣(d+log(1/δ)ϵ2).m^{\mathrm{ag}}_{\mathcal{H}}(\epsilon, \delta) = O\!\left(\frac{d + \log(1/\delta)}{\epsilon^2}\right).

Finite VC dimension is both necessary and sufficient for PAC-learnability (Vapnik, Chervonenkis, 1971; Blumer et al., 1989). This is one of the most consequential theorems in classical learning theory: it ties learnability to a single combinatorial number.

The proof strategy: symmetrization

The proof of the uniform-convergence theorem uses a technique called symmetrization, which is worth sketching because the same technique reappears throughout learning theory (in Rademacher complexity, §5, especially).

The challenge: we need to bound the deviation of LS(h)L_S(h) from LD(h)L_{\mathcal{D}}(h) uniformly over the infinite class H\mathcal{H}. The expectation LD(h)L_{\mathcal{D}}(h) is hard to work with directly. Symmetrization replaces it with a second, “ghost” training set SS' drawn independently from the same distribution. Roughly:

  • The deviation suphLS(h)LD(h)\sup_h |L_S(h) - L_{\mathcal{D}}(h)| is bounded by suphLS(h)LS(h)\sup_h |L_S(h) - L_{S'}(h)| (with appropriate constants).

  • On a fixed combined set SSS \cup S' of size 2n2n, the class H\mathcal{H} realizes at most τH(2n)\tau_{\mathcal{H}}(2n) distinct labelings.

  • The deviation suphLS(h)LS(h)\sup_h |L_S(h) - L_{S'}(h)| over this finite (polynomially-large) set of effective hypotheses is bounded by standard concentration (Hoeffding’s inequality + union bound).

Applying Sauer-Shelah to bound τH(2n)\tau_{\mathcal{H}}(2n) by (2en/d)d(2en/d)^d and tracking constants yields the theorem.

The key technical move: replacing the infinite class with the effective finite class of labelings on a fixed sample. This is the whole point of the growth function - it captures the effective size of H\mathcal{H} as seen by any finite sample.

A canonical pseudocode pattern

The theory above gives an analysis of ERM, not a new algorithm. But the structural argument it suggests appears repeatedly:

ANALYSIS: Why ERM generalizes (sketch)

INPUT: hypothesis class H, training set S of size n drawn from D
       VC dimension d of H

1. Run ERM on S: h_hat = argmin over h in H of L_S(h).
2. Empirical risk L_S(h_hat) is small (zero in realizable case;
   close to L*_H in agnostic case).
3. By uniform convergence: with probability >= 1 - delta,
   |L_S(h) - L_D(h)| <= eps for ALL h in H simultaneously,
   where eps = O(sqrt(d log(n/d) / n)).
4. Therefore L_D(h_hat) <= L_S(h_hat) + eps.
5. Conclusion: ERM returns a hypothesis whose true risk is within
   eps of the best the class can do.

The whole machinery - VC dimension, growth function, Sauer-Shelah, symmetrization, uniform convergence - exists to make step 3 rigorous.

Why uniform convergence fails for deep learning

This is the central problem the chapter has been building toward. For modern deep networks, the uniform-convergence theorem gives bounds that are vacuous - the right-hand side is much larger than 1, so the bound says nothing.

The reason. A typical modern network has W109W \sim 10^9 parameters or more. VC-dimension bounds for neural networks scale at least linearly with WW. A training set might have n106n \sim 10^6 examples. The ratio d/nd/n is enormous, and the bound d/n\sqrt{d / n} is much greater than 1 - meaning the theorem only guarantees that empirical and true risk differ by at most something far larger than the maximum possible difference of 1. It is a true statement, but vacuous.

For decades this was viewed as a bound looseness problem - surely a tighter bound is possible. The shock from Zhang et al. (2017), discussed in §2, and especially the follow-up result from Nagarajan and Kolter (2019) “Uniform convergence may be unable to explain generalization in deep learning,” changed the picture. Nagarajan and Kolter showed:

In natural deep-learning settings, any uniform-convergence bound (not just the VC-dimension one) is provably vacuous. The looseness is not a defect of a particular technique; it is intrinsic to the uniform-convergence approach.

The argument structure (sketch): consider a training procedure that, for any training set of size nn, produces a network that fits the training set perfectly. Now consider a modification of the training procedure that, on a special “adversarial” sample, produces a hypothesis whose error on a fresh sample of the same distribution is very high. The class of hypotheses producible by the training procedure must contain both - but then no uniform-convergence bound can be small, because the supremum over the class is dominated by the bad hypothesis. Empirically, the training procedure rarely produces the bad hypothesis; uniform convergence cannot exploit this fact.

The implication: the success of deep-learning generalization cannot be explained by worst-case capacity bounds applied to the class of all possible trained networks. It must be explained by something about which networks the training procedure actually finds. This is the entry point for everything in §5§8: data-dependent bounds (margin, PAC-Bayes), information-theoretic bounds, and the modern generalization puzzle.

Where this leaves VC theory

VC theory remains a correct and important piece of classical learning theory. It establishes:

  • A clean characterization of PAC-learnability (finite VC dimension is necessary and sufficient).

  • A worst-case capacity measure that is the right notion for many problem classes (intervals, half-planes, decision trees of bounded depth, polynomial classifiers).

  • A proof technique (symmetrization) that is the backbone of modern data-dependent generalization bounds.

What it does not do, in the modern regime, is explain why deep networks generalize. That is not a flaw in the theorem; it is a feature of the regime. The theorem is about the worst case over a class. The deep-learning regime is not worst-case; it is structured by the optimizer and the data in ways that worst-case bounds cannot see. The remaining sections of this chapter develop the frameworks that can.


§5. Rademacher Complexity and Margin-Based Bounds

What problem Rademacher complexity is solving

VC dimension (§4) is a combinatorial measure of a hypothesis class. It is computed without reference to the data distribution or the actual training set. For many problems this is exactly the right notion; for others it is too coarse.

The problem case: a hypothesis class that, in worst case, has very large VC dimension but, on the actual distribution and the actual training data, behaves much more simply. Linear classifiers in high-dimensional spaces with a margin (a gap between the positive and negative classes) are the canonical example. The VC dimension is d+1d + 1 where dd is the ambient dimension, which can be enormous. But if the data has a positive margin, only a small subspace of separating hyperplanes is consistent with the data; the effective complexity is much smaller.

Rademacher complexity captures this by averaging over the data rather than maximizing over a worst-case sample. It is a data-dependent and distribution-dependent capacity measure.

Definition

Given a training set S={x1,,xn}S = \{x_1, \ldots, x_n\} and a class F\mathcal{F} of real-valued functions f:XRf: \mathcal{X} \to \mathbb{R}, the empirical Rademacher complexity is

R^S(F)=Eσ ⁣[supfF1ni=1nσif(xi)],\hat{R}_S(\mathcal{F}) = \mathbb{E}_{\boldsymbol{\sigma}} \!\left[ \sup_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^{n} \sigma_i f(x_i) \right],

where σ=(σ1,,σn)\boldsymbol{\sigma} = (\sigma_1, \ldots, \sigma_n) are independent Rademacher variables - each σi\sigma_i is +1+1 or 1-1 with probability 1/21/2 each. The population (or “expected”) Rademacher complexity is

Rn(F)=ESDn ⁣[R^S(F)].R_n(\mathcal{F}) = \mathbb{E}_{S \sim \mathcal{D}^n}\!\left[\hat{R}_S(\mathcal{F})\right].

Reading the empirical version: we generate a random ±1\pm 1 “label” σi\sigma_i for each training input, then ask how well the best function in F\mathcal{F} can correlate with these random labels on average. If F\mathcal{F} is rich enough to track any pattern of labels, the supremum is large; the class can fit noise, and we should worry about overfitting. If F\mathcal{F} is restricted enough that no function correlates well with random noise on average, the supremum is small.

The mental picture: Rademacher complexity is the fraction of random noise the class can fit. A class that can fit half the noise has complexity 1/2\sim 1/2; a class that can only fit a small fraction has complexity that vanishes as nn \to \infty.

The Rademacher generalization bound

The central theorem (McDiarmid plus standard concentration): for a class F\mathcal{F} of functions bounded in [0,1][0, 1], with probability at least 1δ1 - \delta over the draw of SS,

supfFED[f]E^S[f]2Rn(F)+O ⁣(log(1/δ)n).\sup_{f \in \mathcal{F}} \big| \mathbb{E}_{\mathcal{D}}[f] - \hat{\mathbb{E}}_S[f] \big| \leq 2 R_n(\mathcal{F}) + O\!\left(\sqrt{\frac{\log(1/\delta)}{n}}\right).

The empirical version of the bound replaces RnR_n with R^S\hat{R}_S (with a small extra concentration term). For a binary classification loss class derived from a hypothesis class H\mathcal{H}, this gives a generalization bound for ERM in terms of Rademacher complexity.

Key features:

  • The bound is data-dependent: R^S(F)\hat{R}_S(\mathcal{F}) depends on the actual training set, so it can be tighter than the worst-case VC bound when the data has favourable structure.

  • The bound is distribution-aware: through the expectation over SDS \sim \mathcal{D}, Rn(F)R_n(\mathcal{F}) reflects the actual data distribution.

  • It recovers VC bounds in the worst case: Rn(H)=O(dVC/n)R_n(\mathcal{H}) = O(\sqrt{d_{\text{VC}}/n}) up to log factors, so Rademacher is at least as good as VC asymptotically.

Margin-based bounds

The second motivation for moving past VC dimension is margin. In binary classification with a real-valued score function ff (e.g., the signed distance from a separating hyperplane, or the pre-softmax logit of a network), a positive example with score f(x)=+5f(x) = +5 is classified more confidently than one with f(x)=+0.01f(x) = +0.01 - both are correctly classified, but the second is on the boundary.

The margin of a correctly-classified example (x,y)X×{1,+1}(x, y) \in \mathcal{X} \times \{-1, +1\} is yf(x)y \cdot f(x) - positive when the prediction is correct, and its magnitude measures the confidence of the prediction. The margin classifier idea: if a training set is classified with large margin everywhere, the classifier is “far from the decision boundary” and intuitively more robust.

The margin theorem (Schapire, Freund, Bartlett, Lee, 1998 for boosting; Bartlett, 1998 for neural networks; Koltchinskii and Panchenko, 2002 for the general statement) makes this rigorous. For a class of real-valued classifiers F\mathcal{F}, a fixed margin parameter γ>0\gamma > 0, and the margin loss γ(f,x,y)=1[yf(x)<γ]\ell_{\gamma}(f, x, y) = \mathbb{1}[y \cdot f(x) < \gamma] (the fraction of examples with margin less than γ\gamma), with high probability over SS:

LD0/1(f)L^Sγ(f)+2γRn(F)+O ⁣(log(1/δ)n).L_{\mathcal{D}}^{0/1}(f) \leq \hat{L}_S^{\gamma}(f) + \frac{2}{\gamma} R_n(\mathcal{F}) + O\!\left(\sqrt{\frac{\log(1/\delta)}{n}}\right).

Reading this: the zero-one test error of ff is bounded by the empirical margin loss (the fraction of training examples with margin less than γ\gamma) plus a complexity term scaled by 1/γ1/\gamma. A classifier with consistent large margin on the training data (L^Sγ\hat{L}_S^{\gamma} small for large γ\gamma) generalizes - even if the underlying class has high VC dimension, as long as the relevant scaled Rademacher complexity is small.

This explained why boosting (which provably increases margins) and SVMs (which maximize margin by construction) generalize even with very rich underlying classifiers. The theorem decoupled “fitting” (which uses the rich class) from “generalization” (which uses the margin).

Norm-based bounds for neural networks

The natural question after the margin breakthrough: can margin-style bounds explain neural network generalization? Classical VC bounds are vacuous for deep networks; could Rademacher-with-margin bounds be tight?

Bartlett, Foster, Telgarsky (2017) “Spectrally-normalized margin bounds for neural networks” gave the first serious answer. For a feedforward network fWf_W with weights W=(W1,,WL)W = (W_1, \ldots, W_L) and ReLU activations, they bound the Rademacher complexity in terms of spectral norms and Frobenius norms of the weight matrices. The qualitative result: the relevant complexity term is something like

Rn(F)    1n=1LWop(=1LW2,12/3Wop2/3)3/2,R_n(\mathcal{F}) \;\lesssim\; \frac{1}{\sqrt{n}} \prod_{\ell=1}^{L} \|W_\ell\|_{\text{op}} \cdot \left(\sum_{\ell=1}^{L} \frac{\|W_\ell^\top\|_{2,1}^{2/3}}{\|W_\ell\|_{\text{op}}^{2/3}}\right)^{3/2},

where op\|\cdot\|_{\text{op}} is the operator (spectral) norm and 2,1\|\cdot\|_{2,1} is a matrix norm penalizing rows. The expression is technical; the point is that it depends on norms of the trained weights, not on combinatorial counts of network parameters.

The result is norm-based: networks with small weight norms generalize, networks with large weight norms have weaker guarantees. The bound exploits the empirical observation that SGD-trained networks tend to have controlled norms - a property of the trained network, not of the class of all possible networks of this architecture.

Neyshabur, Bhojanapalli, Srebro (2017–2018) developed parallel and tighter analyses with PAC-Bayes machinery (closer in spirit to §6). The qualitative picture from this line of work: deep-learning generalization can be related to post-hoc properties of the trained network (margin, norms, sharpness) in ways that VC dimension cannot reach.

Where these bounds are tight, and where they are not

The honest accounting. Norm-based and margin-based bounds for neural networks are much tighter than VC bounds - they are not vacuous in the way VC bounds are. But:

  • For practical-size networks, the numerical bounds are often still very loose. They predict generalization gaps far larger than the actual gap.

  • The bounds depend on quantities (spectral norms, margins) that are not directly controlled during training. SGD with weight decay tends to produce networks with controlled norms, but the link is empirical, not algorithmic.

  • The relevant constants involve the depth of the network multiplicatively, which weakens bounds for very deep networks.

  • The bounds say very little about why training procedures find solutions with the favourable properties - they explain “given a low-norm low-margin-loss solution, it generalizes,” but not “why does training find such a solution?”

That last gap is the implicit bias question (§8): what is it about SGD that makes it find solutions whose norms and margins lead to good generalization, when many other solutions in the class would not? Rademacher-based theory is silent on this.

The pseudocode pattern: data-dependent bound computation

Unlike VC bounds, Rademacher-based bounds can be estimated from data:

COMPUTE EMPIRICAL RADEMACHER ESTIMATE

INPUT: training set S = (x_1, ..., x_n), function class F
       number of Rademacher samples M
       optimizer for sup_{f in F}

1. For trial t = 1, ..., M:
   a. Draw sigma^(t) = (sigma_1, ..., sigma_n) with each
      sigma_i in {-1, +1} uniformly.
   b. Solve: f_t = argmax over f in F of
                  (1/n) sum_i sigma_i^(t) f(x_i)
   c. Record value v_t = (1/n) sum_i sigma_i^(t) f_t(x_i)

2. Return (1/M) sum_t v_t as estimate of R_hat_S(F).

The expensive part is step 1b - maximizing the correlation between ff and a random Rademacher vector. For simple classes (linear, kernel methods) this is tractable; for neural networks it is itself a non-trivial optimization. In practice, Rademacher complexity is rarely computed directly for deep networks; it serves more as a conceptual anchor for data-dependent bounds, with the actual bounds tightened through norm-based or PAC-Bayes (§6) machinery.


§6. PAC-Bayes

What problem PAC-Bayes is solving

§3§5 have all been about bounding the generalization gap of a deterministic hypothesis returned by a learning algorithm. PAC-Bayes shifts perspective: instead of returning a single hypothesis, the learner returns a distribution over hypotheses (a “posterior”). The bound depends on how far this posterior has moved from a fixed prior distribution.

Two motivations. First, Bayesian-flavoured learning algorithms (Gaussian processes, Bayesian neural networks, ensembles, dropout-as-Bayesian-approximation) naturally produce posterior distributions; we want bounds appropriate to them. Second - and decisively for deep learning - even a deterministic algorithm can be analyzed by considering a small perturbation distribution around the returned hypothesis as a posterior; the resulting bound can be much tighter than VC or Rademacher bounds.

PAC-Bayes was introduced by David McAllester (1999) “PAC-Bayesian Model Averaging” with parallel work by Shawe-Taylor and Williamson (1997). For two decades it was a refinement; in 2017 it became the central framework for non-vacuous deep-learning bounds (Dziugaite and Roy).

The framework: prior, posterior, KL divergence

Three objects.

  • Prior PP: a distribution over the hypothesis class H\mathcal{H}, chosen before the learner sees the data. The prior encodes what hypotheses the learner would have considered plausible without any data. Important: PP does not have to be the Bayesian “true prior”; it is a technical object that the learner specifies for the bound to apply.

  • Posterior QQ: a distribution over H\mathcal{H}, returned by the learner after seeing the training set SS. QQ encodes the learner’s data-informed beliefs over hypotheses.

  • KL divergence KL(QP)\text{KL}(Q \| P): a measure of how much QQ has moved away from PP. Defined as KL(QP)=EhQ[logQ(h)/P(h)]\text{KL}(Q \| P) = \mathbb{E}_{h \sim Q}[\log Q(h) / P(h)], the expected log-ratio under QQ. KL is non-negative, zero iff Q=PQ = P, and grows as QQ concentrates on regions that PP assigns low probability.

The intuition for the bound that comes next: if the posterior QQ has moved a lot from the prior PP (large KL), the learner has “used a lot of information” from the data, and we should expect more overfitting. If QQ is close to PP (small KL), the learner has barely deviated from its prior beliefs, and there is little overfitting risk. The KL acts as the complexity penalty.

For a posterior QQ over H\mathcal{H}, define the expected risks under QQ:

LD(Q)=EhQ[LD(h)],LS(Q)=EhQ[LS(h)].L_{\mathcal{D}}(Q) = \mathbb{E}_{h \sim Q}[L_{\mathcal{D}}(h)], \quad L_S(Q) = \mathbb{E}_{h \sim Q}[L_S(h)].

These are the average true risk and empirical risk over a random draw from QQ.

The PAC-Bayes bound (McAllester form)

The central result. For any fixed prior PP (independent of SS), with probability at least 1δ1 - \delta over the draw of SS, for every posterior QQ over H\mathcal{H} (which may depend on SS):

LD(Q)LS(Q)+KL(QP)+log(2n/δ)2(n1).L_{\mathcal{D}}(Q) \leq L_S(Q) + \sqrt{\frac{\text{KL}(Q \| P) + \log(2\sqrt{n}/\delta)}{2(n - 1)}}.

Reading this: the expected true risk under QQ is bounded by the empirical risk plus a complexity term that grows with KL(QP)\text{KL}(Q \| P).

The crucial property: the bound holds uniformly over all posteriors QQ. The learner can choose QQ after seeing the data - the bound still holds for whatever QQ the learner returns. This is what makes PAC-Bayes data-dependent in a stronger sense than Rademacher: the complexity penalty is computed on the specific QQ the learner produces, not on a worst-case class.

A tighter version, more useful for empirical bounds, is the Seeger-Langford form:

kl(LS(Q)LD(Q))KL(QP)+log(2n/δ)n,\text{kl}(L_S(Q) \,\|\, L_{\mathcal{D}}(Q)) \leq \frac{\text{KL}(Q \| P) + \log(2\sqrt{n}/\delta)}{n},

where kl(pq)\text{kl}(p \| q) on the left is the binary KL divergence between Bernoulli distributions with means pp and qq. This form is tighter than the McAllester form when LS(Q)L_S(Q) is close to zero (which is the deep-learning regime).

A concrete instantiation: deep networks as Gaussian posteriors

The application that produced the modern PAC-Bayes revival. Dziugaite and Roy (2017) “Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data” applied PAC-Bayes to deep networks as follows.

The setup. Train a neural network with SGD to produce weights WW^*. Define a posterior over networks by perturbing WW^* with Gaussian noise: QQ is the distribution of W+ξW^* + \boldsymbol{\xi} where ξ\boldsymbol{\xi} has independent Gaussian entries with learned variances σpost2\sigma^2_{\text{post}}. Define the prior as another Gaussian centered at the initialization W0W_0 with variance σprior2\sigma^2_{\text{prior}}. (Centring the prior at W0W_0 is a key technical choice; this is allowed because W0W_0 is chosen before seeing the data.) The KL between two diagonal-covariance Gaussians has a closed form.

The algorithm (sketch):

NON-VACUOUS PAC-BAYES BOUND (Dziugaite & Roy 2017, sketch)

INPUT: training set S, network architecture, initialization W_0
       confidence delta

1. Train W^* by SGD on S, starting from W_0.

2. Define posterior Q as Gaussian centered at W^*, variance sigma_post^2
   (learnable, vector-valued one per parameter).

3. Define prior P as Gaussian centered at W_0, variance sigma_prior^2.

4. Optimize sigma_post (and effectively the bound) by minimizing
   the PAC-Bayes RHS:
        L_S(Q) + sqrt[ (KL(Q || P) + log(2 sqrt(n) / delta)) / (2(n-1)) ]
   where L_S(Q) is estimated by Monte Carlo sampling from Q.

5. Report the optimized bound value as a valid upper bound on L_D(Q).

OUTPUT: a numerical bound on the test error of the stochastic
        network Q.

The remarkable result. On MNIST with a network of 6105\sim 6 \cdot 10^5 parameters and n=6104n = 6 \cdot 10^4 examples (more parameters than training data - the regime where VC bounds are vacuous), Dziugaite and Roy obtained PAC-Bayes bounds with numerical values around 17% test error - a bound that is correct, provable, and predictive of the empirical test error. This was the first non-vacuous generalization bound for an overparameterized deep network.

The technique has been refined and extended substantially since (Pérez-Ortiz et al., 2021; Neyshabur et al., 2017–2018; Lotfi et al., 2022, “PAC-Bayes Compression Bounds So Tight That They Can Explain Generalization”, which gives bounds within a few percent of empirical performance on small networks). PAC-Bayes remains, as of 2026, the most empirically tight classical-style generalization framework for deep networks.

Compression-based bounds and flat minima

Two flavours of PAC-Bayes are particularly active.

Compression bounds. The intuition: if the trained network can be compressed to a small description (few effective bits) with little loss of accuracy, the effective hypothesis class is small, and the bound tightens. Arora et al. (2018) gave compression-based PAC-Bayes bounds; Lotfi et al. (2022) tightened these substantially using modern compression-style priors. The bound mechanically reads “if I can describe my trained network in kk bits, my effective KL is O(k)O(k), my bound is small.”

Flat minima. A long-standing empirical observation (Hochreiter and Schmidhuber, 1997; Keskar et al., 2017): networks that converge to “flat” regions of the loss surface (regions where small perturbations of weights cause little increase in loss) generalize better than networks that converge to “sharp” minima. The PAC-Bayes link: a flat minimum is one where a Gaussian posterior with large variance still has low expected empirical loss - meaning LS(Q)L_S(Q) is low even for a high-variance QQ, which keeps the KL small. PAC-Bayes formalizes the flat-minima intuition; the converse direction (does flatness cause generalization?) remains contested but is the subject of substantial active work.

Status in 2026

PAC-Bayes is, today, the framework where rigorous classical-style theory and practical deep-learning generalization come closest to making contact. Its results are:

  • Non-vacuous on overparameterized networks (where VC bounds are vacuous).

  • Numerical (the bound has a specific real-number value, computable from the data).

  • Predictive (the bound tracks empirical test error in many settings).

  • Data-dependent in the strongest sense (the bound depends on the specific QQ returned by the learner).

The limitations are also clear:

  • The bounds still substantially overestimate test error for the largest models.

  • Computing tight bounds requires careful instantiation (prior choice, posterior structure, perturbation analysis).

  • The framework explains “given this QQ, here is the bound”; it does not explain why training procedures produce posteriors with small KL.

  • The framework cannot easily handle deterministic learners; the move to stochastic posteriors is a technical step that is not always natural.

PAC-Bayes interacts with information-theoretic bounds (§7) in subtle ways - both frameworks are data-dependent and learner-dependent, both can produce non-vacuous bounds, and several recent unification results bridge them. We turn to the information-theoretic framework next.


§7. Information-Theoretic Generalization

The conceptual move

PAC-Bayes (§6) bounds the gap between empirical and true risk in terms of KL(QP)\text{KL}(Q \| P) - the divergence of the learner’s posterior from a prior. The information-theoretic framework generalizes the conceptual move: bound generalization in terms of how much information the learned hypothesis carries about the training set.

The intuition. A learner that uses all the information in its training data to fit idiosyncratic patterns will overfit. A learner that extracts only the information relevant to the task, and discards the rest, generalizes. This is the information bottleneck intuition (Tishby, Pereira, Bialek, 1999; Tishby and Zaslavsky, 2015 for deep networks): generalization corresponds to compression of the training data.

The information-theoretic generalization framework, introduced in modern form by Russo and Zou (2016) “Controlling Bias in Adaptive Data Analysis Using Information Theory” and Xu and Raginsky (2017) “Information-Theoretic Analysis of Generalization Capability of Learning Algorithms,” makes this rigorous via mutual information.

Notation: mutual information

The mutual information I(X;Y)I(X; Y) between two random variables is a measure of how much knowing XX reduces uncertainty about YY (or symmetrically, vice versa). Formally,

I(X;Y)=Ex,y ⁣[logp(x,y)p(x)p(y)]=H(Y)H(YX),I(X; Y) = \mathbb{E}_{x, y}\!\left[\log \frac{p(x, y)}{p(x) p(y)}\right] = H(Y) - H(Y \mid X),

where HH is Shannon entropy. Mutual information is symmetric, non-negative, and zero iff XX and YY are independent. Higher I(X;Y)I(X; Y) means stronger statistical dependence.

For a learning algorithm AA that maps a training set SS to a hypothesis hh, the relevant quantity is I(S;A(S))I(S; A(S)) - the mutual information between the training set and the hypothesis the algorithm returns. How much does the returned hypothesis tell you about the training set? A learner that depends very little on the training set has low I(S;A(S))I(S; A(S)); a learner that memorizes the training set has high I(S;A(S))I(S; A(S)).

The Xu-Raginsky bound

The central result. For a learning algorithm AA with σ\sigma-subgaussian loss (a technical condition satisfied by bounded losses such as 0-1):

E[LD(A(S))LS(A(S))]2σ2nI(S;A(S)).\big| \mathbb{E}[L_{\mathcal{D}}(A(S)) - L_S(A(S))] \big| \leq \sqrt{\frac{2 \sigma^2}{n} \cdot I(S; A(S))}.

Reading this: the expected generalization gap (the average over training sets of the difference between true and empirical risk) is bounded by something that scales with I(S;A(S))/n\sqrt{I(S; A(S)) / n}. If the mutual information is small (the algorithm doesn’t “remember” much about its training set), generalization is good. If the mutual information is large (the algorithm “remembers” the training set in detail), generalization may be poor.

The bound has appealing properties:

  • It is algorithm-dependent - different algorithms on the same hypothesis class can have different bounds.

  • It directly connects to information-theoretic privacy notions: a learner that satisfies differential privacy (Dwork et al., 2006) has bounded I(S;A(S))I(S; A(S)), and therefore bounded generalization gap, automatically.

  • It directly connects to algorithmic stability (Bousquet and Elisseeff, 2002): a learner that is stable to small training-set perturbations has bounded mutual information.

Refinements: sample-conditioned and individual-sample bounds

The Xu-Raginsky bound depends on the total mutual information between SS and A(S)A(S). In some regimes this is very large (worst case is nlogHn \log |\mathcal{H}|, which is large for big classes). Refinements decompose the mutual information into smaller terms.

Bu, Zou, Veeravalli (2020) showed that the generalization gap can be bounded by individual-sample mutual information:

E[LD(A(S))LS(A(S))]1ni=1n2σ2I(zi;A(S)),\big| \mathbb{E}[L_{\mathcal{D}}(A(S)) - L_S(A(S))] \big| \leq \frac{1}{n} \sum_{i=1}^{n} \sqrt{2 \sigma^2 \cdot I(z_i; A(S))},

where zi=(xi,yi)z_i = (x_i, y_i) is the ii-th training example. Reading this: the gap is controlled by the average information about a single training example carried by the returned hypothesis. This is much tighter than the total mutual information in many settings.

Negrea et al. (2019) and Steinke and Zakynthinou (2020) developed sample-conditioned and conditional mutual information (CMI) bounds that further refine the framework. The CMI bound, in particular, is one of the tightest known information-theoretic generalization bounds in many settings.

Connections to other frameworks

The information-theoretic framework is not isolated. Key connections:

  • PAC-Bayes (§6). When AA outputs a posterior QQ rather than a single hypothesis, I(S;Q)I(S; Q) can be related to ES[KL(Q(S)P)]\mathbb{E}_S[\text{KL}(Q(S) \| P)] for an appropriate prior PP. Several PAC-Bayes bounds and information-theoretic bounds become formally equivalent (or one bound implies the other up to constants) under such identifications. The two frameworks are unified at a deeper level, though their natural application regimes differ.

  • Differential privacy. An (ϵ,δ)(\epsilon, \delta)-differentially private algorithm has mutual information bounded by O(ϵ2)O(\epsilon^2) (roughly), giving an automatic generalization bound. This is the generalization-by-privacy observation that motivated much of the modern adaptive data analysis literature.

  • Stability. Uniform algorithmic stability (small change in A(S)A(S) when one example in SS is changed) gives bounded mutual information, hence bounded generalization gap. The framework gives a clean argument why stability implies generalization.

These connections matter both conceptually (the four frameworks - PAC, PAC-Bayes, information-theoretic, stability - are different views of the same underlying structure) and practically (proving a stability or privacy property of an algorithm immediately yields a generalization bound).

Limits of the framework

The information-theoretic framework has well-known counterexamples and limits.

The “infinite mutual information, good generalization” problem. For deterministic learning algorithms that interact with continuous data, I(S;A(S))I(S; A(S)) can be technically infinite (the returned A(S)A(S) uniquely determines SS within precision limits, so the mutual information is unbounded). Yet such algorithms can generalize well empirically. The bound is then vacuous. Two responses: add infinitesimal noise to randomize the algorithm (which reduces mutual information but changes the algorithm), or use refined bounds (sample-conditioned, chaining) that avoid the pathology.

Chaining and refined bounds. Recent work uses chaining arguments - bounding mutual information across a sequence of progressively refined hypothesis-class scales - to produce bounds that avoid the infinite-MI pathology and are competitive with state-of-the-art PAC-Bayes bounds on real networks. The technical machinery here is substantial and is still evolving as of 2026.

Computability. As with PAC-Bayes, the bounds are only useful if the relevant information-theoretic quantities can be estimated or upper-bounded. For complex algorithms (SGD on deep networks), estimating I(S;A(S))I(S; A(S)) is non-trivial; the bounds are often stated as conditional results.

Status in 2026

The information-theoretic framework is, alongside PAC-Bayes, one of the two most actively developed lines of rigorous generalization theory. The two frameworks are increasingly understood as views of the same underlying picture; the practical question is which framework gives the tightest bound in a specific setting. For deep networks, sample-conditioned and chaining-based information-theoretic bounds are now competitive with PAC-Bayes bounds in some regimes - though PAC-Bayes still tends to produce the numerically tightest non-vacuous bounds when carefully instantiated (§6).

The framework also has the appealing property of connecting generalization to information-theoretic privacy and stability - two other notions of “good behaviour of a learning algorithm” that are interesting on their own terms and turn out to imply generalization automatically.


§8. The Modern Generalization Puzzle

The puzzle

This is the longest single section of the chapter because it is the topic where theoretical work has been most active for the last decade and where the conceptual landscape is most fragmented. We develop the puzzle, then survey four partial accounts (NTK, double descent, implicit bias, benign overfitting). We follow the editorial discipline of presenting positions as positions, not as settled facts.

The puzzle, in one sentence: modern deep networks generalize well in regimes where classical learning theory predicts they should not. The four pieces of evidence that constitute the puzzle:

  1. Memorization without overfitting (Zhang et al., 2017; reviewed in §2). Networks can fit random labels yet generalize on real labels. Classical capacity-based theory cannot distinguish the two regimes.

  2. Overparameterization helps (Belkin et al., 2019; Nakkiran et al., 2020). Increasing model size past the point where the model perfectly fits the training data continues to decrease test error. This is the double descent phenomenon.

  3. Optimizer matters (multiple papers, 2017–onward). Among the many networks that fit the training data, SGD finds ones that generalize; other optimizers (or even other learning rates within SGD) find solutions with worse generalization. Classical theory does not see “which solution among the many” the optimizer reaches.

  4. Feature learning matters (Yang and Hu, 2021; Petrini et al.; many others). In the regime where networks generalize best, they appear to learn features (representations that match task structure), not just fit the labels. Lazy-regime accounts (NTK) that exclude feature learning capture only part of the empirical picture.

We develop each in turn.

The Zhang et al. shock, revisited

We covered the experiment in §2. The implication, stated precisely:

For the class of all networks of a fixed architecture, VC(H)\text{VC}(\mathcal{H}) is at least nn (since the class can fit any labels on any nn-point dataset). Any uniform-convergence-based bound of the form dVC(H)/n\sqrt{d_{\text{VC}}(\mathcal{H})/n} is at least 11 - vacuous. Nagarajan and Kolter (2019), as discussed in §4, strengthened this: in natural settings, any uniform-convergence-style bound for SGD-trained networks is vacuous.

The conceptual move forced by this result. Generalization must be explained by properties of the trained network and the training procedure, not by capacity of the class. The remaining frameworks in §5§7 already make this move (Rademacher with margins, PAC-Bayes, information-theoretic). The puzzle is whether any such framework actually closes the gap with empirical practice - and what mechanism in the training procedure produces the favourable properties the frameworks require.

Double descent

Belkin, Hsu, Ma, Mandal (2019) “Reconciling modern machine-learning practice and the classical bias-variance trade-off” gave the cleanest characterization. They plotted test error vs model complexity for a range of models on standard datasets. The shape they found:

   TEST ERROR vs MODEL CAPACITY (schematic)

   error
     │       *
     │      * *           classical regime         modern regime
     │     *   *         (bias-variance U)        (over-parameterized
     │    *     *                                  monotone-decreasing)
     │   *       *  *  *  *                      
     │  *           interpolation       *
     │ *           threshold (Hat-line)    *
     │*                       │                *
     │_________________________│___________________*___*___*___*___ capacity
                              n
            UNDER-PARAMETERIZED       OVER-PARAMETERIZED
            (classical U-curve)       (second descent)

The shape has two regimes separated by the interpolation threshold - the capacity at which the model can exactly fit the training data:

  • Below the threshold: the classical bias-variance U-curve. Test error decreases with capacity (less bias), reaches a minimum, then increases with capacity (more variance, overfitting). The textbook curve.

  • At the threshold: test error reaches a peak - the worst point of the entire curve. This is sometimes called the “interpolation peak.”

  • Past the threshold: test error decreases again, often falling below the classical regime’s best point. This is the second descent.

The empirical phenomenon is well-documented. Nakkiran et al. (2020) “Deep Double Descent: Where Bigger Models and More Data Hurt” showed three variants - model-wise (varying capacity), sample-wise (varying nn), and epoch-wise (varying training time) - all exhibit the same U-then-down shape.

Theoretical accounts. The clearest theoretical understanding of double descent comes from analyzable settings: linear regression with random features (Belkin, Hsu, Xu, 2020), kernel regression (Mei and Montanari, 2019). In these settings, the second descent can be derived analytically. The mechanism: in the overparameterized regime, among the many solutions that fit the data, the minimum-norm solution (which SGD-style optimizers naturally find) has favourable structure that reduces test error.

For deep networks, the full mechanism is partially understood. Double descent in deep networks is consistent with the implicit-bias picture (next): SGD selects, among interpolating networks, ones with structural properties that improve with overparameterization rather than degrade.

What double descent falsified: the bias-variance picture as the central narrative for model selection in deep learning. The U-curve is one piece of a larger curve; the U-shape’s right-hand “bad” branch can be passed through into a regime where capacity continues to help.

Neural tangent kernel (NTK)

Jacot, Gabriel, Hongler (2018) “Neural Tangent Kernel: Convergence and Generalization in Neural Networks” gave a theoretical framework that became one of the most-studied accounts of deep-network behaviour.

The setup. Consider a neural network f(x;θ)f(x; \theta) with parameters θ\theta initialized randomly, trained with gradient descent on squared loss. In the limit of infinite width (every hidden layer has infinitely many neurons), with appropriate parameterization, the following holds:

  • The parameters θ\theta change only infinitesimally during training. The network is essentially linear in θ\theta around the initialization.

  • The training dynamics are equivalent to those of a kernel method with a specific kernel - the neural tangent kernel - that depends on the architecture and initialization but not on the data.

  • Generalization can be analyzed using classical kernel-regression theory.

The neural tangent kernel is

KNTK(x,x)=Eθ ⁣[θf(x;θ),θf(x;θ)],K_{\text{NTK}}(x, x') = \mathbb{E}_\theta\!\left[\langle \nabla_\theta f(x; \theta), \nabla_\theta f(x'; \theta)\rangle\right],

where the expectation is over the initialization distribution. In the infinite-width limit, this kernel becomes deterministic and stays fixed throughout training.

The framework is mathematically beautiful. It makes infinite-width networks analyzable with the full machinery of kernel theory and gives generalization bounds. It also explains a number of empirical phenomena (e.g., why wide networks are easier to train).

The catch. The NTK regime is also called the lazy regime: parameters barely move from initialization, the model is essentially linear, and features are not learned in any meaningful sense - the relevant representation is the random initialization’s representation.

But real-world deep learning is widely thought not to be in the lazy regime in practice. Modern networks evidently learn features: the representations they end up with are very different from the initial random ones. Chizat and Bach (2018) “On Lazy Training in Differentiable Programming” showed that the lazy regime is a specific scaling choice; with a different scaling, networks operate in a feature-learning regime that NTK theory does not describe.

The current consensus: NTK is an instructive toy regime for studying deep-network optimization. It is not a complete description of practical training. The feature-learning regime - where the puzzle’s empirical phenomena live - requires different theoretical tools, many of which are open (OP-TH-7).

Implicit bias of optimization

The empirical observation is by now well-established: among the many networks that fit a training set, the specific one SGD finds has systematic structural properties. These properties are imposed by the optimizer rather than by the loss function.

For linear models, the implicit bias is partially understood. Soudry et al. (2018) “The Implicit Bias of Gradient Descent on Separable Data” showed: for logistic regression on linearly separable data, gradient descent (run long enough) converges to the max-margin solution - the same solution an SVM would find. The loss function does not include a margin term; the optimizer imposes one.

For deep networks, the picture is much richer and only partially understood:

  • Norm-based implicit bias. In some settings, SGD on overparameterized networks converges to low-norm solutions among the many that fit the data. Low-norm solutions plug into the norm-based generalization bounds of §5 to give generalization guarantees.

  • Sparsity / low-rank bias. Several lines of work show that gradient descent on deep linear networks converges to solutions with low-rank or sparse structure (Arora et al., 2019; Gunasekar et al., 2017).

  • Flatness preference. SGD with non-trivial batch size and learning rate empirically prefers flat minima (regions where small parameter changes cause small loss increases) over sharp minima. There is theoretical machinery (modified equations, SDE limits of SGD) that partially explains this, though the connection to generalization remains under active study.

  • Edge of stability. Cohen et al. (2021) showed that SGD/GD often operates in a regime where the loss is not monotonically decreasing - at the “edge of stability” where the maximum eigenvalue of the Hessian is just below 2/η2/\eta (the stability threshold for the learning rate η\eta). This regime is incompatible with classical descent analyses and may itself be part of why SGD generalizes.

What implicit bias explains: why, in the underdetermined regime where many solutions fit the data, SGD finds favourable ones. What it does not yet explain: a quantitative, predictive theory of generalization from first principles of the optimization procedure. We have characterizations and examples; we do not have a derived bound that matches empirical generalization.

Feature learning vs lazy learning

The dichotomy worth flagging as a structural one. In lazy training (NTK regime, infinite width, appropriate scaling), networks behave as kernel methods with a kernel determined by the architecture; representations do not change. In feature learning training (more realistic regime), the internal representations evolve substantially during training and end up matched to the task.

The empirical evidence is overwhelming that real-world deep learning is in the feature-learning regime. Modern interpretability research (Olah et al.'s circuits programme; mechanistic-interpretability work) routinely identifies task-specific features in trained networks that are clearly not present in random initializations.

Theoretical work on feature learning is much less developed than on lazy learning. Yang and Hu (2021) “Feature Learning in Infinite-Width Neural Networks” gave a parameterization scheme (μP) under which the infinite-width limit itself is in the feature-learning regime, providing a tractable starting point for theoretical analysis. This is one of the more active frontiers as of 2026.

Benign overfitting

A related theoretical thread. Bartlett, Long, Lugosi, Tsigler (2020) “Benign Overfitting in Linear Regression” introduced the term. The phenomenon: in some overparameterized linear regression problems, the model exactly interpolates noisy training data (achieves zero training error) yet generalizes well (achieves test error close to the noise floor). The interpolation does not produce the harm classical theory predicts.

The technical reason. In the overparameterized regime, the minimum-norm interpolator can spread the noise across many directions in the feature space; if the data and noise structure is favourable, the noise contribution to test error is small.

Several follow-up papers extended this analysis to ridge regression, kernel methods, and shallow networks (Tsigler and Bartlett, 2023; Frei et al., 2023). The framework is currently the cleanest mathematical setting where we can prove “interpolating an overparameterized model on noisy data does not hurt” - a small but rigorous piece of the larger puzzle.

Where the puzzle stands in 2026

The honest summary:

  • The Zhang et al. shock and Nagarajan-Kolter’s strengthening rule out worst-case capacity-based explanations.

  • Data-dependent bounds (Rademacher with margin, PAC-Bayes, information-theoretic) can produce non-vacuous bounds for deep networks but are not yet predictive of empirical test error.

  • Double descent and benign overfitting are partially understood in restricted settings (linear, kernel) and are empirically robust in deep networks but lack first-principles deep-network derivations.

  • NTK gives a clean analyzable regime that is not the practical regime. Feature-learning analyses are in their early stages.

  • Implicit bias of SGD is the leading candidate for the missing piece - what selects favourable solutions from the many that fit the data - but a quantitative theory remains open.

The puzzle is partially answered. The empirical phenomena are no longer unexplained; multiple complementary partial accounts have been developed. But the field does not yet have a single unified framework that quantitatively predicts when, why, and how much overparameterized networks generalize. Open problems OP-TH-1, OP-TH-6, OP-TH-7 in §12 consolidate the still-open pieces.

Editorial note. This is a fragmented subfield. The accounts above (NTK, double descent, implicit bias, benign overfitting, feature learning) are positions and partial mechanisms, not a settled theory. A reader engaging the primary literature should expect to encounter multiple viewpoints presenting their own framework as central; the chapter’s position is that each captures a piece of a larger picture that has not yet been assembled.


§9. Scaling Laws as Empirical Regularities

Why scaling laws live in a theory chapter

Foundation Models §6 develops scaling laws as the empirical phenomenon they are: regular power-law relationships between loss, model parameters, training data, and compute. This chapter §9 develops their epistemic status. Are scaling laws theory, or are they empirical extrapolation in the shape of theory? The distinction matters because field-shaping decisions (where to invest billions of compute, which architectures to scale) currently rest on scaling-law predictions.

The position taken in this section: scaling laws are empirical regularities that take the mathematical form of theoretical predictions. They are not derived from a theory of how neural networks learn; they are fitted to training-run data and interpolated/extrapolated from those fits. This is not a criticism - empirical regularities can be enormously useful - but the epistemic status is different from a derived theoretical bound, and conflating the two has produced confusion.

The basic shape of scaling laws

The canonical result. Kaplan et al. (2020) “Scaling Laws for Neural Language Models” found that, for autoregressive language models trained to convergence, the test cross-entropy loss LL depends on three quantities - number of non-embedding parameters NN, dataset size DD (in tokens), and compute CC - through power-law relationships:

L(N)(NcN)αN,L(D)(DcD)αD,L(C)(CcC)αC,L(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad L(D) \approx \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad L(C) \approx \left(\frac{C_c}{C}\right)^{\alpha_C},

with fitted constants Nc,Dc,CcN_c, D_c, C_c (the “scale at which loss equals 1 nat”) and exponents αN0.076,αD0.095,αC0.05\alpha_N \approx 0.076, \alpha_D \approx 0.095, \alpha_C \approx 0.05 in their original fits.

Reading the formulas: doubling parameters reduces loss by a factor of 2αN0.952^{-\alpha_N} \approx 0.95 - a 5% reduction per doubling. Doubling data reduces loss by a similar small amount. The relationships hold over many orders of magnitude in NN, DD, and CC - from 103\sim 10^3 parameters to 109\sim 10^9 in the original paper, and have been extended to 1012\sim 10^{12} in subsequent work.

Chinchilla: the compute-optimal correction

Hoffmann et al. (2022) “Training Compute-Optimal Large Language Models” - the “Chinchilla” paper - was a major refinement. The setup. Given a fixed compute budget CC (in FLOPs, roughly ND\propto N \cdot D for transformers), how should CC be allocated between model size NN and training tokens DD?

Kaplan et al.'s scaling-law fits had implied that, at fixed CC, the compute-optimal allocation is heavily biased toward NN - train relatively small datasets on relatively large models. Models like GPT-3 (175B parameters, ~300B tokens) followed this prescription.

Hoffmann et al. revisited the question with a more careful experimental design, training over 400 models with varying N/DN/D ratios. Their finding: NN and DD should scale equally - for compute-optimal training, NC0.5N \propto C^{0.5} and DC0.5D \propto C^{0.5} (roughly). GPT-3-style models were dramatically undertrained - they had too many parameters for the number of tokens they were trained on.

Chinchilla itself was a 70B-parameter model trained on 1.4T tokens - much smaller and much more data than GPT-3. It substantially outperformed GPT-3 (175B, 300B tokens) on standard benchmarks despite having less than half the parameters. The Chinchilla scaling-law fits became the new convention.

The Chinchilla form of the scaling law is

L(N,D)=E+ANα+BDβ,L(N, D) = E + \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}},

where EE is an irreducible-loss constant (the entropy of the language distribution, lower-bounding what any model can achieve), A/NαA/N^\alpha is the contribution of finite model size, B/DβB/D^\beta is the contribution of finite data, and the fitted exponents are roughly α0.34,β0.28\alpha \approx 0.34, \beta \approx 0.28. This functional form is now standard.

Compute-optimal vs inference-optimal regimes

A further refinement, important practically. Chinchilla optimized compute spent on training. But once a model is trained, it is served - used for inference - potentially many times. If inference cost is significant compared to training cost, the right tradeoff changes.

The inference-optimal regime: train a smaller model than Chinchilla scaling would suggest, on more tokens than Chinchilla would suggest. The smaller model is more efficient at inference, and the extra training compute may pay for itself many times over in cheaper inference. This is the regime that produced Llama 2 and Llama 3 (and most modern open-weight models), which are aggressively over-trained compared to Chinchilla optimality.

There is no single scaling-law fit for the inference-optimal regime; it depends on the inference-vs-training cost ratio, which depends on deployment context.

Other scaling laws

The Kaplan/Chinchilla scaling laws are the most-developed example, but the methodology generalizes:

  • Scaling laws for transfer. Hernandez et al. (2021) studied how fine-tuning effectiveness scales with model size and pre-training compute. Power-law fits also work here, though with different exponents.

  • Scaling laws for multimodal training. Cherti et al. (2023, OpenCLIP scaling) fit power-laws for image-text contrastive training. The exponents differ from language-only but the functional form persists.

  • Scaling laws for reinforcement learning. Hilton et al. (2023) and others have fit scaling laws for RL training. The fits are noisier than for supervised learning, but the qualitative picture holds.

  • Scaling laws for downstream task performance. A subtler question: do scaling laws for next-token loss translate to scaling laws for downstream task accuracy? Often yes (Schaeffer et al., 2023 argues that apparent “emergence” of capabilities is partly an artefact of discontinuous evaluation metrics applied to smoothly improving underlying scores). The link is not airtight.

The methodology is robust across domains: train models of varying size on varying amounts of data and compute, fit a power-law functional form to the resulting loss values, use the fit to predict the loss at scales not yet trained.

What scaling laws predict, what they don’t

Scaling laws make sharp predictions in their fitted range:

  • The loss of a model of a specific size, trained on a specific amount of data, with a specific compute budget - interpolation within the fitted range - is accurate to within a few percent for models like those in the fit.

  • The compute-optimal allocation between NN and DD for a given CC - guidance for training run design - is the practical product the field uses most.

They make weaker predictions outside the fitted range:

  • Extrapolation in scale. Predicting loss at scales an order of magnitude beyond the fitted data is a bet that the power law continues. So far the power-law form has held up across many orders of magnitude (this is itself remarkable). But there is no theoretical guarantee.

  • Saturation. Scaling laws of the form L=E+A/Nα+B/DβL = E + A/N^\alpha + B/D^\beta predict that loss approaches the irreducible floor EE but never reaches it. Whether this is exactly right - whether the floor is fixed at a specific entropy value, or whether the functional form breaks down close to the floor - is open.

And they make essentially no predictions for some important questions:

  • Capability emergence. Whether and when a model will exhibit a specific capability (in-context learning, arithmetic, chain-of-thought reasoning) is not predictable from loss scaling laws alone. The emergence-of-capability literature (Wei et al., 2022; Schaeffer et al., 2023 critique) is a separate, much less settled question. OP-FM-3.

  • Out-of-distribution performance. Scaling laws are fitted on a specific data distribution; how performance on shifted distributions scales is not part of the fit.

  • Optimization difficulty. Some architectures or scales might be optimization-hard in a way that scaling laws cannot anticipate.

The epistemic status: regression fit vs derived theory

The chapter’s stance on the central question. Scaling laws are empirical regressions of a specific functional form to training-run data. The functional form (L=E+A/Nα+B/DβL = E + A/N^\alpha + B/D^\beta) is not derived from a theory of how neural networks learn; it is chosen because it fits the data well and respects scaling intuitions (irreducible floor; power-law decay). Power laws of various exponents fit empirical data routinely; the specific exponents are fitted, not predicted.

What scaling laws are not. They are not theorems. They are not derivations. They are not predictions in the sense that physics predictions are: there is no underlying mechanistic theory whose consequences they are. They are robust empirical observations, with all the strengths (replication, predictive power within their range) and weaknesses (extrapolation hazard, no mechanism) that implies.

What scaling laws are. A new kind of empirical theory - predictions of practical phenomena that take the mathematical form of theoretical predictions and serve as theory does in the field’s working practice (guiding decisions, structuring intuitions, organizing literature). The field has had to develop the vocabulary for working with empirical theories of this kind, because the same shape appears in neural-architecture-search results, compute-optimal pretraining, and even some capability emergence studies.

The cost of conflating regression with derivation. Treating scaling laws as derived theory would lead to overconfidence in extrapolations, under-appreciation of where the fits might break down, and a failure to remain alert for new empirical phenomena (saturation, regime shifts at very large scale) that the fits do not anticipate.

The benefit of treating them as empirical regularities. We retain the predictive power of the fits where they apply, while remaining open to refinement as scales grow and new architectures and training procedures are explored.

This epistemic distinction is consequential. As of 2026, billions of dollars of compute are allocated annually based on scaling-law predictions; the predictions have been right enough that this is not foolish. They are also not derived from theory, which means the inference is that the world will continue to behave as it has - a useful but bounded form of knowledge.


§10. Sample Complexity in Reinforcement Learning

Why RL deserves its own section

The frameworks of §3§8 are primarily about supervised learning - given labelled training data, return a hypothesis. Reinforcement learning (RL) is structurally different. An RL learner interacts with an environment, observes rewards, and must learn a policy. The data is collected by the learner, depends on the policy being learned, and is not labelled with “correct” actions.

This structural difference produces a substantial theoretical gap between RL theory and RL practice - even more pronounced than the supervised-learning gap. We sketch the framework here; the algorithms themselves live in the Reinforcement Learning chapter.

The MDP setup and the sample-complexity question

The standard RL formalism is the Markov Decision Process (MDP): a tuple M=(S,A,P,r,γ)\mathcal{M} = (\mathcal{S}, \mathcal{A}, P, r, \gamma) with state space S\mathcal{S}, action space A\mathcal{A}, transition kernel P(ss,a)P(s' \mid s, a), reward function r(s,a)r(s, a), and discount factor γ[0,1)\gamma \in [0, 1). A policy π:SA\pi: \mathcal{S} \to \mathcal{A} (or distribution over actions) maps states to actions; its value is the expected discounted sum of rewards Vπ(s)=Eπ[t=0γtr(st,at)s0=s]V^\pi(s) = \mathbb{E}_\pi[\sum_{t=0}^{\infty} \gamma^t r(s_t, a_t) \mid s_0 = s]. The goal of RL is to find an optimal policy π\pi^* maximizing VπV^\pi.

The sample-complexity question. How many interactions (steps in the environment) does a learner need to find an ϵ\epsilon-optimal policy with probability at least 1δ1 - \delta? This is the analogue of the PAC sample-complexity question for supervised learning, adapted to the interactive setting.

PAC-MDP bounds

The first generation of results. Kearns and Singh (2002) and Brafman and Tennenholtz (2002, R-MAX) introduced PAC-MDP as the framework. A learning algorithm is PAC-MDP if there is a polynomial m(S,A,1/ϵ,1/δ,1/(1γ))m(|\mathcal{S}|, |\mathcal{A}|, 1/\epsilon, 1/\delta, 1/(1-\gamma)) such that, after mm steps of interaction, the algorithm’s policy is ϵ\epsilon-optimal with probability at least 1δ1 - \delta.

The classical PAC-MDP results give bounds of the form

m=O~ ⁣(S2Aϵ3(1γ)6)m = \tilde{O}\!\left(\frac{|\mathcal{S}|^2 \, |\mathcal{A}|}{\epsilon^3 (1-\gamma)^6}\right)

(R-MAX-style analysis; tighter results in the same family by Strehl and Littman, 2008). Reading this: the sample complexity scales as S2A|\mathcal{S}|^2 |\mathcal{A}| - polynomial in the size of the state and action spaces, which is useless for any practical setting. A modern Atari game has 1060\sim 10^{60} states; the bound is astronomically loose.

The R-MAX algorithm. A canonical PAC-MDP-style algorithm. It maintains a model of the MDP; for state-action pairs it has tried many times, it uses the empirical estimate of PP and rr; for state-action pairs it has tried few times, it optimistically assumes the reward is maximal. The optimism encourages exploration; the algorithm is provably PAC-MDP. The bound is loose but the algorithmic principle - optimism in the face of uncertainty - has been enormously influential.

Regret bounds and function approximation

A different question. Rather than ask “how many samples to find an ϵ\epsilon-optimal policy?”, ask “what is the cumulative regret of the algorithm over TT steps?” - where regret is the total reward shortfall compared to the optimal policy. Regret bounds in finite MDPs scale as O~(TSA)\tilde{O}(\sqrt{T \cdot |\mathcal{S}| \cdot |\mathcal{A}|}) for algorithms like UCRL2 (Jaksch, Ortner, Auer, 2010).

For modern RL with function approximation (the practical regime where deep networks parametrize the policy or value function), the analogous question is: what is the function-approximation analogue of S|\mathcal{S}| that controls regret?

This is the Bellman-eluder dimension programme (Jin, Liu, Yang, 2021) and successors. Define a complexity measure of the function class used to approximate the value function, derive regret bounds in terms of it, and verify that the measure is small for natural classes (linear MDPs, low-rank MDPs, etc.). For linear function approximation in a dd-dimensional feature space, regret bounds of the form O~(d3T)\tilde{O}(\sqrt{d^3 T}) are achievable.

Other complexity measures developed in this line:

The qualitative picture: RL with function approximation is theoretically tractable when the function class has bounded complexity in one of these senses; the regret bounds are polynomial in this complexity and in TT.

Sample complexity gaps between theory and practice

The gap that mirrors §8’s situation for supervised learning, but more severe.

Modern deep RL - DQN, PPO, AlphaGo/AlphaZero, RLHF for language models - uses massive function classes (deep networks). The Bellman-eluder dimension of such function classes is typically not known; bounds derived under such assumptions do not apply directly. The classical sample-complexity bounds from PAC-MDP give astronomical numbers; the regret bounds with function approximation give tractable numbers under assumptions that may not hold in practice.

Empirically, modern deep RL works on tasks where theoretical bounds say it should not. Concretely:

  • AlphaZero learned Go from self-play; the state space is 10170\sim 10^{170}; PAC-MDP bounds are vacuous; the algorithm achieved superhuman performance in 109\sim 10^9 self-play games.

  • DQN learned Atari games from raw pixel input; classical bounds say nothing useful about the resulting performance.

  • RLHF for language models (LLM §5) fine-tunes a model with 10910^9+ parameters using 10410^410510^5 preference pairs; far below any classical sample-complexity prediction for the function class size involved.

The gap reflects the same underlying issue as in supervised learning: classical bounds are worst-case; deep RL operates in a regime where the function class, the training algorithm, and the data distribution have favourable structure that classical bounds cannot exploit. The gap is more severe in RL because the structural assumptions (Markovian dynamics, accurate function approximation, sufficient exploration) are harder to verify and easier to violate.

Partial accounts of why deep RL works

The theoretical work attempting to close the gap is active and fragmented:

  • Linear-MDP and low-rank-MDP frameworks. Polynomial sample-complexity results for MDPs whose transition dynamics admit low-rank structure (Jin et al., 2020; Yang and Wang, 2020). Whether real-world environments approximately satisfy these assumptions is an empirical question, often answered “approximately, sometimes.”

  • Pessimism in offline RL. Mirror image of the optimism in PAC-MDP. For offline RL (learning from a fixed dataset of past interactions), pessimistic algorithms - choose the action that performs best in the worst-case model consistent with the data - admit theoretical guarantees that align with practical methods (CQL, etc.).

  • Sample complexity of RLHF. Recent work on the sample complexity of preference-based RL (RLHF in the LLM setting, LLM §5) gives polynomial bounds under reward-model assumptions. The bounds are still loose compared to empirical practice.

Status and pointer to the RL chapter

The RL chapter develops the algorithms (Q-learning, policy gradients, AlphaZero, deep RL methods, RLHF). This section’s purpose was to sketch the theoretical-bounds picture - what PAC-MDP, regret, and Bellman-eluder bounds say, and what their relationship to practice is.

The picture: as in supervised learning, classical worst-case theory is loose; modern data-dependent and structure-dependent theory is much tighter but still substantially looser than empirical practice. The gap is one of the open problems collected in §12 (OP-TH-5). Closing it would require either better characterizations of the structural assumptions modern RL methods exploit, or new theoretical frameworks suited to the deep-RL regime.


§11. Connections to Other Chapters

This chapter is cross-cutting: every other learning-related chapter in the book makes assumptions or uses techniques whose theoretical analysis lives here, and conversely this chapter’s prose depends on the algorithmic and empirical material in other chapters for its substance. The chapter is best read after - or in parallel with - the chapters whose generalization behaviour it analyzes.

  • Deep Learning develops the architectures and training procedures whose generalization §8 of this chapter attempts to explain. Specifically: DL §3 introduces the MLP and backpropagation; DL §6 develops the Transformer; DL §9 develops training dynamics (the regime where implicit-bias arguments apply). This chapter’s §4 (uniform convergence’s failure for deep nets), §5 (norm-based bounds for neural networks), §6 (PAC-Bayes for deep nets), and §8 (the modern puzzle) together provide the theoretical layer above the DL chapter’s algorithmic material.

  • Foundation Models uses scaling laws as the empirical theory of pretraining (FM §6). This chapter’s §9 develops the epistemic status of those scaling laws - regression fit vs derivation. FM §11’s open-problems list (OP-FM-1, OP-FM-3) and this chapter’s §12 (OP-TH-4) share questions about whether scaling laws can predict capability emergence.

  • Large Language Models §5 develops RLHF and DPO algorithmically; this chapter §10 sketches the theoretical sample-complexity questions for preference-based RL. The DPO derivation in LLM §5 is one of the few places in modern practice where a clean theoretical reduction (KL-constrained reward maximization → closed-form policy) is exploited algorithmically.

  • Self-Supervised Learning §8 develops four theoretical perspectives on SSL specifically. This chapter develops the broader generalization framework those SSL-specific perspectives sit within - and is largely silent on SSL-specific generalization questions, which remain open (OP-SSL-1).

  • Reinforcement Learning develops Q-learning, policy gradients, AlphaZero, modern deep RL. The sample-complexity layer of that chapter is what this chapter’s §10 develops as theory. The practical-theoretical gap is severe and a major open problem (OP-TH-5).

  • Causality develops the parallel theoretical critique that current ML is correlational rather than causal. This chapter touches the issue in §12 (OP-TH-8, distribution shift and OOD generalization) but defers substantive treatment to the Causality chapter.

  • Generative Models raises generalization questions specific to density estimation (what does it mean for a generative model to “generalize” beyond its training data?) that are conceptually different from the discriminative-classification framework this chapter develops. Recent work on memorization in diffusion models (Carlini et al., 2023) is in the spirit of the Zhang et al. shock applied to generation; the connection is sketched in the Generative Models chapter.

  • Evaluation is the practical face of generalization. Whatever theoretical bounds we prove, evaluation is the empirical measurement of what generalization actually looks like in deployment. This chapter’s frameworks specify what to measure; the Evaluation chapter specifies how to measure it well.


§12. Limitations and Open Problems

This is the consolidated open-problems list for the theory chapter. Each item carries the OP-TH-N identifier so other chapters can cross-reference (LLM §13, FM §11, SSL §11, and DL §12 already do).

  • OP-TH-1. Why do overparameterized neural networks generalize? This is the master open problem of modern learning theory, the centerpiece of §8. Multiple partial accounts exist (data-dependent bounds with margin and norm, PAC-Bayes for deep nets, information-theoretic bounds, NTK in the lazy regime, double descent in linear and kernel settings, benign overfitting, implicit-bias-of-SGD). None of them is a complete theory; their relationship to each other is partially understood. A unified, predictive theory is open.

  • OP-TH-2. When are PAC-Bayes bounds tight enough to be useful for deep nets in practice? PAC-Bayes can produce non-vacuous bounds (Dziugaite-Roy 2017 and successors, §6), but the bounds are still substantially looser than empirical test error for state-of-the-art models. Identifying which regimes admit numerically tight bounds - and developing techniques to produce them at scale - is open.

  • OP-TH-3. Theoretical account of in-context learning. ICL (LLM §7) is empirically dramatic and theoretically mysterious: a pretrained model produces useful outputs from a few demonstrations in the prompt without parameter updates. Multiple proposed mechanisms (implicit Bayesian inference, gradient descent in activation space, induction-head circuits, distributional pattern-matching). No mechanism is settled. Shared with the FM chapter as OP-FM-14.

  • OP-TH-4. Can capability emergence be predicted from scaling laws? Scaling laws (§9) predict loss well; whether they predict specific capabilities (arithmetic, chain-of-thought reasoning, in-context learning) is contested. The Wei et al. (2022) “emergence” literature vs the Schaeffer et al. (2023) “metric artefact” critique frames the question. Shared with FM §11 as OP-FM-3.

  • OP-TH-5. Sample-complexity bounds that match RL practice. Classical PAC-MDP bounds are loose by many orders of magnitude (§10); function-approximation regret bounds require assumptions that may not hold for deep RL. Closing the gap - either through better assumptions or new theoretical machinery - is open.

  • OP-TH-6. Reconciling worst-case bounds with average-case practice. A meta-version of OP-TH-1. Classical learning theory is worst-case; real-world ML operates in a much more benign regime. How to characterize the structure of real distributions and real training algorithms that produces the benign behaviour, and how to incorporate it into rigorous bounds, is open.

  • OP-TH-7. Theory of training dynamics: why optimizers find generalizing solutions. Implicit bias of SGD (§8) is empirically central and theoretically only partially understood. A predictive theory of which solutions SGD finds, why those solutions have favourable structural properties, and how this relates to generalization is open.

  • OP-TH-8. Theory of distribution shift and OOD generalization. Almost all of this chapter assumes train and test data come from the same distribution. Real deployment routinely violates this. Theoretical frameworks for bounded distribution shift (covariate shift, domain adaptation) exist but are limited; theory for arbitrary OOD generalization is essentially absent. The Causality chapter develops one approach (causal invariance); this remains open more broadly.

  • OP-TH-9. Information-theoretic vs PAC-Bayes unification. Both frameworks (§6, §7) produce data-dependent generalization bounds; they are increasingly recognized as views of the same underlying picture. A clean unified framework - and clarity about which framework is tighter in which regime - is currently emerging but not settled.


§13. Further Reading

Opinionated list. Not exhaustive; intended as a reading-order recommendation for someone entering the field.

Classical learning theory

  • Vapnik, V. (1998). Statistical Learning Theory. The canonical textbook from the originator of the field. Dense; rewards repeated reading.

  • Shalev-Shwartz, S., and Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. The modern undergraduate/graduate textbook. Clean, complete, well-organized. Read this first if entering the area cold.

  • Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2018). Foundations of Machine Learning (2nd ed.). Heavier on proofs, with full treatments of Rademacher complexity and margin bounds.

Modern deep-learning theory

  • Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2017). “Understanding Deep Learning Requires Rethinking Generalization.” The shock paper. Mandatory background for anyone working on generalization in 2026.

  • Nagarajan, V., and Kolter, J. Z. (2019). “Uniform Convergence May Be Unable to Explain Generalization in Deep Learning.” The strengthening result that ruled out a whole class of approaches.

  • Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019). “Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-off.” The double-descent paper.

  • Jacot, A., Gabriel, F., and Hongler, C. (2018). “Neural Tangent Kernel: Convergence and Generalization in Neural Networks.” Foundational NTK paper.

  • Chizat, L., and Bach, F. (2018). “On Lazy Training in Differentiable Programming.” Clarifies the lazy-vs-feature-learning distinction.

PAC-Bayes

  • McAllester, D. A. (1999). “PAC-Bayesian Model Averaging.” The foundational reference.

  • Dziugaite, G. K., and Roy, D. M. (2017). “Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data.” The non-vacuous-bound breakthrough.

  • Pérez-Ortiz, M., Rivasplata, O., Shawe-Taylor, J., and Szepesvári, C. (2021). “Tighter Risk Certificates for Neural Networks.” Modern PAC-Bayes refinements.

  • Lotfi, S., Finzi, M., Kapoor, S., Potapczynski, A., Goldblum, M., and Wilson, A. G. (2022). “PAC-Bayes Compression Bounds So Tight That They Can Explain Generalization.” Compression-based PAC-Bayes pushed close to empirical performance on small networks.

Information-theoretic generalization

  • Xu, A., and Raginsky, M. (2017). “Information-Theoretic Analysis of Generalization Capability of Learning Algorithms.” Foundational paper.

  • Russo, D., and Zou, J. (2016). “Controlling Bias in Adaptive Data Analysis Using Information Theory.” Parallel foundational work.

  • Bu, Y., Zou, S., and Veeravalli, V. V. (2020). “Tightening Mutual Information-Based Bounds on Generalization Error.” Individual-sample refinement.

  • Steinke, T., and Zakynthinou, L. (2020). “Reasoning About Generalization via Conditional Mutual Information.” The CMI framework.

Norm-based and margin bounds

  • Bartlett, P. L. (1998). “The Sample Complexity of Pattern Classification with Neural Networks.” Foundational norm-based analysis.

  • Schapire, R. E., Freund, Y., Bartlett, P., and Lee, W. S. (1998). “Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods.” The margin theorem for boosting.

  • Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. (2017). “Spectrally-Normalized Margin Bounds for Neural Networks.” Modern spectral norm bounds.

  • Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. (2017). “Exploring Generalization in Deep Learning.” PAC-Bayes-flavoured analysis.

Implicit bias and training dynamics

  • Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., and Srebro, N. (2018). “The Implicit Bias of Gradient Descent on Separable Data.” Foundational result for logistic regression.

  • Cohen, J., Kaur, S., Li, Y., Kolter, J. Z., and Talwalkar, A. (2021). “Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability.” Empirical observation that has reshaped optimization-theoretic understanding.

  • Yang, G., and Hu, E. J. (2021). “Feature Learning in Infinite-Width Neural Networks.” The μP parameterization placing feature learning into the infinite-width limit.

Scaling laws

  • Kaplan, J., et al. (2020). “Scaling Laws for Neural Language Models.” The original.

  • Hoffmann, J., et al. (2022). “Training Compute-Optimal Large Language Models.” The Chinchilla correction.

  • Schaeffer, R., Miranda, B., and Koyejo, S. (2023). “Are Emergent Abilities of Large Language Models a Mirage?” The metric-artefact critique. Read alongside scaling laws.

  • Hilton, J., et al. (2023). “Scaling Laws for Single-Agent Reinforcement Learning.” Scaling laws in the RL regime.

RL theory

  • Sutton, R. S., and Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). Standard textbook; covers PAC-MDP background.

  • Strehl, A. L., and Littman, M. L. (2008). “An Analysis of Model-Based Interval Estimation for Markov Decision Processes.” Canonical PAC-MDP reference.

  • Jin, C., Liu, Q., and Yang, Z. (2021). “Bellman-Eluder Dimension: New Rich Classes of RL Problems and Sample-Efficient Algorithms.” Modern complexity-measure unification.

Reading order recommendation

For someone new to the area: start with Shalev-Shwartz and Ben-David (classical), then Zhang et al. 2017 and Belkin et al. 2019 (the empirical phenomena), then Dziugaite and Roy 2017 (the first non-vacuous bound), then a survey of one of the active threads (Bartlett et al. for norms, Xu-Raginsky-style for information-theoretic). Defer the NTK literature and full RL theory until the rest is in hand.


§14. Exercises and Experiments

Research-style exercises designed to make the theoretical material concrete and to expose the empirical-theoretical gap.

  • E1. VC and Rademacher of simple classes. For each of (a) axis-aligned rectangles in R2\mathbb{R}^2, (b) half-planes in R2\mathbb{R}^2, (c) polynomial classifiers of degree dd in R1\mathbb{R}^1: derive the VC dimension by direct shattering arguments, then numerically estimate the empirical Rademacher complexity on a synthetic dataset of varying size. Compare the VC bound and the Rademacher bound for generalization. Where is each tight, where is each loose?

  • E2. Zhang et al. random-label reproduction. Train a small CNN on CIFAR-10 in three label regimes: (a) true labels; (b) randomly permuted labels; (c) labels that are partial mixtures (50% true, 50% random). Plot training accuracy vs epochs and final test accuracy. Confirm that the network fits all three but generalizes only on (a). Reflect on what classical theory does and does not predict for each regime.

  • E3. Double descent. Train a sequence of two-layer ReLU networks on CIFAR-10 with widths ranging from 10\sim 10 to 104\sim 10^4 hidden units. Plot test error vs width. Locate the interpolation threshold (the width at which training error reaches zero). Confirm the U-shaped peak around the threshold and the second descent in the overparameterized regime. Compare to the bias-variance prediction.

  • E4. A non-vacuous PAC-Bayes bound. For a small MLP trained on MNIST, follow the Dziugaite-Roy (2017) recipe: train deterministically, then optimize a Gaussian posterior QQ centered at the trained weights against the McAllester PAC-Bayes bound (with prior centered at initialization). Report (a) the numerical value of the bound, (b) the empirical test error of the stochastic network QQ, (c) the bound’s looseness. Aim for a bound below 50% test-error.

  • E5. Scaling-law fit. Train a sequence of small transformer models on a small text corpus (e.g., enwik8 or a 10710^7-token subset of Pile) with parameters spanning roughly 10510^5 to 10810^8 and training-token counts spanning two orders of magnitude. Fit the Chinchilla functional form L(N,D)=E+A/Nα+B/DβL(N, D) = E + A/N^\alpha + B/D^\beta to the resulting loss values. Use the fit to predict the loss at a configuration not in the training set; train a model at that configuration and compare the prediction to the actual loss.

  • E6. Implicit bias on a toy problem. For 2D linearly separable data, train logistic regression with gradient descent (no regularization). Show empirically that the iterate norm grows but the direction converges to the max-margin direction (Soudry et al., 2018). Compare to SVM solution explicitly.

  • E7. NTK as a limit. For a simple two-layer network with width mm, compute the empirical NTK at initialization on a small training set for varying mm (e.g., m{10,100,1000,10000}m \in \{10, 100, 1000, 10000\}). Show convergence to the analytical NTK as mm \to \infty. Train each network and measure how much the network’s tangent kernel changes from initialization. Confirm that the change decreases with mm (the lazy-regime prediction).


AI: A Living Reference by Fuzue. Content licensed under CC BY-SA 4.0 - share, adapt, and build on it; keep the attribution and the open licence on derivatives.