Self-Supervised Learning
Scope and What This Chapter Is About
SSL is the use of objectives that consume unlabeled data at scale by predicting parts of the input from other parts. The chapter covers the family of objectives, the empirical and theoretical understanding of why they work, and how they compose with the architectures of the Deep Learning chapter to produce the foundation models of the FM chapter.
We treat the training paradigm. Architectures, scaling laws, foundation-model deployment, and downstream alignment live in their own chapters with explicit pointers from here. Open problems are flagged inline and consolidated in §11.
§1. Motivation and Scope
What this chapter is for
Self-supervised learning (SSL) is the training paradigm that supplied the missing ingredient to deep learning’s empirical success in the 2010s and 2020s. The deep-learning substrate (developed in the Deep Learning chapter) provides the architectures: how a model is structured. The foundation-model regime (developed in the Foundation Models chapter) provides the deployment posture: pretrain once at scale, adapt many ways. SSL provides what sits between them: the training objectives that turn architectures plus deployment posture into models that actually work.
This chapter exists for two reasons. First, SSL is the single most consequential training paradigm of the modern era - every frontier foundation model in 2026 was trained with some form of self-supervised pretraining, and the techniques scattered across the NLP, vision, audio, and multimodal literatures share enough structure to deserve a unified treatment. Second, the chapter consolidates terminology and conceptual framing that the other chapters reference without re-deriving: “masked language modelling,” “contrastive learning,” “predictive coding,” “JEPA-style embeddings” - all are SSL objectives that the FM and LLM chapters treat as primitives.
What SSL is, in one paragraph
Define a pretext task: a prediction problem constructed from the data itself, with no external annotation. Predict the next word from the words before it. Predict a masked-out image patch from the surrounding patches. Match a caption to its image among a batch of candidates. Train a model on this pretext task at scale. The trained model’s internal representations turn out to be useful for many downstream tasks (classification, retrieval, generation, structured prediction) that the pretext task did not directly target. The combination - cheap-to-generate pretext tasks on unlimited data, useful representations as a side effect - is the paradigm.
The labelled-data bottleneck that motivated SSL
Through the early 2010s, the dominant deep-learning paradigm was supervised: train a model on labelled examples, where each input has an associated target the model is asked to predict. ImageNet’s million-plus labelled images supported the AlexNet inflection (Deep Learning §2) and the subsequent CNN era. But supervised learning is structurally capped by labelling cost. Every additional training example requires a human (or expensive process) to produce a label. Web text, on the other hand, exists in trillions of tokens at near-zero marginal cost - but without labels. Speech audio, video frames, scientific data - most of the data the world produces is unlabelled.
The labelling bottleneck shaped what supervised deep learning could do. Vision systems trained on ImageNet’s thousand classes did not directly help with the long tail of visual categories. Translation systems required parallel text in pairs of languages, available only for a fraction of language pairs. Specialized domains (medical imaging, chemistry, scientific literature) had labelled corpora measured in thousands of examples rather than millions. The natural question: can we use the unlabelled data?
The functional definition
We adopt the following working definition for this chapter:
Self-supervised learning. A training paradigm in which the learning objective is constructed from the input data itself, without external labels - typically by predicting one part of the input from other parts, or by predicting a transformed version from the original. The trained model’s internal representations are subsequently used as the basis for downstream tasks via adaptation (fine-tuning, prompting, retrieval, or zero-shot deployment).
Two structural features matter:
Labels come from the data. Mask a word in a sentence; the correct answer (what word was masked) is in the data. No annotator was needed. This is what makes SSL operate at the scale of available data rather than at the scale of annotation budget.
Representation transfer is the goal. The pretext task is not what we ultimately want the model to do; we want the representations the model builds to be useful for whatever downstream tasks come later. A model trained to predict masked tokens turns out to encode word meanings, syntactic structure, and (at scale) world knowledge - even though “encode world knowledge” was not what the loss measured directly.
Why SSL became central
The shift from supervised pretraining (the ImageNet-era recipe) to self-supervised pretraining was the engine of the foundation-model regime (§3 of the Foundation Models chapter develops “the three pillars”; SSL is the second pillar). Three reasons SSL is the central training paradigm of 2026:
Data scaling. SSL’s training signal comes free with the data. While supervised pretraining at ImageNet scale plateaus around millions of labelled examples, SSL pretraining scales to trillions of tokens of text, billions of images, hundreds of thousands of hours of audio - data volumes that supervised learning fundamentally cannot match.
Cross-modality. The same conceptual pattern (predict-the-missing-piece) works across modalities. The chapter’s §4–§7 catalogue the variants. This generalizability is part of what makes foundation models foundation - the architecture and training paradigm transfer across domains.
Representation transfer. SSL-trained representations transfer to many downstream tasks. A model trained on next-token prediction can be adapted to classification, summarization, translation, dialogue, code generation, structured output, and many other tasks - none of which were targets of the pretraining loss directly.
Boundaries with adjacent paradigms
SSL is one specific point in a larger space of training paradigms. Adjacent paradigms - sometimes confusing terminologically - differ in important ways:
Unsupervised learning in the classical sense covers clustering, density estimation, and dimensionality reduction. SSL is unsupervised in the broad sense (no external labels) but distinct in approach: SSL constructs prediction tasks on the data, where classical unsupervised methods do not necessarily formulate a prediction problem at all (k-means clusters without predicting anything).
Semi-supervised learning combines a small labelled dataset with a large unlabelled one. SSL is the unlabelled-only extreme; semi-supervised approaches use both.
Supervised pretraining (the ImageNet-era recipe) uses labelled data for the pretraining task itself. SSL replaces external labels with self-derived ones.
Reinforcement learning uses environment-provided reward signals. SSL has no environment; the signal comes from the data’s own structure.
Contrastive representation learning is one family of SSL approaches (§5), not a synonym for SSL.
Boundaries with adjacent chapters
Deep Learning (prerequisite). The architectures SSL trains on; this chapter treats SSL as model-agnostic but most concrete instances assume a Transformer or CNN.
Foundation Models is the deployment regime. FM §5 (Pretraining) overlaps with this chapter; we develop the training paradigm here, FM develops how the pretrained model is then deployed.
Theoretical Foundations of Learning develops the generalization story (why SSL representations transfer); we sketch the empirical findings here and leave the theory there.
Large Language Models specializes the language SSL objectives we cover here (causal LM, masked LM, span corruption) to the LLM training pipeline.
Multimodal Models specializes the multimodal SSL objectives (CLIP-style contrastive, masked multimodal modelling) to vision-language and beyond.
Generative Models covers diffusion and flow-matching training, which is a kind of SSL (denoising score-matching) but specialized to generative deployment.
The chapter is meant as the cross-cutting reference for SSL across modalities. Where the LLM, Multimodal, or Generative Models chapter develops a specific instance in depth, we point.
§2. Historical Context
The idea of predicting parts of input from other parts is older than deep learning itself. This section traces the trajectory from early autoencoders through Word2Vec’s distributed representations through the BERT/GPT inflection through contrastive vision through the multimodal SSL of 2026. As elsewhere in the book, we focus on the substrate history - which ideas became the building blocks of the modern paradigm - rather than crediting individuals.
A timeline of the inflection points covered below:
~1980s Early autoassociative networks (Rumelhart et al.,
Hopfield); the predict-from-input pattern as a
theoretical idea
│
▼
2006-2008 Autoencoders for deep-learning pretraining
(Hinton and Salakhutdinov, 2006); denoising
autoencoders (Vincent et al., 2008)
│
▼
2013-2014 Word2Vec (Mikolov et al., 2013), GloVe
(Pennington et al., 2014) - distributed
word representations from context prediction
at scale
│
▼
2015-2017 Sequence pretraining: skip-thoughts (Kiros
et al., 2015), seq2seq pretraining
│
▼
2018 THE LANGUAGE INFLECTION: BERT (Devlin et al.,
masked LM), GPT (Radford et al., causal LM).
Self-supervised pretraining becomes the dominant
paradigm in NLP
│
▼
2019-2021 CONTRASTIVE VISION SSL: SimCLR (Chen et al.,
2020), MoCo (He et al., 2020), BYOL
(Grill et al., 2020), SwAV (Caron et al.,
2020), DINO (Caron et al., 2021)
│
▼
2021 CROSS-MODAL SSL: CLIP (Radford et al., 2021)
trained on web-scale image-text pairs,
enabling zero-shot transfer across vision
│
▼
2022 MASKED VISION SSL: MAE (He et al., 2022).
JEPA framing (LeCun, 2022 position paper)
proposing predictive-embedding architectures
│
▼
2022-2026 SSL as the shared substrate across modalities;
audio (Whisper-style), video, action, scientific
domains all use SSL pretraining; multimodal SSL
becomes the default; the field consolidates
around a small set of objective familiesWe develop each phase below.
Autoencoders and the predict-from-input pattern
The earliest SSL-style training objectives in deep learning were autoencoders: train a neural network to reconstruct its input from a compressed intermediate representation. The architecture is symmetric - an encoder compresses input to a low-dimensional code, a decoder reconstructs the input from the code - and the training objective is reconstruction error. No external labels are needed; the input is its own target.
Hinton and Salakhutdinov (2006), “Reducing the Dimensionality of Data with Neural Networks”, used deep autoencoders as a way to learn useful representations from unlabelled data, with the autoencoder’s bottleneck layer serving as the learned representation. The approach predated the AlexNet inflection and was part of the deep-learning revival of the late 2000s - a serious attempt to use unlabelled data when supervised deep learning had not yet won.
Denoising autoencoders (Vincent et al., 2008) added a structural improvement: corrupt the input (mask pixels, add noise) and ask the network to reconstruct the original from the corrupted version. The corruption forces the network to learn structure of the input distribution rather than just memorizing the identity function. This is the conceptual ancestor of masked language modelling, masked autoencoders, and most modern SSL objectives.
The autoencoder line of work did not produce a foundation-model breakthrough on its own - supervised learning with AlexNet (2012) and ResNet (2015) had more impact at the time. But the predict-from-input pattern was already in place.
Word2Vec and distributed representations
In 2013, Mikolov, Chen, Corrado, and Dean’s Word2Vec (CBOW and Skip-gram variants) trained a shallow network to predict a word from its context (CBOW) or context from a word (Skip-gram), on a corpus of billions of words. The model’s input embeddings turned out to encode semantic relationships in their geometric structure: famously, vec("king") - vec("man") + vec("woman") ≈ vec("queen"). The trained embeddings were used as fixed input features for downstream NLP tasks.
GloVe (Pennington, Socher, Manning, 2014) produced similar embeddings via a different objective - factoring the word-co-occurrence matrix - and reached comparable downstream performance.
The Word2Vec/GloVe lineage established that self-supervised training on the prediction objective of “what word goes with what context” produces representations that capture meaning. This was the first clear demonstration of SSL transfer in the modern sense: a model trained on a simple prediction task produced features that improved downstream tasks that the prediction task did not directly target. The line of work also established the dense distributed representation as the central representational object of modern NLP (LLM §2 develops this in language-specific terms).
Through the mid-2010s, the Word2Vec/GloVe-style embeddings were the standard input layer for NLP pipelines, with task-specific models on top.
Sequence pretraining
The natural extension: replace a per-word embedding with a sequence-level representation. Skip-thoughts (Kiros et al., 2015) trained an encoder to predict surrounding sentences from a central sentence - analogous to Word2Vec but at the sentence level. CoVe (McCann et al., 2017) used a sequence-to-sequence translation model’s encoder as a contextual word representation. ELMo (Peters et al., 2018) used a bidirectional language model to produce contextual word representations whose embeddings varied with the surrounding sentence.
These approaches were precursors of the 2018 inflection but had limitations: they typically produced representations that were added to task-specific models rather than replacing them, and they did not directly support generation. The pretraining task and the deployment task were still separated.
2018: BERT and GPT
The breakthrough year. Two papers in 2018 set the template for modern SSL pretraining in language:
GPT (Radford et al., 2018), “Improving Language Understanding by Generative Pre-Training.” A decoder-only Transformer (LLM §4) pretrained with causal language modelling: predict the next token from the prior tokens. The same architecture used for pretraining could be fine-tuned for downstream tasks.
BERT (Devlin et al., 2018), “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” An encoder-only Transformer pretrained with masked language modelling: randomly mask 15% of input tokens, predict the masked tokens from the remaining context (LLM §2). Downstream tasks adapted by fine-tuning a small task-specific head on top.
Together, GPT and BERT established the modern SSL pretraining paradigm in language. Three properties of these models that the prior generation lacked:
The same model architecture (Transformer) was used for pretraining and downstream tasks.
Fine-tuning of the pretrained model (with the head replaced for the new task) produced state-of-the-art results across many benchmarks - outperforming both training from scratch and the older add-features-to-task-specific-models approach.
Scale paid off. Larger versions of BERT and GPT trained on more data consistently improved downstream performance, establishing the scaling pattern that later became central (Foundation Models §6).
By 2020 the BERT/GPT template was the default in NLP. Most subsequent language-model work in the chapter’s domain has been refinements of the same pattern.
Contrastive vision SSL
In parallel with the BERT/GPT inflection in language, a different line of work attacked SSL in vision. The challenge: an image is a 2D grid of pixels with no obvious analogue to the “predict the next word” pretext task that worked for language. Several lines of work converged on contrastive learning as the solution.
The basic idea: two augmented views of the same image (different crops, colour jitters, rotations) should produce similar representations; views of different images should produce dissimilar representations. The model is trained to make this true by minimizing a contrastive loss.
A short sequence of papers that defined the field:
SimCLR (Chen et al., 2020) - Simple Contrastive Learning of Representations. Showed that with strong data augmentation, a large batch size, and a projection head, contrastive learning could match supervised pretraining on ImageNet classification.
MoCo (He et al., 2020) - Momentum Contrast. Used a memory queue of past representations and a momentum-updated encoder to make contrastive learning work at smaller batch sizes.
BYOL (Grill et al., 2020) - Bootstrap Your Own Latent. Removed the negative examples from contrastive learning entirely; trained two networks to predict each other’s outputs under augmentations, with surprising stability. Triggered a re-examination of what makes contrastive learning work.
SwAV (Caron et al., 2020) - clustering-based contrastive learning, with assignments to prototypes as the contrastive signal.
DINO (Caron et al., 2021) - Self-distillation with no labels. A teacher-student approach with a similar negative-free structure to BYOL, used on Vision Transformers; produced features with surprising semantic structure (the famous self-attention visualizations of DINO models segmenting objects without supervision).
By 2022 contrastive (and related “joint-embedding”) SSL was the dominant approach to image-only representation learning, with the field consolidating around the SimCLR/MoCo/BYOL/DINO design space.
Cross-modal SSL: CLIP
CLIP (Radford et al., 2021) extended contrastive SSL to cross-modal training. Train an image encoder and a text encoder jointly on hundreds of millions of image-caption pairs, with the contrastive objective: match each image to its caption (and vice versa) against a batch of negatives.
The result: a model whose image representations are aligned with natural-language descriptions. CLIP enabled zero-shot transfer to image-classification tasks - feed candidate class names through the text encoder, compute image-encoder similarities, pick the highest - without any task-specific training. Performance on ImageNet zero-shot approached supervised models of comparable scale, and on robust benchmarks (distribution shift, ImageNet-Sketch) CLIP substantially outperformed them.
CLIP’s structural innovation was cross-modal SSL: the model’s representations were grounded in another modality (language), which gave them semantics that pure-vision SSL lacked. This pattern - joint training across modalities - became the dominant approach to multimodal foundation models.
Masked vision and JEPA
A parallel line revisited the masked-prediction pattern for vision. Masked Autoencoder (MAE) (He et al., 2022) trained a Vision Transformer to reconstruct masked image patches from the unmasked ones - essentially BERT for images, with high masking ratios (75% of patches masked). MAE produced representations competitive with contrastive methods and was substantially more compute-efficient to train.
In 2022, Yann LeCun published a position paper proposing JEPA (Joint Embedding Predictive Architecture) as the right path forward for vision SSL. The argument: rather than predict in pixel space (as MAE does) or in contrastive space (as SimCLR does), predict in a learned embedding space. The predicted target is the embedding of the unseen part, not its raw pixels; this is supposed to capture the relevant structure of the input without the noise of pixel-level reconstruction. I-JEPA (Image JEPA, 2023) and subsequent variants (V-JEPA for video, others) have explored this direction.
The JEPA framing is one of the active SSL research programmes in 2026; whether it becomes dominant for vision and beyond, or whether contrastive and masked-prediction approaches remain competitive, is open.
2022–2026: shared substrate and multimodal consolidation
By 2026, SSL has consolidated into a small number of objective families (§4–§7 of this chapter develop the families) applied broadly:
Causal language modelling for text (the LLM-pretraining substrate).
Contrastive multimodal (CLIP-style) for vision-language and beyond.
Masked prediction (MLM-style for text, MAE-style for vision) for understanding-focused tasks.
Predictive-embedding (JEPA-style) as a research-active alternative.
Self-supervised audio (Whisper-style for speech) and other modality-specific variants.
Most frontier foundation models are trained with combinations of these - multimodal pretraining typically uses several objectives simultaneously, with the relative weighting tuned empirically.
The remaining sections of this chapter develop each family in depth (§4 generative, §5 contrastive, §6 predictive), the theory (such as it is) of why SSL works (§8), and the practical considerations of SSL at scale (§9).
Historical aside. This narrative privileges the lineage that produced the modern paradigm. Several research programmes that did not lead directly to current SSL - purely classical unsupervised methods (clustering, density estimation, ICA), neuroscience-inspired learning rules, Bayesian methods - remain active and relevant; the Theoretical Foundations of Learning chapter treats some of them.
§3. The SSL Paradigm
§2 traced the SSL trajectory across modalities and decades. This section unifies what those approaches share: a paradigmatic structure that holds across causal LM, masked LM, contrastive learning, JEPA, and the rest. The unifying view is useful both for understanding the family as a coherent thing and for designing new SSL objectives - once the structure is in mind, new pretext tasks suggest themselves.
A unifying view
SSL methods share a three-part structure:
┌─────────────────────────────┐
│ Unlabelled data sample x │
└─────────────┬───────────────┘
│
▼
┌────────────────────────────────┐
│ 1. Transformation / split │
│ Apply some operation that │
│ produces (input, target) │
│ from x alone: │
│ - mask some tokens │
│ - split into context + │
│ continuation │
│ - augment two views │
│ - corrupt with noise │
│ - separate modalities │
└─────────────┬──────────────────┘
│
▼
┌────────────────────────────────┐
│ 2. Model the prediction │
│ A neural network maps the │
│ input to a prediction of │
│ the target (or some │
│ transformation of it). │
└─────────────┬──────────────────┘
│
▼
┌────────────────────────────────┐
│ 3. Loss │
│ Compare the prediction to │
│ the target via some loss │
│ function. Backpropagate │
│ and update parameters. │
└────────────────────────────────┘Every SSL method instantiates this template:
Causal LM. Transformation: split a sequence into prefix (input) and next token (target). Model: a decoder-only Transformer producing a distribution over the vocabulary. Loss: cross-entropy between predicted distribution and actual next token.
Masked LM. Transformation: randomly mask tokens in a sequence; the masked tokens are targets, the surrounding context is input. Model: an encoder-only Transformer producing per-position distributions. Loss: cross-entropy at masked positions.
Contrastive learning (SimCLR-style). Transformation: produce two augmented views of an image; the two views form a positive pair, other images in the batch are negatives. Model: an image encoder producing dense feature vectors. Loss: contrastive (the views should be close in feature space; the negatives should be far).
CLIP. Transformation: take an image-caption pair from the web; the image and caption are a positive pair. Model: separate image and text encoders. Loss: contrastive across a batch of pairs.
MAE. Transformation: mask 75% of an image’s patches; the masked patches are targets. Model: a Vision Transformer that produces predictions of the masked pixel patches. Loss: pixel reconstruction error on masked patches.
JEPA. Transformation: split an image into context and target patches. Model: an encoder + a predictor in embedding space. Loss: distance between predicted embedding and target embedding.
Denoising / diffusion training. Transformation: add noise to the input. Model: a denoiser that predicts either the original input or the noise. Loss: reconstruction or score-matching error.
The three steps are conceptually decoupled. A given transformation can be paired with many model architectures; a given loss can be paired with many transformations. The space of SSL objectives is combinatorial in these three choices, which is part of why the field has produced so many distinct methods.
The two axes of variation
Within the three-part template, two design axes matter most.
What is predicted, and in what space?
Input space. Predict the actual pixel values (MAE), the actual tokens (causal/masked LM), the actual audio waveform.
Embedding space. Predict the embedding of the target rather than the target itself (JEPA).
Match / contrast space. Don’t predict the target directly; instead predict whether two samples are positive or negative pairs (contrastive learning).
The choice has substantial consequences. Input-space prediction (MAE, language modelling) has to model irrelevant noise in the input - the precise pixel values of a corner of an image, the specific punctuation of a sentence - which the model expends capacity on. Embedding-space prediction (JEPA) lets the model ignore irrelevant detail at the cost of needing a target encoder to define the embedding space, which raises the question of how to train it.
Where does the supervisory signal come from?
Reconstructive. The signal comes from the model’s ability to predict the missing part. The loss measures prediction accuracy directly. Generative objectives (next-token, MLM, MAE) work this way.
Discriminative. The signal comes from the model’s ability to distinguish positive pairs from negative ones. Contrastive learning works this way; the model never reconstructs anything, only learns to push positives together and negatives apart.
The discriminative path requires negatives (real or implicit), which raises the questions of where to get them and whether their structure biases the learned representation. The reconstructive path does not need negatives but commits to the choice of “what to predict” we discussed above.
Why SSL representations transfer
The empirical fact: a model trained on a pretext task that has nothing directly to do with downstream tasks ends up with representations useful for those tasks. Why?
Three partial answers circulate in the literature; the Theoretical Foundations of Learning chapter develops the rigorous-theoretic side, while we sketch the intuitions here.
The manifold-coverage hypothesis. Natural data (images, language, audio) lies on a relatively low-dimensional manifold within its ambient space - most points in pixel-space are not natural images, most token sequences are not natural language. A good SSL objective forces the model to learn the structure of this manifold: to know what natural-image patches look like in context, what natural sentences look like in continuation. Once the manifold is encoded in the model’s representations, downstream tasks operate on those representations and inherit the manifold knowledge.
The compression-and-prediction view. A predictive objective (predict next token, predict masked patch) forces the model to compress the input into a representation rich enough to support prediction. The information-theoretic interpretation: the representation is approximately a sufficient statistic for the prediction task; sufficient statistics tend to be useful for many downstream tasks that share some structure with the pretext. This view connects SSL to classical information theory (the Theoretical Foundations chapter elaborates).
The emergent-features view. Empirically, SSL-trained models develop features that correspond to semantically meaningful concepts even though no semantics were in the loss. Word embeddings encode meaning; image-model features encode object identity and category; LLM features encode entities, relations, and (at scale) world knowledge. These features emerge from the pretraining objective in ways that current theory does not fully predict - they are an empirical surprise, repeatedly confirmed.
None of these accounts is decisive. The empirical fact of transfer is robust; the explanation is partial.
Comparison with adjacent paradigms
Pulling §1’s terminology into a single comparison:
Labels Pretext Prediction
required? task on target
data itself?
Classical unsupervised No No* varies (cluster
(k-means, PCA, ...) assignment,
density estimate)
Self-supervised No Yes constructed from
(this chapter) the data
Semi-supervised Few Yes (on downstream task
unlabelled)
Supervised Many No external labels
Reinforcement learning No No external rewards
from environment
* Some classical unsupervised methods (e.g., score matching,
denoising) can be seen as proto-SSL with hindsight.The boundary between SSL and other paradigms is somewhat fluid in practice - semi-supervised learning often uses an SSL component for the unlabelled portion; supervised learning sometimes uses SSL pretraining as initialization. The conceptual distinction matters less than the mechanical question of what training signal is being used.
What the rest of the chapter develops
The remaining sections develop the major SSL families:
§4 (Generative Objectives): causal LM, masked LM, span corruption, MAE, denoising/diffusion.
§5 (Contrastive and Joint-Embedding Objectives): InfoNCE, SimCLR, MoCo, BYOL, SwAV, DINO, CLIP.
§6 (Predictive and Latent-Space Objectives): JEPA family.
§7 (Multimodal Self-Supervision): cross-modal SSL, native multimodal pretraining.
§8 (Why SSL Works: Theoretical Perspectives): theoretical accounts surveyed.
§9 (SSL in Practice): data, curricula, evaluation.
§4. Generative Objectives
Generative objectives are SSL methods where the model predicts (reconstructs) the missing or target part of the input directly. Loss is computed in the input space - pixel-space for images, token-space for language, waveform or spectrogram space for audio. The training signal is the model’s reconstruction accuracy.
This section catalogues the five most consequential generative objective families. Each instantiates the §3 three-part structure with a specific transformation and a specific loss.
Causal language modelling
The dominant SSL objective for language: predict the next token given the prior tokens. Given a tokenized sequence , the model learns to compute for every position . The training loss is the negative log-likelihood of the actual next token at every position:
Mechanically:
Token sequence: [The quick brown fox jumps over ... ]
For each position t, target = token at position t,
conditioning context = tokens at positions 1..t-1:
Position 1: (no context) → predict "The"
Position 2: "The" → predict "quick"
Position 3: "The quick" → predict "brown"
Position 4: "The quick brown" → predict "fox"
...
Each prediction contributes -log p_θ(target | context) to the loss.
All positions are computed in parallel during training via the
causal mask of the decoder-only Transformer (DL §6).The architecture is a decoder-only Transformer with causal masking, ensuring each position attends only to prior positions. The Deep Learning chapter §6 develops the mechanism; the LLM chapter §1 and §5.1 develop the deployment-specific picture.
Why causal LM dominates as the language SSL objective:
Generation-native. The same model trained for next-token prediction is directly usable as a generator at deployment: sample one token, append, repeat.
Single objective. The training task is fully specified by the corpus; no separate decisions about what to mask or how to corrupt.
Scales cleanly. Empirical scaling laws (FM §6) are smoothest for causal LM; the loss curve and the downstream-capability curve are well-behaved.
Universal across modalities. The “predict the next item” pattern extends naturally to code (next token), audio (next spectrogram frame), action sequences (next motor command), arbitrary token sequences - a single training paradigm across domains.
Masked language modelling
The encoder-side counterpart. Given a sequence, randomly mask some fraction of tokens (typically 15%), and ask the model to predict the masked tokens from the surrounding context on both sides. The model is bidirectional - every input position can attend to every other input position.
where is the set of masked positions and is the sequence with masked tokens replaced by a special [MASK] token. The model sees the unmasked sequence as input; the masked positions are the targets.
Mechanically:
Original sequence: The cat sat on the mat .
After random 15% masking (here: tokens 3 and 6):
Input: The cat [MASK] on the [MASK] .
Target: predict "sat" at position 3
predict "mat" at position 6
Model sees the full surrounding context including positions
AFTER each masked position. The bidirectional attention is
what makes MLM suited to UNDERSTANDING tasks rather than
generation.BERT (Devlin et al., 2018) was the canonical MLM model, and the BERT lineage is treated in LLM §2.
Why MLM is suited to understanding tasks but not generation: the model is trained on a non-autoregressive setup; it does not generate token by token at training time. Adapting an MLM model to generation requires substantial additional machinery (mask-and-fill iterations, non-autoregressive decoding) that does not match autoregressive generation’s quality. For generation, causal LM dominates; for representation-learning-for-understanding, MLM persists, particularly in retrieval encoders (LLM §10 mentions these).
Span corruption
The encoder-decoder objective from T5 (Raffel et al., 2019). Like MLM, randomly select spans to “corrupt”; unlike MLM, the model produces the original spans as a generation target. The encoder sees the corrupted input (with spans replaced by sentinel tokens); the decoder produces the original spans, separated by the same sentinel tokens.
Original: The cat sat on the mat in the garden today.
Corruption (mask span "sat on the mat" → <X>;
"garden today" → <Y>):
Encoder input: The cat <X> in the <Y>.
Decoder target: <X> sat on the mat <Y> garden today <Z>
The decoder is trained to produce the missing spans, generated
autoregressively in the standard seq2seq fashion.Span corruption sits between causal LM (next token, fully autoregressive) and masked LM (single token, bidirectional). It supports both understanding-style adaptation (use the encoder) and generation-style adaptation (use the encoder-decoder).
Masked autoencoders for images
The vision analogue of MLM. MAE (He et al., 2022) tokenizes an image into patches (say, 16×16 pixels), randomly masks a large fraction of patches (75%), and trains a Vision Transformer to reconstruct the original patches from the unmasked ones. The training signal is pixel-reconstruction error on the masked patches.
Original image: grid of 14×14 = 196 patches (for a 224×224 image
at 16×16 patch size)
│
▼
Masking: randomly mask 75% of patches; keep 25%
│
▼
Encoder: Vision Transformer processes ONLY the unmasked
patches (25% of original; ~49 patches).
│
▼
Decoder: Small Transformer with mask tokens at masked
positions; reconstructs the original patches
(in pixel space).
│
▼
Loss: mean-squared error between reconstructed and
original masked patches.Two design choices distinguish MAE from naïve “BERT for images”:
Asymmetric encoder/decoder. The encoder only processes unmasked patches (compute-cheap); the decoder reconstructs masked patches. This makes the model trainable at high masking ratios where naive symmetric masked-image training would be too slow.
High masking ratio. Images have substantial redundancy across nearby pixels; with only 15% masked (the BERT setting), reconstruction is too easy and the model learns shortcuts. 75% masking forces genuine prediction.
MAE produces representations competitive with contrastive methods for image SSL while being substantially cheaper to train.
Denoising and diffusion as SSL
A broader generalisation: train a model to recover the original input from a corrupted version. Denoising autoencoders (Vincent et al., 2008) introduced this in the predict-from-input pattern of §2. Diffusion models (the Generative Models chapter develops them in depth) generalize: corrupt the input by adding Gaussian noise at varying intensities, train the model to predict the noise (or equivalently, the score - the gradient of log-density with respect to the input).
The training objective for diffusion:
where is the noisy version of at noise level , is the noise, and is the model’s noise prediction.
For the SSL chapter the key observation: diffusion training is self-supervised. No external labels; the corruption (noise) is generated from the data alone; the target (clean data, or the noise) is derived from the data. The Generative Models chapter develops the substance - what changes when the model is used for generation rather than for representation learning.
Trade-offs across generative objectives
A summary comparison:
Modality Bidirectional? Generative? Strength
Causal LM text, code, No (autoregress) Yes scaling,
any seq deployment
Masked LM text Yes Indirect under-
standing
Span corruption text Encoder bi- Yes (decoder) uniformity
directional; across
decoder auto- tasks
regressive
MAE images Decoder only Indirect efficient
processes vision
reconstruction SSL
Diffusion images, N/A (different Yes generation
audio, deployment) quality
arbitraryThe choice among generative objectives is driven by the downstream deployment the trained model is intended for. Generative deployment (text generation, image generation) favours causal LM or diffusion. Understanding deployment (classification, retrieval, structured prediction) favours MLM or MAE. Mixed deployment (T5-style) favours span corruption. Modern multimodal pretraining (§7) often combines several of these in a single training run.
§5. Contrastive and Joint-Embedding Objectives
§4’s generative objectives predict the target itself - the next token, the masked patch, the clean image. Contrastive and joint-embedding objectives do something different: they train the model to produce embeddings of inputs in a way that respects a notion of similarity, without predicting any input directly. The training signal comes from comparing pairs: two augmented views of the same image should be close in embedding space; an image and its caption should be close; views of different images should be far.
This section develops the family. We start from the InfoNCE loss as the canonical contrastive formulation, walk through the most influential variants (SimCLR, MoCo, BYOL, SwAV, DINO), treat CLIP-style cross-modal contrastive learning, then discuss the surprising “negative-free” branch and the failure modes that motivated much of the field’s design choices.
Contrastive learning as a discriminative SSL paradigm
The basic recipe:
Take an unlabelled sample from the dataset.
Produce two augmented views and via random augmentation functions (random crops, colour jitter, rotations, blurring, etc.).
Encode both views into representations , .
These should be similar. Other views from other samples in the same batch should not.
The model is trained so that the positive pair has high similarity (typically cosine similarity) and negative pairs (an anchor view and views of other samples) have low similarity.
The general structure:
Anchor sample x:
│
├─ augmentation t_1 → view v_1 → encode → z_1 (anchor representation)
│ \
├─ augmentation t_2 → view v_2 → encode → z_2 (positive: pull close to z_1)
│
│
Other samples in batch:
├─ ... → ... → z_neg_1 (negative: push away from z_1)
├─ ... → ... → z_neg_2
├─ ... → ... → z_neg_3
...The loss measures how well the anchor identifies its positive among the negatives.
The InfoNCE loss
The most widely-used contrastive loss is InfoNCE (Oord et al., 2018, “Representation Learning with Contrastive Predictive Coding”). Given an anchor representation , its positive partner , and a set of negatives :
Reading the formula:
is a similarity function - almost always cosine similarity .
is a temperature hyperparameter (typically 0.1–0.5) that controls how sharply the loss distinguishes the positive from the negatives. Smaller makes the model sensitive to small similarity differences; larger smooths the contrast.
The denominator is the sum over positive and all negatives. The numerator is the positive.
Minimizing the loss is equivalent to maximising the log-probability that, among the positive and all negatives, the actual positive is identified.
Mechanically, InfoNCE is a softmax classification loss with the positive as the one correct class and the negatives as wrong classes. The model is implicitly learning to classify which sample the anchor belongs to, given an “N+1-way” candidate set.
The information-theoretic interpretation (also from Oord et al., 2018): InfoNCE is a lower bound on the mutual information between and . Maximising the bound maximises mutual information; the model learns representations whose mutual-information between augmented views is high - capturing what is invariant under augmentation. (This interpretation has been refined and partly contested in subsequent work; the §8 theoretical-perspectives section returns to it.)
SimCLR: the canonical clean recipe
SimCLR (Chen et al., 2020) was the paper that demonstrated contrastive learning could match supervised pretraining on ImageNet. The recipe combines:
Random augmentations (crop+resize, colour jitter, Gaussian blur) applied independently to produce two views.
A standard image encoder (a ResNet, later a ViT).
A projection head - a small MLP on top of the encoder that maps the encoder output to the space where contrastive loss is computed. Crucially, the projection head’s output is used for the loss; the encoder’s output (one layer earlier) is used for downstream tasks. The projection head is discarded after pretraining.
Large batch sizes (often 4096 or 8192) to provide many negatives per anchor.
Long training (hundreds of epochs).
SimCLR’s contribution was the empirical clarity: showing which components matter (augmentations are crucial; the projection head matters; batch size matters) and demonstrating that, with the right combination, contrastive SSL produces competitive image representations.
MoCo: handling small batches
A practical problem with SimCLR: large batch sizes are expensive. MoCo (Momentum Contrast; He et al., 2020) decouples the number of negatives from the batch size by maintaining a memory queue of past representations to use as negatives.
The structure:
A query encoder processes the anchor view; trained via backprop.
A key encoder processes the positive partner view; updated as an exponential moving average of the query encoder (momentum update):
with momentum .
A memory queue of size (typically tens of thousands) holds past key-encoder outputs.
The contrastive loss uses the query as anchor, the current positive key as positive, and the queue as negatives.
The moving-average update gives MoCo two benefits. First, the queue can be large without proportionate memory cost - it just stores feature vectors, not full batches. Second, the slow-moving key encoder ensures the queue’s entries are consistent across recent steps; without momentum, the queue would contain old features that the encoder has since drifted away from.
MoCo-v2 and MoCo-v3 incorporated SimCLR’s projection-head idea and the standard ViT architectures, producing the now-standard contrastive baseline at small batch sizes.
The negative-free branch: BYOL, SwAV, DINO
A surprise of the contrastive-learning era. BYOL (Bootstrap Your Own Latent; Grill et al., 2020) appeared to remove negative samples entirely:
Two networks (online and target).
Online network produces an embedding from one view; passes it through a prediction head to produce a predicted target embedding.
Target network produces an embedding from the other view (using stop-gradient - no backprop through this branch).
Loss is the mean-squared error between the prediction and the target.
Target network is an exponential moving average of the online network (like MoCo’s key encoder).
There is no contrastive denominator. Naïvely, this should collapse - the model could learn to produce a constant embedding for every input (the “all-zeros solution”), making both prediction and target trivially match. It does not collapse in practice, and the question of why was an active research topic in 2020–2022.
The answer (Tian et al., 2021, and successor analyses) involves several factors interacting:
The prediction head’s asymmetric structure (online predicts target, but not vice versa) breaks symmetric collapse.
The momentum update slows the target enough that the online network always chases a slightly-different target - implicit regularization against collapse.
The augmentations introduce variability that a constant solution cannot match.
SwAV (Caron et al., 2020) used a different approach: each view is assigned to one of learned prototypes via an optimal-transport-based clustering; the loss requires that the two views’ prototype assignments agree. Clustering structure provides the implicit contrast.
DINO (Caron et al., 2021) - Self-distillation with no labels - applied a BYOL-like structure to Vision Transformers, with one important detail: the target is centered and sharpened (mean-subtraction across the batch + low-temperature softmax). This centering serves as the implicit-contrast mechanism, preventing collapse. DINO produced features with remarkable emergent properties - self-attention maps that segmented objects without supervision became a striking visualization.
The “negative-free” approaches revealed something important: the contrastive loss is one way to prevent representational collapse, but not the only way. Asymmetric architectures, momentum updates, prototype clustering, and feature centering all provide alternative anti-collapse mechanisms. The deeper principle: SSL works when the training procedure has some structural force pushing the model toward producing diverse, informative representations rather than the trivial constant solution.
Failure modes: collapse, in three forms
The constant-output failure mode has three named variants in the SSL literature, distinguished by which kind of structure collapses:
Healthy SSL training (diverse representations):
▲ feature 2
│ • • • •
│ •• ••• •• each • is one sample's
│ •• • •• ••• representation in the
│ • • • • •• feature space; spread
│•• •• • • •• out across many dimensions
│ ••• • •• ••
│ • • • •
└─────────────▶ feature 1
Complete representational collapse:
▲ feature 2
│
│
│ •
│ • • • all samples mapped to (essentially)
│ • • • • a single point in feature space -
│ • • • the trivial constant solution
│ •
│
└─────────────▶ feature 1
Dimensional collapse:
▲ feature 2
│
│ ••••••••••••• features spread along ONE
│ ••••••••••••• direction but constant in
│ ••••••••••••• the others; the embedding
│ ••••••••••••• is effectively 1D in a
│ ••••••••••••• d-dimensional space
│
└─────────────▶ feature 1Representational collapse - all samples mapped to the same feature. The trivial constant solution.
Dimensional collapse - features spread out along only a few dimensions of an embedding space designed to be higher-dimensional. Capacity wasted. (Identified by Jing et al., 2021, with diagnostics based on the eigenvalue spectrum of the feature covariance.)
Shortcut learning - the model exploits an unintended correlation in the augmentation pipeline (e.g., consistently using JPEG-compression artifacts to identify a sample’s identity), producing high training-time contrastive performance with useless downstream representations.
The history of contrastive SSL is partly a history of anti-collapse mechanisms - careful augmentation design, projection heads, prototype clustering, feature centering, momentum updates, BatchNorm-related tricks. Each method’s specific design choices are largely responses to one or more of these failure modes.
CLIP: cross-modal contrastive learning
The most-deployed contrastive SSL method in 2026 is CLIP (Radford et al., 2021), which we treated historically in §2. The structure adapts the contrastive recipe to cross-modal pairs:
The dataset is image-caption pairs from the web (hundreds of millions of pairs).
An image encoder produces image embeddings; a text encoder produces text embeddings.
The InfoNCE loss is computed per batch: for each image, the positive is its actual caption; the negatives are all other captions in the batch. (And symmetrically: for each caption, the positive is its actual image; the negatives are all other images.)
The cross-modal structure provides a semantically grounded contrastive signal. SimCLR-style image-only contrastive learning gives image representations that are invariant under augmentation; CLIP gives image representations that are aligned with natural-language descriptions of their content. The latter generalizes better to downstream semantic tasks, and supports zero-shot transfer: at deployment, give the text encoder a list of candidate class names, compute image-encoder similarities, and pick the highest.
CLIP and its successors (ALIGN, BLIP, SigLIP, and others) are the workhorse of cross-modal SSL by 2026, with §7 of this chapter developing the family further.
Where contrastive SSL stands in 2026
The empirical picture: contrastive and joint-embedding methods produced the best image-only SSL recipes through 2022; generative methods (MAE, in particular) caught up and have parity at scale; cross-modal contrastive (CLIP-class) is the dominant approach for multimodal pretraining and remains essentially unchallenged for vision-language tasks.
The deeper structural lesson is that contrastive learning is one mechanism in a broader space of joint-embedding methods, all of which require some anti-collapse mechanism. The diversity of recipes (SimCLR vs MoCo vs BYOL vs DINO) reflects different choices of that mechanism rather than fundamentally different ideas.
§6. Predictive and Latent-Space Objectives
A third major family. JEPA (Joint Embedding Predictive Architecture; LeCun, 2022 position paper) splits the difference between the two we have already covered:
Generative objectives (§4) predict the target in input space - actual pixels, actual tokens. The model must reconstruct fine-grained detail of the input.
Contrastive objectives (§5) do not predict any target directly - they only impose a similarity structure on embeddings.
Predictive joint-embedding objectives (this section) predict the target in embedding space. The model predicts the embedding of the unseen part given the embedding of the seen part. There is reconstruction, but the reconstruction target is itself a learned representation, not raw input.
This section develops the JEPA framing, the I-JEPA / V-JEPA / A-JEPA / MC-JEPA lineage, the connection to predictive coding from neuroscience, and the argument for why predicting in latent space may scale differently than predicting in input space.
The JEPA structure
A JEPA model has three components:
Input x_full (an image, a video, ...)
│
▼
Split into context and target:
x_full = (x_context, x_target)
e.g., x_context = visible image patches
x_target = held-out image patches
│
▼
Context encoder: z_context = f_θ(x_context)
Target encoder: z_target = f_target(x_target) (often EMA of f_θ;
gradient-detached)
│
▼
Predictor:
z_predicted = g_φ(z_context, info about target location)
│
▼
Loss = distance(z_predicted, z_target)
(typically L2 or cosine distance)Three structural features distinguish JEPA from the §4 generative objectives:
Prediction in embedding space, not input space. The target is the encoder’s output, not raw pixels or raw tokens.
Asymmetric encoders. The context encoder is trained; the target encoder is typically a frozen exponential moving average (EMA) of the context encoder. This is the same anti-collapse mechanism as MoCo and BYOL (§5).
A predictor between encoders. A separate network maps context embeddings to predicted target embeddings, conditioned on which target is being predicted (its spatial location, its temporal offset, its modality, etc.).
The model never has to reconstruct raw input. Pixel-level noise, irrelevant fine detail, and modality-specific signal that is not relevant for downstream representation are not in the loss - the embedding is free to abstract them away.
Why predict in embedding space?
The argument for JEPA’s design choice runs through three observations.
Reconstructive losses force the model to model irrelevant detail. A masked-autoencoder (§4) has to reproduce the exact pixel values of the masked region - including high-frequency noise, JPEG-compression artifacts, and exact lighting variation that the model would ideally abstract over. Some fraction of model capacity is devoted to modelling things that no downstream task cares about.
Contrastive losses depend on the augmentation strategy. SimCLR-class methods (§5) learn representations invariant to the chosen augmentations - colour jitter, random crops, blurring. Other invariances that the augmentations did not encode are not enforced. The choice of augmentation strategy shapes what the model can and cannot do downstream; designing good augmentations is itself a research problem.
Predicting in embedding space allows the model to choose what to encode. The target encoder, by being a moving average of the context encoder, encodes “whatever the context encoder has learned to encode”. The training objective then asks: can the context encoder predict the embedding the target encoder produces? If the target encoder has learned to discard irrelevant detail, the predictor must produce that same discarded version; the loss does not push the model to model the discarded detail.
The cleanest way to state the case: in a generative objective, the loss is in the input space - the model is told what to reconstruct (pixels). In JEPA, the loss is in the embedding space - the model is told that the two views’ embeddings should agree, with the embeddings themselves free to be whatever the model finds most useful.
LeCun’s 2022 position paper makes a stronger claim: that JEPA is the right path forward for vision SSL, that generative reconstruction is fundamentally wasteful, and that future foundation models should be predictive-embedding architectures. As of 2026 the empirical picture is more nuanced - JEPA models are competitive with MAE and CLIP on many vision-SSL benchmarks but have not decisively overtaken either. The argument that predicting in embedding space is structurally better is more compelling at the principle level than the current empirical record settles.
I-JEPA: the first concrete instance
I-JEPA (Image JEPA; Assran et al., 2023) instantiates JEPA for image SSL. The recipe:
Take an image; split it into spatial patches (ViT-style).
Select a single context block (typically a large contiguous region).
Select one or several smaller target blocks elsewhere in the image.
Encode the context block with the context encoder.
Encode each target block with the target encoder (EMA of context encoder; gradient-detached).
Use the predictor, conditioned on the target block’s spatial location, to predict the target embedding from the context embedding.
Loss: L2 distance between predicted and actual target embedding.
Mechanically:
Image divided into patches:
┌──┬──┬──┬──┬──┬──┬──┐
│T1│ │ │ │ │ │ │
├──┼──┼──┼──┼──┼──┼──┤
│ │ │ │ │ │ │T2│
├──┼──┼──┼──┼──┼──┼──┤
│ │ │ │ │ │ │ │
├──┼──┼──┼─C┼─C┼─C┼──┤ C = context block (visible)
│ │ │ │ C│ C│ C│ │ T1, T2, T3 = target blocks (hidden)
├──┼──┼──┼─C┼─C┼─C┼──┤ (selected at random)
│ │ │ │ │ │ │ │
├──┼──┼──┼──┼──┼──┼──┤
│T3│ │ │ │ │ │ │
└──┴──┴──┴──┴──┴──┴──┘
For each target block T_i:
z_context = f_θ(C)
z_target_i = f_target(T_i) (EMA encoder, no gradient)
z_pred_i = g_φ(z_context, loc(T_i))
loss_i = || z_pred_i − z_target_i ||²
Total loss = mean over target blocksThe choice of large context block (rather than scattered visible patches) is empirically important - it forces the predictor to do semantically meaningful inference rather than local-patch interpolation.
I-JEPA produced representations competitive with MAE and contrastive methods on standard image-classification benchmarks, at substantially lower compute cost during training (because no pixel-level reconstruction is needed). The result strengthened the JEPA-as-direction argument.
The JEPA family beyond images
The framework generalizes naturally to other modalities, with each producing concrete instances:
V-JEPA (Video JEPA; Bardes et al., 2024). Context and target are blocks of frames in a video; the predictor uses temporal location. Captures motion and temporal dynamics in the embedding space without pixel-level frame reconstruction.
A-JEPA (Audio JEPA). Spectrogram regions as context and target.
MC-JEPA (Multi-block / multi-crop JEPA). Multiple context-target pairs from the same image for richer training signal.
Embodied JEPAs. Active research direction: applying JEPA-style predictive embedding to sequences of (observation, action, next-observation) tuples in robotic and game-playing settings.
The cross-modality applicability is part of what makes JEPA conceptually attractive - the same framework instantiates across visual, auditory, video, and embodied domains.
Connections to predictive coding
The JEPA framing has roots in predictive coding, a theory from neuroscience proposing that perception and learning in biological systems work by predicting incoming sensory input and using prediction errors as the learning signal. The hierarchical-predictive-coding theories of Rao and Ballard (1999), Friston’s free-energy principle (2010), and related neuroscience frameworks all describe the brain as a multi-level prediction machine where each level predicts the activity at the level below it.
JEPA’s structural similarity to predictive coding is not coincidental - LeCun and others have drawn the parallel explicitly. The relevant difference: predictive coding in neuroscience often emphasizes that predictions happen at every level of the hierarchy, and that errors propagate up and predictions propagate down. JEPA is a single-level predictive system; multi-level JEPA architectures (Hierarchical-JEPA) are an open research direction.
The biological-plausibility argument is sometimes invoked for JEPA - that predictive-embedding architectures are more like what brains do than generative reconstruction is. This argument is rhetorical; what brains actually do at the algorithmic level is not settled enough to validate or refute the analogy. The argument is mentioned in the JEPA literature; whether it should carry weight depends on one’s views about brain-inspired AI design.
Why latent-space prediction may scale differently
A speculative observation, not yet definitively established but worth recording. Reconstructive objectives (§4) have a fixed effective capacity per training example - the loss is bounded by the entropy of the input, and the model devotes capacity to modelling input-space noise that does not contribute to downstream tasks. Latent-space prediction (this section) has a learnable target - as the model improves, the target encoder’s embeddings can become richer, and the prediction task can become commensurately harder. This adversarial-like dynamic between the encoder and the predictor might scale differently than fixed-target reconstruction.
The empirical evidence so far is suggestive but not decisive. JEPA-class methods scale to the largest scales they have been tested at; whether the “different scaling” hypothesis holds at frontier scale is open. We record it as an open empirical question, related to OP-FM-3 (predicting emergence).
Where JEPA sits in 2026
JEPA is a credible alternative to both generative and contrastive SSL, with empirical results that are competitive but not decisively dominant. The research programme is active and well-funded; whether JEPA becomes the dominant SSL paradigm or whether it remains one of several competing approaches is open. The structural argument - that predicting in embedding space is principled in a way that input-space reconstruction is not - is the durable contribution of the JEPA framing, whether or not the specific recipes win out.
§7. Multimodal Self-Supervision
Sections §4–§6 developed SSL families that operate within a single modality: causal LM on text, MAE on images, contrastive learning on augmented views of the same image. Multimodal SSL trains on data that contains more than one modality at once - image-text pairs, audio-video, vision-language-action sequences - and uses the cross-modal structure as the training signal.
This is operationally important: the modern foundation-model regime is predominantly multimodal. Frontier LLMs are multimodal native; vision-language models are the dominant deployment surface for image understanding; vision-language-action models are how foundation models reach robotics. The SSL substrate for all of these is this section’s subject.
Why cross-modal SSL matters
Single-modality SSL produces representations invariant to whatever the loss treated as invariant. SimCLR-class image SSL gives representations invariant under augmentation (different crops of the same image should match) - useful, but the invariances are engineering-designed (we picked the augmentations) rather than semantically grounded. Pure image-only SSL has no way to know that “this picture of a cat” should be represented similarly to other pictures of cats; it knows only that two crops of the same picture should match.
Cross-modal SSL fixes this by grounding the invariance signal in another modality. An image’s caption is a semantic description of the image’s content; if the model learns to align images with their captions, the image representations inherit semantics from the text-side signal - without ever having seen explicit semantic labels.
This is the structural reason multimodal SSL works for downstream tasks: the alignment between modalities supplies a semantic grounding that single-modality SSL lacks.
CLIP and its lineage
The canonical example, treated historically in §2 and as a contrastive method in §5. CLIP (Radford et al., 2021) is contrastive multimodal SSL: an image encoder and a text encoder are trained jointly to map matched image-caption pairs to nearby embeddings and mismatched pairs to distant ones. The loss is the symmetric InfoNCE objective on a batch of image-caption pairs.
The CLIP recipe has been extended substantially since 2021:
ALIGN (Jia et al., 2021) used noisier web-scale image-text pairs to demonstrate scale advantages.
BLIP and BLIP-2 (Li et al., 2022, 2023) added a generative component: produce captions from images, not just match them.
SigLIP (Zhai et al., 2023) replaced the softmax-based InfoNCE with a sigmoid-based per-pair loss, removing the per-batch coupling that limits InfoNCE’s parallelism.
OpenCLIP and many open-weights variants reproduced and extended CLIP at various scales.
By 2026 CLIP-class contrastive multimodal SSL is essentially the default for producing image embeddings aligned with language.
Contrastive vs generative multimodal SSL
Contrastive multimodal (CLIP-class). Train two encoders to produce embeddings that match across modalities. The loss is contrastive (InfoNCE or variants). No generation.
Generative multimodal. Train a single model - typically a multimodal Transformer - to generate one modality from another, or to predict masked tokens across modalities. Examples:
Flamingo (Alayrac et al., 2022). A vision encoder feeds into an autoregressive language model via cross-attention; the model generates text grounded in input images.
GPT-4V / GPT-4o and successors. Multimodal autoregressive models trained to predict tokens in interleaved image-text-audio sequences. The training objective is causal LM (§4) applied to mixed-modality token streams.
Multimodal MAE variants. Mask patches across modalities (image patches and text tokens simultaneously); train to reconstruct.
Generative multimodal models tend to produce richer representations than purely contrastive ones - they have to encode the actual structure of each modality, not just an embedding aligned across modalities. But they are more expensive to train and harder to evaluate.
By 2026 modern frontier multimodal models combine both. They use contrastive cross-modal training to align modality-specific encoders, and generative multimodal training to produce capable generative deployment. The best of both, at the cost of complexity.
Beyond image-text: audio, video, action
Cross-modal SSL extends naturally beyond image-text:
Audio-vision (lip-reading, audio captioning, audio-video retrieval). Audio encoders aligned with video encoders via contrastive losses on aligned audio-video segments. AudioCLIP, VATT (Akbari et al., 2021), and successors.
Speech-text alignment. Whisper (Radford et al., 2022) trains an encoder-decoder on hundreds of thousands of hours of paired speech-text, producing both transcription capability and audio representations aligned with language.
Vision-language-action (VLA). Models that learn from sequences of (visual observation, language instruction, motor action) tuples for robotic control. RT-1, RT-2, OpenVLA, π0 (treated in the Robotics chapter). The SSL objective is typically causal LM applied to interleaved multi-modality token streams; the action modality has its own tokenization (discretized motor commands or quantized actions).
Scientific multimodal. Models that align text descriptions with protein sequences, molecular graphs, or experimental observations. The downstream use is scientific discovery; the SSL pattern is the same.
The shared pattern: each modality has its own tokenizer (§3 of LLM for text; patch tokenization for images; discrete codec or spectrogram patches for audio; quantized action vocabulary for motor control), all modalities flow into a unified sequence space, and the model is trained to predict across the sequence with cross-modal supervisory signal.
Native multimodal pretraining
The dominant 2026 pattern for frontier multimodal models is native multimodal training: rather than train modality-specific encoders separately and align them post hoc, train a single Transformer on interleaved multi-modality sequences from the start.
Native multimodal training data - interleaved sequences:
<image tokens for sunset photo> "I took this last weekend" <image tokens for second sunset> "the colors were better in the second one"
<audio tokens for question> "What is the capital of France?" "Paris."
<image tokens for math problem> "Solve this." "First, isolate x..." <image tokens for answer>
<image tokens for robot view> "pick up the red block" <action tokens for pick-up> <image tokens for after>
Modality boundaries are marked by special tokens; the model's
causal-LM objective applies uniformly to the whole sequence.
The model learns each modality's structure, the boundaries
between modalities, and the conditional dependencies across
modalities - all from one self-supervised objective.Native multimodal models (modern Gemini, GPT-4o-class, Claude with vision) are the dominant frontier deployment surface for multimodal foundation models. The training story is structurally simpler than the older “train encoders separately, align later” pattern; the engineering is more complex because the training data must be assembled with appropriate cross-modal pairings.
Open question: what does cross-modal alignment actually buy?
A perennial research question (OP-FM-13 captures a related facet). The empirical claim of CLIP-class methods is that cross-modal alignment grounds representations semantically. But how much of the downstream capability comes from the cross-modal alignment per se, vs. from the scale of the data the alignment objective enabled?
Several lines of evidence suggest both factors matter:
CLIP at fixed dataset scale outperforms image-only contrastive learning at the same scale on downstream semantic tasks - alignment helps.
Pure-text LLMs trained without any image data have nonetheless been shown to encode representations with some implicit spatial-and-physical structure (after sufficient scale on text data alone) - text alone can do some of the work.
Native multimodal models trained on the same data volume but with multimodal objectives outperform pure-text models on multimodal tasks - the modalities themselves carry signal.
The clean question is whether alignment is the active ingredient or whether more modalities of data is the active ingredient. As of 2026 the answer appears to be “both, in a way the field has not cleanly separated.” The Multimodal Models chapter develops the question in depth.
§8. Why SSL Works: Theoretical Perspectives
The empirical fact: SSL representations transfer to downstream tasks they were not trained for, often outperforming supervised pretraining at comparable scales. The empirical fact is robust across modalities, methods, and a decade of replication. The explanation for the fact is partial; the theoretical accounts in the literature each capture a piece, none captures the whole.
This section surveys four theoretical perspectives. As elsewhere in the book, we do not adjudicate; the §3 editorial framing applies - three partial accounts of why SSL transfers, here developed at greater depth than §3 had room for.
Perspective 1: SSL representations as approximate sufficient statistics
The earliest and most general framing. A sufficient statistic of a random variable with respect to a prediction target is a function of that retains all information about - formally, . Sufficient statistics are minimal representations for predicting .
An SSL model trained to predict masked tokens (or next tokens, or augmented views) builds representations that are approximately sufficient for the pretraining task. The interesting empirical observation: these representations also turn out to be useful for many downstream tasks that the pretraining did not target directly.
The hypothesis: if the pretraining target encompasses enough of the input’s structure (predicting anything from anything else about an input), the resulting representation captures the input’s shared structure - the part useful across many downstream tasks. Tishby’s Information Bottleneck principle (Tishby et al., 1999) and its deep-learning extensions formalize a related idea: representations that maximally compress the input while retaining task-relevant information are good representations.
What this account explains. Why SSL representations transfer across tasks (the shared structure captures task-relevant information that many downstream tasks share). Why scale matters (more data exposes more of the underlying structure to the compression process).
What this account does not explain. Which structure SSL captures and which it discards - the account is fully general, and so cannot distinguish a good SSL method from a bad one. It also does not predict which downstream tasks will benefit and which will not.
Perspective 2: Mutual information and its limits
The InfoNCE loss (§5) was originally derived as a lower bound on the mutual information between two augmented views (Oord et al., 2018). The interpretation: contrastive SSL maximizes mutual information between views, learning representations that capture what is invariant across the augmentations.
This is a clean theoretical framing - mutual information is a well-defined information-theoretic quantity, and the InfoNCE bound is mathematically rigorous (under appropriate assumptions).
The trouble starts when the framing is taken seriously as an explanation of why SSL works.
Tschannen et al. (2020), “On Mutual Information Maximization for Representation Learning”, showed empirically that the mutual-information-maximization framing is misleading. They constructed pairs of SSL methods where one had provably higher mutual information than the other, and yet the lower-mutual-information method produced better downstream representations. The mutual-information bound is correct as math; it does not predict downstream utility.
The deeper issue: high mutual information between views does not require semantically useful representations. A representation that perfectly identifies which augmented sample each view came from would have maximum mutual information between views (the views completely determine which sample) but be useless downstream - it would just memorize sample identity. What contrastive SSL actually learns depends not on the loss alone but on the inductive biases of the architecture, the augmentation choices, and the optimization dynamics.
The mutual-information account is therefore incomplete. It captures part of why SSL works (encoders are learning something about the structure of input pairs) but does not predict which inductive biases produce useful representations.
Perspective 3: Spectral analyses of contrastive learning
A more recent theoretical thread. Several papers - Saunshi et al. (2019), HaoChen et al. (2021), Saunshi et al. (2022) - analyze contrastive learning as an eigendecomposition of a structured operator.
The setup: define an augmentation graph whose vertices are samples and whose edges connect samples that produce augmented views with high overlap (e.g., crops of the same image). Contrastive learning’s loss can be reformulated as approximating the top eigenvectors of this graph’s augmentation matrix. The learned representations are essentially the top- eigenvectors.
This account explains several empirical observations:
Why augmentation choice matters. The augmentation graph’s structure depends entirely on which augmentations are used; different augmentations produce different eigenvectors.
Why projection heads help. The projection head allows the model to learn both the eigenvector representation (in the projection-head output) and a richer representation (one layer earlier in the encoder), accommodating both contrastive-loss optimality and downstream-task utility.
Why representation collapse fails when it does. Collapse corresponds to all samples mapping to a single eigenvector; the anti-collapse mechanisms of §5 are mechanisms that prevent this specific eigendecomposition degeneracy.
The spectral account is more operationally predictive than the mutual-information account - it suggests what to look at in a trained contrastive model and predicts some of the field’s empirical regularities. But it applies specifically to contrastive methods; the spectral framing does not extend cleanly to generative SSL (§4) or to JEPA-style predictive embedding (§6).
Perspective 4: Predictive coding and the Bayesian brain
A different theoretical thread, with roots in neuroscience. Predictive coding theories (Rao and Ballard, 1999; Friston’s free-energy principle, 2010) propose that perception and learning in biological systems work by predicting incoming sensory input and using prediction errors as the learning signal.
The connection to SSL is structural: SSL’s predict-from-input objectives are mechanically the same shape as predictive-coding theories. A model trained to predict masked tokens, next tokens, or masked patches is literally doing predictive coding on its input data. The neuroscience theories suggest a normative answer to why this works: it is what biological intelligence does, and presumably for good reasons.
The Bayesian-brain framing extends this: SSL models learn implicit generative models of their training distribution, and downstream task performance reflects how well the generative model captures the structure relevant to those tasks. The Bayesian framing is appealingly principled but operationally vague - what the model’s “implicit prior” actually is, and how it relates to downstream-task usefulness, is not characterized precisely.
What this account contributes. It connects SSL to a much broader scientific question (how does any learning system, biological or artificial, build useful representations from sensory input?) and provides a direction of inquiry (study what brains do; design SSL methods that share more of brain-like structure). It is more programmatic than predictive.
The §6 JEPA framing inherits the predictive-coding heritage explicitly. Whether JEPA-style methods are “more brain-like” in any meaningful sense, and whether brain-likeness predicts downstream-task usefulness, is contested.
A neutral conclusion
As of 2026, no single theoretical account explains the full empirical behaviour of SSL. The four perspectives are complementary rather than competitive:
The compression / sufficient-statistics view explains why representations transfer to many tasks (shared structure).
The mutual-information view captures what contrastive objectives optimize but not which representations result.
The spectral view characterizes what contrastive models learn in cases where the analysis applies.
The predictive-coding view provides normative motivation and cross-disciplinary connections.
For a research-oriented reader: SSL theory is a productive but unsettled area. The empirical methods are robust and reliably deployed; the theoretical explanations are partial. This is one of the most important open scientific questions in modern ML - captured at the meta level as OP-FM-14 (theoretical understanding of in-context learning, the most striking SSL-derived capability) and developed further in the Theoretical Foundations of Learning chapter.
Editorial note. The combination of “the methods work robustly” and “we do not fully understand why” is normal in deep learning. The Deep Learning chapter §12 catches this pattern explicitly (OP-DL-1: architectural design without strong theory); it recurs here. A research-oriented reader should be comfortable using SSL methods that work without insisting on a complete theoretical justification first.
§9. SSL in Practice: Data, Curricula, and Evaluation
§4–§7 developed SSL methods at the level of objectives and architectures. This section treats the practical questions that determine whether an SSL training run produces useful representations or not: what data to train on, in what order, and how to evaluate the resulting model.
Most of the data-side material overlaps with Foundation Models §5 (which treats pretraining at the regime level across modalities). We summarize the SSL-specific facets here.
Data composition
A few practical observations specific to SSL data.
Scale matters more than purity. Empirical results across modalities support a rough hierarchy: more data > better-quality data > more careful curation > better algorithms (at the same scale). At frontier scales the hierarchy is more nuanced - quality matters more, recursive synthetic data matters more, but the broad shape holds. SSL methods that scale poorly with data are unlikely to produce useful representations regardless of architectural elegance.
Deduplication is essential. Lee et al. (2022) and many follow-up studies show that aggressive deduplication of training corpora improves downstream representations beyond what the raw data volume would suggest. Memorization of duplicated content wastes capacity; deduplicated training distributes the capacity across genuinely distinct content. The cost-benefit calculation strongly favours deduplication, even when it shrinks the corpus by a factor of 2–5×.
Mixture composition is empirical and consequential. As Foundation Models §5 notes for FMs generally and the LLM chapter notes for language: the mixture of sources in the training corpus (code-heavy, math-heavy, multilingual-heavy, conversation-heavy) substantially shapes downstream behaviour. SSL inherits this: a contrastive image-text model trained on a code-heavy caption corpus will perform differently than one trained on a balanced web crawl.
Augmentation design (contrastive only). For contrastive SSL, the choice of augmentations is part of the data design and is famously consequential. Strong augmentations (random crops, colour jitter, blur, rotation) produce more invariant representations but at some loss of fine-grained discrimination. The “right” augmentations for a given downstream task family are usually found empirically.
Curricula
Most SSL training does not use a uniform random sample of the corpus from start to finish. Several curricular patterns recur.
Data-mixture annealing. Start with broader, more diverse data (heavy on raw web crawl); end with higher-quality or more domain-targeted data (curated subsets, math, code). Multi-stage pretraining is widespread at frontier scale.
Curriculum learning per se. Order examples from easy to hard during training. The empirical evidence on whether this helps for SSL at scale is mixed; the default in practice is to not do explicit curriculum ordering beyond the mixture-annealing pattern above.
Phase transitions. Some SSL training trajectories show empirical phase transitions - sudden improvements at specific compute milestones. Olsson et al. (2022)'s induction-head emergence in language models is the most famous example. These transitions are sometimes interpreted as the model crossing thresholds where new capabilities become accessible; they make the loss curve more interesting than a smooth power-law fit suggests.
Evaluating SSL representations
A central methodological challenge. The whole point of SSL is that the pretraining task is not the task we care about - we care about downstream tasks. How do we evaluate the representations without committing to specific downstream tasks?
Three standard evaluation protocols, each with characteristic disagreements:
EVALUATION PROTOCOLS
1. Linear probe
──────────────
Freeze the SSL-trained encoder.
Train a single LINEAR layer on top for a classification task.
Measure test accuracy.
Strength: probes the LINEARLY DECODABLE information in the
representation; cheap to evaluate; comparable across
methods.
Weakness: if the model has the information but represents it
in a non-linearly-decodable form, linear probe
under-reports utility.
2. k-NN classification
───────────────────
Freeze the encoder.
For each test sample, find its k nearest neighbors in the
training set's representations (via cosine similarity).
Vote on the test sample's class from the neighbors' labels.
Strength: even simpler than linear probe; tests SIMILARITY
STRUCTURE of representations.
Weakness: sensitive to choice of k; sensitive to imbalance
in the training set; ignores fine-grained linear
structure that linear probe captures.
3. Fine-tuning transfer
────────────────────
Continue training the SSL-pretrained model (all parameters)
on a downstream task with labels.
Measure final downstream task performance.
Strength: measures DOWNSTREAM-DEPLOYMENT utility directly;
what most practitioners actually care about.
Weakness: tangles representation quality with adaptation
quality; expensive to run for many downstream tasks;
harder to compare across SSL methods (different
methods may benefit from different fine-tuning
recipes).The three protocols often disagree about which SSL method is best on a given dataset. A method that produces strong linearly-decodable representations may not produce the best fine-tuning starting point; a method with good fine-tuning behaviour may have moderate linear-probe accuracy. The disagreement is a real methodological issue - there is no single “SSL representation quality” number that captures all uses.
The pragmatic 2026 default: report linear probe, k-NN, and fine-tuning numbers; expect them to rank methods somewhat differently; pay most attention to whichever protocol most closely matches the intended downstream deployment.
Zero-shot transfer is an evaluation specific to cross-modal SSL: a CLIP-class model can be evaluated by feeding candidate class names through its text encoder and using image-text similarity for classification, without any fine-tuning. Zero-shot transfer measures the semantic grounding of the representations directly.
Pointer to Foundation Models §5
Foundation Models §5 develops the data-curation and pretraining-pipeline picture at the regime level - across modalities, with the full pipeline from raw data through tokenization to packing to the training loop. The SSL chapter focuses on the objectives and the evaluation specifically of representations; the broader pipeline lives in FM §5.
§10. Connections to Other Chapters
SSL is the training paradigm; many chapters develop what is built on top, what is theorized about, or what specific instances look like in their domains. The map:
Deep Learning (prerequisite) develops the architectures SSL trains. The Transformer of DL §6, the MLP of DL §3, the optimizer recipes of DL §9 are all assumed; this chapter develops what to train them on.
Foundation Models (most-cross-referenced) treats SSL as one of the three pillars (FM §3) and develops the pretraining stage at the regime level (FM §5). This chapter develops the SSL paradigm proper; FM develops what is built on it.
Theoretical Foundations of Learning develops generalization theory in general; the SSL-specific generalization story (why SSL representations transfer) is the §8 of this chapter at the empirical-perspective level and developed further there at the formal-theoretic level.
Large Language Models specializes the language-side SSL objectives of §4 (causal LM, masked LM, span corruption) to the LLM training pipeline. LLM §5 develops the deployment pipeline; this chapter holds the SSL-objective canonical reference.
Multimodal Models specializes the cross-modal SSL of §7 to the vision-language and beyond. The chapter develops architectures and deployment-side specifics; we develop the training-paradigm side.
Generative Models treats diffusion and flow-matching training, which are kinds of SSL (denoising-score-matching, §4) but oriented toward generative deployment. The Generative Models chapter focuses on what the trained generator does; this chapter focuses on what the training procedure is.
Reinforcement Learning uses self-supervised auxiliary objectives in some agent training setups (predictive world models, self-supervised representation learning as a feature pipeline for downstream RL). The RL chapter develops the agent-training story; SSL provides the representation pipeline.
Robotics uses SSL for embodied perception and for vision-language-action pretraining. The Robotics chapter develops the embodied deployment; SSL provides the training paradigm.
Mechanistic Interpretability asks what SSL representations actually encode. The chapter develops circuit-level analysis of trained models; SSL is what produces the models it analyzes.
AI for Science uses SSL for domain-specific scientific foundation models (AlphaFold’s sequence-to-structure objective is a domain-specific SSL pattern).
§11. Limitations and Open Problems
A research-oriented inventory of SSL-specific open problems. Each is the SSL facet of a broader question that recurs across other chapters’ lists.
OP-SSL-1: No general theoretical account of why SSL representations transfer. §8 surveyed four partial perspectives (compression / sufficient-statistics, mutual information, spectral, predictive coding) and concluded that no account explains the full empirical behaviour. The empirical fact of transfer is robust across modalities, methods, and a decade of replication; the why remains the most important open theoretical question in SSL. Cross-chapter: OP-FM-14 (theoretical understanding of in-context learning, an SSL-derived capability) and the broader content of the Theoretical Foundations of Learning chapter.
OP-SSL-2: Generative vs predictive vs contrastive - which is most general? §4, §5, §6 catalogued the three major SSL families. Each has empirical wins and losses; none decisively dominates across modalities and tasks. Generative (causal LM, MAE) is dominant for language and competitive for vision. Contrastive (CLIP-class) is dominant for cross-modal alignment. Predictive joint-embedding (JEPA) is the youngest of the three and the most theoretically motivated; whether it overtakes the others at scale is unsettled. The deeper question: are these three families specializing for different domains, or will one prove more general?
OP-SSL-3: SSL for embodied agents. Most SSL methods assume large static datasets - images on the web, text in books. Embodied agents (robotics, simulated agents) face a different data regime: experience is generated through interaction, in much smaller volumes than web-scale unlabelled text, with the agent’s actions affecting future observations. Adapting SSL to this regime - predictive losses on embodied data - is an active research area. JEPA’s predictive-coding heritage may be relevant here; the Robotics chapter develops the empirical state.
OP-SSL-4: Cross-modal SSL - how much does each additional modality buy? §7’s closing observation: how much downstream capability comes from cross-modal alignment per se vs from simply training on more modalities of data is not crisply separated empirically. Each additional modality (text + image + audio + video + action + ...) adds training cost; the boundary where marginal benefit drops below marginal cost is not well-characterized. Native multimodal pretraining is the field’s empirical bet; the principled answer is still open.
OP-SSL-5: Data curation - what makes a “good” pretraining corpus? §9 noted that data composition matters and that mixture choices are made empirically. The deeper question: what intrinsic properties of a corpus make it good for SSL pretraining? Diversity, quality, distributional coverage, structural complexity all plausibly matter; how to measure them in advance of training, and how to predict the resulting model’s capabilities, is open.
OP-SSL-6: Recursive SSL - training on synthetic data generated by earlier SSL models. Cross-references OP-FM-2. As frontier-scale training increasingly uses model-generated data, the dynamics of recursive SSL - what fails when models are trained on their predecessors’ outputs - become important. Shumailov et al. (2024)'s model-collapse work gives the basic concern; the conditions under which recursive SSL is robust versus harmful are not fully mapped.
OP-SSL-7: Evaluation protocols disagree. §9 noted that linear probe, k-NN, fine-tuning, and zero-shot transfer often rank SSL methods differently. There is no single “SSL representation quality” number that captures all uses. The methodological consequence: claims about which SSL method is “best” are sensitive to which evaluation protocol is chosen. A unifying framework that predicts which protocol’s ranking will transfer to which downstream use case is missing.
OP-SSL-8: Symbolic and causal structure in SSL representations. SSL produces representations that encode statistical structure of the training data. Whether these representations also encode symbolic structure (compositional rules, hierarchical concepts) and causal structure (mechanisms, interventions) is contested. The Mechanistic Interpretability chapter develops the empirical investigations; the Causality chapter develops the theoretical-impossibility critique (Pearl-aligned argument that purely associational learning cannot build causal representations). The SSL-side question is whether better objectives could produce representations with more of this structure, or whether the limit is fundamental.
These eight open problems span the SSL-specific facets. The cross-cutting Foundation Models open problems (OP-FM-1 through OP-FM-15) include SSL’s broader extensions; the Theoretical Foundations of Learning chapter develops what theoretical apparatus exists for understanding SSL formally.
§12. Critiques and Alternative Frames
SSL is not without critics. Several substantive positions argue either that current SSL methods are overhyped, that they are not the right path forward, or that the conceptual framing of “self-supervised” obscures what is really happening. We present each as a position with adherents and current status, not as settled.
The labels-are-still-needed position. A pushback against the rhetoric that SSL eliminates the need for labels. The empirical observation: while pretraining is self-supervised, the deployed foundation models almost always depend on substantial labelled data at later stages - instruction-tuning demonstrations (LLM §5.2), preference comparisons for RLHF/DPO (LLM §5.3), human-judged evaluations, dangerous-capability evaluations. The “self-supervised” framing captures pretraining accurately but obscures the substantial human-labelling effort that turns a pretrained base into a useful product. Critics argue that the practical bottleneck has shifted from pretraining data to deployment-stage data rather than disappearing, and that calling the resulting models “self-supervised” is misleading. The bottleneck has indeed shifted; whether the shift represents progress (less labelling per capable model) or stagnation (the bottleneck is still labelling, just at a different stage) is debated.
The predictive-coding / world-model critique (LeCun and others). §6 covered the JEPA framing; the critique behind it goes further. The position holds that pure generative SSL (predicting raw inputs - next tokens, masked pixels) is fundamentally an inefficient route to the representations we want. The model expends substantial capacity modelling input-space noise irrelevant to downstream tasks; the gains from scale come slower than they would under a more efficient objective. The constructive alternative is JEPA-style predictive embedding plus world-model learning - train models to predict abstract embeddings of environmental state, with the embeddings themselves chosen to discard irrelevant detail. As of 2026 this critique is partially vindicated (JEPA-class methods are competitive) and partially unsettled (they have not decisively overtaken generative methods at frontier scale). The Foundation Models chapter’s §12 develops the broader world-models critique.
The information-theoretic skepticism (Tschannen et al., 2020). §8 covered this in the theory section. The critique: the field’s go-to theoretical framing of SSL - that contrastive methods maximize mutual information between views - is correct mathematics but does not predict downstream utility. High-mutual-information representations can be useless. The critique is targeted at the explanation, not at the methods themselves; contrastive SSL works, the field’s standard story about why is misleading. The skepticism has shifted the theoretical conversation (toward spectral analyses, inductive-bias arguments) and is increasingly accepted.
The task-design critique. A meta-critique. SSL is presented as removing the labelling problem; the critique observes that it replaces labelling with pretext-task design. The pretext task (which augmentations, which masking ratio, which contrastive structure) is itself a choice - and it turns out to matter enormously. Designing good augmentations for SimCLR, good masking ratios for MAE, good context-target splits for JEPA, is operationally similar to designing good labels. The labelling work has moved upstream into the design of the SSL training procedure rather than been eliminated. The critics’ point: SSL has not freed us from the labelling problem; it has engineered it differently. Defenders observe that the new engineering scales: a single augmentation policy supports billions of images, where a labelling procedure supports tens of thousands.
Cognitive-science critiques. Adjacent to the embodiment and world-model positions but distinct. Some cognitive scientists argue that SSL-trained representations, however empirically successful on engineering tasks, do not reflect how biological learning works in important ways: biological learning is grounded in agentive interaction with the world, is informed by innate priors that SSL methods lack, and is shaped by social and developmental factors that web-scale SSL cannot capture. The critique is more relevant to AI-as-cognitive-science than to AI-as-engineering; whether SSL methods are good models of cognition is a different question than whether they produce useful engineering artifacts.
Editorial note. As elsewhere in this book, we survey these positions without ranking them. SSL is methodologically successful and theoretically partial; substantive critics exist; the methodological-success-vs-theoretical-partial gap is the central state of the field as of 2026. A research-oriented reader should engage the critiques seriously without abandoning the methods that work.
§13. Further Reading
An opinionated annotated list. Each entry includes what it adds beyond this chapter and which section it grounds.
Foundational SSL
Hinton and Salakhutdinov (2006), “Reducing the Dimensionality of Data with Neural Networks.” Deep autoencoders as the early modern SSL ancestor; §2.
Vincent et al. (2008), “Extracting and Composing Robust Features with Denoising Autoencoders.” Denoising autoencoders, the conceptual ancestor of masked-prediction objectives; §2 and §4.
Language SSL (predict-the-token lineage)
Mikolov et al. (2013), “Efficient Estimation of Word Representations in Vector Space” (Word2Vec). The skip-gram and CBOW predict-context objectives.
Pennington, Socher, Manning (2014), “GloVe.” Co-occurrence-matrix factorization for word embeddings.
Devlin et al. (2018), “BERT.” Masked LM at scale; §4.
Radford et al. (2018, 2019), GPT and GPT-2. Causal LM at scale; §4.
Raffel et al. (2019), “T5.” Span corruption as the encoder-decoder middle ground; §4.
Contrastive vision SSL
Oord, Li, Vinyals (2018), “Representation Learning with Contrastive Predictive Coding.” The InfoNCE loss with the mutual-information bound; §5.
Chen et al. (2020), “SimCLR.” The clean recipe for contrastive image SSL; §5.
He et al. (2020), “MoCo.” Momentum encoder + memory queue for contrastive learning at smaller batches; §5.
Grill et al. (2020), “BYOL.” Negative-free contrastive learning; §5.
Caron et al. (2020), “SwAV.” Clustering-based contrastive learning; §5.
Caron et al. (2021), “DINO.” Self-distillation with no labels on Vision Transformers; §5.
Masked vision and JEPA
He et al. (2022), “Masked Autoencoders Are Scalable Vision Learners” (MAE). BERT-for-images with 75% masking; §4.
LeCun (2022), “A Path Towards Autonomous Machine Intelligence.” The JEPA position paper; §6.
Assran et al. (2023), “I-JEPA.” First concrete JEPA instance; §6.
Cross-modal and multimodal SSL
Radford et al. (2021), “CLIP.” Cross-modal contrastive learning; §5 and §7.
Jia et al. (2021), “ALIGN.” Scaling CLIP-style training on noisier data.
Li et al. (2022), “BLIP.” Adding generative components to CLIP-style training.
Zhai et al. (2023), “SigLIP.” Sigmoid loss as a simpler alternative to InfoNCE for contrastive learning.
Radford et al. (2022), “Whisper.” Speech-text SSL at scale.
Alayrac et al. (2022), “Flamingo.” Cross-attention vision-language model.
Theoretical perspectives
Tishby and Zaslavsky (2015), “Deep Learning and the Information Bottleneck Principle.” Compression / sufficient-statistics framing.
Tschannen et al. (2020), “On Mutual Information Maximization for Representation Learning.” The critique of the MI framing; §8.
Saunshi et al. (2022), “Understanding Contrastive Learning Requires Incorporating Inductive Biases.” Spectral analyses of contrastive SSL; §8.
HaoChen et al. (2021), “Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss.” Eigendecomposition framing of contrastive learning.
Critiques and meta-perspectives
Bommasani et al. (2021), “On the Opportunities and Risks of Foundation Models.” SSL is the engine of the foundation-model regime; this is the survey of the regime.
Shumailov et al. (2024), “The Curse of Recursion.” Recursive synthetic-data training failures; §11 OP-SSL-6.
Surveys
Any current arXiv survey on self-supervised learning. These age fast but cover specific sub-fields well.
Reading order
For a reader new to SSL: start with the Bommasani report’s introduction (or Foundation Models §1–§3); then read the Devlin BERT and Radford GPT papers; then Oord et al. (2018) on InfoNCE; then SimCLR; then CLIP. After that, the Tschannen and Saunshi theoretical critiques give a sense of where the theoretical conversation is unsettled. JEPA and the world-models framing are the alternative direction worth understanding.
§14. Exercises and Experiments
Five research-style exercises, each with declared intent (demonstration - deterministic and fast - or exploration - open-ended). Sized for a graduate student with single-GPU access. Notebooks planned in notebooks/self-supervised-learning/ once the project’s interactive layer is decided.
E1. Train SimCLR on CIFAR; compare evaluation protocols.
Intent: exploration.
Setup. A small ResNet-18 or ViT-Tiny on CIFAR-10 or CIFAR-100. Implement SimCLR (§5): two augmented views, projection head, InfoNCE loss with batch-internal negatives.
Tasks.
Pretrain for ~100 epochs.
Evaluate the pretrained encoder via three protocols (§9): linear probe (train a linear classifier on frozen features), k-NN classification (k = 5, 20), and fine-tuning (continue training all parameters on the labelled task).
Report all three accuracies; observe whether they rank methods consistently.
Compare against a baseline trained supervised from scratch on the same labels.
Takeaway. Hands-on experience with contrastive SSL and the §9 evaluation-protocol-disagreement problem.
E2. Masked-LM vs causal-LM on a small text corpus.
Intent: exploration.
Setup. A small Transformer (~10M parameters) and a small text corpus (Shakespeare, tiny-stories, or a small FineWeb-Edu subset).
Tasks.
Train two models from the same starting point: one with causal LM, one with masked LM (15% masking, BERT-style).
Evaluate downstream transfer on two task types: a generation task (text continuation; measure perplexity or human-judged quality on held-out continuations) and an understanding task (sentence classification on a small labelled subset; measure linear-probe accuracy on the encoder).
The causal LM should win generation; the masked LM should win or tie on understanding.
Quantify the gap and reflect on why the asymmetry exists (§4).
Takeaway. First-hand evidence of the generative-vs-understanding tension in SSL objective choice.
E3. Reproduce representational collapse.
Intent: demonstration.
Setup. A simple contrastive setup (SimCLR or BYOL) on CIFAR-10.
Tasks.
First, train the standard setup and verify it produces useful representations.
Then, break it in specific ways and observe collapse:
Remove augmentations (use identical views). Should collapse to identity-encoder.
Remove the projection head. May produce a less useful representation; observe the dimensional structure.
For BYOL: remove the stop-gradient on the target branch. Should collapse.
Remove temperature normalization in InfoNCE. Observe instability.
For each break, measure (a) downstream task accuracy via linear probe, (b) the eigenvalue spectrum of the feature covariance matrix (to identify dimensional collapse), (c) feature diversity (e.g., mean cosine similarity between random pairs of representations; collapsed → 1).
Takeaway. Mechanical understanding of §5’s three failure modes by seeing them in action.
E4. MAE vs I-JEPA on a small image task.
Intent: exploration.
Setup. Implement MAE and I-JEPA on CIFAR-10 or a small ImageNet-100 subset. Both use a Vision Transformer of comparable scale.
Tasks.
Pretrain both for matched compute.
Evaluate via the three protocols of E1 (linear probe, k-NN, fine-tuning).
Inspect the representations: visualize the attention patterns of the encoder, the feature-similarity structure, the cluster structure under k-NN.
Where the two methods differ in their representations - what does each capture that the other doesn’t?
Takeaway. Comparison between reconstructive (§4) and predictive-embedding (§6) approaches on the same data.
E5. Build a small CLIP-style model.
Intent: exploration.
Setup. A small image encoder (ViT-Tiny) and a small text encoder (a small Transformer). A small image-caption dataset (Conceptual Captions subset, Flickr30k, or COCO captions).
Tasks.
Implement the CLIP loss (symmetric InfoNCE over a batch of image-text pairs).
Train for ~50 epochs.
Evaluate zero-shot classification on CIFAR-10 or CIFAR-100 (feed candidate class names through the text encoder; classify by image-text similarity).
Compare zero-shot accuracy to what a small from-scratch supervised model would achieve.
Investigate the failure modes: what does the model get wrong, and how is that related to the training-data distribution?
Takeaway. First-hand experience with cross-modal contrastive SSL and the zero-shot-transfer capability that defines CLIP-class models.