Mechanistic Interpretability
The chapter assumes the Deep Learning chapter (architectures), the Foundation Models chapter (the FM-as-substrate framing), and at minimum LLM §3 (Transformer architecture). It develops MI both as a scientific programme (what we have learned about how networks compute) and as an engineering tool (interpretability methods deployed in safety-critical contexts).
Scope and What This Chapter Is About
The chapter develops mechanistic interpretability: the toolkit and findings of research that explains how neural networks (especially Transformers) compute their outputs through analysis of internal activations, weights, circuits, and features. We cover the foundational programmes (circuits, features), the modern techniques (probing, sparse autoencoders, transcoders, attention-pattern analysis, causal interventions), the substantive findings (induction heads, indirect-object-identification circuits, factual-recall mechanisms, refusal directions), the methodology (causal intervention, ablation, patching), and the connection to alignment and safety. Open problems are flagged inline and consolidated in §14.
§1. Motivation and Scope
A worked instance to anchor the chapter
Take an LLM that, when prompted with “When John and Mary went to the store, John gave a drink to ___”, reliably completes “Mary”. This is a simple capability - the model is performing Indirect Object Identification (IOI). The behavioural evidence is clear; performance is near-100% on this template.
The mechanistic question: what specific internal computation produces this behaviour? Which attention heads identify “John” and “Mary” as candidate names? Which heads decide that the indirect object should not be the subject? Where in the network does this decision happen?
Wang, Variengien, Conmy, Shlegeris, Steinhardt (2022) “Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small” gave the answer. The IOI circuit in GPT-2 Small involves:
Duplicate token heads (in layers 0–3) identify that “John” appears twice in the context.
Name mover heads (in layer 9) attend to candidate names and copy them to the residual stream.
S-Inhibition heads (in layer 7) suppress the subject name (“John”) in favour of the indirect object (“Mary”).
Backup name movers (in layer 10) provide redundancy.
The circuit involves ~26 specific attention heads across ~7 layers, with specific roles assigned via causal-intervention experiments (activation patching, ablation). The paper traces the information flow from input tokens to output prediction, identifies each computational role, and validates each claim with targeted interventions.
This is mechanistic interpretability: explaining how a network produces a specific behaviour by identifying the internal computations responsible. The IOI work is one of dozens of similar circuit analyses produced in the 2022–2024 period - the foundational examples on which the field’s methodology has been built.
What MI is
A working definition. Mechanistic interpretability (MI) is the research programme of explaining how neural networks compute, in mechanistic terms - identifying internal features (computational quantities the network represents) and circuits (sub-networks of weights and activations that perform specific functions), with claims validated by causal interventions on the model’s computation.
The MI commitment is to explanations that operate at the level of mechanism, not just behavioural prediction. A behavioural explanation says “the model outputs Y given X because it was trained to do so on similar data.” A mechanistic explanation says “the model outputs Y given X because attention head 9.6 attends to token T, OV head 9.6 writes feature F to the residual stream, and this feature is read out by the unembedding to produce Y.”
The MI commitment is also to causal validation. A pattern observed in activations (a probe successfully classifies a feature) is suggestive but not definitive - the activation could be a correlational shadow of the true causal computation. To establish that a feature is causally responsible for behaviour, MI uses targeted interventions: ablate the feature, replace it with a counterfactual value, and observe whether the behaviour changes accordingly.
The MI scope is broad. It includes:
Probing: training linear classifiers on internal activations to identify what features the network represents.
Circuits: identifying sub-networks of attention heads and MLP neurons that implement specific computations.
Sparse autoencoders (SAEs): decomposing dense activations into interpretable feature directions.
Activation patching and causal intervention: validating mechanistic claims via targeted modifications.
Feature visualization: identifying what inputs activate specific features.
Steering and representation engineering: modifying behaviour by intervening on internal representations.
What MI is not
Several boundaries worth flagging.
MI is not “explainable AI” in the traditional sense. Explainable AI (XAI) emerged in the 2010s as a programme for producing post-hoc explanations of model predictions - saliency maps, attribution methods, surrogate models. MI shares some methods with XAI but differs in commitments: MI seeks mechanistic understanding (how the model computes), not just human-interpretable explanations of individual predictions. MI is also more demanding of causal validation than most XAI methods.
MI is not a complete solution to alignment. The alignment programme requires that AI systems behave in accordance with human values across deployment contexts. MI is one tool for alignment - it can help identify deceptive behaviour, dangerous capabilities, or misaligned reasoning by examining internal computation. But MI alone does not solve the alignment problem; MI findings can be misleading, MI does not directly produce aligned behaviour, and many alignment failures may be hard to detect mechanistically.
MI is not “neural network theory.” Theoretical work on neural networks (NTK, mean-field theory, generalization bounds) is sometimes called “interpretability” loosely but is qualitatively different from MI. Theory makes claims about networks in general (or in restricted limits); MI makes claims about specific trained networks via empirical analysis. Theory and MI are complementary rather than competing.
MI is not finished. Despite substantial progress, MI’s findings are still partial, model-specific, and contested. The field’s open problems (§14) are large; many results are at the demonstration level rather than the systematic-understanding level. The chapter is appropriately cautious about overclaim.
Why MI matters in 2026
Four motivations.
1. Safety and alignment. As frontier AI systems become more capable and more autonomous, behavioural testing alone provides limited assurance. A model that behaviourally refuses dangerous requests during evaluation might internally represent the dangerous capability while suppressing it via training-induced behavioural patterns; under adversarial probing or distribution shift, the behaviour could change. Mechanistic understanding - knowing what capabilities the model has internally and how they are gated - provides stronger assurance than behavioural testing alone. MI is increasingly the basis for safety arguments about deployed AI systems.
2. Debugging and engineering. When a model fails - produces wrong outputs, exhibits bias, regresses on a task - understanding why requires looking inside. Traditional debugging (look at logs, narrow down the input) is much less effective for neural networks than for software. MI tools (probing, activation patching) provide a systematic debugging methodology that can identify where errors arise. Industrial AI labs are increasingly using MI for production debugging.
3. Scientific understanding. Neural networks are some of the most consequential artefacts of modern science, but our understanding of how they compute is partial. MI is the scientific programme of understanding them - what features they represent, what algorithms they implement, what circuits emerge from training. The findings are intrinsically interesting (induction heads, factual recall mechanisms, feature superposition) and inform theoretical work in machine learning.
4. Governance and accountability. As AI systems are deployed in consequential settings (medical decisions, lending, hiring, criminal justice), regulatory frameworks increasingly demand explanations of model behaviour. MI provides one form of explanation - what internal mechanisms produce specific decisions. The EU AI Act, US executive orders on AI, and other policy frameworks reference interpretability as part of trustworthiness requirements. MI is increasingly relevant to AI governance.
The MI taxonomy
A practical organization of the techniques. MI methods can be classified along several axes.
By unit of analysis:
Neurons: individual scalar activations.
Features: directions in activation space (often spanning multiple neurons).
Attention heads: structured Transformer components with QK and OV decompositions.
Layers: groups of components at the same depth.
Circuits: cross-layer sub-networks implementing specific functions.
By methodology:
Observational: probing, feature visualization, attention pattern analysis. Look at activations without modifying them.
Causal: activation patching, ablation, steering. Modify activations to test mechanistic claims.
Decompositional: sparse autoencoders, dictionary learning. Decompose activations into interpretable components.
By granularity of claim:
Single-example: explain one model output.
Class of examples: explain a behaviour across many inputs.
Universal: identify mechanisms that recur across models, training runs, or scales.
The chapter develops representative methods in each cell of this taxonomy. The most mature methods combine multiple cells - causal interventions on SAE-decomposed features at the circuit level for a behaviour-class, for example.
The relationship to explainable AI
A brief clarification. Explainable AI (XAI) emerged in the 2010s as a research community focused on producing human-interpretable explanations of model outputs. The dominant XAI methods (LIME, SHAP, saliency maps, integrated gradients) produce post-hoc, local explanations - they explain specific predictions in terms of input feature importances.
MI overlaps with XAI on some methods (saliency, integrated gradients) but has different methodological commitments:
XAI typically does not demand causal validation of explanations; MI does.
XAI focuses on individual predictions; MI focuses on mechanisms underlying broad behaviours.
XAI is typically presented to end users who want to understand why a specific decision was made; MI is typically presented to researchers trying to understand model internals.
The communities are increasingly converging. Modern XAI methods are adopting more rigorous causal validation; MI methods are being deployed in user-facing explanation contexts.
The relationship to causality
A specific connection worth flagging. Causal-intervention methods (§5) are the technical heart of MI’s validation methodology - they are essentially the causal-inference framework of the Causality chapter applied to neural-network internals. Activation patching is a causal intervention on internal activations, analogous to do-calculus on a causal graph.
The conceptual point. MI treats the network’s internal computation as a causal system whose structure can be probed with interventions. The Causality chapter’s machinery (do-operator, mediation, counterfactuals) maps onto MI methodology with appropriate adaptation. This is one of several recent intellectual connections between previously-separate fields.
What this chapter does not try to do
Several explicit exclusions.
We do not provide a complete survey of every interpretability paper. The literature is large and rapidly growing; we develop foundational results and key methods, referring to surveys and reading lists for breadth.
We do not develop the engineering of MI tools (specific libraries, visualization frameworks) in depth. These change rapidly; we point to current tools in §15.
We do not treat non-mechanistic interpretability (post-hoc explanation methods, XAI) systematically. We touch on the overlap; the broader XAI literature has its own treatments.
We do not develop evaluation methodology specifically for MI claims. The chapter discusses validation methodology in §11; cross-cutting evaluation issues live in the Evaluation chapter (planned).
We do not extensively cover MI for non-Transformer architectures (CNNs, RNNs, SSMs, diffusion models). The field’s centre of mass is Transformer-LM MI; we mention other architectures where directly relevant.
Position taken in this chapter
The chapter takes MI seriously as both a scientific programme and an engineering tool. The findings (circuits, SAEs, induction heads) are substantive and have changed how the field thinks about neural networks. The methodology (causal intervention as validation) is rigorous and increasingly standard.
The chapter is also appropriately cautious. MI claims often involve substantial human-interpretation effort and may not generalize across models or scales. The systematic-understanding level - knowing in advance what mechanism produces a given behaviour - is not yet attainable for frontier models. The chapter develops both what MI has achieved and where its limitations are.
§2. Historical Context
This section traces MI from its origins in neural-network visualization through the modern circuits-and-features programme. The history is essential because MI’s methodology and conceptual framework have evolved substantially over a relatively short period (roughly 2015–2026).
A timeline of the inflection points:
2014-2015 Early visualization: convolutional-filter
visualization (Zeiler-Fergus 2014); activation
maximization. Network internals as
visualizable objects.
│
▼
2016-2017 Attribution methods: saliency maps,
integrated gradients (Sundararajan et al. 2017),
layer-wise relevance propagation. Producing
per-prediction explanations.
│
▼
2017-2018 Distill: the academic journal founded by Chris
Olah and collaborators. Long-form interactive
articles on neural-network interpretation.
Sets the stylistic and methodological template
for the field.
│
▼
2018-2019 "Feature Visualization" (Olah et al., Distill);
"Zoom In: An Introduction to Circuits" (Olah,
Cammarata et al., Distill 2020). The circuits
programme begins: features and circuits as the
primary unit of mechanistic analysis. CNN-era
studies of InceptionV1.
│
▼
2020-2021 CNN circuits papers in Distill: curve
detectors, high-low-frequency detectors,
multimodal neurons (Goh et al. 2021).
Anthropic founded; transformer-circuits work
begins.
│
▼
2021 "A Mathematical Framework for Transformer
Circuits" (Elhage, Nanda, Olsson et al.,
Anthropic 2021). The QK/OV decomposition;
toy two-layer attention-only models;
conceptual framework for analyzing Transformers
mechanistically.
│
▼
2022 "In-context Learning and Induction Heads"
(Olsson, Elhage, Nanda et al., Anthropic 2022).
Identification of induction heads as the
mechanism behind in-context learning. The
first major-scale mechanistic explanation of
a Transformer capability.
│
▼
2022 "Interpretability in the Wild: A Circuit for
Indirect Object Identification in GPT-2 Small"
(Wang et al., 2022). The first full-circuit
analysis in a real LLM. Methodology template
for many subsequent circuit papers.
│
▼
2022 ROME (Meng et al., 2022): "Locating and Editing
Factual Associations in GPT." Activation
patching identifies factual-recall mechanism;
weight editing achieves targeted fact updates.
Brings causal-intervention methodology to
larger models.
│
▼
2022 "Toy Models of Superposition" (Elhage et al.,
Anthropic 2022). The superposition hypothesis;
polysemantic neurons; the conceptual basis
for SAEs.
│
▼
2023 Sparse autoencoders for interpretability
(Cunningham et al. 2023, "Sparse Autoencoders
Find Highly Interpretable Features in Language
Models"). SAEs as a tool to decompose
polysemantic neurons into interpretable
features.
│
▼
2023 ACDC (Conmy et al. 2023): automated circuit
discovery. Path patching at scale.
│
▼
2024 The "SAE explosion": Bricken et al. (Anthropic),
Templeton et al. (Anthropic), Cunningham et al.,
OpenAI's GPT-4 SAE work. SAEs scaled to frontier
models with millions of features.
│
▼
2024 MI for safety: refusal-direction analysis;
jailbreak mechanistic understanding;
deception-circuit search. MI as alignment tool
matures.
│
▼
2024-2026 Field-level consolidation: MI conferences
(Mech Interp at ICML, NeurIPS); standard
methodology; production deployment at major
labs. MI as a recognized subfield with its
own infrastructure, journals, and community.We develop each phase below.
Early visualization: 2014–2017
The substrate. Zeiler and Fergus (2014) “Visualizing and Understanding Convolutional Networks” gave the first systematic visualization of what convolutional networks learned. The technique: take a trained network, identify which inputs maximally activate a given filter, and produce a visualization. The result: low-layer filters detect edges and textures; mid-layer filters detect parts; high-layer filters detect objects. The hierarchical-feature picture of CNNs was empirically grounded.
Attribution methods developed in parallel. LIME (Ribeiro et al., 2016), SHAP (Lundberg-Lee, 2017), integrated gradients (Sundararajan, Taly, Yan, 2017), and layer-wise relevance propagation (Bach et al., 2015) produced per-prediction explanations of which input features mattered most. These became standard tools in the XAI community.
The limitations. Visualization and attribution gave suggestive pictures but did not establish causal mechanism. A visualized filter might look like a “dog detector” without actually being causally responsible for dog classifications. The early-2010s methodology was insufficient to make rigorous mechanistic claims.
Distill and the circuits programme: 2018–2020
A specific institutional development. Distill - an online journal founded by Chris Olah, Shan Carter, and collaborators in 2016 - published long-form interactive articles on neural-network interpretation. The journal’s commitment to interactive, thorough, rigorous exposition substantially raised the field’s standards.
“Feature Visualization” (Olah, Mordvintsev, Schubert, Distill 2017) consolidated the visualization techniques. “The Building Blocks of Interpretability” (Olah, Satyanarayan et al., Distill 2018) introduced compositional visualization techniques. The methodology was maturing.
The big inflection. “Zoom In: An Introduction to Circuits” (Olah, Cammarata, Schubert, Goh, Petrov, Carter, Distill 2020) explicitly articulated the circuits hypothesis: neural networks can be understood as compositions of features (directions in activation space corresponding to interpretable computational quantities) and circuits (sub-networks of weights implementing specific computations). The hypothesis was a research programme: identify circuits empirically; validate via interventions; build up a “library” of known mechanisms.
A series of Distill papers through 2020–2021 demonstrated the programme on InceptionV1 (a 2014-era CNN): “Curve Detectors” (Cammarata et al.); “High-Low Frequency Detectors”; “Naturally Occurring Equivariance”; “Multimodal Neurons in Artificial Neural Networks” (Goh et al., 2021, on CLIP). Each paper traced a specific computational mechanism through a specific network.
The methodology that emerged. Identify features by visualization, attribution, and probing. Identify circuits as connections between features at different layers. Validate via causal interventions. This methodology, refined over time, is the methodological core of modern MI.
Anthropic and Transformer circuits: 2021–2022
The next inflection. Anthropic was founded in 2021 with substantial focus on safety and interpretability. The Anthropic interpretability team produced a series of papers that established Transformer-specific MI methodology.
“A Mathematical Framework for Transformer Circuits” (Elhage, Nanda, Olsson et al., Anthropic 2021) was foundational. The paper developed:
The QK / OV decomposition: each attention head can be analyzed in terms of two matrices, the QK matrix (what the head attends to) and the OV matrix (what information the head writes when it attends). The decomposition exposes the head’s computational role.
The residual stream view: Transformer activations as the residual stream (a high-dimensional vector at each token position that grows additively through the layers); attention heads and MLPs as components that read from and write to the residual stream.
Toy models of two-layer attention-only Transformers, where the full mechanism could be reverse-engineered analytically.
The framework gave a language for talking about Transformer internals. Many subsequent MI papers use the QK/OV/residual-stream vocabulary.
“In-context Learning and Induction Heads” (Olsson, Elhage, Nanda et al., Anthropic 2022) was the first major-scale mechanistic explanation. The paper identified induction heads - attention heads that implement the pattern “[A][B] ... [A] → [B]” - as the mechanism behind in-context learning. The paper traced induction-head emergence during training; showed they were causally responsible for in-context-learning performance; and connected them to a sudden (“phase transition”) improvement in in-context-learning capability during training.
This was a substantive scientific result. A high-level capability (in-context learning) was traced to a specific architectural mechanism (induction heads), with causal validation. The paper substantially raised expectations for what MI could accomplish.
The IOI circuit and the circuit-paper template: 2022
“Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small” (Wang, Variengien, Conmy, Shlegeris, Steinhardt, 2022) gave the first full circuit analysis in a real-world LLM. The IOI circuit (overviewed in §1) involves ~26 attention heads with specific roles.
The paper’s methodology became a template:
Choose a behaviour (IOI completion).
Construct a dataset of clean and counterfactual prompts.
Use activation patching (replacing internal activations with counterfactuals) to identify which components are causally responsible.
Decompose responsible components into specific attention heads.
Validate each head’s role with targeted ablations and projections.
Synthesize into a full circuit description.
Many subsequent MI papers follow this template, applied to different behaviours and models.
ROME, MEMIT, and causal-intervention scaling: 2022–2023
A different thread. ROME (Meng, Bau, Andonian, Belinkov, 2022) “Locating and Editing Factual Associations in GPT” used activation patching to identify where in the model a factual association (“Paris is the capital of France”) was stored. They found facts were localized in specific MLP layers (typically mid-network) and demonstrated fact editing: rewriting an MLP weight produced a model that returned a different answer for that specific fact while leaving other behaviour intact.
MEMIT (Meng et al., 2023) scaled ROME’s approach to editing thousands of facts simultaneously. Knowledge Neurons (Dai et al., 2022) used a similar approach for fact identification.
These results were dramatic. They showed: (a) some facts are localized in specific MLP weights; (b) targeted weight interventions could edit specific facts; (c) causal-intervention methodology scaled to billion-parameter models. The mechanistic-editing direction has become a substantial subarea, with applications to model unlearning, fact-correction, and (controversially) deployment-time model modifications.
The superposition hypothesis: 2022
“Toy Models of Superposition” (Elhage, Hume et al., Anthropic 2022) articulated a crucial theoretical observation: neural networks can represent more features than they have dimensions by encoding features in superposition - overlapping linear directions in activation space.
The mechanism. If a network has neurons but relevant features, the features cannot each have their own dimension. Instead, they share dimensions: each neuron is polysemantic (responds to multiple features), and each feature is encoded partially in many neurons. This explains polysemantic neurons - a long-standing observation - as an emergent consequence of dimensional pressure.
The implication. Direct neuron-level interpretation is limited because individual neurons mix multiple features. To recover interpretable features, we need to decompose the activation space into the underlying directions. This sets up the SAE programme.
Sparse autoencoders: 2023–2024
The proposed remedy for superposition. Sparse autoencoders (SAEs) are autoencoders with hidden units and a sparsity constraint (only a few units are active per input). When trained on neural-network activations, an SAE decomposes the activations into a dictionary of feature directions - each with a clear interpretation (ideally).
Cunningham, Ewart, Riggs, Huben, Sharkey (2023) “Sparse Autoencoders Find Highly Interpretable Features in Language Models” was one of the foundational papers. They demonstrated that SAE features in small language models were substantially more interpretable than the underlying neurons.
The 2024 SAE explosion. Bricken et al. (Anthropic 2023, “Towards Monosemanticity”), Templeton et al. (Anthropic 2024, “Scaling Monosemanticity”), OpenAI’s GPT-4 SAE work, and many others scaled SAEs to frontier models. The results: SAEs trained on Claude 3 Sonnet identified millions of features with substantial interpretability, including features for abstract concepts (deception, sycophancy, internal conflict, code patterns).
The “Scaling Monosemanticity” paper was a watershed: it demonstrated that industrial-scale MI was tractable and that features extracted from frontier models had safety-relevant interpretations. By 2024, SAEs were the dominant MI technique for decomposing dense activations.
Automated circuit discovery: 2023+
A methodological development. ACDC (Automated Circuit Discovery, Conmy et al., 2023) automated the circuit-identification process. Given a behaviour and a model, ACDC searches for the minimal subgraph of model components that produces the behaviour, using path patching and edge ablation.
Successor methods (EAP, attribution patching, integrated gradients for circuits) have refined the approach. Automated circuit discovery makes MI more systematic: instead of human-driven exploration, the methods produce circuits semi-automatically.
The current state. Automated methods produce candidate circuits; humans interpret and validate them. Fully autonomous MI (system identifies, names, and interprets circuits without human intervention) is an active research direction but not yet mature.
Safety MI: 2024–2026
The most recent direction. As frontier AI capabilities have advanced, MI has been increasingly deployed for safety. Several lines:
Refusal-direction analysis. Identify specific feature directions in the residual stream that encode “this is a refused request”; understand how RLHF-trained refusal works mechanistically.
Jailbreak mechanism. Identify why certain prompts bypass safety training; investigate the residual-stream patterns associated with successful jailbreaks.
Deception detection. Search for features associated with deceptive behaviour (saying one thing while internally representing another); a key target for alignment-relevant MI.
Capability monitoring. Identify features associated with potentially-dangerous capabilities (biology knowledge, cyber capability) for governance and oversight.
These applications are at the frontier of MI as of 2026. The results are suggestive; the methodology is developing; the deployment is partial. Whether MI can be a reliable safety tool - not just an interesting research programme - is one of the field’s open questions.
Field-level consolidation: 2024–2026
The most recent inflection is institutional. MI has become a recognized subfield with:
Dedicated workshops at ICML, NeurIPS, ICLR.
Industrial research groups (Anthropic Interpretability, OpenAI Superalignment, DeepMind Mechanistic Interpretability, Apollo Research).
Standard methodology and shared infrastructure (TransformerLens, SAELens, the Eleuther interp tools).
Connection to alignment research and to policy frameworks.
By 2026, MI is no longer a niche research direction but a substantial subfield with its own conferences, infrastructure, and professional identity.
Where this leaves us in 2026
The current state. MI has mature foundations (the QK/OV framework, the circuits programme, the superposition hypothesis), a substantial methodology (probing, activation patching, SAEs, automated circuit discovery), substantive findings (induction heads, IOI circuit, factual-recall mechanism, refusal directions), and increasing connection to alignment and safety applications.
The remaining sections develop the technical material. §3 establishes the conceptual framework. §4–§5 cover the methods. §6 covers superposition and SAEs. §7 covers Transformer circuits. §8 covers MI findings. §9 covers SAEs at scale. §10 covers MI for safety. §11 covers methodology and practice. §12–§16 close out.
Editorial note. Mechanistic interpretability is a rapidly evolving subfield; specific technical findings and SAE-scaling results will date faster than the conceptual framework. The 2024 SAE-scaling work is the most recent material; readers should expect substantial subsequent development. The chapter is a snapshot of the methodology and findings as they stood in mid-2026.
§3. The MI Conceptual Framework
The methodological work of MI rests on a small set of conceptual commitments. This section develops them: features as the unit of analysis, circuits as the structural object, superposition as the explanation for polysemanticity, the linear-representation hypothesis as a working assumption, and causality as the validation criterion.
Features as the unit of analysis
The starting commitment. Mechanistic interpretability treats features as the primary unit of analysis. A feature is a direction in activation space corresponding to an interpretable computational quantity - “the model is currently representing that the subject is plural”, “the model is in a code-completion context”, “the model is detecting an indirect-object-completion pattern”.
The crucial property: features are directions, not individual neurons. A feature might be encoded along a direction in a -dimensional activation space; its activation on a given input is the dot product where is the activation vector. Multiple neurons can contribute to a single feature (the feature is distributed); a single neuron can participate in multiple features (the neuron is polysemantic).
Why directions rather than neurons. Neural-network training has no incentive to axis-align representations with neuron coordinates. The natural representation is a low-dimensional manifold embedded in the activation space; the relevant computational quantities are the coordinates on that manifold, which are linear combinations of neurons rather than individual neurons.
The empirical evidence. Polysemantic neurons - single neurons that respond to multiple unrelated inputs - are observed throughout deep networks. The classic example: a neuron in InceptionV1 that responds to both faces and words (Goh et al., 2021). The polysemanticity is not a bug; it is an emergent property of how networks pack many features into limited dimensions.
The implication for methodology. Direct neuron-level interpretation is limited. To find the interpretable computational units, we need to decompose the activation space into feature directions. This sets up the role of sparse autoencoders (§6, §9) - finding feature directions empirically.
Circuits as the structural object
The second commitment. Features are the what; circuits are the how. A circuit is a sub-network of weights and activations that implements a specific computation - typically a sequence of features at different layers, connected by specific weights, that together produce a behaviour.
The IOI circuit (§1) is a canonical example. The circuit involves ~26 attention heads at specific layers, each playing a specific role (duplicate-token detection, name-mover attention, S-inhibition). The circuit is not a separate component of the model; it is a subgraph of the existing model architecture, identified by analysis.
Three properties of circuits.
Sparsity. Most circuits involve only a small fraction of the model’s components. The IOI circuit involves ~26 heads out of ~144 in GPT-2 Small; the rest of the model is not causally involved in IOI behaviour.
Compositionality. Different behaviours often share circuit components. The same name-mover heads that participate in IOI also participate in other completion tasks; the same induction heads that drive in-context learning also support various pattern-matching behaviours.
Emergence. Circuits are not designed; they emerge from training. The IOI circuit was not built into GPT-2’s architecture - it formed during pretraining as the model learned to predict next tokens.
The methodological consequence. Identifying a circuit means tracing the information flow from input to output through specific components, with each component’s role validated by causal intervention. This is more demanding than identifying features alone; circuits require characterizing both the features and the connections between them.
The superposition hypothesis
A specific theoretical framework. The superposition hypothesis (Elhage et al., Anthropic 2022, “Toy Models of Superposition”) proposes:
Neural networks represent more features than they have dimensions by encoding them in superposition - overlapping linear directions in activation space.
Why this happens. Suppose a network has neurons in some layer but relevant features that the network needs to represent. The features cannot each have an orthogonal dimension. Instead:
Each feature is encoded along a direction in the -dimensional space.
The directions for different features overlap (are not orthogonal).
The features are sparsely active - only a few are non-zero on any given input - which makes the overlap manageable.
The mathematics. If features are sparse enough (each input activates only a few features) and the feature directions are appropriately spread (e.g., approximately uniformly distributed on the unit sphere), the network can recover individual feature activations from the superposed representation via thresholding or sparse decomposition.
The empirical evidence. The Toy Models paper demonstrated superposition explicitly in small synthetic settings: a 2-neuron representation can encode 5 features in superposition under appropriate sparsity. Subsequent work (Bricken et al., Templeton et al., 2024) demonstrated superposition in real LLMs by training SAEs that recover millions of features from much-lower-dimensional residual streams.
The implications for MI.
Polysemantic neurons are explained. Each neuron participates in many features; its response to any specific input is a mix of feature activations.
Direct neuron interpretation is limited. The interpretable units are features, not neurons.
Sparse decomposition methods (SAEs) are the natural tool. If features are encoded in superposition, the way to recover them is to learn an overcomplete sparse dictionary.
The linear representation hypothesis
A working assumption. The linear representation hypothesis asserts that high-level features in trained neural networks are encoded linearly in the activation space - as directions, with the feature’s activation being approximately a linear function of activations.
The rationale. Linear representations are operationally simple: features can be added, subtracted, projected, and decomposed using linear-algebraic operations. Many empirical observations support the linearity assumption:
Word embeddings exhibit linear analogies (“king - man + woman ≈ queen”) - a property that requires linearity in the embedding space.
Activation steering (adding a “direction” to internal activations) produces predictable behavioural changes (Subramani et al., 2022; Turner et al., 2023).
SAE-decomposed features behave linearly under intervention - adding times a feature direction increases the feature’s apparent activation by .
Circuits found via causal intervention almost always involve linear flows between components (residual stream additions, attention-head outputs as linear projections).
The honest qualifications. Linearity is approximate, not exact. Some behaviours involve non-linear interactions (MLP non-linearities, attention softmax). The linearity hypothesis is useful - it makes MI tractable - but is itself a contestable empirical claim that holds in many settings and may fail in others.
The methodological commitment. MI methods that assume linearity (probing with linear classifiers, SAEs, activation steering, projection-based ablation) are more interpretable and computationally tractable than non-linear alternatives. The trade-off: if linearity fails for a specific case, these methods will produce wrong answers without obvious failure signals. The honest practice is to validate linearity assumptions for the specific case.
Causality as the validation criterion
The methodological commitment that distinguishes MI from older interpretability work. A claim about what a network represents (a feature, a circuit) is not validated by correlation alone. The claim must be validated by demonstrating that intervening on the proposed feature or circuit produces the expected behavioural change.
The validation pattern. Suppose we hypothesize that “head 9.6 in GPT-2 Small is a name mover for the IOI circuit.” The claim has multiple components:
Observation. When the model performs IOI completion, head 9.6 attends strongly to the indirect-object name and writes the name to the residual stream. Validated by attention-pattern analysis.
Necessity. If we remove head 9.6’s contribution (ablate it), IOI performance drops substantially. Validated by ablation experiments.
Sufficiency. If we insert head 9.6’s contribution from a clean prompt into a context where IOI would otherwise fail, the model performs IOI successfully. Validated by activation patching.
The combination of necessity and sufficiency provides causal validation. The feature/circuit is doing the work; alternative explanations (the feature is a downstream consequence of some other mechanism) are ruled out by the interventions.
This commitment to causal validation is the methodological core of MI. It is what distinguishes MI from XAI - XAI methods often produce explanations without causal validation, while MI methods make causal validation a defining requirement.
The technical machinery. The Causality chapter (especially §4 and §10) developed the causal-inference framework that MI uses operationally. Activation patching is essentially the do-operator applied to neural-network activations; causal mediation in MI is the same machinery as mediation in causal inference (§7 of the Causality chapter); causal scrubbing (Anthropic, 2022) is a more sophisticated framework for testing causal-graph hypotheses about model internals.
A worked diagram of the conceptual framework
THE MI CONCEPTUAL STACK
┌─────────────────────────────────┐
│ CAUSALITY │
│ (validation criterion) │
│ Activation patching, ablation, │
│ causal mediation, scrubbing │
└─────────────────────────────────┘
│
│ validates
▼
┌─────────────────────────────────┐
│ CIRCUITS │
│ (structural object) │
│ Sub-networks of weights and │
│ activations implementing │
│ specific computations │
└─────────────────────────────────┘
│
│ composed of
▼
┌─────────────────────────────────┐
│ FEATURES │
│ (units of analysis) │
│ Directions in activation space │
│ encoding interpretable │
│ computational quantities │
└─────────────────────────────────┘
│
│ encoded via
▼
┌─────────────────────────────────┐
│ LINEAR REPRESENTATION │
│ + SUPERPOSITION │
│ (representational substrate) │
│ Features as overlapping │
│ linear directions in │
│ high-dim activation space │
└─────────────────────────────────┘
│
│ instantiated in
▼
┌─────────────────────────────────┐
│ NETWORK ACTIVATIONS / WEIGHTS │
│ (the empirical substrate) │
└─────────────────────────────────┘Each layer of the stack depends on the layer below. The empirical substrate (activations and weights) is decomposed via the linear-representation + superposition assumptions into features. Features compose into circuits. Circuits and features are validated via causal intervention.
This conceptual framework is the methodological foundation of modern MI. The remaining sections (§4 onward) develop the specific techniques that operationalize each layer of the stack.
Where this framework fits in 2026
The summary. The MI conceptual framework - features, circuits, superposition, linearity, causality - is consensus in the modern MI community. Specific technical claims (which features exist, which circuits implement which behaviours) are debated; the framework itself is broadly accepted.
The framework’s limits. Each commitment is an empirical claim that may fail. The linear-representation hypothesis is an idealization; some computations may genuinely be non-linear in ways that break the framework. Superposition is a hypothesis with strong evidence but is not a theorem; alternative organizational principles for activation space may matter. Causal validation is methodologically rigorous but in practice is often partial - full causal validation is expensive and many MI claims rest on partial validation.
The chapter develops the framework with appropriate caveats. The methodology is rigorous given the framework’s commitments; the framework’s commitments are themselves subject to revision as the field develops.
§4. Probing and Attribution
The simplest MI techniques. Probing asks: what features can be linearly extracted from a layer’s activations? Attribution asks: which inputs are most responsible for a model’s prediction? Both are observational - they look at activations and gradients without modifying the model - and both are limited in what they can validate (they identify correlations, not causal mechanisms). They are essential entry points for MI work and are useful when interpreted with appropriate caveats.
Linear probing for features
The basic idea. Given a trained network and a specific layer, train a linear classifier (probe) on the layer’s activations to predict some target property - “is the input a noun?”, “is the model uncertain about its prediction?”, “is the input describing a person?”. If the probe succeeds, the layer contains the relevant feature in linearly-extractable form. If it fails, either the layer doesn’t represent the feature or the feature isn’t linearly encoded.
The mechanics. For an activation at a specific layer and a binary target :
LINEAR PROBE (basic recipe)
Given: N labelled examples (a_i, y_i), each a_i in R^d, y_i in {0, 1}
1. Train a logistic-regression classifier:
y_hat_i = sigmoid(w · a_i + b)
minimize cross-entropy loss with regularization.
2. Evaluate on held-out examples:
accuracy of probe on test set.
3. Interpret:
- High accuracy → the layer contains the feature linearly.
- Low accuracy → either the layer doesn't represent it,
or it's non-linear, or the probe is underspecified.A worked example. Conneau et al. (2018) “What you can cram into a single $&!#* vector” probed sentence representations from various models for properties like sentence length, word content, and syntactic depth. The probes successfully extracted these properties at varying accuracies depending on the layer and the property. The result: sentence representations in modern models implicitly encode many syntactic and semantic features.
Probing has been applied broadly: probing for linguistic features (POS, syntax, semantics), for factual knowledge, for reasoning steps, for emotional content, for safety-relevant features. Probes are simple to train and easy to interpret; they provide suggestive evidence about what models internally represent.
The probing pitfalls
Several well-documented failure modes. Probing alone is not a reliable mechanistic technique; understanding its pitfalls is essential.
Pitfall 1: spurious probe success. A probe can succeed by exploiting features that are statistically informative for the target but not causally responsible for the model’s behaviour. For example, probing a model’s internal representation for “the input is in English” might succeed because of irrelevant correlations (e.g., the model’s representation of common English words) without “language detection” being a meaningful feature the model uses.
Pitfall 2: probe expressivity. A linear probe might succeed because the linear classifier itself is expressive enough to recover the target from activations that don’t contain a clean feature. Hewitt and Liang (2019) “Designing and Interpreting Probes with Control Tasks” showed that probes for syntactic features can succeed at near-perfect accuracy on random labels if the probe has enough capacity - meaning probe success doesn’t always indicate the model represents the target.
The fix: use control tasks (probe a random-labeled version of the same task; if the probe succeeds at a similar rate, the original probe success is uninformative).
Pitfall 3: probe ≠ model behaviour. A probe might succeed at extracting a feature that the model contains but does not use for its predictions. The presence of the feature in activations is not evidence that the model uses it causally.
The fix: validate with causal interventions (§5) - does ablating the feature change behaviour? If yes, the feature is causally relevant; if no, the feature is present but unused.
Pitfall 4: layer choice. Features may be present at some layers but not others. Probing requires choosing layers; choices affect conclusions. A feature absent from layer 5 may be present at layer 8.
The fix: probe at all layers; report which layer (or range) the feature appears at; verify that the feature’s presence aligns with the expected stage of computation.
Saliency methods
A different observational approach. Saliency methods identify which input features are most responsible for a model’s prediction. The classical approach: compute the gradient of the output with respect to the input.
SALIENCY MAP (basic)
Given: model f, input x, output class c
1. Compute s = ∂f_c(x) / ∂x (the gradient).
2. The magnitude of s_i indicates how sensitive the prediction is
to input feature i.
3. Visualize as a "saliency map" overlaid on the input.The applications. Saliency maps are widely used for image classification (which pixels matter for the prediction?) and for text models (which tokens matter?). They provide intuitive visualizations that can be useful for debugging and presentation.
The limitations. Saliency methods are known to be unreliable:
Sanity checks fail. Adebayo et al. (2018) “Sanity Checks for Saliency Maps” showed that many saliency methods produce similar-looking maps even when the model’s weights are randomized - meaning the saliency reflects properties of the image more than properties of the model.
Sensitivity to small changes. Saliency maps can change substantially with small input perturbations, suggesting they capture noise rather than signal.
No causal validation. Saliency identifies high-gradient inputs but doesn’t establish that those inputs are causally responsible for the prediction. Counter-examples exist where saliency-highlighted regions can be removed without affecting prediction.
The honest accounting. Saliency methods are useful for visualization and intuition but unreliable for mechanistic claims. They are largely outside the modern MI methodological core; they remain in use for XAI applications and presentations.
Integrated gradients
A refinement that addresses some saliency-method limitations. Sundararajan, Taly, Yan (2017) “Axiomatic Attribution for Deep Networks” introduced integrated gradients (IG).
The idea. Instead of computing the gradient at a single input , integrate the gradient along a path from a baseline (typically a neutral input, e.g., a black image or empty string) to the actual input :
Reading this. For each input feature , integrate the gradient as we move from baseline to actual input; multiply by the input difference. The result is the attribution of feature to the prediction.
The advantages over basic saliency. IG satisfies several axioms (sensitivity, completeness, implementation invariance) that motivate it as a principled attribution method. It is less sensitive to local gradient noise.
The limitations. IG remains correlational rather than causal - it identifies feature importances, not causal mechanisms. It depends on the choice of baseline (different baselines give different attributions). The axiomatic justification is appealing but does not guarantee mechanistic correctness.
IG is more reliable than basic saliency but still not a substitute for causal-intervention methods. It is widely used as a debugging and presentation tool.
Attention rollout and attention attribution
A Transformer-specific technique. Attention rollout (Abnar and Zuidema, 2020) aggregates attention patterns across layers to produce a per-token attribution score:
ATTENTION ROLLOUT
Given: attention matrices A^l ∈ R^{n×n} at each layer l = 1, ..., L
(where A^l_ij = attention from token i to token j at layer l)
1. Account for residual stream: A_tilde^l = (A^l + I) / 2
2. Multiply across layers: R = A_tilde^L · A_tilde^{L-1} · ... · A_tilde^1
3. R_ij is the rollup attribution from token i to token j.The result is an attention-derived attribution map showing which input tokens are most attended to (across all layers) for each output position.
The honest accounting. Attention rollout is suggestive but has known limitations:
Attention is not the only computational mechanism - MLPs and residual stream also contribute.
Multiplying attention matrices double-counts certain paths and ignores others.
Attention scores are not the same as causal contribution - some attention is correlational rather than functionally important.
Jain and Wallace (2019) “Attention is not Explanation” showed that attention patterns can be substantially modified without affecting model predictions, suggesting attention is not a reliable explanation.
The deeper point. Attention is one aspect of Transformer computation; attribution by attention alone misses other components. Modern circuit analysis (§7) examines attention patterns together with the QK and OV decompositions, with causal-intervention validation. Pure attention attribution is increasingly viewed as too coarse.
When probing tells us something causal
A practical question. Given the pitfalls above, when are probing results trustworthy indicators of model mechanism?
The honest checklist:
Use control tasks (Hewitt-Liang). Probe random labels; verify probe success drops substantially.
Probe at multiple layers. A feature should appear at the expected depth for the computation; if it appears uniformly across layers, the probe is likely overfitting.
Test linearity assumptions. Use both linear probes and small non-linear probes; if non-linear probes substantially outperform, the feature isn’t linearly encoded.
Validate causally. The probe identifies presence of a feature; causal interventions establish use. Combine probing with §5 methods.
Replicate across models. A probe finding that holds across multiple model scales and training runs is more likely meaningful than one that holds for a single model.
When all of these are satisfied, probing provides reasonably strong evidence about what the model represents and uses. When not, probing provides suggestive evidence that needs further validation.
The role of probing in modern MI
The summary. Probing and attribution are first-pass tools in MI workflow. They identify candidates - features the model might represent, inputs the model might be using - that are then validated via causal interventions (§5). The combination of probing and intervention is more reliable than either alone.
Probing remains an active area. Modern probing work (early 2020s onward) addresses the pitfalls explicitly: control tasks are standard; multi-layer analysis is routine; causal-validation followups are increasingly common. Attribution methods (saliency, IG, attention rollout) are mostly relegated to XAI / presentation contexts; modern MI relies on probing only as one component of a broader methodology.
The next section (§5) develops the causal methods that complement probing - the activation-patching, ablation, and mediation techniques that establish causal claims about model internals.
§5. Causal Intervention Methods
The methodological core of modern MI. Causal-intervention methods modify the model’s internal computation and observe the effect on its output, providing direct evidence about what each component is doing. This section develops the four standard techniques: activation patching, ablation, path patching, and causal scrubbing.
Activation patching: the canonical technique
The most-used MI tool. Activation patching (Vig et al., 2020; Geiger et al., 2021; Meng et al., 2022) replaces a model’s internal activation at a specific location with the activation from a different (counterfactual) input, and observes how the output changes.
The recipe:
ALGORITHM Activation Patching (canonical form)
INPUT: model M, clean prompt x_clean, counterfactual prompt x_cf,
location L (a specific layer + token position + component),
evaluation metric (e.g., logit difference for clean answer)
1. Forward pass on x_clean.
- Cache all internal activations.
- Record the model's output (e.g., logit for clean answer).
2. Forward pass on x_cf.
- Cache the activation at location L (call it a_cf[L]).
3. Forward pass on x_clean WITH PATCHING:
- Run as in step 1, but at location L, REPLACE the clean
activation with a_cf[L].
- Continue forward through the rest of the network.
- Record the patched output.
4. Compute metric:
- patched_metric - clean_metric.
- Large absolute change ⇒ location L is causally important
for the behaviour.
- Small change ⇒ location L is not causally important.The core idea. By swapping in the activation from a counterfactual at one location and seeing how much the output moves toward the counterfactual answer, we measure the causal contribution of that location. If the output changes substantially, the location is causally responsible for the relevant computation. If not, the computation happens elsewhere.
A worked example using IOI. Suppose:
Clean prompt: “When John and Mary went to the store, John gave a drink to” (correct answer: " Mary").
Counterfactual prompt: “When Alice and Bob went to the store, Alice gave a drink to” (correct answer: " Bob").
Patching the activation at layer 9, position of “to”, from clean → counterfactual: if the model’s prediction for the next token shifts substantially from " Mary" toward " Bob", layer 9 at that position carries the IOI-relevant information.
By systematically patching across all layers and positions, we build a patching map - a 2D heatmap (layer × position) showing where the IOI computation lives. Hotspots in the heatmap correspond to circuit components.
The variants. Modern activation patching has many flavours:
Resample ablation. Patch with the mean activation from a set of counterfactuals (instead of a single one). More stable; less specific.
Direct logit attribution. Patch only the direct contribution to the logits (last layer’s residual stream) from a specific component.
Attribution patching. Approximate patching with a first-order Taylor expansion around the clean activation; computationally cheaper but less accurate (Syed et al., 2023).
Ablation: deletion-style intervention
A simpler intervention: instead of replacing an activation with a counterfactual, zero it out (or replace with mean activation, or with random noise). The result is a direct measurement of necessity: if zeroing the component breaks the behaviour, the component is necessary; if not, it is not.
ABLATION
For each component C in the model:
1. Forward pass with C's contribution zeroed (or set to mean).
2. Measure the resulting metric.
3. Compare to the clean metric.
4. Large drop ⇒ C is necessary for the behaviour.The trade-off vs activation patching. Ablation tests necessity (does removing it break things?) but not sufficiency (could the component alone produce the behaviour?). Activation patching can test sufficiency by patching from a successful prompt into a failure context.
The choice. Most modern MI work uses both: ablation to identify necessary components; activation patching to test how the components interact and to characterize what information they carry.
Path patching
A refinement that handles the cross-component interactions. Path patching (Goldowsky-Dill et al., 2023; Wang et al., 2022 in IOI) traces information flow along specific paths through the network - patching only the connection between two specified components rather than the activation at a single location.
The use case. Suppose we know head 9.6 (a name mover in IOI) writes to the residual stream and head 7.3 (an S-inhibition head) reads from the same residual stream. We want to test: does 7.3’s effect on the output go through its interaction with 9.6’s output?
Path patching answers this by intervening only on the contribution from 9.6 to 7.3’s input - leaving everything else clean. If the output changes substantially under this targeted intervention, the 9.6 → 7.3 path is causally important.
PATH PATCHING (sketch)
For a path from component A (in earlier layer) to component B (later layer):
1. Forward pass on x_clean. Cache activations.
2. Forward pass on x_cf. Cache A's contribution to the residual
stream (this is what we'll patch).
3. Forward pass on x_clean, with the modification:
- At every layer between A and B, the residual stream
receives A's contribution from x_cf.
- All other components contribute as in clean.
- At layer of B, B receives the modified residual stream.
4. Measure the change in output relative to clean.The result. Path patching produces a graph of causal influences - which components flow information to which others. The graph is the basis for circuit identification.
The cost. Path patching requires more careful experimental design than activation patching at single locations. The number of paths grows quadratically with the number of components, and not all paths are interesting. Modern automated tools (ACDC, EAP) automate the path-search process.
Causal scrubbing
A more sophisticated framework. Causal scrubbing (Anthropic, 2022) tests whether a proposed causal hypothesis about a model’s internals is consistent with the model’s actual behaviour.
The idea. A hypothesis specifies a causal graph of internal computation - which components produce which intermediate features, with what dependencies. Causal scrubbing tests this graph by replacing each component’s activation with the activation it would have on a different input that produces the same value of the relevant feature. If the hypothesis is correct, the model’s behaviour should be unchanged. If it changes, the hypothesis is incomplete or wrong.
The technical detail. For a hypothesis “at component , the model represents feature , and only matters for the downstream computation”:
Identify a counterfactual input that has the same value of feature as the clean input.
Patch in ’s activation from .
If the model’s output is unchanged, the hypothesis (that only matters at ) is consistent.
If the output changes, the hypothesis is missing some information that carries.
Causal scrubbing is more demanding than simple activation patching. It tests what information matters, not just whether a component matters. It is the standard rigorous validation for circuit-level claims; it is less commonly used in exploratory work because it requires a fully-specified hypothesis to test.
Distinguishing causally-relevant from spurious features
The methodological payoff. With these intervention methods, MI claims can be validated:
Probing finds a feature. Necessary first step but not sufficient.
Ablation tests necessity. Does removing the feature break the behaviour?
Activation patching tests sufficiency. Does inserting the feature produce the behaviour?
Path patching tests information flow. Does the feature flow through the proposed path?
Causal scrubbing tests completeness. Does the proposed mechanism account for all relevant information?
A claim that survives all of these is strongly validated. A claim that survives only probing is suggestive but not validated. The gold standard for modern MI is the combination of all five.
The limitations. Even causal validation has limitations:
Counterfactual specification. The validity of activation patching depends on the choice of counterfactual; bad counterfactuals give misleading results.
Distribution shift. Patching can produce activations the model never sees in training; the resulting behaviour may not reflect normal computation.
Component non-additivity. Patching multiple components individually can miss interactions; the joint effect of two components may not equal the sum of individual effects.
Best practice in 2026 acknowledges these limits and uses multiple interventions with consistent results as the validation standard.
Activation patching as do-calculus
A conceptual connection. Activation patching is essentially applied do-calculus (Causality §4). The internal activations of the model form a causal graph; activation patching is a do-intervention on a specific node in this graph; the resulting output change is the causal effect of that intervention.
The Causality chapter’s machinery (the do-operator, mediation, identifiability) applies, with appropriate adaptation, to MI. Some recent work (Geiger et al., 2024 “Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations”) makes this connection explicit, treating MI as causal-inference applied to neural-network internals.
The implication. MI’s methodology is not a separate framework - it is the application of causal-inference principles to a specific target (model internals). Practitioners trained in causal inference can adapt their tools to MI; MI practitioners benefit from the broader causal-inference literature.
Where causal-intervention methods fit
The summary. Causal-intervention methods are the methodological core of modern MI. Activation patching is the workhorse; ablation, path patching, and causal scrubbing are specialized refinements. Together they form a rigorous validation toolkit that distinguishes MI from older interpretability work.
The §6 turn. So far we have treated activations as if they were interpretable once located - the IOI worked example talks about “head 9.6 attending to the name”. But in many cases, the activations are polysemantic (a single neuron mixes multiple features). Recovering the underlying interpretable features from polysemantic activations requires decomposition. The next section covers the superposition hypothesis and SAEs as the dominant decomposition technique.
§6. Features and the Superposition Hypothesis
This section develops the conceptual and technical machinery for recovering interpretable features from neural network activations when those activations are polysemantic. The starting point is the empirical observation that many neurons in trained networks respond to multiple unrelated stimuli; the explanation is the superposition hypothesis; the response is sparse autoencoders.
Polysemantic neurons: the empirical observation
A long-standing finding in neural-network analysis. Polysemantic neurons are neurons that respond to multiple, unrelated stimuli. Examples from the literature:
A neuron in InceptionV1 that fires for both car wheels and dog faces (Olah et al., 2018).
A neuron in CLIP that responds to both Spider-Man and spiders (Goh et al., 2021).
A neuron in GPT-2 that fires for both Greek letters and mathematical notation and random other contexts (multiple sources).
Polysemanticity was initially viewed as a bug - networks should ideally have monosemantic (single-feature) neurons that we can interpret directly. The empirical reality is the opposite: most neurons in most networks are polysemantic.
Three explanations were considered.
Explanation 1: Polysemanticity is essential. Networks fundamentally need to mix concepts; trying to decompose is misguided. Disfavoured by subsequent results showing decomposition works.
Explanation 2: Polysemanticity is an artefact of suboptimal training. Better-trained networks would have monosemantic neurons. Disfavoured by the universality of polysemanticity across architectures, training procedures, and scales.
Explanation 3: Polysemanticity is the consequence of dimensional pressure. Networks need to represent more features than they have neurons; superposition is the natural consequence. The current consensus, articulated and validated by Elhage et al. 2022.
The toy models of superposition
The decisive paper. Elhage, Hume, Schiefer, Henighan, Kravec, Hatfield-Dodds, Askell, Conerly, Drain, Ganguli, Hatfield-Dodds, Hernandez, Lasenby, Lovitt, Olah et al. (Anthropic, 2022) “Toy Models of Superposition” gave a clean theoretical and empirical demonstration.
The toy setup. A small autoencoder with hidden units (the “neurons”) is trained to reconstruct sparse input features, where . The features have different importances (some are more important than others) and sparsities (each feature is active on only a fraction of inputs).
The empirical findings:
For low feature importance and low sparsity, the network ignores the less-important features and represents only features with one neuron each. No superposition.
For higher feature importance and higher sparsity, the network represents all features in superposition - overlapping linear directions.
The network finds geometrically efficient packings (anti-podal pairs at sparse settings; pentagons, regular polytopes, etc. at others).
The key insight. Superposition emerges naturally from the optimization. Networks learn to encode more features than they have dimensions when the features are sparse enough that interference is manageable. This is not a quirk of training; it is a structural feature of how networks pack information.
Subsequent work (e.g., Elhage et al. 2023, “Privileged Bases in the Transformer Residual Stream”) extended this analysis to real Transformer activations, showing that residual streams are not aligned to neuron coordinates - the features are in arbitrary directions, not in neuron-coordinate directions.
Sparse autoencoders: the proposed remedy
If features are encoded as overlapping directions, recovering them requires sparse decomposition. Sparse autoencoders (SAEs) are the dominant tool.
The architecture. An SAE is an autoencoder with two unusual properties:
Overcomplete. The hidden layer has more units than the input dimension. If the SAE is decomposing -dimensional activations, the hidden layer has units (often or higher in recent work).
Sparse. The hidden activations are constrained to be sparse - only a few hidden units are non-zero per input.
The training:
SPARSE AUTOENCODER
Encoder: z = ReLU(W_e · a + b_e) # k-dim, sparse
Decoder: a_hat = W_d · z + b_d # d-dim, reconstruction
Loss: L = ||a - a_hat||^2 + λ * ||z||_1
(reconstruction + sparsity)
Trained on: many activations a from a chosen layer of the model.The hope. The trained encoder produces a sparse decomposition of the activation: the few active hidden units correspond to features present in the input. Each hidden unit has a direction (its row in ) that defines the feature in activation space.
The validation. Each hidden unit can be analyzed:
What inputs activate it? Examine training inputs where the unit is most active; look for a coherent pattern.
Where does it project to? Examine the unit’s decoder direction in activation space; relate to other components of the model.
What does steering it do? Add the unit’s decoder direction to the residual stream; observe behaviour change.
If the unit consistently activates on a interpretable concept, has a meaningful projection direction, and produces interpretable behaviour change when steered, it is a monosemantic feature.
A worked SAE example
A small concrete demonstration. Suppose we train an SAE with hidden units on the residual stream of a small Transformer at layer 8. After training:
We pass 100,000 text snippets through the model.
For each hidden unit, we identify the top-10 activating snippets.
We examine the snippets for shared themes.
A typical result (paraphrased from real SAE work):
Hidden unit 7,234: top activations on snippets containing the word “Python” in code contexts. Interpretation: “Python programming language”.
Hidden unit 2,891: top activations on snippets describing emotional dialogue. Interpretation: “emotional context”.
Hidden unit 11,503: top activations on snippets containing scientific writing about proteins. Interpretation: “protein biology context”.
Hidden unit 9,118: top activations on... unclear; the activating snippets share no obvious theme. Interpretation: unidentified or noise.
The mix of interpretable and uninterpreted features is typical. Modern SAEs at scale produce millions of features; some are highly interpretable, some are partially interpretable, some are uninterpreted. The interpretation rate (fraction of features humans can label) ranges from 30% to 70% depending on the SAE quality, the analysis method, and the threshold for what counts as “interpretable”.
Modern SAE variants
The basic SAE (above) has been refined substantially. The 2023–2024 period saw several variants that improved interpretation quality.
Gated SAEs (Rajamanoharan et al., DeepMind 2024). Use a separate gating network to decide which features to activate, and a magnitude network to determine activation strength. Decouples the binary “is this feature present?” from the continuous “how strong is it?” - reduces the under-shrinkage problem of sparsity penalties.
Top-K SAEs (Gao et al., OpenAI 2024). Replace the sparsity penalty with a hard top-K constraint - only the largest pre-activation values are kept, others are zeroed. Cleaner than shrinkage; controllable sparsity.
JumpReLU SAEs (Rajamanoharan et al., DeepMind 2024). Use a JumpReLU activation (a step function plus linear region) that produces clean sparsity without the under-shrinkage of standard ReLU + .
Matryoshka SAEs (Bussmann et al., 2024). Train SAEs with nested hierarchies: the first features should be the most important, the next less so, etc. Allows variable-resolution interpretation.
The empirical state in 2026. JumpReLU and Top-K are the dominant SAE flavours; both substantially outperform basic SAEs on interpretability metrics. Production SAE pipelines at major labs (Anthropic, OpenAI, DeepMind) use these or close variants.
Transcoders and end-to-end SAEs
A different architectural direction. Transcoders (Templeton et al., Anthropic 2024) replace the decoder’s reconstruction objective with a prediction objective: the SAE features should predict downstream behaviour, not just reconstruct upstream activations.
The motivation. Standard SAEs reconstruct the activations they decompose; the features may be interpretable but may not correspond to causally relevant computations. Transcoders directly optimize for features that drive downstream behaviour, producing more causally-meaningful decompositions.
End-to-end SAEs (multiple groups, 2024) extend the idea: train SAE features to be useful for the model’s task (not just for activation reconstruction). The features that emerge are likely to be those the model uses, rather than artefacts of the activation distribution.
The practical state. Transcoders and end-to-end SAEs are an active 2024-2026 research area. Initial results suggest they produce more interpretable and more causally relevant features than standard SAEs. The methodology is still developing.
The interpretation pipeline at scale
Modern SAE work involves substantial interpretation infrastructure beyond the SAE itself. The pipeline:
Train SAE on a target layer’s activations from a large corpus.
For each feature: collect top-activating examples; collect random non-activating examples.
Generate a label using an LLM (typically GPT-4 or Claude) given the activating examples.
Validate the label: ask the LLM to predict (using only the label) whether the feature would activate on held-out examples; compare to actual activations.
Compute interpretation quality as the agreement between predicted and actual activations.
This is automated interpretability - using LLMs to interpret features at scale. Bricken et al. (2023), Templeton et al. (2024), and the OpenAI SAE work all use this kind of pipeline. The result is a labelled dictionary of millions of features, each with an interpretable description and validation score.
The state of practice. Automated interpretability scales to millions of features. The interpretation quality varies - some features are reliably labelled, others are confused. Manual validation of important features is still required; automated interpretation provides first-pass labels.
Where SAEs sit in 2026
The summary. SAEs are the dominant MI technique for decomposing dense activations into interpretable features. The methodology has matured rapidly (Gated SAEs, JumpReLU, Top-K, Matryoshka, transcoders). Production deployment at scale (Claude SAEs, GPT-4 SAEs) demonstrates that industrial-scale MI is tractable.
The remaining limits. SAE interpretation rates are partial (30-70% interpretable features); some features encode abstract concepts with no clean human label; the relationship between SAE features and the actual computational features the model uses is itself contested (the SAE provides one decomposition; other decompositions may also be valid).
The bridge to the next section. Now we have features (from SAEs and probing) and causal-intervention methodology (from §5). These are the components for circuit analysis: identifying sub-networks of features and components that implement specific behaviours. §7 develops circuit analysis in Transformers using the QK/OV framework and the IOI / induction-head case studies as motivation.
§7. Circuits in Transformers
The technical core of modern MI applied to language models. This section develops the Anthropic Transformer-circuits framework (Elhage et al., 2021) with its QK/OV decomposition and residual-stream picture, then walks through the two foundational circuits - Indirect Object Identification and induction heads - as worked examples. Both circuits illustrate the methodology of §5 (causal intervention) applied to specific behaviours.
The residual stream
The conceptual reframing that made Transformer circuits tractable. Elhage, Nanda, Olsson et al. (Anthropic, 2021) “A Mathematical Framework for Transformer Circuits” gave the field its working vocabulary.
The reframing. Standard Transformer descriptions present the architecture as a sequence of attention + MLP blocks with skip connections. The Anthropic framework reframes this: there is a single residual stream at each token position - a -dimensional vector that grows additively through the layers. Each component (attention head, MLP) reads from the stream (its inputs are projections of the stream) and writes to it (its outputs are added back into the stream).
THE RESIDUAL STREAM VIEW (Elhage et al. 2021)
Position t, layer l:
residual stream r_t^l (a d-dimensional vector)
│
├─── read by attention heads at layer l
│ each head reads via Q^h, K^h, V^h
│ each head writes via O^h
│ all writes added to r_t^l
│
├─── read by MLP at layer l
│ MLP reads via W_in, writes via W_out
│
▼
residual stream r_t^{l+1} = r_t^l + Σ (head outputs) + (MLP output)The implications.
Components are independent contributors. Each attention head, each MLP layer, contributes additively to the residual stream. The stream’s value at any layer is the sum of contributions from all earlier components. This makes attribution natural: the contribution of a specific head can be isolated by removing that head’s write and observing the change.
Reading and writing are linear projections. Every component reads via specific projection matrices (Q/K/V for attention, for MLP) and writes via others (O for attention, for MLP). All transformations into and out of the residual stream are linear; the non-linearity (softmax, ReLU, GELU) lives inside the components.
The residual stream has no privileged basis. The stream is an arbitrary vector space; features can be encoded along any direction. Direct neuron-by-neuron interpretation is misleading; feature-by-feature interpretation (with feature directions identified by SAEs or causal analysis) is the right approach. This is the empirical basis for §6’s superposition discussion.
The framework changed how MI thinks about Transformers. Pre-Anthropic, attention heads were typically discussed in isolation; the Anthropic framing makes their interactions through the shared stream the central object of analysis.
The QK and OV decomposition
The core structural insight about attention heads.
For a standard attention head with weights acting on residual stream input :
The Anthropic framework rewrites this in terms of two effective matrices:
The QK matrix (a matrix). This determines which token positions the head attends to - it controls the attention pattern.
The OV matrix (also ). This determines what the head writes to the residual stream when it attends - it controls the information flow.
The two matrices are conceptually independent. QK answers “where to look”; OV answers “what to do once we look there.”
The interpretation. Each attention head can be analyzed by its QK and OV matrices:
QK analysis. Decompose via SVD or eigendecomposition; identify the directions that produce strong attention. These directions correspond to features in the residual stream that trigger the head’s attention.
OV analysis. Decompose similarly; identify the directions the head writes. These correspond to features the head introduces into the residual stream.
A worked example. Consider a hypothetical “name copier” head. Its QK matrix should have directions that fire on tokens that look like names; its OV matrix should write the same name token (perhaps at a shifted position). Analyzing the matrices empirically, we can identify whether the head’s behaviour matches this description.
The Indirect Object Identification circuit
The first full circuit analysis. Wang, Variengien, Conmy, Shlegeris, Steinhardt (2022) “Interpretability in the Wild.” The IOI behaviour: complete prompts like “When Mary and John went to the store, John gave a drink to ___” with the correct indirect object (Mary).
The methodology applied. The authors used:
Activation patching at every layer × position to identify where the IOI computation happens.
Path patching to trace information flow between components.
Direct logit attribution to identify which components write the answer to the residual stream.
The discovered circuit. ~26 attention heads across ~7 layers, organized into four functional roles:
THE IOI CIRCUIT (Wang et al. 2022, simplified)
Input: "When [S1] and [IO] went to the store, [S2] gave a drink to ___"
where S1 = S2 = subject (e.g., "John"), IO = indirect object (e.g., "Mary")
Layer 0-3: DUPLICATE TOKEN HEADS
Detect that S1 and S2 are the same token.
Mark this fact in the residual stream at S2's position.
Layer 5-6: PREVIOUS TOKEN HEADS, INDUCTION HEADS
Provide additional positional information.
Layer 7-8: S-INHIBITION HEADS
Read the duplicate-token signal.
Write a "subject suppression" signal that prevents the
subject token from being copied as the answer.
Layer 9-10: NAME MOVER HEADS
Read the residual stream at the "to" position.
The S-inhibition signal makes them attend to IO (not S2).
They copy IO's representation to the residual stream as the answer.
Layer 10-11: BACKUP NAME MOVERS
Provide redundancy if the primary name movers are ablated.
The model is robust: removing primary name movers does not
fully break IOI because backups take over.
Output: IO token (correct answer)The validation. Each component’s role is confirmed by:
Attention patterns (the duplicate-token heads attend to the duplicate; the name movers attend to IO).
Activation patching (patching out a component degrades IOI by an amount consistent with its role).
Path patching (the information flow is traced through specific paths).
The result is a mechanistic explanation of IOI: the circuit identifies the duplicate, suppresses it, and copies the non-duplicate. Each step is implemented by specific attention heads with specific QK and OV behaviours.
Why IOI matters methodologically
The IOI paper established several methodological norms that shaped subsequent MI work:
Behavioural specification. Define the behaviour precisely (a templated dataset; a clear correctness metric).
Component enumeration. Identify candidate components via patching maps.
Role attribution. For each candidate, propose a specific role and validate.
End-to-end story. Synthesize the components into a complete information-flow account from input to output.
Robustness to ablation. Document the circuit’s redundancy structure.
The template has been replicated for many other behaviours. Most modern Transformer circuit papers follow this structure; the IOI paper is the methodological prototype.
Induction heads
The second foundational circuit. Olsson, Elhage, Nanda et al. (Anthropic, 2022) “In-context Learning and Induction Heads.”
The behaviour. Suppose a sequence contains “[A][B] ... [A]”. An induction head attends from the second [A] back to the [B] that previously followed [A] in the context, and copies [B] to the current position. The result: the model predicts [B] as the next token, performing a basic in-context-learning pattern.
This is the core mechanism of in-context learning (LLM §7). When you give an LLM examples in the prompt (“Q: What’s 2+2? A: 4. Q: What’s 3+5? A: 8. Q: What’s 4+6? A:”), the model uses induction-head-like mechanisms to attend back to similar contexts and copy the corresponding answer pattern.
The circuit structure (simplified):
INDUCTION HEAD CIRCUIT (Olsson et al. 2022)
Setup: A two-attention-head "induction circuit"
HEAD 1 (Previous Token Head, layer l):
At each position, attends to the IMMEDIATELY PREVIOUS token.
Writes the previous token's identity into the residual stream
at the current position.
HEAD 2 (Induction Head, layer l+1):
At each position [A] (the current "query" token), looks for
positions [A'] elsewhere in the context where the residual
stream contains [A] as the previous-token signal.
Such positions correspond to other occurrences of [A] in the
context (because their residual stream now contains [A]'s
identity as previous-token info).
The head copies the TOKEN AT THAT POSITION (which is [B], the
thing that came after [A] previously) to the current position.
Net effect: When [A] appears, the model predicts [B] (the thing
that previously followed [A]).The mechanism is two-step: head 1 propagates previous-token info; head 2 uses that info to find matches and copy. The composition produces the [A][B]...[A] → [B] behaviour.
The empirical findings.
Phase transition. Induction heads emerge suddenly during training, around a specific point in the training trajectory. Before this point, the model has weak in-context-learning ability; after, it has strong ability. The transition corresponds to the formation of specific induction-head structures.
Universality. Induction heads emerge in essentially every Transformer trained on language modelling at sufficient scale. Different model families, different sizes, different training data - induction heads are robustly present. They are an emergent universal mechanism.
Causal centrality. Ablating induction heads substantially degrades in-context learning. The heads are necessary for the capability; they are not just correlationally present.
The implication. The Olsson et al. paper provided the first major-scale mechanistic explanation of an emergent LLM capability. In-context learning - the marquee capability of modern LLMs - was traced to specific architectural mechanisms with causal validation. This was the proof-of-concept that MI could explain consequential model behaviours.
Recent circuit extensions
Subsequent work has extended circuit analysis substantially.
Function vectors. Hendel, Geva, Globerson (2023) identified function vectors - directions in the residual stream that encode the task the model is performing in-context. By extracting and reusing these vectors, the model’s behaviour can be steered to perform specific in-context tasks.
Circuit motifs across models. Subsequent work has cataloged recurring circuit patterns across different models - successor heads (heads that predict “next” in a sequence), negation heads, summarization heads. Some motifs appear in many models; others are model-specific.
Hierarchical circuits. Modern frontier models contain multi-level circuits: low-level circuits implement basic patterns (induction heads); mid-level circuits compose these (function induction); high-level circuits implement complex behaviours (multi-step reasoning, factual recall). Tracing these hierarchies is an active research area.
Multi-token circuits. Most original circuit work focused on single-token completions. Recent extensions analyze circuits that span multiple tokens - chains of reasoning, longer planning behaviours.
The state of circuit analysis in 2026
The summary. Circuit analysis is the mature methodology of modern MI. The IOI and induction-head circuits are the foundational examples; many other behaviours have been analyzed using the same template. Automated tools (ACDC, EAP, attribution patching at scale) make circuit identification semi-automatic.
The remaining limits. Circuit analysis is labour-intensive - even with automation, specifying behaviours, designing datasets, and validating roles requires substantial human effort. Most analyzed circuits are for small models (GPT-2 size); circuits in frontier models are larger, more complex, and harder to analyze. The gap between research demonstrations and routine production analysis is real.
The 2026 frontier. Combining circuit analysis with SAE features (§6, §9) to express circuits in terms of interpretable features rather than raw attention heads. The combination produces more semantically-meaningful circuit descriptions; it is the dominant methodology for scaling circuit analysis to frontier models.
§8. Findings: What MI Has Discovered
A summary of substantive MI findings: not the methods (developed in §3-§7) but the content - what we have learned about how trained networks compute. This section is organized by the kind of behaviour analyzed.
Factual recall: how LLMs store and retrieve facts
A central finding. Modern LLMs store factual information in their MLP weights and retrieve it through specific computational patterns. Meng, Bau, Andonian, Belinkov (2022) “Locating and Editing Factual Associations in GPT” (the ROME paper) was the foundational work.
The methodology applied. ROME used activation patching to find where a specific fact (e.g., “The Eiffel Tower is in Paris”) is stored in GPT-2. They patched activations at every layer and position, identifying where patching the corrupted prompt back to clean restored the correct prediction.
The finding. For most factual associations, the causal site is in mid-network MLP layers at the subject token position. Specifically: when GPT processes “The Eiffel Tower is in”, the subject “Eiffel Tower” is processed by mid-network MLPs that retrieve the associated fact (“Paris”) from memory and write it to the residual stream.
The mechanism (simplified):
FACTUAL RECALL CIRCUIT (Meng et al. 2022, simplified)
Token sequence: "The Eiffel Tower is in" → ?
Layer 1-5:
Form representation of "Eiffel Tower" in residual stream
at the Tower-token position.
Layer 6-10 (mid-network MLPs at subject position):
The MLP reads "Eiffel Tower" representation as KEY.
The MLP retrieves "Paris" as VALUE (the associated fact).
Writes "Paris" representation to the residual stream.
Layer 11+:
Attention heads at later positions attend back to the
subject position; copy the "Paris" representation forward
to the prediction position.
Output: "Paris"The KEY-VALUE interpretation. The mid-network MLPs effectively act as key-value memories: the input neurons (keys) are activated by the subject’s representation; the output neurons (values) write the associated fact. This is consistent with theoretical work (Geva et al., 2021 “Transformer Feed-Forward Layers Are Key-Value Memories”) that characterized MLPs as memory structures.
ROME and MEMIT: editing facts via intervention
The natural follow-up. If we know where facts are stored, can we edit them?
ROME (Meng et al., 2022) demonstrated targeted fact editing. By identifying the specific MLP layer where the fact is stored and modifying its output for the relevant key, they could change the model’s response. Edit “The Eiffel Tower is in Paris” to “The Eiffel Tower is in Rome”, and the model now (correctly, given the edit) responds “Rome”.
The crucial property: edits are localized. Editing the Eiffel Tower fact does not affect other facts. The mechanism’s localization permits targeted intervention.
MEMIT (Meng et al., 2023) “Mass-Editing Memory in a Transformer” scaled the approach to thousands of edits simultaneously. This made fact-editing practical at scale, with applications in:
Model unlearning. Remove specific facts from a model (for privacy or copyright reasons).
Fact correction. Update outdated facts (the model trained in 2023 still says “Joe Biden is president”; edit to “Donald Trump”).
Targeted modification. Adjust specific behaviours without retraining.
The honest qualifications. Modern frontier models (Claude, GPT-4) have more complex factual-recall mechanisms than GPT-2. ROME/MEMIT-style edits work less reliably; multi-step reasoning is harder to edit; some facts appear distributed across many components rather than localized. The methodology remains useful for research and for some applications, but the simple “edit one MLP weight” picture does not generalize cleanly to frontier scale.
Refusal directions: how safety training works
A safety-relevant MI finding. Modern LLMs trained with RLHF refuse certain requests (harmful instructions, dangerous information). The mechanism: there is a refusal direction in the residual stream - a specific direction that, when active, causes the model to produce refusal output (“I can’t help with that”).
Arditi, Obeso, Syed, Paleka, Panickssery, Gurnee, Nanda (2024) “Refusal in Language Models Is Mediated by a Single Direction” identified the refusal direction empirically. The key results:
For each request type (harmful instructions), the refusal direction is consistent - the same direction works across many prompts.
The direction can be extracted by averaging the residual stream at the response position over refused vs accepted inputs.
Removing the direction (by projection) substantially reduces refusal behaviour - the model produces non-refused outputs even for harmful requests.
Adding the direction increases refusal - the model refuses even benign requests.
The implication. RLHF-trained refusal is not a complex multi-component mechanism; it is approximately one feature that the safety training pushed the model to use. This has substantial implications:
For safety. Refusal can be bypassed by feature-level intervention. An adversary with white-box access to the model can extract and ablate the refusal direction, removing safety constraints. This is one mechanism behind some “jailbreaks.”
For alignment understanding. The simplicity of the refusal mechanism is informative - RLHF apparently teaches the model a relatively simple feature that suffices for the demonstrated behaviour, rather than developing deep understanding of what makes requests harmful.
For monitoring. The refusal feature’s activation can be monitored - if it is active, the model is in a “refusal context”; if not, normal output.
The honest qualifications. The “single direction” finding is for specific models and request types; more diverse refusal patterns may have multi-feature representations. Subsequent work (2024–2026) has extended the analysis to other safety-related features (deception detection, manipulation detection); some are similarly feature-localized, others appear more distributed.
Multi-step reasoning circuits
Recent work on how models perform multi-step reasoning. A key finding: many reasoning behaviours decompose into sequential circuit operations.
Successor circuits. Models can perform “next item” predictions (“January, February, ___” → “March”) via specific successor heads - attention heads that systematically map representations to their successors. Identified across multiple models (Gould et al., 2023).
Composition circuits. When the model needs to compose multiple facts (“The president of France’s wife is named ___”), specific attention heads carry the intermediate result (the president of France) forward, where it is used by later attention heads as the subject of the second lookup.
Chain-of-thought circuits. When models are explicitly trained on chain-of-thought reasoning (LLM §12), the resulting circuits handle multi-step logical operations through specific attention patterns - earlier reasoning steps are attended to by later steps; intermediate conclusions are encoded as features in the residual stream and read by downstream computation.
The general pattern. Multi-step reasoning behaviours decompose into sequences of single-step circuits (each operation handled by a specific component) connected by information flow through the residual stream. The decomposition is not always clean - some reasoning behaviours appear to involve more distributed computation - but the modular picture is the working hypothesis.
Truthfulness and deception representations
A particularly safety-relevant MI direction. Can MI detect when a model is being deceptive - saying one thing while internally representing another?
Burns et al. (2022) “Discovering Latent Knowledge in Language Models Without Supervision” (“DLK” or “CCS”) demonstrated that models often have internal representations of truth even when they output falsehoods. The methodology: train an unsupervised probe that finds consistent truth-or-falsity representations across many statements. The probe can sometimes identify the model’s internal “judgment” of a claim’s truth even when the model’s output is a contrary assertion.
Park et al. (2024) “The Internal State of an LLM Knows When It’s Lying” extended this finding. Various studies have shown that, in specific settings, model internals contain features corresponding to “I am being deceptive now” or “I know my output is false” - even when these are not externalized.
The implications.
For safety. Internal-state probes could potentially detect deceptive behaviour before it produces harmful outputs. This is the basis for several proposed deception detection methods.
For alignment. The fact that models have internal representations distinct from their outputs is important - it suggests the model has some form of “judgment” that can be probed even when output behaviour is misleading.
Honest caveats. The deception-detection results are partial. Internal states do not always cleanly indicate deception; the probes can produce false positives and false negatives; adversarially-trained models may suppress detectable internal states. The research direction is active but not yet a deployed safety tool.
What MI has not discovered
The honest accounting of limitations. Despite the substantive findings above, MI has not yet:
Provided complete mechanistic accounts of frontier-model capabilities. Frontier-scale mechanism remains substantially unknown.
Solved the interpretability scaling problem - many findings come from small models (GPT-2 size) and may not generalize.
Reliably detected complex deception or manipulation. Internal-state probes work in restricted settings, not in adversarial general settings.
Produced predictive understanding - given a new behaviour or model, MI cannot yet say in advance what mechanism implements it.
Reduced AI safety risks substantively in deployed systems. MI tools inform alignment research; they do not yet provide operational safety guarantees.
The trajectory. MI capability is growing rapidly. The 2024–2026 period saw substantial scaling (SAEs to frontier models), substantial findings (refusal directions, deception representations), and substantial methodology refinement. The gap between current MI and the level needed for operational safety guarantees is real but is being actively reduced.
Where MI findings sit in 2026
The summary. MI has produced substantive scientific findings about how trained Transformers compute: facts are stored in MLP key-value structures; refusal is mediated by a single direction; in-context learning is implemented by induction heads; multi-step reasoning decomposes into sequential circuits; deception leaves internal traces detectable by probes.
The findings are real and consequential. They have changed how the field thinks about LLM internals. They are also partial - most findings come from small or mid-sized models, with frontier-model mechanism still substantially unknown.
The next sections develop the scaling programme (§9, SAEs at frontier models) and the safety applications (§10, MI for alignment) that build on these findings.
§9. Sparse Autoencoders at Scale
§6 introduced sparse autoencoders as the response to superposition. This section develops the scaling of SAEs from research-scale demonstrations to industrial deployment on frontier models - the most consequential MI development of 2023–2026.
The 2024 SAE explosion
A specific moment. Through most of 2023, SAEs were a research-stage technique applied to small models (GPT-2 scale). Late 2023 and 2024 saw an explosion: SAEs were scaled to frontier models, the engineering matured rapidly, and substantial methodology converged.
The triggering papers.
Bricken, Templeton et al. (Anthropic, October 2023) “Towards Monosemanticity: Decomposing Language Models With Dictionary Learning.” A long-form Distill-style paper demonstrating SAEs on a 1-layer Transformer with 4096-feature SAE. The paper validated the approach: features were highly interpretable; their causal effects on model behaviour matched the proposed interpretations; the methodology could be applied systematically.
Templeton, Conerly, Marcus et al. (Anthropic, May 2024) “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.” The breakthrough scaling paper. SAEs trained on Claude 3 Sonnet’s middle layer - a frontier-scale model - extracted millions of features. Examples included features for the Golden Gate Bridge, code patterns, deceptive context, internal conflict, sycophancy. The features were demonstrated causally relevant via steering experiments.
Cunningham, Ewart, Riggs, Huben, Sharkey (2023) “Sparse Autoencoders Find Highly Interpretable Features in Language Models.” Independent academic work showing similar results on smaller open-source models.
Gao, la Tour, Tillman et al. (OpenAI, June 2024) “Scaling and Evaluating Sparse Autoencoders.” OpenAI’s parallel SAE work on GPT-4. Introduced Top-K SAEs (§6) and substantial scaling-law-style analyses of SAE training.
Lieberum et al. (DeepMind, July 2024) “Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2.” DeepMind’s open-weight SAE release for Gemma 2 - making frontier-scale SAEs available for academic research.
By mid-2024, all three major frontier-AI labs had released substantial SAE work. SAEs went from research demonstrations to standard production tools in roughly twelve months.
Scaling SAEs: the engineering
The scaling challenge. SAEs at frontier scale require:
Massive activation data. Training an SAE on a frontier model requires billions of activation samples - collected by running the base model on a large text corpus.
Substantial compute. Training a high-quality SAE for a single layer of a frontier model requires hundreds of GPU-hours.
Many SAEs per model. Each layer (and within each layer, multiple sub-locations) requires its own SAE.
High feature counts. To capture rare features, SAEs need many hidden units - typical scales are to or higher, where is the residual stream dimension. For a frontier model with , this means SAE feature counts in the hundreds of thousands to millions.
The engineering response.
Distributed training. SAE training is parallelized across many GPUs.
Checkpointing and incremental training. SAEs are trained on streaming activations from the base model; the activation generation and SAE training are interleaved.
Activation compression. Storing the activation dataset is expensive; modern pipelines use streaming, compressed representations, or on-the-fly generation.
Distributed inference for analysis. Running an SAE on millions of inputs to identify each feature’s activating examples requires substantial inference compute.
The result is industrial-scale MI infrastructure. The frontier labs maintain dedicated MI compute clusters, dedicated engineering teams, and dedicated tooling. MI is no longer a side project; it is a substantial sub-organization at multiple major labs.
Feature interpretation pipelines
Producing labelled, validated features at scale requires substantial automation.
The standard pipeline (paraphrasing the Anthropic and OpenAI workflows):
FEATURE INTERPRETATION PIPELINE (2024 standard)
1. SAE TRAINING
Train SAE on a frontier-model layer's activations.
Result: dictionary of k features (typically 10^5 to 10^7).
2. ACTIVATION COLLECTION
Run the base model on a large diverse corpus.
For each feature, record:
- Top-N activating examples (N typically 100-1000).
- Average activation distribution.
- Distribution of contexts in which it activates.
3. AUTO-LABELING
For each feature, prompt an LLM (Claude, GPT-4) with the
top-activating examples. Ask: "What concept does this
feature represent?"
Result: a candidate human-readable label.
4. AUTO-VALIDATION
Use the label to predict whether the feature should activate
on held-out examples. Compare predictions to actual SAE
activations. Compute the "explanation score" - the agreement
between label-predicted and actual activations.
5. SAFETY-RELEVANT FEATURE EXTRACTION
Search the dictionary for features matching safety-relevant
categories (deception, dangerous knowledge, refusal).
Validate via steering experiments.
6. INTERACTIVE EXPLORATION
Build dashboards (Neuronpedia, Anthropic's interpretability
tools) for human analysts to explore features.The auto-labeling step is itself a substantial AI use case. Modern LLMs are competent at the task - given 50 activating examples and asked “what concept do these share?”, they produce coherent labels for most features. The interpretation rate (fraction of features with high explanation scores) typically ranges from 30% to 70% depending on the SAE quality, the analysis pipeline, and the layer being analyzed.
Feature taxonomies
A finding from large-scale SAE work. The features extracted from frontier models cluster into recognizable categories.
Concrete entities. Specific people, places, events, products. (“Golden Gate Bridge”, “Albert Einstein”, “World War II”.)
Abstract concepts. “Deception”, “loneliness”, “code in Python”, “scientific writing”, “internal conflict”.
Linguistic features. Specific syntactic patterns, parts of speech, language identification, formal vs informal register.
Functional features. “About to refuse”, “summarizing”, “performing arithmetic”, “code completion context”.
Multimodal features. In multimodal models, features bridging image and text content (e.g., “this image depicts a cat” + “the word cat appears nearby”).
Safety-relevant features. “User is being manipulative”, “request involves dangerous knowledge”, “model is being sycophantic”, “discussing self-harm”.
The diversity of features confirms that SAEs find genuinely different kinds of computational quantities - not just lexical or syntactic but conceptual and behavioural. The taxonomy is informative for downstream applications (steering, monitoring, safety analysis).
Steering: causal validation at scale
A crucial validation technique. Once a feature is identified, its causal role can be tested by steering - adding the feature’s decoder direction to the residual stream during forward pass and observing the behavioural change.
The mechanism:
STEERING VECTORS
For a target SAE feature f with decoder direction d_f:
At chosen layer + position(s):
residual_stream += alpha * d_f
(where alpha controls steering strength)
Observe behavioural change:
- Does the model produce output consistent with the feature
being more strongly active?
- Does it produce coherent outputs (or degenerate)?A canonical example. The “Golden Gate Bridge” feature in Claude 3 Sonnet (Anthropic, 2024). When steered strongly:
The model brings up the Golden Gate Bridge in essentially every response, regardless of prompt.
Asked “What is your name?”, the model might respond “I am the Golden Gate Bridge” or describe itself as the bridge.
The steering is coherent - outputs remain grammatically correct and contextually responsive - but the bridge is forced into the topic.
The “Golden Gate Claude” demonstration (publicly released by Anthropic, May 2024) was a striking illustration of feature-level intervention. Users could chat with a Claude variant where the Golden Gate Bridge feature was permanently enabled; the model’s outputs became progressively bridge-themed regardless of input.
The validation. Successful steering provides causal validation that the feature is doing what its label suggests. The combination of (top-activating examples → label → successful steering at that label) is the gold standard for SAE feature interpretation.
The “interpretation at scale” problem
A practical issue. Modern SAEs produce millions of features per model. Even with automated labeling, manually validating every feature is impossible.
The pragmatic responses.
Targeted analysis. Rather than analyzing every feature, focus on safety-relevant features (refusal, deception, dangerous knowledge) and on features active for specific behaviours of interest.
Sample-based validation. Validate a random sample of features manually; use the validation rate to estimate overall reliability.
User-driven exploration. Provide tools (Neuronpedia, Anthropic dashboards) where users can explore features for their specific use cases. Each user validates the features relevant to their analysis.
Automated cross-checking. Compare automated labels across SAEs trained with different hyperparameters, different layers, different model variants. Features with consistent labels across many setups are more likely to be real.
The honest accounting. The “interpretation at scale” problem is unsolved. Modern SAE analysis produces a mixture of highly interpretable features (clear labels, successful steering, consistent across analyses), partially interpretable features (labels that capture some but not all activations), and uninterpreted features (no consistent pattern). The fractions vary; the trend is improving but the issue persists.
Interpretability benchmarks
A maturing methodology. Several benchmarks now evaluate SAE quality systematically.
Reconstruction loss. How well does the SAE reconstruct the original activations? Low reconstruction loss is necessary but not sufficient.
Sparsity. Average number of active features per input. Sparser SAEs have more interpretable individual features.
Explanation score. Fraction of features with high auto-validation scores. Higher is better.
Causal effect on downstream loss. If we replace activations with their SAE-reconstruction, how much does the model’s downstream loss increase? Smaller is better - indicates the SAE captures the relevant information.
Steering effectiveness. For a target behavioural change, how reliably does steering the relevant feature produce it?
These benchmarks let SAE methods (Gated, Top-K, JumpReLU, etc.) be compared systematically. Modern releases include benchmark scores; the field is moving toward standardized evaluation.
Where SAEs at scale sit in 2026
The summary. SAEs at scale are the dominant MI methodology for frontier-model analysis. Major labs (Anthropic, OpenAI, DeepMind) maintain SAE pipelines on their production models. The infrastructure (training, interpretation, validation) is mature; the methodology (Gated, Top-K, JumpReLU; auto-labeling; steering validation) is converging on standards.
The remaining limits. Interpretation rates are still partial. Frontier-model SAE analysis costs substantial compute. Many findings are demonstration-grade rather than systematic-understanding-grade - we know SAEs find features for deception, but we do not have complete maps of all deception-relevant computation in the model.
The trajectory. The 2024–2026 period saw SAEs go from research-stage to production. The 2026–2028 period is likely to see:
Better methodology (transcoders, end-to-end SAEs becoming standard).
More automation of the interpretation pipeline.
Production safety applications (using SAE features for monitoring, steering, dangerous-capability detection).
More open-weight SAE releases enabling academic research at scale.
SAEs are likely to remain a central MI technique for the foreseeable future, even as new techniques are developed.
§10. MI for Safety and Alignment
The applied frontier of MI. As frontier AI capabilities have grown, MI has been increasingly framed as an alignment tool - a way to make AI systems more trustworthy by understanding and intervening on their internal computation. This section develops the safety-relevant MI applications, their current state, and their honest limitations.
Why MI matters for safety
Three structural arguments.
Behavioural testing has limits. A model that behaves aligned during evaluation may have internal states that suggest different deployment behaviour. Behavioural tests sample model behaviour on a finite set of inputs; mechanistic analysis examines what the model could do, not just what it does on the test set.
RLHF and similar alignment training shape behaviour, not understanding. As §8 discussed, RLHF-trained refusal is approximately a single direction in activation space. The model has not learned to “deeply understand harm” - it has learned a feature that triggers refusal output. Understanding this mechanically is essential for understanding safety guarantees.
Capability emergence is partially predictable from internals. Some dangerous capabilities (deception, manipulation, dangerous knowledge) may be detectable in internal representations before they are reliably elicited by behavioural tests. MI provides a forward-looking safety tool.
The combination motivates MI as a complement to behavioural safety testing - not a replacement, but an additional source of evidence about model alignment.
Detecting deception, manipulation, sycophancy
A specific safety-relevant programme. Detecting deceptive AI behaviour is one of the central concerns of alignment research; if AI systems can recognize when they are being evaluated and adjust their behaviour accordingly, behavioural testing alone fails. MI offers tools for internal detection.
The state of the art:
Truthfulness probes (Burns et al. 2022, “Discovering Latent Knowledge”). Train probes that distinguish “model’s internal belief about truth” from “model’s output”. When the model’s output disagrees with its internal-state probe, it may be deceiving.
Sycophancy detection (Sharma et al. 2023). Models trained with RLHF often become sycophantic - agreeing with whatever the user says rather than the truth. MI work has identified specific features and circuits associated with sycophantic behaviour.
Backdoor detection (Hubinger et al. 2024, “Sleeper Agents”). Models can be trained with backdoors - hidden behavioural patterns triggered by specific inputs. MI work has investigated whether such backdoors leave detectable signatures in internal computation. Initial results are mixed - some backdoors are detectable; some are stealthier.
The honest accounting. Adversarially-trained deception (where the model is trained to suppress its detectable internal signals) substantially reduces MI’s detection capability. The Hubinger et al. results showed that some backdoors remain detectable; others can be made nearly invisible via specific training procedures. MI is not yet a robust adversarial-deception detector.
Monitoring for dangerous capabilities
A different safety direction. Frontier AI systems may have latent dangerous capabilities (biology knowledge sufficient for bioweapon design, cyber-offensive capability, manipulation skills) that are present internally but suppressed by training. MI can potentially detect these.
The methodology. SAE features (§9) often include capability-relevant directions. By monitoring whether these features activate for specific inputs, we can detect when the model is internally engaging with dangerous content even if its output is benign.
A specific application. Anthropic’s SAE work on Claude 3 Sonnet identified features for dangerous content categories (bioweapon-relevant biology, cyber-offensive techniques, manipulation tactics). Monitoring these features across inputs provides early-warning information about what the model might do under different conditions.
The applications:
Pre-deployment evaluation. Before deploying a model, check whether dangerous-capability features are present at concerning levels.
Real-time monitoring. During deployment, log when dangerous-capability features activate; investigate concerning activation patterns.
Adversarial testing. When red-teamers attempt to elicit dangerous behaviour, MI can identify whether the model is internally engaging with the dangerous content even when behavioural output is sanitized.
The honest qualifications. Capability monitoring depends on identifying the right features in the right layers. Frontier models have many capability-relevant features; full coverage is unproven. False positives (features that look concerning but reflect benign reasoning) and false negatives (concerning behaviour without flagged features) both occur.
Steering vectors and representation engineering
A practical alignment intervention. Representation engineering (Zou et al., 2023, “Representation Engineering: A Top-Down Approach to AI Transparency”) and steering vectors (Subramani et al., 2022; Turner et al., 2023) directly modify model behaviour by intervening on internal representations.
The mechanism. Identify a feature direction associated with desired behaviour. During inference, add to the residual stream at relevant layers. The model’s behaviour shifts in the direction encoded by .
Applications.
Refusal steering. Add the refusal direction (Arditi et al. 2024, §8) to the residual stream to make the model refuse more often. Useful for tightening safety. Conversely, projecting out the refusal direction can produce a model that refuses less - which is dangerous in adversarial hands but potentially useful for legitimate applications (e.g., medical professional access).
Truthfulness steering. Steer the model toward its internal “true” representation (from Burns et al. probes) and away from output deception.
Style and persona steering. Modify outputs to match specific styles or personas without prompting.
Helpful-vs-harmless balance. Adjust the model’s safety-helpfulness trade-off post-hoc by steering relevant features.
The advantages. Steering is cheap (a single addition during forward pass), reversible (easy to undo), and targeted (changes specific behaviours without retraining).
The honest qualifications. Aggressive steering produces incoherent outputs (the model becomes erratic when pushed too far from its training distribution). Steering a feature can have unintended effects on related features. The control is imprecise compared to retraining.
MI as a part of the alignment toolkit
The honest picture. MI is one component of a comprehensive alignment programme - not a complete solution. The alignment toolkit includes:
Training-time methods (RLHF, RLAIF, Constitutional AI, debate).
Behavioural evaluation (red-teaming, dangerous-capability tests, unintended-behaviour benchmarks).
Mechanistic interpretability (this chapter).
Formal verification (limited applicability to current models but growing).
Containment and oversight (sandboxing, human review of high-stakes decisions, defense in depth).
MI’s specific contributions:
Insight into how training shapes model behaviour at the mechanism level.
Detection tools for deceptive or dangerous internal states.
Targeted intervention via steering and feature-level editing.
Monitoring during deployment.
MI does not:
Provide guarantees of model alignment.
Replace behavioural testing.
Solve fundamental alignment problems (specification gaming, reward hacking, deceptive alignment) on its own.
The realistic picture. MI strengthens alignment evidence; it does not by itself produce aligned systems.
Limitations of MI as a safety tool
Honest treatment of where MI falls short for safety.
Adversarial brittleness. A sufficiently capable adversarial model could suppress the internal signals MI relies on. Empirical evidence (Hubinger et al. 2024 sleeper agents) shows partial vulnerability to this.
Coverage gaps. Even at frontier-scale SAE analysis, many features remain uninterpreted. Safety-relevant computation may live in features we haven’t identified.
Validation cost. Establishing that a specific MI claim about safety holds reliably (across inputs, across model variants, under deployment conditions) requires substantial validation effort. The cost prevents universal application.
Frontier-scale gap. Most rigorous MI work has been on smaller models. The largest frontier models (Claude Opus, GPT-4 successors) have only partial MI coverage. As capabilities grow, the MI gap may grow.
Translation to operational safety. MI findings are research outputs; turning them into operational safety guarantees (auditable, certifiable, deployable) is a separate engineering challenge that is still nascent.
These limitations are not arguments against MI for safety - they are arguments for not over-claiming MI’s contributions. MI is a valuable safety tool; it is not a solved problem.
Connection to evaluations and red-teaming
A practical integration question. Behavioural evaluation (Evaluation chapter, planned) and red-teaming are the established safety methods for AI systems. How does MI complement them?
Three integration patterns.
MI-informed red-teaming. MI findings (e.g., specific deception features) suggest targeted red-team probes - try inputs that should activate concerning features. This makes red-teaming more efficient than purely behavioural exploration.
MI as evaluation augment. During behavioural evaluation, log MI features alongside behavioural outputs. A model that behaviourally passes a safety test but has concerning internal activation patterns warrants additional scrutiny.
MI for failure analysis. When a behavioural test reveals a failure (jailbreak, deception, dangerous output), MI can investigate the mechanism - what internal state produced the failure? - informing both immediate mitigation and longer-term training improvements.
These integrations are increasingly standard at major labs. The 2024–2026 period saw MI move from research lab to production safety pipeline at the frontier-AI organizations.
Where safety MI sits in 2026
The summary. MI for safety is a substantial and growing application area. Specific findings (refusal directions, deception representations, capability-relevant features) inform safety analysis at major labs. Steering and representation engineering provide actionable intervention tools. The integration with behavioural evaluation and red-teaming is increasingly standard.
The unresolved questions. Whether MI can achieve adversarial robustness (detection that survives adversarial training of the model). Whether MI can scale to frontier-model coverage that meaningfully constrains deployment risks. Whether MI findings translate to operational safety guarantees suitable for high-stakes deployment.
These are the central questions of alignment-relevant MI. The trajectory is positive but the destination is not yet reached. The Alignment chapter (planned) will develop the safety implications more comprehensively; this chapter establishes the MI-side technical content.
§11. MI Methodology and Practice
A meta-level section. Having developed the techniques (§4-§9) and applications (§10), this section discusses how MI work is done well - choosing what to interpret, validating claims rigorously, communicating findings, and the role of automated interpretability.
Choosing what to interpret
The first practical question. MI projects can target many different things. Choosing well affects whether the project produces useful results.
The dimensions of choice.
Behaviour vs feature vs circuit. Some projects target specific behaviours (IOI, induction); some target specific features (“what does this SAE feature encode?”); some target full circuits. Each requires different methodology and produces different kinds of findings.
Model scale. Small models (GPT-2 size) are amenable to detailed circuit analysis but findings may not generalize to frontier models. Frontier models offer more interesting capabilities but are harder to analyze comprehensively. The choice depends on the project’s goals.
Layer choice. Different computations live at different layers - early layers handle low-level pattern matching; mid-layers handle semantic processing and factual recall; late layers handle output formatting. A focused project chooses layers based on prior expectations about where the target computation lives.
Static vs dynamic. Some MI work is static (analyze trained weights and typical activations); some is dynamic (track how representations evolve during training, or how they respond to specific inputs). Dynamic MI is more demanding but reveals additional structure.
The pragmatic recommendation. Choose a specific behaviour or specific feature first; develop a clear hypothesis about its mechanism; design experiments that would falsify the hypothesis; iterate based on results. Open-ended exploration is sometimes valuable but more often produces vague findings.
Validating interpretability claims rigorously
The crucial methodological discipline. MI findings are interpretive - they connect numerical activations to human-meaningful concepts. The interpretation step is where claims can fail; rigor here is essential.
The validation checklist for a typical MI claim.
Specify the claim. What feature/circuit/mechanism is being claimed? In operational terms - what experiments would confirm or falsify it?
Demonstrate necessity. Ablate the proposed mechanism; verify the behaviour degrades.
Demonstrate sufficiency. Activate the proposed mechanism in a context where it shouldn’t be active; verify the behaviour appears.
Test counterfactual robustness. Try the experiments with multiple counterfactual inputs; the conclusion should hold across reasonable counterfactuals.
Compare alternative hypotheses. Could other mechanisms explain the same observations? Test the alternatives.
Estimate effect sizes. How much does the proposed mechanism explain quantitatively (e.g., what fraction of behavioural performance is restored by the circuit alone)?
Report what is not explained. Be explicit about residual behaviour the proposed mechanism does not account for.
A rigorous MI claim survives all of these. Most published MI work satisfies most of them; the field’s standards are improving.
The reproducibility challenge
A specific concern. MI findings can be fragile - they may depend on:
Random initialization. Different training runs of the same model may have different circuits.
Architectural details. Small changes (number of heads, layer norm placement) may shift mechanisms.
Hyperparameter choices. Probe training, SAE training, evaluation thresholds - all affect findings.
Implementation details. Different MI libraries (TransformerLens, NNsight, custom code) may produce slightly different results.
The reproducibility responses:
Multiple seeds. Demonstrate findings on multiple training runs / random initializations.
Multiple model variants. Demonstrate on multiple model sizes / architectures / families.
Open code and data. Release the analysis code and (where possible) the activation datasets.
Sensitivity analysis. Document how sensitive the findings are to specific hyperparameters or implementation choices.
The 2026 standard. Major MI papers now routinely include reproducibility information; the field has been moving toward higher standards. Smaller publications and blog posts vary substantially.
Standard benchmarks and metrics
A maturing aspect of MI methodology. Several benchmarks now provide standardized evaluation.
SAE benchmarks (developed in §9): reconstruction loss, sparsity, explanation score, causal effect on downstream loss, steering effectiveness.
Circuit benchmarks. Wang et al. (2022) IOI is the prototype; the IOI dataset and metric are reused for many circuit analyses. Other circuits (induction, factual recall) have associated standard datasets.
Probing benchmarks. Standard sets of probing targets (POS tags, syntactic dependencies, semantic features) for comparing probe quality across models.
Auto-interpretability benchmarks. OpenAI’s neuron-explanation benchmark (Bills et al., 2023); Anthropic’s feature-explanation evaluations. Standardized scoring of whether automatic feature labels are correct.
The trajectory. Standardized benchmarks make MI claims comparable across papers and models. The field is moving toward more standardization, though specific subfields (circuits, SAEs, probing) have different standards.
The role of automated interpretability
A recent and consequential development. Automated interpretability uses LLMs (typically Claude or GPT-4) to interpret MI artefacts (features, neurons, circuits). The §9 SAE labeling pipeline is the most-used instance.
The argument for automation. Manual interpretation does not scale to millions of features. If LLMs can produce coherent labels reliably, MI’s coverage problem (interpretation at scale) becomes tractable.
The argument for caution. LLMs may produce plausible but wrong labels - text that sounds coherent but doesn’t match the actual feature. The auto-validation step (§9) helps but is itself imperfect.
The 2026 state. Automated interpretability is useful for first-pass labelling but insufficient for confident claims. Major MI work uses the standard pattern: auto-label many features; manually validate a subset; report aggregate scores rather than individual labels for unvalidated features.
The evolving frontier. Multi-step automated interpretation - using LLMs to design experiments, generate hypotheses, and refine interpretations iteratively - is an active research direction. Whether it produces reliable mechanistic findings or just impressive-sounding output is contested.
Communicating MI findings
A practical skill. MI findings are technical (involve specific architectures and training details), contingent (apply to specific models/checkpoints), and partial (capture some but not all of relevant computation). Communicating responsibly requires care.
Best practices.
Specify scope. “We find X in GPT-2 Small at checkpoint Y.” Not “transformers do X.”
Quantify uncertainty. “The feature explains approximately 60% of the behavioural variance.” Not “the feature is responsible for the behaviour.”
Acknowledge alternative interpretations. “Our interpretation is consistent with the observations, though alternative mechanisms (X, Y) might also explain them.”
Distinguish demonstration from understanding. A circuit demonstration shows mechanism in this case; it does not necessarily generalize. Make the distinction explicit.
Visualize meaningfully. Activation heatmaps, attention patterns, feature visualizations are powerful but can mislead. Choose visualizations that reflect the underlying claims.
The Distill articles (§2) set the gold standard for MI communication: long-form, interactive, careful about uncertainty, generous with visualizations. Modern MI papers and blog posts vary; the best follow Distill conventions.
Where MI methodology sits in 2026
The summary. MI methodology has matured substantially over 2020-2026. The field has converged on standard techniques (probing, activation patching, SAEs, circuit analysis), standard validation practices (necessity-and-sufficiency, multiple seeds), standard benchmarks (IOI, SAE metrics, auto-interp scores), and standard communication norms (specify scope, quantify uncertainty).
Open methodological questions. How to systematically explore feature spaces rather than ad-hoc analyzing specific behaviours. How to automate circuit identification reliably. How to certify MI claims for safety-critical applications. These are active research areas; OP-MI-3 captures the rigorous-validation challenge.
§12. Connections to Other Chapters
This chapter is densely connected to the rest of the book; the cross-references below are dependency statements.
Deep Learning provides the architectures MI analyzes - MLPs, attention, residual streams, layer normalization. DL §6 (Transformers) is the most directly relevant chapter; this chapter assumes DL §6 as background.
Large Language Models is the primary subject of modern MI. LLM §3 (architecture), §5 (training), and §6 (decoding) provide the LM specifics that MI analyzes. Most MI findings (IOI, induction heads, refusal directions, factual recall) are about LMs specifically.
Foundation Models provides the FM-as-substrate framing. SAEs at scale (§9) are increasingly applied to frontier FMs; MI’s scaling problem is the FM-scaling problem applied to the interpretability question.
Causality §4 develops the do-calculus and causal-intervention machinery that MI’s activation patching applies. The two chapters are intellectual siblings - Causality provides the theoretical framework; MI applies it operationally to network internals. Cross-references throughout.
Self-Supervised Learning is the pretraining paradigm for the models MI analyzes. SSL §4 (generative objectives) and §5 (contrastive) produce the representations that downstream MI investigates.
Alignment / Ethics (planned) is the policy-and-application chapter for MI’s safety contributions. §10 of this chapter develops MI’s role in alignment; the Alignment chapter will develop the broader safety framework MI sits within.
Theoretical Foundations of Learning §8 develops the modern generalization puzzle; MI provides one approach to understanding generalization mechanistically. The connection is at the empirical-vs-mechanistic level - Theory provides the puzzle, MI provides the empirical investigation.
Reinforcement Learning §10 develops RLHF, which produces the safety-trained models MI investigates. The refusal-direction analysis (§8) is essentially analyzing what RLHF teaches the model mechanistically.
Generative Models provides the generative substrate (autoregressive Transformers) that MI primarily targets. The chapter’s §6 (diffusion) and §3 (autoregressive) are the methods being analyzed.
AI for Science §13 raises the curve-fitting-vs-understanding critique. MI provides one response - we can partially understand what these models compute, even at scale. Whether MI-derived “understanding” satisfies the scientific-understanding desideratum is a separate question.
Multimodal Models (planned). Recent MI work has extended to multimodal models (CLIP features, vision-language models); the techniques are similar but with multimodal feature taxonomies (§9).
Evaluation (planned) is the cross-cutting chapter on benchmarking. MI provides internal evidence that complements behavioural evaluation; §10 of this chapter develops the integration.
§13. Critiques and Alternative Perspectives
This section presents critiques of MI as substantive intellectual positions held by working researchers.
“Interpretability is too hard at scale”
A sceptical position. Frontier models have hundreds of billions of parameters; interpreting them comprehensively may be intractable. SAEs scale to millions of features but the interpretation step does not - humans cannot validate millions of features individually. Even automated interpretation produces findings of uneven quality. The field may be producing demonstrations of MI on small models without a path to systematic understanding of frontier models.
The pushback. The scaling trajectory has been positive - 2020 MI was on tiny models; 2024 MI scales to frontier models. The 2024 results (Scaling Monosemanticity, Gemma Scope) were viewed as impossible by many in 2022. Continued progress is plausible.
The chapter’s position. The scaling concern is real but not decisive. MI may not produce complete understanding of frontier models in the near term; partial understanding has substantial value. Whether the partial understanding is enough for safety-critical applications is the harder question.
“MI findings don’t generalize”
A specific critique. Most MI findings are for specific models (GPT-2 Small, Pythia 2.8B, Claude 3 Sonnet). Whether the same circuits or features appear in other models - different sizes, different families, different training procedures - is often not tested. Findings may be model-specific artefacts rather than universal mechanisms.
The pushback. Some findings do generalize. Induction heads emerge in essentially every Transformer. Refusal directions appear consistently across RLHF-trained models. The factual-recall mechanism (KEY-VALUE in mid-network MLPs) appears across model families. The universality of these mechanisms is empirically demonstrated.
The honest accounting. Universality is partial. Some mechanisms are universal; some are model-specific; we do not always know in advance which is which. The trajectory is toward more cross-model verification; the standards are improving.
“MI is a distraction from alignment”
A different criticism. MI is technically interesting but the resources spent on MI could be spent on more directly alignment-relevant work (constitutional AI, debate, training-time alignment). Mechanistic understanding is insufficient for alignment guarantees; behavioural evaluation is more direct.
The pushback. MI provides complementary evidence to behavioural testing - finding internal mechanisms that behavioural testing might miss (the sleeper-agent example). MI also informs training-time alignment by clarifying what current methods actually teach the model. The Refusal-direction finding suggests RLHF teaches a single feature, not deep understanding of harm - a substantive alignment-relevant result.
The chapter’s position. MI is one component of alignment, not the whole. Spending resources on MI is reasonable; spending all resources on MI is not. The honest framing is portfolio diversification across alignment approaches.
The validation crisis
A specific methodological concern. Many MI claims are under-validated - based on probing without causal validation, based on visualization without systematic testing, based on small samples. Replications sometimes fail; alternative interpretations sometimes succeed for the same data.
The response. Modern MI methodology (§11) demands rigorous validation. The field has been moving toward higher standards. Older results that don’t meet modern standards are revisited or qualified. The validation crisis is being actively addressed but is not fully resolved.
The scaling-vs-interpretability trade-off
A structural worry. As models scale, capabilities grow but interpretability becomes harder. The gap between what models can do and what we can mechanistically understand may widen over time. If so, MI’s relative contribution to safety decreases as models become more capable.
The pushback. SAE scaling has kept pace with model scaling so far. New techniques (transcoders, end-to-end SAEs, automated interpretation) may continue to scale. The trade-off is real but not necessarily decisive.
The honest worry. If MI can only achieve partial understanding at any scale, and the frontier moves faster than MI catches up, the practical relevance of MI for safety may decline. This is one of the central uncertainties of the field.
§14. Limitations and Open Problems
Consolidated open-problems list. Each carries an OP-MI-N identifier.
OP-MI-1. Scaling MI to frontier-scale models. SAE scaling has worked; circuit-level analysis is harder. Most circuit-level findings are for small models (GPT-2 Small, Pythia 2.8B). Whether circuit analysis can be made tractable for frontier-scale models - Claude Opus, GPT-4, Gemini Ultra - is the central practical question. Automated tools (ACDC, EAP) help but are not yet sufficient.
OP-MI-2. Universal vs model-specific findings. Many MI findings have been demonstrated on specific models but not systematically tested on others. Establishing which findings are universal (induction heads, refusal directions, factual-recall in mid-MLPs) and which are model-specific requires substantial cross-model verification work. The current state is that some findings have been verified across multiple models; many have not.
OP-MI-3. Validating circuit claims rigorously. Modern circuit work uses necessity-and-sufficiency standards, but full causal validation (causal scrubbing for every claimed component, all alternative hypotheses tested) is rarely achieved. Lifting the validation standards across the field - and developing tools that make rigorous validation tractable - is open.
OP-MI-4. SAE feature interpretation at scale. Automated interpretability produces labels for millions of features but interpretation rates remain partial (30-70%). Many features remain uninterpreted; some may have no clean human label. Improving the rate, the consistency, and the validation of automated interpretation is the central practical issue for SAE-at-scale work.
OP-MI-5. Polysemanticity and superposition. The superposition hypothesis explains polysemantic neurons; SAEs are the proposed remedy. But the SAE decomposition is not unique - multiple SAEs trained on the same activations can produce different feature dictionaries. Which decomposition is “the right one” - or whether the question is well-formed - is debated.
OP-MI-6. Causal validity of probing. Linear probes can succeed without the probed feature being causally relevant. Causal validation (combining probes with intervention) is the response, but is rarely done at scale. Establishing causal validity for the many published probing results is a substantial back-fill task.
OP-MI-7. MI for non-Transformer architectures. Almost all MI methodology is Transformer-specific. CNN MI (the early circuits work on InceptionV1) provided foundations but has not been substantially extended. SSM MI (Mamba and successors), diffusion-model MI, and MI for hybrid architectures are largely open. As architectures diversify, MI methodology needs to follow.
OP-MI-8. Connecting MI findings to behavioral evaluations. A consistent challenge: MI findings are about internal computation; safety claims need to be about behaviour. Translating from “feature X exists internally” to “the model will not exhibit behaviour Y in deployment” requires bridging assumptions that are themselves uncertain. Tighter links between mechanistic findings and behavioural guarantees are needed.
OP-MI-9. Adversarial robustness of MI. A model trained to suppress its detectable internal signals (Hubinger et al. 2024 sleeper agents) can defeat MI’s safety applications. Developing MI methods that are robust to such adversarial training - or proving that they cannot be - is one of the critical open questions for MI as a safety tool.
OP-MI-10. Operational MI for deployed safety. MI findings inform alignment research; turning them into operational safety tools (real-time monitoring, automated detection, certifiable claims) is a separate engineering challenge. Bridging research-grade MI to deployment-grade safety infrastructure is a substantial open problem.
§15. Further Reading
Opinionated annotated list. Not exhaustive; intended as a reading-order recommendation.
Foundational papers and surveys
Olah, C., Mordvintsev, A., and Schubert, L. (2017). “Feature Visualization.” Distill. Foundational visualization work.
Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., and Carter, S. (2020). “Zoom In: An Introduction to Circuits.” Distill. The circuits programme articulation.
Räuker, T., Ho, A., Casper, S., and Hadfield-Menell, D. (2023). “Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks.” Comprehensive survey of MI methods.
Bereska, L., and Gavves, E. (2024). “Mechanistic Interpretability for AI Safety: A Review.” Recent survey emphasizing safety applications.
Conceptual framework
Elhage, N., et al. (2021). “A Mathematical Framework for Transformer Circuits.” Anthropic. The QK/OV decomposition; residual stream view.
Elhage, N., et al. (2022). “Toy Models of Superposition.” Anthropic. Superposition hypothesis.
Park, K., Choe, Y. J., and Veitch, V. (2024). “The Linear Representation Hypothesis and the Geometry of Large Language Models.” Empirical investigation of linearity.
Circuits
Olsson, C., et al. (2022). “In-context Learning and Induction Heads.” Anthropic. The induction-heads finding.
Wang, K., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. (2022). “Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small.” The IOI circuit. Methodology template.
Conmy, A., Mavor-Parker, A. N., Lynch, A., Heimersheim, S., and Garriga-Alonso, A. (2023). “Towards Automated Circuit Discovery for Mechanistic Interpretability.” ACDC.
Heimersheim, S., and Janiak, J. (2023). “A Circuit for Python Docstrings in a 4-Layer Attention-Only Transformer.” Another canonical circuit analysis.
Causal interventions
Vig, J., Gehrmann, S., Belinkov, Y., Qian, S., Nevo, D., Singer, Y., and Shieber, S. (2020). “Investigating Gender Bias in Language Models Using Causal Mediation Analysis.” Early causal-intervention MI.
Geiger, A., Lu, H., Icard, T., and Potts, C. (2021). “Causal Abstractions of Neural Networks.”
Meng, K., Bau, D., Andonian, A., and Belinkov, Y. (2022). “Locating and Editing Factual Associations in GPT.” ROME.
Meng, K., et al. (2023). “Mass-Editing Memory in a Transformer.” MEMIT.
Probing
Conneau, A., Kruszewski, G., Lample, G., Barrault, L., and Baroni, M. (2018). "What you can cram into a single $&!# vector."* Foundational probing paper.
Hewitt, J., and Liang, P. (2019). “Designing and Interpreting Probes with Control Tasks.” The control-task method.
Belinkov, Y. (2022). “Probing Classifiers: Promises, Shortcomings, and Advances.” Comprehensive review of probing.
Sparse autoencoders
Cunningham, H., Ewart, A., Riggs, L., Huben, R., and Sharkey, L. (2023). “Sparse Autoencoders Find Highly Interpretable Features in Language Models.” Foundational SAE paper.
Bricken, T., Templeton, A., et al. (Anthropic 2023). “Towards Monosemanticity: Decomposing Language Models With Dictionary Learning.”
Templeton, A., Conerly, T., Marcus, J., et al. (Anthropic 2024). “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.” The SAE-at-frontier-scale breakthrough.
Gao, L., et al. (OpenAI 2024). “Scaling and Evaluating Sparse Autoencoders.” Top-K SAEs.
Rajamanoharan, S., et al. (DeepMind 2024). “Improving Dictionary Learning with Gated Sparse Autoencoders.”
Lieberum, T., et al. (DeepMind 2024). “Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2.”
MI for safety
Burns, C., Ye, H., Klein, D., and Steinhardt, J. (2022). “Discovering Latent Knowledge in Language Models Without Supervision.” Truthfulness probing.
Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., and Nanda, N. (2024). “Refusal in Language Models Is Mediated by a Single Direction.”
Zou, A., et al. (2023). “Representation Engineering: A Top-Down Approach to AI Transparency.”
Hubinger, E., et al. (2024). “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.” Adversarial robustness of MI.
Reading-order recommendation
For someone entering MI cold: start with Olah’s “Zoom In” (Distill) for the circuits framing. Then Elhage et al. 2021 (“Mathematical Framework”) for the Transformer-specific machinery. Then Olsson et al. 2022 (induction heads) and Wang et al. 2022 (IOI) for the canonical circuit analyses. Then Bricken/Templeton SAE papers for the modern SAE programme. Then Räuker et al. 2023 and Bereska-Gavves 2024 for comprehensive surveys. Add safety-MI papers (Arditi 2024, Burns 2022, Hubinger 2024) once the foundations are in place.
§16. Exercises and Experiments
Research-style exercises. Each develops a specific MI skill on a tractable target.
E1. Linear probes on a small Transformer. Train linear probes for syntactic features (POS tags, dependency relations) on each layer of a small Transformer (Pythia 70M or GPT-2 Small). Plot accuracy by layer. Implement Hewitt-Liang control tasks to test probe expressivity. Compare with non-linear probes.
E2. Train a sparse autoencoder. Train a small SAE on the residual stream of a Transformer layer (Pythia 1.4B is tractable). Analyze the resulting features: identify top-activating examples for ~20 random features; manually label them; compute interpretation rate. Compare basic L1, Top-K, and JumpReLU SAEs.
E3. Reproduce induction-head analysis. Following Olsson et al. (2022), identify candidate induction heads in a small Transformer. Use activation patching to validate their causal role in in-context learning. Verify the phase-transition during training (requires training checkpoints).
E4. Activation patching on factual recall. Pick a known factual prompt (e.g., “The capital of France is”). Implement activation patching across layers and positions. Identify the causal hotspots (layer × position pairs that restore the correct prediction). Compare to ROME’s findings on similar prompts.
E5. Investigate a refusal direction. Take a small RLHF-trained model (Llama-2-Chat or Gemma-Instruct). Following Arditi et al. (2024), extract the refusal direction. Project it out and observe the behavioural change on safety-relevant prompts. Add it to ambiguous prompts and observe increased refusal.
E6. Use automated interpretability. Take an existing SAE (Gemma Scope provides public SAEs). Pick 50 features. For each, get top-activating examples and use an LLM (Claude or GPT-4) to generate a label. Manually check the labels; compute agreement rate. Reflect on automated-interpretability quality.
E7. Path patching for circuit analysis. Pick a simple behaviour (e.g., subject-verb agreement). Use path patching to identify which inter-component paths are responsible. Build a circuit diagram showing the information flow.
E8. Steering vector experiments. Identify a feature direction associated with a target behaviour (e.g., “formal writing style”). Add it to the residual stream during inference; observe the behavioural change as a function of steering strength. Find the “sweet spot” between effective steering and incoherent output.
E9. Reproduce IOI on GPT-2 Small. Following Wang et al. (2022), implement the full IOI circuit analysis. Identify the four functional roles (duplicate-token heads, S-inhibition, name movers, backup name movers). Validate each via targeted ablation.
E10. Cross-model verification. Pick an MI finding (e.g., refusal directions, factual-recall localization, induction heads). Test whether the finding holds on a model from a different family (e.g., extend Anthropic findings to a Llama or Mistral model). Document where the finding generalizes and where it doesn’t.