Causality and Causal Inference

The chapter assumes some probability background (Bayesian networks, conditional independence) and undergraduate statistics. It does not assume prior exposure to causality.

Scope and What This Chapter Is About

The chapter develops causal inference and causal reasoning in AI and statistics — the framework for reasoning about interventions and counterfactuals rather than just correlations. We cover structural causal models, do-calculus, identification (backdoor, front-door, instrumental variables), counterfactual reasoning, causal discovery, causal representation learning, the application of causality to mainstream ML (treatment effects, off-policy evaluation, fairness), and the open question of whether modern LLMs can reason causally. Open problems are flagged inline and consolidated in §14.

§1. Motivation and Scope

A worked example to anchor everything

Three observations about a hospital. We will return to this example throughout the chapter.

Patients with severe pneumonia and asthma are less likely to die than pneumonia patients without asthma. The hospital records this association clearly: a logistic regression on patient outcomes confirms that asthma is a protective factor for pneumonia mortality.

Should asthma-pneumonia patients therefore be sent home with less aggressive treatment than non-asthmatic pneumonia patients? No. The reasoning that produced the observed association is backwards: hospital clinicians know that asthma increases pneumonia risk, so they triage asthma-pneumonia patients to more aggressive care — intensive monitoring, earlier antibiotics, ICU admission. The aggressive care reduces mortality despite asthma’s underlying severity. The observed protective association is an artefact of the treatment policy, not a property of asthma.

A model trained to predict mortality from this data correctly captures the association: predicted mortality is lower for asthma-pneumonia patients. A clinical decision system that acts on this prediction — recommending less aggressive care for asthma-pneumonia patients — would increase their mortality. The model is correlationally correct and causally wrong. Acting on the model produces harm.

This is the Caruana asthma example (Caruana et al., 2015), one of the canonical illustrations of why correlation-based machine learning is not sufficient for decision-making. The same data supports diametrically opposite recommendations depending on whether we read it correlationally (“asthma is protective; treat less aggressively”) or causally (“asthma increases risk; the protective association is an artefact of triage policy”). The data alone cannot distinguish the two.

What causality is

A working definition. Causal reasoning is the framework for reasoning about interventions and counterfactuals — what would happen if we changed a variable, or what would have happened in a different world — rather than just about correlations in observed data.

The key distinction. Standard machine learning is observational: given samples from a joint distribution $P(X, Y)$ , predict $Y$ from $X$ . Causality is interventional: given the same data plus a causal model of how variables relate, predict what would happen if we intervened on $X$ — set it to a specific value rather than observed it. The intervention distribution $P(Y \mid do(X = x))$ is different from the observational conditional $P(Y \mid X = x)$ in general, and the difference is sometimes the difference between helpful and harmful decisions.

Causality also encompasses counterfactual reasoning: given that we observed a particular outcome, what would have happened if a particular variable had been different? Counterfactuals are even further from standard ML than interventions, because they ask about parallel worlds rather than the actual one.

The framework is mathematically rigorous. Pearl’s structural causal model (SCM) formalism (developed in §4) gives a clean language for expressing causal assumptions, deriving identifiable causal quantities, and computing counterfactuals when an SCM is given. Rubin’s potential-outcomes framework (also developed in §4) gives an equivalent but differently-formulated approach focused on treatment-effect estimation.

Correlation vs causation: making the distinction precise

The catchphrase “correlation does not imply causation” is widely repeated. The technical content is worth unpacking.

A correlation $P(Y \mid X = x_1) \neq P(Y \mid X = x_2)$ between two variables can arise from:

$X$ causes $Y$ . Manipulating $X$ changes $Y$ .
$Y$ causes $X$ . The direction of dependence is opposite to what we want.
A common cause (confounder) $Z$ causes both $X$ and $Y$ . Neither $X$ nor $Y$ has direct causal influence on the other; their correlation is mediated by $Z$ .
Conditioning on a common effect (collider) $W$ caused by both $X$ and $Y$ . Selection on $W$ induces correlation between $X$ and $Y$ even if they are independent in the broader population.

The asthma example is case (4): the data is from a hospital where asthma-pneumonia patients are selected for aggressive treatment; the selection process induces the spurious protective association.

Causal reasoning is precisely the framework that distinguishes these four cases (and combinations of them) and tells us which causal interpretations the data is consistent with. Standard ML cannot distinguish them — the four cases all produce the same observed correlations. Distinguishing requires either additional assumptions (a causal model encoding background knowledge) or additional data (interventional experiments, instrumental variables, randomization).

Why causality matters for AI in 2026

Four motivations for a research-oriented practitioner.

1. Out-of-distribution generalization. Standard ML generalizes well within the training distribution and degrades sharply outside it (DL §12, Theoretical Foundations §8, OP-TH-8). Many distribution shifts are causally interpretable — they correspond to interventions on the data-generating process (a new hospital, a different time period, a covariate-shifted population). Models that explicitly capture the causal structure of the data can sometimes generalize across such shifts where correlation-based models fail. The Schölkopf-Bengio “Independent Causal Mechanisms” hypothesis (§9) is the most-developed framework here.

2. Decision-making and policy. Whenever an AI system’s output is acted on (a clinical decision, a policy recommendation, a treatment assignment, a hiring decision), the relevant question is interventional: what happens if I deploy this policy? Correlational predictions can give misleading answers, as the asthma example shows. AI applied to consequential decisions needs to be causal-aware.

3. Fairness and recourse. Fair classification is often most naturally expressed causally: would this person’s prediction change if their protected attribute had been different, holding other things equal? This is a counterfactual question (§7), and the technical framework for answering it is causal. Recourse — telling users what they could change to flip a model’s decision — is also fundamentally about interventions.

4. Scientific discovery and LLM reasoning. AI for Science (the previous chapter) frequently makes claims that are correlational but consumed as causal — “this molecule causes that property”, “this gene variant causes that disease”. The boundary between AI4S’s correlational predictions and the causal claims downstream applications need is one of the field’s open challenges (OP-S-8). Separately, foundation models are increasingly used for reasoning tasks; whether LLMs can reason causally (vs just textually mimicking causal language) is an open empirical question (§11).

The two dominant traditions

Causality has two largely-parallel research traditions that have substantial mathematical overlap but distinct vocabulary, conventions, and intellectual lineage. Both are essential to a complete understanding of the field.

The Pearl tradition: structural causal models and graphs. Originating with Judea Pearl’s work from the late 1980s onward, this tradition represents causal assumptions as a directed acyclic graph (DAG) whose nodes are variables and whose edges encode direct causal influences. Computations operate on the graph: identification of causal effects, derivation of testable implications, computation of counterfactuals. The do-calculus (Pearl, 1995) is the formal rule system. Books of Why (Pearl and Mackenzie, 2018) and the technical Causality: Models, Reasoning, and Inference (Pearl, 2009) are the canonical references. Substantial uptake in computer science, AI, and applied epidemiology.

The Rubin tradition: potential outcomes. Originating with Donald Rubin’s 1974 paper and extended through decades of work in statistics and econometrics, this tradition represents causal quantities as potential outcomes — the outcome each unit would have under each treatment assignment. The framework focuses on treatment-effect estimation and is deeply integrated with statistical practice. Imbens and Rubin (2015) is the canonical reference. Substantial uptake in economics, biostatistics, and clinical research.

The relationship. The two frameworks are mathematically equivalent for the same causal queries — anything one can express, the other can express. They differ in emphasis, typical use cases, and vocabulary. Pearl’s framework is more natural for causal-graph reasoning, complex causal structures, and counterfactual computation. Rubin’s is more natural for treatment-effect estimation, randomized experiments, and applied statistics. Recent textbooks (Hernán and Robins, 2020 “Causal Inference: What If”) bridge both. We develop the Pearl framework first because the graphical machinery is most useful for AI applications; §6 covers the Rubin-aligned estimation tradition.

The “Pearl-vs-Rubin” intellectual division is partly substantive (different frameworks for different problems) and partly cultural (different communities with different conventions). The chapter does not adjudicate; it develops both as complementary perspectives on the same underlying mathematics.

Boundaries with adjacent chapters

This chapter has substantial connections with several others.

Probabilistic Reasoning (planned, originally AIMA Ch 13–14) develops Bayesian networks as the probabilistic substrate. Causal DAGs are a special interpretation of Bayesian networks where edges encode causal influence rather than just statistical dependence. This chapter assumes Bayesian-network background but develops what is causally distinctive.
Theoretical Foundations of Learning §8 develops the modern generalization puzzle. Causal frameworks (causal invariance, ICM) provide one approach to OOD generalization (OP-TH-8); this chapter develops the causal side, the Theory chapter develops the ML-side.
Reinforcement Learning §10 develops off-policy evaluation; the causal framing of OPE (treat past data as observational, target policy as intervention) is one of the cleanest cases where causal machinery applies to a standard ML problem.
Foundation Models and Large Language Models raise the question of whether modern LLMs reason causally. §11 of this chapter develops the empirical evidence; the FM and LLM chapters’ open problems (OP-FM-14, OP-LLM-N) cross-reference.
AI for Science uses causal frameworks for some scientific applications (epidemiology, public health, structural biology of regulatory networks). The chapter §13 critique of AI4S as “curve-fitting not causation” is directly relevant.
Alignment / Ethics treats fairness and recourse, both of which have natural causal framings developed in §10 of this chapter.
Statistics in the traditional sense is the broader field. This chapter develops the causal subset relevant to AI; the broader statistics references in §15 are entry points.

What this chapter does not try to do

Several explicit exclusions.

We do not develop the philosophy of causation in depth. The metaphysical questions (does the world have causal structure independent of our models? what does “cause” mean?) live in philosophy of science. We adopt the working stance that causal models are useful abstractions for reasoning and decision without taking a strong metaphysical position.
We do not provide a complete statistical-methodology treatment. Imbens-Rubin (~700 pages) and Hernán-Robins (~600 pages) are the references for that. We develop the causal framework with applied AI emphasis.
We do not cover every causal-ML technique. The literature is large; we develop the dominant techniques (backdoor, instrumental variables, double ML, causal forests, do-calculus) and refer to surveys for breadth.
We do not treat experimental design in depth — A/B testing, multi-armed bandits as causal-inference tools, sequential randomized trials. These are real and important; they are best treated in dedicated references.
We do not develop the political-economy implications of causal reasoning in deployed AI (who decides which causal assumptions count? whose data informs the model?). These are real questions; they live in the Alignment / Ethics chapter.

Position taken in this chapter

The chapter takes a Pearl-aligned-first organizational stance — structural causal models, DAGs, and do-calculus are developed first because the graphical machinery is most useful for AI applications. The Rubin tradition is integrated where it provides cleaner exposition (especially for treatment-effect estimation in §6). The chapter is neutral on the Pearl-vs-Rubin debate; both are presented as complementary frameworks on the same underlying mathematics.

The chapter is also cautious about causal-ML claims. Many published “causal ML” results require strong assumptions (no unmeasured confounders, correct functional forms, randomization-as-if-randomization) that are not verified in the application setting. The chapter develops both the techniques and the assumptions; the assumptions are non-negotiable for the techniques to give correct causal answers.

§2. Historical Context

This section traces causal reasoning from its philosophical roots through the modern statistical and AI formulations. The history is essential because the modern frameworks (Pearl, Rubin, and their ML descendants) emerged from substantial intellectual conflict and synthesis, and understanding the history is necessary to understand why the field has its current structure.

A timeline of the inflection points:

1748

Hume’s Enquiry

Hume’s ‘Enquiry Concerning Human Understanding’: classical philosophical account of causation as constant conjunction.
1843

Mill’s System of Logic

J.S. Mill’s ‘A System of Logic’: Mill’s Methods of agreement, difference, etc.
1921

Sewall Wright’s path analysis

Graphical representation of correlations among variables, with edge weights representing causal effects. Foundation of modern graphical methods.
1925

Fisher’s randomized controlled trial

R.A. Fisher’s ‘Statistical Methods for Research Workers’: establishes the RCT as the gold standard for causal inference.
1944

Haavelmo’s structural econometrics

Haavelmo’s ‘The probability approach in econometrics’: structural equations for causal inference in economics. Becomes the basis for structural econometrics.
1974

Rubin’s potential-outcomes framework

Rubin’s ‘Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies’: potential-outcomes framework. The Stable Unit Treatment Value Assumption (SUTVA).
1978–1980s

Rubin causal model matures

Propensity scores (Rosenbaum and Rubin, 1983); causal inference becomes a substantive subfield of statistics.
1988

Pearl’s Bayesian networks

Pearl’s ‘Probabilistic Reasoning in Intelligent Systems’: Bayesian networks as the substrate for AI; partial development of causal semantics.
1995

Do-calculus introduced

Pearl’s ‘Causal Diagrams for Empirical Research’ introduces the do-operator and the three rules of do-calculus. Causality becomes formally tractable on graphs.
2000

Pearl’s Causality (book)

Pearl’s ‘Causality: Models, Reasoning, and Inference’: consolidation of the SCM framework, do-calculus, counterfactuals.
2000s

Applied causal inference matures

Causal-effect estimation matures in econometrics and epidemiology. Tian-Pearl ID algorithm (2002) for general identifiability. Rubin and Pearl frameworks develop in parallel; some conflict, some synthesis.
2009

Field at textbook maturity

Pearl’s ‘Causality’ 2nd edition; Imbens and Wooldridge’s review of causal inference in econometrics.
2010s

Causal discovery matures

PC, FCI, GES, LiNGAM algorithms established. Targeted maximum likelihood estimation (van der Laan). Causal inference enters mainstream statistics curricula.
2015

Causality in ML — the asthma example

Caruana et al. ‘Intelligible Models for Healthcare’: the asthma-pneumonia example becomes the canonical motivation for causality in ML.
2016–2017

ML-for-causal-inference emerges

Double ML (Chernozhukov et al., 2016), causal forests (Wager and Athey, 2017), neural network estimators for ATE.
2017–2019

Causal representation learning programme

Schölkopf, Bengio, et al. develop causal representation learning as a research programme. ICM hypothesis formalized. Connection between OOD generalization and causal invariance.
2018

The Book of Why

Pearl’s ‘Book of Why’ (with Mackenzie): popular-press exposition of causation, reaches broader AI audience.
2020

Hernán and Robins synthesis

‘Causal Inference: What If’ bridges Pearl and Rubin frameworks for an applied audience.
2022–2024

LLM-causality benchmarks

Empirical work on whether LLMs can reason causally. CRASS, CLadder, CausalBench benchmarks. Substantial evidence that LLMs handle textually-described causal scenarios well but reason causally less well than the textual performance suggests.
2024–2026

Causal ML as mature subfield

Double-ML in production. Causal-representation-learning theory advances (Khemakhem et al. identifiability; iVAE, contrastive identifiability). Open question of causality-at-FM-scale remains active.

We develop each phase below.

Early philosophy: Hume and Mill

The philosophical question of causation predates statistics by centuries. David Hume’s An Enquiry Concerning Human Understanding (1748) gave the canonical sceptical account: we observe constant conjunction (A is followed by B, repeatedly) but never observe causation itself; causation is a habit of the mind, not a feature of the external world. Hume’s framing has been influential and is the source of much modern scepticism about causal claims.

John Stuart Mill’s A System of Logic (1843) developed practical methods for causal inference from observational data: the methods of agreement (A occurs whenever B does), difference (A occurs only when B occurs), residue, and concomitant variation. Mill’s methods are the conceptual ancestors of modern statistical methods for causal inference and remain pedagogically useful.

These two threads — Hume’s scepticism and Mill’s methods — set up the modern situation: we cannot directly observe causation, but we can reason about it under appropriate assumptions about the data-generating process.

Wright and Fisher: graphical and experimental foundations

The 20th-century statistical formalization. Sewall Wright’s path analysis (1921, 1934) introduced graphical representations of causal relationships: variables as nodes, causal influences as directed edges, with the edges labelled by quantitative effects. Wright’s path tracing rules allowed reading correlations off the graph. This was the first formal use of directed graphs for causal modelling and is the direct ancestor of Pearl’s structural causal models.

R.A. Fisher’s Statistical Methods for Research Workers (1925) established the randomized controlled trial (RCT) as the gold standard for causal inference. The argument: if treatment is randomly assigned, the treatment group and the control group are statistically indistinguishable in all baseline variables (in expectation); any subsequent difference in outcomes must be due to the treatment. Randomization gives causal identifiability by construction, side-stepping confounding entirely.

Fisher’s argument is mathematically airtight but operationally limited — we cannot run RCTs on most important questions (cannot randomize people to smoke; cannot randomize countries to economic policies). The 20th-century statistical project was largely about extending causal inference beyond the RCT setting: how to infer causal effects from observational data with appropriate assumptions.

Haavelmo and structural econometrics

A separate strand. Trygve Haavelmo’s 1944 paper “The probability approach in econometrics” introduced structural equations as a causal framework for economics. The idea: write down a system of equations expressing each variable as a function of others; estimate the equations from data; use the estimated structure for counterfactual reasoning (“what if we changed the tax rate?”).

Haavelmo’s framework became structural econometrics and remains the basis for much economic policy analysis. The econometric tradition developed largely in parallel with the statistical-causality tradition until they substantially merged in the 1990s–2000s.

Rubin’s potential outcomes (1974)

The modern statistical framework. Donald Rubin’s 1974 paper “Estimating causal effects of treatments in randomized and nonrandomized studies” introduced potential outcomes.

The core idea. For a binary treatment $T \in \{0, 1\}$ , define two potential outcomes for each unit: $Y(0)$ (the outcome if untreated) and $Y(1)$ (the outcome if treated). For any given unit, we observe only one of these (the one corresponding to the actual treatment). The individual treatment effect is $Y(1) - Y(0)$ — but we never observe both.

The fundamental problem of causal inference: we cannot observe both potential outcomes for the same unit. Causal inference is therefore an inference problem about unobserved counterfactuals.

The Rubin framework specifies what assumptions allow population-level causal effects (the average treatment effect, ATE) to be estimated from observed data. The Stable Unit Treatment Value Assumption (SUTVA) — units don’t affect each other, treatments are well-defined — plus ignorability (treatment assignment is independent of potential outcomes given observed covariates) give the identifiability that allows estimation. Propensity scores (Rosenbaum and Rubin, 1983) provide the practical estimation technology.

The Rubin tradition became dominant in applied causal inference — biostatistics, epidemiology, clinical research, applied microeconomics. It is the framework most working causal-inference practitioners use today.

Pearl’s structural causal models (1988–2009)

The graphical-AI tradition. Judea Pearl’s Probabilistic Reasoning in Intelligent Systems (1988) introduced Bayesian networks to AI as a substrate for probabilistic reasoning. The 1990s saw Pearl extend Bayesian networks with causal semantics: a directed edge from $X$ to $Y$ does not just encode “X is associated with Y” but “X is a direct cause of Y”.

The do-operator (Pearl, 1995) was the key formal advance. Define $\text{do}(X = x)$ as the intervention that sets $X$ to value $x$ — overriding whatever value $X$ would have taken from its natural causes. The intervention distribution $P(Y \mid \text{do}(X = x))$ is different from the observational conditional $P(Y \mid X = x)$ in general; the do-operator gives the formal vocabulary for this distinction.

The three rules of do-calculus (Pearl, 1995) give a complete set of axioms for manipulating interventional distributions in terms of observational ones. The identifiability problem — given a causal DAG, can $P(Y \mid \text{do}(X = x))$ be expressed in terms of observational distributions? — has a clean algorithmic answer (the ID algorithm, Tian and Pearl, 2002).

Pearl’s Causality: Models, Reasoning, and Inference (2000; 2nd ed. 2009) consolidated the framework into textbook form. The book is dense, technically deep, and substantially shaped the AI and CS approach to causality.

The Pearl-vs-Rubin tension

For two decades the two frameworks developed in parallel with limited dialogue. Rubin dismissed Pearl’s graph framework as “dangerous”; Pearl dismissed potential outcomes as overcomplicated and obscure. The two communities (statistics + applied causal inference vs CS + AI) had different conventions, different vocabulary, and different journals.

Subsequent work has made clear that the frameworks are mathematically equivalent for the same questions. Pearl’s potential-outcomes-equivalent representation expresses individual counterfactuals; Rubin’s framework can be reformulated in graphical terms. Hernán and Robins (2020) “Causal Inference: What If” — the modern applied-causal-inference textbook — uses both frameworks side-by-side, treating them as complementary.

The intellectual conflict has cooled but is not entirely gone. Many working statisticians still favour Rubin; many AI researchers still favour Pearl; many causal-ML papers are framed in one tradition without engaging the other. The chapter treats both as part of the same field.

The 2000s rise of applied causal inference

Through the 2000s, causal-effect estimation matured into a substantial applied subfield. Several threads:

Propensity-score methods became standard in epidemiology and economics. Inverse-probability-weighting estimators, matching estimators, and doubly-robust estimators were developed and refined.
Instrumental-variables methods matured in econometrics, with the Imbens-Angrist work on local average treatment effects.
Mediation analysis developed as a formal framework for decomposing causal effects into pathways.
Causal discovery matured: PC algorithm (Spirtes, Glymour, Scheines), FCI for handling unmeasured confounders, GES for score-based discovery.

By 2010, applied causal inference was a substantial subfield of statistics with its own journals, conferences, and methodological literature. Most working scientists in the relevant applied domains use these methods, though the methods are uneven across fields.

The 2015–2020 ML revival

Around 2015, the ML community substantially re-engaged with causality. Three triggers:

Caruana et al. (2015), the asthma example. The paper made vivid the practical risk of correlation-without-causation in deployed AI; it became the canonical motivation for causal AI.

Double ML and causal forests. Chernozhukov et al. (2016) “Double Machine Learning” gave a recipe for combining flexible ML methods (gradient-boosted trees, neural networks) with causal-inference machinery to estimate average treatment effects with valid inference. Wager and Athey (2017) “Causal forests” gave a related technique for heterogeneous treatment effects. These methods made ML-driven causal inference practical.

Causal representation learning. Schölkopf, Bengio, et al. developed the research programme of causal representation learning — extracting causal variables and structure from raw observational data (pixels, sentences). The Independent Causal Mechanisms (ICM) hypothesis (Schölkopf et al., 2012; refined through 2019) gave a theoretical scaffold for connecting causality to ML.

2020–2026: causal LLMs and the integration

The most recent inflection. As foundation models scaled, the question arose: can these models reason causally, or are they limited to surface-level pattern-matching?

The empirical work has been substantial. Several benchmarks (CRASS, CLadder, CausalBench, CausalChain) probe LLMs’ causal reasoning. The findings: LLMs handle textually-described causal scenarios with moderate competence — they can answer questions like “if I push the cart, what happens?” reasonably well. But they consistently fail at formally-stated causal-inference problems — questions that require do-calculus reasoning, identification of confounders, or counterfactual computation.

The interpretation is contested. Some argue LLMs are learning causal structure implicit in language and could improve with scale; others argue they are pattern-matching on causal-language surface forms without internal causal representation. The empirical evidence is mixed; the chapter develops this in §11.

Where this leaves us in 2026

The current state. Causality is a mature mathematical and statistical subfield with a clean theoretical foundation (SCMs, potential outcomes, do-calculus) and a substantial applied toolkit (propensity scores, IV, double ML, causal forests). The Pearl-vs-Rubin division persists as a cultural artefact but is mathematically obsolete. Causal representation learning is an active research area; causal reasoning in foundation models is an open frontier.

The remaining sections of this chapter develop the technical material. §3 covers the ladder of causation as a conceptual framework. §4–§5 develop SCMs, do-calculus, and identification. §6 develops causal effect estimation. §7 develops counterfactuals. §8 covers causal discovery. §9 covers causal representation learning. §10 covers causality applied to mainstream ML problems. §11 covers causality in foundation models. §12–§16 close out.

Editorial note. Causality is a field where intellectual heritage and tradition matter substantially. The Pearl and Rubin frameworks have different conventions; modern papers may use either or both. We have tried to be even-handed; readers from one tradition may find some of the other tradition’s conventions unfamiliar. The references in §15 include entry points to both traditions.

§3. The Ladder of Causation

Pearl’s organizing framework for the kinds of questions causal reasoning can answer. The ladder has three rungs; each represents a qualitatively different reasoning capability. The framework is conceptual rather than technical; the technical machinery (SCMs, do-calculus, counterfactuals) of §4 onward operationalizes the ladder.

The three rungs

Rung 1 — Association. Observing patterns in data. Among people who take aspirin, what fraction have a headache?

signature

P(Y \mid X=x)

— requires the joint distribution

P(X, Y)

Rung 2 — Intervention. Reasoning about what happens if we act. If I take aspirin, will my headache go away?

signature

P(Y \mid \mathrm{do}(X=x))

— requires a causal graph plus identifiability

Rung 3 — Counterfactual. Imagining what would have been. What would my headache be like if I hadn’t taken aspirin?

signature

P(Y_x \mid X=x', Y=y)

— requires a full SCM (structural equations)

The crucial property: each rung cannot be deduced from the rung below. Knowing the joint distribution $P(X, Y)$ does not determine $P(Y \mid \text{do}(X = x))$ ; knowing the intervention distribution does not determine counterfactual quantities. To move up the ladder, additional information (structural assumptions, a causal graph, experimental data) is required.

Rung 1: Association

The first rung asks what is correlated with what?. The relevant quantity is the conditional probability $P(Y \mid X = x)$ — how does the distribution of $Y$ change when we observe $X = x$ ?

Standard machine learning operates almost entirely at this level. A trained classifier $f(x) \to y$ is computing $\arg\max_y P(Y = y \mid X = x)$ — pure association. The classifier does not know whether $X$ causes $Y$ , whether $Y$ causes $X$ , or whether they share a common cause; the classifier only knows their joint distribution.

The asthma example reframed at Rung 1: the data exhibits $P(\text{died} \mid \text{has-asthma}, \text{has-pneumonia}) < P(\text{died} \mid \neg\text{has-asthma}, \text{has-pneumonia})$ . This is a true Rung-1 statement — the correlation in the data is real. The error was in interpreting the correlation causally (Rung 2).

The Rung 1 toolkit. Statistical learning theory, probabilistic graphical models for conditional independence, kernel methods, deep learning, most of modern ML. None of these tools, on their own, distinguish causation from correlation. They are computationally and theoretically rich at Rung 1 and silent at higher rungs.

Rung 2: Intervention

The second rung asks what would happen if we made $X$ take a specific value?. The relevant quantity is the intervention distribution $P(Y \mid \text{do}(X = x))$ — read as “the distribution of $Y$ when we set $X$ to $x$ , overriding whatever $X$ would have been from its natural causes.”

The crucial point: $P(Y \mid \text{do}(X = x)) \neq P(Y \mid X = x)$ in general. The conditional $P(Y \mid X = x)$ describes what happens given that $X$ was observed to be $x$ — the observation comes from $X$ ’s natural causal mechanism. The intervention distribution describes what happens given that $X$ is forced to be $x$ — overriding the natural mechanism.

Two examples make the distinction vivid.

Example: barometer and storm. A barometer reading falls before a storm. The conditional probability $P(\text{storm} \mid \text{barometer-low})$ is high — observing a low barometer is informative about storms. The intervention distribution $P(\text{storm} \mid \text{do}(\text{barometer-low}))$ is not high — forcing the barometer to a low value (by removing some mercury) does not cause a storm.

Example: hospital admission and recovery. Among hospitalized patients, $P(\text{recovers} \mid \text{has-severe-symptoms})$ is lower than $P(\text{recovers} \mid \neg\text{has-severe-symptoms})$ — severe symptoms are associated with worse outcomes. But $P(\text{recovers} \mid \text{do}(\text{has-severe-symptoms}))$ — intervening to make a patient have severe symptoms — would presumably reduce recovery further. The observational conditional and the intervention distribution may agree in direction but differ in magnitude, or they may disagree entirely (as in the asthma case).

The Rung 2 toolkit. The do-calculus (§4), the backdoor and front-door criteria (§5), instrumental-variables methods, randomized experiments (where intervention is constructed by random assignment), and the entire framework of causal effect estimation.

The deep observation: Rung 2 questions can be answered from Rung 1 data only with additional causal assumptions. The additional structure is what a causal graph or an SCM provides.

Rung 3: Counterfactual

The third rung asks what would have happened if a particular variable had been different, given that we observed a specific outcome?. The relevant quantity is the counterfactual probability $P(Y_{x'} \mid X = x, Y = y)$ — “given that we observed $X = x$ and $Y = y$ , what would $Y$ have been if $X$ had been $x'$ instead?”

Counterfactuals are parallel-world questions. They require reasoning not just about what intervention would have done in general, but about what intervention would have done in this specific case, given everything else we observed about this case.

Two examples.

Example: medical counterfactual. A patient took aspirin and the headache resolved. The counterfactual question: would the headache have resolved if the patient had not taken aspirin? This requires more than knowing the population-average effect of aspirin; it requires reasoning about this specific patient’s specific situation, given what we know.

Example: legal counterfactual. A defendant is on trial for negligence. The relevant counterfactual: would the accident have occurred if the defendant had taken reasonable care? Counterfactuals are the conceptual basis of legal causation in many legal systems.

The Rung 3 toolkit. Structural causal models (§4) — which specify not just the causal graph but the functional form of each causal mechanism — are required for counterfactual reasoning. Counterfactual queries can be answered from an SCM by a three-step procedure (Abduction, Action, Prediction) developed in §7.

The deep observation: Rung 3 questions require more structural information than Rung 2 questions. A causal graph is sufficient for many Rung 2 queries (identifiability) but a full SCM (graph + structural equations) is generally required for Rung 3.

Why ML is mostly on Rung 1

Modern machine learning — deep learning, foundation models, reinforcement learning — operates primarily on Rung 1. The reasons are structural rather than incidental.

The training data is observational. Standard ML datasets are samples from a joint distribution. They do not contain intervention information unless they were collected experimentally (which most are not). Without intervention information, only association is identifiable.

The training objective is correlational. Empirical risk minimization minimizes prediction error, which is a Rung-1 quantity. The optimizer rewards correlation; it has no mechanism to distinguish causal from non-causal correlation.

The architectures are non-causal. A neural network’s predictions depend on the data distribution; there is no built-in machinery to distinguish correlation from causation. Architectural choices (which features feed which layers) are not causal commitments.

The implication: standard ML is correctly calibrated at Rung 1. It is silent at Rungs 2 and 3 unless additional causal structure is brought to bear. The chapter develops what that additional structure looks like and how it can be combined with standard ML.

What it takes to move higher

Three pathways to Rung 2 reasoning from observational data:

Strong assumptions plus a causal graph. Specify a DAG encoding causal relationships; use do-calculus to check identifiability; if identifiable, estimate from data. This is the Pearl programme; §4–§5 develop it.
Randomization or as-if-randomization. Use randomized experiments (where intervention is constructed by random assignment) or natural experiments (instrumental variables, regression discontinuity) where treatment assignment is plausibly random conditional on observed covariates. §6 develops this.
Structural assumptions plus invariance. Causal mechanisms are invariant across environments; correlations are not. By observing data from multiple environments (multiple distribution shifts), causal structure can sometimes be identified that would be unidentifiable from any single environment. The Schölkopf-Bengio causal representation learning programme uses this; §9 develops it.

For Rung 3 (counterfactuals), all three pathways require additional structural assumptions about the functional form of the causal mechanisms — a full SCM rather than just a graph.

Where the ladder fits in 2026

The conceptual framework. The ladder of causation is more useful for clarifying questions than for solving them. When someone asks “does X cause Y?” the ladder demands precision: do you mean associated with (Rung 1), if intervened on (Rung 2), or if it had been different in this case (Rung 3)? Different rungs require different evidence and produce different conclusions.

The cross-chapter implication. When the rest of the book (and the AI literature broadly) talks about “ML systems making decisions” or “ML systems giving recommendations,” the relevant question is what rung is the system actually answering? Most ML systems answer Rung 1 questions, but the contexts in which they are deployed often require Rung 2 (interventional) or Rung 3 (counterfactual) answers. The gap is the source of many failure modes — the asthma example is one of many.

This sets up the technical machinery of §4 onward. The do-operator, the structural causal model, the identification criteria, the estimation techniques — all are the operational tools for moving from Rung 1 data to Rung 2 and Rung 3 answers.

§4. Structural Causal Models and Do-Calculus

The technical core of the Pearl framework. Structural Causal Models (SCMs) are the mathematical objects that encode causal assumptions; the do-operator is the formal notation for interventions; the three rules of do-calculus give the axiomatic machinery for computing intervention distributions from observational ones.

Directed Acyclic Graphs as causal models

The first object. A causal DAG is a directed acyclic graph $\mathcal{G} = (V, E)$ whose nodes $V$ are random variables and whose directed edges $E$ encode direct causal relationships. A directed edge $X \to Y$ means “ $X$ is a direct cause of $Y$ ” — manipulating $X$ would (in the absence of other interventions) affect $Y$ .

The DAG is acyclic: no variable can be its own ancestor. Causal cycles would correspond to feedback loops that the framework does not handle directly (cyclic models exist but are more complex).

A canonical example. Consider four variables: smoking $S$ , tar in lungs $T$ , exercise $E$ , and lung cancer $C$ . A plausible causal DAG:

flowchart TD
  S["$$S$$ (smoking)"]
  T["$$T$$ (tar in lungs)"]
  E["$$E$$ (exercise)"]
  C["$$C$$ (lung cancer)"]

  S --> T
  S --> E
  T --> C
  E --> C

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  class S,T,E,C pill

Reading this. Smoking directly causes tar deposition; smoking directly affects exercise habits (perhaps negatively); tar causes lung cancer; exercise affects lung cancer risk (perhaps protectively). Smoking has a direct effect on $C$ in this graph only through $T$ and $E$ — there is no direct $S \to C$ edge.

The graph encodes claims about which variables directly cause which others. The claims are editorial — they come from the modeller’s domain knowledge, not from the data. Different modellers might draw different graphs for the same domain; the chapter takes no position on which is right. Different graphs may lead to different identifiability conclusions, so getting the graph right matters.

Structural equations

The DAG specifies which variables directly cause which others. A full structural causal model (SCM) also specifies how. For each variable $V_i$ , the SCM gives a structural equation

V_i := f_i(\text{Pa}(V_i), U_i),

where $\text{Pa}(V_i)$ is the set of parents of $V_i$ in the DAG, $U_i$ is an exogenous (external, unobserved) noise variable, and $f_i$ is a function. The assignment symbol $:=$ (rather than $=$ ) emphasizes that the equation is directional: $V_i$ is computed from its parents, not solved jointly.

The structural equations specify the data-generating process: each $U_i$ is drawn from some distribution, then $V_i$ is computed from its parents and noise. The joint distribution $P(V)$ implied by the SCM is the distribution that arises from this generative process.

For the smoking-cancer example:

\begin{aligned} S &:= f_S(U_S) \\ T &:= f_T(S, U_T) \\ E &:= f_E(S, U_E) \\ C &:= f_C(T, E, U_C) \end{aligned}

The functions $f_S, f_T, f_E, f_C$ specify the magnitude and shape of each causal influence; the noise variables $U_S, U_T, U_E, U_C$ capture everything else.

The graph plus the structural equations is a complete causal model. From it, one can compute observational distributions, intervention distributions, and counterfactual quantities.

The do-operator

The formal notation for intervention. Pearl’s $\text{do}$ -operator writes $\text{do}(X = x)$ to denote the intervention that sets the variable $X$ to the value $x$ .

The semantics. Under the intervention $\text{do}(X = x)$ :

The structural equation for $X$ is replaced by the constant assignment $X := x$ . The variable $X$ no longer depends on its causes.
All other structural equations remain unchanged.
The intervention distribution $P(V \setminus X \mid \text{do}(X = x))$ is the distribution of all other variables under this modified system.

Graphically, the intervention $\text{do}(X = x)$ corresponds to removing all incoming edges to $X$ in the DAG — the mutilated graph $\mathcal{G}_{\overline{X}}$ . The remaining graph captures all causal dependencies that survive the intervention.

flowchart LR
  subgraph G ["Original graph 𝒢"]
    S1["$$S$$"]
    T1["$$T$$"]
    E1["$$E$$"]
    C1["$$C$$"]
    S1 --> T1
    S1 --> E1
    T1 --> C1
    E1 --> C1
  end

  subgraph GM ["Mutilated graph 𝒢&#773; (after do(T = t))"]
    S2["$$S$$"]
    T2["$$T = t$$"]
    E2["$$E$$"]
    C2["$$C$$"]
    S2 --> E2
    T2 --> C2
    E2 --> C2
  end

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  class S1,T1,E1,C1,S2,T2,E2,C2 pill

The intervention $\text{do}(T = t)$ sets tar level to $t$ regardless of smoking. Under this intervention, smoking still affects exercise (and through exercise affects cancer), but smoking does not affect tar (because we’ve forced tar to $t$ ). The mutilated graph $\mathcal{G}_{\overline{T}}$ captures this.

The intervention distribution is then defined by the SCM on the mutilated graph: run the structural equations of the mutilated graph forward; the joint distribution over the remaining variables is $P(V \setminus T \mid \text{do}(T = t))$ .

Why $P(Y \mid \text{do}(X)) \neq P(Y \mid X)$ in general

A worked illustration of the central distinction.

Consider the simplest confounded graph:

flowchart TD
  U["$$U$$ (confounder)"]
  X["$$X$$"]
  Y["$$Y$$"]

  U --> X
  U --> Y

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  class U,X,Y pill

A confounder $U$ causes both $X$ and $Y$ ; there is no direct edge from $X$ to $Y$ . In this graph:

$P(Y \mid X = x)$ : the conditional distribution of $Y$ given an observation that $X = x$ . By the law of total probability, this is $\sum_u P(Y \mid X = x, U = u) P(U \mid X = x)$ . The conditioning on $X = x$ updates our beliefs about $U$ (because $X$ and $U$ are correlated through their direct relationship); the updated $U$ distribution affects $Y$ . So the conditional depends on $U$ even though $U$ is not directly causal of $Y$ given $X$ . In this graph (no direct $X \to Y$ edge), $Y$ depends on $X$ only through their common cause $U$ , and conditioning on $X$ reveals this association.
$P(Y \mid \text{do}(X = x))$ : the distribution of $Y$ under intervention. The mutilated graph has no edge $U \to X$ (it’s removed by the intervention). $X = x$ no longer reveals anything about $U$ . The distribution is $\sum_u P(Y \mid U = u) P(U)$ — just the marginal distribution of $Y$ . The intervention on $X$ has no effect on $Y$ in this graph, because $X$ has no direct causal effect on $Y$ .

The conditional and the intervention distribution can be drastically different. The conditional says “observing $X$ is informative about $Y$ ”; the intervention distribution says “manipulating $X$ has no effect on $Y$ ”. Both are correct; they answer different questions.

This is the formal version of the barometer-storm example. The conditional confuses people because it makes barometer observations seem causally relevant to storms; the intervention distribution shows they are not.

The three rules of do-calculus

The technical heart of Pearl’s framework. Pearl’s three rules of do-calculus (Pearl, 1995) give a complete set of axiomatic rewriting rules that can be used to derive intervention distributions $P(\cdot \mid \text{do}(\cdot))$ in terms of observational distributions $P(\cdot \mid \cdot)$ — when such derivation is possible.

The rules involve conditional-independence relationships in mutilated graphs. We state them in informal form; the precise statement requires care about which graph mutilation applies in which rule.

Rule 1 (Insertion/deletion of observations). If $Y$ and $Z$ are conditionally independent given $W$ in the mutilated graph where incoming edges to the variables in $X$ have been removed, then conditioning on $Z$ can be added or removed without changing $P(Y \mid \text{do}(X), W)$ :

P(Y \mid \text{do}(X), Z, W) = P(Y \mid \text{do}(X), W) \quad \text{if } Y \perp\!\!\!\perp Z \mid W \text{ in } \mathcal{G}_{\overline{X}}.

Rule 2 (Action/observation exchange). If a do-intervention on $Z$ is equivalent to observation of $Z$ given other variables — formally, if $Y$ is independent of $Z$ in a specific mutilated graph — then $\text{do}(Z)$ can be replaced by conditioning on $Z$ :

P(Y \mid \text{do}(X), \text{do}(Z), W) = P(Y \mid \text{do}(X), Z, W) \quad \text{if } Y \perp\!\!\!\perp Z \mid X, W \text{ in } \mathcal{G}_{\overline{X}, \underline{Z}}.

Rule 3 (Insertion/deletion of actions). A do-intervention on $Z$ can be added or removed without affecting the distribution if $Z$ has no causal effect on $Y$ given the other interventions and observations:

P(Y \mid \text{do}(X), \text{do}(Z), W) = P(Y \mid \text{do}(X), W) \quad \text{if } Y \perp\!\!\!\perp Z \mid X, W \text{ in } \mathcal{G}_{\overline{X}, \overline{Z(W)}}.

Reading the rules informally. Rule 1 says: we can ignore observations of independent variables. Rule 2 says: if intervention on $Z$ would behave like observation of $Z$ , we can use the observation in its place. Rule 3 says: if $Z$ has no effect on $Y$ in the relevant context, we can drop the intervention.

The power of do-calculus. Shpitser and Pearl (2006) showed the three rules are complete: if $P(Y \mid \text{do}(X))$ is identifiable from observational data given the causal graph, then iterated application of the three rules can derive the identifying expression. Do-calculus is the formal proof system for causal identification.

Identifiability and the ID algorithm

A causal effect $P(Y \mid \text{do}(X))$ is identifiable from a causal graph and observational distribution if it can be uniquely expressed in terms of the observational distribution (and the graph structure). Identifiability depends on the graph; the same observational distribution can have different identifiable causal effects depending on the assumed causal structure.

The ID algorithm (Tian and Pearl, 2002; Shpitser and Pearl, 2006) is a complete algorithm: given a causal DAG and a target effect $P(Y \mid \text{do}(X))$ , the algorithm either returns an expression for the effect in terms of observational quantities, or proves that no such expression exists (the effect is not identifiable from the given graph).

A simple worked example. For the confounded graph above ( $U \to X$ , $U \to Y$ with no direct $X \to Y$ edge), the effect $P(Y \mid \text{do}(X = x))$ is identifiable if $U$ is observed: the backdoor adjustment (developed in §5) gives

P(Y \mid \text{do}(X = x)) = \sum_u P(Y \mid X = x, U = u) P(U = u).

If $U$ is unobserved, the effect is not identifiable from this graph alone — the same observational distribution is consistent with multiple causal-effect values, depending on the unobserved $U$ . This is the fundamental confounding problem: unmeasured common causes prevent identification of causal effects.

The identifiability machinery is what makes Pearl’s framework practical. Given a causal graph, the framework tells you exactly which causal effects can be estimated from your data and which cannot. The framework is honest — when an effect is unidentifiable, it says so explicitly. This contrasts with naive correlational analysis, which produces an estimate regardless of whether the estimate has any causal interpretation.

Pseudocode for the ID algorithm (sketch)

The full algorithm is technically involved. A sketch:

ALGORITHM ID (Shpitser & Pearl, 2006, sketch)
INPUT: DAG G, target effect P(Y | do(X))
OUTPUT: expression for P(Y | do(X)) in observational quantities,
        OR proof of non-identifiability

1. If X is empty: return P(Y) [no intervention; just observation].

2. If there exists a node W not an ancestor of Y in mutilated G_X-bar:
   recursively reduce by ignoring W.

3. Apply do-calculus rules to simplify the expression.

4. If the simplified expression contains no do-operators:
   return the expression.

5. If a specific structural failure pattern appears (hedge):
   return non-identifiable.

6. Otherwise: recurse on a smaller subproblem.

The algorithm is complete: it succeeds whenever identification is possible and proves failure whenever it is not. The algorithmic content makes Pearl’s framework operational rather than purely conceptual.

Where SCMs and do-calculus sit in the field

The summary. SCMs are the standard mathematical object for causal reasoning in the Pearl tradition; do-calculus is the axiomatic system for manipulating causal expressions; the ID algorithm provides operational identifiability checking. Together they are the conceptual core of one of two dominant causal frameworks (the Rubin potential-outcomes framework being the other).

The strengths. Mathematically clean; algorithmically complete; explicitly transparent about assumptions; extensible to counterfactuals (§7) and discovery (§8).

The limitations. The framework requires the causal graph as input — but the graph must come from somewhere (domain knowledge, expert judgment, sometimes causal discovery from data). The graph is the Achilles heel of causal inference: getting it wrong invalidates everything downstream. Considerable effort in modern causal ML is devoted to either learning the graph from data (causal discovery, §8) or to robustness against misspecification of the graph.

§5 develops the practical identification criteria (backdoor, front-door, instrumental variables) that suffice for many applied causal-inference problems and that practitioners use as shortcuts without invoking the full do-calculus machinery.

§5. Identification: Backdoor, Front-door, and Instrumental Variables

The do-calculus and ID algorithm (§4) are mathematically complete but operationally heavy. In practice, most identifiable causal effects can be derived via three named criteria — graphical recipes that practitioners check directly on the DAG without invoking the full algorithm. Each criterion identifies a path structure in the DAG that suffices for identification.

This section develops the backdoor criterion (the most common case), the front-door criterion (a more subtle case useful when the backdoor fails), and the instrumental-variables approach (a different identifiability strategy). Each is presented with the formal statement, a worked example, and the conditions for applicability.

The backdoor criterion

The most-used identification criterion. The backdoor criterion (Pearl, 1993) identifies causal effects in the presence of confounders by adjusting for the confounders.

Definition. A set of variables $Z$ satisfies the backdoor criterion relative to $(X, Y)$ in DAG $\mathcal{G}$ if:

No node in $Z$ is a descendant of $X$ .
$Z$ blocks every “backdoor path” from $X$ to $Y$ — every path that starts with an arrow into $X$ .

When $Z$ satisfies the backdoor criterion, the causal effect $P(Y \mid \text{do}(X = x))$ is identifiable via the backdoor adjustment formula:

P(Y \mid \text{do}(X = x)) = \sum_z P(Y \mid X = x, Z = z) \, P(Z = z).

Reading this. The intervention distribution can be computed by adjusting (averaging) over the values of $Z$ . For each value of $Z$ , observe the conditional $P(Y \mid X = x, Z = z)$ in the data; weight by $P(Z = z)$ ; sum. The result is the causal effect.

A worked example: smoking and lung cancer. Suppose the causal DAG is:

flowchart TD
  A["$$A$$ (age)"]
  S["$$S$$ (smoking)"]
  C["$$C$$ (cancer)"]

  A --> S
  A --> C
  S --> C

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  class A,S,C pill

Age $A$ is a confounder: it causes both smoking (older people may smoke less) and cancer (cancer risk increases with age). The observational correlation $P(C \mid S)$ confuses the direct effect of smoking with the age-mediated path.

Applying the backdoor criterion: is $\{A\}$ a valid backdoor adjustment set?

$A$ is not a descendant of $S$ . ✓
The only backdoor path is $S \leftarrow A \rightarrow C$ . Conditioning on $A$ blocks this path. ✓

So $\{A\}$ satisfies the backdoor criterion. The causal effect is:

P(C \mid \text{do}(S = s)) = \sum_a P(C \mid S = s, A = a) \, P(A = a).

In words: estimate the cancer rate among smokers and non-smokers separately for each age group; then average across age groups using the population age distribution. The result is the smoking effect adjusted for age.

The general intuition. Backdoor adjustment controls for confounders by stratifying the data on them and then averaging. The formal criterion ensures we have stratified on enough variables to block all confounding paths.

What can go wrong with backdoor

Several subtleties.

Adjusting for too much. Conditioning on a descendant of $X$ can introduce bias. The most famous case: conditioning on a collider opens an otherwise-blocked path. If we mistakenly include a descendant of $X$ in our adjustment set, we can create confounding that wasn’t there.

Worked example. Suppose admission to a selective university $U$ is caused by both intelligence $I$ and athletic ability $A$ . Conditioning on $U$ (looking only at admitted students) creates a spurious negative correlation between $I$ and $A$ among admitted students — even though $I$ and $A$ are independent in the population. This is collider bias (Berkson’s paradox); it is what produces the asthma example’s apparent protective effect.

Adjusting for too little. If we leave out a confounder, the backdoor criterion is not satisfied and the adjustment is biased. Unmeasured confounders are the dominant practical concern in observational causal inference.

Functional misspecification. Even if the adjustment set is correct, the form of $P(Y \mid X, Z)$ must be estimated; if the estimator is misspecified (wrong parametric form, insufficient flexibility), the adjustment fails. Modern ML estimators (gradient-boosted trees, neural networks) can mitigate this by being more flexible; §6 develops these.

The front-door criterion

A more subtle identification result. The front-door criterion (Pearl, 1995) identifies causal effects even when backdoor adjustment is impossible — when there is an unmeasured confounder.

Definition. A set $Z$ satisfies the front-door criterion relative to $(X, Y)$ if:

$Z$ intercepts all directed paths from $X$ to $Y$ (i.e., $X$ affects $Y$ only through $Z$ ).
There is no unblocked backdoor path from $X$ to $Z$ .
All backdoor paths from $Z$ to $Y$ are blocked by $X$ .

When $Z$ satisfies the front-door criterion, the causal effect is:

P(Y \mid \text{do}(X = x)) = \sum_z P(Z = z \mid X = x) \sum_{x'} P(Y \mid Z = z, X = x') \, P(X = x').

The formula is more complex than the backdoor adjustment; it involves a two-step causal chain through the intermediate variable $Z$ .

A worked example: the smoking-tar-cancer chain. Suppose smoking ( $S$ ) causes cancer ( $C$ ) only through tar in the lungs ( $T$ ) — and there’s an unobserved confounder ( $U$ , perhaps genetic predisposition) affecting both smoking and cancer:

flowchart TD
  U["$$U$$ (unobserved)"]
  S["$$S$$ (smoking)"]
  T["$$T$$ (tar in lungs)"]
  C["$$C$$ (cancer)"]

  U --> S
  U --> C
  S --> T
  T --> C

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  classDef dim  fill:#fff,stroke:#aaa,stroke-dasharray:3 3,color:#888
  class S,T,C pill
  class U dim

Backdoor fails here: the backdoor path $S \leftarrow U \rightarrow C$ would need to be blocked by $U$ , but $U$ is unobserved.

Front-door succeeds: $T$ intercepts the only directed path $S \rightarrow T \rightarrow C$ ; there is no backdoor from $S$ to $T$ (no edge into $T$ except from $S$ ); the backdoor path $T \leftarrow S \leftarrow U \rightarrow C$ is blocked by conditioning on $S$ . The front-door criterion is satisfied.

The intuition. The front-door formula effectively decomposes the causal effect into two steps: (1) the effect of $S$ on $T$ (which can be measured directly because there is no confounding for this sub-effect); (2) the effect of $T$ on $C$ (which is identifiable by adjusting for $S$ , which blocks the only backdoor path). Multiplying the two sub-effects gives the total $S \to C$ causal effect even though the original $U$ confounder is unobserved.

Front-door is rarer in practice than backdoor — it requires the specific structural setup where an intermediate variable fully mediates the effect with no unconfounded backdoors. When it applies, it is powerful because it identifies causal effects in the presence of unmeasured confounding.

Instrumental variables

A different identifiability strategy, with a long tradition in econometrics.

Definition. A variable $W$ is an instrumental variable (IV) for the effect of $X$ on $Y$ if:

Relevance: $W$ affects $X$ (i.e., $W$ has a causal influence on $X$ ).
Exclusion: $W$ affects $Y$ only through $X$ (no direct $W \to Y$ effect; no confounding paths from $W$ to $Y$ except through $X$ ).
Independence: $W$ is independent of unobserved confounders of $(X, Y)$ .

When $W$ is a valid IV, the causal effect of $X$ on $Y$ can be identified — for linear models, by two-stage least squares; for nonparametric models, with additional assumptions.

flowchart TD
  U["$$U$$ (unobserved confounder)"]
  W["$$W$$ (instrument)"]
  X["$$X$$"]
  Y["$$Y$$"]

  U --> X
  U --> Y
  W --> X
  X --> Y

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  classDef dim  fill:#fff,stroke:#aaa,stroke-dasharray:3 3,color:#888
  class W,X,Y pill
  class U dim

Two-stage least squares (2SLS), the canonical estimator. For linear models:

First stage. Regress $X$ on $W$ to get the predicted value $\hat{X}$ . Because $W$ doesn’t directly affect $Y$ , $\hat{X}$ captures only the part of $X$ ’s variation driven by $W$ .
Second stage. Regress $Y$ on $\hat{X}$ . The slope coefficient is the causal effect of $X$ on $Y$ .

The intuition. The instrument $W$ creates as-if randomization in $X$ — variation in $X$ driven by $W$ alone is independent of the unobserved confounders. Using only this exogenous variation lets us identify the causal effect even though $X$ overall is confounded.

Worked example: smoking and birth weight. Suppose we want to know the effect of smoking during pregnancy on infant birth weight. Direct observational analysis is confounded — mothers who smoke may differ in other ways (income, prenatal care, diet). Cigarette tax variation across states or time is a plausible instrument: it affects smoking (relevance) but presumably not birth weight directly except through smoking (exclusion) and is not correlated with unobserved maternal characteristics (independence).

2SLS using cigarette taxes as the instrument can identify the smoking-on-birth-weight effect under these assumptions. This is the classic natural experiment identification strategy in econometrics.

Assumption fragility. IV identifiability is delicate. The exclusion restriction is untestable — it asserts there is no direct $W \to Y$ effect, but the data cannot prove this. A bad instrument (one that fails exclusion or independence) produces biased estimates that look statistically valid. Considerable empirical work has shown that many published IV analyses use instruments of questionable validity.

When each criterion applies

A summary of when to use each.

Criterion	When to use	What it requires	Common pitfall
Backdoor	All confounders are measured	Adjustment set blocking all backdoor paths	Conditioning on colliders/descendants
Front-door	An intermediate variable fully mediates; backdoor unavailable	Specific structural form	Rare in practice
Instrumental variables	Have an external variation source	Relevance + exclusion + independence	Exclusion is untestable

In practice, backdoor covers most analyzable causal effects when confounders are measured. Front-door applies in specific scenarios where the mediating structure is known. Instrumental variables applies in econometric and natural-experiment contexts where exogenous variation is available.

When none of the three applies, the causal effect may be non-identifiable from the given graph — meaning no estimator from observational data alone can recover it without additional assumptions. The honest response: acknowledge non-identifiability and either collect interventional data, restrict the claim, or impose stronger structural assumptions (linearity, monotonicity, partial identification with bounds).

Sensitivity analysis

A practical extension. Even when a causal effect is “identifiable” given assumed structure, the assumptions are uncertain. Sensitivity analysis asks: how robust is the estimate to violations of the assumptions?

For backdoor adjustment with unmeasured confounders, the Rosenbaum sensitivity approach asks: how strong would an unmeasured confounder need to be (in terms of its association with both treatment and outcome) to change the qualitative conclusion? If the answer is “implausibly strong,” the conclusion is robust; if “weakly,” the conclusion is fragile.

Modern variants (E-values, Cinelli-Hazlett sensitivity, dosearch) provide quantitative tools for assessing robustness. Best practice in 2026 applied causal inference includes sensitivity analysis as a standard companion to the main causal estimate.

Where identification fits in the chapter

The summary. §4 developed what causal questions are identifiable in principle from a given graph (the ID algorithm). §5 develops how to identify them in practice using the three canonical criteria. The two sections are complementary: the ID algorithm tells you whether an effect can be derived; the criteria of this section tell you what the derivation looks like for common cases.

The next section (§6) develops estimation — given identifiability, how do we actually compute the causal effect from finite data? Identification is a statistical-population question; estimation is a finite-sample question. Both are required for applied causal inference.

§6. Causal Effect Estimation

Identification (§4–§5) is about whether a causal effect can be expressed in terms of observable quantities. Estimation is about how to compute it from finite data. This section develops the standard estimands (ATE, CATE, ITE), the classical estimators (propensity scores, IPW, doubly robust), and the modern ML-based estimators (causal forests, double ML, meta-learners).

Causal estimands: ATE, CATE, ITE

Three levels of granularity for a treatment effect.

Average Treatment Effect (ATE). The population-average effect of treatment versus control:

\tau = \mathbb{E}[Y(1) - Y(0)],

where $Y(1)$ and $Y(0)$ are the potential outcomes under treatment and control (Rubin notation). The ATE is the most-reported causal quantity in applied causal inference.

Conditional Average Treatment Effect (CATE). The treatment effect for a specific subpopulation defined by covariates $X = x$ :

\tau(x) = \mathbb{E}[Y(1) - Y(0) \mid X = x].

CATE captures heterogeneity in the treatment effect — the effect may be larger for some subgroups than others. Estimating CATE is the modern personalized-medicine and personalized-policy question.

Individual Treatment Effect (ITE). The treatment effect for a specific individual:

\tau_i = Y_i(1) - Y_i(0).

The ITE is fundamentally unobservable (the fundamental problem of causal inference, §2): for any individual, only one potential outcome is observed. Some methods produce predictions of the ITE (model-based predictions of what each individual’s treatment effect would be); these predictions are best understood as ITE-conditioned-on-observable-covariates — essentially CATE evaluated at $X = x_i$ .

The progression ATE → CATE → ITE represents increasing granularity and decreasing identifiability. ATE is identifiable under standard assumptions; CATE is identifiable conditional on covariates; true ITE is only identifiable under strong additional assumptions (functional form, deterministic counterfactuals).

The standard assumptions

For backdoor-adjustment-based estimation, the standard assumptions are:

Ignorability (no unmeasured confounders). $Y(0), Y(1) \perp T \mid X$ — treatment assignment is independent of potential outcomes given observed covariates.
Positivity (overlap). $0 < P(T = 1 \mid X = x) < 1$ for all relevant $x$ — every subpopulation has some chance of receiving each treatment.
SUTVA (Stable Unit Treatment Value Assumption). Each unit’s outcome depends only on its own treatment, not on others’ treatments. The treatment is well-defined and consistent.
Consistency. $Y = Y(T)$ — the observed outcome equals the potential outcome under the actual treatment.

These assumptions are the practical version of the structural assumptions encoded in a causal graph. They are untestable from the data alone; they must be argued from substantive domain knowledge.

Propensity score methods

The most-classical estimation approach. The propensity score is the probability of treatment given covariates:

e(x) = P(T = 1 \mid X = x).

The propensity score has a remarkable property (Rosenbaum and Rubin, 1983): if ignorability holds given $X$ , it also holds given $e(X)$ alone — a one-dimensional summary suffices. This means that adjustment can be done on the propensity score rather than on the full covariate vector.

Three main propensity-score-based estimators:

Propensity score matching. Find pairs of treated and untreated units with similar propensity scores; estimate the effect by comparing matched pairs. Practical but loses information from unmatched units.

Propensity score stratification. Divide units into strata by propensity-score quintile (or finer); compute the treatment-control difference within each stratum; average across strata.

Inverse Propensity Weighting (IPW). Weight each unit by the inverse of its propensity of receiving the treatment it actually received:

\hat{\tau}_{\text{IPW}} = \frac{1}{n} \sum_i \left[ \frac{T_i Y_i}{\hat{e}(X_i)} - \frac{(1 - T_i) Y_i}{1 - \hat{e}(X_i)} \right].

Reading this. Each treated unit is weighted by $1/\hat{e}(X)$ (the inverse probability of being treated); each untreated unit is weighted by $1/(1 - \hat{e}(X))$ . This reweights the sample to make treated and untreated units comparable in their covariate distributions.

IPW is unbiased when the propensity model is correctly specified. It has high variance when some propensities are close to 0 or 1 (the positivity violation) — extreme inverse weights blow up the estimator.

Doubly robust estimators

A refinement that combines outcome modelling and propensity-score modelling. The doubly robust estimator (Robins, Rotnitzky, Zhao, 1994) uses both a propensity model $\hat{e}(x)$ and an outcome model $\hat{\mu}_t(x) = \hat{\mathbb{E}}[Y \mid T = t, X = x]$ .

The Augmented Inverse Propensity Weighted (AIPW) estimator:

\hat{\tau}_{\text{AIPW}} = \frac{1}{n} \sum_i \left[ \hat{\mu}_1(X_i) - \hat{\mu}_0(X_i) + \frac{T_i (Y_i - \hat{\mu}_1(X_i))}{\hat{e}(X_i)} - \frac{(1 - T_i)(Y_i - \hat{\mu}_0(X_i))}{1 - \hat{e}(X_i)} \right].

The double-robustness property: $\hat{\tau}_{\text{AIPW}}$ is consistent if either the propensity model or the outcome model is correctly specified (not necessarily both). This is a substantial robustness advantage over IPW alone or outcome regression alone.

In modern practice, doubly robust estimators are the default for causal-effect estimation in serious applied settings. The robustness is real and matters when models are misspecified (which they almost always are).

Causal forests and meta-learners (CATE)

Estimating CATE — heterogeneous treatment effects across subpopulations — is the focus of modern ML-for-causal-inference. Several major estimators.

Causal forests (Wager and Athey, 2018). A modification of random forests that uses an honest splitting rule: each tree is split using one subsample of data and the leaf-level treatment effects are estimated using a different subsample. The honest construction gives valid pointwise confidence intervals for CATE. Implemented in the grf R package; widely used.

Meta-learners. A family of approaches that reduce CATE estimation to standard supervised learning. The basic taxonomy (Künzel et al., 2019):

S-learner. Fit a single ML model $\hat{\mu}(X, T)$ predicting outcome from covariates and treatment; CATE is $\hat{\mu}(X, 1) - \hat{\mu}(X, 0)$ . Simple but biased when the model regularizes too strongly toward treating $T$ as just another feature.
T-learner. Fit two models, $\hat{\mu}_1(X)$ on treated units and $\hat{\mu}_0(X)$ on untreated units; CATE is $\hat{\mu}_1(X) - \hat{\mu}_0(X)$ . Handles treatment-control imbalance but loses efficiency.
X-learner. A hybrid that uses T-learner predictions to impute counterfactual outcomes, then trains a second-stage model. Empirically strong when treatment groups have very different sizes.
R-learner. Uses a residualized objective (Robinson, 1988): estimate $\tau(X)$ by minimizing $\sum_i ((Y_i - \hat{m}(X_i)) - (T_i - \hat{e}(X_i)) \tau(X_i))^2$ , where $\hat{m}$ is the marginal-outcome model and $\hat{e}$ is the propensity. Has good theoretical properties when both nuisance models are estimated with flexible ML.

The meta-learner literature shows: which learner is best depends on the data structure. No single learner dominates across settings.

Double Machine Learning (DML)

A landmark recipe from Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, Robins (2018) “Double/Debiased Machine Learning for Treatment and Structural Parameters.” DML provides a framework for combining flexible ML with valid statistical inference for causal parameters.

The setup. We want to estimate the ATE $\tau$ under ignorability with high-dimensional covariates $X$ . Use ML to estimate two nuisance parameters:

$\hat{m}(X) \approx \mathbb{E}[Y \mid X]$ — the marginal outcome model.
$\hat{e}(X) \approx P(T = 1 \mid X)$ — the propensity score.

Then estimate $\tau$ from the Neyman-orthogonal score:

\hat{\tau}_{\text{DML}} = \frac{\sum_i (T_i - \hat{e}(X_i))(Y_i - \hat{m}(X_i))}{\sum_i (T_i - \hat{e}(X_i))^2}.

The Neyman-orthogonality means: small errors in $\hat{m}$ and $\hat{e}$ have only second-order effects on $\hat{\tau}$ . This lets us use flexible ML estimators (gradient-boosted trees, deep nets) for the nuisance parameters without inflating the bias of the causal estimate.

The key innovation: cross-fitting. Split the data into folds; estimate nuisance parameters on each fold’s training data; use them to compute the score on the held-out data. This prevents overfitting bias from the nuisance estimators.

ALGORITHM Double Machine Learning (Chernozhukov et al., 2018, sketch)
INPUT: data (X, T, Y), ML estimators for m(X) and e(X)

1. Split data into K folds.
2. For each fold k:
   - Fit m_hat^(-k) and e_hat^(-k) on the other K-1 folds.
   - On fold k, compute residuals:
       eps_T_i = T_i - e_hat^(-k)(X_i)
       eps_Y_i = Y_i - m_hat^(-k)(X_i)
3. Combine across folds:
   tau_hat = sum(eps_T_i * eps_Y_i) / sum(eps_T_i^2)
4. Compute standard error from the influence function;
   construct confidence interval.

DML is the modern default for causal-effect estimation with high-dimensional covariates. It combines the flexibility of modern ML (which can capture complex nuisance relationships) with the statistical rigor of classical causal inference (valid confidence intervals, double robustness).

Deep learning for causal effect estimation

A more recent thread. Treatment-specific neural networks parameterize $\mu_t(x)$ as a deep network; Counterfactual Regression (CFR; Shalit, Johansson, Sontag, 2017) adds representation learning — learn a shared representation $\phi(x)$ such that treated and untreated distributions are balanced in representation space. DragonNet (Shi, Blei, Veitch, 2019) combines outcome models with propensity heads in a shared backbone. Causal Transformer, TARNet, CEVAE (Causal Effect VAE), and many other variants exist.

The honest accounting. Deep learning for causal effect estimation has shown modest empirical advantages over gradient-boosted trees + double ML on standard benchmarks. The advantages are real but smaller than the deep-learning revolution in other domains. The most-deployed causal-effect estimators in 2026 industry use a mix: gradient-boosted trees + DML for tabular/structured data; deep learning for high-dimensional unstructured covariates (images, text); causal forests for interpretable CATE.

What works in practice

A summary for practitioners.

Setting	Recommended approach
Small-to-moderate $n$ , low-dim $X$	Propensity matching or doubly robust
Large $n$ , moderate-dim $X$	Double ML with gradient-boosted nuisances
CATE with low-dim $X$	Causal forest
CATE with high-dim $X$	DML + meta-learner (R or X)
High-dim unstructured $X$	Deep counterfactual regression + DML

Always: sensitivity analysis for unmeasured confounders. Cross-validation of the nuisance estimators. Honest sample-splitting for valid inference. Robustness checks with alternative specifications.

Where estimation fits

The summary. Identification (§4–§5) tells us which causal effects can in principle be recovered from the observable distribution. Estimation tells us how to do this from finite data with valid inference. Modern estimation combines the robustness of classical statistical theory (doubly robust, Neyman orthogonality) with the flexibility of modern ML (gradient-boosted trees, deep nets for nuisances).

The honest caveat. Every estimator in this section assumes the identifiability assumptions of §4–§5 hold. If the assumed causal graph is wrong, the estimator computes the wrong causal effect (with potentially impressively small standard errors). The fragility of causal estimation is in the identification assumptions, not in the estimator’s finite-sample variance. Sensitivity analysis is the practical check on this fragility; OP-C-1 and OP-C-9 (in §14) are the open problems that drive it.

§7. Counterfactuals

The third rung of causation (§3). Counterfactual reasoning asks what would have happened if a variable had been different in a specific case, given what we actually observed. This is qualitatively different from interventional reasoning (which asks about hypothetical interventions on the population) and requires correspondingly more structure to compute.

This section develops counterfactual computation from an SCM, the equivalent potential-outcomes formulation, mediation analysis as a counterfactual structural decomposition, and counterfactual fairness as a substantive application.

What counterfactuals ask

A worked example. A patient was prescribed drug $A$ for headache and recovered. The question: would the patient have recovered if they had been prescribed drug $B$ instead?

This is not a question about the population-average effect of drug $B$ vs drug $A$ — that is a Rung-2 (intervention) question. It is a question about this specific patient given everything we observed about them — their age, symptoms, medical history, and the fact that they took drug $A$ and recovered. We want to evaluate, for this specific case, what the alternative outcome would have been.

Counterfactual queries arise in:

Medicine. “Did the treatment cause the recovery?” — would they have recovered without it?
Law. “But for the defendant’s action, would the harm have occurred?” — the legal counterfactual standard.
Personalized policy. “If we had given this person a different recommendation, what would they have done?”
Fairness. “Would this decision have been different if the applicant’s protected attribute had been different, holding other things equal?”

Each question requires parallel-world reasoning — what would have happened in this specific case if one thing had changed.

Computing counterfactuals from an SCM

The key technical result. Given a full structural causal model (graph plus structural equations plus noise distributions), counterfactual quantities are computable via a three-step procedure.

Pearl’s three steps (Pearl, 2000, ch. 7):

Abduction. Given the observed data, infer the values of the exogenous noise variables $U$ . Each $U_i$ is the idiosyncratic part of variable $V_i$ — everything specific to this case that the structural equations did not deterministically pin down.
Action. Modify the SCM by replacing the structural equation for the counterfactual variable with the counterfactual value: $V_i := v_i^*$ . (This is the same modification as the $\text{do}$ -operator at Rung 2.)
Prediction. Using the inferred $U$ values and the modified SCM, compute the counterfactual outcome by running the structural equations forward.

Step 1 — Abduction

Given SCM M and observed data v: for each variable V_i, infer exogenous noise U_i such that the structural equation produces the observed v_i. Result: P(U | V = v).

Step 2 — Action

Modify M by replacing V_j’s structural equation with the counterfactual assignment V_j := v_j*. Result: modified SCM M*.

Step 3 — Prediction

Run M* forward with the inferred U values. Compute the counterfactual outcome. Result: Y_{v_j*} | V = v.

A worked example. SCM with two variables: cause $X$ and effect $Y$ , structural equations $X := U_X$ , $Y := f(X, U_Y)$ .

Observed: $X = x$ , $Y = y$ .

Counterfactual question: $Y_{x'} \mid X = x, Y = y$ — what would $Y$ have been if $X = x'$ instead?

Abduction. From observed $X = x$ , the noise $U_X$ has been pinned. From observed $Y = y$ and the structural equation $Y = f(x, U_Y)$ , the noise $U_Y$ is determined: $U_Y = f^{-1}(y; x)$ (assuming the equation can be inverted; otherwise we get a distribution over $U_Y$ ).
Action. Set $X := x'$ in the modified SCM.
Prediction. Compute $Y_{x'} = f(x', U_Y) = f(x', f^{-1}(y; x))$ .

The crucial point: the counterfactual outcome $Y_{x'}$ depends on $U_Y$ — the noise specific to this case. Two different cases with the same $X, Y$ but different latent $U_Y$ would have different counterfactual outcomes. This is the individualization that makes counterfactuals different from population-average interventions.

Counterfactuals in the potential-outcomes framework

The same content in Rubin’s notation. In the potential-outcomes framework, each unit $i$ has a vector of potential outcomes $(Y_i(0), Y_i(1))$ — what would have happened under each treatment. The counterfactual question “what would $Y_i$ have been under the alternative treatment?” is just asking for the unobserved potential outcome.

The framework’s reading. We observe $T_i$ and $Y_i = Y_i(T_i)$ — the actual treatment and the corresponding outcome. The counterfactual $Y_i(1 - T_i)$ is the unobserved potential outcome under the alternative treatment. The individual treatment effect $\tau_i = Y_i(1) - Y_i(0)$ is the difference between the two potential outcomes, of which we observe only one.

The Rubin framework treats counterfactuals as unobserved variables; the Pearl framework treats them as computable quantities given an SCM. The frameworks agree for the queries they both address but differ in emphasis: Rubin emphasizes treatment effects and observational identification; Pearl emphasizes general counterfactual computation from SCMs.

The identifiability of counterfactuals

A subtle point. Rung 2 (intervention) queries are sometimes identifiable from observational data given a causal graph. Rung 3 (counterfactual) queries are generally not — even with a known causal graph and observational data, most counterfactuals are not uniquely determined.

The reason. Two SCMs can be observationally equivalent (same joint distribution and same intervention distributions) but produce different counterfactuals. The counterfactual depends on the specific functional form of the structural equations, not just on which variables affect which others. Without committing to a specific functional form, counterfactuals are not identifiable.

When are counterfactuals identifiable?

With strong functional-form assumptions. Linear-Gaussian SCMs, additive-noise models, monotonicity assumptions — these constrain the structural equations enough that counterfactuals become unique.
With experimental data. Randomized cross-over designs (where each unit receives both treatments at different times) provide direct access to both potential outcomes.
With probabilistic counterfactuals. We can sometimes identify bounds on counterfactual probabilities even when point identification fails.

The honest accounting. Counterfactual reasoning is more demanding than interventional reasoning. The framework is mathematically clean but practically limited; counterfactual claims should be made with explicit awareness of the structural assumptions they require.

Mediation analysis

A specific application. Mediation analysis asks: when $X$ causes $Y$ , how much of the effect goes through a mediator variable $M$ , and how much goes directly?

flowchart LR
  X["$$X$$"]
  M["$$M$$ (mediator)"]
  Y["$$Y$$"]

  X --> Y
  X --> M
  M --> Y

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  class X,M,Y pill

The total effect of $X$ on $Y$ decomposes into:

Direct effect. The effect of $X$ on $Y$ that does not go through $M$ .
Indirect effect. The effect of $X$ on $Y$ that goes through $M$ .

The decomposition is a counterfactual question. The natural direct effect asks: what would the effect of $X$ on $Y$ be if we held $M$ fixed at the value it would have taken under control? The natural indirect effect asks: what would the effect on $Y$ be of changing $M$ from its control-value to its treatment-value, while holding $X$ at control? These are quintessentially counterfactual quantities.

The applications. Mediation analysis is used in:

Epidemiology. Decomposing the effect of education on health into pathways through income, behaviour, and stress.
Social science. Understanding causal mechanisms in policy interventions.
Fairness. Decomposing disparate outcomes into “legitimate” and “illegitimate” pathways (a contested but influential framing).

Practical mediation analysis requires both sequential identification (the effect of $X$ on $M$ and the effect of $M$ on $Y$ given $X$ are both identifiable) and no mediator-outcome confounding affected by treatment (a non-trivial assumption). The VanderWeele textbook (“Explanation in Causal Inference,” 2015) is the canonical reference.

Counterfactual fairness

A landmark application of counterfactual reasoning to algorithmic fairness. Kusner, Loftus, Russell, Silva (2017) “Counterfactual Fairness” defined fairness in counterfactual terms.

Definition. A predictor $\hat{Y}$ is counterfactually fair with respect to a protected attribute $A$ if, for every individual, $\hat{Y}$ would be the same in the counterfactual world where their value of $A$ had been different (while holding all non-descendants of $A$ fixed).

Formally:

P(\hat{Y}_{A \leftarrow a} = y \mid X = x, A = a) = P(\hat{Y}_{A \leftarrow a'} = y \mid X = x, A = a) \quad \forall a, a', x, y.

Reading this. For each individual (specified by $X = x, A = a$ ), the prediction should be the same regardless of whether their counterfactual protected attribute was $a$ or $a'$ . The fairness is individual-level (it applies to each person, not just the population) and counterfactual (it asks about parallel worlds, not actual data).

The technical content. Counterfactual fairness can be achieved by:

Building a causal model identifying the descendants of $A$ .
Using only the non-descendants of $A$ as predictors.
Equivalently: ensuring the predictor is a function only of variables whose counterfactual values are insensitive to $A$ .

The contest. Counterfactual fairness is one fairness criterion among many; others include demographic parity, equalized odds, calibration. The criteria are mutually incompatible in general (Chouldechova, 2017; Kleinberg, Mullainathan, Raghavan, 2017). Different fairness criteria reflect different normative commitments; counterfactual fairness has the advantage of being individual-level and the disadvantage of requiring a causal model.

The applications. Counterfactual fairness has been studied in hiring, lending, criminal justice, and college admissions. Production deployment is limited because constructing the required causal model is hard and contested. The framework remains influential in fairness research even where the specific criterion is not the deployed one.

Where counterfactuals fit

The summary. Counterfactuals are the third rung of the ladder, the most demanding form of causal reasoning. They require an SCM (graph plus structural equations); they are generally not identifiable from observational data alone; they have substantive applications in medicine, law, fairness, and personalized policy. The technical machinery is mature; the practical use is limited by identifiability requirements.

The cross-section to LLMs. A natural question: can a language model reason counterfactually? Empirically, LLMs handle textually-described counterfactuals at the narrative level — they can produce plausible counterfactual stories. Whether they reason about counterfactuals using causal structure or via surface-level pattern matching is debated; §11 develops the empirical evidence.

§8. Causal Discovery

So far the chapter has assumed the causal graph is given — by domain knowledge, expert judgment, or theoretical considerations. Causal discovery asks the inverse question: can we learn the causal graph from data?

The honest answer is partially. Some structural information is always recoverable from observational data alone; some is not. The fundamental limit is the Markov equivalence class — multiple distinct causal graphs can imply the same observational distribution and are therefore indistinguishable from observational data alone. Discovery methods aim to recover as much structure as identifiable, and additional information (interventional data, functional-form assumptions, multiple environments) can sometimes resolve the remaining ambiguity.

This section develops the main causal-discovery methodologies: constraint-based methods (PC, FCI), score-based methods (GES), and functional-causal-model methods (LiNGAM, ANM, nonlinear extensions).

The problem statement

The setup. Given a dataset with $n$ samples and $d$ variables, learn a directed acyclic graph (DAG) over the $d$ variables that best represents the causal structure.

The fundamental difficulty. Multiple DAGs can be Markov equivalent — they encode the same set of conditional independencies among observable variables, so they cannot be distinguished from observational data alone. The Markov-equivalence class is the equivalence class of DAGs with the same skeleton (undirected graph) and the same v-structures (colliders without a direct edge between the parents).

A simple example. Three variables, $X, Y, Z$ . The three DAGs:

$X \to Y \to Z$
$X \leftarrow Y \to Z$
$X \leftarrow Y \leftarrow Z$

all imply the same conditional independencies: $X \perp\!\!\!\perp Z \mid Y$ , with no other independencies. They are Markov equivalent and indistinguishable from observational data alone. By contrast, $X \to Y \leftarrow Z$ implies $X \perp\!\!\!\perp Z$ but not $X \perp\!\!\!\perp Z \mid Y$ — observationally distinct from the others.

The implication. Observational causal discovery can identify the Markov equivalence class but generally not the unique DAG within it. This is the fundamental limit: observational data alone cannot identify edge directions in the parts of the graph where multiple equivalent directions are consistent with the data.

Constraint-based methods: PC and FCI

PC algorithm (Spirtes and Glymour, 1991) is the prototypical constraint-based method. The approach:

Start with a complete undirected graph.
For each pair of variables $X, Y$ , test conditional independence $X \perp\!\!\!\perp Y \mid Z$ for various $Z$ . If conditional independence holds for some $Z$ , remove the edge $X - Y$ .
After this skeleton phase, orient edges using the v-structure rule: if $X - Z - Y$ in the skeleton and $X, Y$ are not adjacent, and the separating set excluding $Z$ kept them dependent, orient as $X \to Z \leftarrow Y$ .
Apply additional orientation rules (Meek rules) to propagate edge directions where forced by acyclicity and Markov equivalence.

ALGORITHM PC (Spirtes and Glymour 1991, sketch)
INPUT: dataset, conditional-independence test

1. Start with complete undirected graph on the d variables.
2. For each pair (X, Y) and conditioning set Z:
     If X ⊥ Y | Z by the CI test:
       Remove edge X-Y; record the separating set Sep(X, Y) = Z.
3. (Skeleton complete) For each unshielded triple X - Z - Y:
     If Z is NOT in Sep(X, Y): orient as X → Z ← Y (v-structure).
4. Apply Meek orientation rules to propagate as far as possible.
5. Output: the resulting PDAG (Partially Directed Acyclic Graph)
   representing the Markov equivalence class.

The strengths of PC. Theoretically clean; provably correct under causal sufficiency (no unmeasured confounders) and faithfulness; widely implemented (R pcalg package, Python causal-learn).

The limitations. PC is brittle to errors in conditional-independence tests — false positives and negatives in the CI tests propagate to errors in the graph. For continuous variables, CI testing is hard (especially in nonlinear settings); for moderate dimensionality ( $d \geq 50$ ), test errors accumulate. PC also assumes causal sufficiency (no unmeasured confounders) — a strong assumption.

FCI (Fast Causal Inference; Spirtes, Meek, Richardson, 1995) extends PC to handle unmeasured confounders. The output is not a DAG but a more complex graph (PAG, Partial Ancestral Graph) that encodes “either X is an ancestor of Y, or there is an unmeasured confounder, or both.” FCI is more robust to confounding but produces weaker conclusions.

Score-based methods: GES

A different approach. GES (Greedy Equivalence Search; Chickering, 2002) treats the graph-discovery problem as optimization: find the DAG that maximizes a score relating the graph to the data.

The recipe.

Forward search. Start from the empty graph. Greedily add edges (or v-structures) that improve the score, until no addition improves it.
Backward search. Start from the resulting graph. Greedily remove edges that improve the score, until no removal improves it.

The score is typically a Bayesian Information Criterion (BIC) — log-likelihood penalized by model complexity. BIC is score-equivalent — it assigns the same score to all members of a Markov equivalence class — which makes GES naturally work on equivalence classes rather than individual DAGs.

GES has been extended to handle high dimensions (NOTEARS, Zheng et al. 2018; continuous optimization rather than discrete search), latent variables, and nonlinear models.

The strengths. Score-based methods are less sensitive to individual CI-test errors than constraint-based methods. They can be tractably applied to larger graphs.

The limitations. The space of DAGs is super-exponential in $d$ ; search is fundamentally hard. Greedy procedures can get stuck in local optima. NOTEARS-style continuous relaxations are scalable but introduce their own optimization difficulties.

Functional causal models: LiNGAM and ANM

A third approach. Functional Causal Models (FCMs) make assumptions about the functional form of the structural equations, and exploit these to break Markov-equivalence ambiguity.

LiNGAM (Linear Non-Gaussian Acyclic Model; Shimizu, Hoyer, Hyvärinen, Kerminen, 2006). Assumes:

Linear: $V_i = \sum_{j \neq i} b_{ij} V_j + U_i$ .
Non-Gaussian: the noise variables $U_i$ are non-Gaussian.
Acyclic: the underlying graph is a DAG.

The crucial result. Under these assumptions, the causal DAG is uniquely identifiable from observational data — no Markov-equivalence ambiguity remains. The non-Gaussianity of the noise distinguishes between, e.g., $X \to Y$ and $Y \to X$ in a way that Gaussian-distribution-only methods cannot.

The intuition. In Gaussian linear models, $X \to Y$ and $Y \to X$ produce indistinguishable joint distributions (both bivariate Gaussians with the same correlation). In non-Gaussian linear models, the distributions differ in higher-order moments (skewness, kurtosis, etc.), which encode the causal direction.

ANM (Additive Noise Models; Hoyer et al., 2009). Generalizes to nonlinear models: $V_i = f_i(\text{Pa}(V_i)) + U_i$ for nonlinear $f_i$ and independent noise. Under regularity conditions, the additive-noise structure breaks the Markov-equivalence ambiguity even in nonlinear settings.

Post-Nonlinear Models, Causal Additive Models. Further generalizations.

The strength of FCMs. They achieve unique causal identification when their assumptions hold — a substantial advance over constraint-based methods, which only identify the equivalence class.

The weakness of FCMs. The assumptions are strong and untestable. Real data may not satisfy linearity (LiNGAM), additivity (ANM), or even the parametric family the method assumes. Misspecification produces incorrect causal conclusions.

Causal discovery in nonlinear, high-dimensional settings

The frontier. Modern data — images, sentences, time-series with many variables — is high-dimensional and nonlinear. Classical causal discovery does not scale or apply directly. Several lines address this.

Deep-learning-based discovery. DAG-GNN, NOTEARS-MLP, GraN-DAG, and related methods use neural networks to model the structural equations and continuous-optimization tricks to search the DAG space. Empirically promising on small benchmarks; less reliable at scale.

Causal discovery with interventional data. If interventional data (or perturbations) are available, causal directions are much more identifiable. The Schölkopf-Bengio causal representation learning programme (§9) leverages this.

Multi-environment discovery. Peters, Bühlmann, Meinshausen (2016) “Causal inference using invariant prediction” (ICP) uses data from multiple environments to identify causal predictors: a variable $X_i$ is a direct cause of $Y$ if and only if $Y$ ’s conditional distribution given $X_i$ is invariant across environments. This is the invariance principle — causal mechanisms are stable across environments, correlations are not.

The honest accounting. High-dimensional, nonlinear causal discovery in 2026 remains an active research frontier with substantial unsolved problems. It is OP-C-1 (causal discovery in nonlinear high-dimensional settings) on the chapter’s open-problems list.

The Markov-equivalence class limitation

A recurring theme. Observational causal discovery generally identifies only the Markov equivalence class, not the unique DAG. This is a fundamental limit — not a deficiency of current methods but a property of the observational-data problem.

Three pathways to resolving the ambiguity within the equivalence class:

Functional assumptions (FCMs above). Assume linearity + non-Gaussianity, additivity, or other restrictions. Breaks the symmetry but requires the assumption to hold.
Interventional data. A single intervention on a variable can resolve direction ambiguities throughout the graph. If you can intervene experimentally, do so.
Multi-environment data. Different environments produce different distributions but share the underlying causal mechanisms. Comparing across environments identifies causal structure that no single environment reveals (ICP, the invariance principle).

The pragmatic recommendation. For working causal discovery in 2026:

Use constraint-based methods (PC, FCI) for transparent and interpretable analysis with explicit conditional-independence assumptions.
Use score-based methods (GES, NOTEARS family) for moderate-scale problems where the score has good statistical properties.
Use FCM-based methods (LiNGAM, ANM) when the functional-form assumptions are plausible.
Use invariance / multi-environment methods (ICP and successors) when data from multiple environments is available — increasingly the case in modern data settings.
Combine with substantive domain knowledge about possible causal directions — many disambiguities can be resolved by expert input.

Where causal discovery fits

The summary. Causal discovery is a partial solution to the problem of getting causal graphs from data. It identifies as much structure as is observably identifiable; the remainder requires additional information. The field has mature theoretical foundations and a substantial methodological toolkit, but high-dimensional and nonlinear cases remain open frontiers.

The connection to §9 (causal representation learning). Standard causal discovery operates on predefined variables — given a set of measured variables, find their causal relationships. Modern data often does not have predefined causal variables — raw pixels, sentences, sensor readings are not themselves causal entities. The causal representation learning programme of §9 extends discovery to the problem of learning the variables alongside their causal structure.

§9. Causal Representation Learning

The classical causal-inference framework (§4–§8) operates on predefined variables — given a set of measured variables, find their causal relationships. Modern AI data often does not come with predefined causal variables. Pixels in an image, tokens in a sentence, samples in an audio waveform are not themselves causal entities; the causal variables (objects, properties, relations, agents) are latent and must be inferred.

Causal Representation Learning (CRL) is the research programme that addresses this gap: learn both the causal variables and their causal structure from raw observations. It is the bridge between modern representation learning (deep nets producing useful embeddings) and classical causal modelling (graphs over interpretable variables).

The motivating problem

A concrete illustration. Consider video data of a billiards table. Pixel-level data is high-dimensional and changes substantially between frames. The causal variables — ball positions, velocities, cue stick movements — are low-dimensional and have clean causal relationships (cue stick hits ball → ball moves → ball collides with another ball → second ball moves).

A neural network trained on the videos can produce useful embeddings — perhaps high-quality prediction of next-frame pixels. But the embedding is generally not a causal representation: dimensions of the embedding may mix multiple causal variables together; causal interventions on the embedding (changing one dimension) do not correspond to coherent physical interventions.

The CRL goal. Train a representation $\phi(x)$ from raw observations $x$ such that:

The dimensions of $\phi(x)$ correspond to causally meaningful variables — ball position, velocity, etc.
The causal structure among the latent variables can be inferred.
Interventions on the latent representation correspond to coherent interventions on the underlying causal variables.

This is fundamentally harder than standard representation learning. The goals are normative (what should the representation look like?) in a way that pixel-prediction or contrastive losses are not.

The Schölkopf-Bengio programme

Schölkopf, Janzing, Peters, Sgouritsa, Zhang, Mooij (2012) “On Causal and Anticausal Learning” was an early articulation. The programme matured through the 2010s and was formalized in Schölkopf, Locatello, Bauer, Ke, Kalchbrenner, Goyal, Bengio (2021) “Toward Causal Representation Learning” — a position paper in Proceedings of the IEEE that consolidated the research direction.

The programme’s central commitments:

The world has causal structure that is reflected in the data-generating process. The observed data is generated by a causal mechanism operating on (potentially-latent) causal variables.
This causal structure has favourable properties that standard representation learning ignores: independence of mechanisms, invariance to interventions, modularity. Representations that capture these properties should be better at OOD generalization, transfer learning, and reasoning.
The problem is identifiable in principle under appropriate assumptions. Multi-environment data, interventional access, or structural restrictions can break the underdetermination that standard observational data faces.
The connection to general AI is direct. Many open problems in AI (OOD generalization, compositionality, common-sense reasoning) have causal framings; CRL provides a path toward addressing them.

The programme is more a research direction than a unified technique. Many specific approaches share the CRL framing without sharing methods.

Independent Causal Mechanisms (ICM)

The central theoretical principle. The Independent Causal Mechanisms (ICM) hypothesis (Schölkopf et al., 2012; refined through 2019) asserts:

The causal mechanisms that generate the data are statistically independent of each other.

Reading this. The data-generating process consists of multiple mechanisms (each variable being computed from its parents). The ICM hypothesis claims these mechanisms have no shared information — knowing one mechanism does not give information about another. Mathematically, the conditional distributions $P(V_i \mid \text{Pa}(V_i))$ for different $i$ are independent in some appropriate algorithmic-information-theoretic sense.

The consequences. If ICM holds:

Intervention modularity. Changing one causal mechanism does not affect others. Interventions are local.
Invariance. Mechanisms are stable across environments; only the prior distributions over their inputs change.
Sparse parameter changes. When the data-generating process changes (a new environment, a new intervention), only a small subset of mechanisms changes.

The empirical implications. Models that encode causal structure with the modularity ICM implies should generalize better across environments where only a few mechanisms change. This connects CRL to OOD generalization (OP-TH-8): causal-invariance-based methods are one approach to OOD that ICM theoretically grounds.

Identifiability results for nonlinear ICA

A specific technical thread. Independent Component Analysis (ICA) is the classical problem of recovering independent latent sources from their linear mixtures. The linear-Gaussian case is fundamentally non-identifiable (multiple latent decompositions explain the same observations). The linear non-Gaussian case (the basis of LiNGAM) is identifiable.

Modern work extends to nonlinear settings. Hyvärinen, Sasaki, Turner (2019) “Nonlinear ICA Using Auxiliary Variables and Generalized Contrastive Learning” gave the first practical identifiability result for nonlinear ICA, using auxiliary variables (typically environment or time indices) to break the unidentifiability of the purely-nonlinear case.

iVAE (Khemakhem et al., 2020) gave a variational-autoencoder-based estimator for the same setup. The mathematical content: if the latent sources are conditionally independent given an auxiliary variable, the latent sources are identifiable up to component-wise transformations.

Contrastive identifiability. A separate line of work (Zimmermann et al., 2021; HZ-Z 2022) shows that contrastive learning (SSL §5) on time-series or multi-environment data gives provable identifiability of the underlying latent sources under appropriate structural assumptions. This connects modern self-supervised representation learning to causal identifiability theory.

The practical takeaway. Identifiability in CRL is partial and requires additional structure beyond raw observations — auxiliary variables, multi-environment data, time-series structure, or interventional access. Pure observational data on a single environment generally does not suffice. The CRL community has developed a rich theoretical understanding of which additional structures restore identifiability.

Disentanglement and causal structure

A related but distinct programme. Disentanglement in deep learning is the task of learning representations where different latent dimensions correspond to different factors of variation (e.g., a face image’s latent dimensions corresponding to identity, expression, pose, lighting separately).

The early disentanglement literature (Higgins et al. $\beta$ -VAE, 2017; InfoGAN; subsequent work) achieved partial success on synthetic benchmarks. Locatello et al. (2019) “Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations” made a striking negative claim: purely unsupervised disentanglement is provably impossible without inductive bias on both model and data. The result was sobering for the disentanglement programme and shifted attention toward weakly-supervised and causally-motivated approaches.

Modern disentanglement work uses auxiliary information — known interventions, multiple environments, time-series — to make the problem tractable. This is the disentanglement-side analogue of the CRL identifiability results above. The two literatures have substantially converged.

Causal world models

A specific application area worth flagging. Causal world models are world models (used in RL and embodied AI) that explicitly represent causal structure. The motivation: model-based RL (RL §8, Dreamer family) works well within the training distribution but fails on OOD interventions; a causally-structured world model should generalize better.

Notable work:

Causal InfoGAN (Kurutach et al., 2018) and related.
CWM (Causal World Models, several groups, 2021–).
CITRIS (Lippe et al., 2022) for learning disentangled causal representations from interventional video.
iCITRIS, BISCUIT (2023–2024) — successors with stronger identifiability.

These are research-stage systems. Production-grade causal world models are not yet a mainstream RL tool, but the research direction is active.

Current state and open problems

The honest accounting. CRL is a substantial research programme with rich theoretical machinery (ICM, identifiability theorems, multi-environment identifiability), active empirical work (causal world models, disentangled representation learning), and a clear motivation (OOD generalization, transfer, robustness).

The empirical state. CRL has produced modest empirical advantages over standard representation learning on standard benchmarks. The advantages are real but smaller than the theoretical promise. CRL methods are not yet dominant in production AI systems.

Why the gap. Several reasons:

Identifiability requires substantial extra structure — multi-environment data, interventional access, auxiliary variables. Many practical settings don’t have these.
The standard representation-learning baselines are strong. Contrastive learning, masked modelling, and large-scale pretraining produce useful representations without explicit causal commitments.
The benchmarks are limited. Most CRL benchmarks are synthetic; real-world settings where CRL clearly helps are still being identified.

The open problems. OP-C-2 (causal representation learning from raw observations) captures this. Whether CRL will scale to produce the next generation of foundation models — or whether scale + contrastive learning will continue to dominate — is one of the chapter’s central uncertainties. The 2024–2026 evidence suggests CRL is useful in specific settings (multi-environment robustness, intervention-aware reasoning) but is not a general replacement for non-causal representation learning.

The most promising direction in 2026. Hybrid approaches: pretrained foundation models that incorporate causal structure as a fine-tuning objective or inductive bias, leveraging both the scale of standard pretraining and the structure of causal modelling. Production deployment is partial; the research is active.

§10. Causality Meets ML: Effects, OPE, Fairness

A survey section. We bring together the causal-inference machinery of §4–§8 with several specific ML problems where causality is structurally relevant: heterogeneous treatment effects, off-policy evaluation in reinforcement learning, fairness, distribution-shift robustness, and recourse. These are the canonical application areas where causal framing has produced specific technical contributions.

Heterogeneous treatment effects with neural networks

The CATE estimation problem (§6) extended to settings where the covariate space is high-dimensional or unstructured (images, text). Standard ML methods are needed to handle the dimensionality; the causal-inference machinery is needed to handle the identifiability and inference.

The dominant approach in 2026 is Double ML + flexible nuisance estimators (§6). For high-dimensional structured covariates: gradient-boosted trees for the nuisance functions, with cross-fitting for valid inference. For unstructured covariates (images, text): deep nets for the nuisances, with counterfactual regression (CFR, Shalit et al., 2017) or DragonNet architectures that explicitly model the treatment-control balance in representation space.

Applications. Estimating heterogeneous effects in:

Personalized medicine. Which patients benefit most from this drug?
Targeted policy. Which subpopulations benefit most from this intervention?
Recommendation systems. What is the causal lift of a recommendation algorithm, vs the natural baseline?
Education. Which interventions help which students?

The honest accounting. Heterogeneous-treatment-effect estimation is mathematically clean but practically demanding. The required assumptions (ignorability, positivity, correct functional forms for nuisances) are strong; estimates can be misleading when assumptions fail. Best practice in 2026 includes:

Sensitivity analysis to unmeasured confounders.
Cross-validation of nuisance estimators.
Robustness checks across different specifications.
Where possible: validation against randomized experiments.

Off-policy evaluation in RL

The reinforcement-learning connection. Off-policy evaluation (OPE) asks: given data collected under a behaviour policy $\pi_\beta$ , estimate the value of a target policy $\pi_e$ — the expected return if we deployed $\pi_e$ instead. This is structurally a causal question: what would happen if we intervened on the policy?

The standard OPE estimators are essentially the causal-inference estimators of §6 applied to RL:

Importance sampling. Reweight trajectories from $\pi_\beta$ to estimate values under $\pi_e$ . Equivalent to inverse-propensity weighting in §6.
Direct method. Fit a value function from the behaviour data; evaluate at the target policy’s state-action distribution. Equivalent to outcome regression.
Doubly robust OPE. Combine importance sampling with a learned value function for double robustness. Equivalent to the AIPW estimator.
MAGIC (Thomas and Brunskill, 2016), Per-Step Doubly Robust, Marginalized Importance Sampling. Modern variants reducing the variance of importance-sampling-based estimators.

The connection to offline RL (RL §9). The offline RL problem is essentially: given OPE estimates for many candidate policies, choose the best policy. The pessimism principle in offline RL (CQL, IQL) is causal pessimism — be skeptical about policies that exploit poorly-supported state-action pairs in the data, because OPE is unreliable there.

The cross-references. RL §9 develops offline RL algorithmically; this section develops the causal interpretation of the same problem. The two frameworks are unifying — modern OPE work is increasingly explicit about its causal foundations.

Fairness through causal lenses

A substantive application. Algorithmic fairness has multiple non-equivalent technical definitions; many of them have natural causal framings.

Demographic parity. $\hat{Y} \perp\!\!\!\perp A$ — the prediction is independent of the protected attribute. Easy to state, often unattainable, sometimes substantively wrong (when $A$ is causally related to a legitimate predictor).

Equalized odds (Hardt, Price, Srebro, 2016). $\hat{Y} \perp\!\!\!\perp A \mid Y$ — the prediction is independent of $A$ within each ground-truth-outcome group. Compatible with some legitimate uses of $A$ .

Counterfactual fairness (Kusner et al., 2017; §7). $\hat{Y}$ is the same in the actual world and the counterfactual world where $A$ was different. The most-thoroughly-causal definition.

Path-specific fairness. Distinguish legitimate causal pathways from $A$ to $Y$ (e.g., $A \to \text{qualifications} \to Y$ ) from illegitimate ones (e.g., $A \to \text{biased-evaluation} \to Y$ ). Allow predictions to depend on the legitimate paths but not the illegitimate ones. Operationalized by mediation analysis (§7).

The mutual incompatibility result. Chouldechova (2017) and Kleinberg, Mullainathan, Raghavan (2017) independently showed that several plausibly-desirable fairness criteria are mutually incompatible — no single classifier can satisfy demographic parity, equalized odds, and calibration simultaneously (except in degenerate cases). The incompatibility is mathematical, not a deficiency of methods.

The implication. Fairness is not a single technical desideratum but a family of normatively distinct criteria, each reflecting different ethical commitments. Causal framings clarify what is being demanded; they do not resolve the underlying normative choices.

The practical state. Counterfactual and path-specific fairness require causal models that must be specified from domain knowledge. The models are contested (different stakeholders may disagree about the appropriate causal structure). Production deployment of causal-fairness methods is limited; demographic parity and equalized odds (which don’t require causal models) are more common in practice.

The cross-reference. Alignment / Ethics chapter (planned) develops fairness more broadly; this section’s causal framing is one of several perspectives.

Robustness to distribution shift via causal invariance

A specific technical contribution. Invariant Risk Minimization (IRM) (Arjovsky, Bottou, Gulrajani, Lopez-Paz, 2019) operationalizes the ICM hypothesis (§9) for predictive ML. The recipe:

Train a representation $\phi(X)$ such that the same classifier on top of $\phi$ is optimal across all training environments.
The intuition: features that have stable causal relationships with $Y$ should give stable optima; features that are spuriously correlated (correlated through unstable mechanisms) will give different optima in different environments.

The IRM objective:

\min_{\phi, w} \sum_{e \in \text{envs}} L^e(w \circ \phi) + \lambda \left\| \nabla_{w | w = 1.0} L^e(w \cdot \phi) \right\|^2.

The second term penalizes the classifier $w$ for being non-stationary across environments — a soft constraint forcing $w$ to be optimal in every environment simultaneously.

The promise. IRM gives a path to OOD generalization grounded in the causal-invariance principle of ICP (§8).

The reality. Empirical work on IRM is mixed. Rosenfeld, Ravikumar, Risteski (2021) “The Risks of Invariant Risk Minimization” showed IRM does not robustly outperform ERM (Empirical Risk Minimization) on standard benchmarks. The community has produced variants (REx, V-REx, GroupDRO, EIIL); none dominates. The OOD-generalization problem (OP-TH-8) remains substantially open.

The honest assessment. Causal-invariance-based OOD methods are theoretically motivated but empirically unimpressive. The principle is correct (causal mechanisms are invariant); operationalizing it well in practice is hard. The 2024–2026 evidence: invariance-based methods help in specific cases but are not a general solution.

Recourse: counterfactual explanations as causal interventions

A practical fairness-and-interpretability application. Recourse is the problem of telling a user what they would need to change to flip an algorithmic decision. If a loan was denied, what (feasibly modifiable) factors would the applicant need to change to be approved?

The framing is fundamentally counterfactual: if your income were $X$ , your loan would be approved. The actionable recommendation is an intervention.

Two main approaches.

Wachter et al. (2017) “Counterfactual Explanations” proposed nearest-counterfactual: find the closest input to the original that flips the decision. Conceptually simple; sometimes produces unrealistic counterfactuals (modifying features that are causally downstream of unchangeable ones).

Karimi et al. (2021) “Algorithmic Recourse: from Counterfactual Explanations to Interventions” extended this with causal recourse: respect the causal structure when proposing interventions. If changing $X$ would also change $Z$ (which is downstream), the recourse must account for this.

The applications. Recourse-style explanations are used in:

Lending (telling rejected applicants what to change).
Hiring (giving rejected candidates feedback).
Healthcare (explaining recommended treatments).
Education (explaining admissions decisions).

The state of practice. Recourse is deployed but limited. Production systems often provide simplified Wachter-style explanations; full causal recourse remains research-stage. The legal-and-regulatory pressure (EU AI Act, US “right to explanation”) is increasing the demand for principled recourse.

Cross-cutting observation

The pattern across this section. Each problem (HTEs, OPE, fairness, OOD, recourse) has a natural causal framing — the question is genuinely about interventions or counterfactuals, not just correlations. Each problem has a causal-inference technical machinery that addresses it. Each problem also has substantial unresolved practical issues — the causal assumptions are uncertain, the estimators are sensitive to misspecification, the deployments are limited.

Causal ML is not yet a solved problem. It is a substantive research and engineering programme with mature theoretical foundations and active empirical development. The 2026 state is “established framework, partial production use, active improvement.”

Where causality-meets-ML sits

The summary. Causality applied to ML is one of the most-active intersection areas of 2024–2026 research. The standard ML tasks (prediction, classification, generation) have correlational objectives; the high-stakes deployment contexts (medicine, policy, lending, education) often require causal answers. Causal ML is the bridge.

The unsolved problems. Reliable causal estimation when causal assumptions are uncertain. Causal reasoning at the scale of modern foundation models. Production deployment of causal methods in contexts where the underlying causal models are contested. Each of these is in the chapter’s open-problems list (§14).

The chapter’s next section (§11) covers the most recent intersection: can foundation models themselves reason causally? The empirical evidence is mixed and the implications are substantial.

§11. Causality and Foundation Models

The most recent intersection. Can modern foundation models — LLMs trained on trillions of tokens of internet text — reason causally? The empirical evidence is mixed; the implications matter for how we deploy LLMs in causal-sensitive contexts.

The question

A foundation model trained on text observes substantial causal language in its training data — papers about treatment effects, news articles about cause-and-consequence, scientific reasoning, legal arguments, everyday narrative. The model could in principle have learned:

The vocabulary of causal reasoning (the words “cause”, “because”, “if”, “would have”).
Causal patterns in specific domains (drug X causes side-effect Y; policy A leads to outcome B).
Causal-reasoning skills (identifying confounders, distinguishing correlation from causation, performing counterfactual reasoning).

The first two are essentially pattern matching — the model has seen enough text to mimic the surface forms. The third is the substantive question: does the model reason causally, or does it produce text that looks causal without underlying causal computation?

The question matters. If LLMs reason causally, they can be deployed in causal-sensitive contexts (medical decision support, policy analysis, scientific advisory) with appropriate trust. If they only pattern-match, deploying them in such contexts is dangerous — they will produce confidently-stated correlational-only inferences dressed up as causal claims.

Empirical evidence: textual fluency

The clear positive evidence. LLMs are highly fluent at textually-described causal scenarios.

Worked example. Prompt GPT-4 with: “A study found that hospital admission is associated with higher mortality. Critic A argues that this proves hospitals are dangerous and we should avoid them. Critic B argues that this is a selection effect — sicker people are more likely to be admitted. Whose argument is more sound, and why?”

A modern LLM produces a substantive response identifying the selection-bias problem, articulating the correct framing (Berkson’s paradox / collider bias), and recommending appropriate causal analysis (randomization, careful matching). The reasoning displayed is at the level of a thoughtful undergraduate in epidemiology.

This is real capability. The LLM has internalized substantial narrative-level causal reasoning patterns. For most textually-described causal scenarios, frontier LLMs perform competently.

Empirical evidence: formal failures

The clear negative evidence. The same LLMs fail systematically on formally-stated causal-inference problems.

Worked example. Specify a small DAG (e.g., 4 nodes with explicit edges); ask the LLM to identify a backdoor adjustment set. LLMs often produce plausible-sounding but wrong answers — naming an adjustment set that does not block all backdoor paths, missing the descendant-of-treatment exclusion, or proposing colliders for inclusion.

Specific findings from the recent literature:

CRASS (Frohberg and Binder, 2022) “CRASS: A Novel Data Set and Benchmark to Test Counterfactual Reasoning of Large Language Models.” Tests counterfactual completions in everyday scenarios. LLMs are competent on most items but with characteristic failure modes.

CLadder (Jin et al., 2023) “CLadder: Assessing Causal Reasoning in Language Models.” Tests Pearl’s three-rung framework with formal questions. Finds:

LLMs perform well at Rung 1 (association).
LLMs struggle at Rung 2 (intervention) when questions are formally stated.
LLMs struggle more at Rung 3 (counterfactual).
Performance improves substantially with chain-of-thought prompting but remains below expert human level.

CausalBench (Zhang et al., 2023) and successors. Tests causal-graph reasoning, counterfactual computation, and identification. LLMs achieve moderate scores; the gap between LLM and expert performance is substantial.

Causal Parrots (Zečević et al., 2023). The provocative interpretation: LLMs are causally fluent (textual mimicry) without being causally competent (true causal reasoning). The pattern-vs-reasoning distinction is empirically demonstrable on formally-stated problems.

Interpreting the mixed evidence

The community has not converged on a single interpretation. Several positions:

“LLMs are reasoning causally.” Defenders point to the fluent narrative-level performance, the substantial improvements with chain-of-thought, and the scaling trends (performance improves with model size). The pattern-matching account undersells what LLMs actually do.

“LLMs are sophisticated pattern matchers.” Critics point to the formal failures, the brittleness under prompt variations, and the absence of explicit causal-reasoning computation in the architecture. LLMs are doing something useful but it is not causal reasoning in the Pearl-Rubin sense.

“Both, with the boundary uncertain.” A middle position: LLMs have learned substantial implicit causal knowledge from training data; they apply it competently in familiar settings; they fail in novel or formally-demanding settings. Whether the implicit knowledge constitutes “real” causal reasoning is partly a definitional question.

The chapter does not adjudicate. The honest summary: LLMs handle narrative-level causal scenarios well, struggle with formal causal inference, and are not yet trustworthy as autonomous causal reasoners in high-stakes contexts.

LLM-assisted causal inference

A productive direction that avoids the deeper question. Instead of asking “can LLMs do causal inference?”, ask “can LLMs help humans do causal inference?”.

Several useful capabilities:

Causal-graph elicitation. Ask an LLM about plausible causal relationships among a set of variables; use the response as a first draft causal graph for human review. The LLM brings broad domain knowledge that no individual researcher has.

Confounder identification. Given a treatment-outcome pair in a specific domain, ask the LLM to enumerate plausible confounders. The list is typically rich and serves as a starting point for the analyst.

Causal-language audit. Given a research paper or news article, ask the LLM to identify causal-language overclaim — places where correlation is being interpreted as causation without justification.

Code generation for causal analysis. Given a dataset and a causal question, the LLM can generate code (Python’s dowhy, R’s pcalg, etc.) implementing the appropriate causal-inference workflow.

These applications work well. The LLM is a tool for the human analyst rather than an autonomous causal reasoner. The human retains responsibility for the substantive judgments; the LLM provides knowledge synthesis and code support.

The Pearl-vs-correlation debate, FM-era version

The classical Pearl-vs-correlation debate (correlation is not enough; we need explicit causal modelling) takes a specific form in the LLM era. The new question: do LLMs make explicit causal modelling unnecessary?

The “no” position. The Pearl-aligned view: LLMs produce confident-sounding outputs without explicit causal commitments; deploying them in causal-sensitive contexts inherits all the failure modes of correlational ML (asthma example, §1), just dressed up in fluent language. Causal modelling remains as important as ever.

The “maybe” position. A more sympathetic view: LLMs may have internalized substantial implicit causal knowledge; the appropriate response is not to dismiss them but to integrate them with explicit causal frameworks — LLMs as draft-generators, humans-or-causal-systems as validators.

The “increasingly yes” position. The most-aggressive AI-positive position: scaled LLMs with appropriate training data will eventually be able to reason causally well enough that explicit causal modelling becomes one tool among many rather than the foundational requirement. This is a bet on future capability; the 2024–2026 evidence supports it weakly at best.

The chapter’s stance. The Pearl-aligned view is the safe default for high-stakes deployment in 2026. LLMs are useful adjuncts to causal inference, not replacements. As capabilities improve, the balance may shift; for now, claims that LLMs reason causally should be treated cautiously.

Cross-references and open problems

The Foundation Models chapter (FM §11) and the LLM chapter discuss reasoning capabilities broadly. The AI for Science chapter §13 discusses the curve-fitting-not-understanding critique. The Theoretical Foundations of Learning chapter §8 discusses generalization. Each is relevant to the question of whether LLM-scale models can reason causally.

OP-C-3 (causal reasoning in foundation models) and OP-C-5 (causal inference at LLM scale) capture the open frontier. Whether the trajectory continues toward LLMs as full causal reasoners or toward LLMs as causal-inference tools-not-reasoners is one of the consequential open questions of mid-2020s AI.

Where this section sits

The summary. Causality and foundation models is the most recent and most speculative topic in the chapter. The empirical evidence shows LLMs are fluent at narrative-level causal reasoning, fail at formal causal inference, and are useful as tools for human causal analysts. The deeper question — what cognitive function the LLM is actually performing when it produces causal-language outputs — is genuinely open and matters for deployment.

§12. Connections to Other Chapters

The Causality chapter is one of the most cross-cutting in the book; almost every other chapter has causal questions or commitments embedded in it.

Probabilistic Reasoning (planned, AIMA Ch 13–14 material) develops Bayesian networks as the probabilistic substrate. Causal DAGs are Bayesian networks with causal semantics; this chapter develops what is causally distinctive on top of the BN foundation.
Theoretical Foundations of Learning §8 develops the modern generalization puzzle. Distribution-shift generalization (OP-TH-8) has natural causal framings — distributions shift through changes to causal mechanisms; models that capture invariant causal structure should generalize across the shifts. This chapter’s §9 (CRL) and §10 (IRM) develop the causal-side techniques.
Reinforcement Learning §10 develops RL post-training (RLHF, DPO, GRPO); §9 develops offline RL. Both have causal interpretations — offline RL is essentially OPE applied at scale; RLHF is intervention on policy parameters with implications for outcome distributions. The §10 of this chapter develops the causal-interpretive layer.
Foundation Models and Large Language Models. The §11 of this chapter develops the empirical evidence on LLM causal reasoning; cross-references the FM/LLM chapters’ open-problems (OP-FM-14, OP-LLM-N) on emergent reasoning capabilities.
AI for Science §13 develops the curve-fitting-not-understanding critique. Causal framing is one substantive response to this critique. AlphaMissense, AlphaFold, and similar systems produce predictions that are sometimes consumed as causal claims; the gap matters for downstream applications (clinical genomics, drug development) and is the subject of OP-S-8.
Generative Models §11 includes a critique about generative models not “understanding”; the understanding-is-causal position is one substantive interpretation. Causal generative models are an active research area at the intersection.
Self-Supervised Learning provides the pretraining substrate for many representations that downstream causal methods consume. The connection is via §9 (CRL) — modern causal representation learning often builds on SSL-pretrained representations.
Alignment / Ethics (planned) develops fairness, recourse, and the broader normative questions. The causal framings of fairness (§7 counterfactual fairness, §10 path-specific) live in this chapter; the broader ethical analysis lives there.
Deep Learning provides the architectural substrate; CRL and causal ML methods are built on DL architectures.
Robotics (planned) involves substantial causal reasoning — robotic control is fundamentally about interventions and counterfactuals. The model-based RL of RL §8 has causal-world-model variants (§9 of this chapter); deployed robotic systems use these.
Evaluation (planned) develops cross-cutting evaluation methodology. The evaluation of causal-ML methods has specific challenges (no ground truth in observational settings, validation requires randomized experiments or known SCMs); this chapter sketches them in §11 and §14.

§13. Critiques and Alternative Perspectives

This section presents substantive critiques of the chapter’s framing.

The Pearl-vs-Rubin debate revisited

A substantive intellectual division. The two traditions have largely converged mathematically (§2), but cultural and methodological differences persist.

The Pearl-aligned critique of the Rubin tradition: Rubin’s potential-outcomes framework obscures structural causal information; it is opaque about which assumptions are doing the work in identification; it is hard to extend to settings with complex causal structure (mediation, multi-step pathways, dynamic treatments). The Rubin tradition’s emphasis on randomized-experiment-as-gold-standard underemphasizes the substantial information available in observational data given a causal model.

The Rubin-aligned critique of the Pearl tradition: Pearl’s framework requires correctly specifying the DAG, which is hard and contested; small misspecification can produce drastically wrong conclusions; the algorithmic identifiability machinery is technically elegant but practically rare; for treatment-effect estimation (the most common applied task), potential-outcomes machinery is more direct and more amenable to statistical practice.

The chapter’s position. Both critiques have force. The frameworks are complementary; choosing between them is partly substantive (different problems suit different framings) and partly cultural (different communities have different conventions). The pragmatic stance: learn both; use whichever is more natural for the specific problem.

“Causality is not necessary for prediction”

A genuine alternative position. The argument: most ML applications do not require causal inference. Prediction systems (classifiers, recommenders, forecasters) need to be accurate, not causally correct. The Caruana asthma example (§1) is a decision-support problem, not a prediction problem; for pure prediction, the correlational solution is correct.

The counter. The boundary between “prediction” and “decision support” is rarely clean in practice. A prediction that informs a decision is a decision-support tool, and the asthma example shows the failure mode. A defensible limitation of correlational ML is not “we predict without making causal claims” but rather “here are the contexts in which our predictions are reliable proxies for the causal questions a user might bring to them”. Few deployed ML systems make this scoping explicit.

The honest accounting. There are genuine pure-prediction problems where causality is unnecessary (forecasting tomorrow’s weather, predicting tomorrow’s stock prices in a market that doesn’t move based on the prediction). But many deployed AI systems blend prediction with decision support, and the boundary is consequential.

The unidentifiability critique

A strong sceptical position. Pearl’s framework is mathematically rigorous, but in practice the causal graph is rarely known with certainty. Different reasonable graphs yield different identification results. The framework’s honest answer in such cases is “depends on the assumed graph” — but practitioners often suppress this dependence and report a single causal estimate.

The constructive response. Modern best practice incorporates sensitivity analysis (§5) and causal-discovery cross-checks — testing whether the assumed graph is consistent with the data via conditional-independence tests, and quantifying how much the conclusion changes under reasonable alternative graphs.

The deeper question. Should we make causal claims when the causal graph is uncertain? The yes answer: the framework lets us state assumptions explicitly and reason from them. The no answer: explicit assumptions can still be wrong, and confident causal claims may mislead.

The expert-knowledge requirement and its implications

A specific instance of the unidentifiability critique. Causal inference requires substantive domain knowledge to specify the causal graph. The knowledge comes from experts (clinicians, economists, scientists), who often disagree. Whose causal model gets used matters for the conclusion.

The political-economy critique. Whose expertise is privileged in setting up causal models? Whose perspectives are absent? For causally-modelled fairness, this matters substantially — different causal models of how protected attributes affect outcomes produce different fairness criteria. The “neutral” causal model is itself a contested claim.

The chapter notes this critique without resolving it. The Alignment / Ethics chapter (planned) develops the political-economy dimension; this chapter’s technical content is silent on it.

The “causality is hard, correlation is easy” trade-off

A practical critique. Correlational ML works now, at scale, on real problems. Causal ML is harder, with more assumptions, more failure modes, smaller-scale results. For most deployed AI systems, the marginal benefit of moving from correlational to causal does not justify the cost.

The counter. The Caruana asthma example argues otherwise: correlational ML has real failure modes that show up in deployment, and the cost of getting causal questions wrong can be substantial. The trade-off is not “easy correlation vs hard causation” but “easy correlation with hidden failure modes vs harder causation with explicit assumptions”.

The honest accounting. The trade-off is real. Causal ML is harder; the gains are often modest in standard ML benchmarks; for many applications correlational ML suffices. The cases where causal framing matters are high-stakes, distribution-shifting, decision-relevant — where deploying correlational ML has real costs.

§14. Limitations and Open Problems

Consolidated open-problems list. Each carries an OP-C-N identifier for cross-chapter reference.

OP-C-1. Causal discovery in nonlinear, high-dimensional settings. Classical PC, FCI, GES are limited to moderate dimensions and linear or constraint-tractable settings. Modern data (millions of variables, complex nonlinearities) defies these methods. Deep-learning-based discovery (NOTEARS-style, DAG-GNN) is promising but unreliable at scale. Multi-environment discovery (ICP and successors) is more robust but requires multi-environment data. Closing the gap to discovery on real high-dimensional observational data is open.
OP-C-2. Causal representation learning from raw observations. The Schölkopf-Bengio programme has produced rich theory (ICM, identifiability for nonlinear ICA with auxiliaries, contrastive identifiability) but limited empirical impact at scale. Whether CRL methods can produce the next generation of foundation models — or whether scale + standard contrastive learning will continue to dominate — is open. Cross-references OP-FM-N and OP-SSL-N.
OP-C-3. Causal reasoning in foundation models. §11 develops the empirical evidence: LLMs are fluent at narrative-level causal reasoning, fail at formal causal inference. Whether scaling, RL post-training, or architectural changes can produce reliably causally-reasoning LLMs is open. The implications for high-stakes deployment are substantial.
OP-C-4. Identifiability in continuous-and-high-dimensional regimes. Identifiability results in classical causal inference are largely for discrete or low-dimensional settings. Modern data is often continuous and high-dimensional; identifiability theory in such regimes is incomplete. Specific open questions: identification with high-dimensional confounders, identification under partial overlap, identification with continuous treatments.
OP-C-5. Causal inference at LLM scale (millions of variables). Standard causal-inference methods do not scale to settings with thousands or millions of variables. Some progress (sparse causal discovery, score-based methods with continuous optimization) exists; substantial gaps remain. The biological-pathway and genomic settings (where millions of genes interact) are typical motivating cases.
OP-C-6. Reconciling Pearl and Rubin frameworks. The frameworks are mathematically equivalent but practically separate. Producing a single unified workflow that combines the graphical clarity of Pearl with the estimation practice of Rubin would benefit applied causal inference substantially. Hernán-Robins is one step; broader synthesis remains open.
OP-C-7. Causal benchmarks that are both rigorous and realistic. Most causal-ML benchmarks are synthetic (the true causal structure is known, so methods can be evaluated) or real but informal (no known ground truth, so methods are evaluated by domain plausibility). Bridging these — producing real-world benchmarks with known causal structure — would substantially accelerate the field.
OP-C-8. Causality and OOD generalization at scale. The IRM, REx, GroupDRO line of work tried to operationalize causal invariance for OOD generalization. Empirical results have been disappointing — these methods do not robustly outperform ERM. Whether causal-invariance-based OOD methods can be made to work at foundation-model scale is an open question.
OP-C-9. Causal inference with foundation models as nuisance estimators. Double ML and similar methods can in principle use foundation models for the nuisance functions $\hat{m}(X)$ and $\hat{e}(X)$ . The theoretical conditions (Neyman orthogonality, cross-fitting) require care when the nuisance estimator is itself a complex pretrained model. Establishing rigorous valid-inference guarantees in this regime is open.
OP-C-10. Counterfactual identification at scale. Counterfactual reasoning requires SCMs (graph + structural equations), which are harder to specify than causal graphs alone. Whether counterfactual methods can scale to large applied problems — beyond the small-scale theoretical demonstrations — is open.

§15. Further Reading

Opinionated annotated list. Not exhaustive; intended as a reading-order recommendation.

Foundational textbooks

Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). The Pearl-tradition canonical reference. Dense, technically deep, rewards repeated reading.
Pearl, J., and Mackenzie, D. (2018). The Book of Why. Popular-press exposition of Pearl’s ideas. Useful entry point.
Imbens, G. W., and Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences. The Rubin-tradition canonical reference. Heavy on potential-outcomes machinery.
Hernán, M. A., and Robins, J. M. (2020). Causal Inference: What If. Bridges Pearl and Rubin frameworks for applied causal inference. Available free online from the authors.
Peters, J., Janzing, D., and Schölkopf, B. (2017). Elements of Causal Inference: Foundations and Learning Algorithms. Modern textbook integrating causal inference and ML.

Foundational papers

Rubin, D. B. (1974). “Estimating causal effects of treatments in randomized and nonrandomized studies.” The potential-outcomes paper.
Rosenbaum, P. R., and Rubin, D. B. (1983). “The central role of the propensity score in observational studies for causal effects.”
Pearl, J. (1995). “Causal diagrams for empirical research.” The do-calculus introduction.
Spirtes, P., Glymour, C., and Scheines, R. (2000). Causation, Prediction, and Search. The causal-discovery textbook.

Effect estimation

Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). “Estimation of regression coefficients when some regressors are not always observed.” Doubly robust estimation.
Chernozhukov, V., et al. (2018). “Double/debiased machine learning for treatment and structural parameters.”
Wager, S., and Athey, S. (2018). “Estimation and inference of heterogeneous treatment effects using random forests.”
Künzel, S. R., Sekhon, J. S., Bickel, P. J., and Yu, B. (2019). “Metalearners for estimating heterogeneous treatment effects using machine learning.”
Shalit, U., Johansson, F. D., and Sontag, D. (2017). “Estimating individual treatment effect: generalization bounds and algorithms.” CFR.

Causal discovery

Spirtes, P., and Glymour, C. (1991). “An algorithm for fast recovery of sparse causal graphs.” PC.
Chickering, D. M. (2002). “Optimal structure identification with greedy search.” GES.
Shimizu, S., et al. (2006). “A linear non-Gaussian acyclic model for causal discovery.” LiNGAM.
Hoyer, P. O., et al. (2009). “Nonlinear causal discovery with additive noise models.” ANM.
Peters, J., Bühlmann, P., and Meinshausen, N. (2016). “Causal inference using invariant prediction.” ICP.
Zheng, X., Aragam, B., Ravikumar, P. K., and Xing, E. P. (2018). “DAGs with NO TEARS.”

Causal representation learning

Schölkopf, B., et al. (2021). “Toward causal representation learning.” The position paper.
Hyvärinen, A., Sasaki, H., and Turner, R. (2019). “Nonlinear ICA using auxiliary variables and generalized contrastive learning.”
Khemakhem, I., Kingma, D. P., Monti, R. P., and Hyvärinen, A. (2020). “Variational autoencoders and nonlinear ICA: a unifying framework.” iVAE.
Locatello, F., et al. (2019). “Challenging common assumptions in the unsupervised learning of disentangled representations.”

Causal ML applications

Caruana, R., et al. (2015). “Intelligible models for healthcare: predicting pneumonia risk and hospital 30-day readmission.” The asthma example.
Hardt, M., Price, E., and Srebro, N. (2016). “Equality of opportunity in supervised learning.” Equalized odds.
Kusner, M. J., Loftus, J. R., Russell, C., and Silva, R. (2017). “Counterfactual fairness.”
Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. (2019). “Invariant risk minimization.”
Wachter, S., Mittelstadt, B., and Russell, C. (2017). “Counterfactual explanations without opening the black box.”
Karimi, A.-H., et al. (2021). “Algorithmic recourse: from counterfactual explanations to interventions.”

Causality and LLMs

Zečević, M., Willig, M., Dhami, D. S., and Kersting, K. (2023). “Causal Parrots: large language models may talk causality but are not causal.”
Jin, Z., et al. (2023). “CLadder: assessing causal reasoning in language models.”

Reading-order recommendation

For someone entering the field: start with Pearl-Mackenzie Book of Why for orientation. Then Hernán-Robins Causal Inference: What If (free online) for applied content. Then Pearl’s Causality for deep theory; Imbens-Rubin for the Rubin tradition. Add specialized references (Peters-Janzing-Schölkopf for ML-aligned framing; specific papers for the topics that matter for your application).

§16. Exercises and Experiments

Research-style exercises. Each develops a specific causal-inference skill.

E1. Do-calculus by hand. Take a 5-node DAG (e.g., the “Simpson’s paradox” graph with treatment, outcome, confounder, mediator). For three different effects $P(Y \mid \text{do}(X))$ on the same graph, derive the identifying expression using do-calculus rules. Verify each step. Compare to the output of an automated tool (the dosearch R package, the dowhy Python library).
E2. ATE estimation by backdoor adjustment. Generate a synthetic dataset from a known SCM with a confounder. Estimate the ATE using (a) naive comparison of means, (b) propensity-score matching, (c) IPW, (d) AIPW, (e) double ML. Compare to the true ATE. Verify that the causally-aware estimators are unbiased while the naive estimator is biased. Plot the bias-variance trade-off across estimators.
E3. PC algorithm on a synthetic graph. Take a known 10-node DAG. Generate synthetic data from a corresponding SCM. Run PC algorithm on the data; recover the Markov equivalence class. Compare to the true DAG. Investigate the effect of (a) sample size, (b) noise level, (c) edge density on PC’s success rate.
E4. Counterfactual fairness analysis. Take a synthetic loan-approval dataset where the protected attribute affects outcomes through both legitimate (income) and illegitimate (biased evaluation) pathways. Construct a causal graph; implement counterfactual fairness; compare to demographic parity and equalized odds. Investigate which fairness criterion each predictor satisfies.
E5. LLM on a causal benchmark. Pick a publicly available causal-reasoning benchmark (CLadder, CausalBench, or a CRASS subset). Evaluate one or more LLMs on it; analyze the pattern of successes and failures. Run chain-of-thought prompting; compare to direct prompting. Identify specific question types where the LLM fails systematically. Reflect on the implications for deploying LLMs in causal-sensitive contexts.
E6. Sensitivity analysis. For a real or synthetic causal-effect estimate, conduct sensitivity analysis to an unmeasured confounder. Use the Cinelli-Hazlett sensitivity-analysis framework or the E-value approach. Report how strong an unmeasured confounder would need to be to invalidate the conclusion. Reflect on whether the conclusion is robust.
E7. Invariant Risk Minimization. Implement IRM (Arjovsky et al., 2019) on a synthetic dataset with multiple environments. Compare to ERM. Test OOD generalization on a held-out environment. Reproduce (or fail to reproduce) the IRM paper’s claims. Reflect on the gap between theoretical motivation and empirical performance.
E8. Causal Discovery with LiNGAM. Generate synthetic data from a linear SCM with non-Gaussian noise. Run LiNGAM on the data; verify that it recovers the unique DAG (not just the equivalence class). Then run LiNGAM on linear-Gaussian data; verify that it fails to recover the unique direction (expected). Reflect on what the failure mode tells us about identifiability.
E9. Causal effect with high-dimensional covariates. Generate a synthetic dataset with 500-dimensional confounders and a treatment-outcome pair. Estimate the ATE using (a) classical IPW with a logistic propensity model, (b) double ML with gradient-boosted nuisances, (c) double ML with neural-network nuisances. Compare bias and variance. Reflect on when high-dimensional methods help vs hurt.
E10. Mediation analysis. Take a synthetic dataset with treatment, mediator, and outcome. Decompose the total effect into natural direct and natural indirect effects. Investigate sensitivity to the (untestable) cross-world independence assumption. Reflect on when mediation analysis is reliable and when it is fragile.

Causality and Causal Inference

Scope and What This Chapter Is About

§1. Motivation and Scope

A worked example to anchor everything

What causality is

Correlation vs causation: making the distinction precise

Why causality matters for AI in 2026

The two dominant traditions

Boundaries with adjacent chapters

What this chapter does not try to do

Position taken in this chapter

§2. Historical Context

Early philosophy: Hume and Mill

Wright and Fisher: graphical and experimental foundations

Haavelmo and structural econometrics

Rubin’s potential outcomes (1974)

Pearl’s structural causal models (1988–2009)

The Pearl-vs-Rubin tension

The 2000s rise of applied causal inference

The 2015–2020 ML revival

2020–2026: causal LLMs and the integration

Where this leaves us in 2026

§3. The Ladder of Causation

The three rungs

Rung 1: Association

Rung 2: Intervention

Rung 3: Counterfactual

Why ML is mostly on Rung 1

What it takes to move higher

Where the ladder fits in 2026

§4. Structural Causal Models and Do-Calculus

Directed Acyclic Graphs as causal models

Structural equations

The do-operator

Why P(Y∣do(X))≠P(Y∣X)P(Y \mid \text{do}(X)) \neq P(Y \mid X)P(Y∣do(X))=P(Y∣X) in general

The three rules of do-calculus

Identifiability and the ID algorithm

Pseudocode for the ID algorithm (sketch)

Where SCMs and do-calculus sit in the field

§5. Identification: Backdoor, Front-door, and Instrumental Variables

The backdoor criterion

What can go wrong with backdoor

The front-door criterion

Instrumental variables

When each criterion applies

Sensitivity analysis

Where identification fits in the chapter

§6. Causal Effect Estimation

Causal estimands: ATE, CATE, ITE

The standard assumptions

Propensity score methods

Doubly robust estimators

Causal forests and meta-learners (CATE)

Double Machine Learning (DML)

Deep learning for causal effect estimation

What works in practice

Where estimation fits

§7. Counterfactuals

What counterfactuals ask

Computing counterfactuals from an SCM

Counterfactuals in the potential-outcomes framework

The identifiability of counterfactuals

Mediation analysis

Counterfactual fairness

Where counterfactuals fit

§8. Causal Discovery

The problem statement

Constraint-based methods: PC and FCI

Score-based methods: GES

Functional causal models: LiNGAM and ANM

Causal discovery in nonlinear, high-dimensional settings

The Markov-equivalence class limitation

Where causal discovery fits

§9. Causal Representation Learning

The motivating problem

The Schölkopf-Bengio programme

Independent Causal Mechanisms (ICM)

Identifiability results for nonlinear ICA

Disentanglement and causal structure

Causal world models

Why $P(Y \mid \text{do}(X)) \neq P(Y \mid X)$ in general