Part XIII · Specialized ML Methods · Chapter 03

Causal Inference, what would have happened, but didn't.

Most of machine learning answers the question "what is correlated with what?" — and most of decision-making needs the answer to a different question: "what would happen if I did X?" Causal inference is the discipline that bridges the gap. This chapter develops the two dominant frameworks (potential outcomes and causal graphs), the methods for estimating effects from observational and experimental data, and the quasi-experimental designs that have re-shaped applied empirical work over the past three decades.

Prerequisites & orientation

This chapter assumes basic familiarity with probability and statistics (Part I Ch 04–05) and a comfort with linear regression (Part IV Ch 01). The DAG sections build on the probabilistic graphical models material of Part IV Ch 06; readers should glance there if the d-separation discussion feels unfamiliar. No prior causal-inference experience is assumed; the chapter develops the framework from first principles.

Two threads run through the chapter. The first is the long-running tension between potential outcomes (the Rubin / Neyman framework, dominant in statistics, biostatistics, and economics) and causal graphs (the Pearl framework, dominant in computer science and AI). The two frameworks are mathematically equivalent for most purposes but emphasise different intuitions and have different practical workflows. This chapter develops both. The second thread is the identification-versus-estimation distinction: showing that a causal effect is in principle recoverable from your data (identification) is logically prior to estimating it well (estimation), and most causal-inference failures trace back to skipping the identification step.

In this chapter

Why Causal Inference Is Different correlation vs. causation · interventions · the gap from ML
The Potential Outcomes Framework Y(0), Y(1) · the fundamental problem · ATE · ATT · SUTVA
Identification: Randomization and Ignorability RCTs · unconfoundedness · positivity · the gold standard
Causal DAGs and the Pearl Framework d-separation · confounders · mediators · colliders · backdoor
Do-Calculus and Identifiability interventions · do-operator · the three rules · frontdoor
Estimation from Observational Data propensity scores · matching · IPW · doubly robust · regression
Instrumental Variables IV setup · 2SLS · LATE · validity tests · weak instruments
Difference-in-Differences and Quasi-Experiments DiD · parallel trends · synthetic control · regression discontinuity
Heterogeneous Treatment Effects CATE · S-learner · T-learner · X-learner · causal forests
Frontier: Sensitivity, Discovery, and the LLM Question unobserved confounding · causal discovery · LLM reasoning

Why Causal Inference Is Different

A trained machine-learning model answers questions about the data it was trained on. A decision-maker needs answers about a world that has not happened yet — what would happen if we ran this advertising campaign, raised this price, gave this drug to this patient. Causal inference is the discipline that connects observed data to those counterfactual questions, and the gap between it and standard predictive ML is the most consequential conceptual gap in applied data science.

Correlation, prediction, and the missing arrow

Standard supervised learning learns the conditional distribution P(Y | X) — how the outcome varies with features in the world as it exists. This is sufficient for prediction problems where the future world will look like the past. It is not sufficient for decision problems, because changing X by an action is fundamentally different from observing X in the world. The classic illustration: ice-cream sales correlate with shark attacks (both rise in summer); a predictor will happily learn this. Acting on the prediction by selling less ice cream will not reduce shark attacks. Causal inference asks not "given that we see X, what do we expect Y to be?" but "if we set X to a particular value by intervention, what would Y be?" The two questions have different answers and require different mathematics.

Where causal questions actually live

The decision questions at the heart of every applied data science role are causal: did this advertising campaign actually drive sales, or would the customers have bought anyway? Does this medical treatment improve patient outcomes, or are we selecting healthier patients for it? Will raising the price reduce demand by 10%, or did the past 10% reduction in volume coincide with a price increase for unrelated reasons? Does this software feature increase engagement, or do engaged users self-select into using the feature? Each of these questions has a numerical answer that prediction cannot give. Causal inference is what produces those answers — and when an organisation's data work shifts from "what is X?" to "what should we do about X?", the methods of this chapter are what's needed.

The two frameworks

Causal inference has two dominant theoretical frameworks that grew up largely in parallel through the 1970s–1990s. The potential outcomes framework (Rubin, building on Neyman) treats causal inference as a missing-data problem: each unit has a value of the outcome under treatment and a value under control, and we observe only one of them. The causal graph framework (Pearl, building on Wright's path analysis) represents causal relationships as directed acyclic graphs and develops a calculus for reasoning about interventions on those graphs. The two frameworks are mathematically equivalent in the cases that overlap, but they emphasise different intuitions and provide different practical workflows. Statisticians, biostatisticians, and economists tend to think in potential outcomes; computer scientists and AI researchers tend to think in graphs. This chapter develops both, in roughly that order.

The Reichenbach Principle

If two variables are correlated, then either one causes the other, or they share a common cause, or both. This deceptively simple principle (Hans Reichenbach, 1956) is the conceptual seed of all of causal inference: distinguishing among the three possibilities is the central problem the rest of this chapter is organised around. A correlation is suggestive but underdetermined; the methods of this chapter are how that determination gets made.

The Potential Outcomes Framework

The cleanest way to formalise a causal question is to imagine a parallel-universe ledger: for each unit in the population, what would happen under treatment, and what would happen under no treatment? The causal effect on each unit is the difference between those two potential outcomes. The catch — the fundamental problem of causal inference — is that we only ever observe one universe per unit.

Notation

For each unit i in the population, let T_i ∈ {0, 1} indicate whether the unit was treated, and let Y_i⁽⁰⁾ and Y_i⁽¹⁾ be the unit's potential outcomes under no treatment and treatment respectively. The unit's individual treatment effect is τ_i = Y_i⁽¹⁾ − Y_i⁽⁰⁾. The observed outcome is Y_i = T_i Y_i⁽¹⁾ + (1 − T_i) Y_i⁽⁰⁾ — we observe whichever potential outcome corresponds to the treatment the unit actually received.

The fundamental problem

The fundamental problem of causal inference (Holland 1986) is that τ_i is never observable for any single unit, because we always observe at most one of the two potential outcomes. A patient who received the drug never gives us the counterfactual outcome where they did not receive it; a customer who saw the ad never gives us the counterfactual outcome where they did not. Individual treatment effects are forever counterfactual. What can be estimated is average effects across populations, where the variation across units lets us pool information to recover the missing potential outcomes.

Average effects

The most commonly estimated quantity is the average treatment effect (ATE):

Common causal estimands ATE: τ = E[ Y (1) - Y (0)] ATT (treated): τ T = E[ Y (1) - Y (0) | T = 1 ] ATU (untreated): τ U = E[ Y (1) - Y (0) | T = 0 ] CATE (conditional): τ(x) = E[ Y (1) - Y (0) | X = x ] The ATE is the population average effect; the ATT is the average effect among the treated (often what a policy decision actually cares about — was the treatment good for those who got it?); the ATU is the parallel quantity for the untreated; the CATE is the effect at a particular feature profile, the central object of heterogeneous-treatment-effect analysis (Section 09).

SUTVA and the assumptions you need

The potential-outcomes setup quietly assumes SUTVA (Stable Unit Treatment Value Assumption), which has two parts: no interference (one unit's treatment does not affect another unit's outcome) and no hidden treatment versions (the treatment is well-defined and consistent). Both can fail in real applications. Network experiments violate no-interference (treating one user changes another's experience). A medical "drug" that is administered at varying doses violates no-hidden-versions. SUTVA failures do not invalidate causal inference but require explicit modelling — the field of network causal inference exists for precisely the no-interference case.

Identification: Randomization and Ignorability

Identification is the question of whether a causal effect is in principle recoverable from the data, before any estimation effort. Randomized experiments identify the ATE essentially by construction; observational studies require additional assumptions that the analyst must justify on substantive grounds. Most causal-inference failures trace back to skipping this step.

The randomized controlled trial

If treatment is assigned by an unbiased random mechanism — a coin flip, a random-number generator, anything that has nothing to do with the unit's potential outcomes — then the treated and untreated groups are exchangeable in expectation. The ATE estimator is then the simple difference of means: τ̂ = ȳ_T=1 − ȳ_T=0. The argument is clean: random assignment severs any link between treatment and pre-treatment characteristics, including unobserved ones, so the difference in observed outcomes captures the causal effect rather than confounding.

This is why randomized controlled trials (RCTs) are the gold standard for causal evidence in medicine, and why A/B testing is the dominant evaluation methodology at every serious tech company. When randomization is feasible, identification is automatic; the estimation reduces to computing a difference in means with appropriate confidence intervals; and the credibility is high enough to support consequential decisions.

The catch: when randomization is not feasible

Many of the most consequential causal questions cannot be addressed by randomization. We cannot randomly assign people to smoke or not smoke, to attend college or not attend, to be exposed to a recession or not. Ethical, practical, and political constraints rule out randomized trials in vast areas of medicine, social science, and public policy. The methods of the rest of this chapter exist to recover causal effects from observational data — and the central question for each of them is: under what assumptions does the data identify the causal effect?

Conditional ignorability

The most common identification strategy for observational data is conditional ignorability (also called unconfoundedness, no-unmeasured-confounding, or the conditional-exchangeability assumption). It says: given a sufficient set of covariates X, treatment assignment is as if random. Formally:

Conditional ignorability + positivity ignorability: { Y (0), Y (1)} ⊥ T | X positivity: 0 < P(T = 1 | X = x) < 1 for all x Ignorability says that, conditional on X, treatment is independent of the potential outcomes — once you stratify on the right covariates, the data look like an RCT within each stratum. Positivity says every covariate combination has both treated and untreated units (otherwise we cannot estimate the effect there). The two assumptions together are sometimes called strong ignorability; together with SUTVA they are sufficient to identify the ATE from observational data.

The catch is that ignorability is an untestable assumption: it requires you to have measured all the confounders. Whether your X is sufficient is a substantive question about the science of the application, not a statistical one. Most published observational causal estimates rely on ignorability, and the credibility of those estimates depends entirely on how convincing the substantive argument for ignorability is. Sensitivity analysis (Section 10) is the standard tool for honest reporting when the assumption cannot be fully justified.

Causal DAGs and the Pearl Framework

The second major framework for causal inference represents causal relationships as directed acyclic graphs. The graph encodes the analyst's assumptions about which variables cause which others; a precise graphical calculus then determines which causal questions can be answered from observational data and how. The framework is more abstract than potential outcomes but pays for the abstraction with mechanical procedures for handling complex confounding.

Directed acyclic graphs

A causal DAG is a directed graph in which each node represents a variable and each edge X → Y is read "X is a direct cause of Y." Acyclicity (no directed cycles) is a substantive assumption: causes precede their effects, and feedback loops require explicit time indexing. The graph is the analyst's hypothesis about the data-generating process; like any modelling assumption it can be wrong, but it is at least visible and criticisable, which is the central pedagogical advantage of the graphical approach.

Three structural building blocks

Three patterns of three nodes appear repeatedly and have very different statistical properties.

Chain (mediation): X → Z → Y. Z is a mediator on the causal path from X to Y. Conditioning on Z blocks the path and removes part of the X-Y association.
Fork (confounding): X ← Z → Y. Z is a common cause — a confounder. The unconditional X-Y association mixes the causal effect (if any) with the spurious association induced by Z. Conditioning on Z removes the confounding.
Inverted fork (collider): X → Z ← Y. Z is a common effect — a collider. X and Y are unconditionally independent (assuming no other connection); conditioning on Z creates a spurious association between them. This is the famous "selection bias" pattern.

The third pattern is counterintuitive and is the source of many empirical disasters: conditioning on a collider — selecting your sample based on a common effect of two variables you want to study — creates correlations where none existed, and is the mechanism behind classic puzzles like Berkson's paradox.

The three structural patterns. In a chain, conditioning on the mediator blocks the X-Y path. In a fork, conditioning on the confounder removes the spurious association. In a collider, conditioning creates a spurious association — the most counter-intuitive of the three and the source of many real-world analytic mistakes.

The backdoor criterion

Pearl's backdoor criterion formalises the question of which covariates need to be controlled for to identify the causal effect of X on Y from observational data. A set of variables S satisfies the backdoor criterion if (a) S blocks every path from X to Y that contains an arrow into X (a "backdoor path"), and (b) S contains no descendants of X. If such an S exists and is observed, the causal effect of X on Y is identified by adjusting for S — i.e., by stratifying or conditioning on S in the analysis.

The criterion mechanises the question that ignorability poses verbally: which variables are the necessary controls? A correctly drawn DAG turns the question into a graph-traversal exercise. The catch is the same as with ignorability: if the DAG is missing variables or edges, the criterion will produce a misleading answer. Drawing the DAG is the substantive, untestable input; the rest is mechanical.

Do-Calculus and Identifiability

The crowning achievement of the Pearl framework is do-calculus: a small set of mechanical rules that determine, for any DAG and any causal query, whether the query can be answered from observational data — and if so, exactly which observational quantities to compute. It is the mathematical formalism behind every claim of identifiability in the graphical framework.

Interventions and the do operator

The notation P(Y | do(X = x)) denotes the distribution of Y when X has been set by external intervention to the value x — distinct from the conditional P(Y | X = x), which is the distribution of Y when we passively observe X equal to x. The two coincide in randomized experiments (where the act of randomly assigning X is itself the intervention) and differ in observational data (where the conditional distribution mixes the causal effect with confounding).

An intervention do(X = x) corresponds graphically to severing all incoming edges to X and setting X to the value x. The post-intervention DAG is the original DAG minus the arrows into X; the post-intervention distribution factorises according to the modified graph. The do-operator formalises the difference between observation and intervention in a single piece of notation.

The three rules of do-calculus

Pearl's do-calculus consists of three rules that allow do-expressions to be transformed into other do-expressions, conditional expressions, or marginal expressions, by appealing to graphical properties of the DAG.

Rule 1 (insertion / deletion of observations): When a variable is conditionally independent of Y in the post-intervention graph given the other variables, you can drop or add it to the conditioning set without changing the distribution.
Rule 2 (action / observation exchange): When intervening on a variable is graphically equivalent to observing it (no confounding for the variable), the do-operator can be replaced by ordinary conditioning.
Rule 3 (insertion / deletion of actions): When intervening on a variable has no effect on Y in the post-intervention graph (the variable is not a cause of Y in the relevant sense), the do-operator can be removed entirely.

The three rules together are complete for causal identification — Shpitser and Pearl proved that an effect is identifiable from observational data if and only if it can be reduced to a do-free expression by repeated application of these three rules. The proof is constructive: there is a polynomial-time algorithm (the ID algorithm) that, given a DAG and a query, returns either an identifying expression or a proof that the query is non-identifiable.

The frontdoor criterion

The most consequential consequence of do-calculus is the frontdoor criterion, which identifies certain causal effects in the presence of unobserved confounders. The setup: X causes Y partially through a mediator M, and although there is unobserved confounding between X and Y, neither X-M nor M-Y is confounded. The frontdoor formula then identifies the X→Y causal effect from observational data, even though the X-Y association is confounded. This is a striking result — that some causal effects are identifiable despite unobserved confounding — and the kind of theorem that simply does not exist in the potential-outcomes framework alone.

The classic example is smoking and lung cancer in the 1960s: the X-Y association was confounded by an unobserved "smoking-gene" variable, but the mediator (tar deposits in the lungs) was not confounded with either the cause or the effect. The frontdoor formula could in principle have identified the smoking-cancer effect even without conducting RCTs, by exploiting the conditional independence structure in the data.

Estimation from Observational Data

Once a causal effect is identified — by ignorability, the backdoor criterion, or some other argument — the remaining work is statistical estimation: turning the identifying expression into a number with appropriate uncertainty. Several estimation strategies have become standard, each with its own strengths and pathologies.

Regression adjustment

The simplest estimation strategy: fit a regression model Y = f(X, T) + ε on the observed data, then estimate the ATE as E[ f(X, 1) − f(X, 0) ] averaged over the empirical distribution of X. This works when the regression model captures the conditional expectation correctly and the relevant covariates are in X. The catch is sensitivity to model specification: if the functional form is wrong, the estimate is biased even if ignorability holds.

Propensity scores and matching

The propensity score e(X) = P(T = 1 | X) is the probability of receiving treatment given covariates. Rosenbaum and Rubin (1983) showed that conditional on the propensity score alone, treatment is ignorable — adjusting on a single scalar function of the covariates is sufficient. This collapses a high-dimensional adjustment problem into a one-dimensional one and underlies several estimation strategies. Propensity score matching pairs treated and untreated units with similar propensity scores and estimates the effect from the matched pairs. Inverse probability weighting (IPW) reweights observations by the inverse propensity (1/e for treated, 1/(1−e) for untreated) so that the weighted population is balanced on covariates and the simple difference in weighted means estimates the ATE.

Doubly robust estimation

Both regression adjustment and IPW require a correct model — regression needs the outcome model right, IPW needs the propensity model right. Doubly robust estimators combine the two so that the estimate is consistent if either model is correct, with the second model serving as insurance against misspecification of the first. The augmented IPW (AIPW) estimator is the canonical doubly robust form. Modern variants — TMLE, double machine learning — extend the idea to allow flexible nonparametric machine-learning estimators for the nuisance models while preserving the doubly-robust property.

The bias-variance tradeoff in causal estimation

Causal estimation differs from prediction in a subtle but important way. The goal is unbiased estimation of an average effect, not minimum prediction error on individual outcomes. Off-the-shelf machine-learning models, optimised for prediction, can produce strongly biased causal estimates because their bias is asymmetric across the treatment groups — the model fits the larger group better than the smaller one. The double machine learning framework (Chernozhukov et al. 2018) addresses this with a sample-splitting "cross-fitting" procedure that decouples the nuisance estimation from the effect estimation. Modern causal-ML practice almost always uses some form of cross-fitting; Chapter 04 of this Part develops the techniques in more depth.

Instrumental Variables

When ignorability fails — when there are unobserved confounders that the analyst cannot adjust for — observational adjustment cannot identify the causal effect. Instrumental variables offer a different route: find a variable that influences treatment but has no other path to the outcome, and use that variable as a source of "exogenous" variation. The technique was developed in mid-20th-century econometrics and remains the workhorse for credible causal inference under unobserved confounding.

The IV setup

An instrumental variable Z must satisfy three conditions. Relevance: Z affects the treatment T. Exclusion: Z affects the outcome Y only through T (no direct path Z → Y, and no other paths around T). Independence: Z is unconfounded with Y in the same way randomization would be — independent of unobserved variables that affect Y.

If a valid instrument exists, the causal effect of T on Y can be identified from the data — even if T and Y share unobserved confounders. The intuition: random variation in Z produces random variation in T (by relevance), which produces variation in Y that must be attributable to the T → Y effect (by exclusion and independence). The classical formula is the Wald estimator: the effect of Z on Y, divided by the effect of Z on T.

Two-stage least squares

The standard estimation procedure is two-stage least squares (2SLS). In the first stage, regress T on Z (and any other exogenous covariates) to get predicted treatment T̂. In the second stage, regress Y on T̂ (and the same exogenous covariates). The coefficient on T̂ in the second stage is the IV estimate of the causal effect. 2SLS is implemented in every statistics package and is the default IV estimator in econometrics.

LATE and the heterogeneity issue

A subtle conceptual point. With a binary instrument and binary treatment, IV estimates not the ATE but the Local Average Treatment Effect (LATE) — the effect for the subpopulation of "compliers," units whose treatment status is actually changed by the instrument. Always-takers (always treated regardless of Z) and never-takers (never treated regardless of Z) contribute nothing to the IV estimate. The LATE-vs-ATE distinction (Imbens & Angrist 1994) was a major theoretical advance: it clarifies that IV identifies a different causal quantity than RCTs do, and the relevance to policy depends on whether the compliers are the population the policy will affect.

Where to find instruments

Real instruments are scarce and the standard for accepting one is high. The classic examples include lottery-based natural experiments (Vietnam draft lottery as an instrument for military service), institutional features (distance to a college as an instrument for college attendance), policy discontinuities (eligibility cutoffs), and rainfall (as an instrument for agricultural production). The exclusion restriction is rarely directly testable, and most published IV studies devote substantial space to defending it on substantive grounds. Weak instruments — those with low first-stage power — produce biased and noisy IV estimates with notoriously bad finite-sample behaviour, and modern IV practice includes diagnostic tests (the Stock-Yogo test, robust IV procedures) to catch this failure mode.

Difference-in-Differences and Quasi-Experiments

A second class of strategies for credible causal inference uses temporal or geographical variation to construct counterfactuals. The idea: if a treatment is rolled out at one time or in one place but not another, the untreated time/place serves as a counterfactual for the treated one, with the difference between them attributable to the treatment.

Difference-in-differences

Difference-in-differences (DiD) is the canonical method. Compare a treated group's outcome before and after a treatment to a control group's outcome over the same period, and take the difference of differences. Under the parallel trends assumption — that without the treatment, the treated and control groups would have moved together — the DiD estimator identifies the causal effect of the treatment on the treated.

Difference-in-differences τ̂ DiD = ( ȳ T,post - ȳ T,pre) - ( ȳ C,post - ȳ C,pre) The first term is the change in the treated group's outcome. The second is the change in the control group's outcome — the counterfactual change that would have occurred in the absence of treatment. Their difference isolates the treatment effect . The parallel-trends assumption is the substantive justification: without it, the control group's change is not a valid counterfactual for the treated group's change.

The classic application is Card and Krueger's (1994) study of the New Jersey minimum-wage increase: fast-food employment in New Jersey before and after the rise was compared to fast-food employment in nearby Pennsylvania (no minimum-wage change), and the DiD estimate became one of the most-cited empirical results in 20th-century labour economics. The method has since become ubiquitous: every "did this policy work?" study in social science, and every "did this product change move the metric?" study in tech, is some variant of DiD.

The parallel-trends question

The parallel-trends assumption is testable in principle by examining pre-treatment trends: if the treated and control groups moved in parallel before the treatment, it is more credible that they would have continued to move in parallel without treatment. Modern DiD practice includes event-study plots that visualise this, with leads and lags around the treatment date and confidence intervals around each. Failures of parallel trends in the pre-period are usually disqualifying; success in the pre-period is suggestive but not definitive (the trends could have diverged at exactly the treatment date for unrelated reasons).

Synthetic control

When there is one treated unit and many potential controls (one country adopted a policy, the others did not), synthetic control (Abadie & Gardeazabal 2003) constructs a weighted average of control units to match the treated unit's pre-treatment trajectory, then uses the post-treatment difference as the causal estimate. The method has become the standard for case-study causal inference and has been extended in dozens of directions (matrix completion methods, generalised synthetic control, augmented synthetic control).

Regression discontinuity

A different quasi-experimental design: regression discontinuity (RD). Some treatments are assigned by a sharp cutoff on a continuous variable — students above a test-score threshold get a scholarship, individuals below an income threshold get a benefit, customers above a tenure threshold get a promotion. Just above and just below the cutoff, the units are essentially identical except for the treatment, so the discontinuity in the outcome at the cutoff identifies the causal effect locally. RD has become a standard tool in education research, public-policy evaluation, and increasingly in tech-product analysis where rule-based eligibility creates the cutoffs naturally.

The credibility revolution

The collective body of techniques in this section — IV, DiD, synthetic control, RD — is associated with what economists call the credibility revolution in empirical work, the shift over the 1990s and 2000s away from regression-based causal claims under heroic assumptions toward design-based causal claims under transparent, defensible identification arguments. The 2021 Nobel Prize in Economics (Card, Angrist, Imbens) recognised the originators. The same intellectual movement has reshaped applied empirical work in epidemiology, education, public health, and tech-industry data science.

Heterogeneous Treatment Effects

The ATE answers "what is the average effect across the population?" Most decisions actually need to know "for whom is the effect largest, and for whom is it negative?" Heterogeneous-treatment-effect estimation is the discipline of producing the conditional average treatment effect (CATE) as a function of unit-level features — the bridge from causal inference to personalised decision-making.

The CATE as the central object

The conditional average treatment effect τ(x) = E[Y(1) − Y(0) | X = x] varies across feature profiles in nearly every real application: a drug works better for some patient subgroups than others, a marketing campaign changes behaviour for some customer segments and not others. Modelling this variation lets policy decisions be targeted to the subgroups where the effect is largest, which is what most operational use cases of causal inference actually require. The challenge is that the CATE is a function of features, not a single number, and estimating it well from data requires both flexible function fitting and the same identification machinery as the ATE.

Meta-learners

The "meta-learner" approach takes any standard ML predictor and uses it as a building block for CATE estimation. Several flavours exist. The S-learner uses a single predictor for E[Y | X, T] and computes τ̂(x) = f̂(x, 1) − f̂(x, 0). The T-learner fits separate predictors per treatment group: τ̂(x) = f̂₁(x) − f̂₀(x). The X-learner (Künzel et al. 2019) is a more involved construction designed for cases where treatment-group sizes are imbalanced; it has been the dominant meta-learner in production deployments since publication. The trade-offs among them are largely empirical and depend on dataset structure.

Causal forests

Causal forests (Wager & Athey 2018) adapt the random-forest framework specifically for CATE estimation. The trees are split based on heterogeneity in treatment effects rather than predictive accuracy, and the resulting forest produces calibrated CATE estimates with valid confidence intervals at each query point. The method is the central method in the grf R package and has become the strongest off-the-shelf CATE estimator in 2024–2026 production work.

Where this connects to Chapter 04

The full machinery of double machine learning, causal trees and forests, uplift modelling, and the broader integration of causal inference with modern ML is the subject of Chapter 04 (Causal Machine Learning). This section has introduced the central concept (the CATE) and a few canonical estimators; the next chapter develops them in production-relevant depth.

Frontier: Sensitivity, Discovery, and the LLM Question

Three frontier directions are reshaping causal inference in 2024–2026: principled handling of unobserved confounding via sensitivity analysis, automated learning of causal structure from data, and the harder question of whether large language models can perform genuine causal reasoning.

Sensitivity analysis

The honest weakness of every observational causal estimate is that ignorability is untestable; there could always be unobserved confounders the analyst missed. Sensitivity analysis quantifies how strong such unobserved confounding would have to be to overturn the conclusion. Rosenbaum's classical approach (Γ-bounds) reports the smallest hypothetical confounder strength that would render the estimate insignificant. The Cinelli-Hazlett (2020) approach reframes this in terms of partial R² values that an unobserved confounder would need with respect to both treatment and outcome — values that are interpretable on the same scale as observed covariates' R² values. Either approach turns the bare estimate into a more honest statement: "this effect is robust unless an unobserved confounder is at least N times stronger than the strongest observed one." Modern responsible practice always reports a sensitivity analysis alongside the main estimate.

Causal discovery

The DAGs of Section 04 have been the analyst's input — drawn from substantive knowledge, then used for analysis. Causal discovery reverses the direction: learn the structure of the DAG from data. Classical algorithms (PC, FCI) use conditional-independence tests in the data to constrain the set of DAGs consistent with the observed correlations. Modern methods (NOTEARS, GraN-DAG, causal-discovery transformers) use continuous optimisation or deep learning to search the space of DAGs directly. The catch is that conditional independence in finite samples is noisy, and many causal structures are observationally equivalent — multiple distinct DAGs can produce the same observational distribution. Practical causal discovery returns an equivalence class of DAGs and supplements with substantive expert judgement to pick among them.

Can LLMs reason causally?

The 2023–2026 literature on LLMs and causal reasoning has produced a contradictory mixture of results. On one hand, LLMs can verbally describe DAGs, identify confounders in textual descriptions, and produce reasonable-sounding causal explanations of phenomena they have read about. On the other hand, careful evaluations (the CLadder benchmark, the CausalBench suite) find that LLMs frequently confuse correlation and causation, get d-separation arguments wrong, and fail systematically on counterfactual reasoning that requires careful distinction between observational and interventional distributions. The honest 2026 picture is that LLMs are useful for the prose around a causal analysis (writing up findings, brainstorming confounders, suggesting candidate instruments) but unreliable for the formal reasoning steps that determine identifiability or correct estimation. Whether scaling alone fixes this, or whether genuine causal reasoning requires architectural changes, is an open empirical question.

What this chapter does not cover

The full integration of causal inference with modern machine learning — double ML, causal forests in depth, uplift modelling, doubly robust learners, the EconML and CausalML libraries — is the subject of Chapter 04 (Causal Machine Learning). The use of causal methods specifically for time-series data, including counterfactual forecasting and intervention-effect estimation in temporal settings, connects to Chapter 01 of this Part. The deep philosophical questions — what constitutes a cause, how to handle hypothetical contrasts that cannot be physically realised, the relationship between causal inference and prediction — belong to the philosophy-of-science literature rather than this chapter.

Causal inference is the discipline that connects observed data to decisions about the world. The two frameworks of this chapter — potential outcomes and causal graphs — are the conceptual machinery; the estimation methods are the practical tools; the quasi-experimental designs are the credible identification strategies when randomization is impossible. A practitioner who can move fluently among the three is the one who can answer the questions that operational decisions actually need.