Part XIII · Specialized ML Methods · Chapter 04

Causal Machine Learning, flexibility meets identification.

Classical causal inference uses linear regression and parametric models because they are tractable, not because they are true. Modern machine learning offers far more flexible estimators — random forests, gradient boosting, neural networks — but plugging them into a causal pipeline naively produces biased estimates. Causal machine learning is the synthesis: the identification rigour of causal inference combined with the flexibility of modern ML, made to work via cross-fitting, orthogonalisation, and a small but consequential set of mathematical tricks.

Prerequisites & orientation

This chapter is the direct sequel to Chapter 03 (Causal Inference) and assumes the framework developed there: potential outcomes, the ATE/CATE distinction, identification under ignorability, and the IV / DiD strategies for unconfounded variation. It also assumes comfort with random forests (Part IV Ch 03) and gradient boosting, plus basic neural networks (Part V Ch 01–02) for the deep-learning sections. No prior exposure to causal ML specifically is assumed.

Two threads run through the chapter. The first is the orthogonalisation idea — the central mathematical insight that lets flexible ML estimators serve as nuisance components inside causal estimators without contaminating the causal estimate with their bias. Once you understand orthogonalisation, every method in this chapter is a different way of applying it. The second thread is the operational shift: the question is rarely "what is the average treatment effect?" but "for whom is the effect largest, and how should we use that information to act?" Most of the methods in this chapter are organised around producing CATE estimates that decision systems can act on, and the chapter closes with the policy-learning frame that makes the action explicit.

In this chapter

Why Causal Machine Learning classical limits · the ML-bias problem · operational shift
Double Machine Learning, Deeper orthogonality · cross-fitting · Neyman · partially linear models
Causal Forests in Depth honest splitting · GRF · asymptotic theory · pointwise CIs
Meta-Learners Revisited S/T/X-learner · R-learner · DR-learner · when to use which
Uplift Modeling four quadrants · transformed outcome · uplift trees · Qini curve
Causal Discovery from Data PC · FCI · GES · NOTEARS · LiNGAM
Counterfactual Prediction and Policy Learning policy trees · off-policy evaluation · contextual bandits
Production Stacks: EconML, CausalML, DoWhy EconML · CausalML · DoWhy · the choice in 2026
Causal Representation Learning disentanglement · ICM · causal abstraction · the deep bridge
Frontier: LLMs as Causal Tools and Causal RL LLM-assisted DAGs · causal RL · foundation-model causality

Why Causal Machine Learning

Classical causal inference works in low-dimensional, parametric settings. Modern operational data is high-dimensional, full of interactions, and rarely well-described by a linear model. Causal machine learning exists because the techniques of Chapter 03 break down at the data scales that matter, and the techniques of standard ML give biased causal estimates if used naively. The synthesis is a small but consequential set of methods that get both the rigour and the flexibility right.

The classical limits

The standard estimators of Chapter 03 — regression adjustment, propensity-score weighting, even matching — assume linear models for the outcome and the propensity, or hand-engineered nonlinear features. With a few covariates and a clear hypothesis, this is fine. With hundreds of features (the typical situation in tech, retail, healthcare claims, online experimentation), parametric assumptions become indefensible. The right functional form for the outcome is unknown; the right functional form for the propensity is unknown; trying to specify both correctly is asking for misspecification bias on top of any unmeasured-confounding bias.

Modern ML — random forests, gradient boosting, neural networks — handles arbitrary functional forms beautifully for prediction. The temptation is to plug ML estimators into the classical causal pipeline (regression adjustment with a random forest, IPW with a neural-network propensity score). The result, in general, is biased causal estimates, even when the ML predictors are individually accurate. The bias has a specific source — and a specific fix.

The ML-bias problem

Standard ML estimators are biased toward the centre of the training distribution because regularisation pulls them there. For prediction, this is a feature: it reduces variance at the cost of a small bias, and the trade-off is favourable. For causal estimation, it is a disaster: the bias is asymmetric across treatment groups, doesn't average out, and propagates through the causal estimator. A naive ML-plug-in causal estimator can have worse mean-squared error than a classical parametric one despite using a better predictor underneath, because the gain in the nuisance prediction is overwhelmed by the bias the regularisation introduces in the causal estimate.

The fix, due to Chernozhukov and colleagues (2018), is orthogonalisation: construct the causal estimator so that small biases in the nuisance predictors cancel out at first order. Combined with sample-splitting (cross-fitting), this gives an estimator whose causal-parameter estimate is consistent and asymptotically normal at the parametric rate, even when the nuisance components are estimated by flexible nonparametric ML. The result — double machine learning — is the central technical achievement that makes the rest of this chapter possible. Section 02 develops it in depth.

The operational shift

Beyond the technical fix, causal ML reflects a shift in what practitioners ask for. The classical question — "what is the average treatment effect?" — is rarely the question that drives a decision. The decisions are: which customers should I target with this offer, which patients should I treat with this drug, which users should I show this feature? These are CATE questions, often with a downstream policy attached. The methods of this chapter — causal forests, meta-learners, uplift modelling, policy learning — are organised around producing CATEs and then using them to recommend or evaluate actions. The policy framing makes the operational use explicit, which the ATE framing did not.

The Bias-Variance Trade-off, Causally

The standard ML mantra "trade a little bias for a lot of variance reduction" produces models that predict well but estimate causal effects badly. The mantra of causal ML is closer to "use the same predictors but combine them so the bias cancels." Once you understand why this is necessary, the methods of the chapter — orthogonalisation, cross-fitting, doubly robust scoring — are predictable consequences rather than a list of tricks.

Double Machine Learning, Deeper

Double machine learning (DML) is the most consequential technique in causal ML. It lets you use any flexible ML estimator for the nuisance components of a causal estimator while retaining classical inference guarantees on the causal parameter itself. The technical core is small enough to fit on a page; the implications are large enough to have reshaped applied causal work since 2018.

The partially linear model

The cleanest setting to develop DML is the partially linear model:

Partially linear model Y = θ T + g(X) + ε, E[ε | T, X] = 0 T = m(X) + ν, E[ν | X] = 0 Y is the outcome, T the treatment (continuous or binary), X the covariates. The causal parameter θ is the effect of T on Y, modelled linearly. The nuisance functions g and m capture the (potentially highly nonlinear) dependence of Y and T on X. The partial linearity means we don't need linearity in X — only that the treatment effect itself is approximately constant across X. (Heterogeneous treatment effects relax this; that's the next several sections.)

The orthogonalisation trick

A naive estimator of θ would fit g(X) and m(X) by ML, plug them in, and compute a residual-based estimate. The bias from imperfect nuisance estimation contaminates the θ estimate. DML's orthogonalised score is constructed so that small errors in either g or m have only second-order effects on the θ estimate:

DML estimator (partially linear model) Ỹ i = Y i - ĝ(X i), T̃ i = T i - m̂(X i) θ̂ = ( \sum i T̃ i Ỹ i) / ( \sum i T̃ i 2) First, partial out X from Y and from T using the ML predictors. Second, regress the residualised outcome on the residualised treatment. The resulting θ̂ is the same as you'd get from OLS in a Frisch-Waugh-Lovell-style decomposition, but with the nonparametric residualisation done by ML. The crucial mathematical property: the score function for θ is orthogonal to the nuisance components, so first-order errors in ĝ and m̂ vanish in the asymptotic distribution of θ̂.

Cross-fitting

One more piece is needed for DML to work in finite samples. The same data cannot both fit the nuisance functions and produce the residuals that go into θ̂; the resulting overfitting bias would not vanish even with orthogonalisation. Cross-fitting handles this by splitting the data: fit the nuisance functions on one half, compute residuals on the other half, swap, average. This is k-fold cross-validation reused for unbiased nuisance estimation. With cross-fitting and orthogonalisation, DML's θ̂ is consistent and asymptotically normal under flexible nuisance estimation, with the parametric √n convergence rate.

Beyond the partially linear model

The orthogonalisation idea generalises far beyond the partially linear setting. The same recipe — find an orthogonalised score function for the causal parameter, estimate nuisance components by ML, cross-fit — works for general ATE estimation under unconfoundedness (the doubly robust score), for IV estimation under flexible first-stage models, for difference-in-differences, for mediation analysis, and for the heterogeneous treatment effect estimators of the next sections. The 2018 Chernozhukov paper provides the general framework; the practical instantiations are scattered across the modern causal-ML literature, but they all share the same logic.

What DML actually buys you

The honest practical answer: DML lets you use random forests, gradient boosting, or neural networks as the nuisance estimators, without having to specify functional forms by hand, while still getting valid confidence intervals on your treatment-effect estimate. The price is a bit more code (cross-fitting, orthogonalised scores) and a bit more thinking (about which score function to use). The benefit is honest causal estimates from messy real-world data with hundreds of covariates. Every modern causal-ML library — EconML, CausalML, DoubleML — implements DML as the default for a wide range of estimands.

Causal Forests in Depth

If DML is the central technique for ATE estimation, causal forests are the central technique for CATE estimation. They adapt the random-forest framework specifically to estimate heterogeneous treatment effects, with a clever splitting criterion and an "honesty" property that delivers asymptotic normality at every query point — meaning calibrated confidence intervals, not just point estimates.

The standard random forest, recapped

A standard regression random forest aggregates predictions from many decision trees, each fit on a bootstrap sample of the data with random feature subsets at each split. Trees split to minimise the mean squared error of the predictions; the forest averages tree predictions. The result is a flexible nonparametric regression estimator with empirical strong performance and a useful neighbour-aggregation interpretation: the prediction at a query point is a weighted average of training outcomes, with weights determined by how often the point falls in the same leaf as each training point.

What changes for causal forests

Causal forests (Wager & Athey 2018) adapt this in two ways. First, the splitting criterion is changed: instead of splitting to minimise outcome MSE, trees split to maximise heterogeneity in treatment effect. The relevant criterion is something like "after this split, do the treatment-effect estimates in the two child nodes differ as much as possible?" This pushes the trees to find feature combinations where treatment effects vary, which is precisely what CATE estimation needs.

Second, the trees are fit with honest splitting: the data used to determine the splits is held out from the data used to estimate the leaf values. Without honesty, the same data both decides where to split and estimates the effects within leaves, which produces overfitting bias that does not vanish asymptotically. Honest splitting halves the data efficiency in finite samples but is the price of valid confidence intervals.

The asymptotic theory

The honest causal forest's central theoretical result is that, under regularity conditions, the CATE estimator at any query point converges to the true CATE at the parametric rate and is asymptotically normal with a covariance that can be estimated from the forest itself. This is unusual for nonparametric methods — most do not produce calibrated confidence intervals at individual query points. Causal forests do, which is why they are the standard tool for CATE estimation when honest uncertainty quantification matters (clinical decision support, personalised pricing, anything regulated).

Generalised random forests

Causal forests are the most popular instance of a broader family: generalised random forests (GRF, Athey, Tibshirani & Wager 2019). The GRF framework lets you target any moment-condition-defined estimand — quantile regression, IV-CATE estimation, partial effect estimation in nonlinear models — using the same forest machinery. The grf R package and its Python equivalents implement the framework with a unified API; in 2026 it is the dominant library for forest-based causal estimation.

Practical considerations

Causal forests work well in moderate-dimensional settings (up to a few hundred features) with sample sizes from a few thousand upward. They struggle with very high-dimensional features where the splitting criterion becomes uninformative, and they can be slow to fit on large datasets compared with gradient-boosted alternatives. The practical recipe in 2026 is: use causal forests when you need calibrated CATE confidence intervals; use a meta-learner with gradient boosting when speed matters more and confidence intervals can be obtained by other means (bootstrap, conformal prediction).

Meta-Learners Revisited

Meta-learners — the family of CATE estimators that use any standard ML predictor as a building block — were introduced briefly in Chapter 03. This section develops them in production-relevant depth: when each variant is appropriate, what their pathologies are, and how the most modern ones (R-learner, DR-learner) connect to the orthogonalisation theme of Section 02.

S-learner

The S-learner ("single") fits one regression model for E[Y | X, T] using both treated and untreated data, with treatment as just another feature. The CATE estimate is τ̂(x) = f̂(x, 1) − f̂(x, 0). S-learners work well when the treatment effect is small relative to the outcome variation — the signal you want is buried in a model trained primarily on the noise — and when the predictor regularises the treatment effect appropriately. They fail when the treatment effect is large but localised: regularisation can shrink it toward zero in a way that biases CATE estimates.

T-learner

The T-learner ("two") fits separate regression models on the treated and untreated subsets: f̂₁ on treated data, f̂₀ on untreated data. The CATE is τ̂(x) = f̂₁(x) − f̂₀(x). T-learners avoid the regularisation issue of S-learners but introduce a different one: when treatment-group sizes are very imbalanced, the two models are estimated with very different precision, and the CATE estimate inherits the noise of the weaker model. They work best when the treatment-group sizes are roughly balanced.

X-learner

The X-learner (Künzel et al. 2019) addresses imbalanced-treatment-group situations explicitly. The recipe: fit T-learner-style models, then use them to impute the missing potential outcomes for each unit, then fit a second-stage model on the imputed individual treatment effects. The two stages are weighted by an estimated propensity score. The X-learner is the dominant meta-learner in production deployments because it handles imbalance gracefully, which most real applications have. The trade-off is two stages of fitting and more sensitivity to propensity-score quality.

R-learner and DR-learner

The R-learner (Nie & Wager 2021) extends DML's orthogonalisation idea to CATE estimation. The recipe: residualise both Y and T against X using ML predictions (as in Section 02), then fit a model for the heterogeneous effect on the residualised data with a particular weighted-loss form. The R-learner inherits DML's orthogonality property at the heterogeneous-effect level, which means the CATE estimate is robust to small errors in the nuisance predictors. The DR-learner ("doubly robust") similarly uses doubly-robust scores in a second-stage CATE regression. Both are now standard in modern causal-ML libraries and are the right choice when you want the orthogonality guarantees of DML extended to heterogeneous effects.

Choosing among them

Method	Best when	Watch out for
S-learner	small treatment effect, similar groups	regularisation shrinking effect to zero
T-learner	balanced groups, large effects	noisier estimates with imbalance
X-learner	imbalanced treatment groups	propensity-score sensitivity
R-learner	orthogonality / robustness goals	more complex implementation
DR-learner	doubly robust + flexible CATE	same as R-learner
Causal forest	need calibrated CIs at query points	moderate-dim only, slower than GBT

Honest practice: try several, evaluate via methods designed for CATE evaluation (the matching-based or quantile-based heterogeneity scores), and pick the one that performs best on your data. Modern libraries (EconML, CausalML) make running multiple meta-learners and comparing them mechanical.

Uplift Modeling

Uplift modelling is causal ML's most-deployed application: predicting which customers to target with a marketing intervention, which patients to enrol in a treatment programme, which users to nudge with a feature. The framing differs from CATE estimation in vocabulary more than substance, but the operational and practical literature has its own conventions worth knowing.

The four quadrants

Uplift modelling thinks of any treatment as dividing the population into four conceptual groups based on potential outcomes. Sure things would buy whether or not they receive the offer — treating them costs money for no gain. Lost causes won't buy regardless — treating them costs money for no gain. Persuadables would buy only if they receive the offer — these are the targets. Do-not-disturbs would buy without the offer but, perversely, won't buy if they receive it (perhaps because the offer signals desperation, or because of opt-out behaviours triggered by marketing); treating them is actively counterproductive. The goal of uplift modelling is to identify the persuadables and target the marketing budget at them.

The four quadrants are essentially the sign of the unit-level treatment effect plus the sign of the baseline outcome. Persuadables have positive effect (won't buy without, will buy with); do-not-disturbs have negative effect. The CATE is exactly what tells you which quadrant a unit is in.

Direct uplift modelling

The dominant practical approach is the same set of meta-learners (Section 04) under different names. The two-model approach is a T-learner. The single-model approach with treatment-as-a-feature is an S-learner. The X-learner, R-learner, and DR-learner all map directly onto uplift modelling pipelines. The marketing literature has its own terminology and its own evaluation conventions, but the underlying mathematics is the CATE estimation of the previous sections.

One method genuinely particular to the uplift literature is the transformed outcome approach. The trick: when treatment is randomized with probability e (typically 0.5 in A/B test settings), the transformed outcome Y* = Y · T/e − Y · (1−T)/(1−e) has expectation exactly equal to the CATE. So a regression of Y* on X with any standard ML method produces a (noisy but unbiased) CATE estimator without needing to fit nuisance models for E[Y | X, T]. The transformed-outcome method is fast and simple and is often the right baseline against which more sophisticated CATE estimators should be compared.

Evaluation: Qini and uplift curves

Evaluating an uplift model differs from evaluating a regression model. The relevant question is: if I rank units by predicted uplift and target the top fraction f, how much incremental outcome do I capture relative to random targeting? The Qini curve plots this — incremental outcome on the y-axis, fraction targeted on the x-axis — and the Qini coefficient (the area under the Qini curve, normalised) is the standard scalar evaluation metric. The uplift curve is a related variant. Both metrics depend only on the ranking, not on calibration of the absolute uplift, which matches what marketing-targeting decisions actually need.

The marketing reality

Production uplift modelling in 2026 is heavy infrastructure: incremental-test platforms that randomly hold out a fraction of the eligible population to measure true incrementality, feature stores that capture per-customer signals at decision time, model-update pipelines that retrain weekly or daily, and policy layers that combine uplift predictions with budget constraints (knapsack-style optimisation over predicted CATEs). The modelling is one piece; the rest of the stack is typically larger.

Causal Discovery from Data

The DAGs of Chapter 03 were inputs — drawn from substantive expertise, then used for analysis. Causal discovery reverses the direction: learn the structure of the DAG from data. The methods exist, work better than commonly believed, and are the closest thing causal inference has to an unsupervised-learning problem.

The constraint-based family

The classical approach to causal discovery exploits the connection between conditional independence in the data and graph structure. If X ⊥ Y | Z in the joint distribution, certain edge configurations are ruled out. Iteratively testing conditional independence across all variable triples gives a set of constraints; finding all DAGs consistent with those constraints gives the equivalence class. The PC algorithm (Spirtes, Glymour & Scheines, late 1980s) is the canonical instance and remains the most-used constraint-based discovery method. The FCI algorithm extends it to handle latent confounders by allowing partial directed graphs that explicitly mark "we cannot tell which way this edge points."

Constraint-based methods are sound under the conditional-independence assumptions of their tests. The catch is that conditional-independence testing in finite samples is noisy, especially when the conditioning set grows; the reliability of the discovered structure deteriorates with the dimensionality of the variables and the sample size relative to it. Modern variants use better independence tests (kernel-based, neural-network-based) and sequential testing protocols, but the fundamental difficulty of conditional-independence testing remains.

Score-based methods

An alternative approach scores each candidate DAG by how well it fits the data (typically via penalised likelihood, BIC, or related criteria) and searches for the best-scoring structure. The GES algorithm (Greedy Equivalence Search) is the canonical score-based method, climbing through the space of DAG equivalence classes by adding, removing, or reversing edges. Score-based methods avoid explicit independence testing but require search over a combinatorial space; with modern computation, GES is feasible for moderate-dimensional problems and produces good results on data with linear-Gaussian structure.

Continuous-optimisation methods (NOTEARS)

The most consequential 2018+ contribution is NOTEARS (Zheng et al. 2018), which reformulates causal discovery as a continuous optimisation. The trick: encode the DAG as a weighted adjacency matrix W, and penalise W to enforce acyclicity via a smooth function (specifically, tr(e^W∘W) − d, which equals zero iff the graph is acyclic). With acyclicity expressible as a smooth constraint, gradient-based optimisers can search the DAG space efficiently. The method scales to hundreds of variables and integrates naturally with neural-network function approximators (NOTEARS-MLP, GraN-DAG). NOTEARS reshaped the discovery field by making it a standard optimisation problem rather than a combinatorial-search one.

The LiNGAM family

A different direction exploits non-Gaussianity. The LiNGAM approach (Linear Non-Gaussian Acyclic Model, Shimizu et al. 2006) shows that if the data-generating process is linear with non-Gaussian noise, the causal direction between two variables is identifiable from observational data alone (whereas with Gaussian noise it is not). LiNGAM uses independent component analysis to recover the structural model; later variants (DirectLiNGAM, RESIT) generalise the approach. LiNGAM-style methods are useful when the linearity-with-non-Gaussian-noise assumption is plausible, which is more often than the Gaussian default would suggest.

What discovery actually delivers

Honest assessment: causal discovery in 2026 is a useful tool but rarely a complete answer. The discovered structure is typically an equivalence class of DAGs rather than a single graph, and substantive expert judgement is needed to pick among the candidates. Discovery is most useful when (a) the analyst has substantive priors about parts of the graph but is uncertain about others, in which case discovery can confirm or refute the uncertain parts, or (b) the analyst is doing exploratory work to generate hypotheses that later RCTs or quasi-experiments will test. Discovery as a fully automated alternative to DAG drawing remains an aspiration rather than a reality.

Counterfactual Prediction and Policy Learning

A CATE is one number per individual; a policy is the rule that turns that number into an action. The discipline of policy learning bridges causal estimation with operational decision-making, and connects directly to the contextual-bandit literature where the same problem is studied under a different name.

From CATE to policy

The simplest treatment policy is "treat if CATE > 0." But operational policies usually have additional structure: a budget constraint (treat at most K out of N units), a cost asymmetry (treating costs $C, the outcome is worth $V per unit), an eligibility filter (some units cannot be treated regardless). Each constraint converts the CATE into a policy through a different optimisation. With a budget constraint and uniform cost, the optimal policy treats the K units with the highest predicted CATEs. With a cost-benefit ratio, treat anyone whose predicted CATE exceeds C/V. With more complex constraints, the policy is the solution to a more elaborate optimisation.

Policy trees and direct policy learning

An alternative skips CATE estimation entirely and directly learns the optimal policy from data. Policy trees (Athey & Wager 2021) fit a decision tree whose leaves correspond to treat/don't-treat decisions, with the tree splits chosen to maximise expected outcome under the resulting policy. The advantage is a policy that is interpretable and immediately deployable; the disadvantage is that the optimisation is harder (a fundamentally combinatorial problem). The policytree R package and its successors implement this for binary and multi-armed treatment decisions.

Off-policy evaluation

Once you have a candidate policy, you need to know how good it is — without deploying it (or before deploying it) and without an RCT specifically for the policy. Off-policy evaluation (OPE) is the discipline of estimating a candidate policy's expected outcome from data collected under a different (existing) policy. The standard estimators — inverse propensity weighting, doubly robust OPE, self-normalised IPS — match the structure of the causal estimators of Chapter 03 but with the candidate policy in place of the treatment indicator.

OPE is the central methodology in contextual bandit literature, where it lets a system evaluate hypothetical alternatives to its current allocation rule before changing the live policy. Production deployments at every major recommendation, ad-targeting, and automated-decision platform use OPE continuously; the development of robust OPE estimators has been one of the most consequential lines of work in causal ML over the past decade.

The contextual bandit framing

A contextual bandit is exactly the setup of this section: at each time step, observe a context, choose an action, observe a reward. The data generated under one policy is used to evaluate or improve another. The technical machinery — propensity-score-based importance weighting, doubly robust estimators, conservative bounds — is identical to causal CATE estimation under a different name. In 2026 the best place to get hands-on with these methods is often the bandit literature, even if the application is not naturally bandit-shaped.

Production Stacks: EconML, CausalML, DoWhy

The methods of this chapter exist in a small set of mature open-source libraries. Each has slightly different priorities, and the choice among them shapes what is easy and what is hard in a given project. This section maps the landscape as it stands in 2026.

EconML

EconML (Microsoft Research) is the most comprehensive of the libraries. It implements DML, the meta-learners (S/T/X/R/DR), causal forests via integration with grf, orthogonal random forests, and several specialised estimators (causal LASSO, deep IV, instrumented learners). The API is opinionated — every estimator follows the same fit / effect / effect_inference pattern — which makes mixing methods within a single project straightforward. EconML is the dominant choice in 2026 for academic or research-flavoured work on heterogeneous treatment effects, and is widely used in production at Microsoft and partner companies.

CausalML

CausalML (Uber) is the uplift-modelling-flavoured library. Its strengths are the meta-learners with fast tree-based implementations (LightGBM and XGBoost as nuisance estimators), the uplift trees and uplift random forests, and the evaluation metrics geared toward marketing-targeting use cases (Qini coefficient, uplift curves, gain charts). CausalML is the right choice for marketing, recommendation, and product-engagement applications where the operational frame is uplift rather than CATE per se.

DoWhy

DoWhy (Microsoft Research) takes a different angle: rather than focusing on estimation, it emphasises the four-step causal-inference workflow — model the problem (build a DAG), identify the effect (apply do-calculus), estimate the effect, refute the estimate (sensitivity analysis, placebo tests). The library integrates with EconML and CausalML for the estimation step but provides explicit support for the modelling and refutation steps that the others mostly leave implicit. DoWhy is the right choice when the analyst wants the full causal-inference workflow with explicit modelling assumptions and built-in robustness checks.

The choice in 2026

The pragmatic answer: most production work in 2026 uses CausalML for marketing/uplift applications, EconML for heterogeneous-treatment-effect research, and DoWhy as a wrapper around either when the workflow rigour matters more than estimator novelty. The libraries cooperate well enough that mixing is feasible. The DoubleML library (R and Python, by the original DML authors) is also worth knowing for its strict theoretical fidelity to the DML framework.

What's missing from all of them

The honest gap as of 2026 is in causal discovery. The discovery libraries (causal-learn, gCastle, NOTEARS implementations) are research-quality rather than production-quality, with rougher APIs, weaker documentation, and less institutional adoption. A causal discovery analysis is feasible but more bespoke than a CATE estimation. The expectation is that the discovery side of the toolchain will catch up over the next few years; in 2026 it is still ahead of you.

Causal Representation Learning

A frontier direction connects causal inference to deep learning's representation-learning agenda: can we learn neural-network representations of the world that are themselves causally structured — disentangled by underlying causes, robust to interventions, useful for counterfactual reasoning? The field has produced provocative results and provocative limitations.

Disentangled representations

The original motivation for causal representation learning came from the disentanglement literature in deep learning: train a generative model so that each latent dimension corresponds to one underlying generative factor of variation. If the latents are aligned with causal factors, downstream interventions on the latents would correspond to real-world interventions, and the representation would be useful for counterfactual reasoning. β-VAEs, FactorVAEs, and their successors aimed at this goal through the late 2010s.

The catch (Locatello et al. 2019) is a now-famous impossibility result: without inductive biases or supervision, disentanglement is fundamentally underdetermined. Many distinct latent decompositions reproduce the same data distribution, and there is no purely-unsupervised way to choose among them. Practical disentanglement requires either weak supervision (pairs of examples that share or differ in particular factors) or strong inductive biases. The negative result reframed the agenda from "learn disentangled latents from data" to "learn latents that respect known causal structure" — a more modest but more attainable goal.

Independent causal mechanisms

The independent causal mechanisms (ICM) principle (Schölkopf et al.) provides one inductive bias. The principle: in a true causal decomposition, the conditional distributions P(X_i | parents(X_i)) are autonomous — changing one does not affect the others. A representation respects ICM if it factorises the joint distribution into pieces that can be modified independently. This gives a target for representation learning: learn latent factors whose conditional distributions are mutually invariant under interventions on each other. Methods that operationalise this — invariant risk minimisation, multi-environment learning, distributional robustness with intervention support — are an active 2024–2026 research direction.

Causal abstraction

A separate line of work — causal abstraction (Beckers & Halpern 2019, Geiger et al.) — formalises the idea that a high-level causal model (variables: "treatment," "outcome") can be a faithful abstraction of a low-level mechanism (the underlying neural-network or biological circuitry). The technical machinery defines what it means for two causal models at different levels to agree on interventions; recent work uses it to interpret what neural networks are doing causally, to build interpretability tools, and to bridge between symbolic and neural representations of causal knowledge. The field is young but is producing theoretically interesting results.

The current bridge

As of 2026, causal representation learning is an active research area with limited production deployment. The honest picture: the methods work in carefully controlled settings (synthetic data, simulators, narrow scientific domains) and are still maturing for operational use on messy real-world data. The next several years will determine whether the agenda produces practically transformative tools or remains a theoretically interesting niche. The bet from the major industrial labs (DeepMind, FAIR, Microsoft) is that the agenda is consequential enough to fund through to production.

Frontier: LLMs as Causal Tools and Causal RL

Three frontier directions are reshaping causal ML in 2024–2026: large language models as collaborators in causal analysis, causal methods integrated into reinforcement learning, and the broader question of how foundation models change what causal practice looks like.

LLMs as causal-analysis collaborators

The straightforward use of LLMs in causal ML is workflow assistance: drafting candidate DAGs from textual descriptions of a problem, suggesting confounders the analyst might have missed, generating documentation for a causal analysis, writing up findings. Modern LLMs (Claude, GPT-5, Gemini 2) are competent at all of these tasks when prompted carefully and verified by humans. The 2024–2025 literature has cautiously documented the wins (LLMs identifying domain-relevant confounders that an analyst missed) and the failures (LLMs confusing correlation and causation, getting d-separation wrong on novel graphs, hallucinating instrumental variables that are not actually exogenous).

The deeper question — whether LLMs can do the formal reasoning steps of causal inference reliably — has a more pessimistic answer in 2026. Benchmarks like CLadder (causal language understanding), CausalBench (causal-reasoning evaluation), and Corr2Cause consistently show that LLMs underperform on formal causal queries: they correctly compute observational quantities but fail at interventional ones, and the failure rate does not improve smoothly with model scale. Whether reliable causal reasoning requires architectural changes or just better training is one of the open empirical questions of the field.

Causal reinforcement learning

RL is naturally causal: the agent intervenes on the world (chooses actions) and observes outcomes. But standard RL methods learn policies under the data-generating distribution they were trained on and fail to generalise when the environment's causal structure changes. Causal reinforcement learning integrates structural assumptions about the environment into RL, with the goal of policies that transfer across environments with shared underlying causal structure but different surface statistics. Bareinboim's transport-formula work, the multi-environment-RL literature, and offline RL methods that explicitly handle confounding are all part of this effort. The 2024–2026 picture is that causal RL is progressing slowly but steadily, with the most concrete progress in offline-RL settings where confounding is easy to identify.

Foundation models and the causal toolkit

The integration of causal methods with the broader foundation-model trend is the most uncertain frontier. Two directions are visible. One: use foundation models as flexible nuisance estimators for DML and related methods, exploiting their representational depth on text and image data. Two: use causal methods to evaluate and improve foundation models, asking causal questions about training-data interventions ("what would happen if I removed this data subset?"), prompt interventions ("does this prompt cause this behaviour?"), and policy interventions ("does this safety training reduce this harm?"). Both directions are active; both are likely to produce tools that practitioners use within a few years.

What this chapter does not cover

The graph-neural-network literature, which provides one route to learning over relational and structural data, is the subject of Chapter 05 (Graph Neural Networks) — including the recent work that uses GNNs for causal discovery and structure learning. Survival analysis, which models time-to-event causal questions ("does this treatment delay event time?"), is the subject of Chapter 06. The Bayesian-deep-learning methods that produce calibrated uncertainty estimates, useful when causal estimates need to be combined with prior beliefs, are the subject of Chapter 07.

Causal machine learning is the synthesis at which causal inference's identification rigour meets modern ML's flexibility. The classical methods of Chapter 03 remain foundational; the techniques of this chapter make them work at the data scales decisions actually involve. A practitioner who can move fluently among the libraries of Section 08, who understands when each meta-learner is appropriate, who can interpret a Qini curve and a CATE confidence interval, and who knows when to reach for policy learning — that practitioner is doing the work that operational causal data science actually requires.