Part XVI · MLOps & Production ML · Chapter 06

A/B Testing & Causal Experimentation in Production, where statistics meets the real world.

A model that scores well in offline evaluation may underperform in production; a model that scores poorly may transform a metric. The only way to know is controlled experimentation: randomly split users between treatment and control, measure the outcome, and infer the causal effect of the treatment on actual user behaviour. A/B testing is the dominant pattern, used by every major tech company to ship product changes, including ML model rollouts. Randomisation is the statistical machinery that turns observational data into causal inference — without it, every measurement is confounded. CUPED (Controlled Pre-Experiment Data) and related variance-reduction techniques cut required sample sizes substantially, letting teams run more experiments per unit of traffic. Multi-armed bandits dynamically allocate traffic to better-performing variants, trading statistical purity for faster convergence. This chapter develops the methodology with the depth a working ML practitioner needs: the experimental design, the statistical machinery, the platform tooling, and the operational discipline that distinguishes rigorous experimentation from data-driven decoration.

Prerequisites & orientation

This chapter assumes the statistics material of Part I (probability, hypothesis testing, regression) and the causal-inference material of Part XIII Ch 03 (DAGs, do-calculus, potential outcomes). Familiarity with at least one experimentation platform (Optimizely, GrowthBook, LaunchDarkly Experimentation, internal tools at major tech companies) helps but is not required. The chapter is written for ML engineers, data scientists, product engineers, and analysts running experiments on production traffic. The methodology generalises beyond ML — it's the standard for any product change — but the chapter emphasises ML-specific applications: deploying new models, tuning ranking algorithms, evaluating recommendation systems.

Three threads run through the chapter. The first is the causality imperative: A/B testing is fundamentally about establishing causation, not just measuring correlation. Without proper randomisation, no amount of measurement gives you a causal answer. The second is the variance-reduction tension: experiments cost traffic, traffic is finite, and variance-reduction techniques (CUPED, stratification, regression adjustment) substantially cut the cost of significance. The third is the velocity-vs-rigour trade-off: A/B tests take weeks; bandits converge in days; both have their place, and choosing the wrong tool costs either rigour or velocity. The chapter develops each in turn.

In this chapter

Why Experimentation Matters causal inference · ground truth · offline-online gap · scale
Randomisation: The Statistical Foundation unit of randomisation · SUTVA · sticky assignment · stratification
Experimental Design and Power Analysis sample size · MDE · power · α · sequential testing
Metrics: OEC, Guardrails, and Pre-Registration overall evaluation criterion · counterbalance · pre-registration
CUPED and Variance Reduction covariate adjustment · pre-period data · stratified sampling · regression
Analysis: From Test Statistic to Decision p-values · confidence intervals · effect sizes · multiple comparisons
Multi-Armed Bandits and Adaptive Allocation epsilon-greedy · UCB · Thompson sampling · contextual bandits
Interference, Network Effects, and Switchback SUTVA violations · marketplace · cluster randomisation · switchback
Experimentation Platforms and Tooling Optimizely · GrowthBook · Statsig · LaunchDarkly · Eppo · in-house
The Frontier and the Operational Question heterogeneous effects · LLM eval · long-term · what next

Why Experimentation Matters

A new model improves accuracy by 2% on the held-out test set. Does it improve user experience? Does it improve business metrics? Will users like it more? Will revenue go up? Offline evaluation cannot answer these questions because the test set, no matter how carefully constructed, is not the same as actual users responding to predictions in real time. The only reliable answer comes from running the new model against real users in a controlled experiment and measuring the actual difference. This is the core insight of the experimentation movement that has reshaped product development since the 2000s.

The offline-online gap

Several phenomena create the gap between offline metrics and real-world impact. Selection effects: the test set was constructed from past traffic that the old model influenced; the new model will influence future traffic differently. User-behaviour adaptation: users learn to interact with the system; new behaviours emerge that weren't in the test set. Compounding effects: a small change in one model affects downstream models, which affect downstream behaviour, in non-linear ways. Novelty and primacy effects: users react to changes (positively or negatively) for reasons that decay over time. None of these are captured by held-out evaluation; all of them are captured by running the system against real users.

The famous-case examples

The reasons to run experiments rather than trust intuition are well-documented. Microsoft Bing's 2012 ad-headline study (Kohavi et al.): an engineer's 30-second tweak — moving a comma in a title — produced a $100M annual revenue gain that the team would have rolled back without an A/B test, since intuition said the change couldn't matter. Booking.com reportedly runs over 1,000 concurrent experiments and finds that roughly 80% of intuited improvements either don't move the metric or move it the wrong way. Netflix's recommendation experiments have shown that small A/B tests routinely flip the sign of teams' priors — the "obviously better" version turns out to be worse. The empirical pattern: human intuition about product changes is unreliable; experimentation is the only way to consistently make right decisions.

The decision-quality argument

The deeper argument for experimentation is decision quality. Without experiments, decisions about which model to ship are based on offline metrics that may not reflect user impact, plus the political and persuasive talents of whoever advocates loudest. With experiments, the decision is grounded in measurable user response. Teams that run experiments seriously develop a healthier relationship with intuition — they keep using it to generate hypotheses, but they don't trust it for outcomes. The cultural shift this represents is substantial; mature ML organisations have decision-making cultures organised around experimental evidence.

The five concrete capabilities

An experimentation platform provides five distinct capabilities. Random assignment: users are deterministically and unbiasedly assigned to treatment or control. Metric pipelines: the metrics of interest are computed correctly and consistently across experiments. Statistical analysis: appropriate statistical tests are applied with correction for multiple comparisons, sequential testing, etc. Decision tooling: dashboards, alerts, and reports that help teams interpret and act on results. Governance: review of experiment designs, results audit trails, archives of past experiments. The chapter develops each of these.

The downstream view

Operationally, an experimentation system sits between the deployment layer (Ch 03) and the product-decision layer. Upstream: model deployments configured with treatment-vs-control variants, request streams that trigger random assignment, instrumentation that captures metrics. Inside: random-assignment service, exposure logging, metric computation, statistical analysis, dashboards. Downstream: ship/kill decisions, post-experiment analysis, knowledge bases of past learnings. The remainder of this chapter develops each piece: §2 randomisation, §3 design and power, §4 metrics, §5 CUPED, §6 analysis, §7 bandits, §8 interference, §9 platforms, §10 the frontier.

Randomisation: The Statistical Foundation

Randomisation is what turns an observational comparison ("users who saw the new model converted at 8%") into a causal claim ("the new model causes 8% conversion vs 6% for the old model"). Without proper randomisation, the comparison is confounded by selection effects: which users saw which model, and why. This section unpacks how to randomise correctly, the assumptions that randomisation depends on, and the most-common ways teams break it.

The unit of randomisation

The first design decision: what gets randomly assigned? User-level randomisation assigns each user to treatment or control consistently across all their requests; this is the standard for most product experiments. Session-level randomisation randomises per session, which is appropriate when treatment effects are session-scoped. Request-level randomisation randomises per request — only appropriate when the treatment can be safely changed mid-session and there's no learning effect. Cluster randomisation assigns groups of users (a region, a marketplace, a content category) together, addressing interference (Section 8). The wrong unit produces biased estimates; the right unit depends on the experiment's hypotheses.

The SUTVA assumption

Randomisation produces unbiased estimates under the Stable Unit Treatment Value Assumption (SUTVA): each unit's outcome depends only on its own treatment assignment, not on other units' assignments. SUTVA is violated when there's interference — when one unit's treatment affects another unit's outcome. Marketplace experiments (where treatment users compete with control users for limited supply), social-network experiments (where treatment users influence control users via sharing), and any system with limited capacity have potential SUTVA violations. Section 8 develops the methodology for handling these.

Sticky assignment and bucket consistency

For user-level randomisation to work over time, each user must be deterministically assigned to the same bucket across requests. The standard implementation is hashing: take the user ID, append the experiment ID, hash the combination, take the hash mod 100, assign to a bucket based on the result. This produces deterministic assignment without storing per-user-per-experiment lookup tables. The discipline is that the hashing is deterministic, the salt (experiment ID) prevents correlation between experiments, and the buckets are well-balanced.

Stratification

Pure random assignment can produce imbalanced groups by chance: more high-value users in treatment than control, more weekend traffic in one variant. Stratified randomisation rebalances this: divide users into strata (paid vs free, weekday vs weekend, geography), randomise within each stratum. This produces tighter point estimates and reduces the chance of conclusion errors driven by random imbalance. Modern experimentation platforms support stratification natively; the discipline is to identify the strata in advance based on which dimensions are most-likely to affect outcomes.

Common failure modes

Several anti-patterns recur. Sample-ratio mismatch (SRM): the actual fraction of users assigned to treatment doesn't match the design (e.g., 50/50 design produces 47/53). SRM almost always indicates a bug — uneven random assignment, bot traffic mis-classified, exposure logging issues. The fix is to monitor for SRM as a guardrail (see Section 4). Crossover: users move between buckets mid-experiment, which destroys the analysis. Telemetry gaps: the treatment variant breaks logging in some path, so treatment users appear under-counted relative to control. Each failure produces apparent treatment effects that are really artefacts of bad randomisation.

Re-randomisation and validation

The discipline of validating randomisation is to compute A/A tests: assign users to two control variants (no actual treatment difference), measure metric differences, expect to see no significant effect. If you see a significant effect in an A/A test, your randomisation has a bug. Mature experimentation platforms run continuous A/A tests as a sanity check; the moment one fires, the platform team investigates before any experiments are trusted.

Experimental Design and Power Analysis

Before running an experiment, you need to know how many users you need to observe a real effect, how long the experiment must run, and what statistical thresholds will mean "ship" or "kill." Power analysis answers these questions; getting it right prevents under-powered experiments (which produce noise) and over-powered experiments (which waste traffic).

The four parameters

Power analysis for the standard two-sample comparison takes four inputs and produces the fifth. The inputs are: significance level α (typically 0.05, the false-positive rate), power 1−β (typically 0.80, the true-positive rate), minimum detectable effect (MDE; the smallest effect size you care about), and the baseline metric standard deviation σ. The output is the required sample size per arm. For a binary metric (conversion rate), the formula simplifies to roughly 16·σ²/MDE². For a continuous metric, similar but with the actual standard deviation. Modern experimentation platforms provide power calculators; the discipline is to do power analysis before launching, not after.

The MDE choice

The MDE is the most-debated input. It encodes "what's the smallest effect that would matter to the business?" Set it too small and experiments take forever (you're powered to detect effects smaller than you'd act on); set it too large and you miss real effects. Common rules: 1% for high-traffic core metrics, 5% for niche metrics, 10–20% for tail metrics with low baseline. The right choice is product-specific; mature teams have explicit MDE policies tied to expected business value of changes.

Sample size and runtime

Required sample size translates to runtime via traffic. A site with 1M daily visitors and an experiment requiring 100K users per arm needs ~1 day per arm, or 2 days total for a 50/50 split. A site with 100K daily visitors needs ~10 days. The runtime is bounded below by required sample size and bounded above by patience and weekly seasonality (you need at least a full week to capture day-of-week effects). The standard rule: run experiments for at least one full week, even if power is hit sooner; run for two full weeks if there are day-of-week interactions of interest.

Sequential testing and peeking

A subtle but common bug: peeking — checking experiment results as they accrue and stopping when significance is reached. This inflates false-positive rates dramatically: an experiment "powered" at α=0.05 with peeking can have an actual false-positive rate of 30% or more. The proper response is sequential testing: a statistical methodology that adjusts for the peeking. The classical Wald sequential probability ratio test (SPRT), the more-modern always-valid p-values (Howard et al. 2021), and group sequential designs all handle this. Modern experimentation platforms (Statsig, Eppo) implement always-valid p-values natively; legacy platforms often don't, and teams using legacy platforms must enforce no-peeking discipline.

Pre-registration

The discipline of pre-registration — committing to the experiment design before observing data — prevents post-hoc rationalisation: "we predicted this exact subgroup-effect all along." Pre-registration includes: the hypothesis, the OEC and guardrail metrics, the MDE, the analysis plan, the stop conditions. Modern experimentation platforms support pre-registration natively; the discipline is to fill it in honestly before launching, not retrofit after seeing results.

The cost-of-experimentation calculation

Every experiment has a cost: the lost utility from running a worse variant on half the traffic. For most experiments this is small (a few hours of slightly-worse experience for half the users). For large experiments on dominant metrics, it can be substantial. Mature teams explicitly weigh the cost: an experiment that's expected to run for a month against a very-high-traffic surface should be designed carefully because the cost is real. Bandits (Section 7) reduce this cost by allocating less traffic to the loser variant once evidence accumulates.

Metrics: OEC, Guardrails, and Pre-Registration

An experiment measures many metrics, but they're not all equal. The overall evaluation criterion (OEC) is the single metric the experiment is designed to move; guardrail metrics are the things you don't want to break; supporting metrics are diagnostic. The discipline of metric design — choosing the OEC, identifying guardrails, deciding what counts as success — is at the heart of running experiments well.

The OEC and Goodhart's law

The OEC should be a metric that captures user value or business value. Common OECs: revenue per user, conversion rate, monthly active users, session length. The challenge is Goodhart's law: when a metric becomes a target, it ceases to be a good metric. Optimising for click-through inflates clickbait; optimising for session length inflates dark patterns; optimising for short-term revenue can erode long-term retention. Mature experimentation cultures pick OECs that are genuinely-aligned with long-term value, with explicit guardrails on the things the OEC could degrade.

Guardrail metrics

Guardrail metrics are the things that must not regress. Standard guardrails include: latency (the change shouldn't break performance), error rate (the change shouldn't increase failures), user retention (short-term wins shouldn't sacrifice long-term users), SRM (sample-ratio mismatch — Section 2), key business metrics not directly targeted but important. An experiment that wins the OEC but breaks a guardrail is not a winning experiment. Mature platforms automatically check guardrails; the discipline is that guardrail violations block ship, not just produce warnings.

Counterbalancing and multi-metric thinking

Real product changes affect many metrics simultaneously. A new recommendation algorithm might increase short-term engagement (good) but reduce content diversity (concerning) and increase server cost (significant). The decision is rarely "did it move the OEC?" but "given the full multi-metric impact, should we ship?" Mature teams have explicit decision frameworks for the multi-metric case: minimum-required improvements on the OEC, maximum-allowed regressions on guardrails, qualitative review for second-order effects.

Short-term vs long-term metrics

Most experiments measure short-term metrics (within the experiment's runtime). But many product decisions have long-term effects that don't show in short-term metrics: novelty effects fade, learning effects compound, retention compounds over months. Surrogate metrics attempt to capture long-term effects in measurable short-term form; they're imperfect proxies. Holdout populations — small permanent control groups never exposed to changes — let teams measure cumulative long-term impact over months. The 2024–2026 work on long-term experimentation (the Booking.com / Microsoft / Meta papers on long-term causal estimation) has matured this substantially.

Heterogeneous treatment effects

The aggregate effect across all users might be neutral while there are large positive effects on one subgroup and large negative effects on another. Heterogeneous treatment effects (HTE) analysis decomposes the aggregate by user characteristics. Standard HTE methods include subgroup analysis (with multiple-comparisons correction), causal forests (Wager & Athey 2018), meta-learners, and uplift modelling (cross-referenced from Part XIII Ch 04). HTE analysis is increasingly important for personalisation experiments; the methodology is mature but requires care to avoid spurious findings.

Metric pipelines and trustworthiness

The metric pipeline is what computes OECs, guardrails, and supporting metrics from raw event logs. Bugs in this pipeline silently contaminate every experiment running on the platform; trustworthy metrics are essential. The discipline includes: rigorous unit tests for metric definitions, validation against independent computation paths, monitoring for metric anomalies (a sudden 10× change in baseline is almost certainly a pipeline bug, not a user-behaviour change), and explicit metric ownership (every metric has an owner who responds to questions). Mature experimentation cultures invest substantially in the metric pipeline because every other piece depends on it.

CUPED and Variance Reduction

The cost of an experiment is dominated by sample size, which is dominated by metric variance. Cutting variance cuts required sample size proportionally — a 50% variance reduction halves the sample size needed for the same statistical power. CUPED (Controlled-experiment Using Pre-Experiment Data, Deng et al. 2013) is the dominant variance-reduction technique; combined with stratification and other adjustments, it routinely cuts experimentation cost by 30–60%.

The intuition

Suppose you're measuring revenue-per-user as your OEC. Some users naturally spend more than others, regardless of treatment. Their spending in the pre-experiment period is highly correlated with their spending during the experiment. If you adjust the post-period metric by subtracting a regression on the pre-period metric, you remove the variation that comes from "this user is just a high spender" and you're left with cleaner variation that reflects treatment effects. CUPED formalises this.

The CUPED estimator

Mathematically, CUPED replaces the simple metric Y with a covariate-adjusted version: Y_cuped = Y − θ(X − E[X]), where X is a pre-experiment covariate (typically the same metric measured pre-experiment) and θ is the regression coefficient cov(X,Y)/var(X). Under random assignment, the adjusted metric has the same expected value but lower variance, so the t-test on Y_cuped has tighter standard errors. The variance reduction is approximately ρ², where ρ is the correlation between the pre-period and post-period metrics. For metrics with high pre-post correlation (revenue, engagement, retention), this can be 50–80% variance reduction.

What CUPED requires

CUPED requires pre-experiment observations on the users in the experiment. This is typically straightforward — most experimentation platforms have access to historical user data. New users (who joined during the experiment) cannot be CUPED-adjusted; they're handled with a separate adjustment or by including a "new user" indicator in the regression. The covariate must be selected before observing experiment data to avoid post-hoc fishing. Modern experimentation platforms (Microsoft ExP, Statsig, Eppo) implement CUPED natively; legacy platforms often don't.

Beyond CUPED: regression adjustment

CUPED is a special case of regression adjustment: regressing the outcome on pre-experiment covariates and analysing the residuals. The general regression-adjustment framework (Lin 2013, Athey & Imbens 2017) extends CUPED in several ways: multiple covariates (not just the pre-period outcome), interactions with treatment (allowing different adjustments per arm), and machine-learning-based adjustment (using ML to predict outcomes and analysing residuals — sometimes called ML-CUPED or double machine learning). The 2024–2026 work on ML-based variance reduction has substantially advanced the state of the art.

Stratification revisited

Stratified randomisation (Section 2) is itself a variance-reduction technique: it ensures balanced groups, which reduces variance from random imbalance. Combined with CUPED, stratification can produce 60–80% effective variance reduction in favourable conditions. The discipline is that stratification is a design-time choice (you can't add stratification after the experiment has started), so power analysis should account for the planned stratification.

The diminishing-returns ceiling

Variance reduction has limits. After CUPED + stratification + best-of-class regression adjustment, you've removed the predictable component of variance; what remains is the irreducible noise that only larger samples can address. For metrics with low pre-post correlation (rare events, novel features, exploratory metrics), variance reduction is much less effective. The discipline is to invest variance-reduction effort proportional to expected benefit: the OEC and high-traffic guardrails benefit most; rare exploratory metrics benefit least.

Analysis: From Test Statistic to Decision

Once an experiment has run, the analysis stage turns observed data into a ship-or-kill decision. The mechanics are statistical (compute the test statistic, the p-value, the confidence interval) and operational (apply the decision rule, document the conclusion, archive the artifacts). Doing this well requires care about multiple comparisons, sequential testing, effect-size interpretation, and the gap between statistical significance and practical significance.

The standard analysis

For a continuous metric with two arms, the standard analysis is a two-sample t-test (or its CUPED-adjusted version): compute the means, compute the pooled standard error, divide to get a t-statistic, look up the p-value, compute the 95% confidence interval. For a binary metric (conversion rate), the equivalent is a two-proportion z-test. Modern experimentation platforms automate this; the analyst's job is to interpret rather than compute.

Confidence intervals over p-values

p-values are misunderstood and misused. The 2016 American Statistical Association statement on p-values catalogued the abuses; the field has been moving toward confidence intervals (CIs) as the primary report format. A 95% CI around the treatment effect tells you the effect's magnitude (point estimate) plus the uncertainty (interval width). "The effect is +2.5% [CI: +1.0% to +4.0%]" is much more informative than "p = 0.003." Mature experimentation cultures display CIs prominently and de-emphasise raw p-values.

Multiple-comparisons correction

An experiment that tests one OEC has α=0.05. An experiment that tests 20 metrics with α=0.05 each has, by chance, an ~64% chance of finding at least one "significant" result that's actually noise. Multiple-comparisons correction addresses this. Bonferroni divides α by the number of tests (conservative). Benjamini-Hochberg controls the false-discovery rate (less conservative, increasingly the standard). Family-wise error rate control is appropriate when avoiding any false positive matters; FDR control is appropriate when accepting some false positives is fine as long as the rate is bounded.

Effect sizes vs statistical significance

"Statistically significant" and "practically significant" are different. With enough samples, a 0.01% effect can be statistically significant; with too few samples, a 10% effect might not be. The discipline is to focus on effect-size point estimates and CIs, not just whether the CI excludes zero. An experiment with CI [+0.05%, +0.15%] is statistically significant but probably not worth shipping; an experiment with CI [-1%, +5%] doesn't reach significance but might point at a real meaningful effect that needs more data. The MDE (Section 3) should anchor what counts as practical significance.

The decision rule

Beyond the analysis is the decision rule: under what conditions should we ship, kill, or extend the experiment? Common rules: ship if OEC improves significantly and no guardrail regresses significantly; kill if OEC regresses significantly or any guardrail regresses significantly; extend (run longer) if the result is inconclusive. The rule should be documented before the experiment runs (Section 4), so the decision isn't post-hoc.

Documenting and archiving

Every experiment, regardless of outcome, should produce an artifact: a structured record of the design, results, and decision. Modern experimentation platforms produce these automatically. The artifact is the institutional memory: future teams can browse past experiments, learn what's been tried, avoid repeating dead-ends. Mature experimentation cultures have searchable archives of thousands of past experiments; the cultural payoff is enormous because new teams stop relitigating settled questions.

Multi-Armed Bandits and Adaptive Allocation

In a classical A/B test, traffic allocation is fixed (typically 50/50) for the entire experiment, even after one variant is clearly winning. Multi-armed bandits dynamically reallocate traffic toward the better-performing variant as evidence accumulates, reducing the cumulative cost of the experiment. The trade-off is statistical purity for faster convergence; bandits are appropriate when the cost of running a loser variant is the dominant concern.

The exploration-exploitation tradeoff

The bandit problem captures a fundamental tension: to know which variant is best, you must explore (try them all); to maximise reward, you must exploit (use the best one). Pure exploration is a uniform A/B test. Pure exploitation is greedy assignment to the current best, which can lock in a wrong answer. Bandit algorithms balance the two: explore enough to identify the best, exploit enough to capture the gains. The classical formulations include epsilon-greedy, UCB, and Thompson sampling.

Epsilon-greedy

The simplest bandit: with probability ε, pick a random arm (explore); with probability 1−ε, pick the current best (exploit). ε is typically 0.1 or 0.05. The variant: decaying epsilon-greedy reduces ε over time as confidence increases. Epsilon-greedy is easy to implement and understand; its weakness is that it doesn't differentiate between "obviously bad" and "barely worse" arms in the explore step, treating them equally.

UCB and confidence-based exploration

Upper Confidence Bound (UCB) algorithms always pick the arm with the highest upper confidence bound on its mean reward — a principled way of trading off exploitation (high estimated mean) and exploration (high uncertainty). UCB has provable regret guarantees: the cumulative shortfall vs picking the optimal arm grows logarithmically, which is the best possible. UCB variants (UCB1, UCB-V, KL-UCB) handle different reward distributions.

Thompson sampling

Thompson sampling is the dominant bandit algorithm in production today. The mechanism: maintain a Bayesian posterior over each arm's reward; on each request, sample once from each posterior, pick the arm with the highest sampled value. Thompson sampling adapts smoothly between exploration and exploitation, has competitive regret bounds, and is easy to implement (especially for binary or Beta-distributed rewards). The 2010s and 2020s adoption of Thompson sampling at major tech companies (Google, Microsoft, Meta) has made it the de-facto standard.

Contextual bandits

Contextual bandits extend the bandit framework with side information about each request: the user's features, the time of day, the device. The bandit picks the variant that's best for this particular context, learning a different optimal arm per context segment. Contextual bandits are essentially online supervised learning: the recommendation problem reframed as a bandit, where the action is the recommended item and the reward is the click. Major recommendation systems at Netflix, Spotify, and similar are sophisticated contextual-bandit systems.

When bandits beat A/B tests

Bandits are appropriate when: traffic is precious (minimising loss to suboptimal variants matters), the variants are clearly comparable (no need for clean post-hoc statistical inference), the reward signal is fast (the bandit needs feedback to update), and the question is "which is better?" rather than "what is the magnitude of the difference?" A/B tests are appropriate when: statistical inference matters (the magnitude of the effect is the question), long-term effects need to be measured (bandits' adaptive allocation makes long-term inference harder), or there are heterogeneous effects to characterise. Many production systems use both: bandits for routine optimisation, A/B tests for major changes.

Interference, Network Effects, and Switchback

SUTVA — the assumption that one user's outcome doesn't depend on other users' assignments — is fundamental to standard A/B test analysis. In practice, SUTVA is often violated. Marketplace experiments have buyer-seller interference. Social-network experiments have friend interference. Capacity-limited systems have congestion interference. The methodology of interference-aware experimentation is its own substantial topic.

Identifying interference

Interference is suspected when treatment and control are not independent. Marketplace: a new ranking algorithm in treatment shows higher-quality items to treatment users; sellers reallocate inventory toward treatment; control users now see worse items than they would have in a uniform launch. Social network: a new feature shown to treatment users gets shared with their friends; some control users see the feature too. Capacity: a recommendation algorithm directs more traffic to popular items in treatment; popular items run out of inventory; control users see different inventory than they would have. Each of these breaks SUTVA in characteristic ways.

Cluster randomisation

The standard fix for interference is cluster randomisation: assign clusters of interacting units (a city, a marketplace region, a social-network community) together. Treatment users in a cluster don't interfere with control users in different clusters. The trade-off is statistical power: cluster randomisation reduces the effective sample size to the number of clusters rather than users. For city-level cluster randomisation, you might have 100 cities — much smaller than 100M users. Power analysis for cluster designs (Hussey & Hughes 2007, the more-recent work on cluster sample sizes) is its own topic.

Switchback experiments

Switchback experiments turn the entire system on and off in alternating time blocks: 1pm-2pm treatment, 2pm-3pm control, 3pm-4pm treatment, etc. This is appropriate when interference is global (a single shared algorithm affects everyone) and when temporal seasonality is small relative to the effect. Switchback is the standard for marketplace and ride-hailing experiments where there's no clean cluster structure. Lyft, DoorDash, and Uber have all published on switchback methodology; the 2024–2026 work on optimal switchback designs is mathematically sophisticated.

Synthetic control methods

For changes that are launched holistically (not as A/B tests), synthetic control compares post-launch outcomes to a synthetic counterfactual constructed from pre-launch data and similar non-launched populations. Originally developed for policy evaluation (Abadie 2003), synthetic control is increasingly used in tech for rollout evaluations where running an A/B test is impossible. The 2020s methodological work (Doudchenko & Imbens 2016, the various 2024–2026 extensions) has substantially improved synthetic control's reliability.

Instrumental variables and the broader causal toolkit

When randomisation is impossible or violated, the broader causal-inference toolkit applies. Instrumental variables exploit a known random source upstream of the variable of interest. Difference-in-differences compares pre/post changes in treated and control units. Regression discontinuity exploits sharp treatment thresholds. Part XIII Ch 03 develops these methodologies; this section just notes that A/B testing is the gold standard but the broader toolkit fills in when randomisation isn't feasible.

The honesty of "we couldn't run a clean experiment"

Mature experimentation cultures are honest about what their experiments measure and don't measure. An experiment with substantial interference should be reported with explicit caveats; a switchback with limited statistical power should be reported with wide confidence intervals; a synthetic-control study should acknowledge the modelling assumptions. The opposite — overclaiming causal certainty from messy experiments — is the failure mode that undermines trust in experimentation as a discipline.

Experimentation Platforms and Tooling

A serious experimentation programme requires platform infrastructure: random assignment service, exposure logging, metric pipelines, statistical analysis, dashboards, and governance workflows. The 2010s and 2020s have produced substantial commercial and open-source options. This section surveys the dominant platforms and their trade-offs.

The big-tech in-house platforms

Major tech companies — Microsoft (ExP), Google (the various internal platforms), Meta (FBLearner Flow's experimentation layer), Netflix (the various internal tools), Booking.com (Forecaster), Airbnb (ERF), Spotify (the various ones) — have built sophisticated in-house experimentation platforms. They often run thousands of concurrent experiments, support sequential testing and CUPED, and integrate deeply with the company's metric pipelines. Public papers and blog posts describing these platforms (Kohavi et al.'s "Trustworthy Online Controlled Experiments" book, the various engineering blogs) are foundational references.

Optimizely and the legacy commercial platforms

Optimizely (founded 2010) was the dominant early commercial experimentation platform, focused initially on web A/B testing. Subsequent generations expanded to feature-flagging, full-stack experimentation, and personalisation. VWO and Adobe Target are similar platforms. The legacy platforms are appropriate for marketing-heavy use cases (web pages, onboarding flows) but are sometimes less-well-suited to deep-product or ML-experimentation use cases.

Statsig, Eppo, and the modern entrants

The 2020–2024 generation of experimentation platforms has set new standards. Statsig (founded 2021 by ex-Facebook experimentation engineers) has rapid adoption and strong technical foundations: always-valid p-values, CUPED, deep metric pipelines. Eppo (founded 2020) emphasises analyst-friendly workflows and statistical rigour. GrowthBook (open-source, founded 2020) offers a self-hosted alternative with the same modern feature set. These platforms have largely caught up to in-house tools at major tech companies and are increasingly the default for new ML/product teams.

Feature flags as an experimentation substrate

Feature-flag platforms (LaunchDarkly, Unleash, the various open-source alternatives) provide the runtime infrastructure for variant assignment. Some have added native experimentation support (LaunchDarkly Experimentation, Unleash with extensions); others are pure feature-flagging that pair with separate experimentation analysis. The build-vs-buy question often comes down to: do you want feature flags + bring-your-own-analysis, or a unified experimentation platform? Both work; the right choice depends on whether you have data infrastructure to handle the analysis side.

Cloud-managed alternatives

The major cloud providers offer experimentation tooling. AWS Evidently (CloudWatch Evidently), Google Optimize (sunset 2023, replaced by partner integrations), and Azure App Configuration with its experimentation features. These are appropriate for teams already deeply invested in a cloud's stack but typically less sophisticated than the dedicated experimentation platforms.

The build-vs-buy decision

The platform decision tracks team scale and experimentation sophistication. Small teams (low experiment volume): use a SaaS platform (Statsig, Eppo, GrowthBook), don't build. Mid-sized teams running hundreds of experiments per year: SaaS or self-hosted GrowthBook, depending on cost sensitivity and on-prem requirements. Large teams running thousands of experiments and building experimentation as a competitive advantage: typically build in-house, with substantial dedicated experimentation-platform-engineering investment. The 2024–2026 industry experience suggests that the SaaS options (Statsig especially) have substantially raised the bar — many teams that would have built in-house in 2018 use SaaS in 2026.

The Frontier and the Operational Question

A/B testing is a mature operational discipline in 2026, but several frontiers remain active. Heterogeneous treatment effect (HTE) estimation has become routine and is reshaping personalisation. LLM evaluation poses methodological challenges that classical A/B testing doesn't address well. Long-term causal estimation is a substantial topic. This section traces the open questions and the directions the field is moving in.

Heterogeneous treatment effects at scale

Pre-2020 HTE work was mostly academic. The 2020s have made HTE estimation operational: causal forests in production, double machine learning, meta-learners, and the various uplift-modelling extensions. The 2024–2026 work emphasises decision-grade HTE — identifying which users should get treatment vs control rather than just measuring heterogeneity. The trajectory is toward personalised allocation policies that maximise expected reward subject to fairness constraints; this is reshaping how mature personalisation systems work.

LLM evaluation in production

Evaluating LLM-based products via A/B testing is methodologically distinctive. Output quality is multidimensional: a generated paragraph can be more accurate but less helpful, more concise but less complete. Single OECs don't capture this; multi-metric experiments are essentially required. Side-by-side preference (let users pick between two variants without explicit metrics) is increasingly used. LLM-as-judge evaluation pipelines (an LLM grades outputs, results feed into experiment metrics) introduce their own measurement-validity questions. The 2024–2026 work on LLM-grade A/B testing is methodologically active.

Long-term causal estimation

Most A/B tests run for 1–4 weeks. Many product changes have effects that compound over months — engagement habits form, retention compounds, network effects accumulate. Long-term holdouts (small permanent control groups) let teams measure long-term effects, but at cost (long-term holdouts represent forgone wins). Surrogate-index methods use short-term proxies as predictors of long-term outcomes. Long-running ramp-ups measure cumulative effects across many short experiments. The 2024–2026 work on long-term causal estimation has matured substantially, particularly at companies (Booking, Microsoft, Meta) with sophisticated experimentation programmes.

Causal-inference-meets-experimentation

The traditional distinction between "causal inference" (observational) and "experimentation" (randomised) is blurring. Modern experimentation programmes use causal-inference techniques (DAGs, do-calculus, double ML) to analyse experiments more rigorously. Modern causal inference uses experimentation when possible and falls back to observational methods when not. The unified discipline — Part XIII Ch 03 — is increasingly the working baseline for serious analytics teams.

Regulatory and ethical considerations

Running experiments on real users raises ethical questions that have always been present but become more pressing as AI systems make consequential decisions. Informed consent for experiments on users — when is it required, when is it implied? Vulnerable populations — should treatments be withheld from groups where treatment is clearly beneficial? Negative experiments — is it ethical to run experiments where the treatment is suspected to harm users in order to measure the harm? The 2024–2026 work on experimentation ethics is reaching the regulatory layer; the EU AI Act has specific provisions on AI experimentation. Mature experimentation programmes have explicit ethics review for high-stakes experiments.

What this chapter has not covered

Several adjacent areas are out of scope. The deep statistical theory of causal inference (potential outcomes, do-calculus, instrumental variables) is developed in Part XIII Ch 03. The detailed methodology of offline policy evaluation (estimating what would have happened under a counterfactual policy) is its own topic. Bayesian experimentation as an alternative paradigm has been touched only briefly. Field experiments in the social-sciences sense (large-scale interventions with strong theoretical motivation) connect to but are distinct from product A/B testing. The chapter focused on the production-engineering substrate of A/B testing for ML and product teams; the broader experimentation landscape connects to causal inference, statistics, and policy research.