Part XVI · MLOps & Production ML · Chapter 05

CI/CD for Machine Learning, where software discipline meets ML's distinctive complications.

Continuous integration and continuous delivery (CI/CD) are the bedrock of modern software engineering: every code change is automatically tested, integrated, and shipped to production through a pipeline that prevents regressions and accelerates iteration. Applying CI/CD to machine learning systems is the same idea adapted to a substantially more complicated artefact. An ML pipeline depends on code, data, and models — three different things that change at different rates and require different validation. Testing for ML goes beyond unit tests: data validation, model evaluation against held-out sets, behavioural tests for known edge cases, fairness checks, and performance benchmarks all need automation. Automated retraining turns drift detection (Ch 04) into action: when monitoring fires, a pipeline retrains, evaluates, gates, and (with appropriate human approval) promotes a new model version. MLOps pipelines orchestrate all of this end-to-end, with proper version control, artefact tracking, and rollback capability. This chapter develops the methodology with the depth a working ML practitioner needs: the testing strategies, the pipeline architectures, the platform choices, and the operational discipline that distinguishes a healthy ML CI/CD setup from a fragile one.

Prerequisites & orientation

This chapter assumes the experiment-tracking material of Ch 01, the feature-store material of Ch 02, the deployment material of Ch 03, and the monitoring material of Ch 04. Familiarity with general-purpose CI/CD platforms (GitHub Actions, GitLab CI, Jenkins, CircleCI) is assumed; familiarity with Kubernetes and at least one orchestrator (Airflow, Argo Workflows, Prefect, Dagster) helps for §6 on pipeline tooling. The chapter is written for ML engineers, platform engineers, and DevOps practitioners who own production ML infrastructure end-to-end. Pure-research contexts where models are run once on static data have less use for this material; teams that ship and maintain models in production have substantial use.

Three threads run through the chapter. The first is the code-data-model triple: traditional CI/CD versions only code, but ML pipelines must version all three artefacts and reason about their interactions. The second is the testing-pyramid adaptation: ML tests look different from software tests — they include data validation, model evaluation, behavioural tests, and operational tests, each at a different cost-and-coverage trade-off. The third is the automation-vs-control tension: fully-automated retraining-and-deployment is operationally elegant but introduces risks (auto-deploying a regressing model is worse than not auto-deploying at all); the discipline is to choose the right level of automation for each pipeline stage with appropriate human gates. The chapter develops each in turn.

In this chapter

Why CI/CD for ML Is Different code-data-model · testing pyramid · automation gradient · stakes
Testing for ML: Beyond Unit Tests data tests · behavioural tests · model tests · invariance · directional
Data Validation in the Pipeline schemas · expectations · TFDV · Great Expectations · CI gates
The Training Pipeline as a CI Artefact orchestration · Airflow · Argo · Prefect · Dagster · KFP
Evaluation Gates and Promotion Logic accuracy thresholds · slice metrics · regression · approval · staging
Continuous Delivery: From Training to Serving registry → deploy · GitOps · Helm · canary · rollback automation
Automated Retraining: When and How scheduled · drift-triggered · performance-triggered · human gates
MLOps Platforms and Tooling Kubeflow · MLflow Pipelines · Vertex AI · SageMaker · ZenML · Metaflow
Operational Workflows and Team Practice PR templates · staging environments · approvers · post-incident reviews
The Frontier and the Operational Question LLM CI/CD · agent pipelines · regulatory automation · what next

Why CI/CD for ML Is Different

Software CI/CD is a mature discipline with well-known patterns: every commit triggers tests, passing tests trigger builds, passing builds trigger deployments, deployments trigger smoke tests, and the whole loop runs in minutes. Applying this to ML is straightforward in spirit and complicated in practice. The reason is that ML systems have three independently-evolving artefact types — code, data, and models — and the validation logic for each is substantially more involved than running a unit-test suite.

The code-data-model triple

Traditional software CI/CD versions and validates only one thing: code. ML systems version and validate three: code (the training and serving logic), data (the training set, the feature definitions, the validation set), and models (the trained weights). Each evolves at its own pace: code changes daily; data changes hourly or continuously; models retrain on whatever schedule the team chose. A change in any one can break the system, and changes in pairs (a code change plus a data change) can interact in subtle ways that no individual test catches.

The testing pyramid for ML

Software's testing pyramid (many fast unit tests at the base, fewer integration tests in the middle, very few slow end-to-end tests at the top) extends to ML with extra layers. Code unit tests are still the fast-and-cheap base. Data tests validate that incoming data has expected schemas, distributions, and freshness. Model behavioural tests check known edge cases (the model still classifies cats correctly even after a code refactor). Evaluation tests compute metrics on held-out sets and gate promotion. Slice tests compute metrics on important subpopulations (Ch 04 §9). Performance tests verify latency and resource usage. Each layer catches different problems; the discipline is to invest at every layer rather than relying on any single one.

The automation gradient

Software CI/CD is typically full-automation: passing tests trigger automatic deployment with no human gate. ML CI/CD typically has more human gates because the consequences of bad deployments are different. The 2020 Google paper on MLOps maturity introduced a useful three-level scale: Level 0 is manual everything (no automation, models trained and deployed by hand). Level 1 automates training and deployment but keeps humans in the loop for promotion. Level 2 automates the full retraining-and-deployment loop with monitoring as the safety net. Most production ML organisations operate at Level 1 in 2026; Level 2 is increasingly common for low-stakes use cases but still rare for high-stakes ones.

What can go wrong

The failure modes are varied. Bad data: a corrupted upstream pipeline produces garbage features; the model trains on the garbage and silently regresses. Test-set leakage: a refactor accidentally moves a feature into both training and validation sets; reported metrics are inflated. Skew between training and serving: even with feature stores (Ch 02), subtle differences emerge; a CI pipeline that doesn't validate this catches the bug only after deployment. Promotion of regressing models: a retrained model has slightly worse aggregate metrics but better slice metrics, or vice versa; without explicit gating logic, the wrong choice gets promoted. Pipeline ossification: the CI/CD pipeline accumulates implicit dependencies and becomes brittle; a small change cascades into hours of debugging.

The downstream view

Operationally, an ML CI/CD system sits at the centre of the entire ML platform. Upstream: code commits (developers), data updates (upstream pipelines), monitoring signals (Ch 04). Inside: testing infrastructure, training pipelines, evaluation gates, deployment automation, rollback machinery. Downstream: model registry (Ch 03 §5), serving infrastructure (Ch 03), monitoring (Ch 04), incident-response procedures. The remainder of this chapter develops each piece: §2 testing methodology, §3 data validation, §4 training pipelines, §5 evaluation gates, §6 continuous delivery, §7 automated retraining, §8 platforms, §9 team practice, §10 the frontier.

Testing for ML: Beyond Unit Tests

Testing ML systems requires more than the unit-and-integration test pattern from software engineering. The testing pyramid for ML adds several layers: data validation, model behavioural tests, evaluation tests, and operational performance tests. Each tests something different and catches a different class of bug; together they form the safety net that lets CI/CD safely deploy ML changes.

Code unit tests still matter

The starting point: unit tests on the training and serving code. Feature transformations should produce expected outputs given known inputs. Loss functions should return expected values for hand-chosen examples. Preprocessing pipelines should handle nulls, edge cases, and type coercions correctly. Pytest, hypothesis (property-based testing), and the various ML-test libraries (mlxtend's `assert_*` helpers, the various 2024–2026 entrants) handle this layer. Skipping this layer because "ML is special" is a common and damaging anti-pattern; software-engineering testing discipline matters at least as much for ML as for general software, not less.

Behavioural tests: invariance, directional, minimum-functionality

The 2020 paper "Beyond Accuracy: Behavioral Testing of NLP Models with CheckList" by Ribeiro et al. introduced a influential typology. Minimum functionality tests (MFT) check that the model handles basic capabilities correctly: a sentiment model classifies "I loved this movie" as positive. Invariance tests check that small perturbations don't change predictions: replacing "great" with "wonderful" shouldn't flip sentiment. Directional expectation tests check that a perturbation moves the prediction in the expected direction: adding a negation should flip sentiment. These tests are cheap to write, easy to maintain, and catch a substantial class of regressions that aggregate metrics miss.

Evaluation tests on held-out data

The next layer: evaluation tests that compute metrics on held-out validation and test sets and compare against thresholds. The discipline is that metrics are computed in CI and gate the pipeline: if the new model's accuracy is more than 2% worse than the previous version's, the pipeline fails. The thresholds are project-specific; setting them right requires balancing strictness (catch real regressions) against flakiness (don't fail every minor variation). The evaluation set must be reproducibly versioned (Ch 02) and the metrics computed deterministically.

Slice tests

Aggregate metrics hide subpopulation regressions; slice tests address this. For each important slice (demographic, geographic, behavioural), compute the metric and compare to the expected threshold. A model that improves aggregate accuracy by 1% but regresses on a critical slice by 10% should fail CI. The discipline is to identify slices in advance — Ch 04 §9 develops the methodology — and bake them into CI gates rather than discovering slice regressions in production.

Operational tests

Finally, operational tests: latency, memory usage, throughput, batch-size handling. These verify that the model meets production-deployment requirements. A model that's accurate but takes 500ms per inference when the SLO is 50ms should fail CI. A model that fits on a development laptop but won't fit in the production GPU memory should fail CI. These tests usually run on production-sized hardware (or a representative subset) rather than on tiny CI runners.

Test maintenance and the brittleness trap

ML tests have a unique maintenance challenge: when the model legitimately improves, some tests fail (a previous "correct answer" becomes a different "correct answer"). Without discipline, teams routinely loosen test thresholds to make CI green, eventually rendering the tests useless. The mitigations are: keep MFT and behavioural tests at the level of "this is a true property of the task" rather than "this is the current model's prediction"; version test cases alongside the data and model; review test changes in PRs with the same rigour as code changes.

Data Validation in the Pipeline

A model is only as good as the data it trained on, so the CI/CD pipeline must validate data with the same rigour it validates code. Data validation runs at multiple points in the pipeline: at ingestion (before the data lands in the warehouse), at feature computation (before the feature store materialises), at training (before the model fits), and at serving (before predictions ship). Each catches a different failure class.

Schema validation

The simplest and most-essential data test: schema validation. Every dataset has an expected schema — column names, types, allowed value ranges, null-ability rules. The schema is committed to source control alongside the code that consumes the data. Validation libraries — Great Expectations, Pandera, Soda, Deepchecks, TFDV (TensorFlow Data Validation) — automate schema checks. A schema mismatch should fail the pipeline rather than silently produce garbage models.

Statistical validation

Beyond schema, statistical validation checks that distributional properties match expectations. Mean and variance of numeric columns within expected ranges. Distribution of categorical values matching expected proportions. Null rates not exceeding thresholds. The distinction between schema and statistical validation is important: schema catches structural problems (a column is missing); statistical catches subtle problems (the column is there but the distribution shifted). Statistical validation reuses the drift-detection machinery of Ch 04 §3 in a CI/CD context.

The expectations-as-code pattern

The pattern that has consolidated by 2026 is expectations-as-code: data quality rules expressed in code, version-controlled, reviewed in PRs, and executed automatically. Great Expectations made this pattern dominant: a JSON or Python "expectation suite" specifies the rules, and the framework runs them and produces structured pass/fail reports. Pandera achieves the same with type-decorator-based syntax. The 2024–2026 work on these tools has matured them substantially; selecting between them is mostly a question of team preference for syntax style.

Data validation gates

Data validation should gate the pipeline rather than just produce reports. A failed expectation should stop the training run, not just log a warning. Promoting data quality to a CI/CD gate requires real discipline — teams that don't gate on data quality routinely find themselves debugging models trained on bad data, which is much more expensive than debugging the data quality issue at ingestion time.

Detecting label-pipeline issues

A specific failure class: label-pipeline corruption. The labelled dataset has labels that are wrong, drifted, or missing. Validation should check label distributions, label-schema consistency, and label-feature joinability (every example with features has a label, and vice versa). Mature ML pipelines have explicit label-quality assertions: minimum number of labels per class, maximum class imbalance, expected agreement rate between annotators. The 2024–2026 work on annotation-quality tooling has improved this layer substantially.

Sampling and approximation for speed

Validating petabyte-scale datasets in CI requires care: full validation is too slow for fast iteration. The standard pattern is sampling: validate a representative sample (typically 1% to 10%) in CI, validate the full dataset in nightly batch jobs. Modern data-validation libraries support both approximate (sampling-based) and exact validation; the choice is per-test based on the cost-benefit trade-off.

The Training Pipeline as a CI Artefact

A training run is not just a script — it's a directed acyclic graph (DAG) of stages: data extraction, feature computation, model training, evaluation, registry update, optional deployment. Treating this DAG as a first-class artefact, version-controlled and orchestrated by a workflow engine, is the modern standard. This section covers the orchestration layer and the patterns for keeping it clean.

The DAG abstraction

An ML training pipeline is a DAG with typed inputs and outputs. Data extraction reads from source systems (warehouse queries, file paths, feature-store snapshots) and produces typed dataset artefacts. Preprocessing applies transformations and produces transformed-dataset artefacts. Training consumes the preprocessed data and produces a model artefact plus training metrics. Evaluation consumes the model and a test set, produces evaluation metrics. Registry registers the model with metadata. Deployment (optional) promotes the model. Every stage has typed inputs and outputs; every stage runs in isolation; the orchestrator handles dependencies, retries, and parallelism.

Airflow and the general-purpose orchestrators

Apache Airflow remains the dominant general-purpose workflow orchestrator. ML pipelines as Airflow DAGs are battle-tested but somewhat awkward — Airflow predates ML-specific concerns and treats data/models as opaque file paths rather than typed artefacts. Prefect (founded 2018) modernises the orchestrator pattern with Python-native flow definitions, dynamic DAGs, and better observability. Dagster (founded 2018) centres on typed assets — the data and model artefacts are first-class — which fits ML particularly well. Argo Workflows is the Kubernetes-native option, popular in cloud-native MLOps stacks.

Kubeflow Pipelines

Kubeflow Pipelines (KFP) is the ML-specific orchestrator. Pipelines are defined in Python with the KFP SDK; each step runs as a Kubernetes pod with isolated resources; artefacts pass between steps via typed metadata. KFP integrates deeply with the rest of Kubeflow (Notebooks, Katib for HPO, KServe for deployment). The strength is the native ML focus; the weakness is that running Kubeflow at scale requires real Kubernetes operational expertise.

MLflow Pipelines and the modern alternatives

The 2022–2024 generation introduced lighter-weight ML-specific orchestrators. MLflow Pipelines (Databricks) provides templated ML workflows with built-in tracking integration. ZenML emphasises framework-agnostic pipeline definitions that can be retargeted to different orchestrators. Metaflow (Netflix, 2019) emphasises data-scientist ergonomics — pipelines feel like Python decorators, not Kubernetes resources. The choice depends on team context: ZenML for portability, Metaflow for ergonomics, KFP for Kubernetes-native ops, Airflow/Prefect/Dagster for general-purpose orchestration that includes ML.

The reproducibility property

A well-designed training pipeline is fully reproducible: given the same code commit, data version (Ch 02), and config, re-running the pipeline produces equivalent (or identical) artefacts. This requires that every stage is deterministic and that all inputs are explicitly captured. The reproducibility test — re-run a historical pipeline and check that artefacts match — should be in CI. Without it, claims of reproducibility are aspirational; with it, they are verified.

Pipeline observability

Production training pipelines require their own observability: stage durations, resource utilisation, artefact sizes, failure rates, cost per run. A pipeline that ran in 2 hours yesterday and 8 hours today usually has a problem worth investigating. Modern orchestrators expose these metrics; mature ML platforms surface them in dashboards alongside the model-level monitoring of Ch 04.

Evaluation Gates and Promotion Logic

An evaluation gate is the decision rule that says "this model is good enough to ship" — or, equivalently, "this model is too bad to ship." Gates turn evaluation metrics into binary or staged decisions that the CI/CD pipeline can act on. Designing good gates is its own engineering discipline: too lenient and bad models slip through; too strict and the pipeline becomes useless because nothing ever passes.

The basic threshold pattern

The simplest gate is a threshold check: the new model's metric must exceed (or not regress beyond) a threshold. Common variants: "validation AUC must be greater than 0.85" (absolute threshold), "validation AUC must not regress more than 1% from the production model" (relative threshold), "F1 must improve on at least 80% of slices and not regress more than 5% on any slice" (slice-aware threshold). The right threshold is task-specific; setting it requires balancing the cost of false-pass (shipping a bad model) against false-fail (blocking a good one).

Multi-metric gates

Single-metric gates are usually insufficient. A model can improve accuracy but regress on calibration, or improve accuracy but become unfair across slices, or improve quality but regress on latency. Multi-metric gates require all (or a configured subset) of the metrics to pass. The discipline is to identify the metrics that matter — accuracy, latency, calibration, slice metrics, fairness, robustness — and gate on all of them rather than picking a single proxy.

Champion-challenger comparison

For applications where the production model is constantly evolving, the gate is often champion-challenger: the new candidate must outperform the current production model on a held-out test set by some margin. The margin is a hyperparameter — too small a margin means accepting noise as improvement; too large means missing real progress. Statistical-significance-based margins (a paired t-test or McNemar's test against the champion) give principled thresholds that adapt to test-set size.

Approval workflows

For high-stakes applications, gating on automated metrics alone is insufficient — human review is required before promotion. Modern MLOps platforms support approval workflows: when the pipeline reaches a promotion stage, it pauses and requests a sign-off from a designated reviewer. The reviewer sees the metrics, slice analyses, fairness checks, and any other diagnostics; they approve, request changes, or reject. The discipline is to make the reviewer's job tractable — clear dashboards, structured comparison views, automated anomaly highlighting.

Staging and pre-prod gates

Gates often appear in stages: a model passes initial CI, is promoted to staging, undergoes more-intensive evaluation (longer evaluation suites, manual review, potentially shadow deployment), and only then promotes to production. The staging-to-production transition has the strictest gates, since regression at this stage hits real users. Mature ML platforms have explicit stage definitions and explicit promotion rules between stages.

Gate hygiene

Gates require maintenance. The list of metrics being gated should be reviewed regularly — new metrics added when new failure modes are identified, old metrics retired when no longer relevant. Threshold values should be re-tuned periodically — if they're never crossed, they're not protecting anything; if they're crossed routinely, they're noise. Failed gates should produce post-mortems: what was the failure, was the gate justified, should the threshold change. Without this hygiene, gates drift into either uselessness (everything passes) or theatre (everything is overridden manually).

Continuous Delivery: From Training to Serving

Once a model passes evaluation gates, the deployment side of CI/CD takes over: moving the model artefact from the registry to live serving infrastructure with appropriate caution and observability. This is where Ch 03 (Model Deployment & Serving) integrates with Ch 05; this section focuses on the automation pipeline around the deployment.

GitOps for ML

The dominant 2024–2026 deployment pattern is GitOps: every deployment is described by configuration files in a Git repository; a controller (Argo CD, Flux) continuously reconciles the live state of the production system against the desired state in Git. For ML, the configuration includes the model registry version, the deployment topology, the canary/rollout strategy, and the monitoring rules. Promoting a new model version is a Git commit (ideally automated by the CI/CD pipeline once gates pass) that triggers the controller to update production.

Model-aware deployment automation

Beyond generic GitOps, ML deployment automation knows about model-specific concerns. Warmup: load the model into GPU memory and run a few inference calls before serving real traffic. Capacity planning: scale the new replica count based on expected QPS, not just the previous deployment's count. Backwards compatibility: verify that the new model's input schema accepts existing client requests. Resource sizing: ensure the new model fits in the configured GPU memory. Mature deployment systems (KServe, BentoML's deployment, the cloud-managed services) handle these automatically.

Canary automation

Canary deployments (Ch 03 §9) are increasingly automated: the CI/CD pipeline deploys the new model to a small percentage of traffic, monitors operational metrics (latency, error rate) and ML metrics (prediction-distribution divergence, slice performance), and either promotes to higher percentages or rolls back automatically based on a configured policy. Tools like Argo Rollouts and Flagger provide canary primitives integrated with Kubernetes; cloud-managed platforms expose canary policies natively.

Rollback automation

Every deployment must have an automated rollback path: when a problem is detected, the system reverts to the previous version without engineer intervention. The mechanics are straightforward — model versions are immutable in the registry, so reverting is a registry pointer update — but the trigger logic is subtle. A naive auto-rollback that fires on every minor metric deviation produces oscillation; a too-permissive auto-rollback misses real regressions. The discipline is to base auto-rollback on SLO breaches rather than minor variations and to require human intervention for rollbacks that don't auto-fire after a window.

Multi-environment promotion

Mature CI/CD pipelines promote artefacts across multiple environments: development (engineer's branch builds), staging (integration tests against pre-production data), production (real users). Each promotion is a gated step; the artefact is the same (the same model bits with the same registry hash) but the configuration changes (different feature-store endpoints, different scaling parameters). The discipline is that the model bits are immutable across promotions — what changes is the configuration of the environment serving them.

Deployment as code review

The GitOps pattern means that a deployment is a code review. A pull request modifies the deployment configuration (typically just bumping the model version); a reviewer approves; merging triggers the rollout. This brings deployment under the same discipline as code: peer review, change tracking, rollback by reverting commits. For high-stakes systems, this is materially better than ad-hoc deploy commands; the audit trail is automatic and complete.

Automated Retraining: When and How

The endgame of ML CI/CD is automated retraining: when monitoring detects drift or performance decay, a pipeline retrains the model on fresh data, evaluates the candidate against gates, and (with appropriate human approval) promotes it. Done well, automated retraining keeps models fresh without requiring constant engineer attention; done badly, it auto-deploys regressing models and erodes trust. This section covers the methodology.

Trigger types

Several patterns trigger automated retraining. Scheduled retraining: every N days, retrain on the most-recent data. The simplest pattern; the inefficiency is retraining when nothing has changed and not retraining when something has. Drift-triggered retraining: when monitoring (Ch 04 §3) detects significant drift, trigger a retrain. Better-targeted than scheduled. Performance-triggered retraining: when model performance crosses a threshold (Ch 04 §4), trigger a retrain. Best-targeted, but requires labels with reasonable latency. Combined triggers (scheduled with drift override) are common in mature platforms.

The retraining pipeline

An automated retraining pipeline runs the same DAG as the original training pipeline (Section 4) with updated data inputs. The pipeline must handle: data freshness (the most-recent data slice up to the cut-off), label availability (waiting for labels to materialise where applicable), evaluation (comparing the retrained model to the current production version), gating (Section 5), and promotion (Section 6). Modern MLOps platforms make this straightforward to orchestrate; the discipline is to design the pipeline once and treat retraining as a parameterisation of the same pipeline rather than a separate one.

Human gates for promotion

Even with full pipeline automation, the production promotion step usually requires human approval. The reason is that automated metrics can miss issues that a human reviewer would catch — context-aware concerns like "this slice metric is fine numerically but the affected population is more important than the metric weighting suggests." Mature platforms split the pipeline at the promotion gate: everything before the gate is fully automated; the gate is human; everything after is automated again. This is the Level 1 MLOps maturity pattern (Section 1).

Fully autonomous Level-2 retraining

Some teams operate Level 2: full automation including production promotion. This works when the cost of a regression is low (an experimental recommendation system can serve a slightly-worse model for an hour without business impact), monitoring is comprehensive (any regression is detected within minutes), and rollback is instant. For high-stakes applications, Level 2 is rarely appropriate; the consequences of auto-deploying a regression are too severe.

The drift-retrain-evaluate cycle

The full automated cycle: monitoring detects drift → pipeline triggers retraining → retrained model is evaluated → evaluation results trigger human review → human approves or rejects → if approved, deployment automation rolls out → post-deployment monitoring confirms the rollout was successful or triggers rollback. Closing this loop is the operational achievement that distinguishes mature ML organisations. The discipline at every step is observability: every retrain is logged with reasons, every evaluation is recorded, every deployment is auditable.

Cost discipline

Automated retraining can be expensive. A daily retrain of a large model can dominate the team's compute budget. The discipline is to trigger retraining on evidence rather than reflexively, and to measure marginal value — does the new model actually outperform the current one enough to justify the retraining cost? Some teams find that monthly retraining is sufficient for many use cases and that more-frequent retraining is wasted compute. The right cadence is empirical and worth measuring.

MLOps Platforms and Tooling

The 2020s have produced a substantial ecosystem of integrated MLOps platforms — products that bundle experiment tracking, feature stores, deployment, monitoring, and CI/CD into a single tool. Choosing the right platform is a substantial architectural decision; this section surveys the dominant options and their trade-offs.

Kubeflow

Kubeflow is the dominant open-source ML platform built on Kubernetes. The full Kubeflow stack includes Kubeflow Pipelines (KFP, the orchestrator), Notebooks (managed Jupyter), Katib (HPO), KServe (serving, Ch 03), and integrations with MLflow and others. The strength is the open-source, cloud-portable architecture; the weakness is the operational overhead of running and updating the full Kubeflow stack. Most production deployments use a managed-Kubeflow offering (GCP, AWS, on-prem managed services) rather than self-hosting.

MLflow as a platform

MLflow (Ch 01) extends well beyond experiment tracking. The model registry (Ch 03 §5) and deployment tools provide a CI/CD-friendly artefact-and-promotion layer; MLflow Pipelines provides templated workflows; the broader ecosystem (MLflow Recipes, third-party integrations) covers most ML-platform needs. The strength is open-source and minimal lock-in; the weakness is that MLflow's pipeline orchestration is less mature than dedicated orchestrators. Many teams use MLflow plus a separate orchestrator (Airflow, Prefect, Dagster) rather than relying on MLflow Pipelines.

Cloud-managed platforms

The cloud-managed platforms — Vertex AI (GCP), SageMaker (AWS), Azure ML (Microsoft) — provide integrated experiment tracking, feature store, training pipelines, model registry, deployment, monitoring, and CI/CD. The integration is the main draw: each component talks to the others natively, with shared identity, audit, and billing. The trade-off is platform lock-in and (typically) higher cost than self-hosted equivalents. For teams without dedicated platform engineers, cloud-managed is usually the right starting point.

Databricks

Databricks has consolidated into a comprehensive ML platform around its Lakehouse foundation. Databricks ML includes Databricks Feature Store (Ch 02), MLflow (deeply integrated), Databricks Notebooks, Databricks Workflows (orchestration), and Mosaic AI (LLM-specific). The unified platform is the value proposition: data, features, and ML all live in the same governance and access-control layer. For teams already on Databricks for analytics, the ML extension is usually the natural choice.

The lighter-weight alternatives

Several lighter-weight platforms target ML teams who want less operational overhead than full Kubeflow. ZenML emphasises framework-agnostic pipelines that retarget to different orchestrators and infrastructure. Metaflow (Netflix-originated) emphasises data-scientist ergonomics. Modal, Replicate, and similar serverless platforms provide ML-as-a-service with minimal infrastructure to manage. Weights & Biases has expanded from tracking into a broader platform with launch (training pipelines), models (registry), and the various 2024–2026 LLM-specific extensions.

The build-vs-buy decision

The platform decision tracks team scale and ML investment. Small teams (1–5 ML engineers): cloud-managed (Vertex AI / SageMaker) for path-of-least-resistance, or lightweight platforms (ZenML, Metaflow) for cost-conscious teams. Mid-sized teams (5–30 ML engineers): managed-Kubeflow or Databricks for serious capability, or a hybrid of cloud-managed plus best-of-breed components (MLflow + Airflow + KServe). Large teams (30+ ML engineers): typically build their own platforms on top of these primitives, with substantial dedicated platform-engineering investment. The 2024–2026 industry experience suggests that fully-custom platforms are rarely worthwhile below organisations like FAANG-scale ML investment; the open-source and managed options have matured to the point that NIH is rarely justified.

Operational Workflows and Team Practice

Tools alone do not produce good CI/CD. The team-level workflows around the tools — PR templates, code-review practices, staging-environment discipline, post-incident reviews — determine whether a CI/CD pipeline actually delivers safer deployments or becomes elaborate theatre. This section covers the operational practices that distinguish well-run ML teams from chaotic ones.

PR templates and CI gates

Every change to ML code, data, or models should flow through a pull request. The PR template ensures the right context is captured: what changed, why, what tests run, what the expected impact is. CI gates on the PR ensure quality: code lint passes, unit tests pass, data validation passes, model evaluation gates pass. The discipline is that no change ships without a PR, and no PR merges without passing CI. Mature teams have explicit "no exceptions" rules around this; teams that allow CI bypasses for urgent fixes find that emergency-bypass becomes the default.

Staging environments and the promotion ladder

The standard environment ladder is dev → staging → production. Dev is the engineer's working environment: rapid iteration, no production data, no production traffic. Staging is production-shaped: production data (or production-fidelity synthetic data), production-equivalent infrastructure, but no real users. Production is the real thing. Promotion across the boundaries is gated; bug found in staging never makes it to production. The discipline is to invest in staging fidelity — a staging environment that doesn't catch real production bugs is theatre, not protection.

Code review for ML

ML code review (cross-referenced from Ch 01 §9) has its own conventions. Reviewers should ask: are the right tests added? Are the data and model changes justified? Is the experiment-tracking instrumentation present? Is the rollback plan clear? Is there a runbook for the inevitable on-call incident? Mature teams include these in PR templates explicitly. The investment in code review pays back many times over by catching issues early.

Approver discipline

For high-stakes changes (production model promotion, major architectural changes), the approval should require multiple reviewers, including someone with operational responsibility for the affected system. The discipline is that approvers actually review — rubber-stamp approvals defeat the purpose. Mature teams have escalation paths: if an approver doesn't have time, the PR finds another approver; if no approver is available, the change waits.

Post-incident reviews and the learning loop

When a CI/CD pipeline fails to catch a problem (a regressing model reaches production, an outage results from a deployment), the response should be a post-incident review. The review covers: what happened, what the pipeline missed, what would have caught it, what changes to add to the pipeline. The output is concrete: new tests, new gates, new monitoring. Teams that consistently feed incidents back into pipeline improvements develop progressively-stronger CI/CD over time; teams that don't repeat the same mistakes.

The compliance and audit dimension

Mature ML organisations treat CI/CD outputs as audit artefacts. Every model version in production traces back to: the code commit that produced it (Ch 01), the data version it was trained on (Ch 02), the evaluation results that justified its promotion (Section 5), the human who approved (Section 5), and the deployment events that put it in production (Section 6). The chain is auditable end-to-end. The 2025 EU AI Act, FDA AI/ML guidance, and SR 11-7 (financial-services model risk management) increasingly require this lineage; CI/CD pipelines that produce it as a natural by-product save substantial compliance work.

The Frontier and the Operational Question

CI/CD for ML is mature operational infrastructure for classical ML in 2026, but several frontiers remain active. LLM CI/CD has its own distinctive shape. Agent-system CI/CD is genuinely emerging. Regulatory pressure is automating compliance documentation. This section traces the open questions and the directions the field is moving in.

LLM CI/CD

LLM systems have CI/CD challenges that classical ML doesn't address. Evaluation is harder: there's no single accuracy metric for a generated paragraph; LLM evaluation requires multiple metrics (BLEU, ROUGE, semantic similarity, LLM-as-judge, human evaluation) each with its own caveats. Fine-tuning is the new training: most LLM "training" runs are fine-tunes of foundation models, with their own dependency on the base model version. Prompts are code: changes to system prompts can have model-changing effects and need to flow through CI/CD just like model weight changes. The 2024–2026 generation of LLM-specific platforms (LangSmith, LangFuse, the various promptops tools) is consolidating but not yet mature.

Agent CI/CD

LLM agents (Part XI) introduce CI/CD challenges that have only just begun to be addressed. Trace-based evaluation: an agent's correctness is over a multi-step trajectory, not a single output, requiring trace evaluation rather than per-call metrics. Tool-availability changes: an agent that worked yesterday may break today because an external tool's API changed. Cost regression: an agent change that increases tool calls or token usage by 10× is a meaningful regression even if outputs improve. The methodology of "what does CI/CD look like for agents?" is genuinely emerging; this is one of the more-active 2025–2026 industry topics.

Regulatory automation

The EU AI Act's full enforcement (2026) is making model-card generation, fairness reporting, and audit-trail packaging into automatable CI/CD outputs. Tools that take CI/CD pipeline metadata and produce regulatory-grade documentation packages — model cards (Ch 01), datasheets, risk assessments, post-deployment monitoring reports — are increasingly important. The 2025–2027 work on regulatory-grade CI/CD will produce a generation of tooling that turns compliance from a manual burden into a pipeline output.

Self-improving pipelines

The 2024–2026 wave of LLM-driven engineering tools has reached CI/CD: AI agents that read pipeline failures, propose fixes, and submit PRs; AI assistants that suggest test cases when a model regression is detected; AI-driven root-cause analysis on incidents. The methodology is genuinely emerging — AI-driven CI/CD has not yet become the production default, but the trajectory is clear. The frontier 2026–2030 work on this will reshape the operational pattern substantially.

The platform consolidation question

The MLOps tooling landscape has been fragmented for years, with dozens of platforms each claiming to be the unified solution. The 2024–2026 trend has been consolidation: smaller players acquired (W&B by CoreWeave, others), feature scope converging across the major platforms, open-source projects coalescing around shared standards (OCI artefacts for model containers, OpenLineage for data lineage, MLflow's pipeline format). The end-state plausibly has 3–5 dominant comprehensive platforms (cloud-managed plus a couple of independents) with smaller players filling specialised niches. The current chaotic landscape is likely temporary.

What this chapter has not covered

Several adjacent areas are out of scope. A/B testing and causal experimentation — the rigorous methodology for measuring deployment impact — is the topic of Ch 06. Responsible release and deployment practices — the broader governance discipline — is Ch 07. The deeper topic of software-engineering best practices beyond ML is its own field. Cost management at scale — the FinOps discipline applied to ML — has been touched only briefly. The chapter focused on the operational substrate of CI/CD for ML; the broader MLOps landscape develops adjacent topics in subsequent chapters.