Part XVI · MLOps & Production ML · Chapter 01

Experiment Tracking & Reproducibility, where research code becomes engineering practice.

Machine learning models are produced by training runs that depend on hundreds of inputs: source code, datasets, hyperparameters, library versions, hardware configurations, and random seeds. Without disciplined tracking, those runs become irreproducible — even by the original author days later — making it impossible to debug regressions, audit decisions, or hand off work to teammates. Experiment tracking systems (MLflow, Weights & Biases, Neptune, Comet) and data-and-model versioning systems (DVC, lakeFS, Pachyderm) provide the operational substrate for reproducibility. Environment management tools (Conda, Poetry, uv, Docker, Nix) capture the dependency graph that produced a result. And the discipline of determinism — controlling random seeds, numerical precision, and hardware non-determinism — is what turns reproducibility from aspirational into actual. This chapter develops the methodology with the depth a working ML practitioner needs: the tools, the practices, the trade-offs, and the operational considerations that distinguish a reproducible research lab from an irreproducible one.

Prerequisites & orientation

This chapter assumes the working machinery of modern deep learning (Part VI) and the data-engineering material of Part III (data pipelines, version control, cloud platforms). Familiarity with Git is essential; familiarity with at least one ML framework (PyTorch, TensorFlow, JAX) is assumed. The chapter is written for ML engineers, research engineers, and applied scientists who run training jobs at scale — whether on a laptop, a single GPU box, a research cluster, or a managed cloud platform. Readers who have never personally hit a "I can't reproduce this run" wall may want to skim §2 before settling in.

Three threads run through the chapter. The first is the state-explosion problem: a modern training run can depend on a million-line codebase, multi-terabyte datasets, ten thousand hyperparameter combinations, and the GPU's particular non-determinism. Capturing all of this state cheaply enough to actually use is a non-trivial engineering challenge. The second is the tooling-vs-discipline tension: the right tools (MLflow, W&B) make tracking nearly free, but no tool removes the need for the discipline of actually using them — committing before training, logging the right metrics, recording the right artefacts. The third is the research-to-production gradient: experiment tracking is most often associated with research, but the same discipline applies (with different tooling weights) to production training pipelines, A/B-test gating, and incident-response audits. The chapter develops each piece in turn.

In this chapter

Why Experiment Tracking Matters reproducibility crisis · auditability · regression debugging · onboarding
What to Track: The State of an ML Experiment code · data · params · environment · metrics · artefacts
MLflow and the Open-Source Stack tracking server · model registry · projects · self-hosting
Weights & Biases and the Hosted Stack runs · sweeps · reports · artefacts · workspaces
Data and Model Versioning with DVC DVC · lakeFS · pipelines · large-file storage · diffs
Environment Management Conda · pip · Poetry · uv · Docker · Nix · lockfiles
Determinism: Seeds, Hardware, and Numerical Reproducibility seeds · cuDNN flags · float math · reduction order · CI
Hyperparameter Sweeps and Configuration Management Hydra · OmegaConf · sweeps · grid · Bayesian · ASHA
Operational Workflows and Team Practice naming · tagging · code review · pre-flight · model cards
The Frontier and the Operational Question LLM training · provenance · regulatory audits · what next

Why Experiment Tracking Matters

Machine learning has a reproducibility problem. Surveys of published ML papers consistently find that 30–50% of headline results cannot be reproduced from the published artifacts; even authors often cannot reproduce their own results six months later. The cause is rarely scientific misconduct — it is almost always missing state. A training run depended on some combination of code revision, data snapshot, hyperparameters, library versions, GPU model, and random seed that was not captured. Experiment tracking is the operational discipline of capturing all of that state, automatically, every time a model is trained.

The reproducibility crisis in numbers

The 2020 NeurIPS reproducibility study found that fewer than 25% of accepted papers provided code that ran successfully out of the box, and fewer than half of those produced numbers within 5% of the published values. Industry surveys (Algorithmia 2020, Anaconda 2023) consistently find that 60–80% of ML practitioners have experienced "I can't reproduce my own work" within the last year. The cost is real: model regressions ship without being noticed, A/B test results are unreliable, and team knowledge evaporates when an engineer leaves.

The five concrete benefits

Disciplined experiment tracking pays back on five distinct dimensions. Reproducibility: any run can be re-executed weeks or years later with identical (or near-identical) outputs. Auditability: regulators, reviewers, and internal stakeholders can trace any deployed model back to its training inputs. Regression debugging: when a metric drops between two runs, the differences in inputs are immediately visible. Knowledge transfer: new team members can browse historical runs to understand what's been tried; departing engineers leave behind a usable record. Iteration speed: with infrastructure in place, the marginal cost of running and comparing experiments approaches zero, which substantially accelerates research velocity.

The cost of doing it badly

Several anti-patterns recur. The spreadsheet pattern — logging hyperparameters and metrics in a shared spreadsheet — collapses past about 50 runs and is incompatible with CI/CD. The filename pattern — encoding hyperparameters in model filenames like `resnet50_lr0.001_bs64_seed42.pt` — works for tiny experiments but does not scale and breaks down on team work. The "I'll just remember" pattern — running ad-hoc training without logging — is the dominant cause of irreproducibility and the strongest argument for tooling. The over-tracking pattern — logging so much that the tracking system itself becomes the bottleneck — is the opposite failure and is rarer but real.

The downstream view

Operationally, experiment tracking sits at the intersection of research and engineering. The downstream consumers of tracked experiments are: human researchers comparing approaches, automated CI systems regression-testing trained models, model registries that promote models to production, regulatory-audit pipelines that prove provenance, and downstream MLOps systems (deployment, monitoring) that need to know which model produced which prediction. The remainder of this chapter develops each piece: §2 the state to track, §3 MLflow, §4 W&B, §5 DVC and data versioning, §6 environment management, §7 determinism, §8 hyperparameter sweeps, §9 team-level practice, §10 the frontier.

What to Track: The State of an ML Experiment

Before discussing tools, it's worth being concrete about what "state" actually means for an ML experiment. A training run depends on six categories of input, each with its own capture mechanism. Track all six and any run can be reproduced; miss one and reproducibility silently breaks. This section enumerates the categories and the standard practices for each.

Code

The canonical way to capture code state is the Git commit SHA. Every training run should record the SHA of the repository at training start, plus a flag indicating whether the working tree was clean. The most-common failure mode is "I had uncommitted changes" — a run that does not correspond to any committed code. The defensive practice is the commit-before-train rule: pre-flight checks that reject runs with a dirty working tree. For systems with code spread across multiple repositories (model code in one, training infra in another), all relevant SHAs must be captured.

Data

Data is the most-common reproducibility hole. A training set that is "the latest export from the warehouse" is not a reproducible artefact — the warehouse changes. Reproducibility requires a content-addressed snapshot: a hash of the data plus a recipe for retrieving it. DVC (Section 5) is the canonical tool. lakeFS provides Git-like operations on object storage. For datasets too large to version directly, the recipe is captured: SQL query, date range, and any sampling parameters. The discipline is the same in either case — never train on "the data" without a way to identify exactly which data.

Hyperparameters and configuration

Modern training systems separate code from configuration: learning rates, batch sizes, model widths, regularisation strengths, and dataset paths live in YAML or JSON files (or, more sophisticated, in Hydra/OmegaConf hierarchical configs — Section 8). The configuration file is the input that's most-actively varied across runs and the one most-essential to capture. Tracking systems automatically log the resolved config; the practice is to never hardcode hyperparameters in source code, since hardcoded values are invisible to the tracking layer.

Environment

The environment is the set of installed packages, the Python version, the CUDA version, the operating system, and the hardware. A model trained against PyTorch 2.4 and CUDA 12.1 may produce different numerics against PyTorch 2.5 and CUDA 12.4. Tracking systems should capture requirements.txt, conda.yaml, or equivalent; environment-management tools (Section 6) make capture mechanical. The full environment can be encoded as a Docker image hash or a Nix derivation hash for the strongest reproducibility guarantee.

Random seeds

Most ML training is non-deterministic by default: data shuffling, weight initialisation, dropout masks, and (for some operations on GPU) numerical reductions all introduce randomness. A reproducible run requires seeding Python's `random`, NumPy's PRNG, the framework's PRNG, and (for distributed training) the seed for each worker. Even with seeds, GPU non-determinism (Section 7) can introduce small numerical differences. Tracking systems log the seed; the discipline is to control all sources of randomness, not just the obvious one.

Metrics, artefacts, and outputs

The outputs of a training run are also state worth tracking: scalar metrics (training loss, validation accuracy, latency), curves and plots (learning curves, confusion matrices, gradient norms over time), model artefacts (weights, optimiser state, training-script copy), and examples (predictions on a fixed validation subset). Tracking systems automatically log scalar metrics and artefacts; the practice is to log enough that any future debugging question — "why did this run diverge?" "what did the model predict on edge cases?" — can be answered from the tracking record without re-training.

Hardware and timing

For performance-sensitive work, hardware state matters: GPU model, GPU count, interconnect topology (NVLink, InfiniBand), CPU model, RAM. A run on 8x A100 vs 8x H100 vs 1x A100 may produce different convergence behaviour. Tracking systems log hardware fingerprints automatically. Timing is also worth tracking: wall-clock duration, GPU utilisation, and per-step throughput help diagnose whether changes are improving the model or just changing the speed.

MLflow and the Open-Source Stack

MLflow is the dominant open-source experiment tracking platform. Originally developed at Databricks (2018) and now a Linux Foundation project, it provides four loosely-coupled components: tracking, projects, models, and registry. MLflow's strength is that it is open-source, self-hostable, and has minimal lock-in — your tracking data lives in your own database. Its weakness is that the UI is more utilitarian than the hosted alternatives, and operating it at scale requires real engineering investment.

The MLflow tracking model

An MLflow experiment is a named bucket. An experiment run is a single training execution within an experiment, with a unique run ID. Each run records parameters (typed key-value pairs, immutable once set), metrics (typed scalar time series, append-only), tags (mutable string key-value pairs), and artefacts (arbitrary files: model weights, plots, data samples). The Python API is minimal: `mlflow.start_run()`, `mlflow.log_param(...)`, `mlflow.log_metric(...)`, `mlflow.log_artifact(...)`. Most teams instrument their training scripts with these calls and rely on MLflow's auto-logging integrations (sklearn, PyTorch Lightning, XGBoost) to capture standard metrics automatically.

Tracking server architecture

MLflow's tracking server is the persistence layer. The default backing store is SQLite (suitable for a single user); production deployments use PostgreSQL or MySQL plus an object store (S3, GCS, Azure Blob) for artefacts. The server exposes a REST API and a web UI. For a team of more than a few engineers, the practice is to run a centralised tracking server (managed by an MLOps-platform team) so that all team members log to the same backend and can browse each other's runs.

The model registry

The model registry is MLflow's solution to the "which model is in production?" question. A registered model has a name, a sequence of versions, and a stage (None, Staging, Production, Archived). The registry provides an audit trail of which model version was promoted when, by whom, and what training run produced it. The registry is the connecting tissue between research (experiment tracking) and production (deployment, Section 9). Most modern MLOps stacks treat the registry as the source-of-truth for "what model should serving be running."

MLflow Projects and reproducibility recipes

MLflow Projects bundle a training script with a description of its dependencies (Conda environment or Docker image) and its entry points. A project can be re-run with `mlflow run` against any compatible execution backend (local, Databricks, Kubernetes). In practice, MLflow Projects are less widely-used than the tracking and registry components; teams more often use Hydra/Make/Bash scripts to encode their training recipes and let MLflow handle only logging.

Self-hosting versus managed

MLflow can be run as a fully self-hosted service or consumed as a managed offering. Databricks-managed MLflow integrates with the Databricks platform and is the no-friction path for teams already on Databricks. Self-hosted MLflow on Kubernetes (typically via the official Helm chart) gives full control but requires real operational investment: HA Postgres, S3 object store, OAuth integration, log retention policies. The choice is the standard build-vs-buy trade-off; the right answer depends on team size and platform-team capacity.

The ecosystem and integrations

MLflow integrates broadly. The PyTorch Lightning, Keras, and Hugging Face Transformers training loops all have native MLflow loggers. Spark integrations capture model training in distributed pipelines. Kubernetes operators (Kubeflow, Argo Workflows) can launch MLflow-instrumented training jobs. The breadth of integration is one of MLflow's main strengths — adopting it usually means adding a few lines of code to existing training scripts rather than re-architecting workflows.

Weights & Biases and the Hosted Stack

Weights & Biases (W&B) is the dominant SaaS experiment-tracking platform. Founded in 2017 and acquired by CoreWeave in 2025, W&B is widely used by deep-learning research labs and engineering teams that prioritise UI quality and zero ops over self-hosting flexibility. Its core proposition is that tracking should be effectively free at the margin — with a one-line `wandb.init()` call, an entire training run is logged, visualised, and shareable.

The W&B tracking model

A W&B project is a collection of related runs. Each run captures the same six categories as MLflow (params, metrics, artefacts, environment, code, hardware) but with much richer auto-instrumentation. By default, W&B captures GPU utilisation, memory, temperature, system metrics, console output, the Git SHA, the diff of any uncommitted changes, and the exact command-line invocation. The Python API is similarly minimal: `wandb.init(project="myproj")`, `wandb.log({"loss": loss})`, `wandb.log_artifact(model_path)`. The richness comes from the platform's data model and the automatic capture, not from extra code.

Sweeps and hyperparameter optimisation

W&B Sweeps are a hyperparameter-optimisation system layered on top of tracking. A sweep is defined by a YAML config specifying the search space (grid, random, or Bayesian), the optimisation target metric, and the agent count. W&B coordinates a fleet of agents (each agent is a worker process that pulls the next config from the sweep controller) and provides a real-time UI showing the search frontier. The integration with tracking is tight — every sweep run is automatically logged, and the sweep dashboard provides parallel-coordinates plots, importance scores, and Pareto-front analysis. For routine hyperparameter tuning, W&B Sweeps is the lowest-friction option in the ecosystem.

Reports and the collaboration story

W&B Reports are interactive notebook-style documents that embed live plots, metrics tables, and run comparisons. The collaboration story is W&B's main differentiator: a researcher can run an experiment, write up the results in a Report with embedded charts, and share a URL with collaborators or stakeholders. The Reports are versioned, comment-able, and update automatically as new runs come in. This is materially better collaboration than MLflow's native UI provides, and is the main reason research labs choose W&B despite the cost.

Artefacts and lineage

W&B Artefacts are W&B's content-addressed file-versioning system. An artefact is a named, versioned, hash-keyed bundle of files (a dataset, a model checkpoint, a preprocessed feature set). Artefacts can be inputs to runs (the run consumed dataset version v3) or outputs (the run produced model version v17). W&B automatically constructs the lineage graph between runs and artefacts, providing a "where did this model come from?" view that mirrors MLflow's registry but with broader scope (any file, not just models).

Comet, Neptune, and the alternatives

The hosted-tracking ecosystem includes several W&B-class alternatives. Comet (founded 2017) is the closest competitor, with similar feature scope and a slight emphasis on enterprise use. Neptune.ai emphasises metadata flexibility and offline-first work. ClearML (formerly Allegro Trains) is open-source-first with a managed tier. Aim (open-source) is a lightweight self-hosted alternative emphasising fast UI on large run histories. The choice between W&B and the alternatives is mostly a function of budget, on-prem requirements, and team familiarity.

Cost considerations

SaaS experiment tracking is not free. W&B charges per-seat for teams above the free tier, plus storage costs for artefacts and metric history. For a 50-engineer team logging large image-classification or LLM-training runs, annual costs can reach six figures. The trade-off versus self-hosted MLflow is real: W&B saves probably 1–3 engineer-FTEs of MLOps-platform work, which is worth the licence cost for teams whose ML productivity is the bottleneck. For teams whose primary cost is engineering payroll for the platform team itself, self-hosting may be cheaper.

Data and Model Versioning with DVC

Code is versioned by Git; data and models, which are too large for Git, need their own versioning system. DVC (Data Version Control) is the dominant solution: a command-line tool that layers content-addressed file versioning on top of Git, using cloud object storage (S3, GCS, Azure) or a network file system as the backing store. DVC's core proposition is that data and model files become first-class citizens in your Git workflow without bloating the Git repository.

The DVC data model

DVC tracks files by computing their hashes and storing the hashes (and pointers) in a small `.dvc` file that lives in Git. The actual file content lives in a separate DVC remote (typically an S3 bucket). Standard Git operations (commit, push, branch, checkout) work on the `.dvc` files; DVC operations (`dvc push`, `dvc pull`) sync the actual content between the remote and the local working directory. The result is that a Git checkout of a 5-TB-dataset project occupies a few kilobytes until you `dvc pull` the data you need.

DVC pipelines and reproducibility

DVC also provides a pipelines feature: a YAML file describing the dependency graph between data, code, and outputs. A pipeline stage declares its inputs (data files, parameters), its command, and its outputs. DVC can then re-execute only the stages whose inputs have changed (a make-style incremental build for ML pipelines). For teams whose preprocessing pipelines are non-trivial, DVC pipelines provide a substantial reproducibility benefit: re-running the full pipeline produces the same outputs given the same inputs.

lakeFS and the alternatives

The data-versioning ecosystem has multiple players. lakeFS provides Git-like semantics (branch, commit, merge) directly on top of object storage, without DVC's Git-and-pointer-files indirection. Pachyderm emphasises pipeline orchestration in addition to versioning. Delta Lake and Apache Iceberg provide ACID transactions and time-travel on data-lake table formats but are oriented to analytical workloads rather than ML pipelines. The choice depends on how your data lives: many small files (DVC), large columnar tables (Iceberg/Delta), or whole-bucket workflows (lakeFS).

Model versioning

Models are technically just large files, so DVC handles them too. But for production model lifecycle, the model registry features of MLflow or W&B (Sections 3–4) are usually a better fit than raw DVC: they add stage-management (None/Staging/Production), metadata (training run that produced the model), and approval workflows. The common pattern is: DVC for datasets and intermediate artefacts, MLflow/W&B registry for production model versions.

Diffing and reviewing data

One challenge with data versioning is that diffs are hard to read. A 50-GB dataset has changed, but `dvc diff` only shows that the hash differs — not what changed inside. The 2024–2026 ecosystem has produced specialised diff tools (DataPanel, Great Expectations, the Deepchecks data-validation library) that surface schema changes, distribution shifts, and example-level differences. The discipline is to combine DVC's content-level versioning with a schema-and-stats validation layer that flags meaningful data changes.

Practical adoption pattern

The practical adoption pattern for DVC: start by versioning the largest, most-frequently-changed datasets; add the trained model artefacts; let small intermediate files stay in Git LFS or just plain Git. Pair DVC with a tracking system (MLflow, W&B) so that experiments record the DVC hash of the data they consumed. The combination — Git for code, DVC for data, MLflow/W&B for runs — is the contemporary standard for reproducible ML pipelines and is what the rest of this chapter assumes.

Environment Management

A trained model is a function of its environment: the Python version, every installed library, every transitive dependency, the CUDA driver, the operating system. Two practitioners running "the same code" with subtly different environments can produce subtly different models. Environment management is the discipline of capturing the exact environment under which a result was produced, in a form that can be re-instantiated weeks or years later. The tools have evolved dramatically over the past decade.

The Python packaging landscape

The Python packaging ecosystem has been historically painful and is finally improving. pip + requirements.txt is the lowest-common-denominator: a flat list of package names and versions. The weakness is that requirements.txt does not pin transitive dependencies, so two installs of the same requirements can produce different environments. pip-tools (`pip-compile`) generates pinned lockfiles. Poetry introduced robust lockfiles and dependency resolution; uv (Astral, 2024) is a blazingly-fast Rust-based alternative that has largely replaced pip-tools in 2024–2026 practice. Pixi extends uv-style speed to Conda packages.

Conda and the scientific stack

Conda remains the dominant package manager for scientific Python because it handles non-Python dependencies (NumPy's underlying BLAS, CUDA, system libraries) that pip cannot. Mamba and micromamba are faster Conda-compatible reimplementations. The conda-forge channel hosts community-maintained recipes for ~25,000 packages and is the de-facto standard. Conda environments are captured in `environment.yml` files; the `--explicit` flag produces fully-pinned lockfiles suitable for reproduction.

Docker and containerisation

Docker images go beyond Python packages: a Dockerfile captures the operating system, the system libraries, the CUDA toolkit, the Python interpreter, and the application code as a single immutable image. A trained model that came out of a specific Docker image hash can in principle always be re-trained from that image. Docker is the operational standard for production deployment; for research it is heavier than Conda or uv but provides stronger guarantees. NVIDIA's NGC catalog provides pre-built CUDA-Python images that are the typical starting point for deep-learning Dockerfiles.

Nix and the deterministic frontier

Nix is the deepest reproducibility tool: a functional package manager that builds every artefact (including Python packages, system libraries, and the OS itself) from declarative specifications, with hash-addressed outputs. A Nix derivation hash is a complete, content-addressed identifier for an entire environment — the strongest reproducibility guarantee available. The cost is steep: Nix has a learning curve and an ecosystem that is smaller than pip's. For most teams, Nix is overkill; for teams whose reproducibility requirements are regulatory (pharmaceutical AI, financial models), it is increasingly attractive.

Lockfiles and the reproducibility contract

The unifying concept across these tools is the lockfile: a file that pins not only direct dependencies but every transitive dependency to exact versions and hashes. `requirements.lock` (pip-tools/uv), `poetry.lock` (Poetry), `conda-lock.yml` (conda-lock), `Dockerfile + image hash` (Docker), `flake.lock` (Nix). The discipline is that lockfiles are committed to Git alongside code; reproducing a result requires checking out the same commit and using the lockfile to install exactly the same dependencies. Without a lockfile, you have a request to the package ecosystem, not a reproducible environment.

The tension between speed and rigour

Environment management has a fundamental tension: stricter tools (Nix, Docker) provide stronger guarantees but slower iteration; looser tools (plain pip) iterate faster but break reproducibility. The practical answer is layered: use uv or Poetry for day-to-day development (fast iteration), commit a lockfile for the reproducibility contract, build a Docker image for production deployment, and reach for Nix only when regulatory requirements demand it. Modern CI pipelines often verify the entire environment chain on every commit, so divergences are caught early.

Determinism: Seeds, Hardware, and Numerical Reproducibility

Even with the same code, the same data, and the same environment, two runs can produce different results because of randomness — both intentional (random initialisation, dropout, data shuffling) and unintentional (non-deterministic GPU operations, non-deterministic reduction order, mixed-precision rounding). Achieving full bitwise determinism is hard and sometimes expensive; achieving statistical reproducibility (close-but-not-identical outputs) is more practical and usually sufficient. This section unpacks the distinction.

Sources of randomness

An ML training run draws randomness from several places. Weight initialisation uses the framework's PRNG. Data loading uses Python's `random` and NumPy's PRNG to shuffle batches and apply augmentations. Dropout, stochastic depth, label smoothing use the framework's PRNG. Distributed training introduces additional non-determinism through the order in which workers complete reductions. Each source has its own seed; controlling them all requires explicit code:

random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED); torch.cuda.manual_seed_all(SEED); torch.backends.cudnn.deterministic = True; torch.backends.cudnn.benchmark = False; torch.use_deterministic_algorithms(True)

GPU non-determinism

GPUs introduce non-determinism even with seeds set. The cause is that many GPU operations (especially reductions like sum and mean) execute in parallel and combine partial results in non-deterministic order; floating-point addition is not associative, so the order matters. The PyTorch flag `torch.use_deterministic_algorithms(True)` disables non-deterministic kernels (and errors out when a deterministic implementation is unavailable). The trade-off is performance: deterministic kernels can be 20–40% slower than the non-deterministic defaults. For most research, accepting small numerical differences (validation accuracy ±0.1%) is fine; for regulatory work, full determinism is worth the cost.

Mixed-precision rounding

Modern training uses mixed-precision arithmetic (FP16 or BF16 for forward pass, FP32 for accumulator). Mixed-precision is non-deterministic across hardware: A100 and H100 use slightly different rounding behaviour for some operations. The same code on different hardware may produce slightly different gradients, slightly different weight updates, and ultimately different models. Cross-hardware reproducibility usually requires forcing FP32, which gives up the performance benefits of mixed precision. The practical answer is to record the hardware (which tracking systems do automatically) and to test for sensitivity, not to insist on cross-hardware bit-identity.

Distributed training non-determinism

Distributed training (data-parallel, model-parallel, pipeline-parallel) adds another non-determinism source: gradient all-reduce operations sum gradients in a non-deterministic order across workers. The PyTorch DDP `find_unused_parameters=False` setting and explicit gradient bucketing partially address this, but not completely. For full distributed determinism, the operational answer is to fix the worker count, the worker assignment, and the all-reduce algorithm — but this constrains scaling and is rarely worth the cost for research.

Verifying determinism in CI

The discipline of determinism is verified by automation: a CI test that runs a small training job twice and compares outputs. If the bitwise outputs match, full determinism holds. If the validation metrics match within a small tolerance, statistical reproducibility holds. Teams that care about reproducibility run this test on every commit; the failure mode "we accidentally introduced non-determinism" is then caught immediately rather than discovered six months later when an audit fails.

The bitwise-vs-statistical trade-off

The right level of determinism depends on the use case. Bitwise reproducibility is required for regulatory audits (pharmaceutical AI, FDA submissions), for numerical-debugging work where any difference is suspect, and for some scientific publications. Statistical reproducibility (validation metrics within ε across runs) is sufficient for almost all research and most production work. The cost of full bitwise determinism is real (20–40% training-time overhead, restricted hardware portability), so the choice should be deliberate rather than aspirational.

Hyperparameter Sweeps and Configuration Management

A modern training run has hundreds of hyperparameters, and most non-trivial work involves running dozens to thousands of variants. Managing this combinatorial explosion has its own tooling layer — configuration management for the input side, hyperparameter-sweep orchestration for the execution side. This section covers both.

Hydra and hierarchical configuration

Hydra (Facebook AI, 2020) is the dominant Python config-management library for ML research. Hydra provides hierarchical YAML configs with dynamic composition: a base config defines defaults, group configs (model, optimiser, dataset) provide swappable variants, and command-line overrides let users specify exact configurations without hand-editing YAML. The Hydra config is type-checked (via Structured Configs), version-controlled, and automatically logged by tracking systems. OmegaConf is the underlying config library; dataclass-based Hydra configs add static type safety.

Sweep search strategies

Hyperparameter sweeps employ several search strategies. Grid search enumerates all combinations of discrete values; works for small parameter spaces but explodes combinatorially. Random search samples uniformly from the parameter space; surprisingly effective and the standard baseline (Bergstra and Bengio 2012). Bayesian optimisation (Gaussian processes, TPE) builds a surrogate model and searches the most-promising regions; effective for expensive evaluations. Hyperband and ASHA use early-stopping to reallocate compute from underperforming runs; the standard for deep-learning sweeps where most runs can be cheaply identified as bad early.

Sweep orchestration platforms

Sweeps require coordination across many parallel runs. W&B Sweeps (Section 4) is the lowest-friction option for hosted-tracking users. Optuna is the dominant open-source HPO library, with strong Bayesian-optimisation support and minimal dependencies. Ray Tune is a more sophisticated platform integrated with Ray for distributed training; supports advanced strategies (Population-Based Training, ASHA) and is the choice for large-scale work. SMAC3 is a research-focused platform with strong theoretical foundations.

The sweep-vs-search trade-off

Hyperparameter optimisation has a recurring methodological question: how much of your compute budget should go to "searching" (running variants) vs. "training" (extending the best variant)? The 2010s answer was 80% search, 20% train; the 2020s LLM era has shifted dramatically toward 5% search, 95% train, because the models are so expensive that even a small budget for search dominates the engineer time. The practical answer is task-dependent: for research with small models, big sweeps are cheap and worth running; for production-scale LLM training, careful manual hyperparameter selection plus a few large runs is the norm.

Reproducibility within sweeps

Sweeps amplify the reproducibility problem: hundreds of runs each with slightly different hyperparameters, all logged to the same project. The discipline is to ensure that each sweep run is individually reproducible (the seed, code SHA, and config are captured per run) and that the sweep itself is reproducible (the sweep config is committed to Git and tagged in the tracking system). Modern sweep tools handle this automatically; the failure mode is the ad-hoc-bash-loop sweep that does not capture per-run state.

Automation and CI integration

The most-mature MLOps pipelines run sweeps automatically as part of CI: a code change triggers a small "smoke" sweep that verifies the model still trains; an accepted pull request triggers a larger evaluation sweep that updates leaderboards; periodic re-training kicks off a full hyperparameter sweep against the latest data. The sweep results feed into the model registry (Section 3) for promotion decisions. The integration ladder — manual run → tracked run → tracked sweep → CI-gated sweep → automated re-training — is the trajectory most teams move along over time.

Operational Workflows and Team Practice

Tools alone do not produce reproducibility. The team-level discipline of using the tools — naming conventions, code-review practices, pre-flight checks, model documentation — is what determines whether a tracking system actually delivers reproducibility or becomes an unloved sea of orphaned runs. This section covers the operational practices that distinguish well-run ML teams from chaotic ones.

Run naming and tagging conventions

A team of ten engineers logging to MLflow or W&B will produce thousands of runs per month. Without naming conventions, the tracking system becomes useless: you cannot find anything. The practical convention is hierarchical tagging: every run gets tags for the experiment goal (e.g. `goal:improve-recall`), the engineer (`author:alice`), the dataset version (`data:v3.2`), the JIRA ticket (`ticket:ML-1234`), and the run type (`run:smoke|full|eval`). Filtering and grouping in the tracking UI then becomes mechanical. Dashboards built on these tags surface the team's current state at a glance.

The commit-before-train rule

The single most-important team practice is the commit-before-train rule: training scripts refuse to run if the working tree has uncommitted changes, unless explicitly overridden with a flag. The implementation is a few lines in the training script's startup: check `git status`, abort if dirty unless `--allow-dirty` is passed. The practice eliminates the "I forgot to commit my changes" failure mode that produces irreproducible runs, at the cost of a small inconvenience. Teams that adopt this rule see substantial reproducibility improvements.

Pre-flight checks

Before launching an expensive training run (anything taking more than a few hours), the practice is a pre-flight check: run the training loop for one or two iterations, validate that all metrics log correctly, validate that the checkpoint format is correct, validate that downstream evaluation runs. Catching configuration errors before a 24-hour training run starts saves substantial waste. The pre-flight check can be automated in CI for the most-expensive runs and made part of the team workflow for everything else.

Code review for ML

ML code review has its own conventions distinct from general software engineering. The reviewer should ask: is the experiment-tracking instrumentation present? Are the right metrics logged? Is the dataset version pinned? Is the seed controlled? Are randomness sources accounted for? Is the configuration externalised rather than hardcoded? Is the training command idempotent? These questions are the equivalent of "are there tests?" and "is the error-handling correct?" in general software review. Teams that explicitly include these in PR templates have substantially better reproducibility.

Model cards and documentation

Beyond the per-run tracking record, models that ship to production benefit from model cards (Mitchell et al. 2019): structured documentation describing the model's intended use, training data, evaluation results, known limitations, and risk considerations. Model cards are the human-readable analogue of the tracked-run record; they provide the context that an audit or a downstream user needs. The discipline is to require a model card before a model can be promoted to production status in the registry.

Onboarding and knowledge transfer

The often-overlooked benefit of disciplined tracking is team-level knowledge transfer. A new team member should be able to browse historical runs, understand what's been tried, see which approaches worked and which didn't, and reproduce key results without help from the original author. The practice that makes this work is descriptive run names and dashboards documenting key findings. Without this, the standard team-onboarding tax (months to come up to speed) compounds with the reproducibility tax (everything has to be redone). With it, a new engineer is productive within days.

The Frontier and the Operational Question

Experiment tracking and reproducibility are mature operational disciplines in 2026, but several frontiers remain active. The scale of frontier-LLM training has outgrown traditional tracking patterns; regulatory pressure (the EU AI Act, FDA AI/ML guidance) is pushing reproducibility from best practice to legal requirement; data-and-model provenance for downstream auditing is becoming a substantial methodological topic. This section traces the open questions and the directions the field is moving in.

The LLM-scale tracking problem

Training a frontier LLM produces a single run that consumes weeks of compute on tens of thousands of GPUs and produces a model whose value is hundreds of millions of dollars. Traditional tracking tools were designed for many small runs; they do not handle the different operational shape of one massive run with millions of metric points and thousands of distinct telemetry streams. Custom tracking infrastructure is the norm at frontier labs (Anthropic, OpenAI, DeepMind, the various open-source-leaderboard groups). The 2024–2026 work on standardising this — including the open-source ML telemetry projects and the various large-scale-training observability papers — is moving toward more-mature shared tooling.

Regulatory provenance

The EU AI Act (entered into force 2024, full enforcement 2026) requires demonstrable provenance for high-risk AI systems: complete records of training data, model versions, evaluation results, and deployment decisions. This regulatory pressure is pushing reproducibility from best practice to legal requirement. Tools that produce tamper-evident audit trails (cryptographic hashes, append-only logs, signed attestations) are increasingly important. The MLflow registry and W&B audit logs are evolving in this direction; the 2025–2027 work on regulatory-grade ML provenance is a major industry theme.

Data lineage at scale

For models trained on broad data (web crawls, customer interaction data, multi-source pipelines), data lineage — the complete record of which data flowed into which model — is increasingly hard to maintain and increasingly demanded by auditors. The 2024–2026 work on lineage tools (OpenLineage, the various Datahub-class projects, the Marquez metadata platform) is moving toward a unified standard, but adoption is uneven. The frontier is automated lineage: code instrumentation that captures data flows without requiring engineer intervention.

Reproducibility for non-deterministic AI agents

The rise of LLM-based agents (Part XI) has introduced a new reproducibility challenge: agent behaviour depends on the underlying LLM, which evolves over time as model providers update their endpoints. Reproducing an agent run from six months ago may be literally impossible if the underlying model has been deprecated. The 2025–2026 work on agent reproducibility — frozen model snapshots, deterministic LLM serving, and replay frameworks — is methodologically distinctive and is an active topic.

Reproducibility-as-a-service

A growing class of products positions reproducibility as their primary value proposition. Cradle and DAGsHub integrate Git, DVC, and tracking into a single managed experience. Modal and Replicate let you "snapshot" a training environment and re-run later. The ecosystem is consolidating, and the next 2–3 years will see further convergence around an opinionated default stack — likely some combination of uv (environments), DVC or lakeFS (data), and one of MLflow/W&B (tracking).

What this chapter has not covered

Several adjacent areas are out of scope. The substantial software-engineering discipline of testing ML code (unit tests for data pipelines, regression tests for trained models) is touched only briefly; Ch 05 (CI/CD for ML) develops it. Model deployment and serving (Ch 03) and monitoring (Ch 04) are out of scope. The experiment-design statistics (when to stop a sweep, when an improvement is significant) is its own topic. The chapter focused on the operational substrate of tracking and reproducibility; the broader MLOps landscape is genuinely vast, and the rest of Part XVI develops adjacent topics.