Part II · Programming & Software Engineering · Chapter 04

Software engineering, the craft beneath the craft.

A model that works only in its author's notebook is, for practical purposes, a model that does not exist. Clean code, testing, design patterns, documentation — the classical software-engineering disciplines — are what turn an isolated experiment into a system another person can read, trust, change, and ship. This chapter is a field guide to the engineering habits that matter most inside an ML codebase, where the usual rules are the same but the pressure to skip them is unusually strong.

How to read this chapter

The first six sections are about the shape of code — how it is named, organised into modules, structured into patterns, and refactored when it drifts out of shape. Sections seven through eleven are about keeping code honest: error handling, testing, types, and logging. Sections twelve through fifteen are about the ecosystem a codebase lives in — documentation, configuration, packaging, and the CI and review disciplines that keep a multi-author project healthy. The final section steps back and asks where all this compounds, and why it matters more in ML than almost anywhere else.

Conventions: Python examples assume 3.11+; package-manager and tooling names refer to the current (2026) state of the ecosystem. We prefer the pragmatic over the doctrinaire — the point of this chapter is not to advocate for any one style but to give you the vocabulary and the habits that hold up across projects, languages, and teams.

What software engineering does for ML practitionersMotivation
Clean code, the boring superpowerNames, functions, scope
Modules, APIs, and boundariesCohesion, coupling, hexagonal
SOLID, briefly and usefullyFive letters, three that matter
The design patterns that actually show upStrategy, factory, adapter, pipeline
Refactoring and code smellsSafe, incremental improvement
Errors and failure modesExceptions, fail fast, retry
Testing, the unit-test foundationPyramid, pytest, fakes
Testing ML codeDeterminism, data tests, properties
Type hints and static analysismypy, ruff, protocols
Logging and observabilityStructlog, metrics, traces
Documentation, from docstring to ADRThree layers, one audience
Configuration, secrets, and environments12-factor, Hydra, reproducibility
Packaging and dependenciespyproject, lockfiles, Docker
CI/CD and code reviewGates, feedback, collaboration
Where these habits compound in MLPayoff

Section 01

What software engineering does for ML practitioners

A model that is beautiful in a notebook and impossible to run on anyone else's laptop is, for practical purposes, a model that does not exist. Software-engineering discipline is what turns a clever idea into an artefact someone else — including your six-months-older self — can read, trust, change, and ship.

The overwhelming majority of production ML code is glue. It reads rows from somewhere, reshapes them, runs a model, writes scores somewhere else. The modelling step — the one research papers are about — is rarely the largest block of code, and almost never the one that breaks. The things that break are configuration, data contracts, versioning, retries, logging, deployment. These are the concerns of software engineering, and ignoring them is the single most reliable way to stall an ML project.

This chapter is not a comprehensive course. It is a working field guide to the engineering habits that matter most inside a data-science or ML codebase. We cover the traditional canon — clean code, testing, design patterns, documentation — but always through the lens of what actually goes wrong in ML work: nondeterminism, heavy dependencies, notebooks, data-schema drift, long training runs, and teams where half the authors have never written a unit test.

Key idea

Software engineering in ML is not about making notebooks look like production Java. It is about recognising which parts of your code change slowly (library interfaces, data schemas, training loops) and which change quickly (hyperparameters, experiment configs, plots), and putting the effort of good engineering where it compounds — into the slow parts.

Section 02

Clean code, the boring superpower

Most "clean code" advice boils down to a single observation: code is read ten times more often than it is written, so optimise for the reader. That reader is almost always a future version of you, struggling to remember what an eight-line list comprehension was meant to compute.

Names are the interface

The single most impactful stylistic habit is naming. A variable called df communicates nothing; monthly_orders tells you the unit of the row and the granularity of the time axis. A function called process hides its contract; filter_invalid_users announces it. Pick names that describe what a value represents, not how it was produced: scored_candidates, not result_of_step_3. And resist the urge to abbreviate — an extra six characters of typing now saves six minutes of reading later.

Functions should do one thing

Long functions are where bugs hide. A function that fetches data, cleans it, scores it with a model, and writes the results has four reasons to change and four places to break. Split it. The heuristic is Robert Martin's: a function should be small enough that you can give it a descriptive name covering what it does — if the name needs an "and," it is two functions.

Good function boundaries also make testing tractable. A function that takes a pure data frame and returns a pure data frame is trivial to unit-test; one that reads from S3, calls a model service, and mutates a database is not. The presence of a good test often indicates the presence of good structure, and vice versa.

Scope and side effects

Prefer local scope over global, immutable inputs over mutated state, and explicit arguments over reaching into globals. A function that silently mutates a list passed into it, or that depends on a module-level config dictionary, is a landmine — every caller becomes a potential cause of its bugs. The rule that makes this practical is: a function's behaviour should be completely determined by its arguments and its return value should be its only effect.

Rule of thumb

If you cannot describe a function in a single sentence starting with a verb, it is doing too much. Split until you can.

Section 03

Modules, APIs, and boundaries

A system is only as maintainable as its boundaries are clear. A module's public interface — the handful of functions, classes, and constants a caller is expected to use — is the contract you have to keep stable; everything else is private, and can change without warning.

The shape of a good module

A good module has a single responsibility, a small public surface, and no hidden coupling to the rest of the codebase. In Python, the convention is a folder with an __init__.py that re-exports the public names, leaving the internal files free to be restructured. Inside, each file should be organised around one noun — a scorer.py, a schema.py, a registry.py — rather than around a grab-bag of verbs.

Coupling and cohesion

Two module-level properties determine how a codebase ages. Cohesion is how closely the pieces inside a module belong together — high is good. Coupling is how tightly one module depends on the internals of another — low is good. High cohesion and low coupling let you change a module's implementation without touching its callers; the opposite is a codebase where every change reveals a web of unexpected dependencies.

Dependency direction

Inside any non-trivial system, decide which modules are "core" (they define the domain and depend on nothing) and which are "shell" (they adapt the core to the outside world — databases, HTTP, file systems, model registries). Arrows of dependency should point inwards: the shell knows about the core, but the core never imports anything from the shell. This is the essence of "hexagonal" or "ports-and-adapters" architecture, and it is the single most useful idea from enterprise software that transfers cleanly to ML codebases.

Rule of thumb

If import-ing your scoring module pulls in AWS clients, database drivers, and your whole experiment tracker, the scoring module is not a scoring module — it is everything, and it will be untestable. Push I/O to the edges.

Section 04

SOLID, briefly and usefully

The SOLID acronym is older than modern Python and steeped in Java idioms, but three of its five principles remain genuinely useful when applied with restraint, and the other two are at least worth recognising in code review.

The five principles, each in one sentence:

S — Single responsibility. A class or module should have one reason to change.
O — Open/closed. Open to extension (add new behaviour by adding code), closed to modification (don't edit old code to extend it).
L — Liskov substitution. Subclasses should be drop-in replacements for their parents; code that works with the base class should work with any child without inspecting its type.
I — Interface segregation. Clients should not be forced to depend on methods they do not use; prefer many small interfaces over one large one.
D — Dependency inversion. High-level code should depend on abstractions, not concrete implementations; pass a logger, don't import the one logger.

Which of these matter for ML code

Single responsibility and dependency inversion are the two that pay off daily. A Scorer class that loads a model, preprocesses features, scores, and logs is four classes pretending to be one; the day you want to swap the model backend for a remote service, you will rewrite all four. A training loop that takes an optimiser object rather than importing torch.optim.Adam directly lets you test the loop with a fake optimiser and swap the real one at configuration time.

Open/closed and Liskov matter in moderation: when you find yourself in a chain of if isinstance(x, ...) branches, a polymorphic interface is often the refactor that wants to happen. Interface segregation tends to over-engineer Python code, where duck typing already solves most of what it addresses in statically-typed languages.

Pragmatic take

Do not treat SOLID as a checklist. Treat it as five names for patterns you have already felt — the frustration of a class that does too much, the pain of a subclass that lies about its parent's contract, the unit test that cannot be written because the code reaches for the real database. When the pain appears, the principle tells you what to fix.

Section 05

The design patterns that actually show up

The Gang of Four catalogue lists twenty-three patterns; a working ML engineer uses maybe six of them regularly, and the rest are worth knowing by name so you can recognise them when they appear in someone else's code.

The ones you will use

Strategy — swap one algorithm for another behind a shared interface. A Sampler abstract class with concrete RandomSampler, WeightedSampler, and StratifiedSampler implementations is the canonical example, and it is exactly how PyTorch's DataLoader accepts its sampler argument.

Factory — centralise object construction so callers specify what they want, not how it is built. model = build_model(config) hides the ten lines of "if it's a ResNet, do this; if it's a Transformer, do that"; every ML framework has one somewhere.

Adapter — wrap an unfriendly interface in a friendly one. Your training code wants something with .fit(X, y) and .predict(X); a vendor library gives you something with .train_model() and .inference(). A thin adapter class reconciles them without spreading the vendor's vocabulary through your codebase.

Pipeline (formally "chain of responsibility") — a sequence of stages, each taking the output of the previous one. sklearn.pipeline.Pipeline, spaCy's Language, Hugging Face's pipeline() — all the same pattern: a list of transformers whose __call__ is composed.

Observer / callback — register handlers that fire on events. PyTorch Lightning's callbacks, Keras callbacks, and MLflow's autologging all use this pattern to let you tap into the training loop without editing it.

Singleton — one shared instance of a thing, often a model or a connection pool. In Python this is just a module-level variable with a lazy initialiser; no class needed.

Anti-patterns worth recognising

God class — one class that does everything; common in early-stage ML code, where a Model class loads data, trains, evaluates, serialises, and serves. Shotgun surgery — a change to one logical feature that requires editing ten files; usually a sign that a cross-cutting concern (logging, metrics, retries) should be extracted. Primitive obsession — everything is a str or a dict; named types (dataclasses, enums, Protocols) pay back the typing within a week.

Section 06

Refactoring and code smells

Refactoring is the disciplined practice of improving a codebase's structure without changing its behaviour. Done well, it is almost invisible in the diff of any single commit, and yet it is the single activity that most distinguishes senior engineers from junior ones.

The mechanics

A refactor is safe only if it is covered by tests. The workflow that makes refactoring routine is: write a test for the behaviour you want to preserve, confirm it passes, apply one mechanical change — extract a function, rename a variable, inline a class — and re-run the test. Repeat until the shape is what you wanted. The tests are the scaffolding that lets you move fast without breaking things; without them, every "improvement" is a gamble.

The standard refactorings

Martin Fowler's catalogue is the standard reference, but a handful cover most cases: extract function (pull a block of related lines into a named function); inline variable (delete a one-use local whose name adds nothing); introduce parameter object (group four arguments that always travel together into a dataclass); replace conditional with polymorphism (turn an if-chain over types into a call on a virtual method); move function to module (a function that touches no state of its owning class wants to be a free function elsewhere).

Code smells as signals

A code smell is a surface pattern that usually indicates a deeper problem — duplicated code, long functions, long parameter lists, data clumps, feature envy. They are not bugs; they are hints. Long functions smell because they fuse responsibilities; duplicated code smells because a future change will need to happen in two places and will forget one. The skill is spotting the smell early, naming it, and doing the small refactor before the bug it is hinting at actually lands.

Rule of thumb

Leave every file you touch slightly better than you found it — rename one variable, extract one helper, delete one dead import. Over a year of commits, a whole codebase becomes easy to read.

Section 07

Errors and failure modes

Any non-trivial program spends a surprising fraction of its code handling things that went wrong. In ML systems, the ratio is often higher — pipelines fail on malformed inputs, model servers time out, training runs crash five hours in. Good error handling is less about writing try/except and more about deciding, up front, which failures you recover from and which you let propagate.

Exceptions vs. return codes

Python's preferred mechanism is exceptions — raise when something goes wrong, let it travel up the call stack until a caller who can do something sensible catches it. The Pythonic rule is "easier to ask forgiveness than permission": attempt the operation, handle the failure. The alternative — returning sentinel values or error codes — works in the small but silently mutates every function signature as the set of failures grows.

Catch narrowly

The single most common error-handling bug is the bare except: — it catches everything, including KeyboardInterrupt and programmer errors like NameError, and it reliably produces systems that ignore the bugs you most wanted to see. Catch the specific exception type you intend to handle, let everything else propagate, and write a log line that includes enough context to make the failure diagnosable.

Fail fast, fail loudly

A pipeline that silently skips malformed rows today produces a model that trains on half its data tomorrow. Validate inputs at the boundary — on ingest, before training, before serving — and fail immediately with a descriptive error. Validation libraries (pydantic, marshmallow, pandera for data frames) make this almost free. The cost of a loud failure at 3 AM is a page to an engineer; the cost of a silent failure is a week of wrong results.

Retry with judgement

Distributed systems fail transiently — a network blip, a rate-limited API, a restarted node. Retry with exponential backoff and a jitter term; give up after a bounded number of attempts. But do not retry on errors that will not succeed on retry — a malformed request, an authentication failure, an out-of-memory crash. Libraries like tenacity make the right patterns one decorator away.

Key idea

Write errors to be read. An exception message that says "ValueError: invalid literal" helps no one; "ValueError: column 'user_id' contains non-numeric values at rows [12, 88, 309]" pinpoints the cause. The cost of the good message is one line; the benefit is paid every time something goes wrong.

Section 08

Testing, the unit-test foundation

A test is a small program that exercises a slice of your code and asserts that its behaviour matches an expectation. Tests are the only mechanism that scales to verify a codebase written by many hands over years; every other form of quality assurance — code review, static analysis, manual QA — supplements them but cannot replace them.

The testing pyramid

The classic picture, due to Mike Cohn, arranges tests by granularity. Many fast unit tests at the base, fewer integration tests in the middle, a handful of end-to-end tests at the top. The ratio matters: unit tests are cheap, fast, and localised in blame; end-to-end tests are slow, brittle, and often fail without telling you which component broke.

The testing pyramid. Most of your tests should live at the base: fast, deterministic, and scoped to a single function or class. Integration tests verify the seams between components; end-to-end tests verify that the whole system starts up and works.

What a good unit test looks like

A unit test exercises one function, with concrete inputs, and asserts one expected outcome. It is named after the behaviour it verifies (test_scorer_rejects_empty_input, not test_1), runs in milliseconds, and does not touch the network, the file system, or a database. The standard library's unittest works, but most modern Python codebases use pytest for its less verbose assertion style and rich fixture system.

Arrange, act, assert

The conventional structure of a test has three phases: arrange the inputs and fakes, act by calling the code under test, assert on the result. Keeping the three phases visible makes the test readable at a glance; mixing them is how test files become as hard to maintain as the code they verify. Favour many small tests — one behaviour each — over a single sprawling one that asserts twelve things, because when the sprawling one fails you learn only that something, somewhere is broken.

Fakes, mocks, and when to use each

When the code under test depends on something heavy — a database, an HTTP service, a model — you replace that dependency with a test double. A fake is a lightweight working implementation (an in-memory store instead of Postgres); a mock is a recording stand-in that lets you assert which calls happened. Prefer fakes when you can, because they exercise more of your code; reach for mocks when the dependency is truly irreplaceable in a test.

Rule of thumb

If you are afraid to change a piece of code, it needs tests. If you are tempted to delete a test because it "always fails," it either caught a real bug or is testing behaviour that is no longer real — either way, investigate before deleting.

Section 09

Testing ML code, where the usual rules bend

ML code resists testing for three reasons: it is stochastic, its inputs are high-dimensional data rather than discrete cases, and its correctness is statistical rather than logical. None of these make testing impossible — they change what you should test.

Test the pipeline, not the performance

A unit test should not assert "this model achieves 92% accuracy." Accuracy is a property of model, data, and random seed — it is not what your code is supposed to guarantee. Instead, test the pipeline: that features are computed correctly from a fixed input, that the training loop runs for one epoch without error, that the model saves and loads to the same predictions, that the scoring function handles an empty batch without crashing. These are the things your code actually determines.

Property-based tests for numerical code

For functions that operate on arrays, property-based testing is a natural fit. Instead of asserting specific outputs, assert properties that must hold: a softmax's outputs sum to one; a normalised vector has unit L2 norm; a matrix inverse of an inverse is the original matrix (to floating-point tolerance); a shuffle preserves the multiset of elements. Libraries like hypothesis generate adversarial inputs — tiny arrays, large arrays, arrays with NaNs and infinities — and will find the edge cases your hand-written tests missed.

Determinism under seeding

A test that is flaky because the model initialisation is random is a bad test, but you do not fix it by not testing. You fix it by seeding: set the seeds for random, numpy.random, torch, and tf at the top of the test, and assert deterministic outputs. In production code, the same seeding discipline produces reproducible experiments — and makes a failed run worth debugging rather than worth re-running and hoping.

Data tests

Most production ML failures are data failures, not model failures. A column changed type; a source feed started sending nulls; a joining key drifted. Data tests — written against expectations, not specific values — catch these early. Tools like great-expectations, pandera, and deepchecks let you assert schemas, value ranges, and distributional properties of your data, and fail the pipeline when they drift. Treat these tests as first-class; they are often more valuable than model tests.

Key idea

In an ML codebase, you are testing the scaffolding — data, features, loop, I/O, serialisation. The model's accuracy is monitored, not unit-tested. The two do different jobs; collapsing them into "my test asserts 90% accuracy" produces tests that are slow, brittle, and silent about real regressions.

Section 10

Type hints and static analysis

Python's type hints are optional and unenforced by the runtime, but a modern codebase that uses them thoughtfully catches a whole class of bugs before they ship and renders code readable without opening the calling file.

What type hints give you

A typed signature — def score(items: list[Item]) -> dict[str, float] — documents the contract at the point it matters most: the call site. A type checker (mypy, pyright) then verifies that every caller passes the right thing and every body returns the right thing, and most IDEs use the same information for autocomplete and go-to-definition. The net effect is that large-scale refactors become possible: rename a class, and the tooling tells you every line that needs updating.

Which types to use

Start at module boundaries: every exported function should have fully typed arguments and return. Internal helpers can follow, but it is fine for glue code to stay loosely typed. Use dataclasses for record types — they replace the dict-of-strings idiom with something an IDE can reason about. Use typing.Protocol for structural types: "anything with a .predict method" is a Predictor, without requiring inheritance. Use typing.Literal to pin enum-like strings, and Annotated with libraries like pydantic for runtime validation on top of static types.

Static analysers beyond the type checker

Type checking is only one flavour of static analysis. Linters (ruff, flake8) catch unused imports, shadowed names, and dubious constructs. Formatters (black, ruff format) remove bikeshedding over spaces-and-line-breaks by having the tool decide. Security scanners (bandit, semgrep) find known dangerous patterns — SQL injection, pickle deserialisation of untrusted input, shell=True subprocess calls. Wire all of them into a pre-commit hook and they run on every commit, automatically.

Rule of thumb

Turn on the strictest reasonable mypy settings in new code (strict = true, or at least disallow_untyped_defs) from day one. Adding types to an existing untyped codebase is a multi-week project; writing them as you go is free.

Section 11

Logging and observability

The first thing a production engineer does when a system breaks is read the logs. If the logs are missing, empty, or unreadable, the system is effectively a black box and debugging becomes guesswork. Good logging is not decorative; it is the primary way a running program talks to the humans responsible for it.

Use the `logging` module

Replace print with logging.getLogger(__name__). The standard logging module gives you levels (DEBUG, INFO, WARNING, ERROR, CRITICAL), structured output, and per-module configuration — so you can turn up the verbosity on a single module during an incident without flooding the rest. Emit at INFO what the system is doing, at WARNING what it had to compensate for, at ERROR what failed, and at DEBUG the details you will need if ERROR fires.

Structured logging

In a distributed system, logs end up in a central aggregator (Datadog, Splunk, CloudWatch, Loki). If every log line is a sentence, search and filter over them is painful. If every line is a JSON object with named fields — request_id, user_id, latency_ms, model_version — queries are mechanical. Libraries like structlog or loguru make structured logging the path of least resistance.

Metrics, traces, and logs

Modern observability divides signals into three: logs (text events), metrics (numeric time series), and traces (spans across a distributed request). Each answers different questions — logs tell you what happened, metrics tell you how often and how fast, traces tell you where time went. A serious ML service needs all three: Prometheus or Datadog for request rates and latencies, OpenTelemetry for traces, and structured logs for the details. Do not try to rebuild one out of the others.

What to log in ML code

Log the inputs and outputs of major pipeline stages, shapes of tensors at key points, hyperparameters at the start of a run, and the git commit and configuration hash so that any log line is traceable to exact code and data. For model serving, log the model version alongside each prediction — debugging "why did this user get this score" is impossible without knowing which model produced it.

Rule of thumb

A log line should answer, at minimum, three questions: what happened, in what context, and when. "Scored 128 items" is worthless; "Scored 128 items for user 42 in 34 ms with model v0.7.3" is a debugging tool.

Section 12

Documentation, from docstring to ADR

Documentation is the interface between the people who wrote a system and the people who must keep it running — including the authors, six months later, who have forgotten. Code tells you what the system does now; good documentation tells you what it is supposed to do, and why.

Docstrings, README, ADRs — three layers

A codebase needs documentation at three levels of granularity. The finest is the docstring on every public function, class, and module — one line describing what it does, followed by arguments, return value, and exceptions if any. The middle is the README — how to install, run, and contribute; the five-minute onboarding of a new team member. The coarsest is the architecture document — the shape of the system, the major components, the data flow — often written as a diagram and a page of prose, updated when it changes.

Architecture Decision Records

The one documentation artefact most teams under-use is the Architecture Decision Record — a short document, numbered and dated, describing a decision, the context that forced it, the alternatives considered, and the trade-offs accepted. ADRs are cheap to write, immensely valuable two years later when someone asks "why did we pick Postgres?", and they survive team turnover in a way that tribal knowledge does not. A directory of docs/adr/0007-feature-store.md is one of the highest-leverage documentation investments available.

Diagrams are documentation

A picture of the component graph — boxes for services, arrows for data flow — is worth a thousand words, especially for engineers joining the project. Draw it. Keep it in the repo, as Mermaid source or an SVG. Update it when the architecture changes, and link to it from the README. An out-of-date diagram is worse than no diagram; a current one is the single fastest path to understanding a system.

Rule of thumb

The audience for your documentation is not the current team — you all already know. The audience is the engineer who joins in eighteen months. Write for them, and your future self will thank you.

Section 13

Configuration, secrets, and environments

The defining property of a well-engineered system is that the same code runs unchanged in development, staging, and production — and the only thing that differs is configuration. Getting this separation right is harder than it sounds, and getting it wrong is the cause of most "works on my machine" incidents.

The twelve-factor rule

The principle that has aged best is from the Twelve-Factor App: store configuration in the environment, not in the code, and not in code-adjacent files committed to the repository. Environment variables are the lowest common denominator — they work in every language, in every deploy target, and separate cleanly from source control. For anything non-trivial, parse them at startup into a typed object; for simple cases, read them directly.

Secrets are a special case

API keys, database passwords, and model credentials must never be committed to a repository — not in code, not in a config file, not even in a comment. Use a secret manager (AWS Secrets Manager, Google Secret Manager, HashiCorp Vault) or, minimally, a .env file that is listed in .gitignore and loaded at startup. Tools like git-secrets and trufflehog scan commits for credential patterns; run them in CI. A leaked secret costs real money and real time; a twenty-line safeguard is cheaper.

Configuration for ML experiments

ML has an additional twist: every experiment has its own configuration — model class, optimiser, learning rate, data slice, seed — and two experiments that differ in one field must be easy to compare. Tools built for this (Hydra, OmegaConf, the newer pydantic-settings) let you compose configs from YAML files, override from the command line, and serialise the full resolved config alongside the run's artefacts. The rule is: every saved model must come with the exact config that produced it, so reproducing a result is a python train.py --config=run_2026_04_12.yaml, not an archaeology expedition.

Rule of thumb

If you cannot check out a commit from six months ago and reproduce the result, your configuration discipline is not strong enough. The fix is to treat config as code and artefacts: versioned, reviewed, and saved alongside the outputs.

Section 14

Packaging and dependencies

A Python project's dependency situation is, at best, manageable. At worst, it is the reason a model that worked last Tuesday no longer runs today. Discipline around environments and versions is not a nice-to-have in ML — with NVIDIA drivers, PyTorch builds, and CUDA versions in the mix, it is the cost of doing business.

Use a single environment manager

Pick one tool per project and commit to it. pip with venv and a pyproject.toml is the standard baseline; poetry, pdm, and the newer uv (from Astral) add lockfiles, dependency groups, and fast solvers. conda remains relevant when you need non-Python binaries (CUDA toolkit, MKL). The wrong choice is to mix them — a project half-managed by pip and half by conda will eventually corrupt in subtle ways.

Pin, lock, and reproduce

Every project needs two files: a loose declaration of direct dependencies (pyproject.toml) and a lock file that pins every transitive dependency to an exact version (poetry.lock, uv.lock, requirements.txt produced by pip-compile). Only the lock file guarantees reproducibility; without it, a fresh install next month gets a slightly different version of some deep library and your model starts producing slightly different numbers.

Containerise the awkward parts

For ML workloads, Docker (or its lineage — podman, containerd) is not optional. CUDA versions, system-level dependencies (libsndfile, libgl), and exact Python builds are trivial to encode in a Dockerfile and painful to coordinate across a team otherwise. A Dockerfile that builds a reproducible image from a locked requirements.txt, tagged with the git SHA, is the closest thing this discipline has to a deliverable. Store it in the repo; build it in CI; run it everywhere.

Rule of thumb

"Works on my machine" is not a closed bug. The fix is almost always to eliminate the gap between your machine and everyone else's — a locked environment, a reproducible build, a container image. The goal is that git clone && make dev yields a working setup on any developer's laptop.

Section 15

CI/CD and code review

Two practices — continuous integration and code review — do more than any others to keep a multi-author codebase from decaying. Both work by putting human and automated judgement between a change and its merge; the rest of software-engineering discipline is the habit of cooperating with that gate.

What CI actually runs

A continuous-integration pipeline runs the same checks, automatically, on every pull request. At minimum: install dependencies from the lock file, run the linters and type checkers, execute the unit tests. In a serious project, add: integration tests against ephemeral services, security scans, coverage reports, artefact builds (a Docker image, a wheel), and a deploy-to-staging on merge. The exact tool — GitHub Actions, GitLab CI, Buildkite, CircleCI — matters far less than the discipline of running every check on every change.

Fast feedback is everything

A CI pipeline that takes forty minutes is a pipeline no one waits for, and PRs pile up behind it. The single most valuable CI tuning is keeping the common-case feedback loop under five minutes — parallelise test suites, cache dependency installs, fail fast on linters before running the slow integration tests. If an end-to-end suite has to take an hour, run it on a nightly schedule, not on every commit.

Code review as a practice

Code review is where the standards of a team are transmitted. A good review is collaborative, not adversarial — the reviewer is helping the author ship, not grading their homework. Keep PRs small (under ~300 lines where possible); they get better reviews and catch more bugs. Comment on what matters — correctness, design, safety — and let formatters and linters argue about whitespace. Explain why, not just what. And know when to defer: if something is a taste difference, say so, and let the author decide.

The author's side

Authors help reviewers by writing good PR descriptions (what changed, why, and how to test), splitting unrelated changes into separate PRs, and responding to review comments within a day. The most expensive defect in review is the one the author silently resents; the second most expensive is the one no one was allowed time to read carefully. Both are avoided by culture, not tooling.

Key idea

CI and review together embody a simple claim: no change ships without the code having been read by another human and having passed a mechanical gate. Breaking either discipline for one "just this once" commit is the beginning of a codebase no one trusts.

Section 16

Where these habits compound in ML

Software-engineering discipline is usually invisible in the moment — the benefit is not that today's PR is ten minutes faster but that next year's refactor is possible at all. For ML, that compounding shows up in a handful of predictable places.

Reproducibility. Locked dependencies, seeded experiments, saved configurations, and structured logs are the difference between "the result from paper X" and "the result from paper X, reproduced locally on a Tuesday." Every paper in your lab repository should come with a command that regenerates its numbers; that is only possible if the engineering underneath is tight.

Iteration speed. The team whose training pipeline takes ninety seconds to start, whose tests run in three minutes, whose configuration is one file to edit, iterates ten times faster than the team for whom each of those takes an order of magnitude longer. The gap does not show up in any single experiment — it shows up over a quarter.

Handing code off. A model that only its author can run is a model that retires when the author leaves. Engineering discipline — readable code, tests, docs, a clear build — is what makes a model survive team changes. For any artefact intended to live past a single person's attention span, this is not optional.

Scaling teams. A two-person codebase survives almost any style; a twenty-person codebase without enforced formatting, types, and review is chaos within months. The investments in tooling that feel heavy at four engineers pay their own salary back at ten.

Going to production. The step from "trained model" to "served prediction in a live system" is almost entirely an engineering step — deployment, monitoring, rollback, versioning, observability. A team that never treats code as production-grade will never ship; a team that treats it as production-grade from the start does so without drama.

Key idea

None of the practices in this chapter is novel. They are the working wisdom of a generation of engineers who learned the hard way. The only thing that makes them distinctive in ML is that the pressure to skip them — to iterate faster, to ship the next model, to write one more notebook — is unusually strong. The teams that resist that pressure are the ones whose systems are still running, legibly, two years from now.

Where to go next

Software-engineering literature is enormous and wildly uneven. The list below is opinionated: a few canonical books that repay re-reading, a handful of ML-specific treatments that take the engineering side seriously, and the tooling documentation that most rewards time spent with it.

Canonical books

The Pragmatic Programmer

Hunt & Thomas · 20th-anniversary ed. 2019 · Addison-Wesley

The best single book on the craft of software engineering for working practitioners. Short chapters on tracer bullets, DRY, orthogonality, and the habits that compound over a career. The 20th-anniversary revision updates the examples without losing the tone.

Pragmatic Bookshelf
Clean Code

Robert C. Martin · 2008 · Prentice Hall

Dated in its Java examples and occasionally dogmatic, but the chapters on naming, functions, and code smells remain the most-cited source on those topics for good reason. Read it once, take the principles, leave the more evangelical parts.

Pearson
Refactoring: Improving the Design of Existing Code

Martin Fowler · 2nd ed. 2018 · Addison-Wesley

The definitive catalogue of refactorings, each described with before/after code and the "smell" it addresses. The second edition switches to JavaScript; the ideas translate to any language. A reference more than a cover-to-cover read.

martinfowler.com
Design Patterns: Elements of Reusable Object-Oriented Software

Gamma, Helm, Johnson & Vlissides (GoF) · 1994 · Addison-Wesley

The original patterns catalogue. The C++/Smalltalk style is dated and some patterns reflect workarounds for language features Python already has (singleton, visitor). The vocabulary — strategy, factory, observer, adapter — is still what working engineers use.

Pearson
A Philosophy of Software Design

John Ousterhout · 2nd ed. 2021 · Yaknyam

A short, sharp book about managing complexity — deep modules over shallow ones, information hiding, the cost of special cases. A useful counterweight to Clean Code's emphasis on tiny functions, with a calmer and more architectural perspective.

stanford.edu/~ouster

Testing and process

Working Effectively with Legacy Code

Michael Feathers · 2004 · Prentice Hall

The book to read when you inherit a codebase without tests. Feathers' definition of legacy code — "code without tests" — is still the most useful framing, and the techniques for introducing seams and pinning behaviour before refactoring are practical and widely applicable.

Pearson
Growing Object-Oriented Software, Guided by Tests

Freeman & Pryce · 2009 · Addison-Wesley

The book-length treatment of test-driven development as a design technique. Worked example in Java, but the lessons — listen to the tests, prefer outside-in, use mocks to drive interfaces — apply equally in Python.

growing-object-oriented-software.com
Accelerate: The Science of Lean Software and DevOps

Forsgren, Humble & Kim · 2018

The empirical case for the DevOps practices (continuous integration, trunk-based development, small batch sizes, tested-commit-to-prod in under an hour). Less craft and more research than the other books here, but the four DORA metrics — deploy frequency, lead time, change-fail rate, restore time — have become standard.

IT Revolution

Engineering for ML specifically

Designing Machine Learning Systems

Chip Huyen · 2022 · O'Reilly

The best single book on the end-to-end engineering of production ML. Data, pipelines, training, serving, monitoring, organisation. Practical, current, and written by someone who has shipped the systems she describes.

O'Reilly
Machine Learning Design Patterns

Lakshmanan, Robinson & Munn · 2020 · O'Reilly

A catalogue of the recurring structural patterns in ML systems — feature stores, cascade models, stateless serving, checkpointing — in the style of the classic Design Patterns. Uneven in places, but an unusually good shared vocabulary for ML platform discussions.

O'Reilly
Rules of Machine Learning (Martin Zinkevich)

Google · 2018 · free

Forty-three pragmatic rules for building ML systems, distilled from Google's experience. Short, opinionated, and genuinely useful. The rule "do machine learning like the great engineer you are, not like the great machine learning expert you aren't" is the ethos of this entire chapter.

Google Developers
Hidden Technical Debt in Machine Learning Systems

Sculley et al. · 2015 · NeurIPS

The "ML code is the small black box in the middle of a huge diagram of infrastructure" paper. The vocabulary it introduced — configuration debt, glue code, pipeline jungles, CACE (changing anything changes everything) — is now the standard way to talk about what makes ML systems hard to maintain.

NeurIPS

Tooling that rewards reading the docs

pytest

docs.pytest.org

The de-facto Python testing framework. The fixture system (especially parametrisation and scope) and the plugin ecosystem are what distinguish it from unittest. Read the "Good Integration Practices" page and the fixture documentation; both are short and high-leverage.

pytest docs
mypy and pyright

mypy-lang.org · microsoft.github.io/pyright

The two mature Python type checkers. Mypy is the reference implementation; pyright (and its editor wrapper Pylance) is faster and often catches more. Both accept the same annotation language; pick one per project and run it in CI.

mypy pyright
ruff

docs.astral.sh/ruff

The linter and formatter that has all but replaced flake8, isort, pylint, and black in modern Python projects. Fast (written in Rust), drop-in, and implements hundreds of rules. Turn it on in pre-commit and most stylistic debates evaporate.

ruff docs
pre-commit

pre-commit.com

A framework for managing git pre-commit hooks — the layer where linters, formatters, and type checkers run before a commit is created. One config file, dozens of available hooks, and it eliminates the entire class of "forgot to run the formatter" review comments.

pre-commit.com
uv and Hydra

docs.astral.sh/uv · hydra.cc

Two pieces of tooling worth spending an afternoon with. uv is Astral's fast dependency manager, a near-drop-in replacement for pip/pip-tools/virtualenv; hydra is Meta's configuration framework, the standard for ML experiments with composable YAML configs and multi-run sweeps.

uv Hydra

This page is the fourth chapter of Part II: Programming & Software Engineering. Up next: databases and SQL, then version control and collaborative development — the two remaining pieces of the engineering foundation. After that, Part III enters data engineering, where the discipline of this chapter meets the scale of modern pipelines.

How to read this chapter

Contents

What software engineering does for ML practitioners

Clean code, the boring superpower

Names are the interface

Functions should do one thing

Scope and side effects

Modules, APIs, and boundaries

The shape of a good module

Coupling and cohesion

Dependency direction

SOLID, briefly and usefully

Which of these matter for ML code

The design patterns that actually show up

The ones you will use

Anti-patterns worth recognising

Refactoring and code smells

The mechanics

The standard refactorings

Code smells as signals

Errors and failure modes

Exceptions vs. return codes

Catch narrowly

Fail fast, fail loudly

Retry with judgement

Testing, the unit-test foundation

The testing pyramid

What a good unit test looks like

Arrange, act, assert

Fakes, mocks, and when to use each

Testing ML code, where the usual rules bend

Test the pipeline, not the performance

Property-based tests for numerical code

Determinism under seeding

Data tests

Type hints and static analysis

What type hints give you

Which types to use

Static analysers beyond the type checker

Logging and observability

Use the logging module

Structured logging

Metrics, traces, and logs

What to log in ML code

Documentation, from docstring to ADR

Docstrings, README, ADRs — three layers

Architecture Decision Records

Diagrams are documentation

Configuration, secrets, and environments

The twelve-factor rule

Secrets are a special case

Configuration for ML experiments

Packaging and dependencies

Use a single environment manager

Pin, lock, and reproduce

Containerise the awkward parts

CI/CD and code review

What CI actually runs

Fast feedback is everything

Code review as a practice

The author's side

Where these habits compound in ML

Where to go next

Canonical books

Testing and process

Engineering for ML specifically

Tooling that rewards reading the docs

Use the `logging` module