Part XVI · MLOps & Production ML · Chapter 03

Model Deployment & Serving, where trained models meet real users.

A trained model is a file on disk. A deployed model is a service that returns predictions reliably, at scale, with latency budgets, monitoring, and rollback capability — and the gap between the two is where most ML projects fail. Deployment is the operational discipline of turning model files into production services. Real-time serving handles individual requests with millisecond latency budgets via REST or gRPC APIs, typically running model containers behind a load balancer. Batch serving processes large historical datasets via scheduled jobs, prioritising throughput over latency. Streaming inference sits between the two, scoring events from message queues. Model registries govern which model version is in production at any moment, with promotion workflows and rollback capability. Containerisation (Docker plus Kubernetes) is the operational substrate that makes serving infrastructure reproducible, portable, and elastic. This chapter develops the methodology with the depth a working ML practitioner needs: the architectural patterns, the framework choices, the latency and cost trade-offs, and the operational considerations that distinguish a healthy production deployment from a fragile one.

Prerequisites & orientation

This chapter assumes the experiment-tracking material of Ch 01 (where model registries are introduced as part of MLflow) and the data-engineering material of Part III. Familiarity with HTTP, REST APIs, and basic systems programming is assumed; familiarity with Kubernetes helps but is not required (we cover the essentials). The chapter is written for ML engineers, platform engineers, and applied scientists who need to take a model from "works on my laptop" to "serves a million requests a day." Readers focused purely on model research can skim and use this as a reference when production work arrives.

Three threads run through the chapter. The first is the latency-vs-throughput trade-off: real-time serving optimises for latency, batch serving optimises for throughput, and the architectural choices (request batching, model size, hardware, caching) trace back to which trade-off you've prioritised. The second is the build-vs-buy gradient: deployment ranges from rolling your own FastAPI service through using framework-specific servers (TorchServe, TensorFlow Serving) to managed cloud platforms (SageMaker, Vertex AI, Azure ML) and dedicated serving frameworks (BentoML, KServe, Ray Serve, vLLM). The third is the model-versioning-and-rollback imperative: production models change, and the registry plus deployment-pattern combination is what lets you safely promote new versions and quickly revert when things break. The chapter develops each piece in turn.

In this chapter

Why Model Serving Is Its Own Discipline trained → deployed · latency · scale · ops
Deployment Patterns: Real-Time, Batch, Streaming, Edge request-response · batch jobs · stream scoring · on-device
REST and gRPC: The Wire-Protocol Choice JSON · protobuf · streaming RPC · OpenAPI
Serving Frameworks and Runtimes TorchServe · TF Serving · Triton · BentoML · KServe · Ray Serve · vLLM
Model Registries and Promotion Workflows stages · versions · approval · rollback · audit
Containerisation and Kubernetes Docker · CUDA images · Helm · autoscaling · GPU scheduling
Latency Optimisation: Batching, Caching, and Quantisation dynamic batching · model compilation · KV cache · INT8
Scaling Patterns and Cost Management HPA · spot · cold start · multi-tenancy · cost per inference
Release Strategies: Canary, Shadow, and A/B canary · shadow · A/B · feature flags · gradual rollout
The Frontier and the Operational Question LLM serving · agent serving · serverless · what next

Why Model Serving Is Its Own Discipline

A common misconception in early-career ML work is that deployment is the easy step at the end. The reality is the opposite: deployment is where the largest fraction of ML projects fail, and the engineering required to deploy a model to production reliably often exceeds the engineering required to train it. This section orients the reader to why deployment is its own discipline with its own tools, its own failure modes, and its own operational rhythm.

The "research code" problem

A trained model usually starts life as a Jupyter notebook. The notebook depends on a particular Python version, a specific set of library versions, a path to a GPU, and an in-memory state that exists only for the duration of the kernel. None of this is suitable for production: production needs a deterministic dependency graph, a reproducible build, multiple replicas behind a load balancer, observability, version control, and the ability to roll back. Bridging from notebook to production service is the operational gap that this chapter is about.

What can go wrong

The failure modes for production deployments are several. Performance regression: a new model version is deployed and prediction quality drops, often because of training-serving skew (Ch 02) or an upstream data shift. Latency violations: P99 latency exceeds the SLO, frequently because of cold starts, GC pauses, or insufficient batching. Outages: the service falls over because of memory pressure, dependency failure, or upstream load. Cost overruns: GPU instances scale up on a load spike and the bill triples without the team noticing. Silent corruption: predictions are wrong but no monitoring catches it because metrics weren't instrumented. Each failure mode has known mitigations; the discipline is to anticipate them before deployment, not after.

The five distinct concerns

Production deployment has five distinct concerns that need separate engineering attention. Packaging: turning model weights plus inference code into a deployable artefact (a Docker image, a SageMaker endpoint config, a Kubernetes resource). Serving: handling requests, running inference, returning responses with appropriate latency and throughput characteristics. Releasing: getting new model versions in front of real traffic safely (canary, shadow, A/B) without disturbing existing users. Operating: monitoring health and performance, handling on-call incidents, scaling capacity. Iterating: feeding production observations back into the next training cycle. The chapter develops each in turn.

Production vs research distinctions

Production serving has constraints research does not. Determinism: production must produce the same output for the same input every time, including across deployment refreshes (so caching works). Idempotence: retried requests must be safe; clients reasonably expect "this request" to mean the same thing on retry. Observability: every prediction generates structured logs, traces, and metrics suitable for incident response. Backwards compatibility: changing the input schema breaks downstream consumers; production APIs evolve carefully. None of these constraints apply to research notebooks; all of them apply to production services.

The downstream view

Operationally, model deployment sits between the training pipeline (where models come from) and downstream consumers (web/mobile services, internal tools, batch jobs, reporting systems). Upstream: a trained model artefact in a registry. Inside the deployment layer: containerisation, the serving framework, the API surface, the autoscaler, the rollout controller. Downstream: client applications, monitoring systems, the model-monitoring stack (Ch 04), the cost-and-billing pipelines. The remainder of this chapter develops each piece: §2 the deployment patterns, §3 the API protocols, §4 the serving frameworks, §5 model registries, §6 containerisation, §7 latency optimisation, §8 scaling and cost, §9 release strategies, §10 the frontier.

Deployment Patterns: Real-Time, Batch, Streaming, Edge

Not every model needs to serve millisecond-latency requests behind a REST API. The right deployment pattern depends on how predictions are consumed: synchronously per user action (real-time), in bulk on a schedule (batch), continuously from event streams (streaming), or on the user's own device (edge). Each pattern has a different operational shape, different infrastructure, and different failure modes.

Real-time (request-response) serving

The dominant pattern: a client sends a request with input features, the model returns a prediction, the client uses the prediction. Latency budgets are typically 50–200ms end-to-end (which means the model itself often has 5–50ms). Examples: search ranking, recommendation systems, fraud scoring at transaction time, ad bidding, chatbot responses. The infrastructure is a stateless service running model containers behind a load balancer with autoscaling; the request path uses REST or gRPC (Section 3). This is the most-engineering-intensive pattern but also the most-impactful for user-facing applications.

Batch serving

The simplest pattern: at a scheduled time, run the model over a large dataset and write predictions to a downstream store. Latency is irrelevant; throughput is everything. Examples: nightly customer-segment scoring, weekly churn predictions, monthly retraining-evaluation runs. The infrastructure is a scheduled job (Airflow, Argo Workflows, Databricks Jobs) that loads the model, processes records in parallel, and writes outputs. Batch serving is operationally low-stress but has the obvious limitation that predictions are stale by the time consumers read them.

Streaming inference

The hybrid: events flow through a streaming system (Kafka, Kinesis, Pub/Sub), each event is scored by the model in real time, results flow downstream. Latency budgets are typically seconds rather than milliseconds. Examples: real-time fraud detection on transaction streams, anomaly detection on log streams, content moderation on uploaded media. The infrastructure is a streaming consumer (Flink, Spark Streaming, Kafka Streams) that loads the model and processes events; the model runs once per event and writes results to another stream or a database.

Edge and on-device deployment

The constrained pattern: the model runs on the user's device (phone, browser, embedded sensor) rather than a server. Examples: on-device speech recognition (Whisper), face unlock, smartphone camera ML, browser-based content filtering, IoT sensor analytics. The constraints are severe — limited memory, limited compute, often no network, energy budgets — so models must be small (typically <100MB) and quantised (Section 7). Frameworks include TensorFlow Lite, PyTorch Mobile, ONNX Runtime, Core ML (Apple), WebAssembly-based runners (transformers.js), and MediaPipe for cross-platform.

The choice matrix

The pattern is dictated by how predictions are consumed. Real-time when a user is waiting on the prediction. Batch when predictions feed a downstream system that consumes them periodically. Streaming when events are continuous and must be scored as they arrive but seconds-of-latency is OK. Edge when network connectivity, privacy, or latency requirements rule out server-side. A surprising number of teams default to real-time because it sounds modern, when batch would be operationally simpler and equally valuable. The default should be: use the simplest pattern that meets the requirement.

Hybrid patterns

Many production systems combine patterns. Batch-precomputed-with-real-time-lookup: predictions are computed in batch and cached in a key-value store, served at request time via lookup (no model runs at request time). Real-time-with-batch-fallback: real-time inference is attempted; if the service times out, a stale batch prediction is returned. Edge-with-server-fallback: simple cases handled on-device, complex cases sent to a server. The hybrid patterns trade engineering complexity for performance and reliability gains; the right choice depends on the workload's specific characteristics.

REST and gRPC: The Wire-Protocol Choice

A real-time serving service exposes a wire protocol: the on-the-network format of requests and responses. Two protocols dominate: REST (HTTP+JSON, the lingua franca of web APIs) and gRPC (HTTP/2 with binary protobuf payloads, optimised for service-to-service traffic). Choosing between them is a substantive architectural decision with material implications for latency, ecosystem fit, and developer experience.

REST and JSON

REST is the default. A model server exposes an HTTP endpoint (e.g. POST /predict) that accepts JSON input and returns JSON output. The strengths are universality (every language, every framework, every browser, every test tool speaks JSON over HTTP), debuggability (humans can read the payloads), and ecosystem (curl, Postman, OpenAPI, Swagger UI all just work). The weaknesses are payload bloat (JSON is verbose, especially for numerical arrays), schema drift (without OpenAPI specs, the schema is unverified), and serialisation cost (Python's `json.dumps` is non-trivial in latency-sensitive paths).

gRPC and protobuf

gRPC is the alternative. Schemas are defined in protobuf files; clients and servers are auto-generated from the schemas in the language of choice; payloads are binary. The strengths are smaller payloads (typically 30–60% smaller than equivalent JSON), faster serialisation (binary protobuf is much faster than text JSON), and stronger type safety (the schema is enforced at compile time). The weaknesses are tooling friction (browsers don't speak gRPC natively, debuggability is harder, the ecosystem is service-to-service-oriented). For high-QPS internal traffic where every millisecond matters, gRPC is materially better; for external APIs and human-debuggable services, REST is materially better.

The hybrid pattern: gRPC + REST gateway

The common production pattern is to run gRPC as the primary protocol for service-to-service traffic and add a REST gateway (grpc-gateway, Envoy with grpc-json transcoding) for clients that need REST. The model service speaks gRPC internally; the gateway translates external REST calls to internal gRPC calls. This gets you the performance benefits of gRPC without giving up REST compatibility for the clients that need it.

Streaming RPC

gRPC supports four call patterns: unary (single request, single response, like REST), server streaming (single request, stream of responses — useful for token-by-token LLM output), client streaming (stream of requests, single response — useful for chunked input), and bidirectional streaming (both sides stream — useful for interactive chat or multi-turn agents). REST handles only the unary pattern natively; for streaming over REST, the alternatives are Server-Sent Events (SSE), WebSockets, or chunked HTTP responses. LLM serving (Section 10) makes heavy use of server streaming.

OpenAPI and API contracts

Whichever wire protocol you choose, the API contract — the schema of inputs and outputs — should be explicitly versioned and documented. OpenAPI (formerly Swagger) is the dominant REST-API spec format; protobuf serves the same role for gRPC. The discipline is that the contract lives in source control, is reviewed like code, and the server's actual behaviour is verified against the contract in CI. Without this discipline, breaking changes to the API silently break consumers — a classic production failure mode.

Latency profile and overheads

The wire-protocol overhead matters more for fast models than slow models. A 5ms inference call has minimal headroom; doubling the network overhead from 1ms to 2ms is a 20% latency increase. A 500ms LLM inference call has more headroom; the same overhead is <1%. The practical guidance: REST is fine for most ML serving (fast or slow); gRPC pays back materially only for very-fast inference paths in high-throughput service-to-service traffic. The decision is rarely binary; many production stacks support both via the gateway pattern.

Serving Frameworks and Runtimes

You don't have to build a serving system from scratch. The 2020s have produced a substantial ecosystem of serving frameworks — libraries and runtimes that handle the boilerplate (HTTP server, request batching, model loading, health checks) so you can focus on the inference logic. Choosing the right framework substantially affects engineering velocity and the operational shape of the service.

Framework-native servers

TorchServe (PyTorch's production-serving framework) and TensorFlow Serving (TF's equivalent) are the framework-native options. They load model files of the framework's native format, expose REST and gRPC endpoints, and provide standard production features (versioning, A/B testing hooks, metrics). The strengths are deep integration with their respective frameworks; the weaknesses are framework lock-in and an opinionated abstraction that can be hard to escape when you need custom inference logic.

Triton Inference Server

NVIDIA Triton is a more general-purpose option. It supports models from PyTorch, TensorFlow, ONNX, OpenVINO, TensorRT, and others; runs them efficiently on GPUs with dynamic batching and concurrent execution; and exposes both REST and gRPC. Triton is the dominant choice for performance-sensitive GPU inference, particularly for ensemble models (multiple models pipelined together). The weakness is that the configuration is more involved than the framework-native servers — Triton rewards engineering investment.

BentoML and the Pythonic alternatives

BentoML (founded 2019) is a Python-first serving framework that emphasises developer experience. You write a Python class with a `@bentoml.service` decorator and a `@bentoml.api` method; BentoML packages the service as a deployable bundle (a "Bento") that includes model files, dependencies, and the serving code. The strength is that it integrates naturally with existing Python ML workflows. The weakness is performance — pure Python serving is slower than Triton for the same model. The 2024–2026 work on BentoML has substantially improved performance, and it remains a popular choice for ML teams that prioritise iteration speed over raw throughput.

KServe and Kubernetes-native serving

KServe (formerly KFServing, part of the Kubeflow ecosystem) is a Kubernetes-native serving framework. You define an `InferenceService` Kubernetes resource, KServe handles model loading, autoscaling (including scale-to-zero), traffic routing, and canary releases. The strength is deep Kubernetes integration — KServe inherits all the deployment, autoscaling, and networking primitives that Kubernetes provides. The weakness is operational complexity: running KServe at scale requires real Kubernetes expertise.

Ray Serve

Ray Serve (Anyscale, 2020) is a serving framework built on the Ray distributed-computing platform. It supports model composition (multiple models in a single service), dynamic batching, and complex request routing (e.g., conditional routing based on input). Ray Serve is the choice for ML services with non-trivial composition logic — multi-model ensembles, agent-style architectures, RAG pipelines — where the orchestration is itself substantial. The 2024–2026 Ray ecosystem has matured into one of the most-capable Python distributed computing platforms.

vLLM and LLM-specialised serving

LLM serving has its own specialised frameworks because of the distinctive workload (autoregressive token generation, KV-cache management, attention computation). vLLM (UC Berkeley, 2023) introduced PagedAttention and substantially improved LLM throughput; it has become the dominant open-source LLM-serving framework. TGI (Hugging Face Text Generation Inference) is the major alternative. SGLang (2024) and LMDeploy are newer entrants. Frontier-LLM serving (Anthropic, OpenAI) uses internal proprietary stacks but borrows extensively from these public frameworks.

Cloud-managed serving

SageMaker Endpoints (AWS), Vertex AI Prediction (GCP), and Azure ML Online Endpoints are the cloud-managed options. They provide model hosting, autoscaling, A/B testing, and monitoring as managed services. The trade-off versus self-hosted is the standard build-vs-buy: managed services save substantial operational engineering at the cost of platform lock-in and (typically) higher per-inference cost. For teams without dedicated ML platform engineers, managed services are usually the right starting point.

Model Registries and Promotion Workflows

A model registry is the production-of-record for "which model is in production." It stores model artefacts, tracks versions, manages stages (Staging, Production, Archived), and provides the audit trail of who promoted what when. The registry is the connecting tissue between the training pipeline (which produces models) and the deployment pipeline (which serves them). Ch 01 introduced MLflow's registry; this section develops the concept in serving-deployment terms.

Registry semantics

A registry organises models hierarchically: registered model (a named slot, e.g. `fraud-classifier`), model version (an immutable artefact, e.g. v1, v2, v3), stage (a label on a version: None, Staging, Production, Archived). Promoting a version to Production means "deploy this version when serving the `fraud-classifier` slot." Multiple versions can coexist in different stages; only one is in Production at any moment. The semantics are similar to a Git tag — a stable, reviewable label on an immutable artefact.

The dominant registries

MLflow Model Registry (Ch 01) is the dominant open-source registry. Weights & Biases Model Registry (Ch 01) is the dominant SaaS option. SageMaker Model Registry, Vertex AI Model Registry, and Azure ML Model Registry are the cloud-native options, deeply integrated with their respective serving systems. Hugging Face Hub is the dominant registry for open-source models (LLMs, vision models, the broader research-community model corpus). The choice depends on the rest of your stack; teams with substantial cross-cloud or cross-framework ambitions tend to centre on MLflow or W&B for portability.

Promotion workflows

Promoting a model from Staging to Production should not be a YOLO operation. The disciplined workflow is: (1) train and register the candidate; (2) run automated evaluation (validation metrics, fairness checks, data-drift checks); (3) gate on human approval (a peer reviewer signs off); (4) promote to Staging; (5) deploy to a staging environment; (6) run integration tests against staging; (7) promote to Production; (8) deploy via canary or shadow (Section 9); (9) monitor for regression; (10) graduate to full production traffic if metrics hold. Mature ML platforms automate this workflow with explicit stage transitions and required approvals.

Rollback discipline

Every production deployment must be trivially rollbackable. The standard pattern is: keep the previous Production version Archived (not deleted); have a one-command rollback that re-promotes the Archived version to Production and triggers a redeploy; ensure the rollback completes in <5 minutes. The discipline is to test the rollback path in non-emergency conditions — practice rolls back during normal-business-hours, so that the on-call engineer who needs to roll back at 3am has already done it. Teams that don't practice rollbacks discover at 3am that rollback is broken.

Lineage and audit

Every model version in the registry should link back to the training run that produced it (Ch 01's experiment-tracking record), the dataset version it was trained on (Ch 02's feature-store snapshot or Ch 01's DVC hash), the evaluation metrics that justified its promotion, and the human who approved the promotion. The end-to-end lineage — from raw data through to production prediction — is what regulators and auditors increasingly demand. The 2025 EU AI Act has substantially raised the stakes for lineage discipline.

Multi-environment registries

Mature ML organisations have separate registries (or separate logical namespaces) for development, staging, and production environments. Models are promoted across environment boundaries with explicit gating: a model must pass dev tests before being copied to staging, must pass staging tests before being copied to production. The boundary discipline prevents accidental "deploy a debug version to prod" failures. Cloud-managed registries provide environment scoping natively; self-hosted MLflow setups achieve it via separate tracking-server instances.

Containerisation and Kubernetes

The dominant deployment substrate for ML services is containers orchestrated by Kubernetes. A container packages the model, its dependencies, and the serving code into a portable unit; Kubernetes runs containers across a cluster of machines with health checks, autoscaling, and rolling updates. The combination is the de-facto standard for production ML serving in 2026, with cloud-managed alternatives (SageMaker, Vertex AI) sitting on top of similar primitives.

Docker images for ML serving

An ML service Docker image typically starts from a CUDA-capable base (NVIDIA's NGC images are the standard), adds a Python interpreter, installs the framework (PyTorch, TensorFlow), copies the inference code, and (sometimes) bakes the model file into the image. The image is then pushed to a container registry (ECR, GCR, Docker Hub, GitHub Container Registry) and pulled by Kubernetes at runtime. Image sizes for deep-learning serving are typically 5–20 GB (the CUDA base is large), which has implications for cold-start latency.

Model-in-image vs model-from-storage

A key design decision: should the model file be baked into the Docker image, or pulled from object storage at startup? Model-in-image is faster to start up (everything is local) and reproducible (the image hash captures the exact model). Model-from-storage lets you update the model without rebuilding the image (faster iteration) and reduces image size. The standard production pattern is model-in-image for production services where reproducibility is paramount, model-from-storage for development workflows where iteration speed matters.

Kubernetes resources for ML

The base Kubernetes resources for ML serving are: Deployment (manages replicas of a stateless service), Service (provides a stable network endpoint), HorizontalPodAutoscaler (scales replica count based on CPU, memory, or custom metrics), Ingress (routes external traffic), and Job/CronJob (for batch inference). For ML-specific needs, KServe (Section 4) adds an `InferenceService` resource that bundles these primitives with model-aware features.

GPU scheduling

Kubernetes' GPU support is mature in 2026. The NVIDIA device plugin exposes GPUs as schedulable resources; pods can request specific GPU types via node selectors. Multi-Instance GPU (MIG) on A100/H100 lets you share a physical GPU across multiple pods. GPU sharing via time-slicing is supported for development workloads (production typically uses dedicated GPUs). Spot GPUs (preemptible) are substantially cheaper but require graceful-restart logic. The operational discipline of GPU scheduling — bin packing, preemption handling, multi-tenancy — is where ML platform engineering invests substantial time.

Helm and the deployment lifecycle

Helm is the dominant Kubernetes package manager. A Helm chart bundles the Kubernetes resources for an ML service into a versioned, parameterised template. Deployments become `helm upgrade`; rollbacks become `helm rollback`. The discipline is that the Helm chart is version-controlled in Git alongside the model code; the deployment is fully described by the chart plus the values file plus the model registry version. Argo CD and Flux add GitOps on top: changes to the Git repo automatically trigger Kubernetes updates, with a clear audit trail.

The cloud-managed alternative

Operating Kubernetes is non-trivial. The cloud-managed alternatives (SageMaker, Vertex AI, Azure ML, Modal, Replicate) hide most of the complexity behind their APIs. The trade-off is cost (managed services are typically 2–4× more expensive per unit of compute than running your own Kubernetes) and flexibility (managed services have more constraints). For teams smaller than ~5 ML platform engineers, managed services are typically the better default; for larger teams with substantial workload diversity, self-managed Kubernetes is materially cheaper at scale.

Latency Optimisation: Batching, Caching, and Quantisation

A naive model deployment runs one inference call per request, in FP32, on a single GPU. This is rarely close to the optimal; for real production workloads, several optimisations stack to give 10–100× throughput improvements and latency reductions. This section covers the most-impactful: dynamic batching, caching, model compilation, and quantisation. Most are now standard, but the operational details determine whether they actually pay back.

Dynamic batching

The single most-impactful optimisation: dynamic batching. Instead of processing requests one at a time, the server queues incoming requests, batches them together (up to some max batch size or max-wait-time), runs a single inference call on the batch, and returns the per-request results. GPU inference is dramatically more throughput-efficient on batches than on single requests — a typical CV model might do 200 QPS at batch 1 and 10,000 QPS at batch 32. The trade-off is latency: requests wait for batchmates. The right batch size and timeout depend on the workload's QPS distribution and latency budget; modern serving frameworks (Triton, Ray Serve, vLLM) auto-tune.

Caching

If many requests have identical inputs, caching saves inference cost. The cache key is the (hashed) input; the cache value is the prediction. Request-level caching works for tasks where exact input matches occur (e.g., image classification on a fixed catalog of images). Embedding caching works for tasks where intermediate embeddings can be reused (e.g., the user's embedding stays the same for many recommendation requests). KV caching for LLMs (Section 10) is a specialised version that has become essential for long-context inference. The cache hit rate determines the value; caches with <30% hit rates are usually not worth the complexity.

Model compilation and graph optimisation

Models defined in PyTorch or TensorFlow can be compiled to optimised execution graphs. torch.compile (PyTorch 2.0+, 2023) provides JIT compilation with substantial speedups (typically 1.5–3× for inference). TensorRT (NVIDIA) takes a model graph and produces an optimised CUDA kernel that's typically 2–5× faster than the naive PyTorch path. ONNX Runtime provides a cross-framework optimisation layer. OpenVINO targets Intel hardware. The compilation step adds build complexity but is usually worth it for production workloads.

Quantisation

Quantisation reduces the numerical precision of model weights and activations, typically from FP32 to INT8 (8-bit integer). The weight memory drops by 4×, the inference compute is faster on hardware that has integer-arithmetic optimisations (Tensor Cores, specialised INT8 paths), and the accuracy drop is usually small (well below 1 percentage point on standard benchmarks). Post-training quantisation (PTQ) is the simplest: take a trained FP32 model and quantise the weights with a calibration step. Quantisation-aware training (QAT) trains with quantisation in mind for higher accuracy. For LLMs, 4-bit quantisation (NF4, GPTQ, AWQ) is now standard for inference, allowing 70B-parameter models to fit on a single 80GB GPU.

Distillation and pruning

Knowledge distillation trains a small "student" model to mimic a large "teacher" model, often achieving 80–95% of the teacher's quality with a fraction of the parameters. Structured pruning removes entire neurons or channels from the model with minimal accuracy loss. Both are forms of model compression that reduce inference cost; both require additional training-time work. They pay back primarily when inference cost is dominant (high-QPS production workloads, edge deployment) and additional training is acceptable.

Hardware selection

The hardware matters: a model running on an A100 vs an H100 vs an L4 vs a T4 has dramatically different cost-per-inference profiles. H100 / H200 are the throughput champions (and the only viable option for very-large LLMs). A100 is the workhorse for general-purpose ML serving. L4 / L40 are mid-range options optimised for inference. T4 is the budget option for non-demanding workloads. CPU inference is viable for small models and high-margin scenarios. Custom AI chips (TPUs, AWS Inferentia, Google's Trillium, the various 2024–2026 entrants) are increasingly attractive for specific workloads. The right hardware choice is workload-specific and benchmarking is essential.

Scaling Patterns and Cost Management

A model serving service that handles 100 QPS at noon may handle 1,000 QPS at 8pm and 10 QPS at 3am. Scaling capacity to match demand — without breaking the latency SLO during ramp-up and without paying for idle capacity at 3am — is its own engineering discipline. This section covers the patterns: horizontal autoscaling, multi-tenancy, spot instances, and cost-per-inference accounting.

Horizontal Pod Autoscaling

The basic Kubernetes scaling primitive is HorizontalPodAutoscaler (HPA): scale the replica count up or down based on observed metrics (CPU usage, memory usage, custom metrics). The standard pattern for ML serving is to autoscale based on a custom metric like "P95 inference latency" or "request queue depth" rather than CPU — model containers often have low CPU even when GPU-saturated. KEDA (Kubernetes Event-Driven Autoscaling) extends HPA to event-source-driven metrics (Kafka queue depth, SQS message count). For LLM serving, Token-per-second based scaling is increasingly the metric of choice.

Cold start and scale-to-zero

Cold start — the time from "no replicas running" to "first request served" — is a major operational concern. For ML services, cold start can be 30–120 seconds (image pull, model load to GPU, warmup); during the cold-start window, requests either queue or fail. Mitigations include pre-warming (keep some minimum replica count even at low load), model preloading in the image, and faster image pulls via lazy-pull mechanisms (eStargz, SOCI). Scale-to-zero — actually running zero replicas at idle — is appropriate for low-traffic services where occasional cold starts are tolerable; KServe and Knative both support it.

Spot and preemptible instances

Spot instances (AWS), preemptible VMs (GCP), and spot VMs (Azure) are cloud capacity sold at 60–90% discounts that can be reclaimed by the provider on short notice. For batch ML workloads that are restart-tolerant, spot is a major cost saver. For real-time serving, spot is more delicate — preemption mid-request causes failures — but workable with careful design (graceful drain on preemption notice, redundant capacity in different spot pools, spot-and-on-demand mixed pools). The 2024–2026 cost-optimisation work has pushed spot use further into real-time serving than was previously thought feasible.

Multi-tenancy

Running multiple models on the same infrastructure — multi-tenancy — improves utilisation. Patterns include model multiplexing (load multiple small models into a single process, route requests by model ID), multi-model endpoints (SageMaker MME, Vertex AI Multi-Model), and GPU sharing via MIG or time-slicing. The trade-off is isolation: tenants share resources and can interfere with each other's tail latency. The 2024–2026 work on production multi-tenancy has refined the patterns substantially; mature platforms achieve 5–10× utilisation improvements over single-tenant deployments.

Cost-per-inference accounting

The metric that matters for ML cost is cost per inference (or cost per 1,000 inferences). Computing it requires attributing infrastructure costs (GPU hours, network, storage) back to the inference workload — a non-trivial accounting exercise but essential for capacity planning. Modern observability stacks (Datadog, New Relic, Grafana Cloud) provide the inputs; teams typically build their own attribution logic. The aspiration is that every product team owns its cost-per-inference and is incentivised to optimise it — without this discipline, ML costs creep up unnoticed.

Capacity planning

Production capacity planning requires three inputs: forecast traffic (expected QPS over the planning horizon), latency SLO (the latency target, usually P95 or P99), and unit cost (cost per replica per hour). The math is straightforward: required replicas = forecast QPS / per-replica QPS at SLO. The hard part is forecasting — traffic spikes (Black Friday, news events, viral content) routinely exceed plans by 10×. The mitigation is overcapacity buffers and aggressive autoscaling; the cost is the same overcapacity buffer eating margin. Mature MLOps teams have explicit capacity-planning rituals tied to budgeting cycles.

Release Strategies: Canary, Shadow, and A/B

Promoting a new model version to 100% of production traffic on day one is a bad idea. Even with rigorous offline evaluation, real-world traffic surfaces issues — latency regressions, edge-case predictions, integration bugs — that didn't show in testing. The disciplined alternative is progressive rollout: route a small fraction of traffic to the new version first, monitor the results, and expand only if the metrics hold. Three patterns dominate: canary, shadow, and A/B testing.

Canary deployments

A canary deployment sends a small percentage of production traffic (typically 1–5% to start) to the new model version while the rest continues to hit the old version. Operational metrics (error rate, latency, infrastructure cost) are compared between the canary and the baseline; if the canary looks healthy, traffic is gradually shifted (5% → 10% → 25% → 50% → 100%) over hours or days. Mature service meshes (Istio, Linkerd) provide canary primitives natively; cloud-managed serving platforms expose canary-percentage parameters directly.

Shadow deployments

A shadow deployment runs the new model version against production traffic but discards its outputs — the user sees only the old version's predictions. Shadow mode lets you observe the new model's behaviour on real production traffic without any user-facing risk. The cost is that you double your inference cost for the shadow window. Shadow is particularly valuable for high-stakes models (fraud detection, content moderation) where even a small prediction-quality regression has substantial consequences.

A/B testing

An A/B test goes further: split user traffic between the old and new models with random assignment, then compare downstream metrics (conversion rate, click-through, revenue) over a fixed window. A/B testing measures the new model's actual effect on user behaviour, not just its operational health. The rigour comes with cost: A/B tests typically need 2–4 weeks for statistical significance on real-world metrics, and they require explicit experimentation infrastructure (consistent random assignment, metric pipelines, hypothesis tracking). Ch 06 in this part will develop A/B testing and causal experimentation in detail.

Multi-armed bandits

An evolution of A/B testing: multi-armed bandits dynamically allocate more traffic to the better-performing variant as evidence accumulates. The advantage is faster convergence to the winning variant and less traffic wasted on the loser. The disadvantage is that the statistical analysis is more complex. Bandits are appropriate when traffic is expensive and you want to minimise opportunity cost; classical A/B is appropriate when traffic is cheap and you want clean statistical inference.

Feature flags and gradual rollout

Feature flags (LaunchDarkly, Optimizely, Unleash, the various open-source alternatives) provide runtime control over which users get which model version. The same infrastructure that handles UI feature flags can handle ML model variant assignment. This is increasingly the default mechanism: deploy multiple model versions, control which version each user sees via a feature-flag rule, and change rules without redeploying. The integration with ML serving is straightforward; modern serving stacks support feature-flag-aware routing natively.

Rollback procedures

Every release strategy needs a rollback procedure, ideally one that can be executed in seconds without engineer intervention. The standard pattern: an automated rule monitors the canary's error rate; if it exceeds a threshold, traffic is automatically routed back to the baseline; an alert is fired to on-call. Manual rollback is a fallback for cases the automation doesn't catch. The discipline is to test the rollback path regularly and to set the automated thresholds based on real production data, not arbitrary numbers.

The Frontier and the Operational Question

Model serving is mature operational infrastructure in 2026, but several frontiers remain active. LLM serving has grown from a research curiosity to the dominant workload at frontier-AI companies. Agent-based serving introduces new operational shapes. Serverless ML and fractional-GPU serving are reshaping cost economics. This section traces the open questions and the directions the field is moving in.

LLM serving

LLM inference is fundamentally different from classical ML inference. It is autoregressive: each token depends on all previous tokens, so the model runs once per token rather than once per request. It uses KV caching: per-request memory grows with context length and persists across the request's tokens. It has heterogeneous request shapes: a 100-token prompt expecting a 50-token reply is operationally different from a 10,000-token prompt expecting a 5,000-token reply. vLLM, SGLang, LMDeploy, and TensorRT-LLM are the dominant open-source frameworks; frontier-LLM serving (Anthropic, OpenAI, Google) uses internal proprietary stacks. The 2025–2027 work on continuous batching, paged attention, and disaggregated prefill/decode is the major industry frontier.

Agent and tool-use serving

LLM agents (Part XI) introduce new serving patterns. A single user request can expand into many model calls (the agent reasons, calls tools, evaluates results, calls more tools), each potentially long-running. The traditional request-response pattern doesn't fit; streaming, long-lived connections, and durable execution (the agent's state persists across model calls and survives infrastructure restarts) are the new requirements. Frameworks like LangGraph, Temporal, and the Claude Agent SDK address aspects of this; the operational story is still consolidating.

Serverless ML

Serverless ML platforms (Modal, Replicate, Banana, Beam, Lambda Labs) provide model serving without explicit infrastructure management — you push a Python function, the platform handles everything else. The trade-off is the standard serverless one: faster iteration and zero ops, paid for by per-request pricing that can become expensive at high QPS. For prototypes, low-traffic services, and sporadic batch workloads, serverless is materially simpler than self-hosted Kubernetes. For high-throughput production, the cost crossover point is real and team-specific.

Fractional GPUs and disaggregated inference

The fractional GPU story has matured substantially. Multi-Instance GPU (MIG) on A100/H100 hardware splits a physical GPU into 7 logical GPUs with hardware isolation; this lets you serve small models on 1/7th of an H100 instead of dedicating a whole one. Disaggregated inference separates the prefill stage (compute-heavy first-token generation) from the decode stage (memory-bound subsequent tokens), running each on hardware optimised for it. The combination is reshaping LLM cost economics; fronier labs report 3–5× cost reduction from these techniques.

Regulatory and explainability requirements

The regulatory pressure described in Ch 01 (EU AI Act, FDA AI/ML guidance) applies to deployment too. Explainability requirements (the right to explanation under GDPR, fairness requirements under the AI Act) are increasingly applied at serving time: each prediction must be accompanied by an explanation suitable for the user. Audit logs of which model version made which prediction for which user are increasingly mandatory. Production serving infrastructure is evolving to support these requirements natively; the 2025–2027 work on regulatory-grade serving will be a major industry theme.

What this chapter has not covered

Several adjacent areas are out of scope. Model monitoring (drift detection, performance monitoring, data quality at serving time) is the topic of Ch 04. CI/CD for ML (the pipeline that takes a code change to a deployed model) is Ch 05. A/B testing at depth — including experimental design, multi-armed bandits, and CUPED — is Ch 06. The substantial topic of edge inference (mobile, browser, embedded) has been touched only briefly. Real-time recommendations, retrieval pipelines for RAG, and the various specialised serving patterns are out of scope. The chapter focused on the operational substrate of model serving; the broader MLOps landscape develops adjacent topics in subsequent chapters.