Part XVII · AI Infrastructure & Systems · Chapter 03

Model Compression, where smaller models meet bigger reach.

Trained models are often dramatically larger than they need to be for the workload at hand. A 70-billion-parameter LLM with full FP32 weights occupies 280 GB of memory and is impractical for most deployment contexts; the same model, properly compressed, can fit on a single 24 GB consumer GPU and serve real users at acceptable quality. Model compression is the discipline of making models smaller, faster, and cheaper to deploy while preserving as much capability as possible. Pruning removes weights or entire structures that contribute little; quantisation represents weights and activations with fewer bits; knowledge distillation trains a small "student" model to mimic a large "teacher"; low-rank factorisation approximates large matrices by products of smaller ones. The 2024–2026 era has produced extraordinarily aggressive compression — 4-bit and even 1-bit weights running production LLMs, sparsity ratios above 90% for some workloads, distilled student models matching teachers many times their size. This chapter develops the methodology with the depth a working ML practitioner needs: the algorithms, the trade-offs, the deployment realities, and the framework choices that make compression actually work.

Prerequisites & orientation

This chapter assumes the deep-learning material of Part VI, the hardware material of Ch 01 (memory bandwidth, FLOPs, the roofline model), and basic familiarity with at least one ML framework (PyTorch, JAX, TensorFlow). Familiarity with linear algebra (eigenvalues, SVD, matrix factorisations) helps for §6 on low-rank methods; familiarity with information theory (entropy, KL divergence) helps for §5 on distillation. The chapter is written for ML engineers, ML researchers, and platform engineers who deploy models in resource-constrained environments — phones, browsers, cost-sensitive servers, edge devices. Pure-research contexts where the only goal is to maximise quality with unlimited compute have less use for this material; teams that ship trained models to production have substantial use.

Three threads run through the chapter. The first is the quality-vs-cost trade-off: every compression technique loses some quality in exchange for some cost saving, and the engineering work is finding the right balance for the specific deployment. The second is the training-vs-deployment distinction: some methods (post-training quantisation, post-training pruning) work on already-trained models without retraining; others (quantisation-aware training, distillation) require additional training but achieve better quality at the same compression. The third is the hardware-aware compression imperative: compression that doesn't map to hardware acceleration provides theoretical savings without practical speedup; modern compression methods are co-designed with the hardware they target. The chapter develops each in turn.

In this chapter

Why Model Compression Matters deployment cost · edge · latency · capability democratisation
Pruning: Removing Redundant Parameters unstructured · structured · magnitude · lottery ticket · iterative
Quantisation: Lower-Precision Numerics FP16/BF16 · INT8 · 4-bit · QAT · post-training · GPTQ · AWQ
Extreme Quantisation: 4-Bit, 2-Bit, and 1-Bit NF4 · QLoRA · BitNet · 1.58-bit · ternary · production deployments
Knowledge Distillation soft targets · DistilBERT · TinyLlama · response-based · feature-based
Low-Rank Factorisation and LoRA SVD · LoRA · DoRA · QLoRA · adapter-based
Sparsity: Structured Hardware Acceleration 2:4 sparsity · NVIDIA Sparse Cores · activation sparsity · MoE
Combining Compression Techniques prune+quant · distill+quant · staged compression · hyperparameter search
Evaluation: Measuring Compression Quality accuracy · latency · throughput · perplexity · downstream metrics
The Frontier and the Operational Question 1-bit LLMs · MoE · hardware co-design · what next

Why Model Compression Matters

Compression is not optional for most production ML. The model that scores highest on the leaderboard is rarely the model that ships, because the highest-scoring model is too large, too slow, or too expensive to deploy. Compression closes the gap between what's trainable and what's deployable; the discipline determines whether AI capability reaches users at all.

The four motivations

Compression is motivated by four distinct concerns. Memory footprint: smaller models fit in smaller memory, allowing deployment on phones, laptops, embedded devices, or fewer/smaller servers. Inference cost: smaller models cost less to serve, which compounds across millions of inferences. Inference latency: smaller models run faster, which matters for interactive applications and tight latency budgets. Energy efficiency: smaller models consume less power, mattering for both battery-powered devices and datacentre electricity bills. Each motivation pushes toward different compression strategies; the right approach depends on which constraint is binding.

The capability-democratisation effect

Compression has had a substantial democratising effect on AI capability. DistilBERT (2019) made BERT-class capability available with a quarter of the inference cost. 4-bit quantised LLMs in 2023–2024 made running 70B-parameter models on consumer GPUs feasible, kicking off the local-LLM ecosystem (LLaMA-cpp, Ollama, LM Studio). On-device LLMs in 2024–2026 (Apple Intelligence, Gemini Nano, Phi-3-class models) brought capable LLMs to phones. Each generation of compression has expanded who can deploy AI capability and where; the overall trend has been toward broader access rather than concentrated deployment in cloud-only contexts.

The compounded value

Compression saves money in proportion to inference volume. A 4× compression that's used for a model serving 1 trillion tokens per year doesn't save 4× the deployment cost — it saves 4× of a multi-million-dollar bill. For frontier models with serving costs in the tens of millions annually, even modest compression has enormous absolute value. The 2024–2026 industry investment in inference-time compression (the FP4 race, the various kernel optimisations, the speculative decoding methods) reflects this: every percent of cost reduction at frontier scale is worth substantial engineering investment.

The training-vs-inference asymmetry

Compression matters much more for inference than training. Training is a one-time cost (per training run) amortised across the model's serving lifetime; inference cost is recurring and scales with usage. Most compression techniques therefore target inference: post-training quantisation, deployment-time pruning, distilled-for-deployment student models. Training-time compression (quantisation-aware training, sparse training) exists but is less developed; the 2024–2026 work on this is producing results but most production focus remains on inference compression.

The downstream view

Operationally, model compression sits between training (Ch 02) and deployment (Ch 03 of MLOps). Upstream: a fully-trained model in the registry. Inside this chapter's scope: pruning, quantisation, distillation, low-rank factorisation, sparsity, and the combinations of these. Downstream: a smaller, faster, cheaper model going through the deployment pipeline, evaluation, monitoring, and serving. The remainder of this chapter develops each piece: §2 pruning, §3 quantisation, §4 extreme quantisation, §5 distillation, §6 low-rank methods, §7 sparsity, §8 combinations, §9 evaluation, §10 the frontier.

Pruning: Removing Redundant Parameters

Pruning removes parameters that contribute little to the model's outputs, producing a smaller model that (ideally) performs nearly as well as the original. The intuition — that trained networks are typically over-parameterised, with many weights near zero or with little impact — has driven decades of research. The practical question is which weights to remove and what shape the resulting sparsity should take, since not every kind of sparsity translates to actual deployment speedup.

Unstructured pruning

Unstructured pruning removes individual weights based on some criterion. The simplest is magnitude pruning: weights with smallest absolute values are zeroed out. This is surprisingly effective — Han et al. (2015) demonstrated that 90%+ of weights in standard CNNs could be removed with minimal accuracy loss after fine-tuning. The challenge is that unstructured sparsity doesn't speed up dense matrix multiplication on standard hardware: a matrix with 90% zero entries still requires the same multiplications unless the hardware specifically supports sparse computation. Unstructured pruning's main benefit is therefore memory savings (the zero entries can be compressed for storage), not speed.

Structured pruning

Structured pruning removes entire structural units — whole neurons, attention heads, layers, or channels — preserving the dense matrix structure that hardware accelerates. A pruned attention head simply isn't computed; a pruned layer is removed entirely from the forward pass. Structured pruning produces actual speedups because the resulting smaller matrices map directly to hardware. The trade-off is that the same accuracy preservation requires less aggressive pruning ratios — typically 30–50% structured vs 90%+ unstructured.

The lottery ticket hypothesis

Frankle and Carbin (2019) introduced the influential Lottery Ticket Hypothesis: trained networks contain sparse subnetworks ("winning tickets") that, if extracted with their original initialisations and trained from scratch, can match the original network's accuracy. The hypothesis spawned substantial follow-up research (the LT-MIT paper, the various 2020–2022 follow-ups). The practical implication is methodological: rather than train a dense network and prune, identify the winning ticket early and train it sparsely from the start. The methodology has practical issues — finding the ticket requires a full training run plus careful pruning — but has substantially influenced thinking about sparsity.

Iterative pruning and the sparsity-quality trade-off

Iterative magnitude pruning (IMP) is the workhorse: train, prune the lowest-magnitude weights, fine-tune, prune more, fine-tune. The procedure typically achieves better quality at a given sparsity than one-shot pruning. The 2020–2024 research has produced sophisticated variants — gradual pruning schedules, layer-wise sparsity ratios, sparsity-aware training. Movement pruning (Sanh et al. 2020) bases pruning decisions on how much weights have moved during training rather than their absolute magnitude — particularly effective for fine-tuning scenarios.

SparseGPT and post-training pruning

For LLMs, post-training pruning has become particularly important because fine-tuning a 70B-parameter model is expensive. SparseGPT (Frantar & Alistarh 2023) introduced a one-shot post-training pruning method that achieves 50%+ sparsity on LLMs without retraining, by formulating pruning as a layer-wise reconstruction problem solved with a custom Cholesky-based update. Wanda (Sun et al. 2023) is a simpler magnitude-and-activation-based variant. These methods have made LLM pruning operationally practical for the first time.

The deployment hardware question

The practical impact of pruning depends entirely on whether the hardware can exploit it. Unstructured sparsity typically gets memory savings but no speedup on standard GPUs. 2:4 structured sparsity (Section 7) gets 2× speedup on NVIDIA A100/H100 Tensor Cores. Channel/head structured sparsity gets immediate speedup on any hardware. The practitioner's choice depends on the deployment hardware; "the model has 90% sparsity" is not by itself a useful claim without specifying which kind of sparsity and on what hardware.

Quantisation: Lower-Precision Numerics

Quantisation reduces the numerical precision of weights and activations, typically from 32-bit floating point down to 16-bit, 8-bit, 4-bit, or even lower. The mathematical insight is that ML model weights and activations don't need anywhere near the precision of FP32 to produce the same outputs; the engineering insight is that lower-precision arithmetic runs faster and uses less memory. Quantisation is the most-impactful single compression technique for modern LLMs.

The bit-width hierarchy

The standard precisions in ML, ordered from highest to lowest. FP32 (32-bit floating point): the historical default, retained for some optimiser state. FP16 / BF16 (16-bit): the standard for training and inference since 2018; BF16 has a wider exponent range than FP16, making it more numerically stable. FP8 (8-bit floating point, two formats E4M3 and E5M2): production-ready since 2023 (H100), used for both training and inference at frontier scale. INT8 (8-bit integer): the standard for inference quantisation on most hardware; widely supported and well-understood. FP4 / NF4 / INT4 (4-bit): the dominant format for LLM inference on consumer hardware. INT2 / 1.58-bit / INT1: experimental/research, see Section 4.

Post-training quantisation

Post-training quantisation (PTQ) operates on an already-trained FP32/BF16 model: weights are quantised to lower precision; a small calibration dataset is used to determine appropriate scaling factors. PTQ is operationally simple — no retraining required — and preserves quality reasonably well at INT8 for most models. The simplest PTQ algorithm: for each weight tensor, find the maximum absolute value, choose a scale that maps that maximum to the largest representable INT8 value (127), quantise. Dynamic quantisation applies this at inference time; static quantisation precomputes scales based on calibration data.

Quantisation-aware training

Quantisation-aware training (QAT) is the higher-quality alternative: simulate quantisation during training so the model learns weights that quantise well. Forward passes round weights to the quantised values; backward passes use the straight-through estimator (Bengio et al. 2013) to propagate gradients through the rounding. QAT typically recovers 1–3% of the accuracy lost by PTQ at the same bit width, but requires retraining (expensive for large models). The standard pattern: PTQ as the default, QAT for the highest-stakes deployments where the quality recovery is worth the retraining cost.

GPTQ and one-shot LLM quantisation

For LLMs, the dominant post-training quantisation algorithm is GPTQ (Frantar et al. 2022). GPTQ formulates quantisation as a layer-wise reconstruction problem: for each layer, find the quantised weights that minimise the activation reconstruction error on a small calibration set. The optimisation uses an OBQ-style update with a Hessian-based correction. GPTQ achieves 4-bit quantisation with minimal quality loss and is the standard for the LLM-quantisation tooling (AutoGPTQ, the various Hugging Face quantisation libraries).

AWQ and activation-aware methods

Activation-aware Weight Quantisation (AWQ, Lin et al. 2023) takes the insight that weights connected to outlier activations matter more for quantisation quality than other weights — and adjusts the quantisation scaling to protect them. AWQ produces slightly better quality than GPTQ at 4-bit and is the basis of much of the LLM-quantisation tooling in 2024–2026. The two methods (GPTQ and AWQ) are complementary; some toolchains use them in combination.

The hardware-quantisation match

Quantisation must match what the hardware supports. INT8 is supported by every modern accelerator. FP8 is supported by H100/H200/B200 (Hopper and Blackwell generations) and some others. FP4 is supported natively on B200 and the AMD MI300 generation. NF4 is a CPU-side format used with on-the-fly dequantisation for the actual matmul. The choice of quantisation format should be informed by what the deployment hardware can actually accelerate; running INT4-quantised weights on hardware that has to dequantise to FP16 for the matmul gives memory savings but no speedup.

Extreme Quantisation: 4-Bit, 2-Bit, and 1-Bit

The 2023–2026 era has pushed LLM quantisation to surprisingly aggressive precisions. 4-bit quantisation has become the standard for consumer-hardware LLM deployment. 2-bit and 1.58-bit methods are pushing into experimental and increasingly production use. The methodology has matured through several generations; the quality-vs-compression frontier has moved further than seemed possible in 2020.

NF4 and the 4-bit consumer revolution

NormalFloat 4 (NF4) (Dettmers et al. 2023, the QLoRA paper) is a 4-bit format optimised for the distribution of normalised weights in LLMs. NF4 uses non-uniform quantisation levels chosen to be optimal for normally-distributed values. Combined with double-quantisation (quantising the quantisation constants) and paged optimisers, NF4 enables 4-bit fine-tuning of large LLMs on consumer hardware — the core insight of the QLoRA paper. The 4-bit ecosystem (LLaMA-cpp's GGUF format, the various 4-bit serving frameworks) has made open-source LLM deployment broadly accessible since 2023.

QLoRA and 4-bit fine-tuning

QLoRA (Quantised LoRA, Dettmers et al. 2023) combines 4-bit quantisation of the base model with LoRA fine-tuning (Section 6) on top. The base model's weights stay frozen in 4-bit; only the small LoRA adapters are trained. The combination enables fine-tuning of 70B-parameter models on a single 24 GB consumer GPU, which would otherwise require thousands of dollars of cloud GPU time. QLoRA has become a foundational technique for the LLM fine-tuning ecosystem; nearly every "fine-tune your own LLM" tutorial in 2024–2026 uses QLoRA.

BitNet and the 1.58-bit frontier

The 2024 BitNet b1.58 paper (Ma et al.) demonstrated that LLMs can be trained from scratch with weights restricted to {-1, 0, 1} — 1.58 bits per weight (log2(3)) — with quality competitive with FP16-trained baselines at sufficiently large model size. The methodology requires training-aware quantisation; standard post-training quantisation cannot achieve this aggressive level. BitNet's hardware advantages are substantial: ternary weights enable arithmetic that's just additions and subtractions, no multiplications, dramatically reducing chip area and energy. Whether BitNet becomes a production deployment standard or remains a research direction is open; the 2024–2026 follow-ups suggest the methodology is continuing to mature.

2-bit and 1-bit methods

Beyond 1.58-bit, the 2024–2026 work on 2-bit (using 4 levels) and 1-bit (binary) weights has produced methods that work for some workloads. QuIP# (Tseng et al. 2024) combined incoherence preprocessing with vector quantisation to produce 2-bit LLMs with surprisingly good quality. AQLM achieves 2-bit quantisation with quality competitive with 4-bit GPTQ. The trajectory is clear: with sufficient methodology, the bits-per-weight frontier continues to drop.

Activation quantisation

Beyond weight quantisation, activation quantisation reduces the precision of intermediate computations. Activations are harder to quantise than weights because they have outliers — a few very-large values that dominate the dynamic range. SmoothQuant (Xiao et al. 2022) and LLM.int8() (Dettmers et al. 2022) introduced methods for handling these outliers, enabling INT8 inference for LLMs. The 2024–2026 work on FP8 inference (Transformer Engine, vLLM's FP8 support) has substantially advanced production activation quantisation.

The practical 4-bit standard

For most production LLM deployment in 2026, 4-bit quantisation is the default: GPTQ or AWQ for the algorithm, GGUF or AWQ for the file format, vLLM/llama.cpp/TensorRT-LLM for the serving framework. Quality loss vs FP16 is typically 1–2% on standard benchmarks — small enough to be acceptable for most workloads, with 4× memory savings and substantial speedup. The 2-bit and 1-bit frontiers are advancing but most production work has stabilised at 4-bit.

Knowledge Distillation

Knowledge distillation (Hinton et al. 2015) trains a small "student" model to mimic the outputs of a large "teacher" model. The student learns from the teacher's "soft" probability distributions — which contain more information than hard labels — and typically achieves quality between standard supervised training and the teacher itself. Distillation has been a foundational compression technique for almost a decade and remains central to producing efficient deployment models.

The basic distillation framework

The standard formulation: train the student to minimise KL divergence between its output distribution and the teacher's, often combined with a standard supervised loss against the labels. The teacher's distribution provides "soft targets" that encode richer information than one-hot labels — relative confidences across classes, ambiguity for difficult inputs, the structure of the teacher's confusion matrix. The loss is typically L = α · KL(student || teacher) + (1−α) · CE(student, labels), with a temperature parameter on the softmaxes that makes the soft targets more or less peaked. The mathematical formulation has been refined many times since 2015 but the core idea is unchanged.

DistilBERT and the BERT-distillation generation

DistilBERT (Sanh et al. 2019) was an influential demonstration. A student transformer with half the layers of BERT, trained with distillation plus a few additional losses (cosine embedding loss, MLM loss), achieved 97% of BERT's GLUE performance with 60% of the parameters and 60% of the inference time. TinyBERT, MobileBERT, MiniLM, and many others followed similar patterns for BERT-class models. The methodology became standard for producing deployable encoder models throughout the 2019–2022 era.

LLM distillation: TinyLlama and the modern era

For LLMs, distillation has become the dominant approach for producing smaller deployable models. TinyLlama (2024) was a 1.1B-parameter model trained on 3T tokens with substantial distillation. Phi-3 (Microsoft, 2024) used careful data curation plus distillation from larger models. Llama 3.2 1B/3B (Meta, 2024) used systematic distillation from Llama 3.1 70B. Mistral's distilled models (Mistral 7B Instruct, the various Tiny variants) similarly. Distillation is now table stakes for releasing a useful small LLM; pure scratch training of small LLMs produces materially worse models than distilled equivalents.

Response-based, feature-based, and relation-based distillation

The 2017–2022 research developed several extensions. Response-based distillation matches the final outputs (the standard formulation). Feature-based distillation matches intermediate layer activations between teacher and student; useful when the architectures differ. Relation-based distillation matches relationships between examples (which examples does the teacher consider similar). For modern LLM distillation, response-based is dominant, but feature-based methods appear in some specialised contexts.

The data scale and quality question

Modern LLM distillation depends on data: the student needs vast amounts of teacher-generated outputs to learn from. Synthetic data generation from the teacher has become a core part of the methodology — the teacher generates training data that the student trains on. Quality control on the synthetic data (removing hallucinations, ensuring diversity, calibrating difficulty) is its own engineering discipline. The 2024–2026 work on data-centric distillation (the various papers from Microsoft Research, Anthropic, and others) emphasises that data quality often matters more than the specific distillation algorithm.

The teacher-student architecture relationship

In principle, the student and teacher can have arbitrarily different architectures. In practice, similar architectures with the student being a smaller version of the teacher works best — same transformer block design, fewer layers and/or smaller hidden dimensions. Cross-architecture distillation (transformer → SSM, dense → MoE) is an active research area but produces less reliable results. The pragmatic guidance: distil to a smaller version of the teacher's architecture unless you have a specific reason to do otherwise.

Low-Rank Factorisation and LoRA

Many of a model's weight matrices are approximately low-rank — the singular-value decomposition reveals that most of the matrix's "information" lives in a handful of dominant singular values. Low-rank factorisation exploits this by replacing a large matrix with a product of smaller ones. LoRA (Low-Rank Adaptation, Hu et al. 2021) brought low-rank methods to mainstream attention via parameter-efficient fine-tuning; the methodology has expanded to encompass a substantial family of techniques in 2024–2026.

The SVD intuition

For an m×n matrix W, the SVD writes W = U Σ V^T where Σ is diagonal with the singular values. If most singular values are near zero, W can be well-approximated by keeping only the top-k singular values: W ≈ U_k Σ_k V_k^T, where U_k is m×k and V_k is k×n. The compressed representation has m·k + k + k·n parameters instead of m·n. For typical neural-network weight matrices, k can be a small fraction of min(m,n) with little quality loss. The mathematical machinery is classical; the engineering question is which weights to factorise and at what rank.

LoRA: low-rank fine-tuning

The 2021 LoRA paper applied low-rank factorisation to fine-tuning rather than to the base model itself. Instead of fine-tuning all weights, LoRA freezes the base model and adds small low-rank "adapters" to selected weight matrices: W → W + B·A, where A is k×n and B is m×k with k typically 8–64. The base model's weights are unchanged; only the adapter matrices A and B are trained. The compute and memory savings are substantial — for a 70B-parameter model, the LoRA adapters might be only 0.1% of the parameter count. LoRA has become the default fine-tuning method for LLMs in cost-conscious contexts.

QLoRA: combining quantisation and LoRA

The 2023 QLoRA paper (already discussed in Section 4) combined NF4 quantisation of the base model with LoRA fine-tuning. The base model is loaded in 4-bit (a quarter of the memory); the LoRA adapters train at higher precision; gradients flow through dequantisation operations. The combined system enables fine-tuning of 70B-parameter models on consumer hardware, which has been transformative for the open-source LLM ecosystem. QLoRA is essentially the standard pattern for LLM fine-tuning in 2024–2026.

DoRA, VeRA, and the parameter-efficient family

The success of LoRA spawned a family of parameter-efficient fine-tuning (PEFT) methods. DoRA (Liu et al. 2024) decomposes weight updates into magnitude and direction components, with LoRA applied to the direction; substantially improves quality at the same parameter count. VeRA (Vector-based Random Matrix Adaptation) shares random projection matrices across layers, reducing parameter count further. AdaLoRA adaptively chooses ranks per layer. IA³ (Liu et al. 2022) scales activations rather than adding adapters. The Hugging Face PEFT library implements most of these and is the standard tooling.

Low-rank pretraining

Beyond fine-tuning, the 2023–2024 work on low-rank methods for pretraining has produced interesting results. ReLoRA (Lialin et al. 2023) periodically merges LoRA adapters into the base weights and re-initialises, producing pretraining at a fraction of the standard cost. GaLore (Zhao et al. 2024) projects gradients into a low-rank space during training, reducing optimiser memory substantially. The 2024–2026 work has demonstrated that low-rank methods can reduce pretraining cost by substantial fractions while maintaining final quality. Whether this becomes the dominant pretraining pattern is open.

The deployment side

Beyond training, low-rank methods can be applied to compress already-trained models. SVD-based compression: take each weight matrix, compute SVD, keep top-k singular values, replace W with U_k Σ_k V_k^T as a factored pair. Quality often drops more than with quantisation at equivalent compression, but the memory savings are real. Mixed methods (low-rank for some layers, quantisation for others) are common in production deployment stacks. Modern compression toolkits (the Hugging Face Optimum library, NVIDIA's TensorRT-LLM) often combine multiple techniques automatically.

Sparsity: Structured Hardware Acceleration

Beyond the unstructured pruning of Section 2, modern hardware supports specific sparsity patterns natively. 2:4 structured sparsity on NVIDIA Tensor Cores produces 2× speedup; activation sparsity (especially in MoE models) can dramatically reduce per-token computation; block sparsity at coarser granularity also has hardware support. Hardware-aware sparsity is increasingly important for getting actual deployment-time speedups from compression.

2:4 structured sparsity

NVIDIA introduced 2:4 structured sparsity with Ampere (A100, 2020): for every block of 4 weights, exactly 2 must be zero, in any positions. The Tensor Cores can skip the zero multiplications, achieving 2× throughput on the sparse matrix-multiply. The constraint preserves enough flexibility that quality loss is small (typically 0.5–1.5% on standard benchmarks); the 2× speedup is real. Modern PyTorch supports 2:4 sparsity natively; converting a dense model to 2:4 sparse is a standard post-training step in performance-sensitive deployments.

The 2:4 deployment pattern

Achieving 2:4 sparsity in practice requires careful methodology. Magnitude-based 2:4 conversion: in each 4-element block of a weight tensor, keep the two largest-magnitude entries and zero out the others. This works for many cases but loses quality; better methods include SparseGPT 2:4 (one-shot post-training sparsification with the 2:4 constraint), N:M sparsity-aware training (training with the constraint enforced from the start), and progressive 2:4 conversion (gradually increasing sparsity during fine-tuning). The choice depends on the budget for retraining; one-shot post-training methods are operationally cheapest.

Activation sparsity

Beyond weight sparsity, activation sparsity exploits the property that many neural-network activations are zero (or near-zero) due to ReLU-like nonlinearities. If you can predict which activations will be zero before computing them, you can skip the corresponding weights entirely. Deja Vu (Liu et al. 2023) and similar methods predict activation sparsity for LLMs, achieving substantial speedups for inference. The methodology requires per-input computation (which activations are zero depends on the input), making it more complex than static weight sparsity, but the speedups can be substantial for memory-bound LLM inference.

Mixture-of-experts as sparsity

Mixture-of-experts (MoE) models — already discussed in Ch 02 §10 — are essentially a structured sparsity pattern at the architecture level. Different tokens activate different "experts"; only a small fraction of the model's parameters compute for any given token. The sparsity is in which experts run, not which weights within experts. MoE provides the largest practical "sparsity" in modern frontier models — DeepSeek-V3 and similar activate only ~5% of parameters per token. The 2024–2026 frontier is increasingly MoE-flavoured because the compute-vs-parameter-count trade-off favours sparse expert architectures.

Block and butterfly sparsity

Beyond 2:4, more flexible sparsity patterns are supported by some hardware. Block sparsity: weights are organised in blocks (typically 16×16 or 32×32), with entire blocks zero or nonzero. Butterfly sparsity: a structured pattern from signal processing, supported by some specialised hardware. N:M sparsity: generalisations of 2:4 (e.g., 1:4, 2:8, 4:8). Each pattern offers different quality-vs-speedup trade-offs; the practical choice depends on hardware support and willingness to do training-time sparsification.

The hardware-co-design pattern

Effective sparsity is fundamentally hardware-co-designed: the chip's specific support for sparse computation determines which sparsity patterns produce real speedups. The 2020s evolution has seen substantial hardware support for sparsity — NVIDIA's Sparse Tensor Cores, AMD's matrix-core sparsity support, the various ASIC-based sparse accelerators (Cerebras, Groq, the various 2024–2026 entrants). The trajectory is toward more flexible sparsity at the hardware level, which will enable more aggressive compression at the algorithm level.

Combining Compression Techniques

Real production deployments rarely use a single compression technique in isolation. Combining quantisation, pruning, distillation, and low-rank methods often produces compounding benefits — but also compounding risks, since each technique has its own quality cost. This section covers how to combine techniques productively and how to avoid the pitfalls.

The orthogonality principle

Different compression techniques tend to be largely orthogonal: quantisation reduces precision per weight; pruning reduces the number of weights; distillation reduces the model size; low-rank reduces matrix complexity. Combining them roughly multiplies their compression ratios. A model with 4-bit quantisation (4× from 16-bit) plus 50% structured sparsity (2×) plus distillation to half the layer count (2×) is roughly 16× smaller than the original — and each step's quality loss is somewhat independent. The combined quality loss is approximately additive in many cases, not multiplicative.

The order matters

The order of operations affects quality. Distill first, then quantise: train a small student via distillation, then post-training quantise. Standard pattern for producing deployable LLMs from large teachers. Prune then quantise: remove redundant weights first, then quantise the remaining ones. Avoids quantising weights that will be zeroed anyway. Quantise then prune: less common; usually worse quality. Compress then fine-tune: apply compression operations, then fine-tune on a small dataset to recover quality. The post-compression fine-tune is the most-impactful single quality-recovery step; teams that skip it leave substantial quality on the table.

Staged compression for deployment

The mature production pattern for LLM deployment in 2026 is roughly: Stage 1: train the full-precision teacher model. Stage 2: distil to a smaller student (Section 5). Stage 3: apply structured pruning if hardware supports it (Section 7). Stage 4: quantise to 4-bit or 8-bit using GPTQ/AWQ (Section 3). Stage 5: small post-compression fine-tune to recover any quality lost. Stage 6: deploy with the appropriate serving framework (vLLM, TensorRT-LLM, etc.). Each stage is optional; the right choice depends on the deployment constraints and the compression engineering budget.

Hyperparameter search at each stage

Each compression technique has hyperparameters: pruning ratio per layer, quantisation bit width, distillation temperature, LoRA rank. Tuning these well requires search. Layer-wise sparsity: not all layers should be pruned equally — typically layers with more parameters tolerate more pruning. Mixed-precision quantisation: some layers benefit from higher precision than others (typically the embedding and output layers stay higher precision). Adaptive ranks: low-rank methods benefit from per-layer rank tuning. The 2024–2026 work on automated compression hyperparameter search (the various AutoML-for-compression methods) has matured the methodology substantially.

Compression-aware training

Beyond combining post-training methods, compression-aware training trains models knowing they'll be compressed. Quantisation-aware training (Section 3) is the canonical example. Sparsity-aware pretraining trains with sparsity from the start. Distillation-friendly training shapes the teacher's behaviour to be more easily distilled. The general pattern: training that produces models more amenable to compression than naive training would. The training-time investment is rewarded with better post-compression quality. The 2024–2026 work has substantially advanced this direction.

The deployment-stack composition

Modern serving stacks (vLLM, TensorRT-LLM, llama.cpp) combine many compression techniques automatically. A typical vLLM serving config might enable: 4-bit GPTQ weights, FP8 KV-cache, 2:4 sparsity if applicable, FlashAttention for memory efficiency, continuous batching. The user doesn't manage each technique separately; they configure the serving stack. The 2024–2026 evolution has substantially raised what comes "for free" with modern serving frameworks; teams that haven't updated their serving stack in two years are leaving substantial efficiency on the table.

Evaluation: Measuring Compression Quality

Compression is fundamentally a quality-vs-cost trade-off; evaluating where on that trade-off a particular compressed model lives is essential. The challenge is that simple aggregate metrics (perplexity, accuracy on standard benchmarks) don't capture all the ways compression can degrade a model. Robust evaluation requires multiple complementary metrics and careful attention to what the deployment cares about.

The standard quality metrics

For LLMs, the standard quality metrics are perplexity on a held-out evaluation corpus (lower is better), accuracy on benchmark suites (MMLU, HellaSwag, GSM8K, HumanEval, MATH), and downstream task performance (whatever the actual application measures). The first two are easy to compute and standard; the third is what actually matters for deployment. A compressed model with 0.5% perplexity increase but 3% drop on the actual production task is not a good deployment candidate — and standard benchmarks don't always catch this kind of regression.

Behavioural evaluation

Beyond metric-based evaluation, behavioural evaluation (cross-referenced from Ch 05 of MLOps) catches issues that aggregate metrics miss. Does the compressed model still handle the same edge cases? Does it produce the same outputs on a fixed test set of typical inputs? Does it have the same refusal patterns (an important consideration for instruction-tuned models)? The 2024–2026 work on LLM behavioural evaluation has produced standard test suites (Inspect AI, Ragas, the various LLM evaluation frameworks). A compressed model that passes accuracy benchmarks but fails behavioural tests has a specific failure mode that operators need to know about.

Latency, throughput, and the operational profile

Beyond quality, the compression's actual operational benefits must be measured. Latency: time to generate the first token, time per generated token, total time for a typical request. Throughput: tokens per second per GPU at production batch sizes. Memory footprint: peak GPU memory during inference. Cost per token: the integrated dollars-per-million-tokens metric. The compressed model that's nominally smaller but slower (due to dequantisation overhead, e.g.) is often a worse deployment than a slightly-larger model with cleaner numerics. Mature compression evaluation always measures these actual operational metrics, not just compression ratios.

Slice-aware evaluation

Compression can affect different subpopulations differently. A 4-bit quantised model might handle common cases fine but fail on rare or out-of-distribution inputs. A pruned model might lose capability on niche topics that depended on the pruned weights. Slice-aware evaluation (Ch 04 §9 of MLOps) tests performance on important subpopulations and surfaces these regressions. The discipline is to identify the slices in advance and evaluate against them; teams that don't catch these regressions in production where they affect users.

The capability-elasticity question

An emerging research question: which capabilities are most elastic with respect to compression — i.e., degrade fastest as compression increases. The 2024–2026 evidence suggests that reasoning-heavy tasks (math, code, multi-step problems) degrade faster than surface-level tasks (paraphrasing, summarisation). Long-context capability degrades faster than short-context capability. Instruction-following calibration (knowing when to refuse, when to express uncertainty) is particularly sensitive to compression. The implication: compression decisions should be informed by which capabilities matter most for the deployment.

The benchmark-vs-production gap

A standing concern: standard benchmarks may not reflect production usage. A compressed model with strong MMLU scores might still feel "worse" in chat, less helpful for actual user tasks, or differently calibrated. A/B testing compressed-vs-uncompressed models in production (Ch 06 of MLOps) is the gold-standard evaluation but is expensive and slow. The pragmatic intermediate: human evaluation panels on representative production traffic, plus continuous monitoring (Ch 04 of MLOps) for unexpected degradations after deployment.

The Frontier and the Operational Question

Model compression is mature for most production use cases in 2026, but several frontiers remain active. 1-bit and ternary methods are becoming production-relevant; MoE-aware compression has its own methodology; hardware-software co-design is producing new compression-friendly architectures; the speculative-decoding paradigm changes the compression-vs-quality trade-off in interesting ways. This section traces the open questions and the directions the field is moving in.

The 1-bit and ternary frontier

BitNet b1.58's 2024 demonstration of competitive ternary LLMs has produced substantial follow-up. The 2025 work on scalable ternary training, post-training conversion from full-precision to ternary, and ternary-specialised hardware is rapidly maturing. The ultimate hardware advantages — multiplication-free arithmetic, dramatically reduced chip area, lower power — are substantial enough that several startups are building ternary-specialised chips. Whether ternary becomes the production deployment standard or remains a parallel track is open, but the trajectory is faster than seemed plausible in 2022.

MoE-aware compression

Mixture-of-experts models present their own compression challenges. The experts are individually small but the total model is large; the routing structure introduces complications. Expert pruning removes lightly-used experts; expert merging combines similar experts into one; expert quantisation can use different precisions for different experts based on usage. The 2024–2026 work on MoE compression is active; specialised methods for DeepSeek-V3-class models are rapidly maturing.

Speculative decoding and compression's reframing

Speculative decoding uses a small "draft" model to propose tokens that a large "verifier" model checks; the verifier runs much less often than every token. The methodology effectively compresses inference cost without compressing the model itself: the deployment serves the full-quality large model but at the speed of the smaller draft model. The 2023–2026 work on speculative decoding has substantially advanced; modern serving frameworks (vLLM, TensorRT-LLM) integrate it natively. The reframing it provides — efficiency at inference time without quality loss in the model — is shifting how teams think about compression vs serving optimisation.

The training-data / model-size frontier

The Chinchilla paper established that compute-optimal scaling balances model size and training data. The 2024–2026 work has produced models that are deliberately under-sized relative to their training data ("over-trained" small models) — the LLaMA family, Phi-3, Gemma — that achieve quality competitive with larger models. This is effectively compression-by-design: train a smaller model on more data rather than train a larger model and then compress. Whether this approach displaces post-training compression or coexists with it is open.

Hardware-software co-design

The 2024–2026 trajectory is increasingly toward hardware that's specifically designed for compressed models. FP4-native hardware (B200, MI300) makes 4-bit quantisation actually fast. Sparse-tensor cores make 2:4 sparsity actually fast. Ternary-specialised chips (the various 2024–2026 startup designs) make 1.58-bit actually fast. The compression methodology and the hardware development are increasingly co-designed; choosing compression strategies in 2026 means choosing for specific deployment hardware.

What this chapter has not covered

Several adjacent topics are out of scope. Inference optimisation at depth (batching, KV caching, speculative decoding implementation) is the topic of Ch 04. AI chips and custom silicon at depth — the design philosophy and the economics — is Ch 05. The detailed methodology of parameter-efficient fine-tuning beyond LoRA basics is in Part IX (LLMs). Compression for non-transformer architectures (CNNs, RNNs, GNNs) is touched only in passing; the methodology for those architectures is similar in principle but different in detail. The chapter focused on the methodology of model compression with emphasis on the LLM-era patterns; the broader landscape develops adjacent topics in subsequent chapters.