Hardware for ML, where neural networks meet silicon.

Modern machine learning is a hardware-bound discipline. The most-discussed model architectures, training recipes, and benchmark results all reflect what's possible on the hardware available, and the hardware has reshaped the field in ways that are easy to take for granted. GPUs made deep learning practical in the early 2010s; TPUs proved that custom silicon could substantially outperform general-purpose chips for specific workloads; NPUs have brought ML acceleration to laptops and phones; the H100 and successor generations have made trillion-parameter LLMs trainable. Beyond the chip itself, memory bandwidth, interconnect topology, and the operational realities of running thousands of accelerators in coordinated training have become the defining constraints of frontier-AI work. The roofline model formalises the central trade-off: every workload is bounded either by compute throughput or memory bandwidth, and understanding which bound applies determines what optimisations matter. This chapter develops the methodology with the depth a working ML practitioner needs: the architectural concepts behind the dominant accelerators, the operational considerations that distinguish a productive ML cluster from a poorly-utilised one, and the analytical frameworks for reasoning about hardware-aware ML.

Prerequisites & orientation

This chapter assumes basic familiarity with computer architecture (registers, caches, memory hierarchies) and with the deep-learning material of Part VI. Familiarity with at least one ML framework (PyTorch, JAX, TensorFlow) and the basic operations of training (forward pass, backward pass, optimiser update) is assumed; familiarity with CUDA programming helps for §3 but is not required (we cover the essentials at a conceptual level). The chapter is written for ML engineers, ML researchers, and platform engineers who need to make architectural decisions about hardware: which accelerators to buy or rent, how to size training jobs, how to debug performance, when to invest in custom kernels. Pure-application contexts where ML is consumed via APIs have less use for this material; teams that train and serve their own models have substantial use.

Three threads run through the chapter. The first is the compute-vs-bandwidth tension: every workload is either compute-bound (limited by floating-point throughput) or memory-bound (limited by memory bandwidth), and the two regimes call for different optimisations. The roofline model (Section 8) formalises this, but the intuition runs throughout. The second is the specialisation gradient: hardware ranges from fully-general (CPUs) through partially-specialised (GPUs) to highly-specialised (TPUs, ASICs), with corresponding trade-offs between flexibility and efficiency. The third is the system-vs-chip distinction: a single chip's performance is one thing, but ML training is overwhelmingly bounded by inter-chip communication and memory hierarchy concerns at scale. The chapter develops each in turn.

01

Why Hardware Shapes ML

The history of modern ML is largely a history of hardware. Backpropagation was published in 1986; deep learning's empirical breakthrough waited until 2012 because the hardware to make it practical hadn't existed. The transformer was published in 2017; large language models reached current capability because GPU clusters in the 100B-parameter range became economically feasible. The next generation of capability — agents that operate continuously, models with millions of tokens of context, real-time multimodal generation — is overwhelmingly bottlenecked by what hardware can deliver at what cost. Understanding the hardware landscape is increasingly essential for ML practitioners, not just hardware engineers.

The empirical hardware-capability link

The connection between hardware progress and ML capability is unusually direct in this field. AlexNet (2012) was trained on two GTX 580s with 3 GB of memory each — the first widely-cited deep-learning result, made possible because GPUs had become fast enough. GPT-3 (2020) required ~3,640 petaflop-days of compute, infeasible without thousands of V100s in coordinated training. GPT-4-class models required tens of thousands of A100s. Frontier-2025-2026 models use H100 and B200 clusters with hundreds of thousands of accelerators. Each generation has been gated by what hardware was available; capability scaling is fundamentally a hardware-investment story.

ML Hardware: Methodology Stack §3–5 CHIPS GPUs (CUDA) TPUs (systolic) NPUs (mobile/edge) specialised ASICs accelerator landscape §6–7 SYSTEM HBM & SRAM NVLink / NVSwitch InfiniBand fabrics collective ops memory & interconnect §8–9 ANALYSE roofline model arithmetic intensity profiling tools utilisation engineering make it fast §2 CPU BASELINE · §10 FRONTIER (B200, MI400, photonic, quantum) application layer: training, inference, edge deployment, frontier scaling

The economic dimension

Hardware also dominates ML economics. Training costs for frontier models are now in the hundreds of millions of dollars, almost entirely accelerator costs. Inference costs for serving popular models can dwarf training costs over a model's deployed lifetime. Cluster capital expenditure at major AI labs runs to billions per year. The implications cascade: hyperscaler capex is reshaping data-centre construction; power grids are becoming a binding constraint; the geopolitics of advanced semiconductor manufacturing has become a national-security issue. ML practitioners increasingly need to reason about hardware costs as a primary engineering concern rather than an externality.

The hardware-software co-design imperative

Modern ML development is a co-design discipline: model architectures, training recipes, and software stacks evolve to match what hardware does well, while hardware evolves to match what models do. Transformers' dominance is partly because they map well to GPUs (matrix multiplication everywhere); FlashAttention's success comes from understanding GPU memory hierarchies; the recent shift toward smaller, specialised models partly reflects inference-cost pressure. The 2024–2026 generation of frontier work (sparse attention, mixture-of-experts, retrieval-augmented architectures) is heavily shaped by hardware-cost considerations. Practitioners who treat hardware as a black box miss optimisation opportunities and routinely make poor architectural choices.

What this chapter covers

The chapter develops hardware concepts at a depth appropriate for ML practitioners, not hardware engineers. We cover what GPUs and TPUs actually do, why they're fast for ML workloads, where the bottlenecks are, and how to reason about performance using the roofline model. We touch on the alternative-accelerator ecosystem (AMD, Intel Gaudi, Cerebras, Groq, the various 2024–2026 entrants) and on the operational realities of running large GPU clusters. We do not develop circuit design, semiconductor process technology, or the deep details of CUDA programming — those are separate disciplines.

The downstream view

Operationally, hardware sits underneath everything else in ML: trained models exist on hardware, training runs consume hardware, inference services run on hardware, and the cost-and-capability frontier is hardware-defined. Upstream: chip vendors (NVIDIA, AMD, Intel, Google TPU, AWS Trainium, the various ASIC startups), cloud providers, on-prem datacentres. Inside this chapter's scope: chip architecture, memory hierarchy, interconnect, the roofline analytical framework. Downstream: distributed training (Ch 02), model compression (Ch 03), inference optimisation (Ch 04), AI chips (Ch 05). The remainder of this chapter develops each piece: §2 the CPU baseline, §3 GPUs, §4 TPUs, §5 alternatives, §6 memory, §7 interconnect, §8 roofline, §9 operations, §10 the frontier.

02

The CPU Baseline and Where It Falls Short

Before discussing accelerators, it's worth understanding why CPUs — the general-purpose chips that run almost all other software — are unsuitable for modern deep learning. CPUs are extraordinarily good at sequential, branchy, latency-sensitive code with complex control flow; they are unsuited to the massively-parallel, throughput-oriented arithmetic that neural networks require. The mismatch motivates the entire accelerator ecosystem.

What CPUs are optimised for

A modern CPU core is optimised for single-thread performance on irregular code: branch prediction, out-of-order execution, deep pipelines, large multi-level caches, sophisticated prefetching. A typical 2026-era server CPU has 32–128 cores, each running at 3–4 GHz with substantial per-core sophistication. The cost of this sophistication is silicon area and power: each core uses billions of transistors and tens of watts. The trade-off is right for most general-purpose computing — running a database, serving a web page, running a developer's IDE — but wrong for the kind of arithmetic ML workloads need.

SIMD and vector instructions

CPUs do offer parallelism through SIMD (Single Instruction Multiple Data) instructions: a single CPU instruction operates on multiple data elements simultaneously. Modern CPU SIMD widths are 256 bits (AVX2) or 512 bits (AVX-512), giving 8–16 single-precision operations per cycle. Intel AMX (Advanced Matrix Extensions, 2023) extends this with explicit matrix-multiply support. AMD and ARM have analogous features. CPU SIMD is useful for ML inference of small models on existing CPU infrastructure — many production inference workloads can hit acceptable performance on AMX-equipped CPUs without GPUs at all. But it's an order of magnitude or two slower than dedicated accelerators for serious ML workloads.

The arithmetic-throughput ceiling

A modern server CPU can do roughly 1–10 TFLOPS (trillion floating-point operations per second) in FP32. A modern GPU can do 100–4,000 TFLOPS in mixed precision, depending on the model. The ratio is at least 100×, often more. For a workload that's dominated by matrix multiplication (the deep-learning case), running on CPU when a GPU is available is leaving a huge factor of performance on the table.

Memory bandwidth and the cache hierarchy

CPUs have sophisticated multi-level cache hierarchies (L1, L2, L3) optimised for the spatial and temporal locality patterns of general-purpose code. A typical modern CPU has 32 KB L1, 1 MB L2, and tens of MB shared L3 per core; main-memory bandwidth is roughly 200–400 GB/s for a server-class chip. By comparison, an H100 GPU has 80 GB of HBM3 memory at 3,350 GB/s — an order of magnitude more bandwidth for an ML-shaped workload. The cache hierarchy that helps CPUs on irregular code can actively hurt for ML workloads, where the working set blows out the cache.

When CPUs are appropriate for ML

Despite the general inadequacy, CPUs are appropriate for several ML scenarios. Inference of small models (<1 billion parameters) on existing CPU infrastructure, particularly with AMX. Latency-sensitive inference where GPU cold-starts are unacceptable and the model fits in CPU caches. Sparse or branchy models (some recommender systems, decision trees, GNNs with low arithmetic intensity) where GPU's parallelism doesn't help much. Embedded inference on edge devices that don't have GPUs. The 2024–2026 work on CPU-optimised inference (Intel's OpenVINO, the various AMX-aware libraries) has made this regime substantially more practical than it was a decade ago.

The fundamental mismatch

For training and most serving of large models, the CPU-vs-accelerator gap is structural. CPUs can't be made fast enough at ML workloads through incremental improvements; the architecture is wrong. The right architecture is massively-parallel, throughput-oriented, with simple cores doing many arithmetic operations in parallel — which is what GPUs and TPUs provide. The next sections develop why their architectures work.

03

GPUs and the Parallel Computing Revolution

GPUs (Graphics Processing Units) accidentally became the dominant ML hardware. Originally designed for the embarrassingly-parallel computations of graphics rendering — apply the same shader to many pixels — they turned out to be near-optimal for the matrix multiplications at the heart of neural networks. NVIDIA's 20-year investment in CUDA (a programming model that exposes GPU parallelism to general-purpose computation) made them programmable for ML; the dominant position they hold in 2026 is the result.

The streaming multiprocessor architecture

A modern GPU is organised around streaming multiprocessors (SMs) — independent processing units that each contain hundreds of simple ALU cores plus shared memory. An H100 has 132 SMs; each SM has 128 FP32 cores plus specialised hardware (Tensor Cores, see below) for matrix operations. Threads execute in groups called warps (32 threads) that move in lockstep through the same instruction; massive parallelism comes from running many warps simultaneously across many SMs. The total parallelism is enormous: an H100 has hundreds of thousands of threads in flight at once.

Tensor Cores

The single most-important GPU evolution for ML was the introduction of Tensor Cores in NVIDIA's Volta architecture (V100, 2017). Tensor Cores are dedicated hardware for small matrix multiplications — typically 4×4 in V100, 16×8 in newer architectures — performed in mixed precision (FP16 or BF16 inputs accumulated in FP32). Tensor Cores deliver substantially higher throughput than general FP32 cores for matrix multiplication, which is overwhelmingly the dominant operation in neural networks. The successive Tensor Core generations have driven the FLOPS-per-chip improvements from V100 (125 TFLOPS) through A100 (312 TFLOPS) through H100 (989 TFLOPS) through B200 (~4,500 TFLOPS).

The CUDA programming model

CUDA exposes GPU parallelism through a programming model where the developer writes kernel functions that execute across many threads in parallel. The threads are organised into blocks (sharing fast on-chip memory) and grids (the full launch). Memory comes in several flavours: global memory (HBM, large but slow), shared memory (fast but small, per-block), registers (fastest, per-thread). Writing high-performance CUDA code requires careful management of memory hierarchy, warp divergence, and arithmetic intensity. Most ML practitioners don't write CUDA directly; they rely on framework-provided kernels (PyTorch, JAX, TensorFlow) and the cuDNN / cuBLAS libraries underneath.

The H100 generation

The NVIDIA H100 (Hopper architecture, 2022) became the dominant ML training accelerator through 2024–2025. Headline specs: 80 GB HBM3 memory at 3.35 TB/s bandwidth, 989 TFLOPS FP16/BF16 with sparsity, NVLink at 900 GB/s. The H100 SXM form factor — 8 H100s on a single board with NVSwitch interconnect — is the canonical building block of frontier-AI training systems. Cluster prices for H100s in 2024–2025 ran roughly $30,000 per chip and over $250,000 per 8-GPU server, with significant additional cost for networking and infrastructure.

The B100/B200 generation

The NVIDIA Blackwell architecture (B100, B200, 2024) substantially extended the H100 in nearly every dimension: more HBM (192 GB on B200), higher bandwidth (~8 TB/s), substantially more FP4/FP8 throughput (~9,000 TFLOPS sparse FP4 on B200), and tighter NVLink coupling (NVLink 5 at 1.8 TB/s). The Blackwell-era systems (NVL36, NVL72) bundle dozens of B-series GPUs into rack-scale "single coherent" systems for the largest training jobs. The 2025–2026 frontier-AI training landscape is overwhelmingly Blackwell-based.

Why NVIDIA dominates

NVIDIA's near-monopoly position in training accelerators is not just a hardware story; it's a software story. CUDA has had two decades of investment, with deep integration into every major ML framework. cuDNN provides highly-optimised primitives. NCCL handles multi-GPU collective communication. The ecosystem of tools, libraries, optimised kernels, debugging support, and community knowledge is unmatched. Competitors (Section 5) face a chicken-and-egg problem: hardware that's not in production has no software ecosystem; without software ecosystem, hardware is hard to adopt. The 2024–2026 efforts to build cross-vendor abstractions (OpenAI's Triton, MLIR-based stacks, the various PyTorch backend abstractions) are gradually loosening this lock-in but progress is incremental.

04

TPUs and Custom Silicon

TPUs (Tensor Processing Units) are Google's custom silicon for ML, first deployed internally in 2015. TPUs proved that domain-specific accelerators could substantially outperform general-purpose GPUs on ML workloads, particularly at very large scale. While TPUs remain primarily a Google internal asset (with limited external availability through Google Cloud), their architectural ideas have influenced the broader landscape, and they remain the only credible alternative to NVIDIA at frontier-training scale.

The systolic array architecture

The defining architectural feature of TPUs is the systolic array: a 2D grid of multiply-accumulate (MAC) units that pumps data in from one direction and accumulates results across the array. For matrix multiplication — the dominant ML operation — systolic arrays are extremely efficient: data flows through the array with minimal control overhead, and the array can be made very large because each MAC unit is simple. TPU v1 had a 256×256 systolic array (65,536 MACs); successive generations have used various sizes optimised for particular workloads. The contrast with GPUs is informative: GPUs use general-purpose SIMD ALUs with Tensor Cores grafted on; TPUs are systolic-array-first.

The TPU generations

TPU has gone through five publicly-acknowledged generations. TPU v1 (2015) was inference-only, deployed for Google Search. TPU v2 (2017) added training support and bf16. TPU v3 (2018) doubled performance. TPU v4 (2021) introduced optical-circuit-switched interconnect for substantially better cluster scaling. TPU v5 (v5e and v5p, 2023) improved per-chip and cluster efficiency. TPU v6 / Trillium (2024) is the most-recent generation as of 2026, focused on inference. Each generation has been deployed at substantial internal scale; external availability through Google Cloud has been more limited than NVIDIA's GPU availability through major clouds.

Pod-scale training

The defining TPU advantage is pod-scale training. A TPU pod is thousands of TPUs connected by Google's specialised optical interconnect; the entire pod operates as a single distributed-training system. Pod scale runs from hundreds of TPUs (small pods) to tens of thousands (TPU v4/v5 pods, which Google has used to train Gemini and similar). The interconnect bandwidth and the training-software stack (XLA, JAX) are tightly co-designed for pod operation. For very-large training jobs, TPU pods are competitive with — sometimes superior to — H100/B200 clusters, despite NVIDIA's broader ecosystem.

The XLA and JAX stack

TPU programming is via XLA (Accelerated Linear Algebra), a compiler that takes high-level ML graphs and produces optimised TPU code. JAX is the preferred ML framework for TPU work — JAX's functional design and explicit compilation model fit the TPU/XLA pattern naturally. TensorFlow also targets TPUs through XLA. PyTorch can target TPUs through the PyTorch/XLA project, but the experience is less polished than native CUDA. The framework-and-compiler asymmetry is part of why TPUs have not displaced GPUs even where their hardware would be competitive.

Why TPUs aren't more widely available

TPUs are primarily a Google internal asset. External availability is limited to Google Cloud, where TPU pricing has historically been competitive with GPU pricing but with substantially less hardware availability. The strategic reason is that TPUs are part of Google's competitive advantage in AI — making them broadly available would erode that advantage. The technical reason is that TPU's pod-scale advantages are hard to deliver to customers as smaller chunks; TPU economics work best at scale. The 2024–2026 evolution has been gradual loosening — more TPU capacity available externally, better tooling — but TPUs remain a Google-favoured platform.

Custom silicon as a strategic asset

The TPU example has motivated other major AI labs to invest in custom silicon. AWS Trainium / Inferentia, Meta's MTIA, Microsoft's Maia, and various other proprietary chips are all responses to the strategic logic Google demonstrated: when you're at hyperscale, owning the silicon stack is a competitive advantage. The 2024–2026 industry trajectory has been toward more custom silicon, more diversity in the accelerator landscape, and gradually-less-monolithic NVIDIA dominance.

05

The Accelerator Ecosystem

Beyond NVIDIA GPUs and Google TPUs, the 2020s have produced a substantial ecosystem of alternative ML accelerators. Some target training, some target inference, some target edge deployment. The 2024–2026 generation has produced credible competitors to NVIDIA in specific niches; whether any will dislodge NVIDIA's dominant position at the high end remains unclear, but the diversity is real and growing.

AMD MI300 and the GPU competitor

AMD's MI300X (2023) and MI325X (2024) are the most-credible direct competitors to NVIDIA's H100/H200 generation. The chips have larger HBM (192 GB on MI300X vs 80 GB on H100) and competitive raw FLOPS. The software story has been the perennial challenge: ROCm (AMD's CUDA equivalent) is less mature, and porting CUDA-optimised ML codebases to ROCm has been substantial engineering work. The 2024–2026 progress on PyTorch ROCm support has been substantial; major training jobs (some at OpenAI and Meta) have begun running on AMD hardware. AMD is a serious alternative for inference and increasingly for training, but ecosystem maturity remains a gap.

Intel Gaudi

Intel Gaudi (originally from Habana Labs, acquired by Intel 2019) is Intel's dedicated AI training accelerator. Gaudi3 (2024) competes with the H100 generation on price-per-performance for training workloads. The software is SynapseAI, with PyTorch support. Like AMD, Intel faces ecosystem-maturity gaps but has been making investments. Gaudi has been adopted by some hyperscalers and some on-premises ML platforms; its broader market impact has been more modest than AMD's.

AWS Trainium and Inferentia

AWS Trainium (training) and Inferentia (inference) are Amazon's custom silicon. Trainium 2 (2024) is competitive with H100 for training cost-effectiveness and is heavily adopted within AWS itself for training Anthropic Claude and other partner models. Inferentia 3 (2025) similarly competes for inference cost. The path-of-least-resistance for AWS customers; less attractive for teams not already deeply invested in AWS infrastructure.

Cerebras and the wafer-scale approach

Cerebras (founded 2016) takes a radically different approach: a single chip that occupies an entire silicon wafer (the Wafer-Scale Engine, WSE). The WSE-3 (2024) has 4 trillion transistors, 900,000 cores, and 44 GB of on-chip SRAM — eliminating the chip-to-chip and chip-to-memory bottlenecks that bound conventional accelerators. Cerebras has been deployed at major research labs and in some supercomputing contexts; its niche is workloads where the conventional GPU approach hits scaling walls. The economics are unusual; whether wafer-scale becomes mainstream or remains a specialised tool is open.

Groq and inference-specialised silicon

Groq (founded 2016) targets inference with a deterministic dataflow architecture optimised for LLM token generation. Groq has demonstrated dramatically lower per-token latency than GPU-based inference (a key metric for interactive chatbot applications). The 2024–2026 traction has been substantial; Groq Cloud and partner deployments have brought GPU-comparable inference cost with much lower latency. Whether Groq becomes a general-purpose inference platform or remains a specialised latency-leader is open.

NPUs and edge accelerators

NPUs (Neural Processing Units) are accelerators integrated into consumer devices. Apple's Neural Engine (in M-series chips and iPhones) provides 35–40 TOPS for on-device ML. Qualcomm's Hexagon NPU (in Snapdragon SoCs) is similar. Intel's NPU (Meteor Lake/Lunar Lake/Arrow Lake) has been integrated into recent laptop CPUs. The 2024–2026 push toward "AI PC" hardware has substantially raised on-device NPU capability. The use case is local inference — speech recognition, image processing, small LLMs — where privacy, latency, or offline operation matter. The performance is far below datacentre-class but adequate for the workload.

The strategic question

The 2026 question is whether the alternative-accelerator ecosystem will remain a set of niches or coalesce into serious competition for NVIDIA. The answer is uncertain: NVIDIA's combination of hardware, software, and ecosystem is formidable; alternative vendors face the chicken-and-egg ecosystem problem; but NVIDIA's pricing power and supply constraints have given alternatives oxygen they didn't have in 2018–2022. Most production ML in 2026 still runs on NVIDIA, but the alternatives are real and growing. The wise practitioner watches the ecosystem rather than betting on a single vendor.

06

Memory Hierarchy and Bandwidth

After raw arithmetic throughput, the most-important hardware property for ML is memory bandwidth: how fast can data move from where it lives to where it's used. Many ML workloads — particularly transformer attention and the inference path generally — are memory-bandwidth-bound rather than compute-bound, meaning a chip with higher FLOPS but slower memory underperforms its lower-FLOPS-but-faster-memory competitor. Understanding the memory hierarchy is essential to understanding ML performance.

HBM and the bandwidth race

The dominant memory technology for ML accelerators is HBM (High Bandwidth Memory): 3D-stacked DRAM connected to the chip through a wide on-package bus. HBM3 (used in H100, MI300) provides ~3 TB/s; HBM3e (used in H200, B200) provides up to ~5 TB/s; HBM4 (announced for the next generation) targets 8+ TB/s. By comparison, conventional GDDR memory in consumer GPUs offers ~1 TB/s; CPU DDR5 memory offers ~400 GB/s. The HBM-vs-DDR gap is what makes ML accelerators ML accelerators; without HBM, GPU FLOPS would be wasted waiting for data.

The memory-capacity dimension

Beyond bandwidth, memory capacity matters because it determines what models fit. An H100 has 80 GB HBM; a single H100 cannot hold a 70B-parameter LLM in FP16 (which would need ~140 GB). The mitigations include model parallelism (split the model across multiple chips, Ch 02), quantisation (reduce precision, Ch 03), and the various memory-saving training techniques (gradient checkpointing, ZeRO sharding). The 2024–2026 trend has been toward larger memory per chip (H200's 141 GB, B200's 192 GB, MI300X's 192 GB) which lets larger models fit on fewer chips and reduces parallelism complexity.

The on-chip memory hierarchy

Beyond HBM, modern accelerators have substantial on-chip SRAM. An H100 SM has ~256 KB of register file plus ~228 KB of shared memory / L1 cache; the chip-level L2 cache is 50 MB. SRAM is 10–100× faster than HBM but 1000× smaller. Performance optimisation involves keeping the working set in SRAM as long as possible — the FlashAttention algorithm (2022) is the classic example of an algorithm redesigned to fit in SRAM, achieving 2–4× speedups over naive attention. Modern serving frameworks (vLLM with PagedAttention, Ch 03 of MLOps) similarly are SRAM-conscious.

Memory bandwidth as the binding constraint

For many ML workloads, particularly transformer attention and most LLM inference, memory bandwidth is the binding constraint. LLM token generation requires reading the entire KV-cache for each generated token; the bandwidth determines tokens-per-second more than the FLOPS rating does. Attention moves a quadratic amount of data through the memory hierarchy; bandwidth bounds this directly. Many fine-tuning workloads are similarly bandwidth-bound. The roofline analysis (Section 8) makes this quantitative; the operational implication is that hardware selection should be informed by memory bandwidth, not just FLOPS.

The memory-bandwidth wall

Memory bandwidth has been growing more slowly than compute. From 2010 to 2024, FP32 FLOPS per chip grew roughly 1000×; HBM bandwidth grew roughly 30×. The ratio of compute to bandwidth has steadily increased, meaning more workloads are bandwidth-bound on newer hardware. This is partly intentional: chip designers prioritise compute because it scales more cheaply with transistor count; bandwidth is a packaging-and-physics problem. The 2024–2026 generation has substantially closed this gap (HBM3e's 5 TB/s is a major step), but the trend is still that bandwidth is precious and worth optimising for.

NVLink and chip-to-chip memory

For multi-GPU training, the bandwidth that matters extends beyond a single chip's HBM to the interconnect bandwidth between chips. NVLink (NVIDIA's chip-to-chip interconnect) provides 900 GB/s on H100 and 1.8 TB/s on B200. NVSwitch extends this to all-to-all connectivity within an 8-GPU server (or 72-GPU rack with NVL72). Cross-server connectivity drops to 200–800 GB/s via InfiniBand. The bandwidth-cliff between NVLink-connected and network-connected chips substantially shapes how training is parallelised — Section 7 develops this.

07

Interconnect: Networking Many Chips Together

A frontier ML training run uses thousands or tens of thousands of accelerators in coordinated training. The accelerators must constantly exchange gradient updates, model weights, and activations. The interconnect — the network connecting accelerators — is often the single most-binding constraint on training scaling. Designing the interconnect is its own engineering discipline; understanding it is essential for reasoning about distributed-training performance.

The interconnect hierarchy

Modern training systems have a multi-level interconnect hierarchy. On-chip: connections between cores within a single chip (~10 TB/s effective). Intra-server: NVLink/NVSwitch within an 8-GPU server (~900 GB/s per link, all-to-all topology). Intra-rack: NVLink-extended fabric (NVL72) up to 72 GPUs in a single coherent system (~1.8 TB/s per link). Inter-rack: InfiniBand or Ethernet at 200–800 Gbps (~25–100 GB/s) per link. Inter-datacentre: orders-of-magnitude slower, used only for very-large-scale training spread across sites. Each level is roughly 10× slower than the level above it; training-parallelism strategies must respect this hierarchy.

Topology and the network fabric

At scale, the network topology determines which chips can talk to which others efficiently. Common topologies include fat tree (Clos network, the standard data-centre topology), torus (used in TPU pods, particularly efficient for nearest-neighbour communication), and dragonfly (used in some HPC systems). The topology shapes which collective-communication patterns are efficient and which have high latency or limited bandwidth. Modern training stacks (NCCL, RCCL, OFCCL) implement collective operations that respect the topology; but they don't compensate for fundamentally bad topologies.

Collective operations

Distributed training is dominated by collective communication: operations involving all participating workers. All-reduce averages a tensor across all workers (the dominant operation in data-parallel training, run after each gradient computation). Broadcast sends a tensor from one worker to all others. Reduce-scatter performs a partial reduction with each worker keeping only its portion of the result. All-gather is the reverse, replicating data across workers. These primitives are implemented by NVIDIA's NCCL library (the dominant choice), AMD's RCCL, the cross-platform MPI, and various others. Performance optimisation of collectives is a major topic in distributed-training engineering.

InfiniBand and the data-centre network

For inter-server communication, InfiniBand is the dominant high-performance fabric. InfiniBand offers low latency (~1 µs round-trip), high bandwidth (200–400 Gbps per port in current HDR/NDR), and efficient RDMA (Remote Direct Memory Access) that lets a worker read another worker's memory without CPU involvement. Mellanox (acquired by NVIDIA in 2020) is the dominant vendor. Ethernet alternatives (RoCE, Ultra Ethernet under development by the UEC) offer competitive performance at lower cost; the 2024–2026 trajectory suggests Ethernet may displace InfiniBand for many workloads, though InfiniBand remains the standard for the highest-performance training.

The bandwidth-vs-latency trade-off

Different ML workloads care about different network properties. Large-batch data-parallel training cares about bandwidth: the all-reduce of large gradient tensors dominates communication, and latency is amortised across the reduction. Pipeline-parallel training cares about latency: pipeline bubbles depend on the time to forward activations, and a slow link wastes compute on every micro-batch. Inference systems care about both: bandwidth for moving model weights, latency for token-by-token generation. The interconnect choice affects each differently; mature ML infrastructure makes trade-offs explicit.

Optical interconnect and the future

The 2024–2026 frontier in interconnect is optical interconnect. Co-packaged optics (CPO) integrate optical transceivers directly with the switch ASIC, dramatically reducing power and increasing bandwidth. Google's TPU v4/v5 pods use an optical-circuit-switched fabric for unprecedented scaling. NVIDIA, Broadcom, and Cisco all have CPO products in development or shipping. The 2026–2030 trajectory plausibly has photonic interconnects as the standard for high-performance ML clusters; the change is incremental but substantial.

08

The Roofline Model and Performance Analysis

The roofline model (Williams, Waterman, Patterson 2009) is the dominant analytical framework for reasoning about hardware performance. It distils the relationship between a workload's arithmetic intensity (FLOPS per byte of memory traffic) and the hardware's compute and bandwidth capabilities into a single chart that immediately tells you whether you're compute-bound or bandwidth-bound — and therefore which optimisations matter. Understanding the roofline model is essential for serious hardware-aware ML work.

The basic roofline

The roofline model plots achievable performance (FLOPS) on the y-axis against arithmetic intensity (FLOPS/byte) on the x-axis. The "roof" has two parts: a sloped section (memory-bound) where performance scales with arithmetic intensity at the rate of memory bandwidth, and a flat section (compute-bound) where performance is capped at the chip's peak FLOPS. The transition point — the "ridge" — is at AI = peak_FLOPS / peak_bandwidth. For an H100 with ~989 TFLOPS BF16 and ~3,350 GB/s HBM, the ridge is around 295 FLOPS/byte. Workloads with arithmetic intensity above 295 are compute-bound; below 295 are memory-bound.

Arithmetic intensity for common ML operations

Computing arithmetic intensity for common ML operations gives concrete intuition. Dense matrix multiplication M×N times N×K has 2MNK FLOPS and roughly 2(MN+NK+MK) bytes of memory traffic; for square M=N=K=4096 in BF16, AI ≈ 1365 — well into compute-bound. Convolution on images has similarly high AI. Element-wise operations (ReLU, dropout, layer norm) have AI ≈ 0.5–2 FLOPS/byte — heavily memory-bound. Attention has AI that scales with sequence length but for typical lengths is in the 50–150 range — bandwidth-bound on H100 hardware. The implication: matrix multiplications are the operation to optimise the chip for; many other operations are memory-bound and need different optimisations.

Implications for model design

The roofline model has direct implications for ML architecture. Larger matrix sizes push toward compute-bound regime; smaller matrix sizes are wasteful because they're memory-bound. Mixed-precision arithmetic (FP16/BF16/FP8) helps both axes: smaller data types means more FLOPS per chip and less data to move. Operator fusion (combining multiple operations into a single kernel) reduces memory traffic and pushes toward compute-bound. Sparsity (using fewer parameters) reduces both memory and compute, but only helps if the hardware supports sparse computation efficiently (which Tensor Cores partly do). Modern ML compilers (XLA, torch.compile, TVM) make these optimisations partly automatic.

The MFU metric

Model FLOPS Utilisation (MFU) is the standard metric for how well a training run uses the available hardware: actual_FLOPS / peak_theoretical_FLOPS. A well-tuned training run should achieve 40–60% MFU on H100-class hardware. MFU below 30% suggests substantial inefficiency — usually data movement, suboptimal kernels, or interconnect bottlenecks. MFU is the primary metric reported in frontier-AI training papers (the GPT-4-era papers, the Llama 3 paper, the Gemini technical reports) because it directly translates to training cost.

Profiling and bottleneck identification

The practical use of roofline analysis is identifying bottlenecks via profiling. NVIDIA Nsight Compute and Nsight Systems profile CUDA kernels and produce roofline plots. PyTorch Profiler shows operation-level timing. JAX/XLA has its own profiling tools. The pattern is: profile the workload, identify the bottleneck (usually memory bandwidth on a few specific kernels), apply the appropriate optimisation (fusion, larger batches, FlashAttention-style algorithmic redesign), re-profile. Mature ML teams have explicit performance-engineering practices around this loop; it routinely produces 2–5× speedups on previously-suboptimal training runs.

Beyond the basic roofline

The basic roofline model has extensions for more-realistic analysis. Hierarchical roofline models multiple memory levels (HBM, L2, L1) with separate ridges. Multi-precision rooflines account for different peak FLOPS at FP32 vs BF16 vs FP8. Communication rooflines include network bandwidth as another bound. The 2024–2026 work on roofline analysis for distributed training (the various LLM-training characterisation papers) has substantially extended the basic framework. The intuition remains the same: understand which bound applies, optimise for that bound.

09

Operational Realities of GPU Clusters

Beyond the chip, beyond the network, the operational realities of running an ML cluster — utilisation, scheduling, failures, multi-tenancy, cost engineering — determine whether the hardware investment pays back. This section covers the operational layer that distinguishes a productive ML cluster from an under-utilised one.

Utilisation and scheduling

A GPU cluster running at 30% utilisation is wasting most of the capital investment. Utilisation engineering — making sure the chips are doing useful work most of the time — is its own discipline. Scheduling systems (Kubernetes with the NVIDIA device plugin, Slurm, the various proprietary schedulers) decide which jobs get which GPUs. Job queuing handles the pattern of "more demand than supply" with priorities and fair-share allocation. Preemption lets high-priority jobs displace lower-priority ones. The operational pattern is mature in HPC; ML clusters have been adopting these patterns over the 2020s.

Multi-Instance GPU and time-slicing

For workloads that don't need a full GPU, Multi-Instance GPU (MIG) on A100/H100/B200 partitions a physical GPU into up to 7 isolated logical GPUs, each with its own dedicated compute and memory. MIG is the standard for serving multiple inference workloads on shared hardware. Time-slicing is an alternative that doesn't isolate at the hardware level (multiple containers share the GPU temporally) — appropriate for development but not for production. The 2024–2026 work on MIG-aware scheduling has substantially improved cluster utilisation for inference-heavy workloads.

Failures at scale

Large GPU clusters fail constantly. With tens of thousands of GPUs, network components, and cooling systems, a major component fails roughly daily on average. The discipline is resilient training: training runs that handle individual failures gracefully via checkpointing, automatic restart, and replica replacement. Checkpoint frequency is a tuning parameter — too frequent and checkpoint I/O dominates training; too infrequent and a failure costs hours of progress. The 2024–2026 work on fast checkpointing (incremental checkpoints, GPU-resident checkpoint state, asynchronous I/O) has substantially reduced the cost of resilience. Frontier-AI training runs (the Llama 3 paper documents this in detail) operate continuously despite failures.

Power and cooling

Modern GPUs consume substantial power: H100 SXM at 700W, B200 at 1000W, and a rack of 72 B200s consuming ~125 kW continuously. Power and cooling have become binding constraints on data-centre scale. Liquid cooling is increasingly standard for high-density GPU racks; power purchase agreements for ML clusters routinely run to hundreds of megawatts; grid availability in some regions is now a binding constraint on cluster siting. The 2025–2027 work on more-efficient hardware (B200 was a substantial efficiency improvement over H100), better cooling, and grid-scale power infrastructure is shaping the geography of frontier AI.

Cost engineering and FinOps

GPU costs dominate ML budgets. FinOps for ML — the discipline of attributing GPU costs to workloads, optimising costs, and aligning incentives — is increasingly important. The standard practices: tag every GPU job with owner and purpose, track cost-per-experiment, build cost-aware autoscalers, periodic cluster-utilisation reviews. Spot instances (Ch 03 of MLOps) are a major lever for batch workloads. Reserved capacity versus on-demand is another major lever. The 2024–2026 industry experience suggests FinOps maturity correlates strongly with operational ML maturity; teams without it find their GPU spend grows monotonically.

Cluster software stacks

Running an ML cluster involves a substantial software stack: Linux, Kubernetes, the NVIDIA driver and CUDA toolkit, container runtimes, the GPU device plugin, the MIG operator, the network device plugin (for InfiniBand), monitoring (DCGM for GPU metrics), the orchestrator (Kubernetes, Slurm), and the ML framework integration (PyTorch, JAX, TensorFlow). Mature ML platforms package this; rolling your own is a substantial engineering investment. NVIDIA NGC, Run:ai (acquired by NVIDIA 2024), CoreWeave, Lambda Labs, and the major cloud platforms all package versions of this stack. The build-vs-buy decision tracks the broader MLOps platform decision (Ch 05 of MLOps).

10

The Frontier and the Operational Question

Hardware for ML is mature in 2026 but rapidly evolving. The B200 generation has substantially extended training capability; the next generation (B300, MI400, the various 2026–2027 entrants) is in development; alternative architectures (photonic, neuromorphic, quantum) are exploring fundamentally different approaches; the geopolitics of advanced chip manufacturing has become a national-security issue. This section traces the open frontiers and the directions the field is moving in.

The next NVIDIA generation

The Blackwell architecture (B100, B200) became dominant in 2024–2025; Blackwell Ultra / B300 (2025) extended it; the next major generation (codenamed Rubin, expected 2026–2027) is rumoured to bring substantial further gains in HBM capacity, bandwidth, and FLOPS. NVIDIA's roughly two-year cadence has been remarkably consistent; the trajectory of ~2× FLOPS per chip per generation is expected to continue through ~2030. Whether this can be sustained depends on transistor density (Moore's Law's slowing), HBM packaging advances, and the various physical limits — but the near-term forecast is more of the same.

The competitor trajectory

AMD's MI400 (announced for 2026) is positioned to compete directly with NVIDIA's next generation; Intel Gaudi continues to evolve. AWS Trainium 3 / Inferentia 4 (announced for 2026) extend Amazon's custom silicon. Google TPU v6 (Trillium, 2024) and v7 (anticipated 2025–2026) continue Google's pod-scale push. The 2024–2026 trajectory has been gradual erosion of NVIDIA's near-monopoly, with alternatives capturing meaningful workloads but not yet displacing NVIDIA at frontier scale.

Photonic computing

Photonic computing — using optical signals rather than electrical for some computations — has been a long-running research thread that is starting to become product-relevant. Lightmatter and Lightelligence are startups commercialising photonic AI accelerators with substantial efficiency advantages on certain operations. The economics work for some workloads; whether photonic computing displaces electrical at scale or remains a specialised technology is open. The 2025–2027 trajectory will reveal more.

Neuromorphic computing

Neuromorphic computing — chips that mimic biological neurons with spiking, sparse, asynchronous computation — is the longer-running alternative architecture research thread. Intel Loihi 2 and IBM TrueNorth (the major research platforms) demonstrate orders-of-magnitude efficiency advantages on specific workloads (event-based processing, certain types of inference). Whether neuromorphic computing finds product-relevant niches is unclear; the 2026 production-ML landscape uses essentially no neuromorphic chips, and the engineering investment to make it production-ready is substantial.

Quantum computing for ML

Quantum computing remains far from useful for ML in 2026. Quantum machine learning is an active research area, but practical advantage on real ML problems has not been demonstrated. The hardware (trapped-ion, superconducting, photonic quantum) is improving but still operates at small qubit counts with substantial error rates. The 2026 honest assessment: quantum ML is a research curiosity, not yet a production tool. Whether this changes in the next 5–10 years is genuinely uncertain.

The geopolitics of chips

Beyond the engineering, the geopolitical dimension of advanced semiconductors has become a substantial issue. Export controls on advanced chips (US restrictions on selling H100/B200-class hardware to China) have reshaped the global market. Domestic semiconductor industry investments (the US CHIPS Act, EU Chips Act, Chinese SMIC investments) are being made on national-security grounds. TSMC's near-monopoly on advanced fabrication has become a strategic asset. Whether the global semiconductor industry remains globalised or fractures into national/regional supply chains is one of the major industrial-policy questions of the late 2020s.

What this chapter has not covered

Several adjacent topics are out of scope. Distributed training — the methodology for using many accelerators in coordination — is the topic of Ch 02. Model compression (quantisation, distillation, pruning) is Ch 03. Inference optimisation (batching, KV caching, speculative decoding) is Ch 04. AI chips and custom silicon at depth — the design philosophy and economics — is Ch 05. The deeper details of CUDA programming, semiconductor process technology, and chip design are out of scope. The chapter focused on the conceptual hardware substrate from an ML practitioner's perspective; the broader hardware landscape develops adjacent topics in subsequent chapters.

Further reading

Foundational papers and references for ML hardware. The roofline model paper (Williams et al.); the original GPU computing papers; the TPU papers (Jouppi et al.); the Hopper/Blackwell whitepapers; the major ML-systems textbooks; the Patterson & Hennessy quantitative-architecture textbook; and the 2024–2026 industry surveys form the right starting kit.