AI Chips & Custom Silicon, where physics meets the ML stack.
For the first thirty years of computing, general-purpose CPUs were enough. Then GPUs created the deep-learning era; then TPUs proved that domain-specific silicon could outperform GPUs at ML workloads by an order of magnitude; then a Cambrian explosion of AI chips arrived, each making a different architectural bet. ASIC design philosophy — when does specialised silicon make sense, and when is general-purpose better — is the unifying methodology question. The competitive landscape includes the dominant NVIDIA GPU lineage, Google's TPUs, AWS Trainium and Inferentia, Cerebras's wafer-scale chips, Groq's deterministic dataflow architecture, the Apple/Qualcomm/Intel NPUs, and a long tail of startups. Beyond conventional silicon, photonic computing uses light rather than electrons for some operations; neuromorphic computing mimics biological neural circuits with spiking, sparse, asynchronous architectures; analogue compute performs matrix-vector products in the analogue domain. Each represents a different bet on what the future of ML hardware looks like. This chapter develops the methodology with the depth a working ML practitioner needs: the design philosophy behind ASICs, the alternative architectures, the competitive landscape, and the strategic questions that shape both engineering and policy.
Prerequisites & orientation
This chapter assumes the hardware material of Ch 01 (GPUs, TPUs, memory hierarchy, the roofline model), the distributed-training material of Ch 02, the model-compression material of Ch 03, and the inference-optimisation material of Ch 04. Familiarity with basic computer architecture (registers, caches, memory bandwidth) is assumed; familiarity with semiconductor process technology (transistor counts, fab nodes, packaging) helps but is not required. The chapter is written for ML engineers, ML researchers, infrastructure engineers, and technology strategists who need to reason about the hardware landscape — which chips to buy, which ones to build on, what the competitive trajectory looks like. Pure-application contexts where ML is consumed via APIs have less use for this material; teams making strategic hardware bets have substantial use.
Three threads run through the chapter. The first is the specialisation gradient: hardware ranges from fully-general (CPUs) through partially-specialised (GPUs) to highly-specialised (TPUs, fixed-function ASICs), with corresponding trade-offs between flexibility and efficiency. The second is the workload-shape stability question: ASICs only make sense if the target workload is stable enough to justify the multi-year design cycle, and ML workloads have evolved fast enough that this is not always obvious. The third is the economics-and-geopolitics dimension: chip design is fabulously expensive (billions per generation), the global semiconductor industry is geographically concentrated, and the strategic stakes have made AI silicon a national-security topic. The chapter develops each in turn.
Why Custom Silicon Matters
Custom silicon for AI is one of the largest industrial investments of the 2020s. The strategic logic is straightforward: at sufficient scale, designing your own chip beats buying NVIDIA's. The execution challenges are substantial: chip design is expensive, slow, and risky; the software ecosystem matters at least as much as the hardware; and the moving target of ML workloads can render multi-year design bets obsolete before they ship. Understanding why and when custom silicon wins is the foundation for reasoning about the broader AI-hardware landscape.
The specialisation pay-off
Custom silicon, well-targeted at a stable workload, can be 5–20× more efficient (perf-per-watt or perf-per-dollar) than general-purpose alternatives. The classic example: Bitcoin mining. CPUs gave way to GPUs, GPUs gave way to FPGAs, FPGAs gave way to ASICs — each step roughly 10× more efficient at the specific workload of computing SHA-256 hashes. ML workloads are less narrow than Bitcoin mining, but the same logic applies: a chip designed specifically for transformer matrix multiplications, with appropriate memory hierarchy and interconnect, can substantially outperform a general-purpose GPU on the same workload.
The strategic-asset argument
For hyperscalers (Google, Amazon, Microsoft, Meta), custom silicon is also a strategic asset. Owning the chip stack reduces dependence on NVIDIA, improves margins on internal compute, and provides differentiation that competitors can't easily replicate. Google's TPU programme has run since 2015; AWS Trainium since 2020; Microsoft Maia since 2023; Meta MTIA since 2023. Each of these costs hundreds of millions to billions of dollars per chip generation but is justified by the strategic value of the resulting compute capacity. The 2023–2026 hyperscaler trajectory is clear: more custom silicon, more diverse silicon, less dependence on the GPU monoculture.
The startup ecosystem
Beyond hyperscalers, a substantial startup ecosystem is building specialised AI chips. Cerebras (wafer-scale chips), Groq (deterministic dataflow inference), SambaNova (reconfigurable dataflow), Tenstorrent (RISC-V-based AI), Untether, Rain, and dozens of others. The startups aim to find specific workloads or operating points where their architectural bets win against NVIDIA; the success rate is mixed (many have failed; some have found niches; a few are growing into substantial businesses) but the diversity is real and ongoing.
The geographic concentration
Modern advanced semiconductor manufacturing is dominated by a handful of foundries: TSMC in Taiwan (the dominant cutting-edge foundry), Samsung in South Korea, Intel Foundry in the US (re-emerging). Most AI chips — including NVIDIA's, Google's TPUs, Apple's chips, and the various startups' — are manufactured at TSMC. The geographic concentration has become a strategic-vulnerability concern: any disruption to Taiwan-based manufacturing would affect the entire global AI industry. The 2022 US CHIPS Act, the EU Chips Act, and corresponding investments in domestic semiconductor capacity are responses to this concern.
The downstream view
Operationally, AI chip choice cascades through the entire ML stack. Upstream: chip designers, foundries, packaging, board manufacturers. Inside this chapter's scope: the architectural choices, the competitive landscape, the strategic and economic dimensions. Downstream: the cost-and-capability frontier of what ML workloads are feasible at what cost, which determines what teams can train and deploy. The remainder of this chapter develops each piece: §2 ASIC design philosophy, §3 the NVIDIA/AMD landscape, §4 hyperscaler ASICs, §5 the startup cohort, §6 edge NPUs, §7 photonic computing, §8 neuromorphic computing, §9 economics and geopolitics, §10 the frontier.
ASIC Design Philosophy
The decision to build a custom ASIC rather than use commodity hardware is a substantial commitment with high stakes. Multi-year design cycles, hundreds of millions of dollars in non-recurring engineering (NRE), and the risk that the workload changes before the chip ships make it a bet that's right only under specific conditions. Understanding when ASICs make sense — and the design philosophy that makes them work — is the foundation for reasoning about all the chips this chapter discusses.
The workload-stability requirement
The most-important precondition for a successful ASIC: a stable, well-understood workload. The chip-design timeline is roughly 18–36 months from architectural decision to production silicon; if the target workload changes substantially during that window, the chip ships obsolete. Bitcoin mining was the textbook example of stable workload — the SHA-256 algorithm hasn't changed since 2008. ML workloads are less stable; transformers became dominant only around 2018, mixture-of-experts only around 2022, ring-attention-style long context only around 2023. ASICs designed before each shift were obsoleted; the wave of "ML chips" designed around CNN workloads (around 2015–2017) mostly didn't survive the transformer transition.
The NRE economics
Designing a modern advanced-node chip costs roughly $50M–$1B in NRE, depending on complexity. Mask sets at 5nm are ~$5M per spin; at 3nm, ~$10M; at 2nm, projected to be ~$20M. A typical chip programme uses 2–4 spins (initial silicon plus revisions). Add tape-out engineering, EDA tool licenses, IP licensing (ARM cores, memory controllers, etc.), packaging design, and ramp-up. The full programme cost easily runs to nine figures. To recoup this investment, you need either substantial sales volume (NVIDIA-class) or substantial internal usage (hyperscaler-class). The specialised-startup model bets on a specific niche where their architectural advantages justify the cost.
The dichotomy: programmability vs efficiency
The fundamental design choice is the spectrum from programmable to specialised. A general-purpose CPU is the most programmable; it can run any workload, but at relatively low efficiency for ML. A GPU is partially specialised; it accelerates parallel workloads (including ML) but remains programmable through CUDA. A TPU is more specialised; it has specific support for ML's dominant operations but is harder to repurpose for non-ML work. A fully fixed-function ASIC is most specialised; it efficiently runs exactly the workload it was designed for, nothing else. Each step toward specialisation buys efficiency at the cost of flexibility; the right point depends on workload stability and volume.
The software-ecosystem problem
Hardware without software is unusable. Every AI-chip company faces the question: what is the programming model, and how does it integrate with the existing ML stack (PyTorch, JAX, TensorFlow, etc.)? NVIDIA's CUDA has 20 years of investment and ecosystem effects. Competing requires either offering CUDA compatibility (AMD's HIP, Intel's various efforts) or building a new software stack good enough that users will switch (Google's XLA + JAX, the various startup-specific stacks). Many AI-chip startups have failed not because of hardware shortcomings but because of software ecosystem inadequacy. The lesson: the chip is half the product; the software is the other half.
The performance metrics
What does "better silicon" mean? Several metrics matter. Peak FLOPS: maximum theoretical floating-point operations per second. Memory bandwidth: how fast data can move from memory to compute units. Perf-per-watt: throughput divided by power consumption — increasingly the binding metric as datacentres hit power limits. Perf-per-dollar: throughput divided by chip cost — the economic-efficiency metric. Latency: time to complete a single inference — matters for interactive workloads. MFU on real workloads: actual performance, not theoretical peak — often the gap between marketing numbers and reality. Honest comparison requires multiple metrics; cherry-picked single-metric comparisons are common in the chip-marketing literature.
The "fast follower" pattern
One operational pattern: don't build the most-aggressive ASIC; build a "fast follower" that incorporates lessons from the leader. The 2024–2026 generation of NVIDIA-competitor GPUs (AMD MI300, Intel Gaudi) are examples — they target the same general pattern as NVIDIA but with cost or capacity advantages. Fast-follower designs trade leading-edge performance for lower risk and faster time-to-market. They've been moderately successful; whether any displaces NVIDIA at the high end is open.
The NVIDIA Lineage and AMD's Challenge
Most AI compute in 2026 still runs on NVIDIA GPUs. The dominance is no accident: NVIDIA invested in CUDA two decades before deep learning made it pay off, and the resulting software-ecosystem moat is formidable. AMD has emerged as the most-credible direct competitor, with the MI300 series capturing meaningful share. Intel's Gaudi line targets specific niches. This section unpacks the dominant programmable-GPU landscape that all the more-specialised alternatives are positioned against.
NVIDIA's compounding advantages
NVIDIA's near-monopoly position rests on layered advantages. CUDA: 20 years of programming-model investment, with deep integration into every major ML framework. cuDNN: highly-optimised kernels for the operations ML actually does. NCCL: efficient multi-GPU collective communication. Mellanox: NVLink and InfiniBand for cluster-scale interconnect (acquired 2020). DOCA / DGX: turn-key training systems and software platforms. Ecosystem: every ML library, every cloud service, every research paper assumes NVIDIA hardware. Competitors face a chicken-and-egg problem: hardware that's not deployed has no software ecosystem; without software ecosystem, hardware is hard to adopt.
The H100/B200 generation cadence
NVIDIA's roughly two-year cadence has been remarkably consistent. V100 (Volta, 2017): introduced Tensor Cores, foundational for the transformer era. A100 (Ampere, 2020): substantially extended throughput and memory; the workhorse of GPT-3-era training. H100 (Hopper, 2022): introduced FP8, NVLink 4, much higher HBM bandwidth; dominant 2022–2024. H200 (Hopper-refresh, 2024): more HBM, more memory bandwidth. B100/B200 (Blackwell, 2024): substantially increased FLOPS, FP4 native, NVLink 5, NVL72 rack-scale systems. B300 (Blackwell Ultra, 2025): another step. Rubin (anticipated 2026–2027): the next major architecture. Each generation has delivered ~2× FLOPS per chip; the trajectory is expected to continue.
AMD's MI300 challenge
AMD MI300X (2023) was the first credible direct competitor to NVIDIA at the high end since the V100 era. The chip has more HBM than H100 (192 GB vs 80 GB), competitive raw FLOPS, and competitive pricing. The follow-up MI325X (2024) and MI400 (2026) have continued the line. The challenge has been software: ROCm (AMD's CUDA equivalent) is less mature, and porting CUDA-optimised codebases is substantial engineering work. The 2024–2026 progress has been substantial; PyTorch ROCm support has substantially improved, and major training jobs (some at OpenAI, Meta, and others) have begun running on AMD hardware. AMD now captures meaningful share of new training deployments; whether it displaces NVIDIA at the high end is open.
Intel Gaudi
Intel Gaudi (originally Habana Labs, acquired 2019) is Intel's dedicated AI training accelerator. Gaudi 2 (2022) and Gaudi 3 (2024) compete with the H100 generation. The pricing has been aggressive (often 30–50% cheaper per FLOP than NVIDIA), and the software (SynapseAI, with PyTorch support) has been improving. The market reception has been more modest than AMD's; Gaudi has captured some hyperscaler deployments but hasn't reached AMD-scale market presence. Intel's broader semiconductor challenges (foundry execution, the various 2023–2025 reorganisations) have affected the Gaudi line's momentum.
The programming-model arms race
The 2024–2026 work on cross-vendor programming abstractions is gradually loosening NVIDIA's CUDA lock-in. OpenAI Triton: a Python DSL for writing GPU kernels that compiles to both NVIDIA and AMD targets; has substantial adoption. PyTorch's torch.compile: increasingly produces kernels portable across vendors. MLIR-based stacks: cross-vendor compilation infrastructure. SyCL / OneAPI: Intel's cross-vendor programming model. JAX with XLA: targets both GPUs and TPUs naturally. Whether these abstractions become "good enough" to make the choice of GPU brand irrelevant is the strategic question; as of 2026, NVIDIA still has meaningful advantages but the gap is narrowing.
The scaling-systems advantage
Beyond the chip itself, NVIDIA's NVL72 rack-scale system (72 GPUs as a single coherent system) and the broader DGX/HGX server line provide the high-bandwidth-domain that frontier-AI training depends on. AMD and Intel have less-mature equivalents. For the very largest training jobs, NVIDIA's cluster-scaling story is currently a meaningful advantage; AMD's various 2024–2026 cluster offerings (the Helios system, the various OEM partnerships) are starting to close the gap, but the systems integration takes years to develop.
Hyperscaler ASICs: TPU, Trainium, Maia, MTIA
The major hyperscalers have all invested in custom silicon: Google's TPUs (since 2015), AWS's Trainium and Inferentia (since 2020), Microsoft's Maia (since 2023), Meta's MTIA (since 2023), and various others. The strategic logic is consistent: at hyperscale, owning the silicon stack is a competitive advantage; the engineering investment (hundreds of millions per generation) is justified by internal-usage volume. This section unpacks the dominant hyperscaler programmes.
Google TPU: the proof of concept
Google's TPU programme (already discussed in Ch 01 §4) was the proof-of-concept that custom AI silicon could substantially outperform GPUs at scale. Five public generations through 2024–2026 (TPU v1 through TPU v6/Trillium); each has been deployed at substantial internal scale. The TPU's architectural distinction is the systolic array — a matrix-multiply-optimised structure that achieves higher arithmetic density than GPUs. The pod-scale advantage — thousands of TPUs as a single coherent system via specialised optical interconnect — is what makes TPU's at the highest training scale particularly competitive. External availability through Google Cloud has been improving but TPUs remain primarily a Google-favoured platform.
AWS Trainium and Inferentia
AWS Trainium (training, since 2020) and Inferentia (inference, since 2019) are Amazon's parallel programmes. Trainium 2 (2024) is competitive with H100 for cost-effective training and is heavily used internally for training Anthropic Claude and other partner models. Inferentia 3 (2025) competes for inference cost with Groq, vLLM-on-NVIDIA, and the various alternatives. The Trainium/Inferentia path-of-least-resistance is for AWS customers who can use the Neuron SDK; for teams not deeply invested in AWS, the chips are less compelling than the dominant alternatives.
Microsoft Maia
Microsoft Maia 100 (2023) is Microsoft's first generation of AI accelerator. The chip is co-designed with Microsoft's hyperscale infrastructure and targets Microsoft's internal training workloads (substantial at OpenAI's scale, given the partnership). The 2024–2026 trajectory has Maia generations following a roughly annual cadence; whether Maia becomes externally available or remains an Azure-internal asset is uncertain. The strategic logic for Microsoft is clear: reducing dependence on NVIDIA's pricing and supply.
Meta MTIA
Meta Training and Inference Accelerator (MTIA) is Meta's silicon programme, originally focused on inference (recommendation models for Facebook/Instagram), now also extending to training. MTIA v1 (2023) and MTIA v2 (2024) target Meta's specific workload mix; Meta's recommendation systems have unusual operational characteristics (very-large embedding tables, unusual memory-access patterns) that custom silicon can target effectively. The 2025–2026 trajectory plausibly has MTIA capturing increasing share of Meta's ML compute.
The strategic logic, in detail
Why do hyperscalers invest in this? Several reasons. Pricing power: NVIDIA's H100 list price has been ~$30K, with margins reportedly 70%+. Owning the chip stack captures that margin internally. Supply: during 2023–2024 NVIDIA shortages, hyperscalers with their own silicon had options that competitors didn't. Differentiation: cloud customers can be offered TPU/Trainium-based services that AWS/Google's competitors literally can't provide. Strategic optionality: not betting the entire ML stack on a single supplier reduces tail risk. Each of these is worth substantial engineering investment at hyperscale; collectively they justify the multi-billion-dollar hyperscaler-ASIC programmes.
The custom-silicon trajectory
The 2024–2026 trajectory is clear: more hyperscaler silicon, more diverse silicon, more competitive landscape. Apple's Private Cloud Compute infrastructure runs on custom Apple silicon. xAI's Colossus cluster uses NVIDIA but is reportedly investigating custom silicon. The smaller hyperscalers (Oracle Cloud, IBM Cloud) are increasingly partnering with specialised AI-chip startups. The 2027–2030 forecast plausibly has the AI-compute landscape substantially more diverse than 2025; whether the trend continues and produces a real multi-vendor world or NVIDIA reasserts dominance is the strategic question.
The Startup Cohort: Cerebras, Groq, SambaNova
Beyond NVIDIA, AMD, Intel, and the hyperscalers, a substantial AI-chip startup ecosystem aims to find specific niches where their architectural bets win. Cerebras, Groq, SambaNova, Tenstorrent, Untether, Rain, Etched, the various 2024–2026 entrants — each has a distinctive architectural vision. Most face significant ecosystem-and-scale challenges; some have found viable niches; a few have grown into substantial businesses. This section surveys the dominant startups and their strategic positioning.
Cerebras: the wafer-scale bet
Cerebras (founded 2016) takes a radically different approach: a single chip that occupies an entire silicon wafer. The Wafer-Scale Engine 3 (WSE-3, 2024) has 4 trillion transistors, 900,000 cores, and 44 GB of on-chip SRAM. The architectural bet is that eliminating chip-to-chip and chip-to-memory bottlenecks — keeping everything on a single silicon substrate — produces substantially better efficiency for some workloads. The wafer-scale chip has been deployed in some research labs, supercomputing centres, and increasingly for commercial inference. The economics are unusual; whether wafer-scale becomes a meaningful competitor to GPU-cluster architectures is open. The 2024–2026 traction has been substantial enough to suggest the bet is at least partially working.
Groq: deterministic inference
Groq (founded 2016) targets inference with a deterministic dataflow architecture. The GroqChip (originally Tensor Streaming Processor, TSP) has no caches and no branch prediction — every operation has predictable timing, which the compiler can schedule deterministically. The result is dramatically lower per-token latency than GPU-based inference (a critical metric for interactive chatbots). Groq Cloud (2024) has captured meaningful inference market with sub-100ms time-to-first-token even for 70B-parameter models. Whether Groq becomes a general-purpose inference platform or remains a specialised latency-leader is uncertain; the 2024–2026 traction has been substantial.
SambaNova: reconfigurable dataflow
SambaNova (founded 2017) builds reconfigurable dataflow architectures with their Reconfigurable Dataflow Unit (RDU). The architectural bet is that ML workloads' dataflow patterns are stable enough to benefit from a chip that maps them efficiently, but variable enough to benefit from runtime reconfiguration. SambaNova's product positioning has been on enterprise on-premises ML — bundled hardware + software for organisations that want LLM capability without depending on external cloud APIs. The 2024–2026 traction has been moderate; the niche is real but smaller than GPU-cloud.
Tenstorrent: open RISC-V AI
Tenstorrent (founded 2016, led since 2021 by Jim Keller) takes a different bet: AI accelerators built on open RISC-V cores, with explicit support for general-purpose programming alongside ML acceleration. The Wormhole and Blackhole chips target both training and inference. The strategic positioning is "open alternative to NVIDIA" — users can access the chip at multiple programming abstractions. Tenstorrent has secured substantial customer wins (some hyperscaler partnerships, the various automotive deals); whether it scales to meaningful market share is open.
The narrower-niche startups
Beyond the larger startups, a long tail of more-specialised entrants. Etched (founded 2022): a transformer-specific ASIC betting that transformers will dominate for the next decade. Rain: analogue compute for inference. Untether: at-memory compute for inference. d-Matrix: chiplet-based inference. MatX, Tiny Corp, Lemurian Labs, and many others. Most are at pre-product or early-product stages as of 2026; some will succeed, most won't. The diversity reflects the genuine architectural disagreement about what the future of AI silicon looks like.
Why most AI-chip startups fail
The AI-chip startup graveyard is substantial. Reasons for failure include: software ecosystem inadequacy (the chip works but no one wants to port their CUDA code); workload-shift obsolescence (the chip targets last year's workload); execution challenges (chip design is hard, foundry capacity is constrained, packaging is non-trivial); scale economics (NVIDIA's volume produces price advantages startups can't match); capital intensity (multi-hundred-million-dollar burn rates with multi-year time-to-market). The startups that survive have specific architectural advantages that compensate for these challenges, plus enough capital to weather the long path to profitability.
Edge and Mobile NPUs
Beyond datacentre AI, a parallel ecosystem of NPUs (Neural Processing Units) targets edge and mobile devices. Apple's Neural Engine, Qualcomm's Hexagon, Intel's NPU (in recent laptop CPUs), Google's Edge TPU — each integrates ML acceleration into consumer-grade SoCs. The 2024–2026 push toward "AI PC" and "AI phone" hardware has substantially raised on-device NPU capability. The use case is local inference where privacy, latency, or offline operation matter; the performance is far below datacentre-class but adequate for the workload.
Apple Neural Engine
Apple Neural Engine (in M-series Macs and iPhone A-series chips) provides 35–40 TOPS for on-device ML. The 2024 M4 chip's NPU runs at 38 TOPS; the iPhone 16 Pro at 35 TOPS. These numbers are far below datacentre-GPU FLOPS (an H100 is roughly 1000× faster) but adequate for many on-device tasks: speech recognition, image processing, photo categorisation, OCR, the various Apple Intelligence features (2024). The chip is tightly integrated with Core ML, Apple's on-device ML framework; developer adoption has been substantial.
Qualcomm Hexagon
Qualcomm Hexagon NPU (in Snapdragon mobile SoCs) is the dominant Android-side mobile NPU. Snapdragon 8 Gen 3 (2023) and Gen 4 (2024) provide 35+ TOPS. Qualcomm's AI Hub (2024) provides cross-platform tooling for deploying models to Snapdragon NPUs; the 2024–2026 work on Snapdragon-based on-device LLMs (Llama 3.2 1B/3B optimised for Snapdragon, the various Meta partnerships) has substantially advanced.
Intel NPU and the AI PC era
Intel NPU (Meteor Lake / Lunar Lake / Arrow Lake CPUs) integrates ML acceleration into Intel's mainstream laptop SoCs. The 2024 Lunar Lake at 48 TOPS pushed past the threshold Microsoft set for "Copilot+ PC" certification. The Microsoft Copilot+ PC programme — with its on-device ML requirements — has been a substantial driver of NPU integration into laptops. AMD Ryzen AI (using AMD's XDNA architecture) and Apple M-series NPUs are the competing alternatives in the AI-PC space. The 2025–2027 forecast has on-device NPU capability continuing to grow; small LLMs running on every laptop is plausibly the near-term direction.
Google Edge TPU and mobile inference
Google Edge TPU (in some Android phones and the Coral developer board) and Pixel-specific TPUs (in Pixel 8/9/10 phones) provide Google's on-device ML acceleration. The Pixel-specific TPU is co-designed with Google's on-device models (Gemini Nano and successors), enabling on-device LLM features. The strategic logic is similar to Apple's: vertical integration of silicon plus models plus apps produces a differentiated product.
The on-device LLM frontier
The 2024–2026 push toward on-device LLMs has been substantial. Apple Intelligence (2024): a 3B-parameter on-device model handles many tasks without cloud. Gemini Nano (2024): Google's on-device model in Pixel phones and Chrome. Phi-3 / Phi-4: Microsoft's small models targeting on-device deployment. Llama 3.2 1B / 3B: Meta's small models with explicit mobile-NPU optimisation. The on-device LLM landscape has rapidly matured; many use cases that would have required cloud inference in 2022 now run on phones in 2026.
The deployment-tooling layer
Deploying models to NPUs requires specific tooling. Apple Core ML: Apple's on-device ML framework, with conversion tools from PyTorch. TensorFlow Lite / LiteRT: cross-platform mobile inference, the dominant Android-side option. ONNX Runtime Mobile: cross-platform alternative. Qualcomm AI Hub: Qualcomm-specific tooling. MediaPipe: Google's cross-platform on-device ML toolkit. The 2024–2026 maturation of these tools has substantially reduced the friction of deploying to mobile NPUs; the user experience is approaching cloud-deployment ergonomics.
Photonic Computing
Photonic computing uses optical signals — light moving through waveguides — rather than electrical signals for some computations. The architectural bet is that optical matrix-multiplication can be substantially more energy-efficient than electronic equivalents, with potentially dramatic implications for ML workloads dominated by matrix operations. The 2020s have seen photonic computing move from research into early product deployment; whether it becomes a mainstream technology or remains a specialised tool is the strategic question.
The basic idea
Optical matrix-vector multiplication can be performed by passing light through an array of programmable interferometers (Mach-Zehnder interferometers, often). The light intensity at each output corresponds to a row of the matrix-vector product. The operation is essentially passive — the light propagates through the optical circuit at the speed of light, with energy consumption only for input and output (not for the multiplication itself). The energy-per-operation can be orders of magnitude below electronic equivalents. The challenge is that the matrix weights are encoded in the physical structure of the optical circuit, which is harder to reprogram than electronic memory.
Lightmatter
Lightmatter (founded 2017) is the leading photonic-computing startup. Their Envise chips integrate photonic matrix-vector multiplication with conventional electronic logic. The 2023–2024 deployments at major data centres have demonstrated the energy advantage on inference workloads; the broader question of whether photonic computing scales economically is the focus of 2025–2027 development. The combined valuation and strategic positioning suggest that photonic computing is no longer pure research; it's becoming a near-term commercial reality, at least for specific niches.
Lightelligence and the broader ecosystem
Lightelligence (founded 2017) is the major Lightmatter competitor in commercial photonic AI. PsiQuantum and Q.ANT have related photonic-computing programmes (though more focused on quantum than ML). The University of Pennsylvania, MIT, and Stanford have substantial academic programmes producing the underlying technology. The 2024–2026 photonic-computing landscape is several startups plus academic groups; the field is small but growing.
Co-packaged optics
Beyond pure photonic computing, co-packaged optics (CPO) integrates optical I/O directly into chip packages, dramatically reducing the energy and latency of inter-chip communication. NVIDIA's Quantum and Spectrum-X switches increasingly use CPO; Broadcom and Cisco have CPO-based switch products. The strategic implication: even if photonic computing itself doesn't displace electronic computing, photonic interconnect is rapidly becoming the standard for high-performance ML clusters. The 2025–2027 trajectory has CPO as standard infrastructure.
The economic and physical limits
Photonic computing has real challenges. Reprogramming: changing the matrix weights requires physically modifying the optical circuit; analogous to FPGA-vs-ASIC trade-offs. Precision: optical computations are inherently analogue, with limited bit-width before noise dominates. Manufacturing: photonic chip manufacturing is less mature than electronic; yields and costs are higher. Integration: integrating photonics with electronics in a manufacturable, deployable form is non-trivial. The 2024–2026 work on these challenges has substantially advanced; whether the advances are enough to make photonic computing a mainstream technology is open.
The future trajectory
The 2026 honest assessment: photonic computing is product-relevant for some workloads (high-volume inference of stable models), still research-relevant for most. The 2027–2030 forecast has photonic computing capturing increasing share of inference workloads where its energy advantages are decisive; training applications remain difficult because of the reprogramming challenge. Whether photonic computing eventually displaces electronic for the bulk of ML compute is uncertain; the bet is being actively made by substantial venture capital and is plausible enough that mainstream chip designers (NVIDIA, Intel, the foundries) are investing in supporting capabilities.
Neuromorphic Computing
Neuromorphic computing takes inspiration from biological neural circuits: spiking, sparse, asynchronous, event-driven computation. Conventional ML hardware processes dense data through synchronous clocked operations; neuromorphic chips process spike events as they arrive, using little or no compute when nothing is happening. The energy advantages can be substantial for the right workloads. The 2010s and 2020s have produced several major neuromorphic platforms; whether the architecture finds product-relevant niches at scale remains genuinely open.
The biological-inspiration argument
Biological neurons consume roughly 10–20 watts to do something close to human cognition; an H100 GPU at full utilisation consumes 700W and falls short of biological cognition by orders of magnitude on some tasks. The argument for neuromorphic computing: biological computation must be radically more efficient than electronic computation for some operations, and architecturally mimicking biology should capture some of that efficiency. The counter-argument: biological efficiency depends heavily on physical properties (3D structure, chemistry, very-low-frequency operation) that don't map cleanly to silicon. The empirical question is whether neuromorphic silicon can deliver the efficiency advantages without the structural advantages of biology.
Intel Loihi
Intel Loihi 2 (2021) is the major commercial neuromorphic research platform. The chip has 1 million spiking neurons across 128 cores; it operates asynchronously with no global clock; computation happens only when spike events occur. Intel's research collaborators have demonstrated efficiency advantages on specific workloads — sparse recurrent networks, event-based processing of dynamic-vision-sensor data, certain types of optimisation problems. The platform is research-focused; commercial deployment has been minimal as of 2026.
IBM TrueNorth and successors
IBM TrueNorth (2014) was the original major neuromorphic chip — 1 million neurons, 256 million synapses, ~70mW power consumption. IBM NorthPole (2023) is a successor with substantial advances. IBM's neuromorphic programme has been research-focused; whether it produces commercial product is unclear.
The use-case challenge
Neuromorphic computing's commercial challenge is finding workloads where its architectural advantages are decisive. Event-based vision: dynamic-vision sensors produce spike-like events; neuromorphic processors handle these naturally. Temporal pattern recognition: sparse temporal data with long delays. Some optimisation problems: certain combinatorial optimisations map well to spiking networks. Some kinds of robotics: sensorimotor control with low latency and low power. None of these is a billion-dollar market; the question is whether the cumulative niche set is large enough to support commercial scale, or whether neuromorphic remains primarily a research direction.
The mainstream-ML mismatch
Most modern ML — large transformers, dense matrix multiplications, gradient-based training — does not fit neuromorphic architectures naturally. The dominant ML stack has adapted to GPU-style compute, not spiking neuromorphic compute; bridging the two is a substantial research effort. The 2024–2026 work on hybrid architectures (combining standard GPU with neuromorphic auxiliary processing for specific operations) has been active but hasn't produced widely-deployed products. The honest 2026 assessment: neuromorphic computing is fascinating, has potential, and is not currently a meaningful part of the production ML landscape. Whether this changes is open.
Analogue compute and in-memory compute
Adjacent to neuromorphic computing, analogue compute performs matrix-vector products in the analogue domain (using crossbar arrays of resistive elements). In-memory compute performs computation inside the memory array, eliminating the data-movement bottleneck. Both have substantial energy advantages on specific workloads; both face precision and manufacturability challenges. Companies like Mythic AI, Rain, and Untether are commercialising analogue and in-memory compute for inference. The 2024–2026 traction has been niche; whether these technologies become mainstream is open.
Economics, Foundries, and Geopolitics
Beyond architecture, the economics and geopolitics of chip manufacturing have become substantial topics in their own right. Modern advanced semiconductor manufacturing is dominated by a handful of foundries; AI chip design is hugely capital-intensive; the strategic stakes of advanced silicon have become a national-security concern. This section unpacks the economic and political dimensions that shape the broader AI-hardware industry.
The foundry oligopoly
Cutting-edge semiconductor manufacturing is concentrated. TSMC (Taiwan Semiconductor Manufacturing Company) is the dominant foundry: ~60% of all foundry revenue, ~90% of cutting-edge nodes (5nm, 3nm, 2nm in development). Samsung Foundry is the major alternative, with ~10% market share and capable of cutting-edge nodes. Intel Foundry Services (re-emerging since 2021) targets cutting-edge nodes with significant US-government backing but has had execution challenges. SMIC in China and the various smaller foundries handle older nodes. Most AI chips — NVIDIA, AMD, Apple, Google TPU, the various startups — are manufactured at TSMC.
The advanced-node escalation
Each successive process node — 7nm, 5nm, 3nm, 2nm — provides more transistor density and (sometimes) better power-efficiency. The cost of advancing nodes has been escalating: a 2nm fab costs ~$30B to build; mask costs scale similarly. Only a handful of companies can afford cutting-edge nodes (NVIDIA, AMD, Apple, Qualcomm, the major hyperscalers, a few large startups). The implication: the AI-chip race is partly a race for foundry capacity at leading nodes, with substantial allocation issues during high-demand periods (the 2023–2024 H100 shortage was partly a TSMC-allocation issue, not just NVIDIA capacity).
Packaging and the system perspective
Modern AI chips are not just silicon dies; they're assembled systems involving multiple chiplets, HBM stacks, interposers, and substrates. CoWoS (Chip-on-Wafer-on-Substrate) is the dominant packaging technology for high-end AI chips; TSMC dominates this too. SoIC (System-on-Integrated-Chips) and the various 3D-stacking technologies are pushing further. The packaging step has become a binding constraint — TSMC's CoWoS capacity has been a key bottleneck for H100 and B200 supply through 2023–2025. The 2025–2027 expansion of CoWoS capacity is one of the major industry investments.
The CHIPS Act and domestic manufacturing
The US CHIPS and Science Act (2022) provides ~$52B in subsidies and tax credits for US-based semiconductor manufacturing. Major investments include TSMC's Arizona fabs, Intel's Ohio fab expansion, Samsung's Texas fabs, and various smaller projects. The European Chips Act provides ~€43B in similar subsidies. Japan has its own programmes; Korea similarly. The strategic logic is geographic diversification: reducing dependence on Taiwan-based manufacturing in case of geopolitical disruption. The 2024–2027 effects are substantial — new fabs are coming online — but the dominant cutting-edge capacity remains in Taiwan for the foreseeable future.
Export controls and the geopolitical dimension
The US has implemented export controls restricting sales of cutting-edge AI chips to China, with successive tightening rounds through 2022–2025. Restricted chips include H100, H200, B200, and various others; restrictions cover both finished products and key manufacturing equipment. The implications are substantial: Chinese AI development is increasingly constrained by available hardware; Chinese-domestic foundries (SMIC) face technology-transfer restrictions. The 2024–2026 effects include accelerated Chinese investment in domestic chip capability (Huawei Ascend, Cambricon, and others), workarounds via re-exports through third countries, and ongoing diplomatic friction. The export-control framework will likely continue evolving through the late 2020s.
The investment scale
The aggregate capital expenditure on AI silicon is staggering. NVIDIA's 2024 revenue ran roughly $130B, mostly from AI chips. TSMC's 2024 revenue ran roughly $90B, increasingly AI-driven. Hyperscaler capex (Google, Microsoft, Amazon, Meta) collectively running ~$200B+ annually, with substantial fractions for AI infrastructure. AI-chip startup funding in 2024 ran over $10B across the sector. The aggregate annual capital flow to the AI-hardware industry is in the hundreds of billions of dollars; the comparison to other industries (the entire US automotive industry's annual revenue is ~$700B) shows that AI hardware has become one of the largest industrial sectors of the late 2020s.
The Frontier and the Strategic Question
AI silicon is mature in 2026 along the dominant GPU-and-TPU trajectory but rapidly evolving along several frontiers. 3D stacking is changing what fits on a chip; CXL is reshaping how memory connects; disaggregated architectures separate compute from memory; quantum computing remains far from useful but continues to develop. The strategic questions — who wins, what architectures dominate, what capabilities become possible — will shape the AI industry through the late 2020s and beyond. This section traces the open frontiers and the directions the field is moving in.
3D stacking and the next density frontier
Conventional planar transistor scaling is slowing as physics imposes harder limits. 3D stacking — stacking transistor layers vertically — is the next density frontier. TSMC's SoIC (System-on-Integrated-Chips) and Intel's Foveros are commercial 3D-stacking technologies; HBM memory stacks are an early example. The 2025–2030 trajectory plausibly has 3D-stacked logic becoming standard; the implications for AI chips are substantial — more transistors per package, shorter interconnect distances, better thermal management challenges. Whether 3D stacking sustains the historical exponential improvement in compute density is open.
CXL and memory disaggregation
Compute Express Link (CXL) is an emerging interconnect standard that allows memory to be shared coherently across multiple compute nodes. The implication for AI: large pools of HBM or DDR memory can be shared across many GPUs, eliminating the per-GPU memory limit. The 2024–2026 first deployments are small; the 2026–2030 trajectory plausibly has CXL-based memory disaggregation becoming a standard part of large AI clusters. Whether this displaces NVLink-style intra-server interconnects or coexists is open.
Disaggregated architectures
Beyond the chip, the broader trend is toward disaggregated architectures: compute, memory, and accelerators as separate pools connected by high-bandwidth networks rather than as fixed-server bundles. The disaggregated-inference pattern (Ch 04 §10) is one example; broader disaggregation is moving into training and into general data-centre architecture. The 2025–2030 trajectory plausibly has disaggregated infrastructure as the dominant pattern for new AI clusters; the implications cascade through every layer of the stack.
Quantum computing for ML
Quantum computing remains far from useful for ML in 2026 (cross-referenced from Ch 01 §10). The hardware is improving — IBM, Google, IonQ, the various quantum startups — but practical quantum advantage on real ML problems has not been demonstrated. The 2026 honest assessment: quantum ML is a research curiosity, not yet a production tool, and unlikely to be one for the next 5+ years. Whether it ever becomes one is open.
The post-Moore era
The exponential transistor-density growth that characterised semiconductor industry from the 1960s through the 2010s has been slowing for years; advanced-node cost increases mean that "more transistors" no longer automatically means "lower cost per transistor." Post-Moore-era scaling requires different innovations: 3D stacking, novel materials, photonic interconnect, in-memory compute, all the alternative approaches discussed in this chapter. Whether the AI industry's compute appetite can be sustainedly fed depends on succeeding at these alternatives; the 2026 outlook is cautious but not pessimistic.
The strategic outlook
The long-term strategic questions about AI silicon are increasingly entwined with broader industrial and geopolitical trends. Will NVIDIA's ecosystem dominance persist or erode? Will hyperscaler ASICs displace merchant silicon for hyperscaler workloads? Will photonic computing, neuromorphic computing, or analogue compute reach mainstream relevance? Will geographic diversification of foundry capacity actually happen? Will export controls reshape the global AI-hardware landscape? The 2026 honest answer is "we don't know" for all of these; the engineering reality is that everyone in the industry is making bets, and the bets that pay off will substantially shape the AI landscape of the late 2020s.
What this chapter has not covered
Several adjacent topics are out of scope. The detailed methodology of chip design (RTL, verification, EDA tooling) is its own substantial discipline. Semiconductor process technology at depth (lithography, transistor physics, process integration) is out of scope. Specific company financials and strategies evolve too rapidly for an evergreen reference. Supply-chain and procurement details are operationally critical but specific to deployment context. The chapter focused on the conceptual landscape from an ML practitioner's perspective; the broader industry develops adjacent topics in adjacent literature.
Further reading
Foundational papers and references for AI chips and custom silicon. The Hennessy & Patterson architecture textbook; the Jouppi TPU papers; the various NVIDIA/AMD/hyperscaler architecture whitepapers; the Lightmatter and photonic-computing papers; the SemiAnalysis industry reports; and the various 2024–2026 industry surveys form the right starting kit.
-
Computer Architecture: A Quantitative ApproachThe standard graduate textbook on computer architecture (cross-referenced from Ch 01). Comprehensive coverage of design principles, with substantial chapters on accelerators, domain-specific architecture, and warehouse-scale computing. The reference textbook for understanding hardware at depth. The reference textbook for computer architecture.
-
A Domain-Specific Architecture for Deep Neural Networks (TPU)The TPU paper, in CACM form. Comprehensive treatment of the design philosophy, the architectural choices, and the empirical performance of Google's first TPU. Required reading for understanding the rationale for custom AI silicon. The TPU design-philosophy reference.
-
A Domain-Specific Supercomputer for Training Deep Neural Networks (TPU v2/v3/v4)The TPU pod paper (cross-referenced from Ch 01). Documents the evolution toward training and pod-scale systems. The reference for the pod-scale-supercomputer paradigm. The TPU pod reference.
-
NVIDIA H100 / B200 Architecture WhitepapersNVIDIA's architecture whitepapers (cross-referenced from Ch 01). Detailed descriptions of the chip architecture, the SM design, the Tensor Cores, NVLink, and the broader ecosystem. Required reading for anyone optimising for or evaluating NVIDIA hardware. The reference for current GPU architecture.
-
SemiAnalysis Industry ReportsSemiAnalysis is the leading industry analysis source for the semiconductor and AI-hardware industry (cross-referenced from Ch 01). Their reports on chip economics, hardware trends, foundry dynamics, and competitive positioning are the dominant source for industry-strategy reasoning. For industry-analysis perspective.
-
Deep Learning with Coherent Nanophotonic CircuitsThe foundational photonic-deep-learning paper. Demonstrated the principle of optical neural-network computation with programmable interferometers. Foundational for the photonic-computing direction that Lightmatter and others have commercialised. The photonic-deep-learning reference.
-
Loihi: A Neuromorphic Manycore Processor with On-Chip LearningThe Intel Loihi paper. Comprehensive description of the architecture, the spiking neural network model, and the on-chip learning capabilities. The reference paper for modern neuromorphic computing. The Loihi / neuromorphic reference.
-
Cerebras Wafer-Scale Engine — Technical Reports and PapersCerebras's technical descriptions of the wafer-scale architecture. The architectural philosophy is novel enough to warrant detailed reading; the public materials include both marketing whitepapers and serious technical analysis. For wafer-scale-architecture understanding.
-
Programming Massively Parallel ProcessorsThe standard textbook for CUDA programming and GPU computing (cross-referenced from Ch 01). The reference for understanding the programming model that underlies the dominant AI-hardware platform. The reference for GPU programming.
-
CHIPS for America Act / EU Chips Act / Japan and Korea ProgrammesThe major government programmes shaping the global semiconductor industry. The CHIPS Act allocations, the EU Chips Act details, the Japan/Korea programmes, and the various export-control frameworks. Required reading for anyone reasoning about the geopolitical dimension of AI hardware. For chip-industry policy understanding.
-
Chip War: The Fight for the World's Most Critical TechnologyThe history-and-strategy book on the global semiconductor industry. Comprehensive coverage of how the modern chip industry came to be, the geographic concentration, the strategic stakes, and the policy implications. Required reading for understanding the broader context of AI hardware. The reference for chip-industry history and strategy.
-
The State of AI Compute ReportsAnnual industry surveys on AI compute capacity, hardware adoption, and infrastructure trends (cross-referenced from Ch 01). Useful for benchmarking against industry trends and for staying current on the rapidly-evolving competitive landscape. For staying current.