Autonomous Vehicles, the largest robot deployment in history.
Self-driving cars are the most economically consequential robotics application of the 2020s. They are also the hardest robotics deployment problem ever attempted: high stakes, regulated, adversarial weather, billions of edge cases. This chapter covers the perception, prediction, planning, and safety stack that makes a vehicle drive itself, and the regulatory environment that determines where and how they are allowed to do so.
Prerequisites & orientation
This chapter draws on the perception material of Chapter 01, the planning and control material of Chapter 02, and the foundation-model architectures of Chapter 05. It is the most applied chapter in Part XII — most of the techniques discussed elsewhere appear here in their production-deployed form, with the addition of safety and regulatory considerations that other robotics applications mostly avoid.
Two threads run through the chapter. The first is the architectural modular-vs-end-to-end question: should the AV stack be a chain of separate perception, prediction, and planning modules with clean interfaces, or a single neural network from sensors to control? Both approaches have shipped in 2026, with different trade-offs in safety verification, debuggability, and capability. The second thread is the long tail: AV failures cluster in rare situations that no individual development team has seen before, which is why fleet-scale data collection and scenario-based testing dominate the engineering effort, often more than algorithmic innovation.
Levels and the Long Road to L4
The Society of Automotive Engineers' six levels of driving automation are the lingua franca of the autonomous vehicle industry. They are also a slightly misleading abstraction: the levels suggest a smooth progression from L0 (no automation) to L5 (drive anywhere), but in practice only two levels — L2 driver assistance and L4 geofenced robotaxi — have produced commercially viable products, and the gap between them is wider than the numbering suggests.
The SAE levels, briefly
The SAE J3016 levels were standardised in 2014 (with substantial revisions through 2021). Level 0 is no automation. Level 1 is single-function driver assistance — adaptive cruise control or lane keeping, but not both at once. Level 2 is partial automation with both steering and acceleration handled by the system, but the driver remains responsible for monitoring and is expected to intervene at any moment. Level 3 is conditional automation where the system handles full driving in a defined set of conditions and the driver may disengage attention but must take over when prompted. Level 4 is high automation within a defined operational design domain — no driver required when conditions are met. Level 5 is full automation everywhere.
The clean numerical progression hides a discontinuity. L2 vehicles ship by the millions today (Tesla Autopilot, GM Super Cruise, Ford BlueCruise, Mercedes Drive Pilot in some configurations); the driver is legally responsible. L4 robotaxis (Waymo, certain Pony.ai and Apollo Go services in China) operate without a driver in carefully geofenced areas. L3 has been a regulatory and commercial dead zone — only a handful of approvals exist (Mercedes in Nevada and California, Honda in Japan), each with extremely narrow operational design domains, because the handover problem (driver re-engagement when the system fails) turned out to be harder than expected.
The two production tiers
The market has thus bifurcated. L2 systems are a feature on a consumer vehicle the customer drives — an advanced cruise control with computer vision. The driver remains attentive (at least nominally), and the system's job is to make ordinary highway driving and increasingly urban driving safer and less effortful. The technical bar is high but bounded by the human in the loop. L4 systems are an entirely different product: a vehicle service that operates without any driver, in a defined service area, governed by entirely different regulatory and operational expectations. The technical bar is much higher because there is no fallback.
Most of this chapter is about the AV stack as it appears in both tiers, but specific differences will be flagged where they matter. The architectural details of an L2 highway-pilot stack are surprisingly similar to those of an L4 robotaxi stack; the difference is in the safety case, the validation effort, and the maturity of the long-tail handling, more than in the algorithms themselves.
L3 is conceptually elegant — the system drives until it can't, then hands off — but the human factors are brutal. Asking a person who has been disengaged for an hour to re-take control with a few seconds' notice has consistently failed in human-factors research. Every L3 deployment has had to constrain the operational design domain so heavily (specific highways, fair weather, low speed, no construction zones) that the practical capability is barely above L2 with a less attentive driver. L4 sidesteps the handover entirely; L2 keeps the driver attentive throughout. L3 sits in the worst spot.
The AV Stack: Five Functions, One System
A self-driving vehicle's software stack does five things in a tight loop: localise itself in the world, perceive other agents, predict what those agents will do, plan its own trajectory, and execute that trajectory through the vehicle's actuators. Forty years of robotics research has refined how each of those functions is implemented; the live architectural debate is whether they should be separate modules or one big network.
The five functions
Localisation answers "where am I?" — placing the vehicle within a map (or, in mapless systems, within an inferred road network) at lane-level accuracy. Perception answers "what is around me?" — detecting and tracking other vehicles, pedestrians, cyclists, lane markings, traffic lights, signs, and road geometry. Prediction answers "what will those things do next?" — forecasting trajectories for every dynamic agent over the next few seconds. Planning answers "what should I do?" — choosing a trajectory that achieves the route goal while remaining safe, legal, and comfortable. Control answers "how do I make the wheels do that?" — translating the planned trajectory into steering, throttle, and brake commands that the vehicle dynamics will execute.
Modular vs. end-to-end
The classical architecture treats each function as a separate module with a clean interface to the next. Perception emits a list of detected objects with poses and class labels; prediction takes that list and emits predicted trajectories; planning takes the trajectories and emits a path; control takes the path and emits actuator commands. The interfaces are inspectable, the modules are independently testable, and the safety case is constructible. Most production AV systems through 2024 used some version of this architecture.
The competing approach is end-to-end: a single neural network ingests raw sensor data and emits low-level driving commands, with no engineered intermediate representations. Tesla's FSD has progressively moved in this direction; Wayve has been end-to-end from the beginning; UniAD and similar academic efforts have shown end-to-end stacks can match modular ones on benchmark splits. The architectural debate — and the rest of this chapter's perspective on it — is that the production picture in 2026 is hybrid: end-to-end where capability matters more than verifiability, modular where the safety case must be constructible, and a steady drift toward end-to-end as verification techniques catch up.
Perception for AVs
AV perception is the most engineering-intense single component of the stack. A modern self-driving car carries 8–14 cameras, 4–8 radar sensors, sometimes 1–4 LiDAR sensors, and runs a neural-network stack that fuses them into a unified scene representation at 10–30 Hz with strict latency budgets. Most of the perception work covered in Chapter 01 has its highest-stakes deployment here.
Sensor configurations
The camera-only approach (Tesla, increasingly Mobileye and Wayve) relies on dense surround-view cameras — typically 8 to 12 — for all perception. The argument is that humans drive with vision alone and that LiDAR adds cost without proportional capability gains; the counter-argument is that cameras struggle in adverse weather and direct sun, conditions where LiDAR and radar continue to work. The multi-sensor approach (Waymo, Cruise, Pony.ai, Mobileye's premium stack) adds LiDAR for precise geometry and radar for weather robustness and direct velocity sensing. Each sensor's failure modes are different, and the redundancy gives the stack a path to gracefully degrade rather than fail when one modality is impaired.
BEV representation as the unifier
The dominant 2022–2026 perception architecture lifts all sensor inputs into a shared bird's-eye-view grid centred on the vehicle. Per-camera features are projected via either depth estimation (Lift-Splat-Shoot) or learned attention (BEVFormer, BEVFusion); LiDAR points are voxelised in the same grid; radar is fused as additional channels. Detection, tracking, and online lane-graph extraction all operate on this unified BEV feature map. The architectural pattern was covered in detail in Chapter 01 Section 8; in AV deployment it is now standard at every major company except for those using fully end-to-end sensor-to-action networks.
Occupancy networks
Tesla's transition from per-class object detection to occupancy networks (publicly described in 2022 and refined since) was widely cited as a paradigm shift. Rather than detect a list of bounding-box objects, the network predicts, for every voxel in a 3D grid around the vehicle, the probability that it is occupied — and an associated semantic class and motion vector. The output handles overhangs, irregular geometry, and unmapped objects (e.g., debris in the road) without requiring the object to belong to a known category. By 2024 this approach was standard across most camera-heavy AV stacks.
Tracking and the kinematic priors
Per-frame detection is not enough; AV perception needs multi-object tracking that maintains identity across frames and handles the case where an object is briefly occluded. Modern trackers combine learned association (transformer-based query tracking, MOTRv2-style approaches) with kinematic Kalman or extended-Kalman filters that smooth the trajectory and predict short-term continuation through occlusions. The combination is the standard production pattern.
Latency and the perception budget
The whole perception stack must fit in something like 50–100 ms of total latency from sensor capture to scene-representation output, on automotive-grade hardware (NVIDIA Orin, Tesla's HW4, Qualcomm Ride). The budget shapes every design choice: networks are quantised to INT8, batches are restricted to one or two timesteps, and most production systems use distillation to compress research-quality networks into deployable ones. Perception's "is this fast enough?" question is as important as its "is this accurate enough?" question, and is often the harder constraint.
Localization and HD Maps
A self-driving car needs to know where it is at lane-level precision — about 10 cm. Consumer GPS is accurate to a few metres, which is not nearly enough. The classical answer is high-definition maps and lane-level localisation against them; the newer answer is mapless approaches that infer the road structure on the fly. Both ship in 2026.
HD maps
High-definition maps for AVs are not consumer navigation maps. They are precise, lane-level vector representations of the road network: every lane centerline, every lane boundary, every stop line, every traffic light, every speed limit, every yielding rule. Building them requires fleet-scale survey vehicles equipped with high-end LiDAR and cameras; maintaining them requires continuous re-survey because the world changes. Waymo, Cruise (pre-shutdown), and Pony.ai all built proprietary HD maps for their service areas. Mobileye's approach was to crowdsource map updates from its production fleet, which is why their maps cover much more area but are less detailed.
HD maps simplify everything downstream. The vehicle localises against the map (lane-level GPS + IMU dead reckoning + visual or LiDAR matching to map features), and the planner can then reason about the world using the map's structured representation rather than having to infer lanes and rules in real time. The cost is the map itself — building, maintaining, and validating it. Estimates for a single dense urban service area range from $50–$500 per kilometre per year, before adding redundancy.
Mapless approaches
The most consequential 2024–2026 trend in AV stacks has been the move toward mapless driving: inferring the lane structure, traffic rules, and road topology directly from sensors at runtime, with at most a coarse navigation map indicating the route. Tesla's FSD has been mapless from early on; Wayve's stack is mapless by design; Mobileye's lower tiers have always been mapless because the cost of HD-mapping every road they wanted to operate on was infeasible.
Mapless approaches scale geographically — once the model works in city A, it should work in city B without a separate map-building project — but they impose a higher burden on the perception system. The lane-graph extraction must run accurately at every step; misreading an intersection's lane structure can cause an immediate planning failure. The best mapless systems in 2026 are competitive with HD-mapped systems on routine driving but degrade more in unusual situations (construction zones, faded markings, atypical intersections).
Lane-level localisation
Inside the localisation module, the actual computation is a tight fusion of GPS / GNSS (when available, and with RTK corrections for centimetre accuracy when in coverage), IMU dead reckoning, wheel odometry, and visual or LiDAR matching against either the HD map or the immediately observed road features. Modern factor-graph SLAM back-ends (covered in Chapter 01 Section 6) run continuously, with the map providing strong priors when available and the system falling back to mapless visual-inertial odometry when not. The output is a vehicle pose with uncertainty estimates, used by every other module downstream.
Prediction
Prediction is the AV stack's hardest unsolved problem. The vehicle must forecast the trajectories of every nearby car, pedestrian, and cyclist over the next several seconds, accounting for what each agent might do, what they might do in response to the AV's own actions, and the considerable irreducible uncertainty in human behaviour.
The prediction problem, formally
Given the current scene — the AV's pose, the HD map (if available), and the tracked state of every other agent — predict for each agent i a distribution over its trajectory τi over the next 3–8 seconds. The distribution is non-trivial because human drivers are multi-modal: at an intersection, the same approaching car might turn left, turn right, or go straight, and the predictor needs to represent all three possibilities with calibrated probabilities. A unimodal prediction that averages the modes ("the car is going slightly left and slightly right") is much worse than a multi-modal prediction that explicitly says "30% turn left, 50% straight, 20% turn right."
Architectures
Modern prediction networks are transformer-based, taking as input the historical trajectories of all agents in the scene plus the local map, and producing a set of trajectory candidates per agent with associated probabilities. MotionLM (Waymo, 2023) was an influential entry: tokenise trajectories the same way language models tokenise text, then predict the next-token continuation autoregressively, with the multi-modality emerging from sampling. MTR (Motion Transformer, 2022) anchors predictions on a discrete set of intent prototypes ("turn left at next intersection," "lane change right") and predicts trajectories conditional on each, with an associated probability over intents.
The major architectural axis is whether prediction is scene-level (modelling all agents jointly with their interactions) or agent-level (predicting each agent independently, ignoring interactions). Scene-level prediction handles the case where two cars are predicting each other's behaviour and influencing each other; agent-level prediction is simpler and faster but misses these interactions. Production stacks in 2026 are trending toward scene-level for the most complex environments (busy intersections, merging, lane changes) and agent-level for everything else.
The interaction problem and self-conditioning
A subtlety: the AV's own future actions affect the predicted trajectories of other agents. If the AV is going to merge into another car's lane, that car's trajectory depends on whether the AV actually merges. The prediction module must therefore be conditioned on a hypothetical AV plan, which means prediction and planning have to be solved jointly rather than sequentially. Modern stacks (Waymo's MotionLM, Wayve's PRISM-1) do exactly this — running prediction conditional on a candidate plan, scoring the plan, iterating. The clean modular interface is preserved, but the joint optimisation is the substance.
Planning and Behavior
The planner translates "perceive and predict" into "act." Three families of planner technologies coexist in production AV stacks: classical rule-based behaviour layers, optimisation-based trajectory planners, and learned planners. Most deployments combine two or three of them in a hierarchy.
The hierarchical planner
The classical decomposition splits planning into three layers. Route planning is global navigation — pick the sequence of roads from origin to destination, typically using A* on a road graph (Chapter 02 Section 03). Behaviour planning is medium-horizon decision-making — should I yield, change lanes, pass the cyclist, wait at the intersection? This is often implemented as a behaviour tree or a finite-state machine over discrete manoeuvres, with rules encoding traffic law and social conventions. Trajectory planning is short-horizon kinematic-feasibility — given the chosen behaviour, produce a continuous trajectory in time-position-velocity space that the vehicle can actually execute.
Each layer constrains the next. Route planning fixes the road sequence; behaviour planning chooses the manoeuvre on the next segment; trajectory planning produces the smooth path. The interfaces are well-defined and the layers can be developed and tested largely independently — which is much of why this decomposition has remained the default for safety-critical deployments.
Trajectory planning techniques
The trajectory layer is where the classical planning techniques of Chapter 02 are most directly applied. Model predictive control is the dominant technique: at every control step, formulate a constrained optimisation over the next few seconds of vehicle motion subject to dynamics, speed limits, comfort constraints, lane geometry, and avoidance of every predicted agent trajectory. Solve it. Apply the first action. Repeat. Variants on this — sampling-based planners (MCTS-like searches over discretised manoeuvres), learned planners (neural networks trained to imitate expert drivers), and hybrid combinations — all exist, but MPC over predicted-agent constraints is the workhorse.
Learned planners
The frontier in AV planning is the move toward end-to-end learned planners — networks that map from perception output (and sometimes raw sensors) directly to a trajectory or to control commands, trained on demonstration data from human drivers and refined with reinforcement learning. UniAD (CVPR 2023) showed that an end-to-end learned planner could match or exceed modular pipelines on benchmark splits while sharing perception and prediction features. Tesla's FSD, Wayve's stack, and several Chinese AV efforts (Pony.ai's PonyWorld, Xpeng's XNet) have all moved heavily in this direction in 2024–2026.
The long tail
Most ordinary driving is straightforward — staying in lane, maintaining speed, stopping at lights. The hard cases are the long tail: the construction zone with non-standard signage, the cyclist weaving through traffic, the emergency vehicle approaching from behind, the deer crossing at night, the four-way stop with confused human drivers. Production AV stacks succeed or fail on the long tail more than on the common cases, and the engineering effort is heavily skewed toward identifying and handling rare situations. This is why the data-flywheel pattern — fleet observations feeding back into training data — has become the dominant strategic asset of every serious AV company. The algorithms are commodities; the labelled long-tail dataset is not.
End-to-End Autonomy
The classical modular stack has been the production default for a decade. The most consequential architectural debate in AVs since 2022 is whether to replace it with a single end-to-end neural network. Several major players have already shipped versions of this; the rest are watching closely.
What "end-to-end" actually means
The strict definition of end-to-end driving is a single neural network that takes raw sensor inputs and outputs steering, throttle, and brake commands, with no engineered intermediate representations. The looser practical definition is a deeply integrated stack where perception, prediction, and planning share a backbone and are trained jointly, but where some structured outputs (like predicted trajectories or behaviour primitives) are still emitted as intermediate signals that humans can inspect.
Pure end-to-end systems have existed since DAVE-2 (NVIDIA, 2016) and earlier; they've been research curiosities for most of that history because their behaviour was hard to verify. The 2022–2026 shift is that end-to-end approaches have become competitive with modular ones on real driving, while still being substantially harder to certify for safety.
Tesla FSD's trajectory
Tesla's Full Self-Driving (FSD) software is the most-deployed end-to-end-leaning AV stack as of 2026. Tesla's progression from FSD V11 (modular) to V12 (heavily neural-network-based, trained end-to-end on millions of fleet-collected miles of human driving) to V13 and V14 (further compressed, faster inference, expanded operational design domain) has been the most public test case for the end-to-end approach. The capability profile that emerged is recognisable: better than modular stacks on smooth, common situations; less predictable on unusual ones; difficult to construct a safety case around.
Wayve, GAIA, and the world-model approach
Wayve's stack has been end-to-end from its founding in 2017, on the bet that sensor-to-action driving could match modular approaches if the data scale was sufficient. Their GAIA series (GAIA-1 in 2023, GAIA-2 in 2024) is a generative video model trained on driving data, used both as a simulator and as a backbone for the driving policy. The architectural pattern — a world model trained on real driving video, with a policy trained against the world model — is increasingly influential and is the closest AV analogue to the foundation-model approach of Chapter 05.
UniAD and the academic frontier
On the academic side, UniAD (CVPR 2023, best paper award) demonstrated a unified architecture where perception, prediction, and planning all share a transformer backbone and are trained jointly with an end-to-end planning loss. The paper triggered a wave of follow-up work (VAD, GenAD, DriveTransformer) refining the architecture and pushing benchmark scores. By 2026 the academic consensus is that end-to-end-with-shared-backbone is the right architectural direction; the production consensus is that it is the right direction but the safety case still needs work.
Safety: ODD, ASIL, and Scenario Testing
An AV's safety case is the central engineering and regulatory artefact. It is the formal argument that the system, deployed in a specified environment, achieves an acceptable level of safety. Building it consumes more engineering time than the perception or planning algorithms themselves.
The Operational Design Domain
The Operational Design Domain (ODD) is the formal specification of the conditions under which the AV is designed to operate: geographic area, road types, weather, time of day, traffic conditions, speed range, and dozens of other parameters. The ODD is the precondition for everything else — a safety case is constructed for a specific ODD, and behaviour outside the ODD is undefined. Waymo's robotaxi service in Phoenix has a specific ODD covering specific roads, specific weather conditions, specific times of day; outside those conditions the vehicles do not operate. Tesla's FSD has a much wider declared ODD (most paved roads in covered countries) and the validation effort is correspondingly larger and less complete.
ISO 26262 and ASIL
The functional-safety standard for automotive electronics is ISO 26262. Every safety-relevant electronic system in a vehicle is assigned an Automotive Safety Integrity Level (ASIL) from A (least critical) to D (most critical). AV perception and planning components are typically ASIL-B or ASIL-D depending on the failure mode, which imposes specific requirements on hardware redundancy, software development process, verification rigour, and failure-rate targets (e.g., ASIL-D requires fewer than 10 dangerous failures per billion hours of operation).
ISO 26262 was originally written assuming deterministic software and breaks down somewhat for neural-network components, where "verification" cannot mean "exhaustively check all inputs." The newer ISO 21448 (Safety of the Intended Functionality, SOTIF) covers the gap: how do you certify a system whose behaviour is fundamentally probabilistic? SOTIF reframes safety in terms of statistical performance over a representative distribution of scenarios. Combined, the two standards form the regulatory backbone of AV safety in most jurisdictions.
Scenario-based testing
The dominant testing methodology for AVs is scenario-based testing: rather than try to drive a billion miles to demonstrate safety statistically, define a structured catalogue of driving scenarios (intersection crossings, lane changes, pedestrian encounters, construction zones, etc.) and verify the system handles each scenario correctly across a range of parameter variations. The scenarios come from human-driving incident logs, regulatory test catalogues (PEGASUS in the EU, NHTSA's pre-crash typology in the US), and the AV company's own fleet observations. Each scenario is automatically tested in simulation across thousands of parameter variations, and the system's pass rate becomes part of the safety case. Production AV companies run scenario suites of millions of variations on every code change.
The role of fault injection and the safety driver
Two more ingredients complete the safety practice. Fault injection testing deliberately introduces sensor failures, actuator faults, software crashes, and network drops to verify the system degrades gracefully rather than catastrophically. The safety driver (or, for L4, a remote operator) provides the final fallback: a human authorised to intervene when the system requests assistance or when an external observer flags a problem. The safety-driver model is being phased out for L4 (Waymo has been driverless in commercial operation since 2023), but it remains universal for L2 and for L4 testing.
Regulatory and Operational Context
An AV is not a technical artefact; it is a regulated product operating in public space under varying legal regimes. The technology can ship as fast as the engineers can build it; the deployment is gated by regulators, insurance markets, and the public's tolerance for the inevitable incidents.
The major regulatory regimes
Three regulatory environments dominate global AV deployment in 2026. The United States is fragmented: federal NHTSA regulations cover vehicle safety standards, but operational permits for self-driving services are state-level. California's DMV requires a layered permit process for testing and commercial operation; Arizona is permissive; Texas, Florida, and Nevada each have their own rules. The result is that AV companies pick service areas state by state, often building separate validation cases for each.
The European Union regulates through the UN-ECE harmonised regulations (R157, R171) plus the EU Type Approval Framework. The bar is uniform across member states but high; deployments lag the US by 1–3 years on most operational expansions, but the certification, once obtained, applies across the union. Mercedes' L3 Drive Pilot was approved under R157 and is the most-deployed L3 system globally as a result.
China has the most permissive testing environment among major markets: dedicated AV-pilot zones in Beijing, Shanghai, Guangzhou, Wuhan, Shenzhen, and Hefei, with explicit policies encouraging rapid deployment. Pony.ai, Apollo Go (Baidu), AutoX, and WeRide all run commercial robotaxi services in multiple Chinese cities at scale. The regulatory speed has compressed the development-to-deployment cycle to a degree the US and EU have not matched.
The "responsible operator" model
Across regimes, the dominant legal model for L4 deployment is the responsible operator: the AV company is legally responsible for the vehicle's behaviour when it is operating without a driver, with attendant liability for accidents. This is a meaningful shift from L2 deployments (where the human driver bears responsibility) and from how most other technology products are sold (where the user accepts terms of service that limit the vendor's liability). Waymo, Cruise, and Pony.ai all operate under variants of this model in their respective markets; the insurance and liability frameworks have had to adapt.
Incident reporting and the public record
NHTSA's Standing General Order (issued 2021, updated 2023 and 2025) requires AV operators to report every incident involving an SAE L2 or higher system to the federal government within specific time windows. The reports are public; researchers and journalists analyse them continuously. The dataset has become the most consequential public-record input to AV regulation, surfacing recurring failure patterns (Cruise's 2023 pedestrian-dragging incident leading to suspension, Tesla's autopilot crashes leading to ongoing investigations) that drive new requirements.
The public's tolerance
The deeper political reality is that the public's tolerance for AV-caused incidents is asymmetric: the same number of fatalities caused by a self-driving system attracts vastly more attention and demands more regulatory response than the same number caused by human drivers. The technology must be substantially safer than humans (per mile, per hour, per task) to be politically acceptable, not merely as safe. This asymmetry is the binding constraint on deployment timelines for most AV companies, more so than any specific technical limitation.
The Industry: Who Ships What in 2026
The AV industry of 2026 looks substantially different from the AV industry of 2020. The shakeout that began with Argo AI's shutdown in 2022 and accelerated through Cruise's suspension in 2023 has left a smaller field of survivors, each focused on a specific operational niche. This section maps the landscape as it actually stands.
The L4 robotaxi survivors
Waymo is the unquestioned commercial leader in L4. As of 2026 Waymo One operates in Phoenix, San Francisco, Los Angeles, Austin, Atlanta, and Miami, with announced expansions to several more cities. The vehicle fleet is dominated by Jaguar I-Pace and Zeekr-built custom platforms; the technical stack is a heavily-modular pipeline with HD maps, multi-sensor perception, and a hybrid learned/optimisation-based planner. Waymo has crossed cumulative milestones (millions of fully driverless miles, multiple cities of profitable per-trip economics) that no other operator has matched.
Cruise was suspended after the October 2023 pedestrian-dragging incident, restructured under GM, and re-launched with a much smaller scope in 2025. As of 2026 Cruise operates a limited service in Phoenix and Houston with a notably more conservative behavioural profile than Waymo. The trajectory is recovery rather than market leadership.
Pony.ai, Apollo Go (Baidu), AutoX, and WeRide dominate the Chinese robotaxi market and collectively operate at a fleet scale comparable to Waymo's, with somewhat different operational design domains and a faster regulatory environment. Pony.ai has expanded to Hong Kong and several Middle Eastern markets; Apollo Go runs the largest fleet (over 1,000 vehicles in Wuhan alone). The technical stacks are broadly similar to Waymo's — modular, multi-sensor, HD-mapped — with their own variants of each component.
Zoox (Amazon-owned) opened limited paid service in Las Vegas and San Francisco in 2025 with a custom-built bidirectional vehicle. The technology is competitive; the path to broad geographic expansion is the open question.
The L2 / L2+ players
Tesla dominates the consumer end. FSD (and FSD Beta in earlier years) ships on millions of vehicles and is increasingly capable, with the V12-V14 progression making it a competitive alternative to dedicated L4 services in some operational regimes. Tesla's announced plans for an L4 robotaxi service ("Cybercab") have moved on a slower timeline than the company has indicated, but the underlying stack continues to advance.
Mobileye ships ADAS to virtually every major automaker; their SuperVision stack is L2+ and approaching L3 in select deployments, and their Drive product is L4 in pilot programmes with Volkswagen, Holon, and others. Mobileye's strategy of incremental capability across a vast OEM base contrasts with Waymo's vertically-integrated robotaxi model.
The OEMs — Mercedes, BMW, GM, Ford, the Chinese OEMs (BYD, Xpeng, Nio, Li Auto) — all ship some L2+ system, with varying strategies for the move to L3 and beyond. The Chinese OEMs in particular have been aggressive on autonomous-driving feature rollouts, with Xpeng's XNGP, Li Auto's NOA, and similar systems shipping in tens of millions of consumer vehicles.
Wayve is the wildcard. Their end-to-end stack is licensed to OEMs (Nissan, Uber's L4 effort) rather than deployed as a Wayve-branded service, and the technology has been visibly impressive in demos. Whether the architectural bet on end-to-end pays off at scale is one of the open questions of the next two to three years.
What this chapter does not cover
The trucking-AV market (Aurora, Gatik, Plus.ai, Kodiak Robotics, Tu Simple's various successor entities) has its own dynamics — different ODDs, different economics, different regulatory environments — that deserve their own treatment. The delivery-robot market (Nuro, Starship, Serve, Coco) overlaps technically with AVs but operates under different rules. The drone-delivery market (Wing, Zipline, Amazon Prime Air) is mostly separate. And the broader question of how foundation-model architectures from Chapter 05 will eventually inform autonomous vehicles — VLA-style language-conditioned driving, world-model-based simulators — is an active research frontier rather than a 2026 production reality.
Autonomous vehicles are the most economically consequential robotics deployment of the 2020s. The classical modular stack covered in this chapter is the foundation; the end-to-end and foundation-model approaches are the frontier. Both are converging toward the same thing — vehicles that can drive themselves more safely than humans, in increasingly broad operational domains — and both will likely play a role in the deployments of the next decade. The practitioner who understands the modular stack and the foundation-model trajectory together is the one who can build the next generation of AV systems.
Further Reading
-
SAE J3016: Levels of Driving AutomationThe SAE standard that defines the L0–L5 levels of driving automation. The standard itself is brief and worth reading directly rather than relying on summaries — many of the persistent misunderstandings about AV capability come from imprecise references to the levels. The reference document for the AV industry's shared vocabulary.
-
UniAD: Planning-Oriented Autonomous DrivingThe UniAD paper that established the unified-architecture end-to-end approach as a serious alternative to modular stacks. Covers the architecture, the joint training procedure, and the empirical results showing that end-to-end can match or exceed modular pipelines on benchmark splits. The paper that legitimised end-to-end AV stacks for academic and industrial audiences.
-
MotionLM: Multi-Agent Motion Forecasting as Language ModelingWaymo's MotionLM paper. Reframes trajectory prediction as autoregressive token prediction over discretised trajectories, naturally producing multimodal forecasts. The paper is especially clear on the empirical advantages of multimodal vs. unimodal prediction in interactive scenes. The reference for modern transformer-based trajectory prediction.
-
BEVFormer: Learning Bird's-Eye-View RepresentationThe BEVFormer paper that established the multi-camera bird's-eye-view paradigm now dominant in production AV perception stacks. Pairs naturally with the perception material in Chapter 01 Section 8. The reference architecture for camera-based BEV perception.
-
Waymo Safety Methodology and Safety Performance ReportsWaymo publishes detailed safety methodology documents and quarterly safety performance reports covering their commercial fleet. The documents are unusually detailed by industry standards and offer the clearest public window into how a production L4 operator constructs and validates a safety case. The most accessible serious public document on AV safety practice.
-
ISO 26262 and ISO 21448 (SOTIF)The two standards together define the formal safety envelope for AV development. ISO 26262 covers functional safety of automotive electronics; SOTIF (ISO 21448) covers the gap that 26262 leaves around probabilistic / ML-based components. Reading both, even casually, calibrates expectations about what AV safety engineering actually involves. The regulatory backbone of the AV safety case.
-
GAIA-1: A Generative World Model for Autonomous DrivingThe Wayve GAIA-1 paper. Establishes the generative-world-model approach to AV training: train a video generator on driving data, use it as both simulator and policy backbone. The closest published AV analogue to the foundation-model approach of Chapter 05. The reference for generative world models in the AV setting.
-
NHTSA Standing General Order Incident ReportsThe public dataset of every reportable AV-related incident in the US since 2021. The dataset is searchable, periodically updated, and is the primary empirical input to the public's understanding of AV safety. The aggregate patterns visible in the data are different from any individual operator's marketing claims, and reading even a sample of the reports recalibrates expectations. The most consequential public dataset on AV safety.