Part XII · Robotics & Embodied AI · Chapter 01

Robot Perception & Sensing, the world made measurable.

A robot's perception system is the boundary between physics and computation. Every belief the robot holds about where it is, what surrounds it, and how the scene is changing comes from a small collection of imperfect sensors and the mathematics that fuses them. Get this layer wrong and nothing above it can be saved.

Prerequisites & orientation

This chapter assumes basic familiarity with linear algebra (Part I Ch 01), probability theory (Part I Ch 04), and computer vision fundamentals (Part VII Ch 01). No prior robotics background is required — we develop the perception stack from sensors up. Readers comfortable with classical robotics will find the modern learned-perception sections (BEV networks, neural implicit maps, foundation models for scene understanding) the most novel; readers from an ML background will find the sections on calibration and Bayesian filtering the most novel.

Two threads run through the chapter. The first is the trade-off between raw geometry (LiDAR, depth, ranging) and semantics (cameras, learned features) — modern systems use both, but how they are combined is one of the central design questions in robot perception. The second is the relationship between classical filtering and optimization and end-to-end learned perception, which is currently the most active research frontier in the field.

In this chapter

Why Perception Is Different on a Robot closed loop · real time · partial observability · noise
Cameras and Visual Perception RGB · stereo · RGB-D · fisheye · pinhole model
LiDAR and Active Range Sensing mechanical · solid-state · time-of-flight · point clouds
IMU and Proprioceptive Sensors accelerometer · gyroscope · encoders · drift
Sensor Fusion and Bayesian Filtering Kalman · EKF · UKF · particle filter
SLAM: Simultaneous Localization and Mapping front-end · back-end · loop closure · classical vs. learned
Sensor Calibration intrinsics · extrinsics · time sync · drift monitoring
The Modern Perception Stack BEV · occupancy · end-to-end · foundation models
Failure Modes degradation · adversarial conditions · divergence · long tail
Frontier Topics NeRF · Gaussian splatting · V-JEPA · generalist perception

Why Perception Is Different on a Robot

Computer vision and robot perception look like the same field from a distance — both take pixels in and produce structured outputs. They diverge once you start moving. A vision system that processes a fixed image set offline can spend seconds per frame and re-run with new parameters whenever it wants. A robot perception system runs while the robot is moving, on a power-constrained computer, with results that immediately feed actuators that affect what the next frame looks like. Five constraints follow from that, and they shape every design decision in this chapter.

Real-time latency

Perception runs at the rate of the control loop above it, which runs at the rate of the slowest sensor in the loop. A robot arm doing torque control wants new state estimates at a kilohertz; an autonomous vehicle's planner wants 10–20 Hz; a humanoid balancing on one leg wants the IMU loop at 1 kHz and the vision loop somewhere from 30 Hz upward. Latency budgets are allocated backward from the actuator: if the wheel command must update at 50 Hz, perception, prediction, and planning together have 20 ms — and perception is often the slowest single step. A perception system that produces correct answers 100 ms late is, for a fast-moving robot, indistinguishable from one that produces wrong answers.

Closed-loop coupling

The robot's perception affects its motion, and its motion affects the next perception. This positive feedback is the core difference between robot perception and vision-on-saved-images. A small bias in pose estimation rotates the camera slightly off-axis, which biases the next pose estimate slightly more, and so on — drift that is invisible in a single frame becomes catastrophic over a hundred. The whole field of state estimation exists to break this feedback by combining noisy sensor readings into a coherent belief whose error is bounded over time, not just per-frame.

Partial observability

Any single sensor reading is incomplete. A camera sees only what is in its field of view, only at the resolution it has, only in the part of the spectrum it captures. A LiDAR returns ranges only along a fixed pattern of rays. Even with all sensors combined, the world the robot can infer is much larger than the world it can measure at any one instant. Perception is fundamentally a problem of maintaining beliefs over unobserved state, which is why probability is everywhere in this chapter and why deterministic vision pipelines tend to fail when they meet the real world.

Noise that isn't Gaussian

Sensor noise on a robot is rarely the convenient zero-mean Gaussian assumed in textbook derivations. Cameras dropout in low light, smear under motion, and produce systematic bias under autoexposure. LiDAR gives spurious returns through glass and rain. IMUs have biases that drift with temperature and time. Real perception code spends as much effort modelling these failure modes as it does the main signal — and the assumptions baked into a Kalman filter (zero-mean Gaussian noise, no missing data, known covariance) are violated routinely.

Compute and energy budgets

A datacenter perception model can run on a 700 W H100. A car can spend 100 W on perception, a drone perhaps 20 W, and a battery-powered legged robot less than that. These constraints are why mobile robots run quantised, distilled models, and why some of the most interesting recent work is on perception architectures that match a target latency-and-power envelope rather than a target accuracy. Best perception per watt is the metric that matters more than peak accuracy on a benchmark.

The Latency Constraint

The most consequential single number in any perception design is the end-to-end latency from photon to motor command. Every algorithmic choice — how often to run the heavy network vs. a tracker, when to skip frames, whether to fuse synchronously or asynchronously — flows from that budget. Designs that treat latency as a downstream concern and try to fix it after the fact almost always end up rewriting the whole pipeline.

Cameras and Visual Perception

Cameras dominate robot perception for the same reason they dominate computer vision: they are cheap, dense, and rich. A standard automotive-grade camera produces several megapixels at 30+ Hz for a few tens of dollars, with each pixel carrying both geometric information (where in the world this ray came from) and semantic information (what colour, texture, and pattern is at that ray). LiDAR cannot match the semantic richness; radar cannot match the resolution; IMUs cannot match the field of view. For nearly every robot built since the late 2010s, vision is the primary source of information about the world.

The pinhole camera model

A camera maps 3D world points onto a 2D image plane. The pinhole model — every point projects through a single optical centre — is the standard idealisation, parameterised by an intrinsic matrix K that encodes focal length and principal point, plus a small set of distortion coefficients that capture the lens deviations from the ideal. The projection of a world point is the matrix-vector product x = K [R | t] X, where [R | t] is the camera's pose in the world. Inverting this — going from a pixel back to a world ray — requires K, which means a camera is useless for geometry until it has been calibrated.

The pinhole camera model. A world point X projects onto the image plane through the optical centre; the intrinsic matrix K combined with the pose [R | t] gives the pixel coordinates. The same model, inverted, defines the ray of possible world positions a single pixel could have come from.

The major camera types

Robots deploy several distinct camera technologies, each with different strengths and characteristic failure modes. The choice between them is mostly determined by how much depth information is needed, the lighting conditions the robot will operate in, and the dynamic range of the scene.

RGB monocular

Standard global-shutter camera

A single colour camera, the workhorse for navigation, object detection, and visual SLAM. Cheap and dense; the only depth information is from cues inferred over time (motion parallax) or from a learned monocular depth network. Global shutter is preferred for robotics because rolling shutter introduces motion-correlated distortion.

~30–60 Hz · 2–8 MP · < $100

Stereo

Two cameras with known baseline

Two cameras separated by a known baseline. Disparity between matched pixels yields depth via triangulation. Depth accuracy degrades with distance (∝ 1/disparity) and with low-texture surfaces where matching is unreliable. Standard on outdoor mobile robots and AVs.

~10–30 Hz · accurate to ~30 m at 12 cm baseline · $200–$2,000

RGB-D

Active depth sensor + RGB

RGB camera fused with an active depth sensor — structured light (original Kinect), time-of-flight (Azure Kinect, RealSense L515), or projected-IR stereo (RealSense D-series). Excellent indoor depth at short range; struggles outdoors because of sunlight interference with the active source.

~30 Hz · sub-cm depth at < 4 m · $200–$1,000

Fisheye / 360

Ultra-wide-FOV cameras

Lenses with field of view from 180° (fisheye) to 360° (panoramic / catadioptric). Very useful for visual SLAM and surround perception because they reduce blind spots, but the projection is highly non-linear and demands a more elaborate camera model than the pinhole.

~10–30 Hz · 180–360° FOV · $100–$500

Event cameras

Asynchronous neuromorphic

Each pixel independently emits an event whenever its log-intensity changes by a threshold. No frame rate; events arrive at microsecond latency. Excellent for high-speed motion and high-dynamic-range scenes, but the data is a sparse stream rather than an image, so most existing CV tools don't apply directly.

µs latency · > 120 dB dynamic range · $500–$5,000

Thermal / NIR

Long-wave IR & near-IR

Thermal cameras (LWIR) sense emitted heat rather than reflected light, which makes them invaluable for night driving, fire detection, and pedestrian visibility in poor conditions. Near-IR is used as the active illumination band in many RGB-D and structured-light systems.

~30 Hz · low resolution (640×512 typ.) · $1,000–$10,000

Distortion, exposure, and the realities of image data

Every real camera deviates from the pinhole model in ways that matter geometrically. Radial distortion (lines that should be straight bow outward or inward) and tangential distortion (a slight tilt of the lens relative to the sensor) are corrected by a small set of distortion coefficients, fitted at calibration time. Wider lenses need richer distortion models — the standard Brown-Conrady model breaks down past ~120° FOV, and fisheye cameras use polynomial or equidistant projection models instead.

Beyond geometry, exposure control silently shapes the data. An auto-exposure algorithm that reacts too slowly produces a dark image as the robot moves into shadow; one that reacts too quickly oscillates and corrupts feature tracking. Most production robotic-vision stacks eventually re-implement parts of the exposure controller, because the off-the-shelf algorithm in the camera firmware was designed for human-pleasing photographs rather than for downstream computer vision.

Cameras Are Half a Sensor

A camera by itself measures intensity per pixel, not depth or motion. Everything else — depth, optical flow, semantic content — is inferred. The accuracy of those inferences depends on the geometry of the camera (intrinsics, pose), the conditions (lighting, motion), and the algorithm used. This is why production robot stacks combine cameras with at least one direct geometric sensor (LiDAR, radar, or stereo) and at least one proprioceptive sensor (IMU, encoders) — to reduce reliance on any single inference.

LiDAR and Active Range Sensing

Where a camera infers geometry from intensity, LiDAR (Light Detection And Ranging) measures it directly. A laser pulse is emitted, reflects off a surface, and its time of flight back to a detector gives the distance. Sweep that pulse across many directions and the result is a point cloud — a set of 3D points in the sensor's reference frame, each annotated with intensity and (in newer sensors) wavelength or velocity information. LiDAR is the closest thing in robotics to a sensor that gives ground-truth geometry, and the cost of that ground truth is hardware that is bulky, power-hungry, and (until recently) far more expensive than cameras.

Mechanical, MEMS, and solid-state LiDAR

The mechanical LiDAR — a spinning unit, typically with multiple laser-detector pairs stacked vertically, rotating at 5–20 Hz — is the architecture associated with the Velodyne sensors that defined the early autonomous-vehicle era. It produces a uniform 360° horizontal field of view, but is mechanically complex and expensive.

MEMS LiDAR replaces the spinning unit with a tiny mirror that scans a smaller region using oscillating micro-mirror systems. It is cheaper and more reliable but typically has a narrower FOV that has to be combined with multiple units for surround coverage. Solid-state LiDAR has no moving parts at all — it uses an optical phased array or a flash architecture (firing a single broad pulse and resolving direction with a 2D detector). Solid-state designs are smaller, cheaper, and more rugged, and have driven the cost of automotive LiDAR from tens of thousands of dollars in 2015 to a few hundred dollars by 2025.

Point clouds as a representation

A LiDAR scan is a list of 3D points (often hundreds of thousands per scan) that share a common origin (the sensor) but are otherwise unstructured. This unstructured-set property is what makes point clouds awkward for neural networks designed around grids: there is no canonical ordering, no fixed shape, and the local density varies dramatically with distance from the sensor. The dominant treatments are:

Voxelisation. Discretise space into a 3D grid and aggregate points per voxel. Simple and fast but loses sub-voxel detail. Variants like sparse 3D convolution (Minkowski networks, SECOND) keep only occupied voxels, which is far more efficient.
Bird's-eye-view (BEV) projection. Discard the vertical axis (or aggregate along it) to produce a 2D top-down image. This is what the BEV networks in Section 8 do, and it works because most things a self-driving car cares about live on a roughly horizontal road.
Direct point processing. Architectures like PointNet and PointNet++ consume points directly, using permutation-invariant pooling. Cleaner conceptually but typically slower at scale.
Range image. The raw scan can be unfolded into a 2D image where one axis is the laser beam index and the other is the rotation angle. Standard 2D CNNs then apply, with the geometry preserved up to the pinhole-style projection back into 3D.

Other range and active sensors

Several other active sensors share LiDAR's general principle but trade off range, resolution, and cost differently. Radar uses radio waves (typically 24, 77, or 79 GHz) and is robust to fog, rain, and dust where LiDAR struggles. It also measures relative velocity directly via the Doppler shift, which is a major advantage for tracking. The trade-off is much lower angular resolution. Ultrasound is cheap and works well at very short range (parking sensors, indoor robots) but is too low-resolution for general perception. Time-of-flight (ToF) cameras are like flash LiDARs in the near-IR band — useful indoors at short range, but heavily affected by ambient light outdoors.

Sensor	Range	Angular resolution	Weather robustness	Velocity?
LiDAR (mechanical)	up to 200 m	0.1°–0.4°	moderate (rain, fog degrade)	via tracking
LiDAR (solid-state)	up to 250 m	0.05°–0.2°	moderate	via tracking
Radar (77 GHz)	up to 250 m	1°–5° (poor)	excellent	direct (Doppler)
Stereo camera	~30 m	image-resolution	poor in low light	via tracking
RGB-D (ToF)	~5 m	image-resolution	indoor only	via tracking
Ultrasound	~5 m	very poor	moderate	via tracking

The standard answer to "which range sensor should I use" in a serious robotics product is more than one. AV stacks combine LiDAR (precise range, moderate weather), radar (velocity and weather robustness), and cameras (semantics) precisely because the failure modes of each are different — and the sensor-fusion machinery in the next two sections is what lets these sensors compensate for one another.

IMU and Proprioceptive Sensors

The exteroceptive sensors of the previous two sections — cameras and range sensors — measure the world. Proprioceptive sensors measure the robot itself: its acceleration, angular velocity, joint angles, wheel rotations. They are smaller, faster, and cheaper than exteroceptive sensors, and they fill in the gaps between vision frames. A 30-Hz camera with a 1-kHz IMU produces about thirty IMU samples between every two images; the IMU keeps the state estimate alive and accurate during that interval.

What an IMU actually measures

A six-axis IMU contains a three-axis accelerometer and a three-axis gyroscope. The accelerometer measures specific force — the force per unit mass applied to the device, which equals the kinematic acceleration minus gravity. The gyroscope measures angular velocity around each of three orthogonal body axes. Nine-axis IMUs add a magnetometer that measures the local magnetic field, providing a global heading reference (subject to magnetic distortion from nearby metal). Modern automotive- and consumer-grade IMUs are MEMS devices on a few-millimetre die; tactical and navigation-grade IMUs use ring-laser gyros or fibre-optic gyros and cost 100–1000× more for proportionally smaller noise and bias.

The integration problem

If you knew the IMU's measurements perfectly, you could integrate angular velocity to get orientation, subtract gravity from accelerometer readings to get kinematic acceleration, and double-integrate that to get position. In practice, IMU readings have additive noise plus a slowly varying bias, and integration is unforgiving: a small constant bias b in the gyroscope produces an orientation error that grows linearly in time (b·t); a small bias in the accelerometer produces a position error that grows quadratically in time (½b·t²). A consumer-grade IMU integrated alone will drift several metres in a few seconds.

IMU pre-integration (continuous form) ω measured (t) = ω true (t) + b g (t) + n g (t) a measured (t) = R T (t) (a true (t) - g) + b a (t) + n a (t) The measured gyro signal is the true angular velocity plus a slowly drifting bias plus white noise. The accelerometer signal is the body-frame projection of the kinematic acceleration minus gravity, with its own bias and noise. Estimating the biases b g and b a online — and removing them before integration — is the central job of an IMU-based state estimator.

This is why an IMU is almost never used alone. It is paired with an exteroceptive sensor (vision, LiDAR, GPS) that provides absolute position information, and a filter combines the two so that the IMU fills the high-rate gaps while the exteroceptive sensor anchors the long-term drift. Visual-inertial odometry (Section 6) is the canonical worked example.

Wheel encoders, joint encoders, and odometry

Wheeled robots typically have rotary encoders on each drive wheel, producing a signed pulse count proportional to wheel rotation. With known wheel geometry, this gives wheel odometry: an estimate of how far each wheel has travelled. Combined into a kinematic model of the chassis, wheel odometry yields a pose estimate that is locally smooth and absolute, with errors that grow with distance (typically 1–5% of distance travelled, depending on the robot and surface).

Wheel odometry's main weakness is that it cannot detect slip. A wheel that spins on ice or sand records distance even when the robot is not moving. The combination of wheel odometry and IMU is one of the oldest and most reliable patterns in mobile robotics: the IMU detects when the chassis is moving (or not), and the encoders provide the magnitude of that motion. On legged robots and arms, joint encoders play the same role: they tell you the robot's configuration directly, but they don't tell you whether the foot is in contact, the gripper has slipped, or the load has shifted.

GPS and global position

GPS gives an absolute position fix in a global reference frame, which is the one piece of information that all the local-relative sensors above lack. A consumer GPS gives 3–5 m accuracy; differential GPS or RTK (Real-Time Kinematic) GPS can give centimetre-level accuracy when conditions are good. The catch is that GPS only works outdoors with a clear view of the sky. Urban canyons, tunnels, indoor environments, and forest canopies degrade or block the signal. Production AV stacks treat GPS as a long-term anchor — useful for global localisation, untrusted for moment-to-moment control — and do most of their tracking with IMU + LiDAR + camera fusion.

The Proprioceptive / Exteroceptive Pairing

The single most important pattern in robot perception is pairing a high-rate, drift-prone proprioceptive sensor (IMU or encoder) with a lower-rate, absolute exteroceptive sensor (camera, LiDAR, or GPS). The proprioceptive sensor predicts the state forward in time at high frequency; the exteroceptive sensor corrects the prediction whenever it has new information. Almost every state estimator in this chapter — Kalman filter, EKF, factor graph, learned IMU-camera fusion — is some elaboration of that one idea.

Sensor Fusion and Bayesian Filtering

A robot has multiple sensors that disagree about what is happening — sometimes mildly, sometimes dramatically. The job of sensor fusion is to combine these readings into a single coherent estimate of the robot's state, with appropriately quantified uncertainty. The dominant framework is Bayesian filtering: maintain a probability distribution over the state, predict it forward in time using a motion model, and update it with each new measurement using Bayes' rule. Different filters are different parameterisations of that same idea.

The Kalman filter

The Kalman filter is the closed-form solution to the Bayesian filtering problem when (a) the state and measurement dynamics are linear, (b) the noise is zero-mean Gaussian, and (c) the prior is Gaussian. Under those assumptions the posterior is also Gaussian, fully described by a mean and a covariance, which can be updated in closed form. The result is the smallest-mean-squared-error estimator achievable for the linear-Gaussian problem, computed at constant per-step cost in the state dimension.

Kalman filter — predict and update predict: μ̄ t = F t μ t-1 + B t u t Σ̄ t = F t Σ t-1 F t T + Q t update (with measurement z t): K t = Σ̄ t H t T (H t Σ̄ t H t T + R t) -1 μ t = μ̄ t + K t (z t - H t μ̄ t) Σ t = (I - K t H t) Σ̄ t F is the state transition, B u is the control input, Q is the process noise covariance, H is the measurement matrix, R is the measurement noise covariance, K is the Kalman gain. The gain trades off prior and measurement: a small K trusts the prior, a large K trusts the measurement. Read once carefully and the rest of state estimation makes much more sense.

The predict-update cycle. Between measurements, the filter propagates the state forward using the motion model, and uncertainty grows. When a measurement arrives, the filter combines it with the predicted state weighted by the relative uncertainties — the Kalman gain — and uncertainty shrinks. The cycle repeats indefinitely.

Beyond linear: EKF, UKF, particle filter

Real robots are not linear-Gaussian. Rotations are non-linear, sensor models are non-linear, and noise is rarely truly Gaussian. Three filter families handle the non-linear case, each with different trade-offs:

Extended Kalman Filter (EKF). Linearises the dynamics and measurement model around the current mean estimate via first-order Taylor expansion (Jacobians), then applies the standard Kalman update. Computationally cheap, but accuracy degrades when the system is highly non-linear or the uncertainty is large enough that the local linearisation is a poor approximation. The EKF is the workhorse of robotic state estimation despite its known weaknesses.
Unscented Kalman Filter (UKF). Instead of linearising, the UKF deterministically samples a small set of sigma points around the current mean, propagates them through the non-linear function, and recovers the posterior mean and covariance from the transformed samples. More accurate than the EKF on highly non-linear problems, at modestly higher computational cost. No Jacobians required, which simplifies implementation.
Particle filter. Represents the posterior as a set of weighted samples (particles). Each particle is propagated through the motion model; weights are updated using the measurement likelihood; the population is periodically resampled. The only filter that handles arbitrary non-Gaussian and multimodal beliefs, at the cost of much higher computation — typical particle counts are in the hundreds to thousands. Used most often when the state has discrete or strongly non-Gaussian structure (e.g., Monte Carlo localisation in a known map).

Factor graphs and the modern back-end

A more recent shift in robot state estimation is from filtering (incremental, single-pass) to smoothing on factor graphs (re-optimising over a sliding window of recent state). A factor graph is a bipartite graph where nodes represent state variables (poses, velocities, biases, landmarks) and factors represent measurements that constrain those variables. Solving a factor graph is a non-linear least-squares optimisation; the result is a maximum-a-posteriori estimate over the whole window jointly, rather than a forward-only filter pass. Libraries like GTSAM, g2o, and Ceres made this approach the default for serious SLAM systems by the mid-2010s, and the iSAM family of incremental solvers brought the per-step cost down to filter-like levels.

In practice, modern robotic state estimators look hybrid: an EKF or factor-graph smoother for the high-rate IMU integration, a windowed factor graph for the visual-inertial fusion, and a global pose graph for loop closures. The implementation is more elaborate than any single filter, but each layer fits the time scale and structure of its problem.

SLAM: Simultaneous Localization and Mapping

To know where you are, you need a map. To build a map, you need to know where you are. Simultaneous Localization and Mapping (SLAM) cuts this Gordian knot by solving both problems jointly: as the robot moves, it estimates both its pose and the map of landmarks around it, using each to refine the other. SLAM is one of the genuinely hard problems in robotics, and it remains a research area decades into its development. This section covers the architecture that nearly all SLAM systems share, the major instantiations of that architecture, and the recent shift toward learned and neural representations.

The front-end / back-end decomposition

A SLAM system is conventionally split into a front-end that produces measurements from raw sensor data, and a back-end that fuses those measurements into a globally consistent estimate. The split is computational rather than mathematical, but it is the dominant organising principle of the field.

Front-end

Feature extractionFrom the raw frame, extract repeatable, distinctive features — corners (Harris, Shi-Tomasi, FAST), descriptors (ORB, SIFT, SuperPoint), or learned key-points.

Data associationMatch features across frames or against landmarks already in the map. The hardest single step in classical SLAM; mistakes here corrupt the back-end.

Pose estimationCompute the relative pose change between consecutive frames from the matched features (visual odometry) or from scan registration (ICP for LiDAR).

Loop closure detectionRecognise that the robot has revisited a place it saw before. Without this, drift accumulates indefinitely. Bag-of-words (DBoW), NetVLAD, or modern global descriptors like CosPlace.

Back-end

Pose graph optimisationThe map of poses linked by relative-pose constraints from the front-end. Optimised to minimise constraint residuals using a non-linear least-squares solver (g2o, Ceres, GTSAM).

Bundle adjustmentJoint optimisation of camera poses and 3D landmark positions to minimise reprojection error. The geometric core of every visual SLAM system.

Outlier handlingRobust kernels (Huber, Cauchy, switchable constraints) so that bad data associations from the front-end do not corrupt the global solution.

Map managementAdding new landmarks, removing stale ones, deciding when to commit a key-frame. Almost all the practical complexity of a SLAM system lives here.

The major SLAM families

Visual SLAM (V-SLAM). The map is a set of 3D landmark points (or surfaces) reconstructed from camera images. Two stylistic branches: feature-based methods (ORB-SLAM, the dominant open-source baseline; PTAM, the parallel-tracking-and-mapping ancestor), and direct methods (DSO, LSD-SLAM, which optimise photometric error directly without explicit features). Stereo and RGB-D variants extend monocular V-SLAM to systems that have a metric scale.

LiDAR SLAM. The map is a 3D point cloud or occupancy grid. Front-ends use scan registration (ICP, NDT, LOAM-style edge/plane features). LiDAR SLAM is generally more reliable than V-SLAM in well-structured environments because it gives metric depth directly, but it relies on geometric structure — a long featureless corridor causes LiDAR SLAM to drift along its principal axis.

Visual-inertial SLAM (VI-SLAM). Combines a camera with an IMU. The IMU provides high-rate motion prior (Section 4) that anchors the camera pose between frames, especially during fast motion when feature tracking would otherwise fail. VINS-Mono, OKVIS, and the IMU-coupled variant of ORB-SLAM are the canonical implementations. VI-SLAM produces metric scale (which monocular V-SLAM cannot) and is the standard for AR/VR headsets and small drones.

Multi-sensor SLAM. Production systems combine several of the above. LIO-SAM and FAST-LIO fuse LiDAR with IMU; LVI-SAM adds vision; the AV perception stacks of Waymo and Cruise fuse all three plus radar plus GPS. The complexity scales fast, but each added sensor narrows the failure modes the system is vulnerable to.

Loop closure: why it is the hard part

Drift in SLAM accumulates linearly in the worst case. After 10 km of driving, a 1% drift produces 100 m of error — far enough to put the robot on the wrong street. The only way to bound this is loop closure: recognising when you have returned to a previously mapped location, and adding a constraint that ties the current pose to the historical one. The back-end then redistributes the accumulated error around the loop. The challenge is that loop closure is a binary classification problem on appearance — and false positives are catastrophic, because a single bad loop closure links the wrong places and corrupts the entire map.

The robust answer is to be conservative in declaring loop closures (high precision) and to use techniques like switchable constraints in the back-end that can disable a loop closure factor if it conflicts too strongly with the rest of the graph. Modern systems also use learned global descriptors (NetVLAD, Patch-NetVLAD, MixVPR) that are far more discriminative than the bag-of-words approaches that dominated through the mid-2010s.

The learned-SLAM frontier

Neural networks have entered every part of the SLAM stack. The front-end has been the easiest to absorb — learned feature detectors (SuperPoint), matchers (SuperGlue, LoFTR), and place recognisers (NetVLAD) are now standard in modern systems. The back-end is harder because the optimisation is high-dimensional and the constraints are physical, but differentiable bundle adjustment (DROID-SLAM) and end-to-end neural SLAM (CodeSLAM, NICE-SLAM, the Gaussian-splatting SLAM lineage) are active research areas. As of 2026 the production picture is still hybrid: classical optimisation back-ends with neural front-ends and increasingly neural map representations. Section 10 covers the emerging implicit-map approaches in more depth.

SLAM Is Never "Done"

SLAM looks like a solved problem on the benchmark datasets that researchers compete on. It is far from solved in the wild. Every long-term deployment runs into corner cases — featureless environments, dynamic scenes, slow drift over weeks, sensor degradation — that academic papers tend not to confront. Production SLAM is as much about monitoring and intervention as it is about the algorithm itself.

Sensor Calibration

Every algorithm in the previous sections assumed that the sensors' parameters are known: that we know the camera's intrinsic matrix, the relative pose between the camera and the LiDAR, the orientation of the IMU relative to the chassis, and the time offset between sensor clocks. Those parameters are not given by the manufacturer, except very approximately. They have to be measured — and they drift, sometimes over hours, sometimes over months. Calibration is the unglamorous infrastructural work that makes every other layer of robot perception possible.

Intrinsic calibration

Intrinsic calibration estimates the per-sensor parameters that describe how the sensor maps physical input to measurements. For a camera, this is the intrinsic matrix K (focal length, principal point) and the distortion coefficients. The standard procedure shows the camera a known calibration pattern — a chessboard, an AprilTag grid, or a CharucoBoard — from many viewpoints, and solves a non-linear least-squares optimisation that finds the intrinsics that minimise the reprojection error of the pattern's known geometry.

For an IMU, intrinsic calibration estimates the bias, scale-factor errors, and axis misalignments of the gyroscope and accelerometer, plus the noise-density and bias-instability parameters that go into the filter. The Allan variance plot, computed by recording a stationary IMU for many hours, is the standard way to estimate the random-walk and bias-instability parameters that the filter needs in order to assign appropriate uncertainty to IMU integration.

Extrinsic calibration

Extrinsic calibration estimates the rigid-body transformation between sensors. A camera-LiDAR extrinsic, for example, is a 6-DOF transform: three translations and three rotations that map a point in the LiDAR frame to a point in the camera frame. Get this wrong by a degree of rotation and points 50 m away appear nearly a metre off; get it wrong by a centimetre of translation and close-range manipulation fails.

The standard procedure is to find a calibration target visible to both sensors simultaneously — a board with known dimensions, or a set of corner reflectors — and to optimise the relative transform that minimises the disagreement between the two sensors' observations of the target. For camera-IMU calibration, the procedure is more elaborate: the camera must observe a moving pattern while the IMU measures the motion, and the relative transform plus the time offset between sensor clocks plus the IMU biases are all estimated jointly. Open-source toolboxes like Kalibr (covered in Further Reading) automate this process and have become the de facto standard.

Time synchronisation

Sensors do not share a clock by default. A camera might say its frame was captured at time t, but the timestamp is when the host computer received the frame, which can be tens of milliseconds after the actual exposure. An IMU streams at its own clock rate. A LiDAR has its own internal timing for the spinning sweep. If the timestamps are misaligned, the fusion algorithm tries to combine measurements taken at slightly different physical moments — and on a moving robot, the difference between data taken 30 ms apart is a non-trivial pose change.

The robust solution is hardware time synchronisation: a single clock signal (PTP, GPS-PPS, or a custom trigger line) drives all sensors. The cheaper, lossier solution is software synchronisation: model each sensor's time offset and estimate it online as part of the state. Real systems use a combination — hardware sync where possible, online estimation as a backstop.

Calibration drifts

The least-discussed fact about calibration is that it does not stay calibrated. Cameras shift in their mounts after vibration. IMU biases vary with temperature and time. LiDAR-camera extrinsics drift after a pothole jolt. A robot that worked perfectly on Monday can fail on Friday because its calibration silently degraded over the week. Production systems either re-calibrate periodically (the AV approach: factory calibration plus weekly re-calibration on a known scene) or estimate calibration parameters online as part of the state estimator (the visual-inertial approach: bias estimation runs continuously inside the filter).

Detecting calibration drift is itself a perception problem. The standard signal is increasing residuals in the back-end optimisation: if the bundle adjustment cannot fit the current data well even after convergence, the most likely culprit is that the assumed sensor parameters no longer match reality. Production stacks log these residuals and flag the robot for re-calibration when they cross a threshold.

The Modern Perception Stack

The classical stack — features, descriptors, hand-crafted optimisation back-ends — was the dominant architecture from roughly 2000 to the late 2010s. The current production stack is different in almost every layer. Front-ends are learned. Map representations are increasingly neural. Object detection runs on dense BEV features rather than per-camera pipelines. And there is a growing trend of fusing perception, prediction, and planning into single end-to-end networks. This section traces those shifts and the architectures that drive them.

Bird's-eye-view representations

The BEV (bird's-eye-view) representation is the dominant paradigm in automotive and mobile-robot perception today. Instead of producing per-camera detections in image space and then fusing them into 3D, the modern approach lifts each camera's features into a shared top-down grid centred on the robot, fuses across cameras, and runs detection and tracking in BEV. The advantages are immediate: BEV is the natural frame for path planning (cars and robots move on a roughly horizontal surface), it makes multi-camera fusion architecturally simple (just project everything to BEV), and it dovetails with LiDAR which is already most naturally represented in BEV.

The architectures that established BEV — Lift-Splat-Shoot (2020), BEVFormer (2022), BEVFusion (2022) — share a common pattern: image features are extracted by a backbone (ResNet, EfficientNet, ViT), explicitly or implicitly projected into BEV using either a depth estimate per pixel or a learned attention mechanism, and refined by a transformer-style encoder operating on the BEV grid. Detection and tracking heads then operate on this unified feature map.

The modern multi-camera BEV perception architecture. Per-camera image features are extracted by a shared backbone, lifted into a common bird's-eye-view grid via depth estimation or cross-attention, refined by a BEV-space encoder, and consumed by task-specific heads for detection, tracking, and online mapping. LiDAR features (when present) are projected into the same BEV grid and fused naturally.

Occupancy networks

BEV is fast but loses the vertical dimension — a flat top-down representation cannot distinguish a low overhang from a clear path. Occupancy networks address this by predicting, for every voxel in a 3D grid around the robot, the probability that it is occupied. The output is a dense 3D representation that handles overhangs, irregular geometry, and unmapped objects without requiring those objects to belong to a known class. Tesla's transition from per-class object detection to occupancy networks (2022) was widely cited as a turning point in production AV perception, and similar architectures are now standard across the industry.

Foundation models for robot perception

Self-supervised vision models — DINO, DINOv2, CLIP, SAM-2, V-JEPA — produce general-purpose visual representations that transfer across tasks without per-task training. For robotics, the most useful properties are zero-shot object segmentation (SAM family), open-vocabulary detection (Grounding DINO, OWL-ViT), and dense semantic features that survive viewpoint and lighting changes (DINOv2 features). Production robot stacks increasingly use these models as front-end encoders, with thin task-specific heads on top, rather than training perception networks from scratch.

The longer-term ambition — generalist robot perception models that consume any sensor modality and produce structured scene representations — is an active research frontier covered in Section 10. But the practical near-term impact has already arrived: a robotics startup in 2026 can take SAM-2 plus DINOv2 plus a small fine-tuned head and have a perception system that would have required a multi-year custom-training effort five years earlier.

End-to-end perception and the prediction-planning blur

The classical stack treated perception, prediction, and planning as separate modules with clean interfaces between them: perception produced a list of detected objects with poses; prediction produced future trajectories for each; planning produced a route. The modern trend is to blur these boundaries. End-to-end driving networks (e.g., UniAD, VAD) jointly produce perception, prediction, and planning outputs from raw sensor inputs, sharing a common BEV feature map. The argument is that the modules' interfaces — bounding boxes, predicted trajectories — discard information that the downstream module would have used. The counter-argument is that end-to-end systems are harder to debug, harder to verify, and harder to certify for safety-critical use. Production AV stacks in 2026 typically have a hybrid: end-to-end neural prediction-planning, with a separate, verifiable perception module that retains structured outputs for the safety-critical certification path.

Failure Modes

Every layer of the perception stack has characteristic failure modes. They are mostly known. They mostly have known mitigations. The reason perception is still hard is that the failures interact, accumulate, and cascade — and the long tail of edge cases that happens once in a million miles is the part that determines whether a robot is safe to deploy.

Sensor degradation

Cameras saturate in direct sunlight, fail in low light, smear under motion blur, and produce unreliable colours through poorly chosen autoexposure. LiDAR returns degrade in rain, fog, snow, and dust — water droplets and particles scatter the laser pulse and produce spurious close-range returns. Radar is robust to weather but suffers from multipath reflections from metal surfaces and the road itself. IMUs drift with temperature and time, and magnetometers are unreliable indoors and near metal structures. Each of these is well-characterised; production stacks include sensor-health monitors that flag degraded data before it reaches the fusion layer.

Filter divergence

Bayesian filters can diverge — produce estimates that are confidently wrong — when their assumptions are violated. The classic mode is over-confidence: the filter's covariance shrinks faster than the actual error, the gain on new measurements drops, and the estimate becomes locked into a wrong belief that no incoming data can correct. This usually traces back to under-modelled process noise (the Q matrix is too small) or to a subtle linearisation error in an EKF when uncertainty is large. The diagnostic is the normalised innovation squared (NIS) statistic, which should be roughly chi-squared distributed if the filter is consistent; persistent NIS values outside the expected range mean the filter does not believe its own measurements, which is the early warning of divergence.

Adversarial and out-of-distribution conditions

Perception networks trained on standard datasets fail in conditions those datasets did not contain. Snow on the road, glare from a setting sun directly into a camera, an emergency vehicle with non-standard markings, a pedestrian carrying an unusually large object — all of these have been documented as failure modes in production AV systems. The mitigations are layered: train on more diverse data (the data flywheel), augment heavily during training (synthetic snow, glare, occlusion), add anomaly detection that flags low-confidence inputs for additional caution, and require the planner to default to safe-stop behaviour when perception is uncertain.

Adversarial attacks on perception — patterned stickers that cause a stop sign to be misclassified, projector-based attacks that inject phantom objects into camera streams — are a more pointed threat. They are practically rare in benign deployment but increasingly worth defending against in safety-critical systems. The standard countermeasure is sensor-fusion redundancy: an attack that fools one sensor modality must also fool the others, which is dramatically harder than fooling a single one.

Calibration drift, undetected

The slowest and most insidious failure mode is calibration drift that goes unnoticed. The system continues to produce plausible-looking outputs, but the assumed sensor parameters no longer match reality, and every downstream estimate is subtly biased. The signal in the back-end residuals (Section 7) is the standard detection method, but the threshold for "this is bad enough to recalibrate" is hard to set and often only obvious in retrospect. Long-running deployments need a re-calibration cadence and a residual-monitoring dashboard, not a one-time factory calibration.

The long tail

The single hardest fact about robot perception is that the failure rate per mile (or per hour, or per task) is dominated by the long tail of rare events. A perception system that handles 99.9% of common situations correctly will still encounter the remaining 0.1% — and on a fleet of thousands of vehicles driving millions of miles, that 0.1% is many thousands of incidents. Most of the engineering effort in production perception is spent on the long tail, not on the common cases. This is why the "data flywheel" — using the fleet itself to discover and label rare events that go back into training — is the central strategic asset of every serious AV company.

Defensive Layering

Production perception systems are organised in defensive layers. Each sensor has its own health monitor. Each fusion module has consistency checks against the others. The state estimator has divergence detection. The output layer has anomaly detection on its own predictions. Every layer is allowed to flag a problem and demand a graceful-degradation response. This sounds like over-engineering until the first time a single sensor failure cascades into a wrong action — at which point it becomes obvious why every layer needs an independent health check.

Frontier Topics

Robot perception is in the middle of a major architectural transition. The classical stack — features, filters, optimisation — is being progressively replaced by learned components. The replacement is not uniform: the back-end optimisation has been the most resistant, while the front-end has been almost entirely absorbed by neural networks. Several research threads are active enough that the production picture in five years will look meaningfully different from today's.

Neural implicit maps and Gaussian splatting

Until recently, robot maps were geometric — point clouds, voxel grids, surfel meshes — and the dominant trade-off was between resolution and memory. Neural radiance fields (NeRFs) and the 3D Gaussian splatting family changed the framing: a map can be a learned function that, given a query position, returns colour and density. The result is a photorealistic, view-dependent reconstruction that uses orders of magnitude less memory than a high-resolution voxel grid would, and that natively supports synthesising novel viewpoints — useful for rendering simulation scenes from real-world data.

NICE-SLAM (2022) was the breakthrough showing that NeRF-style implicit representations could serve as the map in a real-time SLAM system. The Gaussian-splatting lineage (SplaTAM, MonoGS, Gaussian Splatting SLAM) has made these systems faster and more memory-efficient. As of 2026 the production SLAM picture is still mostly classical, but research SLAM has decisively moved to learned implicit maps, and production adoption is following over the next few years.

Foundation models for spatial understanding

The progression that played out in language — from task-specific models to foundation models — is now playing out in vision and 3D. SAM and SAM-2 provide generalist segmentation. DINOv2 provides generalist dense features. V-JEPA and similar models learn predictive video representations that transfer across tasks. For robotics, the relevant question is whether a single large model can replace the perception stack: take in sensor data and output a unified representation that perception, prediction, and planning all consume. Models like 3D-LLM, ConceptGraphs, and the broader vision-language-action (VLA) lineage covered in the next chapter (RT-2, OpenVLA) are the most concrete instances of this trend.

Active perception

A subtle point that is finally getting attention: perception is not passive. The robot can choose what to look at next. Active perception (also called next-best-view planning) treats sensor placement as an optimisation problem — given the current uncertain belief, where should the camera point next to maximally reduce uncertainty? Classical active perception has existed for decades, but it has been hard to integrate with modern learned perception. Recent work (active SLAM with neural implicit maps, learned exploration policies, information-gain estimation in BEV) is starting to close that gap.

Generalist sensor fusion

Modern fusion is largely modality-specific: a separate pipeline for camera fusion, another for LiDAR fusion, another for IMU integration. The frontier is end-to-end fusion across all modalities through a single learned model that consumes raw sensor streams and outputs a structured scene representation. This is the architecture of recent work like UniAD (camera + radar + map), MM-DiffusionDet (camera + LiDAR + radar), and the broader trend in AV stacks toward unified multi-sensor backbones. The argument is that hand-designed fusion discards information that a learned model could exploit; the counter-argument is that fusion is one of the places where physical models really do help, and replacing them with neural networks risks regressing on the cases the physics handles cleanly.

What this chapter does not cover

Several adjacent topics are properly the subject of later chapters. The control loops that consume the perception output — PID, model predictive control, optimisation-based controllers — belong to Chapter 02 (Motion Planning and Control). The training of action policies on top of perception belongs to Chapter 03 (Learning from Demonstration). The simulation environments used to train and evaluate perception under controlled conditions belong to Chapter 04 (Sim-to-Real Transfer). The vision-language-action models that fold perception into a generalist policy belong to Chapter 05 (Foundation Models for Robotics). And the integration of perception into a complete autonomous-driving stack — with all its safety, regulatory, and behavioural complexity — belongs to Chapter 06 (Autonomous Vehicles). Robot perception is the foundation of all of those, and the rest of Part XII assumes the material here.