The world is three-dimensional; images are not. 3D vision is the set of techniques that reconstruct spatial structure — depth, geometry, camera pose, object shape, scene layout — from 2D image evidence. The field has a long pre-deep-learning history: pinhole cameras and intrinsics, epipolar geometry for stereo, Structure from Motion for multi-image reconstruction (COLMAP, Bundler), multi-view stereo for dense depth, and the classical SLAM pipelines (PTAM, ORB-SLAM) that run real-time on autonomous cars and drones. Deep learning brought four waves of change. First, monocular and stereo depth networks (DPT, MiDaS, ZoeDepth, Depth Anything, RAFT-Stereo) replaced hand-crafted matching with learned representations. Second, point-cloud and voxel networks (PointNet, PointNet++, DGCNN, Point Transformer, MinkowskiNet, SpConv) learned directly on 3D data for detection, segmentation, and classification — the perception backbone of autonomous driving. Third, neural radiance fields (NeRF, 2020) and their hundred-plus follow-ups (Mip-NeRF, Instant-NGP, Zip-NeRF, Plenoxels) encoded scenes as continuous volumetric functions, producing photorealistic novel views from a handful of posed photos. Fourth — and most recently — 3D Gaussian Splatting (3DGS, 2023) replaced neural fields with millions of tiny Gaussian primitives, achieving real-time rendering at NeRF-beating quality; within two years, extensions covered dynamic scenes, large environments, and generative 3D. In parallel, 3D generation (DreamFusion, Zero-1-to-3, Score Distillation Sampling), parametric body models (SMPL, SMPL-X, HMR), and dense SLAM (DROID-SLAM, MAST3R, NeRF-SLAM) have pushed the state of the art in scenes, people, and real-time reconstruction. This chapter follows the full arc: camera geometry and classical reconstruction, depth networks, point-cloud and volumetric deep learning, 3D detection and segmentation, mesh and surface reconstruction, neural fields and Gaussian splatting, SLAM, 3D human and object modelling, 3D generation, and the deployment engineering that makes all of this practical.
Sections one through five cover the geometry foundations. Section one is why 3D vision matters — the task landscape (depth, pose, reconstruction, detection, segmentation, generation, SLAM), the applications (autonomous driving, robotics, AR/VR, digital twins, content creation), and the three great difficulties of lifting 2D to 3D: scale ambiguity, occlusion, and the curse of the view. Section two is 3D representations: point clouds, voxel grids, polygon meshes, implicit signed-distance and occupancy fields, and Gaussian splats — with a practical guide to when each is the right choice. Section three is camera geometry: the pinhole model, intrinsics and extrinsics, homogeneous coordinates, projective geometry, radial distortion, and camera calibration. Section four is multi-view geometry: epipolar constraints, the fundamental and essential matrices, stereo triangulation, and the historical pipeline from the pre-deep-learning era. Section five is Structure from Motion and Multi-View Stereo: the COLMAP / Bundler / PMVS tradition that remains an essential pre-processing step for most neural 3D methods.
Sections six through ten cover 2D-to-3D perception. Section six is depth estimation — monocular (MiDaS, DPT, ZoeDepth, Depth Anything v1/v2, Marigold) and stereo (RAFT-Stereo, IGEV, CREStereo) — with the data, loss, and evaluation story for each. Section seven is point-cloud networks: PointNet's permutation-invariant point-wise MLP, PointNet++'s hierarchical sampling, DGCNN's graph-based features, KPConv's kernel-point convolutions, and the Point Transformer / PTv3 line. Section eight is voxel and sparse-convolutional networks: dense 3D UNets, OctNet's sparse octrees, and the MinkowskiEngine / SpConv sparse-convolutional ecosystem that made large-scale 3D learning practical. Section nine is 3D object detection: LiDAR-based (VoxelNet, SECOND, PointPillars, CenterPoint, TransFusion), camera-based (BEVFormer, PETR, Lift-Splat-Shoot), and multi-modal (FUTR3D, BEVFusion). Section ten is 3D segmentation: semantic, instance, and panoptic — Mask3D, Cylinder3D, PolarMix, OneFormer3D.
Section eleven is mesh reconstruction: marching cubes, screened Poisson reconstruction, AtlasNet, Pixel2Mesh, Mesh R-CNN, neural mesh optimisation (NeuralMesh, Flexicubes), and the differentiable-rendering tooling (nvdiffrast, Mitsuba 3) that enables gradient-based mesh refinement.
Sections twelve through fourteen cover the neural-field revolution and its aftermath. Section twelve is neural radiance fields — NeRF's volumetric rendering formulation, positional encoding and coarse-to-fine sampling, and the family of variants (Mip-NeRF 360, Instant-NGP's hash grid, Plenoxels, TensoRF, Zip-NeRF, Nerfacto). Section thirteen is Gaussian splatting: the 3DGS formulation, differentiable rasterisation, and the rapidly expanding family (Mip-Splatting, 4D-GS, LangSplat, Deformable-GS) that replaced NeRF as the default radiance representation within eighteen months of its release. Section fourteen is SLAM: classical feature-based (ORB-SLAM3), dense (KinectFusion, BundleFusion), learned (DROID-SLAM, DPV-SLAM), and neural-field-based (iMAP, NICE-SLAM, NeRF-SLAM, SplaTAM, Gaussian-SLAM).
Sections fifteen and sixteen cover 3D generation and humans. Section fifteen is 3D generation: the Score Distillation Sampling that powers DreamFusion / Magic3D, the image-conditioned generators (Zero-1-to-3, Stable Zero123, Wonder3D, SV3D), the fast feed-forward meshes and splats (LRM, TripoSR, InstantMesh, LGM), and the latent 3D diffusion lineage. Section sixteen is 3D human body: the parametric SMPL / SMPL-X / SMPL-H bodies, image-based regression (HMR, HMR 2.0, PIXIE, 4D-Humans), and hand / face / avatar reconstruction.
Section seventeen is efficient deployment: model compression, neural-field quantisation, streaming and mobile rendering, ONNX / TensorRT / WebGPU export, and the LiDAR / stereo / SLAM engineering that turns research models into drivable autonomy stacks and on-device AR. The closing in-ml section is the operational picture: dataset catalogues (ScanNet, NYU-v2, KITTI, nuScenes, Waymo Open, Matterport3D, Objaverse, CO3D, DTU), the tooling ecosystem (PyTorch3D, Kaolin, Open3D, MinkowskiEngine, SpConv, COLMAP, gsplat, tiny-cuda-nn), labelling, and how 3D plugs back into the rest of Part VII — video for 4D reconstruction, VLMs for 3D grounding — and into the robotics and autonomy parts of the compendium.
The world we see is three-dimensional; cameras record it as a 2D projection, and almost every practical vision task eventually has to invert that projection. Robots need to know how far away obstacles are before they can plan a path. Self-driving cars need to localise themselves in a metric map and predict the three-dimensional trajectories of pedestrians and vehicles. AR headsets need to place a virtual object on a real tabletop so convincingly that it does not drift when the user moves their head. Content creators want to turn a handful of phone photos into a walkable scene. Medical imaging reconstructs volumes from 2D slices. All of these are 3D vision: the body of techniques that recovers the geometry of the world — depth, surfaces, camera poses, object shapes, full scenes — from 2D image evidence.
The task landscape is broad. Depth estimation asks for a per-pixel distance map from a single image (monocular) or a stereo pair. Camera calibration and pose estimation recover intrinsic and extrinsic camera parameters. Structure from Motion (SfM) takes an unordered photo collection and returns a sparse 3D point cloud plus every camera's pose. Multi-view stereo (MVS) densifies that into a point cloud or mesh. 3D object detection predicts oriented 3D boxes around cars, pedestrians, and cyclists from LiDAR, camera, or both. 3D semantic segmentation labels every point or voxel of a scene. SLAM fuses localisation and mapping into a single real-time loop. Neural fields (NeRF) and Gaussian splatting represent scenes as learned functions or primitives, enabling photorealistic novel-view synthesis. 3D generation turns text prompts or single images into 3D assets. Human body reconstruction (SMPL, SMPL-X) recovers articulated meshes from photos and video.
The applications drive the choices. Autonomous driving is dominated by LiDAR-first perception stacks (Waymo, Cruise) and camera-first stacks (Tesla) racing to close the gap; the benchmarks (KITTI, nuScenes, Waymo Open, Argoverse) exist because industry needs them. Robotics — manipulation, navigation, bin picking — depends on depth cameras (RealSense, Kinect), SLAM, and increasingly neural scene representations. AR/VR (ARKit, ARCore, Meta Quest, Apple Vision Pro) ships visual-inertial SLAM to every phone and headset. Digital twins for architecture, construction, and industrial inspection reconstruct facilities from drone flights and 360° scans. Visual effects and content creation use photogrammetry (RealityCapture, Metashape), and increasingly NeRF and Gaussian splatting, to turn shoots into editable 3D scenes. E-commerce wants 3D product models at scale, which is why generative 3D (LRM, TripoSR) is getting so much investment.
The field has moved through four broad eras in deep learning. The pre-deep-learning era (through 2014) was classical geometry: SIFT features, RANSAC, epipolar constraints, bundle adjustment, and SfM/MVS pipelines (COLMAP, Bundler, PMVS) that are still the backbone of most modern neural methods. The first deep era (2015–2019) replaced hand-crafted components with learned depth networks (Eigen et al., DORN, MiDaS) and point-cloud networks (PointNet, PointNet++) that learned directly on 3D data. The NeRF era (2020–2022) reframed 3D reconstruction as learning a continuous neural function of space, kicking off a rapid explosion of photorealistic novel-view methods and neural SLAM. The Gaussian-splatting era (2023–) replaced neural fields with explicit millions-of-Gaussians primitives, recovering real-time rendering and kicking off a second wave of generative and 4D methods.
This chapter follows the arc from the geometric foundations that every method still uses, through the 2D-to-3D perception stack (depth, point clouds, 3D detection and segmentation), through the neural-field and Gaussian-splatting revolutions, to the application stacks (SLAM, 3D generation, human body) and the deployment engineering that matters in practice.
An image is a regular 2D grid of pixels and admits exactly one natural representation. A 3D scene does not. Point clouds, voxel grids, polygon meshes, implicit signed-distance and occupancy fields, neural radiance fields, and Gaussian splats are all in active use, each with strengths and weaknesses. Choosing the right representation — or the right conversion between representations — is half of the design work in a 3D system.
A point cloud is an unordered set of 3D points {(x, y, z)}, often with extra per-point features: colour, normal, intensity, semantic label, feature vector. It is the native output of LiDAR and structured-light sensors, and the natural format for sparse geometry. Point clouds have two challenging properties: they are unordered (a network must be permutation-invariant, which is why PointNet's symmetric max-pool was such a breakthrough) and irregular (there is no natural grid, so standard convolutions do not apply). They scale linearly with the number of points; a KITTI LiDAR sweep has ~100 000 points, a high-density indoor scan can have 10M+.
A voxel grid is a regular 3D grid of cells, each storing occupancy or a feature vector — the direct 3D analogue of an image. Voxels admit 3D convolutions and fit standard CNN frameworks, but their memory cost is O(N³), which makes dense voxel grids impractical beyond about 256³ resolution. The practical workaround is sparse voxels: only store non-empty cells using a hash table or octree, which gets you from O(N³) to roughly O(N²) memory for surface data. MinkowskiEngine and SpConv are the dominant sparse-convolutional libraries and have made voxel networks practical at real scanning resolutions (2–5 cm cells for rooms, 10–20 cm cells for streets).
A polygon mesh is a graph of vertices and faces (usually triangles) defining a surface. Meshes are the native format for computer graphics and the preferred output for engineering, 3D printing, and game-asset pipelines. They represent only the surface, not the interior or the unobserved volume, which makes them efficient for rendering but awkward for volumetric reasoning. Meshes are hard to learn directly — the topology (which vertices connect to which) cannot easily be made differentiable — so most learned pipelines produce volumetric outputs and convert to meshes as a post-processing step via marching cubes or Poisson reconstruction.
A neural radiance field (NeRF) generalises the implicit-field idea to appearance as well as geometry. The network takes a 3D position and a viewing direction and returns a density σ and an RGB colour; rendering is done by classical volumetric integration along rays. NeRF is an implicit, continuous representation — infinitely resolvable and compactly parameterised by a small MLP — but evaluating it is slow (hundreds of MLP queries per pixel), which kicked off a research direction aimed at speeding it up through hybrid explicit/implicit structures (Instant-NGP's hash grids, Plenoxels' sparse voxels, TensoRF's low-rank decomposition).
3D Gaussian splatting (3DGS, 2023) is the most recent and most consequential representation. The scene is a cloud of millions of anisotropic 3D Gaussians, each with position, covariance (encoded as rotation + scale), opacity, and view-dependent colour (spherical harmonics). Rendering is a differentiable rasterisation of these Gaussians to screen-space ellipses, which is 100–1000× faster than volumetric NeRF evaluation. 3DGS combines the interpretability and editability of explicit primitives with the rendering quality of neural fields, and has largely displaced NeRF as the default radiance representation for static scenes.
A practical pipeline often uses several representations in sequence. A SfM front-end produces a sparse point cloud; MVS or a depth network densifies it into a dense point cloud; Poisson reconstruction extracts a mesh; or, increasingly, the point cloud initialises a set of 3D Gaussians that are optimised end-to-end. Autonomous-driving perception typically converts LiDAR into a voxel grid or BEV (bird's-eye-view) grid for detection, while keeping the raw points for segmentation. Choosing between representations is less about finding the one right answer and more about matching the representation to the operation — convolutions like grids, rendering likes splats, surface reasoning likes meshes, sparse sensors produce clouds — and converting when convenient.
Every 3D vision method depends on a mathematical model of the camera: how a 3D point in the world becomes a 2D pixel in an image. The pinhole camera model is the standard abstraction, and its algebra — intrinsics, extrinsics, homogeneous coordinates, and projective geometry — is the language in which all reconstruction is written. A system that gets the camera model wrong will produce sharp-looking but metrically unreliable output.
The pinhole model imagines light from a scene point passing through an idealised single point (the optical centre) and landing on an image plane on the other side. A 3D point X = (X, Y, Z) expressed in the camera frame projects to a pixel (u, v) according to u = fx·X/Z + cx, v = fy·Y/Z + cy, where (fx, fy) are the focal lengths in pixels and (cx, cy) is the principal point (the pixel where the optical axis meets the sensor). Writing this in homogeneous coordinates gives the compact matrix form λ·(u, v, 1)T = K · X, where K is the 3 × 3 intrinsic matrix.
A point in world coordinates reaches the camera frame via a rigid transformation: Xcam = R · Xworld + t, where R is a 3 × 3 rotation matrix and t a 3-vector translation. Together (R, t) form the 3 × 4 extrinsic matrix [R | t]. The full world-to-pixel map is then λ·(u, v, 1)T = K · [R | t] · (X, Y, Z, 1)T, or λ·(u, v, 1)T = P · X, where P is the 3 × 4 camera projection matrix. Every SfM, SLAM, or multi-view method eventually reduces to estimating these matrices for every image in a dataset.
Homogeneous coordinates are the algebraic trick that makes this all work. A 3D point (X, Y, Z) becomes the 4-vector (X, Y, Z, 1); a 2D pixel (u, v) becomes (u, v, 1); projective transformations are linear in homogeneous coordinates (where in Euclidean coordinates they are not), so composition is matrix multiplication. Points at infinity have the last coordinate zero; lines are represented by their coefficient vectors; conics by 3 × 3 symmetric matrices. This is projective geometry, formalised by Hartley & Zisserman's textbook, and is the standard vocabulary of multi-view geometry.
Camera calibration is the procedure that estimates K, the distortion coefficients, and sometimes extrinsics. The classical method is Zhang's method (2000): show the camera a known planar target (checkerboard, ChArUco board) from several poses, detect the corner points, and solve for K and distortion by minimising reprojection error. OpenCV's calibrateCamera has been the practical standard for 25 years. For specialised scenes (endoscopes, catadioptric systems, omnidirectional cameras), there are corresponding tools. Modern learned approaches — deep single-image calibration — estimate K from a single uncalibrated image and work well enough for downstream SfM priors, though not for metric applications.
Camera pose — the 6-DoF (R, t) of a camera relative to a world or object frame — is computed from 2D-3D correspondences by Perspective-n-Point (PnP) solvers. Given n ≥ 3 points with known 3D coordinates and their pixel observations, PnP solves for the pose that minimises reprojection error. The P3P minimal solver (three correspondences) gives up to four geometric solutions; EPnP scales to n correspondences in O(n); RANSAC wraps any solver with outlier rejection. PnP inside a RANSAC loop is how essentially every visual localisation, SLAM relocalisation, and structure-refinement step works.
Beyond the pinhole, specialist models exist: orthographic cameras (no perspective; useful for distant scenes), weak perspective (a scaled orthographic approximation common in faces and bodies), rolling-shutter models that account for CMOS sensors reading rows sequentially, event cameras that fire on brightness change rather than at a fixed frame rate, and light-field cameras that record rays rather than projections. For most applications, a calibrated pinhole with Brown-Conrady distortion and rolling-shutter compensation for video is sufficient.
A single image is not enough to recover depth; but a single image is also not enough to be genuinely ambiguous. The geometric constraints that relate pairs, triples, and larger collections of images are tight enough that given two calibrated views of a rigid scene, the relative camera pose and the 3D structure of matched points are recoverable up to a single overall scale. Multi-view geometry is the algebra that makes this precise, and it is the theoretical core that SfM, stereo matching, SLAM, and all their descendants are built on.
The fundamental object is the epipolar constraint. Given two views of a rigid scene and a point x1 observed in image 1, its corresponding point x2 in image 2 must lie on a line — the epipolar line — determined by the geometry of the two cameras. The epipolar constraint reduces the search for correspondences from 2D (anywhere in the image) to 1D (along the epipolar line), and it is the reason stereo matching is tractable.
Algebraically, for uncalibrated cameras, the epipolar constraint is x2T · F · x1 = 0, where F is the 3 × 3 fundamental matrix. F has seven degrees of freedom (nine entries, one scale ambiguity, one rank-2 constraint) and can be estimated from a minimum of seven point correspondences (the seven-point algorithm) or more robustly from eight using Longuet-Higgins's eight-point algorithm, typically with RANSAC for outlier rejection. For calibrated cameras with known intrinsics K, the analogous object is the essential matrix E = K2T · F · K1, which has five degrees of freedom and encodes pure rotation and translation (up to scale). Nistér's five-point algorithm is the gold-standard minimal solver.
From E (or F with known K) the camera pose decomposition recovers rotation R and translation t up to a sign and a scale. There are four candidate solutions; the physically correct one is selected by requiring that triangulated points lie in front of both cameras. The translation direction is recovered but not its magnitude — the celebrated scale ambiguity of two-view reconstruction — which is why monocular SLAM has a free scale that must be anchored externally (IMU, known object sizes, loop closure with metric scale).
Stereo rectification is the preprocessing step that makes dense stereo matching efficient. Given a pair of images and their relative pose, rectification warps both images so that corresponding epipolar lines become horizontal scanlines in the rectified output. After rectification, matching reduces to searching for each pixel's partner on the same row of the other image — a 1D problem along the x-axis — and the depth is a simple function of the horizontal disparity d = x1 − x2: Z = f·B/d, where f is the focal length and B is the stereo baseline (the distance between the two camera centres).
The classical dense-stereo pipeline has three steps: (1) matching cost at every pixel for every candidate disparity (SAD, SSD, NCC, Census, Hamming over binary-encoded patches); (2) aggregation with a cost-volume filter (semi-global matching, SGM) that enforces piecewise smoothness; (3) disparity refinement with sub-pixel interpolation. SGM (Hirschmüller, 2005) dominated stereo benchmarks until the deep-learning era took over; it is still the default on embedded hardware because it runs in real time with minimal memory.
Beyond two views, the trifocal tensor relates point and line correspondences across three views, and quadrifocal and higher-order constraints exist in principle but are rarely used in practice. Multi-view reconstruction at scale uses the pairwise fundamental-matrix + PnP + bundle-adjustment pipeline that the next section covers. The theoretical foundations here are dense — Hartley & Zisserman's Multiple View Geometry is the canonical 600-page reference — but the handful of ideas above (epipolar lines, F/E matrices, triangulation, rectification, disparity-to-depth) are what almost every production pipeline actually uses.
Structure from Motion (SfM) takes an unordered or ordered collection of images of a static scene and returns two things: a sparse 3D point cloud of feature points, and the 6-DoF pose of every camera. Multi-view stereo (MVS) takes those posed cameras and densifies the point cloud into a dense surface. Together, SfM+MVS is the classical photogrammetry pipeline, and it is still the first step of almost every NeRF and Gaussian-splatting project. COLMAP — the open-source incumbent — runs in the camera-registration front-end of most modern 3D systems.
The classical SfM pipeline has five stages. First, feature extraction detects and describes distinctive keypoints in every image. SIFT (Lowe, 2004) remains the workhorse for photogrammetry because of its rotation and scale invariance; ORB (Rublee et al., 2011) is faster and binary-descriptor-based, preferred in SLAM; SuperPoint (DeTone et al., 2018) is the learned drop-in replacement, and DISK, ALIKED, and XFeat continue to improve the accuracy-speed frontier. Second, feature matching finds correspondences between image pairs — brute-force nearest-neighbour on descriptors for small collections, vocabulary trees or NetVLAD-style image retrieval for large ones — filtered by Lowe's ratio test and the fundamental-matrix RANSAC from the previous section.
Third, incremental reconstruction: pick a well-conditioned initial image pair, triangulate an initial set of 3D points, then iteratively register new images via PnP, triangulate new points, and refine. Fourth, bundle adjustment — the Levenberg-Marquardt minimisation of total reprojection error over all cameras and all 3D points simultaneously — is applied intermittently and as a global final step. Bundle adjustment is computationally expensive (millions of variables for a city scan) but essential for accuracy; Ceres Solver is the standard implementation. Fifth, post-processing removes outlier points, merges split tracks, and exports to downstream formats.
COLMAP (Schönberger & Frahm, 2016) is the dominant open-source SfM system. It pairs SIFT features with vocabulary-tree matching, incremental reconstruction with a careful next-view selection heuristic, and rigorous bundle adjustment. It handles datasets from a few dozen to several hundred thousand images, exports dense MVS via its own PatchMatch implementation, and produces the camera poses and sparse point clouds that almost every NeRF and 3DGS codebase expects as input. Its predecessors — Bundler (Snavely et al., 2006) and VisualSFM (Wu, 2013) — pioneered many of the techniques COLMAP now uses.
Multi-view stereo (MVS) takes posed cameras and densifies the geometry. The classical approach is PatchMatch MVS (Bleyer, Rhemann & Rother, 2011; Schönberger et al., 2016): for each pixel, hypothesise a depth and normal, score by photometric consistency across the other views, and iteratively propagate good hypotheses to neighbouring pixels. PMVS (Furukawa & Ponce, 2010) produces oriented patches and grows them. Deep-learning MVS arrived with MVSNet (Yao et al., 2018), which built a differentiable cost volume on a reference view's plane sweep; CasMVSNet, PatchmatchNet, and IterMVS followed, with transformer-based TransMVSNet and generalising foundation models like DUSt3R (2024) pushing the state of the art further.
DUSt3R (Wang et al., 2024) deserves special mention as a paradigm shift in the SfM/MVS stack: it directly regresses pointmaps — a pixel-aligned 3D-coordinate image — from unposed image pairs using a ViT backbone, outputting both the 3D structure and the implicit relative pose in one forward pass. It sidesteps feature matching, RANSAC, and bundle adjustment entirely, and handles wide baselines and low-texture scenes where classical pipelines fail. Its extensions — MASt3R (with matching head), Spann3R (streaming), Fast3R — are rapidly being absorbed into SLAM and scene-reconstruction systems.
For production use, the commercial photogrammetry stack — Agisoft Metashape, RealityCapture, Reality Scan, Pix4D — wraps similar algorithms with better UIs, GPU acceleration, and turnkey mesh/texture export. These are what architects, VFX artists, and surveyors actually run. The NeRF and Gaussian-splatting toolchains (Nerfstudio, gsplat, Polycam) typically invoke COLMAP internally to recover camera poses before training the neural scene, though the DUSt3R lineage is beginning to replace even that step.
Depth estimation is the task of predicting a per-pixel distance map — how far each pixel's corresponding world point is from the camera. It has two major flavours: monocular depth (one image in, depth out, which is fundamentally ambiguous in scale) and stereo depth (a calibrated pair of images, which resolves the scale via the baseline). Deep networks dominate both. Monocular depth has become one of the most useful vision foundation models of the 2020s because every downstream 3D task benefits from a good prior on scene geometry.
Monocular depth estimation is ill-posed: the mapping from a 2D image to a 3D world is one-to-many, and the global scale can only be recovered from learned priors. A 2D image of a miniature model looks identical, up to scale, to an image of the real thing. Human observers resolve the ambiguity through cues — perspective, shading, texture, known object sizes, the interposition of objects — and a deep network learns to do the same from large datasets. The practical consequence is that monocular predictions are typically relative depth (up to an unknown affine) unless the network is specifically trained for metric depth with calibrated data.
The modern era of monocular depth started with Eigen et al. (2014), which predicted log-depth from a CNN trained on NYU-v2; the output was metric but only on the dataset it was trained on. DORN (Fu et al., 2018) framed depth as an ordinal classification problem over discretised depth bins and won the KITTI benchmark; AdaBins (Bhat et al., 2021) made the bin boundaries adaptive per image. These methods were accurate on their training domain but did not generalise.
MiDaS (Ranftl et al., 2019; v3 in 2022) was the first depth network to target zero-shot generalisation. The recipe was training on a mixture of ten diverse depth datasets with heterogeneous ground truth — some RGB-D, some laser-scanned, some SfM, some stereo — using a scale-and-shift-invariant inverse-depth loss that allowed each dataset to contribute without imposing its idiosyncratic scale. MiDaS v3 with a DPT (Dense Prediction Transformer) backbone became a workhorse for NeRF initialisation, image editing, and generic scene understanding. ZoeDepth (2023) extended MiDaS to metric depth by adding per-domain calibration heads.
Depth Anything (Yang et al., 2024) pushed zero-shot monocular depth to a new level by scaling training data with pseudo-labels: use a teacher MiDaS to label 62M unlabelled images, then train a student on the combined labelled + pseudo-labelled corpus. The v2 version (2024) refined the recipe with higher-quality synthetic data and a better teacher. Depth Anything is now the default open-source monocular depth backbone; it is the initialisation for Marigold, the depth prior for many NeRF and Gaussian-splatting pipelines, and the input for robot manipulation systems that need a quick geometry read.
Stereo depth is easier because the geometry resolves the scale. The modern stereo-matching stack is dominated by RAFT-Stereo (Lipson et al., 2021), which adapted the RAFT optical-flow architecture to the stereo setting: a shared 3D cost volume of all disparity hypotheses, a recurrent GRU-based refiner, and a multi-level feature pyramid. IGEV-Stereo (Xu et al., 2023) replaced the implicit cost volume with an iterative geometric encoding of pairwise features; CREStereo (Li et al., 2022) used cascaded recurrent refinement at multiple scales. These methods reach sub-pixel disparity accuracy on KITTI and Middlebury with reasonable inference cost.
The evaluation landscape has its own metrics. For monocular depth, the standard reports are AbsRel (mean |d_pred − d_gt| / d_gt), δ < 1.25 (fraction of pixels with max(d_pred/d_gt, d_gt/d_pred) < 1.25), and RMSE. For stereo, EPE (endpoint error on disparity) and D1 / bad 3.0 (fraction of pixels with disparity error > 3 px and > 5%) are the KITTI standards. Benchmarks include NYU-v2 (indoor), KITTI (driving), DIODE and iBims-1 (unified indoor/outdoor), ETH3D (cross-domain stereo), and DDAD (autonomous dense depth).
Applications of monocular depth are everywhere: portrait mode on phones (synthesising a bokeh from estimated depth), NeRF initialisation (depth priors regularise ill-conditioned scenes), robot manipulation (fast depth read without a dedicated sensor), virtual camera moves in image editing, 3D photo effects, and pseudo-labelling for downstream 3D perception. Depth Anything v2 can be run in real time on a laptop GPU, and "good enough" depth is now a solved problem for most consumer use cases.
Point clouds — unordered sets of 3D points from LiDAR, RGB-D, or SfM — are the most common 3D data format in robotics and autonomy. Learning on them was an unsolved problem before 2017 because standard convolutions require a grid, and point clouds do not have one. The PointNet paper broke the dam by showing that a permutation-invariant MLP plus a symmetric pooling function was sufficient to process point clouds directly, and the family of PointNet, PointNet++, DGCNN, KPConv, and Point Transformer has grown to cover every 3D task.
PointNet (Qi et al., 2017) is conceptually simple: embed each point with a shared MLP (3 → 64 → 128 → 1024), aggregate across all points with a symmetric pooling operation (max-pool is the key choice), and feed the resulting global feature vector into a classification or segmentation head. The shared MLP makes the network point-order-invariant; the max-pool guarantees permutation invariance; the combination provably approximates any continuous set function. PointNet classified ModelNet40 shapes at 89% top-1 and segmented ShapeNet parts competitively, all while processing unordered clouds natively.
PointNet++ (Qi et al., 2017, same year) added the hierarchical structure PointNet lacked. Inspired by 2D CNN stacks, it alternated sampling (farthest-point sampling to reduce resolution), grouping (ball-query or k-NN to gather each sample's neighbourhood), and PointNet-within-group (apply a small PointNet to each neighbourhood). Multiple such stages built a hierarchy of features at increasing spatial scales, and skip connections fed the encoder features back into a point-wise segmentation decoder. PointNet++ became the default backbone for indoor-scene semantic segmentation and part segmentation through 2019–2020.
Dynamic Graph CNN (DGCNN, Wang et al., 2018) introduced the EdgeConv operator: at each layer, build a k-nearest-neighbour graph in feature space, compute features on each (point, neighbour) edge, and aggregate with max-pool. The graph is recomputed at every layer using the latent features, which lets the receptive field change based on semantic similarity rather than just spatial proximity. DGCNN improved part segmentation on ShapeNet and was the first serious graph-based point network.
KPConv (Thomas et al., 2019) introduced kernel-point convolution: instead of a grid-based kernel, the convolution places a small set of learnable 3D kernel points, and the response at each input point is a weighted sum of features at nearby points, with weights determined by distance to the kernel points. This gave point-cloud CNNs a genuine analogue of the 2D convolution operator — weight sharing, local receptive field, translation invariance — without needing a voxel grid. KPConv set the state of the art on S3DIS, SemanticKITTI, and ScanNet for semantic segmentation and remains a strong baseline.
Point Transformer (Zhao et al., 2021) replaced EdgeConv-style aggregation with a vector attention mechanism: for each point, compute attention weights over its neighbours and aggregate values, all in feature space. The attention is vector-wise (per-channel) rather than scalar, which substantially improved expressive power. The paper set the state of the art on ScanNet, S3DIS, and ShapeNetPart. Point Transformer V2 (Wu et al., 2022) added grouped vector attention and partition-based pooling; Point Transformer V3 (2024) simplified the recipe radically, replacing neighbour search with serialisation through a space-filling curve and using standard transformer blocks over the serialised tokens. PTv3 became the new top of the leaderboard on indoor segmentation and is the current reference.
PointMLP (Ma et al., 2022), PointNeXt (Qian et al., 2022), and PointMAE / Point-BERT (self-supervised masked point modelling) have all continued to push point networks forward. Self-supervised pretraining on large-scale point clouds (Objaverse, indoor scans) mirrors the masked-image-modelling revolution in 2D vision, and the pretrained features transfer across downstream tasks. The parallel track, sparse-convolutional networks, covers the same task space and often the same benchmarks; the two families compete and cross-pollinate continuously.
Voxelising 3D data turns it into a regular grid where 3D convolutions are well-defined. The catch is memory: a 256³ dense voxel grid has 16M cells, and a 512³ grid has 134M — too expensive to process with a typical GPU. The workaround is sparse convolution: store only non-empty cells, and define convolution operators that respect the sparsity pattern. This idea made volumetric learning practical at production scale and is the backbone of essentially every competitive LiDAR-based autonomous-driving perception stack.
The naïve dense alternative is 3D U-Net (Çiçek et al., 2016), a volumetric analogue of the 2D U-Net with 3D convolutions, pooling, and skip connections. Dense 3D U-Nets work for small volumes (medical image segmentation on 256³ CT scans is the canonical use case) and have the advantage of using off-the-shelf 3D convolution kernels. For large scenes they do not scale; the memory is simply too large. OctNet (Riegler et al., 2017) was an early attempt to fix this with an octree-based sparse representation, and O-CNN (Wang et al., 2017) followed with more efficient octree convolutions, but the octree data structure is difficult to work with efficiently on modern hardware.
The breakthrough was sparse convolution with hash-table-backed storage. SparseConvNet (Graham, 2014) laid the theoretical foundation, and MinkowskiEngine (Choy, Gwak & Savarese, 2019) delivered the first widely adopted implementation. The idea: only store voxels where input points land, and define convolutions that produce output voxels only within a local neighbourhood of input voxels (submanifold sparse convolution) or with an expanding receptive field (sparse convolution). Both operations are implemented with coordinate hash tables and a kernel-map precomputation that identifies which input voxels contribute to each output voxel. The resulting networks can operate on grids of arbitrary size because they are O(active voxels), not O(grid size).
MinkowskiEngine (2019) became the standard library and the backbone of many state-of-the-art indoor and outdoor 3D perception systems. MinkUNet — a 3D U-Net built from MinkowskiEngine submanifold and sparse convolutions — set records on SemanticKITTI, ScanNet, and S3DIS for semantic segmentation, and its features transferred well across datasets. The library supports generalised sparse convolution in arbitrary dimensions, which has been useful for 4D (space-time) networks and higher.
The choice of voxel size is the key hyperparameter. For indoor scans, 2–5 cm voxels match the sensor resolution and capture fine geometry; for outdoor driving LiDAR at 30–50 m range, 10–20 cm voxels are standard; for long-range highway scenarios (150 m+), voxel sizes expand to 0.4 m or more. Finer voxels mean more voxels and more compute; coarser voxels lose geometric fidelity. Most systems use a multi-resolution U-Net-style backbone that combines features across voxel sizes.
A related approach is the bird's-eye-view (BEV) representation: project the 3D scene onto a 2D BEV grid by collapsing or pillarising the vertical axis, then apply 2D convolutions in BEV space. PointPillars (Lang et al., 2019) was the canonical example: divide the ground plane into 2D pillars, aggregate each pillar's points with a mini-PointNet, then run a standard 2D CNN on the pillar feature map. PointPillars ran at 60+ Hz on a desktop GPU and dominated the KITTI leaderboard for several years because of its speed-accuracy balance. BEV is especially attractive for autonomous driving because objects of interest (cars, pedestrians) lie on the ground plane, and BEV is naturally compatible with top-down map representations.
Recent work blurs these boundaries. CenterPoint (Yin et al., 2021) uses a sparse 3D voxel backbone with a 2D BEV head. VoxelNeXt (Chen et al., 2023) removed the BEV head entirely and detects directly on sparse voxel features. OctFormer and PointTransformer V3 import transformer attention into sparse voxel representations. The split between "point-based" and "voxel-based" is no longer as sharp as it once was: most systems mix the two, using a voxel grid as the computational backbone and per-point features for fine-grained prediction.
3D object detection is the task of predicting a set of oriented 3D bounding boxes — each with position, dimensions, heading, and class — from sensor data, typically LiDAR point clouds, camera images, or both. It is the foundational perception task of autonomous driving, and the benchmarks (KITTI, nuScenes, Waymo Open, Argoverse 2) set much of the research agenda for 3D vision. The methods split by sensor modality: LiDAR-based, camera-based, and multi-modal fusion.
The first generation of LiDAR detectors adapted 2D techniques to sparse 3D data. VoxelNet (Zhou & Tuzel, 2018) voxelised the point cloud, applied a mini-PointNet per voxel to produce a feature grid, and ran a region-proposal network over the grid. It was the first end-to-end trainable LiDAR detector and established the voxel-feature + RPN template. SECOND (Yan, Mao & Li, 2018) replaced VoxelNet's dense 3D convolutions with sparse convolution, giving an order-of-magnitude speedup and launching the SpConv library.
PointPillars (Lang et al., 2019) collapsed the voxel grid to pillars (z-axis fully collapsed), ran a 2D CNN instead of 3D, and became the first LiDAR detector to reach 60+ Hz at competitive accuracy on KITTI. It remains a production workhorse on embedded hardware because the 2D convolutions are trivially deployable. PointRCNN (Shi, Wang & Li, 2019) took a point-based approach: run PointNet++ directly on raw points, generate proposals per-point, and refine them in a second stage — a 3D analogue of Faster R-CNN.
CenterPoint (Yin, Zhou & Krähenbühl, 2021) made the biggest single improvement on the autonomous-driving benchmarks: it treated 3D detection as centre-point prediction in BEV space (a 3D analogue of CenterNet) plus per-object heatmap regression of dimensions and heading. The anchor-free formulation simplified the loss and matched nuScenes detection and tracking leaders for two years. CenterPoint is still the default single-frame detector in most nuScenes-first benchmarks.
Camera-based 3D detection has historically been harder because a single image does not directly give depth. The Lift-Splat-Shoot (Philion & Fidler, 2020) paradigm made it tractable: each pixel predicts a depth distribution, the pixel features are "lifted" into 3D by splatting them at candidate depths, then "splatted" onto a BEV grid where a 2D detection head operates. BEVFormer (Li et al., 2022) used a transformer with learned BEV queries that attended to multi-camera features; PETR (Liu et al., 2022) inserted 3D positional encodings directly into ViT features and attended from object queries. BEVDet, SOLOFusion, StreamPETR, SparseBEV, and Far3D iteratively pushed camera-only 3D detection on nuScenes to within striking distance of LiDAR-only systems.
Multi-modal fusion combines camera and LiDAR. Early-stage approaches decorate LiDAR points with projected image features (PointPainting, PointAugmenting); mid-stage approaches fuse in a shared BEV space (BEVFusion, Liu et al., 2022; Liang et al., 2022); late-stage approaches combine per-modality detection heads (FUTR3D). BEVFusion is the current de facto baseline: it runs a LiDAR backbone to get a BEV feature map, runs a camera backbone to get per-camera features, lifts the camera features into the same BEV grid, concatenates, and detects in the fused BEV space. TransFusion (Bai et al., 2022) pioneered query-based cross-modal attention and remains a strong detection-only alternative.
The benchmarks drive specialisations. KITTI (2012) has ~7500 train samples with forward-facing LiDAR and one stereo camera pair; it is small and well-curated but saturated. nuScenes (2019) scaled up to 1000 scenes × 20 s each, six cameras, one 32-beam LiDAR, five radars; it is the current standard for multi-modal research. Waymo Open (2019+) is the largest fully-annotated driving dataset with five cameras and five LiDARs; it is the platform of choice for industrial-scale experiments. Argoverse 2 (2022) adds map-aware evaluation and long-tail classes. The Occupancy Network challenge (CVPR 2023) shifted attention from per-object detection toward dense volumetric occupancy prediction, and OccNet, TPVFormer, SurroundOcc, and OccFormer are the current open-source leaders.
3D segmentation generalises the 2D segmentation triplet (semantic / instance / panoptic) to point clouds, voxels, and meshes. The task is to assign a label — a semantic class, an instance ID, or both — to every point or voxel of a 3D scene. Indoor benchmarks (ScanNet, S3DIS, SemanticKITTI for LiDAR) and outdoor benchmarks (SemanticKITTI, nuScenes, Waymo Open) drive the research. The architectural recipes are mostly 3D analogues of their 2D counterparts, with the sparse-convolutional and point-transformer backbones doing most of the work.
Semantic segmentation on point clouds was the first task to be tackled successfully, and it is roughly where 2D semantic segmentation was in 2017: strong baselines exist, the benchmarks are well-understood, and progress is measured in 1–2 mIoU points per year. The standard backbones are MinkUNet, KPConv, Point Transformer, and most recently PTv3. For LiDAR, range-image-based methods like RangeNet++ (Milioto et al., 2019), SalsaNext, and Cylinder3D (Zhu et al., 2021) project the point cloud into a 2D range or cylindrical image where standard 2D convolutions apply; Cylinder3D's cylindrical partition is especially well-matched to rotating LiDAR's scan pattern.
Instance segmentation on 3D is harder because the ambiguity about "what is one instance" is already nontrivial in 2D and gets worse in 3D. The classical recipe is to detect a 3D bounding box first, then predict a mask within it (the 3D analogue of Mask R-CNN); 3D-BoNet (Yang et al., 2019) and SoftGroup (Vu et al., 2022) follow this pattern. More recent approaches do proposal-free instance segmentation by clustering point features: PointGroup (Jiang et al., 2020) clusters on learned embedding offsets; HAIS (Chen et al., 2021) does hierarchical aggregation; Mask3D (Schult et al., 2023) imported the 2D Mask2Former design — object queries attend to sparse-voxel features and predict 3D masks — and became the new top of the ScanNet leaderboard.
OneFormer3D (Kolodiazhnyi et al., 2024) unified semantic, instance, and panoptic segmentation in a single Mask2Former-style architecture, following the 2D OneFormer recipe. A task token conditions the decoder on which segmentation flavour to output, and the same backbone and decoder serve all three. OneFormer3D sits at or near the top of most 3D segmentation benchmarks and signals the same convergence that has already happened in 2D.
Data augmentation in 3D has its own vocabulary. PolarMix (Xiao et al., 2022) cut a pie-slice of one LiDAR scan and pasted it into another, rotating about the sensor axis — a 3D analogue of CutMix. CutMix-for-3D, LaserMix, and scene-level object insertion are now standard. The SemanticKITTI benchmark in particular has been driven forward by augmentation as much as by architecture; its test-set saturation argument is a running theme in the segmentation literature.
Evaluation is the standard mean intersection-over-union (mIoU) for semantic segmentation, 3D mask AP for instance segmentation, and PQ³ᴰ (the 3D Panoptic Quality) for panoptic. The 3D mask-AP protocol is a direct generalisation of 2D COCO AP with mask IoU computed over 3D points. Standard benchmarks are ScanNet200 (200 classes, long-tail distribution, current mainstream indoor), S3DIS (indoor, six buildings, cross-area evaluation), SemanticKITTI (outdoor LiDAR), nuScenes (outdoor LiDAR, more diverse scenes), and the Occ3D / Occ-nuScenes occupancy benchmarks for volumetric prediction.
Meshes are the end format for most 3D content — games, VFX, 3D printing, engineering — but they are awkward to learn directly because their discrete connectivity (which vertices form which triangles) is not differentiable in any useful way. The standard approach is to learn a volumetric, implicit, or point-based intermediate representation, then extract a mesh as a post-processing step. Marching cubes, screened Poisson reconstruction, and their learned variants are the workhorses.
Marching cubes (Lorensen & Cline, 1987) is the classical algorithm for extracting a mesh from a dense voxel grid of signed-distance or occupancy values. It iterates over every voxel, looks up the pattern of inside/outside corners in a 256-entry lookup table, and outputs the triangles that interpolate the implicit surface through the voxel. Marching cubes is embarrassingly parallel, hardware-efficient, and the standard converter from volumetric learned representations (Occupancy Networks, DeepSDF, NeRF density) to meshes. Its one drawback — sharp features are rounded off by the trilinear interpolation — is addressed by Dual Contouring (Ju et al., 2002) and the recent learned variant Flexicubes (Shen et al., 2023).
Screened Poisson reconstruction (Kazhdan & Hoppe, 2013) reconstructs a watertight surface from an oriented point cloud by solving a Poisson equation — find a scalar field whose gradient best matches the input normals and whose level set approximates the surface. Poisson reconstruction is the default MVS-to-mesh converter in most photogrammetry pipelines, including COLMAP's and Agisoft Metashape's. It is robust to noisy point clouds, produces watertight meshes, and has few hyperparameters; its weakness is a tendency to smooth over thin structures and fine details that are under-sampled in the point cloud.
Learned mesh reconstruction has a long history. AtlasNet (Groueix et al., 2018) represented a shape as a union of learned parametric surface patches, each a 2D-to-3D MLP deforming a unit square; the shape was literally an "atlas" of charts. Pixel2Mesh (Wang et al., 2018) started from a coarse ellipsoid mesh and iteratively deformed and subdivided it to match an input image, using graph convolutions on the mesh vertices. Mesh R-CNN (Gkioxari, Malik & Johnson, 2019) paired 2D object detection with a mesh-head that first predicted a coarse voxel shape and then refined it as a mesh. These methods produced meshes directly but were limited to objects with simple topologies.
Occupancy Networks (Mescheder et al., 2019) and DeepSDF (Park et al., 2019) moved to the implicit-function paradigm: learn a continuous function that decides inside / outside (or signed distance) at any 3D query point, then extract the mesh with marching cubes at inference. These methods handle arbitrary topology and have become the dominant neural 3D-shape representation, with extensions to convolutional feature grids (ConvONet), global latent codes (IM-NET), and unsigned distance functions for open surfaces (NDF, 3PSDF).
NeuS (Wang et al., 2021) and VolSDF (Yariv et al., 2021) unified NeRF-style volumetric rendering with implicit-SDF representations, so that one could optimise a signed-distance function directly from multi-view images using volumetric rendering. The result was a mesh-extractable representation that combined NeRF's photometric accuracy with a usable surface. NeuralAngelo (Li et al., 2023) and Bakedangelo pushed surface quality further, and are standard for high-fidelity mesh reconstruction from multi-view photographs — the default pipeline when the downstream consumer is a 3D artist or a CAD tool rather than a novel-view renderer.
The most recent wave is mesh generation by autoregressive transformers. PolyGen (Nash et al., 2020) autoregressively generated vertices then faces. MeshGPT (Siddiqui et al., 2024), MeshXL, and MeshAnything (2024) scaled this up to large transformer models trained on tokenised meshes from Objaverse, producing structured meshes with proper topology rather than the marching-cubes-extracted surfaces that typical implicit methods produce. This is a significant step because art-ready meshes need clean topology — which implicit methods cannot provide — and these generative transformers suggest a path toward automated 3D content production for the game and VFX industries.
Neural radiance fields (NeRF), introduced by Mildenhall et al. in 2020, redefined what "3D reconstruction" meant. A scene is represented as a single small MLP mapping a 3D position and viewing direction to an emitted radiance and a volume density; given a few dozen posed photographs, the MLP is optimised to reproduce them, and at inference can be rendered from any novel viewpoint with photorealistic quality. The paper went from arXiv preprint to having over 4000 follow-up papers in under three years. It is one of the fastest-moving research areas in the history of computer vision.
The NeRF formulation is clean. For a ray r(t) = o + t·d through pixel, sample K points along the ray between near and far bounds. At each sample, query the MLP Fθ(x, d) = (σ, c) to get density and colour. The rendered pixel is a weighted sum: C(r) = Σ Ti·(1 − exp(−σi·δi))·ci, where Ti = exp(−Σj<i σj·δj) is the accumulated transmittance and δi = ti+1 − ti. This is a direct discretisation of the volume-rendering equation. The loss is pixelwise MSE between rendered and observed images; optimisation is gradient descent on the MLP parameters.
The two key engineering ideas in the original paper were positional encoding — map each input coordinate x through γ(x) = (sin(2⁰πx), cos(2⁰πx), sin(2¹πx), ..., sin(2^(L−1)πx), cos(2^(L−1)πx)) so the MLP can represent high-frequency detail — and hierarchical coarse-to-fine sampling, where a coarse MLP proposes densities that are used to importance-sample the fine MLP's ray samples. Both tricks were essential for the original's photometric quality, and both have survived in one form or another through every follow-up.
NeRF's weakness was speed. Training took 1–2 days on a single GPU per scene; rendering a 800 × 600 image took 30 seconds. The follow-up race was primarily about speed. Instant-NGP (Müller et al., 2022) replaced the coordinate MLP with a multi-resolution hash grid: a pyramid of feature tables addressed by a spatial hash of the input coordinate, followed by a tiny MLP. The hash grid encodes most of the capacity explicitly in a learnable table rather than in the MLP weights, allowing training to converge in ~5 seconds per scene on a modern GPU and real-time rendering at up to 60 fps. Instant-NGP was the single biggest performance jump in the NeRF era, and it remains a de facto baseline.
Plenoxels (Yu et al., 2021) went further: remove the MLP entirely, store spherical-harmonic coefficients and a density at every cell of a sparse voxel grid, and optimise the grid values directly with a simple reconstruction loss. Training took 10 minutes and matched NeRF quality. TensoRF (Chen et al., 2022) used a low-rank decomposition of the volumetric field (vectors × matrices) to compress Plenoxels further. DVGO, K-Planes, and related approaches explored similar explicit-grid ideas. The explicit/implicit hybrids were uniformly much faster than pure MLPs and often matched or exceeded NeRF's quality.
The NeRF literature branches rapidly from here: NeRF-W handles in-the-wild photo collections with illumination variation; BARF and NoPe-NeRF jointly optimise camera poses with the radiance field; Dynamic NeRF / D-NeRF / NSFF add time; PixelNeRF and IBRNet condition on image features for single-image generalisation; NeRF in the Dark (RawNeRF) works in raw linear space for HDR; DietNeRF trains with few-view inputs using CLIP consistency; Semantic-NeRF and NeSF add semantic outputs. The Nerfstudio framework (Tancik et al., 2023) became the Python-ecosystem standard for the whole NeRF zoo, shipping with Instant-NGP-style backbones, Mip-NeRF 360-style outdoor handling, and tooling for iterative capture and deployment.
NeRF's practical impact has been significant: novel-view synthesis for real-estate listings, VFX plate generation, reality capture for industrial digital twins, virtual production, and asset creation. Its main limitations are (a) slow training and rendering compared to explicit geometry (partially solved), (b) the need for accurate input camera poses (usually provided by COLMAP), (c) the difficulty of editing the resulting representation, and (d) the lack of surface primitives that downstream 3D tools can consume. All four of these are part of what made Gaussian splatting — the next section — such an immediate hit.
3D Gaussian Splatting (3DGS), introduced by Kerbl et al. at SIGGRAPH 2023, replaced NeRF as the default radiance-field representation within eighteen months of its publication. A scene is represented as millions of explicit 3D Gaussians — each with position, covariance, opacity, and spherical-harmonic colour — rendered via a differentiable point-based rasteriser. Training takes minutes rather than hours, rendering is real-time (100+ fps), and quality matches or exceeds the best NeRFs on static-scene benchmarks. The technique spawned a wave of follow-up work that now covers almost every NeRF sub-topic.
The 3DGS formulation replaces volumetric MLP sampling with a cloud of anisotropic Gaussians. Each Gaussian has: a 3D mean μ, a 3 × 3 covariance Σ (factored as rotation × scaling for stable optimisation), an opacity α, and a set of spherical-harmonic coefficients (typically SH-degree 3, 48 coefficients) for view-dependent colour. To render, the Gaussians are projected to screen space where they become 2D Gaussians; the 2D screen-space ellipses are depth-sorted, alpha-blended front-to-back via tile-based rasterisation, and the result is a photorealistic image. The whole pipeline is differentiable: gradients flow back through the rasteriser to the Gaussian parameters.
Training a 3DGS scene starts from a sparse SfM point cloud (COLMAP output). Each initial point becomes a Gaussian; during optimisation, the algorithm periodically densifies regions that need more detail (splitting under-reconstructed Gaussians into two) and prunes Gaussians whose opacity drops below a threshold. The optimisation runs for ~30k iterations (~30 minutes on a single GPU) and produces a scene representation of typically 1–5M Gaussians occupying 200MB–1GB on disk.
The four practical advantages over NeRF are striking. First, rendering speed: 100–300 fps at 1080p on a consumer GPU vs. seconds per frame for NeRF, with no further engineering needed. Second, training time: minutes instead of hours. Third, editability: because the scene is a cloud of explicit primitives, you can translate, rotate, and manipulate individual Gaussians, drop new ones in, or composite scenes; this is much harder with a neural field. Fourth, quality: on most benchmarks (Mip-NeRF 360, Tanks-and-Temples, Deep Blending), 3DGS matches or beats the best NeRF variants.
The ecosystem grew equally fast. gsplat (Nerfstudio team, 2024) is the leading open-source differentiable-rasteriser library with support for most 3DGS extensions. The original Inria reference code (graphdeco) remains the gold-standard reproduction. PlayCanvas, Luma AI, Polycam, and Postshot ship consumer-facing 3DGS tools. Browsers support 3DGS rendering via WebGL and WebGPU, which has been important for sharing reconstructions — a 500MB Gaussian scene loads into a browser and runs in real time on a laptop.
The limitations of 3DGS mirror those of NeRF. Training still requires accurate camera poses (typically COLMAP); scene editing is easier than NeRF but still not as clean as a mesh; sharp surfaces and material properties are approximated rather than modelled; view-dependent effects near reflective surfaces can be brittle. Research has addressed many of these: 2DGS (Huang et al., 2024) replaces 3D Gaussians with 2D Gaussian disks that better align to surfaces; SuGaR and Gaussian Surfels bridge to meshes; GaussianShader and 3DGS-DR bring explicit BRDFs.
The broader picture is that Gaussian splatting brought radiance-field research into a regime of speed, quality, and editability that was unreachable with pure neural fields. It did so without inventing new mathematics — the volumetric rendering equations are essentially unchanged — but by moving from implicit evaluation to explicit primitives with a carefully engineered rasteriser. The lesson, echoing the history of graphics generally, is that the right primitives plus the right rendering algorithm outperform arbitrarily deep networks for representing geometry. NeRF's neural MLPs are not gone — they are still used inside many 3DGS extensions — but the primary representation is now an explicit one.
SLAM — Simultaneous Localisation and Mapping — is the problem of building a map of an unknown environment while simultaneously tracking the sensor's pose within it, in real time and online. It is what every autonomous robot, AR headset, drone, and self-driving car does at the perception level, and it is one of the oldest problems in robotics. The classical probabilistic formulation (Smith & Cheeseman, 1986) has been joined over the past decade by learning-based and neural-field variants, and the current state of the art is a hybrid that uses learned components inside a classical optimisation backbone.
The SLAM pipeline traditionally has five elements: front-end tracking that estimates the current frame's pose from local observations, map maintenance that stores geometry and feature data over time, loop closure that detects when the sensor has returned to a previously visited place, global optimisation (bundle adjustment or pose-graph optimisation) that reduces accumulated drift, and relocalisation that handles kidnapped-robot-style tracking loss. Getting all five working together at 30+ Hz is the engineering achievement that distinguishes SLAM from offline SfM.
Feature-based visual SLAM uses sparse keypoints (ORB, FAST+BRIEF, SIFT) tracked frame-to-frame. The classical reference system is ORB-SLAM (Mur-Artal et al., 2015), superseded by ORB-SLAM2 (2017) with stereo and RGB-D support and ORB-SLAM3 (Campos et al., 2021) with visual-inertial fusion, multi-map handling, and fisheye support. ORB-SLAM3 is still the open-source reference for sparse feature-based SLAM in 2025, running comfortably on a CPU and handling room-scale to building-scale trajectories. PTAM (Klein & Murray, 2007) was the historical predecessor that first split tracking and mapping into two parallel threads.
Dense visual SLAM maps every pixel rather than sparse features. DTAM (Newcombe et al., 2011), KinectFusion (Newcombe et al., 2011), BundleFusion (Dai et al., 2017), and ElasticFusion (Whelan et al., 2016) build volumetric or surfel-based maps from RGB-D input. Direct visual SLAM methods like LSD-SLAM (Engel et al., 2014) and DSO (Engel et al., 2016) skip feature extraction and directly minimise photometric error on high-gradient pixels — more accurate in textureless scenes but more sensitive to illumination change.
Visual-inertial SLAM fuses cameras with IMU measurements to resolve the monocular scale ambiguity and improve robustness during fast motion. OKVIS, VINS-Mono / VINS-Fusion, ROVIO, and Kimera are the open-source references; ORB-SLAM3's visual-inertial mode is the most widely used in practice. All mobile phone AR (ARKit, ARCore) is a proprietary VI-SLAM system, and the Apple Vision Pro SLAM stack is similarly VI-SLAM with specialised hardware.
Neural-field SLAM replaces the map with a NeRF or Gaussian-splatting scene representation. iMAP (Sucar et al., 2021) was the first: optimise a single MLP on the fly as the sensor moves, sampling new rays per frame. NICE-SLAM (Zhu et al., 2022) used a multi-resolution feature grid instead of a monolithic MLP for scalability. NeRF-SLAM and Orbeez-SLAM combined classical tracking with NeRF mapping. The Gaussian-splatting SLAM generation — SplaTAM (Keetha et al., 2024), Gaussian-SLAM, MonoGS, Photo-SLAM — swapped the neural field for a Gaussian cloud, getting real-time performance and photorealistic reconstruction in one system. As of 2025, Gaussian-based SLAM is the liveliest research direction and is approaching the classical systems in tracking accuracy while vastly exceeding them in map quality.
LiDAR SLAM follows a separate evolution. The classical references are LOAM (Zhang & Singh, 2014), LeGO-LOAM, LIO-SAM (LiDAR-inertial fusion), and FAST-LIO2 (Xu et al., 2022) — the current best-in-class for LiDAR-IMU SLAM with tight coupling. These run comfortably on a CPU and produce centimetre-accurate trajectories over kilometres. The autonomy industry pairs LiDAR SLAM with prior HD maps for production use; research LiDAR SLAM is increasingly combined with learned place recognition (MinkLoc3D, OverlapTransformer) for robust loop closure.
Evaluation uses Absolute Trajectory Error (ATE) — RMSE of the aligned trajectory against ground truth — and Relative Pose Error (RPE) on short intervals. Standard benchmarks are TUM-RGBD (indoor RGB-D), EuRoC (drones), KITTI and KITTI-360 (driving), TartanAir (synthetic photorealistic for challenging scenes), Replica (indoor scanned), and ScanNet. For SLAM systems with map quality claims, the map is also evaluated on rendering PSNR or on 3D reconstruction accuracy against the scanned ground truth.
3D generation — producing novel 3D assets from text prompts, single images, or nothing at all — was a niche research topic until DreamFusion (Poole et al., 2022) showed that a pretrained 2D diffusion model could be used to distil 3D content without any 3D training data. Since then the area has split into optimisation-based methods (slow but high quality) that distil 2D priors via Score Distillation Sampling, feed-forward methods (fast but lower quality) that regress 3D representations from images in a single network pass, and direct 3D diffusion that trains diffusion models in a 3D latent space. By 2025 all three flavours produce usable outputs.
The breakthrough was Score Distillation Sampling (SDS). The setup: initialise a 3D representation (NeRF, mesh, Gaussians), render it from a random viewpoint, pass the rendered image through a pretrained text-to-image diffusion model (Stable Diffusion, Imagen), and use the diffusion model's noise-prediction error as a gradient signal to update the 3D representation. The gradient is a biased but effective "score" telling the 3D representation how to look more like samples from the 2D model conditioned on a text prompt. After ~10 000 iterations of random-viewpoint rendering plus SDS updates, the 3D object looks plausibly like the prompt from any direction.
DreamFusion (Poole et al., 2022) instantiated SDS with a NeRF. Magic3D (Lin et al., 2023) added a coarse-to-fine NeRF-to-mesh progression for faster convergence and better geometry. SJC (Wang et al., 2023) rederived the SDS loss as a variational-score-distillation framework. ProlificDreamer (Wang et al., 2023) replaced SDS with Variational Score Distillation (VSD), which optimises a distribution over NeRFs rather than a single sample, producing substantially sharper and less-saturated outputs. GaussianDreamer (Yi et al., 2024) and DreamGaussian (Tang et al., 2024) swapped the NeRF for 3D Gaussians, speeding up optimisation from hours to minutes.
The second family is image-conditioned generation. Zero-1-to-3 (Liu et al., 2023) fine-tuned Stable Diffusion to be a camera-pose-conditioned novel-view synthesiser: given one image of an object and a target viewpoint, generate a rendered image of the same object from that viewpoint. Once you have a novel-view model, you can generate a consistent set of views around the object and fit a 3D representation to them. Stable Zero123, SyncDreamer (Liu et al., 2024), Wonder3D, and Zero123++ refined the approach with multi-view joint attention to enforce cross-view consistency. SV3D (2024) extended the same idea to video diffusion models for smoother viewpoint trajectories.
The third family is native 3D diffusion: train a diffusion model directly on 3D representations without the 2D-prior detour. Point-E (Nichol et al., 2022) diffused point clouds directly. Shap-E (Jun & Nichol, 2023) diffused an implicit MLP parameterisation. GenS, 3DGen, LDM3D, Mesh-Diffusion, CLAY (2024), and others have pushed the latent-3D-diffusion idea through variants. The advantage is 3D consistency by construction; the disadvantage is that 3D training data (Objaverse, Objaverse-XL, ShapeNet, 3D-Future) is smaller and less diverse than 2D image data, so these models typically train on tens of millions of assets rather than billions of images. Most 2025 production pipelines use a hybrid: a 2D multi-view diffusion front-end for diversity, a feed-forward LRM or Gaussian model for the actual 3D reconstruction.
Evaluation is unsettled. CLIP similarity between rendered views and the text prompt is the standard text-to-3D metric, but it is gameable and does not evaluate 3D consistency or quality. PSNR / LPIPS on held-out viewpoints is used for image-to-3D. Human preference studies are common in papers. Benchmarks include GSO (Google Scanned Objects, image-to-3D), T³Bench (text-to-3D), CO3D (multi-view object benchmark), OmniObject3D, and Objaverse-LVIS for retrieval. The field is moving fast enough that benchmarks saturate within months.
The applications driving investment are clear: game asset production (a single-prompt 3D pipeline replaces weeks of 3D modelling), e-commerce (product → 3D for web/AR), VFX pre-visualisation, architecture and design, and 3D printing. Commercial services — Tripo AI, Meshy, Rodin, Luma Genie — converted research papers into products within months of publication. The quality gap to bespoke 3D content is still wide for complex scenes, but for single objects at typical game-asset quality, generation has crossed the threshold of usefulness.
The human body is the single most economically important 3D object to model. Fitness, virtual try-on, animation, telepresence, film visual effects, medical imaging, and ergonomics all need accurate body reconstruction from images and video. The dominant approach is parametric body modelling — factor the body into a low-dimensional shape and pose parameterisation, then fit the parameters to observations. SMPL and its family (SMPL-X, SMPL-H, STAR) are the standard; image-based regressors (HMR, HMR 2.0, PIXIE, 4D-Humans) and capture systems have been built around them.
SMPL (Loper et al., 2015) "A Skinned Multi-Person Linear model" is a vertex-based PCA-style parametric body mesh. It has three components: a shape blendshape space (10 principal components parameterising height, build, proportions), a pose-dependent blendshape (corrections that depend on the current pose to preserve realism — e.g. muscle bulges at a bent elbow), and a linear blend skinning step that deforms the mesh according to a skeleton of 24 joints with axis-angle rotations. The full parameterisation is 10 shape + 72 pose + 3 global orientation = 85 numbers, producing a 6890-vertex mesh. SMPL has been fitted to motion capture, used as the prior in essentially every image-based body regressor, and serves as the ground truth for several benchmarks.
SMPL-X (Pavlakos et al., 2019) extends SMPL with articulated hands (from MANO) and an expressive face (from FLAME), yielding a unified full-body mesh with facial expressions and finger articulation. SMPL-H is the hand-only extension; STAR (Osman, Bolkart & Black, 2020) tightens SMPL with sparse trainable correctives; GHUM and GHUML (Xu et al., 2020) are Google's alternative body models. SCAPE was the predecessor to SMPL. For most research and product use, SMPL-X is the standard.
Image-to-SMPL regression — Human Mesh Recovery (HMR, Kanazawa et al., 2018) — takes a cropped image of a person and predicts SMPL parameters directly from a CNN. HMR trained on a combination of image-to-2D-keypoint data and 3D MoCap via an adversarial prior on poses. SPIN (Kolotouros et al., 2019) introduced an optimisation-in-the-loop step: regress initial parameters, refine with SMPLify-X optimisation against 2D keypoints, use the refined parameters as supervision. HMR 2.0 (Goel et al., 2023) scaled the recipe with a ViT backbone and more training data; PIXIE (Feng et al., 2021) did joint body-face-hand regression for SMPL-X. 4D-Humans (Goel et al., 2023) added temporal tracking.
Clothed body reconstruction goes beyond SMPL to model the geometry of clothing. PIFu (Saito et al., 2019) and PIFuHD (Saito et al., 2020) predicted implicit surfaces from a single image, producing high-resolution clothed reconstructions. ICON (Xiu et al., 2022) combined SMPL conditioning with implicit surface regression for better generalisation. ECON (Xiu et al., 2023) used front/back normal prediction to regularise the implicit surface. PHORHUM added relighting-compatible albedo. SIFU and TeCH bring diffusion priors into the loop.
Animatable avatars — digital humans that can be reposed after reconstruction — combine SMPL articulation with learned appearance representations. HumanNeRF (Weng et al., 2022) fit a NeRF to a monocular video with SMPL providing pose supervision; the result could be viewed from any angle at any time step. Neural Actor, Ava, SMPL-X Gaussian avatars, Animatable Gaussians, GaussianAvatars, and 4K4D have continued the thread using Gaussian splats. Meta's Codec Avatars and Apple's Persona are industrial realisations.
Applications beyond the mesh are broad. Mocap from video replaces marker-based systems for animation and sports science. Virtual try-on (VTON) uses body shape to drape garments — TryOnDiffusion (Zhu et al., 2023) and StableVITON are representative. Ergonomics and medical imaging use SMPL for posture analysis, gait, and movement assessment. Telepresence (Meta's Codec Avatars, Google's Starline) builds on SMPL-X-like parameterisations for real-time avatar streaming. The hand and face variants (MANO, FLAME) power sign-language systems, sign detection, and facial-expression transfer.
The limitations are real: SMPL assumes a single topology and so can miss prosthetics, extreme body shapes, or non-standard anatomy; fine-grained clothing dynamics require explicit cloth simulation; monocular ambiguity still produces pose errors on occluded joints; and the datasets have known demographic biases in shape and motion distribution. Research on more inclusive parametric bodies (SMPL-Kids, skeletal atypical populations) is active, and the community is generally careful with representation given the social stakes.
Research papers on 3D vision report accuracy on benchmark datasets; products have to run on specific hardware at specific latency and memory budgets, often on the edge. The gap between a 2024 top-of-leaderboard NeRF paper and a shippable AR feature is substantial and covers compression, quantisation, streaming, mobile-specific rasterisers, and the operational discipline of sensor-fusion pipelines. This section covers the engineering that makes research-grade 3D methods deployable.
Neural-field compression is the first set of techniques. A raw NeRF MLP is ~2 MB but the volumetric feature grids that dominate its capacity (in Instant-NGP and TensoRF) can reach hundreds of megabytes to gigabytes. Techniques include VQ compression (vector-quantise the hash-grid features into a codebook), pruning (drop unused grid entries), low-rank decomposition (TensoRF's underlying idea), and 8-bit quantisation of grid feature tables. Compressed NeRFs can reach 10–50 MB while matching the uncompressed quality. For Gaussian splats, compression involves reducing the per-Gaussian payload (quantising spherical-harmonic coefficients, reducing SH degree, pruning low-opacity Gaussians) and has brought typical scenes from 1GB to 20–50 MB.
Streaming and progressive rendering matter for city-scale or building-scale reconstructions that do not fit in GPU memory. Hierarchical 3DGS (Kerbl et al., 2024) reconstructs block-by-block with hierarchical LODs; BungeeNeRF and CityNeRF handle multi-scale city views. Web-based viewers (Antimatter15's splat, SpatialisHQ, Luma's viewer) load progressively — first a low-resolution version for instant feedback, then higher-resolution tiles as the user moves. Streaming radiance fields is more or less a solved problem for static scenes in 2025; dynamic streaming is an active research area.
Mobile and on-device rendering requires specialised rasterisers. The original 3DGS rasteriser is CUDA-only; mobile deployment needs Metal (iOS), Vulkan (Android), or WebGPU implementations. MobileNeRF (Chen et al., 2022) baked a NeRF into a polygonal mesh with view-dependent textures that standard mobile GPUs could render at 30 fps. For 3DGS, gsplat and commercial implementations (Luma, Polycam, Niantic) have shipped mobile-capable rasterisers. WebGPU brings the same capability to the browser, and 3D scene-sharing services increasingly ship WebGPU viewers that run on laptops and modern phones.
Autonomous-driving deployment is a separate world with its own discipline. A production LiDAR-based perception stack on a Waymo-class vehicle runs: point-cloud pre-processing (voxelisation, ground removal), detection (CenterPoint or a proprietary variant), segmentation (Cylinder3D-like), tracking (3D Kalman filter or a learned tracker), and prediction. Every stage is compiled through TensorRT and runs on an embedded GPU (Xavier, Orin) or custom ASIC, with hard real-time budgets (typically 10 Hz per sensor). The training-production skew — mismatch between research benchmarks and production data distribution — is the chronic operational concern.
SLAM deployment has similar disciplines. ARKit and ARCore run VI-SLAM on every iPhone and Android phone; the algorithms are proprietary but are well-engineered classical/learned hybrids with IMU preintegration, keypoint tracking, and pose-graph optimisation. Drones (DJI, Skydio) run visual-inertial odometry at 100+ Hz on small embedded boards. Household robots (Roomba, Matter) use LiDAR-plus-vision SLAM on tiny CPUs. The pattern is that deployed SLAM lags research SLAM by several years — DROID-SLAM and the Gaussian-SLAM family are only now crossing into production — because the real-time and robustness requirements are much stricter than benchmark performance.
For research, the practical deployment checklist is: pick a representation that rasterises efficiently (Gaussian splats, baked textures, meshes); compress the asset; test on the target hardware with the target sensor stack; include streaming and LOD; handle the long-tail robustness issues (occlusion, illumination change, sensor noise) that do not appear in the benchmarks. The research-to-product gap in 3D vision is narrowing — Luma can turn a video into a shareable 3DGS scene in 10 minutes, and ARKit can build a room-scale mesh in real time on a phone — but it is still a non-trivial engineering lift for anything novel.
A 3D-vision system in production is much more than a model: it is a sensor stack, a dataset, a labelling pipeline, a training infrastructure, a deployment path, and a set of downstream consumers. Getting the model to work is often the easiest part. This closing section steps back and surveys the operational picture — datasets, toolkits, annotation tools, and how 3D vision integrates with the rest of the ML stack.
The datasets are the first thing to know. For indoor scenes: ScanNet and ScanNet200 (1500+ RGB-D indoor scans with fine semantic labels, the main indoor benchmark), NYU-v2 (464 RGB-D indoor frames), Matterport3D (90 building-scale scanned properties), Replica (18 room-scale photorealistic reconstructions), S3DIS (six buildings at Stanford). For outdoor driving: KITTI (original, small), nuScenes (1000 scenes, camera+LiDAR+radar), Waymo Open (largest with 5 cameras + 5 LiDARs), Argoverse 2 (map-aware long-tail), KITTI-360 (urban long-term), Lyft Level 5.
For object-level 3D: ShapeNet (50 000 CAD models across 55 categories, the classical benchmark), Objaverse (800k scanned or procedural assets), Objaverse-XL (10M assets, the largest single 3D collection), 3D-FUTURE (furniture), Google Scanned Objects (1000 scanned household items), OmniObject3D (6000 categories). For NeRF / 3DGS training benchmarks: NeRF Synthetic (8 synthetic object scenes), Mip-NeRF 360 (9 real outdoor/indoor unbounded), Tanks and Temples (real multi-view scenes), Deep Blending, LLFF. For multi-view stereo: DTU (classical MVS benchmark), BlendedMVS, ETH3D. For depth: KITTI depth, NYU-v2, DIODE, iBims-1, ETH3D, Hypersim. For human body: Human3.6M, 3DPW, AMASS, EgoBody, AGORA.
The toolkits cover the workflow end to end. PyTorch3D (Meta) and Kaolin (NVIDIA) provide differentiable rendering, mesh operations, and 3D data loaders inside PyTorch. Open3D is the Swiss-army-knife point-cloud/mesh processing library with visualisation and the classical reconstruction algorithms. COLMAP is the SfM workhorse; hloc (Hierarchical Localisation Toolbox) is its modern successor for learned features. MinkowskiEngine and SpConv are the sparse-conv libraries; torch-points3d is a high-level framework wrapping them. Nerfstudio is the NeRF workflow standard; gsplat and graphdeco-inria for Gaussian splatting. Mitsuba 3 for differentiable rendering. tiny-cuda-nn for fast MLPs. mmdetection3d and OpenPCDet for 3D detection benchmarks.
3D vision plugs back into the rest of the ML stack. It depends on many of the earlier pieces in this compendium: the linear-algebra foundations of camera geometry, the optimisation machinery of bundle adjustment and SLAM, the CNN and transformer backbones that most 3D networks inherit, the self-supervised pretraining recipes that transfer to 3D, and the neural-field / diffusion toolkits that came from generative modelling. It connects forward to video understanding (4D reconstruction is video-vision married to 3D), to vision-language models (3D grounding — tell me what's in this room — becomes tractable once a 3D representation is in hand), to robotics (every robot perception stack), and to the upcoming world models that combine video, 3D, and action.
For a team getting started: pick the smallest sensor modality that matches the product (monocular camera → depth networks and Gaussian splatting; stereo or RGB-D → dense MVS; LiDAR → point-cloud / sparse-conv; multi-modal → BEVFusion-style). Pick a dataset or capture process early; most 3D failures come from dataset mismatch rather than model mismatch. Use pretrained foundation depth models (Depth Anything) as regularisers. Render your reconstructions often — the visual debugging is much more informative than numeric metrics in 3D — and keep a human evaluator in the loop. The benchmarks matter but have known blind spots (e.g. indoor ScanNet biases, KITTI's narrow driving distribution), so validate on your target domain before shipping.
The next chapter is Vision-Language Models — CLIP, BLIP, LLaVA, Flamingo, GPT-4V, and the multimodal-LLM family — which is where Part VII culminates. The 3D methods in this chapter will re-emerge there as the grounding substrate for 3D-aware VLMs (LLaVA-3D, Scene3D-LLM, 3D-LLM), and many of the datasets above (ScanNet with referring expressions, 3D-QA benchmarks) serve both sub-fields. 3D vision is no longer a niche; it is now one of the mainstream pillars of multimodal learning.
3D vision sits at the junction of classical geometry, deep representation learning, computer graphics, and robotics. The literature reflects that breadth: foundational multi-view geometry textbooks remain essential, and the recent neural-field and Gaussian-splatting waves are still playing out across conference venues. The selections below cover the geometric foundations (camera models, multi-view geometry, SfM), depth estimation, point-cloud and sparse-convolutional networks, 3D detection and segmentation, the NeRF and Gaussian-splatting lineages, SLAM, 3D generation, human body modelling, and the toolkits (COLMAP, Open3D, MinkowskiEngine, PyTorch3D, Nerfstudio, gsplat) that practical work depends on.