3D & Multimodal Generation, when models leave the page and enter the world.

Every image is a 2D projection of a 3D world, and every creative act involves multiple sensory modalities — language, sight, sound, touch. The frontier of generative AI is learning to operate in these richer spaces: synthesizing objects and scenes with genuine 3D structure that can be viewed from any angle, and building models that can fluidly translate between text, images, audio, and beyond. NeRF and Gaussian splatting redefined 3D representation; score distillation made 2D priors useful for 3D; any-to-any models dissolved the boundaries between modalities entirely.

Prerequisites

This chapter assumes familiarity with diffusion models (Part X Ch 04), latent diffusion and classifier-free guidance (Part X Ch 04), and the image-generation ecosystem (Part X Ch 06). The NeRF sections require calculus (Part I Ch 02) and some comfort with volume rendering integrals. For multimodal models, the transformer architecture (Part VI Ch 04) and CLIP-style contrastive learning (Part VII Ch 06) are essential context. Score distillation sampling is derived here in detail and does not require prior exposure.

The 3D Generation Challenge

Section 01 · Representations · the view-synthesis problem · why 3D is hard

Generating a convincing 2D image is hard. Generating a 3D scene — one that looks correct from every possible viewpoint, has consistent geometry, and supports interaction — is qualitatively harder. The difficulty starts with representation: 3D data does not have a single canonical format the way images are grids of pixels. Several competing representations exist, each with different trade-offs.

Meshes represent surfaces as collections of triangular faces, with vertices in 3D space. This is the format used by nearly all downstream applications (rendering engines, games, CAD). But meshes are topologically complex and differentiable mesh optimization is notoriously unstable — gradients through a rasterizer must handle discrete changes in visibility as triangles move. Voxel grids discretize 3D space into cubic cells, similar to pixels in 2D. They are simple to process with 3D convolutions but scale cubically with resolution: a 256³ grid requires 16 million cells, most of them empty. Point clouds are unordered sets of 3D points, compact but lacking surface information. Signed Distance Functions (SDFs) represent a scene as a continuous function \(f: \mathbb{R}^3 \to \mathbb{R}\) where the sign indicates inside/outside and the magnitude gives distance to the nearest surface — smooth and resolution-free, but requiring neural network approximation.

The novel view synthesis problem — given images from some viewpoints, render the scene from new viewpoints — is the core challenge that NeRF solved. It requires the model to reason about 3D structure from 2D observations: which pixels correspond to which 3D points, how light scatters through a scene, what is occluded and what is visible. Classical computer graphics solves this given a 3D model; the machine learning challenge is to infer the 3D model from images.

The 3D generation gap

2D image generators benefit from enormous training datasets (billions of images) and clear evaluation metrics. 3D generation has neither: 3D assets are expensive to create, datasets are orders of magnitude smaller, and there is no agreed-upon metric for 3D quality. The result is that 3D generation lags 2D generation by several years. The most successful approaches sidestep direct 3D supervision entirely by lifting 2D priors — exploiting the trillion images used to train 2D diffusion models as indirect 3D supervision.

Neural Radiance Fields

Section 02 · Volume rendering · positional encoding · density and radiance

Mildenhall et al. (2020) introduced Neural Radiance Fields (NeRF), a representation of 3D scenes as continuous implicit functions parameterized by a neural network. The key idea: a scene is represented as a function mapping any 3D point and viewing direction to a density (how opaque is this point?) and a color (what radiance is emitted/reflected toward the camera?). To render an image from a given camera pose, rays are cast through each pixel and the color is computed by integrating density and radiance along the ray — volume rendering.

NeRF function and volume rendering
\[ F_\theta: (\mathbf{x}, \mathbf{d}) \mapsto (\sigma, \mathbf{c}) \]
\[ \hat{C}(\mathbf{r}) = \int_{t_n}^{t_f} T(t)\,\sigma(\mathbf{r}(t))\,\mathbf{c}(\mathbf{r}(t), \mathbf{d})\,dt \]
\[ T(t) = \exp\!\left(-\int_{t_n}^{t} \sigma(\mathbf{r}(s))\,ds\right) \]
A ray \(\mathbf{r}(t) = \mathbf{o} + t\mathbf{d}\) is parameterized by origin \(\mathbf{o}\) and direction \(\mathbf{d}\). The network takes a 3D position \(\mathbf{x} \in \mathbb{R}^3\) and viewing direction \(\mathbf{d} \in \mathbb{R}^2\) (spherical coordinates) and outputs volume density \(\sigma \geq 0\) and RGB color \(\mathbf{c} \in [0,1]^3\). The transmittance \(T(t)\) captures how much light survives before reaching point \(t\). In practice the integral is approximated by stratified sampling of \(N\) points per ray.

Training is remarkably simple: given a set of posed images (images with known camera parameters), sample random rays, evaluate the volume rendering integral, and minimize the squared difference between rendered and true pixel colors. No 3D supervision — only 2D images and camera poses. The network optimizes its own 3D representation purely from the photometric constraint that rendered images should match observed ones.

Positional Encoding

Raw 3D coordinates fed into an MLP produce blurry results — the network struggles to represent high-frequency variations in color and geometry. NeRF uses a positional encoding (borrowed from Fourier features) to map coordinates to a higher-dimensional space before passing them to the network:

Positional encoding
\[ \gamma(p) = \left(\sin(2^0 \pi p),\, \cos(2^0 \pi p),\, \ldots,\, \sin(2^{L-1} \pi p),\, \cos(2^{L-1} \pi p)\right) \]
Applied separately to each coordinate dimension. With \(L = 10\) frequencies for positions and \(L = 4\) for directions, the encoding enables the network to represent fine geometric details and view-dependent shading effects. The theoretical justification comes from the spectral bias of neural networks — without encoding, they preferentially learn low-frequency functions.

Hierarchical Sampling

Integrating along a ray with uniform samples is inefficient — most of a ray passes through empty space. NeRF uses a coarse-to-fine sampling strategy: a coarse network with uniform samples is used to estimate where density concentrates, then a fine network samples more densely near surfaces. This 2-network hierarchy makes training tractable within reasonable compute budgets, though original NeRF still required 1–2 days per scene on a single GPU.

Efficient NeRFs: Instant-NGP and Beyond

Section 03 · Hash encoding · Mip-NeRF · NeRF in the Wild · downstream variants

Original NeRF was beautiful in concept but impractical in application: training from scratch per-scene took days, and rendering was seconds per frame rather than real-time. A wave of follow-up work addressed these bottlenecks systematically.

Instant Neural Graphics Primitives (Instant-NGP)

Müller et al. (2022) replaced the MLP's positional encoding with a multiresolution hash encoding. The 3D space is divided into a hierarchy of grids at different resolutions, from coarse (capturing global structure) to fine (capturing detail). Each grid cell stores a learned feature vector. For a given 3D point, features are looked up at all resolutions via trilinear interpolation, hashed to a fixed-size table, and concatenated — producing a compact feature vector that is then decoded by a tiny MLP.

The hash table allows many grid positions to share the same entry (hash collisions are intentional), which implicitly regularizes the representation. The result is dramatic: Instant-NGP trains in seconds to minutes on a single GPU (compared to hours or days for vanilla NeRF) and renders at interactive frame rates. It was a turning point that made NeRF practical for downstream applications like scene reconstruction and 3D generation.

Mip-NeRF and Anti-Aliasing

Standard NeRF samples along rays as infinitesimally thin lines, which causes aliasing when images are captured at different scales or resolutions. Barron et al.'s Mip-NeRF (2021) replaces ray sampling with cone sampling: each pixel corresponds to a frustum (a tapered cone), and the encoding is replaced by an integrated positional encoding over the cone's volume at each sample. This is equivalent to representing the mean and covariance of the expected feature over each cone slice, which can be computed in closed form for Gaussian cones. Mip-NeRF produces sharper renders at multiple scales and is the basis for most subsequent NeRF work. Mip-NeRF 360 extended this to unbounded scenes with a nonlinear scene contraction and a proposal network for efficient sampling.

NeRF Variants for Diverse Settings

NeRF in the Wild (Martin-Brualla et al., 2021) adds per-image latent codes to handle appearance variation in internet photos taken under different lighting and weather conditions — the same Paris street scene photographed in summer and winter should share geometry but not appearance. Block-NeRF decomposes large outdoor environments (city blocks) into overlapping NeRF segments that can be trained in parallel and composed at inference. Deformable NeRF and D-NeRF add a time-dependent deformation field to model dynamic scenes. NeuS replaces volume density with a Signed Distance Function (SDF) representation, producing cleaner surface geometry that can be directly extracted as a mesh via marching cubes.

MethodTraining speedKey innovationBest use case
NeRFDaysVolume rendering + MLPBounded scenes, lab quality
Instant-NGPSecondsMultiresolution hash encodingFast reconstruction, real-time preview
Mip-NeRF 360HoursAnti-aliased cone samplingUnbounded outdoor scenes
NeuSHoursSDF density + surface extractionClean mesh output
NeRF in the WildHoursAppearance latent codesInternet photo collections

3D Gaussian Splatting

Section 04 · Anisotropic Gaussians · differentiable rasterization · real-time rendering

Kerbl et al. (2023) introduced 3D Gaussian Splatting (3DGS), a radically different approach to scene representation that achieves real-time rendering quality matching or exceeding the best NeRF methods. Instead of a neural network queried per-point, 3DGS represents a scene as a set of explicit 3D Gaussian primitives — millions of anisotropic blobs in 3D space, each with a position, orientation, scale, opacity, and spherical harmonics coefficients for view-dependent color.

Rendering works by splatting: project each 3D Gaussian onto the image plane as a 2D Gaussian, sort Gaussians by depth, and alpha-composite them front-to-back. This is a rasterization operation that runs on the GPU at 100+ FPS for scenes with millions of Gaussians — a 100–1000× speedup over NeRF rendering. Training uses a differentiable rasterizer: gradients flow back through the alpha-compositing step to update each Gaussian's parameters.

3DGS: Gaussian parameters and color model
\[ \mathcal{G} = \{\mu_i, \Sigma_i, \alpha_i, \mathbf{f}_i\}_{i=1}^N \]
Each Gaussian \(i\) has: mean position \(\mu_i \in \mathbb{R}^3\); covariance \(\Sigma_i\) (stored as scale + rotation quaternion for positive-semidefinite guarantee); opacity \(\alpha_i \in [0,1]\); and spherical harmonic coefficients \(\mathbf{f}_i\) encoding view-dependent RGB color. Training starts from a sparse point cloud (from COLMAP or similar) and adds/removes Gaussians via adaptive density control based on gradient magnitudes and opacity thresholds.

3DGS has effectively supplanted NeRF as the default method for scene reconstruction from posed images, at least for applications requiring interactive rendering. The explicit representation also enables intuitive scene editing — Gaussians can be selected, deleted, moved, or color-changed directly, and the scene re-renders immediately. Extensions include 4D Gaussian Splatting for dynamic scenes, Gaussian Avatars for photorealistic human head reconstruction, and Compact 3DGS variants that reduce memory footprint through vector quantization of attributes.

NeRF vs. 3DGS: which to use?

3DGS trains faster, renders faster, and produces competitive or better visual quality for bounded scenes. NeRF has advantages in anti-aliasing (Mip-NeRF), clean surface extraction (NeuS), and handling of specular/transparent surfaces where Gaussians struggle. For most practical reconstruction tasks, 3DGS is now the preferred starting point. For scientific applications requiring accurate geometry and surface normals, SDF-based NeRF variants remain competitive.

Score Distillation Sampling: Text-to-3D

Section 05 · DreamFusion · SDS loss · Magic3D · Janus problem · SDS improvements

How do you generate a 3D object from a text prompt when you have no 3D training data? Poole et al.'s DreamFusion (2022) answered this with a remarkable insight: you don't need 3D supervision if you have a strong 2D prior. A pretrained text-to-image diffusion model already "knows" what a 3D object should look like from any viewpoint — if you render the object from a random camera and ask the diffusion model whether that render looks like the prompt, the gradient of that signal can be used to update the 3D representation.

The SDS Loss

DreamFusion trains a NeRF from scratch to maximize the likelihood of its renders under a pretrained diffusion model's score function. The key operation is Score Distillation Sampling (SDS). At each training step: (1) render the current NeRF from a random camera pose to get image \(\mathbf{x} = g(\theta)\); (2) add noise to get \(\mathbf{x}_t\); (3) run the diffusion model to predict the noise \(\hat{\boldsymbol{\epsilon}}_\phi(\mathbf{x}_t, t, y)\) conditioned on the text prompt \(y\); (4) compute the SDS gradient with respect to the NeRF parameters \(\theta\):

Score Distillation Sampling gradient
\[ \nabla_\theta \mathcal{L}_{\text{SDS}} = \mathbb{E}_{t, \boldsymbol{\epsilon}}\!\left[w(t)\!\left(\hat{\boldsymbol{\epsilon}}_\phi(\mathbf{x}_t, t, y) - \boldsymbol{\epsilon}\right)\frac{\partial \mathbf{x}}{\partial \theta}\right] \]
This gradient is the predicted noise minus the actual noise added, propagated back through the differentiable renderer \(\partial\mathbf{x}/\partial\theta\). Intuitively: the diffusion model knows what the render "should" look like after denoising; the gradient pushes the 3D parameters so that future renders are easier to denoise into the target description. The weight \(w(t)\) typically upweights intermediate noise levels.

The SDS gradient avoids backpropagating through the diffusion model's denoising network (which would be extremely expensive), treating the diffusion model as a fixed scorer. This makes the approach tractable: training a DreamFusion object takes 1–2 hours on an A100, entirely without any 3D training data.

Failure Modes and Improvements

SDS has two characteristic failure modes. The Janus problem (multi-face problem) occurs because 2D diffusion models were trained mostly on front-facing photos: when the NeRF is viewed from behind, the model still predicts a front-facing face, causing the NeRF to grow a face on all sides of an object. Solutions include camera-conditioned diffusion models (Zero-1-to-3) that know the viewpoint, or explicit multi-view consistency constraints. The second failure mode is over-smoothing: SDS tends to produce over-saturated, blobby geometry because the mode-seeking behavior of the score gradient favors average-looking geometry. Alternatives to vanilla SDS include Variational Score Distillation (VSD, used in ProlificDreamer), which models the 3D distribution more carefully and produces sharper, more detailed results.

Magic3D (Lin et al., 2023) addressed quality by using a coarse-to-fine pipeline: SDS on a coarse NeRF to establish global geometry, then refinement with SDS on a high-resolution differentiable mesh renderer for fine-grained texture and surface quality. Fantasia3D disentangles geometry and appearance optimization, using SDS with an SDF for geometry and a separate appearance diffusion prior for texture — producing mesh-based outputs suitable for downstream use in game engines and rendering software.

3D-Aware Image Synthesis

Section 06 · π-GAN · EG3D · tri-plane representation · view-consistent generation

A related but distinct problem: train a generative model on 2D images alone such that the generated images are 3D-consistent — meaning you can generate the same object from multiple viewpoints and it looks coherent, as if the model has an internal 3D representation. This is 3D-aware image synthesis, and it works without any 3D ground truth, only posed 2D images.

π-GAN: Radiance Fields as GAN Generators

Chan et al. (2021) proposed π-GAN, which uses a NeRF-based representation as the generator in a GAN. The generator maps a latent code \(\mathbf{z}\) to a set of NeRF weights via a FILM (Feature-wise Linear Modulation) network — the latent code conditions all the layers of the NeRF MLP. Images at arbitrary viewpoints are rendered via volume rendering and discriminated by a 2D image discriminator. Because the generator is a NeRF, sampling the same \(\mathbf{z}\) at different camera poses produces 3D-consistent images from a model trained purely on unstructured posed images. The weakness is computational cost: rendering a full NeRF at every discriminator step makes training slow.

EG3D: Tri-Plane Representation

Chan et al. (2022) introduced EG3D, which replaced the costly MLP-based NeRF with a tri-plane representation for dramatically faster rendering. A scene is represented by three axis-aligned feature planes (XY, XZ, YZ). For any 3D point, its feature vector is obtained by projecting onto each plane, looking up the feature at the projected 2D location (bilinear interpolation), and summing the three vectors. A small decoder MLP maps this feature to density and color. The three planes can be generated by a StyleGAN2 backbone operating on 2D feature maps — fast to generate, fast to query.

Tri-plane feature lookup
\[ \mathbf{f}(\mathbf{x}) = F_{XY}(\mathbf{x}_{xy}) + F_{XZ}(\mathbf{x}_{xz}) + F_{YZ}(\mathbf{x}_{yz}) \]
where \(F_{XY}, F_{XZ}, F_{YZ}\) are learned feature maps and \(\mathbf{x}_{xy}\) denotes the projection of 3D point \(\mathbf{x}\) onto the XY plane. The three-plane decomposition approximates a full 3D feature volume \(O(N^3)\) with three 2D planes \(O(3N^2)\), making it far more memory-efficient while still allowing view-consistent rendering.

EG3D achieves StyleGAN2-level image quality on face and car datasets while providing exact multi-view consistency — the same sample can be rendered at any camera pose. This enables applications like 3D-consistent face editing (change hairstyle in one view, render from another and it's consistent) that were impossible with unstructured 2D GANs. The EG3D framework has been extended to text conditioning (GET3D) and video generation (EG3D-Video).

Multi-view Diffusion and Feed-Forward 3D

Section 07 · Zero123 · SyncDreamer · One-2-3-45 · LRM · single-image-to-3D

SDS-based text-to-3D is slow (1–2 hours per object) and suffers from the Janus problem. A newer generation of methods takes a different approach: train a diffusion model that understands 3D geometry well enough to directly predict novel views of an object, then use those multi-view predictions to reconstruct a 3D model quickly.

Zero-1-to-3: Camera-Conditioned Novel View Synthesis

Liu et al. (2023) fine-tuned Stable Diffusion on a dataset of 3D object renders (Objaverse) augmented with camera pose conditioning. The model Zero-1-to-3 takes a single input image and a relative camera transformation (delta azimuth, elevation, distance) and generates the object as seen from the new viewpoint. Unlike vanilla Stable Diffusion, which has no notion of viewpoint, Zero-1-to-3 has been exposed to many objects from many angles and learned the visual transformations that correspond to camera motion. This directly addresses the Janus problem: the model knows the viewpoint and generates a geometrically consistent novel view rather than generating a canonical front face.

Zero-1-to-3 enables SDS with camera conditioning — at each training step, the diffusion score is conditioned on the actual viewing angle of the current NeRF render, dramatically reducing multi-face artifacts. Several systems (Zero-1-to-3++, SyncDreamer) improved multi-view consistency by generating all views simultaneously in a single synchronized diffusion pass rather than independently for each viewpoint.

Large Reconstruction Model (LRM)

Hong et al. (2023) took a feed-forward approach: train a large transformer to directly predict a tri-plane NeRF representation from a single input image, bypassing iterative optimization entirely. The LRM uses a ViT image encoder to extract image tokens, then a cross-attention transformer decoder that attends over image tokens to produce a tri-plane feature map of shape \(3 \times N \times N \times C\). A small NeRF decoder then renders novel views from this tri-plane.

The key is training data: LRM was trained on Objaverse-XL (10 million+ 3D objects) rendered from multiple viewpoints, with the model supervised to predict the correct tri-plane for each object by minimizing reconstruction error across held-out views. At inference, a single-image 3D reconstruction takes about 5 seconds — compared to hours for SDS-based methods. The quality is lower than optimization-based methods for complex scenes, but the speed enables entirely new workflows: rapid prototyping of 3D assets from product photos, automatic 3D scanning of real objects from a smartphone image.

Instant3D, One-2-3-45, and TripoSR are closely related systems that pair multi-view diffusion (generate several views) with fast feed-forward reconstruction (build 3D from views), achieving a quality-speed balance that SDS alone cannot.

SDS / DreamFusion Text prompt Random render ↔ 2D diffusion score NeRF / 3DGS optimized 1–2h 3D asset Feed-Forward (LRM) Single image Transformer ViT encoder + decoder Tri-plane NeRF predicted in ~5s 3D asset Slow (1–2h) · high quality Works from text only Fast (~5s) · good quality Needs image input
Two paradigms for 3D generation. SDS optimizes a NeRF using gradients from a 2D diffusion model score function — flexible and text-driven but slow. Feed-forward reconstruction (LRM) predicts 3D directly from a single image using a pretrained transformer — fast but requires an image input and is limited by training distribution.

Multimodal Joint Embedding Spaces

Section 08 · CLIP · ALIGN · ImageBind · binding modalities by audio

Before any-to-any generation becomes possible, the different modalities must be brought into a shared representational space — one where semantically similar concepts across modalities are close together, regardless of whether they are expressed as text, image, audio, or video. This is the job of multimodal joint embedding models.

CLIP and Contrastive Pretraining

CLIP (Radford et al., 2021) trains separate image and text encoders by maximizing the cosine similarity between paired (image, caption) embeddings and minimizing it between non-paired ones, across batches of up to 32,768 examples. With 400 million training pairs, the resulting embedding space captures rich visual-semantic correspondences. Text queries like "a dog playing frisbee" and "a corgi at the beach" are nearby in the embedding space, and so are the images they describe.

CLIP's embedding space has properties that go beyond retrieval: it supports zero-shot classification (embed class names as text, find the nearest image embedding), style arithmetic (subtract "photorealistic" embedding, add "oil painting" embedding), and — as used throughout Ch 06 — image generation conditioning via cross-attention. The CLIP space is the backbone of the multimodal AI ecosystem.

ALIGN and Scaling

ALIGN (Jia et al., 2021) scaled the contrastive approach to 1.8 billion noisy image-text pairs with minimal curation, demonstrating that scale can compensate for data quality. The key finding: a large enough model trained on noisy data outperforms a smaller model on clean data. ALIGN's text encoder (BERT-based) and image encoder (EfficientNet-based) match or exceed CLIP on most zero-shot benchmarks despite noisier training data — a result that influenced subsequent scaling decisions across multimodal pretraining.

ImageBind: Six Modalities, One Space

Girdhar et al. (2023) extended the contrastive approach to six modalities simultaneously: images, text, audio, depth, thermal (infrared), and IMU (inertial measurement unit) data. The key insight is that images can serve as a binding modality — since large paired datasets exist for (image, text), (image, audio), (image, depth), and (image, thermal) separately, a single image encoder can be used as the anchor to align all other modalities into a shared space, even without direct audio-text or audio-depth pairs.

Training ImageBind involves six separate contrastive objectives, each pairing images with one other modality. The resulting joint embedding space supports zero-shot cross-modal retrieval between any two modalities: given an audio recording of a thunderstorm, retrieve matching images, depth maps, or text descriptions. More significantly, it enables cross-modal generation: use an audio embedding in place of a text embedding in a diffusion model's cross-attention, and generate images conditioned on sound rather than text — without any explicit audio-conditioned image generation training. The joint embedding space is a universal "language" that different modal encoders all speak.

Any-to-Any Generation

Section 09 · CoDi · Unified-IO · NExT-GPT · composable diffusion

The natural extension of joint embedding spaces is any-to-any generation: a single model that accepts any combination of text, image, audio, and video as input and produces any combination as output. Rather than having separate models for text-to-image, image-to-text, text-to-audio, and audio-to-image, one model handles all twelve pairs (and their compositions) simultaneously.

CoDi: Composable Diffusion

Tang et al. (2023) introduced CoDi, which trains separate diffusion models for each modality (text, image, audio, video) and connects them through a shared multimodal latent space. Each modality-specific diffusion model has its own conditioning encoder but shares a common cross-attention mechanism that can attend to latents from any other modality. Training uses modality-specific data pairs (text-image, text-audio, etc.) and aligns the latent spaces through a contrastive bridging loss. At inference, any modality can condition any other — and crucially, multiple input modalities can be combined simultaneously: generate a video that matches both a text description and an audio clip.

Unified-IO: One Model, All Tasks

Lu et al. (2023) trained Unified-IO 2, a single autoregressive model capable of handling text, images, audio, and video — both as inputs and outputs — within a single shared token vocabulary. Images are VQ-tokenized into discrete codes, audio is mel-spectrogram-tokenized, and video frames are tokenized individually and interleaved. The model processes all modalities as a single mixed-token sequence, predicting any output tokens given any input tokens. Unified-IO 2 achieves competitive performance on dozens of tasks (image captioning, VQA, image generation, audio captioning, audio generation) using a single set of weights, demonstrating that a truly universal multimodal model is achievable.

NExT-GPT: LLM as Multimodal Orchestrator

Wu et al. (2023) used a large language model as the orchestrating backbone for any-to-any generation. NExT-GPT encodes inputs from all modalities into the LLM's token space using modality-specific encoders, runs the LLM to generate output tokens (which may include special modality-routing tokens), and decodes those tokens through modality-specific generation models (Stable Diffusion for images, AudioLDM for audio, etc.). The LLM's role is reasoning and coordination — deciding which output modalities are needed and what content they should contain — while domain-specific generators handle high-quality synthesis. This approach leverages existing pretrained generators and requires relatively little additional training, though the coupling between the LLM's text-space planning and the generators' latent spaces is indirect.

The tokenization unification bet

The deep disagreement in any-to-any modeling is whether to unify at the token level (all modalities become discrete tokens, processed by one transformer) or at the latent level (modality-specific continuous latents, aligned by a shared embedding space). Token-level unification (Unified-IO, DALL-E 1, Gemini) is simpler architecturally but requires very high-quality tokenizers for audio and video. Latent-level alignment (CoDi, NExT-GPT) preserves the quality of specialist generators at the cost of looser modality coupling. Both approaches are active areas of research and current systems combine elements of both.

Cross-Modal Synthesis: Audio-Visual and Beyond

Section 10 · Foley generation · talking head synthesis · sound-to-image · image sonification

Any-to-any generation enables specific cross-modal applications that would be impossible with single-modality models. The most commercially significant of these involve audio-visual synchronization — generating sound for silent video, or generating video that matches an audio track.

Foley Sound Generation

Foley is the art of creating sound effects synchronized to video — footsteps, environmental sounds, impacts. Manual Foley work is labor-intensive and expensive in film production. Generative models have automated much of this: given a video clip, a model trained on (video, audio) pairs can generate temporally synchronized sound effects that match the visual content. SpecVQGAN (Iashin & Rahtu, 2021) and subsequent models encode video frames as tokens and generate synchronized audio spectrograms autoregressively. Diffusion-based audio generation conditioned on video features (e.g., via cross-attention) has pushed quality further, enabling realistic Foley for arbitrary video content.

Talking Head and Lip Sync

Given a portrait image and a speech audio clip, talking head models generate a realistic video of the person speaking the words. Wav2Lip (Prajwal et al., 2020) achieved near-perfect lip synchronization by training a discriminator specifically on the synchronization between lip motion and audio features extracted by a pretrained audio encoder. More recent methods like SadTalker and EchoMimic use diffusion models conditioned on facial landmark sequences and audio features, producing more natural head motion and expression variation beyond just lip movement.

Audio-Driven Image and Video Generation

ImageBind's joint embedding space enables a mode of generation that bypasses text entirely: given an audio clip, generate an image whose semantic content matches the sound. A thunderstorm recording generates a stormy landscape; a piano piece generates an abstract visual interpretation; a dog barking generates a dog image. This works by simply replacing the text embedding in a text-conditioned diffusion model with the audio embedding from ImageBind's shared space — no additional training required, because the embeddings of the same concept are already nearby across modalities.

Image Sonification

The reverse direction — generating audio from images — is less common but has scientific and accessibility applications. Sonification of scientific data (converting telescope images, medical scans, or climate models to audio) helps researchers perceive patterns that are invisible in visual form. For accessibility, image-to-audio systems can describe visual scenes as sound for visually impaired users, going beyond text captions to generate spatial audio that conveys layout and depth.

Generative Models for Science

Section 11 · Protein structure · molecular generation · materials · equivariance

Generative models have found some of their most consequential applications not in creative content but in scientific discovery — generating molecules with desired properties, predicting and designing protein structures, and exploring materials with target characteristics. These applications differ from image or text generation in requiring the model to respect hard physical symmetries and constraints.

AlphaFold and Protein Structure Prediction

DeepMind's AlphaFold 2 (Jumper et al., 2021) solved a 50-year-old grand challenge in biology: predicting a protein's 3D structure from its amino acid sequence with near-experimental accuracy. While not strictly a generative model, AlphaFold's Evoformer architecture — which processes multiple sequence alignments through pairwise attention to build up a representation of inter-residue distances and angles — was a crucial technical foundation for subsequent generative protein design work. AlphaFold 3 extended the framework to predict structures of complexes involving proteins, DNA, RNA, and small molecules, using a diffusion model over atomic coordinates as the structure prediction head.

RFDiffusion: Protein Design with Diffusion

Watson et al. (2023) trained a diffusion model directly on protein backbone coordinates — the 3D positions of the carbon atoms forming the protein chain. Starting from random noise in coordinate space and running the reverse diffusion process, conditioned on desired functional properties (binding site, catalytic residues, symmetry), RFDiffusion generates novel protein backbones that fold into the specified structure and can be further detailed with side chains. This enabled the de novo design of proteins with no natural homologs — proteins that evolution never created — that experimentally verified folded into the designed structure.

A key requirement is equivariance: the generated protein structure should not depend on the arbitrary orientation of the coordinate frame. Rotating the input specifications should rotate the output by the same amount. RFDiffusion and related models use SE(3)-equivariant neural networks (invariant point attention, equivariant convolutions) to enforce this symmetry. Violating equivariance would cause the model to generate different structures depending on irrelevant rotations of the input — physically nonsensical.

Molecular Generation

Drug discovery requires generating molecules with specific properties: target binding affinity, drug-likeness, synthetic accessibility, low toxicity. Models approach this as generation over molecular graphs (atoms as nodes, bonds as edges) or over 3D point clouds of atom positions and types. Graph diffusion models (GDSS, DiGress) run diffusion over the graph adjacency matrix and node attributes jointly. EDM (Hoogeboom et al., 2022) and its successors apply equivariant diffusion directly to 3D molecular geometries — each atom's position, element type, and formal charge. Structure-based drug design conditioned on protein binding site geometry is particularly valuable: DiffSBDD and DiffDock generate ligand structures in 3D, positioned within the target protein's binding pocket, conditioned on pocket geometry.

Materials Discovery

Crystalline materials (semiconductors, catalysts, battery materials) can also be modeled as 3D atomic structures with periodic boundary conditions. Generative models for crystal structures must handle the periodic tiling and the combinatorial explosion of possible element compositions. DiffCSP uses SE(3)-equivariant diffusion over lattice parameters and atomic fractional coordinates to generate stable crystal structures for novel compositions. MatterGen (Zeni et al., 2024) extended this to conditional generation: design a crystal structure with target bandgap, bulk modulus, or magnetic moment, enabling property-guided materials exploration at a scale impossible with experimental synthesis alone.

The 3D and Multimodal Landscape

Section 12 · Convergence trends · world models · generative simulation · remaining barriers

The 3D and multimodal generation landscape is more fragmented than the 2D image generation ecosystem, because the problem space is vastly larger. Rather than a single dominant paradigm (latent diffusion with CFG), there are competing approaches at every level: representation (NeRF vs. 3DGS vs. mesh vs. SDF), training signal (SDS vs. feed-forward vs. multi-view diffusion), and cross-modal architecture (unified token space vs. aligned latent space vs. LLM orchestration).

The Convergence of 3D, Video, and World Models

The most significant trend in 2024–2025 is the blurring boundary between 3D generation, video generation, and world models. A video model that has learned to simulate physical dynamics — how objects move, how light changes, how rigid bodies interact — has implicitly learned a 3D world model. Sora-class video generators can maintain consistent 3D structure across many frames; the question is whether this consistency is "true" 3D understanding or sophisticated pattern matching. The pragmatic answer matters less than the engineering implications: if a video generator reliably produces 3D-consistent outputs, it can be used as a 3D generator by generating videos and extracting geometry.

Conversely, explicit 3D representations (NeRF, 3DGS) are increasingly being used as the latent space for video generation — rather than diffusing over 2D frame sequences, diffuse over 3D scene parameters and render frames from the 3D scene. This enforces multi-view consistency by construction and enables camera-controlled generation (specify the camera trajectory and generate the video as seen from that camera). 4D generation — generating time-varying 3D scenes — is the natural next step, combining 3DGS with temporal dynamics modeling.

Generative Simulation

Scientific simulation (physics, chemistry, climate, biology) is computationally expensive. Generative models trained on simulation outputs can serve as surrogate simulators — orders-of-magnitude faster approximations that can be sampled at inference time for rapid exploration. Weather prediction (GraphCast, Pangu-Weather), molecular dynamics simulation (neural force fields), and fluid dynamics (physics-informed neural networks) are all areas where generative models are beginning to complement or replace traditional numerical solvers.

Remaining Barriers

The most persistent barrier in 3D generation is training data. Objaverse and Objaverse-XL represent the largest available 3D asset datasets, but they are dominated by synthetic CAD models and game assets that differ substantially from real-world objects. Physical consistency — accurate material properties, realistic lighting, correct deformation — requires either procedural simulation data or expensive real-world capture. The synthetic-to-real gap remains substantial for any application requiring physical accuracy.

In multimodal generation, the main barrier is modality alignment quality for non-image modalities. The paired data problem is severe: while image-text pairs number in the billions, audio-video pairs with natural synchronization are far fewer and noisier. Models trained on weaker alignment data produce weaker cross-modal correspondences. Until large-scale naturally synchronized multimodal datasets exist (or can be synthesized at scale), any-to-any generation will be limited to the modality pairs for which sufficient supervised pairs exist.

Finally, the interactive 3D generation problem — editing a generated 3D scene interactively, the way Photoshop allows editing of 2D images — is largely unsolved. 3DGS enables some direct Gaussian manipulation, but semantically meaningful edits (change the material of the chair, resize the table) require semantic understanding of the 3D representation that current models lack. Connecting language-guided editing to 3D representations, analogous to what ControlNet and InstructPix2Pix did for 2D, is one of the most actively researched problems in the field.

Key Papers