Sim-to-Real Transfer, training where the data is, deploying where the world is.
Real robot data is slow and expensive; simulated robot data is fast and cheap. The problem is that simulators differ from reality in ways that quietly invalidate the policies trained inside them. This chapter is about that gap — what causes it, what physics simulators do well and badly, and the techniques that have made it possible, in 2024–2026, to train policies in simulation that actually work on hardware.
Prerequisites & orientation
This chapter assumes familiarity with the perception, control, and learning material of Chapter 01, Chapter 02, and Chapter 03. The reinforcement-learning sections lean on the basics from Part IX Ch 01 and the policy-gradient material in Part IX Ch 04; readers who want a refresher on PPO or SAC should glance there first. No physics background is assumed beyond high-school mechanics.
Two threads run through the chapter. The first is the reality gap: the systematic differences between simulator and reality that make a sim-trained policy fail on hardware. We will name the gap's components — visual, physical, contact, sensor — and look at how each is being closed. The second is the compute-for-data trade: simulation lets you spend GPU-hours instead of robot-hours, and the practical question for any project is how that trade actually pencils out for the task at hand. The answer is increasingly "well," but it is task-specific, and the techniques in this chapter exist precisely because the trade does not always work without help.
The Reality Gap
Every simulator is a model of reality, and every model is wrong about something. The reality gap is the gap between "the policy works in simulation" and "the policy works on the real robot." Closing it is the central engineering problem of modern robot learning.
If you train a policy in simulation and deploy it directly on hardware, the most common outcome is that it performs worse — sometimes catastrophically — than the simulation evaluation suggested. The policy might over-rotate the gripper because the simulated friction was idealised, or fail to localise because the simulator's camera renders shadows that the real camera handles differently, or fall off a balance point because the simulated motor torque profile differs from the real one. None of these failures show up in simulation evaluation; all of them show up at deployment. The reality gap is not one problem but a family of them, and the techniques in this chapter address different parts of the family.
Four kinds of gap
It helps to name the components rather than treat the gap as monolithic. The visual gap is the difference between simulated and real images: lighting, shadows, materials, sensor noise, exposure, motion blur. A vision-conditioned policy trained on perfect ray-traced renderings can fail when the real camera produces a slightly washed-out image at the wrong moment. The physical gap is the difference between simulated and real dynamics: mass, inertia, motor lag, gear backlash, joint damping. A policy that depends on a precisely modelled torque-to-acceleration relationship will fail if the real robot's gears have more friction than the simulator assumed.
The contact gap is the special case of physical gap where the modelling is hardest: how rigid bodies behave when they touch, deform, slip, or break contact. Simulating contact accurately is computationally expensive and most simulators take shortcuts that produce subtly wrong behaviour at the moment of contact, exactly when many manipulation tasks need precision. The sensor gap is the model-to-reality mismatch in sensor responses: a simulated LiDAR returns clean ranges while a real one has spurious returns and dropouts; a simulated IMU has zero bias while a real one drifts. Each of these gaps has its own causes and its own fixes, and the rest of this chapter is organised around them.
Why simulation is worth the trouble
If the gap is real and hard to close, why bother with simulation at all? The answer is the compute-for-data trade. A real robot collects data at the rate of physical time and risks hardware damage at every step. A simulated robot can be cloned a thousand times across GPUs and run a million steps in parallel, with no risk of damage and no waiting. For RL — which needs many millions of environment interactions to converge — simulation is the only viable training environment for most tasks. For imitation learning, simulation can amplify a small set of demonstrations into a large training set by replaying them in randomised conditions. The techniques in this chapter exist because the trade is overwhelmingly in simulation's favour as long as you can close the gap, and because closing the gap turned out to be tractable.
The honest evaluation of any sim-to-real technique is a real-world success rate, not a simulation one. A policy that achieves 99% in sim and 30% on hardware has not benefited from simulation; the sim score is misleading you. Every section in this chapter should be read with the question "what does this do to the real-world number?" in mind, because that is the only number that matters.
Physics Simulators: The Modern Landscape
A simulator is a piece of software that integrates a model of physics forward in time given inputs. The dominant simulators in robotics differ in their physics models, their rendering quality, their speed, and what they were originally designed for — and the choice between them often determines what kinds of sim-to-real techniques are even available.
The major simulators
MuJoCo. Long the dominant academic robotics simulator, MuJoCo (Multi-Joint dynamics with Contact) was open-sourced by DeepMind in 2021 after Google acquired its developer Roboti LLC. It uses a soft-contact model that produces stable, differentiable contact dynamics — at the cost of contact behaviour that is physically smooth but not physically exact. MuJoCo is fast, accurate enough for most manipulation and locomotion research, and the simulator behind nearly every modern legged-robot RL paper through 2024.
Isaac Sim and Isaac Lab. NVIDIA's flagship robotics simulator, built on the Omniverse platform with PhysX as the physics engine. Isaac Sim provides photo-realistic ray-traced rendering, large-scale GPU-parallel simulation, and tight integration with the rest of NVIDIA's robotics stack (Isaac ROS, Cosmos, etc.). Isaac Lab is the RL training framework on top, with built-in support for tens of thousands of parallel environments. The combination is the dominant production training environment for robot learning as of 2026.
PyBullet and Bullet. An open-source physics engine that has been around since the 2000s, with broad robotics adoption through PyBullet (its Python interface) and tools like Webots. Less performant than MuJoCo or Isaac on RL workloads but easy to install and well-documented; remains the workhorse for tutorials and small projects.
Brax. Google's JAX-based simulator, designed from the start for massively parallel hardware acceleration. Brax can run hundreds of thousands of environments on a single GPU and achieves training throughput an order of magnitude higher than CPU-based simulators on suitable problems. The catch is that Brax's contact model is simplified for speed — it works well for locomotion but is less reliable for fine manipulation.
Drake. An MIT/TRI simulator focused on rigorous physics, primarily used in academia and at the Toyota Research Institute. Drake's contact and constraint solvers are among the most accurate available; the trade-off is that Drake is slower than MuJoCo or Isaac and lacks their RL-training infrastructure. Drake is the right choice when physical fidelity matters more than throughput — verification, planning research, certain manipulation tasks.
Genesis and the next-gen wave. Genesis (released late 2024) is a new entrant claiming order-of-magnitude speedups over Isaac with comparable physical fidelity, courtesy of a Taichi-based GPU implementation. ManiSkill, RoboCasa, and several other 2025-vintage simulators are pushing similar performance bounds. The simulator landscape is moving fast enough that any survey is out of date within a year.
The right simulator for the job
| Use case | Pick | Reason |
|---|---|---|
| Locomotion RL at scale | Isaac Lab or Brax | GPU-parallel, fast forward dynamics, well-tooled |
| Manipulation (research) | MuJoCo or Isaac Sim | Stable contact, mature tooling |
| Fine contact tasks (assembly) | Drake or MuJoCo | Most accurate contact & constraint solvers |
| Visual policies / sensors | Isaac Sim | Photo-realistic ray-traced rendering |
| Quick prototypes / tutorials | PyBullet | Easy install, broad documentation |
| Production AV / large fleets | CARLA or proprietary | Scenario authoring, traffic models |
The choice has consequences for everything downstream. A policy trained in MuJoCo may need a different sim-to-real recipe than one trained in Isaac, because the underlying physics models embed different approximations. Switching simulators mid-project is rarely cheap. The pragmatic advice is to make the choice early, validate that the simulator's physics matches your task's regime, and stick with it.
The Visual Gap and Photo-Realistic Rendering
Vision-based policies are uniquely vulnerable to the visual gap. A policy that learned its features from one rendering pipeline can fail completely when shown images from a different one — even if the underlying scene is identical. Closing the visual gap is partly a rendering problem and partly a training-time problem.
What's hard about rendering for robotics
Game engines render to look good to humans. Robotics simulators need to render to look real to a CNN, which is a different objective. Real cameras have characteristics — sensor noise, exposure response curves, chromatic aberration, motion blur, rolling-shutter artefacts, autoexposure dynamics, lens distortion — that game-engine renderers don't model by default. A policy trained on synthetically clean images will weight features that don't generalise to real cameras, while shrugging off real-world artefacts that didn't appear in training.
Modern robotics renderers (Isaac Sim's RTX-based pipeline, Unreal-based simulators, NVIDIA Omniverse) approach photo-realism by combining ray-tracing for global illumination, physically based materials, and explicit sensor models that simulate per-pixel sensor noise, exposure, and motion blur. The rendering quality of Isaac Sim 2024 is genuinely difficult to distinguish from a real camera under good conditions; the failure modes that remain tend to be in low-light, reflective, or fast-motion scenes.
Sensor models, not just rendering
A photo-realistic renderer is necessary but not sufficient. The sensor itself adds noise, drops frames, has a finite dynamic range, and produces specific artefacts (banding, blooming, autoexposure overshoot) that the policy must learn to ignore. Modern simulators include explicit sensor models as a layer between rendered images and the policy: the renderer produces a clean image, the sensor model adds Gaussian read noise, applies an exposure curve, simulates rolling shutter for fast motion, and so on. The closer this layer matches the actual hardware, the smaller the visual gap on deployment.
The same idea applies to LiDAR (simulated with realistic beam divergence, intensity noise, and weather attenuation), depth cameras (with structured-light interference patterns or ToF noise models), and IMUs (with bias drift, white noise, and temperature dependence). The simulators that are taken seriously for sim-to-real always include explicit sensor models; ones that don't, don't transfer well.
The texture-randomisation trick
The earliest and still one of the most reliable visual-gap mitigations is to give up on rendering reality and instead train on a wide enough distribution of randomised textures that the real world looks like a typical training sample. A policy trained on a thousand random textures applied to the same geometry learns to ignore appearance and focus on structure. This is a special case of domain randomisation (Section 5) and the original sim-to-real success story — Tobin et al. (2017) showed it could close the gap entirely for a position-tracking task. Even today, in pipelines that use photo-realistic rendering, texture randomisation is usually applied as a defensive layer.
Contact, Friction, and Where Engines Disagree
Contact is the hardest thing physics simulators do. When two rigid bodies touch, the simulator has to enforce non-interpenetration, model friction, decide when contact breaks, and do all of this stably across a tiny time step. Different simulators make different approximations, and those approximations are where most of the physical reality gap lives.
Hard contact vs. soft contact
The mathematically correct way to handle contact is as a complementarity problem: at each time step, the simulator solves for contact forces that are non-negative, only act on bodies in contact, and are exactly large enough to prevent interpenetration. This is the linear complementarity problem (LCP) formulation, and solvers like Bullet and Drake use it for high-fidelity contact. The catch is that LCP solvers are computationally expensive, can have multiple solutions when contacts conflict, and need careful regularisation to avoid jittering at near-contact configurations.
The practical alternative is soft contact: model contacts as stiff springs with damping, allowing tiny interpenetrations in exchange for a smooth, easily-solvable system. MuJoCo and Brax both use soft contact (with different parameterisations); the contact behaviour is stable and differentiable, but it is not physically exact. A simulated gripper closing on an object using soft contact will produce slightly different forces than the real gripper would. For most tasks the difference is negligible; for delicate assembly or in-hand manipulation, it can be the dominant source of sim-to-real mismatch.
Friction is its own can of worms
Real friction is non-linear, history-dependent, and strongly affected by surface micro-geometry. Simulators model it with the Coulomb friction law (constant friction coefficient, force tangent to the surface, magnitude bounded by μ times the normal force) — a clean abstraction that captures the dominant behaviour but misses everything that makes real friction interesting. The friction cone is the geometric formulation: contact forces must lie inside a cone centred on the surface normal. Most simulators linearise this cone (replacing it with an inscribed pyramid) for solver efficiency; the linearisation introduces directional artefacts that show up as biased motion at contact.
Production sim-to-real pipelines for manipulation almost universally randomise friction coefficients across training environments, precisely because friction is the parameter most likely to differ between simulator and reality. Domain randomisation (next section) is in part a response to friction's intractability.
Stiffness, damping, and stable simulation
Beyond contact, the simulator's integrator — the algorithm that advances physics one time step at a time — is a critical engineering choice. Explicit Euler integration is fast but unstable for stiff systems (high spring constants, fast dynamics); implicit integrators are stable but slower. MuJoCo uses a semi-implicit scheme that balances the trade. The integrator's time step is the most consequential single parameter: too large and the simulation diverges, too small and it runs slowly. Modern simulators expose this and let practitioners tune it for their task.
Domain Randomization
If you cannot model reality precisely, model many slightly-different realities and force the policy to be robust to all of them. Domain randomisation is the simplest sim-to-real technique that consistently works, and it remains the foundation under most other techniques in this chapter.
The core idea
The original domain randomisation proposal (Tobin et al., 2017; Sadeghi & Levine, 2017) was almost embarrassingly simple: don't try to make the simulator match reality, just train on a wide enough distribution of simulated realities that the real world is, in expectation, just one more sample from that distribution. The policy never sees real images during training; it sees a thousand variants of the simulated scene with different textures, lighting, camera positions, and noise. Provided the distribution of variations is wide enough that the real world is statistically inside it, the policy generalises to reality without any real data.
The argument worked surprisingly well. Domain randomisation closed the sim-to-real gap on the original 2017 robot-grasping task without any real images, and it has remained a foundational technique. Almost every successful sim-to-real result since then uses some form of domain randomisation, often combined with one or more of the techniques in later sections.
What to randomise
Visual randomisation is what made the technique famous, but the same principle applies to physics. The standard randomisation menu for a manipulation policy includes:
- Visual: textures, lighting position and colour, scene clutter, camera position and intrinsics, sensor noise level.
- Physical: object masses, friction coefficients, joint damping, motor gains, gravitational constant (yes, really).
- Latency: sensor-to-action delay, action-to-effect delay (motor lag).
- Sensor noise: camera noise variance, IMU bias and white noise, encoder quantisation.
- Initial conditions: object positions, robot starting pose, target locations.
The specific ranges to randomise over matter a lot. Too narrow and the policy fails to generalise; too wide and it becomes a generalist that can't solve any specific instance well. The standard practice is to start with rough engineering estimates of how much each parameter could plausibly differ between sim and reality, then tune the ranges based on the real-world success rate.
Automatic domain randomisation
Hand-tuning randomisation ranges is tedious. Automatic Domain Randomisation (OpenAI, 2019) closes the loop: start with narrow ranges, train the policy until it succeeds, expand the ranges, train again, repeat. The result is a curriculum that adaptively widens the difficulty as the policy becomes more capable. ADR was the technique behind OpenAI's solving of Rubik's cube on a real Shadow Hand, where the in-hand manipulation problem required randomisation across so many parameters that hand-tuning would have been impossible.
The limits of randomisation
Domain randomisation is bounded by the simulator's expressive range. If the simulator's contact model is structurally wrong (Section 4), no amount of randomising friction coefficients within a wrong model fixes the problem. If the visual rendering pipeline cannot produce the kinds of artefacts a real camera produces, randomising textures within the renderer doesn't help. The technique works when the gap is parametric (the simulator has the right model, just wrong parameters) and fails when the gap is structural (the simulator is missing a phenomenon entirely). The next sections cover techniques that handle the structural-gap case.
Domain Adaptation
If the gap is too structural to randomise across, the alternative is to learn it. Domain adaptation methods explicitly use real data — typically much less than RL would need — to align the policy or its inputs with the real domain.
Image-to-image translation
One family of techniques translates simulated images into real-looking images using generative models — GANs in 2017–2020, diffusion in 2023+. The earliest concrete instance was CycleGAN-style translation: train an unpaired image-to-image translation network on simulated and real images, then apply it during sim training so that the policy sees images that look real. The catch is that translation networks introduce their own artefacts and can systematically distort important features (object boundaries, fine textures) in ways that hurt the downstream policy. Modern translation methods using diffusion models are cleaner but still imperfect.
Feature-level adaptation
A subtler approach: align not the images but the features the policy extracts from them. The policy network has an encoder that maps observations to features and a head that maps features to actions. If the encoder produces similar features for simulated and real images of the same scene, the head trained on simulated features will work on real ones. Methods like Domain-Adversarial Neural Networks (Ganin et al., 2016) train the encoder with an additional adversarial loss that confuses a domain classifier — the encoder is rewarded for producing features the classifier cannot tell apart by domain. The result is a domain-invariant feature space.
Learned residual dynamics
For the physical-gap case, an analogous trick: train a small neural network to predict the residual between simulator dynamics and reality, then add the residual to the simulator during training. The residual is fit on a small dataset of real transitions paired with simulator predictions of what would have happened. Once trained, the augmented simulator (sim + residual model) is a closer model of reality, and policies trained on it transfer better. This is the model-side analogue of feature-level adaptation, and is increasingly common in production sim-to-real pipelines for legged-robot locomotion.
When adaptation pays
Domain adaptation requires real data, which costs robot time. The break-even point is when you have some real data — say, a few hundred trajectories — but not enough to train a policy from scratch. Below that, domain randomisation alone is more efficient (no real data needed); above it, real-data fine-tuning works directly. The sweet spot is the middle, where adaptation amplifies a small real dataset into a closed-gap simulator.
System Identification
Before randomising or adapting, you can simply measure the simulator's parameters from the real robot. System identification — the discipline of fitting a model to observed data — is the oldest and most direct way to close the physical gap.
Offline system ID
The classical approach: run a battery of motion primitives on the real robot, record the trajectories, and fit the simulator's parameters (mass, inertia, friction coefficients, joint damping, motor gains) to minimise the discrepancy between simulated and real responses. The motion primitives are designed to excite each parameter — quick steps reveal motor lag and damping, prolonged loads reveal friction, free-swing trajectories reveal inertia. The fit is a non-linear least-squares optimisation that solvers like IPOPT or Levenberg-Marquardt handle in minutes.
Offline system identification is cheap (a few hours of robot time, one optimisation), produces a much better simulator than the factory defaults, and is often the first step before any other sim-to-real technique. Production stacks for legged robots and manipulators almost always include a calibration pass that fits the simulator parameters from the specific hardware unit, accounting for unit-to-unit variation.
Online system ID
The harder version: identify parameters continuously as the robot operates, so the simulator (and hence the policy) tracks slow changes in hardware (wear, temperature, payload). Online system identification typically maintains a probability distribution over parameters and updates it as new data arrives — Kalman-filter-style for linear cases, particle-filter-style for non-linear. The cost is computational, the benefit is a model that stays current. Used in autonomous flight (gain scheduling against wind conditions) and in legged robots that adapt to slope or surface changes.
Differentiable physics
The newest twist is differentiable physics: simulators that produce gradients of the trajectory with respect to the simulator's parameters. With gradients available, system identification becomes an end-to-end optimisation that can be done with standard ML tooling. Toolboxes like Brax, MuJoCo MJX, NimblePhysics, and Drake's autodiff pipeline all support this. The same gradients can also be used during policy training, making the whole sim-to-real pipeline a continuous differentiable system. Differentiable physics is a research frontier in 2026 — it unifies system ID, model-based RL, and trajectory optimisation under a single mathematical framework.
Real-to-Sim and Hybrid Training
The natural inverse of sim-to-real is real-to-sim: use real-world data to make the simulator more realistic, then train policies in that improved simulator. The two directions can be composed into a real-to-sim-to-real loop that closes the gap iteratively.
Real-to-sim asset reconstruction
A practical first step: take a real scene the robot will operate in and reconstruct it in simulation. Modern photogrammetry and Neural Radiance Field methods (Chapter 01 covered these for SLAM) can produce a 3D scene reconstruction from a few minutes of real-camera footage. Drop the reconstruction into a simulator and the policy can train against a digital twin of the actual deployment environment, eliminating the geometric component of the visual gap entirely.
This pattern is becoming standard for warehouse robots and home robots: scan the deployment site once, generate a high-fidelity simulation of it, train (or fine-tune) the policy in that simulation, deploy. Companies like Skild and Physical Intelligence reportedly run pipelines like this at scale, with site-specific simulators for each customer deployment.
Hybrid training: real + sim
For policy training itself, the dominant pattern is hybrid: most of the data is simulated, a smaller portion is real. The policy gets the breadth of simulation (many tasks, many variations) and the grounding of reality (correct physics on the deployment hardware). The mixing ratio matters — too little real data and the policy is dominated by sim quirks; too much and the simulation no longer adds value. Rules of thumb in 2026: 90% sim / 10% real is typical for manipulation, 99% sim / 1% real is typical for locomotion (where simulation is more reliable).
The real-to-sim-to-real loop
The most ambitious pattern composes the previous techniques into a closed loop: train a policy in simulation, deploy it on hardware, collect real trajectories from the deployment, use those to identify simulator parameters and reconstruct any new scene geometry, retrain in the updated simulator, redeploy. The loop closes the sim-to-real gap iteratively as more real data accumulates. The technique is conceptually clean and increasingly common in production, though it requires the engineering infrastructure to support all three directions of the data flow.
Massively Parallel Simulation and RL Training
The single biggest change in sim-to-real over the past five years has been the move to massively parallel GPU simulation. Where 2018-era robotics RL ran a few hundred environments on CPU at a few thousand steps per second, 2024-era pipelines run hundreds of thousands of environments on a single GPU at tens of millions of steps per second. The same algorithms with three orders of magnitude more data produce qualitatively different results.
The Isaac Gym revolution
The transition started with NVIDIA's Isaac Gym (2021), which moved both the physics simulation and the rendering onto the GPU. The performance gain was dramatic: training an in-hand cube reorientation policy that took a week on CPU clusters in 2019 took a few hours on a single workstation GPU in 2021. Isaac Gym was deprecated in favour of Isaac Lab (2023, built on Isaac Sim's Omniverse base), which kept the GPU-parallel simulation and added photo-realistic rendering, mature RL training infrastructure, and a much larger task library. By 2024 nearly every serious legged-robot or manipulation RL paper used Isaac Lab or one of its peers.
The training-pipeline shape
A modern parallel-sim training pipeline looks roughly like this. The simulator runs N = 4,096 to 100,000 environments in parallel, each with a different randomisation seed. The policy network ingests all N observations as a batch and produces N actions. The simulator advances all environments by one step (a fully parallel GPU operation) and produces N new observations and rewards. After k steps, the rollout is fed to a PPO-style update that runs the policy network's gradient computation on the same GPU. The cycle repeats. Throughputs of 10–100 million environment steps per second are typical on modern hardware.
The shape has consequences. With this throughput, behaviours that were prohibitively expensive to train (long-horizon manipulation, complex locomotion gaits, multi-agent coordination) become feasible. The sim-to-real techniques in this chapter — domain randomisation especially — also scale with the throughput; randomising over 100,000 simultaneous environments produces dramatically more robust policies than randomising over 100.
The simulators that scale
Not all simulators support massive parallelism well. Brax and Isaac are the dominant choices. ManiSkill 3 (2024) added a parallel-manipulation focus. Genesis claims another order of magnitude over Isaac on the same workloads. The common architectural pattern is: physics state lives in GPU memory, all per-environment computations are SIMD across environments, and the policy network shares the GPU with the simulation rather than running on a separate device. Simulators that don't fit this pattern (Drake, classic PyBullet) are fast for single environments but cannot match the parallel-throughput of the GPU-native ones.
Frontier: Generative Sim and Closing the Loop
The current frontier is the use of generative models — video generators, world models, neural physics — as components of the simulation pipeline itself. The boundary between "simulator" and "learned dynamics model" is blurring fast, and the implications for sim-to-real are still being worked out.
World models as simulators
If you train a model to predict the next observation given the current observation and action, you have, in effect, trained a simulator from data. The DreamerV3 line of work (covered in Part IX Ch 05) uses this kind of world model as the substrate for model-based RL: predict in latent space, plan against the prediction, only occasionally check predictions against reality. Translating that to robotics, the question is whether a learned world model trained on real robot data can replace the analytical simulator entirely. Early answers (RoboDreamer, several 2024–2025 papers) suggest yes for short-horizon tasks; the verdict on long-horizon tasks is still pending.
Generative video as simulation
A more aggressive version: treat video generation as the simulator. Genie (DeepMind, 2024), Sora (OpenAI), and the Cosmos line (NVIDIA, 2025) are video-generation models capable of producing plausible continuations of a scene given an initial image and a control signal. If the videos are physically plausible enough, a policy can train against them — reading observations from the generated video, predicting actions, and using a separate scoring model to compute rewards. The technique is at the proof-of-concept stage in 2025–2026 but the trajectory is steep, and it is widely expected to be a major component of robot training pipelines by the end of the decade.
Neural physics
A third direction: replace specific components of an analytical simulator with learned approximations. Neural physics models for cloth, fluids, granular media, and deformable contact are now competitive in fidelity with classical solvers and dramatically faster on GPU. Production simulators are starting to incorporate them as drop-in components for the regimes where classical physics is too slow or too brittle (cloth manipulation, food handling, granular pouring). The sim-to-real implications follow the standard pattern: a neural physics module that better matches reality reduces the gap to be closed by other techniques.
Real2sim2real, foundationally
The longer-term vision composes everything in this chapter into a single foundation-scale loop: large fleets of deployed robots collect real data continuously, the data feeds a world model and updates analytical simulators, foundation-scale policies are trained against the combined sim, and the policies deploy back to the fleet — closing the loop. This is the strategy of Physical Intelligence, Skild, 1X, and several other 2024–2026 robotic-foundation-model companies. Whether the loop closes the way the analogous loop did in language modelling is still an open empirical question, but the bet is being made aggressively.
What this chapter does not cover
The vision-language-action models that the foundation-scale training loop produces belong to Chapter 03 (Learning from Demonstration & Imitation) and Chapter 05 (Foundation Models for Robotics). The autonomous-driving simulators (CARLA, Waymo's simulator, Tesla's simulator stack) and the safety-critical scenario-authoring techniques they require belong to Chapter 06 (Autonomous Vehicles), where the sim-to-real problem becomes a regulatory and legal one as well as a technical one.
Sim-to-real is the layer at which the breadth of simulation meets the constraints of the real world. The classical techniques of this chapter — domain randomisation, system identification, hybrid training — remain the operational core of every successful deployment. The frontier techniques — generative simulation, world models, neural physics — are reshaping that core in real time. A practitioner who understands both axes can decide for any given task whether to lean on the classics or reach for the frontier.
Further Reading
-
Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real WorldThe original domain-randomisation paper. Establishes the technique on a robot localisation task and presents the empirical results that launched the technique into mainstream sim-to-real practice. Short, clear, and still the right entry point. The single paper that most reshaped sim-to-real practice.
-
Solving Rubik's Cube with a Robot HandThe headline result that established Automatic Domain Randomisation. The technical paper describes the curriculum that adaptively widens randomisation ranges, the engineering required to make it work on the Shadow Hand, and the empirical lessons. Reading it in full grounds the rest of this chapter's randomisation discussion. The reference for what large-scale, well-tuned domain randomisation can do.
-
MuJoCo: A Physics Engine for Model-Based ControlThe original MuJoCo paper. Covers the soft-contact model that underlies most of academic robotics simulation today, plus the design philosophy that prioritised simulation speed over physical exactness. Pairs well with reading the Drake or Bullet documentation for contrast. The clearest single explanation of the MuJoCo design choices.
-
Isaac Lab DocumentationOfficial documentation for NVIDIA's Isaac Lab — the dominant production training environment for robot RL as of 2026. Covers GPU-parallel simulation, photo-realistic rendering, the standard task library, and the integration with PPO and other RL algorithms. Reading the tutorials is the fastest way to actually train a sim-to-real policy. The current de facto training infrastructure.
-
Sim-to-Real: Learning Agile Locomotion for Quadruped RobotsThe early Google quadruped sim-to-real paper, which established the recipe of system identification + dynamics randomisation + actuator modelling that became standard for legged-robot RL. The paper is unusually concrete about engineering details. The reference recipe for legged-robot sim-to-real.
-
Brax: A Differentiable Physics Engine for Large Scale Rigid Body SimulationThe Brax paper. Describes the JAX-based differentiable simulator that demonstrated GPU-parallel robot simulation at scales that CPU-based simulators could not match. The technical sections on the simulator design and the training results are the right entry to differentiable physics for ML practitioners. The branch point for differentiable, GPU-native robotics simulation.
-
Learning Quadrupedal Locomotion over Challenging TerrainThe ETH Anymal sim-to-real paper that produced the legged-robot videos most readers will have seen — climbing stairs, crossing rubble, walking on grass — all from a policy trained entirely in simulation. The paper is a clean worked example of the techniques in this chapter applied at production quality. The most widely-cited demonstration that sim-to-real works for serious robotics.
-
DreamerV3: Mastering Diverse Domains through World ModelsThe most polished entry in the world-model lineage that informs Section 10's frontier discussion. Covers the architecture, the training procedure, and the empirical results that show a single learned world model can substitute for analytical simulators on a range of tasks. The natural reading bridge from this chapter into Part IX Ch 05's model-based RL material. The reference for using learned world models as simulators.