Foundation Models for Robotics, the GPT moment, on a robot.

For sixty years, robot policies were specialists — one model, one task, one robot. Between 2022 and 2026 that changed. Foundation-scale vision-language-action models trained on internet-scale text and images plus millions of robot trajectories now drive a single network that can pick up an unfamiliar object on an unfamiliar table when asked in plain English. This chapter covers how that change happened, what the leading models do, and where the field is heading.

Prerequisites & orientation

This chapter assumes familiarity with the imitation-learning material of Chapter 03 and the sim-to-real techniques of Chapter 04. It also assumes a working understanding of large language models (Part VI Ch 06–07) and vision-language models (Part VII Ch 06), since vision-language-action models are essentially VLMs with an action head. Diffusion (Part X Ch 04) and autoregressive generation (Part X Ch 05) come up in the architecture sections.

Two threads run through the chapter. The first is the VLM-to-VLA recipe: take a pretrained vision-language model, attach an action head, fine-tune on robot data, and inherit the VLM's world knowledge as a prior over robot behaviour. The recipe is the architectural backbone of nearly every model in this chapter, and it explains the timing — why robotics foundation models took off shortly after vision-language models reached production maturity. The second thread is the generalist-vs-specialist trade-off: foundation policies handle a wide range of tasks adequately, while task-specific policies handle a narrow range expertly. The trade is shifting toward the foundation policies fast, but it has not yet flipped completely.

01

Why Foundation Models Came to Robotics

Robotics had been the field most resistant to the foundation-model paradigm — until it wasn't. The transition between 2022 and 2024 was driven by three trends arriving simultaneously: the maturity of vision-language models, the emergence of cross-embodiment robot datasets large enough to fine-tune on, and the discovery that VLMs already contained most of what a robot policy needs to know.

The data wall robotics had been hitting

For most of robotics' history, every new task meant a new dataset, a new policy, and a new training run. The most data-rich domains in the field — autonomous driving, certain industrial-manipulation cells — had carefully curated proprietary datasets with millions of trajectories, but they were narrow. The general-purpose datasets that powered the language and vision revolutions had no robotics analogue. Without scale, the architectural innovations that worked elsewhere kept under-delivering on robots: a transformer trained on 50,000 manipulation trajectories looked very similar to a CNN-MLP trained on the same data.

The clear thing missing was internet-scale priors. A robot learning to grasp a coffee mug from a hundred demonstrations cannot generalise to a thermos because nothing in those demonstrations ever connected the visual category "thermos" to the manipulation category "graspable cylindrical container." A pretrained vision-language model knows that connection from billions of internet image-text pairs, even though it has never controlled a robot. The question that drove the 2022–2024 transition was: can a robot policy borrow that knowledge?

The answer that turned out to be yes

RT-1 (2022) and RT-2 (2023), both from Google DeepMind, were the proofs of concept. RT-1 showed that a transformer-based policy trained on 130,000 manipulation trajectories — orders of magnitude more than typical robot datasets at the time — could handle hundreds of distinct tasks with reasonable success and showed early signs of cross-task transfer. RT-2 went further: by jointly training on robot data and the VLM's original internet-scale image-text data, it inherited the VLM's semantic generalisation, allowing it to handle commands and objects it had never seen during robot training. The semantic breadth of a VLA was qualitatively different from anything a from-scratch robot policy could match.

The pattern reset what "good" meant in robotics. Within eighteen months, OpenVLA, π0, RDT, and a half-dozen other vision-language-action models followed, each using different base VLMs and different action heads but all building on the same insight: the policy is not a from-scratch network trained on robot data, it is a fine-tune of a pretrained vision-language model. Foundation models had arrived in robotics, and the rest of the chapter is about how they actually work.

Why It Took So Long

The VLM-to-VLA recipe seems obvious in retrospect. In practice it required three preconditions that only converged around 2022: vision-language models good enough to be worth fine-tuning, large enough robot datasets (Open X-Embodiment) to make the fine-tune meaningful, and enough compute infrastructure (TPU pods, large-scale GPU clusters) to do both training stages at scale. Each of the three had a long lead time, and the field changed once they were all available.

02

The VLA Recipe: From VLM to VLA

Nearly every successful robotic foundation model in 2024–2026 is built from the same recipe: take a pretrained vision-language model, attach a way to produce actions, fine-tune on robot data. The recipe has variants, but its core structure is shared, and understanding it makes the entire VLA literature legible at once.

The three ingredients

A vision-language-action model takes one or more camera images plus a language instruction as input and produces a sequence of robot actions as output. The architecture has three logical parts. The vision encoder turns images into visual tokens — typically a ViT or SigLIP-style encoder, often pretrained and frozen during fine-tuning. The language-vision backbone is a transformer that processes the visual tokens together with the tokenised language instruction, producing contextualised representations that capture both what the robot sees and what it has been asked to do. This is where most of the model's parameters live, and it is what gets inherited from the pretrained VLM. The action head finally maps the backbone's output to robot actions; this is the only component that has to be trained from scratch on robot data.

images camera RGB language "pick up X" vision encoder SigLIP / ViT tokenizer BPE / SentencePiece VLM backbone PaLI / Llama / Gemma ~1B–10B params, mostly frozen action head trained from scratch action chunk N × 7-DOF poses
The canonical VLA architecture. A pretrained vision encoder turns images into tokens; a tokeniser turns the language instruction into tokens; the VLM backbone (the largest component, inherited from a pretrained vision-language model) processes both jointly; an action head maps backbone outputs to robot actions. The action head is the only component that strictly requires training from scratch on robot data.

Action representation choices

The action head's design is the single most consequential choice in a VLA. Three families dominate. Discrete action tokens (RT-1, RT-2, OpenVLA) discretise each action dimension into ~256 bins and predict each bin autoregressively as just another vocabulary token. The advantage is that the action head reuses the VLM's existing token-prediction head — no architectural changes — at the cost of token-level precision and slow autoregressive inference. Continuous regression heads output action vectors directly via a small MLP. Faster at inference, but discards the VLM's mature decoding machinery. Diffusion or flow-matching heads (π0, RDT) treat the action chunk as the output of a small diffusion process conditioned on the backbone's representation; this handles multimodality cleanly and is the dominant frontier choice in 2025–2026.

Fine-tuning vs. co-training

The recipe has two main sub-flavours. In fine-tuning, the VLM is trained on internet data first, then on robot data — robot training is a separate phase. In co-training, the VLM is trained jointly on robot data and a continuing stream of internet data, mixed in some ratio. Co-training (the RT-2 contribution) preserves the VLM's semantic generalisation through the robot fine-tune and is now the dominant approach for large-scale models. Pure fine-tuning is still common for smaller open-source efforts where the compute for co-training is prohibitive.

03

RT-1 and the Action-Tokenization Idea

RT-1 was the model that showed manipulation policies could absorb scale. It was not yet a vision-language-action model — it had no internet pretraining — but it established the architecture, the action-tokenisation trick, and the dataset (Robotics Transformer Data) that the VLAs above would later build on.

The architecture

RT-1 (Brohan et al., 2022) is a transformer-decoder policy with about 35 million parameters. The input is a sequence of recent camera images plus a natural-language instruction. Images go through an EfficientNet vision encoder; the resulting tokens are conditioned on the instruction via a FiLM layer; a transformer decoder then produces action tokens autoregressively. Each robot action — a 7-DOF end-effector pose plus a gripper command — is discretised into 256 bins per dimension, so an action becomes 8 tokens. The model emits an action chunk by emitting 8 tokens at a time.

The architecture itself is unremarkable. What was new was the dataset: 130,000 teleoperated trajectories collected on 13 robots over 17 months, covering 700+ tasks. RT-1 is what you get when you take an ordinary transformer and feed it more robot data than anyone had ever fed a transformer before. The performance jump over previous baselines was substantial enough — and the cross-task generalisation strong enough — that the rest of the field paid attention.

Why discrete action tokens stuck

Discretising actions into bins and predicting bin indices as just-another-token-from-a-vocabulary was the architectural move that allowed VLAs to inherit decoder machinery from language models without changes. The cost was per-step quantisation error of about 0.4% per dimension on the discretisation grid — usually lost in the noise of the trajectory's other imperfections. The benefit was that the same forward pass that generates "pick up the red cup" can generate the action sequence, and the same training infrastructure (cross-entropy loss, teacher forcing) just works. Every later autoregressive VLA has used some variant of this discretisation, sometimes with finer or learned bin boundaries.

Cross-task and the early generalisation

The most-quoted result from RT-1 was that a single network trained jointly across 700+ tasks performed better on each individual task than a smaller model trained on that single task — the reverse of what you would expect if tasks were independent. This was the first concrete evidence in robotics that scale was producing the same kind of cross-task transfer that scale had produced in language. It set up RT-2 to ask the question of whether scale could be borrowed from outside robotics entirely.

04

RT-2 and Co-Training with Internet Data

RT-2 was the moment robotics caught up with the foundation-model paradigm. By co-training on robot data and on the same internet-scale vision-language data the underlying VLM was originally pretrained on, the model retained its semantic understanding through the robot fine-tune — and gained the ability to follow instructions and recognise objects it had never seen during robot training.

What RT-2 actually did

RT-2 (Brohan et al., 2023) replaced RT-1's lightweight backbone with a full pretrained vision-language model — either PaLI-X (55 billion parameters) or PaLM-E (12 billion parameters). The action head was kept simple: actions were tokenised into the VLM's existing vocabulary by reusing infrequently-used token IDs as action bins. The model thus emitted actions as ordinary tokens during inference, exactly as it would emit a caption or an answer. The key training trick was co-training: instead of fine-tuning the VLM on robot data and forgetting its prior knowledge, mix robot data with the VLM's original training data in roughly a 1:1 ratio during the fine-tune. The mixture preserves the VLM's vision-language understanding while teaching it to predict actions.

Semantic generalisation

The empirical results were striking. RT-2 could handle commands and objects entirely absent from its robot training data. Asked to "move the apple to the can of energy drink," the model identified the energy-drink can on the table — even though the robot training data had never contained an energy-drink can — by leveraging the underlying VLM's knowledge of what energy drinks look like. The cross-task and cross-object generalisation was qualitatively different from RT-1's, and it was attributed entirely to the VLM prior.

The other surprise was the ability to do chain-of-thought robot actions: when prompted to reason about a multi-step task, RT-2 could emit a brief textual plan ("first I need to clear the cup, then reach for the bowl") interleaved with action tokens. This was not engineered; it emerged from co-training with chain-of-thought-style language data and using the VLM's existing reasoning capacity. RT-2 thus became the first robot policy that could plausibly be said to "think" about the task before acting.

Limitations and what RT-2 did not solve

RT-2 was not perfect. Its action precision was bounded by tokenisation quantisation, its inference speed was slow (1–3 Hz on production hardware) because of the autoregressive decoding through a multi-billion-parameter backbone, and its capability profile was narrower than the marketing suggested — long-horizon tasks and fine manipulation remained outside its reliable range. The contribution was structural rather than complete: a recipe and a proof of concept that subsequent models (OpenVLA, π0, RDT) would refine into production-ready systems.

05

OpenVLA and the Open-Source Lineage

RT-2 was closed. OpenVLA, released by a Stanford / UC Berkeley collaboration in mid-2024, was the first vision-language-action model with weights, training code, and data fully open — and it became the baseline that the entire academic and small-team ecosystem built on.

The architecture

OpenVLA is a 7-billion-parameter model built on a Llama-2-7B language backbone, with vision provided by a fused DINOv2 + SigLIP encoder pair. Actions are tokenised into 256 bins per dimension as in RT-2, and predicted autoregressively. The training data is the Open X-Embodiment dataset (Section 7), filtered down to a subset of about 1 million trajectories from manipulation embodiments. Training took about 14 days on 64 A100 GPUs — small enough for a well-funded academic group, large enough to be inaccessible to a single researcher, and a useful reference point for the cost of training a VLA from scratch in 2024.

Performance and use as a base model

On in-distribution tasks (manipulation tasks similar to those in Open X-Embodiment), OpenVLA matched or exceeded RT-2 on benchmark splits. More importantly, it was designed for downstream fine-tuning: the paper provided clean recipes and showed that fine-tuning OpenVLA on a few hundred new trajectories produced strong task-specific policies, with the foundation model's prior knowledge clearly improving the fine-tune over from-scratch baselines.

The release reshaped the academic robotics-foundation-model landscape. Within months, the OpenVLA repository had hundreds of forks with task-specific fine-tunes, distillations to smaller backbones, and quantised inference variants. By early 2026 it was the de facto starting point for any academic VLA project — the "Llama-2 of VLAs" — and a benchmark every commercial model was measured against.

The wider open ecosystem

OpenVLA was not the only open VLA, but it was the most influential. Other notable entries: RDT-1B (Robotics Diffusion Transformer, Tsinghua, 2024), which used a diffusion action head instead of autoregressive tokens; Octo (Berkeley, 2024), a smaller 93M-parameter generalist trained on Open X-Embodiment with a transformer backbone; NORA (Carnegie Mellon, 2025), an open VLA optimised for inference latency. Each fills a niche, and most academic robotics-foundation-model work in 2025–2026 builds on one of them rather than RT-2-style closed models.

LoRA and the efficient-fine-tuning workflow

The dominant pattern for adapting OpenVLA to a new task is LoRA fine-tuning — low-rank adaptation that trains only a tiny fraction of the parameters while keeping most of the model frozen. LoRA fine-tunes can be done on a single GPU in hours, store as small adapter weights (~50 MB), and stack with each other. The result is a workflow where a single base OpenVLA serves as a foundation, and individual deployments swap in task-specific or robot-specific LoRA adapters as needed. This workflow has become standard for production deployments of open VLAs.

06

π0 and Flow-Matching VLAs

Physical Intelligence's π0 (released October 2024) is the most influential commercial VLA of the 2024–2026 period. It moved away from RT-2's autoregressive action tokens and instead used flow matching as the action representation, producing dramatically smoother and faster action sequences. π0 set the bar for production-deployed VLAs and influenced nearly every commercial model that followed.

What flow matching does for actions

Where RT-2 and OpenVLA emit action tokens autoregressively (one bin at a time across all action dimensions), π0 emits an entire action chunk in a single denoising pass via flow matching — a continuous-action analogue of diffusion. The action chunk (typically 50 future actions) is treated as a vector in continuous space; the model learns a velocity field that, integrated over a few denoising steps, transports a sample from random noise to a valid action sequence. The technique sits between standard diffusion (which would require many more denoising steps) and direct regression (which produces unimodal outputs).

The benefits are substantial. Inference is much faster than autoregressive token generation — typically 2–3 denoising steps per chunk versus hundreds of token decoding passes. Action precision is no longer bounded by quantisation. And the multimodality of flow matching naturally captures situations where multiple valid action sequences exist (the same trajectory can be approached from several angles). The trade-off is implementation complexity: flow matching is mathematically more involved than discrete tokens, and the training procedure has more knobs to tune.

The architecture and training

π0 is a 3-billion-parameter model with a PaliGemma vision-language backbone and a flow-matching action head. The training data is a mix of Open X-Embodiment plus a substantial proprietary dataset of teleoperated demonstrations across multiple Physical Intelligence robots and customer deployments. Co-training with internet vision-language data continues throughout the robot fine-tune, in the RT-2 tradition. Training compute is large but not enormous — comparable to a moderate-scale language-model training run — because the bulk of the model's capability comes from the pretrained backbone.

π0-FAST and the iteration cadence

Physical Intelligence has iterated on π0 at a pace closer to language-model startups than to traditional robotics labs. π0-FAST (early 2025) added action-token compression that reduced inference latency by another factor of 5×, opening 30–50 Hz inference on commodity GPUs and enabling a much wider range of dynamic tasks. π0.5 (mid-2025) extended the model to mobile manipulation and home-robot deployments, with substantially expanded training data. The cadence is in part a consequence of the foundation-model recipe: once the data flywheel is running, each iteration is a fine-tune-and-evaluate cycle rather than a full re-engineering.

Why π0 mattered beyond the technical contribution

π0 was the first VLA that visibly worked in the real world to a non-specialist audience. The video demos — folding laundry, making coffee, packing a box, all driven by language commands — were qualitatively different from anything previously shown publicly. The model became the existence proof that the foundation-model approach could produce useful robot policies, and it accelerated investment and competitive entry across the sector. In 2026 most serious commercial robotics-foundation-model efforts (Skild, 1X, Figure, Tesla Optimus's policy stack) use a flow-matching or diffusion action head in the π0 lineage, regardless of which other architectural choices they make.

07

Training Data: Open X-Embodiment, DROID, and Friends

A foundation model is only as good as its training data. The breakthrough that made VLAs possible was not a single architectural insight; it was the appearance of cross-embodiment robot datasets large enough to support foundation-scale fine-tuning. This section catalogues the major data sources and the practical question of how to mix them.

Open X-Embodiment

The dominant academic dataset is Open X-Embodiment (released 2023, expanded continuously since), a collaborative collection of teleoperated demonstrations contributed by 34 academic robotics labs. The dataset spans 22 different robot embodiments (single-arm manipulators, bimanual setups, mobile manipulators, even a few legged platforms) and covers more than a million trajectories across 60+ source datasets. The data is deliberately heterogeneous — different robots, different tasks, different action representations — and the central empirical finding from the accompanying RT-X paper was that joint training across the heterogeneity outperformed training on any single embodiment alone, even when evaluated on tasks from that embodiment. Open X-Embodiment is the closest thing in robotics to ImageNet — the dataset that defined a generation of foundation models.

DROID

DROID (2024) addresses a complementary problem: where Open X-Embodiment trades consistency for breadth, DROID emphasises consistency. 76,000 trajectories collected on identical Franka Emika Panda arms distributed across 13 institutions, all with the same setup procedure, the same calibration, and the same recording format. The result is a smaller but substantially cleaner dataset, and it is the standard fine-tuning target when downstream deployments will run on Franka hardware. Most VLA papers in 2024–2026 report results on both Open X-Embodiment and DROID benchmark splits.

Commercial datasets

The proprietary side has grown faster than the academic side. Physical Intelligence, Skild, Tesla, Figure, 1X, and Google's robotics group all maintain internal demonstration datasets that are reportedly an order of magnitude larger than Open X-Embodiment. The data flywheel arguments of Chapter 03 apply here: deployed fleets generate continuous data, internal teleoperation studios produce focused task data, and customer deployments contribute use-case-specific demonstrations. The proprietary data is rarely public, but its scale and diversity is what gives commercial VLAs their visible edge over academic ones in 2025–2026.

Mixing and weighting

A foundation VLA is rarely trained on a single dataset. The standard recipe is to combine multiple sources with task-specific weights, oversampling embodiments and tasks that are similar to the deployment target. Choosing the weights is more art than science — too much weight on the deployment embodiment causes overfitting to its quirks; too much weight on the wider mixture loses the specialist's edge. Modern VLA training pipelines treat the mixing weights as a hyperparameter to tune, often with separate weights for the pretraining and fine-tuning phases.

Data Diversity Beats Data Volume

The empirical finding everyone in the field eventually arrives at is that diversity matters more than volume past a certain threshold. A million trajectories from 22 embodiments outperforms ten million from one. The practical implication is that data-collection effort should optimise for breadth — different robots, different objects, different scenes, different operators — rather than just volume. Most early VLA failures trace back to over-narrow data more than to architectural problems.

08

Capability Profile: What Generalist Policies Actually Do

A clear-eyed view of what foundation-scale robot policies actually do — and don't do — well is essential for deciding whether to use one. The marketing has been ahead of reality at every step; the reality is good but specific.

What they do well

VLAs in 2025–2026 reliably handle the following kinds of tasks. Single-step manipulation under language instruction: "pick up the red cup and put it on the plate" works, including when the model has not seen the specific cup before. Object-in-scene generalisation: the model finds and manipulates objects in cluttered environments, identifying targets from natural-language descriptions including referential ones ("the smaller of the two mugs"). Mid-horizon multi-step tasks: tasks requiring 5–15 sequential steps work reasonably well when each step is a simple manipulation; success rates degrade gracefully as horizon length grows. Adapting to novel objects: for objects within the rough training distribution (kitchen items, office supplies, household tools), zero-shot transfer is the rule rather than the exception.

Where they still struggle

The honest weaknesses are several. Long-horizon tasks (50+ steps): the success rate drops sharply, often below practical-utility thresholds. Fine manipulation requiring sub-millimetre precision: assembly, threading, delicate insertion remain hard. The action representation's effective resolution is often the bottleneck. Tasks requiring sustained reasoning across steps: "fold this shirt and then look for stains and report any you find" tests the model's ability to remember a complex plan and respond to novel observations along the way; current VLAs are inconsistent here. Out-of-distribution physics: liquids, granular media, and deformable objects (cloth, dough) are hard for VLAs trained primarily on rigid-body manipulation, even when the model has seen videos of these materials in its VLM pretraining.

Failure modes to watch for

A few characteristic failure modes are worth knowing. VLAs tend to hallucinate plausible-looking but wrong actions when they are uncertain — picking up something nearby when the asked-for object is not visible, for example. They have short effective memories compared with what their context windows would suggest — the policy weights what happened recently more than what was said three minutes ago, which can cause drift on long tasks. And they have poor calibration of their own confidence: a VLA does not reliably know when it is about to fail, so the standard production pattern is to wrap a foundation policy in an external verifier (often a smaller model) that catches likely-bad actions before execution.

Evaluation

The gap between sim-eval scores and real-deployment scores is even larger for VLAs than for specialist policies, because the language-following capability is genuinely broad and is hard to evaluate without expensive real-world trials. The standard evaluation suite includes BridgeData and Robomimic for in-distribution checks, the LIBERO benchmark for cross-task generalisation, and per-deployment task suites for actual capability assessment. The honest number is the per-deployment task suite; the rest are diagnostic.

09

The Production Architecture

A 7-billion-parameter VLA with a 30-step diffusion head is not the architecture you actually run on a robot. Production deployments use a layered architecture that compresses the foundation model's capability into a fast, reliable inference path while keeping its semantic depth available when needed.

The latency problem

A robot arm wants new commands at 10–30 Hz. A 7B VLA running autoregressive action decoding on a server-class GPU produces actions at 1–3 Hz; a 3B flow-matching VLA gets 10–20 Hz; running anything large on the robot's onboard hardware is typically a non-starter. Production stacks therefore split the work: a slow, capable foundation model running on a server (or cloud) at 1–10 Hz emits coarse plans; a fast, small policy running on the robot at 100+ Hz produces low-level actions that track those plans.

Action chunking as latency hiding

The dominant technique for making VLAs feel responsive at lower inference rates is action chunking: the model emits a sequence of N future actions per inference call, and the robot executes them open-loop while the next chunk is being computed. With a chunk size of 50 actions at 30 Hz, the model only needs to inference at 0.6 Hz to keep the actuators busy. The trade-off is a longer period of un-corrected behaviour between inferences — which is fine when the world is stable but problematic when it changes. Modern stacks dynamically adjust chunk size based on task velocity.

Distillation

The other major production technique is distillation: train a smaller, faster model to imitate the foundation VLA's outputs. The teacher (the full foundation model) generates supervised data; the student (a smaller specialist policy) learns to reproduce it. A distilled 100-million-parameter student running at 30 Hz on a Jetson can be 90% as capable as a 3-billion-parameter teacher running at 5 Hz on a server, for the deployment-specific task it was distilled on. Production deployments routinely train per-deployment distilled students from the same foundation teacher, which gives them the foundation model's semantic depth where it matters and a fast specialist where it doesn't.

The cascade pattern

The most sophisticated production deployments use a three-tier cascade. A high-level reasoning model (often a separate text-only LLM) takes the user's request and produces a plan. The foundation VLA executes each step of the plan, called every few seconds. A fast distilled student handles the within-step low-level control at high rate. Each tier hands off increasingly concrete commands to the next, and each tier can be improved or swapped independently. This is the architecture of most polished customer-facing humanoid robot demos in 2026, even when the marketing talks about "a single foundation model" controlling the robot.

10

Frontier: Reasoning VLAs and What Comes Next

The current frontier in robotics foundation models is the integration of explicit reasoning, tool use, and multi-modal grounding into the VLA architecture. The 2026 generation of models is starting to look qualitatively different from RT-2 and OpenVLA, and the next two to three years will determine how far the foundation-model paradigm scales in robotics.

Reasoning VLAs

The integration of reasoning — chain-of-thought style — directly into the VLA's action loop is the most visible 2025–2026 frontier. Models like Gemini Robotics-ER (2024–2025) and the Hugging Face SmolVLA-Reasoning lineage explicitly emit tokens that mix natural-language reasoning ("I need to clear the cup first because it is blocking the bowl") with action commands. The reasoning serves as a planning step and provides a debuggable trace; the action commands carry out the plan. Whether reasoning improves real-world performance vs. compute spent elsewhere is an active empirical question, but the architectural pattern is now standard in frontier work.

Tool-using VLAs

A natural extension: let the VLA call external tools — a separate vision model for fine-grained perception, a path planner for kinematic verification, a database for environmental knowledge — in the same way an LLM agent calls tools. The result is a hybrid system where the VLA does the orchestration and high-level decision-making, classical components do the things they are good at, and the seam between them is a tool-call interface. Several 2025 papers (RoboTool, ToolVLA) have explored this; production adoption is accelerating in 2026.

Multi-embodiment, multi-task, multi-modal at foundation scale

The natural endpoint of the cross-embodiment story is a single model trained on every robot, every task, and every input modality. Gemini Robotics (Google, 2025) was the first model marketed in those terms — claiming a single model that controls humanoid, mobile manipulator, and arm embodiments with the same weights, conditioned on language and image input. The claim is empirically defensible on benchmark splits and visibly works in demos; whether it works in arbitrary new deployments is the open practical question. The trajectory is clearly toward more multi-modal, more cross-embodiment foundation models, and toward fewer task-specific specialists.

The data flywheel, robotics edition

The strategic bet driving all of the above is the data flywheel: deployed VLAs collect data from real customers; the data improves the next-generation model; the better model attracts more deployments; repeat. This is the model that worked in language and is being aggressively pursued in robotics. Whether the loop closes the way it did in language depends on questions that are not yet settled — whether robot deployment generates useful training data the way internet text does, whether the long tail of robotic edge cases is bounded, whether the privacy and safety concerns of fleet-wide data collection are tractable. The serious commercial efforts are betting yes; the next two years will resolve the question.

What this chapter does not cover

The autonomous-driving foundation-model lineage — Tesla's FSD policy networks, Waymo's MotionLM, Wayve's GAIA — has been developing in parallel with the manipulation lineage of this chapter, with substantial cross-pollination but separate commercial dynamics. That story belongs to Chapter 06 (Autonomous Vehicles). The agentic LLM context — the broader question of what it means for an AI system to act in the world, plan, use tools — is the subject of Part XI Chapters 01–02. And the alignment and safety questions raised by deployed foundation models with the ability to take physical action belong to Part XVIII.

Foundation models for robotics are the layer at which language understanding meets physical action. The classical specialist policies of earlier chapters remain the right tool for narrow industrial deployments; the foundation models of this chapter are increasingly the right tool for everything else. The transition is not yet complete and may never be — different tasks rationally favour different tools — but the centre of gravity has clearly moved toward foundation-scale models, and the practitioner who can wield them well will be doing the most consequential work in robotics for the foreseeable future.

Further Reading