Robots Atlas>ROBOTS ATLAS

NVIDIA Maps Robotics' Future on the LLM Evolution Path

NVIDIA Maps Robotics' Future on the LLM Evolution Path

Jim Fan, Lead of the Embodied Autonomous Research group at NVIDIA, took the stage at Sequoia Capital's AI Ascent 2026 on May 10 to present a complete roadmap for robotics — what he calls the "Great Parallel." Fan argued that robotics is now replicating the four-stage evolution of large language models, and that the Physical Turing Test will become a reality within 2 to 3 years.

Key takeaways

  • NVIDIA identifies four robotics stages: pre-training, alignment, reasoning, and autonomous research
  • New WAM (World Action Model) paradigm replaces current VLA (Vision-Language-Action) models
  • EgoScale: pre-training on 20,854 hours of human egocentric video yields log-linear dexterity gains
  • DreamDojo: NVIDIA's neural simulator generates training data at over 10 FPS without classical physics
  • Fan predicts: full robotics technology tree completed by 2040

Four stages, one playbook

Fan's analogy is precise. Language models passed through four phases: unsupervised pre-training on raw data, alignment with human intent, step-by-step reasoning, and finally autonomous knowledge generation. Fan argues robotics is now in the first phase of that path — and accelerating.

The key shift is abandoning VLA models in favor of WAM. Vision-Language-Action models — including NVIDIA's own GR00T N1.5 — Fan described as "head-heavy" architectures: language models with an action module bolted on. Strong at object recognition, weak at physics.

WAM flips the priority. Instead of predicting the next token, the model predicts the next physical state — pixels and joint torques simultaneously. NVIDIA's DreamZero is the early realization of this approach.

The end of teleoperation as a standard

Fan called for a "moment of silence" for teleoperation — long the gold standard of robot data collection. The problem is fundamental: humans have 24 hours per day, robots break down, scaling is impossible. The millions of hours of data needed for general intelligence are beyond this method's reach.

NVIDIA's alternative is sensorized human data. Two pillars of this strategy: Universal Manipulation Interfaces (UMI) — simple actuators worn on human hands to collect data without the robot in the loop — and egocentric scaling, training on thousands of hours of first-person human video.

EgoScale confirmed this empirically: pre-training on 20,854 hours of egocentric video revealed a near-perfect log-linear scaling law — more human data translates proportionally to better zero-shot robot dexterity. Teleoperation now accounts for less than 0.1% of the training mix.

Compute = environment = data

The bottleneck for robotics is the physical world. NVIDIA's answer is neural simulation. DreamDojo — an open-source simulator based on generative video — replaces classical physics equations with a data-driven model. It generates sensor states in real time at over 10 frames per second, enabling reinforcement learning directly in the model's "dream space."

Fan described this as the equation of a new era: compute creates environments, and environments create data. In the Blackwell GPU era, this means the data limit is no longer physical — it becomes a compute budget.

Our generation was born too late to explore the earth and too early to explore the stars. But we were born just in time to solve robotics.

Jim Fan, Lead of Embodied Autonomous Research, NVIDIA

Horizon 2040: physical auto-research

Fan sketched a timeline for the "robotics technology tree." The nearest milestone — the Physical Turing Test, meaning physical dexterity indistinguishable from a human — is expected within 2 to 3 years. By 2040, Fan predicts the Physical Auto Research phase: robots designing and building their own next generations.

Current industry data support this ambition. Figure is producing one Figure 03 robot per hour at its BotQ facility. 1X launched its NEO factory in Hayward, California, targeting 100,000 units per year by 2027. Genesis AI unveiled the GENE-26.5 model for dexterous manipulation. NVIDIA supplies compute and foundation models. The pieces are on the table — integration is the open question.

Why this matters

Fan's presentation is a rare event: a lead engineer at the company supplying infrastructure to nearly the entire AI ecosystem publicly declaring the end of one paradigm and initiating the next. VLA models won't disappear overnight, but the direction is set — and it comes from the company selling GPUs to practically every lab in the space.

NVIDIA's strategy is also a direct answer to an economic problem: collecting robot training data via teleoperation does not scale. The new approach — human sensorized data plus neural simulation — makes early-stage training less dependent on robot hardware, lowering the barrier to entry for startups.

If EgoScale's log-linear scaling law holds across domains, the implications are significant: every additional hour of human video will translate to a predictable increase in robot capability. This shifts competition from a hardware race to a race for human data.

What's next

  • DreamDojo is available as open-source — first external benchmarks vs. Isaac Sim and MuJoCo expected within months
  • Fan announced EgoScale expansion toward 100,000+ hours of egocentric video; results will confirm or challenge the scaling law across new domains
  • Robotics Summit & Expo (May 2026) will bring further announcements from NVIDIA partners deploying GR00T N1.5 and DreamZero in commercial pilots

Sources

Share this article