Physical AI: When artificial intelligence leaves the screen and enters the real world

At CES 2026, Jensen Huang said robotics had reached its “ChatGPT moment.” The line resonated because it points to something real, even if the comparison has limits. ChatGPT made technology visible to the public after years of maturing inside research labs. Physical AI presents a different challenge. The hard part is not building a compelling interface. It is solving a far more demanding technical problem.

When we talk about Physical AI, we mean artificial intelligence systems that operate in the physical world. That includes robots, autonomous vehicles, and any machine that can perceive its environment, make decisions, and act with real consequences. It sounds straightforward, but that definition points to a major shift in how AI systems have to be built.

A traditional language model works in an open loop. It receives text, generates text, and stops there. It may produce a brilliant answer, but it does not have to deal with what happens next in the environment. A Physical AI system works differently. It runs in a closed loop. It acts, changes the environment, observes what changed, and decides what to do next. That loop never stops, and it must also respect the timing constraints imposed by physics.

In robotic manipulation, for example, control often has to run at 20-100 Hz. That leaves only 10-50 milliseconds per inference cycle. The problem shows up immediately. A large model running on edge hardware may take 50-100 milliseconds to respond. That is the central tension. The model that reasons best is often too slow, while the one that meets the timing budget usually gives up capability. Understanding that tradeoff is key to understanding why Physical AI is not solved by simply wiring an LLM to a robot.

VLA models

To deal with that challenge, Vision-Language-Action models, or VLAs, have emerged. At a high level, they bring together three key pieces in a single architecture. The first is a visual component that turns camera images into useful representations. The second is a language component that interprets those representations together with a natural-language instruction. The third is a decoder that turns all of that into actions the robot can execute.

Much of the technical debate centers on that last block. Early approaches, such as Google DeepMind’s RT-2 in 2023, treated actions as if they were text tokens. That was an important proof point because it showed that semantic knowledge acquired during pretraining could transfer to robotic tasks. The drawback is that discretizing continuous actions reduces precision and also limits response speed. For that reason, it is hard to see this approach as the ideal answer for production systems today.

In response, more recent work has moved toward decoders based on diffusion or flow matching. Instead of producing token sequences, these models generate continuous action distributions. That enables smoother trajectories and higher-frequency control.

The dominant architecture no longer tries to make a single model do everything at once. Today, the more common pattern is to split responsibilities. A slower component, usually backed by a large VLM, handles scene understanding and high-level planning. Then a faster component takes that latent representation and produces continuous actions at the hardware's required frequency. NVIDIA GR00T N1, Figure AI's Helix, and Gemini Robotics On-Device all follow variations of that same logic. The core idea is simple. Each part of the system is optimized around the constraint that matters most to it.

Physical AI example - In agriculture robot taking tomatoes.

The bottleneck that is still unresolved

Training a VLA requires large amounts of robotic trajectory data, and that is where one of the field’s most serious bottlenecks appears. Collecting that data in the real world is slow, expensive, and hard to scale because every demonstration happens in real time on physical hardware. That is why the most sensible strategy is usually to train in simulation and then transfer the policy to the real robot. The problem is that there is a well-known gap between those two worlds, the sim-to-real gap.

No simulator reproduces reality with perfect fidelity. Contacts, friction, lighting, and actuator dynamics are always modeled with some degree of approximation. As a result, a policy that performs extremely well in simulation can degrade significantly when deployed in the real world. One of the most common techniques for reducing that gap is domain randomization, which systematically varies simulation parameters during training so the policy learns to generalize across different conditions. This works reasonably well in locomotion and navigation, but it loses effectiveness in fine manipulation, where contact behavior matters much more.

There is also a subtler problem. Recent literature shows that an agent trained with reinforcement learning can learn to exploit artifacts of the physics engine as part of its strategy. In other words, it can rely on behaviors that exist only inside the simulator, such as impossible slipping, incorrect inertias, or unrealistic responses in the contact model. When that happens, the policy looks competent until it leaves the simulated environment, at which point it falls apart. Domain randomization can soften part of the issue, but it does not solve it at the root.

In that context, world foundation models such as NVIDIA Cosmos open a different path. The idea is to generate synthetic data using generative models trained on real-world video, with the expectation that they can capture visual and dynamical distributions closer to reality than those provided by traditional physics engines. This is still an evolving bet, but if it proves effective at scale, it could fundamentally reshape the data equation in Physical AI.

Why it matters now

Physical AI is not a new idea; what is new is that several pieces that used to advance separately are beginning to line up at the same time. We now have foundation models with stronger semantic understanding, simulators that are fast enough for parallel training, edge hardware capable of real-time inference, and transfer techniques that, while still imperfect, already support useful deployments outside the lab.

None of this means the problem is solved. Hard questions remain, especially in fine manipulation, generalization across different robotic embodiments, model compression for edge inference, and safety guarantees for neural policies. But it does mark something important. We have crossed the point where these systems stop being abstract promises and become genuinely useful tools.

For people looking to enter the field, the open problems are concrete and technically meaningful. That is exactly where Physical AI becomes interesting, not as a passing trend, but as one of the areas where artificial intelligence still has to prove it can face the real world without simplifications.

VLA models

The bottleneck that is still unresolved

Why it matters now

References