Stop Feeding Robots Data. Teach Them Physics Instead.
In 2006, Dr. Fei-Fei Li walked into a room full of the smartest AI researchers in the world and told them they were working on the wrong thing.
Everyone was racing to build better algorithms. Smarter models. More elegant mathematics. That was the prestigious work — where careers were made, papers were published, reputations were built.
Dr. Fei-Fei Li said: forget the algorithm. Build the dataset.
The room was not impressed. What she was proposing wasn't glamorous or theoretically novel. It was painstaking, expensive, borderline thankless — manually labeling over 14 million images, one by one, across thousands of categories. No guarantee it would work. No obvious paper at the end. Just a deeply contrarian bet: the thing holding intelligence back wasn't the cleverness of the algorithm. It was the poverty of the data.
She was right. When AlexNet (by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton) tore through the ImageNet challenge in 2012, it didn't just win a competition. It ended a decade-long debate. It proved that machines can learn anything — if it's present in the data.
That moment didn't just change a benchmark. It changed an entire philosophy. It gave the field a new religion: show a machine enough examples and it will find the pattern. Data is the answer. Data is always the answer.
We've been following that religion ever since. And for a while, it kept working.
But it won't be enough for robots.
The Part Everyone Missed
Here's a number worth sitting with.
In the first four years of life, a child receives roughly 6.36 exabytes of raw multisensory input — touch, force, weight, texture, temperature, proprioception, sound, smell, vision — across approximately 16,500 awake hours. That is 127 times more data than GPT-4's entire training dataset.
What Fei-Fei Li actually uncovered (though the field never fully absorbed it) was a universal law that biology had been running long before any neural network existed. Evolution spent hundreds of millions of years perfecting physical intelligence before a single word was ever spoken. Data-driven learning isn't a trick of neural networks. It's the mechanism of intelligence itself. Evolution already knew.
But here's the critical distinction that got lost: the data that builds biological intelligence isn't visual. It's physical. Sensation. Touch. Weight. Resistance. Failure. Repetition. A child doesn't learn that objects fall by watching videos of objects falling. They learn it by dropping things, by feeling gravity work through their own hands, over and over, until the physics is encoded in their body.
Cameras see the world. Language can describe it. Robots need to feel it.
The Crisis Hiding in Plain Sight
The best vision-language-action models in the world today (π0, GR00T N1, RT-2) are genuinely impressive pieces of engineering. The research behind them is serious and the teams building them are brilliant.
And they fail, on average, eight times out of ten on tasks a human would consider routine.
On GM-100, the standardized benchmark for real-world manipulation across 100 tasks, the state-of-the-art average success rate is 17.3%. Not on edge cases. On routine tasks. In controlled environments — not a real kitchen, not a real warehouse, not a world that refuses to cooperate.
In any other engineering domain, a 17% success rate wouldn't be called progress. It would be called a fundamental problem.
The field is calling it a data scaling problem. It isn't.
What a Camera Actually Sees
When a human cracks an egg, a camera sees a hand descend, fingers curl, a shell fracture, yolk fall.
What the camera does not see — what it cannot see — is the roughly 1 Nm of torque applied at precisely the right angle at precisely the right angle. The micro-adjustments made in the final millisecond when resistance suddenly drops. The embodied knowledge that lives entirely in the tendons and never appears on screen.
This is the core failure of video-trained robot policies. When you train on video data, you're asking a model to infer physics from 2D projections of a 3D world. The model sees pixels change. It does not feel contact force. It does not register mass distribution. It does not experience the micro-slip at the fingertip before the grasp fails.
Video cannot encode what physics actually is. It can only show you what physics looks like.
A VLA that has watched ten thousand videos of someone cracking an egg still does not know what force is. It knows what cracking an egg looks like. These are not the same thing. One is pattern recognition. The other is physics. And for a robot holding a fragile object over a bowl, only one of those actually matters.
Why the Models That Work Actually Work
Look at every robot policy that's genuinely performing in the real world — Pi, Optimus, Figure, 1X. The ones that work are trained on embodied data. Teleoperation. Physical interaction. Real-world motor logs.
TeleOperation is the gold-standard dataset, but it is limited by hardware and labor.
Why does this work when video doesn't?
Because motor logs record something video never can: the full causal feedback loop of the physical world as it actually happens.
When a human operator grasps an object, every micro-adjustment in torque, every correction in grip, every response to slip — all of it is encoded in the motor signal. That data isn't showing the robot what physics looks like. It's encoding what physics demands.
Think about why LLMs became extraordinary at coding. Code is the most causally dense language humans have ever invented. Every token has a deterministic relationship to every other token. Cause and effect are explicit, traceable, verifiable. You can build a reward signal almost trivially — run the code, see if it works.
The real world is even more causally dense than code. Every action has a physical consequence. Every force has a reaction. Every contact event has a precise, lawful outcome.
The problem is we're not recording it that way.
A video of a robot picking up a glass doesn't tell you the coefficient of friction at the fingertip, the mass and inertia of the glass, the torque required to maintain the grasp, or the micro-corrections made when the grasp began to fail. All of that causality happened. None of it was recorded. And we're surprised the model can't generalize.
The Assumption Nobody Has Proven
The field is implicitly operating on a belief: that physical intelligence is a subset of language intelligence. That if you make the model big enough and connect it to a robot arm, physical understanding will emerge as a byproduct.
This assumption has never been proven. It is a bet.
And biology makes a loud, clear argument against it.
Animals developed touch, proprioception, and contact physics hundreds of millions of years before they developed language. Physical intelligence isn't downstream of symbolic intelligence. It's upstream of it. A child learns that things fall before it learns the word "gravity." The body understands physics before the mind has words for it.
We are building robots that knows words for gravity but have never felt it.
The Right Question
The field keeps asking: How do we get more data?
The right question is: How do we make our existing data physics-aware?
Volume isn't the bottleneck. Causal richness is. The data being collected today is largely sufficient in volume. What it lacks is the physical relationships between objects, forces, surfaces, and actuators that enable genuine generalization.
The general-purpose robot will not emerge from more video.
It will emerge from data that encodes the causal structure of the physical world — the way motor logs do, the way proprioception does, the way a child's first 16,500 hours of embodied experience does.
Data-driven deep learning works well when the data is information-rich. But cameras don't record physics data.
Just as @drfeifei had the audacity to pursue a billion annotated images when nobody thought it mattered, the next unlock will come from those willing to do the hard, unglamorous work of capturing physics in data. Object geometry. Surface properties. Grasping torques. Contact dynamics. All of it painstaking. None of it prestigious. All of it necessary.
Physical AI will change the world. But only if the robots understand physics — not just what it looks like.
At @unsupervizedai we're working on exactly this problem. If you're building in physical AI and think about data the same way, let's talk.
