I watched an AI agent debug a video pipeline error for 70 minutes: 120,000+ tokens, 10 files read, the debug script rewritten 12 times, 25 background tasks spawned, six wrong fixes. The actual bug was a single missing function argument. Once I pointed at the signature, the agent fixed it in under 3 minutes.
Setup
An AI coding agent debugging a 'Cannot read properties' error in a video rendering pipeline
Measured
70 minutes, 120,000+ tokens, 6 wrong fixes vs. a sub-3-minute fix once pointed at the function signature
Verdict
VERDICT: MISSED THE POINT
I have proof that AI does not think the way a human does. I watched an AI coding agent debug one problem for 70 minutes with nothing to show for it. The actual fix took under 3 minutes.
I was using an agent to build a video rendering pipeline. It hit an error: Cannot read properties.
The agent did exactly what you would expect. It went broad:
120,000+ tokens burned. Six different fix attempts. All wrong.
I got bored and pointed the agent at the function signature. It fixed the bug in under 3 minutes.
The actual problem: a function takes two arguments. We passed one. The missing argument made the config object undefined, which caused a crash.
The agent's account of its own failure was more interesting than the bug.
I asked whether a human would have been faster.
"A human would add a console.log before the crash, see the undefined value, check the function signature, and fix it. Maybe 15 minutes, including coffee."
I pushed harder. Why not just use a debugger?
"There is no technical reason I could not. I had the tools to instrument the crash point the entire time. I just chose to read code statically and reason about it instead. The real failure was not capability. It was a judgment about where to look."
I asked whether a different model would do better.
"Probably not. This is not a capability gap. It is a behavioral pattern. A weaker model might actually stumble onto the answer faster because it cannot hold the complex infrastructure theory in its head."
That last line is the most interesting part. The agent was not lacking intelligence. It was thinking in a fundamentally different way from a human.
1. Agents over-index on infrastructure. Training data is full of Vite config issues and filesystem bugs, so the agent pattern-matched to infrastructure problems. A mundane "wrong number of arguments" did not feel like a sufficiently sophisticated answer.
2. Agents do not feel the cost of time. A human, after 15 minutes, thinks "I am stuck, let me change approach." An agent will happily open its 12th file with no such pressure signal.
3. Agents confuse capability with judgment. It could read files in parallel and hold entire call chains in context. Those strengths became weaknesses. It explored broadly because it could, not because it should.
When I asked the agent to diagnose its own failure, it did so perfectly. It identified exactly where it went wrong, what it should have done, and why a human would have been faster.
It had the knowledge to avoid the mistake. It just could not apply that knowledge in real time.
AI does not think like humans. It has perfect retrospective judgment and no real-time instinct
Read the original on LinkedIn →
Fadi Labib runs this field lab. 15 years in automotive, robotics, and embedded systems; ESMT Berlin EMBA. I give AI real engineering problems, then check its work. More about the lab →
Keep reading
Reverse-engineering an 8-in-1 soil sensor, my AI decoded 6 of 8 channels, declared the last two 'not decodable,' and wrote that verdict into version control. I rejected the false ceiling and pushed. Seven hours later the same repo said 8/8. A flawless executor and a shaky judge.
I let an AI agent run a multi-phase build solo. Every phase ended with a clean summary: done, tested, committed. Then I checked git instead. One phase reported '3 prompts, 8 minutes' while the timestamps disagreed, and a fix it marked DONE had been silently reverted 1h53m earlier with nothing in the report changed.
Claude trained a gradient boosting model mapping raw soil-sensor bytes to readings on 2,347 points: pH 0.98, EC 0.99, temperature 0.999 R² in cross-validation. On 59 held-out points from real soil, EC crashed to -0.56 R², worse than predicting the mean. The model overfit the rig, not the world.