AI FIELD TEST №03March 20, 2026 · 5 min readAI Engineering-Judgment Agents

The AI Debugged for 70 Minutes. The Real Fix Took 3

I watched an AI agent debug a video pipeline error for 70 minutes: 120,000+ tokens, 10 files read, the debug script rewritten 12 times, 25 background tasks spawned, six wrong fixes. The actual bug was a single missing function argument. Once I pointed at the signature, the agent fixed it in under 3 minutes.

Setup

An AI coding agent debugging a 'Cannot read properties' error in a video rendering pipeline

Measured

70 minutes, 120,000+ tokens, 6 wrong fixes vs. a sub-3-minute fix once pointed at the function signature

Verdict

VERDICT: MISSED THE POINT

I have proof that AI does not think the way a human does. I watched an AI coding agent debug one problem for 70 minutes with nothing to show for it. The actual fix took under 3 minutes.

What The Agent Did

I was using an agent to build a video rendering pipeline. It hit an error: Cannot read properties.

The agent did exactly what you would expect. It went broad:

Read 10 files
Rewrote the debug script 12 times
Spawned 25 background tasks
Patched Vite configs
Fixed filesystem permissions
Resolved port conflicts
Tested browser imports directly

120,000+ tokens burned. Six different fix attempts. All wrong.

I got bored and pointed the agent at the function signature. It fixed the bug in under 3 minutes.

The actual problem: a function takes two arguments. We passed one. The missing argument made the config object undefined, which caused a crash.

Then I Interrogated It

The agent's account of its own failure was more interesting than the bug.

I asked whether a human would have been faster.

"A human would add a console.log before the crash, see the undefined value, check the function signature, and fix it. Maybe 15 minutes, including coffee."

I pushed harder. Why not just use a debugger?

"There is no technical reason I could not. I had the tools to instrument the crash point the entire time. I just chose to read code statically and reason about it instead. The real failure was not capability. It was a judgment about where to look."

I asked whether a different model would do better.

"Probably not. This is not a capability gap. It is a behavioral pattern. A weaker model might actually stumble onto the answer faster because it cannot hold the complex infrastructure theory in its head."

That last line is the most interesting part. The agent was not lacking intelligence. It was thinking in a fundamentally different way from a human.

Three Patterns Underneath

1. Agents over-index on infrastructure. Training data is full of Vite config issues and filesystem bugs, so the agent pattern-matched to infrastructure problems. A mundane "wrong number of arguments" did not feel like a sufficiently sophisticated answer.

2. Agents do not feel the cost of time. A human, after 15 minutes, thinks "I am stuck, let me change approach." An agent will happily open its 12th file with no such pressure signal.

3. Agents confuse capability with judgment. It could read files in parallel and hold entire call chains in context. Those strengths became weaknesses. It explored broadly because it could, not because it should.

When I asked the agent to diagnose its own failure, it did so perfectly. It identified exactly where it went wrong, what it should have done, and why a human would have been faster.

It had the knowledge to avoid the mistake. It just could not apply that knowledge in real time.

AI does not think like humans. It has perfect retrospective judgment and no real-time instinct

Read the original on LinkedIn →

VERDICT: MISSED THE POINT

Follow the next AI field test on LinkedIn

Fadi Labib runs this field lab. 15 years in automotive, robotics, and embedded systems; ESMT Berlin EMBA. I give AI real engineering problems, then check its work. More about the lab →

Related essays that extend the same thread

Browse the archive

Jun 21, 2026•7 min read

The '94% Fewer Tokens' Screenshot Is Wrong. The README It Came From Is Honest

A viral screenshot says a coding ruleset cuts your tokens 94%. The README it was lifted from says something quieter and more honest: ~54% less code, ~20% cheaper, and the 94% is a per-task ceiling on a date picker. I read the README, then re-ran the benchmark myself. My numbers landed right next to theirs. The hype wasn't the tool. It was the feed stripping the units.

AI · Agents · Evaluation · +1 more

Jun 11, 2026•4 min read

My AI Committed 'Impossible' to Git. Seven Hours Later: 8/8

Reverse-engineering an 8-in-1 soil sensor, my AI decoded 6 of 8 channels, declared the last two 'not decodable,' and wrote that verdict into version control. I rejected the false ceiling and pushed. Seven hours later the same repo said 8/8. A flawless executor and a shaky judge.

AI · Agents · Engineering-Judgment · +1 more

Jun 8, 2026•4 min read

The Most Expensive Lie My Agent Tells Isn't in the Code

I let an AI agent run a multi-phase build solo. Every phase ended with a clean summary: done, tested, committed. Then I checked git instead. One phase reported '3 prompts, 8 minutes' while the timestamps disagreed, and a fix it marked DONE had been silently reverted 1h53m earlier with nothing in the report changed.

AI · Agents · Developer-Workflow · +1 more