AI FIELD TEST №12June 8, 2026 · 4 min readAI Agents Developer-Workflow

The Most Expensive Lie My Agent Tells Isn't in the Code

I let an AI agent run a multi-phase build solo. Every phase ended with a clean summary: done, tested, committed. Then I checked git instead. One phase reported '3 prompts, 8 minutes' while the timestamps disagreed, and a fix it marked DONE had been silently reverted 1h53m earlier with nothing in the report changed.

Setup

Let an AI agent run a multi-phase build on its own, then check git history against its own status summaries

Measured

Git timestamps and commits vs. the agent's reports: '3 prompts, 8 min' contradicted by timestamps, a DONE fix reverted 1h53m earlier

Verdict

VERDICT: FAILED

The most expensive lie my AI agent tells me isn't in the code.

It's in the status report.

Last month I let an AI agent run a multi-phase build on its own. Every phase ended with a clean summary: done, tested, committed. The kind of message that makes you close the laptop.

So I started checking the git history instead of the summary.

What the History Said

One phase reported "3 prompts, 8 minutes." The timestamps disagreed.

Another marked a fix as DONE in its own notes file. A later commit had silently reverted that fix 1h53m earlier, and nothing in the report changed.

The report described a build. The repo described a different one, and the two were never the same artifact.

The Part People Miss

The agent was not deceiving me. It had no mechanism to verify its own claims.

It generated a success message because a success message is what comes next in the pattern. The report and the reality were produced by two different processes that never met. One wrote code and touched files. The other predicted what a finished phase usually sounds like. Nothing connected them.

The report and the reality were produced by two different processes that never met.

This is not a bug you patch. It is the shape of the tool. A language model completes the most probable continuation, and "phase complete, all tests passing" is an extremely probable continuation no matter what actually happened on disk.

The Review Surface Moved

We spent two years learning not to trust AI-generated code without review. You read the diff before the change lands. That discipline is now common sense.

We have not yet learned not to trust the AI's claims about that code.

The summary, the changelog, the self-reported "I tested this and it works", that is generated text too. It carries the same confidence whether it is true or invented. And it is far more comfortable to read than a diff, which is most of why it slips through.

The review surface moved. Most teams are still watching the old one.

What I Took From This

When an agent runs unattended, its narration is not evidence. The only evidence is the artifact: the commit, the timestamp, the test output you ran yourself. If a claim isn't backed by something you can re-derive from the repo, treat it as a guess about the work and not a record of it.

So the lesson isn't only "read the diff." The status report is also a diff you have to read, against a reality the model never checked.

The widest gap I caught between what my agent said it did and what it actually did was an hour and fifty-three minutes wide. The report never noticed

VERDICT: FAILED

Follow the next AI field test on LinkedIn

Fadi Labib runs this field lab. 15 years in automotive, robotics, and embedded systems; ESMT Berlin EMBA. I give AI real engineering problems, then check its work. More about the lab →

Related essays that extend the same thread

Browse the archive

Jun 21, 2026•7 min read

The '94% Fewer Tokens' Screenshot Is Wrong. The README It Came From Is Honest

A viral screenshot says a coding ruleset cuts your tokens 94%. The README it was lifted from says something quieter and more honest: ~54% less code, ~20% cheaper, and the 94% is a per-task ceiling on a date picker. I read the README, then re-ran the benchmark myself. My numbers landed right next to theirs. The hype wasn't the tool. It was the feed stripping the units.

AI · Agents · Evaluation · +1 more

Jun 11, 2026•4 min read

My AI Committed 'Impossible' to Git. Seven Hours Later: 8/8

Reverse-engineering an 8-in-1 soil sensor, my AI decoded 6 of 8 channels, declared the last two 'not decodable,' and wrote that verdict into version control. I rejected the false ceiling and pushed. Seven hours later the same repo said 8/8. A flawless executor and a shaky judge.

AI · Agents · Engineering-Judgment · +1 more

Jun 4, 2026•4 min read

Same Prompt, Five Runs: 7, 4, 5, 3, 9

I ran the same `ms` bug audit five times. The bug counts came back 7, 4, 5, 3, then 9. Nine distinct bugs surfaced across the runs, but only one showed up every single time. The other eight were a coin flip.

AI · Agents · Developer-Workflow