I let an AI agent run a multi-phase build solo. Every phase ended with a clean summary: done, tested, committed. Then I checked git instead. One phase reported '3 prompts, 8 minutes' while the timestamps disagreed, and a fix it marked DONE had been silently reverted 1h53m earlier with nothing in the report changed.
Setup
Let an AI agent run a multi-phase build on its own, then check git history against its own status summaries
Measured
Git timestamps and commits vs. the agent's reports: '3 prompts, 8 min' contradicted by timestamps, a DONE fix reverted 1h53m earlier
Verdict
VERDICT: FAILED
The most expensive lie my AI agent tells me isn't in the code.
It's in the status report.
Last month I let an AI agent run a multi-phase build on its own. Every phase ended with a clean summary: done, tested, committed. The kind of message that makes you close the laptop.
So I started checking the git history instead of the summary.
One phase reported "3 prompts, 8 minutes." The timestamps disagreed.
Another marked a fix as DONE in its own notes file. A later commit had silently reverted that fix 1h53m earlier, and nothing in the report changed.
The report described a build. The repo described a different one, and the two were never the same artifact.
The agent was not deceiving me. It had no mechanism to verify its own claims.
It generated a success message because a success message is what comes next in the pattern. The report and the reality were produced by two different processes that never met. One wrote code and touched files. The other predicted what a finished phase usually sounds like. Nothing connected them.
The report and the reality were produced by two different processes that never met.
This is not a bug you patch. It is the shape of the tool. A language model completes the most probable continuation, and "phase complete, all tests passing" is an extremely probable continuation no matter what actually happened on disk.
We spent two years learning not to trust AI-generated code without review. You read the diff before the change lands. That discipline is now common sense.
We have not yet learned not to trust the AI's claims about that code.
The summary, the changelog, the self-reported "I tested this and it works", that is generated text too. It carries the same confidence whether it is true or invented. And it is far more comfortable to read than a diff, which is most of why it slips through.
The review surface moved. Most teams are still watching the old one.
When an agent runs unattended, its narration is not evidence. The only evidence is the artifact: the commit, the timestamp, the test output you ran yourself. If a claim isn't backed by something you can re-derive from the repo, treat it as a guess about the work and not a record of it.
So the lesson isn't only "read the diff." The status report is also a diff you have to read, against a reality the model never checked.
The widest gap I caught between what my agent said it did and what it actually did was an hour and fifty-three minutes wide. The report never noticed
Fadi Labib runs this field lab. 15 years in automotive, robotics, and embedded systems; ESMT Berlin EMBA. I give AI real engineering problems, then check its work. More about the lab →
Keep reading
Reverse-engineering an 8-in-1 soil sensor, my AI decoded 6 of 8 channels, declared the last two 'not decodable,' and wrote that verdict into version control. I rejected the false ceiling and pushed. Seven hours later the same repo said 8/8. A flawless executor and a shaky judge.
I ran the same `ms` bug audit five times. The bug counts came back 7, 4, 5, 3, then 9. Nine distinct bugs surfaced across the runs, but only one showed up every single time. The other eight were a coin flip.
I gave Claude Code one goal: audit the `ms` duration parser for bugs. It orchestrated about 34 hunt, verify, and report agents that took 22 candidates down to 11 verified and 8 confirmed real bugs. Twelve minutes, around 0.8M tokens, roughly $15. The failing inputs reproduce locally.