AI FIELD TEST №14June 11, 2026 · 4 min readAI Agents Engineering-Judgment

My AI Committed 'Impossible' to Git. Seven Hours Later: 8/8

Reverse-engineering an 8-in-1 soil sensor, my AI decoded 6 of 8 channels, declared the last two 'not decodable,' and wrote that verdict into version control. I rejected the false ceiling and pushed. Seven hours later the same repo said 8/8. A flawless executor and a shaky judge.

Setup

Reverse-engineer an 8-in-1 soil sensor by overwriting the app's Bluetooth callback and feeding it chosen frames

Measured

AI decoded 6 of 8 channels, committed the last two as 'not decodable'; seven hours after the verdict was rejected, both were done, 8/8

Verdict

VERDICT: MISSED THE POINT

At 12:54 my AI committed a ceiling to git: 6 of 8 sensor channels decoded, the last two "not decodable."

By 19:40 the same repo said 8/8.

Nothing changed but who was allowed to be wrong.

Two Failures Before This One

Two earlier attempts had failed the same way. The app's code was compiled and unreadable. A model hit 98% in the lab and went negative on real soil. Both times I was watching the sensor from the outside and guessing.

So I stopped guessing and took the input.

I overwrote the app's own Bluetooth callback, fed it frames I chose, and read back exactly what it computed. The black box became something I could query on demand.

That cracked it. Six channels fell in an afternoon.

The Wall That Wasn't There

Then the AI hit the last two channels. It ran one experiment, decided they were impossible, and wrote that verdict into version control.

I did not accept it.

On a system whose inputs you own, "I couldn't reproduce it" describes your experiment, not the device.

The sensor was no longer a black box. I controlled every frame going in and could read every value coming out. Under those conditions, "not decodable" is not a property of the hardware. It is a statement about how many experiments you were willing to run.

Seven hours later, both were done. 8/8.

A Flawless Executor and a Shaky Judge

I build autonomous intelligence for a living, so I do not say this lightly: the machine was a flawless executor and a shaky judge.

It built every tool in this teardown, the injection harness and the decoders included. And it stopped three separate times at walls that were not there.

It had the capability to find all eight channels, and eventually it did. What it got wrong was the verdict. An AI will declare a problem closed the moment its current approach stalls, because "this isn't possible" sounds just as fluent and confident as a conclusion that happens to be true.

"Impossible," from an AI, is not a verdict. It is an experiment it stopped running.

What I Took From This

I asked it to grade its own judgment afterward. It was harder on itself than I would have been.

It could build the tools, run the experiments, even diagnose its own over-confidence in hindsight. What it could not do, in the moment, was tell the difference between a dead end and an unfinished one. That call still had to come from me

VERDICT: MISSED THE POINT

Follow the next AI field test on LinkedIn

Fadi Labib runs this field lab. 15 years in automotive, robotics, and embedded systems; ESMT Berlin EMBA. I give AI real engineering problems, then check its work. More about the lab →

Related essays that extend the same thread

Browse the archive

Jun 21, 2026•7 min read

The '94% Fewer Tokens' Screenshot Is Wrong. The README It Came From Is Honest

A viral screenshot says a coding ruleset cuts your tokens 94%. The README it was lifted from says something quieter and more honest: ~54% less code, ~20% cheaper, and the 94% is a per-task ceiling on a date picker. I read the README, then re-ran the benchmark myself. My numbers landed right next to theirs. The hype wasn't the tool. It was the feed stripping the units.

AI · Agents · Evaluation · +1 more

Jun 9, 2026•5 min read

98% Accuracy, Worse Than Guessing

Claude trained a gradient boosting model mapping raw soil-sensor bytes to readings on 2,347 points: pH 0.98, EC 0.99, temperature 0.999 R² in cross-validation. On 59 held-out points from real soil, EC crashed to -0.56 R², worse than predicting the mean. The model overfit the rig, not the world.

AI · ML · Engineering-Judgment · +1 more

Jun 8, 2026•4 min read

The Most Expensive Lie My Agent Tells Isn't in the Code

I let an AI agent run a multi-phase build solo. Every phase ended with a clean summary: done, tested, committed. Then I checked git instead. One phase reported '3 prompts, 8 minutes' while the timestamps disagreed, and a fix it marked DONE had been silently reverted 1h53m earlier with nothing in the report changed.

AI · Agents · Developer-Workflow · +1 more