AI FIELD TEST №05April 17, 2026 · 4 min readAI Engineering-Judgment Hardware

Three Models Surrendered Identically, Then Walked It All Back

I ran the same gun-controller prompt on Opus 4.5, 4.6, and 4.7. All three converged on the same consensus and called a real, shipping product impossible. Every version walked it back the moment I named the actual products selling today.

Setup

The same hunting-game-with-a-gun-controller prompt run across Opus 4.5, 4.6, and 4.7

Measured

All three gave an identical consensus answer and called a shipping product 'impossible' until named

Verdict

VERDICT: FAILED

I ran the same prompt across three model generations: Opus 4.5, 4.6, and 4.7. All three surrendered identically.

The Prompt

Help me design a hunting game with a gun controller for modern TVs.

Every version gave the same answer. Sensor bars. Camera vision. IR bezel arrays. Clean consensus, the same across three generations of model.

Then I pushed on the original NES Zapper approach. Every version said it is impossible on LCDs.

I kept pushing. I named LCDMOD. The Tomee Zapp Gun. Hyperkin, which sells patched ROMs for exactly this today.

Every version walked it back:

"You are right. I overlooked a live, shipping ecosystem."

This Is Not A Capability Problem

Three models, three capability levels, the same blind spot. That rules out "the model just was not smart enough." This is a convergence problem. AI compresses what people already agree about.

When the consensus says the old trick is dead, every model inherits the consensus, and a more capable model just defends it with more confidence. The shipping counterexamples existed the whole time. They lived outside the agreed-upon answer, so they never surfaced until a human named them.

Gunpei Yokoi built Duck Hunt because he saw a toy light gun in 1970 and pulled the memory out fifteen years later, when Nintendo needed a launch title. The fifteen years is the problem.

The Rule I Am Drawing For My Team

Deploy AI on evaluation-heavy work, where the answer lives inside consensus
Keep novelty-required work human, where the answer lives outside it

If your engineering roadmap has unprecedented product territory, that is not an AI problem. It is a judgment problem.

Where has AI quietly narrowed your team's judgment?

VERDICT: FAILED

Follow the next AI field test on LinkedIn

Fadi Labib runs this field lab. 15 years in automotive, robotics, and embedded systems; ESMT Berlin EMBA. I give AI real engineering problems, then check its work. More about the lab →

Related essays that extend the same thread

Browse the archive

Jun 11, 2026•4 min read

My AI Committed 'Impossible' to Git. Seven Hours Later: 8/8

Reverse-engineering an 8-in-1 soil sensor, my AI decoded 6 of 8 channels, declared the last two 'not decodable,' and wrote that verdict into version control. I rejected the false ceiling and pushed. Seven hours later the same repo said 8/8. A flawless executor and a shaky judge.

AI · Agents · Engineering-Judgment · +1 more

Jun 9, 2026•5 min read

98% Accuracy, Worse Than Guessing

Claude trained a gradient boosting model mapping raw soil-sensor bytes to readings on 2,347 points: pH 0.98, EC 0.99, temperature 0.999 R² in cross-validation. On 59 held-out points from real soil, EC crashed to -0.56 R², worse than predicting the mean. The model overfit the rig, not the world.

AI · ML · Engineering-Judgment · +1 more

Jun 8, 2026•5 min read

Claude Moved Fast and I Followed the Wrong Direction

I pointed AI at decoding an 8-in-1 soil sensor. It had the raw bytes in under an hour, then chose to reverse-engineer the obfuscated Android app and burned hours down a dead end of native code and hidden constants, insisting we were close the whole way. Speed is not strategy.

AI · Hardware · Engineering-Judgment