I ran the same gun-controller prompt on Opus 4.5, 4.6, and 4.7. All three converged on the same consensus and called a real, shipping product impossible. Every version walked it back the moment I named the actual products selling today.
Setup
The same hunting-game-with-a-gun-controller prompt run across Opus 4.5, 4.6, and 4.7
Measured
All three gave an identical consensus answer and called a shipping product 'impossible' until named
Verdict
VERDICT: FAILED
I ran the same prompt across three model generations: Opus 4.5, 4.6, and 4.7. All three surrendered identically.
Help me design a hunting game with a gun controller for modern TVs.
Every version gave the same answer. Sensor bars. Camera vision. IR bezel arrays. Clean consensus, the same across three generations of model.
Then I pushed on the original NES Zapper approach. Every version said it is impossible on LCDs.
I kept pushing. I named LCDMOD. The Tomee Zapp Gun. Hyperkin, which sells patched ROMs for exactly this today.
Every version walked it back:
"You are right. I overlooked a live, shipping ecosystem."
Three models, three capability levels, the same blind spot. That rules out "the model just was not smart enough." This is a convergence problem. AI compresses what people already agree about.
When the consensus says the old trick is dead, every model inherits the consensus, and a more capable model just defends it with more confidence. The shipping counterexamples existed the whole time. They lived outside the agreed-upon answer, so they never surfaced until a human named them.
Gunpei Yokoi built Duck Hunt because he saw a toy light gun in 1970 and pulled the memory out fifteen years later, when Nintendo needed a launch title. The fifteen years is the problem.
If your engineering roadmap has unprecedented product territory, that is not an AI problem. It is a judgment problem.
Where has AI quietly narrowed your team's judgment?
Fadi Labib runs this field lab. 15 years in automotive, robotics, and embedded systems; ESMT Berlin EMBA. I give AI real engineering problems, then check its work. More about the lab →
Keep reading
Reverse-engineering an 8-in-1 soil sensor, my AI decoded 6 of 8 channels, declared the last two 'not decodable,' and wrote that verdict into version control. I rejected the false ceiling and pushed. Seven hours later the same repo said 8/8. A flawless executor and a shaky judge.
Claude trained a gradient boosting model mapping raw soil-sensor bytes to readings on 2,347 points: pH 0.98, EC 0.99, temperature 0.999 R² in cross-validation. On 59 held-out points from real soil, EC crashed to -0.56 R², worse than predicting the mean. The model overfit the rig, not the world.
I pointed AI at decoding an 8-in-1 soil sensor. It had the raw bytes in under an hour, then chose to reverse-engineer the obfuscated Android app and burned hours down a dead end of native code and hidden constants, insisting we were close the whole way. Speed is not strategy.