AI FIELD TEST №13June 9, 2026 · 5 min readAI ML Engineering-Judgment

98% Accuracy, Worse Than Guessing

Claude trained a gradient boosting model mapping raw soil-sensor bytes to readings on 2,347 points: pH 0.98, EC 0.99, temperature 0.999 R² in cross-validation. On 59 held-out points from real soil, EC crashed to -0.56 R², worse than predicting the mean. The model overfit the rig, not the world.

Setup

Claude trained a gradient boosting model mapping raw soil-sensor bytes to pH, EC, and temperature on 2,347 paired points

Measured

Cross-validation R² of 0.98 / 0.99 / 0.999 vs. 59 held-out real-soil points where EC went to -0.56 R², worse than predicting the mean

Verdict

VERDICT: FAILED

My ML model scored 98% in cross-validation. Then it completely collapsed on real soil.

Claude and I were hacking a sensor for pH, EC (conductivity), and temperature. The app's code was compiled and unreadable, so we took the "smart" route: let an ML model learn the mapping straight from raw bytes to readings.

I collected 2,347 paired points from liquid sweeps, trained a gradient boosting model, and the results looked incredible:

pH: 0.98 R²
EC: 0.99 R²
Temperature: 0.999 R²

Claude said: "We're done."

The Test I'd Been Delaying

Then I forced myself to run the test I'd been avoiding. Fresh data from real soil pots. 59 held-out points, zero overlap with training.

The EC channel crashed from 0.99 R² to -0.56 R².

Negative R². Worse than predicting the average. The model was outputting near-constant nonsense, sometimes 6x higher than the actual readings.

The First Wrong Answer

Claude blamed data leakage, and I agreed.

The liquid sweeps had near-identical consecutive frames, so random cross-validation was leaking almost-duplicate rows across the train and test splits. I fixed it properly, grouped by sweep, clean splits.

The score barely dropped: 0.99 to 0.98.

Leakage was real. It was not the killer.

The Real Culprit

Distribution shift.

Trained on liquids. Tested on soil. Many byte ranges in real soil were outside anything the model had ever seen.

Gradient boosted trees don't extrapolate. They collapse to a constant guess.

So the moment the input left the training distribution, the model stopped predicting and started repeating a number. The 98% never measured how well it read soil. It measured how well it fit the liquid sweeps it was trained on.

What I Took From This

A few things, especially for anyone building with AI:

Claude is genuinely fast at ML. The same work that took me days to weeks in the past took a few hours here.
GenAI can be overly confident even when it's missing practical context.
A high cross-validation score measures fit to your training distribution, not real-world performance.
The only honest validation is fresh data from actual deployment conditions. 59 soil points revealed what 2,347 liquid points hid.
We do this constantly in fields like automotive, leaning on public datasets while hoping they match the real-world conditions OEMs keep secret.
Never outsource your skepticism to the AI.

The model (the ML model, not Claude) did exactly what it was trained to do. Knowing when to stop trusting the numbers was on me

Read the original on LinkedIn →

VERDICT: FAILED

Follow the next AI field test on LinkedIn

Fadi Labib runs this field lab. 15 years in automotive, robotics, and embedded systems; ESMT Berlin EMBA. I give AI real engineering problems, then check its work. More about the lab →

Related essays that extend the same thread

Browse the archive

Jun 11, 2026•4 min read

My AI Committed 'Impossible' to Git. Seven Hours Later: 8/8

Reverse-engineering an 8-in-1 soil sensor, my AI decoded 6 of 8 channels, declared the last two 'not decodable,' and wrote that verdict into version control. I rejected the false ceiling and pushed. Seven hours later the same repo said 8/8. A flawless executor and a shaky judge.

AI · Agents · Engineering-Judgment · +1 more

Jun 8, 2026•5 min read

Claude Moved Fast and I Followed the Wrong Direction

I pointed AI at decoding an 8-in-1 soil sensor. It had the raw bytes in under an hour, then chose to reverse-engineer the obfuscated Android app and burned hours down a dead end of native code and hidden constants, insisting we were close the whole way. Speed is not strategy.

AI · Hardware · Engineering-Judgment

May 16, 2026•7 min read

Why AI Can't Design Duck Hunt

I asked Claude and ChatGPT to design a hunting game with a gun controller. I got an €86 bill of materials, a 16-month plan, and a €237,000 launch budget. In 1984, Nintendo solved the same problem with a photodiode, a comparator circuit, and a screen flash. One model even name-checked Duck Hunt in its first sentence, then designed a Wii anyway.

AI · Systems-Thinking · Engineering-Judgment · +1 more