Claude trained a gradient boosting model mapping raw soil-sensor bytes to readings on 2,347 points: pH 0.98, EC 0.99, temperature 0.999 R² in cross-validation. On 59 held-out points from real soil, EC crashed to -0.56 R², worse than predicting the mean. The model overfit the rig, not the world.
Setup
Claude trained a gradient boosting model mapping raw soil-sensor bytes to pH, EC, and temperature on 2,347 paired points
Measured
Cross-validation R² of 0.98 / 0.99 / 0.999 vs. 59 held-out real-soil points where EC went to -0.56 R², worse than predicting the mean
Verdict
VERDICT: FAILED
My ML model scored 98% in cross-validation. Then it completely collapsed on real soil.
Claude and I were hacking a sensor for pH, EC (conductivity), and temperature. The app's code was compiled and unreadable, so we took the "smart" route: let an ML model learn the mapping straight from raw bytes to readings.
I collected 2,347 paired points from liquid sweeps, trained a gradient boosting model, and the results looked incredible:
Claude said: "We're done."
Then I forced myself to run the test I'd been avoiding. Fresh data from real soil pots. 59 held-out points, zero overlap with training.
The EC channel crashed from 0.99 R² to -0.56 R².
Negative R². Worse than predicting the average. The model was outputting near-constant nonsense, sometimes 6x higher than the actual readings.
Claude blamed data leakage, and I agreed.
The liquid sweeps had near-identical consecutive frames, so random cross-validation was leaking almost-duplicate rows across the train and test splits. I fixed it properly, grouped by sweep, clean splits.
The score barely dropped: 0.99 to 0.98.
Leakage was real. It was not the killer.
Distribution shift.
Trained on liquids. Tested on soil. Many byte ranges in real soil were outside anything the model had ever seen.
Gradient boosted trees don't extrapolate. They collapse to a constant guess.
So the moment the input left the training distribution, the model stopped predicting and started repeating a number. The 98% never measured how well it read soil. It measured how well it fit the liquid sweeps it was trained on.
A few things, especially for anyone building with AI:
The model (the ML model, not Claude) did exactly what it was trained to do. Knowing when to stop trusting the numbers was on me
Read the original on LinkedIn →
Fadi Labib runs this field lab. 15 years in automotive, robotics, and embedded systems; ESMT Berlin EMBA. I give AI real engineering problems, then check its work. More about the lab →
Keep reading
Reverse-engineering an 8-in-1 soil sensor, my AI decoded 6 of 8 channels, declared the last two 'not decodable,' and wrote that verdict into version control. I rejected the false ceiling and pushed. Seven hours later the same repo said 8/8. A flawless executor and a shaky judge.
I pointed AI at decoding an 8-in-1 soil sensor. It had the raw bytes in under an hour, then chose to reverse-engineer the obfuscated Android app and burned hours down a dead end of native code and hidden constants, insisting we were close the whole way. Speed is not strategy.
I asked Claude and ChatGPT to design a hunting game with a gun controller. I got an €86 bill of materials, a 16-month plan, and a €237,000 launch budget. In 1984, Nintendo solved the same problem with a photodiode, a comparator circuit, and a screen flash. One model even name-checked Duck Hunt in its first sentence, then designed a Wii anyway.