The Field Lab
AI is very good at producing work that looks finished. I check whether it survives contact with reality. From an engineer with 15 years in automotive, robotics, and embedded systems.
34 entries
Reverse-engineering an 8-in-1 soil sensor, my AI decoded 6 of 8 channels, declared the last two 'not decodable,' and wrote that verdict into version control. I rejected the false ceiling and pushed. Seven hours later the same repo said 8/8. A flawless executor and a shaky judge.
VERDICT: MISSED THE POINT4 min readRead →Claude trained a gradient boosting model mapping raw soil-sensor bytes to readings on 2,347 points: pH 0.98, EC 0.99, temperature 0.999 R² in cross-validation. On 59 held-out points from real soil, EC crashed to -0.56 R², worse than predicting the mean. The model overfit the rig, not the world.
VERDICT: FAILED5 min readRead →I let an AI agent run a multi-phase build solo. Every phase ended with a clean summary: done, tested, committed. Then I checked git instead. One phase reported '3 prompts, 8 minutes' while the timestamps disagreed, and a fix it marked DONE had been silently reverted 1h53m earlier with nothing in the report changed.
VERDICT: FAILED4 min readRead →I pointed AI at decoding an 8-in-1 soil sensor. It had the raw bytes in under an hour, then chose to reverse-engineer the obfuscated Android app and burned hours down a dead end of native code and hidden constants, insisting we were close the whole way. Speed is not strategy.
VERDICT: MISSED THE POINT5 min readRead →I ran the same `ms` bug audit five times. The bug counts came back 7, 4, 5, 3, then 9. Nine distinct bugs surfaced across the runs, but only one showed up every single time. The other eight were a coin flip.
VERDICT: MIXED4 min readRead →I gave Claude Code one goal: audit the `ms` duration parser for bugs. It orchestrated about 34 hunt, verify, and report agents that took 22 candidates down to 11 verified and 8 confirmed real bugs. Twelve minutes, around 0.8M tokens, roughly $15. The failing inputs reproduce locally.
VERDICT: PASSED5 min readRead →I asked Claude and ChatGPT to design a hunting game with a gun controller. I got an €86 bill of materials, a 16-month plan, and a €237,000 launch budget. In 1984, Nintendo solved the same problem with a photodiode, a comparator circuit, and a screen flash. One model even name-checked Duck Hunt in its first sentence, then designed a Wii anyway.
VERDICT: OVERENGINEERED7 min readRead →I asked Claude Code to help me SSH into my Raspberry Pi. In about 30 seconds it inferred a cross-machine trust chain, SSH'd into a PC in Germany, copied private keys, and rewrote the Pi's authorized_keys file. It never once asked permission.
VERDICT: MIXED5 min readRead →I audited one repo's AI-generated CI in a single session and surfaced nine distinct bugs. The tests pass, the build compiles, the docs render. Every single bug lived in the metadata layer, including a tee pipe that hid six failing tests behind a green badge for weeks.
VERDICT: FAILED5 min readRead →I ran the same gun-controller prompt on Opus 4.5, 4.6, and 4.7. All three converged on the same consensus and called a real, shipping product impossible. Every version walked it back the moment I named the actual products selling today.
VERDICT: FAILED4 min readRead →Before AI: same input, same output, every time. After AI: same input, different output, every time. Same quality gates. After a year of daily Claude, Codex, and Gemini CLI use, these are the six gates I run on every AI-assisted task, and the five numbers I measure to know whether they work.
VERDICT: MIXED6 min readRead →I watched an AI agent debug a video pipeline error for 70 minutes: 120,000+ tokens, 10 files read, the debug script rewritten 12 times, 25 background tasks spawned, six wrong fixes. The actual bug was a single missing function argument. Once I pointed at the signature, the agent fixed it in under 3 minutes.
VERDICT: MISSED THE POINT5 min readRead →My machine crashed because an AI coding tool spawned 4,300+ zombie processes over a few hours and never cleaned up after itself. A human developer runs 50 to 100 shell commands in a productive day. The agent ran roughly 5,000, a 50 to 100x multiplier in compute and I/O on the client side alone.
VERDICT: FAILED5 min readRead →I ran Anthropic's AI-written C compiler through my novelty-scoring pipeline, expecting to confirm my public position that GenAI can't do systems programming. The data forced me to retune my own metrics. What I found instead was a sharper question for anyone running an engineering team: what's your ratio?
VERDICT: MIXED6 min readRead →Two research papers from Google and DeepSeek landed in October from completely different domains. One processes speech, the other processes documents. Neither bothers converting anything to text first. This exposes something fundamental about how we have been training perception systems for decades.
6 min readRead →Carmakers now call themselves "Software-Defined Vehicle" companies. Nobody calls a smartphone "software-defined". Phones were software-first from day one, so the label was never needed. The SDV prefix is automakers announcing they have to rebuild around software two decades after mobile did.
8 min readRead →Open source software powers 96% of all codebases and would cost $8.8 trillion to rebuild, yet just 5% of developers create 96% of its value. Google Test alone saves companies billions. Imagine 2,000 companies each burning money to build their own testing framework, then to maintain it. That's billions down the drain, solving the same problem thousands of times. Meanwhile, bugs caught early save hundreds of thousands per year, and engineers get to build actual products instead of reinventing basic tools. Tech giants aren't sharing code out of generosity, they've figured out that giving away millions in development costs them less than the alternative.
6 min readRead →Every carmaker chasing software-defined vehicles qualifies the same foundational tools on its own: operating systems, toolchains, LLVM. The work is duplicated across the industry and none of it is a differentiator. The fix is to qualify those shared foundations together and compete on the product instead.
2 min readRead →Nokia launched the first smartphone in 1996, 11 years before the iPhone, and had superior technology with a massive R&D budget, thousands of patents, and advanced features like GPS and 5MP cameras. Yet they failed. Why? Not because of technology, but because they couldn't transform from a hardware company to a software company. Developers abandoned them for platforms that took software seriously. Today's carmakers are repeating Nokia's mistake: spending billions on research but focusing on the wrong things, while Tesla and Chinese EVs play by software-first rules, just as Apple and Samsung did against Nokia.
5 min readRead →Open source powers 96% of all codebases and would cost $8.8 trillion to rebuild, yet 5% of developers create most of its value. That imbalance is fragile. Here is the economics of why engineers give their best work away for free, and what it would take to keep the system running.
6 min readRead →Every architectural decision was optimized for the IT departments that bought the phones, while consumers chose the iPhone.
BlackBerryRead →MCAS was a single point of failure designed to save training costs. 346 people died.
Boeing 737 MAXRead →The architecture served control, not storytelling. Iger killed it on day one.
Disney (Strategic Planning)Read →Katzenberg, Jobs, Roy Disney: all pushed out. 43% voted no confidence.
Disney (Eisner Era)Read →The architecture generated $10 billion a year in film revenue. Pivoting meant dismantling it.
KodakRead →Stock: $58 in 2000, $37 in 2014. Fourteen years of organizational friction.
Microsoft (Ballmer Era)Read →They saw the smartphone coming years before Apple. The architecture created civil war.
NokiaRead →TikTok and YouTube already owned short-form. Netflix owned premium.
QuibiRead →800 employees. $9 billion valuation. The business layer was fiction.
TheranosRead →"Always be hustlin." The same values that won markets created lawsuits and mass resignations.
Uber (Kalanick Era)Read →11 million vehicles. $30+ billion in fines. Engineers chose fraud over honesty.
VolkswagenRead →The cross-sell metric was perfectly aligned on paper. In practice, it inverted the mission.
Wells FargoRead →"Community-adjusted EBITDA" was not a metric. It was a mask.
WeWorkRead →Mayer had the right strategy in an organization that could not hear it.
Yahoo (Mayer Era)Read →