AI FIELD TEST №01February 20, 2026 · 6 min readAI Compilers Systems-Software

I Set Out to Prove AI Can't Build Systems Software

I ran Anthropic's AI-written C compiler through my novelty-scoring pipeline, expecting to confirm my public position that GenAI can't do systems programming. The data forced me to retune my own metrics. What I found instead was a sharper question for anyone running an engineering team: what's your ratio?

Setup

Ran Anthropic's AI-written C compiler (CCC) through my own novelty-scoring pipeline

Measured

Passes 99% of GCC torture tests, 0% token similarity, ~$20K; but ~2x instruction bloat, ~100K LOC vs 15-20K human, 478 dead functions

Verdict

VERDICT: MIXED

I set out to prove AI can't build serious systems software. The data proved me wrong.

That sentence is uncomfortable to write, because a day before the evidence landed, I had taken the opposite position in public.

The Setup

I maintain a pipeline that scores the novelty of software projects. It leans on established metrics and known FOSS tools, and I use it for two things: evaluating my students' work and dissecting interesting repositories. It is the closest thing I have to an objective instrument for the question "is this code actually original, or is it a remix?"

The day before Anthropic published CCC, their AI-written C compiler, I had posted about why generative AI is not suited to systems programming. Compilers, kernels, drivers: the argument was that this class of software demands a kind of judgment and rigor that next-token prediction does not have.

So when CCC dropped, I ran it through the pipeline expecting confirmation. A victory lap, basically.

The results were strong enough that I had to retune my weights, replace some metrics, and rerun the whole analysis. Still impressive. There is a particular humility in watching your own measurement tool side against you the day after you planted a flag.

What Is Genuinely Impressive

The numbers that made me rerun the pipeline:

Passes 99% of the GCC torture tests. That suite exists specifically to break compilers. Passing it is not a toy result.
Real optimization passes. This is actual optimization machinery, not a naive translator that emits whatever works.
0% token similarity with existing compilers. Structurally original by every measure my pipeline can apply.
Cost: roughly $20K. That is less than a few weeks of a senior compiler engineer in the Bay Area.

Any one of these would be notable. Together they ended my "AI can't do systems software" position as a blanket claim.

Where It Gets Complicated

The same pipeline that humbled me also surfaced the cracks:

Instruction-level output is roughly 2x more bloated than established compilers produce
The compiler itself is about 100K lines of code, 305K in the full repo. Comparable human-written compilers land at 15-20K lines
478 dead functions sitting in the codebase
A high LLM fingerprint throughout: verbose, over-documented, repetitive
It leans on GNU binutils in some cases rather than standing fully on its own

None of these stop the compiler from working. All of them matter the moment a human team has to maintain it, extend it, or trust it in production. The gap between "it works" and "you can ship it" is enormous. That gap is where engineering judgment lives.

The Caveat That Matters Most

Building a compiler is a well-understood, bounded problem. There are hundreds of references, decades of literature, and test suites like GCC torture that define success precisely. The AI had a clear target to optimize against.

Getting something working under those conditions is not the hard part. The hard part is building something efficient, maintainable, and robust under conditions the test suite never anticipated. CCC was playing a game with published rules. Most real systems work is not.

The Philosophical Question

CCC is described as a clean-room implementation. But the model was trained on vast amounts of code, certainly including compiler implementations. My pipeline confirms it did not copy from specific projects. What it cannot tell me is whether the model internalized compiler-construction patterns from training and reproduced them in a different language.

And it was a clever choice to build it in Rust. Most reference compilers are written in C or C++, so the language switch alone inflates novelty scores. My instrument measures token and structural similarity, not conceptual lineage.

Does Originality Even Matter?

Here is where the analysis turned on me a second time.

Looking back at my own career, the genuinely novel problems were maybe 20% of the work. The rest was engineers detecting patterns, relating the current situation to one they had solved before, then adapting. That is pattern recognition, and it is exactly what LLMs do.

So if AI can handle 80% of the bounded, pattern-based work at this speed and cost, does it matter that it is not original?

I think the question for executives is not "Can AI replace my engineers?" It is:

"What percentage of my work is the kind AI handles well, and what percentage requires judgment and ambiguity-handling?"

If you do not know your ratio, you are making AI adoption decisions in the dark. You will either over-trust the tool on the judgment-heavy 20% or waste your engineers on the bounded 80%.

Know your ratio.

For most teams, originality was never the bulk of the job anyway.

Originally shared on LinkedIn.

Read the original on LinkedIn →

VERDICT: MIXED

Follow the next AI field test on LinkedIn

Fadi Labib runs this field lab. 15 years in automotive, robotics, and embedded systems; ESMT Berlin EMBA. I give AI real engineering problems, then check its work. More about the lab →

Related essays that extend the same thread

Browse the archive

Jun 21, 2026•7 min read

The '94% Fewer Tokens' Screenshot Is Wrong. The README It Came From Is Honest

A viral screenshot says a coding ruleset cuts your tokens 94%. The README it was lifted from says something quieter and more honest: ~54% less code, ~20% cheaper, and the 94% is a per-task ceiling on a date picker. I read the README, then re-ran the benchmark myself. My numbers landed right next to theirs. The hype wasn't the tool. It was the feed stripping the units.

AI · Agents · Evaluation · +1 more

Jun 17, 2026•7 min read

Gemma 4 Won 73% of My AI Debates. It Still Wouldn't Say 'I Don't Know'

Two models dropped in one week: Gemma 4, the 12B I run locally, and Fable 5, a frontier model that was officially pulled days later. I spent that short window using Fable as a blind judge for 120 debates and reasoning rounds between five local models. Gemma 4 won 73% as the slowest model on the board, the fastest model came near the bottom, and the one with 'reasoning' in its name finished dead last. The shared failure was calibration: fluent, confident, and unwilling to admit doubt, even from the winner.

AI · LLM · Evaluation · +1 more

Jun 11, 2026•4 min read

My AI Committed 'Impossible' to Git. Seven Hours Later: 8/8

Reverse-engineering an 8-in-1 soil sensor, my AI decoded 6 of 8 channels, declared the last two 'not decodable,' and wrote that verdict into version control. I rejected the false ceiling and pushed. Seven hours later the same repo said 8/8. A flawless executor and a shaky judge.

AI · Agents · Engineering-Judgment · +1 more