I ran Anthropic's AI-written C compiler through my novelty-scoring pipeline, expecting to confirm my public position that GenAI can't do systems programming. The data forced me to retune my own metrics. What I found instead was a sharper question for anyone running an engineering team: what's your ratio?
Setup
Ran Anthropic's AI-written C compiler (CCC) through my own novelty-scoring pipeline
Measured
Passes 99% of GCC torture tests, 0% token similarity, ~$20K; but ~2x instruction bloat, ~100K LOC vs 15-20K human, 478 dead functions
Verdict
VERDICT: MIXED
I set out to prove AI can't build serious systems software. The data proved me wrong.
That sentence is uncomfortable to write, because a day before the evidence landed, I had taken the opposite position in public.
I maintain a pipeline that scores the novelty of software projects. It leans on established metrics and known FOSS tools, and I use it for two things: evaluating my students' work and dissecting interesting repositories. It is the closest thing I have to an objective instrument for the question "is this code actually original, or is it a remix?"
The day before Anthropic published CCC, their AI-written C compiler, I had posted about why generative AI is not suited to systems programming. Compilers, kernels, drivers: the argument was that this class of software demands a kind of judgment and rigor that next-token prediction does not have.
So when CCC dropped, I ran it through the pipeline expecting confirmation. A victory lap, basically.
The results were strong enough that I had to retune my weights, replace some metrics, and rerun the whole analysis. Still impressive. There is a particular humility in watching your own measurement tool side against you the day after you planted a flag.
The numbers that made me rerun the pipeline:
Any one of these would be notable. Together they ended my "AI can't do systems software" position as a blanket claim.
The same pipeline that humbled me also surfaced the cracks:
None of these stop the compiler from working. All of them matter the moment a human team has to maintain it, extend it, or trust it in production. The gap between "it works" and "you can ship it" is enormous. That gap is where engineering judgment lives.
Building a compiler is a well-understood, bounded problem. There are hundreds of references, decades of literature, and test suites like GCC torture that define success precisely. The AI had a clear target to optimize against.
Getting something working under those conditions is not the hard part. The hard part is building something efficient, maintainable, and robust under conditions the test suite never anticipated. CCC was playing a game with published rules. Most real systems work is not.
CCC is described as a clean-room implementation. But the model was trained on vast amounts of code, certainly including compiler implementations. My pipeline confirms it did not copy from specific projects. What it cannot tell me is whether the model internalized compiler-construction patterns from training and reproduced them in a different language.
And it was a clever choice to build it in Rust. Most reference compilers are written in C or C++, so the language switch alone inflates novelty scores. My instrument measures token and structural similarity, not conceptual lineage.
Here is where the analysis turned on me a second time.
Looking back at my own career, the genuinely novel problems were maybe 20% of the work. The rest was engineers detecting patterns, relating the current situation to one they had solved before, then adapting. That is pattern recognition, and it is exactly what LLMs do.
So if AI can handle 80% of the bounded, pattern-based work at this speed and cost, does it matter that it is not original?
I think the question for executives is not "Can AI replace my engineers?" It is:
"What percentage of my work is the kind AI handles well, and what percentage requires judgment and ambiguity-handling?"
If you do not know your ratio, you are making AI adoption decisions in the dark. You will either over-trust the tool on the judgment-heavy 20% or waste your engineers on the bounded 80%.
Know your ratio.
For most teams, originality was never the bulk of the job anyway.
Originally shared on LinkedIn.
Read the original on LinkedIn →
Fadi Labib runs this field lab. 15 years in automotive, robotics, and embedded systems; ESMT Berlin EMBA. I give AI real engineering problems, then check its work. More about the lab →
Keep reading
Reverse-engineering an 8-in-1 soil sensor, my AI decoded 6 of 8 channels, declared the last two 'not decodable,' and wrote that verdict into version control. I rejected the false ceiling and pushed. Seven hours later the same repo said 8/8. A flawless executor and a shaky judge.
Claude trained a gradient boosting model mapping raw soil-sensor bytes to readings on 2,347 points: pH 0.98, EC 0.99, temperature 0.999 R² in cross-validation. On 59 held-out points from real soil, EC crashed to -0.56 R², worse than predicting the mean. The model overfit the rig, not the world.
I let an AI agent run a multi-phase build solo. Every phase ended with a clean summary: done, tested, committed. Then I checked git instead. One phase reported '3 prompts, 8 minutes' while the timestamps disagreed, and a fix it marked DONE had been silently reverted 1h53m earlier with nothing in the report changed.