I Set Out to Prove AI Can't Build Systems Software

I set out to prove AI can't build serious systems software. The data proved me wrong.

That sentence is uncomfortable to write, because a day before the evidence landed, I had taken the opposite position in public.

The Setup

I maintain a pipeline that scores the novelty of software projects. It leans on established metrics and known FOSS tools, and I use it for two things: evaluating my students' work and dissecting interesting repositories. It is the closest thing I have to an objective instrument for the question "is this code actually original, or is it a remix?"

The day before Anthropic published CCC, their AI-written C compiler, I had posted about why generative AI is not suited to systems programming. Compilers, kernels, drivers: the argument was that this class of software demands a kind of judgment and rigor that next-token prediction does not have.

So when CCC dropped, I ran it through the pipeline expecting confirmation. A victory lap, basically.

The results were strong enough that I had to retune my weights, replace some metrics, and rerun the whole analysis. Still impressive. There is a particular kind of humility in watching your own measurement tool side against you the day after you planted a flag.

What Is Genuinely Impressive

The numbers that made me rerun the pipeline:

Passes 99% of the GCC torture tests. That suite exists specifically to break compilers. Passing it is not a toy result.
Real optimization passes. Not a naive translator that emits whatever works. Actual optimization machinery.
0% token similarity with existing compilers. Structurally original by every measure my pipeline can apply.
Cost: roughly $20K. That is less than a few weeks of a senior compiler engineer in the Bay Area.

Any one of these would be notable. Together they ended my "AI can't do systems software" position as a blanket claim.

Where It Gets Complicated

The same pipeline that humbled me also surfaced the cracks:

Instruction-level output is roughly 2x more bloated than established compilers produce
The compiler itself is about 100K lines of code, 305K in the full repo. Comparable human-written compilers land at 15-20K lines
478 dead functions sitting in the codebase
A high LLM fingerprint throughout: verbose, over-documented, repetitive
It leans on GNU binutils in some cases rather than standing fully on its own

None of these stop the compiler from working. All of them matter the moment a human team has to maintain it, extend it, or trust it in production. The gap between "it works" and "you can ship it" is enormous, and that gap is where engineering judgment lives.

The Caveat That Matters Most

Building a compiler is a well-understood, bounded problem. There are hundreds of references, decades of literature, and test suites like GCC torture that define success precisely. The AI had a clear target to optimize against.

Getting something working under those conditions is not the hard part. The hard part is building something efficient, maintainable, and robust under conditions the test suite never anticipated. CCC was playing a game with published rules. Most real systems work is not.

The Philosophical Question

CCC is described as a clean-room implementation. But the model was trained on vast amounts of code, certainly including compiler implementations. My pipeline confirms it did not copy from specific projects. What it cannot tell me is whether the model internalized compiler-construction patterns from training and reproduced them in a different language.

And it was a clever choice to build it in Rust. Most reference compilers are written in C or C++, so the language switch alone inflates novelty scores. My instrument measures token and structural similarity, not conceptual lineage.

Does Originality Even Matter?

Here is where the analysis turned on me a second time.

Looking back at my own career, the genuinely novel problems were maybe less than 10% of the work. The rest was engineers detecting patterns, relating the current situation to one they had solved before, and adapting. That is pattern recognition. It is exactly what LLMs do.

So if AI can handle 80% of the bounded, pattern-based work at this speed and cost, does it matter that it is not original?

I think the question for executives is not "Can AI replace my engineers?" It is:

"What percentage of my work is the kind AI handles well, and what percentage requires judgment and ambiguity-handling?"

If you do not know your ratio, you are making AI adoption decisions in the dark. You will either over-trust the tool on the judgment-heavy 20% or waste your engineers on the bounded 80%.

Know your ratio.

And maybe the real question is not whether AI can be original. Maybe it is whether originality was ever as important as we thought.

Originally shared on LinkedIn.

I set out to prove AI can't build serious systems software. The data proved me wrong.

That sentence is uncomfortable to write, because a day before the evidence landed, I had taken the opposite position in public.

The Setup

So when CCC dropped, I ran it through the pipeline expecting confirmation. A victory lap, basically.

What Is Genuinely Impressive

The numbers that made me rerun the pipeline:

Passes 99% of the GCC torture tests. That suite exists specifically to break compilers. Passing it is not a toy result.
Real optimization passes. Not a naive translator that emits whatever works. Actual optimization machinery.
0% token similarity with existing compilers. Structurally original by every measure my pipeline can apply.
Cost: roughly $20K. That is less than a few weeks of a senior compiler engineer in the Bay Area.

Any one of these would be notable. Together they ended my "AI can't do systems software" position as a blanket claim.

Where It Gets Complicated

The same pipeline that humbled me also surfaced the cracks:

Instruction-level output is roughly 2x more bloated than established compilers produce
The compiler itself is about 100K lines of code, 305K in the full repo. Comparable human-written compilers land at 15-20K lines
478 dead functions sitting in the codebase
A high LLM fingerprint throughout: verbose, over-documented, repetitive
It leans on GNU binutils in some cases rather than standing fully on its own

The Caveat That Matters Most

The Philosophical Question

Does Originality Even Matter?

Here is where the analysis turned on me a second time.

So if AI can handle 80% of the bounded, pattern-based work at this speed and cost, does it matter that it is not original?

I think the question for executives is not "Can AI replace my engineers?" It is:

"What percentage of my work is the kind AI handles well, and what percentage requires judgment and ambiguity-handling?"

If you do not know your ratio, you are making AI adoption decisions in the dark. You will either over-trust the tool on the judgment-heavy 20% or waste your engineers on the bounded 80%.

Know your ratio.

And maybe the real question is not whether AI can be original. Maybe it is whether originality was ever as important as we thought.

Originally shared on LinkedIn.

I Set Out to Prove AI Can't Build Systems Software

The Setup

What Is Genuinely Impressive

Where It Gets Complicated

The Caveat That Matters Most

The Philosophical Question

Does Originality Even Matter?

Related essays that extend the same thread.

Why AI Can't Design Duck Hunt

Six Quality Gates for AI-Assisted Engineering

Why Do We Still Assume Machines Must Read Before They Understand?

I Set Out to Prove AI Can't Build Systems Software

The Setup

What Is Genuinely Impressive

Where It Gets Complicated

The Caveat That Matters Most

The Philosophical Question

Does Originality Even Matter?

Related essays that extend the same thread.

Why AI Can't Design Duck Hunt

Six Quality Gates for AI-Assisted Engineering

Why Do We Still Assume Machines Must Read Before They Understand?