Table of contents

Over the past two years, large language models (LLMs) have gone from novelty to daily driver. They can generate entire applications from a prompt, refactor complex functions, write CUDA kernels, and even emit LLVM IR. For many engineers, the workflow has shifted from "write every line" to "describe intent and iterate."

That shift naturally raises a provocative question:

If LLMs can write code, do we still need compilers?

At first glance, it's tempting to imagine a future where AI systems replace large parts of the traditional toolchain. If a model can emit optimized C++ or vectorized Python, maybe the compiler becomes a thin execution engine, or even disappears entirely.

But this framing misunderstands both technologies.

LLMs are generative systems. They are exceptional at pattern synthesis, code completion, API discovery, and translating intent into working implementations. They are probabilistic, heuristic, and language-driven.

Compilers, by contrast, are deterministic optimization engines. They reason about types, data layout, control flow, aliasing, vectorization, threading, and hardware targets. They enforce correctness constraints and transform programs with mathematical guarantees.

At Exaloop, we’ve seen this distinction firsthand. LLMs are increasingly capable of generating C/C++, emitting low-level kernels, or even “transpiling” Python into lower-level languages. But doing so introduces subtle correctness and semantic inconsistencies, missed performance optimization opportunities, and ultimately results in a maintenance headache when it comes to updating code or expanding its functionality—Python, as it turns out, is a heck of a lot easier to manage than mountains of equivalent C code.

The real story isn’t "LLMs vs. compilers". It’s LLMs + compilers.

LLMs can dramatically improve developer productivity by generating Codon-compatible Python, scaffolding parallel algorithms, or refactoring code into forms that are more amenable to static optimization. Codon can then take that code and apply aggressive optimizations—multithreading, vectorization, memory specialization, and hardware-specific lowering—that no language model can reliably synthesize on its own.

In this post, we’ll explore why LLMs can’t replace compilers, and why the combination of AI-generated code and high-performance compilation may be one of the most powerful shifts in software engineering yet.

The Calculator Problem

Ask an LLM: “What is 7 × 13?”

You’ll probably get 91.

Ask it again.

You’ll probably get 91 again. But “probably” is the key word.

LLMs are probabilistic systems. They do not execute algorithms in the traditional sense; they generate tokens based on learned statistical patterns. Even when they produce correct answers consistently, correctness is emergent behavior—not a guarantee enforced by a formal execution model.

Now consider a compiler. When a compiler translates:

x = 9 * 13

it must deterministically emit machine instructions that, on every supported architecture, always compute 91. There is no temperature parameter. No sampling variance. No “close enough.” If the result changes between runs, builds, or hardware targets, the compiler is broken.

This is what we call the Calculator Problem.

Compilers exist to provide deterministic, semantics-preserving transformations. Given the same input program and target configuration, they must produce the same meaning every time. Optimization passes (vectorization, inlining, constant folding, loop transformations) are required to be correctness-preserving. The system is built on invariants, formal IRs, and well-defined lowering stages.

LLMs, by design, do not operate under those constraints. They can suggest a faster algorithm. They can rewrite Python into C++. They can emit SIMD intrinsics or even LLVM IR.

But they cannot guarantee that:

The transformation preserves semantics in all edge cases
The memory model is respected
Undefined behavior isn’t introduced
The optimization remains valid across architectures
The result is reproducible build after build

Even if an LLM gets it right 99% of the time, that remaining 1% is unacceptable for compilation. Software infrastructure depends on determinism the way mathematics depends on equality.

This is not a weakness of LLMs, it’s simply not what they are built for. They are generative reasoning tools, not formal transformation engines.

Codon, like any serious compiler, exists to solve the Calculator Problem. It takes Python code and applies deterministic, architecture-aware transformations that are provably semantics-preserving. It doesn’t “guess” how to vectorize a loop. It analyzes it. It doesn’t “approximate” aliasing behavior. It computes it.

In other words:

LLMs are excellent at proposing programs.
Compilers are responsible for guaranteeing them.

And when performance, reproducibility, and correctness matter, determinism isn’t optional, it’s foundational.

Why LLM-based transpilation doesn't work

A common pitch is: take Python, translate it to C/C++ (by hand or with an LLM), and you get speed.

That can work for a narrow subset of Python: straight-line numeric code with fixed types and simple control flow. But “Python” in the real world isn’t that subset. Python’s semantics include constructs that do not have a 1-to-1 mapping to C/C++, because they aren’t just syntax, they’re runtime behaviors: lazy evaluation, dynamic dispatch, introspection, exceptions as control flow, and pervasive object identity.

A canonical example: generators.

Generators aren't a loop, they're a suspended computation

In Python:

def tokens(stream):
    for chunk in stream:
        for t in chunk.split():
            yield t

This function doesn’t produce a list. It produces an iterator that yields values one at a time, maintaining internal state between yields. It’s lazy, it composes, and it has a bounded memory footprint: it only holds the current chunk, the split iterator, and a bit of state.

There is no direct C equivalent of yield as a language feature. To implement this in C/C++, you need one of:

A state machine (manual continuation lowering)
A heap-allocated coroutine frame (compiler/runtime support)
A custom iterator object with careful lifetime and exception semantics

That’s already a lot of machinery. And when an LLM “transpiles” this, it frequently takes the path of least resistance.

The easy transpilation is a semantic landmine: eager materialization

A typical naïve rewrite turns lazy iteration into a materialized container:

def tokens_eager(stream):
    out = []
    for chunk in stream:
        out.extend(chunk.split())
    return out

This is still valid Python, but it’s no longer the same kind of program. The original generator can be consumed incrementally:

for t in tokens(big_stream()):
    process(t)   # constant-ish memory

The eager version must allocate and hold all tokens before processing can even start:

for t in tokens_eager(big_stream()):
    process(t)   # memory scales with total tokens

That one change can turn a pipeline that runs in $O(1)$ additional memory into $O(n)$ memory. On large inputs, that’s not a micro-optimization mistake, it’s the difference between “works” and “crashes.”

And this exact pattern is a common failure mode when using LLMs as “transpilers”: they produce code that looks plausibly equivalent but silently changes evaluation strategy from streaming to batch.

“But C++ is faster” isn’t enough if it explodes memory

Even if an LLM emits C++ that is functionally equivalent, it may still choose structurally expensive representations:

Building std::vector<std::string> when the Python version yields strings on demand
Copying substrings eagerly instead of using views
Reifying intermediate arrays that Python never materializes
Hoisting generators into buffers “for simplicity”
Flattening iterators into full lists to avoid modeling suspension state

In other words, the “transpiled” program can easily allocate orders of magnitude more memory than the original Python, precisely because Python’s lazy constructs are often what keep memory bounded in the first place.

The deeper issue: Python semantics are runtime semantics

Generators are just one example. Other Python features that routinely defeat 1-to-1 translation include:

Closures capturing variables with late binding
Dynamic dispatch and class/object semantics
Duck-typed protocols (iteration, context managers, numeric tower)
Exception semantics (catching exceptions, finally blocks and other control flow)
… and many others

C/C++ doesn’t lack expressiveness: you can implement almost anything with enough scaffolding. The problem is that “transpilation” is only attractive if it’s simple and mechanical. Once you need a Python runtime, coroutine lowering, object model emulation, and semantics-preserving guarantees, you’re back to… building a compiler and runtime.

Where this lands for LLMs

LLMs can absolutely help here: they can refactor Python into a form that’s easier to compile, or suggest a streaming-friendly algorithm, or rewrite code to avoid dynamic features.

But using an LLM as an automatic Python → C++ translator is brittle for the same reason the Calculator Problem exists:

it’s easy to produce something that looks equivalent,
while quietly changing semantics,
and frequently changing memory behavior in disastrous ways.

The right approach is not “LLM replaces the compiler”—it’s “LLM proposes code; the compiler enforces semantics and optimizes deterministically”.

Subtle semantic and performance regressions

Even when a Python → C/C++ translation looks correct, it can quietly change either numerical semantics or performance characteristics in ways that are hard to detect.

Two examples from NumPy make this concrete.

Summation is not just a for-loop

Consider:

import numpy as np
x = np.random.rand(10_000_000)
s = np.sum(x)

An LLM asked to “rewrite this in C++” might generate something like:

double s = 0.0;
for (size_t i = 0; i < n; ++i) {
  s += x[i];
}

That is simple. It is readable. It compiles.

It is also not equivalent to what NumPy actually does.

Modern NumPy implementations use block-based or pairwise summation strategies (and may leverage vectorized reduction kernels underneath). The goal is to reduce floating-point error accumulation by avoiding purely linear summation order. Summing left-to-right:

((((a0 + a1) + a2) + a3) + ...)

accumulates rounding error differently than summing in balanced blocks.

On large arrays, especially with values of mixed magnitude, these differences are measurable. The naïve loop may:

Accumulate significantly more floating-point error
Produce slightly different results than NumPy
Break numerical reproducibility expectations

In finance, scientific computing, or ML preprocessing pipelines, that “tiny” difference can matter.

So what happened? The LLM generated something that looks equivalent, but silently changed the numerical algorithm. That’s not a syntax change, that’s a semantic change.

Multi-dimensional arrays and cache behavior

Now consider:

y = np.sum(A, axis=0)

where A is a large 2D array.

NumPy doesn’t just utilize a naïve nested loop. It:

Examines memory layout (C-order vs. Fortran-order vs. non-contiguous arrays)
Chooses loop ordering to maximize contiguous access
Uses stride-aware iteration
Applies vectorized inner kernels
May tile internally for cache locality

An LLM translating this into C++ will likely produce:

for (size_t i = 0; i < rows; i++) {
  for (size_t j = 0; j < cols; j++) {
    out[j] += A[i][j];
  }
}

Looks fine. But if A is column-major, this loop walks memory row-wise, resulting in strided access that jumps by an entire column each iteration. On large matrices, that means:

Poor cache utilization
More cache misses
Lower effective memory bandwidth
Dramatically worse performance

Simply flipping loop order:

for (size_t j = 0; j < cols; j++) {
  for (size_t i = 0; i < rows; i++) {
    out[j] += A[i][j];
  }
}

can be substantially faster due to contiguous memory traversal.

NumPy handles this automatically. It reasons about strides and layout. An LLM emitting straightforward nested loops has no guarantee of doing so correctly, and often defaults to the most “obvious” ordering rather than the most efficient one.

The pattern

In both examples, the LLM-generated code:

Compiles
Appears correct
Passes small tests

And yet:

Changes floating-point behavior
Ignores memory layout
Leaves performance on the table
Fails to match library-level optimizations

These aren’t dramatic bugs. They’re subtle degradations. And those are often more dangerous, because they slip through review and only surface under scale.

A real compiler pipeline, by contrast:

Understands types and layout
Performs vectorization and reduction optimizations systematically
Applies deterministic, architecture-aware transformations
Preserves semantics while improving performance

This is exactly why “just rewrite it in C++”, whether by hand or via LLM, is not a substitute for a proper optimizing compiler.

The hard part of performance is not writing loops, it’s writing the right loops for the machine without changing the meaning of the program.

Conclusion: Intelligence proposes, compilers guarantee

LLMs are extraordinary tools. They can draft complex programs, refactor legacy code, suggest vectorized rewrites, and even emit low-level kernels. They dramatically reduce the friction between idea and implementation.

But they are not compilers.

Compilation is not just translation. It is the deterministic, semantics-preserving transformation of programs into efficient machine code. It requires formal reasoning about types, memory layout, aliasing, control flow, numerical stability, concurrency, and hardware characteristics. It requires guarantees.

LLMs operate probabilistically. They can approximate transformations. They can suggest optimizations. They can produce code that looks right. But they cannot provide the invariants that software infrastructure depends on:

Reproducible builds
Deterministic execution
Architecture-aware optimization
Semantic preservation across edge cases
Memory and performance guarantees at scale

And as we’ve seen, subtle differences matter. A naïve reduction changes numerical behavior. A simple loop ordering change destroys cache locality. An eager rewrite of a generator explodes memory usage. These aren’t stylistic details, they are fundamental properties of the program.

The future is not “LLMs replace compilers”—the future is a layered stack:

LLMs accelerate humans.
Compilers accelerate programs.

At Exaloop, this is exactly how we see Codon fitting into the AI era. LLMs can generate clean, Codon-compilable Python. They can refactor dynamic code into statically optimizable patterns. They can help developers explore the design space faster than ever before.

Then Codon does what only a compiler can do: deterministically analyze, specialize, and lower that code into highly optimized native execution across CPUs and GPUs without sacrificing correctness.

In other words: LLMs make writing programs easier, and compilers make running them fast, reliably.

Can LLMs replace compilers?