The Waymo Rule for AI-Generated Code

AI-generated code doesn't have to be perfect; it just has to be better than you.

I used to think that AI would end up writing code in a language that had formal methods properties.

In computer science, formal methods are mathematically rigorous techniques for the specification, development, analysis, and verification of software and hardware systems.Formal methods - Wikipedia

The idea was that:

engineers write a specification,
AI generates code according to that specification, and
a model checker verifies that the generated code conforms to the specification.

FizzBee seemed particularly well-suited to the task. It’s a specification language and model checker similar in spirit to TLA+. Unlike TLA+, which has an impenetrable syntax, FizzBee uses Starlark as its specification language. Starlark is a stripped-down version of Python, which is used in Bazel. This makes it friendly for humans. It also makes it friendly for AI since there’s a ton of Python code (and a reasonable amount of Starlark) on the internet.

I have also been watching Higher Order Company’s work on Bend and HVM4 for some time. (The founder, Victor Taelin, is an excellent follow on Twitter, if that’s your jam.)

I’ve given up on this idea. I now believe that:

the specification language is just English,
the AI’s programming languages are just the languages we have today, and
the verification is just tests.

What’s shifted my thinking is simply that AI has gotten much, much better. I now trust GPT-5.4 (high) to do the right thing most of the time. It has earned my trust.

For those living under a rock, a Waymo is a self-driving car. Waymos are not perfect. They still crash. But that’s fine, so long as they crash less often than humans. And they do. Realistically, they probably need to crash 10x-20x less frequently than humans so that our mush brains truly feel comfortable putting our lives in their hands. They probably also should not fail in non-human-like ways, and so on. But the point is that they don’t have to be perfect; just better.

I have started to think about AI-generated code like I think about Waymos. It doesn’t need to be perfect, just better. And it is. I don’t need to verify that the code is correct; I just need to verify that it’s better than what I would have written. I’m starting to get to the point where that’s the case. Techniques like Code Quality Gates for Vibe-Coded Projects and Simplifying AI-Generated Code With Lizard help me trust the AI-generated code.

Don’t get me wrong, there’s still a place for formal methods, model checkers, and verification. It’s just one tool of many. Tools and techniques such as property-based testing, fuzzing, static analysis, and so on all play in this space.

I also still expect us to evolve more AI-friendly languages (c.f. From Human Ergonomics to Agent Ergonomics). I just don’t think that the future of AI-generated code is one in which AIs write code in a language with formal methods properties and then verify it with a model checker.

Antithesis has picked up on this. Though they started with deterministic simulation testing, they’ve hired the creator of Hypothesis (a property-based testing framework), and have released Bombadil, a property-based testing tool for web UIs. Their slogan now reads:

Bug-free systems, unlimited velocity: Unleash your agents

This is the way: defense in depth. Layered quality assurance that convinces our mushy human brains that we can trust what’s going on under the hood without opening it.

Code Quality Gates for Vibe-Coded Projects

Simplifying AI-Generated Code With Lizard

Wiggum Loop