Deterministic Simulation Testing Is Really Hard
The more state you explore, the more false positives you'll encounter.
Last year, I spent some time building SlateDB’s deterministic simulation tests (DSTs). It was a fun project that exposed a few bugs:
- Protected against key/value length overflows in SstRowCodecV0
- await_durable writes hang when wal_enabled is false
- Empty scan ranges cause panic
But the test has found exactly zero bugs since then. If you look at the test, you’ll see that it’s too simple to be meaningful.
The test harness randomly generates some configuration for a SlateDB instance and runs read/write/scan/flush operations in a loop. After each operation, it checks relevant invariants. All of this runs in a single thread on a deterministic Tokio async runtime scheduler. This approach misses:
- failure injection (network partitions, process crashes, disk failures, etc.)
- more complex workloads (transactions, compactions, merge operators, TTL, etc.)
- more complex invariants (linearizability, durability, etc.)
- interleaved operations (a read while a write is happening, etc.)
I added most of these features this week on a local branch. Even with SlateDB’s codebase, which can already be made deterministic, it’s been a real struggle.
The test harness needed a full rewrite to allow operation interleaving. Rather than having the harness pick operations, tests now spawn one or more tasks (“actors”) on the runtime. These tasks can do whatever they like: read, write, flush, compact, restart the database, and so on. I also added the concept of “toxics”, similar to those found in Toxiproxy, which enable fault injection.
These changes dramatically expanded codepaths that were tested (state exploration). That’s a good thing; I wanted to exercise more of SlateDB’s codebase. Unfortunately, many of the states were what I would call invalid. For example:
- The fuzzer configured the compactor to fence itself faster than it could complete any compaction jobs. This lead to hanging tests.
- An actor that controlled clock progression was shut down before other actors that depended on the clock progressing, resulting in a test hang.
- The mock clock advancing so quickly that database scans would time out, causing the test to fail.
- Injecting faults with fuzzed probability resulted in tests that failed too frequently to make any progress.
- The fuzzer configuring write size such that compaction was never exercised.
These are not bugs, per-se; they’re correct behavior for misconfigurations or bad code. Taken individually, they seem obvious. But I was fuzzing to find edge cases, and the fuzzer kept picking new misconfigurations. I had to chase each one down, figure out what was going on, and fix it.
There is a tension between state exploration and “false positive” failures. Meaningful deterministic simulation tests must explore a large swathe of your codebase. To do that, you must allow for a wide range of configurations and workloads. But the wider the range, the more you’ll encounter invalid states that cause your tests to fail or hang. The trick is constraining your state exploration to meaningful states.
Property-based testers such as proptest and Hegel can help with this. You’ll at least be able to define invariants that your fuzzed configuration shouldn’t violate.
The challenge is figuring out all the invariants in the first place, though. I don’t see how anything can help with that (other than maybe formal methods verification and model checking). You just have to run your tests, see what fails, and figure out if it’s a real bug or not.
Fortunately, GPT 5.4 (and now 5.5) has been a godsend in debugging these issues. When a test hangs, I give the test logs to GPT and it will usually tell me what’s going on:
- Run DST in a loop with random seeds.
- Test hangs after a few minutes.
- Copy logs and ask GPT what’s going on.
- GPT identifies the issue and suggests a fix.
- Implement the fix and repeat.
This works; it’s just slow going. I’ve added fault injection, workload interleaving, compactor fencing tests, transaction tests, and a few other things. I plan to keep chipping away at it. I’ll keep you posted.