A flaky test is a test that lies. It passes or fails based on timing, not logic - and a real-time 60 fps game loop is the worst-case scenario for that kind of lie.
The loop reads the wall clock. It pulls random numbers for every spawn. It accumulates floating-point time across frames. Run the same test twice and you get two different worlds.
The engine underneath Pan Tvardowski is built to be deterministic where it matters. The test harness pins down the four things that make a game nondeterministic: time, frame pacing, randomness, and input. Pin all four and the same seed replays the same world bit-for-bit. That is what makes 1,866 engine tests and 1,802 game tests trustworthy instead of flaky.
But the honest version matters too: it is bit-exact where the code is pure integer math under a fake clock and an injected seed. It degrades to float-tolerance the moment you sum real dt, and degrades further once a real device's wall clock and Skia layer enter the picture. This post draws that boundary instead of pretending it is not there.
Deterministic game testing starts with four sources of nondeterminism
A game loop has exactly four things that can make it unpredictable across runs. Get all four wrong and no test suite will ever be trustworthy. Get all four right and every test becomes a deterministic function of its inputs.
The @flare-engine/testing package maps one primitive to each source. It is 221 LOC across 5 files and carries 23 tests of its own. Small by design: the harness should be boring.
| Nondeterminism source | Primitive | What it removes |
|---|---|---|
| Wall clock | fake-clock |
Date.now() / real requestAnimationFrame |
| Frame pacing | test-world |
Variable dt from real render loop |
| Randomness | SeededRNG (in @flare-engine/math) |
Unseeded / device-seeded RNG calls |
| Input | mock-input |
Human timing, gesture delays |
The fake clock is the keystone
No test in the engine or game suite ever reads Date.now() directly. No test ever waits on a real requestAnimationFrame. Time only moves when a test explicitly advances it.
// packages/testing/src/fake-clock.ts
advance(seconds) {
if (seconds < 0) throw new Error('[@flare-engine/testing] cannot advance backwards');
nowMs += seconds * 1000;
// Snapshot callbacks so any newly scheduled ones via requestFrame during
// this flush are deferred to the next advance - mirrors rAF semantics.
const snapshot = [...pending.entries()];
pending.clear();
for (const [, cb] of snapshot) cb(nowMs);
},The advance(seconds) method is the whole trick. Calling advance(1/60) steps the world by exactly one frame. Calling it 300 times steps it 5 seconds. The pending-callback snapshot ensures that any new requestFrame calls scheduled during a flush are deferred to the next advance - which matches how a real browser rAF queue works.
Because nowMs is a closure variable, not a system call, two runs of the same test are mathematically identical. There is no race, no OS scheduler jitter, no slow CI machine.
Seeded RNG: same seed, same sequence, bit-exact
The engine uses SeededRNG from @flare-engine/math, implemented as mulberry32 - a pure integer mixing function with magic constant 0x6d2b79f5. No floating-point in the generator itself, which is why it is bit-exact.
// packages/math/tests/random.test.ts
it('produces deterministic output', () => {
const a = new SeededRNG(42);
const b = new SeededRNG(42);
for (let i = 0; i < 100; i++) {
expect(a.next()).toBe(b.next());
}
});toBe - not toBeCloseTo. The equality is exact because mulberry32 is integer math from seed to output. Two generators seeded identically produce identical sequences forever, regardless of when or where they run.
Every enemy spawn, every drop roll, every wave decision in Pan Tvardowski goes through SeededRNG. The game can replay any session exactly given the seed.
Golden schedules: locking the deterministic stream
The wave spawner is a pure function: buildZoneSchedule(zone, difficulty, seed). Same triple in, same schedule out. The schedule is not random at runtime - it is generated from the seed and then executed.
// Pan-Tvardowski/tests/balance-schedule.test.ts
test('same (zone, difficulty, seed) yields an identical schedule', () => {
const a = buildZoneSchedule('z3', 5, 42);
const b = buildZoneSchedule('z3', 5, 42);
expect(a).toEqual(b);
});
test('golden d1 spawn counts (locks the deterministic stream)', () => {
const goldenLengths: Record<ZoneKey, number> = {
z1: 134, z2: 126, z3: 168, z4: 140,
z5: 187, z6: 204, z7: 398, z8: 268,
};
for (const zone of ZONES) {
expect(buildZoneSchedule(zone, 1, 1).spawns.length).toBe(goldenLengths[zone]);
}
});The golden test pins the exact stream length per zone. Change the math - reorder a spawn condition, adjust a timing constant - and the golden breaks. This is replay testing for content: the game's balance logic is pinned to a specific deterministic output.
Zone 7 generating 398 spawns and zone 2 generating 126 is not an accident. Those numbers are the contract between the balance function and the game. The test enforces it.
The boss state model is the same story: plain-data boss state is snapshot-testable at any frame, because there is no live closure to capture - just a record.
Snapshots: replay one world against another
Golden schedules pin the input to the simulation. Snapshots pin the output. The scene-snapshot primitive serializes every entity in the ECS world into a stable, sorted structure, so two worlds built from the same seed can be compared field-for-field.
// packages/testing/src/scene-snapshot.ts
const sortedIds = [...ids].sort((a, b) => a - b);
const entities: SceneSnapshotEntity[] = [];
for (const id of sortedIds) {
const comps: Record<string, unknown> = {};
for (const def of components) {
if (world.has(id, def)) comps[def.name] = world.get(id, def);
}
entities.push({ id, components: comps });
}
return { entities };Sorting by entity id is the whole point. ECS iteration order is an implementation detail, and a snapshot that depended on it would be flaky by construction. With a stable order, the replay assertion is a one-liner: run the same scene twice and the two snapshots must be deep-equal.
// packages/testing/tests/scene-snapshot.test.ts
expect(snapshotsEqual(serializeScene(w1, [Pos]), serializeScene(w2, [Pos]))).toBe(true);That is what "replay assertion" means in practice: drive a world N frames under the fake clock, snapshot it, and assert it matches a known-good run. A determinism regression - a stray Math.random, an unsorted iteration, a wall-clock read - breaks the snapshot loudly instead of failing once in fifty CI runs.
Fixed timestep: the sim never sees a variable dt
The game loop runs at 60 Hz with a fixed-step accumulator. When a real frame delivers 20ms instead of 16.67ms, the accumulator drains in exact 1/60 chunks - so the simulation always steps by exactly fixedDt = 1/60, regardless of frame pacing.
stepFrames(n, dt) in the test world drives this directly: call it with n=60 and every system sees exactly 60 steps of 1/60 seconds. No matter how slow or fast the test runner is.
This is the standard fixed-timestep pattern. The engine's contribution is wiring it so the test harness drives the same accumulator the production loop uses - there is no separate "test mode" path that might behave differently.
Testing a game you can't see
Skia renders text directly to a GPU surface. That text is not in the Android or iOS accessibility tree. getByText('HP: 42') finds nothing - Skia painted the pixels, but no native view carries the value.
The smoke harness works around this by mirroring live game state into __DEV__-only DBG_* sentinels - invisible to players, visible to Maestro.
# .maestro/boss-fight.yaml - deep-link to a boss, assert live game state
- openLink: "pantvardowski://boss/2"
- extendedWaitUntil:
visible: "DBG_BOSSHP:[1-9][0-9]*/[1-9][0-9]*"
- assertVisible: "DBG_BOSSPHASE:entering"
- assertVisible: "DBG_BOMBS:3"
- tapOn:
id: "DBG_BOMB_BTN"
- extendedWaitUntil:
visible: "DBG_BOMBS:2"The deep link pantvardowski://boss/2 skips the title screen and navigates straight to boss 2 with the seeded state the test expects. DBG_BOSSHP carries the live HP ratio as a text node. Maestro can assert it; a human player never sees it.
This pattern generalizes. Any React Native app with a rendered surface that sits outside the accessibility tree - a Skia chart, a canvas editor, a map view - can mirror its key state into __DEV__ text nodes and use deep links to skip navigation. The sentinels make a "you can't test this" surface testable.
Limits - where determinism breaks
Float-tolerance at the boundary. Summed dt uses toBeCloseTo, not toBe. The accumulator arithmetic is float, and while the per-step fixedDt is exact, accumulated error over many frames is real. The test suite acknowledges this with the right matcher.
The default seed fallback. Without an injected seed, SeededRNG falls back to Date.now() as the seed. A test that forgets to inject a seed is not deterministic, and the suite has no lint rule enforcing injection. This is a discipline gap, not a technical one - but it is real.
Device wall clock and cold start. Time-to-interactive is measured as a median of 3 cold launches on the benchmark device (Samsung Galaxy A54 5G, release build). A single measurement is not reliable; three gives a usable signal. This is not unit-tested - it is a Maestro scenario run.
Skia text is not in the accessibility tree. The sentinel pattern handles this for CI, but assertions on visual appearance (color, size, animation) are still not possible without a screenshot comparison layer. The harness does not have one.
Performance is measured, not unit-tested. The 1,866 engine tests and 1,802 game tests (156,767 assertions in the game suite alone, 0 fail across both as of 2026-06-26) say the logic is correct. They say nothing about whether it runs at 60 fps on a real device. Frame budgets are verified by the E8 Maestro benchmark suite, not by unit tests.
Close
The rule you can apply this week: inject the seed; assert state, not pixels.
If you have a game loop that reads Date.now() in tests, wrap it in a fake clock first - one function, one closure. If your RNG is unseeded in tests, wire in a seed constant. Those two changes make the difference between a test suite that lies and one you can trust.
The DBG_* sentinel pattern is the same insight applied to rendered surfaces: if you cannot read it from the test harness, mirror it into something you can. It works for a React Native game. It works for any RN app with a Skia surface, a canvas component, or a map view that sits outside the accessibility tree.
For the engine internals that generate the seeded sequences - the zero-alloc loop under test is covered in Zero-alloc game loops in TypeScript. The serialization argument for why coroutines were not used (a paused generator is the thing you cannot snapshot) lives in TypeScript game coroutines I did not use.