Proving your difficulty ladder is real: round-robin heatmaps, pinned seeds, and the Blended Hard probe

If your game ships a difficulty ladder — Easy / Medium / Hard / Elite — you have made a monotone-progress promise to every player who touches it. A player beats Easy, graduates to Medium, hits a wall on Hard, and eventually touches Elite. If any step is too small, the tier is filler and players skip it. If any step is too large, the tier is a brick wall and players churn.

That promise has to be measured, not asserted. Playtesting catches the extremes and nothing else — your own hands know the game too well, and a handful of games from friends is a dozen samples against a combinatorial matrix (four tiers, five board sizes, two first-player assignments). Forty cells, maybe thirty data points. You can convince yourself of almost anything with that much noise.

This post is about the round-robin tournament I run on every Cell Division AI change, the two heatmaps it produces, and the things the heatmaps revealed the first time I actually looked at them.

What “graduated” actually means

For the ladder to hold together, two claims both have to be true:

Win rate strictly increases up the ladder. Hard beats Medium more than half the time, by more than Medium beats Easy.
Score differential grows. A ladder where Hard wins by half a cell is a fragile ladder even if the sign is right — any small rule change upends it.

Both claims have to survive across board sizes and across first-player assignments. A difficulty ladder that only holds up when the AI goes first is not a ladder. It’s a coin flip.

The inner loop

The tournament lives at ai/scripts/tournament_heatmap.py. It builds the shipped roster, inserts a non-shipped Blended Hard probe between Medium and Hard (more on that below), and runs an all-pairs round robin for every board size in [4, 5, 6, 7, 8] with both first-player assignments:

for board_size in [4, 5, 6, 7, 8]:
    for fp in ["Row first", "Row second"]:
        for i in range(n):
            for j in range(n):
                for _ in range(n_games):
                    result = play_game(ais[i], ais[j], board_size)
                    diff = result.scores[0] - result.scores[1]
                    if diff > 0: wins += 1
                    elif diff == 0: draws += 1

Fifty games per matchup is the default. With six players, five board sizes, and two first-player columns, that’s 9,000 games per run — cheap enough to run on a laptop in a few minutes because the engine is pure Python and the AIs are mostly 14-feature linear scorers.

Why two charts, not one

Score difference alone lies. An AI that wins by +2.3 points on average can still be losing 40% of its games — the big wins drag the mean. Win rate alone also lies. A 55% win rate with 40% draws is a completely different game from a 55% win rate with 5% draws. The first says “both AIs play it safe and one occasionally slips”; the second says “one AI is clearly stronger.”

So the script produces two PNGs: a red-blue score-difference heatmap and a green-yellow-red win/draw heatmap. Each cell in the win-rate chart shows two numbers stacked — the row’s win percentage over the column on top, and the draw percentage below, suffixed with d.

If you can only look at one number to judge your ladder, you’re going to ship a broken ladder. Score and win rate disagree constantly, and the disagreements are where the interesting bugs live.

The rendered heatmaps themselves are on the Cell Division site — the player-facing writeup that shows “yes, the ladder really is monotone” lives at Tournament Heatmaps: Is the Difficulty Ladder Graduated?.

The RNG trap

The first time I ran the tournament, close matchups flipped between runs. Medium would beat Easy 54% one afternoon and 48% the next. The bug was in the game state, not the tournament — the seed fed to every softmax-sampling AI was Math.random(), freshly generated each new game. Every tournament run sampled a different random stream, so every run was a different measurement.

The one-line fix:

// src/engine/game/GameManager.ts
-    aiSeed: Math.random(),
+    aiSeed: 42,

With the seed pinned, two runs produce bit-identical heatmaps. That turns the chart from a sample into an assertion: if a change to an AI shifts a cell, the change is responsible — not variance. It also makes a before-heatmap and after-heatmap diff trustworthy.

This is my single favorite paragraph to recommend to any indie dev shipping a stochastic AI. A stochastic tier (Easy, anything with temperature > 0) still behaves stochastically inside a single game — the seed just makes the sequence of randomness reproducible. You get non-determinism at the gameplay level and determinism at the measurement level, which is exactly the combination you want.

What the heatmap revealed about Medium

The old Medium heuristic, in both the TypeScript game engine and the Python tournament roster, was immediate_ai + immediate_opp at softmax temperature 0.5 — a soft, sampled argmax over “which cell would score me the most right now, minus what it would give the opponent if they took it next.” It was stochastic on purpose, to keep play from feeling robotic.

The heatmap said: it’s too stochastic, and it’s missing the one feature that actually matters off the immediate step. Medium was beating Easy only narrowly on 4×4, and on larger boards the margin barely grew — exactly the “Medium feels like Easy with extra steps” failure mode.

The fix was two lines in src/engine/ai/engine.ts:

-    scores[i] = feats[i][0] + feats[i][1];
+    scores[i] = feats[i][0] + feats[i][1] + 0.3 * feats[i][2];  // + openness
...
-  return cells[softmaxSample(scores, 0.5, seed)];
+  return cells[softmaxSample(scores, 0, seed)];

Adding the openness feature at weight 0.3 gives the AI a reason to prefer cells that still have room to grow. Dropping temperature to zero turns the soft sample into a deterministic argmax. After the change, the heatmap showed the gap opening up: new Medium beat Easy decisively, and still lost to Hard on every board size. The ladder was straighter, but not yet evenly spaced.

None of this was obvious from playing Medium. It was obvious within thirty seconds of looking at the heatmap.

Blended Hard: probing the gap

The remaining problem was the size of the Medium-to-Hard step. Hard is trained with PPO and argmax-greedy; Medium is a three-feature hand-tuned heuristic. Even with the retuned Medium, the gap was big enough that a player who had just started winning on Medium got flattened by Hard on their first try. I didn’t want to dumb Hard down — it’s the last inline heuristic before Elite, and it needs to feel like work.

Instead I built an analytical probe: a convex combination of the Medium heuristic weights and the trained Hard weights. The whole class is twelve lines:

medium_w[0] = 1.0   # immediate_ai
medium_w[1] = 1.0   # immediate_opp
medium_w[2] = 0.3   # openness
hard_w = np.asarray(hard_weights).ravel()
blended = (1.0 - blend) * medium_w + blend * hard_w

With blend = 0.25, Blended Hard is 75% Medium and 25% trained Hard. It slots cleanly between the two in the heatmap, so you can see the gap smoothly instead of as a cliff.

Blended Hard is not a shipped tier. Players never see it in the difficulty picker. It’s the kind of thing you only build because you’re measuring, and it earns its keep by showing you the shape of the gap between two things you do ship. If the gap were suspiciously smooth, something would be wrong with Hard’s training. If the gap were suspiciously sharp, we’d know an intermediate tier was worth shipping. Neither was true — the gap was about what you’d expect from “a linear model with 25% trained signal” — which means the ladder is shaped like it should be.

Four things the heatmaps made obvious

Board size is a knob on discriminating power. On 4×4 the whole ladder compresses — there are so few playable cells that even Easy stumbles into competent moves — and tiers blur. On 8×8 the ladder is clean and monotone. If you want a test that separates tiers, test on the biggest board.
The diagonal carries signal. Self-play with a fixed seed should produce draws for deterministic AIs. The fact that Hard-vs-Hard and Elite-vs-Elite are mostly draws is evidence the seeding is doing its job. If the diagonal ever drifts, you have a reproducibility bug before you have a difficulty bug.
First-player matters, and unevenly. Some tiers have a meaningful first-move advantage; others barely change. A ladder that only holds up in one column plays differently depending on whether the human picks row or column. Showing both columns makes that visible instead of averaged away. (The full follow-up lives in The First-Player Tax.)
Blended Hard lives where I wanted it. Medium → Blended Hard → Hard is a smooth gradient in both charts, which means the gap we saw before wasn’t a training artifact — there really is a lot of room between hand-tuned heuristic and PPO-trained greedy.

Putting it in CI

The tournament infrastructure can do more than I currently ask of it. run_round_robin in ai/src/tournament/tournament.py already computes Elo ratings across the full roster — I just don’t render them, because a single Elo number per tier smooths over exactly the board-size and first-player structure the heatmap exists to show.

There’s also nothing stopping this from running in CI on every AI change and failing the build if a cell moves by more than some threshold. The pinned seed makes that a realistic test, not a flake generator. The only reason I haven’t wired it up is that a change I’m intentionally making — say, retuning Medium — should move cells, and the threshold-vs-flag UX takes more thought than the measurement itself.

For now, the workflow is: run the tournament, diff the heatmaps, look at the PNGs, commit. A twenty-second sanity check that prevents the kind of regression where “Hard feels off” is the first signal, which is about a month too late.

Takeaway

If your game ships a difficulty ladder, you owe your players proof that the rungs are spaced. Round-robin heatmaps are the smallest artifact that actually delivers that proof: cheap to run, reproducible once you pin the seed, and hard to misinterpret when you produce both a score chart and a win-rate chart side by side. Building one cost me a day. The things it revealed the first afternoon — the noisy Medium, the big Medium-to-Hard step, the reproducibility bug hiding in a Math.random() call — would have cost a lot more than that to find in the wild, if I’d ever found them at all.

For the companion posts: the model stack post covers the four AI tiers that are being measured here; The AlphaZero detour covers where Elite came from; and the player-facing take on the heatmaps, with the actual rendered charts, is at Tournament Heatmaps on the Cell Division blog.