Four difficulty tiers, two model families: 14 features, 13 weights, and a distilled CNN

Cell Division ships with four AI difficulty tiers that all run on-device, with no server call, on any phone new enough to install the app. Three of them — Easy, Medium, Hard — are linear models on top of fourteen hand-crafted features. The fourth, Elite, is a small CNN distilled from an AlphaZero-style teacher.

The temptation with this kind of stack is to collapse it: use the CNN for everything, set the weaker tiers by adding noise or lowering search depth. I’ve ended up avoiding that on two games now. This post is the case for two families of model, side by side, and what each one is earning.

Why two families at all

The shipping requirement isn’t “be strong.” It’s that each tier has to feel different. Easy should be playful. Medium should be stubborn. Hard should punish mistakes. Elite should make you work for it. A single model scaled by temperature gets you one personality at four loudness levels, not four personalities.

Linear models over a shared feature vocabulary give you personality almost for free: change the feature subset, change the temperature, change the blend. Each tier can be tuned by hand to a specific feel. For the top tier, where “play genuinely strong moves” is the only requirement, a distilled CNN earns its keep. That’s the split.

The vocabulary: 14 features, [0, 1] normalized

Every candidate cell on the board gets summarized by the same fourteen features, defined in src/engine/ai/features.ts and mirrored in Python at ai/src/ai/features.py so training and inference agree to the last bit. Roughly grouped:

Immediate payoff (2): points the AI would score by playing here; points the opponent would score by playing here. Same lookahead, flipped.
Local density (6): openness, AI neighbors, opponent neighbors, AI connectivity, opponent connectivity, boundary neighbors. Who owns the neighborhood and how crowded is it.
Shape (4): AI underlap, opponent underlap, AI half-axis, opponent half-axis. Bridge potential and asymmetric extension.
Second ring (1): empty cells at Chebyshev distance exactly 2.
Global (1): game_progress — fraction of the board filled. Only used by the legacy switching-linear Elite to interpolate between early-game and late-game weights.

Every feature is normalized to roughly [0, 1]. That’s the quiet structural decision that does the most work in the whole stack — it means the same trained weight vector works on any board size from 4×4 to 8×8 with zero retraining. Portability is worth more than a few ELO points.

Tiers 1–3: same features, different recipes

The dispatcher in src/engine/ai/engine.ts picks a scoring function, evaluates it on every legal move, and either argmaxes or samples from a softmax. The whole selection loop is a dozen lines.

Easy — two features and a loose hand

score = openness - 0.3 * boundary_neighbors
move  = softmax_sample(scores, T ≈ 1.5)

No training, no weight vector, two features. The high temperature means even the best move isn’t certain — Easy plays into open space most of the time but occasionally surprises you with a weird choice. Which is exactly what a new player wants: an opponent that loses to basic strategy but not in a way that feels scripted.

Medium — three features, greedy

score = immediate_ai + immediate_opp + 0.3 * openness
move  = argmax(scores)

Medium adds the two immediate-score features and picks deterministically. It always takes the highest-scoring play and uses openness as a tie-breaker. No surprises — it’s a “make the scoring play” opponent, the right shape for someone who’s learning and wants a predictable sparring partner.

Hard — 13 trained weights, blended

Hard is where training starts earning its keep. A 13-weight linear model (LinearNet in ai/src/ai/linear_torch.py) is trained with PPO self-play. At inference time we blend:

blended = 0.75 * MEDIUM_WEIGHTS + 0.25 * HARD_WEIGHTS
score   = features · blended

The 75/25 blend is deliberate and is the move I’d steal for any future project. Pure trained policies make plays that are correct but inscrutable — moves a human can’t explain in terms of the obvious goals of the game. Mixing in the greedy heuristic keeps the tactical-looking moves visible while layering learned positional judgment on top. It’s a dial between “predictable” and “sharp,” and it lets you ship Hard without a UX problem where the opponent looks broken.

What the learned weights say

Here’s the actual 13-weight vector Hard ships with, straight out of src/engine/ai/weights.ts:

feature	weight
immediate_ai	+4.09
immediate_opp	+4.34
openness	+14.49
ai_neighbors	−1.73
opp_neighbors	−1.65
ai_connectivity	+1.90
opp_connectivity	+1.82
ai_underlap	−1.84
opp_underlap	−1.52
boundary_neighbors	−3.89
ai_half_axis	−0.77
opp_half_axis	−0.37
second_order_openness	−3.51

Three things I care about in this vector:

openness dominates at +14.49. Training converged on “play into empty space.” Every other feature is noise around this one.
Both immediate_* features land near +4. The greedy baseline Medium uses was not wrong — tactics matter, just not as much as space.
second_order_openness is negative. Once you already have local openness, more distant emptiness is mildly bad. The model learned to prefer cells that are open but bordered by structure, not floating in the middle of nothing. I did not know this before training. That’s the kind of insight a 13-parameter model can hand you that a 10-million-parameter one cannot.

This is the quiet superpower of a tiny model: you can actually read it. A deep net would play about as well, maybe slightly better, and you’d have no idea why.

The PPO loop

Hard and the legacy switching-linear Elite are trained with the same loop — only the model changes. Scripts are ai/scripts/train_linear.py and ai/scripts/train_switching_linear.py:

Self-play. Each iteration plays 64 games across board sizes 4 through 8, with randomized starting positions (some pre-placed stones, some blocked cells). Starting-position variety is what kept the model from overfitting an opening.
Behavior policy. Moves are sampled from a softmax for roughly the first 75% of each game (minimum T = 0.05), then greedy for the tail. PPO needs non-zero probability on alternative actions or the advantage estimator is a divide-by-zero.
Reward. Each move in a trajectory gets the same final reward: tanh(margin / scale), margin = AI score − opponent score. tanh keeps blowouts from drowning out close games.
Update. Learning rate 0.01, clip coefficient 0.2, 4 epochs per batch, entropy bonus 0.05, target KL 0.03 for early stopping.

The learning rate is higher than you’d ever use on a deep net. You can get away with it because the model is tiny — the loss surface is nearly convex, there’s no catastrophic-interference risk from a big step, and the target KL stop guards the rare case where PPO overshoots. Whole pipeline converges in minutes on a laptop CPU and exports a JSON weight vector that gets checked into src/engine/ai/weights.ts.

Elite: where hand-crafted features stop paying rent

Fourteen features is a lot until you try to encode distant threats. A move that sets up a fork four turns out has no signal in any of the fourteen — the fork doesn’t exist yet. The features are local by construction, and that locality is exactly why the linear tiers are tractable. It’s also the ceiling.

Elite is what happens when you stop hand-crafting. A 32-channel, 3-block ResNet with a single policy head (~65K parameters) takes a (3, 8, 8) board tensor directly and outputs 64 logits. No features, no hand-designed signals — the convolutions learn their own vocabulary. Inference is a single forward pass.

The training pipeline is three stages:

Teacher. ai/scripts/train_cnn_teacher.py runs AlphaZero-style self-play on a larger 64-channel, 6-block ResNet with both policy and value heads. 100–400 MCTS simulations per move, policy targets from visit counts, value targets from game outcome. This is the expensive part.
Teacher data. generate_teacher_data.py replays the frozen teacher at high simulation count, records positions + MCTS action distributions to an .npz.
Student. train_student.py trains the shipping small network to imitate those distributions via cross-entropy. No MCTS, no self-play — supervised learning on a fixed dataset. Runs in minutes.

The student doesn’t have to re-derive the teacher’s reasoning. It has to mimic it on the move distribution, which is a much easier learning problem. You get most of the way to the teacher’s strength at about 1% of the size.

For the integration side — how this .onnx gets into an Expo app, the Android autolinking gap, and the Jest N-API workaround — see Shipping a CNN game AI on-device in Expo with ONNX. This post is the design of the model stack; that post is the shipping of it.

The legacy switching-linear Elite (still in the web build)

Before the CNN, Elite was a 26-weight SwitchingLinearNet — two separate 13-weight vectors, one for the opening and one for the endgame, interpolated on game_progress:

gp    = game_progress                    // 0 at move 1, 1 at last move
score = Σ feat[j] * (w_early[j] * (1 - gp)
                   + w_late[j]  * gp)

The intuition captures something real: what makes a good opening move (claim space, stay off edges) isn’t what makes a good endgame move (squeeze every last connection axis out of tight territory). One set of weights has to compromise; two sets can specialize and blend smoothly.

It still ships, as the fallback for the web build where we don’t ship a native ML runtime. If you’re playing Cell Division in a browser, the “Elite” you’re facing is the 26-weight model, not the CNN. On a phone, it’s the distilled network. I find it pleasing that the fallback is still a credible Elite — “degraded” here means “a very good linear model instead of a slightly better neural one,” which is a much better degrade-gracefully story than most ML integrations get.

Why this combination is the right shape

Four tiers, two families:

Easy / Medium: hand-written scoring rules. No training at all. Personalities tuned by feature selection.
Hard: 13 trained weights + 75/25 blend with the Medium heuristic. Trained in minutes on CPU.
Elite (native): ~65K-parameter distilled ResNet, trained offline on a GPU box over a weekend.
Elite (web fallback): 26-weight switching-linear model. Still credible.

The whole stack — linear weights, switching-linear weights, distilled ResNet — fits in the app bundle, runs on-device with no network calls, and retrains on a single GPU box on a weekend. That’s the combination I wanted: something you can read, something you can beat, and something that still surprises you.

If you want to play against the result: Cell Division and Jelmata both use this same two-family stack. For the player-facing take on what each tier actually feels like to play against, the game sites have their own writeups: Meet Your Four Opponents on the Cell Division blog covers the same ground without code.