The AI training graveyard: CMA-ES, REINFORCE, a phased linear model, and the 28 weights that survived

Most blog posts about game AI are written from the destination — here is the model we shipped, here is how it works, here is the curve. The player-facing writeup on Jelmata’s site is no different. It walks through the 14-feature linear model that powers Easy through Hard and the distilled CNN that powers Elite. It’s true. It doesn’t tell you what the last three weeks of the repo actually looked like.

Between March 3 and March 5, 2026, Jelmata’s Python training pipeline was rewritten five times. Evolutionary strategies gave way to Monte Carlo tree search, which survived for ninety minutes. REINFORCE self-play replaced it. A neural network briefly became the Hard difficulty, then got rolled back. An 832-parameter phased linear model appeared, trained for a day, and got deleted in one commit along with every checkpoint it had produced. What finally shipped a month later as Elite was a distilled CNN — but the 28-weight switching-linear model that runs as Elite’s fallback is the only direct survivor of that frantic week.

This post is the journey, with commit hashes. Every approach below was written, trained, and killed. A few left fingerprints on the final pipeline; most left nothing but commits and deleted checkpoints. If you’re building a game AI from scratch, the graveyard is probably more useful than the shipping architecture.

CMA-ES, in a file named `alphazero.py`

The first Python training framework landed on the morning of March 3 as a single 1,289-line file called ai/alphazero.py, plus a src/training/train.py that ran the actual optimizer. The filename was aspirational. The training loop wasn’t AlphaZero-anything; it was Covariance Matrix Adaptation Evolution Strategy — CMA-ES — optimizing a flat 14-feature linear policy against hand-configured Easy and Medium opponents.

Two reasons to start with evolutionary strategies. First, no gradients. A linear policy scored by win rate against a fixed opponent is a black-box function from weights to reward, and CMA-ES doesn’t care that the function isn’t differentiable. Second, embarrassingly parallel. CMA-ES samples a population of candidate weight vectors, each one plays a batch of games on a worker process, and the top survivors shape the next generation’s covariance. A laptop’s worth of cores, an afternoon, usable weights.

By late afternoon the same day, the first Hard weights shipped from generation 59 — the commit message literally says “gen59”. A later commit added Common Random Numbers: every candidate in a generation plays the exact same evaluation boards with the exact same starting seeds, so differences in win rate reflect differences in policy rather than differences in the games that got rolled. Variance reduction is a boring-sounding optimization that saves days of wasted training, and I’d recommend it for any population-based trainer that ranks candidates by noisy fitness.

CMA-ES didn’t survive as the main trainer, but it produced the first weights that were good enough to play against. Every trainer that came after was measured against the baseline CMA-ES set.

MCTS (ninety minutes in the repo)

Later the same day, a new file appeared: ai/src/ai/mcts.py, 187 lines of UCB tree search with a LinearAI rollout policy at temperature 0.3. The idea was the obvious one — if the linear evaluator is any good, you should be able to search with it and get a stronger player for free. Run a few dozen simulations, each one using the linear model to pick moves all the way to terminal, and back the win rates up to the root.

It didn’t ship. The very next commit in the log is titled “Improve game UI polish and remove unused MCTS AI”, and it deletes mcts.py and its 380-line evaluation harness in the same diff that tweaks the score header animation. Ninety minutes, if that, in the repo.

The reason MCTS got cut is the same reason the much-later Elite CNN doesn’t search at inference time: rollouts are a phone budget problem. Even a cheap linear rollout costs dozens of milliseconds per simulation when you multiply it out, and search only pays off if you do many simulations. The AI felt slower without feeling obviously smarter. That’s a bad trade on any device you don’t plug into a wall.

The lesson wasn’t that MCTS is wrong for Jelmata. It’s that search belongs in training, not inference. That idea survived the cut and shaped the final pipeline — the distilled CNN that shipped as Elite was trained by an MCTS teacher, so the search work happens once, offline, and the shipping runtime is a single forward pass.

REINFORCE self-play: gradients, then instability

With MCTS gone and CMA-ES already in production, the next question was whether gradient-based training could do better. Commit 7540217, still on March 3, added ai/selfplay.py — 1,188 lines of REINFORCE self-play. The docstring is worth quoting because it’s the whole thesis:

“Same game and network architecture as alphazero.py, but replaces MCTS self-play with direct policy sampling: each move requires only 1 forward pass (vs ~96 for MCTS). REINFORCE with value baseline for policy gradient. Entropy bonus to prevent premature convergence. The tradeoff: noisier training signal but ~100x faster games, enabling far more self-play data per wall-clock hour.”

That factor of 100 is the whole point. A tree-search trainer spends most of its wall clock on the tree, not the network. If you throw the tree away and sample moves directly from the policy, each game is fast enough to play tens of thousands of them before the trainer gets bored. The price is a much noisier gradient — the policy sees a single trajectory instead of a softened distribution over many — and REINFORCE is notorious for becoming unstable when the noise compounds.

It became unstable almost immediately. The next commit is titled “Fix self-play training stability and update weights” and it clamps log-probabilities to prevent the off-policy ratio from exploding, plus bumps the entropy coefficient from 0.01 to 0.05. The day after, the whole thing was refactored into an on-policy loop that tracks the behavior-policy temperature per move, so corrective importance ratios are computed against the distribution the move was actually sampled from rather than the current policy. Around iteration 300 the weights started looking reasonable, and those weights briefly powered Medium difficulty.

REINFORCE self-play was never the final answer for any difficulty. It was the pipeline that proved gradients could work on this problem at all, and it was the scaffold that every subsequent PPO trainer was built on. Its real legacy is that the phrase “one forward pass per move” became the non-negotiable constraint for everything that followed.

Neural Hard AI: a 24-hour misfire

On March 4, commit 9bda2cb did something that looks fine on the surface and turned out to be a design mistake: it made Hard difficulty a neural network. The commit added on-device ONNX inference through onnxruntime-react-native, bundled assets/models/hard_ai.onnx into the app, and wired Hard to call into it. Later the same day, a follow-up commit stacked a phased linear front-end in front of the network so the model could vary its strategy with the board’s fullness. For about 24 hours, Hard was a neural net.

Both decisions got rolled back the day after. The neural Hard AI disappeared, the ONNX export script got deleted, and Hard went back to a hand-tuned five-feature linear model whose entire strategy can be read off the sign of the weights.

Two reasons.

The first was the difficulty ceiling. If Hard is already a neural net, there is nowhere for Elite to go. The whole point of four difficulty levels is that each represents a genuinely different way of playing, not just a weaker sampling temperature on the same model. When Hard became a CNN, Elite had to be the same CNN but somehow better, and there was no clean way to say what that meant. Rolling Hard back to a transparent 5-feature linear model is what gave Elite the room to land a month later as something meaningfully stronger than everything below it. (The separate difficulty-ladder heatmap work is what actually proved that the re-expanded ladder was monotone in win rate across board sizes.)

The second was interpretability. The Hard weights I can still read — the biggest weight is -6.0 on cluster size created, because Jelmata’s score is a product of component sizes and merging two groups halves your score. That’s a single sentence of strategy a player can understand, discover a counter to, and eventually beat. A neural Hard doesn’t have that story. When a player asks why the AI played a move, “it’s what the model wanted” is a dead end. The hand-tuned weights came back; ONNX moved up to Elite, where it belonged.

If your difficulty ladder has a neural top tier, every tier below it should be something a human can narrate. When every tier is a neural net, they all blur into “same model, different temperature,” which is the thing graduated difficulty is supposed to avoid.

Phased linear: 832 parameters looking for a problem

The phased linear model is the most interesting failure of the week because it was a good idea. Jelmata’s optimal strategy changes over the course of a game — early moves are about claiming corners and spreading out, middle moves are about pressure and contact, late moves are about finishing groups without merging them. A single flat weight vector has to compromise across all three. Why not train a different weight vector for every stage of the game and interpolate between them?

Commit 96eec16 added ai/src/ai/phased_torch.py, a differentiable nn.Module that held a torch.zeros(num_phases, FEATURE_COUNT) parameter tensor — sixty-four phase vectors of thirteen features each, for 832 learnable parameters. At inference, the current board-fill fraction mapped to a floating-point phase index, and the two nearest phase vectors were linearly interpolated. A 450-line trainer, train_phased_linear.py, ran REINFORCE against it. Everything worked.

It didn’t pay off. 832 parameters is an absurd budget for a model whose entire job is to rank maybe twenty candidate moves, and most of those parameters turned out to be learning indistinguishable versions of the same strategy. The phased model played about as well as a well-trained flat linear model and was dramatically harder to reason about, debug, and export.

The punchline is in commit c6045c1 on March 5: the diff deletes the phased trainer (603 lines), the phased PPO variant (818 lines), the phased fine-tuner (418 lines), and every checkpoint in ai/checkpoints/phased/ in a single stroke. In its place it adds a switching-linear model — two 14-feature weight vectors, one for the opening, one for the endgame, blended by how full the board is. Total parameter count: 28. One thirtieth of the phased model. Same idea, stripped to its essentials.

The same commit also deletes a 1,356-line train_alphazero.py, a 1,266-line train_resnet.py, a 612-line ResNet fine-tuner, and a 5.7 MB ResNet checkpoint. A full AlphaZero pipeline with a ResNet policy-value trunk had been started, iterated on, and abandoned in the span of two days. Roughly five thousand lines of training code and about 40 MB of experimental checkpoints disappeared from the repo before anyone outside the project would ever see them. (Cell Division went through the same motion a few weeks later — its own detour is documented at The AlphaZero detour: batched MCTS, three value targets, and a sign bug that hid in plain sight.)

CEM and PPO: the trainers that lived

The same c6045c1 that deleted the graveyard also added the two trainers that survived. The first is ai/scripts/train_linear_cem.py, 1,005 lines of Cross-Entropy Method optimization — evolutionary strategies, like CMA-ES, but simpler: sample a Gaussian population of candidate weight vectors, rank them by tournament win rate, keep the top fraction, fit a new Gaussian to the elites, repeat. Every candidate is evaluated on the same fixed set of episode seeds, so rankings don’t wobble just because one candidate got lucky with its roll-outs. Common Random Numbers again, ported from the earlier CMA-ES trainer.

The second is ai/scripts/train_switching_linear.py — literally the old REINFORCE/PPO linear trainer renamed, with the feature vector augmented to include board-fullness so a single 28-weight model could learn two phases at once. PPO trains the same object CEM was searching, using gradients instead of populations. The two trainers ran against each other for a day and PPO won on win rate; the checkpoint it produced, switching_linear_ppo.json, is the shipping Elite fallback model on every platform where ONNX can’t initialize, including every browser that visits the web build.

Both scripts are still in ai/scripts/ today. CEM stuck around because it’s good for producing weights with different personalities — useful for the hint system and for generating the bot opponents in the online lobby, where a little variety matters more than raw strength. PPO stayed because it produces the single strongest linear model in the repo.

A month of nothing, then the CNN

After c6045c1 the training directory went quiet for almost a month. All shipping difficulties had workable weights, the switching-linear was strong enough to be the top opponent, and the game itself needed attention — the online arena, push notifications, the hint system, the landing page. The AI files in ai/ barely changed from March 5 to April 5.

Then, on April 5, commit ed5649e landed the Elite CNN. A new train_cnn_teacher.py built an AlphaZero-style policy-value network trained with MCTS self-play — the one place where search actually earned its keep, because it was doing its work in training rather than at runtime. A new train_student_cnn.py distilled that teacher into a much smaller student CNN by dumping positions and the teacher’s greedy move into a dataset and learning to imitate it. The student exported to ONNX, got bundled into the app, and became Elite. The mobile plumbing lives at Shipping a distilled CNN to Expo via onnxruntime-react-native; the Jelmata-specific architectural decisions (one CNN for four board sizes via 10×10 padding) live at One AI, four board sizes.

The fascinating part, in retrospect, is that the Elite CNN is almost a straight application of lessons the graveyard taught. MCTS belongs in training, not inference — the student learns from a teacher that searches, but ships as a single forward pass. Interpretability is worth preserving at the bottom of the ladder — Easy, Medium, and Hard are still plain linear models with weights you can read. Every neural approach that was tried too early got cut. Every neural approach that shipped ended up where it could do the most work for the least runtime cost.

What the graveyard teaches

Three lessons survive the experiments.

Search belongs in training. Every attempt to push search into runtime — MCTS as an opponent, MCTS as a rollout harness for phased linear — got reverted within a day. Every attempt to push search into training, from the never-finished ResNet AlphaZero pipeline to the Elite CNN teacher that actually shipped, eventually produced something worth keeping. If your on-device budget is measured in milliseconds, your search budget needs to live on a GPU somewhere else.

A difficulty ladder needs room at the top. The twenty-four hours Hard spent as a neural network were the clearest mistake of the week, because they collapsed the entire difficulty structure. Once every level is a neural net, Elite has no distinct story. Rolling Hard back to a transparent 5-feature model made the Elite CNN meaningful when it landed a month later. A graduated difficulty ladder depends on each tier having a genuinely different character, not a scaled version of the one above it.

The smallest model that tells the strategy is usually the right one. The phased linear model had 832 parameters and couldn’t outplay a 28-weight switching-linear that encoded the same opening/endgame idea. Both the student CNN and the switching-linear model were smaller than the approach they replaced. In a graveyard full of ambitious trainers, the things that survived were always the ones that did more with less.

If you ever want to convince yourself that a new game-AI approach is worth shipping: build it, train it, and then immediately ask whether a twenty-line heuristic does 90% of what it does. Most of the time, the honest answer rolls your model back one level of sophistication, and the game ships stronger for it.

The player-facing take on the same arc — three lessons in plain language, no commit hashes — lives at The AI Training Graveyard on the Jelmata blog. For what actually got shipped, see One AI, four board sizes and Shipping a distilled CNN to Expo via onnxruntime-react-native.

CMA-ES, in a file named alphazero.py