Strategy Analysis Conclusions - Cassville Checkers

This document summarizes the key findings from the benchmark analysis of Cassville Checkers strategies.

Performance Rankings¶

Based on 2-player head-to-head matchups across 20 games each:

Rank	Strategy	Win Rate	Key Strength
1	greedy	~69%	Score-based optimization balances all priorities
2	heuristic_advance	~64%	Advances marbles before deploying new ones
3	heuristic_balanced	~61%	Good balance of staging and advancing
4	random	~38%	Baseline performance
5	heuristic_deploy	~19%	Aggressive deployment causes congestion

The most critical insight is that advancing existing marbles before deploying new ones is essential for winning. The data clearly shows:

heuristic_advance (goal > ring > staging > home) significantly outperforms heuristic_deploy (goal > mercy > home > staging > ring)
The difference is stark: ~64% vs ~19% win rate
This holds across all player counts

Strategies that minimize captures finish games faster and win more:

The heuristic_deploy strategy creates massive ring congestion, leading to 10x more captures than efficient strategies.

The greedy agent’s score-based selection is highly effective because it:

Games scale predictably with player count:

We trained multiple versions of PPO agents using MaskablePPO from sb3-contrib:

Model	vs balanced	vs advance	vs greedy	Avg
v2	50.5%	47.0%	48.0%	48.5%
v3	51.0%	50.0%	51.5%	50.8%
v4	52.5%	52.5%	50.0%	51.7%

All models perform at ~50% win rate, meaning they are evenly matched with the heuristic opponents.

Model	vs balanced	vs advance	vs greedy	Avg
v2	33.5%	28.0%	27.5%	29.7%
v3	23.0%	19.5%	27.0%	23.2%
v4	20.5%	21.0%	18.5%	20.0%

The 4-player agents perform at ~20-30%, which is near the fair baseline of 25% for 4 equally-matched players.

PPO agents are competitive with heuristics - they learned to play at a similar skill level
Training variations made no significant difference - v2, v3, and v4 all perform similarly within statistical uncertainty (±7% for 200 games)
Diverse opponents didn’t help - training against mixed strategies (v3) didn’t improve generalization
Increased rewards didn’t help - 10x reward scaling (v4) didn’t improve learning
PPO matches but doesn’t exceed heuristics - the neural network learned competent play but not superior strategies

The aggressive deployment strategy fails because:

The data consistently shows this is the worst strategy across all configurations.

For optimal human or AI play:

The greedy and heuristic_advance strategies best exemplify these principles.

Given that PPO achieved parity with heuristics but not superiority, potential improvements to explore:

Longer training: Current models trained for 1M steps; 10M+ might help
Larger networks: Current architecture uses 2x64 hidden layers; deeper/wider networks may learn better representations
AlphaZero-style approach: MCTS + neural network has proven superior for board games
Better reward shaping: Intermediate rewards for progress (marbles advanced, captures avoided) rather than just win/loss
Curriculum learning: Start with weaker opponents, gradually increase difficulty
Population-based training: Evolve a population of agents against each other