This document summarizes the key findings from the benchmark analysis of Cassville Checkers strategies.
Performance Rankings¶
Based on 2-player head-to-head matchups across 20 games each:
| Rank | Strategy | Win Rate | Key Strength |
|---|---|---|---|
| 1 | greedy | ~69% | Score-based optimization balances all priorities |
| 2 | heuristic_advance | ~64% | Advances marbles before deploying new ones |
| 3 | heuristic_balanced | ~61% | Good balance of staging and advancing |
| 4 | random | ~38% | Baseline performance |
| 5 | heuristic_deploy | ~19% | Aggressive deployment causes congestion |
Key Insights¶
1. Advance Before Deploy¶
The most critical insight is that advancing existing marbles before deploying new ones is essential for winning. The data clearly shows:
heuristic_advance(goal > ring > staging > home) significantly outperformsheuristic_deploy(goal > mercy > home > staging > ring)The difference is stark: ~64% vs ~19% win rate
This holds across all player counts
2. Captures Are Costly¶
Strategies that minimize captures finish games faster and win more:
| Strategy | 2P Captures | 4P Captures |
|---|---|---|
| greedy | ~1.8 | ~14 |
| heuristic_balanced | ~1.9 | ~13 |
| heuristic_advance | ~1.5 | ~12 |
| random | ~9 | ~93 |
| heuristic_deploy | ~12 | ~139 |
The heuristic_deploy strategy creates massive ring congestion, leading to 10x more captures than efficient strategies.
3. Greedy Optimization Works¶
The greedy agent’s score-based selection is highly effective because it:
Dynamically balances competing priorities based on game state
Prioritizes high-value moves (entering goal) when available
Avoids leaving marbles exposed to capture
4. Player Count Scaling¶
Games scale predictably with player count:
4-player games take roughly 2x as long as 2-player games
Captures increase dramatically with more players
Good strategies maintain their advantage regardless of player count
PPO Agent Analysis¶
We trained multiple versions of PPO agents using MaskablePPO from sb3-contrib:
v2: Trained against
heuristic_balancedonlyv3: Trained against mixed opponents (balanced, advance, greedy) + v2 PPO
v4: Same as v3 but with 10x win/loss rewards (5000/-1000 vs 500/-100)
2-Player Performance (200 games each)¶
| Model | vs balanced | vs advance | vs greedy | Avg |
|---|---|---|---|---|
| v2 | 50.5% | 47.0% | 48.0% | 48.5% |
| v3 | 51.0% | 50.0% | 51.5% | 50.8% |
| v4 | 52.5% | 52.5% | 50.0% | 51.7% |
All models perform at ~50% win rate, meaning they are evenly matched with the heuristic opponents.
4-Player Performance (200 games each, fair baseline: 25%)¶
| Model | vs balanced | vs advance | vs greedy | Avg |
|---|---|---|---|---|
| v2 | 33.5% | 28.0% | 27.5% | 29.7% |
| v3 | 23.0% | 19.5% | 27.0% | 23.2% |
| v4 | 20.5% | 21.0% | 18.5% | 20.0% |
The 4-player agents perform at ~20-30%, which is near the fair baseline of 25% for 4 equally-matched players.
Key PPO Findings¶
PPO agents are competitive with heuristics - they learned to play at a similar skill level
Training variations made no significant difference - v2, v3, and v4 all perform similarly within statistical uncertainty (±7% for 200 games)
Diverse opponents didn’t help - training against mixed strategies (v3) didn’t improve generalization
Increased rewards didn’t help - 10x reward scaling (v4) didn’t improve learning
PPO matches but doesn’t exceed heuristics - the neural network learned competent play but not superior strategies
Why heuristic_deploy Fails¶
The aggressive deployment strategy fails because:
Ring congestion: Multiple marbles deployed before any advance
Capture cascades: More marbles on ring = more capture opportunities
Wasted progress: Captured marbles lose all ring progress
Longer games: 5-10x more captures per game than efficient strategies
The data consistently shows this is the worst strategy across all configurations.
Recommended Strategy¶
For optimal human or AI play:
Prioritize entering the goal when a marble can complete its journey
Advance marbles on the ring before deploying new ones from home
Move marbles from staging to ring promptly to avoid blocking
Use home-to-staging only when the ring is relatively clear
Consider score-based evaluation for complex decisions
The greedy and heuristic_advance strategies best exemplify these principles.
Future Work¶
Given that PPO achieved parity with heuristics but not superiority, potential improvements to explore:
Longer training: Current models trained for 1M steps; 10M+ might help
Larger networks: Current architecture uses 2x64 hidden layers; deeper/wider networks may learn better representations
AlphaZero-style approach: MCTS + neural network has proven superior for board games
Better reward shaping: Intermediate rewards for progress (marbles advanced, captures avoided) rather than just win/loss
Curriculum learning: Start with weaker opponents, gradually increase difficulty
Population-based training: Evolve a population of agents against each other