This page describes the technical implementation of the Cassville Checkers environment and the AI agents used for analysis.
Gymnasium Environment¶
The game is implemented as a Gymnasium environment, making it compatible with standard reinforcement learning libraries.
Observation Space¶
The environment provides a dictionary observation containing:
| Key | Shape | Description |
|---|---|---|
marble_positions | [num_players * 5] | Position index for each marble |
current_player | () | Which player’s turn (0 to num_players-1) |
die_roll | () | Current die roll (0-5, representing 1-6) |
lap_completed | [num_players * 5] | Boolean flags for lap completion |
mercy_counters | [num_players] | Failed deployment attempts per player |
action_mask | [30] | Binary mask of legal actions |
Action Space¶
The action space is discrete with 30 possible actions (6 action types × 5 marbles):
| Action Type | Description |
|---|---|
HOME_TO_STAGING | Move marble from home to staging (requires roll of 1 or 6) |
STAGING_TO_RING | Move marble from staging onto the ring |
RING_MOVE | Advance marble around the ring by die value |
RING_TO_GOAL | Enter goal area (requires exact landing after lap completion) |
MERCY_MOVE | Use mercy rule to bypass 1/6 deployment requirement |
SKIP | Skip turn when no valid moves available |
Reward Structure¶
| Event | Reward |
|---|---|
| Marble reaches goal | +100 |
| Win game | +5000 |
| Capture opponent | +15 |
| Get captured | -10 |
| Time penalty (per turn) | -0.1 |
| Lose game | -1000 |
Agent Implementations¶
RandomAgent¶
The simplest baseline—selects uniformly at random from all legal actions. Provides a lower bound on expected performance.
Heuristic Agents¶
Priority-based agents that prefer certain action types over others. Three variants explore different strategic priorities:
HeuristicAgentAdvance (best heuristic):
Priority: goal > ring > staging > home > mercy > skipFocuses on advancing marbles already on the ring before deploying new ones.
HeuristicAgentBalanced:
Priority: goal > staging > ring > home > mercy > skipMoves marbles from staging to ring first, then advances them.
HeuristicAgentDeploy:
Priority: goal > mercy > home > staging > ring > skipAggressively deploys new marbles before advancing existing ones.
GreedyAgent¶
A score-based agent that assigns point values to each action type:
| Action Type | Base Score | Notes |
|---|---|---|
RING_TO_GOAL | 100 | Highest priority—finish marbles |
STAGING_TO_RING | 40 | Get staged marbles moving |
RING_MOVE | 25 | +15 bonus for lap-completed marbles |
MERCY_MOVE | 20 | Helps when stuck |
HOME_TO_STAGING | 15 | Reduced—avoid early deployment |
SKIP | 0 | Last resort |
The greedy agent selects the legal action with the highest score.
PPO Agents¶
Neural network policies trained using MaskablePPO from sb3-contrib. Action masking ensures the policy only selects legal moves.
Training configurations tested:
| Version | Training Regime | Win/Loss Rewards |
|---|---|---|
| v2 | vs heuristic_balanced only | 500 / -100 |
| v3 | vs mixed opponents (balanced, advance, greedy) + v2 PPO | 500 / -100 |
| v4 | Same as v3 | 5000 / -1000 (10x) |
All versions used:
1 million timesteps
MLP policy with 2×64 hidden layers
CPU training (~1000 FPS)
Code Structure¶
cassville_checkers/
├── board.py # Board topology, position encoding
├── game_state.py # Game logic, move validation, captures
├── env.py # Gymnasium environment wrapper
├── agents.py # Agent implementations
└── train.py # PPO training utilitiesKey Classes¶
Board: Manages the 48-position ring topology and maps between player-relative and absolute positions.
GameState: Handles all game logic including move validation, capturing, bonus turns, and win detection.
CassvilleCheckersEnv: Gymnasium-compatible environment with action masking support.
Source Code¶
The full implementation is available at: github