Deep Q-Network agent that learns to play 2048. Uses a dueling CNN architecture with prioritized experience replay, targeting Apple Silicon (MPS backend).
- Python 3.13+
- uv package manager
- macOS with Apple Silicon (MPS) or CPU fallback
uv syncFor development (includes pytest):
uv sync --group dev# Default: 5000 episodes, 16 parallel environments, dueling DQN + PER
uv run python train.py
# Quick test run
uv run python train.py --episodes 100
# Single environment mode (slower, useful for debugging)
uv run python train.py --episodes 100 --single-env
# Resume from checkpoint
uv run python train.py --resume
# Resume with boosted exploration
uv run python train.py --resume --epsilon 0.3
# Watch training in real-time
uv run python train.py --visualizeTraining saves dqn_2048.pt (model checkpoint) and training_curves.png (plots) periodically.
# Default: 5 games with GUI
uv run python play.py
# More games, faster
uv run python play.py --games 20 --delay 0.05uv run python human_play.pyArrow keys or WASD to move. R to restart. ESC to quit.
uv run pytest tests/ -vconfig.py - TrainConfig dataclass with all hyperparameters
game.py - Game logic, state encoding, vectorized environment
model.py - DQN_CNN, DQN_CNN_Dueling, DQN (legacy MLP)
agent.py - DQNAgent, ReplayBuffer, PrioritizedReplayBuffer
train.py - Trainer class with parallel and single-env modes
play.py - Watch trained agent play with GUI
human_play.py- Play 2048 yourself
plot.py - Training curve visualization
gui.py - Pygame rendering
tests/ - pytest suite (game logic + agent)
- Dueling DQN: Separate value and advantage streams - V(s) + A(s,a) - mean(A)
- Double DQN: Policy net selects actions, target net evaluates (reduces overestimation)
- Prioritized Experience Replay: Sum-tree sampling weighted by TD error
- Cosine LR schedule: Learning rate anneals from 5e-4 to 1e-5
- Soft target updates: tau=0.001 each step (no hard copies)
The CNN backbone uses 3x3 padded convolutions with batch normalization (~826K parameters), feeding into separate value (1 output) and advantage (4 outputs) heads.
All hyperparameters live in config.py as a TrainConfig dataclass. Key defaults:
| Parameter | Value | Notes |
|---|---|---|
episodes |
5000 | Total training episodes |
num_envs |
16 | Parallel environments |
batch_size |
128 | Replay sample size |
gamma |
0.99 | Discount factor |
lr |
5e-4 | Initial learning rate |
epsilon_decay |
0.999 | Per-episode multiplicative decay |
tau |
0.001 | Soft target update rate |
use_dueling |
True | Dueling architecture |
use_per |
True | Prioritized experience replay |