Skip to content

guojing0/2048-RL

Repository files navigation

2048-RL

Deep Q-Network agent that learns to play 2048. Uses a dueling CNN architecture with prioritized experience replay, targeting Apple Silicon (MPS backend).

Requirements

  • Python 3.13+
  • uv package manager
  • macOS with Apple Silicon (MPS) or CPU fallback

Setup

uv sync

For development (includes pytest):

uv sync --group dev

Usage

Train the agent

# Default: 5000 episodes, 16 parallel environments, dueling DQN + PER
uv run python train.py

# Quick test run
uv run python train.py --episodes 100

# Single environment mode (slower, useful for debugging)
uv run python train.py --episodes 100 --single-env

# Resume from checkpoint
uv run python train.py --resume

# Resume with boosted exploration
uv run python train.py --resume --epsilon 0.3

# Watch training in real-time
uv run python train.py --visualize

Training saves dqn_2048.pt (model checkpoint) and training_curves.png (plots) periodically.

Watch the agent play

# Default: 5 games with GUI
uv run python play.py

# More games, faster
uv run python play.py --games 20 --delay 0.05

Play yourself

uv run python human_play.py

Arrow keys or WASD to move. R to restart. ESC to quit.

Run tests

uv run pytest tests/ -v

Project structure

config.py    - TrainConfig dataclass with all hyperparameters
game.py      - Game logic, state encoding, vectorized environment
model.py     - DQN_CNN, DQN_CNN_Dueling, DQN (legacy MLP)
agent.py     - DQNAgent, ReplayBuffer, PrioritizedReplayBuffer
train.py     - Trainer class with parallel and single-env modes
play.py      - Watch trained agent play with GUI
human_play.py- Play 2048 yourself
plot.py      - Training curve visualization
gui.py       - Pygame rendering
tests/       - pytest suite (game logic + agent)

Architecture

  • Dueling DQN: Separate value and advantage streams - V(s) + A(s,a) - mean(A)
  • Double DQN: Policy net selects actions, target net evaluates (reduces overestimation)
  • Prioritized Experience Replay: Sum-tree sampling weighted by TD error
  • Cosine LR schedule: Learning rate anneals from 5e-4 to 1e-5
  • Soft target updates: tau=0.001 each step (no hard copies)

The CNN backbone uses 3x3 padded convolutions with batch normalization (~826K parameters), feeding into separate value (1 output) and advantage (4 outputs) heads.

Configuration

All hyperparameters live in config.py as a TrainConfig dataclass. Key defaults:

Parameter Value Notes
episodes 5000 Total training episodes
num_envs 16 Parallel environments
batch_size 128 Replay sample size
gamma 0.99 Discount factor
lr 5e-4 Initial learning rate
epsilon_decay 0.999 Per-episode multiplicative decay
tau 0.001 Soft target update rate
use_dueling True Dueling architecture
use_per True Prioritized experience replay

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages