Skip to content

DavidBarbera/alphazero-simple

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AlphaZero from scratch

A small, readable implementation of AlphaZero (Silver et al., Science 2018) built up one stage at a time, plus a polished web UI for actually playing the games. We start with tic-tac-toe because the algorithm is identical across games but tic-tac-toe trains in minutes on a laptop — then we reuse the same code (and the same UI) for harder games.

Why AlphaZero and not the original AlphaGo? The 2016 AlphaGo learned from human games through a multi-stage pipeline. AlphaZero is far simpler: a single network, no human data, learning purely from self-play. It is both easier to build and more elegant.

New to MCTS or reinforcement learning? Read docs/mcts.md — an illustrated, from-scratch explanation of the search algorithm and how AlphaZero builds on it.

The big idea in three sentences

  1. One neural network looks at a board and outputs a policy (which moves look good) and a value (who is likely to win).
  2. A Monte Carlo Tree Search uses that network to look ahead and produce a better policy than the raw network.
  3. The network trains on its own search results from self-play, bootstrapping from random play to superhuman with no external data.

Project structure

alphazero/
├── alphazero/              # the core library
│   ├── games/
│   │   ├── base.py         # the abstract Game interface (the keystone)
│   │   ├── registry.py     # game auto-discovery for the API/UI
│   │   └── tictactoe.py    # first concrete game             [Stage 1 ✓]
│   ├── agents.py           # opponents (Random, MCTS, AlphaZero)
│   ├── mcts.py             # vanilla MCTS, random rollouts    [Stage 2 ✓]
│   ├── az_mcts.py          # PUCT, network-guided search      [Stage 3 ✓]
│   ├── network.py          # two-headed policy + value net    [Stage 3 ✓]
│   ├── self_play.py        # generate games from self-play    [Stage 4 ✓]
│   └── trainer.py          # the train -> play -> repeat loop  [Stage 4 ✓]
├── server/                 # FastAPI backend (imports the registry)
│   ├── app.py              # endpoints: list games, new game, move
│   ├── game_manager.py     # in-memory game sessions
│   └── schemas.py          # API models (mirror frontend/src/types.ts)
├── frontend/               # Vite + React + TypeScript + Tailwind UI
│   └── src/
│       ├── App.tsx
│       ├── api.ts          # typed API client
│       └── components/     # GamePicker, GameView, Board (generic renderer)
├── docs/                   # illustrated explainers (start with mcts.md)
├── tests/                  # game + API tests
├── main.py                 # CLI entry point
└── pyproject.toml

The design rule that makes everything reusable: all game logic lives in Python behind Game in games/base.py, and the UI renders whatever the API describes. To add a game you implement that interface, decorate it with @register_game, and it appears in the picker automatically — no frontend changes for grid-based games.

Running the web UI

You run two dev servers side by side. Both hot-reload, so you rarely refresh.

1. Backend (FastAPI) — from the repo root:

python -m venv .venv && source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
uvicorn server.app:app --reload --port 8000

The vs AlphaZero opponent needs PyTorch. Add it with pip install -e ".[train]" (or ".[dev,train]" for both).

2. Frontend (Vite) — in a second terminal:

cd frontend
npm install
npm run dev

Then open http://localhost:5173. The Vite server proxies /api to the backend, so there's no CORS setup.

The hot-reload loop

  • Frontend edits (anything under frontend/src) are swapped into the page in-place by Vite's HMR — your in-progress game stays on the board, no refresh.
  • Backend edits (game logic, or a whole new game) auto-restart the server via --reload. The UI refetches the game list when you tab back, so a new game shows up on its own. (A reload clears in-memory games, so a game in progress resets — fine for development.)

Training the agent

A trained tic-tac-toe model ships with this download at checkpoints/tictactoe.pt, so vs AlphaZero plays optimally out of the box. (Git ignores checkpoints/, so after a fresh clone, recreate it with the command below.) Reproducing it takes only a few minutes on a CPU — no GPU:

pip install -e ".[train]"
python main.py train          # self-play -> train -> repeat; saves the checkpoint

Each iteration prints the loss and a scoreline against a random opponent, so you can watch it climb from random play to unbeatable:

iter  1 | ... | vs random: 18W 1D 1L
iter  6 | ... | vs random: 20W 0D 0L

Evaluate a trained model against a baseline:

python main.py eval --opponent random --games 100   # ~89W 11D 0L  (never loses)
python main.py eval --opponent mcts   --games 40     # mostly draws (optimal play)

The loop minimises the paper's loss (mean-squared value error + policy cross-entropy + L2), augments data with the board's 8 symmetries, and uses Dirichlet root noise during self-play for exploration. See alphazero/trainer.py.

Roadmap

  • Stage 0 — Scaffold & Game interface
  • Stage 1 — Tic-tac-toe — game logic + tests
  • Web UI — FastAPI + React app to pick and play any registered game
  • Stage 2 — Vanilla MCTS — UCT tree search with random rollouts (no network); playable in the UI as the "vs MCTS" opponent
  • Stage 3 — Network + PUCT — a two-headed policy/value network and PUCT search (no rollouts); the "vs AlphaZero" opponent
  • Stage 4 — Self-play loop — self-play training that learns tic-tac-toe from scratch to optimal play; a trained model is included
  • Stage 5 — Connect Four & scaling — reuse everything on a bigger game

Tests

pip install -e ".[dev,train]"   # [train] adds PyTorch for the network/trainer tests
pytest -q

One gotcha to remember early

During self-play, AlphaZero adds a little Dirichlet noise to the move priors at the root node only, for exploration. Leave it out and self-play can collapse to repeating the same handful of games. It lives in alphazero/config.py (dirichlet_alpha, dirichlet_epsilon) and is used during self-play training.

References

  • Silver et al., A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play, Science 362, 1140–1144 (2018).
  • Sutton & Barto, Reinforcement Learning: An Introduction (free online).
  • David Silver's UCL RL lecture course (free) — by the paper's lead author.
  • suragnair/alpha-zero-general on GitHub — a clean reference to compare against.

License

MIT — see LICENSE.

About

A walk-through implementation of a simple version of AlphaZero (Silver et al., Science 2018).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors