A small, readable implementation of AlphaZero (Silver et al., Science 2018) built up one stage at a time, plus a polished web UI for actually playing the games. We start with tic-tac-toe because the algorithm is identical across games but tic-tac-toe trains in minutes on a laptop — then we reuse the same code (and the same UI) for harder games.
Why AlphaZero and not the original AlphaGo? The 2016 AlphaGo learned from human games through a multi-stage pipeline. AlphaZero is far simpler: a single network, no human data, learning purely from self-play. It is both easier to build and more elegant.
New to MCTS or reinforcement learning? Read docs/mcts.md — an illustrated, from-scratch explanation of the search algorithm and how AlphaZero builds on it.
- One neural network looks at a board and outputs a policy (which moves look good) and a value (who is likely to win).
- A Monte Carlo Tree Search uses that network to look ahead and produce a better policy than the raw network.
- The network trains on its own search results from self-play, bootstrapping from random play to superhuman with no external data.
alphazero/
├── alphazero/ # the core library
│ ├── games/
│ │ ├── base.py # the abstract Game interface (the keystone)
│ │ ├── registry.py # game auto-discovery for the API/UI
│ │ └── tictactoe.py # first concrete game [Stage 1 ✓]
│ ├── agents.py # opponents (Random, MCTS, AlphaZero)
│ ├── mcts.py # vanilla MCTS, random rollouts [Stage 2 ✓]
│ ├── az_mcts.py # PUCT, network-guided search [Stage 3 ✓]
│ ├── network.py # two-headed policy + value net [Stage 3 ✓]
│ ├── self_play.py # generate games from self-play [Stage 4 ✓]
│ └── trainer.py # the train -> play -> repeat loop [Stage 4 ✓]
├── server/ # FastAPI backend (imports the registry)
│ ├── app.py # endpoints: list games, new game, move
│ ├── game_manager.py # in-memory game sessions
│ └── schemas.py # API models (mirror frontend/src/types.ts)
├── frontend/ # Vite + React + TypeScript + Tailwind UI
│ └── src/
│ ├── App.tsx
│ ├── api.ts # typed API client
│ └── components/ # GamePicker, GameView, Board (generic renderer)
├── docs/ # illustrated explainers (start with mcts.md)
├── tests/ # game + API tests
├── main.py # CLI entry point
└── pyproject.toml
The design rule that makes everything reusable: all game logic lives in
Python behind Game in games/base.py, and the UI renders whatever the API
describes. To add a game you implement that interface, decorate it with
@register_game, and it appears in the picker automatically — no frontend
changes for grid-based games.
You run two dev servers side by side. Both hot-reload, so you rarely refresh.
1. Backend (FastAPI) — from the repo root:
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
uvicorn server.app:app --reload --port 8000The vs AlphaZero opponent needs PyTorch. Add it with
pip install -e ".[train]"(or".[dev,train]"for both).
2. Frontend (Vite) — in a second terminal:
cd frontend
npm install
npm run devThen open http://localhost:5173. The Vite server proxies /api to the
backend, so there's no CORS setup.
- Frontend edits (anything under
frontend/src) are swapped into the page in-place by Vite's HMR — your in-progress game stays on the board, no refresh. - Backend edits (game logic, or a whole new game) auto-restart the server
via
--reload. The UI refetches the game list when you tab back, so a new game shows up on its own. (A reload clears in-memory games, so a game in progress resets — fine for development.)
A trained tic-tac-toe model ships with this download at
checkpoints/tictactoe.pt, so vs AlphaZero plays optimally out of the box.
(Git ignores checkpoints/, so after a fresh clone, recreate it with the
command below.) Reproducing it takes only a few minutes on a CPU — no GPU:
pip install -e ".[train]"
python main.py train # self-play -> train -> repeat; saves the checkpointEach iteration prints the loss and a scoreline against a random opponent, so you can watch it climb from random play to unbeatable:
iter 1 | ... | vs random: 18W 1D 1L
iter 6 | ... | vs random: 20W 0D 0L
Evaluate a trained model against a baseline:
python main.py eval --opponent random --games 100 # ~89W 11D 0L (never loses)
python main.py eval --opponent mcts --games 40 # mostly draws (optimal play)The loop minimises the paper's loss (mean-squared value error + policy
cross-entropy + L2), augments data with the board's 8 symmetries, and uses
Dirichlet root noise during self-play for exploration. See alphazero/trainer.py.
- Stage 0 — Scaffold &
Gameinterface - Stage 1 — Tic-tac-toe — game logic + tests
- Web UI — FastAPI + React app to pick and play any registered game
- Stage 2 — Vanilla MCTS — UCT tree search with random rollouts (no network); playable in the UI as the "vs MCTS" opponent
- Stage 3 — Network + PUCT — a two-headed policy/value network and PUCT search (no rollouts); the "vs AlphaZero" opponent
- Stage 4 — Self-play loop — self-play training that learns tic-tac-toe from scratch to optimal play; a trained model is included
- Stage 5 — Connect Four & scaling — reuse everything on a bigger game
pip install -e ".[dev,train]" # [train] adds PyTorch for the network/trainer tests
pytest -qDuring self-play, AlphaZero adds a little Dirichlet noise to the move priors
at the root node only, for exploration. Leave it out and self-play can collapse
to repeating the same handful of games. It lives in alphazero/config.py
(dirichlet_alpha, dirichlet_epsilon) and is used during self-play training.
- Silver et al., A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play, Science 362, 1140–1144 (2018).
- Sutton & Barto, Reinforcement Learning: An Introduction (free online).
- David Silver's UCL RL lecture course (free) — by the paper's lead author.
suragnair/alpha-zero-generalon GitHub — a clean reference to compare against.
MIT — see LICENSE.