AlphaZero from scratch

A small, readable implementation of AlphaZero (Silver et al., Science 2018) built up one stage at a time, plus a polished web UI for actually playing the games. We start with tic-tac-toe because the algorithm is identical across games but tic-tac-toe trains in minutes on a laptop — then we reuse the same code (and the same UI) for harder games.

Why AlphaZero and not the original AlphaGo? The 2016 AlphaGo learned from human games through a multi-stage pipeline. AlphaZero is far simpler: a single network, no human data, learning purely from self-play. It is both easier to build and more elegant.

New to MCTS or reinforcement learning? Read docs/mcts.md — an illustrated, from-scratch explanation of the search algorithm and how AlphaZero builds on it.

The big idea in three sentences

One neural network looks at a board and outputs a policy (which moves look good) and a value (who is likely to win).
A Monte Carlo Tree Search uses that network to look ahead and produce a better policy than the raw network.
The network trains on its own search results from self-play, bootstrapping from random play to superhuman with no external data.

Project structure

alphazero/
├── alphazero/              # the core library
│   ├── games/
│   │   ├── base.py         # the abstract Game interface (the keystone)
│   │   ├── registry.py     # game auto-discovery for the API/UI
│   │   └── tictactoe.py    # first concrete game             [Stage 1 ✓]
│   ├── agents.py           # opponents (Random, MCTS, AlphaZero)
│   ├── mcts.py             # vanilla MCTS, random rollouts    [Stage 2 ✓]
│   ├── az_mcts.py          # PUCT, network-guided search      [Stage 3 ✓]
│   ├── network.py          # two-headed policy + value net    [Stage 3 ✓]
│   ├── self_play.py        # generate games from self-play    [Stage 4 ✓]
│   └── trainer.py          # the train -> play -> repeat loop  [Stage 4 ✓]
├── server/                 # FastAPI backend (imports the registry)
│   ├── app.py              # endpoints: list games, new game, move
│   ├── game_manager.py     # in-memory game sessions
│   └── schemas.py          # API models (mirror frontend/src/types.ts)
├── frontend/               # Vite + React + TypeScript + Tailwind UI
│   └── src/
│       ├── App.tsx
│       ├── api.ts          # typed API client
│       └── components/     # GamePicker, GameView, Board (generic renderer)
├── docs/                   # illustrated explainers (start with mcts.md)
├── tests/                  # game + API tests
├── main.py                 # CLI entry point
└── pyproject.toml

The design rule that makes everything reusable: all game logic lives in Python behind Game in games/base.py, and the UI renders whatever the API describes. To add a game you implement that interface, decorate it with @register_game, and it appears in the picker automatically — no frontend changes for grid-based games.

Running the web UI

You run two dev servers side by side. Both hot-reload, so you rarely refresh.

1. Backend (FastAPI) — from the repo root:

python -m venv .venv && source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
uvicorn server.app:app --reload --port 8000

The vs AlphaZero opponent needs PyTorch. Add it with pip install -e ".[train]" (or ".[dev,train]" for both).

2. Frontend (Vite) — in a second terminal:

cd frontend
npm install
npm run dev

Then open http://localhost:5173. The Vite server proxies /api to the backend, so there's no CORS setup.

The hot-reload loop

Frontend edits (anything under frontend/src) are swapped into the page in-place by Vite's HMR — your in-progress game stays on the board, no refresh.
Backend edits (game logic, or a whole new game) auto-restart the server via --reload. The UI refetches the game list when you tab back, so a new game shows up on its own. (A reload clears in-memory games, so a game in progress resets — fine for development.)

Training the agent

A trained tic-tac-toe model ships with this download at checkpoints/tictactoe.pt, so vs AlphaZero plays optimally out of the box. (Git ignores checkpoints/, so after a fresh clone, recreate it with the command below.) Reproducing it takes only a few minutes on a CPU — no GPU:

pip install -e ".[train]"
python main.py train          # self-play -> train -> repeat; saves the checkpoint

Each iteration prints the loss and a scoreline against a random opponent, so you can watch it climb from random play to unbeatable:

iter  1 | ... | vs random: 18W 1D 1L
iter  6 | ... | vs random: 20W 0D 0L

Evaluate a trained model against a baseline:

python main.py eval --opponent random --games 100   # ~89W 11D 0L  (never loses)
python main.py eval --opponent mcts   --games 40     # mostly draws (optimal play)

The loop minimises the paper's loss (mean-squared value error + policy cross-entropy + L2), augments data with the board's 8 symmetries, and uses Dirichlet root noise during self-play for exploration. See alphazero/trainer.py.

Roadmap

Stage 0 — Scaffold & Game interface
Stage 1 — Tic-tac-toe — game logic + tests
Web UI — FastAPI + React app to pick and play any registered game
Stage 2 — Vanilla MCTS — UCT tree search with random rollouts (no network); playable in the UI as the "vs MCTS" opponent
Stage 3 — Network + PUCT — a two-headed policy/value network and PUCT search (no rollouts); the "vs AlphaZero" opponent
Stage 4 — Self-play loop — self-play training that learns tic-tac-toe from scratch to optimal play; a trained model is included
Stage 5 — Connect Four & scaling — reuse everything on a bigger game

Tests

pip install -e ".[dev,train]"   # [train] adds PyTorch for the network/trainer tests
pytest -q

One gotcha to remember early

During self-play, AlphaZero adds a little Dirichlet noise to the move priors at the root node only, for exploration. Leave it out and self-play can collapse to repeating the same handful of games. It lives in alphazero/config.py (dirichlet_alpha, dirichlet_epsilon) and is used during self-play training.

References

Silver et al., A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play, Science 362, 1140–1144 (2018).
Sutton & Barto, Reinforcement Learning: An Introduction (free online).
David Silver's UCL RL lecture course (free) — by the paper's lead author.
suragnair/alpha-zero-general on GitHub — a clean reference to compare against.

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AlphaZero from scratch

The big idea in three sentences

Project structure

Running the web UI

The hot-reload loop

Training the agent

Roadmap

Tests

One gotcha to remember early

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.vscode		.vscode
alphazero		alphazero
docs		docs
frontend		frontend
server		server
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AlphaZero from scratch

The big idea in three sentences

Project structure

Running the web UI

The hot-reload loop

Training the agent

Roadmap

Tests

One gotcha to remember early

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages