Chess Engine with Learned Evaluation

A chess engine in Python that searches with alpha-beta minimax and supports two interchangeable evaluation functions: a hand-crafted classical evaluator and a neural network trained on Stockfish-labelled positions. Built to compare learned vs hand-crafted evaluation at equal search depth.

Results

Metric	Value
NN evaluation MAE (offline)	81 cp (down from 290)
NN evaluation MSE (offline)	17,932 (down from 84,306)
Training data	~200,000 positions from 5,000 games
Features per position	29
NN vs Classical (10 games at 1s/move)	0-10
Search depth reached at 1s/move	Classical: ~4, NN: ~2-3

The neural network achieved strong offline accuracy — predicting Stockfish's depth-6 evaluation to within ~81 centipawns on held-out positions — but lost decisively in head-to-head tournament play.

The cause is the speed-accuracy tradeoff: TensorFlow inference takes ~5-10ms per call, while the classical evaluator runs in microseconds. At the same time budget, the classical engine searches roughly one ply deeper, and that extra depth dominates the NN's per-position accuracy advantage. This mirrors the broader history of chess engines: classical evaluation was state-of-the-art for decades, and modern neural approaches (NNUE) only succeeded by specifically designing tiny, fast networks for inference speed within alpha-beta search.

How it works

Search. Alpha-beta minimax with several standard refinements:

Quiescence search at leaf nodes to avoid the horizon effect
Move ordering by MVV-LVA (most valuable victim, least valuable attacker) to maximise pruning
Transposition table with Zobrist hashing to avoid re-searching repeated positions
Iterative deepening with a time budget instead of fixed depth
Exception-safe board mutation (try/finally around push/pop) to handle search timeouts safely

Classical evaluation. Piece values plus piece-square tables, in centipawns. The standard hand-crafted approach used by chess engines for decades before NNUE.

Neural network evaluation. A feed-forward network (128 → 64 → 32 → 1) trained to predict Stockfish's depth-6 evaluation from White's perspective. Inputs are 29 normalised features:

Per-piece-type material counts for each side
Piece-square table sums per piece type
Mobility for both sides
King safety (castling rights, in check)
Centre control, side to move

Project history

Started as a first deep learning project, motivated by curiosity about how game-playing AIs work (and wanting to beat my brother at chess). The initial version was unusable — moves took minutes due to inefficient code structure (a 13MB CSV loading on every import, model.predict() called hundreds of times per search with massive Python overhead) and the NN was undertrained (100 games, 5 features, MAE of 290 centipawns).

Returned to the project and rebuilt it properly:

Profiled and fixed the bottlenecks → moves now take seconds, not minutes
Added quiescence search, move ordering, transposition tables → engine plays competently
Retrained NN with 50× more data and richer features → MAE dropped from 290 to 81 cp
Built a tournament harness to compare evaluators rigorously
Discovered the speed-accuracy tradeoff through head-to-head play

The most interesting debugging finding came from the tournament: an early implementation showed the NN winning 5-5 against classical, which turned out to be an artifact of how the eval function was being swapped between turns (monkey-patching that didn't actually propagate). After fixing the comparison harness properly, the true result emerged — and was both more decisive and more informative than the original fake one.

Usage

# Train the neural network (slow; uses Stockfish to label positions)
python game.py

# Play against the AI (uses classical evaluation by default)
python ai.py

# Run the NN vs Classical tournament
python compare_engines.py

Requires: python-chess, tensorflow, scikit-learn, pandas, numpy, and a Stockfish binary at stockfish/stockfish.exe.

Possible future work

Implement an NNUE-style architecture (small, fast networks designed for inference speed within search)
Scale training data 10× and add tactical positions to teach the NN about non-quiet positions
Deploy to Lichess for a public Elo rating
Add killer-move heuristic and null-move pruning for deeper search

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
__pycache__		__pycache__
stockfish		stockfish
README.md		README.md
ai.py		ai.py
chess_evaluation_model.h5		chess_evaluation_model.h5
chess_evaluation_model.keras		chess_evaluation_model.keras
compare_engines.py		compare_engines.py
elo_test.py		elo_test.py
feature_mean.npy		feature_mean.npy
feature_std.npy		feature_std.npy
game.py		game.py
game_data.csv		game_data.csv
helpers.py		helpers.py
nn_eval.py		nn_eval.py
tournament.py		tournament.py
tournament_results.csv		tournament_results.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chess Engine with Learned Evaluation

Results

How it works

Project history

Usage

Possible future work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Chess Engine with Learned Evaluation

Results

How it works

Project history

Usage

Possible future work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages