A 6-max no-limit Texas Hold'em bot built on Deep CFR. It finished 15th of 500+ entrants in the Fullhouse Hackathon 2026 (Quadrature Capital), inside a strict sandbox: 2 seconds per decision, 768 MB RAM, no network, no threads, read-only filesystem.
The architecture is Deep CFR. The part worth reading is a call I made against my own benchmark: I shipped the model that scored +22.41 bb/100, not the one that scored +23.73. The higher number (v23) came from a wider feature encoding the runtime can't serve without a full retrain, and v17, the line I shipped, was also the one that held up against opponents it never trained against (+19.89 bb/100 held-out). On a one-shot submission with no second try against the real field, the servable, general line beats the flashy local number.
- Training (offline): external-sampling MCCFR drives a dueling advantage network; traversal and
hand evaluation are Numba-JIT compiled. No poker heuristics go into the learned policy. The net reads
a 51-feature state and scores 5 abstract actions (
FOLD / CHECK_CALL / BET_50 / BET_POT / ALL_IN). - Runtime (in the sandbox): pure-Python NumPy inference over the weights in
data/deep_cfr_model.npz, with no training-only state at decision time. The one hand-coded exception is short-stack preflop (≤12 bb effective), which uses a push/fold chart; every other decision on every street is the net.
The constraint behind every result: training and serving share one feature encoder, one action menu, one set of weights. Break that contract and a checkpoint can't be served, which is why the strongest experiments never shipped (runtime-contract.md).
Poker has no loss curve you can trust. Training error doesn't tell you whether a change made the bot play better, so every experiment was scored the same way: change one variable, play it against a fixed pool of opponent bots, read bb/100 with confidence intervals. Promotion required clearing a gate, not just posting a number:
- a CI-clear gain on the standard pool,
- non-negative on a held-out pool the model never trained against,
- no losing opponent segment,
- an exact match to the 51-feature / 5-action runtime contract.
| Line | One change | Best bb/100 (95% CI) | Outcome |
|---|---|---|---|
| v17 | 51-feature / 5-action baseline | +22.41 [+19.61, +25.21] |
Shipped. Held out at +19.89. |
| v21 | multiway-equity feature (52-dim) | +22.66 [+19.88, +25.45] |
Parked. Runtime emits 51 features. |
| v23 | side-pot representation (56-dim) | +23.73 [+20.88, +26.57] |
Parked. Strongest line, same mismatch. |
| v18 | strength-percentile feature | +12.39 [+9.73, +15.04] |
Refuted. No gain over v17. |
| v22 | 8-action menu | +10.48 [+7.78, +13.17] |
Refuted. More heads, same traversal budget. |
| v24 | side-pot + 8-action | +12.49 [+9.84, +15.14] |
Refuted. Negative by the final checkpoint. |
These are local-harness numbers against the opponent pool in bots/, with 95%
normal-approximation CIs (±1.96 × stderr). They are not leaderboard results. The only external number
is the 15th-place finish. The full per-version ledger is in experiments.md.
- Representation changes were the only things that beat baseline (v21, v23), and only when the runtime feature shape stayed byte-aligned.
- Growing the action space without growing the traversal budget just starves the new policy heads (v22, v24).
- The evaluator, not the model, was where most of the engineering went. A bot that beats a narrow pool is overfit to it, and the held-out gate is what caught that.
- The most expensive mistake was changing several structural things in one branch. Attribution collapses and the experiment teaches nothing.
- 2 s per decision, 768 MB RAM, 30 s import warmup
- no network, subprocess, pickle, threads, processes, or writable filesystem
- Python 3.10, dependencies limited to
eval7,numpy,scipy,treys,scikit-learn
bot/: the submitted runtime (feature encoder and NumPy inference)training/: Deep CFR (MCCFR traversal, replay buffers, networks, Numba game logic)tooling/: benchmark harness, leak diagnosis, and the cluster/GPU infra that ran it at scalebots/: the opponent pool every number was measured against (details)engine_vendored/: frozen snapshot of the official Fullhouse engine, for offline testingdocs/research/: the experiment ledger and the runtime contract
main is the shipped runtime. Every major experiment line is preserved as an archive/* tag
(archive/v18 through archive/v31-6act-6max) so the parked and refuted branches stay inspectable.
uv venv && source .venv/bin/activate
uv pip install -r requirements.txt
python scripts/bench.py smoke # fast sanity pass
python scripts/bench.py bench # full 6-max benchmark with CIs
python scripts/bench.py diagnose adv_random_bot # per-opponent leak diagnosis
pytest -q # CI runs ruff + pytest on every pushRetrain or repackage:
python -m training.train --iterations 250 --traversals 30000 --device auto
python scripts/build_submission.py # package and validate the submission zipPyTorch is intentionally unpinned in requirements.txt; the right wheel is platform-dependent. Install
a build separately to run training or the torch-backed tests.
The engine, sandbox runner, and validator under engine_vendored/ are a frozen snapshot of the
official Fullhouse Hackathon engine (Quadrature Capital),
kept only so the bot is testable offline. Everything else is my own work.

