Adam Ma & Ben Marler
A research project exploring two approaches to exploiting suboptimal opponents in Limit Texas Hold'em: a specialist Double DQN agent and a generalist Actor-Critic agent with opponent modeling. Built on the RLCard environment.
Modern poker solvers use Game Theory Optimal (GTO) strategies — mathematically unexploitable, but they leave money on the table against flawed human players. This project investigates whether RL agents can dynamically exploit opponent tendencies to do better.
Two approaches were compared:
- The Specialist (Double DQN): One agent trained per opponent archetype. Learns to maximally exploit a specific, fixed strategy.
- The Generalist (Actor-Critic with Memory): A single agent trained against a mix of all opponent types, using an LSTM to identify and adapt to opponents mid-session.
Four rule-based opponents were implemented as training targets:
| Archetype | Behavior |
|---|---|
| Calling Station | Always calls, never folds |
| Maniac | Always raises if possible |
| Old Man Coffee (OMC) | Only plays premium hands (AA, KK, QQ) |
| Polarizing | Raises strong/nut hands, checks marginal hands, folds the rest |
The classic 2x2 tight/loose × passive/aggressive breakdown used to categorize the four fixed opponents.
The state vector fed to all agents is a 90-dimensional binary encoding:
s = [c_cards, c_raises, h_rank, d_draw, c_legal, p] ∈ {0,1}^90
52 cards + 20 raise counts + 10 hand ranks + 3 draw flags + 4 legal actions + 1 position
Hand rank and draw flags were explicitly added so agents don't have to learn poker hand semantics from scratch — they can focus purely on strategy.
A Dueling Double DQN with LSTM trained against a single opponent type. Key design choices:
- Double DQN prevents overestimation of Q-values
- Dueling architecture separates state value V(s) from action advantage A(s,a)
- LSTM captures within-hand betting history
- Replay buffer breaks temporal correlations for stable training
- Huber loss handles noisy poker rewards gracefully
Four specialist models were trained: dqn-calling, dqn-maniac, dqn-omc, dqn-polar.
A 90-dimensional state vector passed through two 128-unit ReLU hidden layers to produce Q-values for each of the 4 legal actions.
Three architectures were explored, each addressing the failures of the last:
V1 — Concatenation: Game LSTM + Opponent LSTM, outputs concatenated. The actor learned to ignore the opponent context entirely due to high card-luck noise.
V2 — FiLM Conditioning + CTDE: Replaced concatenation with Feature-wise Linear Modulation (inspired by AlphaStar). The critic was given both players' hole cards during training to reduce variance. Meta-batching forced gradients to flow across hands. Reached +1.0 mbb/h overall but adaptation remained flat.
The V2 dual-stream architecture: game state flows through an MLP trunk modulated by the opponent LSTM context via FiLM scale/shift parameters.
V3 — Fully Recurrent RL²: A single Session LSTM spanning all decision points in a session. Failed due to vanishing gradients across ~1500 BPTT steps.
DQN-Calling dominates its target opponent (cyan), while AC models (red/orange) maintain near break-even. DQN-Maniac (dark blue) loses badly — it never learned to beat a passive player.
All DQN models profit against OMC. AC models perform well but with significantly higher variance than DQN.
Surprising result: DQN-Calling crushes the Maniac (cyan), while DQN-Maniac (dark blue) barely breaks even against its own training target.
The meta training on AC-v2 drastically stabilized the performance and it performed well with a slight upward trend
The overlapping "First half / Second half" curves confirm that AC models are not adapting to opponents within a session: the core open problem.
Key takeaways:
- Specialist DQN agents strongly outperform against their training targets
- AC models are competitive but higher variance, and showed no measurable intra-session adaptation
- Meta-batched training (V2-Meta) substantially stabilized AC performance
- Surprisingly, DQN-Calling was the strongest all-around performer across multiple opponent types
PokerBots/
├── agents/ # DQN and Actor-Critic agent implementations
├── env/ # RLCard environment wrappers and state encoders
├── evaluation/ # Session evaluation scripts and metrics
├── main/ # Entry point for running matchups
├── players/ # Fixed-strategy opponent implementations
├── results/ # Training curves and evaluation plots
├── train/ # Training scripts (train_dqn.py, train_ac.py, train_cfr.py)
├── references/ # Research papers and references
├── Makefile # All commands for training, evaluation, and dev
└── pyproject.toml # Dependencies managed via uv
Requires Python 3.11 and uv.
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dependencies
make sync
# Verify setup
make check-deps# Train specialist DQN agents (one per opponent)
make train-dqn-calling
make train-dqn-maniac
make train-dqn-omc
make train-dqn-polar
# Train Actor-Critic agents
make train-ac-pure # No KL regularization
make train-ac-kl # With KL regularization (lambda=0.5)# Run a specific matchup (e.g., DQN-Calling vs Maniac)
make matchup AGENT=dqn-calling OPPONENT=maniac
# Run session-level evaluation for all agents vs one opponent
make sessions-all
# Or evaluate a specific pairing
make evaluate-sessions ARGS="--agents dqn-calling --opponent maniac"Available agents: ac-pure, ac-v2, ac-v2-meta, ac-v3, dqn-calling, dqn-maniac, dqn-omc, dqn-polar, random
Available opponents: calling, folder, maniac, omc, polar, random
make pack-models # Compress trained models to models.tar.gz
make unpack-models # Extract models from archive
make clean-models # Remove all .pt / .pkl weight filesKey libraries (managed via uv):
rlcard— Limit Texas Hold'em environmenttorch 2.2.2— Neural networks (CPU on macOS/Windows, CUDA 12.1 on Linux)open-spiel— Additional game-theoretic utilities (non-macOS)numpy,matplotlib,seaborn— Data and visualizationpokerkit— Hand evaluation utilitiesruff— Linting and formatting







