Skip to content

ehfurgeson/ballnet

Repository files navigation

ballnet

Overview

ballnet uses a GATv2TCN (Graph Attention Network + Temporal Convolutional Network) model to predict NBA statlines. It provides point estimates and calibrated probabilities for any player and statistic (of those in the outputs, which includes points, assists, rebounds, turnovers, steals, and blocks)

To switch to a new model, change one line in config.py:

ACTIVE_MODEL = "v5"   # was "v3"

Training and evaluation scripts import config.py and will automatically use weights from models/v5/.


Directory Structure

clean/
├── README.md                  # This file
├── game_embeddings.md         # ← Game outcome prediction reference (optional)
├── config.py                  # ← Single source of truth for all paths + hyperparams
├── predictor.py               # Core predictor class (loads artifacts + runs inference)
├── update.py                  # ← Data refresh / rebuild tensors (optional)
│
├── architecture/              # Model architecture source
│   ├── gatv2tcn.py            # GATv2TCN implementation
│   └── tcn.py                 # TCN block implementation
│
├── data/                      # Runtime data (gitignored: *.pkl, *.npy, *.parquet)
│   ├── raw_boxscores.parquet          # Full NBA game log (built by 01_fetch_data.py)
│   ├── game_home_teams.parquet        # ← {GAME_ID: home_team_abbr} from LeagueGameFinder (cached)
│   ├── X_seq.pkl              # (Days, Players, 13) forward-filled stat tensor
│   ├── X_raw.pkl              # (Days, Players, 13) raw sparse stat tensor (no fill)
│   ├── G_seq.pkl              # List of networkx graphs, one per game-day
│   ├── player_ids.pkl         # Ordered list of player IDs (axis 1 of X_seq)
│   ├── game_dates.pkl         # Ordered list of date strings (axis 0 of X_seq)
│   ├── day_seasons.pkl        # Season label per day (e.g. "2024-25")
│   ├── team_temporal.pkl      # (Days, Players, n_teams) per-day team one-hot arrays
│   ├── pos_temporal.pkl       # (Days, Players, 3) per-day position arrays
│   ├── n_teams.pkl            # int — number of unique teams
│   ├── player_id2team.pkl     # {player_id: "LAL"} — most recent team abbreviation
│   ├── player_id2position.pkl # {player_id: [G,F,C] binary array}
│   ├── mu_per_day.npy         # Causal sliding-window normalization means (Days, 1, 13)
│   └── sd_per_day.npy         # Causal sliding-window normalization std devs (Days, 1, 13)
│
├── models/                    # Trained model weights
│   ├── v5/                    # Current active model
│   │   ├── model.pth          # GATv2TCN state dict
│   │   ├── team_emb.pth       # Linear(n_teams, 2)
│   │   ├── pos_emb.pth        # Linear(3, 2)
│   │   └── conformal_residuals.pkl  # Calibration residuals
│   └── ...                    # v1-v4
│
├── scripts/                   # Setup and training scripts
│   ├── 01_fetch_data.py             # Historical NBA boxscore scrape
│   ├── 02_build_tensors.py          # Build tensors from raw data
│   ├── 03_train.py                  # Training script (MPS/CUDA/Colab)
│   ├── 04_calibrate.py              # Compute conformal_residuals.pkl
│   └── prepare_colab.py             # Package for Colab training
│
└── upload/                    # Google Colab upload bundle
    ├── train.ipynb            # Colab bootstrap notebook
    ├── config.py              # Colab path shim
    ├── scripts/03_train.py    # Training script copy
    ├── gatv2tcn.py            # Model source
    ├── tcn.py                 # TCN source
    └── data/                  # Required data files

Typical Workflow

First-time setup

# 1. Fetch all historical NBA data (takes hours, uses kamikaze restart protocol)
python scripts/01_fetch_data.py

# 2. Build all tensor artifacts from raw data
python scripts/02_build_tensors.py

# 3a. Train on Google Colab (recommended — ~15-30 min on T4/A100)
python scripts/prepare_colab.py
# → Upload clean/upload/ to Google Drive root
# → Open upload/train.ipynb in Colab → Runtime → Run all
# → Download clean_download/ → copy .pth files to clean/models/v2/

# 3b. OR train locally on MPS/CUDA (expect 1-2+ hours)
python scripts/03_train.py

# 4. Calibrate the model (computes conformal residuals)
python scripts/04_calibrate.py

Daily workflow

# Fetch any new games and update all tensors
python update.py

Retraining (updated data)

python scripts/02_build_tensors.py   # regenerate tensors
python scripts/prepare_colab.py      # rebuild upload/ bundle with fresh data
# ... train on Colab, copy weights ...
python scripts/04_calibrate.py       # recompute residuals for new weights

config.py

The single import every other script depends on.

ACTIVE_MODEL    = "v5"          # ← Change this to switch models everywhere

ROOT            = Path(__file__).resolve().parent
DATA_DIR        = ROOT / "data"
MODEL_DIR       = ROOT / "models" / ACTIVE_MODEL
GATV2_SRC       = ROOT / "architecture"

FEATURE_COLS    = ['PTS','AST','REB','TO','STL','BLK','PLUS_MINUS',
                   'TCHS','PASS','DIST','PACE','USG_PCT','TS_PCT']  # 13 features
PREDICTION_COLS = ['PTS','AST','REB','TO','STL','BLK']             # 6 output stats
PRED_INDICES    = [0,1,2,3,4,5]   # indices of PREDICTION_COLS within FEATURE_COLS
SEQ_LENGTH      = 10              # days of history used as input window

Data Files

X_seq.pkl — shape (Days, Players, 13)

The forward-filled stat tensor. Zero rows (non-playing days) are filled forward with each player's last known stats. This is the version used as model input.

Gotcha: X_seq.pkl is stored un-normalized (raw stat values like PTS=24.0). Normalization happens in memory only inside predictor.py via mu_per_day/ sd_per_day. Never write the normalized version back to disk — it would cause double-normalization on the next load.

X_raw.pkl — shape (Days, Players, 13)

Same as X_seq but without forward-fill (~84% zeros). Used to detect which players actually played on a given day (non-zero rows).

player_id2team.pkl{player_id: "LAL"}

Maps player ID → team abbreviation string (e.g. "LAL", "BOS"). Generated by 02_build_tensors.py from the most recent team per player.

Gotcha: This stores strings not integers. 03_train.py and 04_calibrate.py both apply an alphabetical string→int encoding (sorted(all_teams)) at runtime to get a consistent n_teams=30 integer mapping.

player_id2position.pkl{player_id: [G, F, C]}

Maps player ID → 3-element binary position vector (e.g. [1, 0, 0] for Forward). Generated by 02_build_tensors.py using nba_api static player data.

mu_per_day.npy / sd_per_day.npy — shape (Days, 1, 13)

Causal sliding window normalization statistics. To prevent lookahead bias in backtesting, each day leverages an expanding trailing window of up to 150 active days to compute rolling means and standard deviations using purely historical data. sd values < 1e-6 are treated as 1.0.


Model Architecture — GATv2TCN

GATv2TCN(
    in_channels        = 17,   # 13 stats + 2 team_emb + 2 pos_emb
    out_channels       = 6,    # PTS, AST, REB, TO, STL, BLK
    len_input          = 10,   # SEQ_LENGTH
    len_output         = 1,
    temporal_filter    = 64,
    out_gatv2conv      = 32,
    dropout_tcn        = 0.25,
    dropout_gatv2conv  = 0.5,
    head_gatv2conv     = 4,
)

Gotcha: The correct kwarg names are len_input, len_output, out_gatv2conv, dropout_tcn, dropout_gatv2conv, head_gatv2conv. Do NOT use seq_length or heads — those don't exist in the gatv2tcn.py constructor and will raise TypeError: unexpected keyword argument.

Embedding layers:

team_emb = nn.Linear(n_teams, 2)   # bias=True (default)
pos_emb  = nn.Linear(3, 2)         # bias=True (default)

Gotcha: Always use default bias=True when creating these layers to load the saved .pth files, which include a bias key. Using bias=False causes RuntimeError: Unexpected key(s) in state_dict: "bias".

Input construction:

x_t = cat([X_norm[day, :, :], team_emb(team_one_hot), pos_emb(pos_vec)], dim=1)
# shape: (P=805, 17)
# stacked over SEQ_LENGTH=10 days → (1, P, 17, 10)

Training Notes

Why train on Google Colab?

The model is small (~77K parameters, 76KB .pth). However, the full training loop (300 epochs × 20-day batch × 148 val days) takes 1.5+ hours on Apple MPS but only 15-30 minutes on Colab T4/A100.

Colab training workflow (prepare_colab.py)

python scripts/prepare_colab.py

Builds upload/ (~44 MB):

  • scripts/03_train.pyexact copy of the canonical training script (parity guarantee)
  • config.py — auto-generated Colab path shim so 03_train.py resolves imports correctly
  • train.ipynb — minimal 4-cell bootstrap notebook that runs 03_train.py via subprocess
  • gatv2tcn.py + tcn.py — model source
  • data/ — the 5 required pkl files

Parity guarantee: prepare_colab.py copies scripts/03_train.py directly into the upload bundle rather than duplicating its logic. This means any changes to 03_train.py (hyperparameters, normalization, loss function, etc.) are automatically reflected in Colab training after re-running prepare_colab.py. Never edit training logic in the notebook or in prepare_colab.py directly — always edit 03_train.py.

After training completes, Colab saves output to clean_download/ in your Drive:

  • model.pth, team_emb.pth, pos_emb.pth → copy to clean/models/<ACTIVE_MODEL>/
  • Re-run 04_calibrate.py after copying new weights

Gotcha: Run prepare_colab.py fresh each time you retrain with updated data — it copies the current pkl files and 03_train.py, so stale uploads will train on stale data with stale code.

Understanding the loss numbers

Training uses summed MSE (not averaged) over all days in the batch/val set:

loss = sum(mse_per_day)   # NOT mean(mse_per_day)

This means raw loss values scale linearly with the number of days. Our dataset has ~7× more val days (147) than the original Colab notebook (20), so our val loss will be ~7× larger by construction. This is expected and correct. The per-day loss converges to the same ~0.032 as the original training. Divide the reported val loss by ~147 to compare.

tqdm display during training

The progress bar shows all four quantities every epoch:

Training:  35% | 105/300 [train=38.4, val=21.3, best=18.9, saved=★]

appears when a new best validation loss is saved.


predictor.py — GATv2Predictor

The core predictor class used by inference and evaluation scripts.

Setup

p = GATv2Predictor()
p.setup()   # loads all artifacts from data/ and models/<ACTIVE_MODEL>/

setup() loads: X_seq, G_seq, player_ids, game_dates, mu_per_day, sd_per_day, team_temporal, pos_temporal, n_teams, conformal_residuals.pkl, and all three .pth weight files.

Conformal residuals (tiered format): After loading conformal_residuals.pkl, the predictor exposes:

  • self.val_residualsdict[str, list] keyed as "PTS_low", "PTS_mid", "PTS_high", etc.
  • self.val_biasdict[str, float] — per-stat mean bias computed at calibration time

Key public helpers:

p.get_residual_std("PTS")   # mid-tier std — used by quantile_test.py (and any tiered SD filter)

Inference methods

Fast path — use for day-level batched predictions

# ONE forward pass for all 805 players, cached per day
pred_matrix = p.predict_all_for_day(day_idx)     # → (P, 6) raw stat units
mc_matrix   = p.predict_all_mc_for_day(day_idx)  # → (20, P, 6) MC-dropout samples

Both methods are memoized by day_idx. Calling them twice for the same day is a free dict lookup. This is ~150× faster than the per-player approach for large batches.

Use _get_day_idx_for_date("2025-02-15") to convert a date string to an index.

Convenience wrappers — use for per-player queries

p.predict_point_estimate(player_id, "PTS")
p.predict_conformal_probability(player_id, "PTS", 22.5)

These call the day-level batched methods internally so they also benefit from caching if called multiple times for the same day.

Memory management

p.clear_day_cache()   # frees _day_cache and _mc_cache dicts if RAM is tight

scripts/02_build_tensors.py

Reads data/raw_boxscores.parquet and produces all artifacts in data/.

What it generates

File Description
X_seq.pkl Forward-filled stat tensor (Days, Players, 13)
X_raw.pkl Raw sparse stat tensor (no fill)
G_seq.pkl List of networkx graphs
player_ids.pkl Ordered player ID list
game_dates.pkl Ordered date string list
player_id2team.pkl {pid: "LAL"} — most recent team
player_id2position.pkl {pid: [G,F,C]} — position binary vector
mu_per_day.npy / sd_per_day.npy Per-season normalization stats

Gotcha: player_id2team stores team abbreviation strings ("LAL") not integers. 03_train.py and 04_calibrate.py handle this with an alphabetical sort encoding at runtime.

Pre-flight tensor comparison

Before rebuilding, the script compares the current tensor's player count to the incoming data's player count. If counts differ significantly, retraining is required — the model's graph structure is keyed to specific player indices.


scripts/03_train.py

Team/position encoding

# player_id2team.pkl → string→int encoding
all_teams    = sorted(set(team_str_values))  # alphabetical, stable
team_str2int = {t: i for i, t in enumerate(all_teams)}

This handles both string abbreviations (our pipeline) and integer team IDs (original Colab pipeline) automatically.

Mask (active players only)

Loss is computed only on players who appear in the target day's graph (i.e., players who actually played that game-day):

mask = G_out[i].unique()   # node indices in next day's edge tensor
loss = mse_loss(pred[mask], y[mask])

Players who didn't play are forward-filled in y but excluded from the loss. This prevents the model from wasting capacity learning the fill-forward function.


scripts/04_calibrate.py

Loads model weights and runs inference on the validation set (days 50%–75% of the dataset, matching 03_train.py) to compute signed residuals against the forward-shifted target day ($t+1$):

residual = actual[t+1] - predicted[t]

Residuals are stratified by predicted value magnitude into three tiers per stat (low / mid / high) and mean-centered to remove systematic model bias before saving.

Output format (conformal_residuals.pkl):

{
    "bias": {"PTS": -0.247, "AST": -0.326, ...},   # raw mean residual per stat
    "residuals": {
        "PTS_low": [...],    # mean-centered, for predictions < 12
        "PTS_mid": [...],    # mean-centered, for predictions 12–22
        "PTS_high": [...],   # mean-centered, for predictions ≥ 22
        "AST_low": [...],
        ...                  # all 6 stats × up to 3 tiers
    }
}

Tier boundaries:

Stat Low Mid High
PTS <12 12–22 ≥22
AST <4 4–8 ≥8
REB <4 4–8 ≥8
STL <1.5 1.5–3 ≥3
BLK <1 1–2.5 ≥2.5
TO <2 2–4 ≥4

Tiers with fewer than 30 samples fall back to the mid tier (e.g. STL_high and TO_high typically have 0 samples — the model rarely predicts these stats that high).

Always re-run 04_calibrate.py after copying new weights from Colab.

Also re-run if tier boundaries are adjusted — the saved residuals depend on which threshold was used to bin predictions at calibration time.


Known Gotchas

# Issue Details
1 Double normalization X_seq.pkl is raw. Normalize in-memory only. Never write X_seq_norm to disk.
2 Wrong GATv2TCN kwargs Use len_input, len_output, out_gatv2conv, dropout_tcn, dropout_gatv2conv, head_gatv2conv. Never seq_length or heads.
3 bias=True on embeddings nn.Linear(n_teams, 2) default is bias=True. Saved .pth files include bias. Never use bias=False.
4 team strings not ints player_id2team.pkl stores "LAL" strings. Use alphabetical sort encoding before computing n_teams.
5 Loss scale vs Colab Our val loss is ~7× larger by construction (147 val days vs 20). Compare per-day loss (divide by ~147).
6 CWD sensitivity Always run scripts from clean/ or via absolute path. config.py imports fail if clean/ is not importable.
7 Upload/ freshness Re-run prepare_colab.py every time you want to retrain with updated data. It copies fresh pkl files AND 03_train.py.
8 Colab/local parity Never duplicate training logic in prepare_colab.py or the notebook. All training code lives in 03_train.py. Edit 03_train.py → re-run prepare_colab.py → upload.
9 Double-denormalization The model natively outputs raw stat predictions. Never multiply by sd_per_day or add mu in predictor.py or 04_calibrate.py. That inflates predictions (24.5 PTS → 260 PTS).
10 conformal_residuals.pkl format The file uses a tiered format. Use predictor._get_residuals_for(stat, pred_val) or predictor.get_residual_std(stat) instead.
11 LOG_TRANSFORM semantics predictor.py and 04_calibrate.py must have LOG_TRANSFORM set correctly based on the active model's training configuration.
12 Colab train.ipynb is NOT the canonical script Training logic lives in scripts/03_train.py only. Re-run prepare_colab.py before every Colab upload so the bundle includes the current 03_train.py.

Dependencies

torch torchvision            # model
torch-geometric              # GATv2Conv
networkx                     # graph construction
numpy pandas pyarrow         # data
nba_api                      # schedule data (ScoreboardV3)
scikit-learn statsmodels     # utilities & metrics
scipy                        # conformal probability (norm.cdf)
tqdm seaborn patsy xgboost   # visualization and analysis
matplotlib                   # plotting

The model source (gatv2tcn.py, tcn.py) lives at:

architecture/

This path is referenced in config.py as GATV2_SRC and used by 03_train.py, 04_calibrate.py, and prepare_colab.py (which copies the files into upload/).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors