ballnet

Overview

ballnet uses a GATv2TCN (Graph Attention Network + Temporal Convolutional Network) model to predict NBA statlines. It provides point estimates and calibrated probabilities for any player and statistic (of those in the outputs, which includes points, assists, rebounds, turnovers, steals, and blocks)

To switch to a new model, change one line in config.py:

ACTIVE_MODEL = "v5"   # was "v3"

Training and evaluation scripts import config.py and will automatically use weights from models/v5/.

Directory Structure

clean/
├── README.md                  # This file
├── game_embeddings.md         # ← Game outcome prediction reference (optional)
├── config.py                  # ← Single source of truth for all paths + hyperparams
├── predictor.py               # Core predictor class (loads artifacts + runs inference)
├── update.py                  # ← Data refresh / rebuild tensors (optional)
│
├── architecture/              # Model architecture source
│   ├── gatv2tcn.py            # GATv2TCN implementation
│   └── tcn.py                 # TCN block implementation
│
├── data/                      # Runtime data (gitignored: *.pkl, *.npy, *.parquet)
│   ├── raw_boxscores.parquet          # Full NBA game log (built by 01_fetch_data.py)
│   ├── game_home_teams.parquet        # ← {GAME_ID: home_team_abbr} from LeagueGameFinder (cached)
│   ├── X_seq.pkl              # (Days, Players, 13) forward-filled stat tensor
│   ├── X_raw.pkl              # (Days, Players, 13) raw sparse stat tensor (no fill)
│   ├── G_seq.pkl              # List of networkx graphs, one per game-day
│   ├── player_ids.pkl         # Ordered list of player IDs (axis 1 of X_seq)
│   ├── game_dates.pkl         # Ordered list of date strings (axis 0 of X_seq)
│   ├── day_seasons.pkl        # Season label per day (e.g. "2024-25")
│   ├── team_temporal.pkl      # (Days, Players, n_teams) per-day team one-hot arrays
│   ├── pos_temporal.pkl       # (Days, Players, 3) per-day position arrays
│   ├── n_teams.pkl            # int — number of unique teams
│   ├── player_id2team.pkl     # {player_id: "LAL"} — most recent team abbreviation
│   ├── player_id2position.pkl # {player_id: [G,F,C] binary array}
│   ├── mu_per_day.npy         # Causal sliding-window normalization means (Days, 1, 13)
│   └── sd_per_day.npy         # Causal sliding-window normalization std devs (Days, 1, 13)
│
├── models/                    # Trained model weights
│   ├── v5/                    # Current active model
│   │   ├── model.pth          # GATv2TCN state dict
│   │   ├── team_emb.pth       # Linear(n_teams, 2)
│   │   ├── pos_emb.pth        # Linear(3, 2)
│   │   └── conformal_residuals.pkl  # Calibration residuals
│   └── ...                    # v1-v4
│
├── scripts/                   # Setup and training scripts
│   ├── 01_fetch_data.py             # Historical NBA boxscore scrape
│   ├── 02_build_tensors.py          # Build tensors from raw data
│   ├── 03_train.py                  # Training script (MPS/CUDA/Colab)
│   ├── 04_calibrate.py              # Compute conformal_residuals.pkl
│   └── prepare_colab.py             # Package for Colab training
│
└── upload/                    # Google Colab upload bundle
    ├── train.ipynb            # Colab bootstrap notebook
    ├── config.py              # Colab path shim
    ├── scripts/03_train.py    # Training script copy
    ├── gatv2tcn.py            # Model source
    ├── tcn.py                 # TCN source
    └── data/                  # Required data files

Typical Workflow

First-time setup

# 1. Fetch all historical NBA data (takes hours, uses kamikaze restart protocol)
python scripts/01_fetch_data.py

# 2. Build all tensor artifacts from raw data
python scripts/02_build_tensors.py

# 3a. Train on Google Colab (recommended — ~15-30 min on T4/A100)
python scripts/prepare_colab.py
# → Upload clean/upload/ to Google Drive root
# → Open upload/train.ipynb in Colab → Runtime → Run all
# → Download clean_download/ → copy .pth files to clean/models/v2/

# 3b. OR train locally on MPS/CUDA (expect 1-2+ hours)
python scripts/03_train.py

# 4. Calibrate the model (computes conformal residuals)
python scripts/04_calibrate.py

Daily workflow

# Fetch any new games and update all tensors
python update.py

Retraining (updated data)

python scripts/02_build_tensors.py   # regenerate tensors
python scripts/prepare_colab.py      # rebuild upload/ bundle with fresh data
# ... train on Colab, copy weights ...
python scripts/04_calibrate.py       # recompute residuals for new weights

config.py

The single import every other script depends on.

ACTIVE_MODEL    = "v5"          # ← Change this to switch models everywhere

ROOT            = Path(__file__).resolve().parent
DATA_DIR        = ROOT / "data"
MODEL_DIR       = ROOT / "models" / ACTIVE_MODEL
GATV2_SRC       = ROOT / "architecture"

FEATURE_COLS    = ['PTS','AST','REB','TO','STL','BLK','PLUS_MINUS',
                   'TCHS','PASS','DIST','PACE','USG_PCT','TS_PCT']  # 13 features
PREDICTION_COLS = ['PTS','AST','REB','TO','STL','BLK']             # 6 output stats
PRED_INDICES    = [0,1,2,3,4,5]   # indices of PREDICTION_COLS within FEATURE_COLS
SEQ_LENGTH      = 10              # days of history used as input window

Data Files

`X_seq.pkl` — shape `(Days, Players, 13)`

The forward-filled stat tensor. Zero rows (non-playing days) are filled forward with each player's last known stats. This is the version used as model input.

Gotcha: X_seq.pkl is stored un-normalized (raw stat values like PTS=24.0). Normalization happens in memory only inside predictor.py via mu_per_day/ sd_per_day. Never write the normalized version back to disk — it would cause double-normalization on the next load.

`X_raw.pkl` — shape `(Days, Players, 13)`

Same as X_seq but without forward-fill (~84% zeros). Used to detect which players actually played on a given day (non-zero rows).

`player_id2team.pkl` — `{player_id: "LAL"}`

Maps player ID → team abbreviation string (e.g. "LAL", "BOS"). Generated by 02_build_tensors.py from the most recent team per player.

Gotcha: This stores strings not integers. 03_train.py and 04_calibrate.py both apply an alphabetical string→int encoding (sorted(all_teams)) at runtime to get a consistent n_teams=30 integer mapping.

`player_id2position.pkl` — `{player_id: [G, F, C]}`

Maps player ID → 3-element binary position vector (e.g. [1, 0, 0] for Forward). Generated by 02_build_tensors.py using nba_api static player data.

`mu_per_day.npy` / `sd_per_day.npy` — shape `(Days, 1, 13)`

Causal sliding window normalization statistics. To prevent lookahead bias in backtesting, each day leverages an expanding trailing window of up to 150 active days to compute rolling means and standard deviations using purely historical data. sd values < 1e-6 are treated as 1.0.

Model Architecture — GATv2TCN

GATv2TCN(
    in_channels        = 17,   # 13 stats + 2 team_emb + 2 pos_emb
    out_channels       = 6,    # PTS, AST, REB, TO, STL, BLK
    len_input          = 10,   # SEQ_LENGTH
    len_output         = 1,
    temporal_filter    = 64,
    out_gatv2conv      = 32,
    dropout_tcn        = 0.25,
    dropout_gatv2conv  = 0.5,
    head_gatv2conv     = 4,
)

Gotcha: The correct kwarg names are len_input, len_output, out_gatv2conv, dropout_tcn, dropout_gatv2conv, head_gatv2conv. Do NOT use seq_length or heads — those don't exist in the gatv2tcn.py constructor and will raise TypeError: unexpected keyword argument.

Embedding layers:

team_emb = nn.Linear(n_teams, 2)   # bias=True (default)
pos_emb  = nn.Linear(3, 2)         # bias=True (default)

Gotcha: Always use default bias=True when creating these layers to load the saved .pth files, which include a bias key. Using bias=False causes RuntimeError: Unexpected key(s) in state_dict: "bias".

Input construction:

x_t = cat([X_norm[day, :, :], team_emb(team_one_hot), pos_emb(pos_vec)], dim=1)
# shape: (P=805, 17)
# stacked over SEQ_LENGTH=10 days → (1, P, 17, 10)

Training Notes

Why train on Google Colab?

The model is small (~77K parameters, 76KB .pth). However, the full training loop (300 epochs × 20-day batch × 148 val days) takes 1.5+ hours on Apple MPS but only 15-30 minutes on Colab T4/A100.

Colab training workflow (`prepare_colab.py`)

python scripts/prepare_colab.py

Builds upload/ (~44 MB):

scripts/03_train.py — exact copy of the canonical training script (parity guarantee)
config.py — auto-generated Colab path shim so 03_train.py resolves imports correctly
train.ipynb — minimal 4-cell bootstrap notebook that runs 03_train.py via subprocess
gatv2tcn.py + tcn.py — model source
data/ — the 5 required pkl files

Parity guarantee: prepare_colab.py copies scripts/03_train.py directly into the upload bundle rather than duplicating its logic. This means any changes to 03_train.py (hyperparameters, normalization, loss function, etc.) are automatically reflected in Colab training after re-running prepare_colab.py. Never edit training logic in the notebook or in prepare_colab.py directly — always edit 03_train.py.

After training completes, Colab saves output to clean_download/ in your Drive:

model.pth, team_emb.pth, pos_emb.pth → copy to clean/models/<ACTIVE_MODEL>/
Re-run 04_calibrate.py after copying new weights

Gotcha: Run prepare_colab.py fresh each time you retrain with updated data — it copies the current pkl files and 03_train.py, so stale uploads will train on stale data with stale code.

Understanding the loss numbers

Training uses summed MSE (not averaged) over all days in the batch/val set:

loss = sum(mse_per_day)   # NOT mean(mse_per_day)

This means raw loss values scale linearly with the number of days. Our dataset has ~7× more val days (147) than the original Colab notebook (20), so our val loss will be ~7× larger by construction. This is expected and correct. The per-day loss converges to the same ~0.032 as the original training. Divide the reported val loss by ~147 to compare.

tqdm display during training

The progress bar shows all four quantities every epoch:

Training:  35% | 105/300 [train=38.4, val=21.3, best=18.9, saved=★]

★ appears when a new best validation loss is saved.

predictor.py — GATv2Predictor

The core predictor class used by inference and evaluation scripts.

Setup

p = GATv2Predictor()
p.setup()   # loads all artifacts from data/ and models/<ACTIVE_MODEL>/

setup() loads: X_seq, G_seq, player_ids, game_dates, mu_per_day, sd_per_day, team_temporal, pos_temporal, n_teams, conformal_residuals.pkl, and all three .pth weight files.

Conformal residuals (tiered format): After loading conformal_residuals.pkl, the predictor exposes:

self.val_residuals — dict[str, list] keyed as "PTS_low", "PTS_mid", "PTS_high", etc.
self.val_bias — dict[str, float] — per-stat mean bias computed at calibration time

Key public helpers:

p.get_residual_std("PTS")   # mid-tier std — used by quantile_test.py (and any tiered SD filter)

Inference methods

Fast path — use for day-level batched predictions

# ONE forward pass for all 805 players, cached per day
pred_matrix = p.predict_all_for_day(day_idx)     # → (P, 6) raw stat units
mc_matrix   = p.predict_all_mc_for_day(day_idx)  # → (20, P, 6) MC-dropout samples

Both methods are memoized by day_idx. Calling them twice for the same day is a free dict lookup. This is ~150× faster than the per-player approach for large batches.

Use _get_day_idx_for_date("2025-02-15") to convert a date string to an index.

Convenience wrappers — use for per-player queries

p.predict_point_estimate(player_id, "PTS")
p.predict_conformal_probability(player_id, "PTS", 22.5)

These call the day-level batched methods internally so they also benefit from caching if called multiple times for the same day.

Memory management

p.clear_day_cache()   # frees _day_cache and _mc_cache dicts if RAM is tight

scripts/02_build_tensors.py

Reads data/raw_boxscores.parquet and produces all artifacts in data/.

What it generates

File	Description
`X_seq.pkl`	Forward-filled stat tensor `(Days, Players, 13)`
`X_raw.pkl`	Raw sparse stat tensor (no fill)
`G_seq.pkl`	List of networkx graphs
`player_ids.pkl`	Ordered player ID list
`game_dates.pkl`	Ordered date string list
`player_id2team.pkl`	`{pid: "LAL"}` — most recent team
`player_id2position.pkl`	`{pid: [G,F,C]}` — position binary vector
`mu_per_day.npy` / `sd_per_day.npy`	Per-season normalization stats

Gotcha: player_id2team stores team abbreviation strings ("LAL") not integers. 03_train.py and 04_calibrate.py handle this with an alphabetical sort encoding at runtime.

Pre-flight tensor comparison

Before rebuilding, the script compares the current tensor's player count to the incoming data's player count. If counts differ significantly, retraining is required — the model's graph structure is keyed to specific player indices.

scripts/03_train.py

Team/position encoding

# player_id2team.pkl → string→int encoding
all_teams    = sorted(set(team_str_values))  # alphabetical, stable
team_str2int = {t: i for i, t in enumerate(all_teams)}

This handles both string abbreviations (our pipeline) and integer team IDs (original Colab pipeline) automatically.

Mask (active players only)

Loss is computed only on players who appear in the target day's graph (i.e., players who actually played that game-day):

mask = G_out[i].unique()   # node indices in next day's edge tensor
loss = mse_loss(pred[mask], y[mask])

Players who didn't play are forward-filled in y but excluded from the loss. This prevents the model from wasting capacity learning the fill-forward function.

scripts/04_calibrate.py

Loads model weights and runs inference on the validation set (days 50%–75% of the dataset, matching 03_train.py) to compute signed residuals against the forward-shifted target day ($t+1$):

residual = actual[t+1] - predicted[t]

Residuals are stratified by predicted value magnitude into three tiers per stat (low / mid / high) and mean-centered to remove systematic model bias before saving.

Output format (conformal_residuals.pkl):

{
    "bias": {"PTS": -0.247, "AST": -0.326, ...},   # raw mean residual per stat
    "residuals": {
        "PTS_low": [...],    # mean-centered, for predictions < 12
        "PTS_mid": [...],    # mean-centered, for predictions 12–22
        "PTS_high": [...],   # mean-centered, for predictions ≥ 22
        "AST_low": [...],
        ...                  # all 6 stats × up to 3 tiers
    }
}

Tier boundaries:

Stat	Low	Mid	High
PTS	<12	12–22	≥22
AST	<4	4–8	≥8
REB	<4	4–8	≥8
STL	<1.5	1.5–3	≥3
BLK	<1	1–2.5	≥2.5
TO	<2	2–4	≥4

Tiers with fewer than 30 samples fall back to the mid tier (e.g. STL_high and TO_high typically have 0 samples — the model rarely predicts these stats that high).

Always re-run 04_calibrate.py after copying new weights from Colab.

Also re-run if tier boundaries are adjusted — the saved residuals depend on which threshold was used to bin predictions at calibration time.

Known Gotchas

#	Issue	Details
1	Double normalization	`X_seq.pkl` is raw. Normalize in-memory only. Never write `X_seq_norm` to disk.
2	Wrong GATv2TCN kwargs	Use `len_input`, `len_output`, `out_gatv2conv`, `dropout_tcn`, `dropout_gatv2conv`, `head_gatv2conv`. Never `seq_length` or `heads`.
3	bias=True on embeddings	`nn.Linear(n_teams, 2)` default is `bias=True`. Saved `.pth` files include bias. Never use `bias=False`.
4	team strings not ints	`player_id2team.pkl` stores `"LAL"` strings. Use alphabetical sort encoding before computing `n_teams`.
5	Loss scale vs Colab	Our val loss is ~7× larger by construction (147 val days vs 20). Compare per-day loss (divide by ~147).
6	CWD sensitivity	Always run scripts from `clean/` or via absolute path. `config.py` imports fail if `clean/` is not importable.
7	Upload/ freshness	Re-run `prepare_colab.py` every time you want to retrain with updated data. It copies fresh pkl files AND `03_train.py`.
8	Colab/local parity	Never duplicate training logic in `prepare_colab.py` or the notebook. All training code lives in `03_train.py`. Edit `03_train.py` → re-run `prepare_colab.py` → upload.
9	Double-denormalization	The model natively outputs raw stat predictions. Never multiply by `sd_per_day` or add `mu` in `predictor.py` or `04_calibrate.py`. That inflates predictions (24.5 PTS → 260 PTS).
10	conformal_residuals.pkl format	The file uses a tiered format. Use `predictor._get_residuals_for(stat, pred_val)` or `predictor.get_residual_std(stat)` instead.
11	LOG_TRANSFORM semantics	`predictor.py` and `04_calibrate.py` must have `LOG_TRANSFORM` set correctly based on the active model's training configuration.
12	Colab train.ipynb is NOT the canonical script	Training logic lives in `scripts/03_train.py` only. Re-run `prepare_colab.py` before every Colab upload so the bundle includes the current `03_train.py`.

Dependencies

torch torchvision            # model
torch-geometric              # GATv2Conv
networkx                     # graph construction
numpy pandas pyarrow         # data
nba_api                      # schedule data (ScoreboardV3)
scikit-learn statsmodels     # utilities & metrics
scipy                        # conformal probability (norm.cdf)
tqdm seaborn patsy xgboost   # visualization and analysis
matplotlib                   # plotting

The model source (gatv2tcn.py, tcn.py) lives at:

architecture/

This path is referenced in config.py as GATV2_SRC and used by 03_train.py, 04_calibrate.py, and prepare_colab.py (which copies the files into upload/).

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
architecture		architecture
models		models
scripts		scripts
upload		upload
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
config.py		config.py
predictor.py		predictor.py
requirements.txt		requirements.txt
update.py		update.py

Folders and files

Latest commit

History

Repository files navigation

ballnet

Overview

Directory Structure

Typical Workflow

First-time setup

Daily workflow

Retraining (updated data)

config.py

Data Files

X_seq.pkl — shape (Days, Players, 13)

X_raw.pkl — shape (Days, Players, 13)

player_id2team.pkl — {player_id: "LAL"}

player_id2position.pkl — {player_id: [G, F, C]}

mu_per_day.npy / sd_per_day.npy — shape (Days, 1, 13)

Model Architecture — GATv2TCN

Training Notes

Why train on Google Colab?

Colab training workflow (prepare_colab.py)

Understanding the loss numbers

tqdm display during training

predictor.py — GATv2Predictor

Setup

Inference methods

Fast path — use for day-level batched predictions

Convenience wrappers — use for per-player queries

Memory management

scripts/02_build_tensors.py

What it generates

Pre-flight tensor comparison

scripts/03_train.py

Team/position encoding

Mask (active players only)

scripts/04_calibrate.py

Known Gotchas

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`X_seq.pkl` — shape `(Days, Players, 13)`

`X_raw.pkl` — shape `(Days, Players, 13)`

`player_id2team.pkl` — `{player_id: "LAL"}`

`player_id2position.pkl` — `{player_id: [G, F, C]}`

`mu_per_day.npy` / `sd_per_day.npy` — shape `(Days, 1, 13)`

Colab training workflow (`prepare_colab.py`)

Packages