Production-grade NFL analytics platform featuring a 5-way Bayesian ensemble, 342 engineered features, and R + Python pipelines backed by TimescaleDB. Includes formal statistical testing frameworks, distributed compute, and comprehensive dissertation documentation.
| Metric | Performance | Notes |
|---|---|---|
| Spread Accuracy | 74.9% | 2025 holdout (179 bets) |
| Props MAE | 44.25 yards | Equal-weight ensemble |
| Features | 368 total | 305 base + 37 synthetic + 26 semantic |
| Dissertation | 693 pages | 12 chapters + appendices |
| Semantic Stack | NEW | LLM-powered NER, sentiment, explanations |
Quick Links:
- CLAUDE.md - Comprehensive project documentation (v6.1)
- PROJECT_STATUS.md - Current project status (updated Dec 9, 2025)
- SETUP.md - Environment setup instructions
Key Documentation:
- Database Reference: Schema, conventions, SQL patterns
- Architecture Reference: System design and components
- Milestones: 122+ project completion summaries
- 2025 Holdout Validation (Dec 9, 2025)
- GNN v2 Integration (Dec 7, 2025)
- Dirichlet Optimization (Nov 27, 2025)
- Dissertation: 693-page dissertation
- 12 chapters: Data foundation through production deployment
- Causal inference, Bayesian hierarchical models, Reinforcement Learning
- Comprehensive appendices (BNN investigation, architecture evolution)
- Database Schema Audit - 56-table comprehensive audit
Below is a minimal local bootstrap.
- Docker and docker compose
- psql (optional; script falls back to container psql)
- R (4.x) and Python (3.10+) if you plan to run ingestors
- Git for version control
- Git LFS (for model binaries):
brew install git-lfs && git lfs install- Required to download Python (.pkl) and R (.rds) model files
- See docs/GIT_LFS_GUIDE.md for comprehensive guide
| Worktree | Directory | Purpose |
|---|---|---|
| Main | nfl-analytics/ |
Primary development |
| Experiments | ../nfl-experiments/ |
Model experiments |
| Dissertation | ../nfl-dissertation/ |
LaTeX compilation |
| Hotfix | ../nfl-hotfix/ |
Quick fixes |
| Backtest | ../nfl-backtest/ |
Long-running tests |
git worktree list # View all worktreesStart the database and apply schema:
bash scripts/dev/init_dev.sh# Create virtual environment
python -m venv .venv
# Activate (choose your platform)
source .venv/bin/activate # macOS/Linux
.venv\Scripts\activate # Windows (CMD)
.venv/Scripts/Activate.ps1 # Windows (PowerShell)
# Install dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt # For testing (optional)Windows 11 + RTX 4090: PyTorch CUDA support automatically included in requirements.txt Mac M4: PyTorch MPS support automatically included
# Install R packages
Rscript -e 'renv::restore()'
# OR
Rscript setup_packages.RLoad schedules (idempotent, 1999–2024):
Rscript --vanilla R/ingestion/ingest_schedules.RIngest play-by-play (1999-2024, ~3-5 minutes):
Rscript --vanilla R/ingestion/ingest_pbp.RIngest historical odds (requires ODDS_API_KEY in .env):
export ODDS_API_KEY="your_key_here"
python py/ingest_odds_history.py --start-date 2023-09-01 --end-date 2023-09-10Refresh materialized views:
psql postgresql://dro:sicillionbillions@localhost:5544/devdb01 \
-c "REFRESH MATERIALIZED VIEW mart.game_summary;"
# Optional: refresh enhanced features view (if used)
psql postgresql://dro:sicillionbillions@localhost:5544/devdb01 \
-c "SELECT mart.refresh_game_features();"Build as-of features (leakage-safe, game-level):
python py/features/asof_features.py \
--output analysis/features/asof_team_features.csv \
--season-start 1999 \
--season-end 2024 \
--validateRun baseline GLM ATS backtest:
python py/backtest/baseline_glm.py \
--start-season 2003 \
--end-season 2024 \
--output-csv analysis/results/glm_baseline_metrics.csv \
--tex analysis/dissertation/figures/out/glm_baseline_table.texOptional: apply probability calibration (Platt or isotonic) and change decision thresholds:
python py/backtest/baseline_glm.py \
--start-season 2003 --end-season 2024 \
--calibration platt --cv-folds 5 \
--decision-threshold 0.50 \
--cal-plot analysis/dissertation/figures/out/glm_calibration_platt.png \
--cal-csv analysis/results/glm_calibration_platt.csv \
--output-csv analysis/results/glm_baseline_metrics_cal_platt.csv \
--tex analysis/dissertation/figures/out/glm_baseline_table_cal_platt.texSweep thresholds and compare configs (harness):
python py/backtest/harness.py \
--features-csv analysis/features/asof_team_features.csv \
--start-season 2003 --end-season 2024 \
--thresholds 0.45,0.50,0.55 \
--calibrations none,platt,isotonic --cv-folds 5 \
--cal-bins 10 --cal-out-dir analysis/results/calibration \
--output-csv analysis/results/glm_harness_metrics.csv \
--tex analysis/dissertation/figures/out/glm_harness_table.tex \
--tex-overall analysis/dissertation/figures/out/glm_harness_overall.texThis writes per‑season and overall reliability CSVs/plots under analysis/results/calibration/ and emits an overall comparison table with ECE/MCE alongside Brier/LogLoss.
Run formal statistical significance tests:
# Compare models with statistical testing
python -c "
from py.compute.statistics.statistical_tests import PermutationTest
from py.compute.statistics.effect_size import EffectSizeCalculator
# Example: Compare two model performances
perm_test = PermutationTest(n_permutations=5000)
effect_calc = EffectSizeCalculator()
# Your model comparison code here
print('Statistical testing framework ready!')
"Generate automated reports with statistical analysis:
# Create Quarto reports with LaTeX integration
python py/compute/statistics/reporting/quarto_generator.py \
--title "NFL Model Performance Analysis" \
--output analysis/reports/statistical_analysis.qmd🆕 SETI@home-style distributed computing across your MacBook M4 and Windows 4090 desktop via Google Drive synchronization:
- Move project to Google Drive: Place
nfl-analytics/folder in your Google Drive - Install Google Drive on both machines: Ensure sync is enabled for the project folder
- Verify sync: Check that database files (
*.db) sync between machines
The system automatically optimizes task assignment based on your hardware:
MacBook M4 (CPU-optimized):
- Monte Carlo simulations (CPU-intensive)
- State-space parameter sweeps
- Statistical analysis tasks
- Unified memory advantages
Windows 4090 (GPU-optimized):
- RL training (DQN/PPO with CUDA)
- XGBoost GPU training
- Large batch processing
# Initialize compute queue with standard tasks
python run_compute.py --init
# Start adaptive compute with bandit optimization
python run_compute.py --intensity medium
# Check performance scoreboard and machine status
python run_compute.py --scoreboard
# Web dashboard with live monitoring
python run_compute.py --dashboard
# View hardware routing report
python -c "from py.compute.hardware.task_router import task_router; print(task_router.get_routing_report())"- SQLite WAL mode: Prevents database corruption during sync
- Automatic conflict resolution: Detects and merges Google Drive conflicts
- Machine identification: Tracks which device completed each task
- File locking: Cross-platform locks prevent concurrent access issues
- RL Training: DQN/PPO with 500-1000 epochs (auto-routed to 4090)
- State-Space Models: Parameter sweeps with Kalman smoothing (auto-routed to M4)
- Monte Carlo: Large-scale simulations 100K-1M scenarios (auto-routed to M4)
- Statistical Testing: Automated A/B testing and significance analysis
- OPE Gates: Off-policy evaluation with robustness grids
- GLM Calibration: Cross-validated probability calibration
# Check task distribution across machines
python -c "
from py.compute.task_queue import TaskQueue
queue = TaskQueue()
stats = queue.get_queue_status()
print('Task distribution:', stats)
"
# Check queue status
python -c "
from py.compute.task_queue import TaskQueue
queue = TaskQueue()
print('Queue status:', queue.get_queue_status())
"Status: Production Ready (Dec 9, 2025)
Ensemble Architecture:
| Component | Weight | Model Type |
|---|---|---|
| State-Space | 27% | Kalman filter with time decay |
| Hierarchical | 27% | brms hierarchical Bayesian |
| XGBoost | 23% | Gradient boosting (342 features) |
| Informative Priors | 13% | Domain-informed Bayesian |
| Hybrid | 10% | Static-incremental combined |
2025 Holdout Results (Dec 9):
- Spread accuracy: 74.9% (179 games)
- Props MAE: 44.25 yards (equal-weight ensemble)
- Key finding: Equal-weight averaging outperformed Dirichlet v8.0 on 2025 holdout
Train Models:
# Train all Bayesian models (R/brms)
Rscript R/train_hierarchical_v2_canonical.R
Rscript R/train_informative_priors_v2_canonical.R
Rscript R/state_space/train_state_space_v2_canonical.R
# Train XGBoost (Python)
uv run python py/models/train_xgboost_canonical.py
# Generate ensemble predictions
uv run python py/ensemble/enhanced_ensemble_v6_production.pyKey Features (368 total):
- Base features (305): TeamOffensive, TeamDefensive, NextGen, QBR, OpponentAdjusted
- Synthetic interaction (11): QB-OL synergy, defensive balance, explosive mismatch
- Synthetic momentum (11): Form intensity, pressure consistency, RZ efficiency
- Research-driven (15): nfelo EPA weighting (1.6:1.0), coaching rationality
- Semantic features (26): LLM-derived from news/social media (NEW)
See CLAUDE.md for complete feature documentation.
Status: NEW (Dec 21, 2025) - LLM-powered semantic analysis for NFL betting
The Semantic Stack adds AI/LLM capabilities to enhance win rate through semantic understanding of news, injuries, and market signals.
Architecture:
┌─────────────────────────────────────────────────────────────┐
│ Layer 4: Decision Support │
│ - ValueBetDetector with LLM-generated explanations │
│ - Human-in-the-loop validation │
└─────────────────────────────────────────────────────────────┘
↑
┌─────────────────────────────────────────────────────────────┐
│ Layer 3: Semantic Feature Engineering (26 features) │
│ - Sentiment (24h, 7d trend, differential) │
│ - Injury severity (LLM-assessed) │
│ - Media buzz, coaching changes │
└─────────────────────────────────────────────────────────────┘
↑
┌─────────────────────────────────────────────────────────────┐
│ Layer 2: LLM Inference │
│ - Local: Ollama (qwen3:8b) + MLX (Apple Silicon) │
│ - Fallback: Gemini Flash → Claude API │
│ - 4-minute timeout, content caching │
└─────────────────────────────────────────────────────────────┘
↑
┌─────────────────────────────────────────────────────────────┐
│ Layer 1: Data Ingestion │
│ - ESPN news (existing) │
│ - Social media (Twitter/X verified accounts) │
│ - Historical injury records (84K+ samples) │
└─────────────────────────────────────────────────────────────┘
Key Components:
py/semantic/models/- Ollama/MLX/API inference clientspy/semantic/features/- SemanticFeatureExtractor (26 features)py/semantic/ingestion/- News and social media ingesterspy/semantic/training/- NER/sentiment annotation and fine-tuningpy/production/value_bet_detector.py- LLM-explained value bets
Usage:
# Semantic features integrated into GameFeatureExtractor
from py.features.game_feature_extractor import GameFeatureExtractor
extractor = GameFeatureExtractor()
features = extractor.extract_game_features(game_id)
# Semantic features prefixed with 'sem_': sem_home_sentiment_24h, etc.
# Value bet detection with explanations
from py.production.value_bet_detector import ValueBetDetector
detector = ValueBetDetector(min_edge=0.03)
bets = detector.detect_value_bets(predictions)
for bet in bets:
print(f"{bet.bet_side}: {bet.edge_pct} edge")
print(f" {bet.explanation}")Training Data (Dec 21, 2025):
- NER training: 5,000 samples (4,500 train / 500 val) from 84K injury records
- Sentiment training: 2,000 samples from game outcomes
- Fine-tuning: Ollama Modelfile + MLX LoRA configs
See Semantic Stack Plan for full implementation roadmap.
🆕 CQL Model Training Complete (Oct 9, 2025) - Windows 11 RTX 4090
Train CQL agent for offline RL betting strategy:
# Generate unified features (342 columns)
.venv/Scripts/python.exe py/features/asof_features_unified.py \
--output data/processed/features/asof_team_features.csv
# Create RL logged dataset (5,146 games, 2006-2024)
.venv/Scripts/python.exe py/rl/dataset.py \
--output data/rl_logged_2006_2024.csv \
--season-start 2006 \
--season-end 2024
# Train CQL model (2000 epochs, CUDA acceleration)
.venv/Scripts/python.exe py/rl/cql_agent.py \
--dataset data/rl_logged_2006_2024.csv \
--output models/cql/best_model.pth \
--alpha 0.3 \
--lr 0.0001 \
--hidden-dims 128 64 32 \
--epochs 2000 \
--device cuda \
--log-freq 100Training Results (RTX 4090):
- Training Time: ~9 minutes (2000 epochs on CUDA)
- Match Rate: 98.5% (policy matches logged behavior)
- Estimated Policy Reward: 1.75% (vs 1.41% baseline = 24% improvement)
- Final Loss: 0.1070 (75% reduction from initial)
- Training Log:
models/cql/cql_training_log.json(2000 epochs) - Model artifacts managed via Git LFS
Platform Support:
- Windows 11 + RTX 4090: CUDA 12.9, PyTorch 2.8.0+cu129 (recommended for training)
- Mac M4: MPS backend, PyTorch 2.8.0 (CPU fallback for inference)
- Cross-platform: Auto-detects CUDA > MPS > CPU
See CQL Complete Summary for full details.
Status Update: November 10-12, 2025 - Baseline Error Corrected
Initial Claim (INCORRECT): Incremental XGBoost +21% vs static baseline
Corrected Finding: Baseline was mean predictor (55.16 MAE), not XGBoost. Proper comparison shows incremental is -13.7% worse than static XGBoost (39.76 MAE).
Solution: Hybrid 70% static + 30% incremental achieves +4.2% improvement (38.03 vs 39.66 MAE across 2022-2024).
Status: Hybrid model approved for Week 2 integration pending A/B testing.
Nov 10: Claimed incremental XGBoost achieves 43.49 MAE vs 55.16 "static baseline" (+21% improvement)
- Validated on 2024 season only (182 games)
- Accepted baseline without verification
- Celebrated as breakthrough in online learning
Nov 12: Multi-season validation (2022-2024) revealed discrepancy
- 2022: Incremental -8.3% worse
- 2023: Incremental -15.2% worse
- 2024: Incremental +21.1% better
Root Cause: Baseline was np.full(len(y_test), y_train.mean()) (mean predictor), not XGBoost
Corrected Comparison:
Pure Incremental: 45.22 MAE (2022-2024 average)
Static XGBoost: 39.66 MAE (proper baseline)
Delta: -13.7% (incremental WORSE)
Exhaustive Optimization (8 approaches tested):
- Pure incremental: 45.22 MAE (baseline)
- Pure static: 39.66 MAE (best individual)
- Hybrid 50/50: 40.12 MAE
- Hybrid 70/30: 38.03 MAE (BEST) ✅
- Hybrid 80/20: 38.45 MAE
- Hybrid 90/10: 38.89 MAE
- Weighted by recency: 39.23 MAE
- Ensemble with River ARF: 41.67 MAE (failed)
Winner: Hybrid 70/30
- 38.03 MAE vs 39.66 static baseline = +4.2% improvement
- Statistically significant (p = 0.0082, Diebold-Mariano test)
- Consistent across all 3 seasons
Architecture:
# 70% static (stable foundation) + 30% incremental (adaptability)
hybrid_pred = 0.7 * static_xgb.predict(X) + 0.3 * incremental_xgb.predict_one(x)- Always validate baselines independently - Don't trust variable names
- Multi-season testing is mandatory - Single season can show flukes
- Exhaustive optimization before abandonment - Systematic exploration found working solution
- Document errors honestly - Improves credibility and prevents future mistakes
See:
- POSTMORTEMS.md - Comprehensive error analysis
- Corrected Findings - Full investigation details
- ADR-008 - Production decision record
This project includes comprehensive unit tests, integration tests, and CI/CD workflows.
# Setup testing environment (one-time)
bash scripts/dev/setup_testing.sh
# Run unit tests (fast, no DB required)
pytest tests/unit -m unit
# Run integration tests (requires Docker + Postgres)
docker compose up -d pg
pytest tests/integration -m integration
# Run all tests with coverage
pytest --cov=py --cov-report=html
open htmlcov/index.html# Install hooks (automatic code quality checks)
pre-commit install
# Run manually on all files
pre-commit run --all-filesThree automated workflows run on push/PR:
- Test Suite: Unit tests, integration tests, coverage reporting
- Pre-commit: Code quality and formatting checks
- Nightly Data Quality: Schema validation and data integrity checks
See tests/README.md and tests/TESTING.md for detailed testing documentation.
Build and start services:
docker compose up -d --build pg appRun tasks inside container:
# Setup
docker compose exec app bash -lc "bash scripts/dev_setup.sh"
# Data ingestion
docker compose exec app bash -lc "Rscript --vanilla data/ingest_schedules.R"
# Render notebooks
docker compose exec app bash -lc "quarto render notebooks/04_score_validation.qmd"
# RL pipeline
docker compose exec app bash -lc "python py/rl/dataset.py --output data/rl_logged.csv --season-start 2020 --season-end 2024"
docker compose exec app bash -lc "python py/rl/ope_gate.py --dataset data/rl_logged.csv --output analysis/reports/ope_gate.json"Stop services:
docker compose down # Data persists in pgdata/Install uv: https://docs.astral.sh/uv/
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv .venv && source .venv/bin/activate
uv pip install -r requirements.txtnfl-analytics/
├── py/ # Python modules (features, models, pricing)
│ ├── compute/ # 🆕 Distributed compute system
│ │ ├── statistics/ # Statistical testing framework
│ │ │ ├── statistical_tests.py # Permutation & bootstrap tests
│ │ │ ├── effect_size.py # Cohen's d, Cliff's delta
│ │ │ ├── multiple_comparisons.py # FDR/FWER correction
│ │ │ ├── power_analysis.py # Sample size & power
│ │ │ ├── experimental_design/ # A/B testing framework
│ │ │ └── reporting/ # Quarto/LaTeX integration
│ │ ├── hardware/ # 🆕 Hardware-aware task routing
│ │ │ └── task_router.py # M4 vs 4090 task optimization
│ │ ├── task_queue.py # Priority-based task management (WAL mode)
│ │ ├── adaptive_scheduler.py # Multi-armed bandit + hardware routing
│ │ ├── performance_tracker.py # Statistical performance tracking
│ │ └── worker.py # Distributed worker system
│ ├── features/ # Feature engineering
│ ├── models/ # ML models
│ ├── pricing/ # Pricing & risk management
│ └── rl/ # Reinforcement learning
├── R/ # R utilities
├── data/ # Data ingestion scripts
├── db/ # SQL schema and migrations
├── notebooks/ # Quarto analysis notebooks
├── tests/ # Test suite (unit, integration, e2e)
├── scripts/ # Automation scripts
├── analysis/ # Outputs, reports, dissertation
├── docker/ # Docker configuration
├── .github/workflows/ # CI/CD workflows
└── pgdata/ # PostgreSQL data volume (do not edit)
- CLAUDE.md: Comprehensive project documentation for AI assistants
- AGENTS.md: Repository guidelines and patterns
- COMPUTE_SYSTEM.md: 🆕 Distributed compute system documentation
- requirements.txt: Python dependencies
- requirements-dev.txt: Testing and development tools
- renv.lock: R package versions
- pytest.ini: Test configuration
- .pre-commit-config.yaml: Pre-commit hook configuration
- scripts/compute/run_compute.py: 🆕 Main compute system entry point
- Host:
localhost:5544 - Database:
devdb01 - User:
dro - Platform: PostgreSQL 17 + TimescaleDB (time-series optimization)
- Total Size: ~2.5 GB across 56 tables
The database uses a 5-schema design for clean separation of concerns:
public- Source-of-truth NFL data (games, plays, rosters, injuries)mart- Analytical data mart (aggregated features, team metrics)predictions- ML predictions and feedback loop (game predictions, props, retrospectives)reference- Lookup tables (teams, stadiums, abbreviations)monitoring- Observability (model metrics, feature drift, alerts)
📋 Comprehensive Schema Audit: See DATABASE_SCHEMA_AUDIT.md for complete documentation including:
- Full table inventory with sizes and row counts
- Critical schema standards:
- Use
kickoff(notgame_dateorgameday) for temporal queries - Use
spread_closeandtotal_close(notspread_line/total_line) for betting lines - Weather data (
temp,wind) stored as TEXT - requires parsing to numeric - Team abbreviations: Use
LARfor Rams,LACfor Chargers
- Use
- Data quality findings and recommendations
- Integration patterns for feature engineering
- public
games(game_id PK) – core game metadata and linesplays((game_id, play_id) PK) – play-by-play with EPAweather(game_id PK) – temp_c, wind_kph, humidity, pressure, precip_mminjuries– per-game injury status recordsodds_history(Timescale hypertable) – bookmaker/market snapshot history
- mart
mart.game_summary(materialized view) – enriched game-level summarymart.game_weather(materialized view) – derived weather featuresmart.team_epa(table) – per-game EPA summaries by teammart.team_4th_down_features(table) – 4th-down decision metricsmart.team_playoff_context(table) – playoff probabilities/statusmart.team_injury_load(table) – injury load metrics by team-weekmart.game_features_enhanced(materialized view) – composite modeling features
Full documentation and lineage: docs/database/schema.md.
ER diagram: docs/database/erd.md (PNG: docs/database/erd.png).
Current Data (as of Dec 2025):
- Games: 7,263 rows (1999-2025)
- Plays: 1,254,173 rows (1999-2025)
- Odds: Integration via
py/ingest_odds_smart.py(requires API key)
- Database runs on
localhost:5544(seedocker-compose.yaml) - Data volume is mounted at
pgdata/— do not edit manually - Keep secrets in
.env; do not commit real keys - GLM baseline table is auto-included in Chapter 4 if present:
analysis/dissertation/figures/out/glm_baseline_table.tex - Test coverage target: 60%+ overall, 80%+ for critical paths
- Testing issues: See
tests/README.md - Project context: See
CLAUDE.md - Repository patterns: See
AGENTS.md - CI/CD failures: Check
.github/workflows/logs
