Skip to content

najicham/nba-stats-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3,034 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ€ NBA Props Platform

Production-ready NBA player props prediction and grading system

Status Python GCP License


🎯 What Is This?

A comprehensive data pipeline that:

  • Scrapes NBA game data, player stats, and betting lines from multiple sources
  • Processes raw data into analytics features (1000+ metrics per player/game)
  • Predicts player prop outcomes using 7 ML systems (including ensemble models)
  • Grades predictions against actual outcomes with 70-90% coverage
  • Monitors system health with Grafana dashboards and automated alerts

Current Status: All 6 phases operational, 614 predictions generated daily across 7 systems


πŸ“š Documentation

πŸš€ Quick Start

I need to... Go here
Get oriented docs/00-start-here/README.md
Check system health docs/STATUS-DASHBOARD.md
Daily operations docs/00-start-here/DAILY-SESSION-START.md
Recent changes docs/09-handoff/ (latest session handoffs)
System architecture docs/01-architecture/quick-reference.md
Troubleshooting docs/02-operations/troubleshooting-matrix.md

πŸ“– Full Documentation

All documentation lives in docs/:

docs/
β”œβ”€β”€ 00-start-here/          ⭐ Start here for navigation
β”œβ”€β”€ 01-architecture/        System design & decisions
β”œβ”€β”€ 02-operations/          Daily ops, troubleshooting
β”œβ”€β”€ 03-phases/              6 pipeline phases (orchestration β†’ publishing)
β”œβ”€β”€ 04-deployment/          Deployment guides & status
β”œβ”€β”€ 05-development/         How to build (patterns, testing)
β”œβ”€β”€ 06-reference/           Quick lookups (processor cards, data flow)
β”œβ”€β”€ 07-monitoring/          Grafana, alerts, observability
β”œβ”€β”€ 08-projects/            Active work & completed projects
└── 09-handoff/             Session handoffs & status updates

Documentation Index: docs/00-PROJECT-DOCUMENTATION-INDEX.md


πŸ—οΈ System Architecture

Pipeline Overview

Phase 1: Orchestration  β†’  Daily scheduling & coordination
Phase 2: Raw Data       β†’  Scrape from NBA.com, BallDontLie, OddsAPI
Phase 3: Analytics      β†’  1000+ features per player/game
Phase 4: Precompute     β†’  ML feature store, zone analysis
Phase 5: Predictions    β†’  7 systems (XGBoost, CatBoost, Ensembles)
Phase 6: Publishing     β†’  API endpoints, dashboards

Tech Stack:

  • Compute: Google Cloud Run, Cloud Functions, Cloud Scheduler
  • Storage: BigQuery (10+ datasets), Cloud Storage
  • Orchestration: Firestore-based distributed locks
  • ML: XGBoost, CatBoost, custom ensemble models
  • Monitoring: Cloud Monitoring, Grafana, custom alerting

πŸ“Š System Status

Last Updated: 2026-01-19 (Session 112)

Core Services

Service Status Last Deploy Notes
Prediction Worker βœ… Operational 2026-01-19 07:55 UTC All 7 systems working
Prediction Coordinator βœ… Operational 2026-01-19 06:07 UTC Fixed deployment script
Analytics Processors βœ… Operational 2026-01-19 06:23 UTC Session 107 metrics deployed
Grading Function βœ… Operational Phase 5b 70-90% coverage
Cloud Schedulers βœ… Enabled Multiple Daily triggers working

Prediction Systems

System Status Performance Volume (Jan 19)
Moving Average βœ… Baseline 91 predictions
Zone Matchup V1 βœ… Matchup analysis 91 predictions
Similarity Balanced V1 βœ… Historical 69 predictions
XGBoost V1 βœ… ML baseline 91 predictions
CatBoost V8 βœ… 3.40 MAE (champion) 91 predictions
Ensemble V1 βœ… Weighted 91 predictions
Ensemble V1.1 βœ… Performance-based (NEW) 91 predictions

Total: 614 predictions per day across all systems

Recent Fix (Session 112): Fixed 37-hour outage caused by missing google-cloud-firestore dependency


🚨 Recent Changes

Session 113 (2026-01-26) βœ…

  • βœ… Added comprehensive spot check system for data accuracy verification
  • βœ… 6 automated checks: rolling averages, usage rate, minutes parsing, ML features, cache, points arithmetic
  • βœ… Integrated into daily validation (5 spot checks, 95% accuracy threshold)
  • βœ… Found real data quality issues: Mo Bamba (28% rolling avg error), usage rate precision issues
  • βœ… Documentation: 599-line usage guide + troubleshooting
  • πŸ“ Full guide | Handoff

Week 0 Security (2026-01-19) πŸ”’

  • βœ… Fixed 13 critical security vulnerabilities (97+ individual issues)
  • βœ… SQL injection: 47 queries converted to parameterized format
  • βœ… Authentication: Added API key validation to analytics service
  • βœ… Removed RCE risks: Fixed eval() and pickle deserialization
  • βœ… Input validation: New validation library for all user inputs
  • πŸ“ Security log

Session 112 (2026-01-19) πŸŽ‰

  • βœ… Fixed prediction pipeline outage (37+ hours down)
  • βœ… Root cause: Missing google-cloud-firestore==2.14.0 dependency
  • βœ… Result: All 7 systems operational, 614 predictions generated
  • πŸ“ Full handoff

Session 111 (2026-01-19)

  • βœ… Deployed 7 Session 107 metrics (variance + star tracking)
  • βœ… Fixed analytics processor schema evolution
  • βœ… Investigated prediction failures (fixed in Session 112)

Session 110 (2026-01-18)

  • βœ… Deployed Ensemble V1.1 with performance-based weights
  • βœ… Added CatBoost V8 to ensemble (45% weight)
  • βœ… Expected MAE improvement: 5.41 β†’ 4.9-5.1 (6-9% better)

See full timeline: docs/STATUS-DASHBOARD.md


πŸ› οΈ Development

Prerequisites

  • Python 3.11+
  • Google Cloud SDK
  • BigQuery access
  • Service account with appropriate permissions

Environment Variables

Required (All Services):

  • GCP_PROJECT_ID - GCP project identifier (e.g., nba-props-platform)
  • ENVIRONMENT - Environment name (dev, staging, prod)

Security (Week 0 - Required as of 2026-01-19):

  • VALID_API_KEYS - Comma-separated API keys for analytics service authentication
  • BETTINGPROS_API_KEY - BettingPros API key (moved from hardcoded)
  • SENTRY_DSN - Sentry monitoring DSN (moved from hardcoded)

Optional:

  • SLACK_WEBHOOK_URL - Slack notifications
  • GOOGLE_APPLICATION_CREDENTIALS - Path to service account key file

See deployment guide for configuration details.

Quick Commands

# Check system health
./monitoring/check-system-health.sh

# Run data accuracy spot checks
python scripts/spot_check_data_accuracy.py --samples 10

# Validate tonight's data
python scripts/validate_tonight_data.py

# Deploy prediction worker
bash bin/predictions/deploy/deploy_prediction_worker.sh

# Deploy analytics processors
bash bin/analytics/deploy/deploy_analytics_processors.sh

# Trigger manual predictions
curl -X POST "https://prediction-coordinator-[PROJECT].run.app/start" \
  -H "Authorization: Bearer $(gcloud auth print-identity-token)" \
  -H "Content-Type: application/json" \
  -d '{"force": true, "game_date": "2026-01-19"}'

Project Structure

β”œβ”€β”€ bin/                    # Deployment scripts
β”œβ”€β”€ data_processors/        # Analytics & precompute processors
β”œβ”€β”€ predictions/            # ML prediction systems
β”‚   β”œβ”€β”€ coordinator/        # Batch coordinator
β”‚   └── worker/             # Prediction worker (7 systems)
β”œβ”€β”€ scrapers/               # Raw data scrapers
β”œβ”€β”€ shared/                 # Shared utilities
β”œβ”€β”€ monitoring/             # Health checks & alerts
β”œβ”€β”€ schemas/                # BigQuery schemas
└── docs/                   # Documentation (main resource)

πŸ“ž Support & Contact

For Issues

  1. Check recent handoffs: docs/09-handoff/
  2. Review troubleshooting guide: docs/02-operations/troubleshooting-matrix.md
  3. Check system status: docs/STATUS-DASHBOARD.md

For AI Sessions

Starting a new Claude Code session?

  1. Read docs/09-handoff/ for latest status
  2. Review docs/00-start-here/DAILY-SESSION-START.md
  3. Check docs/STATUS-DASHBOARD.md for current health

πŸ“Š Key Metrics

  • Prediction Coverage: 150+ players per day
  • Grading Coverage: 70-90%
  • Best Model: CatBoost V8 (3.40 MAE)
  • Systems: 7 concurrent prediction systems
  • Daily Volume: 614 predictions
  • Uptime: 99%+ (after Session 112 fix)

πŸ“„ License

Proprietary - All Rights Reserved


Project Contact: NBA Props Platform Team GCP Project: nba-props-platform Region: us-west2 (Los Angeles) Documentation: docs/


πŸš€ Week 1-4 Improvement Plan (NEW!)

Status: Ready to execute after Week 0 validation Timeline: 4 weeks, 42 hours total Goal: 99.7% reliability + $170/month savings + 5x performance

Quick Links

Week 1 Focus: Cost & Reliability Sprint (12 hours)

  • πŸ’° BigQuery optimization: -$60-90/month savings
  • πŸ”§ Critical scalability fixes
  • πŸ›‘οΈ Idempotency & data integrity
  • πŸ“ˆ Structured logging & metrics

Next: Validate Quick Win #1 tomorrow (Jan 21, 8:30 AM ET), then begin Week 1!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •