Dev#30
Merged
Merged
Conversation
- tune.py runs a study for one agent at a time (q_learning, sarsa, dqn) - TPESampler-driven Bayesian search over agent-specific param spaces: - q_learning/sarsa: learning_rate, epsilon schedule, decay episodes - dqn: lr, hidden_sizes (categorical), batch_size, target_update_interval - Per-trial budget reduced (600 ep for tabular, 300 for dqn) to keep tuning fast; winner gets full production budget when re-trained in Step 23 - Each trial logged as nested MLflow run under a parent study run; clean hierarchy visible in UI - save_best_config writes configs/<agent>_tuned.yaml with the winning hyperparameters at full production episode budget - Smoke-tested with 2 trials end-to-end; configs/<agent>_tuned.yaml roundtrip successful Refs #8
- multi_seed_eval.py trains the same config N times with different seeds and aggregates eval metrics with mean / std / min / max - Parent MLflow run owns N nested per-seed runs for clean UI hierarchy - Per-seed JSONs + summary.json written to experiments/multi_seed/<run_id>/ - Each seed evaluates on 5 held-out eval seeds; total 25 eval episodes per config for robust statistics - aggregate_results computes the full distribution including range - 3 tests cover aggregation logic and end-to-end mini run Satisfies CO1 cross-validation requirement: instead of one number per algorithm, every comparison reports mean ± std across 5 independent training runs.
- Aggregates multi-seed eval summaries with greedy/random baselines from MLflow - Generates 3 outputs in experiments/figures/: - sprint6_comparison.md (drop-in for the report) - sprint6_comparison.csv (raw data for further analysis) - sprint6_comparison.png (bar chart with error bars, dark theme) - Multi-seed learners color-coded green with std error bars; baselines gray - Sorted by descending eval reward so the leader is leftmost
- configs/q_learning_tuned.yaml: Optuna-tuned (30 trials) - configs/sarsa_tuned.yaml: Optuna-tuned (30 trials) - configs/dqn_tuned.yaml: Optuna-tuned (15 trials) - scripts/register_tuned_models.py: registers multi-seed eval results as new versions in the MLflow Model Registry Closes #8
Feature/hyperparam tuning
- api/schemas.py: Pydantic v2 models for request/response with field validation (rejects NaN/inf observations) - api/main.py: FastAPI app with /health, /info, /metrics, /predict endpoints - Module-level state holds loaded policy, model info, prediction counters, rolling latency window - DQN-specific shortcut extracts Q-values from policy.q_net for response - _interpret_action maps integer action -> (kind, target_index) using num_donors/num_shelters from model info - Prediction logging is best-effort (failures don't break the response) - lifespan context manager loads policy on startup - /predict returns 503 if no model loaded; 422 if obs has wrong dim Refs #5
…ution - api/policy_loader.py loads a DQN policy from one of three sources: 1. MLflow Model Registry (FOOD_RESCUE_MODEL_NAME + _VERSION env vars) 2. Local file or directory (FOOD_RESCUE_MODEL_PATH) 3. Default convention: experiments/policies/dqn_tuned.pt or dqn_v1.pt - Only DQN supported for serving (tabular agents need env-derived state) - _load_from_mlflow_registry uses mlflow.artifacts.download_artifacts - meta.json sidecar provides obs_dim, num_actions; num_donors/num_shelters hardcoded to 5 (matches our scenarios) - api/prediction_log.py is a stub; full SQLite impl in Step 28 Service starts cleanly via 'uvicorn api.main:app --port 8000', all four endpoints (/health, /info, /metrics, /predict) tested end-to-end with curl.
- api/prediction_log.py replaces the Step 27 stub - log_prediction() writes request_id, timestamp, observation (JSON), action, action_kind, model name/version, latency_ms per prediction - fetch_recent() and fetch_observations() used by drift detector + dashboard - DB path defaults to experiments/prediction_log.db; override with FOOD_RESCUE_LOG_DB env var (needed for Docker volume mounts) - CREATE TABLE IF NOT EXISTS is idempotent; no migration tooling needed - New connection per request avoids SQLite threading issues Refs #5
- api/prediction_log.py replaces the Step 27 stub - log_prediction() writes request_id, timestamp, observation (JSON), action, action_kind, model name/version, latency_ms per prediction - fetch_recent() and fetch_observations() used by drift detector + dashboard - DB path defaults to experiments/prediction_log.db; override with FOOD_RESCUE_LOG_DB env var (needed for Docker volume mounts) - CREATE TABLE IF NOT EXISTS is idempotent; no migration tooling needed - New connection per request avoids SQLite threading issues Refs #5
- monitoring/drift_detector.py: per-feature KS test comparing live request obs (from prediction_log.db) vs training distribution (rolled out from all 3 scenarios, n_reference_episodes=20) - DriftReport dataclass with summary(), drifted_features list, p-values - Needs ≥30 live samples before reporting drift (avoids false positives) - Reference distribution cached in memory after first build - monitoring/dashboard.py: Streamlit app reading from the prediction log - Service health panel (polls /health, /info, /metrics) - On-demand drift check with per-feature p-value table - Action distribution bar chart + latency line chart - Recent predictions table (last 50) - scipy added to requirements.txt (ks_2samp) Refs #5
…k state - Extract load_policy_from_env_if_needed() wrapper in api/main.py - autouse fixture patches the wrapper, preventing real policy load - Fix test_predict_donor/shelter/idle to set state[policy] directly instead of patching q_net return_value after client creation
Feature/serving api
…CORS, fix dup tests - requirements.txt: removed double scipy + double httpx entries - docker-compose.yml: removed deprecated 'version' top-level key - api/main.py: added CORSMiddleware for browser-based demo - pyproject.toml: add ruff config with sensible ignores - tests: remove duplicate test_no_update_in_eval_mode definitions that were silently shadowing each other
fix(sprint8): deduplicate deps, remove obsolete compose version, add …
CI (.github/workflows/ci.yml): - Lint with ruff - Run pytest with coverage on Python 3.11 - Build both Docker images on push to main/dev - Smoke-test the serve image by hitting /health CD (.github/workflows/cd.yml): - workflow_dispatch with a config dropdown to pick which agent to train - Auto-trigger on pushes to main with [retrain] in the commit message - Upload trained policy and MLflow runs as artifacts Also adds .github/pull_request_template.md to enforce the What/Why/How-to-test/Closes structure for all future PRs.
data_prep.py requires the --scenario argument; 'all' processes all three scenarios (weekday, weekend, holiday_rush) which is what the tests expect to be present.
feat(cicd): add GitHub Actions CI and CD workflows
k8s/: - 00-namespace.yaml, 10-mlflow.yaml, 20-api.yaml, 30-train-job.yaml - README.md with apply instructions + ArgoCD GitOps example - Manifests aren't deployed (no cluster); they document the GitOps pattern README.md (root): - Quick demo + docker-compose reproduction - MLOps capability matrix linking to rubric requirements - Honest results table linking to MODEL_IMPROVEMENT.md
feat(sprint10): K8s manifests
Previously the 'DQN' option silently fell back to the JS-side greedy
policy — the trained model at experiments/policies/dqn_v1.pt was
never actually used.
Changes:
- Add 'DQN (via API)' dropdown option
- New buildObservationFor(v) constructs the 31-dim vector matching
sim/environment.py:_get_observation exactly
- New policyDqnApi async function POSTs to /predict and decodes the
returned action via decodeAction()
- API status pill in the topbar (green/amber/red)
- Graceful fallback to greedy if the API is unreachable
- stepSim() is now async with for/of loop instead of forEach (because
forEach doesn't await async callbacks)
- Step + play button handlers updated to await stepSim()
API base URL is overridable via localStorage.setItem('api_base', ...)
feat: wire index.html demo to FastAPI /predict endpoint
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase 1 final: full MLOps stack — Sprints 1-10