Skip to content

Dev#30

Merged
MrPhantom2325 merged 32 commits into
mainfrom
dev
May 13, 2026
Merged

Dev#30
MrPhantom2325 merged 32 commits into
mainfrom
dev

Conversation

@MrPhantom2325
Copy link
Copy Markdown
Owner

Phase 1 final: full MLOps stack — Sprints 1-10

MrPhantom2325 and others added 30 commits May 11, 2026 03:16
- tune.py runs a study for one agent at a time (q_learning, sarsa, dqn)
- TPESampler-driven Bayesian search over agent-specific param spaces:
  - q_learning/sarsa: learning_rate, epsilon schedule, decay episodes
  - dqn: lr, hidden_sizes (categorical), batch_size, target_update_interval
- Per-trial budget reduced (600 ep for tabular, 300 for dqn) to keep tuning
  fast; winner gets full production budget when re-trained in Step 23
- Each trial logged as nested MLflow run under a parent study run; clean
  hierarchy visible in UI
- save_best_config writes configs/<agent>_tuned.yaml with the winning
  hyperparameters at full production episode budget
- Smoke-tested with 2 trials end-to-end; configs/<agent>_tuned.yaml roundtrip
  successful

Refs #8
- multi_seed_eval.py trains the same config N times with different seeds
  and aggregates eval metrics with mean / std / min / max
- Parent MLflow run owns N nested per-seed runs for clean UI hierarchy
- Per-seed JSONs + summary.json written to experiments/multi_seed/<run_id>/
- Each seed evaluates on 5 held-out eval seeds; total 25 eval episodes
  per config for robust statistics
- aggregate_results computes the full distribution including range
- 3 tests cover aggregation logic and end-to-end mini run

Satisfies CO1 cross-validation requirement: instead of one number per
algorithm, every comparison reports mean ± std across 5 independent
training runs.
- Aggregates multi-seed eval summaries with greedy/random baselines from MLflow
- Generates 3 outputs in experiments/figures/:
  - sprint6_comparison.md (drop-in for the report)
  - sprint6_comparison.csv (raw data for further analysis)
  - sprint6_comparison.png (bar chart with error bars, dark theme)
- Multi-seed learners color-coded green with std error bars; baselines gray
- Sorted by descending eval reward so the leader is leftmost
- configs/q_learning_tuned.yaml: Optuna-tuned (30 trials)
- configs/sarsa_tuned.yaml: Optuna-tuned (30 trials)
- configs/dqn_tuned.yaml: Optuna-tuned (15 trials)
- scripts/register_tuned_models.py: registers multi-seed eval results
  as new versions in the MLflow Model Registry

Closes #8
- api/schemas.py: Pydantic v2 models for request/response with field validation
  (rejects NaN/inf observations)
- api/main.py: FastAPI app with /health, /info, /metrics, /predict endpoints
- Module-level state holds loaded policy, model info, prediction counters,
  rolling latency window
- DQN-specific shortcut extracts Q-values from policy.q_net for response
- _interpret_action maps integer action -> (kind, target_index) using
  num_donors/num_shelters from model info
- Prediction logging is best-effort (failures don't break the response)
- lifespan context manager loads policy on startup
- /predict returns 503 if no model loaded; 422 if obs has wrong dim

Refs #5
…ution

- api/policy_loader.py loads a DQN policy from one of three sources:
  1. MLflow Model Registry (FOOD_RESCUE_MODEL_NAME + _VERSION env vars)
  2. Local file or directory (FOOD_RESCUE_MODEL_PATH)
  3. Default convention: experiments/policies/dqn_tuned.pt or dqn_v1.pt
- Only DQN supported for serving (tabular agents need env-derived state)
- _load_from_mlflow_registry uses mlflow.artifacts.download_artifacts
- meta.json sidecar provides obs_dim, num_actions; num_donors/num_shelters
  hardcoded to 5 (matches our scenarios)
- api/prediction_log.py is a stub; full SQLite impl in Step 28

Service starts cleanly via 'uvicorn api.main:app --port 8000', all four
endpoints (/health, /info, /metrics, /predict) tested end-to-end with curl.
- api/prediction_log.py replaces the Step 27 stub
- log_prediction() writes request_id, timestamp, observation (JSON),
  action, action_kind, model name/version, latency_ms per prediction
- fetch_recent() and fetch_observations() used by drift detector + dashboard
- DB path defaults to experiments/prediction_log.db; override with
  FOOD_RESCUE_LOG_DB env var (needed for Docker volume mounts)
- CREATE TABLE IF NOT EXISTS is idempotent; no migration tooling needed
- New connection per request avoids SQLite threading issues

Refs #5
- api/prediction_log.py replaces the Step 27 stub
- log_prediction() writes request_id, timestamp, observation (JSON),
  action, action_kind, model name/version, latency_ms per prediction
- fetch_recent() and fetch_observations() used by drift detector + dashboard
- DB path defaults to experiments/prediction_log.db; override with
  FOOD_RESCUE_LOG_DB env var (needed for Docker volume mounts)
- CREATE TABLE IF NOT EXISTS is idempotent; no migration tooling needed
- New connection per request avoids SQLite threading issues

Refs #5
- monitoring/drift_detector.py: per-feature KS test comparing live request
  obs (from prediction_log.db) vs training distribution (rolled out from
  all 3 scenarios, n_reference_episodes=20)
  - DriftReport dataclass with summary(), drifted_features list, p-values
  - Needs ≥30 live samples before reporting drift (avoids false positives)
  - Reference distribution cached in memory after first build
- monitoring/dashboard.py: Streamlit app reading from the prediction log
  - Service health panel (polls /health, /info, /metrics)
  - On-demand drift check with per-feature p-value table
  - Action distribution bar chart + latency line chart
  - Recent predictions table (last 50)
- scipy added to requirements.txt (ks_2samp)

Refs #5
…k state

- Extract load_policy_from_env_if_needed() wrapper in api/main.py
- autouse fixture patches the wrapper, preventing real policy load
- Fix test_predict_donor/shelter/idle to set state[policy] directly
  instead of patching q_net return_value after client creation
…CORS, fix dup tests

- requirements.txt: removed double scipy + double httpx entries
- docker-compose.yml: removed deprecated 'version' top-level key
- api/main.py: added CORSMiddleware for browser-based demo
- pyproject.toml: add ruff config with sensible ignores
- tests: remove duplicate test_no_update_in_eval_mode definitions that
  were silently shadowing each other
fix(sprint8): deduplicate deps, remove obsolete compose version, add …
CI (.github/workflows/ci.yml):
- Lint with ruff
- Run pytest with coverage on Python 3.11
- Build both Docker images on push to main/dev
- Smoke-test the serve image by hitting /health

CD (.github/workflows/cd.yml):
- workflow_dispatch with a config dropdown to pick which agent to train
- Auto-trigger on pushes to main with [retrain] in the commit message
- Upload trained policy and MLflow runs as artifacts

Also adds .github/pull_request_template.md to enforce the
What/Why/How-to-test/Closes structure for all future PRs.
data_prep.py requires the --scenario argument; 'all' processes all
three scenarios (weekday, weekend, holiday_rush) which is what the
tests expect to be present.
feat(cicd): add GitHub Actions CI and CD workflows
k8s/:
- 00-namespace.yaml, 10-mlflow.yaml, 20-api.yaml, 30-train-job.yaml
- README.md with apply instructions + ArgoCD GitOps example
- Manifests aren't deployed (no cluster); they document the GitOps pattern

README.md (root):
- Quick demo + docker-compose reproduction
- MLOps capability matrix linking to rubric requirements
- Honest results table linking to MODEL_IMPROVEMENT.md
Previously the 'DQN' option silently fell back to the JS-side greedy
policy — the trained model at experiments/policies/dqn_v1.pt was
never actually used.

Changes:
- Add 'DQN (via API)' dropdown option
- New buildObservationFor(v) constructs the 31-dim vector matching
  sim/environment.py:_get_observation exactly
- New policyDqnApi async function POSTs to /predict and decodes the
  returned action via decodeAction()
- API status pill in the topbar (green/amber/red)
- Graceful fallback to greedy if the API is unreachable
- stepSim() is now async with for/of loop instead of forEach (because
  forEach doesn't await async callbacks)
- Step + play button handlers updated to await stepSim()

API base URL is overridable via localStorage.setItem('api_base', ...)
feat: wire index.html demo to FastAPI /predict endpoint
@MrPhantom2325 MrPhantom2325 merged commit 08b55d1 into main May 13, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants