A step-by-step checklist to confirm the W&B integration (Weave tracing, W&B runs, evaluations, and the MCP server) actually works end to end. Run the phases in order; each says what to do, what to expect, and what to capture if it fails. Commands are PowerShell (Windows + Docker).
-
.envis correct. Open.envand confirm:WANDB_API_KEY=your key (from https://wandb.ai/authorize)WANDB_ENTITY=your entity — this is not a key, it's your W&B username (personal) or team/org name. Find it at the top-left team switcher on wandb.ai, or in your profile URLwandb.ai/<entity>.WANDB_PROJECT=loopprep(or your choice)TAVILY_API_KEY=your key (from https://app.tavily.com)- No
ANTHROPIC_*lines (no longer used).
-
WANDB_API_KEYis also a Windows env var (needed for the MCP server, which runs inside Claude Code, not Docker). Then restart Claude Code:[Environment]::SetEnvironmentVariable("WANDB_API_KEY", "<your-wandb-key>", "User")
Expected: all four values present, Anthropic removed.
- Build the image:
docker compose build backend
- Confirm the app imports (catches broken imports before runtime):
docker compose run --rm backend python -c "import app.main; print('import OK')"
Expected: build succeeds; prints import OK.
If it fails: capture the full traceback — an ImportError/NameError here
means a code wiring issue, not a W&B problem.
- Start everything:
docker compose up --build - Backend liveness/readiness:
Invoke-RestMethod http://localhost:8000/api/health Invoke-RestMethod http://localhost:8000/api/ready
- Frontend loads: open http://localhost:5173
- API docs load: open http://localhost:8000/docs
Expected: health/ready return OK; pages load.
If it fails: the backend should refuse to start if a required env var is
missing (by design) — check the docker compose up logs for
Missing required environment variables: ....
- Create a session (runs the research → format agents):
$body = @{ job_description = "Senior Backend Engineer building high-throughput payment APIs in Python. Owns reliability and on-call." company = "Stripe" role = "Senior Backend Engineer" seniority = "senior" } | ConvertTo-Json Invoke-RestMethod -Method Post -Uri http://localhost:8000/api/sessions ` -ContentType "application/json" -Body $body
Expected: HTTP 200 with a session_id, phase, and a questions array
(3 questions with type/text/difficulty).
- Find the Weave trace. In the
docker compose uplogs, look for a line with a Weave URL (https://wandb.ai/<entity>/loopprep/r/...), or openhttps://wandb.ai/<entity>/loopprep/weave/tracesdirectly.
Expected in Weave: a trace tree showing research_node and format_node,
each with a nested LLM call (the W&B Inference chat completion via
ChatOpenAI), plus a search (Tavily) span under research. Inputs/outputs,
latency, and model ID should be visible.
Capture if it fails: the API response, and whether the trace appears at all (auth issue) vs. appears but is missing nested LLM spans (patching issue).
- In the same project, open the Runs/Workspace view:
https://wandb.ai/<entity>/loopprep
Expected: a run named session-xxxxxxxx (matching your session_id), with
config (session_id, job_description_preview) and any logged metrics /
the question-queue-* artifact.
- Run the eval harness:
docker compose run --rm backend python -m app.evals.run_eval
Expected (console): one line per sample, e.g.
eval-backend-001: question_count_ok=1.00, schema_validity=1.00, relevance=0.85,
then a Summary: line with the three _avg metrics.
- Confirm in Weave: open the project's Evals tab:
https://wandb.ai/<entity>/loopprep/weave/evaluations
Expected: an evaluation for model format-agent / dataset jd-samples-v1
with per-prediction scores and the aggregate summary.
Capture if it fails: the console output (a low schema_validity means the
model isn't returning clean JSON; a crash means a logging/API issue).
- Start a fresh Claude Code session in this project. When prompted,
approve the project-scoped
wandbMCP server. - Confirm it's connected:
Expected:
/mcpwandblisted as connected. - Hand it back to me. Tell me it's connected and I'll verify by calling a W&B MCP tool — e.g. list your projects and count the Weave traces from Phase 3 / summarize the eval from Phase 5 — to prove the round-trip works.
Capture if it fails: whether /mcp shows wandb at all (config issue) vs.
shows it failing to connect (the ${WANDB_API_KEY} env var from Phase 0 isn't
visible to Claude Code — re-check it and fully restart).
- Phase 1 import: ✅ / ❌ (paste error)
- Phase 2 health: ✅ / ❌
- Phase 3 trace tree visible w/ nested LLM spans: ✅ / ❌
- Phase 4 session run visible: ✅ / ❌
- Phase 5 eval in Evals tab: ✅ / ❌ (paste Summary line)
- Phase 6 MCP connected: ✅ / ❌