[Platform] Judge panel + scoreboard for hackathon PRs#14
Conversation
Builds the scoring system for the month-long NEST hackathon: - scripts/judge/rubric.md: versioned rubric (v1), six 1-5 dimensions (correctness, test_rigor, api_fit, docs_quality, novelty, persona_fidelity) with anchored 1/3/5 examples. - scripts/judge/judge_pr.py: async judge_pr(pr_number, n_judges=3, model="claude-opus-4-7"). N parallel judges via anthropic.AsyncAnthropic with the rubric block marked cache_control: ephemeral; temperature 0.0; GitHub PR fetched via stdlib urllib (optional GITHUB_TOKEN). Aggregator returns median per dimension (statistics.median_low tie-break), total median, and a deterministic 3-sentence consensus. Per-file diffs above 5000 lines are truncated with a marker. - scripts/judge/run_all.py: CLI that scores every open hackathon/* PR and writes docs/hackathon/scores.json. Idempotent on HEAD SHA. Falls back to a deterministic MockJudgeClient (seeded by head_sha + judge_id) when ANTHROPIC_API_KEY is unset or --mock is passed. --prs-cache reads a pre-fetched PR list for offline smoke runs. - docs/hackathon/scores.json: bootstrap scoreboard for PRs #2-#11 generated with mock judges (mock: true). Re-run with a live key to replace. - scripts/judge/tests/test_judge.py: 24 unit tests covering median aggregation (incl. tie-breaking + missing-judge handling), JSON schema round-trip, parse_verdict fault tolerance, diff truncation, persona inference, and a fake JudgeClient driving judge_pr end-to-end. Live API tests behind @pytest.mark.live, skipped by default. - pyproject.toml: register scripts/ under pytest testpaths and the "live" marker. - README.md: Hackathon section linking to docs/hackathon/scores.json. https://claude.ai/code/session_01C5j2D4MgCkPgsjSCqBVpWW
There was a problem hiding this comment.
Sorry @mariagorskikh, you have reached your weekly rate limit of 500000 diff characters.
Please try again later or upgrade to continue using Sourcery
The judge panel now supports an OpenAIProvider alongside the existing
AnthropicJudgeClient (aliased as AnthropicProvider). Selection is via
the new `--provider {anthropic,openai}` flag on run_all.py and a
`make_provider()` factory inside judge_pr.py. The Anthropic path is
unchanged when --provider is left at its default, so existing scores
are bit-for-bit identical. OpenAI calls chat.completions with JSON mode,
temperature 0.0, default model gpt-5.5, and an OPENAI_API_KEY env var.
The rubric, JSON output schema, scores.json shape, and median-low
aggregation are shared across both providers.
|
Follow-up commit What's new
Invariants preserved
Tests 10 new tests for the provider factory and the OpenAI provider (mocked SDK; no live API calls). All 299 tests pass; ruff/format/pyright clean. Generated by Claude Code |
…dian `_build_consensus` was reporting `sum(per-dim medians)` as the score, while `total_median` in the JSON used `median_low(per-judge totals)`. In any non-degenerate case these diverge — e.g., PR #2 was showing "scored 20.0/30" in prose while `median: 21.0` in JSON, confusing downstream consumers. - Plumb `total_median` (the value that actually gets written to JSON) through to `_build_consensus` so the prose uses the same aggregation. - Add a regression test (`test_consensus_uses_total_median_not_sum_of_medians`) that constructs three judges with divergent score patterns where the old buggy computation (20) and the new correct computation (21) differ; it asserts the consensus string contains `f"{total_median:.1f}/30"`. - Re-bootstrap `docs/hackathon/scores.json` with the fixed code; all 10 entries now have the median field and consensus prose in agreement.
The adapter was reading an invented `{<pr>: {correctness, realism, design,
docs, total, notes}}` shape with totals on a 0-10 scale. The judge panel
(PR #14) actually writes the PR #14 scoreboard shape: a top-level
`{version, generated_at, mock, submissions: [{pr, scores: {6 dims},
median, consensus, ...}]}` with six dimensions (correctness, test_rigor,
api_fit, docs_quality, novelty, persona_fidelity) each on a 1-5 scale and
`median` in [6, 30]. The mismatch made every submission render as
"unscored" in the marketplace UI.
- Rewrite `load_scores` to walk `submissions[]`, project each entry into
`{pr_number: JudgeScore}`, copy the canonical `median` field into
`JudgeScore.total`, and stash `consensus` on `notes` for the detail
view to quote. The old flat shape now degrades to `{}` instead of
smuggling stale numbers.
- Update `JudgeScore` (Python + TS) to carry all six real dimensions on
the 1-5 scale; total stays in [6, 30].
- Update the submission detail page `ScoreBar` to render `N/5` per
dimension, and the headline total as `X/30`. Score badge tooltip and
hackathon-card titles updated to `/30`.
- Update the marketplace tests for the new shape + a regression test
that the old flat shape is rejected.
- Rebuild `apps/nest-dashboard/public/hackathon-data.json` against the
fixed `docs/hackathon/scores.json` from `platform/judge-panel`; the
trust layer (3 submissions) now reports a non-null `top_score` (23.0).
…l-diff fixture - judge_pr.py: skip temperature kwarg for gpt-5.x models (they only accept default=1) - fixtures/hackathon-prs-2026-05-26.json: replace stub diffs with real git diffs (688-2002 lines per PR, fetched via git diff origin/main...origin/<head_ref>) - docs/hackathon/scores.json: 10 PRs scored by 3 gpt-5.5 judges each via OpenAI.
The temperature gating in judge_pr.py omits the kwarg for gpt-5.x models (they reject temperature=0). Test was asserting the now-absent kwarg. Update to assert absence, and add a parallel test for gpt-4o to lock in that the determinism path still works for older models.
Integration of 5 platform tracks built in parallel by specialist agents: - platform/ci-hygiene (PR #12): Makefile + pre-commit + idempotent CI feedback bot + CONTRIBUTING Definition of Done - platform/open-problems (PR #13): 10 differentiated open problems across 10 layers, charter, judging doc - platform/judge-panel (PR #14): rubric, anthropic + openai providers, run_all CLI, real-diff fixture, live gpt-5.5 scoreboard for PRs #2-#11 - platform/research-harness (PR #15): conditions matrix, claude-CLI live runner, collect + analyze, dry-run fixtures + tests - platform/marketplace-ui (PR #16): /hackathon Next.js section with author tags, judge scores, layer browser; Python data adapter Schema reconciled end-to-end (rubric -> scores.json -> adapter -> TS types -> UI) on the 6-dim 1-5 scale with totals in [6, 30]. Local CI: 341 passed, 1 skipped (matplotlib gated), 1 deselected (live marker). Live judge scoreboard top: #2 harvard-phd trust 26.0/30 (EigenTrust + checkable invariants) #7 coinbase-crypto payments 26.0/30 (HTLC escrow) #6 stanford-ml-phd trust 25.0/30 #11 google-staff transport 25.0/30
|
Superseded by #17 (now merged to main at Generated by Claude Code |
Why
The NEST hackathon is month-long and aimed at thousands of participants. Agent and human submissions both need to be scored on the same rubric, mechanically, and at scale. This PR ships the judge panel and the scoreboard wiring on the
platform/judge-paneltrack.What's in the box
Rubric (
scripts/judge/rubric.md,version: 1)Six dimensions, each scored 1-5 with concrete 1/3/5 anchors:
Versioned. Re-runs record the rubric version they used, so historical scores stay interpretable when the rubric evolves.
Judge module (
scripts/judge/judge_pr.py)urllib; optionalGITHUB_TOKENfor rate-limit headroom).anthropic.AsyncAnthropic+asyncio.gather. Each judge is self-contained: it sees rubric + PR body + diff, no conversational context, no chained calls.cache_control: {"type": "ephemeral"}. For 3 judges in the same 5-min window, the rubric tokens are billed once.temperature=0.0on every judge. The output recordsmodelandrubric_version.anthropicSDK is imported lazily so the module imports cleanly in CI environments without it.ANTHROPIC_API_KEYis read from env at call time; missing key raises a clear error before any I/O. No.envfile needed or supported.Aggregation
statistics.median_low(deterministic tie-break to the low value).errorfield. If every judge errors, medians become NaN and the consensus explicitly says so.Diff truncation
Per-file diffs over 5000 lines are clipped with a marker (
... [truncated N lines ...] ...) and the output setsdiff_truncated: true. The truncator walksdiff --gitboundaries, so one giant file doesn't drown the others.CLI (
scripts/judge/run_all.py)head_shamatches the prior scoreboard. Pass--forceto override, or change the rubric version to invalidate everything.ANTHROPIC_API_KEYis unset, falls back to a deterministicMockJudgeClientwhose scores are derived fromsha256(head_sha | judge_id | dimension). Same input → same scores, every time. The scoreboard's top-levelmock: trueflag and per-rowmodel: "mock:..."make this unmistakable.--prs-cachelets the CLI run without GitHub access — useful for sandboxed CI and offline reproducibility. The committed fixture (scripts/judge/fixtures/hackathon-prs-2026-05-26.json) snapshots the 10 hackathon PRs as of this PR's creation.Scoreboard schema (
docs/hackathon/scores.json){ "version": 1, "generated_at": "2026-05-26T20:04:17+00:00", "mock": true, "submissions": [ { "pr": 2, "handle": "harvard-phd", "layer": "trust", "title": "...", "author": "mariagorskikh", "head_sha": "4e0f4d...", "head_ref": "hackathon/harvard-phd-eigentrust", "model": "mock:claude-opus-4-7", "rubric_version": 1, "diff_truncated": false, "scores": {"correctness": 3.0, "test_rigor": 2.0, ...}, "median": 21.0, "consensus": "PR #2 from the harvard-phd persona scored ...", "judges": [{"judge_id": 0, "scores": {...}, "rationale": "...", "error": null}, ...] } ] }Persona handle is preferred from the
[Hackathon] <handle>:title prefix (explicit) and falls back to the branch heuristic. Layer is inferred from title/body against the canonical 12 layers.Bootstrap scoreboard
The committed
docs/hackathon/scores.jsonwas generated from mock judges because this sandbox has noANTHROPIC_API_KEY. The schema is fully exercised; the numeric scores are deterministic stand-ins. To replace with real judgments:export ANTHROPIC_API_KEY=... uv run python -m scripts.judge.run_all --force --output docs/hackathon/scores.jsonmock: truein the top-level metadata andmodel: "mock:claude-opus-4-7"per submission flag this clearly. A live re-run will flip both.Cost model
max_tokens: 2048per judge.claude-opus-4-7at current pricing, a full live run is on the order of $3-5. Prompt caching cuts that roughly in half on the rubric portion.Tests (
scripts/judge/tests/test_judge.py)25 tests covering:
judgeswitherror.parse_verdictfault tolerance: plain JSON, fenced markdown, garbage, out-of-range, missing dimension, non-int score, refusals.stanford-ml-phd, branch fallback otherwise.JudgeResult.to_dict().MockJudgeClientdeterminism.judge_pr()with a fakeJudgeClient(no SDK required).Live API tests are behind
@pytest.mark.liveand skipped by default (addopts = ["-m", "not live"]). Run them withpytest -m liveand a real key.Local CI parity
All five exit 0.
How to re-run
Live judges against current open PRs:
export ANTHROPIC_API_KEY=sk-ant-... uv run python -m scripts.judge.run_all --output docs/hackathon/scores.jsonSubset:
uv run python -m scripts.judge.run_all --pr 7 --pr 11 \ --output docs/hackathon/scores.jsonForce a full re-score (e.g. after bumping the rubric version):
Offline schema smoke:
uv run python -m scripts.judge.run_all --mock \ --prs-cache scripts/judge/fixtures/hackathon-prs-2026-05-26.json \ --output /tmp/scores.jsonScope discipline
anthropicis already an optional dep innest-shell; the judge imports it lazily and uses stdlib for everything else).https://claude.ai/code/session_01C5j2D4MgCkPgsjSCqBVpWW
Generated by Claude Code