[Platform] Judge panel + scoreboard for hackathon PRs by mariagorskikh · Pull Request #14 · projnanda/nandatown

mariagorskikh · 2026-05-26T20:07:12Z

Why

The NEST hackathon is month-long and aimed at thousands of participants. Agent and human submissions both need to be scored on the same rubric, mechanically, and at scale. This PR ships the judge panel and the scoreboard wiring on the platform/judge-panel track.

What's in the box

Rubric (`scripts/judge/rubric.md`, `version: 1`)

Six dimensions, each scored 1-5 with concrete 1/3/5 anchors:

correctness — does the code do what the PR claims; any obvious bugs.
test_rigor — coverage, property-based vs example-based, adversarial cases.
api_fit — drop-in compatibility, registry wiring, idiomatic to NEST.
docs_quality — PR body clarity, docstrings, runnable examples.
novelty — what's interesting vs. textbook restatement.
persona_fidelity — does it actually reflect the claimed engineer persona.

Versioned. Re-runs record the rubric version they used, so historical scores stay interpretable when the rubric evolves.

Judge module (`scripts/judge/judge_pr.py`)

async def judge_pr(pr_number: int, *, n_judges: int = 3,
                   model: str = "claude-opus-4-7") -> JudgeResult

Fetches the PR diff, body, author, head SHA, and check-run summary via the GitHub REST API (stdlib urllib; optional GITHUB_TOKEN for rate-limit headroom).
Runs N judges in parallel via anthropic.AsyncAnthropic + asyncio.gather. Each judge is self-contained: it sees rubric + PR body + diff, no conversational context, no chained calls.
Prompt caching on the rubric: the rubric is sent as a system block with cache_control: {"type": "ephemeral"}. For 3 judges in the same 5-min window, the rubric tokens are billed once.
Determinism: temperature=0.0 on every judge. The output records model and rubric_version.
The anthropic SDK is imported lazily so the module imports cleanly in CI environments without it.
ANTHROPIC_API_KEY is read from env at call time; missing key raises a clear error before any I/O. No .env file needed or supported.

Aggregation

Median per dimension uses statistics.median_low (deterministic tie-break to the low value).
Missing/errored judges are skipped per-dimension; the verdict is still recorded with its error field. If every judge errors, medians become NaN and the consensus explicitly says so.
Consensus is a deterministic 3-sentence narrative stitched from the numeric medians plus a snippet from the strongest judge's rationale — no second LLM call.

Diff truncation

Per-file diffs over 5000 lines are clipped with a marker (... [truncated N lines ...] ...) and the output sets diff_truncated: true. The truncator walks diff --git boundaries, so one giant file doesn't drown the others.

CLI (`scripts/judge/run_all.py`)

# Live: 10 PRs x 3 judges = 30 Anthropic calls
ANTHROPIC_API_KEY=... uv run python -m scripts.judge.run_all \
    --output docs/hackathon/scores.json

# Offline / CI smoke (mock judges keyed off head_sha)
uv run python -m scripts.judge.run_all --mock \
    --prs-cache scripts/judge/fixtures/hackathon-prs-2026-05-26.json \
    --output docs/hackathon/scores.json

Idempotent: re-running skips any PR whose head_sha matches the prior scoreboard. Pass --force to override, or change the rubric version to invalidate everything.
Offline mode: when ANTHROPIC_API_KEY is unset, falls back to a deterministic MockJudgeClient whose scores are derived from sha256(head_sha | judge_id | dimension). Same input → same scores, every time. The scoreboard's top-level mock: true flag and per-row model: "mock:..." make this unmistakable.
--prs-cache lets the CLI run without GitHub access — useful for sandboxed CI and offline reproducibility. The committed fixture (scripts/judge/fixtures/hackathon-prs-2026-05-26.json) snapshots the 10 hackathon PRs as of this PR's creation.

Scoreboard schema (`docs/hackathon/scores.json`)

{
  "version": 1,
  "generated_at": "2026-05-26T20:04:17+00:00",
  "mock": true,
  "submissions": [
    {
      "pr": 2, "handle": "harvard-phd", "layer": "trust",
      "title": "...", "author": "mariagorskikh",
      "head_sha": "4e0f4d...", "head_ref": "hackathon/harvard-phd-eigentrust",
      "model": "mock:claude-opus-4-7", "rubric_version": 1,
      "diff_truncated": false,
      "scores": {"correctness": 3.0, "test_rigor": 2.0, ...},
      "median": 21.0,
      "consensus": "PR #2 from the harvard-phd persona scored ...",
      "judges": [{"judge_id": 0, "scores": {...}, "rationale": "...", "error": null}, ...]
    }
  ]
}

Persona handle is preferred from the [Hackathon] <handle>: title prefix (explicit) and falls back to the branch heuristic. Layer is inferred from title/body against the canonical 12 layers.

Bootstrap scoreboard

The committed docs/hackathon/scores.json was generated from mock judges because this sandbox has no ANTHROPIC_API_KEY. The schema is fully exercised; the numeric scores are deterministic stand-ins. To replace with real judgments:

export ANTHROPIC_API_KEY=...
uv run python -m scripts.judge.run_all --force --output docs/hackathon/scores.json

mock: true in the top-level metadata and model: "mock:claude-opus-4-7" per submission flag this clearly. A live re-run will flip both.

Cost model

Live: 10 PRs × 3 judges = 30 calls.
Each call: rubric (~1.8k tokens, cached after first call in the 5-min window) + PR body (variable) + diff (capped at ~5k lines/file). Realistic per-call input: ~5k-20k tokens.
Output capped at max_tokens: 2048 per judge.
With claude-opus-4-7 at current pricing, a full live run is on the order of $3-5. Prompt caching cuts that roughly in half on the rubric portion.
Idempotency means re-runs only re-score PRs whose HEAD SHA changed, so steady-state cost during the hackathon is one call per push, not per CI run.

Tests (`scripts/judge/tests/test_judge.py`)

25 tests covering:

Score aggregation: odd/even counts, ties (low tie-break), empty, single, all-judges-errored.
Missing-judge handling: per-dimension skip, still appears in judges with error.
parse_verdict fault tolerance: plain JSON, fenced markdown, garbage, out-of-range, missing dimension, non-int score, refusals.
Diff truncation: short diffs untouched, oversized files clipped, multi-file diffs where only the big one is cut.
Persona inference: title-wins for 3-token cases like stanford-ml-phd, branch fallback otherwise.
JSON schema round-trip on JudgeResult.to_dict().
MockJudgeClient determinism.
End-to-end judge_pr() with a fake JudgeClient (no SDK required).
"One judge errors, two succeed" → result still computes.

Live API tests are behind @pytest.mark.live and skipped by default (addopts = ["-m", "not live"]). Run them with pytest -m live and a real key.

Local CI parity

uv sync
uv run ruff check .            # 0 errors
uv run ruff format --check .   # 100 files clean
uv run pyright                 # 0 errors (strict mode)
uv run pytest -v               # 289 passed, 1 deselected (live)

All five exit 0.

How to re-run

Live judges against current open PRs:

export ANTHROPIC_API_KEY=sk-ant-...
uv run python -m scripts.judge.run_all --output docs/hackathon/scores.json

Subset:

uv run python -m scripts.judge.run_all --pr 7 --pr 11 \
    --output docs/hackathon/scores.json

Force a full re-score (e.g. after bumping the rubric version):

uv run python -m scripts.judge.run_all --force --output docs/hackathon/scores.json

Offline schema smoke:

uv run python -m scripts.judge.run_all --mock \
    --prs-cache scripts/judge/fixtures/hackathon-prs-2026-05-26.json \
    --output /tmp/scores.json

Scope discipline

No participant PR was modified.
No other platform track was touched.
No new runtime dependencies (anthropic is already an optional dep in nest-shell; the judge imports it lazily and uses stdlib for everything else).

https://claude.ai/code/session_01C5j2D4MgCkPgsjSCqBVpWW

Generated by Claude Code

Builds the scoring system for the month-long NEST hackathon: - scripts/judge/rubric.md: versioned rubric (v1), six 1-5 dimensions (correctness, test_rigor, api_fit, docs_quality, novelty, persona_fidelity) with anchored 1/3/5 examples. - scripts/judge/judge_pr.py: async judge_pr(pr_number, n_judges=3, model="claude-opus-4-7"). N parallel judges via anthropic.AsyncAnthropic with the rubric block marked cache_control: ephemeral; temperature 0.0; GitHub PR fetched via stdlib urllib (optional GITHUB_TOKEN). Aggregator returns median per dimension (statistics.median_low tie-break), total median, and a deterministic 3-sentence consensus. Per-file diffs above 5000 lines are truncated with a marker. - scripts/judge/run_all.py: CLI that scores every open hackathon/* PR and writes docs/hackathon/scores.json. Idempotent on HEAD SHA. Falls back to a deterministic MockJudgeClient (seeded by head_sha + judge_id) when ANTHROPIC_API_KEY is unset or --mock is passed. --prs-cache reads a pre-fetched PR list for offline smoke runs. - docs/hackathon/scores.json: bootstrap scoreboard for PRs #2-#11 generated with mock judges (mock: true). Re-run with a live key to replace. - scripts/judge/tests/test_judge.py: 24 unit tests covering median aggregation (incl. tie-breaking + missing-judge handling), JSON schema round-trip, parse_verdict fault tolerance, diff truncation, persona inference, and a fake JudgeClient driving judge_pr end-to-end. Live API tests behind @pytest.mark.live, skipped by default. - pyproject.toml: register scripts/ under pytest testpaths and the "live" marker. - README.md: Hackathon section linking to docs/hackathon/scores.json. https://claude.ai/code/session_01C5j2D4MgCkPgsjSCqBVpWW

sourcery-ai

Sorry @mariagorskikh, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

The judge panel now supports an OpenAIProvider alongside the existing AnthropicJudgeClient (aliased as AnthropicProvider). Selection is via the new `--provider {anthropic,openai}` flag on run_all.py and a `make_provider()` factory inside judge_pr.py. The Anthropic path is unchanged when --provider is left at its default, so existing scores are bit-for-bit identical. OpenAI calls chat.completions with JSON mode, temperature 0.0, default model gpt-5.5, and an OPENAI_API_KEY env var. The rubric, JSON output schema, scores.json shape, and median-low aggregation are shared across both providers.

mariagorskikh · 2026-05-26T20:46:03Z

Follow-up commit 696b28b extends the judge panel with an OpenAI provider so live judges can run with either an Anthropic or OpenAI key.

What's new

OpenAIProvider in scripts/judge/judge_pr.py — wraps openai.AsyncOpenAI against chat.completions with temperature=0.0, response_format={"type": "json_object"}, and default model gpt-5.5 (OpenAI's current best reasoning model on chat.completions as of May 2026). The SDK is imported lazily; no prompt-caching gymnastics, since OpenAI's caching is implicit.
AnthropicProvider is now a public alias for the existing AnthropicJudgeClient.
make_provider(name, *, model=None, api_key=None) factory returns the right client; unknown names raise a clear ValueError.
run_all.py gets --provider {anthropic,openai} (default anthropic) and --model <name> (defaults to the provider's recommended model). If the selected provider's API key env var is unset, the mock judge fallback kicks in just like before, and the error message names the right env var.
pyproject.toml adds an optional-dependencies group: [project.optional-dependencies] judge = ["anthropic>=0.30", "openai>=1.0"]. Install with uv sync --extra judge.
New scripts/judge/README.md documents providers, env vars, and CLI usage.

Invariants preserved

Same rubric, same six dimensions, same JSON output schema, same median-low aggregation, same scores.json shape.
Default-provider behavior is bit-for-bit identical to the previous Anthropic-only flow.

Tests

10 new tests for the provider factory and the OpenAI provider (mocked SDK; no live API calls). All 299 tests pass; ruff/format/pyright clean.

Generated by Claude Code

…dian `_build_consensus` was reporting `sum(per-dim medians)` as the score, while `total_median` in the JSON used `median_low(per-judge totals)`. In any non-degenerate case these diverge — e.g., PR #2 was showing "scored 20.0/30" in prose while `median: 21.0` in JSON, confusing downstream consumers. - Plumb `total_median` (the value that actually gets written to JSON) through to `_build_consensus` so the prose uses the same aggregation. - Add a regression test (`test_consensus_uses_total_median_not_sum_of_medians`) that constructs three judges with divergent score patterns where the old buggy computation (20) and the new correct computation (21) differ; it asserts the consensus string contains `f"{total_median:.1f}/30"`. - Re-bootstrap `docs/hackathon/scores.json` with the fixed code; all 10 entries now have the median field and consensus prose in agreement.

The adapter was reading an invented `{<pr>: {correctness, realism, design, docs, total, notes}}` shape with totals on a 0-10 scale. The judge panel (PR #14) actually writes the PR #14 scoreboard shape: a top-level `{version, generated_at, mock, submissions: [{pr, scores: {6 dims}, median, consensus, ...}]}` with six dimensions (correctness, test_rigor, api_fit, docs_quality, novelty, persona_fidelity) each on a 1-5 scale and `median` in [6, 30]. The mismatch made every submission render as "unscored" in the marketplace UI. - Rewrite `load_scores` to walk `submissions[]`, project each entry into `{pr_number: JudgeScore}`, copy the canonical `median` field into `JudgeScore.total`, and stash `consensus` on `notes` for the detail view to quote. The old flat shape now degrades to `{}` instead of smuggling stale numbers. - Update `JudgeScore` (Python + TS) to carry all six real dimensions on the 1-5 scale; total stays in [6, 30]. - Update the submission detail page `ScoreBar` to render `N/5` per dimension, and the headline total as `X/30`. Score badge tooltip and hackathon-card titles updated to `/30`. - Update the marketplace tests for the new shape + a regression test that the old flat shape is rejected. - Rebuild `apps/nest-dashboard/public/hackathon-data.json` against the fixed `docs/hackathon/scores.json` from `platform/judge-panel`; the trust layer (3 submissions) now reports a non-null `top_score` (23.0).

…l-diff fixture - judge_pr.py: skip temperature kwarg for gpt-5.x models (they only accept default=1) - fixtures/hackathon-prs-2026-05-26.json: replace stub diffs with real git diffs (688-2002 lines per PR, fetched via git diff origin/main...origin/<head_ref>) - docs/hackathon/scores.json: 10 PRs scored by 3 gpt-5.5 judges each via OpenAI.

The temperature gating in judge_pr.py omits the kwarg for gpt-5.x models (they reject temperature=0). Test was asserting the now-absent kwarg. Update to assert absence, and add a parallel test for gpt-4o to lock in that the determinism path still works for older models.

Integration of 5 platform tracks built in parallel by specialist agents: - platform/ci-hygiene (PR #12): Makefile + pre-commit + idempotent CI feedback bot + CONTRIBUTING Definition of Done - platform/open-problems (PR #13): 10 differentiated open problems across 10 layers, charter, judging doc - platform/judge-panel (PR #14): rubric, anthropic + openai providers, run_all CLI, real-diff fixture, live gpt-5.5 scoreboard for PRs #2-#11 - platform/research-harness (PR #15): conditions matrix, claude-CLI live runner, collect + analyze, dry-run fixtures + tests - platform/marketplace-ui (PR #16): /hackathon Next.js section with author tags, judge scores, layer browser; Python data adapter Schema reconciled end-to-end (rubric -> scores.json -> adapter -> TS types -> UI) on the 6-dim 1-5 scale with totals in [6, 30]. Local CI: 341 passed, 1 skipped (matplotlib gated), 1 deselected (live marker). Live judge scoreboard top: #2 harvard-phd trust 26.0/30 (EigenTrust + checkable invariants) #7 coinbase-crypto payments 26.0/30 (HTLC escrow) #6 stanford-ml-phd trust 25.0/30 #11 google-staff transport 25.0/30

mariagorskikh · 2026-05-26T22:06:50Z

Superseded by #17 (now merged to main at 1771cdb). Closing — the content of this PR (including the live judge scoreboard, OpenAI provider, schema reconciliation, and gpt-5.5 temperature fix) is part of that integration merge.

Generated by Claude Code

sourcery-ai Bot reviewed May 26, 2026

View reviewed changes

mariagorskikh mentioned this pull request May 26, 2026

[Platform] Integration v2 #17

Merged

claude added 2 commits May 26, 2026 21:22

mariagorskikh merged commit 4cde126 into main May 26, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Platform] Judge panel + scoreboard for hackathon PRs#14

[Platform] Judge panel + scoreboard for hackathon PRs#14
mariagorskikh merged 5 commits into
mainfrom
platform/judge-panel

mariagorskikh commented May 26, 2026

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

mariagorskikh commented May 26, 2026

Uh oh!

Uh oh!

mariagorskikh commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mariagorskikh commented May 26, 2026

Why

What's in the box

Rubric (scripts/judge/rubric.md, version: 1)

Judge module (scripts/judge/judge_pr.py)

Aggregation

Diff truncation

CLI (scripts/judge/run_all.py)

Scoreboard schema (docs/hackathon/scores.json)

Bootstrap scoreboard

Cost model

Tests (scripts/judge/tests/test_judge.py)

Local CI parity

How to re-run

Scope discipline

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

mariagorskikh commented May 26, 2026

Uh oh!

Uh oh!

mariagorskikh commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Rubric (`scripts/judge/rubric.md`, `version: 1`)

Judge module (`scripts/judge/judge_pr.py`)

CLI (`scripts/judge/run_all.py`)

Scoreboard schema (`docs/hackathon/scores.json`)

Tests (`scripts/judge/tests/test_judge.py`)