Skip to content

[Platform] Judge panel + scoreboard for hackathon PRs#14

Merged
mariagorskikh merged 5 commits into
mainfrom
platform/judge-panel
May 26, 2026
Merged

[Platform] Judge panel + scoreboard for hackathon PRs#14
mariagorskikh merged 5 commits into
mainfrom
platform/judge-panel

Conversation

@mariagorskikh

Copy link
Copy Markdown
Collaborator

Why

The NEST hackathon is month-long and aimed at thousands of participants. Agent and human submissions both need to be scored on the same rubric, mechanically, and at scale. This PR ships the judge panel and the scoreboard wiring on the platform/judge-panel track.

What's in the box

Rubric (scripts/judge/rubric.md, version: 1)

Six dimensions, each scored 1-5 with concrete 1/3/5 anchors:

  • correctness — does the code do what the PR claims; any obvious bugs.
  • test_rigor — coverage, property-based vs example-based, adversarial cases.
  • api_fit — drop-in compatibility, registry wiring, idiomatic to NEST.
  • docs_quality — PR body clarity, docstrings, runnable examples.
  • novelty — what's interesting vs. textbook restatement.
  • persona_fidelity — does it actually reflect the claimed engineer persona.

Versioned. Re-runs record the rubric version they used, so historical scores stay interpretable when the rubric evolves.

Judge module (scripts/judge/judge_pr.py)

async def judge_pr(pr_number: int, *, n_judges: int = 3,
                   model: str = "claude-opus-4-7") -> JudgeResult
  • Fetches the PR diff, body, author, head SHA, and check-run summary via the GitHub REST API (stdlib urllib; optional GITHUB_TOKEN for rate-limit headroom).
  • Runs N judges in parallel via anthropic.AsyncAnthropic + asyncio.gather. Each judge is self-contained: it sees rubric + PR body + diff, no conversational context, no chained calls.
  • Prompt caching on the rubric: the rubric is sent as a system block with cache_control: {"type": "ephemeral"}. For 3 judges in the same 5-min window, the rubric tokens are billed once.
  • Determinism: temperature=0.0 on every judge. The output records model and rubric_version.
  • The anthropic SDK is imported lazily so the module imports cleanly in CI environments without it.
  • ANTHROPIC_API_KEY is read from env at call time; missing key raises a clear error before any I/O. No .env file needed or supported.

Aggregation

  • Median per dimension uses statistics.median_low (deterministic tie-break to the low value).
  • Missing/errored judges are skipped per-dimension; the verdict is still recorded with its error field. If every judge errors, medians become NaN and the consensus explicitly says so.
  • Consensus is a deterministic 3-sentence narrative stitched from the numeric medians plus a snippet from the strongest judge's rationale — no second LLM call.

Diff truncation

Per-file diffs over 5000 lines are clipped with a marker (... [truncated N lines ...] ...) and the output sets diff_truncated: true. The truncator walks diff --git boundaries, so one giant file doesn't drown the others.

CLI (scripts/judge/run_all.py)

# Live: 10 PRs x 3 judges = 30 Anthropic calls
ANTHROPIC_API_KEY=... uv run python -m scripts.judge.run_all \
    --output docs/hackathon/scores.json

# Offline / CI smoke (mock judges keyed off head_sha)
uv run python -m scripts.judge.run_all --mock \
    --prs-cache scripts/judge/fixtures/hackathon-prs-2026-05-26.json \
    --output docs/hackathon/scores.json
  • Idempotent: re-running skips any PR whose head_sha matches the prior scoreboard. Pass --force to override, or change the rubric version to invalidate everything.
  • Offline mode: when ANTHROPIC_API_KEY is unset, falls back to a deterministic MockJudgeClient whose scores are derived from sha256(head_sha | judge_id | dimension). Same input → same scores, every time. The scoreboard's top-level mock: true flag and per-row model: "mock:..." make this unmistakable.
  • --prs-cache lets the CLI run without GitHub access — useful for sandboxed CI and offline reproducibility. The committed fixture (scripts/judge/fixtures/hackathon-prs-2026-05-26.json) snapshots the 10 hackathon PRs as of this PR's creation.

Scoreboard schema (docs/hackathon/scores.json)

{
  "version": 1,
  "generated_at": "2026-05-26T20:04:17+00:00",
  "mock": true,
  "submissions": [
    {
      "pr": 2, "handle": "harvard-phd", "layer": "trust",
      "title": "...", "author": "mariagorskikh",
      "head_sha": "4e0f4d...", "head_ref": "hackathon/harvard-phd-eigentrust",
      "model": "mock:claude-opus-4-7", "rubric_version": 1,
      "diff_truncated": false,
      "scores": {"correctness": 3.0, "test_rigor": 2.0, ...},
      "median": 21.0,
      "consensus": "PR #2 from the harvard-phd persona scored ...",
      "judges": [{"judge_id": 0, "scores": {...}, "rationale": "...", "error": null}, ...]
    }
  ]
}

Persona handle is preferred from the [Hackathon] <handle>: title prefix (explicit) and falls back to the branch heuristic. Layer is inferred from title/body against the canonical 12 layers.

Bootstrap scoreboard

The committed docs/hackathon/scores.json was generated from mock judges because this sandbox has no ANTHROPIC_API_KEY. The schema is fully exercised; the numeric scores are deterministic stand-ins. To replace with real judgments:

export ANTHROPIC_API_KEY=...
uv run python -m scripts.judge.run_all --force --output docs/hackathon/scores.json

mock: true in the top-level metadata and model: "mock:claude-opus-4-7" per submission flag this clearly. A live re-run will flip both.

Cost model

  • Live: 10 PRs × 3 judges = 30 calls.
  • Each call: rubric (~1.8k tokens, cached after first call in the 5-min window) + PR body (variable) + diff (capped at ~5k lines/file). Realistic per-call input: ~5k-20k tokens.
  • Output capped at max_tokens: 2048 per judge.
  • With claude-opus-4-7 at current pricing, a full live run is on the order of $3-5. Prompt caching cuts that roughly in half on the rubric portion.
  • Idempotency means re-runs only re-score PRs whose HEAD SHA changed, so steady-state cost during the hackathon is one call per push, not per CI run.

Tests (scripts/judge/tests/test_judge.py)

25 tests covering:

  • Score aggregation: odd/even counts, ties (low tie-break), empty, single, all-judges-errored.
  • Missing-judge handling: per-dimension skip, still appears in judges with error.
  • parse_verdict fault tolerance: plain JSON, fenced markdown, garbage, out-of-range, missing dimension, non-int score, refusals.
  • Diff truncation: short diffs untouched, oversized files clipped, multi-file diffs where only the big one is cut.
  • Persona inference: title-wins for 3-token cases like stanford-ml-phd, branch fallback otherwise.
  • JSON schema round-trip on JudgeResult.to_dict().
  • MockJudgeClient determinism.
  • End-to-end judge_pr() with a fake JudgeClient (no SDK required).
  • "One judge errors, two succeed" → result still computes.

Live API tests are behind @pytest.mark.live and skipped by default (addopts = ["-m", "not live"]). Run them with pytest -m live and a real key.

Local CI parity

uv sync
uv run ruff check .            # 0 errors
uv run ruff format --check .   # 100 files clean
uv run pyright                 # 0 errors (strict mode)
uv run pytest -v               # 289 passed, 1 deselected (live)

All five exit 0.

How to re-run

Live judges against current open PRs:

export ANTHROPIC_API_KEY=sk-ant-...
uv run python -m scripts.judge.run_all --output docs/hackathon/scores.json

Subset:

uv run python -m scripts.judge.run_all --pr 7 --pr 11 \
    --output docs/hackathon/scores.json

Force a full re-score (e.g. after bumping the rubric version):

uv run python -m scripts.judge.run_all --force --output docs/hackathon/scores.json

Offline schema smoke:

uv run python -m scripts.judge.run_all --mock \
    --prs-cache scripts/judge/fixtures/hackathon-prs-2026-05-26.json \
    --output /tmp/scores.json

Scope discipline

  • No participant PR was modified.
  • No other platform track was touched.
  • No new runtime dependencies (anthropic is already an optional dep in nest-shell; the judge imports it lazily and uses stdlib for everything else).

https://claude.ai/code/session_01C5j2D4MgCkPgsjSCqBVpWW


Generated by Claude Code

Builds the scoring system for the month-long NEST hackathon:

- scripts/judge/rubric.md: versioned rubric (v1), six 1-5 dimensions
  (correctness, test_rigor, api_fit, docs_quality, novelty,
  persona_fidelity) with anchored 1/3/5 examples.
- scripts/judge/judge_pr.py: async judge_pr(pr_number, n_judges=3,
  model="claude-opus-4-7"). N parallel judges via anthropic.AsyncAnthropic
  with the rubric block marked cache_control: ephemeral; temperature 0.0;
  GitHub PR fetched via stdlib urllib (optional GITHUB_TOKEN). Aggregator
  returns median per dimension (statistics.median_low tie-break), total
  median, and a deterministic 3-sentence consensus. Per-file diffs above
  5000 lines are truncated with a marker.
- scripts/judge/run_all.py: CLI that scores every open hackathon/* PR
  and writes docs/hackathon/scores.json. Idempotent on HEAD SHA. Falls
  back to a deterministic MockJudgeClient (seeded by head_sha + judge_id)
  when ANTHROPIC_API_KEY is unset or --mock is passed. --prs-cache reads
  a pre-fetched PR list for offline smoke runs.
- docs/hackathon/scores.json: bootstrap scoreboard for PRs #2-#11
  generated with mock judges (mock: true). Re-run with a live key to
  replace.
- scripts/judge/tests/test_judge.py: 24 unit tests covering median
  aggregation (incl. tie-breaking + missing-judge handling), JSON schema
  round-trip, parse_verdict fault tolerance, diff truncation, persona
  inference, and a fake JudgeClient driving judge_pr end-to-end. Live
  API tests behind @pytest.mark.live, skipped by default.
- pyproject.toml: register scripts/ under pytest testpaths and the
  "live" marker.
- README.md: Hackathon section linking to docs/hackathon/scores.json.

https://claude.ai/code/session_01C5j2D4MgCkPgsjSCqBVpWW

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @mariagorskikh, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

The judge panel now supports an OpenAIProvider alongside the existing
AnthropicJudgeClient (aliased as AnthropicProvider). Selection is via
the new `--provider {anthropic,openai}` flag on run_all.py and a
`make_provider()` factory inside judge_pr.py. The Anthropic path is
unchanged when --provider is left at its default, so existing scores
are bit-for-bit identical. OpenAI calls chat.completions with JSON mode,
temperature 0.0, default model gpt-5.5, and an OPENAI_API_KEY env var.

The rubric, JSON output schema, scores.json shape, and median-low
aggregation are shared across both providers.

Copy link
Copy Markdown
Collaborator Author

Follow-up commit 696b28b extends the judge panel with an OpenAI provider so live judges can run with either an Anthropic or OpenAI key.

What's new

  • OpenAIProvider in scripts/judge/judge_pr.py — wraps openai.AsyncOpenAI against chat.completions with temperature=0.0, response_format={"type": "json_object"}, and default model gpt-5.5 (OpenAI's current best reasoning model on chat.completions as of May 2026). The SDK is imported lazily; no prompt-caching gymnastics, since OpenAI's caching is implicit.
  • AnthropicProvider is now a public alias for the existing AnthropicJudgeClient.
  • make_provider(name, *, model=None, api_key=None) factory returns the right client; unknown names raise a clear ValueError.
  • run_all.py gets --provider {anthropic,openai} (default anthropic) and --model <name> (defaults to the provider's recommended model). If the selected provider's API key env var is unset, the mock judge fallback kicks in just like before, and the error message names the right env var.
  • pyproject.toml adds an optional-dependencies group: [project.optional-dependencies] judge = ["anthropic>=0.30", "openai>=1.0"]. Install with uv sync --extra judge.
  • New scripts/judge/README.md documents providers, env vars, and CLI usage.

Invariants preserved

  • Same rubric, same six dimensions, same JSON output schema, same median-low aggregation, same scores.json shape.
  • Default-provider behavior is bit-for-bit identical to the previous Anthropic-only flow.

Tests

10 new tests for the provider factory and the OpenAI provider (mocked SDK; no live API calls). All 299 tests pass; ruff/format/pyright clean.


Generated by Claude Code

…dian

`_build_consensus` was reporting `sum(per-dim medians)` as the score, while
`total_median` in the JSON used `median_low(per-judge totals)`. In any
non-degenerate case these diverge — e.g., PR #2 was showing
"scored 20.0/30" in prose while `median: 21.0` in JSON, confusing
downstream consumers.

- Plumb `total_median` (the value that actually gets written to JSON)
  through to `_build_consensus` so the prose uses the same aggregation.
- Add a regression test
  (`test_consensus_uses_total_median_not_sum_of_medians`) that constructs
  three judges with divergent score patterns where the old buggy
  computation (20) and the new correct computation (21) differ; it
  asserts the consensus string contains `f"{total_median:.1f}/30"`.
- Re-bootstrap `docs/hackathon/scores.json` with the fixed code; all 10
  entries now have the median field and consensus prose in agreement.
mariagorskikh pushed a commit that referenced this pull request May 26, 2026
The adapter was reading an invented `{<pr>: {correctness, realism, design,
docs, total, notes}}` shape with totals on a 0-10 scale. The judge panel
(PR #14) actually writes the PR #14 scoreboard shape: a top-level
`{version, generated_at, mock, submissions: [{pr, scores: {6 dims},
median, consensus, ...}]}` with six dimensions (correctness, test_rigor,
api_fit, docs_quality, novelty, persona_fidelity) each on a 1-5 scale and
`median` in [6, 30]. The mismatch made every submission render as
"unscored" in the marketplace UI.

- Rewrite `load_scores` to walk `submissions[]`, project each entry into
  `{pr_number: JudgeScore}`, copy the canonical `median` field into
  `JudgeScore.total`, and stash `consensus` on `notes` for the detail
  view to quote. The old flat shape now degrades to `{}` instead of
  smuggling stale numbers.
- Update `JudgeScore` (Python + TS) to carry all six real dimensions on
  the 1-5 scale; total stays in [6, 30].
- Update the submission detail page `ScoreBar` to render `N/5` per
  dimension, and the headline total as `X/30`. Score badge tooltip and
  hackathon-card titles updated to `/30`.
- Update the marketplace tests for the new shape + a regression test
  that the old flat shape is rejected.
- Rebuild `apps/nest-dashboard/public/hackathon-data.json` against the
  fixed `docs/hackathon/scores.json` from `platform/judge-panel`; the
  trust layer (3 submissions) now reports a non-null `top_score` (23.0).
claude added 2 commits May 26, 2026 21:22
…l-diff fixture

- judge_pr.py: skip temperature kwarg for gpt-5.x models (they only accept default=1)
- fixtures/hackathon-prs-2026-05-26.json: replace stub diffs with real git diffs
  (688-2002 lines per PR, fetched via git diff origin/main...origin/<head_ref>)
- docs/hackathon/scores.json: 10 PRs scored by 3 gpt-5.5 judges each via OpenAI.
The temperature gating in judge_pr.py omits the kwarg for gpt-5.x
models (they reject temperature=0). Test was asserting the now-absent
kwarg. Update to assert absence, and add a parallel test for gpt-4o
to lock in that the determinism path still works for older models.
@mariagorskikh mariagorskikh merged commit 4cde126 into main May 26, 2026
4 checks passed
mariagorskikh added a commit that referenced this pull request May 26, 2026
Integration of 5 platform tracks built in parallel by specialist agents:

- platform/ci-hygiene (PR #12): Makefile + pre-commit + idempotent CI feedback bot + CONTRIBUTING Definition of Done
- platform/open-problems (PR #13): 10 differentiated open problems across 10 layers, charter, judging doc
- platform/judge-panel (PR #14): rubric, anthropic + openai providers, run_all CLI, real-diff fixture, live gpt-5.5 scoreboard for PRs #2-#11
- platform/research-harness (PR #15): conditions matrix, claude-CLI live runner, collect + analyze, dry-run fixtures + tests
- platform/marketplace-ui (PR #16): /hackathon Next.js section with author tags, judge scores, layer browser; Python data adapter

Schema reconciled end-to-end (rubric -> scores.json -> adapter -> TS types -> UI) on the 6-dim 1-5 scale with totals in [6, 30].

Local CI: 341 passed, 1 skipped (matplotlib gated), 1 deselected (live marker).

Live judge scoreboard top:
  #2  harvard-phd     trust       26.0/30  (EigenTrust + checkable invariants)
  #7  coinbase-crypto payments    26.0/30  (HTLC escrow)
  #6  stanford-ml-phd trust       25.0/30
  #11 google-staff    transport   25.0/30

Copy link
Copy Markdown
Collaborator Author

Superseded by #17 (now merged to main at 1771cdb). Closing — the content of this PR (including the live judge scoreboard, OpenAI provider, schema reconciliation, and gpt-5.5 temperature fix) is part of that integration merge.


Generated by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants