ChicagoHAI · jwang1230 · Apr 23, 2026
diff --git a/.gitignore b/.gitignore
@@ -17,6 +17,9 @@ results/
 benchmarks/results/
 data/raw_html/
 benchmarks/data/raw_html/
+benchmarks/openreview_benchmark/data/openreview_pdfs/
+benchmarks/openreview_benchmark/data/openreview_raw/
+benchmarks/openreview_benchmark/results/
 
 # Review output
 review_results/

diff --git a/benchmarks/openreview_benchmark/OPENREVIEW.md b/benchmarks/openreview_benchmark/OPENREVIEW.md
@@ -0,0 +1,120 @@
+# OpenReview benchmark track (pilot)
+
+This track complements the Refine-based benchmark in `benchmarks/data/benchmark.jsonl`. It uses **public OpenReview** threads (reviews, author replies, meta-review, decision) from ML venues. It is **not** paragraph-anchored like Refine; evaluation should use **semantic overlap** (e.g. LLM-as-judge) between model comments and human review text, not paragraph-location metrics.
+
+All OpenReview-specific assets live under **`benchmarks/openreview_benchmark/`** (data, scripts, this doc).
+
+## Pilot scope (ICLR 2025)
+
+The pilot includes **10 papers** from **ICLR 2025**, chosen from a random sample of accepted papers by **longest average** review length (sum of `summary`, `strengths`, `weaknesses`, and `questions` per official review, averaged across reviewers). Papers also had to have **at least three official reviews** and **at least one author reply** in the thread.
+
+| Forum ID | Title |
+|----------|--------|
+| `jj7b3p5kLY` | The AdEMAMix Optimizer: Better, Faster, Older |
+| `kOJf7Dklyv` | Air Quality Prediction with Physics-Guided Dual Neural ODEs in Open Systems |
+| `ajxAJ8GUX4` | Learning Geometric Reasoning Networks For Robot Task And Motion Planning |
+| `XMOaOigOQo` | ContraDiff: Planning Towards High Return States via Contrastive Learning |
+| `SFNqrHQTEP` | NExUME: Adaptive Training and Inference for DNNs under Intermittent Power Environments |
+| `BC4lIvfSzv` | Generative Representational Instruction Tuning |
+| `M992mjgKzI` | OGBench: Benchmarking Offline Goal-Conditioned RL |
+| `BM9qfolt6p` | LucidPPN: Unambiguous Prototypical Parts Network for User-centric Interpretable Computer Vision |
+| `7b2JrzdLhA` | Graph Neural Ricci Flow: Evolving Feature from a Curvature Perspective |
+| `d4qMoUSMLT` | Efficient Training of Neural Stochastic Differential Equations by Matching Finite Dimensional Distributions |
+
+## Data files
+
+| Path | Description |
+|------|-------------|
+| `benchmarks/openreview_benchmark/data/openreview_raw/<forum_id>.json` | Raw API response: all notes in the forum (`GET /notes?forum=<id>`). **Gitignored**; produce with `collect_openreview.py` if you need to re-run `normalize_openreview.py`. Not required to run eval (committed JSONL is enough). |
+| `benchmarks/openreview_benchmark/data/openreview_benchmark.jsonl` | One JSON object per line: normalized paper metadata, reviews, discussions, meta-review, decision. **Committed**; this is what the eval script reads. |
+
+Optional: `filter_candidates.py` can write a ranked list (e.g. `candidate_papers.json`) while discovering the pilot; that file is **not** required to use the benchmark once `openreview_benchmark.jsonl` exists.
+
+### Locked evaluation artifacts (committed)
+
+| Path | Description |
+|------|-------------|
+| `benchmarks/openreview_benchmark/reports/` | Frozen full-eval JSON copies for git (`eval_<UTC>.json`); use **repo-relative** `benchmark` / `results_dir` paths inside each file. |
+| `benchmarks/openreview_benchmark/REPORT.md` | Human-readable pilot report (tables, caveats, how to reproduce). |
+
+## Scripts
+
+Shared HTTP helpers (Cloudflare session) live in **`benchmarks/openreview_benchmark/scripts/openreview_http.py`** and are imported by the fetch/download scripts below.
+
+| Script | Purpose |
+|--------|---------|
+| `benchmarks/openreview_benchmark/scripts/collect_openreview.py` | Fetch forums by venue or explicit `--forum-ids`; writes `data/openreview_raw/`. Uses a browser session (visit `openreview.net` first) so API requests are not blocked. |
+| `benchmarks/openreview_benchmark/scripts/normalize_openreview.py` | Convert raw forum JSON to `data/openreview_benchmark.jsonl`. |
+| `benchmarks/openreview_benchmark/scripts/filter_candidates.py` | List accepted papers for ICLR 2025 + NeurIPS 2025, random sample, rank by review text length; optional pilot discovery. |
+| `benchmarks/openreview_benchmark/scripts/validate_openreview_benchmark.py` | Check JSONL schema; optional `--parse-one` downloads the first paper’s PDF and runs `parse_document` (no LLM). Use before a full review run. |
+| `benchmarks/openreview_benchmark/scripts/evaluate_openreview_benchmark.py` | LLM-judge **precision / recall / F1**; optional `--save-full-report` / `--output`; appends to `eval_history.jsonl` unless `--no-eval-history`. |
+| `benchmarks/openreview_benchmark/scripts/download_openreview_pdfs.py` | Download PDFs for papers in `openreview_benchmark.jsonl` into `data/openreview_pdfs/` (gitignored) for `openaireview review <file.pdf>`. |
+
+## Schema (normalized JSONL)
+
+Each line is one paper. Main fields:
+
+- **Paper:** `paper_id`, `forum_url`, `venue`, `year`, `title`, `authors`, `abstract`, `keywords`, `primary_area`, `pdf_url`, `decision`
+- **Reviews:** `reviews[]` — each item has `review_id`, `reviewer`, `rating`, `confidence`, `soundness`, `presentation`, `contribution`, `summary`, `strengths`, `weaknesses`, `questions`
+- **Discussion:** `discussions[]` — `comment_id`, `replyto`, `author_type`, `comment` (and optional `reviewer` for reviewer comments)
+- **Meta-review:** `meta_review` (object or null)
+
+## Evaluation (implemented)
+
+Module: `src/reviewer/evaluate_openreview.py`. CLI: `benchmarks/openreview_benchmark/scripts/evaluate_openreview_benchmark.py`.
+
+OpenAIReview outputs **discrete comments** (title, quote, explanation). Human ground truth is **official reviews** with separate fields. Scores are **LLM-as-judge** (configurable model, default `gpt-4o-mini` via `OPENREVIEW_JUDGE_MODEL`).
+
+**Precision** (per paper): among model comments, the fraction for which the judge answers **YES** to: “Does this comment overlap **any** substantive critique or question in the **pooled** human review text (all reviewers combined)?”
+
+**Recall** (per paper): for each official review with non-empty text, the judge answers **YES** if **at least one** model comment addresses a substantive issue in **that** review. **Recall** = (number of YES) / (number of non-empty official reviews). Macro-averaged over papers in the CLI summary.
+
+**F1** = harmonic mean of precision and recall per paper; the script prints per-paper and **mean** P/R/F1.
+
+**API keys:** use the same stack as the rest of the package (e.g. `OPENAI_API_KEY` and `REVIEW_PROVIDER=openai` for the judge). Review runs and judge calls can share the provider.
+
+**Get PDFs locally** (the CLI does not fetch OpenReview PDF URLs like arXiv):
+
+```bash
+python benchmarks/openreview_benchmark/scripts/download_openreview_pdfs.py
+# Writes benchmarks/openreview_benchmark/data/openreview_pdfs/<paper_id>.pdf (gitignored)
+```
+
+**Run a review** (keep outputs under this track; `results/` is gitignored except you can commit summaries separately):
+
+```bash
+openaireview review benchmarks/openreview_benchmark/data/openreview_pdfs/jj7b3p5kLY.pdf \
+  --name jj7b3p5kLY --method zero_shot \
+  --output-dir benchmarks/openreview_benchmark/results/reviews
+```
+
+**Run evaluation** — `--results-dir` must match where review JSON lives:
+
+```bash
+python benchmarks/openreview_benchmark/scripts/evaluate_openreview_benchmark.py \
+  --results-dir benchmarks/openreview_benchmark/results/reviews \
+  --save-full-report
+```
+
+That writes a **timestamped** full report under `benchmarks/openreview_benchmark/results/eval_<UTC>.json` and **appends one line** to **`benchmarks/openreview_benchmark/eval_history.jsonl`** (mean P/R/F1, judge model, paper ids, optional pointer to the full report). Commit `eval_history.jsonl` when you want a paper-trail for a written report; use `--no-eval-history` to skip the append. Use `--output <path.json>` instead of `--save-full-report` if you want a fixed report path.
+
+For a **PR-ready snapshot**, copy that JSON into **`reports/`**, normalize paths to repo-relative strings, and extend **`REPORT.md`** (see the existing locked run there).
+
+Do **not** use paragraph-index metrics from `evaluate.py` as the primary signal for this track unless human spans are aligned to the paper in a future version.
+
+**Next steps (optional):** atomic human bullets; rebuttal–point linkage; cheaper embedding baselines.
+
+## Local-only files (gitignored: `results/`, `data/openreview_pdfs/`, `data/openreview_raw/`)
+
+| Path | Needed for git / PR? | When you can delete |
+|------|----------------------|----------------------|
+| `data/openreview_raw/<forum_id>.json` | No | Only for **regenerating** `openreview_benchmark.jsonl` via `normalize_openreview.py`. Eval and the committed pilot do **not** need these files on disk. |
+| `results/reviews/<paper_id>.json` | No (local LLM outputs) | Never required for the **committed** scorecard; keep if you want to **re-run eval** without paying for reviews again. |
+| `results/eval_<UTC>.json` | No | **Redundant** after you copy metrics into `reports/` (same numbers; `reports/` is the committed snapshot). |
+| `data/openreview_pdfs/*.pdf` | No | Safe to remove to save disk if you no longer run `openaireview review` locally; download again with `download_openreview_pdfs.py` if needed. |
+
+## Limitations
+
+- OpenReview is **ML/AI-heavy**; diversity is mostly via topic area within venues.
+- API access may require the same session pattern as in `collect_openreview.py` (Cloudflare).
+- Review quality and length vary by reviewer; the pilot biased toward **longer** average reviews for denser supervision.
diff --git a/benchmarks/openreview_benchmark/REPORT.md b/benchmarks/openreview_benchmark/REPORT.md
@@ -0,0 +1,74 @@
+# OpenReview pilot benchmark — locked evaluation report
+
+**Run:** `generated_at` = `2026-04-23T10:43:02.136255+00:00` (UTC)  
+**Committed scorecard:** [`reports/eval_20260423T104302Z.json`](reports/eval_20260423T104302Z.json)  
+**History line:** [`eval_history.jsonl`](eval_history.jsonl) (same run; `full_report` points at the committed JSON under `reports/`)
+
+This report summarizes one completed **LLM-as-judge** pass over all **10** ICLR 2025 pilot papers. It is meant to be cited before a PR; raw review outputs stay under `results/` (gitignored).
+
+---
+
+## What was evaluated
+
+| Role | Model | Notes |
+|------|--------|--------|
+| **Paper review** (predictions) | `claude-opus-4-6` | `openaireview review … --method zero_shot`; method key `zero_shot__claude-opus-4-6` in each `<paper_id>.json`. |
+| **Judge** (precision / recall) | `claude-sonnet-4-6` | Same API stack as reviews (`REVIEW_PROVIDER=openai` + gateway). Judge calls use `temperature=0.0`, `max_tokens=8`, YES/NO prompts per `src/reviewer/evaluate_openreview.py`. |
+
+Metrics are **not** comparable to the Refine benchmark in `benchmarks/REPORT.md` (different ground truth: paragraph-anchored Refine comments vs OpenReview review text overlap).
+
+---
+
+## Metric definitions (short)
+
+See **`OPENREVIEW.md`** and **`src/reviewer/evaluate_openreview.py`** for the exact prompts.
+
+- **Precision:** fraction of model comments the judge says overlap **any** substantive critique or question in **pooled** official review text (all reviewers).
+- **Recall:** for each official review with non-empty formatted text, the judge says whether **at least one** model comment addresses a substantive issue in **that** review; recall = YES count / number of such reviews.
+- **F1:** harmonic mean of precision and recall **per paper**; the table below matches the committed JSON. **Means** in the JSON are unweighted averages across the 10 papers.
+
+---
+
+## Aggregate results (n = 10)
+
+| Mean precision | Mean recall | Mean F1 |
+|----------------|-------------|---------|
+| 0.377 | 0.745 | 0.464 |
+
+---
+
+## Per-paper results
+
+| `paper_id` | Precision | Recall | F1 | Predictions | Reviews covered / non-empty |
+|------------|-----------|--------|-----|-------------|----------------------------|
+| 7b2JrzdLhA | 0.500 | 0.750 | 0.600 | 12 | 3 / 4 |
+| ajxAJ8GUX4 | 0.250 | 1.000 | 0.400 | 8 | 4 / 4 |
+| BC4lIvfSzv | 0.300 | 1.000 | 0.462 | 10 | 4 / 4 |
+| BM9qfolt6p | 0.111 | 0.750 | 0.194 | 9 | 3 / 4 |
+| d4qMoUSMLT | 0.500 | 0.750 | 0.600 | 8 | 3 / 4 |
+| jj7b3p5kLY | 0.500 | 0.600 | 0.545 | 8 | 3 / 5 |
+| kOJf7Dklyv | 0.750 | 0.600 | 0.667 | 8 | 3 / 5 |
+| M992mjgKzI | 0.000 | 0.000 | 0.000 | 8 | 0 / 4 |
+| SFNqrHQTEP | 0.556 | 1.000 | 0.714 | 9 | 4 / 4 |
+| XMOaOigOQo | 0.300 | 1.000 | 0.462 | 10 | 3 / 3 |
+
+---
+
+## Interpretation and caveats
+
+1. **LLM judge variance:** A second run with the same inputs can change YES/NO edges; treat means as **point estimates**, not ground truth.
+2. **Strict overlap:** The judge is asked for overlap with **substantive** human critiques. Model comments that are mostly notation or internal consistency may score **no** overlap when humans emphasized contribution, novelty, or positioning (see **`M992mjgKzI`**: all NO in this run despite substantive model comments).
+3. **Review vs judge model mismatch:** Reviews used **Opus**, judge **Sonnet**; both are valid for an end-to-end pipeline but should be stated in any write-up.
+4. **Infrastructure:** Gateway retries (including higher retry count on judge calls in `evaluate_openreview.py` during this workstream) absorbed intermittent 503 / Bedrock errors; long runs are still sensitive to outages.
+
+---
+
+## Reproducing (after PDFs and review JSON exist)
+
+```bash
+python benchmarks/openreview_benchmark/scripts/evaluate_openreview_benchmark.py \
+  --results-dir benchmarks/openreview_benchmark/results/reviews \
+  --save-full-report
+```
+
+Copy the new `eval_<UTC>.json` into `reports/` with **repo-relative** `benchmark` and `results_dir` fields if you want another locked row for git. You can then delete the duplicate under `results/` to save space; the committed snapshot lives only in `reports/`.
diff --git a/benchmarks/openreview_benchmark/data/openreview_benchmark.jsonl b/benchmarks/openreview_benchmark/data/openreview_benchmark.jsonl
diff --git a/benchmarks/openreview_benchmark/eval_history.jsonl b/benchmarks/openreview_benchmark/eval_history.jsonl
@@ -0,0 +1 @@
+{"generated_at": "2026-04-23T10:43:02.139100+00:00", "benchmark": "benchmarks/openreview_benchmark/data/openreview_benchmark.jsonl", "results_dir": "benchmarks/openreview_benchmark/results/reviews", "judge_model": "claude-sonnet-4-6", "judge_provider": "openai", "method_key": null, "paper_ids_evaluated": ["7b2JrzdLhA", "ajxAJ8GUX4", "BC4lIvfSzv", "BM9qfolt6p", "d4qMoUSMLT", "jj7b3p5kLY", "kOJf7Dklyv", "M992mjgKzI", "SFNqrHQTEP", "XMOaOigOQo"], "num_papers": 10, "mean": {"precision": 0.37667, "recall": 0.745, "f1": 0.4643}, "full_report": "reports/eval_20260423T104302Z.json"}
diff --git a/benchmarks/openreview_benchmark/reports/eval_20260423T104302Z.json b/benchmarks/openreview_benchmark/reports/eval_20260423T104302Z.json
@@ -0,0 +1,150 @@
+{
+  "generated_at": "2026-04-23T10:43:02.136255+00:00",
+  "benchmark": "benchmarks/openreview_benchmark/data/openreview_benchmark.jsonl",
+  "results_dir": "benchmarks/openreview_benchmark/results/reviews",
+  "judge_model": "claude-sonnet-4-6",
+  "judge_provider": "openai",
+  "method_key": null,
+  "num_papers": 10,
+  "mean": {
+    "precision": 0.37667,
+    "recall": 0.745,
+    "f1": 0.4643
+  },
+  "per_paper": [
+    {
+      "precision": 0.5,
+      "recall": 0.75,
+      "f1": 0.6,
+      "num_predictions": 12,
+      "num_human_reviews": 4,
+      "num_predictions_matched": 6,
+      "num_reviews_covered": 3,
+      "num_nonempty_reviews": 4,
+      "judge_model": "claude-sonnet-4-6",
+      "paper_id": "7b2JrzdLhA",
+      "title": "Graph Neural Ricci Flow: Evolving Feature from a Curvature P"
+    },
+    {
+      "precision": 0.25,
+      "recall": 1.0,
+      "f1": 0.4,
+      "num_predictions": 8,
+      "num_human_reviews": 4,
+      "num_predictions_matched": 2,
+      "num_reviews_covered": 4,
+      "num_nonempty_reviews": 4,
+      "judge_model": "claude-sonnet-4-6",
+      "paper_id": "ajxAJ8GUX4",
+      "title": "Learning Geometric Reasoning Networks For Robot Task And Mot"
+    },
+    {
+      "precision": 0.3,
+      "recall": 1.0,
+      "f1": 0.4615,
+      "num_predictions": 10,
+      "num_human_reviews": 4,
+      "num_predictions_matched": 3,
+      "num_reviews_covered": 4,
+      "num_nonempty_reviews": 4,
+      "judge_model": "claude-sonnet-4-6",
+      "paper_id": "BC4lIvfSzv",
+      "title": "Generative Representational Instruction Tuning"
+    },
+    {
+      "precision": 0.1111,
+      "recall": 0.75,
+      "f1": 0.1935,
+      "num_predictions": 9,
+      "num_human_reviews": 4,
+      "num_predictions_matched": 1,
+      "num_reviews_covered": 3,
+      "num_nonempty_reviews": 4,
+      "judge_model": "claude-sonnet-4-6",
+      "paper_id": "BM9qfolt6p",
+      "title": "LucidPPN: Unambiguous Prototypical Parts Network for User-ce"
+    },
+    {
+      "precision": 0.5,
+      "recall": 0.75,
+      "f1": 0.6,
+      "num_predictions": 8,
+      "num_human_reviews": 4,
+      "num_predictions_matched": 4,
+      "num_reviews_covered": 3,
+      "num_nonempty_reviews": 4,
+      "judge_model": "claude-sonnet-4-6",
+      "paper_id": "d4qMoUSMLT",
+      "title": "Efficient Training of Neural Stochastic Differential Equatio"
+    },
+    {
+      "precision": 0.5,
+      "recall": 0.6,
+      "f1": 0.5455,
+      "num_predictions": 8,
+      "num_human_reviews": 5,
+      "num_predictions_matched": 4,
+      "num_reviews_covered": 3,
+      "num_nonempty_reviews": 5,
+      "judge_model": "claude-sonnet-4-6",
+      "paper_id": "jj7b3p5kLY",
+      "title": "The AdEMAMix Optimizer: Better, Faster, Older"
+    },
+    {
+      "precision": 0.75,
+      "recall": 0.6,
+      "f1": 0.6667,
+      "num_predictions": 8,
+      "num_human_reviews": 5,
+      "num_predictions_matched": 6,
+      "num_reviews_covered": 3,
+      "num_nonempty_reviews": 5,
+      "judge_model": "claude-sonnet-4-6",
+      "paper_id": "kOJf7Dklyv",
+      "title": "Air Quality Prediction with Physics-Guided Dual Neural ODEs "
+    },
+    {
+      "precision": 0.0,
+      "recall": 0.0,
+      "f1": 0.0,
+      "num_predictions": 8,
+      "num_human_reviews": 4,
+      "num_predictions_matched": 0,
+      "num_reviews_covered": 0,
+      "num_nonempty_reviews": 4,
+      "judge_model": "claude-sonnet-4-6",
+      "paper_id": "M992mjgKzI",
+      "title": "OGBench: Benchmarking Offline Goal-Conditioned RL"
+    },
+    {
+      "precision": 0.5556,
+      "recall": 1.0,
+      "f1": 0.7143,
+      "num_predictions": 9,
+      "num_human_reviews": 4,
+      "num_predictions_matched": 5,
+      "num_reviews_covered": 4,
+      "num_nonempty_reviews": 4,
+      "judge_model": "claude-sonnet-4-6",
+      "paper_id": "SFNqrHQTEP",
+      "title": "NExUME: Adaptive Training and Inference for DNNs under Inter"
+    },
+    {
+      "precision": 0.3,
+      "recall": 1.0,
+      "f1": 0.4615,
+      "num_predictions": 10,
+      "num_human_reviews": 3,
+      "num_predictions_matched": 3,
+      "num_reviews_covered": 3,
+      "num_nonempty_reviews": 3,
+      "judge_model": "claude-sonnet-4-6",
+      "paper_id": "XMOaOigOQo",
+      "title": "ContraDiff: Planning Towards High Return States via Contrast"
+    }
+  ],
+  "lock": {
+    "locked_for_repo": "2026-04-23",
+    "notes": "Duplicate of the eval run identified by generated_at; paths are repo-relative for portability. Raw per-paper review JSON under results/reviews/ remains gitignored; this file is the committed scorecard."
+  }
+}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{"generated_at": "2026-04-23T10:43:02.139100+00:00", "benchmark": "benchmarks/openreview_benchmark/data/openreview_benchmark.jsonl", "results_dir": "benchmarks/openreview_benchmark/results/reviews", "judge_model": "claude-sonnet-4-6", "judge_provider": "openai", "method_key": null, "paper_ids_evaluated": ["7b2JrzdLhA", "ajxAJ8GUX4", "BC4lIvfSzv", "BM9qfolt6p", "d4qMoUSMLT", "jj7b3p5kLY", "kOJf7Dklyv", "M992mjgKzI", "SFNqrHQTEP", "XMOaOigOQo"], "num_papers": 10, "mean": {"precision": 0.37667, "recall": 0.745, "f1": 0.4643}, "full_report": "reports/eval_20260423T104302Z.json"}