Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@ results/
benchmarks/results/
data/raw_html/
benchmarks/data/raw_html/
benchmarks/openreview_benchmark/data/openreview_pdfs/
benchmarks/openreview_benchmark/data/openreview_raw/
benchmarks/openreview_benchmark/results/

# Review output
review_results/
Expand Down
120 changes: 120 additions & 0 deletions benchmarks/openreview_benchmark/OPENREVIEW.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# OpenReview benchmark track (pilot)

This track complements the Refine-based benchmark in `benchmarks/data/benchmark.jsonl`. It uses **public OpenReview** threads (reviews, author replies, meta-review, decision) from ML venues. It is **not** paragraph-anchored like Refine; evaluation should use **semantic overlap** (e.g. LLM-as-judge) between model comments and human review text, not paragraph-location metrics.

All OpenReview-specific assets live under **`benchmarks/openreview_benchmark/`** (data, scripts, this doc).

## Pilot scope (ICLR 2025)

The pilot includes **10 papers** from **ICLR 2025**, chosen from a random sample of accepted papers by **longest average** review length (sum of `summary`, `strengths`, `weaknesses`, and `questions` per official review, averaged across reviewers). Papers also had to have **at least three official reviews** and **at least one author reply** in the thread.

| Forum ID | Title |
|----------|--------|
| `jj7b3p5kLY` | The AdEMAMix Optimizer: Better, Faster, Older |
| `kOJf7Dklyv` | Air Quality Prediction with Physics-Guided Dual Neural ODEs in Open Systems |
| `ajxAJ8GUX4` | Learning Geometric Reasoning Networks For Robot Task And Motion Planning |
| `XMOaOigOQo` | ContraDiff: Planning Towards High Return States via Contrastive Learning |
| `SFNqrHQTEP` | NExUME: Adaptive Training and Inference for DNNs under Intermittent Power Environments |
| `BC4lIvfSzv` | Generative Representational Instruction Tuning |
| `M992mjgKzI` | OGBench: Benchmarking Offline Goal-Conditioned RL |
| `BM9qfolt6p` | LucidPPN: Unambiguous Prototypical Parts Network for User-centric Interpretable Computer Vision |
| `7b2JrzdLhA` | Graph Neural Ricci Flow: Evolving Feature from a Curvature Perspective |
| `d4qMoUSMLT` | Efficient Training of Neural Stochastic Differential Equations by Matching Finite Dimensional Distributions |

## Data files

| Path | Description |
|------|-------------|
| `benchmarks/openreview_benchmark/data/openreview_raw/<forum_id>.json` | Raw API response: all notes in the forum (`GET /notes?forum=<id>`). **Gitignored**; produce with `collect_openreview.py` if you need to re-run `normalize_openreview.py`. Not required to run eval (committed JSONL is enough). |
| `benchmarks/openreview_benchmark/data/openreview_benchmark.jsonl` | One JSON object per line: normalized paper metadata, reviews, discussions, meta-review, decision. **Committed**; this is what the eval script reads. |

Optional: `filter_candidates.py` can write a ranked list (e.g. `candidate_papers.json`) while discovering the pilot; that file is **not** required to use the benchmark once `openreview_benchmark.jsonl` exists.

### Locked evaluation artifacts (committed)

| Path | Description |
|------|-------------|
| `benchmarks/openreview_benchmark/reports/` | Frozen full-eval JSON copies for git (`eval_<UTC>.json`); use **repo-relative** `benchmark` / `results_dir` paths inside each file. |
| `benchmarks/openreview_benchmark/REPORT.md` | Human-readable pilot report (tables, caveats, how to reproduce). |

## Scripts

Shared HTTP helpers (Cloudflare session) live in **`benchmarks/openreview_benchmark/scripts/openreview_http.py`** and are imported by the fetch/download scripts below.

| Script | Purpose |
|--------|---------|
| `benchmarks/openreview_benchmark/scripts/collect_openreview.py` | Fetch forums by venue or explicit `--forum-ids`; writes `data/openreview_raw/`. Uses a browser session (visit `openreview.net` first) so API requests are not blocked. |
| `benchmarks/openreview_benchmark/scripts/normalize_openreview.py` | Convert raw forum JSON to `data/openreview_benchmark.jsonl`. |
| `benchmarks/openreview_benchmark/scripts/filter_candidates.py` | List accepted papers for ICLR 2025 + NeurIPS 2025, random sample, rank by review text length; optional pilot discovery. |
| `benchmarks/openreview_benchmark/scripts/validate_openreview_benchmark.py` | Check JSONL schema; optional `--parse-one` downloads the first paper’s PDF and runs `parse_document` (no LLM). Use before a full review run. |
| `benchmarks/openreview_benchmark/scripts/evaluate_openreview_benchmark.py` | LLM-judge **precision / recall / F1**; optional `--save-full-report` / `--output`; appends to `eval_history.jsonl` unless `--no-eval-history`. |
| `benchmarks/openreview_benchmark/scripts/download_openreview_pdfs.py` | Download PDFs for papers in `openreview_benchmark.jsonl` into `data/openreview_pdfs/` (gitignored) for `openaireview review <file.pdf>`. |

## Schema (normalized JSONL)

Each line is one paper. Main fields:

- **Paper:** `paper_id`, `forum_url`, `venue`, `year`, `title`, `authors`, `abstract`, `keywords`, `primary_area`, `pdf_url`, `decision`
- **Reviews:** `reviews[]` — each item has `review_id`, `reviewer`, `rating`, `confidence`, `soundness`, `presentation`, `contribution`, `summary`, `strengths`, `weaknesses`, `questions`
- **Discussion:** `discussions[]` — `comment_id`, `replyto`, `author_type`, `comment` (and optional `reviewer` for reviewer comments)
- **Meta-review:** `meta_review` (object or null)

## Evaluation (implemented)

Module: `src/reviewer/evaluate_openreview.py`. CLI: `benchmarks/openreview_benchmark/scripts/evaluate_openreview_benchmark.py`.

OpenAIReview outputs **discrete comments** (title, quote, explanation). Human ground truth is **official reviews** with separate fields. Scores are **LLM-as-judge** (configurable model, default `gpt-4o-mini` via `OPENREVIEW_JUDGE_MODEL`).

**Precision** (per paper): among model comments, the fraction for which the judge answers **YES** to: “Does this comment overlap **any** substantive critique or question in the **pooled** human review text (all reviewers combined)?”

**Recall** (per paper): for each official review with non-empty text, the judge answers **YES** if **at least one** model comment addresses a substantive issue in **that** review. **Recall** = (number of YES) / (number of non-empty official reviews). Macro-averaged over papers in the CLI summary.

**F1** = harmonic mean of precision and recall per paper; the script prints per-paper and **mean** P/R/F1.

**API keys:** use the same stack as the rest of the package (e.g. `OPENAI_API_KEY` and `REVIEW_PROVIDER=openai` for the judge). Review runs and judge calls can share the provider.

**Get PDFs locally** (the CLI does not fetch OpenReview PDF URLs like arXiv):

```bash
python benchmarks/openreview_benchmark/scripts/download_openreview_pdfs.py
# Writes benchmarks/openreview_benchmark/data/openreview_pdfs/<paper_id>.pdf (gitignored)
```

**Run a review** (keep outputs under this track; `results/` is gitignored except you can commit summaries separately):

```bash
openaireview review benchmarks/openreview_benchmark/data/openreview_pdfs/jj7b3p5kLY.pdf \
--name jj7b3p5kLY --method zero_shot \
--output-dir benchmarks/openreview_benchmark/results/reviews
```

**Run evaluation** — `--results-dir` must match where review JSON lives:

```bash
python benchmarks/openreview_benchmark/scripts/evaluate_openreview_benchmark.py \
--results-dir benchmarks/openreview_benchmark/results/reviews \
--save-full-report
```

That writes a **timestamped** full report under `benchmarks/openreview_benchmark/results/eval_<UTC>.json` and **appends one line** to **`benchmarks/openreview_benchmark/eval_history.jsonl`** (mean P/R/F1, judge model, paper ids, optional pointer to the full report). Commit `eval_history.jsonl` when you want a paper-trail for a written report; use `--no-eval-history` to skip the append. Use `--output <path.json>` instead of `--save-full-report` if you want a fixed report path.

For a **PR-ready snapshot**, copy that JSON into **`reports/`**, normalize paths to repo-relative strings, and extend **`REPORT.md`** (see the existing locked run there).

Do **not** use paragraph-index metrics from `evaluate.py` as the primary signal for this track unless human spans are aligned to the paper in a future version.

**Next steps (optional):** atomic human bullets; rebuttal–point linkage; cheaper embedding baselines.

## Local-only files (gitignored: `results/`, `data/openreview_pdfs/`, `data/openreview_raw/`)

| Path | Needed for git / PR? | When you can delete |
|------|----------------------|----------------------|
| `data/openreview_raw/<forum_id>.json` | No | Only for **regenerating** `openreview_benchmark.jsonl` via `normalize_openreview.py`. Eval and the committed pilot do **not** need these files on disk. |
| `results/reviews/<paper_id>.json` | No (local LLM outputs) | Never required for the **committed** scorecard; keep if you want to **re-run eval** without paying for reviews again. |
| `results/eval_<UTC>.json` | No | **Redundant** after you copy metrics into `reports/` (same numbers; `reports/` is the committed snapshot). |
| `data/openreview_pdfs/*.pdf` | No | Safe to remove to save disk if you no longer run `openaireview review` locally; download again with `download_openreview_pdfs.py` if needed. |

## Limitations

- OpenReview is **ML/AI-heavy**; diversity is mostly via topic area within venues.
- API access may require the same session pattern as in `collect_openreview.py` (Cloudflare).
- Review quality and length vary by reviewer; the pilot biased toward **longer** average reviews for denser supervision.
74 changes: 74 additions & 0 deletions benchmarks/openreview_benchmark/REPORT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# OpenReview pilot benchmark — locked evaluation report

**Run:** `generated_at` = `2026-04-23T10:43:02.136255+00:00` (UTC)
**Committed scorecard:** [`reports/eval_20260423T104302Z.json`](reports/eval_20260423T104302Z.json)
**History line:** [`eval_history.jsonl`](eval_history.jsonl) (same run; `full_report` points at the committed JSON under `reports/`)

This report summarizes one completed **LLM-as-judge** pass over all **10** ICLR 2025 pilot papers. It is meant to be cited before a PR; raw review outputs stay under `results/` (gitignored).

---

## What was evaluated

| Role | Model | Notes |
|------|--------|--------|
| **Paper review** (predictions) | `claude-opus-4-6` | `openaireview review … --method zero_shot`; method key `zero_shot__claude-opus-4-6` in each `<paper_id>.json`. |
| **Judge** (precision / recall) | `claude-sonnet-4-6` | Same API stack as reviews (`REVIEW_PROVIDER=openai` + gateway). Judge calls use `temperature=0.0`, `max_tokens=8`, YES/NO prompts per `src/reviewer/evaluate_openreview.py`. |

Metrics are **not** comparable to the Refine benchmark in `benchmarks/REPORT.md` (different ground truth: paragraph-anchored Refine comments vs OpenReview review text overlap).

---

## Metric definitions (short)

See **`OPENREVIEW.md`** and **`src/reviewer/evaluate_openreview.py`** for the exact prompts.

- **Precision:** fraction of model comments the judge says overlap **any** substantive critique or question in **pooled** official review text (all reviewers).
- **Recall:** for each official review with non-empty formatted text, the judge says whether **at least one** model comment addresses a substantive issue in **that** review; recall = YES count / number of such reviews.
- **F1:** harmonic mean of precision and recall **per paper**; the table below matches the committed JSON. **Means** in the JSON are unweighted averages across the 10 papers.

---

## Aggregate results (n = 10)

| Mean precision | Mean recall | Mean F1 |
|----------------|-------------|---------|
| 0.377 | 0.745 | 0.464 |

---

## Per-paper results

| `paper_id` | Precision | Recall | F1 | Predictions | Reviews covered / non-empty |
|------------|-----------|--------|-----|-------------|----------------------------|
| 7b2JrzdLhA | 0.500 | 0.750 | 0.600 | 12 | 3 / 4 |
| ajxAJ8GUX4 | 0.250 | 1.000 | 0.400 | 8 | 4 / 4 |
| BC4lIvfSzv | 0.300 | 1.000 | 0.462 | 10 | 4 / 4 |
| BM9qfolt6p | 0.111 | 0.750 | 0.194 | 9 | 3 / 4 |
| d4qMoUSMLT | 0.500 | 0.750 | 0.600 | 8 | 3 / 4 |
| jj7b3p5kLY | 0.500 | 0.600 | 0.545 | 8 | 3 / 5 |
| kOJf7Dklyv | 0.750 | 0.600 | 0.667 | 8 | 3 / 5 |
| M992mjgKzI | 0.000 | 0.000 | 0.000 | 8 | 0 / 4 |
| SFNqrHQTEP | 0.556 | 1.000 | 0.714 | 9 | 4 / 4 |
| XMOaOigOQo | 0.300 | 1.000 | 0.462 | 10 | 3 / 3 |

---

## Interpretation and caveats

1. **LLM judge variance:** A second run with the same inputs can change YES/NO edges; treat means as **point estimates**, not ground truth.
2. **Strict overlap:** The judge is asked for overlap with **substantive** human critiques. Model comments that are mostly notation or internal consistency may score **no** overlap when humans emphasized contribution, novelty, or positioning (see **`M992mjgKzI`**: all NO in this run despite substantive model comments).
3. **Review vs judge model mismatch:** Reviews used **Opus**, judge **Sonnet**; both are valid for an end-to-end pipeline but should be stated in any write-up.
4. **Infrastructure:** Gateway retries (including higher retry count on judge calls in `evaluate_openreview.py` during this workstream) absorbed intermittent 503 / Bedrock errors; long runs are still sensitive to outages.

---

## Reproducing (after PDFs and review JSON exist)

```bash
python benchmarks/openreview_benchmark/scripts/evaluate_openreview_benchmark.py \
--results-dir benchmarks/openreview_benchmark/results/reviews \
--save-full-report
```

Copy the new `eval_<UTC>.json` into `reports/` with **repo-relative** `benchmark` and `results_dir` fields if you want another locked row for git. You can then delete the duplicate under `results/` to save space; the committed snapshot lives only in `reports/`.
10 changes: 10 additions & 0 deletions benchmarks/openreview_benchmark/data/openreview_benchmark.jsonl

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions benchmarks/openreview_benchmark/eval_history.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"generated_at": "2026-04-23T10:43:02.139100+00:00", "benchmark": "benchmarks/openreview_benchmark/data/openreview_benchmark.jsonl", "results_dir": "benchmarks/openreview_benchmark/results/reviews", "judge_model": "claude-sonnet-4-6", "judge_provider": "openai", "method_key": null, "paper_ids_evaluated": ["7b2JrzdLhA", "ajxAJ8GUX4", "BC4lIvfSzv", "BM9qfolt6p", "d4qMoUSMLT", "jj7b3p5kLY", "kOJf7Dklyv", "M992mjgKzI", "SFNqrHQTEP", "XMOaOigOQo"], "num_papers": 10, "mean": {"precision": 0.37667, "recall": 0.745, "f1": 0.4643}, "full_report": "reports/eval_20260423T104302Z.json"}
150 changes: 150 additions & 0 deletions benchmarks/openreview_benchmark/reports/eval_20260423T104302Z.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
{
"generated_at": "2026-04-23T10:43:02.136255+00:00",
"benchmark": "benchmarks/openreview_benchmark/data/openreview_benchmark.jsonl",
"results_dir": "benchmarks/openreview_benchmark/results/reviews",
"judge_model": "claude-sonnet-4-6",
"judge_provider": "openai",
"method_key": null,
"num_papers": 10,
"mean": {
"precision": 0.37667,
"recall": 0.745,
"f1": 0.4643
},
"per_paper": [
{
"precision": 0.5,
"recall": 0.75,
"f1": 0.6,
"num_predictions": 12,
"num_human_reviews": 4,
"num_predictions_matched": 6,
"num_reviews_covered": 3,
"num_nonempty_reviews": 4,
"judge_model": "claude-sonnet-4-6",
"paper_id": "7b2JrzdLhA",
"title": "Graph Neural Ricci Flow: Evolving Feature from a Curvature P"
},
{
"precision": 0.25,
"recall": 1.0,
"f1": 0.4,
"num_predictions": 8,
"num_human_reviews": 4,
"num_predictions_matched": 2,
"num_reviews_covered": 4,
"num_nonempty_reviews": 4,
"judge_model": "claude-sonnet-4-6",
"paper_id": "ajxAJ8GUX4",
"title": "Learning Geometric Reasoning Networks For Robot Task And Mot"
},
{
"precision": 0.3,
"recall": 1.0,
"f1": 0.4615,
"num_predictions": 10,
"num_human_reviews": 4,
"num_predictions_matched": 3,
"num_reviews_covered": 4,
"num_nonempty_reviews": 4,
"judge_model": "claude-sonnet-4-6",
"paper_id": "BC4lIvfSzv",
"title": "Generative Representational Instruction Tuning"
},
{
"precision": 0.1111,
"recall": 0.75,
"f1": 0.1935,
"num_predictions": 9,
"num_human_reviews": 4,
"num_predictions_matched": 1,
"num_reviews_covered": 3,
"num_nonempty_reviews": 4,
"judge_model": "claude-sonnet-4-6",
"paper_id": "BM9qfolt6p",
"title": "LucidPPN: Unambiguous Prototypical Parts Network for User-ce"
},
{
"precision": 0.5,
"recall": 0.75,
"f1": 0.6,
"num_predictions": 8,
"num_human_reviews": 4,
"num_predictions_matched": 4,
"num_reviews_covered": 3,
"num_nonempty_reviews": 4,
"judge_model": "claude-sonnet-4-6",
"paper_id": "d4qMoUSMLT",
"title": "Efficient Training of Neural Stochastic Differential Equatio"
},
{
"precision": 0.5,
"recall": 0.6,
"f1": 0.5455,
"num_predictions": 8,
"num_human_reviews": 5,
"num_predictions_matched": 4,
"num_reviews_covered": 3,
"num_nonempty_reviews": 5,
"judge_model": "claude-sonnet-4-6",
"paper_id": "jj7b3p5kLY",
"title": "The AdEMAMix Optimizer: Better, Faster, Older"
},
{
"precision": 0.75,
"recall": 0.6,
"f1": 0.6667,
"num_predictions": 8,
"num_human_reviews": 5,
"num_predictions_matched": 6,
"num_reviews_covered": 3,
"num_nonempty_reviews": 5,
"judge_model": "claude-sonnet-4-6",
"paper_id": "kOJf7Dklyv",
"title": "Air Quality Prediction with Physics-Guided Dual Neural ODEs "
},
{
"precision": 0.0,
"recall": 0.0,
"f1": 0.0,
"num_predictions": 8,
"num_human_reviews": 4,
"num_predictions_matched": 0,
"num_reviews_covered": 0,
"num_nonempty_reviews": 4,
"judge_model": "claude-sonnet-4-6",
"paper_id": "M992mjgKzI",
"title": "OGBench: Benchmarking Offline Goal-Conditioned RL"
},
{
"precision": 0.5556,
"recall": 1.0,
"f1": 0.7143,
"num_predictions": 9,
"num_human_reviews": 4,
"num_predictions_matched": 5,
"num_reviews_covered": 4,
"num_nonempty_reviews": 4,
"judge_model": "claude-sonnet-4-6",
"paper_id": "SFNqrHQTEP",
"title": "NExUME: Adaptive Training and Inference for DNNs under Inter"
},
{
"precision": 0.3,
"recall": 1.0,
"f1": 0.4615,
"num_predictions": 10,
"num_human_reviews": 3,
"num_predictions_matched": 3,
"num_reviews_covered": 3,
"num_nonempty_reviews": 3,
"judge_model": "claude-sonnet-4-6",
"paper_id": "XMOaOigOQo",
"title": "ContraDiff: Planning Towards High Return States via Contrast"
}
],
"lock": {
"locked_for_repo": "2026-04-23",
"notes": "Duplicate of the eval run identified by generated_at; paths are repo-relative for portability. Raw per-paper review JSON under results/reviews/ remains gitignored; this file is the committed scorecard."
}
}
Loading