diff --git a/.gitattributes b/.gitattributes new file mode 100644 index 0000000..e08b3c9 --- /dev/null +++ b/.gitattributes @@ -0,0 +1,2 @@ +*.pdf binary +*.png binary diff --git a/README.md b/README.md index b55a969..044fa0b 100644 --- a/README.md +++ b/README.md @@ -21,6 +21,9 @@ data release pending licensing, privacy, and source-distribution review. ## Key Artifacts - [Draft introduction](paper/LegalBenchPro_intro_draft.pdf) +- [Presentation page 1: benchmark breakdown](paper/presentation_pages/LegalBenchPro_slide_01_benchmark_breakdown.pdf) +- [Presentation page 3: model stability and score distribution](paper/presentation_pages/LegalBenchPro_slide_03_model_stability_score_distribution.pdf) +- [Presentation page 5: literature and benchmark comparison](paper/presentation_pages/LegalBenchPro_slide_05_literature_benchmark_comparison.pdf) - [AI-assisted research workflow and safeguards](docs/AI_WORKFLOW.md) - [Annotation protocol and scoring design](docs/ANNOTATION_PROTOCOL.md) - [Data card](docs/DATA_CARD.md) @@ -40,12 +43,17 @@ data release pending licensing, privacy, and source-distribution review. reviewer; contributed human review of model outputs and professional feedback on the scoring rubric. -## Public Preview Overview +## Presentation Excerpts -LegalBenchPro public preview overview showing task instances, LLM response cells, model configurations, validation rows, source coverage, and Chinese case split design +The repository includes three cropped PDF pages from the current project presentation. +They are intended as lightweight visual entry points for readers who want the argument +before opening the full manuscript draft. -The figure is generated from committed public metadata: -`data/metadata/dataset_summary.json` and `data/metadata/source_distribution.csv`. +| Page | PDF | Why it matters | +| --- | --- | --- | +| 1 | [Benchmark breakdown](paper/presentation_pages/LegalBenchPro_slide_01_benchmark_breakdown.pdf) | Establishes the benchmark inventory: public-exam rows, Chinese real-case prompts, human-scored pilot rows, model groups, and response-level evaluation counts. It explains the scale of the dataset before any model comparison is interpreted. | +| 3 | [Model stability and score distribution](paper/presentation_pages/LegalBenchPro_slide_03_model_stability_score_distribution.pdf) | Shows that model performance should be read as a distribution across score bands, not only as a single average. The side-by-side public-exam and real-case panels make the transfer question visible. | +| 5 | [Literature and benchmark comparison](paper/presentation_pages/LegalBenchPro_slide_05_literature_benchmark_comparison.pdf) | Positions LegalBenchPro against representative legal benchmarks by task coverage, real-document grounding, paired stances, reference-aware scoring, expert validation, and exam-to-case transfer. | ## At a Glance @@ -56,7 +64,7 @@ The figure is generated from committed public metadata: - **Reproducibility:** Python sample extraction, machine-readable metadata, tests, data-card documentation, and an explicit workflow audit trail. - **Research workflow:** public artifacts are organized so that readers can inspect the - path from workbook-derived metadata to samples, documentation, figures, and + path from workbook-derived metadata to samples, documentation, presentation pages, and manuscript materials. ## Benchmark Design @@ -101,8 +109,8 @@ defensible argument structure. This project contributes: LLM-generated response cells; - a scoring protocol that distinguishes answer matching from citation-aware legal reasoning; -- a reproducible public workflow for sample extraction, metadata generation, figure - rendering, and manuscript tracking. +- a reproducible public workflow for sample extraction, metadata generation, + presentation documentation, and manuscript tracking. For empirical social-science research, the project is also a small example of how LLM-assisted analysis can be made auditable: institutional text is treated as data, @@ -148,8 +156,8 @@ For a quick review of the project, start with: - `data/sample/legalbenchpro_public_exam_sample.csv` for public-exam content excerpts; - `data/metadata/source_distribution.csv` and `data/metadata/model_configurations.csv` for concise metadata; -- `scripts/extract_public_sample.py` and `scripts/render_benchmark_overview.py` for - the reproducible export and figure-rendering workflow. +- `paper/presentation_pages/` for the selected presentation-page PDFs; +- `scripts/extract_public_sample.py` for the reproducible public export workflow. ## Repository Map @@ -158,6 +166,10 @@ paper/ LegalBenchPro_intro_draft.pdf # Current draft introduction introduction_revised.tex # Dataset-aligned introduction for Overleaf manuscript_working_draft.md # Working paper skeleton for GitHub readers + presentation_pages/ # Cropped PDF excerpts from the project slides + LegalBenchPro_slide_01_benchmark_breakdown.pdf + LegalBenchPro_slide_03_model_stability_score_distribution.pdf + LegalBenchPro_slide_05_literature_benchmark_comparison.pdf docs/ DATA_CARD.md # Dataset scope, fields, release status, risks ANNOTATION_PROTOCOL.md # Human validation plan and scoring dimensions @@ -171,11 +183,9 @@ data/ metadata/dataset_summary.json metadata/model_configurations.csv metadata/source_distribution.csv -outputs/ - figures/benchmark_overview.png # Public metadata overview figure scripts/ extract_public_sample.py # Rebuilds the public sample and metadata - render_benchmark_overview.py # Rebuilds the README overview figure + render_benchmark_overview.py # Optional metadata overview renderer src/legalbenchpro/ workbook.py # Small workbook helpers used by scripts tests/ @@ -200,7 +210,6 @@ python scripts/extract_public_sample.py \ --cn-sample-size 10 \ --bar-sample-size 20 \ --max-cell-chars 420 -python scripts/render_benchmark_overview.py ``` Windows PowerShell: @@ -216,7 +225,6 @@ python .\scripts\extract_public_sample.py ` --cn-sample-size 10 ` --bar-sample-size 20 ` --max-cell-chars 420 -python .\scripts\render_benchmark_overview.py ``` ## Validation @@ -245,8 +253,7 @@ manual `PYTHONPATH` setup is not required for local validation. This repository is intentionally organized as a research-engineering artifact, not only as a dataset announcement. It demonstrates: -- Python scripts that regenerate public samples, metadata, and the README overview - figure from structured inputs; +- Python scripts that regenerate public samples and metadata from structured inputs; - explicit dataset documentation, release constraints, and annotation protocol files; - lightweight tests for workbook parsing utilities; - an audit trail for AI-assisted coding and research workflow decisions; diff --git a/outputs/figures/benchmark_overview.png b/outputs/figures/benchmark_overview.png deleted file mode 100644 index 1af2ec2..0000000 Binary files a/outputs/figures/benchmark_overview.png and /dev/null differ diff --git a/paper/presentation_pages/LegalBenchPro_slide_01_benchmark_breakdown.pdf b/paper/presentation_pages/LegalBenchPro_slide_01_benchmark_breakdown.pdf new file mode 100644 index 0000000..3376e2e Binary files /dev/null and b/paper/presentation_pages/LegalBenchPro_slide_01_benchmark_breakdown.pdf differ diff --git a/paper/presentation_pages/LegalBenchPro_slide_03_model_stability_score_distribution.pdf b/paper/presentation_pages/LegalBenchPro_slide_03_model_stability_score_distribution.pdf new file mode 100644 index 0000000..32ee773 Binary files /dev/null and b/paper/presentation_pages/LegalBenchPro_slide_03_model_stability_score_distribution.pdf differ diff --git a/paper/presentation_pages/LegalBenchPro_slide_05_literature_benchmark_comparison.pdf b/paper/presentation_pages/LegalBenchPro_slide_05_literature_benchmark_comparison.pdf new file mode 100644 index 0000000..b11f2c9 Binary files /dev/null and b/paper/presentation_pages/LegalBenchPro_slide_05_literature_benchmark_comparison.pdf differ