Skip to content

Latest commit

 

History

History
148 lines (106 loc) · 4.47 KB

File metadata and controls

148 lines (106 loc) · 4.47 KB

Evaluation

Synthdocs includes tooling to evaluate how well fact locations are extracted—both for the tool itself and for your own prediction pipelines.

Two Evaluation Modes

1. Batch Evaluation (Recommended for Most Users)

Evaluate a generated batch output folder:

# Evaluate tool-produced fact locations from *.meta.yaml
uv run synthdocs eval fact-locations --target output/

# Evaluate user predictions against ground truth
uv run synthdocs eval fact-locations --target output/ --predictions-dir predictions/

Expected folder structure:

output/
└── cases/
    └── <case_id>/
        ├── case.meta.yaml          # contains introduced_facts per document
        └── documents/
            ├── 01-intake-form.md
            ├── 01-intake-form.meta.yaml   # tool predictions (facts + locations)
            └── ...

When using --predictions-dir, user predictions should mirror this layout with *.predictions.yaml files.

Reports are written under the target folder:

  • output/reports/fact_location/tool/runs/<run_id>/... (tool predictions)
  • output/reports/fact_location/user/runs/<run_id>/... (user predictions)

2. Spec-Based Evaluation (For Synthdocs Development)

Generate sample documents from specs and evaluate:

# Generate + evaluate
uv run synthdocs eval fact-locations --spec purchase_agreement --samples-per-spec 3

# Evaluate existing documents (no generation)
uv run synthdocs eval fact-locations --skip-generate --documents-dir /path/to/docs

Specs live under reports/fact_location/specs/. See reports/fact_location/specs/README.md for the spec format.

Predictions

Predictions can come from two sources:

Source File pattern Use case
Tool *.meta.yaml Evaluate synthdocs' own fact-location extraction
User *.predictions.yaml Evaluate your pipeline's predictions against ground truth

For batch eval, use --predictions-dir to provide a mirrored folder of user predictions.

The Judge Pipeline

Evaluation uses a two-stage pipeline:

1. Deterministic grounding
   └── Does the predicted span exist exactly in the document?
       └── grounded = (slice_text == located_text)

2. LLM judge (only for grounded spans)
    └── Is the fact actually expressed in this context?
    └── How minimal is the extracted span (good boundaries vs. overly broad)?

Pass Rule

A predicted span passes if:

pass = grounded AND entailed AND tightness_score >= 3

Where:

  • grounded: the located_text substring exists at the predicted position
  • entailed: the LLM judge confirms the fact is expressed in the context window
  • tightness_score (span minimality score): 1-5 scale (3+ means the extracted span is reasonably minimal)

Key Metrics

Metric Description
fact_location_rate_verified Required facts with at least one verified (passing) span
verified_span_precision Passing spans / total predicted spans
fact_location_rate_raw Required facts with any span (before judging)

Judge Configuration

Control judge behavior via CLI flags:

uv run synthdocs eval fact-locations --target output/ \
  --judge-backend openai \
  --judge-model gpt-4o-mini \
  --judge-temperature 0.1 \
  --judge-context-chars 120

Skip the LLM judge entirely (use deterministic grounding only):

uv run synthdocs eval fact-locations --target output/ --no-judge

Judge Benchmark

Sanity-check the judge against a labeled dataset:

uv run synthdocs eval judge-benchmark fact-locations \
  --judge-backend openai \
  --judge-model gpt-4o-mini \
  --out judge_results.jsonl

This prints:

  • Entailment accuracy (does the judge correctly identify when facts are present?)
  • Span minimality exact-match and MAE
  • Span pass agreement (does pass/fail for entailed AND span-minimality>=threshold align with labels?)
  • Confusion matrices

Using a Custom Backend for the Judge

The CLI only supports OpenAI and Mistral. For custom backends, use the Python API:

from synthdocs.eval.judge_benchmarks import (
    load_fact_locations_judge_benchmark,
    run_fact_locations_judge_benchmark,
)

result = run_fact_locations_judge_benchmark(
    backend=MyCustomBackend(),
    items=load_fact_locations_judge_benchmark(),
)

See docs/custom-llm-backend.md for details on implementing and evaluating custom backends.