Evaluation

Synthdocs includes tooling to evaluate how well fact locations are extracted—both for the tool itself and for your own prediction pipelines.

Two Evaluation Modes

1. Batch Evaluation (Recommended for Most Users)

Evaluate a generated batch output folder:

# Evaluate tool-produced fact locations from *.meta.yaml
uv run synthdocs eval fact-locations --target output/

# Evaluate user predictions against ground truth
uv run synthdocs eval fact-locations --target output/ --predictions-dir predictions/

Expected folder structure:

output/
└── cases/
    └── <case_id>/
        ├── case.meta.yaml          # contains introduced_facts per document
        └── documents/
            ├── 01-intake-form.md
            ├── 01-intake-form.meta.yaml   # tool predictions (facts + locations)
            └── ...

When using --predictions-dir, user predictions should mirror this layout with *.predictions.yaml files.

Reports are written under the target folder:

output/reports/fact_location/tool/runs/<run_id>/... (tool predictions)
output/reports/fact_location/user/runs/<run_id>/... (user predictions)

2. Spec-Based Evaluation (For Synthdocs Development)

Generate sample documents from specs and evaluate:

# Generate + evaluate
uv run synthdocs eval fact-locations --spec purchase_agreement --samples-per-spec 3

# Evaluate existing documents (no generation)
uv run synthdocs eval fact-locations --skip-generate --documents-dir /path/to/docs

Specs live under reports/fact_location/specs/. See reports/fact_location/specs/README.md for the spec format.

Predictions

Predictions can come from two sources:

Source	File pattern	Use case
Tool	`*.meta.yaml`	Evaluate synthdocs' own fact-location extraction
User	`*.predictions.yaml`	Evaluate your pipeline's predictions against ground truth

For batch eval, use --predictions-dir to provide a mirrored folder of user predictions.

The Judge Pipeline

Evaluation uses a two-stage pipeline:

1. Deterministic grounding
   └── Does the predicted span exist exactly in the document?
       └── grounded = (slice_text == located_text)

2. LLM judge (only for grounded spans)
    └── Is the fact actually expressed in this context?
    └── How minimal is the extracted span (good boundaries vs. overly broad)?

Pass Rule

A predicted span passes if:

pass = grounded AND entailed AND tightness_score >= 3

Where:

grounded: the located_text substring exists at the predicted position
entailed: the LLM judge confirms the fact is expressed in the context window
tightness_score (span minimality score): 1-5 scale (3+ means the extracted span is reasonably minimal)

Key Metrics

Metric	Description
`fact_location_rate_verified`	Required facts with at least one verified (passing) span
`verified_span_precision`	Passing spans / total predicted spans
`fact_location_rate_raw`	Required facts with any span (before judging)

Judge Configuration

Control judge behavior via CLI flags:

uv run synthdocs eval fact-locations --target output/ \
  --judge-backend openai \
  --judge-model gpt-4o-mini \
  --judge-temperature 0.1 \
  --judge-context-chars 120

Skip the LLM judge entirely (use deterministic grounding only):

uv run synthdocs eval fact-locations --target output/ --no-judge

Judge Benchmark

Sanity-check the judge against a labeled dataset:

uv run synthdocs eval judge-benchmark fact-locations \
  --judge-backend openai \
  --judge-model gpt-4o-mini \
  --out judge_results.jsonl

This prints:

Entailment accuracy (does the judge correctly identify when facts are present?)
Span minimality exact-match and MAE
Span pass agreement (does pass/fail for entailed AND span-minimality>=threshold align with labels?)
Confusion matrices

Using a Custom Backend for the Judge

The CLI only supports OpenAI and Mistral. For custom backends, use the Python API:

from synthdocs.eval.judge_benchmarks import (
    load_fact_locations_judge_benchmark,
    run_fact_locations_judge_benchmark,
)

result = run_fact_locations_judge_benchmark(
    backend=MyCustomBackend(),
    items=load_fact_locations_judge_benchmark(),
)

See docs/custom-llm-backend.md for details on implementing and evaluating custom backends.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation

Two Evaluation Modes

1. Batch Evaluation (Recommended for Most Users)

2. Spec-Based Evaluation (For Synthdocs Development)

Predictions

The Judge Pipeline

Pass Rule

Key Metrics

Judge Configuration

Judge Benchmark

Using a Custom Backend for the Judge

FilesExpand file tree

evaluation.md

Latest commit

History

evaluation.md

File metadata and controls

Evaluation

Two Evaluation Modes

1. Batch Evaluation (Recommended for Most Users)

2. Spec-Based Evaluation (For Synthdocs Development)

Predictions

The Judge Pipeline

Pass Rule

Key Metrics

Judge Configuration

Judge Benchmark

Using a Custom Backend for the Judge