Synthdocs includes tooling to evaluate how well fact locations are extracted—both for the tool itself and for your own prediction pipelines.
Evaluate a generated batch output folder:
# Evaluate tool-produced fact locations from *.meta.yaml
uv run synthdocs eval fact-locations --target output/
# Evaluate user predictions against ground truth
uv run synthdocs eval fact-locations --target output/ --predictions-dir predictions/Expected folder structure:
output/
└── cases/
└── <case_id>/
├── case.meta.yaml # contains introduced_facts per document
└── documents/
├── 01-intake-form.md
├── 01-intake-form.meta.yaml # tool predictions (facts + locations)
└── ...
When using --predictions-dir, user predictions should mirror this layout with *.predictions.yaml files.
Reports are written under the target folder:
output/reports/fact_location/tool/runs/<run_id>/...(tool predictions)output/reports/fact_location/user/runs/<run_id>/...(user predictions)
Generate sample documents from specs and evaluate:
# Generate + evaluate
uv run synthdocs eval fact-locations --spec purchase_agreement --samples-per-spec 3
# Evaluate existing documents (no generation)
uv run synthdocs eval fact-locations --skip-generate --documents-dir /path/to/docsSpecs live under reports/fact_location/specs/. See reports/fact_location/specs/README.md for the spec format.
Predictions can come from two sources:
| Source | File pattern | Use case |
|---|---|---|
| Tool | *.meta.yaml |
Evaluate synthdocs' own fact-location extraction |
| User | *.predictions.yaml |
Evaluate your pipeline's predictions against ground truth |
For batch eval, use --predictions-dir to provide a mirrored folder of user predictions.
Evaluation uses a two-stage pipeline:
1. Deterministic grounding
└── Does the predicted span exist exactly in the document?
└── grounded = (slice_text == located_text)
2. LLM judge (only for grounded spans)
└── Is the fact actually expressed in this context?
└── How minimal is the extracted span (good boundaries vs. overly broad)?
A predicted span passes if:
pass = grounded AND entailed AND tightness_score >= 3
Where:
grounded: thelocated_textsubstring exists at the predicted positionentailed: the LLM judge confirms the fact is expressed in the context windowtightness_score(span minimality score): 1-5 scale (3+ means the extracted span is reasonably minimal)
| Metric | Description |
|---|---|
fact_location_rate_verified |
Required facts with at least one verified (passing) span |
verified_span_precision |
Passing spans / total predicted spans |
fact_location_rate_raw |
Required facts with any span (before judging) |
Control judge behavior via CLI flags:
uv run synthdocs eval fact-locations --target output/ \
--judge-backend openai \
--judge-model gpt-4o-mini \
--judge-temperature 0.1 \
--judge-context-chars 120Skip the LLM judge entirely (use deterministic grounding only):
uv run synthdocs eval fact-locations --target output/ --no-judgeSanity-check the judge against a labeled dataset:
uv run synthdocs eval judge-benchmark fact-locations \
--judge-backend openai \
--judge-model gpt-4o-mini \
--out judge_results.jsonlThis prints:
- Entailment accuracy (does the judge correctly identify when facts are present?)
- Span minimality exact-match and MAE
- Span pass agreement (does pass/fail for entailed AND span-minimality>=threshold align with labels?)
- Confusion matrices
The CLI only supports OpenAI and Mistral. For custom backends, use the Python API:
from synthdocs.eval.judge_benchmarks import (
load_fact_locations_judge_benchmark,
run_fact_locations_judge_benchmark,
)
result = run_fact_locations_judge_benchmark(
backend=MyCustomBackend(),
items=load_fact_locations_judge_benchmark(),
)See docs/custom-llm-backend.md for details on implementing and evaluating custom backends.