Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
142 changes: 49 additions & 93 deletions crates/ethos-cli/tests/verify.rs
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,54 @@ fn odl_example() -> PathBuf {
repo_root().join("examples/verify/opendataloader.json")
}

fn verify_alpha_report_cases() -> Vec<(String, Vec<String>, PathBuf)> {
let root = repo_root();
let inventory = json_file(root.join("examples/verify/cases.json"));
let report_cases = inventory["report_cases"]
.as_array()
.expect("verify-alpha report_cases is an array");

report_cases
.iter()
.map(|case| {
let name = case["name"]
.as_str()
.expect("verify-alpha case name is a string")
.to_string();
let mut args = vec![
"verify".to_string(),
root.join(
case["input"]
.as_str()
.expect("verify-alpha case input is a string"),
)
.display()
.to_string(),
];
if let Some(grounding) = case.get("grounding").and_then(Value::as_str) {
args.push("--grounding".to_string());
args.push(grounding.to_string());
}
args.push("--citations".to_string());
args.push(
root.join(
case["citations"]
.as_str()
.expect("verify-alpha case citations is a string"),
)
.display()
.to_string(),
);
let expected = root.join(
case["golden"]
.as_str()
.expect("verify-alpha case golden is a string"),
);
(name, args, expected)
})
.collect()
}

#[test]
fn verify_alpha_schema_report_example_matches_cli_output() {
let root = repo_root();
Expand All @@ -147,99 +195,7 @@ fn verify_alpha_schema_report_example_matches_cli_output() {

#[test]
fn verify_alpha_demo_reports_match_goldens() {
let root = repo_root();
let cases: [(&str, Vec<String>, PathBuf); 6] = [
(
"native-grounded",
vec![
"verify".to_string(),
root.join("schemas/examples/document.example.json")
.display()
.to_string(),
"--citations".to_string(),
root.join("examples/verify/native_grounded_citations.json")
.display()
.to_string(),
],
root.join("examples/verify/goldens/native_grounded_report.json"),
),
(
"opendataloader-grounded",
vec![
"verify".to_string(),
root.join("examples/verify/opendataloader.json")
.display()
.to_string(),
"--grounding".to_string(),
"opendataloader-json".to_string(),
"--citations".to_string(),
root.join("examples/verify/opendataloader_grounded_citations.json")
.display()
.to_string(),
],
root.join("examples/verify/goldens/opendataloader_grounded_report.json"),
),
(
"native-split-quote",
vec![
"verify".to_string(),
root.join("examples/verify/native_split_quote_document.json")
.display()
.to_string(),
"--citations".to_string(),
root.join("examples/verify/native_split_quote_citations.json")
.display()
.to_string(),
],
root.join("examples/verify/goldens/native_split_quote_report.json"),
),
(
"native-non-v1-claims",
vec![
"verify".to_string(),
root.join("schemas/examples/document.example.json")
.display()
.to_string(),
"--citations".to_string(),
root.join("examples/verify/native_non_v1_claims_citations.json")
.display()
.to_string(),
],
root.join("examples/verify/goldens/native_non_v1_claims_report.json"),
),
(
"native-stale",
vec![
"verify".to_string(),
root.join("schemas/examples/document.example.json")
.display()
.to_string(),
"--citations".to_string(),
root.join("examples/verify/native_stale_citations.json")
.display()
.to_string(),
],
root.join("examples/verify/goldens/native_stale_report.json"),
),
(
"opendataloader-capability-limited",
vec![
"verify".to_string(),
root.join("examples/verify/opendataloader_no_tables.json")
.display()
.to_string(),
"--grounding".to_string(),
"opendataloader-json".to_string(),
"--citations".to_string(),
root.join("examples/verify/opendataloader_table_cell_citations.json")
.display()
.to_string(),
],
root.join("examples/verify/goldens/opendataloader_capability_limited_report.json"),
),
];

for (name, args, expected_path) in cases {
for (name, args, expected_path) in verify_alpha_report_cases() {
let args = args.iter().map(String::as_str).collect::<Vec<_>>();
let actual = parse_success(&args);
let expected = json_file(expected_path);
Expand Down
6 changes: 3 additions & 3 deletions docs/execution-status.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@ The committed implementation now includes:
- `ethos doc parse` / `ethos fingerprint` PDF execution through a worker process with `max_parse_ms` timeout enforcement, stable error-envelope relay, diagnostics-gated worker stderr, and page-range validation/filtering.
- Quantized page/span extraction at the backend boundary, plus a basic deterministic layout pass that assembles paragraph `text_block` elements and simple column reading order for the current born-digital fixtures.
- Schema/example/profile validation is green through `schemas/validate_examples.py` using `jsonschema` draft 2020-12 validation, including the crop descriptor artifact contract plus referential-integrity and bbox sanity checks outside JSON Schema.
- `ethos verify` now produces non-empty quote, value, presence, and table-cell verification checks over native Ethos document JSON and synthetic OpenDataLoader-style JSON through `--grounding opendataloader-json`; it also verifies quote/value/presence citations over pinned real OpenDataLoader 2.4.7 JSON, including grounded and ungrounded cases. Citation/config inputs are rejected when they drift outside the closed schemas. The public demo harness covers grounded, ungrounded, not-found, stale-fingerprint, capability-limited, malformed-citation, malformed OpenDataLoader-style input, and summary-format reject paths.
- `ethos verify` now produces non-empty quote, value, presence, and table-cell verification checks over native Ethos document JSON and synthetic OpenDataLoader-style JSON through `--grounding opendataloader-json`; it also verifies quote/value/presence citations over pinned real OpenDataLoader 2.4.7 JSON, including grounded and ungrounded cases. Citation/config inputs are rejected when they drift outside the closed schemas. The public demo harness covers grounded, ungrounded, split-quote, not-found, stale-fingerprint, unsupported non-v1 claim, capability-limited, malformed-citation, malformed OpenDataLoader-style input, and summary-format reject paths.
- Verification semantics are now trust-honest at alpha scope: quote containment is explicitly labeled, value/table-cell checks require normalized equality, fingerprint-pinned citations fail closed when source fingerprints are unavailable, and structured capability limits explain why a run is downgraded.
- `make verify-alpha` is the current alpha trust-loop command: it checks native examples, synthetic OpenDataLoader-style examples, pinned real OpenDataLoader grounded/ungrounded examples, schema validation, usage diagnostics for malformed citations and malformed OpenDataLoader-style inputs, byte-identical repeated verification reports, byte-identical native crop descriptors, summary diagnostics for an ungrounded native case, and foreign fixture manifest hash binding.
- `make verify-alpha` is the current alpha trust-loop command: it checks native examples, split-quote evidence matching, unsupported non-v1 claim reporting, synthetic OpenDataLoader-style examples, pinned real OpenDataLoader grounded/ungrounded examples, schema validation, verify-alpha case inventory coverage, usage diagnostics for malformed citations and malformed OpenDataLoader-style structures, byte-identical repeated verification reports, byte-identical native crop descriptors, summary diagnostics for an ungrounded native case, and foreign fixture manifest hash binding.
- Native Ethos verification can emit deterministic, schema-backed crop descriptor JSON artifacts through `--crop-dir`; these bind `document_fingerprint`, page, bbox, and check ids. Native `crop_ref` filenames are logical evidence references derived from document fingerprint, check id, and page, while descriptors still record the exact observed bbox. When `--crop-source-pdf` is supplied, the CLI validates source-PDF fingerprint binding and emits PNG crop artifacts whose filenames, byte hashes, dimensions, and source fingerprint are bound from the descriptor. `make verify-rendered-crops` checks same-host repeated-run stability for the rendered artifact path, and `make compare-rendered-crops` classifies two rendered-crop runs by separating logical evidence identity from rendered artifact byte equality. Cross-platform rendered image determinism is not claimed; the 2026-06-14 macOS arm64 vs Linux x64 validation record in `docs/validation/rendered-crops-2026-06-14.md` preserved document fingerprint and `payload_sha256` but failed rendered artifact byte equality because the evidence bbox differed slightly across platforms.

Still absent or not claimable: public benchmark reports, public competitor-comparison claims, public speed/quality/footprint claims, OCR/image-only support, real table extraction, mature list/heading/layout semantics, semantic/arithmetic verification beyond deterministic evidence lookup, Phase 2 project-maintained PDFium builds, release packaging, and claim-audit approval for any public result wording.
Expand Down Expand Up @@ -52,7 +52,7 @@ Milestone A has an accepted internal Gate Zero decision for roadmap control, so
| Layout groundwork | Landed: basic paragraph text blocks and simple column reading order over quantized spans | Tables, headings, lists, rotation/quirk handling, and confidence policy remain future work |
| Font policy groundwork | Partially landed: substitution table and profile policy are present; fixture output uses deterministic substitution IDs | Bundled fallback asset hashing and broader font/CID validation remain open |
| Schema/example validation | Landed: schemas, examples, deterministic profile, referential integrity, and bbox sanity pass the `jsonschema` validation gate | Contract changes still require explicit versioning and compatibility review |
| Trust-layer implementation | Landed: `ethos verify` quote/value/presence/table-cell checks, explicit quote-containment labeling, normalized equality for value/table-cell checks, stale and unverifiable fingerprint handling, unsupported claim reporting, structured capability limits, native Ethos JSON path, ODL-style adapter path with synthetic table/cell mapping, pinned real OpenDataLoader 2.4.7 grounded/ungrounded fixtures, foreign fixture manifest hash validation, crop-ref evidence plumbing, stable logical native crop refs, native crop descriptor artifacts, raw BGRA crop rendering in `ethos-pdf`, CLI PNG crop artifact production for bound native source PDFs, same-host rendered crop repeatability check, rendered-crop run comparison helper, strict citation/config input validation, citation input schema, and demo fixtures | Still needed: evidence matching against richer source structures, semantic/arithmetic claim handling by explicit non-v1 design, real OpenDataLoader table-cell grounding, broader adapter hardening against real output, and a decision on whether cross-platform rendered crop artifact equality is worth pursuing after the current macOS/Linux bbox drift finding |
| Trust-layer implementation | Landed: `ethos verify` quote/value/presence/table-cell checks, explicit quote-containment labeling, normalized equality for value/table-cell checks, stale and unverifiable fingerprint handling, unsupported claim reporting, structured capability limits, native Ethos JSON path, ODL-style adapter path with synthetic table/cell mapping, pinned real OpenDataLoader 2.4.7 grounded/ungrounded fixtures, foreign fixture manifest hash validation, crop-ref evidence plumbing, stable logical native crop refs, native crop descriptor artifacts, raw BGRA crop rendering in `ethos-pdf`, CLI PNG crop artifact production for bound native source PDFs, same-host rendered crop repeatability check, rendered-crop run comparison helper, strict citation/config input validation, citation input schema, split-quote fixture coverage, explicit unsupported non-v1 claim reporting, OpenDataLoader-style structure diagnostics for malformed bbox and unknown-page references, verify-alpha case inventory checks, and demo fixtures | Still needed: real OpenDataLoader table-cell grounding, additional adapter hardening against broader real output shapes, future claim-kind expansion outside the current v1 alpha policy, and a decision on whether cross-platform rendered crop artifact equality is worth pursuing after the current macOS/Linux bbox drift finding |
| WS-HARNESS readiness | Partially landed: readiness path is green for frozen corpus/hardware and pinned competitors, Gate Zero evidence preflight validates the current `ethos-bench` handoff, and gates fail closed if those records regress | Public-safe comparison report flow, release/package approval, claim-wording approval, and future evidence-refresh workflow still need hardening |

## PM Rule
Expand Down
2 changes: 1 addition & 1 deletion docs/roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Current PM status and blockers: `docs/execution-status.md`.
| --- | --- | --- | --- |
| Week 0 | pre-kickoff | ADRs, governance, corpus freeze, CI bootstrap, competitor pins | All 11 rows done; clock starts |
| A | weeks 1-8 | Contracts (5 schemas, c14n, deterministic profile), trust-boundary artifacts (`GroundingSource`, verification schemas, OpenDataLoader adapter stub, `ethos verify` CLI stub), PDFium Phase 1 spike, harness + competitor adapters, CLI skeleton | **Gate Zero**: ADR-0005 is accepted as `PROCEED` for internal Milestone B continuation. This is not public benchmark, release, package, production, or claim approval. |
| B | weeks 9-14 | **`ethos verify` alpha first**: native Ethos JSON + OpenDataLoader verification demo, stale fingerprint checks, capability-limited reports, deterministic evidence matching; then reading order, blocks, headings, lists, Markdown/text exporters, Python wheel scaffold, quality dashboard, Windows x64 nightly determinism | 13-B exit checklist |
| B | weeks 9-14 | **`ethos verify` alpha first**: native Ethos JSON + synthetic and pinned real OpenDataLoader verification demos, stale fingerprint checks, capability-limited reports, deterministic evidence matching including split-quote coverage, explicit unsupported non-v1 claim reporting, adapter structure diagnostics; then reading order, blocks, headings, lists, Markdown/text exporters, Python wheel scaffold, quality dashboard, Windows x64 nightly determinism | 13-B exit checklist |
| C | weeks 15-22 | Simple/bordered tables; RAG chunker + citations; non-text region coordinates; security report + default-chunk exclusion; debug overlay; internal benchmark snapshot | 13-C exit + first checkpoint |
| D | weeks 23-30 | `verify_citations` v1; crop API; sandbox/subprocess backend; Node beta and MCP experimental only if staffed or accepted by release-scope ADR | 13-D exit |
| E | weeks 31-40 | Public benchmark report (reproducible, labeled tiers); PDFium Phase 2 project-maintained builds; stable CLI/Python docs; proof-of-trust demos; **Public Beta** | Release 1 claim audit + public-beta checkpoint |
Expand Down
27 changes: 27 additions & 0 deletions examples/verify/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,33 @@

This directory contains verify-alpha fixtures, citations, and golden reports.

`cases.json` is the executable verify-alpha case inventory. `make verify-alpha` fails if a
listed fixture path is missing, if a report golden is not covered by the inventory, if the
real OpenDataLoader fixture manifest hashes drift, or if this README stops naming an inventory
case.

## Verify-Alpha Case Inventory

Report cases:

| Case | Coverage |
| --- | --- |
| `native-grounded` | Native quote, table-cell, and presence grounding. |
| `opendataloader-grounded` | Synthetic OpenDataLoader-style grounding with declared capability limits. |
| `native-split-quote` | Adjacent native text evidence matching. |
| `native-non-v1-claims` | Explicit unsupported non-v1 claim reporting. |
| `native-ungrounded` | Native mismatch and missing element diagnostics. |
| `opendataloader-not-found` | Synthetic OpenDataLoader-style missing element diagnostics. |
| `native-stale` | Fingerprint staleness handling. |
| `opendataloader-capability-limited` | Capability-limited table-cell reporting. |
| `real-opendataloader-grounded` | Pinned real OpenDataLoader grounded fixture. |
| `real-opendataloader-ungrounded` | Pinned real OpenDataLoader ungrounded fixture. |

Usage-error cases: `invalid-table-cell-citation`, `invalid-bbox-citation`,
`opendataloader-malformed-bbox-input`, and `opendataloader-unknown-page-input`.

Summary cases: `native-ungrounded-summary`.

## Native Ethos Grounding

```bash
Expand Down
Loading
Loading