diff --git a/docs/execution-status.md b/docs/execution-status.md index 5556490..2863984 100644 --- a/docs/execution-status.md +++ b/docs/execution-status.md @@ -14,7 +14,7 @@ The committed implementation now includes: - A pinned Phase 1 PDFium profile in `docs/pdfium-profile.md` and `profiles/ethos-deterministic-v1.json`: `chromium/7881`, V8/XFA disabled, platform artifact hashes, runtime library hashes, and provenance are recorded. - Runtime checks that reject missing or mismatched PDFium versions, release artifacts, and extracted libraries with stable errors before dynamic loading. - `ethos doc parse` / `ethos fingerprint` PDF execution through a worker process with `max_parse_ms` timeout enforcement, stable error-envelope relay, diagnostics-gated worker stderr, and page-range validation/filtering. -- Quantized page/span extraction at the backend boundary, plus a basic deterministic layout pass that assembles paragraph `text_block` elements and simple column reading order for the current born-digital fixtures. +- Quantized page/span extraction at the backend boundary, plus a basic deterministic layout pass that assembles paragraph `text_block` elements and simple column reading order for the current born-digital fixtures. Fixture validation binds selected `fixture.json` expectations to committed extraction/layout goldens so current read-order cases fail closed on drift. - Schema/example/profile validation is green through `schemas/validate_examples.py` using `jsonschema` draft 2020-12 validation, including the crop descriptor artifact contract plus referential-integrity and bbox sanity checks outside JSON Schema. - `ethos verify` now produces non-empty quote, value, presence, and table-cell verification checks over native Ethos document JSON and synthetic OpenDataLoader-style JSON through `--grounding opendataloader-json`; it also verifies quote/value/presence citations over pinned real OpenDataLoader 2.4.7 JSON, including grounded and ungrounded cases. Citation/config inputs are rejected when they drift outside the closed schemas. The public demo harness covers grounded, ungrounded, split-quote, not-found, stale-fingerprint, unsupported non-v1 claim, capability-limited, malformed-citation, malformed OpenDataLoader-style input, and summary-format reject paths. - Verification semantics are now trust-honest at alpha scope: quote containment is explicitly labeled, value/table-cell checks require normalized equality, fingerprint-pinned citations fail closed when source fingerprints are unavailable, and structured capability limits explain why a run is downgraded. @@ -49,7 +49,7 @@ Milestone A has an accepted internal Gate Zero decision for roadmap control, so | PDFium Phase 1 profile | Landed: pinned profile, V8/XFA-disabled state, platform hashes, runtime library hashes, and provenance are recorded | Phase 2 project-maintained builds still block Public Beta | | PDFium loader/runtime checks | Landed: missing/mismatched version, artifact, and runtime library hashes fail deterministically | Release packaging and operator setup path still need hardening | | Real PDF backend | Landed for simple born-digital PDFs: page count, quantized spans, worker execution, timeout, page filtering, and fingerprint path exist | Wider corpus coverage, failure fixtures, memory-limit behavior, quirk log, and Gate Zero run are still missing | -| Layout groundwork | Landed: basic paragraph text blocks and simple column reading order over quantized spans | Tables, headings, lists, rotation/quirk handling, and confidence policy remain future work | +| Layout groundwork | Landed: basic paragraph text blocks, simple column reading order over quantized spans, and fixture metadata checks against committed extraction/layout goldens | Tables, headings, lists, rotation/quirk handling, and confidence policy remain future work | | Font policy groundwork | Partially landed: substitution table and profile policy are present; fixture output uses deterministic substitution IDs | Bundled fallback asset hashing and broader font/CID validation remain open | | Schema/example validation | Landed: schemas, examples, deterministic profile, referential integrity, and bbox sanity pass the `jsonschema` validation gate | Contract changes still require explicit versioning and compatibility review | | Trust-layer implementation | Landed: `ethos verify` quote/value/presence/table-cell checks, explicit quote-containment labeling, normalized equality for value/table-cell checks, stale and unverifiable fingerprint handling, unsupported claim reporting, structured capability limits, native Ethos JSON path, ODL-style adapter path with synthetic table/cell mapping, pinned real OpenDataLoader 2.4.7 grounded/ungrounded fixtures, foreign fixture manifest hash validation, crop-ref evidence plumbing, stable logical native crop refs, native crop descriptor artifacts, raw BGRA crop rendering in `ethos-pdf`, CLI PNG crop artifact production for bound native source PDFs, same-host rendered crop repeatability check, rendered-crop run comparison helper, strict citation/config input validation, citation input schema, split-quote fixture coverage, explicit unsupported non-v1 claim reporting, OpenDataLoader-style structure diagnostics for malformed bbox and unknown-page references, verify-alpha case inventory checks, and demo fixtures | Still needed: real OpenDataLoader table-cell grounding, additional adapter hardening against broader real output shapes, future claim-kind expansion outside the current v1 alpha policy, and a decision on whether cross-platform rendered crop artifact equality is worth pursuing after the current macOS/Linux bbox drift finding | diff --git a/fixtures/README.md b/fixtures/README.md index 7322243..b2b5ce2 100644 --- a/fixtures/README.md +++ b/fixtures/README.md @@ -37,6 +37,15 @@ Successful parse fixtures also carry c14n stage goldens: - `extraction.json`: `ethos_core::traits::Extraction` after the PDF backend boundary. - `layout.json`: `ethos_core::traits::LayoutOutput` after deterministic layout grouping. +For successful fixtures, `validate_fixtures.py` also binds selected `fixture.json` +expectations to those committed goldens: + +- `expected_pages`: exact `extraction.json` page count. +- `expected_span_text`: exact `extraction.json` span text order. +- `expected_elements`: exact `layout.json` element count. +- `expected_text`: exact `layout.json` element text order. Use a string for a single + layout element and a string array when reading order spans multiple elements. + Regenerate them only after reviewing parser/layout drift. First configure the pinned profile artifact for your platform; for macOS arm64 this is: diff --git a/fixtures/synthetic/two-columns/fixture.json b/fixtures/synthetic/two-columns/fixture.json index 57d8ccc..bc01820 100644 --- a/fixtures/synthetic/two-columns/fixture.json +++ b/fixtures/synthetic/two-columns/fixture.json @@ -12,6 +12,16 @@ "Left top Left bottom", "Right top Right bottom" ], + "expected_span_text": [ + "Right", + "top", + "Right", + "bottom", + "Left", + "top", + "Left", + "bottom" + ], "expected_pages": 1, "expected_elements": 2, "exercises": [ diff --git a/fixtures/validate_fixtures.py b/fixtures/validate_fixtures.py index 1b3f29b..7ee52b0 100644 --- a/fixtures/validate_fixtures.py +++ b/fixtures/validate_fixtures.py @@ -132,18 +132,18 @@ def validate_c14n_scalar_contract(value, ctx: str) -> None: fail(f"{ctx} is not a valid JSON scalar/container") -def validate_golden_file(path: Path, stage: str, keys: set[str]) -> None: +def validate_golden_file(path: Path, stage: str, keys: set[str]): if not path.is_file(): fail(f"{path.relative_to(ROOT)} missing for successful fixture") - return + return None golden = load_json(path) if golden is None: - return + return None ctx = str(path.relative_to(ROOT)) if not isinstance(golden, dict): fail(f"{ctx} must be an object") - return + return None validate_c14n_scalar_contract(golden, ctx) if path.read_bytes() != canonical_json_bytes(golden): fail(f"{ctx} must be canonical JSON with one trailing newline") @@ -160,6 +160,7 @@ def validate_golden_file(path: Path, stage: str, keys: set[str]) -> None: if not isinstance(golden.get("elements"), list): fail(f"{ctx} elements must be an array") validate_projection_items(ctx, "elements", golden.get("elements"), required=True) + return golden def validate_projection_items(ctx: str, key: str, value, required: bool) -> None: @@ -248,6 +249,68 @@ def validate_foreign_fixture_packages() -> int: return count +def validate_expected_count(value, expected, ctx: str) -> None: + if expected is None: + return + if not isinstance(expected, int) or expected < 0: + fail(f"{ctx} must be an integer >= 0") + return + if len(value) != expected: + fail(f"{ctx} expected {expected}, found {len(value)}") + + +def validate_expected_text(metadata, layout, ctx: str) -> None: + if "expected_text" not in metadata: + return + expected = metadata["expected_text"] + elements = layout.get("elements") if isinstance(layout, dict) else None + if not isinstance(elements, list): + return + actual = [element.get("text") for element in elements] + + if isinstance(expected, str): + if actual != [expected]: + fail(f"{ctx} expected_text must match layout element text order") + elif isinstance(expected, list) and all(isinstance(item, str) for item in expected): + if actual != expected: + fail(f"{ctx} expected_text list must match layout reading order") + else: + fail(f"{ctx} expected_text must be a string or string array") + + +def validate_expected_span_text(metadata, extraction, ctx: str) -> None: + if "expected_span_text" not in metadata: + return + expected = metadata["expected_span_text"] + if not isinstance(expected, list) or not all(isinstance(item, str) for item in expected): + fail(f"{ctx} expected_span_text must be a string array") + return + spans = extraction.get("spans") if isinstance(extraction, dict) else None + if not isinstance(spans, list): + return + actual = [span.get("text") for span in spans] + if actual != expected: + fail(f"{ctx} expected_span_text must match extraction span order") + + +def validate_stage_expectations(metadata_path: Path, metadata, extraction, layout) -> None: + ctx = str(metadata_path.relative_to(ROOT)) + if isinstance(extraction, dict): + validate_expected_count( + extraction.get("pages", []), + metadata.get("expected_pages"), + f"{ctx} expected_pages", + ) + validate_expected_span_text(metadata, extraction, ctx) + if isinstance(layout, dict): + validate_expected_count( + layout.get("elements", []), + metadata.get("expected_elements"), + f"{ctx} expected_elements", + ) + validate_expected_text(metadata, layout, ctx) + + manifest = load_json(MANIFEST) if manifest is None: sys.exit(1) @@ -378,16 +441,23 @@ def validate_foreign_fixture_packages() -> int: if "failure" not in manifest_subsets: fixture_dir = metadata_path.parent - validate_golden_file( + extraction_golden = validate_golden_file( fixture_dir / "extraction.json", "extraction", EXTRACTION_GOLDEN_KEYS, ) - validate_golden_file( + layout_golden = validate_golden_file( fixture_dir / "layout.json", "layout", LAYOUT_GOLDEN_KEYS, ) + if extraction_golden is not None and layout_golden is not None: + validate_stage_expectations( + metadata_path, + metadata, + extraction_golden, + layout_golden, + ) if indexed_files != sorted(indexed_files): fail("manifest fixture entries must be sorted by file") @@ -414,6 +484,7 @@ def validate_foreign_fixture_packages() -> int: ok("fixture.json sha256 values match document.pdf bytes") ok("fixture manifest has no missing or extra fixture documents") ok("successful fixture goldens have valid stage metadata") + ok("successful fixture metadata expectations match committed stage goldens") ok(f"foreign fixture manifests bind {foreign_package_count} package(s) to committed hashes") if failures: