Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions docs/benchmark-ownership.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,8 @@ the Ethos source tree or rewrite Ethos parser behavior.
## Claim Rule

No public performance, footprint, quality, table, heading, or parser-speed superlative claim is
allowed from either repository until signed G1/G2/G3 result files exist for the required hosts
and the claim audit says the wording is supported.
allowed from either repository unless a claim audit maps the exact wording to accepted evidence
and an explicit public-report or release decision allows that surface.

Allowed current wording:

Expand All @@ -63,4 +63,5 @@ cross-platform rendered-crop byte-identity claims
4. Record cross-host or claim-affecting evidence in `ethos/docs/validation/` or in signed
`ethos-bench/benchmarks/results/gate-zero/` files, depending on whether the evidence is a
product-boundary validation or a benchmark result.
5. Fill ADR-0005 only after the required G1/G2/G3 result files and evidence bundles exist.
5. Amend ADR-0005 or add a successor decision only after the required result files and evidence
bundles exist for that decision.
4 changes: 2 additions & 2 deletions docs/decisions/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@ Every closing PRD §15 open question gets an ADR here. Output-changing merges re
| ADR | Title | Status |
| --- | --- | --- |
| 0000 | Gate Zero decider | Accepted |
| 0001 | Staffing confirmation vs plan schedule | **Proposed — blocks Week 0 exit** |
| 0001 | Staffing confirmation vs plan schedule | Accepted |
| 0002 | PDFium two-phase distribution path | Accepted |
| 0003 | Deterministic font policy | Accepted |
| 0004 | Licensing and dependency policy | Accepted |
| 0005 | Gate Zero decision | Template (filled week 4) |
| 0005 | Gate Zero decision | Accepted - PROCEED |
| 0006 | Package identifiers | Accepted |
| 0007 | Trust layer first | Accepted |
| 0008 | Gate Zero G2 footprint policy | Accepted |
Expand Down
25 changes: 15 additions & 10 deletions docs/execution-status.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@

Date: 2026-06-16
Owner: product / decider
Status: Pre-alpha / Milestone A implementation. Week 0 governance is accepted, WS-ENGINE Phase 1 has a real narrow PDFium path, WS-VERIFY-ALPHA has real deterministic evidence checks over native Ethos JSON and pinned OpenDataLoader output, WS-HARNESS has fail-closed readiness scaffolding, the Gate Zero corpus/hardware manifest and direct competitor lock are frozen/signed, ADR-0006 closes package identifier/trademark validation, ADR-0007 locks the product direction, and the public-source preflight is green for a source-only pre-alpha GitHub push. Signed host result generation still blocks Gate Zero, public benchmark reports, releases, packages, and all performance/quality claims. The next controlled-run handoff is `docs/gate-zero-evidence-runbook.md`.
Status: Pre-alpha / Milestone B entry. Week 0 governance is accepted, WS-ENGINE Phase 1 has a real narrow PDFium path, WS-VERIFY-ALPHA has real deterministic evidence checks over native Ethos JSON and pinned OpenDataLoader output, WS-HARNESS has fail-closed readiness scaffolding, the Gate Zero corpus/hardware manifest and direct competitor lock are frozen/signed, ADR-0005 records an accepted `PROCEED` decision for internal Milestone B continuation, ADR-0006 closes package identifier/trademark validation, ADR-0007 locks the product direction, and the public-source preflight is green for a source-only pre-alpha GitHub push. Public benchmark reports, releases, packages, production positioning, and all performance/quality/footprint claims remain blocked. The controlled-run handoff remains `docs/gate-zero-evidence-runbook.md`; the accepted decision record is `docs/decisions/ADR-0005-gate-zero-decision.md`.

## Current Reality

The repository is still pre-alpha, but it is no longer only contract/scaffold code. Real parsing and real alpha verification exist. They are narrow, fixture-backed, and not yet Gate Zero-proven.
The repository is still pre-alpha, but it is no longer only contract/scaffold code. Real parsing and real alpha verification exist. They are narrow, fixture-backed, and now have an accepted internal Gate Zero decision for roadmap control. That decision is not a public benchmark report, release approval, package approval, production approval, or claim approval.

The committed implementation now includes:

Expand All @@ -21,23 +21,28 @@ The committed implementation now includes:
- `make verify-alpha` is the current alpha trust-loop command: it checks native examples, synthetic OpenDataLoader-style examples, pinned real OpenDataLoader grounded/ungrounded examples, schema validation, usage diagnostics for malformed citations and malformed OpenDataLoader-style inputs, byte-identical repeated verification reports, byte-identical native crop descriptors, summary diagnostics for an ungrounded native case, and foreign fixture manifest hash binding.
- Native Ethos verification can emit deterministic, schema-backed crop descriptor JSON artifacts through `--crop-dir`; these bind `document_fingerprint`, page, bbox, and check ids. Native `crop_ref` filenames are logical evidence references derived from document fingerprint, check id, and page, while descriptors still record the exact observed bbox. When `--crop-source-pdf` is supplied, the CLI validates source-PDF fingerprint binding and emits PNG crop artifacts whose filenames, byte hashes, dimensions, and source fingerprint are bound from the descriptor. `make verify-rendered-crops` checks same-host repeated-run stability for the rendered artifact path, and `make compare-rendered-crops` classifies two rendered-crop runs by separating logical evidence identity from rendered artifact byte equality. Cross-platform rendered image determinism is not claimed; the 2026-06-14 macOS arm64 vs Linux x64 validation record in `docs/validation/rendered-crops-2026-06-14.md` preserved document fingerprint and `payload_sha256` but failed rendered artifact byte equality because the evidence bbox differed slightly across platforms.

Still absent or not claimable: reproducible benchmark result JSON, executed competitor comparisons, public speed/quality/footprint claims, OCR/image-only support, real table extraction, mature list/heading/layout semantics, semantic/arithmetic verification beyond deterministic evidence lookup, Phase 2 project-maintained PDFium builds, release packaging, and full frozen-corpus multi-platform determinism evidence.
Still absent or not claimable: public benchmark reports, public competitor-comparison claims, public speed/quality/footprint claims, OCR/image-only support, real table extraction, mature list/heading/layout semantics, semantic/arithmetic verification beyond deterministic evidence lookup, Phase 2 project-maintained PDFium builds, release packaging, and claim-audit approval for any public result wording.

## Human / External Blockers

PM execution packet: `benchmarks/gate-zero/FREEZE_PACKET.md`.

Resolved control point: ADR-0005 is accepted with `PROCEED` for internal Milestone B
continuation. Its indexed result files and evidence bundles live in the sibling `ethos-bench`
repository. This does not unblock public benchmark reports, releases, packages, production
positioning, or claim wording.

| ID | Blocker | Required output | Owner | Blocks |
| --- | --- | --- | --- | --- |
| H1 | Generate signed Gate Zero host results | `../ethos-bench/benchmarks/results/gate-zero/{macos-arm64,linux-x64}/g1.json` plus G2/G3 result files are produced from the frozen manifest and pinned lock | Benchmark owner / decider | Valid Gate Zero run, public benchmark trust |
| H2 | Execute pinned competitor comparisons | Harness executes the pinned OpenDataLoader, EdgeParse, LiteParse, and PyMuPDF4LLM artifacts and records signed comparison rows where applicable in `ethos-bench` | Benchmark owner | Public competitor comparison |
| H3 | Accept package identifier ADR | Closed by ADR-0006 acceptance on 2026-06-15 | Devrel / decider | Unblocked package identifier/trademark gate; broader public-release checklist still applies |
| H1 | Execute and review public-safe competitor comparison flow | Harness executes the pinned OpenDataLoader, EdgeParse, LiteParse, and PyMuPDF4LLM artifacts where applicable, then records reviewable comparison rows in `ethos-bench` without unsupported wording | Benchmark owner | Public competitor-comparison report |
| H2 | Complete public release/package checklist | `docs/public-release-checklist.md` is complete, release/package artifacts are explicitly approved, and claim-language gates still pass | Devrel / decider | Public releases, packages, and production positioning |
| H3 | Approve public result wording | Claim audit maps any proposed public result sentence to accepted evidence and rejects unsupported wording | Benchmark owner / decider | Public benchmark/result language |

The corpus/hardware freeze and direct competitor pins are recorded in `benchmarks/gate-zero/manifest.json` and `benchmarks/competitors.lock.json`. The remaining blockers are result production and signed evidence, not manifest/pin placeholders.
The corpus/hardware freeze and direct competitor pins are recorded in `benchmarks/gate-zero/manifest.json` and `benchmarks/competitors.lock.json`. The remaining blockers are public-report, public-release, and claim-wording blockers, not manifest/pin placeholders.

## Current Milestone Posture

Milestone A is partially implemented, not complete. The product can demonstrate a narrow parser-backed grounding loop today, but cannot yet claim Gate Zero readiness or public benchmark credibility.
Milestone A has an accepted internal Gate Zero decision for roadmap control, so Milestone B work may proceed internally. The product can demonstrate a narrow parser-backed grounding loop today, but the decision cannot be used as public benchmark credibility.

| Work item | Current status | Remaining blocker |
| --- | --- | --- |
Expand All @@ -48,8 +53,8 @@ Milestone A is partially implemented, not complete. The product can demonstrate
| Font policy groundwork | Partially landed: substitution table and profile policy are present; fixture output uses deterministic substitution IDs | Bundled fallback asset hashing and broader font/CID validation remain open |
| Schema/example validation | Landed: schemas, examples, deterministic profile, referential integrity, and bbox sanity pass the `jsonschema` validation gate | Contract changes still require explicit versioning and compatibility review |
| Trust-layer implementation | Landed: `ethos verify` quote/value/presence/table-cell checks, explicit quote-containment labeling, normalized equality for value/table-cell checks, stale and unverifiable fingerprint handling, unsupported claim reporting, structured capability limits, native Ethos JSON path, ODL-style adapter path with synthetic table/cell mapping, pinned real OpenDataLoader 2.4.7 grounded/ungrounded fixtures, foreign fixture manifest hash validation, crop-ref evidence plumbing, stable logical native crop refs, native crop descriptor artifacts, raw BGRA crop rendering in `ethos-pdf`, CLI PNG crop artifact production for bound native source PDFs, same-host rendered crop repeatability check, rendered-crop run comparison helper, strict citation/config input validation, citation input schema, and demo fixtures | Still needed: evidence matching against richer source structures, semantic/arithmetic claim handling by explicit non-v1 design, real OpenDataLoader table-cell grounding, broader adapter hardening against real output, and a decision on whether cross-platform rendered crop artifact equality is worth pursuing after the current macOS/Linux bbox drift finding |
| WS-HARNESS readiness | Partially landed: readiness path is green for frozen corpus/hardware and pinned competitors, and fails closed if those records regress | Actual benchmark runner outputs, install-size/RSS/timing collection, competitor execution, and cross-host determinism evidence are still missing |
| WS-HARNESS readiness | Partially landed: readiness path is green for frozen corpus/hardware and pinned competitors, Gate Zero evidence preflight validates the current `ethos-bench` handoff, and gates fail closed if those records regress | Public-safe comparison report flow, release/package approval, claim-wording approval, and future evidence-refresh workflow still need hardening |

## PM Rule

Public language stays at "pre-alpha / Milestone A implementation" until the remaining external blockers are closed and Gate Zero has reproducible result JSON. Do not describe Ethos as benchmark-validated, release-ready, or broadly parser-complete. Internal parser work should proceed only when it supports Gate Zero evidence or the trust layer; the product-differentiating path remains verification and grounding first, with parser expansion serving that path.
Public language stays at "pre-alpha / Milestone B internal continuation" until the remaining external blockers are closed and the claim audit approves specific wording. Do not describe Ethos as benchmark-validated, release-ready, production-ready, or broadly parser-complete. Internal parser work should proceed only when it supports accepted evidence paths or the trust layer; the product-differentiating path remains verification and grounding first, with parser expansion serving that path.
10 changes: 7 additions & 3 deletions docs/gate-zero-evidence-runbook.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,15 +136,16 @@ For the cross-host G3 bundle, use `GATE_ZERO_PLATFORM=cross-platform` and

## Decision Step

Fill `docs/decisions/ADR-0005-gate-zero-decision.md` only after:
ADR-0005 is accepted for the current internal Gate Zero decision. For a future amendment or
successor decision, update the decision record only after:

- required G1 files exist for both recorded hosts;
- required G2 files exist for both recorded hosts;
- G3 has compared the required hosts;
- evidence bundles exist for the source result files;
- the decider has reviewed the result JSON and reproduction sidecars.

Before filling the ADR, run:
Before amending the decision record, run:

```bash
python3 .github/scripts/gate_zero_evidence_preflight.py decision --ethos-bench ../ethos-bench
Expand All @@ -154,9 +155,12 @@ This checks the expected `ethos-bench` result paths, timestamped evidence bundle
complete reproduction environments, and bundle checksum manifests. It does not decide whether
Gate Zero passes.

Until that ADR is filled, public language remains:
Even after ADR-0005 acceptance, public language remains:

```text
Ethos is pre-alpha. It verifies whether AI citations are grounded in document evidence across
native Ethos JSON and supported foreign parser outputs.
```

ADR-0005 authorizes internal Milestone B continuation only. It does not authorize public benchmark
reports, releases, packages, production positioning, or unsupported result wording.
9 changes: 5 additions & 4 deletions docs/roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,16 +12,17 @@ Current PM status and blockers: `docs/execution-status.md`.
| Milestone | Window | Contents | Gate |
| --- | --- | --- | --- |
| Week 0 | pre-kickoff | ADRs, governance, corpus freeze, CI bootstrap, competitor pins | All 11 rows done; clock starts |
| A | weeks 1-8 | Contracts (5 schemas, c14n, deterministic profile), trust-boundary artifacts (`GroundingSource`, verification schemas, OpenDataLoader adapter stub, `ethos verify` CLI stub), PDFium Phase 1 spike, harness + competitor adapters, CLI skeleton | **Gate Zero**: G1 throughput, G2 footprint, G3 determinism - PROCEED / G1_RETRY / FALLBACK (ADR-0005). A is incomplete without the trust-boundary artifacts. |
| A | weeks 1-8 | Contracts (5 schemas, c14n, deterministic profile), trust-boundary artifacts (`GroundingSource`, verification schemas, OpenDataLoader adapter stub, `ethos verify` CLI stub), PDFium Phase 1 spike, harness + competitor adapters, CLI skeleton | **Gate Zero**: ADR-0005 is accepted as `PROCEED` for internal Milestone B continuation. This is not public benchmark, release, package, production, or claim approval. |
| B | weeks 9-14 | **`ethos verify` alpha first**: native Ethos JSON + OpenDataLoader verification demo, stale fingerprint checks, capability-limited reports, deterministic evidence matching; then reading order, blocks, headings, lists, Markdown/text exporters, Python wheel scaffold, quality dashboard, Windows x64 nightly determinism | 13-B exit checklist |
| C | weeks 15-22 | Simple/bordered tables; RAG chunker + citations; non-text region coordinates; security report + default-chunk exclusion; debug overlay; internal benchmark snapshot | 13-C exit + first checkpoint |
| D | weeks 23-30 | `verify_citations` v1; crop API; sandbox/subprocess backend; Node beta and MCP experimental only if staffed or accepted by release-scope ADR | 13-D exit |
| E | weeks 31-40 | Public benchmark report (reproducible, labeled tiers); PDFium Phase 2 project-maintained builds; stable CLI/Python docs; proof-of-trust demos; **Public Beta** | Release 1 claim audit + public-beta checkpoint |
| F / Release 2 | post-E | Complex tables, formula/LaTeX, chart classification, optional enrichment modules (never base) | Scoped after E from beta fixtures |

Fallback charter: if Gate Zero fails on G2/G3 (or a failed G1 retry), Ethos pivots to the
parser-agnostic trust layer — standalone `ethos-verify` + chunk/citation tooling over foreign
parser output. The trust layer ships as a Milestone B alpha either way.
Fallback charter: ADR-0005 selected `PROCEED`. If a future Gate Zero successor decision rejects
G2/G3 evidence, or rejects G1 after a bounded retry path, Ethos pivots to the parser-agnostic trust
layer — standalone `ethos-verify` + chunk/citation tooling over foreign parser output. The trust
layer remains the first Milestone B product path either way.

Surface labels in Release 1: CLI + Python **stable**. Node **beta** and MCP **experimental**
ship only if staffed or accepted by release-scope ADR before public claims.
Loading