Skip to content

rework /verify around lavish + live citations; drop proprietary form specs#29

Merged
bensonwong merged 15 commits into
mainfrom
feat/judgement-marker
Jun 23, 2026
Merged

rework /verify around lavish + live citations; drop proprietary form specs#29
bensonwong merged 15 commits into
mainfrom
feat/judgement-marker

Conversation

@bensonwong

Copy link
Copy Markdown
Contributor

Summary

Two related cleanups to the public skills, landed on one branch:

1. /verify reworked around lavish styling + DeepCitation live citations

The skill now produces a clean lavish-styled HTML report whose citations are DeepCitation's own interactive citations (click → matched phrase, evidence keyhole, page view), opened in lavish-axi for an annotate → poll → reply review loop. DeepCitation owns verification + the citation UX; lavish owns the report shell + review loop.

Corrected end-to-end against the live verify --html CLI (not assumptions):

  • Binary status modelverified / unverified, anchored on the sourceContext sentence. No variance/pending/isVerbatim and no confidence score in the CLI embed. The internal ambiguity.confidence is a localization signal and is never surfaced.
  • CITATION_DATA must be a single JSON object grouped by attachmentId (a flat list is rejected with "No valid CITATION_DATA block found").
  • verify --html writes {stem}-verified.html (ignores --out); --local-only keeps it off "My Verifications".
  • Coexistence — a post-embed data-lavish-action sweep over [data-citation-key] so citation clicks reach DeepCitation's popover while prose stays commentable (verified against artifact-sdk.js: all three lavish handlers bail on isLavishAction).
  • Accepted prepare inputs documented (PDF / image / Office / CSV-TSV / URL — .txt rejected).
  • Adds a physician chart-prep test scenario (docs/scenario-physician-chart-prep.md) as a manual acceptance script.

2. Remove proprietary content from the MIT submodule

Drops the [Judgement]/[Missing] marker teaching plus the medical form specs, rules, scripts, and source PDFs that don't belong in the open-source skills repo.

Test plan

  • /verify pipeline run end-to-end on a real OCR'd chart: auth → prepare → author → verify --html (553 KB interactive report) → popover inspection → lavish coexistence. Every documented behavior is observed, not assumed.
  • Each correction in the skill is backed by an observed CLI result (binary status, attachmentId grouping, -verified.html naming, data-lavish-action exclusion).

bensonwong added 14 commits May 16, 2026 23:38
Replace the bracket-token markers with plain prose throughout the
form-fill skill:

- SKILL.md / field-citation-mapping.md: physician-judgment fields now
  state in plain language what the signing physician must confirm,
  instead of appending a [Physician Judgment] token.
- manual-copy-output.md: drop [Missing] from the copy-block exclusion
  list; undocumented fields are stated as prose.
- lint-form-draft.mjs: COPY_BLOCK_FORBIDDEN now detects the prose
  conventions ('not documented in the available records',
  'physician must confirm') rather than the old marker tokens; drop
  the obsolete MISSING_OUTSIDE_COPY rule.

The markers read as distracting, unprofessional UI for the first
release; the verbose prose carries the same meaning to physicians.
The 15 medical form specs + the forms/INDEX.md registry are proprietary,
jurisdiction-specific work product. They were moved to packages/internal-skills
(non-OSS) and are the source of truth for formSpecs.generated.ts. The copies
here were redundant and out of place in the MIT-licensed skill library.

review-perspectives.md no longer points reviewers at a forms/medical/ path —
the active form spec is provided to the workflow at runtime.
Settle the unused verify skill on a single model: lavish-axi owns the report
shell and the annotate/poll/reply review loop; DeepCitation owns verification
and the interactive citation UX, embedded via verify --html. No confidence
scores, no invented verification surfaces (evidence table / status grid /
discrepancy list), no hero framing, no runtime adversarial roster — status is
DeepCitation's discrete verified/variance/unverified/pending badge.

- SKILL.md: embed model, status table, two-lane review loop, invariants
- rules/lavish-loop.md: styling+loop+verify --html embed, [data-citation-key] coexistence
- rules/cloud-sandbox-constraints.md, parallel-generation.md: reconciled to verify --html
- AGENTS.md: router collapsed (roster + playbooks entries removed)
- delete playbooks/* and rules/runtime-roster.md (invented surfaces / roster)
- docs/scenario-physician-chart-prep.md: manual test walkthrough
Local end-to-end testing of the pipeline (prepare → author → verify --html →
lavish) surfaced five factual errors and a model mismatch; fix all:

- verification model is BINARY (verified/unverified), anchored on sourceContext,
  not the web-app's 4-state verified/variance/unverified/pending. Badge follows
  the context sentence, not the key — so author must make sourceContext support
  the claim and sourceMatch a verbatim substring. ambiguity.confidence is a
  localization signal, never rendered, never surfaced.
- CITATION_DATA block must be a single JSON object grouped by attachmentId;
  a flat list fails 'No valid CITATION_DATA block found'. Capture attachmentId
  from prepare.
- prepare rejects plain .txt (PDF/image/Office/CSV-TSV/ODF/URL only).
- verify --html ignores --out and writes {stem}-verified.html; pass --local-only
  to avoid auto-upload. Authored file is <topic>.html, artifact <topic>-verified.html.
- coexistence is a POST-embed sweep: add data-lavish-action to [data-citation-key]
  (keys are hashed at verify time, can't pre-mark). Confirmed against artifact-sdk.js
  that all three handlers (hover/select/click) honor it.
- scenario doc checks updated to binary + 'click is what verifies' caveat.

Test artifacts kept local under scratch/ (uncommitted).
@claude

claude Bot commented Jun 23, 2026

Copy link
Copy Markdown

test

@claude

claude Bot commented Jun 23, 2026

Copy link
Copy Markdown

PR Review

Overall this is a well-motivated rework. The division-of-labor model (DeepCitation owns verification + citation UX; lavish owns the report shell + review loop) is a cleaner architecture, and the status model corrections are accurate. A few things worth discussing before merging.

Issues

1. attachmentId capture -- not shown how to get it from stderr

SKILL.md step 2 says the attachmentId is 'printed on stderr' but the code block only redirects stdout, so an agent cannot observe stderr during the run. Since the PR also says it is 'stored as attachmentId in the JSON', the simpler fix is to drop the 'printed on stderr' parenthetical and just say 'read attachmentId from the JSON.' Otherwise show the separate stderr redirect explicitly. The ambiguity could cause an agent to stall looking for something it cannot see.

2. auth.md narrowed verify out of the 'action needed' check -- is that intentional?

Old: 'If prepare or verify output contains action needed...'
New: 'If prepare output contains action needed...'

verify --html hits the API too. If a session token expires between prepare and verify, that error path now has no documented recovery. If this cannot happen in practice (verify reuses an already-established auth context), a one-line rationale in auth.md would close the ambiguity.

3. Comprehensiveness guidance dropped without a replacement

The old 'Comprehensiveness' section guarded against agents that answer the easy sub-question deeply and the hard one shallowly. Nothing in the new SKILL.md fills that gap. A brief note in step 3 or the invariants would suffice.

4. Per-citation SELF-CHECK removed

The old SKILL.md had a 4-step in-flow self-check and a STOP AND CHECK gate before verify, both enforcing that k is a verbatim substring of f. The new skill documents the rule in the status model section but drops the in-flow reminder. A one-line checkpoint in step 3 or the invariants would prevent agents from discovering the constraint only when verify flags bad anchors.

Smaller observations

  • verify --html ignoring --out -- correctly documented in SKILL.md step 4 and cloud-sandbox-constraints. Good catch.
  • 2-redirect prohibition on prepare -- explicitly noted and explained. Correct.
  • perl sweep for data-lavish-action -- regex looks right for single-line attributes. Multi-line edge case is unlikely from verify --html output; acceptable as-is.
  • Scenario doc -- medical examples are appropriate in a test file; the disclaimer is correctly placed; acceptance checks are specific and testable. LGTM.
  • Form-fill removal -- clean. Worth confirming no downstream host caches the old skill list in a way that ghost-routes /form-fill prompts post-merge.
  • --local-only -- consistently documented across SKILL.md, cloud-sandbox-constraints, and lavish-loop. Good.

Summary

Area Status
Status model (binary verified/unverified) Correct
Output path (stem-verified.html, ignores --out) Correct
lavish/DeepCitation division of labor Clean
data-lavish-action coexistence sweep Well-documented
prepare stderr / attachmentId capture Needs clarification
auth.md scope narrowing Needs rationale or reversion
Comprehensiveness guidance Lost in revision
Per-citation SELF-CHECK Lost in revision
form-fill removal Appropriate
Scenario test doc Correctly scoped

…JSON, restore self-check

- auth.md: 'action needed' recovery now explicitly covers verify, not just prepare
  (the session token can expire between prepare and verify — observed in live testing).
- SKILL.md step 2: read attachmentId from the prepare JSON (the only redirect is stdout,
  so stderr isn't observable mid-run).
- SKILL.md step 3: restore the in-flow per-citation self-check (k must be a verbatim
  substring of f) and the comprehensiveness reminder dropped in the rework.
@bensonwong

Copy link
Copy Markdown
Contributor Author

Thanks — addressed all four findings in 5acae36:

  1. attachmentId capture — step 2 now reads attachmentId from the prepare JSON (attachmentId field) instead of stderr, since the run only redirects stdout and stderr isn't observable mid-run.
  2. auth scope — reverted the narrowing. auth.md now applies the "action needed" recovery to prepare or verify, with a one-line rationale: the session token can expire between the two (both hit the API). This wasn't hypothetical — it happened during live end-to-end testing of the skill.
  3. Comprehensiveness — restored as an in-flow line in step 3: "answer the hard part as fully as the easy part; a deep answer to the easy half is a failure."
  4. Per-citation self-check — restored in step 3: find the verbatim f first, derive k as a word-for-word substring of f, fix f first if it isn't — don't wait for verify to flag a bad anchor.

Smaller observations (form-fill ghost-routing, multi-line perl edge case) noted; both are acceptable as-is per your assessment.

@bensonwong bensonwong merged commit d5605db into main Jun 23, 2026
@bensonwong bensonwong deleted the feat/judgement-marker branch June 23, 2026 23:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant