Skip to content

feat(verify): add HTML embed path and extract parallel-generation rules#26

Closed
bensonwong wants to merge 1 commit into
mainfrom
feat/html-embed-citation-path
Closed

feat(verify): add HTML embed path and extract parallel-generation rules#26
bensonwong wants to merge 1 commit into
mainfrom
feat/html-embed-citation-path

Conversation

@bensonwong

@bensonwong bensonwong commented Apr 16, 2026

Copy link
Copy Markdown
Contributor

Summary

  • New: HTML annotation path — when the user's claim source is a static HTML file they want to preserve, agents now annotate it directly instead of generating a new report. Uses data-cite="N" attributes on any HTML element and appends <<<CITATION_DATA>>> as raw text after </html>; parseCitationData() strips it before output so it never renders in the browser. Runs verify --html to inject the CDN runtime while preserving original structure and styling.
  • New §2 triage row — routes "embed citations into static HTML" to the HTML annotation path; narrows the existing "Existing verified HTML" row to clarify it covers only prior CLI output (already has data-citation-key attributes).
  • §4 --html note updated — now covers both entry points: embed-into case (.deepcitation/{draft}-body.html) and re-verify of prior CLI output.
  • Parallel generation extracted — the full parallel-agent pipeline (evidence tagging, split/overlap math, merge failure recovery) moves to rules/parallel-generation.md. SKILL.md defers to it for 100+ page / 3+ file scenarios; single-topic and sub-100-page cases stay inline.
  • AGENTS.md guidance router — new entry pointing to rules/parallel-generation.md.
  • Cleanup — removed cloud-sandbox probe block, proxy invariants, tool alternatives enumeration, and results summary emoji line from SKILL.md (each covered by their respective rules files).

Test plan

  • Read SKILL.md §1–§4 end-to-end: confirm read-only, verify/markdown, and verify/html-embed are each clearly differentiated in Orient and Triage
  • Trace the HTML embed path: §2 triage → §3 HTML annotation path → data-cite wrapping → <<<CITATION_DATA>>> after </html> → §4 verify --html .deepcitation/{draft}-body.html
  • Confirm rules/parallel-generation.md exists and contains the full pipeline (evidence tagging, split math, merge failure recovery)
  • Confirm AGENTS.md guidance router lists parallel-generation.md under the correct trigger keywords

…es file

- New §3 "HTML annotation path": guides agents to annotate a source HTML
  file with data-cite="N" attributes and append <<<CITATION_DATA>>> after
  </html>, then run verify --html to preserve original HTML structure
- New §2 triage row for "embed citations into static HTML" case;
  narrows the existing "Existing verified HTML" row to CLI-prior-run only
- §4 --html note updated to cover both the embed-into and re-verify cases
- Parallel generation guidance extracted to rules/parallel-generation.md;
  SKILL.md now defers to it for 100+ page / 3+ file scenarios
- AGENTS.md guidance router updated with parallel-generation.md entry
- Removed cloud-sandbox probe block, proxy invariants, and tool
  alternatives list (moved to their respective rules files)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@claude

claude Bot commented Apr 16, 2026

Copy link
Copy Markdown

PR Review

Overall this is a well-structured refactor. Extracting the parallel pipeline into its own rules file and adding the HTML embed path are both good moves. A few issues worth addressing before merge.


Blocking / Correctness

Threshold mismatch: three inconsistent definitions of when to use parallel generation

  • SKILL.md section heading: "100+ pages with 3+ files"
  • parallel-generation.md (When to use): "100 or more pages AND the question spans 2 or more distinct topics"
  • AGENTS.md guidance router: "100+ page documents" (no topic or file condition)

The original SKILL.md was internally consistent ("2+ distinct sections / 2+ top-level headings"). Pick one formulation and apply it in all three places.

No direct-read fallback rule silently gutted in the Hard rules section

Old text named the forbidden tools (Read, pdfplumber, urllib, web fetch, etc.) and explained why the rule exists: a hand-built report that mimics the verified format presents unverified text as verified.

New text collapses to: "If prepare/verify cannot complete, the deliverable is not producible."

This is the exact moment agents are most tempted to reach for an alternative tool. The explanation needs to live here — either restore it inline or add an explicit pointer to the file that carries the full rule.


Significant

Cloud-sandbox probe removed with no guaranteed load trigger

The original section warned: "false negatives are catastrophic because the agent loses awareness of the 45 s bash timeout, the __DC_ERROR__ protocol, and the no-fallback invariant."

The proactive probe (env | grep ..., whoami) is gone. cloud-sandbox-constraints.md now only appears in the AGENTS.md router under the keyword "cloud sandbox behavior" — an agent that doesn't already suspect it's in a sandbox will never load that file, which is the false-negative scenario the original text warned about. At minimum restore a one-liner in the Hard rules: "Before running any CLI command, probe for cloud-sandbox markers — see rules/cloud-sandbox-constraints.md."

Proxy invariant dropped from Invariants

"Never modify proxy environment variables on individual command runs" was listed in Invariants — the section that applies everywhere, sandbox or not — precisely because this failure mode appears outside sandboxes too. The rules file is on-demand only. Restoring this line to Invariants is one sentence and closes a real gap.


Minor / Polish

Triage row example text is awkward

Old: (e.g. index.html, draft.md) — New: (e.g. draft.md, or files that have made unfounded claims that need verification)

A file doesn't make claims. Consider: (e.g. draft.md, a report, or any document containing claims to check).

Results summary format dropped with no pointer

The verified/partial/not-found summary format was removed from the closing step. None of the linked rules files appear to own it. If it's intentionally retired, say so in the section; otherwise agents will invent inconsistent formats.

parseCitationData() is an internal function name in agent-facing docs

Fragile if the function is renamed. Rewrite as observable behavior: "The CLI strips everything from <<<CITATION_DATA>>> to <<<END_CITATION_DATA>>> before writing the output file, so this block never appears in the browser."

CLI version note (deepTextPromptPortion fallback) was removed

Fine if the minimum supported CLI now always emits deepTextPages — but worth a brief comment in the PR description so this is explicitly intentional rather than an accidental omission.

@bensonwong bensonwong closed this Apr 16, 2026
@bensonwong bensonwong deleted the feat/html-embed-citation-path branch April 19, 2026 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant