test: add decline-token coverage to the verdict-parser contract (#88) by JFK · Pull Request #95 · JFK/gh-issue-driven

JFK · 2026-06-13T05:33:03Z

Closes #88

Summary

Add decline-token coverage to the verdict-parser contract test. The reference parser (tests/test_verdict_parser.py) modeled only advisory (green|yellow|red) and binary (pass|fail), but gate1 also emits decline (the /ceo-escalation signal, start.md step 10) — so the contract test and the command had drifted: a structured ## Verdict: decline was unguarded.

Implementation notes

New gate1 parser mode; GATE1_PATTERN mirrors start.md step 10 exactly: ^\s*##\s*Verdict:\s*(green|yellow|red|decline)\b.
decline is structured-only — a free-form mention in the reasoning body must not trigger it; the no-structured-line fallback reuses the advisory classifier (which can never return decline).
Three fixtures lock the behavior: gate1-decline-structured → decline; gate1-decline-then-green → green (last-wins, no escalation); gate1-decline-in-body → green.
28 fixtures pass (was 25). Module docstring updated to three modes.

⚠️ Scope note (important)

Gate1 (design review) found that #88 as written is ~90% already implemented: tests/test_verdict_parser.py's existing 25 fixtures already cover last-wins (edge-multiple-verdicts), trailing punctuation (edge-verdict-with-punctuation), case-insensitive, leading-whitespace, structured-overrides-heuristic, and both advisory + binary modes; enum-sync-check.sh and jq-sync-check.sh already cover enum + jq sync in CI. This PR therefore closes the one genuine remaining gap — the decline token — rather than re-adding existing coverage. The milestone issue was drafted before fully auditing existing test coverage.

Pre-PR review summary

gate2 mode: advisor-only (gate2.binary_gate = none)
cso: green / qa-lead: green / cto: green
gate1: yellow via /claude-c-suite:ask (scoped down — see Scope note)
review provider: code-review

🤖 Generated via /gh-issue-driven:ship (autonomous=red-only, milestone v0.14.0)

The reference parser (tests/test_verdict_parser.py) modeled only the advisory (green|yellow|red) and binary (pass|fail) tokens, but gate1 also emits `decline` (the /ceo-escalation signal, start.md step 10) — so the contract test and the command had drifted: a structured `## Verdict: decline` was unguarded. Add a `gate1` parser mode whose pattern mirrors start.md step 10 (green|yellow|red|decline), with `decline` as a structured-only token (a free-form mention in the reasoning body must not trigger it; the no-structured fallback reuses the advisory classifier, which can never return decline). Three fixtures lock the behavior: structured-decline → decline; decline-then- green → green (last-wins, no escalation); decline-in-body-only → green. Scope note: the broader "#88 parser-contract tests" was found ~90% already covered by the existing 25 fixtures + enum-sync-check.sh + jq-sync-check.sh (gate1, yellow). This PR closes the one genuine remaining gap — `decline`. 28 fixtures pass (was 25). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The v0.14.0 "token + precision optimization" milestone was re-audited against the actual files (using the token-baseline tool from #94). The premise turned out to be largely invalid: - The "~28% / ~6,400 token" compression is not achievable: slash commands load whole (no runtime include / conditional load), so relocating sections to an appendix saves nothing, and the bulk of start.md/ship.md is load-bearing executable spec that must not be compressed. - The claimed precision bugs were phantom: step-18b precedence is already an If/Else-if chain; verdict last-wins is already explicit (and now test-guarded by #95); the propose.md "parallel Skill" instruction is correct (batched Skill calls are supported); the propose.md "regex mismatch" is a harmless subset, not a contradiction. This commit ships the ONLY verified-safe, genuinely-beneficial residue: - start.md: delete a verbatim-redundant `lang != "en"` localization line (649) that duplicated line 647. - goal.md: convert the red-verdict force-continue prose (phase-aware bullets) into a compact decision table, preserving every load-bearing detail (the gate2.binary_gate `fail` exception, phase routing, continue-to steps). Net effect (per tests/token-baseline.sh): TOTAL ~78,424 -> ~78,355 tokens (-69 tokens, -0.09%). The negligible number is itself the finding — it demonstrates the milestone's compression premise was unfounded, and the token-baseline tool (#87/#94) measuring it is working as intended. Closes #89 Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

JFK marked this pull request as ready for review June 13, 2026 05:35

JFK merged commit 1a8ff03 into main Jun 13, 2026
1 check passed

JFK deleted the 88-test/verdict-decline-coverage branch June 13, 2026 05:37

JFK mentioned this pull request Jun 13, 2026

refactor: v0.14.0 verified-safe cleanups (close the inflated compression milestone) #96

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add decline-token coverage to the verdict-parser contract (#88)#95

test: add decline-token coverage to the verdict-parser contract (#88)#95
JFK merged 1 commit into
mainfrom
88-test/verdict-decline-coverage

JFK commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JFK commented Jun 13, 2026

Summary

Implementation notes

⚠️ Scope note (important)

Pre-PR review summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant