Skip to content

test: add decline-token coverage to the verdict-parser contract (#88)#95

Merged
JFK merged 1 commit into
mainfrom
88-test/verdict-decline-coverage
Jun 13, 2026
Merged

test: add decline-token coverage to the verdict-parser contract (#88)#95
JFK merged 1 commit into
mainfrom
88-test/verdict-decline-coverage

Conversation

@JFK

@JFK JFK commented Jun 13, 2026

Copy link
Copy Markdown
Owner

Closes #88

Summary

Add decline-token coverage to the verdict-parser contract test. The reference parser (tests/test_verdict_parser.py) modeled only advisory (green|yellow|red) and binary (pass|fail), but gate1 also emits decline (the /ceo-escalation signal, start.md step 10) — so the contract test and the command had drifted: a structured ## Verdict: decline was unguarded.

Implementation notes

  • New gate1 parser mode; GATE1_PATTERN mirrors start.md step 10 exactly: ^\s*##\s*Verdict:\s*(green|yellow|red|decline)\b.
  • decline is structured-only — a free-form mention in the reasoning body must not trigger it; the no-structured-line fallback reuses the advisory classifier (which can never return decline).
  • Three fixtures lock the behavior: gate1-decline-structureddecline; gate1-decline-then-greengreen (last-wins, no escalation); gate1-decline-in-bodygreen.
  • 28 fixtures pass (was 25). Module docstring updated to three modes.

⚠️ Scope note (important)

Gate1 (design review) found that #88 as written is ~90% already implemented: tests/test_verdict_parser.py's existing 25 fixtures already cover last-wins (edge-multiple-verdicts), trailing punctuation (edge-verdict-with-punctuation), case-insensitive, leading-whitespace, structured-overrides-heuristic, and both advisory + binary modes; enum-sync-check.sh and jq-sync-check.sh already cover enum + jq sync in CI. This PR therefore closes the one genuine remaining gap — the decline token — rather than re-adding existing coverage. The milestone issue was drafted before fully auditing existing test coverage.

Pre-PR review summary

  • gate2 mode: advisor-only (gate2.binary_gate = none)
  • cso: green / qa-lead: green / cto: green
  • gate1: yellow via /claude-c-suite:ask (scoped down — see Scope note)
  • review provider: code-review

🤖 Generated via /gh-issue-driven:ship (autonomous=red-only, milestone v0.14.0)

The reference parser (tests/test_verdict_parser.py) modeled only the advisory
(green|yellow|red) and binary (pass|fail) tokens, but gate1 also emits `decline`
(the /ceo-escalation signal, start.md step 10) — so the contract test and the
command had drifted: a structured `## Verdict: decline` was unguarded.

Add a `gate1` parser mode whose pattern mirrors start.md step 10
(green|yellow|red|decline), with `decline` as a structured-only token (a
free-form mention in the reasoning body must not trigger it; the no-structured
fallback reuses the advisory classifier, which can never return decline).

Three fixtures lock the behavior: structured-decline → decline; decline-then-
green → green (last-wins, no escalation); decline-in-body-only → green.

Scope note: the broader "#88 parser-contract tests" was found ~90% already
covered by the existing 25 fixtures + enum-sync-check.sh + jq-sync-check.sh
(gate1, yellow). This PR closes the one genuine remaining gap — `decline`.

28 fixtures pass (was 25).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@JFK JFK marked this pull request as ready for review June 13, 2026 05:35
@JFK JFK merged commit 1a8ff03 into main Jun 13, 2026
1 check passed
@JFK JFK deleted the 88-test/verdict-decline-coverage branch June 13, 2026 05:37
JFK added a commit that referenced this pull request Jun 13, 2026
The v0.14.0 "token + precision optimization" milestone was re-audited against
the actual files (using the token-baseline tool from #94). The premise turned
out to be largely invalid:

- The "~28% / ~6,400 token" compression is not achievable: slash commands load
  whole (no runtime include / conditional load), so relocating sections to an
  appendix saves nothing, and the bulk of start.md/ship.md is load-bearing
  executable spec that must not be compressed.
- The claimed precision bugs were phantom: step-18b precedence is already an
  If/Else-if chain; verdict last-wins is already explicit (and now test-guarded
  by #95); the propose.md "parallel Skill" instruction is correct (batched Skill
  calls are supported); the propose.md "regex mismatch" is a harmless subset, not
  a contradiction.

This commit ships the ONLY verified-safe, genuinely-beneficial residue:

- start.md: delete a verbatim-redundant `lang != "en"` localization line (649)
  that duplicated line 647.
- goal.md: convert the red-verdict force-continue prose (phase-aware bullets)
  into a compact decision table, preserving every load-bearing detail (the
  gate2.binary_gate `fail` exception, phase routing, continue-to steps).

Net effect (per tests/token-baseline.sh): TOTAL ~78,424 -> ~78,355 tokens
(-69 tokens, -0.09%). The negligible number is itself the finding — it
demonstrates the milestone's compression premise was unfounded, and the
token-baseline tool (#87/#94) measuring it is working as intended.

Closes #89

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test: parser-contract tests (verdict last-wins, gate enums, jq sync)

1 participant