Skip to content

test: add deterministic token-count baseline for command files (#87)#94

Merged
JFK merged 3 commits into
mainfrom
87-test/test-deterministic-token-count-baseline
Jun 13, 2026
Merged

test: add deterministic token-count baseline for command files (#87)#94
JFK merged 3 commits into
mainfrom
87-test/test-deterministic-token-count-baseline

Conversation

@JFK

@JFK JFK commented Jun 13, 2026

Copy link
Copy Markdown
Owner

Closes #87

Summary

Implementation notes

  • Reproducibility across Windows/WSL and Linux CI: the script strips CR before counting, and commands/*.md, the snapshot, and *.sh are pinned to LF via .gitattributes. (The working tree was CRLF — raw wc -c differed from the normalized count, so this is load-bearing for the milestone's "prove reduction" goal.)
  • Wired into .github/workflows/lint.yml: the self-test gates CI; --check runs as an informational step.
  • CONTRIBUTING.md documents the refresh procedure.
  • Baseline at this branch: TOTAL 4573 lines / 313710 bytes / ~78424 tokens (top cost centers: start.md ~21277, ship.md ~12214).

Pre-PR review summary

  • gate2 mode: advisor-only (gate2.binary_gate = none)
  • audit: skipped
  • cso: green
  • qa-lead: yellow → addressed in commit 8e36504 (added the missing drift-case assertion the review flagged)
  • cto: green
  • gate1: green via /claude-c-suite:ask (QA Lead lens)
  • review provider: code-review

Full reviews are saved in the plugin cache (<branch-flat>.gate1.md / .gate2.md).

🤖 Generated via /gh-issue-driven:ship (autonomous=red-only, milestone v0.14.0)

JFK and others added 2 commits June 13, 2026 14:20
Add tests/token-baseline.sh — a deterministic size census of commands/*.md
(lines, bytes, ~tokens ≈ bytes/4) with a committed snapshot at
tests/fixtures/token-baseline.txt, so the v0.14.0 compression milestone
(#89-#92) can prove per-command reductions and catch accidental bloat
without an LLM.

- --check prints the table + drift vs snapshot, ALWAYS exits 0 (informational;
  a bloat hard-fail guard is deferred per gate1)
- --update refreshes the snapshot
- CR stripped before counting, and commands/*.md, the snapshot, and *.sh pinned
  to LF via .gitattributes, so byte counts are reproducible across Windows/WSL
  and Linux CI
- tests/token-baseline-test.sh self-tests the census tool; both wired into
  .github/workflows/lint.yml (self-test gates, --check is informational)
- CONTRIBUTING.md documents the refresh procedure

Baseline at this commit: TOTAL 4573 lines / 313710 bytes / ~78424 tokens
(start.md ~21277, ship.md ~12214 are the top cost centers).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Gate2 (qa-lead) flagged that the tool's core contract — AC #3, "snapshot
drift must still exit 0 (informational) and warn" — was only verified in the
match case, never under actual drift. Add a drift-case assertion: append a
sentinel row to the snapshot, run --check, assert exit 0 AND a "WARN: size
drift" line (captured from stderr), then restore the snapshot via --update.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@JFK JFK marked this pull request as ready for review June 13, 2026 05:27
The scripts were committed mode 100644 because `chmod +x` on Windows does not
set git's executable bit. CI invokes them via `bash` (which works), but the
self-test's `[ -x ]` contract assertion correctly failed on a clean Linux
checkout — the repo convention is executable tests/*.sh (rwxr-xr-x). Set the
git exec bit via `git update-index --chmod=+x` on both scripts (blob content
unchanged, mode 100644 -> 100755).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@JFK JFK merged commit 583cef7 into main Jun 13, 2026
1 check passed
@JFK JFK deleted the 87-test/test-deterministic-token-count-baseline branch June 13, 2026 05:40
JFK added a commit that referenced this pull request Jun 13, 2026
The v0.14.0 "token + precision optimization" milestone was re-audited against
the actual files (using the token-baseline tool from #94). The premise turned
out to be largely invalid:

- The "~28% / ~6,400 token" compression is not achievable: slash commands load
  whole (no runtime include / conditional load), so relocating sections to an
  appendix saves nothing, and the bulk of start.md/ship.md is load-bearing
  executable spec that must not be compressed.
- The claimed precision bugs were phantom: step-18b precedence is already an
  If/Else-if chain; verdict last-wins is already explicit (and now test-guarded
  by #95); the propose.md "parallel Skill" instruction is correct (batched Skill
  calls are supported); the propose.md "regex mismatch" is a harmless subset, not
  a contradiction.

This commit ships the ONLY verified-safe, genuinely-beneficial residue:

- start.md: delete a verbatim-redundant `lang != "en"` localization line (649)
  that duplicated line 647.
- goal.md: convert the red-verdict force-continue prose (phase-aware bullets)
  into a compact decision table, preserving every load-bearing detail (the
  gate2.binary_gate `fail` exception, phase routing, continue-to steps).

Net effect (per tests/token-baseline.sh): TOTAL ~78,424 -> ~78,355 tokens
(-69 tokens, -0.09%). The negligible number is itself the finding — it
demonstrates the milestone's compression premise was unfounded, and the
token-baseline tool (#87/#94) measuring it is working as intended.

Closes #89

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test: deterministic token-count baseline for command files

1 participant