ADR-028: prose-linter Token.shape contract raise + depth-counter bracket-matching (#354) by davidlabianca · Pull Request #367 · cosai-oasis/secure-ai-tooling

davidlabianca · 2026-05-29T20:29:17Z

ADR-028: prose-linter Token.shape contract raise + depth-counter bracket-matching (#354)

Closes: #354

Summary

The 4th and final sub-issue of #353, carries a new ADR + a contract-level tokenizer change.

Per the newly-Accepted ADR-028:

Token contract raise (D1–D4): Token gains a third field shape: Literal["complete","open","close","neutral"] (default "neutral"); the public surface of _prose_tokens.py stays exactly Token / TokenKind / tokenize().
Shape at emission (D3): new private _classify_emphasis_shape tags each emphasis token from its interior edge-whitespace; _RE_ITALIC_UNDERSCORE tightened to whitespace-flanking (D-Open-21), fixing the intraword false positive where home_bar and foo_baz matched as italic (Spike S3: 0 changes / 569 fields).
Depth-counter linter (D5): the prose-subset linter's emphasis rejection is now a single bracket-matching pass over an integer depth counter with two one-line predicates (nested-emphasis + emphasis-wrapped-sentinel).
Docs: new docs/adr/028-...md; additive ADR-017 D1/D5 cross-ref amendments; README ADR index row.

Commit walk (9 commits — audit trail preserved)

#	SHA	One-liner
1	`82712a6`	ADR-028 lands `Status: Draft` + ADR-017 D5/D1 amendments + README index
2	`025e059`	Draft → Accepted after Spike focused on underscore usage (underscore corpus diff: 0/569)
3	`e149737`	D5 erratum #1 — close-branch emit when `depth > 0` + `max(0, depth-1)` floor
4	`cc7df7d`	RED — failing tests (Token.shape contract, classifier cell-grid, depth-walk, underscore)
5	`cb170b0`	GREEN — implementation
6	`5e79d87`	D5 erratum #2 — wrapped-sentinel prose: corrected fictional `SENTINEL_INNER_RE` to the two internal `_RE_SENTINEL_*_INNER` constants
7	`455f364`	Remove stale RED-phase scaffolding language
8	`15dd2d8`	D5 erratum #3 — greenfield-not-refactor framing + double-emit note
9	`121f14e`	Independent-review fixes — `_delim_for_token` fail-loud guard, 6 D-Open-18 emphasis-shape fixtures, double-emit test, corrected docstrings

Reviewer focus

docs/adr/028-...md — D1–D7 + three 2026-05-29 Addenda (errata).
_prose_tokens.py — Token.shape, _classify_emphasis_shape, the tightened _RE_ITALIC_UNDERSCORE.
validate_yaml_prose_subset.py — the depth-counter walk + two predicates, the _delim_for_token guard, and the documented _RE_SENTINEL_*_INNER cross-module coupling (D4-sanctioned).
Tests — TestEmphasisShapeFixtures + accepting/emphasis_shapes/, TestNestedEmphasisRejection (incl. double-emit), TestDelimForTokenGuard, classifier cell-grid.

Gates (all green at branch tip `121f14e`)

pytest: 2588 passed / 6 skipped.
ruff check + ruff format --check: clean.
pre-commit run --all-files: clean.
Both prose linters --block on risk-map/yaml/*.yaml → exit 0.
D6 diagnostic format preserved byte-for-byte; prior tokenizer/subset tests unmodified.

ADR cross-refs

ADR-028 (new) — decision of record for this PR.
ADR-017 D1/D4/D5 — D1/D5 amended (additive pointers); D4 diagnostic format preserved.
ADR-016 D5/D6 / ADR-015 D3 — other tokenizer consumers read kind/value only; shape default is transparent; partition-of-input + site/lint parity preserved.
ADR-020 D4 — pre-existing folded-bullet citation defect explicitly deferred (D7).

…ontract raise) as Status: Draft, with ADR-017 D5 + D1 amendments Supersedes three accreted whitespace-adjacency-heuristic commits on the archived feature/353-c4-followons branch (a71ae47, 817e00d, 5e72a2a) and the stashed Path B derived-methods restructure attempt; both attempts left the emphasis-shape decision split between the tokenizer (greedy regex side effects) and the linter (regex/string-operation recovery). ADR-028 D1-D4 raise the Token contract to ADR grade; D5 specifies a depth-counter bracket-matching pass over the new shape field; D6 locks diagnostic-format preservation; D7 captures the two ADR-017 amendments and four one-way cross-references. Analytical input: working-plans/028-prose-linter-planning-inventory.md (21 locked decisions). D-Open-21 (underscore-italic intraword tightening) is in-scope per §8 but gated by Spike S3 (live-corpus diff probe) before maintainer flips Status: Draft -> Accepted. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>

…rms the D-Open-21 lock Spike S3 (underscore-italic corpus impact probe) ran the current and proposed whitespace-flanked _RE_ITALIC_UNDERSCORE over all 569 prose fields in the four content YAMLs: zero tokenization changes. Probe validated non-blind against the home_bar and foo_baz motivating case. D-Open-21 lock confirmed; first ADR-028 section 9.3 outcome (empty diff). Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>

…nd depth-counter prose linter (RED) Authors the RED-phase test suite for Accepted ADR-028 ahead of implementation (ADR-025 D2 TDD chain). New coverage: - TestEmphasisShapeClassification + TestTokenShapeField (test_prose_tokens.py): the Token.shape contract (D1-D3) and _classify_emphasis_shape cell-grid (D-Open-16), incl. the D-Open-7 both-edges-whitespace "open" convention and the ADR-025 D10 wire-up (shape inspected on tokenize() output). - TestIntrawordUnderscoreRejection (test_prose_tokens.py): the D-Open-21 whitespace-flanked underscore tightening (intraword \S_\S is not italic; boundary/whitespace-flanked _italic_ still is; __double__ stays TEXT). - TestNestedEmphasisRejection + TestEmphasisDiagnosticFormat (test_validate_yaml_prose_subset.py): the D5 depth-counter walk + the two predicates, with the faux-depth false-positive guard, and the D6 byte-for-byte diagnostic format / exact reason strings. - TestTripleAsteriskProbe (S1) and TestLiveCorpusBaseline (S2, live_corpus marker): the Spike S1 tokenizer ground-truth lock and the zero-diagnostic corpus regression baseline gating the GREEN pass. 35 new tests fail for the right reason (missing Token.shape field, absent _classify_emphasis_shape, current intraword-underscore match, and the not-yet- built emphasis rejection in check_prose_field); the pre-existing tokenizer and prose-subset suites stay green and unmodified. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>

… predicate prose The close branch lacked the nested-emphasis emit and carried a false depth-floor comment. It now emits when depth > 0 before decrementing and floors with max(0, depth - 1), reconciling the pseudocode with D5's own predicate prose and planning-inventory section 5.2's load-bearing case. Erratum only: no section 8 lock changes; Status stays Accepted. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>

…nter prose linter (GREEN) Turns the RED suite from cc7df7d green by building the Accepted ADR-028 design. Tokenizer (scripts/hooks/precommit/_prose_tokens.py): - Token NamedTuple gains shape: Literal["complete","open","close","neutral"] as the third field, default "neutral" (D1 / D-Open-1..4); two-positional construction stays valid and equality stays structural. - _classify_emphasis_shape() classifies a matched span from interior edge whitespace via .isspace(), no regex (D3 / D-Open-5..7); both-edges -> "open". - The three emphasis emission sites pass shape= (D3 / Section 5.3). - _RE_ITALIC_UNDERSCORE tightened to require whitespace-or-boundary flanking so intraword \S_\S no longer tokenizes as italic (D3 invariant 3 / D-Open-21, cosai-simpler form); __double__ stays TEXT; asterisk italic untouched. Linter (scripts/hooks/precommit/validate_yaml_prose_subset.py): - check_prose_field gains the D5 single-pass depth-counter walk with a bare integer counter (D-Open-10) and the two one-line predicates; reason strings "nested emphasis" / "emphasis-wrapped sentinel" route through the existing Diagnostic + format_diagnostic_line, byte-for-byte preserved (D6). - The close branch emits when depth > 0 before decrementing, with a max(0, depth-1) floor, per ADR-028 D5 as amended 2026-05-29 (e149737). - The wrapped-sentinel predicate reuses the tokenizer's own sentinel-inner patterns; no new emphasis-shape re.compile is introduced. This is a fresh build on the clean base (D-Open-11): the pre-existing regex-driven emphasis layer described in the ADR's Consequences never landed on this branch (it lived on the abandoned R-commits a71ae47 / 817e00d / 5e72a2a), so the design is implemented directly rather than refactored. Live corpus stays at zero diagnostics for both prose linters in --block; the test docstring at test_nested_bold_produces_one_nested_emphasis_diagnostic is reconciled with the amended D5. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>

…e test docstring D5 named a non-existent unified SENTINEL_INNER_RE 'shared with the references linter'. Reality: two internal _RE_SENTINEL_*_INNER constants imported by the prose-subset linter via documented coupling (sanctioned by D4's internal-_RE_* posture), not used by the references linter (which resolves via _resolve_intra_sentinel). Adds a second D5 Addendum and fixes the matching docstring. Doc-only erratum: impl unchanged, no section 8 lock change, Status stays Accepted. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>

…-028 test and comment prose The ADR-028 emphasis feature is implemented and green; comment and docstring text that narrated the transient test-first chain state ("RED until …", "after the … pass adds …", class-header "(RED — …)" annotations) is now stale and is reworded to describe what each test guards, in the present tense. - validate_yaml_prose_subset.py: the close-branch comment no longer claims the ADR D5 pseudocode "omits the emit on close" / "the test is canonical" — that predated the D5 amendment (e149737). It now states the close token is the attribution point for the canonical [open, text, close] nested case and cites ADR-028 D5 as amended 2026-05-29. - test_prose_tokens.py + test_validate_yaml_prose_subset.py: ~37 docstring, comment, and assertion-message sites reworded. Doc-only — no assertion expression, expected value, or logic changed; the full suite stays at 2576 passed / 6 skipped. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>

…ouble-emit note) Independent architecture review flagged D5's 'are deleted' / Consequences 'linter shrinks' as false against this clean base: the named helpers never existed here (D-Open-11; they lived on the abandoned feature/353-c4-followons). The Addendum reframes that language as logical supersession of the archived design, not a diff, and documents that the nested-emphasis and wrapped-sentinel predicates are independent and may both fire on one token. Doc-only; no code or section 8 lock change; Status stays Accepted. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>

…prose linter Fail-loud guard in _delim_for_token (raises ValueError on a non-emphasis value instead of silently returning '_'); 6 additive emphasis-shape fixtures under accepting/emphasis_shapes/ honoring D-Open-18 Path 3b (open/close/both-edges/nested/sentinel-wrapped token streams); a characterization test pinning the intended nested+wrapped double-emit; and corrected the stale 'tokenizer/fixture-dir NOT modified' docstring plus the fixture-pair count. pytest 2588/6, ruff clean, both prose linters --block exit 0. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>

shrey-bagga

Can we please add regression coverage for nested underscore italic and punctuation-boundary underscore italic? With the current _RE_ITALIC_UNDERSCORE, _foo _nested_ bar_ tokenizes as one complete ITALIC token plus TEXT and produces no nested emphasis diagnostic, despite ADR-017’s “nesting another italic inside is rejected” rule. It also stops recognizing boundary forms like "See _important_, then act." as italic. After adding coverage, can we also adjust the tokenizer so snake_case still stays TEXT without losing those ADR-017/ADR-028 cases.

davidlabianca and others added 9 commits May 28, 2026 19:42

davidlabianca self-assigned this May 29, 2026

davidlabianca added tech-debt adr Architecture Decision Records tooling labels May 29, 2026

davidlabianca marked this pull request as ready for review May 29, 2026 20:32

davidlabianca requested review from santosomar and shrey-bagga May 29, 2026 20:32

shrey-bagga requested changes May 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ADR-028: prose-linter Token.shape contract raise + depth-counter bracket-matching (#354)#367

ADR-028: prose-linter Token.shape contract raise + depth-counter bracket-matching (#354)#367
davidlabianca wants to merge 9 commits into
cosai-oasis:mainfrom
davidlabianca:feature/354-prose-linter-token-contract

davidlabianca commented May 29, 2026

Uh oh!

shrey-bagga left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

davidlabianca commented May 29, 2026