ADR-028: prose-linter Token.shape contract raise + depth-counter bracket-matching (#354)#367
Open
davidlabianca wants to merge 9 commits into
Open
Conversation
…ontract raise) as Status: Draft, with ADR-017 D5 + D1 amendments Supersedes three accreted whitespace-adjacency-heuristic commits on the archived feature/353-c4-followons branch (a71ae47, 817e00d, 5e72a2a) and the stashed Path B derived-methods restructure attempt; both attempts left the emphasis-shape decision split between the tokenizer (greedy regex side effects) and the linter (regex/string-operation recovery). ADR-028 D1-D4 raise the Token contract to ADR grade; D5 specifies a depth-counter bracket-matching pass over the new shape field; D6 locks diagnostic-format preservation; D7 captures the two ADR-017 amendments and four one-way cross-references. Analytical input: working-plans/028-prose-linter-planning-inventory.md (21 locked decisions). D-Open-21 (underscore-italic intraword tightening) is in-scope per §8 but gated by Spike S3 (live-corpus diff probe) before maintainer flips Status: Draft -> Accepted. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
…rms the D-Open-21 lock Spike S3 (underscore-italic corpus impact probe) ran the current and proposed whitespace-flanked _RE_ITALIC_UNDERSCORE over all 569 prose fields in the four content YAMLs: zero tokenization changes. Probe validated non-blind against the home_bar and foo_baz motivating case. D-Open-21 lock confirmed; first ADR-028 section 9.3 outcome (empty diff). Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
…nd depth-counter prose linter (RED) Authors the RED-phase test suite for Accepted ADR-028 ahead of implementation (ADR-025 D2 TDD chain). New coverage: - TestEmphasisShapeClassification + TestTokenShapeField (test_prose_tokens.py): the Token.shape contract (D1-D3) and _classify_emphasis_shape cell-grid (D-Open-16), incl. the D-Open-7 both-edges-whitespace "open" convention and the ADR-025 D10 wire-up (shape inspected on tokenize() output). - TestIntrawordUnderscoreRejection (test_prose_tokens.py): the D-Open-21 whitespace-flanked underscore tightening (intraword \S_\S is not italic; boundary/whitespace-flanked _italic_ still is; __double__ stays TEXT). - TestNestedEmphasisRejection + TestEmphasisDiagnosticFormat (test_validate_yaml_prose_subset.py): the D5 depth-counter walk + the two predicates, with the faux-depth false-positive guard, and the D6 byte-for-byte diagnostic format / exact reason strings. - TestTripleAsteriskProbe (S1) and TestLiveCorpusBaseline (S2, live_corpus marker): the Spike S1 tokenizer ground-truth lock and the zero-diagnostic corpus regression baseline gating the GREEN pass. 35 new tests fail for the right reason (missing Token.shape field, absent _classify_emphasis_shape, current intraword-underscore match, and the not-yet- built emphasis rejection in check_prose_field); the pre-existing tokenizer and prose-subset suites stay green and unmodified. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
… predicate prose The close branch lacked the nested-emphasis emit and carried a false depth-floor comment. It now emits when depth > 0 before decrementing and floors with max(0, depth - 1), reconciling the pseudocode with D5's own predicate prose and planning-inventory section 5.2's load-bearing case. Erratum only: no section 8 lock changes; Status stays Accepted. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
…nter prose linter (GREEN) Turns the RED suite from cc7df7d green by building the Accepted ADR-028 design. Tokenizer (scripts/hooks/precommit/_prose_tokens.py): - Token NamedTuple gains shape: Literal["complete","open","close","neutral"] as the third field, default "neutral" (D1 / D-Open-1..4); two-positional construction stays valid and equality stays structural. - _classify_emphasis_shape() classifies a matched span from interior edge whitespace via .isspace(), no regex (D3 / D-Open-5..7); both-edges -> "open". - The three emphasis emission sites pass shape= (D3 / Section 5.3). - _RE_ITALIC_UNDERSCORE tightened to require whitespace-or-boundary flanking so intraword \S_\S no longer tokenizes as italic (D3 invariant 3 / D-Open-21, cosai-simpler form); __double__ stays TEXT; asterisk italic untouched. Linter (scripts/hooks/precommit/validate_yaml_prose_subset.py): - check_prose_field gains the D5 single-pass depth-counter walk with a bare integer counter (D-Open-10) and the two one-line predicates; reason strings "nested emphasis" / "emphasis-wrapped sentinel" route through the existing Diagnostic + format_diagnostic_line, byte-for-byte preserved (D6). - The close branch emits when depth > 0 before decrementing, with a max(0, depth-1) floor, per ADR-028 D5 as amended 2026-05-29 (e149737). - The wrapped-sentinel predicate reuses the tokenizer's own sentinel-inner patterns; no new emphasis-shape re.compile is introduced. This is a fresh build on the clean base (D-Open-11): the pre-existing regex-driven emphasis layer described in the ADR's Consequences never landed on this branch (it lived on the abandoned R-commits a71ae47 / 817e00d / 5e72a2a), so the design is implemented directly rather than refactored. Live corpus stays at zero diagnostics for both prose linters in --block; the test docstring at test_nested_bold_produces_one_nested_emphasis_diagnostic is reconciled with the amended D5. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
…e test docstring D5 named a non-existent unified SENTINEL_INNER_RE 'shared with the references linter'. Reality: two internal _RE_SENTINEL_*_INNER constants imported by the prose-subset linter via documented coupling (sanctioned by D4's internal-_RE_* posture), not used by the references linter (which resolves via _resolve_intra_sentinel). Adds a second D5 Addendum and fixes the matching docstring. Doc-only erratum: impl unchanged, no section 8 lock change, Status stays Accepted. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
…-028 test and comment prose
The ADR-028 emphasis feature is implemented and green; comment and docstring
text that narrated the transient test-first chain state ("RED until …", "after
the … pass adds …", class-header "(RED — …)" annotations) is now stale and is
reworded to describe what each test guards, in the present tense.
- validate_yaml_prose_subset.py: the close-branch comment no longer claims the
ADR D5 pseudocode "omits the emit on close" / "the test is canonical" — that
predated the D5 amendment (e149737). It now states the close token is the
attribution point for the canonical [open, text, close] nested case and cites
ADR-028 D5 as amended 2026-05-29.
- test_prose_tokens.py + test_validate_yaml_prose_subset.py: ~37 docstring,
comment, and assertion-message sites reworded. Doc-only — no assertion
expression, expected value, or logic changed; the full suite stays at
2576 passed / 6 skipped.
Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
…ouble-emit note) Independent architecture review flagged D5's 'are deleted' / Consequences 'linter shrinks' as false against this clean base: the named helpers never existed here (D-Open-11; they lived on the abandoned feature/353-c4-followons). The Addendum reframes that language as logical supersession of the archived design, not a diff, and documents that the nested-emphasis and wrapped-sentinel predicates are independent and may both fire on one token. Doc-only; no code or section 8 lock change; Status stays Accepted. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
…prose linter Fail-loud guard in _delim_for_token (raises ValueError on a non-emphasis value instead of silently returning '_'); 6 additive emphasis-shape fixtures under accepting/emphasis_shapes/ honoring D-Open-18 Path 3b (open/close/both-edges/nested/sentinel-wrapped token streams); a characterization test pinning the intended nested+wrapped double-emit; and corrected the stale 'tokenizer/fixture-dir NOT modified' docstring plus the fixture-pair count. pytest 2588/6, ruff clean, both prose linters --block exit 0. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
shrey-bagga
requested changes
May 30, 2026
Contributor
shrey-bagga
left a comment
There was a problem hiding this comment.
Can we please add regression coverage for nested underscore italic and punctuation-boundary underscore italic? With the current _RE_ITALIC_UNDERSCORE, _foo _nested_ bar_ tokenizes as one complete ITALIC token plus TEXT and produces no nested emphasis diagnostic, despite ADR-017’s “nesting another italic inside is rejected” rule. It also stops recognizing boundary forms like "See _important_, then act." as italic. After adding coverage, can we also adjust the tokenizer so snake_case still stays TEXT without losing those ADR-017/ADR-028 cases.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ADR-028: prose-linter Token.shape contract raise + depth-counter bracket-matching (#354)
Closes: #354
Summary
The 4th and final sub-issue of #353, carries a new ADR + a contract-level tokenizer change.
Per the newly-Accepted ADR-028:
Tokengains a third fieldshape: Literal["complete","open","close","neutral"](default"neutral"); the public surface of_prose_tokens.pystays exactlyToken/TokenKind/tokenize()._classify_emphasis_shapetags each emphasis token from its interior edge-whitespace;_RE_ITALIC_UNDERSCOREtightened to whitespace-flanking (D-Open-21), fixing the intraword false positive wherehome_bar and foo_bazmatched as italic (Spike S3: 0 changes / 569 fields).docs/adr/028-...md; additive ADR-017 D1/D5 cross-ref amendments; README ADR index row.Commit walk (9 commits — audit trail preserved)
82712a6Status: Draft+ ADR-017 D5/D1 amendments + README index025e059e149737depth > 0+max(0, depth-1)floorcc7df7dcb170b05e79d87SENTINEL_INNER_REto the two internal_RE_SENTINEL_*_INNERconstants455f36415dd2d8121f14e_delim_for_tokenfail-loud guard, 6 D-Open-18 emphasis-shape fixtures, double-emit test, corrected docstringsReviewer focus
docs/adr/028-...md— D1–D7 + three 2026-05-29 Addenda (errata)._prose_tokens.py—Token.shape,_classify_emphasis_shape, the tightened_RE_ITALIC_UNDERSCORE.validate_yaml_prose_subset.py— the depth-counter walk + two predicates, the_delim_for_tokenguard, and the documented_RE_SENTINEL_*_INNERcross-module coupling (D4-sanctioned).TestEmphasisShapeFixtures+accepting/emphasis_shapes/,TestNestedEmphasisRejection(incl. double-emit),TestDelimForTokenGuard, classifier cell-grid.Gates (all green at branch tip
121f14e)pytest: 2588 passed / 6 skipped.ruff check+ruff format --check: clean.pre-commit run --all-files: clean.--blockonrisk-map/yaml/*.yaml→ exit 0.ADR cross-refs
kind/valueonly;shapedefault is transparent; partition-of-input + site/lint parity preserved.