Skip to content

ADR-028: prose-linter Token.shape contract raise + depth-counter bracket-matching (#354)#367

Open
davidlabianca wants to merge 9 commits into
cosai-oasis:mainfrom
davidlabianca:feature/354-prose-linter-token-contract
Open

ADR-028: prose-linter Token.shape contract raise + depth-counter bracket-matching (#354)#367
davidlabianca wants to merge 9 commits into
cosai-oasis:mainfrom
davidlabianca:feature/354-prose-linter-token-contract

Conversation

@davidlabianca
Copy link
Copy Markdown
Contributor

ADR-028: prose-linter Token.shape contract raise + depth-counter bracket-matching (#354)

Closes: #354

Summary

The 4th and final sub-issue of #353, carries a new ADR + a contract-level tokenizer change.

Per the newly-Accepted ADR-028:

  • Token contract raise (D1–D4): Token gains a third field shape: Literal["complete","open","close","neutral"] (default "neutral"); the public surface of _prose_tokens.py stays exactly Token / TokenKind / tokenize().
  • Shape at emission (D3): new private _classify_emphasis_shape tags each emphasis token from its interior edge-whitespace; _RE_ITALIC_UNDERSCORE tightened to whitespace-flanking (D-Open-21), fixing the intraword false positive where home_bar and foo_baz matched as italic (Spike S3: 0 changes / 569 fields).
  • Depth-counter linter (D5): the prose-subset linter's emphasis rejection is now a single bracket-matching pass over an integer depth counter with two one-line predicates (nested-emphasis + emphasis-wrapped-sentinel).
  • Docs: new docs/adr/028-...md; additive ADR-017 D1/D5 cross-ref amendments; README ADR index row.

Commit walk (9 commits — audit trail preserved)

# SHA One-liner
1 82712a6 ADR-028 lands Status: Draft + ADR-017 D5/D1 amendments + README index
2 025e059 Draft → Accepted after Spike focused on underscore usage (underscore corpus diff: 0/569)
3 e149737 D5 erratum #1 — close-branch emit when depth > 0 + max(0, depth-1) floor
4 cc7df7d RED — failing tests (Token.shape contract, classifier cell-grid, depth-walk, underscore)
5 cb170b0 GREEN — implementation
6 5e79d87 D5 erratum #2 — wrapped-sentinel prose: corrected fictional SENTINEL_INNER_RE to the two internal _RE_SENTINEL_*_INNER constants
7 455f364 Remove stale RED-phase scaffolding language
8 15dd2d8 D5 erratum #3 — greenfield-not-refactor framing + double-emit note
9 121f14e Independent-review fixes — _delim_for_token fail-loud guard, 6 D-Open-18 emphasis-shape fixtures, double-emit test, corrected docstrings

Reviewer focus

  • docs/adr/028-...md — D1–D7 + three 2026-05-29 Addenda (errata).
  • _prose_tokens.pyToken.shape, _classify_emphasis_shape, the tightened _RE_ITALIC_UNDERSCORE.
  • validate_yaml_prose_subset.py — the depth-counter walk + two predicates, the _delim_for_token guard, and the documented _RE_SENTINEL_*_INNER cross-module coupling (D4-sanctioned).
  • TestsTestEmphasisShapeFixtures + accepting/emphasis_shapes/, TestNestedEmphasisRejection (incl. double-emit), TestDelimForTokenGuard, classifier cell-grid.

Gates (all green at branch tip 121f14e)

  • pytest: 2588 passed / 6 skipped.
  • ruff check + ruff format --check: clean.
  • pre-commit run --all-files: clean.
  • Both prose linters --block on risk-map/yaml/*.yaml → exit 0.
  • D6 diagnostic format preserved byte-for-byte; prior tokenizer/subset tests unmodified.

ADR cross-refs

  • ADR-028 (new) — decision of record for this PR.
  • ADR-017 D1/D4/D5 — D1/D5 amended (additive pointers); D4 diagnostic format preserved.
  • ADR-016 D5/D6 / ADR-015 D3 — other tokenizer consumers read kind/value only; shape default is transparent; partition-of-input + site/lint parity preserved.
  • ADR-020 D4 — pre-existing folded-bullet citation defect explicitly deferred (D7).

davidlabianca and others added 9 commits May 28, 2026 19:42
…ontract raise) as Status: Draft, with ADR-017 D5 + D1 amendments

Supersedes three accreted whitespace-adjacency-heuristic commits on the
archived feature/353-c4-followons branch (a71ae47, 817e00d, 5e72a2a) and
the stashed Path B derived-methods restructure attempt; both attempts
left the emphasis-shape decision split between the tokenizer (greedy
regex side effects) and the linter (regex/string-operation recovery).
ADR-028 D1-D4 raise the Token contract to ADR grade; D5 specifies a
depth-counter bracket-matching pass over the new shape field; D6 locks
diagnostic-format preservation; D7 captures the two ADR-017 amendments
and four one-way cross-references.

Analytical input: working-plans/028-prose-linter-planning-inventory.md
(21 locked decisions). D-Open-21 (underscore-italic intraword
tightening) is in-scope per §8 but gated by Spike S3 (live-corpus
diff probe) before maintainer flips Status: Draft -> Accepted.

Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
…rms the D-Open-21 lock

Spike S3 (underscore-italic corpus impact probe) ran the current and
proposed whitespace-flanked _RE_ITALIC_UNDERSCORE over all 569 prose
fields in the four content YAMLs: zero tokenization changes. Probe
validated non-blind against the home_bar and foo_baz motivating case.
D-Open-21 lock confirmed; first ADR-028 section 9.3 outcome (empty diff).

Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
…nd depth-counter prose linter (RED)

Authors the RED-phase test suite for Accepted ADR-028 ahead of implementation
(ADR-025 D2 TDD chain). New coverage:

- TestEmphasisShapeClassification + TestTokenShapeField (test_prose_tokens.py):
  the Token.shape contract (D1-D3) and _classify_emphasis_shape cell-grid
  (D-Open-16), incl. the D-Open-7 both-edges-whitespace "open" convention and
  the ADR-025 D10 wire-up (shape inspected on tokenize() output).
- TestIntrawordUnderscoreRejection (test_prose_tokens.py): the D-Open-21
  whitespace-flanked underscore tightening (intraword \S_\S is not italic;
  boundary/whitespace-flanked _italic_ still is; __double__ stays TEXT).
- TestNestedEmphasisRejection + TestEmphasisDiagnosticFormat
  (test_validate_yaml_prose_subset.py): the D5 depth-counter walk + the two
  predicates, with the faux-depth false-positive guard, and the D6 byte-for-byte
  diagnostic format / exact reason strings.
- TestTripleAsteriskProbe (S1) and TestLiveCorpusBaseline (S2, live_corpus
  marker): the Spike S1 tokenizer ground-truth lock and the zero-diagnostic
  corpus regression baseline gating the GREEN pass.

35 new tests fail for the right reason (missing Token.shape field, absent
_classify_emphasis_shape, current intraword-underscore match, and the not-yet-
built emphasis rejection in check_prose_field); the pre-existing tokenizer and
prose-subset suites stay green and unmodified.

Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
… predicate prose

The close branch lacked the nested-emphasis emit and carried a false depth-floor comment. It now emits when depth > 0 before decrementing and floors with max(0, depth - 1), reconciling the pseudocode with D5's own predicate prose and planning-inventory section 5.2's load-bearing case. Erratum only: no section 8 lock changes; Status stays Accepted.

Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
…nter prose linter (GREEN)

Turns the RED suite from cc7df7d green by building the Accepted ADR-028 design.

Tokenizer (scripts/hooks/precommit/_prose_tokens.py):
- Token NamedTuple gains shape: Literal["complete","open","close","neutral"]
  as the third field, default "neutral" (D1 / D-Open-1..4); two-positional
  construction stays valid and equality stays structural.
- _classify_emphasis_shape() classifies a matched span from interior edge
  whitespace via .isspace(), no regex (D3 / D-Open-5..7); both-edges -> "open".
- The three emphasis emission sites pass shape= (D3 / Section 5.3).
- _RE_ITALIC_UNDERSCORE tightened to require whitespace-or-boundary flanking
  so intraword \S_\S no longer tokenizes as italic (D3 invariant 3 / D-Open-21,
  cosai-simpler form); __double__ stays TEXT; asterisk italic untouched.

Linter (scripts/hooks/precommit/validate_yaml_prose_subset.py):
- check_prose_field gains the D5 single-pass depth-counter walk with a bare
  integer counter (D-Open-10) and the two one-line predicates; reason strings
  "nested emphasis" / "emphasis-wrapped sentinel" route through the existing
  Diagnostic + format_diagnostic_line, byte-for-byte preserved (D6).
- The close branch emits when depth > 0 before decrementing, with a
  max(0, depth-1) floor, per ADR-028 D5 as amended 2026-05-29 (e149737).
- The wrapped-sentinel predicate reuses the tokenizer's own sentinel-inner
  patterns; no new emphasis-shape re.compile is introduced.

This is a fresh build on the clean base (D-Open-11): the pre-existing
regex-driven emphasis layer described in the ADR's Consequences never landed
on this branch (it lived on the abandoned R-commits a71ae47 / 817e00d /
5e72a2a), so the design is implemented directly rather than refactored.
Live corpus stays at zero diagnostics for both prose linters in --block;
the test docstring at test_nested_bold_produces_one_nested_emphasis_diagnostic
is reconciled with the amended D5.

Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
…e test docstring

D5 named a non-existent unified SENTINEL_INNER_RE 'shared with the references linter'. Reality: two internal _RE_SENTINEL_*_INNER constants imported by the prose-subset linter via documented coupling (sanctioned by D4's internal-_RE_* posture), not used by the references linter (which resolves via _resolve_intra_sentinel). Adds a second D5 Addendum and fixes the matching docstring. Doc-only erratum: impl unchanged, no section 8 lock change, Status stays Accepted.

Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
…-028 test and comment prose

The ADR-028 emphasis feature is implemented and green; comment and docstring
text that narrated the transient test-first chain state ("RED until …", "after
the … pass adds …", class-header "(RED — …)" annotations) is now stale and is
reworded to describe what each test guards, in the present tense.

- validate_yaml_prose_subset.py: the close-branch comment no longer claims the
  ADR D5 pseudocode "omits the emit on close" / "the test is canonical" — that
  predated the D5 amendment (e149737). It now states the close token is the
  attribution point for the canonical [open, text, close] nested case and cites
  ADR-028 D5 as amended 2026-05-29.
- test_prose_tokens.py + test_validate_yaml_prose_subset.py: ~37 docstring,
  comment, and assertion-message sites reworded. Doc-only — no assertion
  expression, expected value, or logic changed; the full suite stays at
  2576 passed / 6 skipped.

Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
…ouble-emit note)

Independent architecture review flagged D5's 'are deleted' / Consequences 'linter shrinks' as false against this clean base: the named helpers never existed here (D-Open-11; they lived on the abandoned feature/353-c4-followons). The Addendum reframes that language as logical supersession of the archived design, not a diff, and documents that the nested-emphasis and wrapped-sentinel predicates are independent and may both fire on one token. Doc-only; no code or section 8 lock change; Status stays Accepted.

Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
…prose linter

Fail-loud guard in _delim_for_token (raises ValueError on a non-emphasis value instead of silently returning '_'); 6 additive emphasis-shape fixtures under accepting/emphasis_shapes/ honoring D-Open-18 Path 3b (open/close/both-edges/nested/sentinel-wrapped token streams); a characterization test pinning the intended nested+wrapped double-emit; and corrected the stale 'tokenizer/fixture-dir NOT modified' docstring plus the fixture-pair count. pytest 2588/6, ruff clean, both prose linters --block exit 0.

Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
@davidlabianca davidlabianca self-assigned this May 29, 2026
@davidlabianca davidlabianca added tech-debt adr Architecture Decision Records tooling labels May 29, 2026
@davidlabianca davidlabianca marked this pull request as ready for review May 29, 2026 20:32
Copy link
Copy Markdown
Contributor

@shrey-bagga shrey-bagga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please add regression coverage for nested underscore italic and punctuation-boundary underscore italic? With the current _RE_ITALIC_UNDERSCORE, _foo _nested_ bar_ tokenizes as one complete ITALIC token plus TEXT and produces no nested emphasis diagnostic, despite ADR-017’s “nesting another italic inside is rejected” rule. It also stops recognizing boundary forms like "See _important_, then act." as italic. After adding coverage, can we also adjust the tokenizer so snake_case still stays TEXT without losing those ADR-017/ADR-028 cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

adr Architecture Decision Records tech-debt tooling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tooling: prose tokenizer/linter enforcement for nested emphasis + emphasis-wrapped sentinels (ADR-017 subset)

2 participants