diff --git a/docs/adr/017-yaml-prose-authoring-subset.md b/docs/adr/017-yaml-prose-authoring-subset.md index 5c06b987..12898ca3 100644 --- a/docs/adr/017-yaml-prose-authoring-subset.md +++ b/docs/adr/017-yaml-prose-authoring-subset.md @@ -38,6 +38,8 @@ Authors may use exactly these forms in any prose field. Everything else is rejec Bold and italic may compose (`**emphatically *not* this**` is valid). Sentinels are atomic identifier tokens; they do not nest into bold or italic. An author who wants the rendered title to appear bold relies on the renderer's stylesheet, not on wrapping `**` around `{{}}`. +The canonical mechanism the lint uses to enforce the "no nested same-family emphasis" rule and the "no emphasis-wrapped sentinel" rule is the depth-counter bracket-matching pass specified in [ADR-028](028-prose-linter-bracket-matching-architecture.md) D5. + The subset operates on the string contents of each prose paragraph. Paragraph and hard-break shape is carried by the YAML *array* structure ([ADR-011](011-persona-site-data-schema-contract.md) `definitions/prose`); prose strings are not list-bearing. ### D2. Disallowed by construction @@ -106,6 +108,8 @@ The two hooks have overlapping rejection sets (raw `` blocked by both, bare c If ADR-016 lands its hook before this one, the bare-camelCase and raw-`` checks live in `validate_prose_references.py` until ADR-017's hook ships and the shared tokenizer is extracted. The end state is the two-hook-shared-tokenizer split above. +The shared tokenizer's Token contract — the `Token` NamedTuple structure (including the emphasis-shape field), the `TokenKind` enumeration, the tokenizer's emission invariants, and the consumer surface — is formally specified in [ADR-028](028-prose-linter-bracket-matching-architecture.md) D1-D4. ADR-017 owns the grammar's authoring rules; ADR-028 owns the contract every consumer of the shared tokenizer reads. + ### D6. Redistribution contract surface Per [ADR-014](014-yaml-content-security-posture.md) P5, the framework guarantees "shape via schemas" to downstream consumers. ADR-017 is the canonical statement of "content within prose strings." After the conformance sweep closes, the contract becomes strictly: **YAML prose contains no URLs at all.** Every URL lives in a structured `externalReferences` entry (ADR-016) and is referenced from prose by sentinel. This is a stronger guarantee than "YAML prose contains some URLs you must sanitize" — a third-party redistributor parsing the YAML knows that any URL it ingests came through a typed, schema-validated structured field. diff --git a/docs/adr/028-prose-linter-bracket-matching-architecture.md b/docs/adr/028-prose-linter-bracket-matching-architecture.md new file mode 100644 index 00000000..17115052 --- /dev/null +++ b/docs/adr/028-prose-linter-bracket-matching-architecture.md @@ -0,0 +1,220 @@ +# ADR-028: Prose-linter emphasis enforcement via bracket-matching depth pass over a Token-shape contract + +**Status:** Accepted +**Date:** 2026-05-27 (Draft); 2026-05-28 (Accepted) +**Authors:** Architect agent, with maintainer review + +--- + +## Context + +[ADR-017](017-yaml-prose-authoring-subset.md) D1 admits one nesting level for `**bold**`, `*italic*`, and `_italic_`; nesting another same-family emphasis inside is rejected. D5 commits to a single shared tokenizer at `scripts/hooks/precommit/_prose_tokens.py` consumed by both `validate-yaml-prose-subset` (ADR-017) and `validate_prose_references.py` ([ADR-016](016-reference-strategy.md) D6). The grammar lives in ADR-017; the enforcement mechanism for the "one nesting level" rule does not. + +The mechanism implemented on the archive branch split the emphasis-shape decision across two layers. The tokenizer's non-greedy bold and italic regexes produce delimiter-bounded spans whose interior whitespace edges carry the structural signal — `**foo **` retains a trailing space because the regex closed at what the author intended as an inner open. The linter then runs a second set of regex constants against each emphasis token's `value` to recover that signal. The shape decision is encoded as a side effect at one layer and decoded at the other, with the load-bearing knowledge ("whitespace-at-edge means greedy early-close") implicit in both. The two layers must stay in sync, and a reader of either alone cannot tell that the synchronization is load-bearing. + +This split is the architectural problem ADR-028 addresses. Attempts to restructure the linter without first raising the contract failed for three reasons that are properties of the design surface, not of any specific implementation. First, a token stream that carries no structural open/close events cannot support a stack-based or counter-based walk: any algorithm that treats every emphasis token as a "push" degenerates to faux-depth, because sibling complete-emphasis spans separated by plain text (`**hello** world **goodbye**`) become indistinguishable from one emphasis nested inside another. Second, the whitespace-at-edge heuristic can be relocated freely — from a linter-side regex constant to a derived method, an inline string check, a comment, anywhere — without ever being eliminated; the load-bearing knowledge that "trailing space inside the delimiter means greedy early-close" still has to live somewhere, and whichever layer hosts it carries the synchronization burden. Third, any shape information that is meaningful only for some token kinds produces an asymmetric API (`str | None` returns, value-passthrough fallbacks, branching on kind before reading shape), which forces every consumer to handle a "doesn't apply" case at every read site. All three failure modes resolve to the same root cause: the `Token` contract under-specifies what consumers may assume about shape, so each refactor has to re-establish invariants that should already exist on the token. Without raising the contract, the restructure surface stays load-bearing. + +ADR-028 specifies the contract this surface needs. The `Token` NamedTuple structure, the `TokenKind` enumeration, the invariants every token stream from `tokenize()` satisfies, and the consumer surface every reader accesses are documented here as ADR-grade architecture rather than implementation detail; emphasis-shape classification is part of the contract, carried on the token rather than reconstructed downstream. The full planning inventory — alternatives, consumer audit, fixture corpus, cross-ADR alignment matrix, shape-determination rules, and 21 locked decision points — was produced at `working-plans/028-prose-linter-planning-inventory.md` (local, untracked) and is the analytical input. + +Line references in this ADR are GitHub permalinks pinned against `upstream/main` at commit `7320136`. + +## Decision + +The tokenizer emits `Token` instances carrying a `shape` field classified at emission time; the prose-subset linter walks tokens once with a depth counter and reads `shape` directly. The decision has seven components. + +### D1. Token contract + +The `Token` NamedTuple at `scripts/hooks/precommit/_prose_tokens.py` is the durable contract between the tokenizer and every consumer. Fields, in declaration order: + +| Field | Type | Semantics | +|---|---|---| +| `kind` | `TokenKind` | Token classification per D2. | +| `value` | `str` | Exact substring from the input the token covers. For INVALID_* kinds, the offending substring. | +| `shape` | `Literal["complete", "open", "close", "neutral"]` | Emphasis classification per D3. `"neutral"` for every non-emphasis token. Default `"neutral"`. | + +`shape` is declared as the **third** field with a default of `"neutral"`. Two-positional construction (`Token(TokenKind.X, value)`) continues to compile and yields a token with `shape="neutral"`; emission sites that classify emphasis pass `shape=` as a keyword argument. + +Equality is structural NamedTuple equality. Two tokens with identical `kind` and `value` but different `shape` values are not equal. Test code that constructs tokens manually for comparison against `tokenize()` output must either supply the matching `shape` or compare via the fixture-format projection that drops `shape` (see D4 and the fixture migration in Follow-up). + +The NamedTuple structure, defaults, and equality semantics are part of the ADR-grade specification rather than implementation detail. Future field additions (e.g., a hypothetical `source_line`) require a new ADR or an amendment here; new `shape` values likewise. + +### D2. TokenKind enumeration + +The tokenizer emits exactly the following sixteen kinds. The accept/reject column matches `_REJECTED_KINDS` in [`validate_yaml_prose_subset.py:49-62`](https://github.com/cosai-oasis/secure-ai-tooling/blob/7320136/scripts/hooks/precommit/validate_yaml_prose_subset.py#L49-L62) with one carve-out documented after the table. + +| TokenKind | Defining regex / character class | Accept / Reject | Semantic meaning | Defining ADR | +|---|---|---|---|---| +| `BOLD` | `\*\*(.+?)\*\*` (`re.DOTALL`) | Accept | One nesting level bold; non-greedy close. Italic inside permitted; nested `**bold**` rejected by D5. | [ADR-017](017-yaml-prose-authoring-subset.md) D1 | +| `ITALIC` | `\*(.+?)\*` or whitespace-flanked `_(.+?)_` | Accept | One nesting level italic. Two delimiters so authors can italicize text containing the other delimiter. | [ADR-017](017-yaml-prose-authoring-subset.md) D1 | +| `SENTINEL_INTRA` | inner: `(risk\|control\|component\|persona)[A-Z]\w*` | Accept | Intra-document reference sentinel. | [ADR-016](016-reference-strategy.md) D2 | +| `SENTINEL_REF` | inner: `ref:[A-Za-z0-9_.\-]+` | Accept | External-reference sentinel. | [ADR-016](016-reference-strategy.md) D2 | +| `TEXT` | catch-all run accumulator | Accept | Plain prose. | [ADR-017](017-yaml-prose-authoring-subset.md) D1 | +| `INVALID_HTML` | `<[A-Za-z/][^>]*>` | Reject | Any HTML tag. | [ADR-017](017-yaml-prose-authoring-subset.md) D2 | +| `INVALID_URL` | scheme-with-authority, opaque-data scheme, or markdown link | Reject | Inline URL in any form. | [ADR-017](017-yaml-prose-authoring-subset.md) D2 / D4 rule 2 | +| `INVALID_HEADING` | `#+[^\n]*` at line start | Reject | Markdown heading. | [ADR-017](017-yaml-prose-authoring-subset.md) D2 | +| `INVALID_LIST` | `-\s`, `\*\s`, `\d+\.\s` at column 0 | Reject | Markdown list marker at column 0. | [ADR-017](017-yaml-prose-authoring-subset.md) D2 | +| `INVALID_CODE` | ```` ```...``` ```` (`re.DOTALL`) or `` `[^`]+` `` | Reject | Fenced or inline code. | [ADR-017](017-yaml-prose-authoring-subset.md) D2 | +| `INVALID_IMAGE` | `!\[[^\]]*\]\([^)]*\)` | Reject | Markdown image. | [ADR-017](017-yaml-prose-authoring-subset.md) D2 | +| `INVALID_BLOCKQUOTE` | `>[^\n]*` at line start | Reject | Markdown blockquote. | [ADR-017](017-yaml-prose-authoring-subset.md) D2 | +| `INVALID_TABLE` | `\|[^\n]*(?:\n\|$)` at line start | Reject | Markdown pipe-table row. | [ADR-017](017-yaml-prose-authoring-subset.md) D2 | +| `INVALID_FOLDED_BULLET` | `\s+\-\s[^\n]*(?:\n\|$)` at line start | Reject | Folded-bullet drift — leading-whitespace `- ` line. | [ADR-020](020-controls-schema.md) follow-up (citation defect — see D7) | +| `INVALID_CAMELCASE_ID` | `(risk\|control\|component\|persona)[A-Z]\w*` | Reject (**delegated**) | Bare entity-prefix camelCase outside a sentinel. | [ADR-017](017-yaml-prose-authoring-subset.md) D4 rule 5 / [ADR-016](016-reference-strategy.md) D6 | +| `INVALID_SENTINEL` | `\{\{ ... \}\}` with brace-depth scan, inner fails both sentinel forms | Reject | Structurally well-formed `{{ }}` with content matching neither sentinel grammar. | [ADR-016](016-reference-strategy.md) D2 | + +**Delegation carve-out.** `INVALID_CAMELCASE_ID` is the only rejecting kind that the prose-subset linter intentionally excludes from `_REJECTED_KINDS`. Ownership lives in `validate_prose_references.py` per ADR-017 D4 rule 5 and ADR-016 D6. A consumer that iterates `if kind.name.startswith("INVALID")` to count rejections will diverge from the prose-subset linter's behavior unless it applies the same exclusion. The carve-out is load-bearing for both consumers and any third-party reader; it is part of the contract. + +The enumeration is closed for ADR-028. Adding a new kind requires an amendment here. + +### D3. Tokenizer emission invariants + +Every token stream produced by `tokenize()` satisfies four invariants. + +1. **Partition-of-input.** `"".join(t.value for t in tokenize(text)) == text` for every input. Every character lands in exactly one token's `value`. Asserted in the test corpus at [`test_prose_tokens.py:1043-1072`](https://github.com/cosai-oasis/secure-ai-tooling/blob/7320136/scripts/hooks/tests/test_prose_tokens.py#L1043-L1072) (`TestMixedRuns.test_tokens_cover_full_input`). +2. **Rule-precedence ordering.** The sixteen rules apply per-character in the precedence order documented at [`_prose_tokens.py:11-27`](https://github.com/cosai-oasis/secure-ai-tooling/blob/7320136/scripts/hooks/precommit/_prose_tokens.py#L11-L27). Higher-priority match wins; ties do not exist because each rule anchors on a distinct character or substring. Line-anchored rules (headings, list markers, blockquotes, pipe-tables, folded-bullet drift) fire only at index 0 or immediately after `\n` per `at_line_start()`. +3. **Greedy / non-greedy specification.** Bold non-greedy on close; italic non-greedy on close — asterisk italic is intraword-permissive, underscore italic additionally requires whitespace-or-boundary flanking per ADR-017 D1, so an intraword `\S_\S` underscore does not qualify as a delimiter and the run consumes as TEXT; URL regexes greedy with bracketed exclusion `[^\s{]+` so a URL adjacent to a `{{` sentinel terminates at the brace boundary; fenced code non-greedy across newlines with `re.DOTALL`; sentinel scan brace-depth-aware so `{{id{{ref:x}}}}` consumes as one `INVALID_SENTINEL` token. Unclosed `{{` returns `None` from `_match_sentinel`; the caller emits the remainder as `TEXT`. +4. **Shape classification at emission.** Every emphasis token (`BOLD`, `ITALIC`) carries a `shape` value classified at emission time from leading/trailing whitespace in the matched span's interior. Every non-emphasis token carries `shape="neutral"`. The classification rules per delimiter (`**`, `*`, `_`): + +| Matched span shape | `shape` value | Rationale | +|---|---|---| +| `**foo**` — neither edge whitespace in interior | `"complete"` | Well-formed emphasis. | +| `**foo bar**` — internal whitespace only | `"complete"` | Only **edge** whitespace is the signal. | +| `**foo **` — trailing whitespace before close | `"open"` | Greedy non-greedy closed on what the author intended as an inner open. | +| `** bar**` — leading whitespace after open | `"close"` | Mirror — trailing half of the same early-close pattern. | +| `** foo **` — both edges whitespace | `"open"` | Convention: leading-whitespace test fires first; consistent with the wrapped-sentinel case `**\n{{ref:x}}\n**` rendering as `"open"`. | +| `**\n**` — interior is a single `\n` | `"open"` | `\n` counts as `isspace()`; both-edges-whitespace falls under the `"open"` convention. | + +Edge cases the tokenizer does not emit as emphasis at all carry `shape="neutral"` like any other TEXT: `**` alone; `**foo` unclosed; `**` paired with no inner character; `****` consumed as two TEXT runs because `(.+?)` requires at least one inner character; `__bold__` consumed as TEXT per ADR-017 D1's asterisk-only-bold stance; and **intraword underscore runs like `home_bar and foo_baz` consumed as TEXT** because the underscores fail the whitespace-flanking requirement — a snake_case identifier pair is not an italic span. + +The classification is implemented as a private helper `_classify_emphasis_shape(span, delim) -> str` in `_prose_tokens.py`, invoked from the three emphasis emission sites at [`_prose_tokens.py:421-440`](https://github.com/cosai-oasis/secure-ai-tooling/blob/7320136/scripts/hooks/precommit/_prose_tokens.py#L421-L440) (one per delimiter family). The helper is a pure function over the matched span; it takes no global state. Emit-site call shape: `emit(TokenKind.BOLD, m.group(), shape=_classify_emphasis_shape(m.group(), "**"))`. + +The shape decision is encapsulated in the tokenizer. The linter never reads whitespace edges directly; it reads `token.shape`. + +### D4. Consumer contract + +The public surface of `_prose_tokens.py` is exactly three names: `Token`, `TokenKind`, and `tokenize()`. The leading underscore in the module name marks the module as one of the precommit-sibling internals coordinated by ADR-017 D5 and ADR-016 D6; it does **not** mark every name in the module as private. Internal helpers (`_match_sentinel`, `_classify_emphasis_shape`, `emit`, `flush_text`, `pending_text_start`, the `_RE_*` regex constants) are not part of the contract and may be reorganized without an ADR amendment. + +Stable consumer API: + +- **Field access** on tokens is stable: `token.kind`, `token.value`, `token.shape`. +- **Iteration** is stable: `for token in tokens:` and indexed access (`tokens[N]`) for positional reads. +- **Positional unpacking** is **not** stable. No `kind, value = token` destructuring appears in production code or tests; future contributors must not introduce it. Tuple positionality is therefore reserved for forward-compatible field additions (D1's `shape` is the first such addition). +- **Construction** is via `tokenize()`. Test code may construct `Token(...)` via keyword arguments for assertion-side comparisons; production code outside `_prose_tokens.py` itself must not construct `Token` instances directly. + +The known production consumers are three: `validate_yaml_prose_subset.py` (the prose-subset linter; reads `kind`, `value`, and `shape` per D1); `validate_prose_references.py` (the references linter per ADR-016 D6; reads `kind` and `value` only); and `_sentinel_expansion.py` (the shared sentinel expander per ADR-016 D5; reads `kind` and `value` only). The two latter consumers are unaffected by the `shape` field's introduction because they never read it; under D1's default-`"neutral"` posture, they see tokens whose new field carries the no-op value. + +The test corpus's contract is the `_tokens_to_dicts` projection at [`test_prose_tokens.py:203`](https://github.com/cosai-oasis/secure-ai-tooling/blob/7320136/scripts/hooks/tests/test_prose_tokens.py#L203), currently `{"kind": t.kind.name, "value": t.value}`. The projection drops `shape` and remains stable. Fixtures continue to assert two-field equality. The fixture migration (Follow-up below) is **additive**: existing emphasis-bearing fixtures stay byte-identical and implicitly assert `shape="complete"` by virtue of the tokenizer emitting that value; new fixtures cover `"open"`, `"close"`, and other shape cases without disturbing the existing corpus. + +### D5. Linter algorithm — depth-counter bracket-matching pass + +The prose-subset linter's emphasis-rejection logic in `validate_yaml_prose_subset.py` is a single pass over `field.tokens` with an integer depth counter and two one-line predicates. The current helpers `_detect_nested_emphasis_indices` and `_is_emphasis_wrapped_sentinel`, along with the three `_RE_*_EARLY_CLOSE` and three `_RE_*_WRAPPED_SENTINEL` regex constants, are deleted. + +Algorithm (textbook bracket-matching shape): + +```text +depth = 0 +for token in field.tokens: + if token.kind in (BOLD, ITALIC): + if token.shape == "open": + if depth > 0: # already inside emphasis of same family + emit_diagnostic(...) + depth += 1 + elif token.shape == "close": + if depth > 0: # close of an unmatched open -> nested emphasis + emit_diagnostic(...) + depth = max(0, depth - 1) # floor: a standalone close-shape (e.g. " ** bar**") would underflow + elif token.shape == "complete": + if depth > 0: # nested complete-emphasis inside open emphasis + emit_diagnostic(...) + # complete = open + close, net depth change 0 + # shape == "neutral" on emphasis kinds is not emitted (see D3) + if _is_emphasis_wrapped_sentinel(token): + emit_diagnostic(...) +``` + +The two predicates: + +- **Nested-emphasis predicate.** Fires when an emphasis token is encountered with `depth > 0` and a same-family emphasis open is unmatched. Implemented as the `depth > 0` check inline in the walk above, applied on the `open`, `complete`, **and** `close` branches; the `close` branch checks `depth > 0` *before* decrementing, and is the attribution point for the canonical split-token nested case (`**foo **nested** bar**` tokenizes as `[open, text, close]`, and the `close` token at `depth == 1` is where the single diagnostic lands). The kind comparison is via `token.kind` directly (a bare counter suffices; a kind-stack is not needed for the current ADR-017 D1 rule, which detects any nesting regardless of delimiter family). +- **Emphasis-wrapped-sentinel predicate.** Fires when an emphasis token's interior (its `value` minus the delimiter pair) `.strip()`s to a string that fullmatches either of the tokenizer's two internal sentinel-inner regexes — `_RE_SENTINEL_INTRA_INNER` or `_RE_SENTINEL_REF_INNER` (defined in `_prose_tokens.py`). The prose-subset linter imports these `_RE_*` constants directly: a deliberate cross-module coupling permitted by D4 (the `_RE_*` constants are internal and reorganizable) and flagged with an inline coupling comment at the import site. They are **not** shared with the references linter, which resolves sentinels structurally via `_resolve_intra_sentinel` against the id-index rather than by inner-regex match. Independent of the depth state. + +Both predicates are one-line expressions over `token.shape` (or `token.kind` and `token.value`) and the depth state. There are no regex constants in the linter for shape detection. The whitespace-adjacency heuristic that previously lived in `_RE_*_EARLY_CLOSE` has been moved into the tokenizer's `_classify_emphasis_shape` per D3; the linter cannot see it and cannot drift from it. + +The walk handles the false-positive patterns a naive stack would faux-depth on. `**hello** world **goodbye**` tokenizes as `[BOLD shape="complete", TEXT shape="neutral", BOLD shape="complete"]` and walks depth `0 → 0 → 0 → 0`; no diagnostic. `**hello** world {{ref:x}}` tokenizes as `[BOLD shape="complete", TEXT, SENTINEL_REF]` and the sentinel is at `depth == 0`; no diagnostic. The prospective D6-of-ADR-017 rules ("no sentinel inside any emphasis", "no link text containing emphasis") would express as `depth > 0 and token is sentinel/link` — predicates with a real depth value, not a faux one. + +Reason strings (`_REASON_NESTED_EMPHASIS`, `_REASON_EMPHASIS_WRAPPED_SENTINEL`) and the diagnostic format are preserved per D6. + +**Addendum (2026-05-29) — erratum.** The original D5 draft omitted the `close`-branch emit and the depth floor from the pseudocode, leaving the pseudocode inconsistent with the Nested-emphasis predicate prose (which has always covered every emphasis token, including `close`-shape) and with planning-inventory §5.2's load-bearing-case analysis of `**foo **nested** bar**` (tokens `[open, text, close]`, where the `close` token at `depth == 1` is the only attribution point). The pseudocode now (1) emits the nested-emphasis diagnostic in the `close` branch when `depth > 0`, checked before the decrement, and (2) floors the decrement with `max(0, depth - 1)` because a standalone leading-space bold classifies as `shape="close"` and would otherwise drive `depth` to `-1` (the prior `# depth-floor enforced by tokenizer invariants` comment was false). This is an erratum reconciling the pseudocode with D5's own governing predicate prose; it is **not** a new decision — §5.2 flagged "clarify the predicate" but §8 never created a corresponding locked D-Open decision, so no §8 lock changes. Status remains **Accepted**. + +**Addendum (2026-05-29) — second erratum.** The original D5 Emphasis-wrapped-sentinel predicate bullet named a single unified `SENTINEL_INNER_RE` "defined once in `_prose_tokens.py` and shared with the references linter." Both claims were false against the as-built code. No `SENTINEL_INNER_RE` exists; the tokenizer carries two internal constants, `_RE_SENTINEL_INTRA_INNER` and `_RE_SENTINEL_REF_INNER` (`_prose_tokens.py:174-175`), which the prose-subset linter imports directly with an inline coupling comment (`validate_yaml_prose_subset.py:39-45`). That direct import of internal `_RE_*` constants is exactly the coupling D4 sanctions (the `_RE_*` constants are internal and reorganizable), and a single public `SENTINEL_INNER_RE` would have been a 4th public name contradicting D4's "exactly three names" surface. The constants are **not** shared with the references linter, which resolves sentinels structurally via `_resolve_intra_sentinel` (prefix + id-index, dispatched on `token.kind`) rather than by inner-regex match. The predicate bullet above is corrected to describe this reality. This is a doc-accuracy erratum only: the implementation already matches the corrected text, no code changes, no §8 lock changes, and Status remains **Accepted**. + +**Addendum (2026-05-29) — third erratum.** The "are deleted" framing in the D5 prose above (the named helpers `_detect_nested_emphasis_indices` and `_is_emphasis_wrapped_sentinel`, plus the three `_RE_*_EARLY_CLOSE` and three `_RE_*_WRAPPED_SENTINEL` constants) and the matching "linter shrinks" bullet in Consequences describe a delete-and-replace diff that did not occur on this branch. This is a fresh branch off clean `upstream/main = 7320136` per D-Open-11; none of those symbols ever existed on this base. The prior emphasis-rejection layer they describe lived only on the abandoned archive branch `feature/353-c4-followons`. The as-built is therefore a greenfield addition of the depth-walk emphasis enforcement, not a reduction: the "are deleted" / "linter shrinks" language records the logical supersession of that archived design, not a diff against this branch's base. The two predicates are also independent and may both fire on a single emphasis token — a `close`-shape token at `depth > 0` whose stripped interior matches a sentinel-inner regex emits **both** the `nested emphasis` and the `emphasis-wrapped sentinel` diagnostics (e.g. `**foo **{{ref:x}}** bar**`); the wrapped-sentinel predicate runs unconditionally on every emphasis token while the nested-emphasis predicate gates on depth, so the two are orthogonal and the double-emit is intended. This is a doc-accuracy erratum only: no code change, no §8 lock change, and Status remains **Accepted**. + +### D6. Diagnostic conformance + +ADR-017 D4's diagnostic format spec is preserved byte-for-byte: + +```text +validate-yaml-prose-subset: ::[]: at +``` + +The two reason strings — `"nested emphasis"` and `"emphasis-wrapped sentinel"` — are unchanged. The nested-index `[][]` second-segment format introduced by PR #286 (per ADR-017 D4 line 81's "Addendum" amendment) is unchanged. The token-snippet is the same `token.value` substring the existing implementation emits. + +`TestDiagnosticFormat`, the live-corpus baseline, and third-party tooling that scrapes the linter's stderr output continue to work without modification. + +### D7. Cross-ADR alignment + +**In scope for the ADR-028 commit:** + +- **ADR-017 D5 amendment.** One additive paragraph at the end of D5 stating that the Token contract (NamedTuple structure including emphasis-shape field, TokenKind enumeration, emission invariants, consumer surface) is formally specified in ADR-028 D1-D4. Does not amend any existing rule; only adds the pointer. +- **ADR-017 D1 amendment.** One forward pointer at the end of D1 indicating that the canonical mechanism for enforcing the "no nested same-family emphasis" and "no emphasis-wrapped sentinel" rules is the depth-counter pass specified in ADR-028 D5. + +**One-way cross-references this ADR makes:** + +- [ADR-005](005-pre-commit-framework.md): `validate-yaml-prose-subset` is one of the local hooks declared under the pre-commit framework; ADR-005's hook-orchestration shape is preserved. +- [ADR-012](012-static-spa-architecture.md): the zero-dep posture for the SPA's runtime path is the prior art that motivated ADR-015's hand-rolled sanitizer and ADR-017's hand-rolled tokenizer. ADR-028 keeps the hand-rolled approach; no library dependency is introduced. +- [ADR-015](015-site-content-sanitization-invariants.md) D3: the bounded-emission property and the hand-rolled-sanitizer-with-grammar-sync rationale apply symmetrically to the lint-side tokenizer. Neither the emitted token kinds nor the partition-of-input invariant changes, so the site/lint grammar parity is preserved. +- [ADR-016](016-reference-strategy.md) D5 and D6: `validate_prose_references.py` and `_sentinel_expansion.py` are the two consumers of the shared tokenizer besides the prose-subset linter itself. Per D4 above, neither reads `shape`; the contract raise is transparent to them. +- [ADR-025](025-testing-strategy.md) D2 and D10: the testing chain (Testing → Code-Reviewer → SWE → Code-Reviewer) authors failing tests before implementation; the `_classify_emphasis_shape` helper and the depth-counter walk both flow through the chain. D10's wire-up requirement is satisfied by including at least one test that calls `tokenize()` and inspects `shape` on the emitted tokens (rather than only testing the classifier helper in isolation). + +**Deferred to future maintenance edits:** + +- **ADR-016 D6 amendment to mention the `shape` field's existence.** Not in scope. `validate_prose_references.py` does not read `shape`; the courtesy amendment would add scope without value. A future amendment lands when the field becomes load-bearing for the references linter. +- **ADR-020 D4 / folded-bullet citation defect.** The `INVALID_FOLDED_BULLET` kind is documented in `_prose_tokens.py` and `_REASONS` strings as belonging to ADR-020 D4, but D4 is the prose-shape decision; the folded-bullet rule's actual home is ADR-020's follow-up section. This is a pre-existing documentation defect, not load-bearing here. Cleanup lands in a separate one-line commit. + +## Alternatives Considered + +- **Status quo — regex-driven shape detection in the linter.** Three `_RE_*_EARLY_CLOSE` constants in `validate_yaml_prose_subset.py` extract the whitespace-edge signal from token values that the greedy non-greedy bold/italic regexes encoded as side effects. Rejected because the shape decision is split across two layers, the regex constants encode an implicit channel from the tokenizer to the linter, and three accreted commits (`a71ae47`, `817e00d`, `5e72a2a`) on the archived `feature/353-c4-followons` branch motivated the restructure in the first place. +- **Path B — derived methods `Token.delim()` and `Token.interior()`, shape inference at the linter via `interior().endswith(whitespace)`.** Stashed at `stash@{0}` on `.worktree/353-c4-followons`. Rejected on four grounds: (1) faux-depth — the kind-stack had no pop conditions, so `**hello** world **goodbye**` would false-positive as depth-2 nested; (2) the whitespace heuristic was renamed-not-removed — moved from a regex constant to a method-then-string-operation chain with the same load-bearing knowledge requirement; (3) D6 forward-compat predicates degenerated to "is there any prior emphasis in this string"; (4) asymmetric API — `delim()` returned `str | None` and `interior()` returned `value` unchanged for non-emphasis kinds, forcing every consumer to handle two different "doesn't apply" fallbacks. +- **Open/close event tokens — emit `BOLD_OPEN` and `BOLD_CLOSE` as distinct kinds.** Rejected because it breaks the partition-of-input invariant: an event token has no `value` substring of its own, or the `value` must be split across multiple tokens (the open's `**` and the close's `**`), neither of which the consumer audit and existing fixture format tolerate. BOLD and ITALIC remain single tokens covering the delimiter-bounded span; shape is a field on the token, not a kind. +- **Extend the test projection to include `shape` and annotate every emphasis-bearing fixture.** Rejected. The fixture corpus has 56 pairs; the migration would touch ~150 `shape` keys across files, most of them defaulting to `"neutral"` with low signal density. The existing five emphasis-bearing fixtures already implicitly lock in `shape="complete"` (because the tokenizer emits that value and the projection drops it). +- **Suppress the projection extension and assert `shape` only via per-test attribute checks, never via fixtures.** Rejected. Loses the auditability of fixture-driven shape assertions for new `"open"` / `"close"` cases. +- **Use a markdown library (`markdown-it-py`, `commonmark`).** Rejected per ADR-017 D5 line 80 and ADR-015 D3's zero-dep posture for the matching sanitizer. The grammar is small enough to hand-roll; a library brings a configuration surface larger than the replacement code. +- **Kind-stack rather than bare depth counter in the linter.** Reasonable for prospective D6-of-ADR-017 rules that would need to know the enclosing kind (e.g., "no link text containing emphasis"), but the current rule set — ADR-017 D1's "no nested same-family emphasis" plus the wrapped-sentinel check — needs only a depth value; kind comparison happens via `token.kind` directly on the emphasis token under inspection. A kind-stack adds zero current capability and complicates the algorithm's textbook shape. If a future rule needs the enclosing-kind dimension, the stack arrives then. + +## Consequences + +**Positive** + +- **One shape decision, encapsulated in the tokenizer.** The whitespace-adjacency heuristic exists in exactly one place — `_classify_emphasis_shape` — and runs on the matched span before the token is appended to the stream. The linter sees a four-value enum and dispatches on it; future contributors do not need to know that whitespace-at-edge is load-bearing to read the linter. +- **Faux-depth false positives are eliminated by construction.** `**hello** world **goodbye**` tokenizes as two `shape="complete"` BOLDs, walks depth `0 → 0`, and emits no diagnostic. Prospective D6-of-ADR-017 rules can use `depth > 0` with confidence that the depth value is structurally meaningful. +- **The Token API is symmetric.** Every token carries `shape: Literal[…]`; there is no `None` fallback to branch on. Consumers that need shape read it; consumers that do not read it (`validate_prose_references.py`, `_sentinel_expansion.py`) are unaffected by its presence. +- **The contract is ADR-grade.** The Token NamedTuple, TokenKind enumeration, tokenizer emission invariants, and consumer surface are specified here. Future refactors of the tokenizer-internal helpers (`_match_sentinel`, `_classify_emphasis_shape`, the `_RE_*` constants) need no ADR amendment; surface changes (new fields, new kinds, new emission invariants) do. +- **The linter shrinks.** Three `_RE_*_EARLY_CLOSE` constants, three `_RE_*_WRAPPED_SENTINEL` constants, `_detect_nested_emphasis_indices`, and `_is_emphasis_wrapped_sentinel` are deleted from `validate_yaml_prose_subset.py`. The replacement is a single-pass walk with two one-line predicates over `token.shape` and the depth state. +- **Diagnostic format is preserved.** The existing test corpus (including `TestDiagnosticFormat` and the live-corpus baseline) and any third-party tooling that scrapes the linter's stderr output continue to work without modification. + +**Negative** + +- **The classifier helper is now part of the contract.** `_classify_emphasis_shape` is internal to `_prose_tokens.py` per D4, but its behavior — the both-edges-whitespace `"open"` convention, the leading-whitespace-first ordering — is observable through the `shape` field on emphasis tokens. A future change to the convention is a tokenizer change; downstream tests assert against the values it emits. +- **Two BOLD tokens with different `shape` are not structurally equal.** D1's NamedTuple equality posture means test code that constructs a token manually for comparison against `tokenize()` output must supply the correct `shape` or compare via the projection that drops `shape`. The TDD chain's reviewers carry this as a checklist item. +- **The `shape` field name is generic.** A future ADR adding (e.g.) `position` or `category` to the NamedTuple may force a rename to `emphasis_shape` for clarity. The generic name is accepted on the basis that the field's semantics are documented in D1. +- **The underscore-italic tightening (D3 invariant 3) is content-visible.** Snake_case identifier pairs like `home_bar and foo_baz` no longer tokenize as italic — a corpus-observable shift even though no new rejection is added. The corpus impact probe (Follow-up below) runs before Status flips from Draft to Accepted; if surprises surface, the rule is revisited. +- **Authors who copy prose containing `_X Y_` constructs from upstream where they were intended as italic must now flank with whitespace explicitly.** The friction is intentional; the false-positive case it eliminates is the load-bearing reason. + +**Follow-up** + +- **Underscore-italic corpus diff — the lockdown validation gate.** Before maintainer flips Status: Draft → Accepted, a one-off script runs both the current `_RE_ITALIC_UNDERSCORE` and the proposed whitespace-flanked variant against all four content YAMLs (`risks.yaml`, `controls.yaml`, `components.yaml`, `personas.yaml`) and emits a diff listing every prose field whose tokenization changes (e.g., `::: was ITALIC("_X Y_"), now TEXT`). Two outcomes: diff is empty or only matches expected snake_case false positives → lock confirmed; diff includes authorially-intended italic spans → maintainer reopens the underscore decision before the ADR is Accepted. **Resolved 2026-05-28:** probe run over all 569 prose fields in the four content YAMLs reported **zero** tokenization changes (first outcome). The probe was validated as non-blind — `home_bar and foo_baz` flips ITALIC→TEXT as designed while genuine whitespace-flanked italics and `__double__` are preserved. D-Open-21 lock confirmed; Status flipped Draft → Accepted. +- **Triple-asterisk tokenizer probe.** Captures the tokenizer's output for `***foo***`, `****`, `*****foo*****`, `**foo***`, `***foo**`, and `***`. Feeds the classifier test fixtures. Runs as part of the testing agent's RED phase, parallel to the SWE implementation. +- **Live-corpus regression baseline.** Captures the current zero-diagnostic state of `validate-yaml-prose-subset` against the four content YAMLs as a `pytest.mark.live_corpus` test. Gates the post-migration regression check; the new linter must produce the same zero-diagnostic result. +- **Fixture migration — additive only.** Add 6-10 new emphasis-shape fixtures under (likely) `scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/` or equivalent, covering `shape="open"`, `shape="close"`, depth-3 nested input asserted multi-token, and italic variants per delimiter. The existing five emphasis-bearing fixtures stay byte-identical; they implicitly assert `shape="complete"` via the tokenizer's emission and the projection's drop-of-`shape`. The `_tokens_to_dicts` projection at [`test_prose_tokens.py:203`](https://github.com/cosai-oasis/secure-ai-tooling/blob/7320136/scripts/hooks/tests/test_prose_tokens.py#L203) stays at `{"kind", "value"}`; tests that need to assert `shape` use per-test attribute checks. +- **Stale corpus-size comments.** The module docstring at [`_prose_tokens.py:266`](https://github.com/cosai-oasis/secure-ai-tooling/blob/7320136/scripts/hooks/precommit/_prose_tokens.py#L266) references "42 grammar cases"; the test header at [`test_prose_tokens.py:90-97`](https://github.com/cosai-oasis/secure-ai-tooling/blob/7320136/scripts/hooks/tests/test_prose_tokens.py#L90-L97) references "55 fixture-parametrized pairs". The current corpus is 56 pairs (with the additive emphasis-shape fixtures, 62-66). Cleanup lands with the fixture migration commit. +- **Future kind additions.** Adding a new TokenKind member (e.g., for a hypothetical fourth allowed authoring token) is an amendment to D2 here, plus the corresponding ADR-017 D1 amendment. The contract is ADR-grade, so future additions surface as ADR work, not implementation drift. +- **Future shape values.** Adding a fifth `shape` value (e.g., for cross-delimiter italic-in-italic detection per ADR-017 D1 line 36's deferred case) is an amendment to D1 and D3 here. The cross-delimiter case continues to tokenize as a single ITALIC token because the outer match consumes the inner. diff --git a/docs/adr/README.md b/docs/adr/README.md index dc1327ae..8a0a9ea3 100644 --- a/docs/adr/README.md +++ b/docs/adr/README.md @@ -43,6 +43,7 @@ If a decision is about *how the Risk Map content model is shaped*, it belongs in | [025](025-testing-strategy.md) | Testing strategy and posture across Python, site JS, schemas, and infrastructure | Accepted | 2026-05-05 | | [026](026-issue-template-domain.md) | Issue-template domain — generator scope, schema-derived enums, and ADR-content alignment contract | Accepted | 2026-05-20 | | [027](027-framework-versioning-and-mapping-convention.md) | Per-mapping framework version pinning | Accepted | 2026-05-22 | +| [028](028-prose-linter-bracket-matching-architecture.md) | Prose-linter emphasis enforcement via bracket-matching depth pass over a Token-shape contract | Accepted | 2026-05-28 | ## Conventions diff --git a/pyproject.toml b/pyproject.toml index 17b7a1eb..ecfc3239 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -2,6 +2,7 @@ pythonpath = ["scripts/hooks"] markers = [ "slow: marks tests as slow (deselect with '-m \"not slow\"')", + "live_corpus: marks tests that read the live risk-map YAML corpus (deselect with '-m \"not live_corpus\"')", ] [tool.coverage.run] diff --git a/scripts/hooks/precommit/_prose_tokens.py b/scripts/hooks/precommit/_prose_tokens.py index 7d149dba..b31f1eeb 100644 --- a/scripts/hooks/precommit/_prose_tokens.py +++ b/scripts/hooks/precommit/_prose_tokens.py @@ -38,7 +38,7 @@ import re from enum import Enum -from typing import NamedTuple +from typing import Literal, NamedTuple class TokenKind(Enum): @@ -89,10 +89,15 @@ class Token(NamedTuple): Attributes: kind: The token's classification (accepting or rejecting). value: The exact substring from the input that this token covers. + shape: Emphasis classification per ADR-028 D3. 'neutral' for every + non-emphasis token. One of 'complete', 'open', 'close', 'neutral'. + Default 'neutral' so two-positional construction Token(kind, value) + still compiles and yields a token with shape='neutral'. """ kind: TokenKind value: str + shape: Literal["complete", "open", "close", "neutral"] = "neutral" # --------------------------------------------------------------------------- @@ -177,8 +182,18 @@ class Token(NamedTuple): _RE_ITALIC_ASTERISK = re.compile(r"\*(.+?)\*", re.DOTALL) # Italic underscore: _..._ — single underscore only; __ is NOT italic (ADR-017 D1). -# Lookahead/lookbehind prevent matching when adjacent to another underscore. -_RE_ITALIC_UNDERSCORE = re.compile(r"(? Literal["complete", "open", "close", "neutral"]: + """Classify the emphasis shape of a matched span by examining interior edge whitespace. + + The tokenizer calls this at emission time for BOLD, ITALIC-asterisk, and + ITALIC-underscore tokens. The shape drives the depth-counter walk in the + prose-subset linter (ADR-028 D5). + + Rules (ADR-028 D3 table): + - Both edges whitespace -> 'open' (convention: leading test fires first) + - Trailing whitespace only -> 'open' (greedy close on intended inner open) + - Leading whitespace only -> 'close' (trailing half of an early-closed match) + - Neither edge whitespace -> 'complete' (well-formed span) + - Empty interior -> 'neutral' (defensive; not emitted in practice) + + Uses str.isspace() — no regex (ADR-028 D-Open-6). + + Args: + span: The full matched span including delimiters (e.g. '**foo **'). + delim: The delimiter string ('**', '*', or '_'). + + Returns: + One of 'complete', 'open', 'close', 'neutral'. + """ + interior = span[len(delim) : -len(delim)] + if not interior: + return "neutral" + leading_ws = interior[0].isspace() + trailing_ws = interior[-1].isspace() + if leading_ws and trailing_ws: + return "open" + if leading_ws: + return "close" + if trailing_ws: + return "open" + return "complete" + + # --------------------------------------------------------------------------- # Sentinel helper # --------------------------------------------------------------------------- @@ -263,7 +320,7 @@ def tokenize(text: str) -> list[Token]: The `text` argument is expected to be a single prose field value as decoded by PyYAML — not raw YAML, not a file path. - Test fixtures for all 42 grammar cases live at: + Test fixtures live at: scripts/hooks/tests/fixtures/prose_subset/ Args: @@ -286,11 +343,11 @@ def flush_text(end: int) -> None: tokens.append(Token(TokenKind.TEXT, text[pending_text_start:end])) pending_text_start = -1 - def emit(kind: TokenKind, value: str) -> None: - """Flush any pending TEXT, then emit the given token.""" + def emit(kind: TokenKind, value: str, *, shape: str = "neutral") -> None: + """Flush any pending TEXT, then emit the given token with the given shape.""" nonlocal i flush_text(i) - tokens.append(Token(kind, value)) + tokens.append(Token(kind, value, shape)) i += len(value) def at_line_start() -> bool: @@ -422,21 +479,21 @@ def at_line_start() -> bool: if ch == "*" and i + 1 < len(text) and text[i + 1] == "*": m = _RE_BOLD.match(text, i) if m: - emit(TokenKind.BOLD, m.group()) + emit(TokenKind.BOLD, m.group(), shape=_classify_emphasis_shape(m.group(), "**")) continue # --- Rule 13: Italic asterisk *...* --- if ch == "*": m = _RE_ITALIC_ASTERISK.match(text, i) if m: - emit(TokenKind.ITALIC, m.group()) + emit(TokenKind.ITALIC, m.group(), shape=_classify_emphasis_shape(m.group(), "*")) continue # --- Rule 14: Italic underscore _..._ (single underscore only) --- if ch == "_": m = _RE_ITALIC_UNDERSCORE.match(text, i) if m: - emit(TokenKind.ITALIC, m.group()) + emit(TokenKind.ITALIC, m.group(), shape=_classify_emphasis_shape(m.group(), "_")) continue # --- Rule 15: Bare camelCase entity-prefix identifier --- diff --git a/scripts/hooks/precommit/validate_yaml_prose_subset.py b/scripts/hooks/precommit/validate_yaml_prose_subset.py index 09acff0f..34ad1bb5 100644 --- a/scripts/hooks/precommit/validate_yaml_prose_subset.py +++ b/scripts/hooks/precommit/validate_yaml_prose_subset.py @@ -30,7 +30,20 @@ from precommit._linter_types import Diagnostic, ProseField, format_diagnostic_line # noqa: E402 from precommit._prose_fields import find_prose_fields # noqa: E402 -from precommit._prose_tokens import TokenKind # noqa: E402 +from precommit._prose_tokens import ( # noqa: E402 + _RE_SENTINEL_INTRA_INNER, + _RE_SENTINEL_REF_INNER, + TokenKind, +) + +# Deliberate cross-module coupling: _RE_SENTINEL_INTRA_INNER and +# _RE_SENTINEL_REF_INNER are internal to _prose_tokens (leading-underscore per +# ADR-028 D4). The wrapped-sentinel predicate (ADR-028 D5) reuses them directly +# so the linter's notion of a "sentinel" cannot drift from the tokenizer's own +# classification. They are NOT promoted to public constants — ADR-028 D4 fixes +# the public surface of _prose_tokens at exactly Token, TokenKind, and tokenize(); +# a consumer importing these _RE_* names accepts the reorganization-coupling risk +# that D4 describes. # Re-export so callers can import ProseField and Diagnostic from this module # (the test suite imports both from here, not from _linter_types). @@ -87,6 +100,69 @@ "If you added a new INVALID_* kind to _REJECTED_KINDS, add its reason to _REASONS too." ) +# Reason strings for emphasis violations (ADR-028 D6). These are stable +# constants; any change requires a D6 amendment. +_REASON_NESTED_EMPHASIS = "nested emphasis" +_REASON_EMPHASIS_WRAPPED_SENTINEL = "emphasis-wrapped sentinel" + +# The two emphasis token kinds; used in the depth-counter walk (ADR-028 D5). +_EMPHASIS_KINDS: frozenset[TokenKind] = frozenset({TokenKind.BOLD, TokenKind.ITALIC}) + + +def _is_emphasis_wrapped_sentinel(token_value: str, delim: str) -> bool: + """Return True if the emphasis token wraps exactly one sentinel. + + Strips the emphasis delimiter pair from token_value, .strip()s whitespace, + then checks whether the result is a `{{ }}` span whose inner content + fullmatches either the intra-doc or ref sentinel inner regex. + + This mirrors how _match_sentinel classifies sentinels: outer {{ }} are + stripped first, then the inner content is matched against the patterns. + + Args: + token_value: The full emphasis token value including delimiters. + delim: The delimiter string ('**', '*', or '_'). + + Returns: + True if the stripped interior is a well-formed sentinel. + """ + interior = token_value[len(delim) : -len(delim)].strip() + # Interior must be wrapped in {{ }} to be a sentinel form. + if not (interior.startswith("{{") and interior.endswith("}}")): + return False + inner = interior[2:-2] + return bool(_RE_SENTINEL_INTRA_INNER.fullmatch(inner) or _RE_SENTINEL_REF_INNER.fullmatch(inner)) + + +def _delim_for_token(token_value: str) -> str: + """Return the delimiter prefix for an emphasis token value. + + Inspects the leading characters to distinguish '**' (BOLD) from '*' (ITALIC + asterisk) from '_' (ITALIC underscore). Called only on BOLD/ITALIC tokens, + whose values always start with one of those delimiters. + + Args: + token_value: The full token value string (a BOLD or ITALIC token). + + Returns: + The delimiter string: '**', '*', or '_'. + + Raises: + ValueError: if token_value does not start with '**', '*', or '_'. The + helper fails loud rather than guessing a delimiter, so a future + emphasis kind that reaches it with an unhandled delimiter surfaces + immediately instead of silently mis-slicing the token interior. + """ + if token_value.startswith("**"): + return "**" + if token_value.startswith("*"): + return "*" + if token_value.startswith("_"): + return "_" + raise ValueError( + f"_delim_for_token expects a BOLD/ITALIC token value starting with '**', '*', or '_'; got {token_value!r}" + ) + def check_prose_field(field: ProseField) -> list[Diagnostic]: """Check one ProseField against the ADR-017 D4 grammar rejection rules. @@ -96,6 +172,9 @@ def check_prose_field(field: ProseField) -> list[Diagnostic]: — ADR-017 D4 rule 5 delegates bare-camelCase rejection to validate_prose_references. + Also runs the ADR-028 D5 depth-counter emphasis-rejection walk, emitting + diagnostics for nested emphasis and emphasis-wrapped sentinels. + Args: field: A ProseField with tokens already populated by tokenize(). @@ -103,14 +182,8 @@ def check_prose_field(field: ProseField) -> list[Diagnostic]: List of Diagnostic objects (empty if the field is clean). """ diagnostics: list[Diagnostic] = [] - for token in field.tokens: - if token.kind not in _REJECTED_KINDS: - continue - base_reason = _REASONS[token.kind] - # ADR-017 D4: append the offending token value as a snippet for context. - # Only append when token.value is non-empty (tokenizer guarantees this, - # but guard defensively to avoid "at ''" in edge cases). - reason = f"{base_reason} at {token.value!r}" if token.value else base_reason + + def _emit_diag(reason: str) -> None: diagnostics.append( Diagnostic( hook_id=_HOOK_ID, @@ -122,6 +195,53 @@ def check_prose_field(field: ProseField) -> list[Diagnostic]: nested_index=field.nested_index, ) ) + + # --- INVALID_* token rejection (ADR-017 D4) --- + for token in field.tokens: + if token.kind not in _REJECTED_KINDS: + continue + base_reason = _REASONS[token.kind] + # ADR-017 D4: append the offending token value as a snippet for context. + # Only append when token.value is non-empty (tokenizer guarantees this, + # but guard defensively to avoid "at ''" in edge cases). + reason = f"{base_reason} at {token.value!r}" if token.value else base_reason + _emit_diag(reason) + + # --- ADR-028 D5 depth-counter emphasis walk --- + # Single pass over the token stream with a bare integer depth counter. + # Emphasis tokens with shape='open' increment depth; 'close' decrements. + # Any emphasis token arriving at depth > 0 is a nested-emphasis violation. + # The wrapped-sentinel predicate is independent of depth state. + depth = 0 + for token in field.tokens: + if token.kind not in _EMPHASIS_KINDS: + continue + + # Nested-emphasis predicate (ADR-028 D5). + if token.shape == "open": + if depth > 0: + _emit_diag(f"{_REASON_NESTED_EMPHASIS} at {token.value!r}") + depth += 1 + elif token.shape == "close": + # Check before decrementing: the close token is the one arriving + # at depth > 0 in the canonical [open, text, close] stream + # (e.g. **foo **nested** bar**), so it is the attribution point for + # the single nested-emphasis diagnostic. ADR-028 D5 (as amended + # 2026-05-29) emits in the close branch when depth > 0, before the + # decrement. + if depth > 0: + _emit_diag(f"{_REASON_NESTED_EMPHASIS} at {token.value!r}") + depth = max(0, depth - 1) + elif token.shape == "complete": + if depth > 0: + _emit_diag(f"{_REASON_NESTED_EMPHASIS} at {token.value!r}") + # complete = open + close, net depth change 0 + + # Emphasis-wrapped-sentinel predicate (independent of depth state). + delim = _delim_for_token(token.value) + if _is_emphasis_wrapped_sentinel(token.value, delim): + _emit_diag(f"{_REASON_EMPHASIS_WRAPPED_SENTINEL} at {token.value!r}") + return diagnostics diff --git a/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/bold_both_edges_whitespace.tokens.json b/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/bold_both_edges_whitespace.tokens.json new file mode 100644 index 00000000..d6056767 --- /dev/null +++ b/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/bold_both_edges_whitespace.tokens.json @@ -0,0 +1 @@ +[{"kind": "BOLD", "value": "** foo **"}] diff --git a/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/bold_both_edges_whitespace.txt b/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/bold_both_edges_whitespace.txt new file mode 100644 index 00000000..1619c8c3 --- /dev/null +++ b/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/bold_both_edges_whitespace.txt @@ -0,0 +1 @@ +** foo ** \ No newline at end of file diff --git a/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/bold_close.tokens.json b/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/bold_close.tokens.json new file mode 100644 index 00000000..47427de5 --- /dev/null +++ b/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/bold_close.tokens.json @@ -0,0 +1 @@ +[{"kind": "BOLD", "value": "** bar**"}] diff --git a/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/bold_close.txt b/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/bold_close.txt new file mode 100644 index 00000000..0aa56ec1 --- /dev/null +++ b/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/bold_close.txt @@ -0,0 +1 @@ +** bar** \ No newline at end of file diff --git a/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/bold_open.tokens.json b/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/bold_open.tokens.json new file mode 100644 index 00000000..e53ba84e --- /dev/null +++ b/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/bold_open.tokens.json @@ -0,0 +1 @@ +[{"kind": "BOLD", "value": "**foo **"}] diff --git a/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/bold_open.txt b/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/bold_open.txt new file mode 100644 index 00000000..e7134b12 --- /dev/null +++ b/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/bold_open.txt @@ -0,0 +1 @@ +**foo ** \ No newline at end of file diff --git a/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/bold_wraps_sentinel_nested.tokens.json b/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/bold_wraps_sentinel_nested.tokens.json new file mode 100644 index 00000000..e05ff617 --- /dev/null +++ b/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/bold_wraps_sentinel_nested.tokens.json @@ -0,0 +1 @@ +[{"kind": "BOLD", "value": "**x **"}, {"kind": "TEXT", "value": "y"}, {"kind": "BOLD", "value": "**{{ref:x}}**"}] diff --git a/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/bold_wraps_sentinel_nested.txt b/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/bold_wraps_sentinel_nested.txt new file mode 100644 index 00000000..9328e3d4 --- /dev/null +++ b/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/bold_wraps_sentinel_nested.txt @@ -0,0 +1 @@ +**x **y**{{ref:x}}** \ No newline at end of file diff --git a/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/italic_asterisk_open.tokens.json b/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/italic_asterisk_open.tokens.json new file mode 100644 index 00000000..4711774f --- /dev/null +++ b/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/italic_asterisk_open.tokens.json @@ -0,0 +1 @@ +[{"kind": "ITALIC", "value": "*foo *"}] diff --git a/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/italic_asterisk_open.txt b/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/italic_asterisk_open.txt new file mode 100644 index 00000000..c9628810 --- /dev/null +++ b/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/italic_asterisk_open.txt @@ -0,0 +1 @@ +*foo * \ No newline at end of file diff --git a/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/nested_bold_three_token.tokens.json b/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/nested_bold_three_token.tokens.json new file mode 100644 index 00000000..6951e13f --- /dev/null +++ b/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/nested_bold_three_token.tokens.json @@ -0,0 +1 @@ +[{"kind": "BOLD", "value": "**foo **"}, {"kind": "TEXT", "value": "nested"}, {"kind": "BOLD", "value": "** bar**"}] diff --git a/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/nested_bold_three_token.txt b/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/nested_bold_three_token.txt new file mode 100644 index 00000000..60fdb6f2 --- /dev/null +++ b/scripts/hooks/tests/fixtures/prose_subset/accepting/emphasis_shapes/nested_bold_three_token.txt @@ -0,0 +1 @@ +**foo **nested** bar** \ No newline at end of file diff --git a/scripts/hooks/tests/test_prose_tokens.py b/scripts/hooks/tests/test_prose_tokens.py index e872253b..e4cf5712 100644 --- a/scripts/hooks/tests/test_prose_tokens.py +++ b/scripts/hooks/tests/test_prose_tokens.py @@ -87,8 +87,9 @@ Test Summary ============ -Total fixture-parametrized pairs: 55 +Total fixture-parametrized pairs: 62 - accepting/: 7 fixture pairs (inc. double_underscore_not_bold) +- accepting/emphasis_shapes/: 6 fixture pairs (ADR-028 D-Open-18, Path 3b) - sentinels/: 7 fixture pairs - rejecting/: 16 fixture pairs (existing) + 14 new URL fixture pairs (commit 5) - folded_bullets/: 2 fixture pairs @@ -345,6 +346,23 @@ def test_tokenize_returns_list(self): "accepting/double_underscore_not_bold", ] +# ADR-028 D-Open-18 (Path 3b): additive emphasis-shape fixtures. These lock the +# tokenizer's token STREAM (kind + value) for the open / close / both-edges / +# nested emphasis inputs the shape classifier (ADR-028 D3) tags. The fixture +# projection drops `shape` per D-Open-20, so shape itself is asserted directly +# in the classifier tests; these fixtures lock the underlying greedy-match +# stream the classifier and the depth-counter linter depend on (e.g. that +# `**foo **nested** bar**` splits into [BOLD, TEXT, BOLD], the precondition for +# the nested-emphasis diagnostic). +_EMPHASIS_SHAPE_FIXTURES = [ + "accepting/emphasis_shapes/bold_open", + "accepting/emphasis_shapes/bold_close", + "accepting/emphasis_shapes/bold_both_edges_whitespace", + "accepting/emphasis_shapes/italic_asterisk_open", + "accepting/emphasis_shapes/nested_bold_three_token", + "accepting/emphasis_shapes/bold_wraps_sentinel_nested", +] + _SENTINEL_FIXTURES = [ "sentinels/intra_risk", "sentinels/intra_control", @@ -457,6 +475,31 @@ def test_no_invalid_tokens(self, fixture_path: str): assert token.kind not in invalid_kinds, f"Fixture {fixture_path!r}: unexpected INVALID token {token!r}" +class TestEmphasisShapeFixtures: + """ + Verify the token stream for emphasis-shape inputs (ADR-028 D-Open-18, Path 3b). + + Given: an emphasis input exercising open / close / both-edges / nested shapes + When: tokenize() is called + Then: the stream matches the fixture's .tokens.json exactly (kind + value). + + These fixtures lock the greedy-match behaviour the shape classifier and the + depth-counter linter depend on; `shape` is dropped by the fixture projection + (D-Open-20) and asserted directly in the classifier tests. + """ + + @pytest.mark.parametrize("fixture_path", _EMPHASIS_SHAPE_FIXTURES) + def test_emphasis_shape_token_stream(self, fixture_path: str): + """ + Given: input from accepting/emphasis_shapes/.txt + When: tokenize() is called + Then: output matches the fixture's .tokens.json exactly (kind + value) + """ + input_text, expected = _load_fixture_pair(fixture_path) + result = _tokens_to_dicts(tokenize(input_text)) + assert result == expected, f"Fixture {fixture_path!r}: expected {expected!r}, got {result!r}" + + class TestSentinels: """ Verify sentinel tokenisation for both intra-document ({{riskXxx}}, {{controlXxx}}, @@ -1932,3 +1975,614 @@ def test_newline_only(self): assert not token.kind.name.startswith("INVALID"), ( f"Bare newline should not produce INVALID token, got {token!r}" ) + + +# =========================================================================== +# TestTripleAsteriskProbe (Spike S1 — descriptive locks) +# =========================================================================== +# These tests lock the ground-truth token streams for six triple-asterisk +# inputs. They record the tokenizer's current output (emphasis emission +# sites are extended, not changed). +# Feeds into TestEmphasisShapeClassification fixture grounding. +# =========================================================================== + + +class TestTripleAsteriskProbe: + r""" + Lock current tokenizer output for triple-/multi-asterisk edge-case inputs. + + ADR-028 §9.3 Spike S1 — run empirically before classifier tests are written + to ensure the _classify_emphasis_shape helper handles each case consistently. + + All assertions are descriptive: they record the observed stream. + """ + + def test_triple_star_foo_tokenizes_as_bold_plus_trailing_text(self): + """ + Given: '***foo***' + When: tokenize() is called + Then: BOLD('***foo**') + TEXT('*') + + The non-greedy _RE_BOLD closes at the first '**' after open, consuming + '***foo**' (interior='*foo'). The final lone '*' is TEXT. + Shape: the BOLD interior '*foo' has no edge whitespace -> + _classify_emphasis_shape yields 'complete'. + """ + _require_module() + tokens = tokenize("***foo***") + assert len(tokens) == 2, f"Expected 2 tokens, got {tokens!r}" + assert tokens[0].kind == TokenKind.BOLD + assert tokens[0].value == "***foo**" + assert tokens[1].kind == TokenKind.TEXT + assert tokens[1].value == "*" + + def test_four_stars_tokenizes_as_italic_plus_trailing_text(self): + """ + Given: '****' + When: tokenize() is called + Then: ITALIC('***') + TEXT('*') + + Rule 12 (bold) checks ch=='*' and text[i+1]=='*', tries _RE_BOLD which + needs at least one inner char between '**...**'. '****' has interior='' + which fails (.+?). Rule 13 (italic asterisk) then fires on the lone '*' + and _RE_ITALIC_ASTERISK matches '***' (interior='*'). The remaining '*' + is TEXT. + Shape: ITALIC interior '*' has no edge whitespace -> 'complete'. + """ + _require_module() + tokens = tokenize("****") + assert len(tokens) == 2, f"Expected 2 tokens, got {tokens!r}" + assert tokens[0].kind == TokenKind.ITALIC + assert tokens[0].value == "***" + assert tokens[1].kind == TokenKind.TEXT + assert tokens[1].value == "*" + + def test_five_star_foo_tokenizes_as_bold_text_bold(self): + """ + Given: '*****foo*****' + When: tokenize() is called + Then: BOLD('*****') + TEXT('foo') + BOLD('*****') + + _RE_BOLD non-greedy: matches '**' open, then (.+?) closes at first '**'. + Given '*****foo*****', opening at i=0, the first '**' close is at i=3 + (positions 3-4), consuming '*****' (interior='***'). TEXT('foo'). + Then BOLD('*****') again. + Shape: interior '***' has no edge whitespace -> 'complete'. + """ + _require_module() + tokens = tokenize("*****foo*****") + assert len(tokens) == 3, f"Expected 3 tokens, got {tokens!r}" + assert tokens[0].kind == TokenKind.BOLD + assert tokens[0].value == "*****" + assert tokens[1].kind == TokenKind.TEXT + assert tokens[1].value == "foo" + assert tokens[2].kind == TokenKind.BOLD + assert tokens[2].value == "*****" + + def test_bold_then_one_star_tokenizes_as_bold_plus_text(self): + """ + Given: '**foo***' + When: tokenize() is called + Then: BOLD('**foo**') + TEXT('*') + + Standard bold match; trailing lone '*' is TEXT. + Shape: interior 'foo' has no edge whitespace -> 'complete'. + """ + _require_module() + tokens = tokenize("**foo***") + assert len(tokens) == 2, f"Expected 2 tokens, got {tokens!r}" + assert tokens[0].kind == TokenKind.BOLD + assert tokens[0].value == "**foo**" + assert tokens[1].kind == TokenKind.TEXT + assert tokens[1].value == "*" + + def test_one_star_then_bold_tokenizes_as_single_bold(self): + """ + Given: '***foo**' + When: tokenize() is called + Then: BOLD('***foo**') — single token + + _RE_BOLD opens at '**' (positions 0-1), (.+?) matches '*foo' (leading + '*' is inner content), closes at '**' (positions 5-6). Full span is + '***foo**'. Interior is '*foo'; no edge whitespace -> 'complete'. + Shape: 'complete'. + """ + _require_module() + tokens = tokenize("***foo**") + assert len(tokens) == 1, f"Expected 1 token, got {tokens!r}" + assert tokens[0].kind == TokenKind.BOLD + assert tokens[0].value == "***foo**" + + def test_three_stars_alone_tokenizes_as_italic(self): + """ + Given: '***' + When: tokenize() is called + Then: ITALIC('***') — single token + + Bold rule fails (no closing '**' after inner content). Italic-asterisk + rule succeeds: _RE_ITALIC_ASTERISK matches '*...*' = '***' (interior='*'). + Shape: interior '*' has no edge whitespace -> 'complete'. + """ + _require_module() + tokens = tokenize("***") + assert len(tokens) == 1, f"Expected 1 token, got {tokens!r}" + assert tokens[0].kind == TokenKind.ITALIC + assert tokens[0].value == "***" + + +# =========================================================================== +# TestEmphasisShapeClassification — ADR-028 D3 shape classifier +# =========================================================================== +# These tests import _classify_emphasis_shape from precommit._prose_tokens. +# The import block below uses the same lazy-guard pattern as the module's +# top-level _IMPORT_ERROR guard: a collection-time import failure sets +# _CLASSIFY_IMPORT_ERROR and each test calls _require_classify() to fail +# with a clear message rather than a collection crash. +# =========================================================================== + +_CLASSIFY_IMPORT_ERROR: ImportError | None = None +try: + from precommit._prose_tokens import _classify_emphasis_shape # noqa: E402 +except ImportError as _ce: + _CLASSIFY_IMPORT_ERROR = _ce + _classify_emphasis_shape = None # type: ignore[assignment] + + +def _require_classify() -> None: + """Fail with assertion if _classify_emphasis_shape could not be imported.""" + if _CLASSIFY_IMPORT_ERROR is not None: + pytest.fail( + f"_classify_emphasis_shape not importable from precommit._prose_tokens.\n" + f"Original error: {_CLASSIFY_IMPORT_ERROR}\n" + "This indicates _classify_emphasis_shape is missing from precommit._prose_tokens." + ) + + +class TestEmphasisShapeClassification: + r""" + Cell-grid tests for _classify_emphasis_shape(span, delim) -> str. + + ADR-028 D3 shape rules, D-Open-16 (12-14 cases), D-Open-7 (both-edges -> 'open'). + + Grid: delimiter {'**', '*', '_'} x shape {'complete', 'open', 'close'} + plus edge cases from Spike S1 (triple-asterisk inputs). + + No `neutral`-return case is tested because the classifier only runs on + matched emphasis spans; non-emphasis tokens carry shape='neutral' by + construction in the tokenizer (not via the classifier). + + Shape rules (from ADR-028 D3 table): + - Neither edge has whitespace in interior -> 'complete' + - Only trailing whitespace in interior -> 'open' + - Only leading whitespace in interior -> 'close' + - Both edges whitespace -> 'open' (convention, D-Open-7) + - Single whitespace-only interior (\n) -> 'open' + """ + + # --- BOLD '**' delimiter --- + + def test_bold_complete_no_edge_whitespace(self): + """ + Given: span='**foo**', delim='**' + When: _classify_emphasis_shape is called + Then: returns 'complete' + + Interior is 'foo'; neither first nor last char is whitespace. + """ + _require_classify() + assert _classify_emphasis_shape("**foo**", "**") == "complete" + + def test_bold_complete_with_internal_space_not_at_edge(self): + """ + Given: span='**foo bar**', delim='**' + When: _classify_emphasis_shape is called + Then: returns 'complete' + + Interior is 'foo bar'; edge chars are 'f' and 'r' — neither is whitespace. + Only edge whitespace is the signal (ADR-028 D3). + """ + _require_classify() + assert _classify_emphasis_shape("**foo bar**", "**") == "complete" + + def test_bold_open_trailing_whitespace(self): + """ + Given: span='**foo **', delim='**' + When: _classify_emphasis_shape is called + Then: returns 'open' + + Interior is 'foo '; last char is space -> trailing edge whitespace -> + the non-greedy regex closed on what the author intended as an inner open. + """ + _require_classify() + assert _classify_emphasis_shape("**foo **", "**") == "open" + + def test_bold_close_leading_whitespace(self): + """ + Given: span='** bar**', delim='**' + When: _classify_emphasis_shape is called + Then: returns 'close' + + Interior is ' bar'; first char is space -> leading edge whitespace -> + this is the trailing half of an early-closed greedy match. + """ + _require_classify() + assert _classify_emphasis_shape("** bar**", "**") == "close" + + def test_bold_both_edges_whitespace_is_open(self): + """ + Given: span='** foo **', delim='**' + When: _classify_emphasis_shape is called + Then: returns 'open' + + Interior is ' foo '; both edges whitespace. D-Open-7 convention: + leading-whitespace test fires first -> 'open'. Consistent with the + wrapped-sentinel case '**\\n{{ref:x}}\\n**' rendering as 'open'. + """ + _require_classify() + assert _classify_emphasis_shape("** foo **", "**") == "open" + + def test_bold_newline_interior_is_open(self): + """ + Given: span='**\\n**', delim='**' + When: _classify_emphasis_shape is called + Then: returns 'open' + + Interior is '\\n'; \\n.isspace() is True -> both-edges convention + fires (both leading and trailing are the same whitespace char) -> 'open'. + ADR-028 D3 table: '**\\n**' -> 'open'. + """ + _require_classify() + assert _classify_emphasis_shape("**\n**", "**") == "open" + + # --- ITALIC asterisk '*' delimiter --- + + def test_italic_asterisk_complete(self): + """ + Given: span='*foo*', delim='*' + When: _classify_emphasis_shape is called + Then: returns 'complete' + """ + _require_classify() + assert _classify_emphasis_shape("*foo*", "*") == "complete" + + def test_italic_asterisk_open_trailing_space(self): + """ + Given: span='*foo *', delim='*' + When: _classify_emphasis_shape is called + Then: returns 'open' + """ + _require_classify() + assert _classify_emphasis_shape("*foo *", "*") == "open" + + def test_italic_asterisk_close_leading_space(self): + """ + Given: span='* bar*', delim='*' + When: _classify_emphasis_shape is called + Then: returns 'close' + """ + _require_classify() + assert _classify_emphasis_shape("* bar*", "*") == "close" + + # --- ITALIC underscore '_' delimiter --- + + def test_italic_underscore_complete(self): + """ + Given: span='_foo_', delim='_' + When: _classify_emphasis_shape is called + Then: returns 'complete' + """ + _require_classify() + assert _classify_emphasis_shape("_foo_", "_") == "complete" + + def test_italic_underscore_open_trailing_space(self): + """ + Given: span='_foo _', delim='_' + When: _classify_emphasis_shape is called + Then: returns 'open' + """ + _require_classify() + assert _classify_emphasis_shape("_foo _", "_") == "open" + + def test_italic_underscore_close_leading_space(self): + """ + Given: span='_ bar_', delim='_' + When: _classify_emphasis_shape is called + Then: returns 'close' + """ + _require_classify() + assert _classify_emphasis_shape("_ bar_", "_") == "close" + + # --- Triple-asterisk edge cases (grounded by Spike S1) --- + + def test_triple_star_bold_interior_no_edge_ws_is_complete(self): + """ + Given: span='***foo**', delim='**' (from '***foo***' S1 probe) + When: _classify_emphasis_shape is called + Then: returns 'complete' + + Interior is '*foo'; first char '*' is not whitespace, last char 'o' + is not whitespace -> 'complete'. + """ + _require_classify() + assert _classify_emphasis_shape("***foo**", "**") == "complete" + + def test_five_star_bold_pure_star_interior_is_complete(self): + """ + Given: span='*****', delim='**' (from '*****foo*****' S1 probe) + When: _classify_emphasis_shape is called + Then: returns 'complete' + + Interior is '***'; '*' is not whitespace -> 'complete'. + """ + _require_classify() + assert _classify_emphasis_shape("*****", "**") == "complete" + + +# =========================================================================== +# TestTokenShapeField — ADR-028 D1 Token.shape wire-up +# =========================================================================== +# ADR-028 D1: Token gains shape: Literal['complete','open','close','neutral'] +# as third field with default 'neutral'. These tests call tokenize() and +# inspect .shape on the resulting tokens, satisfying ADR-025 D10 wire-up. +# =========================================================================== + + +class TestTokenShapeField: + r""" + Wire-up tests: tokenize() emits tokens whose .shape field carries the + ADR-028 D3 classification. Satisfies ADR-025 D10 (at least one test + that calls tokenize() and reads .shape on the result). + """ + + def _assert_has_shape(self, token: object) -> None: + """Assert token has a .shape attribute; fail with diagnostic if not.""" + assert hasattr(token, "shape"), ( + f"Token {token!r} has no .shape attribute (ADR-028 D1 requires Token.shape)." + ) + + def test_simple_bold_has_complete_shape(self): + """ + Given: tokenize('**foo**') + When: .shape is read on the first token + Then: shape == 'complete' + + ADR-028 D3: interior 'foo' has no edge whitespace -> 'complete'. + """ + _require_module() + tokens = tokenize("**foo**") + assert len(tokens) == 1 + self._assert_has_shape(tokens[0]) + assert tokens[0].shape == "complete", f"Expected shape='complete' for '**foo**', got {tokens[0].shape!r}" + + def test_simple_italic_asterisk_has_complete_shape(self): + """ + Given: tokenize('*foo*') + When: .shape is read on the first token + Then: shape == 'complete' + """ + _require_module() + tokens = tokenize("*foo*") + assert len(tokens) == 1 + self._assert_has_shape(tokens[0]) + assert tokens[0].shape == "complete" + + def test_simple_italic_underscore_has_complete_shape(self): + """ + Given: tokenize(' _foo_ ') (whitespace-flanked for current regex to match) + When: .shape is read on the ITALIC token + Then: shape == 'complete' + """ + _require_module() + tokens = tokenize(" _foo_ ") + italic_tokens = [t for t in tokens if t.kind == TokenKind.ITALIC] + assert len(italic_tokens) == 1, f"Expected one ITALIC token, got {tokens!r}" + self._assert_has_shape(italic_tokens[0]) + assert italic_tokens[0].shape == "complete" + + def test_nested_bold_shapes_are_open_neutral_close(self): + """ + Given: tokenize('**foo **nested** bar**') + When: .shape is read on the 3 resulting tokens + Then: shapes are ['open', 'neutral', 'close'] + + The three tokens are BOLD('**foo **'), TEXT('nested'), BOLD('** bar**'). + BOLD('**foo **'): interior 'foo ' has trailing whitespace -> 'open'. + TEXT('nested'): non-emphasis token -> 'neutral'. + BOLD('** bar**'): interior ' bar' has leading whitespace -> 'close'. + """ + _require_module() + tokens = tokenize("**foo **nested** bar**") + assert len(tokens) == 3 + for t in tokens: + self._assert_has_shape(t) + assert tokens[0].shape == "open", f"Expected 'open', got {tokens[0].shape!r}" + assert tokens[1].shape == "neutral", f"Expected 'neutral', got {tokens[1].shape!r}" + assert tokens[2].shape == "close", f"Expected 'close', got {tokens[2].shape!r}" + + def test_all_non_emphasis_tokens_carry_neutral_shape(self): + """ + Given: tokenize('hello {{riskFoo}} world') + When: .shape is read on every token + Then: every non-BOLD/ITALIC token has shape == 'neutral' + + ADR-028 D3 invariant 4: non-emphasis tokens carry shape='neutral'. + """ + _require_module() + tokens = tokenize("hello {{riskFoo}} world") + for t in tokens: + self._assert_has_shape(t) + if t.kind not in (TokenKind.BOLD, TokenKind.ITALIC): + assert t.shape == "neutral", f"Non-emphasis token {t!r} must have shape='neutral', got {t.shape!r}" + + def test_invalid_token_carries_neutral_shape(self): + """ + Given: tokenize('See https://example.com') + When: .shape is read on the INVALID_URL token + Then: shape == 'neutral' + + All INVALID_* tokens are non-emphasis -> shape='neutral'. + """ + _require_module() + tokens = tokenize("See https://example.com") + url_tokens = [t for t in tokens if t.kind == TokenKind.INVALID_URL] + assert len(url_tokens) >= 1 + for t in url_tokens: + self._assert_has_shape(t) + assert t.shape == "neutral", f"INVALID_URL must have shape='neutral', got {t.shape!r}" + + def test_sentinel_token_carries_neutral_shape(self): + """ + Given: tokenize('{{riskPromptInjection}}') + When: .shape is read on the SENTINEL_INTRA token + Then: shape == 'neutral' + """ + _require_module() + tokens = tokenize("{{riskPromptInjection}}") + assert len(tokens) == 1 + assert tokens[0].kind == TokenKind.SENTINEL_INTRA + self._assert_has_shape(tokens[0]) + assert tokens[0].shape == "neutral" + + def test_text_token_carries_neutral_shape(self): + """ + Given: tokenize('plain prose text') + When: .shape is read on the TEXT token + Then: shape == 'neutral' + """ + _require_module() + tokens = tokenize("plain prose text") + assert len(tokens) == 1 + assert tokens[0].kind == TokenKind.TEXT + self._assert_has_shape(tokens[0]) + assert tokens[0].shape == "neutral" + + def test_both_edges_whitespace_bold_has_open_shape(self): + """ + Given: tokenize('** foo **') + When: .shape is read on the BOLD token + Then: shape == 'open' + + D-Open-7: both-edges-whitespace -> 'open' by convention. + """ + _require_module() + tokens = tokenize("** foo **") + bold_tokens = [t for t in tokens if t.kind == TokenKind.BOLD] + assert len(bold_tokens) == 1, f"Expected 1 BOLD, got {tokens!r}" + self._assert_has_shape(bold_tokens[0]) + assert bold_tokens[0].shape == "open", f"D-Open-7: both-edges-ws -> 'open', got {bold_tokens[0].shape!r}" + + +# =========================================================================== +# TestIntrawordUnderscoreRejection — ADR-028 D3 invariant 3, D-Open-21 +# =========================================================================== +# ADR-028 D3 invariant 3: intraword \S_\S does NOT qualify as an italic +# delimiter. _RE_ITALIC_UNDERSCORE requires whitespace-or-boundary flanking +# on the opening '_' (left side) and closing '_' (right side). +# =========================================================================== + + +class TestIntrawordUnderscoreRejection: + r""" + Tests for D-Open-21: tightened _RE_ITALIC_UNDERSCORE. + + _RE_ITALIC_UNDERSCORE requires whitespace-or-boundary flanking on the + opening '_' (left side) and closing '_' (right side). Intraword + underscore pairs like 'home_bar and foo_baz' must NOT tokenize as ITALIC. + """ + + def test_intraword_underscore_pair_produces_no_italic(self): + """ + Given: 'home_bar and foo_baz' + When: tokenize() is called + Then: NO ITALIC token in the stream + + _RE_ITALIC_UNDERSCORE requires whitespace-or-boundary flanking; + intraword underscores do not satisfy this — all tokens are TEXT. + """ + _require_module() + tokens = tokenize("home_bar and foo_baz") + italic_tokens = [t for t in tokens if t.kind == TokenKind.ITALIC] + assert len(italic_tokens) == 0, ( + f"D-Open-21: intraword '_' must NOT produce ITALIC. Got ITALIC tokens: {italic_tokens!r}." + ) + + def test_adjacent_intraword_underscore_produces_no_italic(self): + """ + Given: 'a_b_c' (both underscores intraword) + When: tokenize() is called + Then: NO ITALIC token — entire string is TEXT + + Both underscores have non-whitespace flanking characters; the + tightened regex rejects them as italic delimiters. + """ + _require_module() + tokens = tokenize("a_b_c") + italic_tokens = [t for t in tokens if t.kind == TokenKind.ITALIC] + assert len(italic_tokens) == 0, f"D-Open-21: 'a_b_c' must produce no ITALIC. Got: {italic_tokens!r}." + + def test_whitespace_flanked_underscore_italic_still_tokenizes(self): + """ + Given: 'prefix _italic_ suffix' (whitespace on both sides) + When: tokenize() is called + Then: ITALIC('_italic_') is still produced + + Whitespace flanking satisfies _RE_ITALIC_UNDERSCORE's flanking + requirement — this is the canonical accepted form. + """ + _require_module() + tokens = tokenize("prefix _italic_ suffix") + italic_tokens = [t for t in tokens if t.kind == TokenKind.ITALIC] + assert len(italic_tokens) == 1, f"Expected 1 ITALIC for whitespace-flanked '_italic_', got {tokens!r}" + assert italic_tokens[0].value == "_italic_" + + def test_string_boundary_flanked_underscore_italic_tokenizes(self): + """ + Given: '_foo_' (start/end-of-string flanking) + When: tokenize() is called + Then: ITALIC('_foo_') is produced + + Start-of-string counts as whitespace-or-boundary flanking per + D-Open-21.2(a) cosai-simpler form. + """ + _require_module() + tokens = tokenize("_foo_") + assert len(tokens) == 1, f"Expected 1 token, got {tokens!r}" + assert tokens[0].kind == TokenKind.ITALIC + assert tokens[0].value == "_foo_" + + def test_end_of_string_close_flanked_underscore_italic_tokenizes(self): + """ + Given: 'some text _foo_' (end-of-string after close underscore) + When: tokenize() is called + Then: ITALIC('_foo_') is produced + + D-Open-21.2(a): end-of-string after the closing '_' qualifies as a + word boundary and satisfies the whitespace-or-boundary flanking + requirement on the close side. + + Empirically verified (2026-05-29): + tokenize('some text _foo_') -> + TEXT('some text ') + ITALIC('_foo_') + """ + _require_module() + tokens = tokenize("some text _foo_") + italic_tokens = [t for t in tokens if t.kind == TokenKind.ITALIC] + assert len(italic_tokens) == 1, ( + f"Expected 1 ITALIC token for end-of-string close-flanked '_foo_', got {tokens!r}" + ) + assert italic_tokens[0].value == "_foo_", f"Expected ITALIC value '_foo_', got {italic_tokens[0].value!r}" + + def test_double_underscore_still_produces_no_italic_or_bold(self): + """ + Given: '__double__' + When: tokenize() is called + Then: no BOLD and no ITALIC token + + ADR-017 D1: __bold__ is NOT recognized. + """ + _require_module() + tokens = tokenize("__double__") + assert all(t.kind not in (TokenKind.BOLD, TokenKind.ITALIC) for t in tokens), ( + f"'__double__' must produce neither BOLD nor ITALIC. Got: {tokens!r}" + ) diff --git a/scripts/hooks/tests/test_validate_yaml_prose_subset.py b/scripts/hooks/tests/test_validate_yaml_prose_subset.py index d305a0bb..cb0cae01 100644 --- a/scripts/hooks/tests/test_validate_yaml_prose_subset.py +++ b/scripts/hooks/tests/test_validate_yaml_prose_subset.py @@ -43,8 +43,10 @@ — reference_violations/ : (used by the references linter only) — schemas/ : minimal mock schemas for introspection tests -The tokenizer (_prose_tokens.py, locked at 25e3d22) is NOT modified. -The prose_subset/ fixture directory is NOT modified. +The tokenizer (_prose_tokens.py) is extended on this branch per ADR-028 D1/D3 +(the Token.shape field, _classify_emphasis_shape, and the tightened +_RE_ITALIC_UNDERSCORE). The prose_subset/ fixture directory gains the +accepting/emphasis_shapes/ pairs (ADR-028 D-Open-18, Path 3b). Test Coverage ============= @@ -93,6 +95,7 @@ from validate_yaml_prose_subset import ( # noqa: E402 Diagnostic, ProseField, + _delim_for_token, check_prose_field, find_prose_fields, main, @@ -104,6 +107,7 @@ # Stub names so module-level references do not raise NameError at load time. Diagnostic = None # type: ignore[assignment,misc] ProseField = None # type: ignore[assignment,misc] + _delim_for_token = None # type: ignore[assignment] check_prose_field = None # type: ignore[assignment] find_prose_fields = None # type: ignore[assignment] main = None # type: ignore[assignment] @@ -2023,3 +2027,509 @@ def test_flat_array_diagnostic_emits_single_bracket_only(self, tmp_path, capsys) assert not re.search(r"shortDescription\[\d+\]\[\d+\]:", line), ( f"Flat-array line must not contain double brackets: {line!r}" ) + + +# =========================================================================== +# TestLiveCorpusBaseline (Spike S2) +# =========================================================================== + + +@pytest.mark.live_corpus +class TestLiveCorpusBaseline: + r""" + Spike S2: live-corpus regression baseline for the prose-subset linter. + + Guards that the linter produces ZERO diagnostics across the four content + YAMLs in --block mode (confirmed by the Spike S3 probe before ADR-028 + was flipped to Accepted). + + ADR-028 §9.3 Spike S2 — gates Phase 5 regression check. + """ + + _REPO_ROOT = Path(__file__).resolve().parent.parent.parent.parent + _YAML_DIR = _REPO_ROOT / "risk-map" / "yaml" + _CONTENT_YAMLS = ["risks.yaml", "controls.yaml", "components.yaml", "personas.yaml"] + + def _run_block(self, yaml_file: str) -> "subprocess.CompletedProcess[str]": + """Run the linter in --block mode on a single content YAML.""" + import subprocess as _sp + + script = Path(__file__).parent.parent / "precommit" / "validate_yaml_prose_subset.py" + return _sp.run( + [sys.executable, str(script), "--block", str(self._YAML_DIR / yaml_file)], + capture_output=True, + text=True, + ) + + def test_risks_yaml_produces_zero_diagnostics(self): + """ + Given: risk-map/yaml/risks.yaml (current corpus) + When: validate_yaml_prose_subset --block is run + Then: exits 0 (zero diagnostics) + + Baseline captured 2026-05-28. + """ + result = self._run_block("risks.yaml") + assert result.returncode == 0, f"risks.yaml produced diagnostics:\n{result.stderr}" + assert result.stderr.strip() == "", f"risks.yaml produced unexpected stderr:\n{result.stderr}" + + def test_controls_yaml_produces_zero_diagnostics(self): + """ + Given: risk-map/yaml/controls.yaml (current corpus) + When: validate_yaml_prose_subset --block is run + Then: exits 0 (zero diagnostics) + """ + result = self._run_block("controls.yaml") + assert result.returncode == 0, f"controls.yaml produced diagnostics:\n{result.stderr}" + assert result.stderr.strip() == "", f"controls.yaml produced unexpected stderr:\n{result.stderr}" + + def test_components_yaml_produces_zero_diagnostics(self): + """ + Given: risk-map/yaml/components.yaml (current corpus) + When: validate_yaml_prose_subset --block is run + Then: exits 0 (zero diagnostics) + """ + result = self._run_block("components.yaml") + assert result.returncode == 0, f"components.yaml produced diagnostics:\n{result.stderr}" + assert result.stderr.strip() == "", f"components.yaml produced unexpected stderr:\n{result.stderr}" + + def test_personas_yaml_produces_zero_diagnostics(self): + """ + Given: risk-map/yaml/personas.yaml (current corpus) + When: validate_yaml_prose_subset --block is run + Then: exits 0 (zero diagnostics) + """ + result = self._run_block("personas.yaml") + assert result.returncode == 0, f"personas.yaml produced diagnostics:\n{result.stderr}" + assert result.stderr.strip() == "", f"personas.yaml produced unexpected stderr:\n{result.stderr}" + + +# =========================================================================== +# TestNestedEmphasisRejection — ADR-028 D5 depth-counter linter +# =========================================================================== +# These tests assert the ADR-028 D5 depth-counter walk in check_prose_field(). +# =========================================================================== + + +class TestNestedEmphasisRejection: + r""" + Tests for ADR-028 D5: depth-counter emphasis-rejection walk in check_prose_field. + + Uses the same _make_field() idiom as TestSingleViolationDetection to build + synthetic ProseField objects and call check_prose_field() directly. + + Tests that assert zero diagnostics guard against false positives. + """ + + def _make_field( + self, + raw_text: str, + entry_id: str = "riskAlpha", + field_name: str = "shortDescription", + index: int = 0, + ) -> "ProseField": + """Build a ProseField with tokens populated from the tokenizer.""" + import sys as _sys + from pathlib import Path as _Path + + _sys.path.insert(0, str(_Path(__file__).parent.parent / "precommit")) + from precommit._prose_tokens import tokenize as _tok # noqa: PLC0415 + + tokens = _tok(raw_text) + return ProseField( + file_path=_Path("test.yaml"), + entry_id=entry_id, + field_name=field_name, + index=index, + raw_text=raw_text, + tokens=tokens, + ) + + # --- Tests that MUST produce a diagnostic --- + + def test_nested_bold_produces_one_nested_emphasis_diagnostic(self): + """ + Given: '**foo **nested** bar**' -> [BOLD(open), TEXT, BOLD(close)] + When: check_prose_field is called + Then: exactly ONE diagnostic with reason containing 'nested emphasis' + and snippet "at '** bar**'" (the close token) + + ADR-028 D5: BOLD('**foo **') has shape='open' -> depth 0->1. + BOLD('** bar**') has shape='close' and arrives at depth==1 -> nested emphasis. + + Close-branch emit: ADR-028 D5 (as amended 2026-05-29) requires the + close-branch emit when depth > 0, checked before the decrement; this + test verifies it. BOLD('** bar**') is the only token in the stream + [open, text, close] that arrives at depth > 0 (depth==1 before the + decrement), so the single diagnostic's snippet is the close token's + value: "at '** bar**'". + """ + field = self._make_field("**foo **nested** bar**") + diags = check_prose_field(field) + assert len(diags) == 1, f"Expected 1 'nested emphasis' diagnostic, got {len(diags)}: {diags!r}." + assert "nested emphasis" in diags[0].reason, ( + f"Expected reason containing 'nested emphasis', got {diags[0].reason!r}" + ) + assert "at '** bar**'" in diags[0].reason, ( + f"Expected close-token snippet \"at '** bar**'\" in reason, got {diags[0].reason!r}" + ) + + def test_nested_italic_produces_one_nested_emphasis_diagnostic(self): + """ + Given: '*foo *nested* bar*' -> [ITALIC(open), TEXT, ITALIC(close)] + When: check_prose_field is called + Then: ONE diagnostic with reason containing 'nested emphasis' + + Same depth-counter logic for italic-asterisk delimiter as for bold. + """ + field = self._make_field("*foo *nested* bar*") + diags = check_prose_field(field) + assert len(diags) == 1, f"Expected 1 diagnostic for nested italic, got {len(diags)}: {diags!r}." + assert "nested emphasis" in diags[0].reason + + def test_italic_after_open_bold_produces_nested_emphasis_diagnostic(self): + """ + Given: '**A ** *B* C**' -> [BOLD(open='**A **'), TEXT(' '), ITALIC(complete='*B*'), TEXT(' C**')] + When: check_prose_field is called + Then: exactly ONE diagnostic with reason containing 'nested emphasis' + + This test covers the 'complete at depth > 0' branch of ADR-028 D5. + + Empirically verified token stream (2026-05-29): + tokenize('**A ** *B* C**') -> + BOLD('**A **') shape='open' -> depth 0->1 + TEXT(' ') shape='neutral' + ITALIC('*B*') shape='complete' -> depth==1 -> emit_diagnostic + TEXT(' C**') shape='neutral' + + The reviewer's suggested string '**A ** *B* C**' was verified to yield + this exact stream. BOLD('**A **') has trailing interior whitespace + ('A ') -> shape='open'; ITALIC('*B*') is a complete-shape token that + arrives at depth==1 after the open bold. The diagnostic fires on the + ITALIC token because it is a complete-emphasis token inside an open span. + """ + field = self._make_field("**A ** *B* C**") + diags = check_prose_field(field) + assert len(diags) == 1, ( + f"Expected 1 'nested emphasis' diagnostic for complete italic at depth>0, got {len(diags)}: {diags!r}." + ) + assert "nested emphasis" in diags[0].reason, ( + f"Expected reason containing 'nested emphasis', got {diags[0].reason!r}" + ) + + def test_emphasis_wrapped_sentinel_intra_produces_diagnostic(self): + """ + Given: '**{{riskPromptInjection}}**' + When: check_prose_field is called + Then: ONE diagnostic with reason containing 'emphasis-wrapped sentinel' + + ADR-028 D5: emphasis-wrapped-sentinel predicate fires when emphasis + token interior (stripped) fullmatches the sentinel inner regex. + """ + field = self._make_field("**{{riskPromptInjection}}**") + diags = check_prose_field(field) + assert len(diags) == 1, f"Expected 1 'emphasis-wrapped sentinel' diagnostic, got {len(diags)}: {diags!r}." + assert "emphasis-wrapped sentinel" in diags[0].reason, ( + f"Expected 'emphasis-wrapped sentinel' in reason, got {diags[0].reason!r}" + ) + + def test_emphasis_wrapped_ref_sentinel_produces_diagnostic(self): + """ + Given: '**{{ref:x}}**' + When: check_prose_field is called + Then: ONE diagnostic with reason containing 'emphasis-wrapped sentinel' + + The wrapped-sentinel predicate applies to both SENTINEL_INTRA and + SENTINEL_REF inner forms — the test strips the delimiter pair and + fullmatches the tokenizer's two internal regexes, + _RE_SENTINEL_INTRA_INNER and _RE_SENTINEL_REF_INNER. + """ + field = self._make_field("**{{ref:x}}**") + diags = check_prose_field(field) + assert len(diags) == 1, f"Expected 1 diagnostic for '**{{ref:x}}**', got {len(diags)}: {diags!r}." + assert "emphasis-wrapped sentinel" in diags[0].reason + + def test_emphasis_wrapped_sentinel_with_newlines_produces_diagnostic(self): + """ + Given: '**\\n{{ref:x}}\\n**' + When: check_prose_field is called + Then: ONE diagnostic with reason containing 'emphasis-wrapped sentinel' + + ADR-028 D3: '**\\n**' has both-edges whitespace -> shape='open'. + The emphasis-wrapped-sentinel predicate uses .strip() on the interior, + so leading/trailing newlines do not prevent detection. + """ + field = self._make_field("**\n{{ref:x}}\n**") + diags = check_prose_field(field) + assert len(diags) == 1, f"Expected 1 diagnostic for newline-wrapped sentinel, got {len(diags)}: {diags!r}." + assert "emphasis-wrapped sentinel" in diags[0].reason + + # --- Tests that MUST produce ZERO diagnostics (ADR-028 D5 faux-depth guard) --- + + def test_sibling_complete_bold_spans_produce_zero_diagnostics(self): + """ + Given: '**hello** world **goodbye**' + When: check_prose_field is called + Then: ZERO diagnostics + + ADR-028 D5 faux-depth guard: the two BOLD tokens have shape='complete' + -> depth stays at 0 throughout -> no nested-emphasis diagnostic. + """ + field = self._make_field("**hello** world **goodbye**") + diags = check_prose_field(field) + assert len(diags) == 0, ( + f"Sibling complete bold spans must produce 0 diagnostics (faux-depth guard). Got: {diags!r}" + ) + + def test_sentinel_at_depth_zero_produces_zero_diagnostics(self): + """ + Given: '**hello** world {{ref:x}}' + When: check_prose_field is called + Then: ZERO diagnostics + + The sentinel is at depth==0 (outside any open emphasis). + ADR-028 D5: the emphasis-wrapped-sentinel predicate checks the emphasis + token's interior, not any subsequent sentinel at depth 0. + """ + field = self._make_field("**hello** world {{ref:x}}") + diags = check_prose_field(field) + assert len(diags) == 0, f"Sentinel at depth 0 must produce 0 diagnostics. Got: {diags!r}" + + def test_clean_bold_produces_zero_diagnostics(self): + """ + Given: '**bold**' + When: check_prose_field is called + Then: ZERO diagnostics + + Simple complete-shape bold — no nesting, no sentinel inside. + """ + field = self._make_field("**bold**") + diags = check_prose_field(field) + assert len(diags) == 0, f"Clean bold must produce 0 diagnostics. Got: {diags!r}" + + def test_clean_italic_asterisk_produces_zero_diagnostics(self): + """ + Given: '*italic*' + When: check_prose_field is called + Then: ZERO diagnostics + """ + field = self._make_field("*italic*") + diags = check_prose_field(field) + assert len(diags) == 0, f"Clean italic must produce 0 diagnostics. Got: {diags!r}" + + def test_clean_italic_underscore_produces_zero_diagnostics(self): + """ + Given: '_italic_' at string boundary + When: check_prose_field is called + Then: ZERO diagnostics + """ + field = self._make_field("_italic_") + diags = check_prose_field(field) + assert len(diags) == 0, f"Clean underscore italic must produce 0 diagnostics. Got: {diags!r}" + + def test_bold_containing_italic_produces_zero_diagnostics(self): + """ + Given: '**bold *italic* inside**' + When: check_prose_field is called + Then: ZERO diagnostics + + ADR-017 D1: italic inside bold is one permitted nesting level. + The tokenizer emits a single BOLD token for this span (italic-in-bold + is absorbed atomically). No depth-counter violation. + """ + field = self._make_field("**bold *italic* inside**") + diags = check_prose_field(field) + assert len(diags) == 0, f"Bold-with-italic-inside must produce 0 diagnostics. Got: {diags!r}" + + def test_nested_bold_wrapping_sentinel_emits_both_diagnostics(self): + """ + Given: '**x **y**{{ref:x}}**' -> [BOLD(open), TEXT, BOLD(complete '**{{ref:x}}**')] + When: check_prose_field is called + Then: the trailing BOLD token emits BOTH a 'nested emphasis' AND an + 'emphasis-wrapped sentinel' diagnostic (two diagnostics total). + + ADR-028 D5 (Addendum 2026-05-29, third erratum): the nested-emphasis and + wrapped-sentinel predicates are independent and may both fire on a single + emphasis token. The complete token '**{{ref:x}}**' arrives at depth > 0 + (nested, via the preceding open '**x **') and its stripped interior is a + sentinel (wrapped). This pins the intended double-emit so a future change + cannot silently collapse it to one diagnostic. + """ + field = self._make_field("**x **y**{{ref:x}}**") + diags = check_prose_field(field) + reasons = [d.reason for d in diags] + assert len(diags) == 2, f"Expected exactly 2 diagnostics (nested + wrapped). Got: {diags!r}" + assert "nested emphasis at '**{{ref:x}}**'" in reasons, reasons + assert "emphasis-wrapped sentinel at '**{{ref:x}}**'" in reasons, reasons + + +# =========================================================================== +# TestEmphasisDiagnosticFormat — ADR-028 D6 diagnostic format locks +# =========================================================================== +# Locks the exact diagnostic format strings for the two reason constants: +# ADR-028 D6: 'nested emphasis' and 'emphasis-wrapped sentinel', plus the +# token-snippet convention ('at '). +# =========================================================================== + + +class TestEmphasisDiagnosticFormat: + r""" + Tests for ADR-028 D6: diagnostic format for emphasis violations. + + The ADR-017 D4 format is preserved byte-for-byte: + validate-yaml-prose-subset: ::[]: + + The for emphasis violations follows the existing 'at ''' + pattern: the reason string ends with "at ''" where token.value + is the offending emphasis token's full value (including delimiters). + """ + + def _make_field( + self, + raw_text: str, + entry_id: str = "riskAlpha", + field_name: str = "shortDescription", + index: int = 0, + ) -> "ProseField": + """Build a ProseField with tokens from the tokenizer.""" + import sys as _sys + from pathlib import Path as _Path + + _sys.path.insert(0, str(_Path(__file__).parent.parent / "precommit")) + from precommit._prose_tokens import tokenize as _tok # noqa: PLC0415 + + tokens = _tok(raw_text) + return ProseField( + file_path=_Path("test.yaml"), + entry_id=entry_id, + field_name=field_name, + index=index, + raw_text=raw_text, + tokens=tokens, + ) + + def test_nested_emphasis_diagnostic_reason_string(self): + """ + Given: '**foo **nested** bar**' triggers nested emphasis + When: check_prose_field produces a Diagnostic + Then: reason starts with 'nested emphasis' and ends with "at '** bar**'" + + ADR-028 D6: reason string is 'nested emphasis' (unchanged); the snippet + convention follows the existing INVALID_* pattern: "at ''". + The offending token is the BOLD('** bar**') (the 'close'-shape token at + depth > 0). + """ + field = self._make_field("**foo **nested** bar**") + diags = check_prose_field(field) + assert len(diags) == 1, f"Expected 1 diagnostic, got {diags!r}." + reason = diags[0].reason + assert reason.startswith("nested emphasis"), f"Reason must start with 'nested emphasis', got {reason!r}" + assert "at '** bar**'" in reason, f"Reason must contain \"at '** bar**'\", got {reason!r}" + + def test_nested_emphasis_format_diagnostic_line(self): + """ + Given: a Diagnostic for nested emphasis + When: format_diagnostic_line is called + Then: output matches ADR-017 D4 format with 'nested emphasis' reason + + Asserts the full committed format string including hook_id prefix. + """ + from precommit._linter_types import format_diagnostic_line # noqa: PLC0415 + + field = self._make_field( + "**foo **nested** bar**", + entry_id="riskAlpha", + field_name="shortDescription", + index=0, + ) + diags = check_prose_field(field) + assert len(diags) == 1, f"Expected 1 diagnostic, got {diags!r}." + line = format_diagnostic_line(diags[0]) + # Format: validate-yaml-prose-subset: test.yaml:riskAlpha:shortDescription[0]: nested emphasis at '...' + assert line.startswith("validate-yaml-prose-subset: "), f"Expected hook_id prefix, got {line!r}" + assert "riskAlpha" in line + assert "shortDescription[0]" in line + assert "nested emphasis" in line + assert _DIAG_PATTERN.match(line), f"Diagnostic line does not match committed pattern: {line!r}" + + def test_emphasis_wrapped_sentinel_diagnostic_reason_string(self): + """ + Given: '**{{riskPromptInjection}}**' triggers emphasis-wrapped sentinel + When: check_prose_field produces a Diagnostic + Then: reason starts with 'emphasis-wrapped sentinel' and contains the token value + + ADR-028 D6: reason string is 'emphasis-wrapped sentinel'; snippet is + the full BOLD token value '**{{riskPromptInjection}}**'. + """ + field = self._make_field("**{{riskPromptInjection}}**") + diags = check_prose_field(field) + assert len(diags) == 1, f"Expected 1 diagnostic, got {diags!r}." + reason = diags[0].reason + assert reason.startswith("emphasis-wrapped sentinel"), ( + f"Reason must start with 'emphasis-wrapped sentinel', got {reason!r}" + ) + assert "at '**{{riskPromptInjection}}**'" in reason, f"Reason must contain token snippet, got {reason!r}" + + def test_emphasis_wrapped_sentinel_format_diagnostic_line_matches_pattern(self): + """ + Given: a Diagnostic for emphasis-wrapped sentinel + When: format_diagnostic_line is called + Then: output matches the committed _DIAG_PATTERN regex + + Verifies the emphasis violation slots into the existing format contract + without modifying the pattern. + """ + from precommit._linter_types import format_diagnostic_line # noqa: PLC0415 + + field = self._make_field( + "**{{riskPromptInjection}}**", + entry_id="riskBeta", + field_name="shortDescription", + index=1, + ) + diags = check_prose_field(field) + assert len(diags) == 1, f"Expected 1 diagnostic, got {diags!r}." + line = format_diagnostic_line(diags[0]) + assert _DIAG_PATTERN.match(line), ( + f"Emphasis-wrapped-sentinel diagnostic does not match committed pattern: {line!r}" + ) + + +# =========================================================================== +# TestDelimForTokenGuard — _delim_for_token fail-loud contract +# =========================================================================== +# _delim_for_token is called only on BOLD/ITALIC tokens, so its input always +# starts with '**', '*', or '_'. It must fail loud on any other value rather +# than silently returning a wrong delimiter (which would make +# _is_emphasis_wrapped_sentinel slice the wrong interior). +# =========================================================================== + + +class TestDelimForTokenGuard: + """Tests for the _delim_for_token delimiter-dispatch helper (ADR-028 D5).""" + + def test_bold_delimiter(self): + """'**...**' values return the two-character bold delimiter.""" + assert _delim_for_token("**foo**") == "**" + + def test_italic_asterisk_delimiter(self): + """'*...*' values return the single asterisk delimiter.""" + assert _delim_for_token("*foo*") == "*" + + def test_italic_underscore_delimiter(self): + """'_..._' values return the underscore delimiter.""" + assert _delim_for_token("_foo_") == "_" + + def test_unrecognized_value_raises(self): + """ + A value that is not a BOLD/ITALIC token (no '**', '*', or '_' prefix) + must raise rather than silently returning a delimiter. Guards against a + future emphasis kind reaching the helper with an unhandled delimiter. + """ + with pytest.raises(ValueError): + _delim_for_token("plain text") + + def test_empty_value_raises(self): + """An empty string is not a valid emphasis token value and must raise.""" + with pytest.raises(ValueError): + _delim_for_token("")