Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/adr/017-yaml-prose-authoring-subset.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@ Authors may use exactly these forms in any prose field. Everything else is rejec

Bold and italic may compose (`**emphatically *not* this**` is valid). Sentinels are atomic identifier tokens; they do not nest into bold or italic. An author who wants the rendered title to appear bold relies on the renderer's stylesheet, not on wrapping `**` around `{{<entity-id>}}`.

The canonical mechanism the lint uses to enforce the "no nested same-family emphasis" rule and the "no emphasis-wrapped sentinel" rule is the depth-counter bracket-matching pass specified in [ADR-028](028-prose-linter-bracket-matching-architecture.md) D5.

The subset operates on the string contents of each prose paragraph. Paragraph and hard-break shape is carried by the YAML *array* structure ([ADR-011](011-persona-site-data-schema-contract.md) `definitions/prose`); prose strings are not list-bearing.

### D2. Disallowed by construction
Expand Down Expand Up @@ -106,6 +108,8 @@ The two hooks have overlapping rejection sets (raw `<a>` blocked by both, bare c

If ADR-016 lands its hook before this one, the bare-camelCase and raw-`<a>` checks live in `validate_prose_references.py` until ADR-017's hook ships and the shared tokenizer is extracted. The end state is the two-hook-shared-tokenizer split above.

The shared tokenizer's Token contract — the `Token` NamedTuple structure (including the emphasis-shape field), the `TokenKind` enumeration, the tokenizer's emission invariants, and the consumer surface — is formally specified in [ADR-028](028-prose-linter-bracket-matching-architecture.md) D1-D4. ADR-017 owns the grammar's authoring rules; ADR-028 owns the contract every consumer of the shared tokenizer reads.

### D6. Redistribution contract surface

Per [ADR-014](014-yaml-content-security-posture.md) P5, the framework guarantees "shape via schemas" to downstream consumers. ADR-017 is the canonical statement of "content within prose strings." After the conformance sweep closes, the contract becomes strictly: **YAML prose contains no URLs at all.** Every URL lives in a structured `externalReferences` entry (ADR-016) and is referenced from prose by sentinel. This is a stronger guarantee than "YAML prose contains some URLs you must sanitize" — a third-party redistributor parsing the YAML knows that any URL it ingests came through a typed, schema-validated structured field.
Expand Down
220 changes: 220 additions & 0 deletions docs/adr/028-prose-linter-bracket-matching-architecture.md

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions docs/adr/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ If a decision is about *how the Risk Map content model is shaped*, it belongs in
| [025](025-testing-strategy.md) | Testing strategy and posture across Python, site JS, schemas, and infrastructure | Accepted | 2026-05-05 |
| [026](026-issue-template-domain.md) | Issue-template domain — generator scope, schema-derived enums, and ADR-content alignment contract | Accepted | 2026-05-20 |
| [027](027-framework-versioning-and-mapping-convention.md) | Per-mapping framework version pinning | Accepted | 2026-05-22 |
| [028](028-prose-linter-bracket-matching-architecture.md) | Prose-linter emphasis enforcement via bracket-matching depth pass over a Token-shape contract | Accepted | 2026-05-28 |

## Conventions

Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
pythonpath = ["scripts/hooks"]
markers = [
"slow: marks tests as slow (deselect with '-m \"not slow\"')",
"live_corpus: marks tests that read the live risk-map YAML corpus (deselect with '-m \"not live_corpus\"')",
]

[tool.coverage.run]
Expand Down
77 changes: 67 additions & 10 deletions scripts/hooks/precommit/_prose_tokens.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@

import re
from enum import Enum
from typing import NamedTuple
from typing import Literal, NamedTuple


class TokenKind(Enum):
Expand Down Expand Up @@ -89,10 +89,15 @@ class Token(NamedTuple):
Attributes:
kind: The token's classification (accepting or rejecting).
value: The exact substring from the input that this token covers.
shape: Emphasis classification per ADR-028 D3. 'neutral' for every
non-emphasis token. One of 'complete', 'open', 'close', 'neutral'.
Default 'neutral' so two-positional construction Token(kind, value)
still compiles and yields a token with shape='neutral'.
"""

kind: TokenKind
value: str
shape: Literal["complete", "open", "close", "neutral"] = "neutral"


# ---------------------------------------------------------------------------
Expand Down Expand Up @@ -177,15 +182,67 @@ class Token(NamedTuple):
_RE_ITALIC_ASTERISK = re.compile(r"\*(.+?)\*", re.DOTALL)

# Italic underscore: _..._ — single underscore only; __ is NOT italic (ADR-017 D1).
# Lookahead/lookbehind prevent matching when adjacent to another underscore.
_RE_ITALIC_UNDERSCORE = re.compile(r"(?<![_])_(?![_])(.+?)(?<![_])_(?![_])")
# ADR-028 D3 invariant 3: intraword \S_\S does NOT qualify as an italic delimiter.
# Requirements (combined):
# - Opening _: preceded by whitespace or start-of-string (left-flank)
# - Opening _: NOT adjacent to another _ (no __)
# - Opening _: followed by non-whitespace (interior must start immediately)
# - Closing _: preceded by non-whitespace
# - Closing _: NOT adjacent to another _ (no __)
# - Closing _: followed by whitespace or end-of-string (right-flank)
# The (?=\S) after the open and (?<=\S) before the close structurally guarantee
# non-whitespace at both interior edges, so _classify_emphasis_shape(span, "_")
# always returns "complete" — there is no open/close-shape underscore-italic token.
_RE_ITALIC_UNDERSCORE = re.compile(r"(?:^|(?<=\s))(?<![_])_(?![_])(?=\S)(.+?)(?<=\S)(?<![_])_(?![_])(?=\s|$)")

# Bare camelCase entity-prefix identifier: (risk|control|component|persona) immediately
# followed by a capital letter, then the rest of the identifier word.
# This fires only on plain prose; the sentinel branch consumes it first when inside {{}}.
_RE_BARE_CAMELCASE = re.compile(r"(risk|control|component|persona)([A-Z]\w*)")


# ---------------------------------------------------------------------------
# Emphasis-shape classifier (ADR-028 D3)
# ---------------------------------------------------------------------------


def _classify_emphasis_shape(span: str, delim: str) -> Literal["complete", "open", "close", "neutral"]:
"""Classify the emphasis shape of a matched span by examining interior edge whitespace.

The tokenizer calls this at emission time for BOLD, ITALIC-asterisk, and
ITALIC-underscore tokens. The shape drives the depth-counter walk in the
prose-subset linter (ADR-028 D5).

Rules (ADR-028 D3 table):
- Both edges whitespace -> 'open' (convention: leading test fires first)
- Trailing whitespace only -> 'open' (greedy close on intended inner open)
- Leading whitespace only -> 'close' (trailing half of an early-closed match)
- Neither edge whitespace -> 'complete' (well-formed span)
- Empty interior -> 'neutral' (defensive; not emitted in practice)

Uses str.isspace() — no regex (ADR-028 D-Open-6).

Args:
span: The full matched span including delimiters (e.g. '**foo **').
delim: The delimiter string ('**', '*', or '_').

Returns:
One of 'complete', 'open', 'close', 'neutral'.
"""
interior = span[len(delim) : -len(delim)]
if not interior:
return "neutral"
leading_ws = interior[0].isspace()
trailing_ws = interior[-1].isspace()
if leading_ws and trailing_ws:
return "open"
if leading_ws:
return "close"
if trailing_ws:
return "open"
return "complete"


# ---------------------------------------------------------------------------
# Sentinel helper
# ---------------------------------------------------------------------------
Expand Down Expand Up @@ -263,7 +320,7 @@ def tokenize(text: str) -> list[Token]:
The `text` argument is expected to be a single prose field value as
decoded by PyYAML — not raw YAML, not a file path.

Test fixtures for all 42 grammar cases live at:
Test fixtures live at:
scripts/hooks/tests/fixtures/prose_subset/

Args:
Expand All @@ -286,11 +343,11 @@ def flush_text(end: int) -> None:
tokens.append(Token(TokenKind.TEXT, text[pending_text_start:end]))
pending_text_start = -1

def emit(kind: TokenKind, value: str) -> None:
"""Flush any pending TEXT, then emit the given token."""
def emit(kind: TokenKind, value: str, *, shape: str = "neutral") -> None:
"""Flush any pending TEXT, then emit the given token with the given shape."""
nonlocal i
flush_text(i)
tokens.append(Token(kind, value))
tokens.append(Token(kind, value, shape))
i += len(value)

def at_line_start() -> bool:
Expand Down Expand Up @@ -422,21 +479,21 @@ def at_line_start() -> bool:
if ch == "*" and i + 1 < len(text) and text[i + 1] == "*":
m = _RE_BOLD.match(text, i)
if m:
emit(TokenKind.BOLD, m.group())
emit(TokenKind.BOLD, m.group(), shape=_classify_emphasis_shape(m.group(), "**"))
continue

# --- Rule 13: Italic asterisk *...* ---
if ch == "*":
m = _RE_ITALIC_ASTERISK.match(text, i)
if m:
emit(TokenKind.ITALIC, m.group())
emit(TokenKind.ITALIC, m.group(), shape=_classify_emphasis_shape(m.group(), "*"))
continue

# --- Rule 14: Italic underscore _..._ (single underscore only) ---
if ch == "_":
m = _RE_ITALIC_UNDERSCORE.match(text, i)
if m:
emit(TokenKind.ITALIC, m.group())
emit(TokenKind.ITALIC, m.group(), shape=_classify_emphasis_shape(m.group(), "_"))
continue

# --- Rule 15: Bare camelCase entity-prefix identifier ---
Expand Down
138 changes: 129 additions & 9 deletions scripts/hooks/precommit/validate_yaml_prose_subset.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,20 @@

from precommit._linter_types import Diagnostic, ProseField, format_diagnostic_line # noqa: E402
from precommit._prose_fields import find_prose_fields # noqa: E402
from precommit._prose_tokens import TokenKind # noqa: E402
from precommit._prose_tokens import ( # noqa: E402
_RE_SENTINEL_INTRA_INNER,
_RE_SENTINEL_REF_INNER,
TokenKind,
)

# Deliberate cross-module coupling: _RE_SENTINEL_INTRA_INNER and
# _RE_SENTINEL_REF_INNER are internal to _prose_tokens (leading-underscore per
# ADR-028 D4). The wrapped-sentinel predicate (ADR-028 D5) reuses them directly
# so the linter's notion of a "sentinel" cannot drift from the tokenizer's own
# classification. They are NOT promoted to public constants — ADR-028 D4 fixes
# the public surface of _prose_tokens at exactly Token, TokenKind, and tokenize();
# a consumer importing these _RE_* names accepts the reorganization-coupling risk
# that D4 describes.

# Re-export so callers can import ProseField and Diagnostic from this module
# (the test suite imports both from here, not from _linter_types).
Expand Down Expand Up @@ -87,6 +100,69 @@
"If you added a new INVALID_* kind to _REJECTED_KINDS, add its reason to _REASONS too."
)

# Reason strings for emphasis violations (ADR-028 D6). These are stable
# constants; any change requires a D6 amendment.
_REASON_NESTED_EMPHASIS = "nested emphasis"
_REASON_EMPHASIS_WRAPPED_SENTINEL = "emphasis-wrapped sentinel"

# The two emphasis token kinds; used in the depth-counter walk (ADR-028 D5).
_EMPHASIS_KINDS: frozenset[TokenKind] = frozenset({TokenKind.BOLD, TokenKind.ITALIC})


def _is_emphasis_wrapped_sentinel(token_value: str, delim: str) -> bool:
"""Return True if the emphasis token wraps exactly one sentinel.

Strips the emphasis delimiter pair from token_value, .strip()s whitespace,
then checks whether the result is a `{{ }}` span whose inner content
fullmatches either the intra-doc or ref sentinel inner regex.

This mirrors how _match_sentinel classifies sentinels: outer {{ }} are
stripped first, then the inner content is matched against the patterns.

Args:
token_value: The full emphasis token value including delimiters.
delim: The delimiter string ('**', '*', or '_').

Returns:
True if the stripped interior is a well-formed sentinel.
"""
interior = token_value[len(delim) : -len(delim)].strip()
# Interior must be wrapped in {{ }} to be a sentinel form.
if not (interior.startswith("{{") and interior.endswith("}}")):
return False
inner = interior[2:-2]
return bool(_RE_SENTINEL_INTRA_INNER.fullmatch(inner) or _RE_SENTINEL_REF_INNER.fullmatch(inner))


def _delim_for_token(token_value: str) -> str:
"""Return the delimiter prefix for an emphasis token value.

Inspects the leading characters to distinguish '**' (BOLD) from '*' (ITALIC
asterisk) from '_' (ITALIC underscore). Called only on BOLD/ITALIC tokens,
whose values always start with one of those delimiters.

Args:
token_value: The full token value string (a BOLD or ITALIC token).

Returns:
The delimiter string: '**', '*', or '_'.

Raises:
ValueError: if token_value does not start with '**', '*', or '_'. The
helper fails loud rather than guessing a delimiter, so a future
emphasis kind that reaches it with an unhandled delimiter surfaces
immediately instead of silently mis-slicing the token interior.
"""
if token_value.startswith("**"):
return "**"
if token_value.startswith("*"):
return "*"
if token_value.startswith("_"):
return "_"
raise ValueError(
f"_delim_for_token expects a BOLD/ITALIC token value starting with '**', '*', or '_'; got {token_value!r}"
)


def check_prose_field(field: ProseField) -> list[Diagnostic]:
"""Check one ProseField against the ADR-017 D4 grammar rejection rules.
Expand All @@ -96,21 +172,18 @@ def check_prose_field(field: ProseField) -> list[Diagnostic]:
— ADR-017 D4 rule 5 delegates bare-camelCase rejection to
validate_prose_references.

Also runs the ADR-028 D5 depth-counter emphasis-rejection walk, emitting
diagnostics for nested emphasis and emphasis-wrapped sentinels.

Args:
field: A ProseField with tokens already populated by tokenize().

Returns:
List of Diagnostic objects (empty if the field is clean).
"""
diagnostics: list[Diagnostic] = []
for token in field.tokens:
if token.kind not in _REJECTED_KINDS:
continue
base_reason = _REASONS[token.kind]
# ADR-017 D4: append the offending token value as a snippet for context.
# Only append when token.value is non-empty (tokenizer guarantees this,
# but guard defensively to avoid "at ''" in edge cases).
reason = f"{base_reason} at {token.value!r}" if token.value else base_reason

def _emit_diag(reason: str) -> None:
diagnostics.append(
Diagnostic(
hook_id=_HOOK_ID,
Expand All @@ -122,6 +195,53 @@ def check_prose_field(field: ProseField) -> list[Diagnostic]:
nested_index=field.nested_index,
)
)

# --- INVALID_* token rejection (ADR-017 D4) ---
for token in field.tokens:
if token.kind not in _REJECTED_KINDS:
continue
base_reason = _REASONS[token.kind]
# ADR-017 D4: append the offending token value as a snippet for context.
# Only append when token.value is non-empty (tokenizer guarantees this,
# but guard defensively to avoid "at ''" in edge cases).
reason = f"{base_reason} at {token.value!r}" if token.value else base_reason
_emit_diag(reason)

# --- ADR-028 D5 depth-counter emphasis walk ---
# Single pass over the token stream with a bare integer depth counter.
# Emphasis tokens with shape='open' increment depth; 'close' decrements.
# Any emphasis token arriving at depth > 0 is a nested-emphasis violation.
# The wrapped-sentinel predicate is independent of depth state.
depth = 0
for token in field.tokens:
if token.kind not in _EMPHASIS_KINDS:
continue

# Nested-emphasis predicate (ADR-028 D5).
if token.shape == "open":
if depth > 0:
_emit_diag(f"{_REASON_NESTED_EMPHASIS} at {token.value!r}")
depth += 1
elif token.shape == "close":
# Check before decrementing: the close token is the one arriving
# at depth > 0 in the canonical [open, text, close] stream
# (e.g. **foo **nested** bar**), so it is the attribution point for
# the single nested-emphasis diagnostic. ADR-028 D5 (as amended
# 2026-05-29) emits in the close branch when depth > 0, before the
# decrement.
if depth > 0:
_emit_diag(f"{_REASON_NESTED_EMPHASIS} at {token.value!r}")
depth = max(0, depth - 1)
elif token.shape == "complete":
if depth > 0:
_emit_diag(f"{_REASON_NESTED_EMPHASIS} at {token.value!r}")
# complete = open + close, net depth change 0

# Emphasis-wrapped-sentinel predicate (independent of depth state).
delim = _delim_for_token(token.value)
if _is_emphasis_wrapped_sentinel(token.value, delim):
_emit_diag(f"{_REASON_EMPHASIS_WRAPPED_SENTINEL} at {token.value!r}")

return diagnostics


Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
[{"kind": "BOLD", "value": "** foo **"}]
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
** foo **
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
[{"kind": "BOLD", "value": "** bar**"}]
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
** bar**
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
[{"kind": "BOLD", "value": "**foo **"}]
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
**foo **
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
[{"kind": "BOLD", "value": "**x **"}, {"kind": "TEXT", "value": "y"}, {"kind": "BOLD", "value": "**{{ref:x}}**"}]
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
**x **y**{{ref:x}}**
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
[{"kind": "ITALIC", "value": "*foo *"}]
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*foo *
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
[{"kind": "BOLD", "value": "**foo **"}, {"kind": "TEXT", "value": "nested"}, {"kind": "BOLD", "value": "** bar**"}]
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
**foo **nested** bar**
Loading