Skip to content

Add NormalizationTest Part 2 + GraphemeBreakTest; fix GB9c#2

Merged
redvers merged 1 commit into
mainfrom
conformance/part2-and-grapheme
May 28, 2026
Merged

Add NormalizationTest Part 2 + GraphemeBreakTest; fix GB9c#2
redvers merged 1 commit into
mainfrom
conformance/part2-and-grapheme

Conversation

@redvers
Copy link
Copy Markdown
Contributor

@redvers redvers commented May 28, 2026

Follow-up to the M9 conformance work. Adds the two pieces I called out as missing in the prior review and fixes the bug the new grapheme runner caught.

Conformance additions

NormalizationTest.txt Part 2. The UAX #15 conformance clause requires that every assigned codepoint X not listed in @part1 of the test file satisfies `X == NFC(X) == NFD(X) == NFKC(X) == NFKD(X)`. The runner now tracks the @part1 codepoint set via `HashSet[U32]` and iterates the assigned-cp space (~275k cps) running the identity check on the complement.

GraphemeBreakTest.txt runner (`unicode_grapheme_conform_main`). Parses ÷/× markers + hex cps, builds the UTF-8 with per-cp byte offsets, and compares the cluster boundary set against `Graphemes.ranges` output.

GB9c fix (real bug caught by the new runner)

The grapheme runner failed 7 cases on first run — all Devanagari conjunct sequences (KA + VIRAMA + TA, etc.). Root cause: my M4 grapheme cursor predates UAX #29 GB9c (Indic_Conjunct_Break, Unicode 15.1).

The fix:

  • New codegen pass over DerivedCoreProperties.txt's InCB entries → `_UcdIndicConjunctBreak` lookup table.
  • Hand-written `IndicConjunctBreak` closed union (None / Consonant / Linker / Extend).
  • `_GraphemeCursor` extended with a two-flag GB9c state machine (`consonant_seen`, `linker_seen`) updated alongside the existing `GraphemeBreak` state.
  • New `Codepoints.indic_conjunct_break` accessor.

Final tally on Unicode 16.0.0

Suite Result
Unit tests 146 / 146
NormalizationTest Part 1 (explicit invariants) 19,965 / 19,965
NormalizationTest Part 2 (identity for cps not in @part1) 275,446 / 275,446
GraphemeBreakTest.txt (UAX #29 cluster boundaries) 1,093 / 1,093

`make ci` runs all three suites; the PR workflow runs the same.

Test plan

  • `make ci` locally — all green
  • PR CI runs the same

Two new conformance checks, plus the implementation fix the
grapheme runner exposed:

  NormalizationTest.txt Part 2 — the conformance clause that says
  "for every assigned cp X not in @part1, X is its own NFC / NFD /
  NFKC / NFKD." Tracked via a HashSet[U32] of cps that appear as c1
  in @part1 lines; then iterates the assigned cp space (~275k cps)
  and runs the identity check on the complement. Wired into the
  existing unicode_conform_main.

  GraphemeBreakTest.txt — new unicode_grapheme_conform_main runner.
  Parses ÷/× markers + hex cps from each line, builds a UTF-8
  string with byte-offset accounting, and compares our
  Graphemes.ranges output to the expected break set.

The grapheme runner caught 7 failures on first run — all
Devanagari conjunct sequences. Root cause: my M4 grapheme cursor
predates UAX #29 GB9c (Indic_Conjunct_Break rule, added in
Unicode 15.1).

Fix:
  - codegen InCB property from DerivedCoreProperties.txt
    (unicode_build/incb_table.pony) → _UcdIndicConjunctBreak
  - hand-write IndicConjunctBreak closed union (4 values: None,
    Consonant, Linker, Extend)
  - extend _GraphemeCursor with a two-flag GB9c state machine
    (consonant_seen, linker_seen) updated alongside the existing
    GraphemeBreak state
  - surface as Codepoints.indic_conjunct_break

Final tally on Unicode 16.0.0:
  146 unit tests
  Part 1: 19,965 / 19,965  (17,085 cps tracked)
  Part 2: 275,446 / 275,446
  GraphemeBreakTest: 1,093 / 1,093

`make ci` now runs all three suites.
@redvers redvers merged commit de6c887 into main May 28, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant