Add NormalizationTest Part 2 + GraphemeBreakTest; fix GB9c by redvers · Pull Request #2 · contact-red/unicode

redvers · 2026-05-28T06:50:13Z

Follow-up to the M9 conformance work. Adds the two pieces I called out as missing in the prior review and fixes the bug the new grapheme runner caught.

Conformance additions

NormalizationTest.txt Part 2. The UAX #15 conformance clause requires that every assigned codepoint X not listed in @part1 of the test file satisfies `X == NFC(X) == NFD(X) == NFKC(X) == NFKD(X)`. The runner now tracks the @part1 codepoint set via `HashSet[U32]` and iterates the assigned-cp space (~275k cps) running the identity check on the complement.

GraphemeBreakTest.txt runner (`unicode_grapheme_conform_main`). Parses ÷/× markers + hex cps, builds the UTF-8 with per-cp byte offsets, and compares the cluster boundary set against `Graphemes.ranges` output.

GB9c fix (real bug caught by the new runner)

The grapheme runner failed 7 cases on first run — all Devanagari conjunct sequences (KA + VIRAMA + TA, etc.). Root cause: my M4 grapheme cursor predates UAX #29 GB9c (Indic_Conjunct_Break, Unicode 15.1).

The fix:

New codegen pass over DerivedCoreProperties.txt's InCB entries → `_UcdIndicConjunctBreak` lookup table.
Hand-written `IndicConjunctBreak` closed union (None / Consonant / Linker / Extend).
`_GraphemeCursor` extended with a two-flag GB9c state machine (`consonant_seen`, `linker_seen`) updated alongside the existing `GraphemeBreak` state.
New `Codepoints.indic_conjunct_break` accessor.

Final tally on Unicode 16.0.0

Suite	Result
Unit tests	146 / 146
NormalizationTest Part 1 (explicit invariants)	19,965 / 19,965
NormalizationTest Part 2 (identity for cps not in @part1)	275,446 / 275,446
GraphemeBreakTest.txt (UAX #29 cluster boundaries)	1,093 / 1,093

`make ci` runs all three suites; the PR workflow runs the same.

Test plan

`make ci` locally — all green
PR CI runs the same

@part1

Two new conformance checks, plus the implementation fix the grapheme runner exposed: NormalizationTest.txt Part 2 — the conformance clause that says "for every assigned cp X not in @part1, X is its own NFC / NFD / NFKC / NFKD." Tracked via a HashSet[U32] of cps that appear as c1 in @part1 lines; then iterates the assigned cp space (~275k cps) and runs the identity check on the complement. Wired into the existing unicode_conform_main. GraphemeBreakTest.txt — new unicode_grapheme_conform_main runner. Parses ÷/× markers + hex cps from each line, builds a UTF-8 string with byte-offset accounting, and compares our Graphemes.ranges output to the expected break set. The grapheme runner caught 7 failures on first run — all Devanagari conjunct sequences. Root cause: my M4 grapheme cursor predates UAX #29 GB9c (Indic_Conjunct_Break rule, added in Unicode 15.1). Fix: - codegen InCB property from DerivedCoreProperties.txt (unicode_build/incb_table.pony) → _UcdIndicConjunctBreak - hand-write IndicConjunctBreak closed union (4 values: None, Consonant, Linker, Extend) - extend _GraphemeCursor with a two-flag GB9c state machine (consonant_seen, linker_seen) updated alongside the existing GraphemeBreak state - surface as Codepoints.indic_conjunct_break Final tally on Unicode 16.0.0: 146 unit tests Part 1: 19,965 / 19,965 (17,085 cps tracked) Part 2: 275,446 / 275,446 GraphemeBreakTest: 1,093 / 1,093 `make ci` now runs all three suites.

redvers merged commit de6c887 into main May 28, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NormalizationTest Part 2 + GraphemeBreakTest; fix GB9c#2

Add NormalizationTest Part 2 + GraphemeBreakTest; fix GB9c#2
redvers merged 1 commit into
mainfrom
conformance/part2-and-grapheme

redvers commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

redvers commented May 28, 2026

Conformance additions

GB9c fix (real bug caught by the new runner)

Final tally on Unicode 16.0.0

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant