Add NormalizationTest Part 2 + GraphemeBreakTest; fix GB9c#2
Merged
Conversation
Two new conformance checks, plus the implementation fix the grapheme runner exposed: NormalizationTest.txt Part 2 — the conformance clause that says "for every assigned cp X not in @part1, X is its own NFC / NFD / NFKC / NFKD." Tracked via a HashSet[U32] of cps that appear as c1 in @part1 lines; then iterates the assigned cp space (~275k cps) and runs the identity check on the complement. Wired into the existing unicode_conform_main. GraphemeBreakTest.txt — new unicode_grapheme_conform_main runner. Parses ÷/× markers + hex cps from each line, builds a UTF-8 string with byte-offset accounting, and compares our Graphemes.ranges output to the expected break set. The grapheme runner caught 7 failures on first run — all Devanagari conjunct sequences. Root cause: my M4 grapheme cursor predates UAX #29 GB9c (Indic_Conjunct_Break rule, added in Unicode 15.1). Fix: - codegen InCB property from DerivedCoreProperties.txt (unicode_build/incb_table.pony) → _UcdIndicConjunctBreak - hand-write IndicConjunctBreak closed union (4 values: None, Consonant, Linker, Extend) - extend _GraphemeCursor with a two-flag GB9c state machine (consonant_seen, linker_seen) updated alongside the existing GraphemeBreak state - surface as Codepoints.indic_conjunct_break Final tally on Unicode 16.0.0: 146 unit tests Part 1: 19,965 / 19,965 (17,085 cps tracked) Part 2: 275,446 / 275,446 GraphemeBreakTest: 1,093 / 1,093 `make ci` now runs all three suites.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to the M9 conformance work. Adds the two pieces I called out as missing in the prior review and fixes the bug the new grapheme runner caught.
Conformance additions
NormalizationTest.txt Part 2. The UAX #15 conformance clause requires that every assigned codepoint X not listed in @part1 of the test file satisfies `X == NFC(X) == NFD(X) == NFKC(X) == NFKD(X)`. The runner now tracks the @part1 codepoint set via `HashSet[U32]` and iterates the assigned-cp space (~275k cps) running the identity check on the complement.
GraphemeBreakTest.txt runner (`unicode_grapheme_conform_main`). Parses ÷/× markers + hex cps, builds the UTF-8 with per-cp byte offsets, and compares the cluster boundary set against `Graphemes.ranges` output.
GB9c fix (real bug caught by the new runner)
The grapheme runner failed 7 cases on first run — all Devanagari conjunct sequences (KA + VIRAMA + TA, etc.). Root cause: my M4 grapheme cursor predates UAX #29 GB9c (Indic_Conjunct_Break, Unicode 15.1).
The fix:
Final tally on Unicode 16.0.0
`make ci` runs all three suites; the PR workflow runs the same.
Test plan