0.2.0: UAX #29 word segmentation (Words)#3
Merged
Merged
Conversation
Mirrors the Graphemes infrastructure for word boundaries:
unicode/word_break.pony — hand-written closed union with
20 primitive values
unicode_build/word_break_codes — codegen-side name → byte map
unicode_build/word_break_table — emits _UcdWordBreak from
WordBreakProperty.txt + emoji-data
(Extended_Pictographic for WB3c)
unicode/_word_break_cursor.pony — UAX #29 word-boundary state
machine; two-pass design with a
precomputed (offset, class) array
so WB6, WB7b, and WB12 (which
need lookahead) work cleanly
unicode/_word_iterators.pony — range + slice iterators
unicode/words.pony — Words topical primitive
(count, ranges, iter)
unicode_word_conform_main — WordBreakTest.txt runner
make conform-word — runs the conformance suite
All rules WB1..WB16 implemented including the lookahead-dependent
ones (WB6 AHLetter × Mid* AHLetter, WB7b Hebrew × DQuote Hebrew,
WB12 Numeric × Mid* Numeric). WB4 transparency handled via
"effective" prev/prev2 state that skips Extend/Format/ZWJ.
Final tally on Unicode 16.0.0:
152 unit tests
NormalizationTest Part 1: 19,965 / 19,965
NormalizationTest Part 2: 275,446 / 275,446
GraphemeBreakTest: 1,093 / 1,093
WordBreakTest: 1,826 / 1,826 ← new (100% UAX #29)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
First piece of the 0.2.0 segments theme. UAX #29 word boundaries — mirrors the Graphemes architecture.
What's new
WB4 transparency
WB4 ("X (Extend|Format|ZWJ)*") is handled by tracking "effective" prev/prev2 classes that skip transparent codepoints. The state machine maintains both the raw prev (for WB3, WB3a, WB3b, WB3c, WB3d) and the effective lookback (for WB5..WB13b).
Test results on Unicode 16.0.0
100% UAX #29 word conformance on first complete build.
Test plan