0.2.0: UAX #29 sentence segmentation (Sentences)#4
Merged
Conversation
Second piece of the 0.2.0 segments theme. Mirrors the Words
infrastructure but with the rule set tailored for sentences.
unicode/sentence_break.pony — hand-written closed
union with 15 primitives
unicode_build/sentence_break_codes — codegen-side name → byte
unicode_build/sentence_break_table — emits _UcdSentenceBreak
from SentenceBreakProperty
unicode/_sentence_break_cursor.pony — UAX #29 sentence-boundary
state machine; two-pass
design with backward
lookback (SB7) and forward
scan (SB8 Lower lookup)
unicode/_sentence_iterators.pony — range + slice iterators
unicode/sentences.pony — Sentences topical primitive
unicode_sentence_conform_main — SentenceBreakTest.txt runner
make conform-sentence — runs the suite
Sentences differ from graphemes/words in two important ways:
1. The default is NO break (SB998) — sentences only break at
paragraph separators (SB4) or after the end-of-sentence
terminator pattern (SB11). Most pairs of codepoints have no
applicable rule and the chain continues.
2. SB9 and SB10 must be distinguished by chain phase. SB9
(SATerm Close* × Close|Sp|ParaSep) applies only while the
chain has not seen any Sp; SB10 (SATerm Close* Sp* × Sp|ParaSep)
applies once at least one Sp has appeared and is more
restrictive — a Close in Sp-phase BREAKS rather than
extending. The `last_in_chain` value tells the cursor which
phase it's in.
Final tally on Unicode 16.0.0:
158 unit tests
NormalizationTest Part 1: 19,965 / 19,965
NormalizationTest Part 2: 275,446 / 275,446
GraphemeBreakTest: 1,093 / 1,093
WordBreakTest: 1,826 / 1,826
SentenceBreakTest: 512 / 512 ← new (100% UAX #29)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Second piece of the 0.2.0 segments theme. Sentences mirror the Words architecture but with a different rule shape.
What's new
Sentences differ from graphemes/words in two important ways
1. Default is NO break (SB998). Sentences only break at paragraph separators (SB4) or after the end-of-sentence terminator pattern matching SB11. Most pairs of codepoints have no applicable rule and the chain continues.
2. SB9 vs SB10 must be distinguished by chain phase. SB9 (`SATerm Close* × (Close|Sp|ParaSep)`) applies only while no Sp has appeared yet; SB10 (`SATerm Close* Sp* × (Sp|ParaSep)`) applies once at least one Sp has appeared, and is more restrictive — a Close in Sp-phase BREAKS rather than extending. The `last_in_chain` value returned by `_saterm_lookback` tells the cursor which phase it's in. The initial implementation conflated the two and failed 6 cases (all featured `SATerm Close+ Sp Close → next-sentence` patterns); distinguishing the phase fixed all of them.
The cursor uses the same two-pass design as Words: decode all codepoints into a (offset, class) array, then run rules with full lookback (SB7 two-step) and forward scan (SB8 Lower lookup) available.
Test results on Unicode 16.0.0
100% UAX #29 sentence conformance.
Test plan