0.2.0: UAX #29 sentence segmentation (Sentences) by redvers · Pull Request #4 · contact-red/unicode

redvers · 2026-05-28T07:37:30Z

Second piece of the 0.2.0 segments theme. Sentences mirror the Words architecture but with a different rule shape.

What's new

`SentenceBreak` closed union (15 values: Other, CR, LF, Extend, Sep, Format, Sp, Lower, Upper, OLetter, Numeric, ATerm, STerm, Close, SContinue).
Auto-generated `_UcdSentenceBreak` cp-range table from `SentenceBreakProperty.txt`.
`_SentenceBreakCursor` state machine implementing SB1..SB11.
`Sentences` topical primitive with `count`, `ranges`, `iter`.
`make conform-sentence` runs SentenceBreakTest.txt (now part of `make ci`).

Sentences differ from graphemes/words in two important ways

1. Default is NO break (SB998). Sentences only break at paragraph separators (SB4) or after the end-of-sentence terminator pattern matching SB11. Most pairs of codepoints have no applicable rule and the chain continues.

2. SB9 vs SB10 must be distinguished by chain phase. SB9 (`SATerm Close* × (Close|Sp|ParaSep)`) applies only while no Sp has appeared yet; SB10 (`SATerm Close* Sp* × (Sp|ParaSep)`) applies once at least one Sp has appeared, and is more restrictive — a Close in Sp-phase BREAKS rather than extending. The `last_in_chain` value returned by `_saterm_lookback` tells the cursor which phase it's in. The initial implementation conflated the two and failed 6 cases (all featured `SATerm Close+ Sp Close → next-sentence` patterns); distinguishing the phase fixed all of them.

The cursor uses the same two-pass design as Words: decode all codepoints into a (offset, class) array, then run rules with full lookback (SB7 two-step) and forward scan (SB8 Lower lookup) available.

Test results on Unicode 16.0.0

Suite	Result
Unit tests	158 / 158
NormalizationTest Part 1	19,965 / 19,965
NormalizationTest Part 2	275,446 / 275,446
GraphemeBreakTest	1,093 / 1,093
WordBreakTest	1,826 / 1,826
SentenceBreakTest	512 / 512

100% UAX #29 sentence conformance.

Test plan

`make ci` locally — all five suites green
PR CI runs the same

Second piece of the 0.2.0 segments theme. Mirrors the Words infrastructure but with the rule set tailored for sentences. unicode/sentence_break.pony — hand-written closed union with 15 primitives unicode_build/sentence_break_codes — codegen-side name → byte unicode_build/sentence_break_table — emits _UcdSentenceBreak from SentenceBreakProperty unicode/_sentence_break_cursor.pony — UAX #29 sentence-boundary state machine; two-pass design with backward lookback (SB7) and forward scan (SB8 Lower lookup) unicode/_sentence_iterators.pony — range + slice iterators unicode/sentences.pony — Sentences topical primitive unicode_sentence_conform_main — SentenceBreakTest.txt runner make conform-sentence — runs the suite Sentences differ from graphemes/words in two important ways: 1. The default is NO break (SB998) — sentences only break at paragraph separators (SB4) or after the end-of-sentence terminator pattern (SB11). Most pairs of codepoints have no applicable rule and the chain continues. 2. SB9 and SB10 must be distinguished by chain phase. SB9 (SATerm Close* × Close|Sp|ParaSep) applies only while the chain has not seen any Sp; SB10 (SATerm Close* Sp* × Sp|ParaSep) applies once at least one Sp has appeared and is more restrictive — a Close in Sp-phase BREAKS rather than extending. The `last_in_chain` value tells the cursor which phase it's in. Final tally on Unicode 16.0.0: 158 unit tests NormalizationTest Part 1: 19,965 / 19,965 NormalizationTest Part 2: 275,446 / 275,446 GraphemeBreakTest: 1,093 / 1,093 WordBreakTest: 1,826 / 1,826 SentenceBreakTest: 512 / 512 ← new (100% UAX #29)

redvers merged commit 1cdc57b into main May 28, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.2.0: UAX #29 sentence segmentation (Sentences)#4

0.2.0: UAX #29 sentence segmentation (Sentences)#4
redvers merged 1 commit into
mainfrom
segments/sentences

redvers commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

redvers commented May 28, 2026

What's new

Sentences differ from graphemes/words in two important ways

Test results on Unicode 16.0.0

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant