Skip to content

0.2.0: UAX #29 sentence segmentation (Sentences)#4

Merged
redvers merged 1 commit into
mainfrom
segments/sentences
May 28, 2026
Merged

0.2.0: UAX #29 sentence segmentation (Sentences)#4
redvers merged 1 commit into
mainfrom
segments/sentences

Conversation

@redvers
Copy link
Copy Markdown
Contributor

@redvers redvers commented May 28, 2026

Second piece of the 0.2.0 segments theme. Sentences mirror the Words architecture but with a different rule shape.

What's new

  • `SentenceBreak` closed union (15 values: Other, CR, LF, Extend, Sep, Format, Sp, Lower, Upper, OLetter, Numeric, ATerm, STerm, Close, SContinue).
  • Auto-generated `_UcdSentenceBreak` cp-range table from `SentenceBreakProperty.txt`.
  • `_SentenceBreakCursor` state machine implementing SB1..SB11.
  • `Sentences` topical primitive with `count`, `ranges`, `iter`.
  • `make conform-sentence` runs SentenceBreakTest.txt (now part of `make ci`).

Sentences differ from graphemes/words in two important ways

1. Default is NO break (SB998). Sentences only break at paragraph separators (SB4) or after the end-of-sentence terminator pattern matching SB11. Most pairs of codepoints have no applicable rule and the chain continues.

2. SB9 vs SB10 must be distinguished by chain phase. SB9 (`SATerm Close* × (Close|Sp|ParaSep)`) applies only while no Sp has appeared yet; SB10 (`SATerm Close* Sp* × (Sp|ParaSep)`) applies once at least one Sp has appeared, and is more restrictive — a Close in Sp-phase BREAKS rather than extending. The `last_in_chain` value returned by `_saterm_lookback` tells the cursor which phase it's in. The initial implementation conflated the two and failed 6 cases (all featured `SATerm Close+ Sp Close → next-sentence` patterns); distinguishing the phase fixed all of them.

The cursor uses the same two-pass design as Words: decode all codepoints into a (offset, class) array, then run rules with full lookback (SB7 two-step) and forward scan (SB8 Lower lookup) available.

Test results on Unicode 16.0.0

Suite Result
Unit tests 158 / 158
NormalizationTest Part 1 19,965 / 19,965
NormalizationTest Part 2 275,446 / 275,446
GraphemeBreakTest 1,093 / 1,093
WordBreakTest 1,826 / 1,826
SentenceBreakTest 512 / 512

100% UAX #29 sentence conformance.

Test plan

  • `make ci` locally — all five suites green
  • PR CI runs the same

Second piece of the 0.2.0 segments theme. Mirrors the Words
infrastructure but with the rule set tailored for sentences.

  unicode/sentence_break.pony           — hand-written closed
                                          union with 15 primitives
  unicode_build/sentence_break_codes    — codegen-side name → byte
  unicode_build/sentence_break_table    — emits _UcdSentenceBreak
                                          from SentenceBreakProperty
  unicode/_sentence_break_cursor.pony   — UAX #29 sentence-boundary
                                          state machine; two-pass
                                          design with backward
                                          lookback (SB7) and forward
                                          scan (SB8 Lower lookup)
  unicode/_sentence_iterators.pony      — range + slice iterators
  unicode/sentences.pony                — Sentences topical primitive
  unicode_sentence_conform_main         — SentenceBreakTest.txt runner
  make conform-sentence                 — runs the suite

Sentences differ from graphemes/words in two important ways:

  1. The default is NO break (SB998) — sentences only break at
     paragraph separators (SB4) or after the end-of-sentence
     terminator pattern (SB11). Most pairs of codepoints have no
     applicable rule and the chain continues.
  2. SB9 and SB10 must be distinguished by chain phase. SB9
     (SATerm Close* × Close|Sp|ParaSep) applies only while the
     chain has not seen any Sp; SB10 (SATerm Close* Sp* × Sp|ParaSep)
     applies once at least one Sp has appeared and is more
     restrictive — a Close in Sp-phase BREAKS rather than
     extending. The `last_in_chain` value tells the cursor which
     phase it's in.

Final tally on Unicode 16.0.0:

  158 unit tests
  NormalizationTest Part 1:  19,965 / 19,965
  NormalizationTest Part 2:  275,446 / 275,446
  GraphemeBreakTest:         1,093  / 1,093
  WordBreakTest:             1,826  / 1,826
  SentenceBreakTest:         512    / 512    ← new (100% UAX #29)
@redvers redvers merged commit 1cdc57b into main May 28, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant