Unicode-correct text processing for Pony — graphemes, normalization, case folding, search, segmentation, and more.
Pre-release (VERSION 0.0.0). Foundation milestones M0–M9 complete:
- Validated UTF-8
Textwith optional grapheme bitmap index - Phantom-typed
ByteIndex/CodepointIndex/GraphemeIndex - UAX #29 extended grapheme cluster iteration
- UAX #15 normalization (NFC / NFD / NFKC / NFKD) — 100% NormalizationTest.txt conformance (18,992/18,992 test cases pass)
- Case operations (upper / lower / title / fold) with full multi-cp expansions
- Comparison primitives (byte / canonical / compat / caseless / caseless-canonical per UAX #21 D146)
- Search / Split / Trim / Replace on UTF-8
- Full UCD-backed predicates: 30 General Categories, 163 Scripts, ~58 binary properties, ~30k Unicode names, case mappings, decomposition tables
146 PonyCheck unit tests + the NormalizationTest conformance suite.
- Install corral
corral add github.com/contact-red/unicode.git --version 0.1.0corral fetchto fetch dependenciesuse "unicode"to include this packagecorral run -- ponycto compile your application
use "unicode"
actor Main
new create(env: Env) =>
try
// Validated UTF-8, codepoint/grapheme counts
let t = Text.from_string("café 🇫🇷👨👩👧")?
env.out.print("graphemes: " + t.size_graphemes().string())
env.out.print("codepoints: " + t.size_codepoints().string())
env.out.print("bytes: " + t.size_bytes().string())
// Normalization — precomposed and decomposed forms compare equal
let pre = "café" // pre-composed é (U+00E9)
let dec = "café" // e + combining acute
match Compares.equal_canonical(pre, dec)
| true => env.out.print("canonically equal: yes")
end
// Case folding for caseless matching
match Compares.equal_caseless("MASSE", "Maße")
| true => env.out.print("Maße == MASSE under fold")
end
// Search / Split / Trim / Replace
match Search.contains("hello world", "world")
| true => env.out.print("found")
end
match Replace.all("foo bar foo", "foo", "qux")
| let s: String iso => env.out.print(consume s) // "qux bar qux"
end
// Per-codepoint properties
env.out.print("'A' script: " +
match Codepoints.script(U32('A'))
| let _: ScriptLatin => "Latin"
else "other"
end)
else
env.out.print("invalid UTF-8")
endGenerated by pony-doc. See design notes below for the full surface plan.
The package is dual-surface, single-truth, Unicode-correct text processing:
- Typed surface:
class Text(default capval) — the canonical entry point. Constructors validate UTF-8 via?partials; the invariant lets methods skip re-validation. - Free-function surface: topical primitives (
Graphemes,Codepoints,Search,Split,Trim,Replace,Normalize,Case,Compare, etc.) for one-shot operations overString box. - Both surfaces delegate to the same package-private underscore methods on the topical primitives — behavior cannot drift.
Key design choices:
- No
_form(normal-form) tag on Text. Normalization is explicit; callers control it. Codepointis aclass valwrappingU32. Hot iteration yields bareU32viaText.codepoints()(no per-element allocation); typed-formCodepoints.from_u32(u)+Text.codepoints_typed()is available for the type-safe boundary case.- Graphemes are
String valslices, not a distinct type. UAX #29 cluster boundaries; iterators yieldString val(zero byte-copy) or(USize, USize)byte ranges for zero-allocation paths. - Phantom-typed indices (
ByteIndex,CodepointIndex,GraphemeIndex) prevent unit confusion at compile time. - Optional bitmap index opt-in at
Textconstruction for fast random grapheme access (~12.5% memory overhead). - Closed unions for
Script,Category,Property,NormalForm— exhaustive matching; Unicode-version bumps treated as semver-significant. - UCD generated at build time by the
unicode-buildtool; compiled-in asvalstatic tables; Pony's dead-code elimination strips unused tables at link time.
| Release | Theme |
|---|---|
0.1.0 |
Foundation: validated UTF-8 Text, indices, graphemes, codepoints, names, normalize, case-fold, compare, everyday text ops (search/split/trim/replace/insert/delete) |
0.2.0 |
Segments (words/sentences/lines), scripts |
0.3.0 |
Encodings beyond UTF-8 (Latin-1, UTF-16, UTF-32, …) |
0.4.0 |
Locale-aware collation |
0.5.0 |
Confusables + safe identifier matching (eq_identifier) |
0.6.0 |
IDNA |
0.7.0 |
Bidi (UAX #9) |
1.0.0 |
API freeze |
See design/candidate-v3.md for the full design document. Note: eq_caseless_normalized (lands in 0.1.0) is NOT safe for security-critical identifier matching against homograph attacks — wait for 0.5.0 eq_identifier. See "Identifier matching" section in the design document.
make test # build and run the test suite
make cleanTBD