unicode

Unicode-correct text processing for Pony — graphemes, normalization, case folding, search, segmentation, and more.

Status

Pre-release (VERSION 0.0.0). Foundation milestones M0–M9 complete:

Validated UTF-8 Text with optional grapheme bitmap index
Phantom-typed ByteIndex / CodepointIndex / GraphemeIndex
UAX #29 extended grapheme cluster iteration
UAX #15 normalization (NFC / NFD / NFKC / NFKD) — 100% NormalizationTest.txt conformance (18,992/18,992 test cases pass)
Case operations (upper / lower / title / fold) with full multi-cp expansions
Comparison primitives (byte / canonical / compat / caseless / caseless-canonical per UAX #21 D146)
Search / Split / Trim / Replace on UTF-8
Full UCD-backed predicates: 30 General Categories, 163 Scripts, ~58 binary properties, ~30k Unicode names, case mappings, decomposition tables

146 PonyCheck unit tests + the NormalizationTest conformance suite.

Installation

Install corral
corral add github.com/contact-red/unicode.git --version 0.1.0
corral fetch to fetch dependencies
use "unicode" to include this package
corral run -- ponyc to compile your application

Quick start

use "unicode"

actor Main
  new create(env: Env) =>
    try
      // Validated UTF-8, codepoint/grapheme counts
      let t = Text.from_string("café 🇫🇷👨‍👩‍👧")?
      env.out.print("graphemes:  " + t.size_graphemes().string())
      env.out.print("codepoints: " + t.size_codepoints().string())
      env.out.print("bytes:      " + t.size_bytes().string())

      // Normalization — precomposed and decomposed forms compare equal
      let pre = "café"  // pre-composed é (U+00E9)
      let dec = "café"  // e + combining acute
      match Compares.equal_canonical(pre, dec)
      | true => env.out.print("canonically equal: yes")
      end

      // Case folding for caseless matching
      match Compares.equal_caseless("MASSE", "Maße")
      | true => env.out.print("Maße == MASSE under fold")
      end

      // Search / Split / Trim / Replace
      match Search.contains("hello world", "world")
      | true => env.out.print("found")
      end
      match Replace.all("foo bar foo", "foo", "qux")
      | let s: String iso => env.out.print(consume s)  // "qux bar qux"
      end

      // Per-codepoint properties
      env.out.print("'A' script: " +
        match Codepoints.script(U32('A'))
        | let _: ScriptLatin => "Latin"
        else "other"
        end)
    else
      env.out.print("invalid UTF-8")
    end

API documentation

Generated by pony-doc. See design notes below for the full surface plan.

Design

The package is dual-surface, single-truth, Unicode-correct text processing:

Typed surface: class Text (default cap val) — the canonical entry point. Constructors validate UTF-8 via ? partials; the invariant lets methods skip re-validation.
Free-function surface: topical primitives (Graphemes, Codepoints, Search, Split, Trim, Replace, Normalize, Case, Compare, etc.) for one-shot operations over String box.
Both surfaces delegate to the same package-private underscore methods on the topical primitives — behavior cannot drift.

Key design choices:

No _form (normal-form) tag on Text. Normalization is explicit; callers control it.
Codepoint is a class val wrapping U32. Hot iteration yields bare U32 via Text.codepoints() (no per-element allocation); typed-form Codepoints.from_u32(u) + Text.codepoints_typed() is available for the type-safe boundary case.
Graphemes are String val slices, not a distinct type. UAX #29 cluster boundaries; iterators yield String val (zero byte-copy) or (USize, USize) byte ranges for zero-allocation paths.
Phantom-typed indices (ByteIndex, CodepointIndex, GraphemeIndex) prevent unit confusion at compile time.
Optional bitmap index opt-in at Text construction for fast random grapheme access (~12.5% memory overhead).
Closed unions for Script, Category, Property, NormalForm — exhaustive matching; Unicode-version bumps treated as semver-significant.
UCD generated at build time by the unicode-build tool; compiled-in as val static tables; Pony's dead-code elimination strips unused tables at link time.

Release plan

Release	Theme
`0.1.0`	Foundation: validated UTF-8 `Text`, indices, graphemes, codepoints, names, normalize, case-fold, compare, everyday text ops (search/split/trim/replace/insert/delete)
`0.2.0`	Segments (words/sentences/lines), scripts
`0.3.0`	Encodings beyond UTF-8 (Latin-1, UTF-16, UTF-32, …)
`0.4.0`	Locale-aware collation
`0.5.0`	Confusables + safe identifier matching (`eq_identifier`)
`0.6.0`	IDNA
`0.7.0`	Bidi (UAX #9)
`1.0.0`	API freeze

See design/candidate-v3.md for the full design document. Note: eq_caseless_normalized (lands in 0.1.0) is NOT safe for security-critical identifier matching against homograph attacks — wait for 0.5.0 eq_identifier. See "Identifier matching" section in the design document.

Building

make test     # build and run the test suite
make clean

License

TBD

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
.release-notes		.release-notes
design		design
unicode		unicode
unicode_build		unicode_build
unicode_build_main		unicode_build_main
unicode_conform_main		unicode_conform_main
unicode_grapheme_conform_main		unicode_grapheme_conform_main
unicode_sentence_conform_main		unicode_sentence_conform_main
unicode_word_conform_main		unicode_word_conform_main
.gitignore		.gitignore
.markdownlintignore		.markdownlintignore
CHANGELOG.md		CHANGELOG.md
Makefile		Makefile
README.md		README.md
VERSION		VERSION
corral.json		corral.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

unicode

Status

Installation

Quick start

API documentation

Design

Release plan

Building

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

unicode

Status

Installation

Quick start

API documentation

Design

Release plan

Building

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages