Skip to content

contact-red/unicode

Repository files navigation

unicode

CI

Unicode-correct text processing for Pony — graphemes, normalization, case folding, search, segmentation, and more.

Status

Pre-release (VERSION 0.0.0). Foundation milestones M0–M9 complete:

  • Validated UTF-8 Text with optional grapheme bitmap index
  • Phantom-typed ByteIndex / CodepointIndex / GraphemeIndex
  • UAX #29 extended grapheme cluster iteration
  • UAX #15 normalization (NFC / NFD / NFKC / NFKD) — 100% NormalizationTest.txt conformance (18,992/18,992 test cases pass)
  • Case operations (upper / lower / title / fold) with full multi-cp expansions
  • Comparison primitives (byte / canonical / compat / caseless / caseless-canonical per UAX #21 D146)
  • Search / Split / Trim / Replace on UTF-8
  • Full UCD-backed predicates: 30 General Categories, 163 Scripts, ~58 binary properties, ~30k Unicode names, case mappings, decomposition tables

146 PonyCheck unit tests + the NormalizationTest conformance suite.

Installation

  • Install corral
  • corral add github.com/contact-red/unicode.git --version 0.1.0
  • corral fetch to fetch dependencies
  • use "unicode" to include this package
  • corral run -- ponyc to compile your application

Quick start

use "unicode"

actor Main
  new create(env: Env) =>
    try
      // Validated UTF-8, codepoint/grapheme counts
      let t = Text.from_string("café 🇫🇷👨‍👩‍👧")?
      env.out.print("graphemes:  " + t.size_graphemes().string())
      env.out.print("codepoints: " + t.size_codepoints().string())
      env.out.print("bytes:      " + t.size_bytes().string())

      // Normalization — precomposed and decomposed forms compare equal
      let pre = "café"  // pre-composed é (U+00E9)
      let dec = "café"  // e + combining acute
      match Compares.equal_canonical(pre, dec)
      | true => env.out.print("canonically equal: yes")
      end

      // Case folding for caseless matching
      match Compares.equal_caseless("MASSE", "Maße")
      | true => env.out.print("Maße == MASSE under fold")
      end

      // Search / Split / Trim / Replace
      match Search.contains("hello world", "world")
      | true => env.out.print("found")
      end
      match Replace.all("foo bar foo", "foo", "qux")
      | let s: String iso => env.out.print(consume s)  // "qux bar qux"
      end

      // Per-codepoint properties
      env.out.print("'A' script: " +
        match Codepoints.script(U32('A'))
        | let _: ScriptLatin => "Latin"
        else "other"
        end)
    else
      env.out.print("invalid UTF-8")
    end

API documentation

Generated by pony-doc. See design notes below for the full surface plan.

Design

The package is dual-surface, single-truth, Unicode-correct text processing:

  • Typed surface: class Text (default cap val) — the canonical entry point. Constructors validate UTF-8 via ? partials; the invariant lets methods skip re-validation.
  • Free-function surface: topical primitives (Graphemes, Codepoints, Search, Split, Trim, Replace, Normalize, Case, Compare, etc.) for one-shot operations over String box.
  • Both surfaces delegate to the same package-private underscore methods on the topical primitives — behavior cannot drift.

Key design choices:

  • No _form (normal-form) tag on Text. Normalization is explicit; callers control it.
  • Codepoint is a class val wrapping U32. Hot iteration yields bare U32 via Text.codepoints() (no per-element allocation); typed-form Codepoints.from_u32(u) + Text.codepoints_typed() is available for the type-safe boundary case.
  • Graphemes are String val slices, not a distinct type. UAX #29 cluster boundaries; iterators yield String val (zero byte-copy) or (USize, USize) byte ranges for zero-allocation paths.
  • Phantom-typed indices (ByteIndex, CodepointIndex, GraphemeIndex) prevent unit confusion at compile time.
  • Optional bitmap index opt-in at Text construction for fast random grapheme access (~12.5% memory overhead).
  • Closed unions for Script, Category, Property, NormalForm — exhaustive matching; Unicode-version bumps treated as semver-significant.
  • UCD generated at build time by the unicode-build tool; compiled-in as val static tables; Pony's dead-code elimination strips unused tables at link time.

Release plan

Release Theme
0.1.0 Foundation: validated UTF-8 Text, indices, graphemes, codepoints, names, normalize, case-fold, compare, everyday text ops (search/split/trim/replace/insert/delete)
0.2.0 Segments (words/sentences/lines), scripts
0.3.0 Encodings beyond UTF-8 (Latin-1, UTF-16, UTF-32, …)
0.4.0 Locale-aware collation
0.5.0 Confusables + safe identifier matching (eq_identifier)
0.6.0 IDNA
0.7.0 Bidi (UAX #9)
1.0.0 API freeze

See design/candidate-v3.md for the full design document. Note: eq_caseless_normalized (lands in 0.1.0) is NOT safe for security-critical identifier matching against homograph attacks — wait for 0.5.0 eq_identifier. See "Identifier matching" section in the design document.

Building

make test     # build and run the test suite
make clean

License

TBD

About

Unicode implementation in pure pony

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors