Skip to content

regex: \p{Surrogate} Unicode property rejected ('invalid pattern') — Rust regex crate has no surrogate scalar values; blocks string-width/ink #4884

@proggeramlug

Description

@proggeramlug

Summary

The Unicode property escape \p{Surrogate} (general category Cs) is rejected at regex-compile time: new RegExp('\\p{Surrogate}', 'u')SyntaxError: Invalid regular expression: /\p{Surrogate}/: invalid pattern. Every other property in the same family works.

Root cause: Perry compiles JS regexes to the Rust regex crate, which matches over Unicode scalar values (UTF-8) — surrogate code points (U+D800–U+DFFF) can't occur there, so the crate rejects the Surrogate/Cs property outright instead of treating it as a never-matching class.

This is the next ink wall after #4877 (Segmenter). string-width/index.js builds two regexes at module top-level that include \p{Surrogate}, so importing string-width (→ ink) throws at init before any user code:

// node_modules/string-width/index.js:19
const zeroWidthClusterRegex   = /^(?:\p{Default_Ignorable_Code_Point}|\p{Control}|\p{Format}|\p{Mark}|\p{Surrogate})+$/v;
// :22
const leadingNonPrintingRegex = /^[\p{Default_Ignorable_Code_Point}\p{Control}\p{Format}\p{Mark}\p{Surrogate}]+/v;

Isolation (exactly one token is at fault)

OK    /\p{Control}/u
OK    /\p{Mark}/u
OK    /\p{Format}/u
OK    /\p{Default_Ignorable_Code_Point}/u
OK    /[\p{Control}]/v                         ← the /v (unicodeSets) flag itself is fine
FAIL  /\p{Surrogate}/u                         → invalid pattern
FAIL  /^(?:…|\p{Surrogate})+$/u                → invalid pattern (same with /v)

So it's specifically \p{Surrogate} — not the /v flag, not the other properties.

Expected (Node)

/\p{Surrogate}/u compiles and matches lone surrogate code units ("\uD800"). Node/V8 matches per UTF-16 code unit, so surrogates are matchable.

Suggested fix

In the JS→Rust regex translation (crates/perry-runtime/src/regex/compile.rs / grammar.rs), special-case \p{Surrogate} / \p{gc=Cs} / \P{Surrogate} rather than passing it through to the Rust crate:

Impact

  • string-width@7+wrap-ansi, cli-truncate, slice-ansiink end-to-end (Compile ink (React-based TUI framework) end-to-end via perry.compilePackages #348). After this, ink's next gate is yoga-layout's WASM runtime (the documented out-of-scope rock).
  • Any package using \p{Surrogate} for input sanitization / width calculation.
  • Sits at the intersection of the two known categorical gaps in CLAUDE.md (Rust regex crate limits + lone-surrogate/WTF-8 handling), but is a narrow, self-contained translation fix.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions