Summary
The Unicode property of strings \p{RGI_Emoji} (ES2024 /v/unicodeSets only) is rejected at compile time: /^\p{RGI_Emoji}$/v → SyntaxError: Invalid regular expression: /^\p{RGI_Emoji}$/: invalid pattern.
Unlike ordinary \p{…} character properties (which match a single code point), a property of strings can match a multi-code-point cluster (emoji ZWJ sequences, flags, keycaps, skin-tone modifiers). Rust's regex crate has no support for properties of strings, so the token is unrepresentable as-is.
This is the next ink wall after #4884 (\p{Surrogate}). string-width/index.js:25 builds this regex at module top-level, so importing string-width (→ ink) throws at init before any user code:
// node_modules/string-width/index.js:25
const rgiEmojiRegex = /^\p{RGI_Emoji}$/v; // "is this whole cluster one RGI emoji?" → width 2
It's the only property-of-strings in the ink dep tree.
What works vs. what doesn't (probed on the #4887 build)
OK /\p{Extended_Pictographic}/u → true on "😀"
OK /\p{Emoji}/u → true
OK /\p{Emoji_Presentation}/u → true
FAIL /^\p{RGI_Emoji}$/v → invalid pattern
So the single-code-point emoji building blocks Rust does support — only the string-property aggregate is missing. That makes a translation feasible without new Unicode tables.
Expected (Node)
/^\p{RGI_Emoji}$/v.test("👍") → true; .test("👨👩👧") (ZWJ family) → true; .test("🇬🇧") (flag) → true; .test("ab") → false.
Suggested fix (JS→Rust regex translation, crates/perry-runtime/src/regex/{grammar,compile}.rs)
Per UTS #51, RGI_Emoji = Basic_Emoji | Emoji_Keycap_Sequence | RGI_Emoji_Flag_Sequence | RGI_Emoji_Tag_Sequence | RGI_Emoji_Modifier_Sequence | RGI_Emoji_ZWJ_Sequence. Two paths:
-
Pragmatic (unblocks ink, no data tables): expand \p{RGI_Emoji} to an alternation over the supported primitives — e.g.
(?:
[\u{1F1E6}-\u{1F1FF}]{2} # flag pair (regional indicators)
| \p{Emoji}️⃣ # keycap
| \p{Extended_Pictographic}[\u{1F3FB}-\u{1F3FF}]?️? # base (+ skin tone, + VS16)
(?:\p{Extended_Pictographic}[\u{1F3FB}-\u{1F3FF}]?️?)* # ZWJ continuation
)
Anchored as ^(…)$ this classifies single emoji clusters correctly for string-width's width-2 decision (it already segments into clusters via Intl.Segmenter first, so the input is always one cluster). Approximate at the edges (rare tag sequences) but behavior-preserving for the width use case.
-
Faithful: generate the actual RGI sequence set from the Unicode emoji-sequences.txt / emoji-zwj-sequences.txt data into a table and emit an exact alternation (or a separate matcher). Heavier; only needed if exact \p{RGI_Emoji} semantics matter beyond width.
\P{RGI_Emoji} and use inside larger /v set operations can stay unsupported initially (string-width doesn't need them) — but should error clearly rather than mis-compile.
Impact
Related
Summary
The Unicode property of strings
\p{RGI_Emoji}(ES2024/v/unicodeSets only) is rejected at compile time:/^\p{RGI_Emoji}$/v→SyntaxError: Invalid regular expression: /^\p{RGI_Emoji}$/: invalid pattern.Unlike ordinary
\p{…}character properties (which match a single code point), a property of strings can match a multi-code-point cluster (emoji ZWJ sequences, flags, keycaps, skin-tone modifiers). Rust'sregexcrate has no support for properties of strings, so the token is unrepresentable as-is.This is the next ink wall after #4884 (
\p{Surrogate}).string-width/index.js:25builds this regex at module top-level, so importingstring-width(→ink) throws at init before any user code:It's the only property-of-strings in the ink dep tree.
What works vs. what doesn't (probed on the #4887 build)
So the single-code-point emoji building blocks Rust does support — only the string-property aggregate is missing. That makes a translation feasible without new Unicode tables.
Expected (Node)
/^\p{RGI_Emoji}$/v.test("👍")→true;.test("👨👩👧")(ZWJ family) →true;.test("🇬🇧")(flag) →true;.test("ab")→false.Suggested fix (JS→Rust regex translation,
crates/perry-runtime/src/regex/{grammar,compile}.rs)Per UTS #51,
RGI_Emoji = Basic_Emoji | Emoji_Keycap_Sequence | RGI_Emoji_Flag_Sequence | RGI_Emoji_Tag_Sequence | RGI_Emoji_Modifier_Sequence | RGI_Emoji_ZWJ_Sequence. Two paths:Pragmatic (unblocks ink, no data tables): expand
\p{RGI_Emoji}to an alternation over the supported primitives — e.g.Anchored as
^(…)$this classifies single emoji clusters correctly forstring-width's width-2 decision (it already segments into clusters viaIntl.Segmenterfirst, so the input is always one cluster). Approximate at the edges (rare tag sequences) but behavior-preserving for the width use case.Faithful: generate the actual RGI sequence set from the Unicode
emoji-sequences.txt/emoji-zwj-sequences.txtdata into a table and emit an exact alternation (or a separate matcher). Heavier; only needed if exact\p{RGI_Emoji}semantics matter beyond width.\P{RGI_Emoji}and use inside larger/vset operations can stay unsupported initially (string-width doesn't need them) — but should error clearly rather than mis-compile.Impact
string-width@7+→wrap-ansi,cli-truncate,slice-ansi→ ink end-to-end (Compileink(React-based TUI framework) end-to-end viaperry.compilePackages#348). This is the laststring-widthinit regex; after it, ink's next gate is yoga-layout's WASM runtime (the documented out-of-scope rock).string-width.Related
ink(React-based TUI framework) end-to-end viaperry.compilePackages#348 (ink end-to-end smoke test — where this surfaced)\p{Surrogate}Unicode property rejected ('invalid pattern') — Rust regex crate has no surrogate scalar values; blocks string-width/ink #4884 / fix(regex): rewrite \p{Surrogate} to a never-matching class (#4884) #4887 (previous ink wall:\p{Surrogate})Intl.Segmenter(grapheme segmentation) —new Intl.Segmenter()throws, blocks string-width/wrap-ansi/ink #4877 / feat(intl): implement Intl.Segmenter (grapheme/word/sentence) — closes #4877 #4882 (Intl.Segmenter), Codegen:new MessageChannel()global constructor unlinked/non-constructible — routes to stdlib symbol, breaks React scheduler init #4873 / fix(hir): globalnew MessageChannel()routes to always-linked runtime constructor (#4873) #4875 (new MessageChannel())