Skip to content

regex: \p{RGI_Emoji} (property of strings, /v flag) rejected — Rust regex crate has no string-properties; blocks string-width/ink #4889

@proggeramlug

Description

@proggeramlug

Summary

The Unicode property of strings \p{RGI_Emoji} (ES2024 /v/unicodeSets only) is rejected at compile time: /^\p{RGI_Emoji}$/vSyntaxError: Invalid regular expression: /^\p{RGI_Emoji}$/: invalid pattern.

Unlike ordinary \p{…} character properties (which match a single code point), a property of strings can match a multi-code-point cluster (emoji ZWJ sequences, flags, keycaps, skin-tone modifiers). Rust's regex crate has no support for properties of strings, so the token is unrepresentable as-is.

This is the next ink wall after #4884 (\p{Surrogate}). string-width/index.js:25 builds this regex at module top-level, so importing string-width (→ ink) throws at init before any user code:

// node_modules/string-width/index.js:25
const rgiEmojiRegex = /^\p{RGI_Emoji}$/v;   // "is this whole cluster one RGI emoji?" → width 2

It's the only property-of-strings in the ink dep tree.

What works vs. what doesn't (probed on the #4887 build)

OK    /\p{Extended_Pictographic}/u      → true on "😀"
OK    /\p{Emoji}/u                      → true
OK    /\p{Emoji_Presentation}/u         → true
FAIL  /^\p{RGI_Emoji}$/v                → invalid pattern

So the single-code-point emoji building blocks Rust does support — only the string-property aggregate is missing. That makes a translation feasible without new Unicode tables.

Expected (Node)

/^\p{RGI_Emoji}$/v.test("👍")true; .test("👨‍👩‍👧") (ZWJ family) → true; .test("🇬🇧") (flag) → true; .test("ab")false.

Suggested fix (JS→Rust regex translation, crates/perry-runtime/src/regex/{grammar,compile}.rs)

Per UTS #51, RGI_Emoji = Basic_Emoji | Emoji_Keycap_Sequence | RGI_Emoji_Flag_Sequence | RGI_Emoji_Tag_Sequence | RGI_Emoji_Modifier_Sequence | RGI_Emoji_ZWJ_Sequence. Two paths:

  • Pragmatic (unblocks ink, no data tables): expand \p{RGI_Emoji} to an alternation over the supported primitives — e.g.

    (?:
        [\u{1F1E6}-\u{1F1FF}]{2}                                   # flag pair (regional indicators)
      | \p{Emoji}️⃣                                       # keycap
      | \p{Extended_Pictographic}[\u{1F3FB}-\u{1F3FF}]?️?       # base (+ skin tone, + VS16)
          (?:‍\p{Extended_Pictographic}[\u{1F3FB}-\u{1F3FF}]?️?)*  # ZWJ continuation
    )
    

    Anchored as ^(…)$ this classifies single emoji clusters correctly for string-width's width-2 decision (it already segments into clusters via Intl.Segmenter first, so the input is always one cluster). Approximate at the edges (rare tag sequences) but behavior-preserving for the width use case.

  • Faithful: generate the actual RGI sequence set from the Unicode emoji-sequences.txt / emoji-zwj-sequences.txt data into a table and emit an exact alternation (or a separate matcher). Heavier; only needed if exact \p{RGI_Emoji} semantics matter beyond width.

\P{RGI_Emoji} and use inside larger /v set operations can stay unsupported initially (string-width doesn't need them) — but should error clearly rather than mis-compile.

Impact

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions