Skip to content

fix(regex): expand \p{RGI_Emoji} & friends (properties of strings) to supported emoji primitives (#4889)#4891

Merged
proggeramlug merged 1 commit into
mainfrom
worktree-fix-4889-rgi-emoji
Jun 10, 2026
Merged

fix(regex): expand \p{RGI_Emoji} & friends (properties of strings) to supported emoji primitives (#4889)#4891
proggeramlug merged 1 commit into
mainfrom
worktree-fix-4889-rgi-emoji

Conversation

@proggeramlug

Copy link
Copy Markdown
Contributor

Fixes #4889.

Problem

ES2024 /v-mode Unicode properties of strings (\p{RGI_Emoji} et al.) can match multi-code-point clusters (ZWJ sequences, flag pairs, keycaps, skin-tone sequences). The Rust regex crate has no notion of properties of strings, so it rejected them as invalid pattern. string-width@7+ builds /^\p{RGI_Emoji}$/v at module top level (index.js:25), so importing it (→ wrap-ansi / cli-truncate / slice-ansi → ink, #348) threw SyntaxError at init before any user code — the next ink wall after #4887 (\p{Surrogate}).

Fix

js_regex_to_rust (the same interception point as the #4887 surrogate rewrite) now expands all seven properties of strings — RGI_Emoji, Basic_Emoji, Emoji_Keycap_Sequence, RGI_Emoji_Flag_Sequence, RGI_Emoji_Tag_Sequence, RGI_Emoji_Modifier_Sequence, RGI_Emoji_ZWJ_Sequence — into alternations over the single-code-point emoji properties the crate does support, per the UTS #51 sequence grammar (the issue's "pragmatic" path, no new Unicode tables):

RGI_Emoji ≈ flag-pair | keycap | tag-seq | ELEMENT (ZWJ ELEMENT)*
ELEMENT   = \p{Emoji_Modifier_Base}\p{Emoji_Modifier}     # skin-tone seq
          | \p{Emoji}\x{FE0F}                             # text-default + VS16
          | [\p{Emoji_Presentation}&&[^RI]]               # default emoji presentation

Grammar-based, not the enumerated RGI data files, so it over-matches at rare edges (unlisted flag pairs / ZWJ combos) — behavior-preserving for the width use case, as scoped in the issue.

  • Negated / in-class uses (\P{RGI_Emoji}, [\p{RGI_Emoji}]) pass through unchanged, so RegExp construction throws a clear SyntaxError instead of mis-compiling. Node also rejects \P{RGI_Emoji}.

Validation

  • Unit test in grammar.rs with 22 match/no-match cases verified against Node's /v implementation (ZWJ family 👨‍👩‍👧, flags 🇬🇧, keycaps 1️⃣, tag sequences 🏴󠁧󠁢󠁥󠁮󠁧󠁿, skin tones 👍🏻, VS16-in-ZWJ ❤️‍🔥, lone skin tone 🏻; negatives: lone regional indicator 🇬, bare 0/#, /© without VS16), plus the sibling properties and the SyntaxError pass-through.
  • e2e on the issue's probes, byte-identical to node --experimental-strip-types:
    • /^\p{RGI_Emoji}$/vtrue/true/true/true/true/false/false, new RegExp("\\P{RGI_Emoji}","v")SyntaxError
    • real string-width import now passes module init: stringWidth("abc")=3, stringWidth("👍")=2, stringWidth("👨‍👩‍👧")=2 ✅
  • cargo test -p perry-runtime (RUST_TEST_THREADS=1): 1007 passed, 1 failed — date::tests::test_full_year_setters_revive_invalid_date_only, which fails identically on pristine main (environment-dependent, unrelated).
  • cargo fmt --check + check_file_size.sh clean.

Next wall (out of scope)

With init unblocked, mixed-content stringWidth("a👍b") hits a pre-existing bug: for-of over a dynamically-typed string value throws (string).next is not a function (minimal repro: function f(v: any) { for (const c of v) {} } f("ab")). Same family as the ".entries()-on-any" note in #4800. Statically-typed for-of over strings works fine.

… supported emoji primitives (#4889)

ES2024 `/v`-mode Unicode *properties of strings* (`\p{RGI_Emoji}`,
`\p{Basic_Emoji}`, `\p{Emoji_Keycap_Sequence}`, `\p{RGI_Emoji_Flag_Sequence}`,
`\p{RGI_Emoji_Tag_Sequence}`, `\p{RGI_Emoji_Modifier_Sequence}`,
`\p{RGI_Emoji_ZWJ_Sequence}`) can match multi-code-point clusters, which the
Rust `regex` crate cannot represent — it rejected them as `invalid pattern`.
`string-width@7+` builds `/^\p{RGI_Emoji}$/v` at module top level, so
importing it (→ wrap-ansi / cli-truncate / slice-ansi → ink, #348) threw a
SyntaxError at init — the next ink wall after #4887 (`\p{Surrogate}`).

`js_regex_to_rust` now expands each property of strings into an alternation
over the single-code-point emoji properties the crate does support, following
the UTS #51 sequence grammar:

  flag pair | keycap | tag sequence | ELEMENT (ZWJ ELEMENT)*

where ELEMENT is a skin-tone modifier sequence, an emoji presentation
sequence (text-default emoji + VS16), or a default-emoji-presentation
character (minus regional indicators, which only count in pairs). This is the
grammar, not the enumerated RGI data files, so it over-matches at rare edges
(unlisted flag pairs / ZWJ combinations) but classifies real emoji clusters
the way Node does — verified against Node's /v implementation on 22 cases
(ZWJ family, flags, keycaps, tag sequences, skin tones, VS16-in-ZWJ, lone
skin tone, and the negatives: lone regional indicator, bare keycap base,
text-default emoji without VS16).

Negated (`\P{RGI_Emoji}`) and in-class uses pass through unchanged so RegExp
construction throws a clear SyntaxError instead of mis-compiling — Node also
rejects `\P{RGI_Emoji}`.

With this, `stringWidth("👍")` / `stringWidth("👨‍👩‍👧")` compile and return 2
under Perry, byte-identical to Node. The next string-width gap is unrelated:
for-of over a dynamically-typed string value throws
`(string).next is not a function` (pre-existing, also noted in #4800).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

regex: \p{RGI_Emoji} (property of strings, /v flag) rejected — Rust regex crate has no string-properties; blocks string-width/ink

1 participant