fix(regex): expand \p{RGI_Emoji} & friends (properties of strings) to supported emoji primitives (#4889)#4891
Merged
Merged
Conversation
… supported emoji primitives (#4889) ES2024 `/v`-mode Unicode *properties of strings* (`\p{RGI_Emoji}`, `\p{Basic_Emoji}`, `\p{Emoji_Keycap_Sequence}`, `\p{RGI_Emoji_Flag_Sequence}`, `\p{RGI_Emoji_Tag_Sequence}`, `\p{RGI_Emoji_Modifier_Sequence}`, `\p{RGI_Emoji_ZWJ_Sequence}`) can match multi-code-point clusters, which the Rust `regex` crate cannot represent — it rejected them as `invalid pattern`. `string-width@7+` builds `/^\p{RGI_Emoji}$/v` at module top level, so importing it (→ wrap-ansi / cli-truncate / slice-ansi → ink, #348) threw a SyntaxError at init — the next ink wall after #4887 (`\p{Surrogate}`). `js_regex_to_rust` now expands each property of strings into an alternation over the single-code-point emoji properties the crate does support, following the UTS #51 sequence grammar: flag pair | keycap | tag sequence | ELEMENT (ZWJ ELEMENT)* where ELEMENT is a skin-tone modifier sequence, an emoji presentation sequence (text-default emoji + VS16), or a default-emoji-presentation character (minus regional indicators, which only count in pairs). This is the grammar, not the enumerated RGI data files, so it over-matches at rare edges (unlisted flag pairs / ZWJ combinations) but classifies real emoji clusters the way Node does — verified against Node's /v implementation on 22 cases (ZWJ family, flags, keycaps, tag sequences, skin tones, VS16-in-ZWJ, lone skin tone, and the negatives: lone regional indicator, bare keycap base, text-default emoji without VS16). Negated (`\P{RGI_Emoji}`) and in-class uses pass through unchanged so RegExp construction throws a clear SyntaxError instead of mis-compiling — Node also rejects `\P{RGI_Emoji}`. With this, `stringWidth("👍")` / `stringWidth("👨👩👧")` compile and return 2 under Perry, byte-identical to Node. The next string-width gap is unrelated: for-of over a dynamically-typed string value throws `(string).next is not a function` (pre-existing, also noted in #4800).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #4889.
Problem
ES2024
/v-mode Unicode properties of strings (\p{RGI_Emoji}et al.) can match multi-code-point clusters (ZWJ sequences, flag pairs, keycaps, skin-tone sequences). The Rustregexcrate has no notion of properties of strings, so it rejected them asinvalid pattern.string-width@7+builds/^\p{RGI_Emoji}$/vat module top level (index.js:25), so importing it (→ wrap-ansi / cli-truncate / slice-ansi → ink, #348) threwSyntaxErrorat init before any user code — the next ink wall after #4887 (\p{Surrogate}).Fix
js_regex_to_rust(the same interception point as the #4887 surrogate rewrite) now expands all seven properties of strings —RGI_Emoji,Basic_Emoji,Emoji_Keycap_Sequence,RGI_Emoji_Flag_Sequence,RGI_Emoji_Tag_Sequence,RGI_Emoji_Modifier_Sequence,RGI_Emoji_ZWJ_Sequence— into alternations over the single-code-point emoji properties the crate does support, per the UTS #51 sequence grammar (the issue's "pragmatic" path, no new Unicode tables):Grammar-based, not the enumerated RGI data files, so it over-matches at rare edges (unlisted flag pairs / ZWJ combos) — behavior-preserving for the width use case, as scoped in the issue.
\P{RGI_Emoji},[\p{RGI_Emoji}]) pass through unchanged, so RegExp construction throws a clear SyntaxError instead of mis-compiling. Node also rejects\P{RGI_Emoji}.Validation
grammar.rswith 22 match/no-match cases verified against Node's/vimplementation (ZWJ family 👨👩👧, flags 🇬🇧, keycaps 1️⃣, tag sequences 🏴, skin tones 👍🏻, VS16-in-ZWJ ❤️🔥, lone skin tone 🏻; negatives: lone regional indicator 🇬, bare0/#,☁/©without VS16), plus the sibling properties and the SyntaxError pass-through.node --experimental-strip-types:/^\p{RGI_Emoji}$/v→true/true/true/true/true/false/false,new RegExp("\\P{RGI_Emoji}","v")→SyntaxErrorstring-widthimport now passes module init:stringWidth("abc")=3,stringWidth("👍")=2,stringWidth("👨👩👧")=2 ✅cargo test -p perry-runtime(RUST_TEST_THREADS=1): 1007 passed, 1 failed —date::tests::test_full_year_setters_revive_invalid_date_only, which fails identically on pristine main (environment-dependent, unrelated).cargo fmt --check+check_file_size.shclean.Next wall (out of scope)
With init unblocked, mixed-content
stringWidth("a👍b")hits a pre-existing bug: for-of over a dynamically-typed string value throws(string).next is not a function(minimal repro:function f(v: any) { for (const c of v) {} } f("ab")). Same family as the ".entries()-on-any" note in #4800. Statically-typed for-of over strings works fine.