fix(regex): rewrite \p{Surrogate} to a never-matching class (#4884)#4887
Merged
Conversation
added 2 commits
June 10, 2026 10:08
The Rust `regex` crate matches over Unicode scalar values, which exclude
surrogate code points (U+D800..=U+DFFF), so it rejects `\p{Surrogate}` /
`\p{gc=Cs}` outright with `invalid pattern` instead of treating it as a
never-matching class. Every other property in the family compiles fine.
`string-width@7+` builds two module-top-level regexes that include
`\p{Surrogate}`, so importing it (→ wrap-ansi / cli-truncate / slice-ansi
→ ink, #348) threw `SyntaxError` at module init before any user code ran —
the next ink wall after #4877 (Intl.Segmenter).
`js_regex_to_rust` now intercepts the `\p{...}` / `\P{...}` brace form,
loosely normalizes the property value (stripping `gc=` / `general_category=`
and `_`/spaces), and for the surrogate category rewrites:
- positive, outside a class → `[^\s\S]` (never matches)
- negated, outside a class → `[\s\S]` (any scalar value)
- positive, inside a class → dropped (a never-matching union member)
- negated, inside a class → `\s\S`
All other properties pass through to the crate unchanged. Since valid input
carries no surrogate scalar values, this is behavior-preserving and matches
Node byte-for-byte on the string-width width/zero-width predicates.
Unit test in regex.rs covers the rewrite spellings and constructs both
string-width patterns.
regex.rs hit the 2000-line lint gate (2007). Relocate the new surrogate-property test into grammar.rs alongside the js_regex_to_rust logic it exercises, validating compilation via regex::Regex::new directly instead of js_regexp_new. regex.rs back to 1967 lines.
proggeramlug
added a commit
that referenced
this pull request
Jun 10, 2026
… supported emoji primitives (#4889) (#4891) ES2024 `/v`-mode Unicode *properties of strings* (`\p{RGI_Emoji}`, `\p{Basic_Emoji}`, `\p{Emoji_Keycap_Sequence}`, `\p{RGI_Emoji_Flag_Sequence}`, `\p{RGI_Emoji_Tag_Sequence}`, `\p{RGI_Emoji_Modifier_Sequence}`, `\p{RGI_Emoji_ZWJ_Sequence}`) can match multi-code-point clusters, which the Rust `regex` crate cannot represent — it rejected them as `invalid pattern`. `string-width@7+` builds `/^\p{RGI_Emoji}$/v` at module top level, so importing it (→ wrap-ansi / cli-truncate / slice-ansi → ink, #348) threw a SyntaxError at init — the next ink wall after #4887 (`\p{Surrogate}`). `js_regex_to_rust` now expands each property of strings into an alternation over the single-code-point emoji properties the crate does support, following the UTS #51 sequence grammar: flag pair | keycap | tag sequence | ELEMENT (ZWJ ELEMENT)* where ELEMENT is a skin-tone modifier sequence, an emoji presentation sequence (text-default emoji + VS16), or a default-emoji-presentation character (minus regional indicators, which only count in pairs). This is the grammar, not the enumerated RGI data files, so it over-matches at rare edges (unlisted flag pairs / ZWJ combinations) but classifies real emoji clusters the way Node does — verified against Node's /v implementation on 22 cases (ZWJ family, flags, keycaps, tag sequences, skin tones, VS16-in-ZWJ, lone skin tone, and the negatives: lone regional indicator, bare keycap base, text-default emoji without VS16). Negated (`\P{RGI_Emoji}`) and in-class uses pass through unchanged so RegExp construction throws a clear SyntaxError instead of mis-compiling — Node also rejects `\P{RGI_Emoji}`. With this, `stringWidth("👍")` / `stringWidth("👨👩👧")` compile and return 2 under Perry, byte-identical to Node. The next string-width gap is unrelated: for-of over a dynamically-typed string value throws `(string).next is not a function` (pre-existing, also noted in #4800). Co-authored-by: Ralph Küpper <ralph@skelpo.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #4884.
new RegExp('\\p{Surrogate}', 'u')(and the\p{gc=Cs}/\P{Surrogate}spellings) threwSyntaxError: Invalid regular expression: invalid pattern.Root cause: Perry compiles JS regexes to the Rust
regexcrate, which matches over Unicode scalar values — surrogate code points (U+D800–U+DFFF) can't occur there, so the crate rejects theSurrogate/Csgeneral category outright instead of treating it as a never-matching class.\p{Surrogate}was falling throughjs_regex_to_rust's catch-all and reaching the crate verbatim.Why it matters:
string-width@7+builds two module-top-level regexes containing\p{Surrogate}, so importing it (→wrap-ansi/cli-truncate/slice-ansi→ ink, #348) threw at init before any user code. This was the next ink wall after #4877 (Intl.Segmenter).Fix
js_regex_to_rustnow intercepts the\p{...}/\P{...}brace form, loosely normalizes the property value (stripsgc=/general_category=and_/spaces), and for the surrogate category rewrites:\p\P[^\s\S](never matches)[\s\S](any scalar)\s\SAll other properties pass through unchanged. This is the minimum option from the issue — behavior-preserving for all valid (non-surrogate) input, which is everything
string-widthever sees. (The "faithful" V8-matching option isn't viable here: the baseregex/fancy-regexengines can't represent surrogate code points at all, so a true[\u{D800}-\u{DFFF}]would itself fail to compile.)Validation
surrogate_property_rewrites_to_never_matchinregex.rscovers the rewrite spellings and constructs both string-width patterns. Fullregex::suite green (17 passed).\p{Control},\p{Mark},\p{Format},\p{Default_Ignorable_Code_Point}, the/vflag, negation in/out of classes,gc=Csspelling).Notes
Per CLAUDE.md's contributor guidance I left the version bump + CHANGELOG entry for the maintainer to fold in at merge.
After this, ink's next documented gate is yoga-layout's WASM runtime.
Closes #4884.