Skip to content

fix(regex): rewrite \p{Surrogate} to a never-matching class (#4884)#4887

Merged
proggeramlug merged 2 commits into
mainfrom
fix-4884-surrogate
Jun 10, 2026
Merged

fix(regex): rewrite \p{Surrogate} to a never-matching class (#4884)#4887
proggeramlug merged 2 commits into
mainfrom
fix-4884-surrogate

Conversation

@proggeramlug

Copy link
Copy Markdown
Contributor

Summary

Fixes #4884. new RegExp('\\p{Surrogate}', 'u') (and the \p{gc=Cs} / \P{Surrogate} spellings) threw SyntaxError: Invalid regular expression: invalid pattern.

Root cause: Perry compiles JS regexes to the Rust regex crate, which matches over Unicode scalar values — surrogate code points (U+D800–U+DFFF) can't occur there, so the crate rejects the Surrogate/Cs general category outright instead of treating it as a never-matching class. \p{Surrogate} was falling through js_regex_to_rust's catch-all and reaching the crate verbatim.

Why it matters: string-width@7+ builds two module-top-level regexes containing \p{Surrogate}, so importing it (→ wrap-ansi / cli-truncate / slice-ansiink, #348) threw at init before any user code. This was the next ink wall after #4877 (Intl.Segmenter).

Fix

js_regex_to_rust now intercepts the \p{...} / \P{...} brace form, loosely normalizes the property value (strips gc= / general_category= and _/spaces), and for the surrogate category rewrites:

context positive \p negated \P
outside a class [^\s\S] (never matches) [\s\S] (any scalar)
inside a class dropped (never-matching union member) \s\S

All other properties pass through unchanged. This is the minimum option from the issue — behavior-preserving for all valid (non-surrogate) input, which is everything string-width ever sees. (The "faithful" V8-matching option isn't viable here: the base regex/fancy-regex engines can't represent surrogate code points at all, so a true [\u{D800}-\u{DFFF}] would itself fail to compile.)

Validation

  • New unit test surrogate_property_rewrites_to_never_match in regex.rs covers the rewrite spellings and constructs both string-width patterns. Full regex:: suite green (17 passed).
  • The exact string-width regexes plus the issue's isolation cases now compile and match Node byte-for-byte (\p{Control}, \p{Mark}, \p{Format}, \p{Default_Ignorable_Code_Point}, the /v flag, negation in/out of classes, gc=Cs spelling).

Notes

Per CLAUDE.md's contributor guidance I left the version bump + CHANGELOG entry for the maintainer to fold in at merge.

After this, ink's next documented gate is yoga-layout's WASM runtime.

Closes #4884.

Ralph Küpper added 2 commits June 10, 2026 10:08
The Rust `regex` crate matches over Unicode scalar values, which exclude
surrogate code points (U+D800..=U+DFFF), so it rejects `\p{Surrogate}` /
`\p{gc=Cs}` outright with `invalid pattern` instead of treating it as a
never-matching class. Every other property in the family compiles fine.

`string-width@7+` builds two module-top-level regexes that include
`\p{Surrogate}`, so importing it (→ wrap-ansi / cli-truncate / slice-ansi
→ ink, #348) threw `SyntaxError` at module init before any user code ran —
the next ink wall after #4877 (Intl.Segmenter).

`js_regex_to_rust` now intercepts the `\p{...}` / `\P{...}` brace form,
loosely normalizes the property value (stripping `gc=` / `general_category=`
and `_`/spaces), and for the surrogate category rewrites:
  - positive, outside a class → `[^\s\S]` (never matches)
  - negated,  outside a class → `[\s\S]`  (any scalar value)
  - positive, inside  a class → dropped (a never-matching union member)
  - negated,  inside  a class → `\s\S`
All other properties pass through to the crate unchanged. Since valid input
carries no surrogate scalar values, this is behavior-preserving and matches
Node byte-for-byte on the string-width width/zero-width predicates.

Unit test in regex.rs covers the rewrite spellings and constructs both
string-width patterns.
regex.rs hit the 2000-line lint gate (2007). Relocate the new
surrogate-property test into grammar.rs alongside the js_regex_to_rust
logic it exercises, validating compilation via regex::Regex::new directly
instead of js_regexp_new. regex.rs back to 1967 lines.
@proggeramlug proggeramlug merged commit 837d2a1 into main Jun 10, 2026
12 of 13 checks passed
@proggeramlug proggeramlug deleted the fix-4884-surrogate branch June 10, 2026 08:24
proggeramlug added a commit that referenced this pull request Jun 10, 2026
… supported emoji primitives (#4889) (#4891)

ES2024 `/v`-mode Unicode *properties of strings* (`\p{RGI_Emoji}`,
`\p{Basic_Emoji}`, `\p{Emoji_Keycap_Sequence}`, `\p{RGI_Emoji_Flag_Sequence}`,
`\p{RGI_Emoji_Tag_Sequence}`, `\p{RGI_Emoji_Modifier_Sequence}`,
`\p{RGI_Emoji_ZWJ_Sequence}`) can match multi-code-point clusters, which the
Rust `regex` crate cannot represent — it rejected them as `invalid pattern`.
`string-width@7+` builds `/^\p{RGI_Emoji}$/v` at module top level, so
importing it (→ wrap-ansi / cli-truncate / slice-ansi → ink, #348) threw a
SyntaxError at init — the next ink wall after #4887 (`\p{Surrogate}`).

`js_regex_to_rust` now expands each property of strings into an alternation
over the single-code-point emoji properties the crate does support, following
the UTS #51 sequence grammar:

  flag pair | keycap | tag sequence | ELEMENT (ZWJ ELEMENT)*

where ELEMENT is a skin-tone modifier sequence, an emoji presentation
sequence (text-default emoji + VS16), or a default-emoji-presentation
character (minus regional indicators, which only count in pairs). This is the
grammar, not the enumerated RGI data files, so it over-matches at rare edges
(unlisted flag pairs / ZWJ combinations) but classifies real emoji clusters
the way Node does — verified against Node's /v implementation on 22 cases
(ZWJ family, flags, keycaps, tag sequences, skin tones, VS16-in-ZWJ, lone
skin tone, and the negatives: lone regional indicator, bare keycap base,
text-default emoji without VS16).

Negated (`\P{RGI_Emoji}`) and in-class uses pass through unchanged so RegExp
construction throws a clear SyntaxError instead of mis-compiling — Node also
rejects `\P{RGI_Emoji}`.

With this, `stringWidth("👍")` / `stringWidth("👨‍👩‍👧")` compile and return 2
under Perry, byte-identical to Node. The next string-width gap is unrelated:
for-of over a dynamically-typed string value throws
`(string).next is not a function` (pre-existing, also noted in #4800).

Co-authored-by: Ralph Küpper <ralph@skelpo.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

regex: \p{Surrogate} Unicode property rejected ('invalid pattern') — Rust regex crate has no surrogate scalar values; blocks string-width/ink

1 participant