Skip to content

Comments

fix: handle non-BMP Unicode codepoints in foldl, foldr, and %c format#606

Merged
stephenamar-db merged 1 commit intodatabricks:masterfrom
JoshRosen:fix-unicode-foldl-foldr-format-c
Feb 23, 2026
Merged

fix: handle non-BMP Unicode codepoints in foldl, foldr, and %c format#606
stephenamar-db merged 1 commit intodatabricks:masterfrom
JoshRosen:fix-unicode-foldl-foldr-format-c

Conversation

@JoshRosen
Copy link
Contributor

This PR fixes two more non-BMP Unicode bugs:

  • foldl/foldr iterated strings by UTF-16 code unit (for (char <- s.value)), splitting non-BMP characters like emoji into surrogate pair halves. Use codePointAt/codePointBefore with Character.charCount for correct codepoint iteration.
  • The %c format conversion used s.toChar.toString which truncates codepoints above U+FFFF to 16 bits. Use Character.toString(s.toInt) instead.

All code written by Claude Opus 4.6.

foldl/foldr iterated strings by UTF-16 code unit (for (char <- s.value)),
splitting non-BMP characters like emoji into surrogate pair halves. Use
codePointAt/codePointBefore with Character.charCount for correct codepoint
iteration.

The %c format conversion used s.toChar.toString which truncates codepoints
above U+FFFF to 16 bits. Use Character.toString(s.toInt) instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@stephenamar-db stephenamar-db merged commit 43bdcd6 into databricks:master Feb 23, 2026
43 of 48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants