Skip to content

fix(buffer): use Latin-1 encoding for atob() per WHATWG spec#1284

Merged
richarddavison merged 5 commits intoawslabs:mainfrom
chessbyte:fix/atob-latin1-encoding
Dec 22, 2025
Merged

fix(buffer): use Latin-1 encoding for atob() per WHATWG spec#1284
richarddavison merged 5 commits intoawslabs:mainfrom
chessbyte:fix/atob-latin1-encoding

Conversation

@chessbyte
Copy link
Contributor

@chessbyte chessbyte commented Dec 12, 2025

Issue # (if available)

Fixes #966

Description of changes

The atob() function was incorrectly treating decoded bytes as UTF-8, causing bytes >= 128 to be replaced with U+FFFD (replacement character).

Per the WHATWG spec, atob() should return a "binary string" where each character's code point directly represents a byte value (0-255). This matches Latin-1 (ISO-8859-1) encoding where each byte maps directly to a Unicode code point.

This fix resolves JWT verification issues (issue #966) where signature bytes containing values >= 128 were being corrupted, causing signature length mismatches (248 vs expected 256 bytes) and verification failures.

Added comprehensive tests for atob/btoa including:

  • Basic encoding/decoding
  • High-byte values (128-255) as Latin-1 characters
  • Full 0-255 byte range verification
  • Comparison with Buffer.from() for binary data

Sources:

Checklist

  • Created unit tests in tests/unit and/or in Rust for my feature if needed
  • Ran make fix to format JS and apply Clippy auto fixes
  • Made sure my code didn't add any additional warnings: make check
  • Added relevant type info in types/ directory
  • Updated documentation if needed (API.md/README.md/Other)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@chessbyte
Copy link
Contributor Author

Tried fixing seemingly unrelated Windows test in #1285

@chessbyte chessbyte force-pushed the fix/atob-latin1-encoding branch from 9ac3816 to f038a0e Compare December 13, 2025 18:06
Copy link
Collaborator

@richarddavison richarddavison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! LGTM thank you

@Sytten
Copy link
Collaborator

Sytten commented Dec 14, 2025

I had to do some tricks in the string decoder. The problem though is that the rust to js only work with UTF-8. They added a raw 2 bytes read but not write.

@chessbyte
Copy link
Contributor Author

@Sytten can you provide a failing scenario here that I will convert to a test and ensure it passes

@Sytten
Copy link
Collaborator

Sytten commented Dec 15, 2025

Honestly its a bit too far in my head, I would need to recheck. It was something to do with unsurrogate pair. Might not be relevant here, just know that UTF16 to UTF8 is a PITA.

The atob() function was incorrectly treating decoded bytes as UTF-8,
causing bytes >= 128 to be replaced with U+FFFD (replacement character).

Per the WHATWG spec, atob() should return a "binary string" where each
character's code point directly represents a byte value (0-255). This
matches Latin-1 (ISO-8859-1) encoding where each byte maps directly to
a Unicode code point.

This fix resolves JWT verification issues (issue awslabs#966) where signature
bytes containing values >= 128 were being corrupted, causing signature
length mismatches (248 vs expected 256 bytes) and verification failures.

Added comprehensive tests for atob/btoa including:
- Basic encoding/decoding
- High-byte values (128-255) as Latin-1 characters
- Full 0-255 byte range verification
- Comparison with Buffer.from() for binary data

Fixes awslabs#966
btoa() was incorrectly encoding the UTF-8 bytes of the input string
instead of treating each character as a byte value 0-255 (Latin-1).

This caused:
- btoa(String.fromCharCode(255)) to return "w78=" instead of "/w=="
- btoa("€") to return "4oKs" instead of throwing InvalidCharacterError

The fix iterates over character code points, validates each is ≤ 255,
and converts directly to bytes. This ensures proper roundtrip with atob().

This addresses concerns raised by @Sytten in PR awslabs#1284 about UTF-16 to
UTF-8 string handling edge cases. Sytten noted that rquickjs's Rust-to-JS
string interface primarily works with UTF-8 and mentioned potential issues
with "unsurrogate pairs" - referring to lone surrogate code units.

Background on surrogate pairs: JavaScript strings are UTF-16 encoded
internally. Characters outside the Basic Multilingual Plane (code points
U+10000 and above, like emoji) are represented as surrogate pairs - two
16-bit code units where a high surrogate (0xD800-0xDBFF) is followed by
a low surrogate (0xDC00-0xDFFF). A "lone surrogate" occurs when one
appears without its pair, which cannot be validly converted to UTF-8.

For btoa(), surrogate code units (0xD800-0xDFFF) are > 255, so they
correctly trigger an error per the WHATWG spec requirement that btoa()
only accepts characters with code points 0-255.
@chessbyte chessbyte force-pushed the fix/atob-latin1-encoding branch from f038a0e to d6976af Compare December 16, 2025 00:45
@chessbyte
Copy link
Contributor Author

@Sytten I re-read your concerns and added 2nd commit:

btoa() was incorrectly encoding the UTF-8 bytes of the input string
instead of treating each character as a byte value 0-255 (Latin-1).

This caused:

  • btoa(String.fromCharCode(255)) to return "w78=" instead of "/w=="
  • btoa("€") to return "4oKs" instead of throwing InvalidCharacterError

The fix iterates over character code points, validates each is ≤ 255, and converts directly to bytes. This ensures proper roundtrip with atob().

This addresses concerns raised by @Sytten about UTF-16 to UTF-8 string handling edge cases. Sitter noted that rquickjs's Rust-to-JS string interface primarily works with UTF-8 and mentioned potential issues with "unsurrogate pairs" - referring to lone surrogate code units.

Background on surrogate pairs: JavaScript strings are UTF-16 encoded internally. Characters outside the Basic Multilingual Plane (code points U+10000 and above, like emoji) are represented as surrogate pairs - two 16-bit code units where a high surrogate (0xD800-0xDBFF) is followed by a low surrogate (0xDC00-0xDFFF). A "lone surrogate" occurs when one
appears without its pair, which cannot be validly converted to UTF-8.

For btoa(), surrogate code units (0xD800-0xDFFF) are > 255, so they correctly trigger an error per the WHATWG spec requirement that btoa() only accepts characters with code points 0-255.

Node.js Comparison Tests:

Test Node.js LLRT (before) LLRT (after)
btoa(String.fromCharCode(255)) "/w==" "w78=" ❌ "/w==" ✅
btoa("€") throws "Invalid character" "4oKs" ❌ throws ✅
btoa(surrogate) throws "Invalid character" throws (UTF-8 error) throws ✅

@chessbyte
Copy link
Contributor Author

@Sytten @richarddavison what is the status of this PR? It has been sitting since my changes on Monday without any activity.

Copy link
Collaborator

@richarddavison richarddavison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, xmas times... Minor perf suggestion!

Use is_ascii() check to skip character-by-character validation for the
common case of ASCII input, falling back to Latin-1 validation only for
non-ASCII strings.
@chessbyte
Copy link
Contributor Author

The failing tests should be addressed by #1295

@richarddavison richarddavison enabled auto-merge (squash) December 22, 2025 21:03
@richarddavison richarddavison merged commit b6a70bc into awslabs:main Dec 22, 2025
1 check passed
@chessbyte chessbyte deleted the fix/atob-latin1-encoding branch December 22, 2025 22:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error: PKCS#8 ASN.1 error: ASN.1 INTEGER not canonically encoded as DER

3 participants