fix(buffer): use Latin-1 encoding for atob() per WHATWG spec#1284
fix(buffer): use Latin-1 encoding for atob() per WHATWG spec#1284richarddavison merged 5 commits intoawslabs:mainfrom
Conversation
|
Tried fixing seemingly unrelated Windows test in #1285 |
9ac3816 to
f038a0e
Compare
richarddavison
left a comment
There was a problem hiding this comment.
Nice! LGTM thank you
|
I had to do some tricks in the string decoder. The problem though is that the rust to js only work with UTF-8. They added a raw 2 bytes read but not write. |
|
@Sytten can you provide a failing scenario here that I will convert to a test and ensure it passes |
|
Honestly its a bit too far in my head, I would need to recheck. It was something to do with unsurrogate pair. Might not be relevant here, just know that UTF16 to UTF8 is a PITA. |
The atob() function was incorrectly treating decoded bytes as UTF-8, causing bytes >= 128 to be replaced with U+FFFD (replacement character). Per the WHATWG spec, atob() should return a "binary string" where each character's code point directly represents a byte value (0-255). This matches Latin-1 (ISO-8859-1) encoding where each byte maps directly to a Unicode code point. This fix resolves JWT verification issues (issue awslabs#966) where signature bytes containing values >= 128 were being corrupted, causing signature length mismatches (248 vs expected 256 bytes) and verification failures. Added comprehensive tests for atob/btoa including: - Basic encoding/decoding - High-byte values (128-255) as Latin-1 characters - Full 0-255 byte range verification - Comparison with Buffer.from() for binary data Fixes awslabs#966
btoa() was incorrectly encoding the UTF-8 bytes of the input string
instead of treating each character as a byte value 0-255 (Latin-1).
This caused:
- btoa(String.fromCharCode(255)) to return "w78=" instead of "/w=="
- btoa("€") to return "4oKs" instead of throwing InvalidCharacterError
The fix iterates over character code points, validates each is ≤ 255,
and converts directly to bytes. This ensures proper roundtrip with atob().
This addresses concerns raised by @Sytten in PR awslabs#1284 about UTF-16 to
UTF-8 string handling edge cases. Sytten noted that rquickjs's Rust-to-JS
string interface primarily works with UTF-8 and mentioned potential issues
with "unsurrogate pairs" - referring to lone surrogate code units.
Background on surrogate pairs: JavaScript strings are UTF-16 encoded
internally. Characters outside the Basic Multilingual Plane (code points
U+10000 and above, like emoji) are represented as surrogate pairs - two
16-bit code units where a high surrogate (0xD800-0xDBFF) is followed by
a low surrogate (0xDC00-0xDFFF). A "lone surrogate" occurs when one
appears without its pair, which cannot be validly converted to UTF-8.
For btoa(), surrogate code units (0xD800-0xDFFF) are > 255, so they
correctly trigger an error per the WHATWG spec requirement that btoa()
only accepts characters with code points 0-255.
f038a0e to
d6976af
Compare
|
@Sytten I re-read your concerns and added 2nd commit: btoa() was incorrectly encoding the UTF-8 bytes of the input string This caused:
The fix iterates over character code points, validates each is ≤ 255, and converts directly to bytes. This ensures proper roundtrip with atob(). This addresses concerns raised by @Sytten about UTF-16 to UTF-8 string handling edge cases. Sitter noted that rquickjs's Rust-to-JS string interface primarily works with UTF-8 and mentioned potential issues with "unsurrogate pairs" - referring to lone surrogate code units. Background on surrogate pairs: JavaScript strings are UTF-16 encoded internally. Characters outside the Basic Multilingual Plane (code points U+10000 and above, like emoji) are represented as surrogate pairs - two 16-bit code units where a high surrogate (0xD800-0xDBFF) is followed by a low surrogate (0xDC00-0xDFFF). A "lone surrogate" occurs when one For btoa(), surrogate code units (0xD800-0xDFFF) are > 255, so they correctly trigger an error per the WHATWG spec requirement that btoa() only accepts characters with code points 0-255. Node.js Comparison Tests:
|
|
@Sytten @richarddavison what is the status of this PR? It has been sitting since my changes on Monday without any activity. |
richarddavison
left a comment
There was a problem hiding this comment.
Sorry for the delay, xmas times... Minor perf suggestion!
Use is_ascii() check to skip character-by-character validation for the common case of ASCII input, falling back to Latin-1 validation only for non-ASCII strings.
|
The failing tests should be addressed by #1295 |
Issue # (if available)
Fixes #966
Description of changes
The atob() function was incorrectly treating decoded bytes as UTF-8, causing bytes >= 128 to be replaced with U+FFFD (replacement character).
Per the WHATWG spec, atob() should return a "binary string" where each character's code point directly represents a byte value (0-255). This matches Latin-1 (ISO-8859-1) encoding where each byte maps directly to a Unicode code point.
This fix resolves JWT verification issues (issue #966) where signature bytes containing values >= 128 were being corrupted, causing signature length mismatches (248 vs expected 256 bytes) and verification failures.
Added comprehensive tests for atob/btoa including:
Sources:
Checklist
tests/unitand/or in Rust for my feature if neededmake fixto format JS and apply Clippy auto fixesmake checktypes/directoryBy submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.