fix(buffer): use Latin-1 encoding for atob() per WHATWG spec by chessbyte · Pull Request #1284 · awslabs/llrt

chessbyte · 2025-12-12T14:16:31Z

Issue # (if available)

Fixes #966

Description of changes

The atob() function was incorrectly treating decoded bytes as UTF-8, causing bytes >= 128 to be replaced with U+FFFD (replacement character).

Per the WHATWG spec, atob() should return a "binary string" where each character's code point directly represents a byte value (0-255). This matches Latin-1 (ISO-8859-1) encoding where each byte maps directly to a Unicode code point.

This fix resolves JWT verification issues (issue #966) where signature bytes containing values >= 128 were being corrupted, causing signature length mismatches (248 vs expected 256 bytes) and verification failures.

Added comprehensive tests for atob/btoa including:

Basic encoding/decoding
High-byte values (128-255) as Latin-1 characters
Full 0-255 byte range verification
Comparison with Buffer.from() for binary data

Sources:

Checklist

Created unit tests in tests/unit and/or in Rust for my feature if needed
Ran make fix to format JS and apply Clippy auto fixes
Made sure my code didn't add any additional warnings: make check
Added relevant type info in types/ directory
Updated documentation if needed (API.md/README.md/Other)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

chessbyte · 2025-12-12T19:19:45Z

Tried fixing seemingly unrelated Windows test in #1285

richarddavison

Nice! LGTM thank you

Sytten · 2025-12-14T17:54:56Z

I had to do some tricks in the string decoder. The problem though is that the rust to js only work with UTF-8. They added a raw 2 bytes read but not write.

chessbyte · 2025-12-14T18:17:56Z

@Sytten can you provide a failing scenario here that I will convert to a test and ensure it passes

Sytten · 2025-12-15T00:04:15Z

Honestly its a bit too far in my head, I would need to recheck. It was something to do with unsurrogate pair. Might not be relevant here, just know that UTF16 to UTF8 is a PITA.

The atob() function was incorrectly treating decoded bytes as UTF-8, causing bytes >= 128 to be replaced with U+FFFD (replacement character). Per the WHATWG spec, atob() should return a "binary string" where each character's code point directly represents a byte value (0-255). This matches Latin-1 (ISO-8859-1) encoding where each byte maps directly to a Unicode code point. This fix resolves JWT verification issues (issue awslabs#966) where signature bytes containing values >= 128 were being corrupted, causing signature length mismatches (248 vs expected 256 bytes) and verification failures. Added comprehensive tests for atob/btoa including: - Basic encoding/decoding - High-byte values (128-255) as Latin-1 characters - Full 0-255 byte range verification - Comparison with Buffer.from() for binary data Fixes awslabs#966

@Sytten

btoa() was incorrectly encoding the UTF-8 bytes of the input string instead of treating each character as a byte value 0-255 (Latin-1). This caused: - btoa(String.fromCharCode(255)) to return "w78=" instead of "/w==" - btoa("€") to return "4oKs" instead of throwing InvalidCharacterError The fix iterates over character code points, validates each is ≤ 255, and converts directly to bytes. This ensures proper roundtrip with atob(). This addresses concerns raised by @Sytten in PR awslabs#1284 about UTF-16 to UTF-8 string handling edge cases. Sytten noted that rquickjs's Rust-to-JS string interface primarily works with UTF-8 and mentioned potential issues with "unsurrogate pairs" - referring to lone surrogate code units. Background on surrogate pairs: JavaScript strings are UTF-16 encoded internally. Characters outside the Basic Multilingual Plane (code points U+10000 and above, like emoji) are represented as surrogate pairs - two 16-bit code units where a high surrogate (0xD800-0xDBFF) is followed by a low surrogate (0xDC00-0xDFFF). A "lone surrogate" occurs when one appears without its pair, which cannot be validly converted to UTF-8. For btoa(), surrogate code units (0xD800-0xDFFF) are > 255, so they correctly trigger an error per the WHATWG spec requirement that btoa() only accepts characters with code points 0-255.

chessbyte · 2025-12-16T00:50:27Z

@Sytten I re-read your concerns and added 2nd commit:

btoa() was incorrectly encoding the UTF-8 bytes of the input string
instead of treating each character as a byte value 0-255 (Latin-1).

This caused:

btoa(String.fromCharCode(255)) to return "w78=" instead of "/w=="
btoa("€") to return "4oKs" instead of throwing InvalidCharacterError

The fix iterates over character code points, validates each is ≤ 255, and converts directly to bytes. This ensures proper roundtrip with atob().

This addresses concerns raised by @Sytten about UTF-16 to UTF-8 string handling edge cases. Sitter noted that rquickjs's Rust-to-JS string interface primarily works with UTF-8 and mentioned potential issues with "unsurrogate pairs" - referring to lone surrogate code units.

Background on surrogate pairs: JavaScript strings are UTF-16 encoded internally. Characters outside the Basic Multilingual Plane (code points U+10000 and above, like emoji) are represented as surrogate pairs - two 16-bit code units where a high surrogate (0xD800-0xDBFF) is followed by a low surrogate (0xDC00-0xDFFF). A "lone surrogate" occurs when one
appears without its pair, which cannot be validly converted to UTF-8.

For btoa(), surrogate code units (0xD800-0xDFFF) are > 255, so they correctly trigger an error per the WHATWG spec requirement that btoa() only accepts characters with code points 0-255.

Node.js Comparison Tests:

Test	Node.js	LLRT (before)	LLRT (after)
btoa(String.fromCharCode(255))	"/w=="	"w78=" ❌	"/w==" ✅
btoa("€")	throws "Invalid character"	"4oKs" ❌	throws ✅
btoa(surrogate)	throws "Invalid character"	throws (UTF-8 error)	throws ✅

chessbyte · 2025-12-19T17:23:26Z

@Sytten @richarddavison what is the status of this PR? It has been sitting since my changes on Monday without any activity.

richarddavison

Sorry for the delay, xmas times... Minor perf suggestion!

modules/llrt_buffer/src/buffer.rs

Use is_ascii() check to skip character-by-character validation for the common case of ASCII input, falling back to Latin-1 validation only for non-ASCII strings.

chessbyte · 2025-12-22T13:48:05Z

The failing tests should be addressed by #1295

chessbyte force-pushed the fix/atob-latin1-encoding branch from 9ac3816 to f038a0e Compare December 13, 2025 18:06

richarddavison approved these changes Dec 14, 2025

View reviewed changes

chessbyte added 2 commits December 15, 2025 19:44

chessbyte force-pushed the fix/atob-latin1-encoding branch from f038a0e to d6976af Compare December 16, 2025 00:45

richarddavison requested changes Dec 22, 2025

View reviewed changes

modules/llrt_buffer/src/buffer.rs Show resolved Hide resolved

perf(buffer): add SIMD-optimized fast path for ASCII in btoa()

b372fcb

Use is_ascii() check to skip character-by-character validation for the common case of ASCII input, falling back to Latin-1 validation only for non-ASCII strings.

Merge branch 'main' into fix/atob-latin1-encoding

45a4449

richarddavison approved these changes Dec 22, 2025

View reviewed changes

richarddavison enabled auto-merge (squash) December 22, 2025 21:03

Merge branch 'main' into fix/atob-latin1-encoding

c146573

richarddavison merged commit b6a70bc into awslabs:main Dec 22, 2025
1 check passed

chessbyte deleted the fix/atob-latin1-encoding branch December 22, 2025 22:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(buffer): use Latin-1 encoding for atob() per WHATWG spec#1284

fix(buffer): use Latin-1 encoding for atob() per WHATWG spec#1284
richarddavison merged 5 commits intoawslabs:mainfrom
chessbyte:fix/atob-latin1-encoding

chessbyte commented Dec 12, 2025 •

edited

Loading

Uh oh!

chessbyte commented Dec 12, 2025

Uh oh!

richarddavison left a comment

Uh oh!

Sytten commented Dec 14, 2025

Uh oh!

chessbyte commented Dec 14, 2025

Uh oh!

Sytten commented Dec 15, 2025

Uh oh!

chessbyte commented Dec 16, 2025

Uh oh!

chessbyte commented Dec 19, 2025

Uh oh!

richarddavison left a comment

Uh oh!

Uh oh!

chessbyte commented Dec 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chessbyte commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue # (if available)

Description of changes

Checklist

Uh oh!

chessbyte commented Dec 12, 2025

Uh oh!

richarddavison left a comment

Choose a reason for hiding this comment

Uh oh!

Sytten commented Dec 14, 2025

Uh oh!

chessbyte commented Dec 14, 2025

Uh oh!

Sytten commented Dec 15, 2025

Uh oh!

chessbyte commented Dec 16, 2025

Uh oh!

chessbyte commented Dec 19, 2025

Uh oh!

richarddavison left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chessbyte commented Dec 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chessbyte commented Dec 12, 2025 •

edited

Loading