Skip to content

datadog: fix UTF-8 continuation byte handling in metric name sanitiza…#212

Merged
kevinburkesegment merged 1 commit intomainfrom
sanitization-cleanups
Nov 4, 2025
Merged

datadog: fix UTF-8 continuation byte handling in metric name sanitiza…#212
kevinburkesegment merged 1 commit intomainfrom
sanitization-cleanups

Conversation

@kevinburkesegment
Copy link
Copy Markdown
Contributor

@kevinburkesegment kevinburkesegment commented Nov 4, 2025

…tion

The previous implementation had a subtle bug when processing invalid UTF-8 sequences. When encountering an orphaned continuation byte (0x80-0xBF) or incomplete multi-byte sequence, the code would insert a replacement character but fail to skip the continuation bytes that followed. This could result in:

  1. Invalid UTF-8 output when continuation bytes were processed as standalone characters
  2. Multiple consecutive replacement characters instead of collapsing them
  3. Incorrect handling of mixed valid/invalid UTF-8 sequences

This fix properly skips continuation bytes after detecting invalid sequences, ensuring the output is always valid UTF-8. The change also renames accentMap to latin1SupplementMap with improved documentation to clarify that it maps Unicode codepoints in the Latin-1 Supplement range (U+00C0-U+00FF), not the array indices themselves.

Added comprehensive test cases covering edge cases like orphaned continuation bytes, incomplete sequences, invalid surrogates, and mixed valid/invalid UTF-8. Also added fuzz testing and benchmarks to validate correctness and performance.

@kevinburkesegment kevinburkesegment force-pushed the sanitization-cleanups branch 2 times, most recently from ecba2a6 to 9be582d Compare November 4, 2025 17:21
etiennep
etiennep previously approved these changes Nov 4, 2025
…tion

The previous implementation had a subtle bug when processing invalid UTF-8
sequences. When encountering an orphaned continuation byte (0x80-0xBF) or
incomplete multi-byte sequence, the code would insert a replacement character
but fail to skip the continuation bytes that followed. This could result in:

1. Invalid UTF-8 output when continuation bytes were processed as standalone
   characters
2. Multiple consecutive replacement characters instead of collapsing them
3. Incorrect handling of mixed valid/invalid UTF-8 sequences

This fix properly skips continuation bytes after detecting invalid sequences,
ensuring the output is always valid UTF-8. The change also renames `accentMap`
to `latin1SupplementMap` with improved documentation to clarify that it maps
Unicode codepoints in the Latin-1 Supplement range (U+00C0-U+00FF), not the
array indices themselves.

Added comprehensive test cases covering edge cases like orphaned continuation
bytes, incomplete sequences, invalid surrogates, and mixed valid/invalid UTF-8.
Also added fuzz testing and benchmarks to validate correctness and performance.

Co-Authored-By: Claude <noreply@anthropic.com>
@kevinburkesegment kevinburkesegment merged commit d6e7f85 into main Nov 4, 2025
8 checks passed
@kevinburkesegment kevinburkesegment deleted the sanitization-cleanups branch November 4, 2025 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants