datadog: fix UTF-8 continuation byte handling in metric name sanitiza… by kevinburkesegment · Pull Request #212 · segmentio/stats

kevinburkesegment · 2025-11-04T17:13:58Z

…tion

The previous implementation had a subtle bug when processing invalid UTF-8 sequences. When encountering an orphaned continuation byte (0x80-0xBF) or incomplete multi-byte sequence, the code would insert a replacement character but fail to skip the continuation bytes that followed. This could result in:

Invalid UTF-8 output when continuation bytes were processed as standalone characters
Multiple consecutive replacement characters instead of collapsing them
Incorrect handling of mixed valid/invalid UTF-8 sequences

This fix properly skips continuation bytes after detecting invalid sequences, ensuring the output is always valid UTF-8. The change also renames accentMap to latin1SupplementMap with improved documentation to clarify that it maps Unicode codepoints in the Latin-1 Supplement range (U+00C0-U+00FF), not the array indices themselves.

Added comprehensive test cases covering edge cases like orphaned continuation bytes, incomplete sequences, invalid surrogates, and mixed valid/invalid UTF-8. Also added fuzz testing and benchmarks to validate correctness and performance.

…tion The previous implementation had a subtle bug when processing invalid UTF-8 sequences. When encountering an orphaned continuation byte (0x80-0xBF) or incomplete multi-byte sequence, the code would insert a replacement character but fail to skip the continuation bytes that followed. This could result in: 1. Invalid UTF-8 output when continuation bytes were processed as standalone characters 2. Multiple consecutive replacement characters instead of collapsing them 3. Incorrect handling of mixed valid/invalid UTF-8 sequences This fix properly skips continuation bytes after detecting invalid sequences, ensuring the output is always valid UTF-8. The change also renames `accentMap` to `latin1SupplementMap` with improved documentation to clarify that it maps Unicode codepoints in the Latin-1 Supplement range (U+00C0-U+00FF), not the array indices themselves. Added comprehensive test cases covering edge cases like orphaned continuation bytes, incomplete sequences, invalid surrogates, and mixed valid/invalid UTF-8. Also added fuzz testing and benchmarks to validate correctness and performance. Co-Authored-By: Claude <noreply@anthropic.com>

kevinburkesegment force-pushed the sanitization-cleanups branch 2 times, most recently from ecba2a6 to 9be582d Compare November 4, 2025 17:21

etiennep previously approved these changes Nov 4, 2025

View reviewed changes

kevinburkesegment dismissed etiennep’s stale review via d6e7f85 November 4, 2025 17:23

kevinburkesegment force-pushed the sanitization-cleanups branch from 9be582d to d6e7f85 Compare November 4, 2025 17:23

etiennep approved these changes Nov 4, 2025

View reviewed changes

kevinburkesegment merged commit d6e7f85 into main Nov 4, 2025
8 checks passed

kevinburkesegment deleted the sanitization-cleanups branch November 4, 2025 17:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datadog: fix UTF-8 continuation byte handling in metric name sanitiza…#212

datadog: fix UTF-8 continuation byte handling in metric name sanitiza…#212
kevinburkesegment merged 1 commit intomainfrom
sanitization-cleanups

kevinburkesegment commented Nov 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kevinburkesegment commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kevinburkesegment commented Nov 4, 2025 •

edited

Loading