fix: sanitize input text to remove null bytes and control characters #75

echobt · 2026-01-21T11:12:06Z

Description

This PR addresses an issue where null bytes (\x00) and other control characters in the input text cause the tokenizer to fail with a generic error.

Changes

Added a sanitize_text function in src/core/embeddings.rs that filters out control characters while preserving newlines (\n), tabs (\t), and carriage returns (\r).
Updated embed_batch to sanitize input text before passing it to the tokenizer.
Added unit tests to verify the sanitization logic.

Testing

Ran cargo test to verify the new sanitization logic and ensure no regressions.
The new test test_sanitize_text passes, confirming that null bytes and unwanted control characters are removed.

fix: sanitize input text to remove null bytes and control characters

4937dcf

echobt mentioned this pull request Jan 21, 2026

[BUG] Null/Control Characters Cause Tokenization Failure Without Proper Error Message PlatformNetwork/bounty-challenge#218

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: sanitize input text to remove null bytes and control characters #75

fix: sanitize input text to remove null bytes and control characters #75

Uh oh!

echobt commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: sanitize input text to remove null bytes and control characters #75

Are you sure you want to change the base?

fix: sanitize input text to remove null bytes and control characters #75

Uh oh!

Conversation

echobt commented Jan 21, 2026

Description

Changes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants