Skip to content

Conversation

@echobt
Copy link
Contributor

@echobt echobt commented Jan 21, 2026

Description

This PR addresses an issue where null bytes (\x00) and other control characters in the input text cause the tokenizer to fail with a generic error.

Changes

  • Added a sanitize_text function in src/core/embeddings.rs that filters out control characters while preserving newlines (\n), tabs (\t), and carriage returns (\r).
  • Updated embed_batch to sanitize input text before passing it to the tokenizer.
  • Added unit tests to verify the sanitization logic.

Testing

  • Ran cargo test to verify the new sanitization logic and ensure no regressions.
  • The new test test_sanitize_text passes, confirming that null bytes and unwanted control characters are removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants