Skip to content

fix(embedder): add unified input truncation to prevent token limit errors#760

Closed
yangxinxin-7 wants to merge 4 commits intovolcengine:mainfrom
yangxinxin-7:fix/embedding-input-truncation
Closed

fix(embedder): add unified input truncation to prevent token limit errors#760
yangxinxin-7 wants to merge 4 commits intovolcengine:mainfrom
yangxinxin-7:fix/embedding-input-truncation

Conversation

@yangxinxin-7
Copy link
Collaborator

Summary

Fixes #616 and #686: embedding requests fail when input text exceeds model token limits.

  • Add truncate_text_by_tokens(text, max_tokens) to base.py with three-layer truncation:
    1. Fast path: if utf-8 byte count ≤ max_tokens, return immediately (provably safe: token count ≤ utf-8 bytes for BPE)
    2. tiktoken path: encode with cl100k_base, slice to max_tokens, decode
    3. Fallback: byte-based truncation when tiktoken is unavailable
  • Add max_input_tokens property and _truncate_input() to EmbedderBase; all embedders call it at embed() / embed_batch() entry

Per-model token limits

Embedder Limit
OpenAI 8000 (default)
Jina v5-text-small 32000
Jina v5-text-nano 8000
Voyage 8000 (default)
Volcengine (vision) 8000
Volcengine (non-vision) 4000
VikingDB bge-large-zh 500
VikingDB bge-m3 / bge-visualized-m3 8000
VikingDB doubao-embedding-large 4000
VikingDB doubao-embedding-vision 8000
VikingDB doubao-embedding 4000

Test plan

  • New test file tests/unit/test_truncate_input.py — 15 cases (all passing):
    • ASCII and CJK text truncated within token limit
    • Special token strings (e.g. <|endoftext|>) do not raise
    • tiktoken unavailable → byte-based fallback
    • Short text takes utf-8 fast path without calling tiktoken
    • Warning logged when truncation occurs
    • Truncated result is always valid UTF-8

Closes #616
Closes #686

yangxinxin-7 and others added 2 commits March 19, 2026 11:54
…rors

Add truncate_text_by_tokens() with tiktoken (cl100k_base) as primary
tokenizer, utf-8 byte count as fast path, and byte-based truncation as
fallback. Apply _truncate_input() across all embedder implementations
(OpenAI, Jina, Voyage, Volcengine, VikingDB) with per-model token limits.

Fixes volcengine#616
Fixes volcengine#686

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove unused `import re` in base.py
- Add warning log when tiktoken initialization fails instead of silent pass
- Fix double blank line (PEP 8)
- Use identity check (`is not`) in _truncate_input to avoid full string comparison on long texts
- Add double-checked locking for thread-safe tiktoken singleton initialization
- Cache `_get_max_input_tokens()` result in VikingDB embedder __init__ to avoid repeated string matching per batch item

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
yangxinxin-7 and others added 2 commits March 19, 2026 12:09
- Use sentinel _TIKTOKEN_NOT_AVAILABLE to cache failed tiktoken init,
  preventing repeated import attempts and log spam on every embed() call
  when tiktoken is unavailable
- Replace `is not` identity check with `len(truncated) < len(text)` in
  _truncate_input to correctly detect truncation regardless of string
  object identity

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fast path (utf-8 bytes <= max_tokens) is safe for ALL models including BGE,
  since token count <= utf-8 byte count holds universally
- When fast path misses, cl100k_base may undercount BGE SentencePiece tokens
  (~1.76 tokens/CJK char vs cl100k_base merging); mitigated by conservative
  per-model limits (bge-large-zh=500, bge-m3=8000)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Contributor

@qin-ptr qin-ptr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This PR fixes embedding failures when input text exceeds model token limits (closes #616, #686). The implementation adds truncate_text_by_tokens() with a three-tier strategy (UTF-8 fast path → tiktoken → byte fallback) and applies _truncate_input() across all embedder implementations.

Overall assessment: The truncation implementation is solid and well-tested, but there are two blocking issues that must be addressed before merging.


Blocking Issues

1. CI lint failure

The lint / lint check is failing. This must be fixed before the PR can be merged.

2. Incomplete fix for issue #686

Location: openviking/models/embedder/base.py (_truncate_input method)

Issue #686 identifies two root causes:

  1. No input truncation guard (✅ fixed by this PR)
  2. Unhandled exception in embedding queue blocks uvicorn (❌ not addressed)

From issue #686's logs:

RuntimeError: OpenAI API error: Error code: 400 - ...

This exception propagates from openai_embedders.py:embed() through collection_schemas.py:on_dequeue and appears to block the async event loop, making the entire HTTP server unresponsive.

Why this matters: Even with truncation in place, other embedding errors (network failures, API rate limits, etc.) could still crash the server.

Recommendation: Either:

  • Add exception handling in collection_schemas.py:on_dequeue to gracefully handle embedding failures without blocking the event loop
  • Or add a TODO comment and create a follow-up issue to track this work

Non-blocking Suggestions

3. Design choice: truncation vs chunking

Location: openviking/models/embedder/base.py (truncate_text_by_tokens function)

Simple truncation causes information loss. Related issues (#530 "long memory should use chunked vectorization", #531 "truncation vs chunking responsibilities") discuss chunking as a better long-term solution.

Suggestion:

  • Document in the PR description or code comments why truncation was chosen over chunking
  • Or add a config option to let users choose behavior (truncate / error / future chunking support)

4. Commit organization

The 4 commits should be squashed into one. Commits 2-4 ("fix code review issues", "fix tiktoken sentinel", "docs clarify") are incremental fixes to the initial implementation and should be merged for a cleaner git history.

5. Warning log detail

Location: openviking/models/embedder/base.py:197

The truncation warning should include original length:

logging.warning(
    f"[{self.__class__.__name__}] Input truncated from {len(text)} chars "
    f"to {self.max_input_tokens} tokens (result: {len(truncated)} chars)"
)

6. Backward compatibility impact

Location: PR description

Should document that truncation changes embedding results. For existing vector databases, re-embedding the same text will produce different vectors (if text exceeded limits), potentially affecting query results.

7. Configurable token limits

Location: openviking/models/embedder/vikingdb_embedders.py (VIKINGDB_MODEL_MAX_TOKENS)

Token limits are hardcoded. Consider allowing users to override these in config for different deployment scenarios.


Highlights

  • Comprehensive test coverage: 15 test cases covering ASCII/CJK text, special tokens, tiktoken fallback, UTF-8 boundaries
  • Thread safety: Double-checked locking for tiktoken singleton
  • Performance optimization: UTF-8 fast path avoids unnecessary tokenization; VikingDB caches token limit lookup
  • Graceful fallback: Byte-based truncation when tiktoken is unavailable
  • Clear documentation: Explains cl100k_base limitations for non-OpenAI models

Recommendation

REQUEST_CHANGES due to:

  1. CI lint failure (must fix)
  2. Incomplete fix for issue #686's second root cause (exception handling)

Once these are addressed, the PR is ready to merge. The truncation approach, while not optimal long-term (chunking is better), is acceptable as a quick fix.


🤖 I am a bot owned by @qin-ctx.

@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

2 participants