feat: use tokie for accurate token counting#31
Conversation
…providers Replace the rough len/5 heuristic with real tokenizer-based counting via tokie. Tokenizers are lazily loaded from HuggingFace Hub and cached. For models without a known tokenizer, falls back to cl100k (GPT-4). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deploying catsu with
|
| Latest commit: |
aab3170
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://15a6a257.catsu-3ib.pages.dev |
| Branch Preview URL: | https://feat-tokie-token-counting.catsu-3ib.pages.dev |
Deploying with
|
| Status | Name | Latest Commit | Preview URL | Updated (UTC) |
|---|---|---|---|---|
| ✅ Deployment successful! View logs |
catsu-docs | aab3170 | Commit Preview URL Branch Preview URL |
Apr 21 2026, 01:28 AM |
There was a problem hiding this comment.
Pull request overview
This PR introduces accurate token counting via the tokie tokenizer library for providers that don’t return token usage (notably Cloudflare and Gemini), using lazily loaded/cached tokenizers with sensible fallbacks.
Changes:
- Add
tokiedependency and introduce a cached tokenizer loader/token counter module. - Extend the model catalog data model to include optional tokenizer metadata (
TokenizerInfo) loaded frommodels.json. - Replace
len/5token estimation withtokie-based counting in Cloudflare and Gemini providers (with fallbacks).
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| Cargo.toml | Adds tokie dependency (HF-enabled) to support accurate tokenization. |
| src/tokenizer.rs | New module for cached tokenizer loading and token counting with fallbacks. |
| src/models.rs | Adds TokenizerInfo and attaches optional tokenizer config to ModelInfo. |
| src/catalog.rs | Deserializes optional tokenizer metadata from the embedded catalog JSON. |
| src/providers/cloudflare.rs | Uses tokenizer-based counting (via spawn_blocking) instead of len/5. |
| src/providers/gemini.rs | Uses tokenizer-based counting (via spawn_blocking) instead of len/5. |
| src/lib.rs | Registers tokenizer module and re-exports TokenizerInfo. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let mut cache = TOKENIZER_CACHE.lock().ok()?; | ||
|
|
||
| if let Some(tok) = cache.get(name) { | ||
| return Some(Arc::clone(tok)); | ||
| } | ||
|
|
||
| match Tokenizer::from_pretrained(name) { | ||
| Ok(tok) => { | ||
| let tok = Arc::new(tok); |
There was a problem hiding this comment.
get_or_load_tokenizer holds the global TOKENIZER_CACHE mutex while calling Tokenizer::from_pretrained(name), which can perform disk/network I/O and block for a long time. This effectively serializes all concurrent token counting calls and can stall unrelated requests. Consider dropping the lock before loading (double-checked locking: check cache, release lock, load, then re-lock to insert) or using a per-tokenizer OnceCell/DashMap so only the same tokenizer load is deduplicated.
| let mut cache = TOKENIZER_CACHE.lock().ok()?; | |
| if let Some(tok) = cache.get(name) { | |
| return Some(Arc::clone(tok)); | |
| } | |
| match Tokenizer::from_pretrained(name) { | |
| Ok(tok) => { | |
| let tok = Arc::new(tok); | |
| { | |
| let cache = TOKENIZER_CACHE.lock().ok()?; | |
| if let Some(tok) = cache.get(name) { | |
| return Some(Arc::clone(tok)); | |
| } | |
| } | |
| match Tokenizer::from_pretrained(name) { | |
| Ok(tok) => { | |
| let tok = Arc::new(tok); | |
| let mut cache = TOKENIZER_CACHE.lock().ok()?; | |
| if let Some(existing) = cache.get(name) { | |
| return Some(Arc::clone(existing)); | |
| } |
| .await | ||
| .unwrap_or_else(|_| { | ||
| crate::tokenizer::fallback_count(&request.inputs) | ||
| }); |
There was a problem hiding this comment.
In the spawn_blocking(...).await error path, the fallback calls crate::tokenizer::fallback_count(&request.inputs) on the async runtime thread. Since fallback_count may attempt to load a tokenizer (and uses a blocking std::sync::Mutex), this can block the reactor thread. Consider making the error-path fallback a pure len/5 estimate, or running fallback_count in spawn_blocking as well.
| .await | ||
| .unwrap_or_else(|_| { | ||
| crate::tokenizer::fallback_count(&request.inputs) | ||
| }); |
There was a problem hiding this comment.
In the spawn_blocking(...).await error path, the fallback calls crate::tokenizer::fallback_count(&request.inputs) on the async runtime thread. Since fallback_count may attempt to load a tokenizer (and uses a blocking std::sync::Mutex), this can block the runtime thread. Consider making the error-path fallback a pure len/5 estimate, or running fallback_count inside spawn_blocking too.
Summary
xenova/gpt-4)len/5fallback if cl100k itself can't be loadedtokenizermetadata frommodels.jsonintoTokenizerInfoonModelInfoCoverage
Files changed
Cargo.toml— addedtokiedependencysrc/models.rs— addedTokenizerInfostruct and field onModelInfosrc/catalog.rs— deserialize tokenizer field from models.jsonsrc/tokenizer.rs— new module: cached tokenizer loading and countingsrc/providers/cloudflare.rs— replacedlen/5with tokiesrc/providers/gemini.rs— replacedlen/5with tokiesrc/lib.rs— registered new module, exportedTokenizerInfoTest plan
cargo buildcompiles cleanlycargo test— all 11 tests + 2 doc-tests pass🤖 Generated with Claude Code