Skip to content

feat: use tokie for accurate token counting#31

Open
chonknick wants to merge 1 commit into
mainfrom
feat/tokie-token-counting
Open

feat: use tokie for accurate token counting#31
chonknick wants to merge 1 commit into
mainfrom
feat/tokie-token-counting

Conversation

@chonknick

Copy link
Copy Markdown
Contributor

Summary

  • Adds tokie as a dependency for accurate token counting in providers that don't return token counts from their API (Cloudflare and Gemini)
  • Tokenizers are lazily loaded from HuggingFace Hub and cached in-process for reuse
  • Supports both HuggingFace tokenizers and tiktoken (cl100k_base mapped via xenova/gpt-4)
  • Falls back to cl100k tokenizer when a model's specific tokenizer is unavailable, with a final len/5 fallback if cl100k itself can't be loaded
  • Deserializes the existing tokenizer metadata from models.json into TokenizerInfo on ModelInfo

Coverage

Provider Model Token Counting
Cloudflare bge-small/base/large-en-v1.5, bge-m3, qwen3-embedding-0.6b Accurate (tokie)
Cloudflare embeddinggemma-300m, plamo-embedding-1b cl100k fallback
Gemini gemini-embedding-001 Accurate (cl100k via tokie)

Files changed

  • Cargo.toml — added tokie dependency
  • src/models.rs — added TokenizerInfo struct and field on ModelInfo
  • src/catalog.rs — deserialize tokenizer field from models.json
  • src/tokenizer.rs — new module: cached tokenizer loading and counting
  • src/providers/cloudflare.rs — replaced len/5 with tokie
  • src/providers/gemini.rs — replaced len/5 with tokie
  • src/lib.rs — registered new module, exported TokenizerInfo

Test plan

  • cargo build compiles cleanly
  • cargo test — all 11 tests + 2 doc-tests pass
  • Manual test with Cloudflare provider to verify accurate token counts
  • Manual test with Gemini provider to verify cl100k counting

🤖 Generated with Claude Code

…providers

Replace the rough len/5 heuristic with real tokenizer-based counting
via tokie. Tokenizers are lazily loaded from HuggingFace Hub and cached.
For models without a known tokenizer, falls back to cl100k (GPT-4).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 21, 2026 01:27
@cloudflare-workers-and-pages

Copy link
Copy Markdown

Deploying catsu with  Cloudflare Pages  Cloudflare Pages

Latest commit: aab3170
Status: ✅  Deploy successful!
Preview URL: https://15a6a257.catsu-3ib.pages.dev
Branch Preview URL: https://feat-tokie-token-counting.catsu-3ib.pages.dev

View logs

@cloudflare-workers-and-pages

Copy link
Copy Markdown

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
catsu-docs aab3170 Commit Preview URL

Branch Preview URL
Apr 21 2026, 01:28 AM

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces accurate token counting via the tokie tokenizer library for providers that don’t return token usage (notably Cloudflare and Gemini), using lazily loaded/cached tokenizers with sensible fallbacks.

Changes:

  • Add tokie dependency and introduce a cached tokenizer loader/token counter module.
  • Extend the model catalog data model to include optional tokenizer metadata (TokenizerInfo) loaded from models.json.
  • Replace len/5 token estimation with tokie-based counting in Cloudflare and Gemini providers (with fallbacks).

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
Cargo.toml Adds tokie dependency (HF-enabled) to support accurate tokenization.
src/tokenizer.rs New module for cached tokenizer loading and token counting with fallbacks.
src/models.rs Adds TokenizerInfo and attaches optional tokenizer config to ModelInfo.
src/catalog.rs Deserializes optional tokenizer metadata from the embedded catalog JSON.
src/providers/cloudflare.rs Uses tokenizer-based counting (via spawn_blocking) instead of len/5.
src/providers/gemini.rs Uses tokenizer-based counting (via spawn_blocking) instead of len/5.
src/lib.rs Registers tokenizer module and re-exports TokenizerInfo.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/tokenizer.rs
Comment on lines +41 to +49
let mut cache = TOKENIZER_CACHE.lock().ok()?;

if let Some(tok) = cache.get(name) {
return Some(Arc::clone(tok));
}

match Tokenizer::from_pretrained(name) {
Ok(tok) => {
let tok = Arc::new(tok);

Copilot AI Apr 21, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_or_load_tokenizer holds the global TOKENIZER_CACHE mutex while calling Tokenizer::from_pretrained(name), which can perform disk/network I/O and block for a long time. This effectively serializes all concurrent token counting calls and can stall unrelated requests. Consider dropping the lock before loading (double-checked locking: check cache, release lock, load, then re-lock to insert) or using a per-tokenizer OnceCell/DashMap so only the same tokenizer load is deduplicated.

Suggested change
let mut cache = TOKENIZER_CACHE.lock().ok()?;
if let Some(tok) = cache.get(name) {
return Some(Arc::clone(tok));
}
match Tokenizer::from_pretrained(name) {
Ok(tok) => {
let tok = Arc::new(tok);
{
let cache = TOKENIZER_CACHE.lock().ok()?;
if let Some(tok) = cache.get(name) {
return Some(Arc::clone(tok));
}
}
match Tokenizer::from_pretrained(name) {
Ok(tok) => {
let tok = Arc::new(tok);
let mut cache = TOKENIZER_CACHE.lock().ok()?;
if let Some(existing) = cache.get(name) {
return Some(Arc::clone(existing));
}

Copilot uses AI. Check for mistakes.
Comment thread src/providers/gemini.rs
Comment on lines +175 to +178
.await
.unwrap_or_else(|_| {
crate::tokenizer::fallback_count(&request.inputs)
});

Copilot AI Apr 21, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the spawn_blocking(...).await error path, the fallback calls crate::tokenizer::fallback_count(&request.inputs) on the async runtime thread. Since fallback_count may attempt to load a tokenizer (and uses a blocking std::sync::Mutex), this can block the reactor thread. Consider making the error-path fallback a pure len/5 estimate, or running fallback_count in spawn_blocking as well.

Copilot uses AI. Check for mistakes.
Comment on lines +174 to +177
.await
.unwrap_or_else(|_| {
crate::tokenizer::fallback_count(&request.inputs)
});

Copilot AI Apr 21, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the spawn_blocking(...).await error path, the fallback calls crate::tokenizer::fallback_count(&request.inputs) on the async runtime thread. Since fallback_count may attempt to load a tokenizer (and uses a blocking std::sync::Mutex), this can block the runtime thread. Consider making the error-path fallback a pure len/5 estimate, or running fallback_count inside spawn_blocking too.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants