feat: add build module and tokie CLI#4
Conversation
Spec for adding a `build` feature to the core crate with convert/verify/upload functions, plus a tokie-cli crate that wires them to a `tokie` CLI binary.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a feature-gated Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI (tokie convert)
participant Build as tokie::build::convert
participant HFHub as HF Hub API
participant FS as Filesystem
participant Verify as tokie::build::verify
participant Ref as Reference Backend
CLI->>Build: convert(repo_id, output)
Build->>HFHub: Download tokenizer.json
HFHub-->>Build: tokenizer.json
Build->>Build: Load into Tokenizer
Build->>FS: Write .tkz file
Build-->>CLI: ConvertResult{vocab_size, file_size}
alt verify requested
CLI->>Verify: verify(repo_id, tkz_path)
Verify->>FS: Load .tkz
Verify->>Ref: Resolve reference tokenizer
Ref-->>Verify: Reference encoder
Verify->>Verify: Compare token IDs (chunked)
Verify-->>CLI: VerifyResult{passed, failed, mismatches}
end
alt upload requested
CLI->>Build: upload(tkz_path, tokiers_name, token)
Build->>Build: Resolve auth token
Build->>FS: Read .tkz bytes
Build->>HFHub: PUT /tokiers/{name}/tokenizer.tkz
HFHub-->>Build: HTTP 2xx
Build-->>CLI: Success
end
sequenceDiagram
participant CLI as CLI (tokie verify)
participant Build as tokie::build::verify
participant HFHub as HF Hub API
participant FS as Filesystem
participant Ref as Reference Backend
CLI->>Build: verify(repo_id, tkz_path?)
alt tkz_path provided
Build->>FS: Load .tkz
else tkz_path absent
Build->>HFHub: Download tokiers/{name}/tokenizer.tkz
HFHub-->>Build: .tkz bytes
Build->>FS: Write temp .tkz
end
Build->>Build: Load Tokenizer from .tkz
Build->>Ref: Resolve reference (tiktoken-rs or HF tokenizers)
Ref-->>Build: Reference encoder
Build->>Build: Encode test texts with both
Build->>Build: Collect mismatches
Build-->>CLI: VerifyResult{total, passed, failed, mismatches}
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
Adds a feature-gated “build” workflow to the tokie library and a new tokie-cli crate to expose conversion/verification/upload from the command line, while moving heavy reference deps behind an optional feature.
Changes:
- Introduce
tokie::build(behindfeature = "build") withconvert(),verify(), andupload()APIs plus related result/error types. - Add
tokie-clicrate providingtokie convert/tokie verifycommands (and optional upload). - Move
tokenizers,tiktoken-rs, andureqbehind the newbuildfeature; update example feature requirements accordingly.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/superpowers/specs/2026-04-21-build-module-and-cli-design.md | Design spec for the build module + CLI surface. |
| docs/superpowers/plans/2026-04-21-build-module-and-cli.md | Implementation plan detailing steps and intended behavior. |
| crates/tokie/src/lib.rs | Exposes the new build module/types behind cfg(feature = "build"). |
| crates/tokie/src/build.rs | Implements conversion, verification, and upload logic. |
| crates/tokie/Cargo.toml | Adds build feature and makes reference deps optional; adjusts examples’ required features. |
| crates/tokie-cli/src/main.rs | Clap-based CLI wiring for convert/verify (and upload). |
| crates/tokie-cli/Cargo.toml | New CLI crate manifest and dependencies. |
| Cargo.lock | Lockfile updates for the new crate/dependencies. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| //! | ||
| //! This module is only available when the `build` feature is enabled: | ||
| //! ```toml | ||
| //! tokie = { version = "0.1", features = ["build"] } |
There was a problem hiding this comment.
The module-level doc example uses tokie = { version = "0.1", ... }, but the crate version in this repo is currently 0.0.8. This is likely to confuse users copying the snippet; consider either matching the current crate version or omitting the version in the example.
| //! tokie = { version = "0.1", features = ["build"] } | |
| //! tokie = { features = ["build"] } |
| if let Some(parent) = output.parent() { | ||
| std::fs::create_dir_all(parent).ok(); | ||
| } |
There was a problem hiding this comment.
std::fs::create_dir_all(parent).ok(); discards directory creation errors (e.g., permissions, invalid path). This can make failures harder to diagnose and may lead to misleading downstream errors; consider propagating the I/O error (e.g., by mapping it into BuildError).
| "p50k_base" => tiktoken_rs::p50k_base(), | ||
| _ => unreachable!(), | ||
| } | ||
| .expect("failed to load tiktoken encoding"); |
There was a problem hiding this comment.
expect("failed to load tiktoken encoding") will panic the process if the encoding can't be loaded. Since this is a library API, prefer returning a BuildError instead of panicking so callers (including the CLI) can handle/report the failure cleanly.
| .expect("failed to load tiktoken encoding"); | |
| .map_err(|e| BuildError::Download(format!("failed to load tiktoken encoding: {e}")))?; |
| Ok(VerifyResult { | ||
| total, | ||
| passed, | ||
| failed, | ||
| mismatches, | ||
| }) |
There was a problem hiding this comment.
BuildError includes a Verification { result: VerifyResult } variant, but verify() currently returns Ok(VerifyResult) even when there are mismatches. This makes it easy for callers to treat a failed verification as success by mistake; consider returning Err(BuildError::Verification { result }) when failed > 0 (or remove the error variant if success-on-mismatch is the intended contract).
| Ok(VerifyResult { | |
| total, | |
| passed, | |
| failed, | |
| mismatches, | |
| }) | |
| let result = VerifyResult { | |
| total, | |
| passed, | |
| failed, | |
| mismatches, | |
| }; | |
| if result.failed > 0 { | |
| Err(BuildError::Verification { result }) | |
| } else { | |
| Ok(result) | |
| } |
| #[arg(short, long)] | ||
| upload: bool, | ||
|
|
||
| /// HuggingFace API token (falls back to HF_TOKEN env or cached token) |
There was a problem hiding this comment.
The CLI help text says the token "falls back to HF_TOKEN env or cached token", but tokie::build::upload() only checks the explicit --token value and HF_TOKEN. Either implement reading the cached HuggingFace token (e.g., from the standard HF token file) or adjust the help text to match the actual behavior.
| /// HuggingFace API token (falls back to HF_TOKEN env or cached token) | |
| /// HuggingFace API token (falls back to HF_TOKEN env) |
| } else { | ||
| let name = repo_id.rsplit('/').next().unwrap_or(repo_id); | ||
| print!("Downloading tokiers/{name} ... "); | ||
| let api = hf_hub::api::sync::ApiBuilder::new().build().unwrap(); |
There was a problem hiding this comment.
hf_hub::api::sync::ApiBuilder::new().build().unwrap() will panic if the Hub client can't be initialized (e.g., invalid config / cache dir permissions). Since this is user-facing CLI code, handle the error and exit with a clear message instead of panicking.
| let api = hf_hub::api::sync::ApiBuilder::new().build().unwrap(); | |
| let api = match hf_hub::api::sync::ApiBuilder::new().build() { | |
| Ok(api) => api, | |
| Err(e) => { | |
| println!("FAILED: could not initialize HuggingFace Hub client: {e}"); | |
| std::process::exit(1); | |
| } | |
| }; |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
crates/tokie-cli/src/main.rs (1)
111-113: Reuse the canonicaltokiersrepo-name mapping here.Both upload and download derive
tokiers/<name>by stripping everything before the last/. Ifcrates/tokie/src/hub.rs::tokiers_repo_namealready owns aliases or special cases, this CLI can upload under one name and later verify/download under another. Please route both paths through the same helper instead of re-deriving the basename locally.Also applies to: 128-131
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/tokie-cli/src/main.rs` around lines 111 - 113, The CLI currently computes the "tokiers/<name>" basename by rsplit on repo_id in both upload and download branches (variables like tokiers_name); replace those local rsplit derivations with a call to the canonical helper tokie::hub::tokiers_repo_name (defined in crates/tokie/src/hub.rs) so both upload (where tokiers_name is set) and download paths (lines 128-131) use the same alias/special-case mapping; update usages that build "tokiers/{...}" to call tokiers_repo_name(repo_id) instead of repo_id.rsplit(...).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/tokie-cli/Cargo.toml`:
- Line 14: Update the tokie dependency declaration so it includes a version
field in addition to the path (e.g., add version = "0.0.8") so the crate can be
installed from crates.io while still supporting local workspace builds; modify
the existing tokie entry (tokie = { path = "../tokie", features = ["hf",
"build"] }) to include the version attribute.
In `@crates/tokie-cli/src/main.rs`:
- Around line 128-130: The ApiBuilder::new().build().unwrap() call in main
(constructing the hf_hub client) can fail and currently panics; change it to
handle the Err case and emit the CLI-style failure and exit path instead of
unwrap. Locate the hf_hub::api::sync::ApiBuilder usage (the
ApiBuilder::new().build() call) and match on the Result, logging a message like
"FAILED: could not initialize HF Hub client: {error}" (or using the existing
failure/log helper the CLI uses) and exit with code 1 when Err, otherwise
continue with the Ok(api) value.
In `@crates/tokie/src/build.rs`:
- Around line 122-129: The code currently calls
tiktoken_rs::*_base().expect(...) in the block that handles encoding_name
(matching "cl100k_base", "o200k_base", "p50k_base") inside the build logic,
which will abort the process on failure; change this to propagate a BuildError
through the public verify() API instead: replace the .expect(...) with error
mapping (e.g., map_err or try/?) to convert the tiktoken_rs initialization error
into the crate's BuildError and return Err from the surrounding function,
ensuring any failure to create the tokenizer (from tiktoken_rs::cl100k_base(),
o200k_base(), p50k_base()) yields a Result::Err rather than panicking; reference
the variables and symbols encoding_name, tiktoken, and the tiktoken_rs::*_base()
calls so you update the correct spot.
- Around line 195-202: The upload() token resolution currently only checks the
CLI arg and HF_TOKEN env var; update it to attempt reading the cached HF token
via hf_hub::cache::Cache::new()?.token() before returning the
BuildError::Upload. Specifically, in the code building token_str (where
token.map(...).or_else(...).ok_or_else(...)), add an intermediate attempt to
call Cache::new()?.token().ok() (or equivalent) as another or_else branch so
that token resolution order is: explicit CLI token, HF_TOKEN env var, cached
token from hf-hub cache, and only then return BuildError::Upload; ensure any
errors constructing the Cache are handled/ignored (treated as no cached token)
not propagated as a hard error.
---
Nitpick comments:
In `@crates/tokie-cli/src/main.rs`:
- Around line 111-113: The CLI currently computes the "tokiers/<name>" basename
by rsplit on repo_id in both upload and download branches (variables like
tokiers_name); replace those local rsplit derivations with a call to the
canonical helper tokie::hub::tokiers_repo_name (defined in
crates/tokie/src/hub.rs) so both upload (where tokiers_name is set) and download
paths (lines 128-131) use the same alias/special-case mapping; update usages
that build "tokiers/{...}" to call tokiers_repo_name(repo_id) instead of
repo_id.rsplit(...).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 38bf866c-25f1-49a6-99b1-d0cf5b6addff
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (7)
crates/tokie-cli/Cargo.tomlcrates/tokie-cli/src/main.rscrates/tokie/Cargo.tomlcrates/tokie/src/build.rscrates/tokie/src/lib.rsdocs/superpowers/plans/2026-04-21-build-module-and-cli.mddocs/superpowers/specs/2026-04-21-build-module-and-cli-design.md
| path = "src/main.rs" | ||
|
|
||
| [dependencies] | ||
| tokie = { path = "../tokie", features = ["hf", "build"] } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, check if the files exist and locate them
find . -name "Cargo.toml" -type f | grep -E "(tokie-cli|tokie)" | head -20Repository: chonkie-inc/tokie
Length of output: 178
🏁 Script executed:
# Check the structure of the repository
ls -la crates/ 2>/dev/null || echo "crates directory structure check"Repository: chonkie-inc/tokie
Length of output: 399
🏁 Script executed:
# Read the tokie-cli Cargo.toml file
cat -n crates/tokie-cli/Cargo.toml 2>/dev/null || echo "File not found"Repository: chonkie-inc/tokie
Length of output: 639
🏁 Script executed:
# Read the tokie Cargo.toml file to get its version
cat -n crates/tokie/Cargo.toml 2>/dev/null || echo "File not found"Repository: chonkie-inc/tokie
Length of output: 3317
Add a versioned dependency spec for tokie to support cargo install from crates.io.
Path-only dependencies cannot be resolved when installing from crates.io. Add version = "0.0.8" alongside the path attribute to enable both local workspace builds and crates.io installation, which is required for the stated cargo install tokie-cli objective.
Suggested manifest change
-tokie = { path = "../tokie", features = ["hf", "build"] }
+tokie = { version = "0.0.8", path = "../tokie", features = ["hf", "build"] }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| tokie = { path = "../tokie", features = ["hf", "build"] } | |
| tokie = { version = "0.0.8", path = "../tokie", features = ["hf", "build"] } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/tokie-cli/Cargo.toml` at line 14, Update the tokie dependency
declaration so it includes a version field in addition to the path (e.g., add
version = "0.0.8") so the crate can be installed from crates.io while still
supporting local workspace builds; modify the existing tokie entry (tokie = {
path = "../tokie", features = ["hf", "build"] }) to include the version
attribute.
| let name = repo_id.rsplit('/').next().unwrap_or(repo_id); | ||
| print!("Downloading tokiers/{name} ... "); | ||
| let api = hf_hub::api::sync::ApiBuilder::new().build().unwrap(); |
There was a problem hiding this comment.
Handle HF Hub initialization failure instead of panicking.
Line 130 unwraps the Hub client build. When that fails, tokie verify <repo_id> aborts with a panic instead of the CLI's normal FAILED: ... + exit code 1 flow.
Suggested fix
- let api = hf_hub::api::sync::ApiBuilder::new().build().unwrap();
+ let api = match hf_hub::api::sync::ApiBuilder::new().build() {
+ Ok(api) => api,
+ Err(e) => {
+ println!("FAILED: HF Hub init failed: {e}");
+ std::process::exit(1);
+ }
+ };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/tokie-cli/src/main.rs` around lines 128 - 130, The
ApiBuilder::new().build().unwrap() call in main (constructing the hf_hub client)
can fail and currently panics; change it to handle the Err case and emit the
CLI-style failure and exit path instead of unwrap. Locate the
hf_hub::api::sync::ApiBuilder usage (the ApiBuilder::new().build() call) and
match on the Result, logging a message like "FAILED: could not initialize HF Hub
client: {error}" (or using the existing failure/log helper the CLI uses) and
exit with code 1 when Err, otherwise continue with the Ok(api) value.
| if let Some(encoding_name) = tiktoken_encoding_name(repo_id) { | ||
| let tiktoken = match encoding_name { | ||
| "cl100k_base" => tiktoken_rs::cl100k_base(), | ||
| "o200k_base" => tiktoken_rs::o200k_base(), | ||
| "p50k_base" => tiktoken_rs::p50k_base(), | ||
| _ => unreachable!(), | ||
| } | ||
| .expect("failed to load tiktoken encoding"); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, check if the file exists and read it
fd build.rs | head -20Repository: chonkie-inc/tokie
Length of output: 86
🏁 Script executed:
# Get the build.rs file location and size
find . -name build.rs -path "*/tokie/*" -exec wc -l {} + 2>/dev/nullRepository: chonkie-inc/tokie
Length of output: 92
🏁 Script executed:
# Read the build.rs file to understand context around lines 122-129
cat -n crates/tokie/src/build.rs | sed -n '100,150p'Repository: chonkie-inc/tokie
Length of output: 2143
🏁 Script executed:
# Check the verify function signature and its return type
ast-grep --pattern 'fn verify($_) -> $_ {
$$$
}'Repository: chonkie-inc/tokie
Length of output: 43
🏁 Script executed:
# Search for BuildError definition to understand error handling
rg "enum BuildError|struct BuildError" -A 10Repository: chonkie-inc/tokie
Length of output: 2511
Don't panic when the reference tokenizer backend fails to initialize.
verify() is a public Result-returning API, but line 129 uses .expect(...). If tiktoken_rs::*_base() fails, callers get a process abort instead of a BuildError.
Suggested fix
let tiktoken = match encoding_name {
"cl100k_base" => tiktoken_rs::cl100k_base(),
"o200k_base" => tiktoken_rs::o200k_base(),
"p50k_base" => tiktoken_rs::p50k_base(),
_ => unreachable!(),
}
- .expect("failed to load tiktoken encoding");
+ .map_err(|e| {
+ BuildError::Download(format!(
+ "failed to initialize tiktoken backend {encoding_name}: {e}"
+ ))
+ })?;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/tokie/src/build.rs` around lines 122 - 129, The code currently calls
tiktoken_rs::*_base().expect(...) in the block that handles encoding_name
(matching "cl100k_base", "o200k_base", "p50k_base") inside the build logic,
which will abort the process on failure; change this to propagate a BuildError
through the public verify() API instead: replace the .expect(...) with error
mapping (e.g., map_err or try/?) to convert the tiktoken_rs initialization error
into the crate's BuildError and return Err from the surrounding function,
ensuring any failure to create the tokenizer (from tiktoken_rs::cl100k_base(),
o200k_base(), p50k_base()) yields a Result::Err rather than panicking; reference
the variables and symbols encoding_name, tiktoken, and the tiktoken_rs::*_base()
calls so you update the correct spot.
| let token_str = token | ||
| .map(|t| t.to_string()) | ||
| .or_else(|| std::env::var("HF_TOKEN").ok()) | ||
| .ok_or_else(|| { | ||
| BuildError::Upload( | ||
| "no HF token found — pass --token or set HF_TOKEN env var".to_string(), | ||
| ) | ||
| })?; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Get overview of repository structure
git ls-files crates/tokie/ | head -20Repository: chonkie-inc/tokie
Length of output: 830
🏁 Script executed:
# Look at the build.rs file to understand context
wc -l crates/tokie/src/build.rsRepository: chonkie-inc/tokie
Length of output: 90
🏁 Script executed:
# Search for any mentions of "cached" in the codebase
rg "cached.*token|token.*cached" --type-listRepository: chonkie-inc/tokie
Length of output: 450
🏁 Script executed:
# Search for CLI help text mentioning token or HF_TOKEN
rg "HF_TOKEN|HF token|cached.*HF|Hugging Face.*token" -A 3 -B 3Repository: chonkie-inc/tokie
Length of output: 18003
🏁 Script executed:
# Read the CLI help text in detail
cat -n crates/tokie-cli/src/main.rs | grep -A 5 -B 5 "HuggingFace API token"Repository: chonkie-inc/tokie
Length of output: 475
🏁 Script executed:
# Check the full upload function context in build.rs
sed -n '180,210p' crates/tokie/src/build.rsRepository: chonkie-inc/tokie
Length of output: 1117
🏁 Script executed:
# Search for any cached token handling utilities in the codebase
rg "cache|\.hf\(|home|\.hugging" --type rustRepository: chonkie-inc/tokie
Length of output: 21120
🏁 Script executed:
# Check dependencies to understand what HF crates are available
cat crates/tokie/Cargo.toml | grep -A 3 -B 3 "hf-\|hugging"Repository: chonkie-inc/tokie
Length of output: 415
🏁 Script executed:
# Check if hf-hub provides any cached token utilities
rg "hf-hub|hf_hub" -A 5 crates/tokie/src/build.rs | head -40Repository: chonkie-inc/tokie
Length of output: 716
🏁 Script executed:
# Look for imports and usage of hf-hub in build.rs
grep -n "use\|hf-hub\|hf_hub" crates/tokie/src/build.rs | head -20Repository: chonkie-inc/tokie
Length of output: 475
🏁 Script executed:
# Search HuggingFace documentation/examples for cached token pattern
web_search
What does hf-hub crate provide for HuggingFace token caching and retrievalRepository: chonkie-inc/tokie
Length of output: 154
🌐 Web query:
hf-hub crate cached token retrieval HuggingFace authentication
💡 Result:
The hf-hub Rust crate retrieves Hugging Face authentication tokens from the standard cache location at ~/.cache/huggingface/token (or overridden by HF_HOME/token). This is done via the Cache struct's token method, which reads the token file if it exists. To set up the token: 1. Generate a User Access Token from https://huggingface.co/settings/tokens (read or write scope as needed). 2. Use huggingface-cli login (install via pip install -U huggingface_hub) to save it to the cache file. 3. Alternatively, manually write the token to ~/.cache/huggingface/token. When creating an Api (sync or async), it automatically uses Cache::from_env and pulls the token via cache.token if present for authenticated requests (e.g., private repos). A log message like "Token file not found" appears if absent but does not error for public access. Example usage: use hf_hub::api::sync::Api; let api = Api::new?; // Automatically checks cache for token let repo = api.model("bert-base-uncased".to_string); let _ = repo.get("config.json")?; For the newer huggingface-hub Rust crate (github.com/huggingface/huggingface_hub_rust), tokens are resolved in order: HfApiBuilder::token, HF_TOKEN env var, HF_TOKEN_PATH, or ~/.cache/huggingface/token.
Citations:
- 1: https://crates.io/crates/hf-hub/0.4.2
- 2: https://docs.rs/hf-hub/latest/hf_hub/struct.Cache.html
- 3: https://github.com/huggingface/hf-hub/blob/357331db/src/lib.rs
- 4: Token file not found "token" huggingface/hf-hub#54
- 5: https://docs.rs/crate/hf-hub/0.2.0/source/src/api/sync.rs
🏁 Script executed:
# Get the complete upload function implementation
sed -n '186,226p' crates/tokie/src/build.rsRepository: chonkie-inc/tokie
Length of output: 1486
The upload() function does not implement the advertised cached-token fallback.
The CLI help explicitly states: "HuggingFace API token (falls back to HF_TOKEN env or cached token)," and the design spec promises token resolution from "environment/cache." However, the implementation (lines 195–202) only checks the provided argument and HF_TOKEN env var; it ignores the cached Hugging Face token at ~/.cache/huggingface/token. Users logged in via huggingface-cli login will fail here unless they also export HF_TOKEN manually.
The hf-hub crate (already a dependency) provides a Cache struct with a token() method that can read the cached token—integrating cached fallback is straightforward and necessary to match the advertised behavior.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/tokie/src/build.rs` around lines 195 - 202, The upload() token
resolution currently only checks the CLI arg and HF_TOKEN env var; update it to
attempt reading the cached HF token via hf_hub::cache::Cache::new()?.token()
before returning the BuildError::Upload. Specifically, in the code building
token_str (where token.map(...).or_else(...).ok_or_else(...)), add an
intermediate attempt to call Cache::new()?.token().ok() (or equivalent) as
another or_else branch so that token resolution order is: explicit CLI token,
HF_TOKEN env var, cached token from hf-hub cache, and only then return
BuildError::Upload; ensure any errors constructing the Cache are handled/ignored
(treated as no cached token) not propagated as a hard error.
Download and cache enwik8 (~100MB), split into ~1KB chunks, and compare token IDs against the reference backend. This gives ~80K test cases instead of 8, catching edge cases around unicode and special characters. Also improve CLI output: show first divergence point per mismatch, limit to 5 shown with a count of remaining.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/tokie/src/build.rs`:
- Line 191: The error mapping for Tokenizer::from_file is incorrect: it
currently maps load failures to BuildError::SaveTkz making messages misleading;
change the mapping to a load-specific variant (e.g., add BuildError::LoadTkz) or
reuse an existing load variant like BuildError::LoadJson, update the
Tokenizer::from_file call to map_err(BuildError::LoadTkz or
BuildError::LoadJson) and add the new enum variant and any needed Display/From
impls so error messages correctly indicate a failed .tkz load instead of a save.
- Around line 298-303: Remove the unreachable HTTP status check block that
inspects response.status() and returns BuildError::Upload (the if
response.status() >= 400 { ... } block referencing BuildError::Upload and
response); rely on the existing map_err(...) and ? operator (used just before)
which already converts ureq::Error::Status into an error, so delete lines
containing that status check and its return to eliminate dead code in the upload
path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 8aad450d-3464-47cc-bf5d-176ba2cf366b
📒 Files selected for processing (3)
crates/tokie-cli/src/main.rscrates/tokie/Cargo.tomlcrates/tokie/src/build.rs
🚧 Files skipped from review as they are similar to previous changes (2)
- crates/tokie/Cargo.toml
- crates/tokie-cli/src/main.rs
| /// Downloads and caches enwik8 (~100MB), splits into ~1KB chunks, and compares | ||
| /// token IDs from tokie against the reference backend (HF tokenizers or tiktoken-rs). | ||
| pub fn verify(repo_id: &str, tkz_path: &Path) -> Result<VerifyResult, BuildError> { | ||
| let tokie_tok = Tokenizer::from_file(tkz_path).map_err(BuildError::SaveTkz)?; |
There was a problem hiding this comment.
Wrong error variant: use a load-specific error instead of SaveTkz.
Tokenizer::from_file is a load operation, but the error is mapped to BuildError::SaveTkz. This produces misleading error messages like "failed to save .tkz" when the actual failure is loading the file. Consider adding a LoadTkz variant or reusing LoadJson.
Suggested fix
Add a new error variant or reuse an existing one:
pub enum BuildError {
Download(String),
LoadJson(JsonLoadError),
+ LoadTkz(SerdeError),
SaveTkz(SerdeError),
Verification { result: VerifyResult },
Upload(String),
HubInit(String),
}Then in verify:
- let tokie_tok = Tokenizer::from_file(tkz_path).map_err(BuildError::SaveTkz)?;
+ let tokie_tok = Tokenizer::from_file(tkz_path).map_err(BuildError::LoadTkz)?;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/tokie/src/build.rs` at line 191, The error mapping for
Tokenizer::from_file is incorrect: it currently maps load failures to
BuildError::SaveTkz making messages misleading; change the mapping to a
load-specific variant (e.g., add BuildError::LoadTkz) or reuse an existing load
variant like BuildError::LoadJson, update the Tokenizer::from_file call to
map_err(BuildError::LoadTkz or BuildError::LoadJson) and add the new enum
variant and any needed Display/From impls so error messages correctly indicate a
failed .tkz load instead of a save.
| if response.status() >= 400 { | ||
| return Err(BuildError::Upload(format!( | ||
| "upload returned HTTP {}: check your token has write access to tokiers/ org", | ||
| response.status() | ||
| ))); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
ureq 2.0 Rust crate default behavior for HTTP 4xx 5xx error status codes
💡 Result:
In ureq version 2.x (including 2.0), the default behavior for HTTP 4xx and 5xx status codes is to treat them as errors. When a request receives a 4xx or 5xx response, it returns a ureq::Error::Status(status_code, response) variant. This allows access to both the status code (u16) and the full Response object (including headers and body) for error handling, such as reading error messages from the server. This is the default since ureq 2.0.0, as noted in the changelog: "Non-2xx responses are considered Error by default." Example from docs: match ureq::get("http://example.com/").call { Ok(response) => { /* success / }, Err(ureq::Error::Status(code, response)) => { / 4xx/5xx: access response.body etc. / } Err(_) => { / network etc. */ } } The behavior can be changed with .or_any_status to treat them as success, or in later versions (3.x) via config.http_status_as_error(false), but for 2.0 it is on by default.
Citations:
- 1: https://docs.rs/ureq/latest/ureq/enum.Error.html
- 2: https://docs.rs/ureq/2.0.2/ureq/enum.Error.html
- 3: https://docs.rs/ureq/2.2.0/ureq/enum.Error.html
- 4: https://github.com/algesten/ureq/blob/main/CHANGELOG.md
- 5: https://docs.rs/ureq/2.0.0-rc4/ureq/enum.Error.html
🏁 Script executed:
cat -n crates/tokie/src/build.rs | sed -n '290,310p'Repository: chonkie-inc/tokie
Length of output: 703
🏁 Script executed:
rg 'or_any_status' crates/tokie/src/build.rsRepository: chonkie-inc/tokie
Length of output: 43
🏁 Script executed:
rg 'or_any_status|\.call\(' crates/tokie/src/build.rs -A 2 -B 2Repository: chonkie-inc/tokie
Length of output: 255
🏁 Script executed:
cat > /tmp/check_ureq.py << 'EOF'
import json
# Based on web search and ureq documentation:
# ureq 2.x behavior for .send_bytes():
# - Returns Result<Response, Error>
# - For 4xx/5xx responses: returns Err(ureq::Error::Status(code, response))
# - No .or_any_status() is being used, so default behavior applies
# - map_err catches the error and ? operator exits early
behavior = {
"scenario": "ureq::put().send_bytes() with 4xx/5xx response",
"ureq_version": "2.x",
"default_behavior": "Treats 4xx/5xx as errors",
"returns": "Err(ureq::Error::Status(code, response))",
"code_flow": [
"Line 292-296: .send_bytes().map_err(...)?",
"If 4xx/5xx: Error variant is caught by map_err",
"? operator exits function with error",
"Lines 298-303: Never reached"
],
"status_check_reachable": False
}
print(json.dumps(behavior, indent=2))
EOF
python /tmp/check_ureq.pyRepository: chonkie-inc/tokie
Length of output: 498
Remove unreachable status code check (lines 298-303).
With ureq 2.x, HTTP 4xx/5xx responses return Err(ureq::Error::Status(...)) by default, which is caught by the map_err on line 296. The ? operator exits before reaching the status check, making lines 298-303 unreachable dead code that should be removed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/tokie/src/build.rs` around lines 298 - 303, Remove the unreachable
HTTP status check block that inspects response.status() and returns
BuildError::Upload (the if response.status() >= 400 { ... } block referencing
BuildError::Upload and response); rely on the existing map_err(...) and ?
operator (used just before) which already converts ureq::Error::Status into an
error, so delete lines containing that status check and its return to eliminate
dead code in the upload path.
The accuracy test suite was only running on push to main. Now it also runs on pull_request, so accuracy regressions are caught before merge. Also update the feature gate from hf to build since accuracy.rs imports tokenizers and tiktoken-rs which are now behind the build feature.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/tokie/tests/accuracy.rs`:
- Line 7: The test file now uses the cfg attribute #[cfg(feature = "build")] but
the top-of-file test invocation comment still references --features hf; update
that header invocation to --features build (or document both if intended) so
local runs match the new feature gate, and ensure any README or comment lines
near the file header that mention "--features hf" are changed to "--features
build" to avoid confusion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 54eaf831-12bc-4099-8871-56e5e7749990
📒 Files selected for processing (2)
.github/workflows/tokenizer-accuracy.ymlcrates/tokie/tests/accuracy.rs
| //! Requires network access and benches/data/enwik8 (1MB used). | ||
|
|
||
| #![cfg(feature = "hf")] | ||
| #![cfg(feature = "build")] |
There was a problem hiding this comment.
Update the test invocation docs to match the new feature gate.
Line 7 now gates this file behind build, but the header command still says --features hf (Line 3). That command is stale and can mislead local test runs.
Suggested patch
-//! Run with: cargo test -p tokie --test accuracy --features hf -- --ignored
+//! Run with: cargo test -p tokie --test accuracy --features build -- --ignored🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/tokie/tests/accuracy.rs` at line 7, The test file now uses the cfg
attribute #[cfg(feature = "build")] but the top-of-file test invocation comment
still references --features hf; update that header invocation to --features
build (or document both if intended) so local runs match the new feature gate,
and ensure any README or comment lines near the file header that mention
"--features hf" are changed to "--features build" to avoid confusion.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
.github/workflows/speed-regression.yml (1)
64-69: Pathspec checkout may leave extraneous files from the current commit.
git checkout "$BASE_SHA" -- .overwrites tracked files but does not remove files that exist in the working tree but are absent in$BASE_SHA. If the PR adds new workspace crates or files that influence the Cargo build graph, those extra files could affect the base benchmark, leading to inconsistent comparisons.Consider using
git checkout --force "$BASE_SHA"unconditionally for a cleaner baseline, or addgit clean -fdafter the pathspec checkout to remove untracked files.Also note that the stash created on line 64 is never popped—likely fine for CI, but worth documenting if intentional.
🔧 Optional fix: ensure clean base state
git stash --include-untracked || true - git checkout "$BASE_SHA" -- . || git checkout --force "$BASE_SHA" + git checkout --force "$BASE_SHA" cargo run --example speed_regression --release --features hf \ > benchmark_base.json 2>benchmark_base.log || echo '{"results":[]}' > benchmark_base.json cat benchmark_base.log git checkout --force - || trueOr, if you prefer the pathspec approach for speed:
git stash --include-untracked || true git checkout "$BASE_SHA" -- . || git checkout --force "$BASE_SHA" + git clean -fd || true cargo run --example speed_regression --release --features hf \🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/speed-regression.yml around lines 64 - 69, The current checkout using git checkout "$BASE_SHA" -- . may leave new files from the PR in the working tree; update the workflow step that runs the base benchmark (the block invoking git stash, git checkout "$BASE_SHA" -- ., cargo run --example speed_regression ...) to ensure a truly clean base state by replacing the pathspec checkout with an unconditional git checkout --force "$BASE_SHA" or by running git clean -fd immediately after the pathspec checkout; also either pop or document the initial git stash created by git stash --include-untracked so it’s clear the stash is intentionally left in CI.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In @.github/workflows/speed-regression.yml:
- Around line 64-69: The current checkout using git checkout "$BASE_SHA" -- .
may leave new files from the PR in the working tree; update the workflow step
that runs the base benchmark (the block invoking git stash, git checkout
"$BASE_SHA" -- ., cargo run --example speed_regression ...) to ensure a truly
clean base state by replacing the pathspec checkout with an unconditional git
checkout --force "$BASE_SHA" or by running git clean -fd immediately after the
pathspec checkout; also either pop or document the initial git stash created by
git stash --include-untracked so it’s clear the stash is intentionally left in
CI.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 4a4620e0-7a38-4240-b569-caeb6029cea4
📒 Files selected for processing (1)
.github/workflows/speed-regression.yml
Chunking enwik8 into 1KB pieces created false positives at chunk boundaries because BPE tokenization is context-dependent. Now encodes the first 1MB as a single string, matching how the accuracy test suite works. GPT-2 is 100% accurate (286273/286273 tokens match).
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (3)
crates/tokie/src/build.rs (3)
155-155:⚠️ Potential issue | 🟡 MinorMap
.tkzload failures to a load-oriented error.
Tokenizer::from_file()is a read path, but this maps failures toBuildError::SaveTkz, which produces misleading diagnostics for callers debugging a bad or missing.tkzfile.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/tokie/src/build.rs` at line 155, The code maps failures from Tokenizer::from_file(...) to the SaveTkz variant, producing misleading diagnostics; change the map_err call on Tokenizer::from_file(tkz_path) (where tokie_tok is set) to map errors to a load-oriented BuildError variant (e.g., BuildError::LoadTkz) so read/load failures are reported correctly, and if BuildError lacks a LoadTkz variant, add one that wraps the underlying io/error and use it here.
166-173:⚠️ Potential issue | 🟠 MajorPropagate
tiktoken-rsinit errors instead of panicking.
verify()is a publicResultAPI, but this branch still uses.expect(...). A backend initialization failure should come back asErr(BuildError::...), not abort the process.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/tokie/src/build.rs` around lines 166 - 173, The code in build.rs currently calls .expect(...) when constructing the tiktoken encoding inside the verify() flow, which will panic on initialization failure; change this to propagate the error through the public Result by replacing the .expect call with proper error conversion (use ? or map_err) and return a BuildError variant (or convert into the existing BuildError) instead of aborting. Specifically, in the block that builds tiktoken (the match over tiktoken_encoding_name(repo_id) that calls tiktoken_rs::cl100k_base()/o200k_base()/p50k_base()), replace the trailing .expect("failed to load tiktoken encoding") with a fallible propagation like .map_err(|e| BuildError::... or e.into())? so verify() returns Err on init failures rather than panicking.
241-248:⚠️ Potential issue | 🟠 MajorHonor the documented cached-token fallback.
Token resolution still stops at the explicit arg and
HF_TOKEN. That means users authenticated viahuggingface-cli loginwill still fail here unless they also exportHF_TOKEN, which doesn't match the advertised behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/tokie/src/build.rs` around lines 241 - 248, The token resolution currently only checks the explicit arg (`token`) and the HF_TOKEN env var, but must also fall back to the CLI-cached token; update the token resolution for `token_str` to, after checking `HF_TOKEN`, try the Hugging Face CLI cache (e.g., use a helper that returns the cached token such as `huggingface_hub::cached_token()` if available, or read the conventional cache file paths like "~/.cache/huggingface/token" or "~/.huggingface/token") and return that value before failing; ensure the new fallback is included in the same chain where `token.map(|t| t.to_string()).or_else(|| std::env::var("HF_TOKEN").ok())` is constructed so `token_str` will use the cached token when present.
🧹 Nitpick comments (1)
crates/tokie/src/build.rs (1)
157-161: Only read the 1MB sample you actually verify.The comment says verification uses the first 1MB, but
std::fs::read()still pulls the full cachedenwik8file into memory on every run. Streaming just the prefix keeps the hot path aligned with the advertised cost.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/tokie/src/build.rs` around lines 157 - 161, The code currently calls std::fs::read(&enwik8_path) which loads the entire enwik8 into memory; change this to open the file (std::fs::File::open(enwik8_path) or equivalent) and stream only the first 1_000_000 bytes into a buffer using std::io::Read::take(...).read_to_end(&mut raw) so the hot path only reads the 1MB sample used for verification; preserve the same BuildError::Download mapping on errors and keep the existing truncated variable logic (use raw[..raw.len().min(1_000_000)] or simply use the buffer filled by take) and reference ensure_enwik8(), enwik8_path, and BuildError::Download when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/tokie/src/build.rs`:
- Around line 73-113: ensure_enwik8() currently downloads enwik8 over plain HTTP
and extracts it without integrity checks; change ENWIK8_URL to a secure (HTTPS)
source and add a pinned digest constant (e.g., ENWIK8_SHA256) for the downloaded
blob/zip, compute the SHA-256 (or other agreed hash) of the downloaded bytes
(zip_path or bytes) and compare it to the pinned digest before proceeding to
write/extract; if the checksum fails, return a BuildError::Download with a clear
failure message and do not extract, and keep the existing error handling around
reading/extracting to ensure only verified content is written to enwik8 (update
references in ensure_enwik8()).
- Around line 77-79: The fallback PathBuf::from("/tmp") is Unix-specific;
replace that fallback with a cross-platform temporary directory by using
std::env::temp_dir() when dirs_next::cache_dir() returns None so the computed
cache_dir (the variable created from
dirs_next::cache_dir().unwrap_or_else(...).join("tokie")) works correctly on
Windows and other platforms.
---
Duplicate comments:
In `@crates/tokie/src/build.rs`:
- Line 155: The code maps failures from Tokenizer::from_file(...) to the SaveTkz
variant, producing misleading diagnostics; change the map_err call on
Tokenizer::from_file(tkz_path) (where tokie_tok is set) to map errors to a
load-oriented BuildError variant (e.g., BuildError::LoadTkz) so read/load
failures are reported correctly, and if BuildError lacks a LoadTkz variant, add
one that wraps the underlying io/error and use it here.
- Around line 166-173: The code in build.rs currently calls .expect(...) when
constructing the tiktoken encoding inside the verify() flow, which will panic on
initialization failure; change this to propagate the error through the public
Result by replacing the .expect call with proper error conversion (use ? or
map_err) and return a BuildError variant (or convert into the existing
BuildError) instead of aborting. Specifically, in the block that builds tiktoken
(the match over tiktoken_encoding_name(repo_id) that calls
tiktoken_rs::cl100k_base()/o200k_base()/p50k_base()), replace the trailing
.expect("failed to load tiktoken encoding") with a fallible propagation like
.map_err(|e| BuildError::... or e.into())? so verify() returns Err on init
failures rather than panicking.
- Around line 241-248: The token resolution currently only checks the explicit
arg (`token`) and the HF_TOKEN env var, but must also fall back to the
CLI-cached token; update the token resolution for `token_str` to, after checking
`HF_TOKEN`, try the Hugging Face CLI cache (e.g., use a helper that returns the
cached token such as `huggingface_hub::cached_token()` if available, or read the
conventional cache file paths like "~/.cache/huggingface/token" or
"~/.huggingface/token") and return that value before failing; ensure the new
fallback is included in the same chain where `token.map(|t|
t.to_string()).or_else(|| std::env::var("HF_TOKEN").ok())` is constructed so
`token_str` will use the cached token when present.
---
Nitpick comments:
In `@crates/tokie/src/build.rs`:
- Around line 157-161: The code currently calls std::fs::read(&enwik8_path)
which loads the entire enwik8 into memory; change this to open the file
(std::fs::File::open(enwik8_path) or equivalent) and stream only the first
1_000_000 bytes into a buffer using std::io::Read::take(...).read_to_end(&mut
raw) so the hot path only reads the 1MB sample used for verification; preserve
the same BuildError::Download mapping on errors and keep the existing truncated
variable logic (use raw[..raw.len().min(1_000_000)] or simply use the buffer
filled by take) and reference ensure_enwik8(), enwik8_path, and
BuildError::Download when making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: f17e14da-c38b-42f7-9d59-2551dfc07603
📒 Files selected for processing (1)
crates/tokie/src/build.rs
| const ENWIK8_URL: &str = "http://mattmahoney.net/dc/enwik8.zip"; | ||
|
|
||
| /// Download and cache enwik8. Returns the path to the cached file. | ||
| fn ensure_enwik8() -> Result<PathBuf, BuildError> { | ||
| let cache_dir = dirs_next::cache_dir() | ||
| .unwrap_or_else(|| PathBuf::from("/tmp")) | ||
| .join("tokie"); | ||
| let enwik8_path = cache_dir.join("enwik8"); | ||
|
|
||
| if enwik8_path.exists() { | ||
| return Ok(enwik8_path); | ||
| } | ||
|
|
||
| std::fs::create_dir_all(&cache_dir).ok(); | ||
|
|
||
| let zip_path = cache_dir.join("enwik8.zip"); | ||
| let response = ureq::get(ENWIK8_URL) | ||
| .call() | ||
| .map_err(|e| BuildError::Download(format!("failed to download enwik8: {e}")))?; | ||
|
|
||
| let mut bytes = Vec::new(); | ||
| response | ||
| .into_reader() | ||
| .read_to_end(&mut bytes) | ||
| .map_err(|e| BuildError::Download(format!("failed to read enwik8 response: {e}")))?; | ||
| std::fs::write(&zip_path, &bytes) | ||
| .map_err(|e| BuildError::Download(format!("failed to write enwik8.zip: {e}")))?; | ||
|
|
||
| // Extract the zip | ||
| let file = std::fs::File::open(&zip_path) | ||
| .map_err(|e| BuildError::Download(format!("failed to open enwik8.zip: {e}")))?; | ||
| let mut archive = zip::ZipArchive::new(file) | ||
| .map_err(|e| BuildError::Download(format!("failed to read enwik8.zip: {e}")))?; | ||
| let mut entry = archive | ||
| .by_index(0) | ||
| .map_err(|e| BuildError::Download(format!("failed to extract enwik8: {e}")))?; | ||
| let mut content = Vec::new(); | ||
| std::io::Read::read_to_end(&mut entry, &mut content) | ||
| .map_err(|e| BuildError::Download(format!("failed to read enwik8 entry: {e}")))?; | ||
| std::fs::write(&enwik8_path, &content) | ||
| .map_err(|e| BuildError::Download(format!("failed to write enwik8: {e}")))?; |
There was a problem hiding this comment.
Pin the verification corpus before trusting it.
ensure_enwik8() downloads and extracts the reference corpus over plain HTTP with no checksum validation. That lets a MITM or corrupted mirror silently change the verification sample and makes verify() results untrustworthy. Please switch to an authenticated source or hard-verify a pinned digest before extracting.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/tokie/src/build.rs` around lines 73 - 113, ensure_enwik8() currently
downloads enwik8 over plain HTTP and extracts it without integrity checks;
change ENWIK8_URL to a secure (HTTPS) source and add a pinned digest constant
(e.g., ENWIK8_SHA256) for the downloaded blob/zip, compute the SHA-256 (or other
agreed hash) of the downloaded bytes (zip_path or bytes) and compare it to the
pinned digest before proceeding to write/extract; if the checksum fails, return
a BuildError::Download with a clear failure message and do not extract, and keep
the existing error handling around reading/extracting to ensure only verified
content is written to enwik8 (update references in ensure_enwik8()).
| let cache_dir = dirs_next::cache_dir() | ||
| .unwrap_or_else(|| PathBuf::from("/tmp")) | ||
| .join("tokie"); |
There was a problem hiding this comment.
Use a cross-platform temp fallback here.
PathBuf::from("/tmp") is Unix-specific. If dirs_next::cache_dir() returns None on Windows or other non-Unix environments, this fallback points at the wrong location and can break verification unnecessarily.
Suggested change
- let cache_dir = dirs_next::cache_dir()
- .unwrap_or_else(|| PathBuf::from("/tmp"))
- .join("tokie");
+ let cache_dir = dirs_next::cache_dir()
+ .unwrap_or_else(std::env::temp_dir)
+ .join("tokie");📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let cache_dir = dirs_next::cache_dir() | |
| .unwrap_or_else(|| PathBuf::from("/tmp")) | |
| .join("tokie"); | |
| let cache_dir = dirs_next::cache_dir() | |
| .unwrap_or_else(std::env::temp_dir) | |
| .join("tokie"); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/tokie/src/build.rs` around lines 77 - 79, The fallback
PathBuf::from("/tmp") is Unix-specific; replace that fallback with a
cross-platform temporary directory by using std::env::temp_dir() when
dirs_next::cache_dir() returns None so the computed cache_dir (the variable
created from dirs_next::cache_dir().unwrap_or_else(...).join("tokie")) works
correctly on Windows and other platforms.
Summary
buildfeature to coretokiecrate withconvert(),verify(),upload()functionstokie-clicrate that exposes these astokie convertandtokie verifyCLI commandstokenizersandtiktoken-rsfrom dev-deps to optional deps behind thebuildfeatureCLI Surface
Test Results
All 124 existing lib tests pass.
Test plan
tokie convertfor BPE model (gpt2)tokie convert -vfor WordPiece model (bert)tokie convert -vfor tiktoken model (cl100k via Xenova/gpt-4)tokie verifystandalone with--tkztokie verifywith Hub downloadcargo check -p tokie(no build feature) still compiles without extra depsSummary by CodeRabbit