Skip to content

feat: add build module and tokie CLI#4

Open
chonknick wants to merge 12 commits into
mainfrom
feat/build-module-and-cli
Open

feat: add build module and tokie CLI#4
chonknick wants to merge 12 commits into
mainfrom
feat/build-module-and-cli

Conversation

@chonknick

@chonknick chonknick commented Apr 22, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add build feature to core tokie crate with convert(), verify(), upload() functions
  • Add tokie-cli crate that exposes these as tokie convert and tokie verify CLI commands
  • Move tokenizers and tiktoken-rs from dev-deps to optional deps behind the build feature
  • Verification auto-detects reference backend (tiktoken-rs for Xenova repos, HF tokenizers for everything else)

CLI Surface

cargo install tokie-cli

tokie convert <repo_id> -o <output.tkz>        # convert only
tokie convert <repo_id> -o <output.tkz> -v     # convert + verify
tokie convert <repo_id> -o <output.tkz> -v -u  # convert + verify + upload
tokie verify <repo_id> [--tkz <path>]           # verify existing .tkz

Test Results

$ tokie convert openai-community/gpt2 -o /tmp/gpt2.tkz -v
Converting openai-community/gpt2 ... OK  vocab=50257 size=3223.6KB
Verifying ... OK  8/8 texts pass

$ tokie convert bert-base-uncased -o /tmp/bert.tkz -v
Converting bert-base-uncased ... OK  vocab=30522 size=1697.3KB
Verifying ... OK  8/8 texts pass

$ tokie convert Xenova/gpt-4 -o /tmp/cl100k.tkz -v
Converting Xenova/gpt-4 ... OK  vocab=100263 size=6762.5KB
Verifying ... OK  8/8 texts pass

All 124 existing lib tests pass.

Test plan

  • tokie convert for BPE model (gpt2)
  • tokie convert -v for WordPiece model (bert)
  • tokie convert -v for tiktoken model (cl100k via Xenova/gpt-4)
  • tokie verify standalone with --tkz
  • tokie verify with Hub download
  • Error case (invalid repo) exits with code 1
  • cargo check -p tokie (no build feature) still compiles without extra deps
  • All lib tests pass

Summary by CodeRabbit

  • New Features
    • Added a feature-gated "build" capability with convert/verify/upload operations and a new tokie CLI (convert and verify subcommands, optional verify/upload, download-if-missing).
  • Tests
    • Tokenizer accuracy test now gated by the build feature; CI workflow updated to run it on pull requests.
  • Documentation
    • Added design and plan docs describing the build feature and CLI.
  • Chores
    • Updated example gating and CI workflow steps to use the build feature.

Copilot AI review requested due to automatic review settings April 22, 2026 00:43
@coderabbitai

coderabbitai Bot commented Apr 22, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a feature-gated build module to tokie exposing convert, verify, and upload for tokenizer packaging, verification, and Hub upload, and a new tokie-cli binary with convert and verify subcommands that call into tokie::build.

Changes

Cohort / File(s) Summary
CLI Binary
crates/tokie-cli/Cargo.toml, crates/tokie-cli/src/main.rs
New tokie-cli crate and tokie binary. Implements Clap-based convert and verify subcommands, progress printing, optional verification and upload flows, and HF Hub download logic for missing .tkz.
Build Module Implementation
crates/tokie/src/build.rs
New feature-gated build module implementing convert(), verify(), and upload(). Adds public types ConvertResult, VerifyResult, Mismatch, and BuildError. Handles HF Hub init/download, tokenizer JSON loading, .tkz writing/reading, chunked verification against reference backends, and direct PUT uploads.
Library Feature & Re-exports
crates/tokie/Cargo.toml, crates/tokie/src/lib.rs
Adds build Cargo feature enabling optional deps (tokenizers, tiktoken-rs, ureq, zip, dirs-next); updates example targets to require build. Adds #[cfg(feature = "build")] pub mod build; and re-exports BuildError, ConvertResult, VerifyResult, Mismatch.
Tests & CI
.github/workflows/tokenizer-accuracy.yml, .github/workflows/speed-regression.yml, crates/tokie/tests/accuracy.rs
Switched accuracy CI/test gating from hf to build. Adjusted speed-regression workflow checkout logic. Accuracy test now compiles under build.
Docs / Specs
docs/superpowers/plans/2026-04-21-build-module-and-cli.md, docs/superpowers/specs/2026-04-21-build-module-and-cli-design.md
New plan and spec docs describing the build feature, CLI behavior, verification strategy, upload flow, and example updates.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI (tokie convert)
    participant Build as tokie::build::convert
    participant HFHub as HF Hub API
    participant FS as Filesystem
    participant Verify as tokie::build::verify
    participant Ref as Reference Backend

    CLI->>Build: convert(repo_id, output)
    Build->>HFHub: Download tokenizer.json
    HFHub-->>Build: tokenizer.json
    Build->>Build: Load into Tokenizer
    Build->>FS: Write .tkz file
    Build-->>CLI: ConvertResult{vocab_size, file_size}

    alt verify requested
        CLI->>Verify: verify(repo_id, tkz_path)
        Verify->>FS: Load .tkz
        Verify->>Ref: Resolve reference tokenizer
        Ref-->>Verify: Reference encoder
        Verify->>Verify: Compare token IDs (chunked)
        Verify-->>CLI: VerifyResult{passed, failed, mismatches}
    end

    alt upload requested
        CLI->>Build: upload(tkz_path, tokiers_name, token)
        Build->>Build: Resolve auth token
        Build->>FS: Read .tkz bytes
        Build->>HFHub: PUT /tokiers/{name}/tokenizer.tkz
        HFHub-->>Build: HTTP 2xx
        Build-->>CLI: Success
    end
Loading
sequenceDiagram
    participant CLI as CLI (tokie verify)
    participant Build as tokie::build::verify
    participant HFHub as HF Hub API
    participant FS as Filesystem
    participant Ref as Reference Backend

    CLI->>Build: verify(repo_id, tkz_path?)

    alt tkz_path provided
        Build->>FS: Load .tkz
    else tkz_path absent
        Build->>HFHub: Download tokiers/{name}/tokenizer.tkz
        HFHub-->>Build: .tkz bytes
        Build->>FS: Write temp .tkz
    end

    Build->>Build: Load Tokenizer from .tkz
    Build->>Ref: Resolve reference (tiktoken-rs or HF tokenizers)
    Ref-->>Build: Reference encoder
    Build->>Build: Encode test texts with both
    Build->>Build: Collect mismatches
    Build-->>CLI: VerifyResult{total, passed, failed, mismatches}
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Poem

🐰 I nibbled bytes and wrapped them tight,
I matched the tokens through day and night,
I hopped to Hub with tiny paws,
Converted, verified — obeying laws,
A rabbit seals the build with glee ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'feat: add build module and tokie CLI' directly and accurately describes the main changes: addition of a build module to the core tokie crate and introduction of a new tokie-cli crate with CLI commands.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/build-module-and-cli

Comment @coderabbitai help to get the list of available commands and usage tips.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a feature-gated “build” workflow to the tokie library and a new tokie-cli crate to expose conversion/verification/upload from the command line, while moving heavy reference deps behind an optional feature.

Changes:

  • Introduce tokie::build (behind feature = "build") with convert(), verify(), and upload() APIs plus related result/error types.
  • Add tokie-cli crate providing tokie convert / tokie verify commands (and optional upload).
  • Move tokenizers, tiktoken-rs, and ureq behind the new build feature; update example feature requirements accordingly.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
docs/superpowers/specs/2026-04-21-build-module-and-cli-design.md Design spec for the build module + CLI surface.
docs/superpowers/plans/2026-04-21-build-module-and-cli.md Implementation plan detailing steps and intended behavior.
crates/tokie/src/lib.rs Exposes the new build module/types behind cfg(feature = "build").
crates/tokie/src/build.rs Implements conversion, verification, and upload logic.
crates/tokie/Cargo.toml Adds build feature and makes reference deps optional; adjusts examples’ required features.
crates/tokie-cli/src/main.rs Clap-based CLI wiring for convert/verify (and upload).
crates/tokie-cli/Cargo.toml New CLI crate manifest and dependencies.
Cargo.lock Lockfile updates for the new crate/dependencies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/tokie/src/build.rs
//!
//! This module is only available when the `build` feature is enabled:
//! ```toml
//! tokie = { version = "0.1", features = ["build"] }

Copilot AI Apr 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module-level doc example uses tokie = { version = "0.1", ... }, but the crate version in this repo is currently 0.0.8. This is likely to confuse users copying the snippet; consider either matching the current crate version or omitting the version in the example.

Suggested change
//! tokie = { version = "0.1", features = ["build"] }
//! tokie = { features = ["build"] }

Copilot uses AI. Check for mistakes.
Comment thread crates/tokie/src/build.rs
Comment on lines +99 to +101
if let Some(parent) = output.parent() {
std::fs::create_dir_all(parent).ok();
}

Copilot AI Apr 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::fs::create_dir_all(parent).ok(); discards directory creation errors (e.g., permissions, invalid path). This can make failures harder to diagnose and may lead to misleading downstream errors; consider propagating the I/O error (e.g., by mapping it into BuildError).

Copilot uses AI. Check for mistakes.
Comment thread crates/tokie/src/build.rs
"p50k_base" => tiktoken_rs::p50k_base(),
_ => unreachable!(),
}
.expect("failed to load tiktoken encoding");

Copilot AI Apr 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expect("failed to load tiktoken encoding") will panic the process if the encoding can't be loaded. Since this is a library API, prefer returning a BuildError instead of panicking so callers (including the CLI) can handle/report the failure cleanly.

Suggested change
.expect("failed to load tiktoken encoding");
.map_err(|e| BuildError::Download(format!("failed to load tiktoken encoding: {e}")))?;

Copilot uses AI. Check for mistakes.
Comment thread crates/tokie/src/build.rs
Comment on lines +175 to +180
Ok(VerifyResult {
total,
passed,
failed,
mismatches,
})

Copilot AI Apr 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BuildError includes a Verification { result: VerifyResult } variant, but verify() currently returns Ok(VerifyResult) even when there are mismatches. This makes it easy for callers to treat a failed verification as success by mistake; consider returning Err(BuildError::Verification { result }) when failed > 0 (or remove the error variant if success-on-mismatch is the intended contract).

Suggested change
Ok(VerifyResult {
total,
passed,
failed,
mismatches,
})
let result = VerifyResult {
total,
passed,
failed,
mismatches,
};
if result.failed > 0 {
Err(BuildError::Verification { result })
} else {
Ok(result)
}

Copilot uses AI. Check for mistakes.
#[arg(short, long)]
upload: bool,

/// HuggingFace API token (falls back to HF_TOKEN env or cached token)

Copilot AI Apr 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI help text says the token "falls back to HF_TOKEN env or cached token", but tokie::build::upload() only checks the explicit --token value and HF_TOKEN. Either implement reading the cached HuggingFace token (e.g., from the standard HF token file) or adjust the help text to match the actual behavior.

Suggested change
/// HuggingFace API token (falls back to HF_TOKEN env or cached token)
/// HuggingFace API token (falls back to HF_TOKEN env)

Copilot uses AI. Check for mistakes.
} else {
let name = repo_id.rsplit('/').next().unwrap_or(repo_id);
print!("Downloading tokiers/{name} ... ");
let api = hf_hub::api::sync::ApiBuilder::new().build().unwrap();

Copilot AI Apr 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hf_hub::api::sync::ApiBuilder::new().build().unwrap() will panic if the Hub client can't be initialized (e.g., invalid config / cache dir permissions). Since this is user-facing CLI code, handle the error and exit with a clear message instead of panicking.

Suggested change
let api = hf_hub::api::sync::ApiBuilder::new().build().unwrap();
let api = match hf_hub::api::sync::ApiBuilder::new().build() {
Ok(api) => api,
Err(e) => {
println!("FAILED: could not initialize HuggingFace Hub client: {e}");
std::process::exit(1);
}
};

Copilot uses AI. Check for mistakes.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
crates/tokie-cli/src/main.rs (1)

111-113: Reuse the canonical tokiers repo-name mapping here.

Both upload and download derive tokiers/<name> by stripping everything before the last /. If crates/tokie/src/hub.rs::tokiers_repo_name already owns aliases or special cases, this CLI can upload under one name and later verify/download under another. Please route both paths through the same helper instead of re-deriving the basename locally.

Also applies to: 128-131

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/tokie-cli/src/main.rs` around lines 111 - 113, The CLI currently
computes the "tokiers/<name>" basename by rsplit on repo_id in both upload and
download branches (variables like tokiers_name); replace those local rsplit
derivations with a call to the canonical helper tokie::hub::tokiers_repo_name
(defined in crates/tokie/src/hub.rs) so both upload (where tokiers_name is set)
and download paths (lines 128-131) use the same alias/special-case mapping;
update usages that build "tokiers/{...}" to call tokiers_repo_name(repo_id)
instead of repo_id.rsplit(...).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/tokie-cli/Cargo.toml`:
- Line 14: Update the tokie dependency declaration so it includes a version
field in addition to the path (e.g., add version = "0.0.8") so the crate can be
installed from crates.io while still supporting local workspace builds; modify
the existing tokie entry (tokie = { path = "../tokie", features = ["hf",
"build"] }) to include the version attribute.

In `@crates/tokie-cli/src/main.rs`:
- Around line 128-130: The ApiBuilder::new().build().unwrap() call in main
(constructing the hf_hub client) can fail and currently panics; change it to
handle the Err case and emit the CLI-style failure and exit path instead of
unwrap. Locate the hf_hub::api::sync::ApiBuilder usage (the
ApiBuilder::new().build() call) and match on the Result, logging a message like
"FAILED: could not initialize HF Hub client: {error}" (or using the existing
failure/log helper the CLI uses) and exit with code 1 when Err, otherwise
continue with the Ok(api) value.

In `@crates/tokie/src/build.rs`:
- Around line 122-129: The code currently calls
tiktoken_rs::*_base().expect(...) in the block that handles encoding_name
(matching "cl100k_base", "o200k_base", "p50k_base") inside the build logic,
which will abort the process on failure; change this to propagate a BuildError
through the public verify() API instead: replace the .expect(...) with error
mapping (e.g., map_err or try/?) to convert the tiktoken_rs initialization error
into the crate's BuildError and return Err from the surrounding function,
ensuring any failure to create the tokenizer (from tiktoken_rs::cl100k_base(),
o200k_base(), p50k_base()) yields a Result::Err rather than panicking; reference
the variables and symbols encoding_name, tiktoken, and the tiktoken_rs::*_base()
calls so you update the correct spot.
- Around line 195-202: The upload() token resolution currently only checks the
CLI arg and HF_TOKEN env var; update it to attempt reading the cached HF token
via hf_hub::cache::Cache::new()?.token() before returning the
BuildError::Upload. Specifically, in the code building token_str (where
token.map(...).or_else(...).ok_or_else(...)), add an intermediate attempt to
call Cache::new()?.token().ok() (or equivalent) as another or_else branch so
that token resolution order is: explicit CLI token, HF_TOKEN env var, cached
token from hf-hub cache, and only then return BuildError::Upload; ensure any
errors constructing the Cache are handled/ignored (treated as no cached token)
not propagated as a hard error.

---

Nitpick comments:
In `@crates/tokie-cli/src/main.rs`:
- Around line 111-113: The CLI currently computes the "tokiers/<name>" basename
by rsplit on repo_id in both upload and download branches (variables like
tokiers_name); replace those local rsplit derivations with a call to the
canonical helper tokie::hub::tokiers_repo_name (defined in
crates/tokie/src/hub.rs) so both upload (where tokiers_name is set) and download
paths (lines 128-131) use the same alias/special-case mapping; update usages
that build "tokiers/{...}" to call tokiers_repo_name(repo_id) instead of
repo_id.rsplit(...).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 38bf866c-25f1-49a6-99b1-d0cf5b6addff

📥 Commits

Reviewing files that changed from the base of the PR and between 06106e6 and 1cf2e82.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (7)
  • crates/tokie-cli/Cargo.toml
  • crates/tokie-cli/src/main.rs
  • crates/tokie/Cargo.toml
  • crates/tokie/src/build.rs
  • crates/tokie/src/lib.rs
  • docs/superpowers/plans/2026-04-21-build-module-and-cli.md
  • docs/superpowers/specs/2026-04-21-build-module-and-cli-design.md

path = "src/main.rs"

[dependencies]
tokie = { path = "../tokie", features = ["hf", "build"] }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check if the files exist and locate them
find . -name "Cargo.toml" -type f | grep -E "(tokie-cli|tokie)" | head -20

Repository: chonkie-inc/tokie

Length of output: 178


🏁 Script executed:

# Check the structure of the repository
ls -la crates/ 2>/dev/null || echo "crates directory structure check"

Repository: chonkie-inc/tokie

Length of output: 399


🏁 Script executed:

# Read the tokie-cli Cargo.toml file
cat -n crates/tokie-cli/Cargo.toml 2>/dev/null || echo "File not found"

Repository: chonkie-inc/tokie

Length of output: 639


🏁 Script executed:

# Read the tokie Cargo.toml file to get its version
cat -n crates/tokie/Cargo.toml 2>/dev/null || echo "File not found"

Repository: chonkie-inc/tokie

Length of output: 3317


Add a versioned dependency spec for tokie to support cargo install from crates.io.

Path-only dependencies cannot be resolved when installing from crates.io. Add version = "0.0.8" alongside the path attribute to enable both local workspace builds and crates.io installation, which is required for the stated cargo install tokie-cli objective.

Suggested manifest change
-tokie = { path = "../tokie", features = ["hf", "build"] }
+tokie = { version = "0.0.8", path = "../tokie", features = ["hf", "build"] }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
tokie = { path = "../tokie", features = ["hf", "build"] }
tokie = { version = "0.0.8", path = "../tokie", features = ["hf", "build"] }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/tokie-cli/Cargo.toml` at line 14, Update the tokie dependency
declaration so it includes a version field in addition to the path (e.g., add
version = "0.0.8") so the crate can be installed from crates.io while still
supporting local workspace builds; modify the existing tokie entry (tokie = {
path = "../tokie", features = ["hf", "build"] }) to include the version
attribute.

Comment on lines +128 to +130
let name = repo_id.rsplit('/').next().unwrap_or(repo_id);
print!("Downloading tokiers/{name} ... ");
let api = hf_hub::api::sync::ApiBuilder::new().build().unwrap();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle HF Hub initialization failure instead of panicking.

Line 130 unwraps the Hub client build. When that fails, tokie verify <repo_id> aborts with a panic instead of the CLI's normal FAILED: ... + exit code 1 flow.

Suggested fix
-        let api = hf_hub::api::sync::ApiBuilder::new().build().unwrap();
+        let api = match hf_hub::api::sync::ApiBuilder::new().build() {
+            Ok(api) => api,
+            Err(e) => {
+                println!("FAILED: HF Hub init failed: {e}");
+                std::process::exit(1);
+            }
+        };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/tokie-cli/src/main.rs` around lines 128 - 130, The
ApiBuilder::new().build().unwrap() call in main (constructing the hf_hub client)
can fail and currently panics; change it to handle the Err case and emit the
CLI-style failure and exit path instead of unwrap. Locate the
hf_hub::api::sync::ApiBuilder usage (the ApiBuilder::new().build() call) and
match on the Result, logging a message like "FAILED: could not initialize HF Hub
client: {error}" (or using the existing failure/log helper the CLI uses) and
exit with code 1 when Err, otherwise continue with the Ok(api) value.

Comment thread crates/tokie/src/build.rs Outdated
Comment on lines +122 to +129
if let Some(encoding_name) = tiktoken_encoding_name(repo_id) {
let tiktoken = match encoding_name {
"cl100k_base" => tiktoken_rs::cl100k_base(),
"o200k_base" => tiktoken_rs::o200k_base(),
"p50k_base" => tiktoken_rs::p50k_base(),
_ => unreachable!(),
}
.expect("failed to load tiktoken encoding");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, check if the file exists and read it
fd build.rs | head -20

Repository: chonkie-inc/tokie

Length of output: 86


🏁 Script executed:

# Get the build.rs file location and size
find . -name build.rs -path "*/tokie/*" -exec wc -l {} + 2>/dev/null

Repository: chonkie-inc/tokie

Length of output: 92


🏁 Script executed:

# Read the build.rs file to understand context around lines 122-129
cat -n crates/tokie/src/build.rs | sed -n '100,150p'

Repository: chonkie-inc/tokie

Length of output: 2143


🏁 Script executed:

# Check the verify function signature and its return type
ast-grep --pattern 'fn verify($_) -> $_ {
  $$$
}'

Repository: chonkie-inc/tokie

Length of output: 43


🏁 Script executed:

# Search for BuildError definition to understand error handling
rg "enum BuildError|struct BuildError" -A 10

Repository: chonkie-inc/tokie

Length of output: 2511


Don't panic when the reference tokenizer backend fails to initialize.

verify() is a public Result-returning API, but line 129 uses .expect(...). If tiktoken_rs::*_base() fails, callers get a process abort instead of a BuildError.

Suggested fix
         let tiktoken = match encoding_name {
             "cl100k_base" => tiktoken_rs::cl100k_base(),
             "o200k_base" => tiktoken_rs::o200k_base(),
             "p50k_base" => tiktoken_rs::p50k_base(),
             _ => unreachable!(),
         }
-        .expect("failed to load tiktoken encoding");
+        .map_err(|e| {
+            BuildError::Download(format!(
+                "failed to initialize tiktoken backend {encoding_name}: {e}"
+            ))
+        })?;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/tokie/src/build.rs` around lines 122 - 129, The code currently calls
tiktoken_rs::*_base().expect(...) in the block that handles encoding_name
(matching "cl100k_base", "o200k_base", "p50k_base") inside the build logic,
which will abort the process on failure; change this to propagate a BuildError
through the public verify() API instead: replace the .expect(...) with error
mapping (e.g., map_err or try/?) to convert the tiktoken_rs initialization error
into the crate's BuildError and return Err from the surrounding function,
ensuring any failure to create the tokenizer (from tiktoken_rs::cl100k_base(),
o200k_base(), p50k_base()) yields a Result::Err rather than panicking; reference
the variables and symbols encoding_name, tiktoken, and the tiktoken_rs::*_base()
calls so you update the correct spot.

Comment thread crates/tokie/src/build.rs
Comment on lines +195 to +202
let token_str = token
.map(|t| t.to_string())
.or_else(|| std::env::var("HF_TOKEN").ok())
.ok_or_else(|| {
BuildError::Upload(
"no HF token found — pass --token or set HF_TOKEN env var".to_string(),
)
})?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Get overview of repository structure
git ls-files crates/tokie/ | head -20

Repository: chonkie-inc/tokie

Length of output: 830


🏁 Script executed:

# Look at the build.rs file to understand context
wc -l crates/tokie/src/build.rs

Repository: chonkie-inc/tokie

Length of output: 90


🏁 Script executed:

# Search for any mentions of "cached" in the codebase
rg "cached.*token|token.*cached" --type-list

Repository: chonkie-inc/tokie

Length of output: 450


🏁 Script executed:

# Search for CLI help text mentioning token or HF_TOKEN
rg "HF_TOKEN|HF token|cached.*HF|Hugging Face.*token" -A 3 -B 3

Repository: chonkie-inc/tokie

Length of output: 18003


🏁 Script executed:

# Read the CLI help text in detail
cat -n crates/tokie-cli/src/main.rs | grep -A 5 -B 5 "HuggingFace API token"

Repository: chonkie-inc/tokie

Length of output: 475


🏁 Script executed:

# Check the full upload function context in build.rs
sed -n '180,210p' crates/tokie/src/build.rs

Repository: chonkie-inc/tokie

Length of output: 1117


🏁 Script executed:

# Search for any cached token handling utilities in the codebase
rg "cache|\.hf\(|home|\.hugging" --type rust

Repository: chonkie-inc/tokie

Length of output: 21120


🏁 Script executed:

# Check dependencies to understand what HF crates are available
cat crates/tokie/Cargo.toml | grep -A 3 -B 3 "hf-\|hugging"

Repository: chonkie-inc/tokie

Length of output: 415


🏁 Script executed:

# Check if hf-hub provides any cached token utilities
rg "hf-hub|hf_hub" -A 5 crates/tokie/src/build.rs | head -40

Repository: chonkie-inc/tokie

Length of output: 716


🏁 Script executed:

# Look for imports and usage of hf-hub in build.rs
grep -n "use\|hf-hub\|hf_hub" crates/tokie/src/build.rs | head -20

Repository: chonkie-inc/tokie

Length of output: 475


🏁 Script executed:

# Search HuggingFace documentation/examples for cached token pattern
web_search
What does hf-hub crate provide for HuggingFace token caching and retrieval

Repository: chonkie-inc/tokie

Length of output: 154


🌐 Web query:

hf-hub crate cached token retrieval HuggingFace authentication

💡 Result:

The hf-hub Rust crate retrieves Hugging Face authentication tokens from the standard cache location at ~/.cache/huggingface/token (or overridden by HF_HOME/token). This is done via the Cache struct's token method, which reads the token file if it exists. To set up the token: 1. Generate a User Access Token from https://huggingface.co/settings/tokens (read or write scope as needed). 2. Use huggingface-cli login (install via pip install -U huggingface_hub) to save it to the cache file. 3. Alternatively, manually write the token to ~/.cache/huggingface/token. When creating an Api (sync or async), it automatically uses Cache::from_env and pulls the token via cache.token if present for authenticated requests (e.g., private repos). A log message like "Token file not found" appears if absent but does not error for public access. Example usage: use hf_hub::api::sync::Api; let api = Api::new?; // Automatically checks cache for token let repo = api.model("bert-base-uncased".to_string); let _ = repo.get("config.json")?; For the newer huggingface-hub Rust crate (github.com/huggingface/huggingface_hub_rust), tokens are resolved in order: HfApiBuilder::token, HF_TOKEN env var, HF_TOKEN_PATH, or ~/.cache/huggingface/token.

Citations:


🏁 Script executed:

# Get the complete upload function implementation
sed -n '186,226p' crates/tokie/src/build.rs

Repository: chonkie-inc/tokie

Length of output: 1486


The upload() function does not implement the advertised cached-token fallback.

The CLI help explicitly states: "HuggingFace API token (falls back to HF_TOKEN env or cached token)," and the design spec promises token resolution from "environment/cache." However, the implementation (lines 195–202) only checks the provided argument and HF_TOKEN env var; it ignores the cached Hugging Face token at ~/.cache/huggingface/token. Users logged in via huggingface-cli login will fail here unless they also export HF_TOKEN manually.

The hf-hub crate (already a dependency) provides a Cache struct with a token() method that can read the cached token—integrating cached fallback is straightforward and necessary to match the advertised behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/tokie/src/build.rs` around lines 195 - 202, The upload() token
resolution currently only checks the CLI arg and HF_TOKEN env var; update it to
attempt reading the cached HF token via hf_hub::cache::Cache::new()?.token()
before returning the BuildError::Upload. Specifically, in the code building
token_str (where token.map(...).or_else(...).ok_or_else(...)), add an
intermediate attempt to call Cache::new()?.token().ok() (or equivalent) as
another or_else branch so that token resolution order is: explicit CLI token,
HF_TOKEN env var, cached token from hf-hub cache, and only then return
BuildError::Upload; ensure any errors constructing the Cache are handled/ignored
(treated as no cached token) not propagated as a hard error.

Download and cache enwik8 (~100MB), split into ~1KB chunks, and compare
token IDs against the reference backend. This gives ~80K test cases
instead of 8, catching edge cases around unicode and special characters.

Also improve CLI output: show first divergence point per mismatch,
limit to 5 shown with a count of remaining.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/tokie/src/build.rs`:
- Line 191: The error mapping for Tokenizer::from_file is incorrect: it
currently maps load failures to BuildError::SaveTkz making messages misleading;
change the mapping to a load-specific variant (e.g., add BuildError::LoadTkz) or
reuse an existing load variant like BuildError::LoadJson, update the
Tokenizer::from_file call to map_err(BuildError::LoadTkz or
BuildError::LoadJson) and add the new enum variant and any needed Display/From
impls so error messages correctly indicate a failed .tkz load instead of a save.
- Around line 298-303: Remove the unreachable HTTP status check block that
inspects response.status() and returns BuildError::Upload (the if
response.status() >= 400 { ... } block referencing BuildError::Upload and
response); rely on the existing map_err(...) and ? operator (used just before)
which already converts ureq::Error::Status into an error, so delete lines
containing that status check and its return to eliminate dead code in the upload
path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 8aad450d-3464-47cc-bf5d-176ba2cf366b

📥 Commits

Reviewing files that changed from the base of the PR and between 1cf2e82 and f72b0c8.

📒 Files selected for processing (3)
  • crates/tokie-cli/src/main.rs
  • crates/tokie/Cargo.toml
  • crates/tokie/src/build.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • crates/tokie/Cargo.toml
  • crates/tokie-cli/src/main.rs

Comment thread crates/tokie/src/build.rs
/// Downloads and caches enwik8 (~100MB), splits into ~1KB chunks, and compares
/// token IDs from tokie against the reference backend (HF tokenizers or tiktoken-rs).
pub fn verify(repo_id: &str, tkz_path: &Path) -> Result<VerifyResult, BuildError> {
let tokie_tok = Tokenizer::from_file(tkz_path).map_err(BuildError::SaveTkz)?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Wrong error variant: use a load-specific error instead of SaveTkz.

Tokenizer::from_file is a load operation, but the error is mapped to BuildError::SaveTkz. This produces misleading error messages like "failed to save .tkz" when the actual failure is loading the file. Consider adding a LoadTkz variant or reusing LoadJson.

Suggested fix

Add a new error variant or reuse an existing one:

 pub enum BuildError {
     Download(String),
     LoadJson(JsonLoadError),
+    LoadTkz(SerdeError),
     SaveTkz(SerdeError),
     Verification { result: VerifyResult },
     Upload(String),
     HubInit(String),
 }

Then in verify:

-    let tokie_tok = Tokenizer::from_file(tkz_path).map_err(BuildError::SaveTkz)?;
+    let tokie_tok = Tokenizer::from_file(tkz_path).map_err(BuildError::LoadTkz)?;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/tokie/src/build.rs` at line 191, The error mapping for
Tokenizer::from_file is incorrect: it currently maps load failures to
BuildError::SaveTkz making messages misleading; change the mapping to a
load-specific variant (e.g., add BuildError::LoadTkz) or reuse an existing load
variant like BuildError::LoadJson, update the Tokenizer::from_file call to
map_err(BuildError::LoadTkz or BuildError::LoadJson) and add the new enum
variant and any needed Display/From impls so error messages correctly indicate a
failed .tkz load instead of a save.

Comment thread crates/tokie/src/build.rs
Comment on lines +298 to +303
if response.status() >= 400 {
return Err(BuildError::Upload(format!(
"upload returned HTTP {}: check your token has write access to tokiers/ org",
response.status()
)));
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

ureq 2.0 Rust crate default behavior for HTTP 4xx 5xx error status codes

💡 Result:

In ureq version 2.x (including 2.0), the default behavior for HTTP 4xx and 5xx status codes is to treat them as errors. When a request receives a 4xx or 5xx response, it returns a ureq::Error::Status(status_code, response) variant. This allows access to both the status code (u16) and the full Response object (including headers and body) for error handling, such as reading error messages from the server. This is the default since ureq 2.0.0, as noted in the changelog: "Non-2xx responses are considered Error by default." Example from docs: match ureq::get("http://example.com/").call { Ok(response) => { /* success / }, Err(ureq::Error::Status(code, response)) => { / 4xx/5xx: access response.body etc. / } Err(_) => { / network etc. */ } } The behavior can be changed with .or_any_status to treat them as success, or in later versions (3.x) via config.http_status_as_error(false), but for 2.0 it is on by default.

Citations:


🏁 Script executed:

cat -n crates/tokie/src/build.rs | sed -n '290,310p'

Repository: chonkie-inc/tokie

Length of output: 703


🏁 Script executed:

rg 'or_any_status' crates/tokie/src/build.rs

Repository: chonkie-inc/tokie

Length of output: 43


🏁 Script executed:

rg 'or_any_status|\.call\(' crates/tokie/src/build.rs -A 2 -B 2

Repository: chonkie-inc/tokie

Length of output: 255


🏁 Script executed:

cat > /tmp/check_ureq.py << 'EOF'
import json

# Based on web search and ureq documentation:
# ureq 2.x behavior for .send_bytes():
# - Returns Result<Response, Error>
# - For 4xx/5xx responses: returns Err(ureq::Error::Status(code, response))
# - No .or_any_status() is being used, so default behavior applies
# - map_err catches the error and ? operator exits early

behavior = {
    "scenario": "ureq::put().send_bytes() with 4xx/5xx response",
    "ureq_version": "2.x",
    "default_behavior": "Treats 4xx/5xx as errors",
    "returns": "Err(ureq::Error::Status(code, response))",
    "code_flow": [
        "Line 292-296: .send_bytes().map_err(...)?",
        "If 4xx/5xx: Error variant is caught by map_err",
        "? operator exits function with error",
        "Lines 298-303: Never reached"
    ],
    "status_check_reachable": False
}

print(json.dumps(behavior, indent=2))
EOF
python /tmp/check_ureq.py

Repository: chonkie-inc/tokie

Length of output: 498


Remove unreachable status code check (lines 298-303).

With ureq 2.x, HTTP 4xx/5xx responses return Err(ureq::Error::Status(...)) by default, which is caught by the map_err on line 296. The ? operator exits before reaching the status check, making lines 298-303 unreachable dead code that should be removed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/tokie/src/build.rs` around lines 298 - 303, Remove the unreachable
HTTP status check block that inspects response.status() and returns
BuildError::Upload (the if response.status() >= 400 { ... } block referencing
BuildError::Upload and response); rely on the existing map_err(...) and ?
operator (used just before) which already converts ureq::Error::Status into an
error, so delete lines containing that status check and its return to eliminate
dead code in the upload path.

The accuracy test suite was only running on push to main. Now it also
runs on pull_request, so accuracy regressions are caught before merge.
Also update the feature gate from hf to build since accuracy.rs imports
tokenizers and tiktoken-rs which are now behind the build feature.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/tokie/tests/accuracy.rs`:
- Line 7: The test file now uses the cfg attribute #[cfg(feature = "build")] but
the top-of-file test invocation comment still references --features hf; update
that header invocation to --features build (or document both if intended) so
local runs match the new feature gate, and ensure any README or comment lines
near the file header that mention "--features hf" are changed to "--features
build" to avoid confusion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 54eaf831-12bc-4099-8871-56e5e7749990

📥 Commits

Reviewing files that changed from the base of the PR and between f72b0c8 and 987a8e8.

📒 Files selected for processing (2)
  • .github/workflows/tokenizer-accuracy.yml
  • crates/tokie/tests/accuracy.rs

//! Requires network access and benches/data/enwik8 (1MB used).

#![cfg(feature = "hf")]
#![cfg(feature = "build")]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update the test invocation docs to match the new feature gate.

Line 7 now gates this file behind build, but the header command still says --features hf (Line 3). That command is stale and can mislead local test runs.

Suggested patch
-//! Run with: cargo test -p tokie --test accuracy --features hf -- --ignored
+//! Run with: cargo test -p tokie --test accuracy --features build -- --ignored
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/tokie/tests/accuracy.rs` at line 7, The test file now uses the cfg
attribute #[cfg(feature = "build")] but the top-of-file test invocation comment
still references --features hf; update that header invocation to --features
build (or document both if intended) so local runs match the new feature gate,
and ensure any README or comment lines near the file header that mention
"--features hf" are changed to "--features build" to avoid confusion.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
.github/workflows/speed-regression.yml (1)

64-69: Pathspec checkout may leave extraneous files from the current commit.

git checkout "$BASE_SHA" -- . overwrites tracked files but does not remove files that exist in the working tree but are absent in $BASE_SHA. If the PR adds new workspace crates or files that influence the Cargo build graph, those extra files could affect the base benchmark, leading to inconsistent comparisons.

Consider using git checkout --force "$BASE_SHA" unconditionally for a cleaner baseline, or add git clean -fd after the pathspec checkout to remove untracked files.

Also note that the stash created on line 64 is never popped—likely fine for CI, but worth documenting if intentional.

🔧 Optional fix: ensure clean base state
          git stash --include-untracked || true
-          git checkout "$BASE_SHA" -- . || git checkout --force "$BASE_SHA"
+          git checkout --force "$BASE_SHA"
          cargo run --example speed_regression --release --features hf \
            > benchmark_base.json 2>benchmark_base.log || echo '{"results":[]}' > benchmark_base.json
          cat benchmark_base.log
          git checkout --force - || true

Or, if you prefer the pathspec approach for speed:

          git stash --include-untracked || true
          git checkout "$BASE_SHA" -- . || git checkout --force "$BASE_SHA"
+          git clean -fd || true
          cargo run --example speed_regression --release --features hf \
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/speed-regression.yml around lines 64 - 69, The current
checkout using git checkout "$BASE_SHA" -- . may leave new files from the PR in
the working tree; update the workflow step that runs the base benchmark (the
block invoking git stash, git checkout "$BASE_SHA" -- ., cargo run --example
speed_regression ...) to ensure a truly clean base state by replacing the
pathspec checkout with an unconditional git checkout --force "$BASE_SHA" or by
running git clean -fd immediately after the pathspec checkout; also either pop
or document the initial git stash created by git stash --include-untracked so
it’s clear the stash is intentionally left in CI.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In @.github/workflows/speed-regression.yml:
- Around line 64-69: The current checkout using git checkout "$BASE_SHA" -- .
may leave new files from the PR in the working tree; update the workflow step
that runs the base benchmark (the block invoking git stash, git checkout
"$BASE_SHA" -- ., cargo run --example speed_regression ...) to ensure a truly
clean base state by replacing the pathspec checkout with an unconditional git
checkout --force "$BASE_SHA" or by running git clean -fd immediately after the
pathspec checkout; also either pop or document the initial git stash created by
git stash --include-untracked so it’s clear the stash is intentionally left in
CI.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 4a4620e0-7a38-4240-b569-caeb6029cea4

📥 Commits

Reviewing files that changed from the base of the PR and between 987a8e8 and f67ba6d.

📒 Files selected for processing (1)
  • .github/workflows/speed-regression.yml

Chunking enwik8 into 1KB pieces created false positives at chunk
boundaries because BPE tokenization is context-dependent. Now encodes
the first 1MB as a single string, matching how the accuracy test suite
works. GPT-2 is 100% accurate (286273/286273 tokens match).

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
crates/tokie/src/build.rs (3)

155-155: ⚠️ Potential issue | 🟡 Minor

Map .tkz load failures to a load-oriented error.

Tokenizer::from_file() is a read path, but this maps failures to BuildError::SaveTkz, which produces misleading diagnostics for callers debugging a bad or missing .tkz file.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/tokie/src/build.rs` at line 155, The code maps failures from
Tokenizer::from_file(...) to the SaveTkz variant, producing misleading
diagnostics; change the map_err call on Tokenizer::from_file(tkz_path) (where
tokie_tok is set) to map errors to a load-oriented BuildError variant (e.g.,
BuildError::LoadTkz) so read/load failures are reported correctly, and if
BuildError lacks a LoadTkz variant, add one that wraps the underlying io/error
and use it here.

166-173: ⚠️ Potential issue | 🟠 Major

Propagate tiktoken-rs init errors instead of panicking.

verify() is a public Result API, but this branch still uses .expect(...). A backend initialization failure should come back as Err(BuildError::...), not abort the process.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/tokie/src/build.rs` around lines 166 - 173, The code in build.rs
currently calls .expect(...) when constructing the tiktoken encoding inside the
verify() flow, which will panic on initialization failure; change this to
propagate the error through the public Result by replacing the .expect call with
proper error conversion (use ? or map_err) and return a BuildError variant (or
convert into the existing BuildError) instead of aborting. Specifically, in the
block that builds tiktoken (the match over tiktoken_encoding_name(repo_id) that
calls tiktoken_rs::cl100k_base()/o200k_base()/p50k_base()), replace the trailing
.expect("failed to load tiktoken encoding") with a fallible propagation like
.map_err(|e| BuildError::... or e.into())? so verify() returns Err on init
failures rather than panicking.

241-248: ⚠️ Potential issue | 🟠 Major

Honor the documented cached-token fallback.

Token resolution still stops at the explicit arg and HF_TOKEN. That means users authenticated via huggingface-cli login will still fail here unless they also export HF_TOKEN, which doesn't match the advertised behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/tokie/src/build.rs` around lines 241 - 248, The token resolution
currently only checks the explicit arg (`token`) and the HF_TOKEN env var, but
must also fall back to the CLI-cached token; update the token resolution for
`token_str` to, after checking `HF_TOKEN`, try the Hugging Face CLI cache (e.g.,
use a helper that returns the cached token such as
`huggingface_hub::cached_token()` if available, or read the conventional cache
file paths like "~/.cache/huggingface/token" or "~/.huggingface/token") and
return that value before failing; ensure the new fallback is included in the
same chain where `token.map(|t| t.to_string()).or_else(||
std::env::var("HF_TOKEN").ok())` is constructed so `token_str` will use the
cached token when present.
🧹 Nitpick comments (1)
crates/tokie/src/build.rs (1)

157-161: Only read the 1MB sample you actually verify.

The comment says verification uses the first 1MB, but std::fs::read() still pulls the full cached enwik8 file into memory on every run. Streaming just the prefix keeps the hot path aligned with the advertised cost.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/tokie/src/build.rs` around lines 157 - 161, The code currently calls
std::fs::read(&enwik8_path) which loads the entire enwik8 into memory; change
this to open the file (std::fs::File::open(enwik8_path) or equivalent) and
stream only the first 1_000_000 bytes into a buffer using
std::io::Read::take(...).read_to_end(&mut raw) so the hot path only reads the
1MB sample used for verification; preserve the same BuildError::Download mapping
on errors and keep the existing truncated variable logic (use
raw[..raw.len().min(1_000_000)] or simply use the buffer filled by take) and
reference ensure_enwik8(), enwik8_path, and BuildError::Download when making the
change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/tokie/src/build.rs`:
- Around line 73-113: ensure_enwik8() currently downloads enwik8 over plain HTTP
and extracts it without integrity checks; change ENWIK8_URL to a secure (HTTPS)
source and add a pinned digest constant (e.g., ENWIK8_SHA256) for the downloaded
blob/zip, compute the SHA-256 (or other agreed hash) of the downloaded bytes
(zip_path or bytes) and compare it to the pinned digest before proceeding to
write/extract; if the checksum fails, return a BuildError::Download with a clear
failure message and do not extract, and keep the existing error handling around
reading/extracting to ensure only verified content is written to enwik8 (update
references in ensure_enwik8()).
- Around line 77-79: The fallback PathBuf::from("/tmp") is Unix-specific;
replace that fallback with a cross-platform temporary directory by using
std::env::temp_dir() when dirs_next::cache_dir() returns None so the computed
cache_dir (the variable created from
dirs_next::cache_dir().unwrap_or_else(...).join("tokie")) works correctly on
Windows and other platforms.

---

Duplicate comments:
In `@crates/tokie/src/build.rs`:
- Line 155: The code maps failures from Tokenizer::from_file(...) to the SaveTkz
variant, producing misleading diagnostics; change the map_err call on
Tokenizer::from_file(tkz_path) (where tokie_tok is set) to map errors to a
load-oriented BuildError variant (e.g., BuildError::LoadTkz) so read/load
failures are reported correctly, and if BuildError lacks a LoadTkz variant, add
one that wraps the underlying io/error and use it here.
- Around line 166-173: The code in build.rs currently calls .expect(...) when
constructing the tiktoken encoding inside the verify() flow, which will panic on
initialization failure; change this to propagate the error through the public
Result by replacing the .expect call with proper error conversion (use ? or
map_err) and return a BuildError variant (or convert into the existing
BuildError) instead of aborting. Specifically, in the block that builds tiktoken
(the match over tiktoken_encoding_name(repo_id) that calls
tiktoken_rs::cl100k_base()/o200k_base()/p50k_base()), replace the trailing
.expect("failed to load tiktoken encoding") with a fallible propagation like
.map_err(|e| BuildError::... or e.into())? so verify() returns Err on init
failures rather than panicking.
- Around line 241-248: The token resolution currently only checks the explicit
arg (`token`) and the HF_TOKEN env var, but must also fall back to the
CLI-cached token; update the token resolution for `token_str` to, after checking
`HF_TOKEN`, try the Hugging Face CLI cache (e.g., use a helper that returns the
cached token such as `huggingface_hub::cached_token()` if available, or read the
conventional cache file paths like "~/.cache/huggingface/token" or
"~/.huggingface/token") and return that value before failing; ensure the new
fallback is included in the same chain where `token.map(|t|
t.to_string()).or_else(|| std::env::var("HF_TOKEN").ok())` is constructed so
`token_str` will use the cached token when present.

---

Nitpick comments:
In `@crates/tokie/src/build.rs`:
- Around line 157-161: The code currently calls std::fs::read(&enwik8_path)
which loads the entire enwik8 into memory; change this to open the file
(std::fs::File::open(enwik8_path) or equivalent) and stream only the first
1_000_000 bytes into a buffer using std::io::Read::take(...).read_to_end(&mut
raw) so the hot path only reads the 1MB sample used for verification; preserve
the same BuildError::Download mapping on errors and keep the existing truncated
variable logic (use raw[..raw.len().min(1_000_000)] or simply use the buffer
filled by take) and reference ensure_enwik8(), enwik8_path, and
BuildError::Download when making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: f17e14da-c38b-42f7-9d59-2551dfc07603

📥 Commits

Reviewing files that changed from the base of the PR and between f67ba6d and f783c82.

📒 Files selected for processing (1)
  • crates/tokie/src/build.rs

Comment thread crates/tokie/src/build.rs
Comment on lines +73 to +113
const ENWIK8_URL: &str = "http://mattmahoney.net/dc/enwik8.zip";

/// Download and cache enwik8. Returns the path to the cached file.
fn ensure_enwik8() -> Result<PathBuf, BuildError> {
let cache_dir = dirs_next::cache_dir()
.unwrap_or_else(|| PathBuf::from("/tmp"))
.join("tokie");
let enwik8_path = cache_dir.join("enwik8");

if enwik8_path.exists() {
return Ok(enwik8_path);
}

std::fs::create_dir_all(&cache_dir).ok();

let zip_path = cache_dir.join("enwik8.zip");
let response = ureq::get(ENWIK8_URL)
.call()
.map_err(|e| BuildError::Download(format!("failed to download enwik8: {e}")))?;

let mut bytes = Vec::new();
response
.into_reader()
.read_to_end(&mut bytes)
.map_err(|e| BuildError::Download(format!("failed to read enwik8 response: {e}")))?;
std::fs::write(&zip_path, &bytes)
.map_err(|e| BuildError::Download(format!("failed to write enwik8.zip: {e}")))?;

// Extract the zip
let file = std::fs::File::open(&zip_path)
.map_err(|e| BuildError::Download(format!("failed to open enwik8.zip: {e}")))?;
let mut archive = zip::ZipArchive::new(file)
.map_err(|e| BuildError::Download(format!("failed to read enwik8.zip: {e}")))?;
let mut entry = archive
.by_index(0)
.map_err(|e| BuildError::Download(format!("failed to extract enwik8: {e}")))?;
let mut content = Vec::new();
std::io::Read::read_to_end(&mut entry, &mut content)
.map_err(|e| BuildError::Download(format!("failed to read enwik8 entry: {e}")))?;
std::fs::write(&enwik8_path, &content)
.map_err(|e| BuildError::Download(format!("failed to write enwik8: {e}")))?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Pin the verification corpus before trusting it.

ensure_enwik8() downloads and extracts the reference corpus over plain HTTP with no checksum validation. That lets a MITM or corrupted mirror silently change the verification sample and makes verify() results untrustworthy. Please switch to an authenticated source or hard-verify a pinned digest before extracting.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/tokie/src/build.rs` around lines 73 - 113, ensure_enwik8() currently
downloads enwik8 over plain HTTP and extracts it without integrity checks;
change ENWIK8_URL to a secure (HTTPS) source and add a pinned digest constant
(e.g., ENWIK8_SHA256) for the downloaded blob/zip, compute the SHA-256 (or other
agreed hash) of the downloaded bytes (zip_path or bytes) and compare it to the
pinned digest before proceeding to write/extract; if the checksum fails, return
a BuildError::Download with a clear failure message and do not extract, and keep
the existing error handling around reading/extracting to ensure only verified
content is written to enwik8 (update references in ensure_enwik8()).

Comment thread crates/tokie/src/build.rs
Comment on lines +77 to +79
let cache_dir = dirs_next::cache_dir()
.unwrap_or_else(|| PathBuf::from("/tmp"))
.join("tokie");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use a cross-platform temp fallback here.

PathBuf::from("/tmp") is Unix-specific. If dirs_next::cache_dir() returns None on Windows or other non-Unix environments, this fallback points at the wrong location and can break verification unnecessarily.

Suggested change
-    let cache_dir = dirs_next::cache_dir()
-        .unwrap_or_else(|| PathBuf::from("/tmp"))
-        .join("tokie");
+    let cache_dir = dirs_next::cache_dir()
+        .unwrap_or_else(std::env::temp_dir)
+        .join("tokie");
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let cache_dir = dirs_next::cache_dir()
.unwrap_or_else(|| PathBuf::from("/tmp"))
.join("tokie");
let cache_dir = dirs_next::cache_dir()
.unwrap_or_else(std::env::temp_dir)
.join("tokie");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/tokie/src/build.rs` around lines 77 - 79, The fallback
PathBuf::from("/tmp") is Unix-specific; replace that fallback with a
cross-platform temporary directory by using std::env::temp_dir() when
dirs_next::cache_dir() returns None so the computed cache_dir (the variable
created from dirs_next::cache_dir().unwrap_or_else(...).join("tokie")) works
correctly on Windows and other platforms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants