Update dependency tokenizers to v0.23.1#360
Open
red-hat-konflux[bot] wants to merge 1 commit into
Open
Conversation
Signed-off-by: red-hat-konflux <126015336+red-hat-konflux[bot]@users.noreply.github.com>
Codecov Report✅ All modified and coverable lines are covered by tests. @@ Coverage Diff @@
## main #360 +/- ##
=======================================
Coverage 52.61% 52.61%
=======================================
Files 10 10
Lines 745 745
=======================================
Hits 392 392
Misses 353 353
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
==0.22.2→==0.23.1Release Notes
huggingface/tokenizers (tokenizers)
v0.23.1Compare Source
TL;DR
tokenizers 0.23.1is the first proper stable release in the0.23line —0.23.0only ever shipped asrc0because the release pipeline itself was broken (Node side hadn't shipped multi-platform binaries since 2023, Python side was onpyo3 0.27without free-threaded support).0.23.1is the version where everything actually goes out the door together: full Node multi-platform wheels for the first time in years, Python 3.14 (regular and free-threaded3.14t), full type hints for every Python class, and a stack of measurable perf wins on the BPE / added-vocab hot paths.There is no functional
0.23.0published — we tag0.23.1directly so users don't accidentally pull a never-shipped version.🚨 Breaking changes
requires-python = ">=3.10"; 3.9 users stay on0.22.x.add_tokensnormalizescontentat insertion (#1995) — re-savedtokenizer.jsonmay differ in theadded_tokensblock. Existing files load unchanged.Anynow return real types;mypy --strictmay surface previously-hidden errors. Stub layout also moved fromtokenizers/<sub>/__init__.pyitotokenizers/<sub>.pyi. This breaks the surface of some of the processors likeRobertaProcessign's__init__.PyResult<T>because ofArc<RwLock<Tokenizer>>; a poisoned lock surfaces asPyExceptioninstead of a panic.⚡ Performance — measured locally on this Mac, not lifted from PRs
Run with
cargo bench --bench <name> -- --save-baseline v0_22_2onv0.22.2, then--baseline v0_22_2onv0.23.1. Numbers are point-in-time wall clock on a single laptop; relative deltas are what matters, absolute numbers will differ on CI hardware.Added-vocabulary deserialize — the headline win (#1995, #1999)
bench: improve added_vocab_deserialize to reflect real-world workloads(#2000) is now representative of how transformers actually loads tokenizer.json files. The combined effect ofdaachorsefor the matching automaton plus the normalize-on-insert refactor is enormous on this workload:Real-world impact: loading a Llama-3-style tokenizer with a large set of added tokens dropped from "noticeable pause" to "instant".
BPE encode
BPE GPT2 encode batch, no cacheBPE GPT2 encode batch(cached)BPE GPT2 encode(single)BPE Train (small)BPE Train (big)The BPE per-thread cache PR (#2028) shows much larger wins on highly-parallel workloads (+47–62% at 88+ threads on a server box, per the PR's own measurements on Vera). Single-thread batch numbers above are flat or slightly improved because cache-hit overhead was already low without contention.
Llama-3 encode
llama3-encode(single)llama3-batchllama3-offsetsTruncation early exit (#1990)
Right-direction truncation no longer pre-tokenizes past
max_length. The newtruncation_benchmarkdoesn't exist on v0.22.2 so there's no apples-to-apples here, but the PR's own measurements on the same machine showed −20–28% across a range ofmax_lengthvalues for right-truncation; left-truncation unchanged.Other perf improvements (no direct comparable bench)
BPE::Builder::buildno longer formats strings in a hot loop (#2010) — ~45% fasterTokenizer::from_fileon Llama-3 in the PR's profile.🔄 Serialization / deserialization
The
tokenizer.jsonformat is forward-compatible: existing files load on 0.23 unchanged. Two things to know if you re-save:added_tokensentries created viaadd_tokens(..., normalized=True)will have theircontentnormalized at save time — see breaking-change note above.tokenizer.train(...)no longer keeps a redundantadded_tokens/special_tokensVecseparate from theadded_tokens_map_r. Public API surface unchanged; only the internal struct shape moved.bench: improve added_vocab_deserialize to reflect real-world workloads(#2000) lands a more realistic micro-benchmark for this surface; if you're tracking deserialize perf in your own CI, the new bench is the one to compare against.🐍 Python: free-threaded 3.14t support
Dedicated wheels for
python3.14t(the free-threaded build introduced in PEP 703). The wheel:Py_MOD_GIL_NOT_USED, so importingtokenizersdoes not force the GIL back on.abi3cargo feature (free-threaded Python doesn't expose the limited API).Arc<RwLock<Tokenizer>>for the inner state so concurrent setters and encoders don't race PyO3's per-pyclass borrow check.A new stress-test module
tests/test_freethreaded.pyexercises N-encoder × M-setter races on a singleTokenizerand asserts noRuntimeError: Already borrowed, noRwLockpoisoning, and thatsys._is_gil_enabled() is Falsepost-import.For the regular CPython wheel everything is unchanged.
📦 Node.js bindings: first proper multi-platform release since 2023
The npm package now ships 13 platforms (macOS x64/arm64/universal, Windows x64/i686/arm64, Linux x64/arm64/armv7 in both glibc and musl, Android arm64/armv7) — previous workflows only built 3 of those, leaving Apple Silicon / Linux ARM / Alpine users with
package-not-founderrors since 2023 (#1365, #1703, #1922). Fixed via #1970 + #2034, which also bumps@napi-rs/clito v3 and switches cross-builds tocargo-zigbuild.🧷 Type hints & typing for all classes (#1928, #1997)
Every class in the python bindings now ships proper
.pyistubs —Tokenizer,AddedToken,Encoding, every decoder / model / normalizer / pre-tokenizer / processor / trainer. Editors and type checkers (mypy,pyright,ty) see real signatures with types and docstrings instead of falling back toAny.The stubs are generated automatically from the compiled extension via
tools/stub-gen(Rust binary usingpyo3-introspection). Re-runningmake styleregenerates them; CI guards against regenerated-vs-checked-in drift. If the generator ever returns 0 docstrings (e.g. because the[patch.crates-io]pin in.cargo/config.tomlfalls out of sync with the pyo3 dep version), it now hard-aborts with a precise diagnostic instead of silently emitting bare-bones stubs.mypy --strict.✨ Other features
models.Unigramnow exposesalphaandnbest_sizefor subword regularization (parity with Google's implementation, #1994). Closes long-standing requests #730 and #849.Tokenizer(#1958) — useful for long-lived caches that don't want to keep tokenizers alive.ci_benchmarkagainst the stored baseline and posts a comparison chart to the PR.🛠 Other fixes
EncodingVisualizer: unclosed annotation span fixed (#1911), HTML escape applied to output (#1937).__copy__/__deepcopy__(#1930).to_vec()fromslice(#1964).wget/ norvig URL with HF Hub downloads in test data fetch (#2018).uvsupport in the Python Makefile (#1977).👥 Contributors
Thanks to everyone who shipped commits between
v0.22.2andv0.23.1:@ArthurZucker, @finnagin, @gordonmessmer, @jberg5, @kennethsible, @llukito, @MayCXC, @McPatate, @michaelfeil, @mrkm4ntr, @musicinmybrain, @ngoldbaum, @OhashiReon, @paulinebm, @podarok, @rtrompier, @sebpop, @Shivam-Bhardwaj, @threexc, @wheynelau, @xanderlent — plus @dependabot and @hf-security-analysis for keeping pins fresh.
Full Changelog: huggingface/tokenizers@v0.22.2...v0.23.1
Configuration
📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
To execute skipped test pipelines write comment
/ok-to-test.Documentation
Find out how to configure dependency updates in MintMaker documentation or see all available configuration options in Renovate documentation.