Update dependency tokenizers to v0.23.1 by red-hat-konflux[bot] · Pull Request #360 · ansible/aap-rag-content

red-hat-konflux · 2026-06-03T20:14:22Z

This PR contains the following updates:

Package	Change	Age	Confidence
tokenizers	`==0.22.2` → `==0.23.1`

Release Notes

huggingface/tokenizers (tokenizers)

`v0.23.1`

Compare Source

TL;DR

tokenizers 0.23.1 is the first proper stable release in the 0.23 line — 0.23.0 only ever shipped as rc0 because the release pipeline itself was broken (Node side hadn't shipped multi-platform binaries since 2023, Python side was on pyo3 0.27 without free-threaded support). 0.23.1 is the version where everything actually goes out the door together: full Node multi-platform wheels for the first time in years, Python 3.14 (regular and free-threaded 3.14t), full type hints for every Python class, and a stack of measurable perf wins on the BPE / added-vocab hot paths.

There is no functional 0.23.0 published — we tag 0.23.1 directly so users don't accidentally pull a never-shipped version.

🚨 Breaking changes

Drop Python 3.9 (#1952) — requires-python = ">=3.10"; 3.9 users stay on 0.22.x.
add_tokens normalizes content at insertion (#1995) — re-saved tokenizer.json may differ in the added_tokens block. Existing files load unchanged.
Type stubs are precise (#1928, #1997) — methods that returned Any now return real types; mypy --strict may surface previously-hidden errors. Stub layout also moved from tokenizers/<sub>/__init__.pyi to tokenizers/<sub>.pyi. This breaks the surface of some of the processors like RobertaProcessign's __init__ .
3.14t-only: setters/getters return PyResult<T> because of Arc<RwLock<Tokenizer>>; a poisoned lock surfaces as PyException instead of a panic.

⚡ Performance — measured locally on this Mac, not lifted from PRs

Run with cargo bench --bench <name> -- --save-baseline v0_22_2 on v0.22.2, then --baseline v0_22_2 on v0.23.1. Numbers are point-in-time wall clock on a single laptop; relative deltas are what matters, absolute numbers will differ on CI hardware.

Added-vocabulary deserialize — the headline win (#1995, #1999)

bench: improve added_vocab_deserialize to reflect real-world workloads (#2000) is now representative of how transformers actually loads tokenizer.json files. The combined effect of daachorse for the matching automaton plus the normalize-on-insert refactor is enormous on this workload:

benchmark	v0.22.2	v0.23.1	change
100k tokens, special, no norm	~410 ms	248 ms	−40%
100k tokens, non-special, no norm	~7.1 s	273 ms	−96%
100k tokens, special, NFKC	~395 ms	235 ms	−40%
100k tokens, non-special, NFKC	~7.4 s	290 ms	−96%
400k tokens, special, no norm	~15 s	980 ms	−94%

Real-world impact: loading a Llama-3-style tokenizer with a large set of added tokens dropped from "noticeable pause" to "instant".

BPE encode

benchmark	v0.22.2	v0.23.1	change
`BPE GPT2 encode batch, no cache`	530 ms	446 ms	−16%
`BPE GPT2 encode batch` (cached)	690 ms	685 ms	noise
`BPE GPT2 encode` (single)	1.95 s	1.94 s	noise
`BPE Train (small)`	32.6 ms	31.5 ms	−3%
`BPE Train (big)`	1.01 s	988 ms	−2%

The BPE per-thread cache PR (#2028) shows much larger wins on highly-parallel workloads (+47–62% at 88+ threads on a server box, per the PR's own measurements on Vera). Single-thread batch numbers above are flat or slightly improved because cache-hit overhead was already low without contention.

Llama-3 encode

benchmark	v0.22.2	v0.23.1	change
`llama3-encode` (single)	2.10 s	2.02 s	−4%
`llama3-batch`	438 ms	408 ms	−7%
`llama3-offsets`	410 ms	395 ms	−4%

Truncation early exit (#1990)

Right-direction truncation no longer pre-tokenizes past max_length. The new truncation_benchmark doesn't exist on v0.22.2 so there's no apples-to-apples here, but the PR's own measurements on the same machine showed −20–28% across a range of max_length values for right-truncation; left-truncation unchanged.

Other perf improvements (no direct comparable bench)

BPE::Builder::build no longer formats strings in a hot loop (#2010) — ~45% faster Tokenizer::from_file on Llama-3 in the PR's profile.
BPE per-thread cache (#2028) — see Vera numbers in PR description for parallel scale-out.

🔄 Serialization / deserialization

The tokenizer.json format is forward-compatible: existing files load on 0.23 unchanged. Two things to know if you re-save:

added_tokens entries created via add_tokens(..., normalized=True) will have their content normalized at save time — see breaking-change note above.
tokenizer.train(...) no longer keeps a redundant added_tokens/special_tokens Vec separate from the added_tokens_map_r. Public API surface unchanged; only the internal struct shape moved.

bench: improve added_vocab_deserialize to reflect real-world workloads (#2000) lands a more realistic micro-benchmark for this surface; if you're tracking deserialize perf in your own CI, the new bench is the one to compare against.

🐍 Python: free-threaded 3.14t support

Dedicated wheels for python3.14t (the free-threaded build introduced in PEP 703). The wheel:

Declares Py_MOD_GIL_NOT_USED, so importing tokenizers does not force the GIL back on.
Builds without the abi3 cargo feature (free-threaded Python doesn't expose the limited API).
Goes through Arc<RwLock<Tokenizer>> for the inner state so concurrent setters and encoders don't race PyO3's per-pyclass borrow check.

A new stress-test module tests/test_freethreaded.py exercises N-encoder × M-setter races on a single Tokenizer and asserts no RuntimeError: Already borrowed, no RwLock poisoning, and that sys._is_gil_enabled() is False post-import.

For the regular CPython wheel everything is unchanged.

📦 Node.js bindings: first proper multi-platform release since 2023

The npm package now ships 13 platforms (macOS x64/arm64/universal, Windows x64/i686/arm64, Linux x64/arm64/armv7 in both glibc and musl, Android arm64/armv7) — previous workflows only built 3 of those, leaving Apple Silicon / Linux ARM / Alpine users with package-not-found errors since 2023 (#1365, #1703, #1922). Fixed via #1970 + #2034, which also bumps @napi-rs/cli to v3 and switches cross-builds to cargo-zigbuild.

🧷 Type hints & typing for all classes (#1928, #1997)

Every class in the python bindings now ships proper .pyi stubs — Tokenizer, AddedToken, Encoding, every decoder / model / normalizer / pre-tokenizer / processor / trainer. Editors and type checkers (mypy, pyright, ty) see real signatures with types and docstrings instead of falling back to Any.

The stubs are generated automatically from the compiled extension via tools/stub-gen (Rust binary using pyo3-introspection). Re-running make style regenerates them; CI guards against regenerated-vs-checked-in drift. If the generator ever returns 0 docstrings (e.g. because the [patch.crates-io] pin in .cargo/config.toml falls out of sync with the pyo3 dep version), it now hard-aborts with a precise diagnostic instead of silently emitting bare-bones stubs.

>>> from tokenizers import Tokenizer
>>> # IDEs now resolve every method, every kwarg, every return type
>>> Tokenizer.from_pretrained("bert-base-cased")

⚠️ As called out in breaking changes: stricter type info means previously-hidden type errors in user code may now surface under mypy --strict.

✨ Other features

Unigram sampling: models.Unigram now exposes alpha and nbest_size for subword regularization (parity with Google's implementation, #1994). Closes long-standing requests #730 and #849.
Weakref support on Tokenizer (#1958) — useful for long-lived caches that don't want to keep tokenizers alive.
CI benchmark regression detection on PRs (#2013) — every PR runs ci_benchmark against the stored baseline and posts a comparison chart to the PR.
Longer-context Llama-3 benchmarks (#1971) for tracking head-room on multi-thousand-token inputs.

🛠 Other fixes

EncodingVisualizer: unclosed annotation span fixed (#1911), HTML escape applied to output (#1937).
DecodeStream: __copy__ / __deepcopy__ (#1930).
Pre-tokenize: removed an unnecessary to_vec() from slice (#1964).
Replace wget / norvig URL with HF Hub downloads in test data fetch (#2018).
uv support in the Python Makefile (#1977).
Several security-pin bumps on workflow SHAs (#2004, #2005, #2006, #2016, #2017).

👥 Contributors

Thanks to everyone who shipped commits between v0.22.2 and v0.23.1:

@ArthurZucker, @finnagin, @gordonmessmer, @jberg5, @kennethsible, @llukito, @MayCXC, @McPatate, @michaelfeil, @mrkm4ntr, @musicinmybrain, @ngoldbaum, @OhashiReon, @paulinebm, @podarok, @rtrompier, @sebpop, @Shivam-Bhardwaj, @threexc, @wheynelau, @xanderlent — plus @dependabot and @hf-security-analysis for keeping pins fresh.

Full Changelog: huggingface/tokenizers@v0.22.2...v0.23.1

Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.

If you want to rebase/retry this PR, check this box

To execute skipped test pipelines write comment /ok-to-test.

Documentation

Find out how to configure dependency updates in MintMaker documentation or see all available configuration options in Renovate documentation.

Signed-off-by: red-hat-konflux <126015336+red-hat-konflux[bot]@users.noreply.github.com>

codecov-commenter · 2026-06-03T20:17:51Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 52.61%. Comparing base (17437c6) to head (7c7a174).

@@           Coverage Diff           @@
##             main     #360   +/-   ##
=======================================
  Coverage   52.61%   52.61%           
=======================================
  Files          10       10           
  Lines         745      745           
=======================================
  Hits          392      392           
  Misses        353      353

Flag	Coverage Δ
python	`52.61% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Update dependency tokenizers to v0.23.1

7c7a174

Signed-off-by: red-hat-konflux <126015336+red-hat-konflux[bot]@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update dependency tokenizers to v0.23.1#360

Update dependency tokenizers to v0.23.1#360
red-hat-konflux[bot] wants to merge 1 commit into
mainfrom
konflux/mintmaker/main/tokenizers-0.x

red-hat-konflux Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

red-hat-konflux Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release Notes

v0.23.1

TL;DR

🚨 Breaking changes

⚡ Performance — measured locally on this Mac, not lifted from PRs

Added-vocabulary deserialize — the headline win (#​1995, #​1999)

BPE encode

Llama-3 encode

Truncation early exit (#​1990)

Other perf improvements (no direct comparable bench)

🔄 Serialization / deserialization

🐍 Python: free-threaded 3.14t support

📦 Node.js bindings: first proper multi-platform release since 2023

🧷 Type hints & typing for all classes (#​1928, #​1997)

✨ Other features

🛠 Other fixes

👥 Contributors

Configuration

Documentation

Uh oh!

codecov-commenter commented Jun 3, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

red-hat-konflux Bot commented Jun 3, 2026 •

edited

Loading

`v0.23.1`

Added-vocabulary deserialize — the headline win (#1995, #1999)

Truncation early exit (#1990)

🧷 Type hints & typing for all classes (#1928, #1997)