This repository was archived by the owner on Apr 29, 2026. It is now read-only.
Release v0.2.0: O(n²) overlap fix, Mapping/Shield hardening, docs#16
Merged
Conversation
added 4 commits
April 26, 2026 15:23
The greedy interval scheduler in Anonymizer._resolve_overlaps did `any(_overlaps(m, t) for t in taken)` for every candidate, costing O(n^2) on documents with thousands of PII items. On a synthetic 100 KiB input with ~4900 candidate matches that's roughly 12 million overlap checks in pure Python. Since `taken` is invariantly sorted-by-start and pairwise non-overlapping, the only intervals that can overlap a new candidate are its left and right neighbors in start order. A single bisect_left lookup finds both in O(log n). Output is byte-identical to the previous algorithm. Tests: existing 19 anonymizer tests pass unchanged. Added a stress suite (1000 non-overlapping, 100 identical-span, 500 clusters of 5 vs a naive reference) and a Hypothesis property test asserting equivalence with the old algorithm over arbitrary match sets.
Mapping.from_dict now validates every field at runtime: token shape matches [TYPE_NNN], the token's prefix matches its declared type, counters cover the maximum issued counter for each type, and counters/values have the expected scalar types. Tampered mapping JSON now fails loudly at load time instead of silently corrupting the Mapping. Shield gains: - reset(): drop the accumulated mapping between unrelated documents or users to prevent cross-document token leakage on round-trip. - max_input_bytes constructor option: refuses inputs whose UTF-8 byte length exceeds the cap. Default unbounded; recommended for pipelines that ingest untrusted text. - Docstring covers thread-safety and the cross-document leakage class. Anonymizer constructor now rejects: - Duplicate detector names (would silently overwrite the priority map and break overlap-resolution determinism). - Strategy values other than TOKEN (the only implemented strategy in v0.1; passing anything else previously was a silent no-op). Detector base class enforces pii_type and name presence at class definition via __init_subclass__ instead of failing at first detect() call. CLI: - anonymize/deanonymize gain --force; refuse to overwrite existing output and mapping files otherwise. - All three subcommands gain --max-bytes (default 64 MiB) to refuse pathologically large stdin/file inputs. - detect --format is now case-insensitive. Performance polish: - Mapping uses __slots__. - Mapping.token_for uses an f-string instead of str.format. - Anonymizer caches the priority dict in __init__ instead of rebuilding it on every _resolve_overlaps call. importlib.metadata.version() in __init__.py now falls back to a "0.0.0+local" sentinel if the package metadata is missing, so import works when the source tree is read via PYTHONPATH without an editable install. Tests: +24 in tests/test_security_hardening.py covering all of the above. Suite total 295 -> 319, all passing.
Bumps `pyproject.toml` 0.1.0 -> 0.2.0 and synchronises the user-facing documentation with the algorithmic-perf and security/hardening changes landed earlier on this branch. CHANGELOG.md: detailed [0.2.0] entry under Added / Changed / Fixed, plus a "Migration notes for 0.1.x -> 0.2.0" subsection covering the two breaking changes (CLI now requires --force to overwrite outputs; Mapping.from_dict now raises on tampered/malformed JSON). README.md: status line updated to v0.2.0 with the test count and a one-paragraph summary of the service-pack scope; quick-example section mentions Shield.reset() and max_input_bytes; CLI section shows --force; roadmap reflects v0.2.0 done and pushes NER to "next 0.x". docs/quickstart.md: new sections covering Shield.reset() between unrelated documents, max_input_bytes for untrusted-input pipelines, strict Mapping.from_dict / from_json validation; CLI block updated for --force and --max-bytes. docs/limitations.md: threat model gains a "cross-document leakage on round-trip via a long-lived Shield" item plus a "v0.2.0 hardening you should opt into" subsection naming the three boundary controls. examples/cli_usage.md: --force usage on anonymize and deanonymize, new "Input-size cap" section covering --max-bytes. examples/hardening.py: new runnable demo of Shield.reset() and max_input_bytes, end-to-end with realistic inputs. notebooks/quickstart.ipynb: replaces the "NER ships in v0.1.1" line with a version-agnostic mention of "a later 0.x release"; inserts a new "Reusing a Shield safely" section before the closing summary, demonstrating reset() and max_input_bytes. Local smoke tests: pytest 319/319 green; python -m build produces clean sdist + wheel; twine check PASSED on both; pip install of the wheel into a fresh venv reports __version__ == 0.2.0 and exercises the new APIs cleanly.
Six ruff findings were tripped by edits in the previous three commits;
this fix-up brings the branch back to green on all four CI gates.
- E501 (line too long): wrapped the `--force` typer.Option in
`cli.py` and shortened a long print() line in the new notebook
cell.
- RUF023 (unsorted __slots__): re-sorted `Mapping.__slots__` to
`("_counters", "_forward", "_reverse")`.
- RUF007: replaced `zip(matches, matches[1:], strict=False)` in
`tests/test_anonymizer.py` with `itertools.pairwise(matches)`.
- I001 / F401 in `tests/test_security_hardening.py`: removed the
unused `Detector` import and let ruff re-sort the import block.
Also applied `ruff format` across five files that had long-line or
other minor formatting drift; no code-behavior changes.
Verified locally:
ruff check . -> All checks passed
ruff format --check . -> 56 files already formatted
mypy -> no issues found in 23 source files
pytest -q -> 319 passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.