Release v0.2.0: O(n²) overlap fix, Mapping/Shield hardening, docs by Tatarinho · Pull Request #16 · Tatarinho/llm-safe-pl

Tatarinho · 2026-04-26T13:52:42Z

No description provided.

The greedy interval scheduler in Anonymizer._resolve_overlaps did `any(_overlaps(m, t) for t in taken)` for every candidate, costing O(n^2) on documents with thousands of PII items. On a synthetic 100 KiB input with ~4900 candidate matches that's roughly 12 million overlap checks in pure Python. Since `taken` is invariantly sorted-by-start and pairwise non-overlapping, the only intervals that can overlap a new candidate are its left and right neighbors in start order. A single bisect_left lookup finds both in O(log n). Output is byte-identical to the previous algorithm. Tests: existing 19 anonymizer tests pass unchanged. Added a stress suite (1000 non-overlapping, 100 identical-span, 500 clusters of 5 vs a naive reference) and a Hypothesis property test asserting equivalence with the old algorithm over arbitrary match sets.

Mapping.from_dict now validates every field at runtime: token shape matches [TYPE_NNN], the token's prefix matches its declared type, counters cover the maximum issued counter for each type, and counters/values have the expected scalar types. Tampered mapping JSON now fails loudly at load time instead of silently corrupting the Mapping. Shield gains: - reset(): drop the accumulated mapping between unrelated documents or users to prevent cross-document token leakage on round-trip. - max_input_bytes constructor option: refuses inputs whose UTF-8 byte length exceeds the cap. Default unbounded; recommended for pipelines that ingest untrusted text. - Docstring covers thread-safety and the cross-document leakage class. Anonymizer constructor now rejects: - Duplicate detector names (would silently overwrite the priority map and break overlap-resolution determinism). - Strategy values other than TOKEN (the only implemented strategy in v0.1; passing anything else previously was a silent no-op). Detector base class enforces pii_type and name presence at class definition via __init_subclass__ instead of failing at first detect() call. CLI: - anonymize/deanonymize gain --force; refuse to overwrite existing output and mapping files otherwise. - All three subcommands gain --max-bytes (default 64 MiB) to refuse pathologically large stdin/file inputs. - detect --format is now case-insensitive. Performance polish: - Mapping uses __slots__. - Mapping.token_for uses an f-string instead of str.format. - Anonymizer caches the priority dict in __init__ instead of rebuilding it on every _resolve_overlaps call. importlib.metadata.version() in __init__.py now falls back to a "0.0.0+local" sentinel if the package metadata is missing, so import works when the source tree is read via PYTHONPATH without an editable install. Tests: +24 in tests/test_security_hardening.py covering all of the above. Suite total 295 -> 319, all passing.

Bumps `pyproject.toml` 0.1.0 -> 0.2.0 and synchronises the user-facing documentation with the algorithmic-perf and security/hardening changes landed earlier on this branch. CHANGELOG.md: detailed [0.2.0] entry under Added / Changed / Fixed, plus a "Migration notes for 0.1.x -> 0.2.0" subsection covering the two breaking changes (CLI now requires --force to overwrite outputs; Mapping.from_dict now raises on tampered/malformed JSON). README.md: status line updated to v0.2.0 with the test count and a one-paragraph summary of the service-pack scope; quick-example section mentions Shield.reset() and max_input_bytes; CLI section shows --force; roadmap reflects v0.2.0 done and pushes NER to "next 0.x". docs/quickstart.md: new sections covering Shield.reset() between unrelated documents, max_input_bytes for untrusted-input pipelines, strict Mapping.from_dict / from_json validation; CLI block updated for --force and --max-bytes. docs/limitations.md: threat model gains a "cross-document leakage on round-trip via a long-lived Shield" item plus a "v0.2.0 hardening you should opt into" subsection naming the three boundary controls. examples/cli_usage.md: --force usage on anonymize and deanonymize, new "Input-size cap" section covering --max-bytes. examples/hardening.py: new runnable demo of Shield.reset() and max_input_bytes, end-to-end with realistic inputs. notebooks/quickstart.ipynb: replaces the "NER ships in v0.1.1" line with a version-agnostic mention of "a later 0.x release"; inserts a new "Reusing a Shield safely" section before the closing summary, demonstrating reset() and max_input_bytes. Local smoke tests: pytest 319/319 green; python -m build produces clean sdist + wheel; twine check PASSED on both; pip install of the wheel into a fresh venv reports __version__ == 0.2.0 and exercises the new APIs cleanly.

Six ruff findings were tripped by edits in the previous three commits; this fix-up brings the branch back to green on all four CI gates. - E501 (line too long): wrapped the `--force` typer.Option in `cli.py` and shortened a long print() line in the new notebook cell. - RUF023 (unsorted __slots__): re-sorted `Mapping.__slots__` to `("_counters", "_forward", "_reverse")`. - RUF007: replaced `zip(matches, matches[1:], strict=False)` in `tests/test_anonymizer.py` with `itertools.pairwise(matches)`. - I001 / F401 in `tests/test_security_hardening.py`: removed the unused `Detector` import and let ruff re-sort the import block. Also applied `ruff format` across five files that had long-line or other minor formatting drift; no code-behavior changes. Verified locally: ruff check . -> All checks passed ruff format --check . -> 56 files already formatted mypy -> no issues found in 23 source files pytest -q -> 319 passed

Michal Piotrowski added 4 commits April 26, 2026 15:23

Tatarinho merged commit c9bcd08 into main Apr 26, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v0.2.0: O(n²) overlap fix, Mapping/Shield hardening, docs#16

Release v0.2.0: O(n²) overlap fix, Mapping/Shield hardening, docs#16
Tatarinho merged 4 commits into
mainfrom
fix/overlap-resolution-bisect

Tatarinho commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Tatarinho commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant