Skip to content
This repository was archived by the owner on Apr 29, 2026. It is now read-only.

Release v0.2.0: O(n²) overlap fix, Mapping/Shield hardening, docs#16

Merged
Tatarinho merged 4 commits into
mainfrom
fix/overlap-resolution-bisect
Apr 26, 2026
Merged

Release v0.2.0: O(n²) overlap fix, Mapping/Shield hardening, docs#16
Tatarinho merged 4 commits into
mainfrom
fix/overlap-resolution-bisect

Conversation

@Tatarinho
Copy link
Copy Markdown
Owner

No description provided.

Michal Piotrowski added 4 commits April 26, 2026 15:23
The greedy interval scheduler in Anonymizer._resolve_overlaps did
`any(_overlaps(m, t) for t in taken)` for every candidate, costing
O(n^2) on documents with thousands of PII items. On a synthetic
100 KiB input with ~4900 candidate matches that's roughly 12 million
overlap checks in pure Python.

Since `taken` is invariantly sorted-by-start and pairwise
non-overlapping, the only intervals that can overlap a new candidate
are its left and right neighbors in start order. A single
bisect_left lookup finds both in O(log n).

Output is byte-identical to the previous algorithm.

Tests: existing 19 anonymizer tests pass unchanged. Added a stress
suite (1000 non-overlapping, 100 identical-span, 500 clusters of 5
vs a naive reference) and a Hypothesis property test asserting
equivalence with the old algorithm over arbitrary match sets.
Mapping.from_dict now validates every field at runtime: token shape
matches [TYPE_NNN], the token's prefix matches its declared type,
counters cover the maximum issued counter for each type, and
counters/values have the expected scalar types. Tampered mapping
JSON now fails loudly at load time instead of silently corrupting
the Mapping.

Shield gains:
 - reset(): drop the accumulated mapping between unrelated documents
   or users to prevent cross-document token leakage on round-trip.
 - max_input_bytes constructor option: refuses inputs whose UTF-8
   byte length exceeds the cap. Default unbounded; recommended for
   pipelines that ingest untrusted text.
 - Docstring covers thread-safety and the cross-document leakage
   class.

Anonymizer constructor now rejects:
 - Duplicate detector names (would silently overwrite the priority
   map and break overlap-resolution determinism).
 - Strategy values other than TOKEN (the only implemented strategy
   in v0.1; passing anything else previously was a silent no-op).

Detector base class enforces pii_type and name presence at class
definition via __init_subclass__ instead of failing at first
detect() call.

CLI:
 - anonymize/deanonymize gain --force; refuse to overwrite existing
   output and mapping files otherwise.
 - All three subcommands gain --max-bytes (default 64 MiB) to
   refuse pathologically large stdin/file inputs.
 - detect --format is now case-insensitive.

Performance polish:
 - Mapping uses __slots__.
 - Mapping.token_for uses an f-string instead of str.format.
 - Anonymizer caches the priority dict in __init__ instead of
   rebuilding it on every _resolve_overlaps call.

importlib.metadata.version() in __init__.py now falls back to a
"0.0.0+local" sentinel if the package metadata is missing, so
import works when the source tree is read via PYTHONPATH without
an editable install.

Tests: +24 in tests/test_security_hardening.py covering all of the
above. Suite total 295 -> 319, all passing.
Bumps `pyproject.toml` 0.1.0 -> 0.2.0 and synchronises the user-facing
documentation with the algorithmic-perf and security/hardening changes
landed earlier on this branch.

CHANGELOG.md: detailed [0.2.0] entry under Added / Changed / Fixed,
plus a "Migration notes for 0.1.x -> 0.2.0" subsection covering the
two breaking changes (CLI now requires --force to overwrite outputs;
Mapping.from_dict now raises on tampered/malformed JSON).

README.md: status line updated to v0.2.0 with the test count and a
one-paragraph summary of the service-pack scope; quick-example section
mentions Shield.reset() and max_input_bytes; CLI section shows
--force; roadmap reflects v0.2.0 done and pushes NER to "next 0.x".

docs/quickstart.md: new sections covering Shield.reset() between
unrelated documents, max_input_bytes for untrusted-input pipelines,
strict Mapping.from_dict / from_json validation; CLI block updated
for --force and --max-bytes.

docs/limitations.md: threat model gains a "cross-document leakage on
round-trip via a long-lived Shield" item plus a "v0.2.0 hardening you
should opt into" subsection naming the three boundary controls.

examples/cli_usage.md: --force usage on anonymize and deanonymize,
new "Input-size cap" section covering --max-bytes.

examples/hardening.py: new runnable demo of Shield.reset() and
max_input_bytes, end-to-end with realistic inputs.

notebooks/quickstart.ipynb: replaces the "NER ships in v0.1.1" line
with a version-agnostic mention of "a later 0.x release"; inserts a
new "Reusing a Shield safely" section before the closing summary,
demonstrating reset() and max_input_bytes.

Local smoke tests: pytest 319/319 green; python -m build produces
clean sdist + wheel; twine check PASSED on both; pip install of the
wheel into a fresh venv reports __version__ == 0.2.0 and exercises
the new APIs cleanly.
Six ruff findings were tripped by edits in the previous three commits;
this fix-up brings the branch back to green on all four CI gates.

- E501 (line too long): wrapped the `--force` typer.Option in
  `cli.py` and shortened a long print() line in the new notebook
  cell.
- RUF023 (unsorted __slots__): re-sorted `Mapping.__slots__` to
  `("_counters", "_forward", "_reverse")`.
- RUF007: replaced `zip(matches, matches[1:], strict=False)` in
  `tests/test_anonymizer.py` with `itertools.pairwise(matches)`.
- I001 / F401 in `tests/test_security_hardening.py`: removed the
  unused `Detector` import and let ruff re-sort the import block.

Also applied `ruff format` across five files that had long-line or
other minor formatting drift; no code-behavior changes.

Verified locally:
  ruff check .             -> All checks passed
  ruff format --check .    -> 56 files already formatted
  mypy                     -> no issues found in 23 source files
  pytest -q                -> 319 passed
@Tatarinho Tatarinho merged commit c9bcd08 into main Apr 26, 2026
7 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant