Tatarinho · Tatarinho · Apr 26, 2026 · Apr 26, 2026 · Apr 26, 2026 · Apr 26, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,60 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+## [0.2.0] - 2026-04-26
+
+Service-pack release: a large algorithmic-perf fix and a security/hardening
+sweep on the public API. Same library, same nine detectors, same checksums —
+just much faster on large documents and stricter about untrusted inputs.
+
+### Added
+
+- `Shield.reset()`: discard the accumulated Mapping (counters and entries) without rebuilding the Shield. Use between unrelated documents or users to prevent cross-document token leakage on `deanonymize`. Detector list and `max_input_bytes` are preserved.
+- `Shield(max_input_bytes=...)` constructor option: refuses inputs whose UTF-8 byte length exceeds the cap. Default unbounded; recommended for pipelines that ingest untrusted text since `Shield.anonymize` allocates O(n) memory in input size.
+- CLI `--force` flag on `anonymize` and `deanonymize`: required to overwrite an existing output or mapping file. Without it the command refuses with a clear error instead of silently clobbering.
+- CLI `--max-bytes` flag on every subcommand (default 64 MiB): refuses pathologically large stdin or file inputs without crashing the process.
+- `Shield` docstring documents thread-safety and the cross-document leakage class.
+- `tests/test_security_hardening.py`: 24 new tests covering `Mapping.from_dict` validation paths, `Anonymizer` constructor enforcement, `Shield` input-size guard and reset behavior, and `Detector.__init_subclass__` enforcement.
+- `tests/test_overlap_property.py`: Hypothesis-driven property test asserting the new bisect-based overlap resolution is set-equivalent to the previous quadratic algorithm over arbitrary match sets.
+
+### Changed
+
+- `Anonymizer._resolve_overlaps` now uses a `bisect_left`-based neighbor check instead of a linear `any(...)` scan over `taken`. Worst-case complexity drops from O(n²) to O(n log n) for the lookup; per-call insertion remains O(n) due to list shifts. On a 100 KiB synthetic document with ~4900 candidate matches the median `Shield.anonymize()` latency drops from ~1700 ms to ~70 ms (≈25× faster); 1 MiB inputs that previously timed the harness out now complete in ~1.5 s. Output is byte-identical to the previous algorithm.
+- `Mapping.from_dict` now validates every field at runtime: token shape (`[TYPE_NNN]`), token-prefix vs declared type, counter coverage of issued tokens, and the scalar types of values and counters. **Breaking** for callers that previously fed malformed JSON and relied on lenient acceptance — those calls now raise `ValueError`.
+- `Anonymizer.__init__` now rejects:
+  - Detector lists with duplicate `name` attributes (previously silently overwrote the priority dict and broke overlap-resolution determinism).
+  - `Strategy` values other than `Strategy.TOKEN` (the only implemented strategy in v0.1; passing anything else previously was a silent no-op). The strategy is also stored on the instance now, ready for future `MASK` / `FAKE` dispatch.
+- `Detector` base class now enforces `pii_type` and `name` presence at class-definition time via `__init_subclass__`. Subclasses missing either previously instantiated successfully and crashed on first `detect()` call.
+- CLI `anonymize` / `deanonymize` now refuse to overwrite an existing output or mapping file unless `--force` is passed. **Breaking** for scripts that relied on auto-overwrite — add `--force` to preserve previous behavior.
+- CLI `detect --format` is now case-insensitive (`JSON`, `Json`, `json` all accepted); previously only lowercase worked.
+- `Mapping` now uses `__slots__` and `Mapping.token_for` uses an f-string instead of `str.format`. Internal performance polish; no API change.
+- `Anonymizer` now caches the priority dict in `__init__` instead of rebuilding it on every `_resolve_overlaps` call. Internal; no API change.
+- `__version__` (in `__init__.py`) now falls back to a `"0.0.0+local"` sentinel when `importlib.metadata.version("llm-safe-pl")` raises `PackageNotFoundError`. This keeps `import llm_safe_pl` working when the source tree is loaded via `PYTHONPATH` without an editable install — useful for development workflows and CI checkout-only steps.
+- `examples/cli_usage.md` updated for the new `--force` and `--max-bytes` flags.
+- `docs/quickstart.md`, `docs/limitations.md`, and `README.md` updated to mention the new `Shield.reset` and `max_input_bytes` capabilities and to call out the breaking CLI behavior.
+
+### Fixed
+
+- Removed silent failure modes when a custom detector subclass omitted required class variables (now raised at class-definition time, see `Detector.__init_subclass__` change above).
+
+### Migration notes for 0.1.x → 0.2.0
+
+The two changes that may surprise existing users:
+
+1. **CLI overwrite now requires `--force`.** A cron job that runs
+   `llm-safe anonymize doc.txt -o out.txt -m map.json` daily will now fail on
+   the second run because `out.txt` already exists. Add `-f` / `--force`:
+   `llm-safe anonymize doc.txt -o out.txt -m map.json --force`.
+2. **`Mapping.from_dict` now raises on malformed JSON** that previously
+   loaded leniently. If you persist mappings from one process and load them
+   in another, mappings produced by 0.1.0 still load cleanly in 0.2.0
+   (round-trip is preserved); only hand-crafted or tampered JSON triggers
+   the new errors.
+
+If neither applies to you, 0.2.0 is a drop-in upgrade with a 25× speedup on
+larger documents and the new `Shield.reset()` / `max_input_bytes` options
+available when you want them.
+
 ## [0.1.0] - 2026-04-22
 
 ### Added

diff --git a/README.md b/README.md
@@ -8,7 +8,7 @@
 
 Reversible PII anonymization for Polish documents, designed for LLM workflows.
 
-> **Status: alpha (v0.1.0).** Core regex + checksum detection, anonymization, deanonymization, and the CLI are implemented and tested (280+ tests, ~99% coverage). The optional spaCy NER recognizer for PERSON / ORGANIZATION / LOCATION is scheduled for v0.1.1. See [CHANGELOG.md](CHANGELOG.md) and [Roadmap](#roadmap).
+> **Status: alpha (v0.2.0).** Core regex + checksum detection, anonymization, deanonymization, and the CLI are implemented and tested (319 tests, ~99% coverage). v0.2.0 is a service-pack release: ~25× faster `Shield.anonymize()` on documents with thousands of PII items, plus a security-hardening pass (strict `Mapping.from_dict` validation, `Shield(max_input_bytes=...)`, `Shield.reset()`, CLI `--force` / `--max-bytes`). The optional spaCy NER recognizer for PERSON / ORGANIZATION / LOCATION is still scheduled for a later 0.x release. See [CHANGELOG.md](CHANGELOG.md) and [Roadmap](#roadmap).
 
 ---
 
@@ -64,7 +64,9 @@ restored = shield.deanonymize(result.text)
 
 The same value always maps to the same token within a `Shield` instance, including across multiple `anonymize()` calls. Formatted identifiers (e.g. `526-000-12-46`) round-trip exactly — the dashes are preserved.
 
-PERSON detection (`Jan Kowalski` in the example) requires `pip install "llm-safe-pl[ner]"` and is part of Phase 6. Without the extra, names remain visible and structured identifiers (PESEL, NIP, IBAN, etc.) are tokenized.
+If you process unrelated documents (different users, different requests) through one Shield, call `shield.reset()` between them to drop the accumulated mapping and prevent cross-document token leakage. For pipelines that ingest untrusted text, pass `Shield(max_input_bytes=...)` to refuse oversized inputs at the boundary instead of letting them turn into an O(n) memory blowup.
+
+PERSON detection (`Jan Kowalski` in the example) requires `pip install "llm-safe-pl[ner]"` and is scheduled for a later 0.x release. Without the extra, names remain visible and structured identifiers (PESEL, NIP, IBAN, etc.) are tokenized.
 
 ## Try it live in Colab
 
@@ -82,11 +84,15 @@ llm-safe detect document.txt --format text
 # Anonymize: writes rewritten text and a reversible mapping
 llm-safe anonymize document.txt -o anon.txt -m mapping.json
 
+# Re-running on the same outputs requires --force (otherwise the CLI refuses
+# to overwrite, since v0.2.0)
+llm-safe anonymize document.txt -o anon.txt -m mapping.json --force
+
 # Restore original values (prints to stdout, or use -o FILE)
 llm-safe deanonymize anon.txt -m mapping.json
 ```
 
-The CLI reads UTF-8 (with or without BOM) and UTF-16 (when a BOM is present), so files produced by PowerShell's default `>` redirection work without manual conversion. Output is always canonical UTF-8.
+The CLI reads UTF-8 (with or without BOM) and UTF-16 (when a BOM is present), so files produced by PowerShell's default `>` redirection work without manual conversion. Output is always canonical UTF-8. Each subcommand also supports `--max-bytes` (default 64 MiB) to refuse pathologically large inputs.
 
 ## What's supported
 
@@ -155,14 +161,15 @@ The 80% coverage gate is enforced in `pyproject.toml`.
 
 ## Roadmap
 
-- **Phase 0** — Scaffolding: packaging, CI, locked public API surface, tests green. **Done.**
-- **Phase 1** — `models.py`: `Match`, `Mapping`, `AnonymizeResult`, `PIIType`. **Done.**
-- **Phase 2** — Checksum validators: PESEL, NIP, REGON, Luhn, mod-97 IBAN. **Done.**
-- **Phase 3** — Nine regex + checksum detectors. **Done.**
-- **Phase 4** — `Anonymizer` / `Deanonymizer` with consistent tokens. **Done.**
-- **Phase 5** — `Shield` facade + CLI subcommands. **Done.**
-- **Phase 6** — Optional spaCy NER recognizer. *Next — planned for v0.1.1.*
-- **v0.2.0+** — Faker-based fake substitution, PDF/DOCX parsing, broader IBAN detector scope.
+- **Phase 0** — Scaffolding: packaging, CI, locked public API surface, tests green. **Done in v0.1.0.**
+- **Phase 1** — `models.py`: `Match`, `Mapping`, `AnonymizeResult`, `PIIType`. **Done in v0.1.0.**
+- **Phase 2** — Checksum validators: PESEL, NIP, REGON, Luhn, mod-97 IBAN. **Done in v0.1.0.**
+- **Phase 3** — Nine regex + checksum detectors. **Done in v0.1.0.**
+- **Phase 4** — `Anonymizer` / `Deanonymizer` with consistent tokens. **Done in v0.1.0.**
+- **Phase 5** — `Shield` facade + CLI subcommands. **Done in v0.1.0.**
+- **v0.2.0** — Algorithmic perf fix (`Shield.anonymize()` ~25× faster on large docs), security-hardening pass (`Mapping.from_dict` strict validation, `Shield.reset()`, `Shield(max_input_bytes=...)`, CLI `--force` / `--max-bytes`). **Done.** See [CHANGELOG.md](CHANGELOG.md).
+- **Next 0.x** — Optional spaCy NER recognizer for PERSON / ORGANIZATION / LOCATION via `pip install "llm-safe-pl[ner]"`.
+- **Later** — Faker-based fake substitution, PDF/DOCX parsing, broader IBAN detector scope.
 
 ## Non-goals
 

diff --git a/docs/limitations.md b/docs/limitations.md
@@ -101,6 +101,24 @@ Detectors are whitespace-sensitive for the phone, IBAN, and credit card formats.
 - **PII types the library does not detect.** Names, organizations, and locations without the `[ner]` extra; street addresses, landline phones with parens, dates of birth, legacy bank account formats, non-Polish identifiers. See the rest of this document for the full list.
 - **Active adversaries inside your process.** If a compromised dependency or malicious import runs before `Shield.anonymize`, the raw document is already in memory.
 - **Side channels outside the prompt body.** Request metadata, IP address, timing, response-size-based inference, retained billing records.
+- **Cross-document leakage on round-trip via a long-lived Shield.** A single Shield's Mapping accumulates across every `anonymize()` call. If a process anonymizes document A (sensitive) and later runs `deanonymize` on document B (attacker-controlled) using the same Shield, any literal `[PESEL_001]` substring in B is substituted with A's PESEL. Call `Shield.reset()` between unrelated documents/users, or instantiate a fresh `Shield` per request.
+
+### v0.2.0 hardening you should opt into
+
+The library exposes three boundary controls. They are not enabled by default
+because they require a deployment decision; turn them on when you are
+processing untrusted text:
+
+- `Shield(max_input_bytes=...)` — refuses inputs whose UTF-8 byte length
+  exceeds the cap. Without it, `Shield.anonymize` allocates O(n) memory in
+  input size, so unbounded input is a denial-of-service vector.
+- `Shield.reset()` between unrelated calls — drops the accumulated Mapping
+  so cross-document leakage on round-trip cannot occur (see previous
+  section).
+- Persisted `Mapping` JSON is validated strictly on load
+  (`Mapping.from_dict` / `from_json` raise on tampered or malformed input).
+  This protects you from accepting a hostile mapping file that would
+  otherwise silently corrupt subsequent `deanonymize` calls.
 
 ### Assumptions
 

diff --git a/docs/quickstart.md b/docs/quickstart.md
@@ -59,6 +59,41 @@ A few things to notice:
 - `shield.deanonymize(text)` with no mapping argument uses the Shield's own mapping. Pass an explicit `Mapping` to deanonymize against a saved state.
 - Detected PII formats are preserved: `526-000-12-46` stays dashed, `4532 0151 1283 0366` stays spaced. The round-trip reproduces the source byte-for-byte.
 
+## Reusing a Shield across unrelated documents
+
+Because the Mapping is shared across calls, processing two unrelated documents through the same Shield mixes their tokens. If the second document contains attacker-controlled text with a literal `[PESEL_001]` substring, `deanonymize` will substitute it with the *first* document's PESEL value. Use `Shield.reset()` between unrelated documents to drop the accumulated mapping:
+
+```python
+shield = Shield()
+
+# Document A — internal, trusted.
+result_a = shield.anonymize(doc_a)
+restored_a = shield.deanonymize(llm_response_a)
+
+# Discard A's tokens before touching B.
+shield.reset()
+
+# Document B — could be untrusted.
+result_b = shield.anonymize(doc_b)
+restored_b = shield.deanonymize(llm_response_b)
+```
+
+`reset()` keeps the detector list and any `max_input_bytes` setting; it only drops the Mapping. Equivalent to instantiating a fresh `Shield()` but cheaper if you have a custom detector list.
+
+## Guarding against oversized input
+
+`Shield.anonymize` allocates O(n) memory in input size. For pipelines that ingest untrusted text, set `max_input_bytes` to refuse oversized inputs at the boundary instead of letting them OOM the process:
+
+```python
+# Refuse anything over 1 MiB.
+shield = Shield(max_input_bytes=1024 * 1024)
+
+shield.anonymize(very_large_text)
+# ValueError: input is 5242880 bytes; max_input_bytes=1048576
+```
+
+Default is unbounded. Set it whenever the upstream caller can't be trusted.
+
 ## CLI
 
 Everything the Python API does is also available from a shell:
@@ -70,11 +105,14 @@ llm-safe detect document.txt
 # Anonymize; writes two files.
 llm-safe anonymize document.txt -o anon.txt -m mapping.json
 
+# Re-running on the same outputs requires --force (since v0.2.0).
+llm-safe anonymize document.txt -o anon.txt -m mapping.json --force
+
 # Restore originals.
 llm-safe deanonymize anon.txt -m mapping.json -o restored.txt
 ```
 
-See [`cli_usage.md`](../examples/cli_usage.md) for more.
+Each subcommand also supports `--max-bytes` (default 64 MiB) to refuse oversized stdin or file inputs. See [`cli_usage.md`](../examples/cli_usage.md) for more.
 
 ## Saving and loading mappings
 
@@ -95,13 +133,15 @@ shield = Shield(mapping=loaded)
 # Any anonymize() call will reuse tokens already allocated in `loaded`.
 ```
 
+`Mapping.from_dict` / `from_json` validate every field at load time: token shape, type-prefix consistency, and counter coverage of issued tokens. Tampered or hand-edited mapping JSON raises `ValueError` rather than loading silently. Mappings produced by `Mapping.to_json` always round-trip cleanly.
+
 ## What Shield detects
 
 - PESEL, NIP, REGON (Polish government IDs, all checksum-validated)
-- Polish ID card (dowód osobisty), passport (regex-only for v0.1)
+- Polish ID card (dowód osobisty), passport (regex-only)
 - Phone, email, PL IBAN, credit card (Luhn-validated, 13-19 digits)
 
-Person, organization, and location names require the optional `[ner]` extra — planned for v0.1.1.
+Person, organization, and location names require the optional `[ner]` extra — scheduled for a later 0.x release.
 
 ## Next steps
 

diff --git a/examples/cli_usage.md b/examples/cli_usage.md
@@ -46,6 +46,17 @@ cat mapping.json
 
 Now it's safe to send `anonymized.txt` to any LLM API.
 
+Re-running on the same outputs requires `--force` (since v0.2.0). The CLI refuses to silently overwrite an existing `-o` or `-m` file:
+
+```bash
+llm-safe anonymize document.txt -o anonymized.txt -m mapping.json
+# Usage: llm-safe anonymize ...
+# Error: anonymized.txt exists; pass --force to overwrite
+
+llm-safe anonymize document.txt -o anonymized.txt -m mapping.json --force
+# (overwrites both)
+```
+
 ## Deanonymize
 
 Restore original values using a mapping produced by `anonymize`.
@@ -56,6 +67,9 @@ llm-safe deanonymize anonymized.txt -m mapping.json
 
 # To a file
 llm-safe deanonymize anonymized.txt -m mapping.json -o restored.txt
+
+# --force is required to overwrite an existing output file (since v0.2.0)
+llm-safe deanonymize anonymized.txt -m mapping.json -o restored.txt --force
 ```
 
 ## End-to-end round-trip in one shell
@@ -96,6 +110,17 @@ The CLI accepts UTF-8 (with or without BOM) and UTF-16 LE/BE when a BOM is prese
 
 Output is always canonical UTF-8 without BOM.
 
+## Input-size cap
+
+Every subcommand supports `--max-bytes` (default 64 MiB). Inputs larger than that are refused with a clear error instead of being slurped into memory. Useful when piping from an untrusted source:
+
+```bash
+# Refuse anything over 1 MiB.
+some_user_program | llm-safe anonymize - -o out.txt -m map.json --max-bytes $((1024 * 1024))
+```
+
+Set it lower than the default if you know your real inputs are bounded; raising it above 64 MiB is allowed but treats the host's RAM as the only ceiling.
+
 ## Help
 
 ```bash

diff --git a/examples/hardening.py b/examples/hardening.py
@@ -0,0 +1,57 @@
+"""Hardening features added in v0.2.0: Shield.reset() and max_input_bytes.
+
+Two short demos, each independent of the other.
+
+Run: python examples/hardening.py
+"""
+
+from llm_safe_pl import Shield
+
+
+def demo_reset() -> None:
+    """Reset the accumulated mapping between unrelated documents."""
+    print("--- Demo 1: Shield.reset() ---")
+    shield = Shield()
+
+    # Document A — sensitive, internal.
+    doc_a = "Klient: PESEL 44051401359."
+    result_a = shield.anonymize(doc_a)
+    print(f"After document A: mapping has {len(shield.mapping)} entry/entries.")
+    print(f"  text: {result_a.text}")
+
+    # Without reset(), document A's tokens persist into the next call.
+    # If document B happens to contain a literal '[PESEL_001]' (e.g. an LLM
+    # response that the caller forgot to validate), `deanonymize` would
+    # substitute it with A's PESEL.
+    shield.reset()
+    print(f"After reset(): mapping has {len(shield.mapping)} entry/entries.")
+
+    # Document B — different user, different request.
+    doc_b = "Inny klient: PESEL 92010100003."
+    result_b = shield.anonymize(doc_b)
+    print(f"After document B: mapping has {len(shield.mapping)} entry/entries.")
+    print(f"  text: {result_b.text}")
+    print()
+
+
+def demo_max_input_bytes() -> None:
+    """Refuse oversized input at the boundary."""
+    print("--- Demo 2: max_input_bytes ---")
+    # Cap at 100 bytes for demonstration; realistic values are MiB-scale.
+    shield = Shield(max_input_bytes=100)
+
+    small = "PESEL 44051401359 — fits."
+    print(f"Small input ({len(small.encode('utf-8'))} bytes): accepted.")
+    shield.anonymize(small)
+
+    big = "x" * 200
+    print(f"Big input ({len(big.encode('utf-8'))} bytes): rejected.")
+    try:
+        shield.anonymize(big)
+    except ValueError as exc:
+        print(f"  ValueError: {exc}")
+
+
+if __name__ == "__main__":
+    demo_reset()
+    demo_max_input_bytes()