Skip to content
This repository was archived by the owner on Apr 29, 2026. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 54 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,60 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

## [0.2.0] - 2026-04-26

Service-pack release: a large algorithmic-perf fix and a security/hardening
sweep on the public API. Same library, same nine detectors, same checksums —
just much faster on large documents and stricter about untrusted inputs.

### Added

- `Shield.reset()`: discard the accumulated Mapping (counters and entries) without rebuilding the Shield. Use between unrelated documents or users to prevent cross-document token leakage on `deanonymize`. Detector list and `max_input_bytes` are preserved.
- `Shield(max_input_bytes=...)` constructor option: refuses inputs whose UTF-8 byte length exceeds the cap. Default unbounded; recommended for pipelines that ingest untrusted text since `Shield.anonymize` allocates O(n) memory in input size.
- CLI `--force` flag on `anonymize` and `deanonymize`: required to overwrite an existing output or mapping file. Without it the command refuses with a clear error instead of silently clobbering.
- CLI `--max-bytes` flag on every subcommand (default 64 MiB): refuses pathologically large stdin or file inputs without crashing the process.
- `Shield` docstring documents thread-safety and the cross-document leakage class.
- `tests/test_security_hardening.py`: 24 new tests covering `Mapping.from_dict` validation paths, `Anonymizer` constructor enforcement, `Shield` input-size guard and reset behavior, and `Detector.__init_subclass__` enforcement.
- `tests/test_overlap_property.py`: Hypothesis-driven property test asserting the new bisect-based overlap resolution is set-equivalent to the previous quadratic algorithm over arbitrary match sets.

### Changed

- `Anonymizer._resolve_overlaps` now uses a `bisect_left`-based neighbor check instead of a linear `any(...)` scan over `taken`. Worst-case complexity drops from O(n²) to O(n log n) for the lookup; per-call insertion remains O(n) due to list shifts. On a 100 KiB synthetic document with ~4900 candidate matches the median `Shield.anonymize()` latency drops from ~1700 ms to ~70 ms (≈25× faster); 1 MiB inputs that previously timed the harness out now complete in ~1.5 s. Output is byte-identical to the previous algorithm.
- `Mapping.from_dict` now validates every field at runtime: token shape (`[TYPE_NNN]`), token-prefix vs declared type, counter coverage of issued tokens, and the scalar types of values and counters. **Breaking** for callers that previously fed malformed JSON and relied on lenient acceptance — those calls now raise `ValueError`.
- `Anonymizer.__init__` now rejects:
- Detector lists with duplicate `name` attributes (previously silently overwrote the priority dict and broke overlap-resolution determinism).
- `Strategy` values other than `Strategy.TOKEN` (the only implemented strategy in v0.1; passing anything else previously was a silent no-op). The strategy is also stored on the instance now, ready for future `MASK` / `FAKE` dispatch.
- `Detector` base class now enforces `pii_type` and `name` presence at class-definition time via `__init_subclass__`. Subclasses missing either previously instantiated successfully and crashed on first `detect()` call.
- CLI `anonymize` / `deanonymize` now refuse to overwrite an existing output or mapping file unless `--force` is passed. **Breaking** for scripts that relied on auto-overwrite — add `--force` to preserve previous behavior.
- CLI `detect --format` is now case-insensitive (`JSON`, `Json`, `json` all accepted); previously only lowercase worked.
- `Mapping` now uses `__slots__` and `Mapping.token_for` uses an f-string instead of `str.format`. Internal performance polish; no API change.
- `Anonymizer` now caches the priority dict in `__init__` instead of rebuilding it on every `_resolve_overlaps` call. Internal; no API change.
- `__version__` (in `__init__.py`) now falls back to a `"0.0.0+local"` sentinel when `importlib.metadata.version("llm-safe-pl")` raises `PackageNotFoundError`. This keeps `import llm_safe_pl` working when the source tree is loaded via `PYTHONPATH` without an editable install — useful for development workflows and CI checkout-only steps.
- `examples/cli_usage.md` updated for the new `--force` and `--max-bytes` flags.
- `docs/quickstart.md`, `docs/limitations.md`, and `README.md` updated to mention the new `Shield.reset` and `max_input_bytes` capabilities and to call out the breaking CLI behavior.

### Fixed

- Removed silent failure modes when a custom detector subclass omitted required class variables (now raised at class-definition time, see `Detector.__init_subclass__` change above).

### Migration notes for 0.1.x → 0.2.0

The two changes that may surprise existing users:

1. **CLI overwrite now requires `--force`.** A cron job that runs
`llm-safe anonymize doc.txt -o out.txt -m map.json` daily will now fail on
the second run because `out.txt` already exists. Add `-f` / `--force`:
`llm-safe anonymize doc.txt -o out.txt -m map.json --force`.
2. **`Mapping.from_dict` now raises on malformed JSON** that previously
loaded leniently. If you persist mappings from one process and load them
in another, mappings produced by 0.1.0 still load cleanly in 0.2.0
(round-trip is preserved); only hand-crafted or tampered JSON triggers
the new errors.

If neither applies to you, 0.2.0 is a drop-in upgrade with a 25× speedup on
larger documents and the new `Shield.reset()` / `max_input_bytes` options
available when you want them.

## [0.1.0] - 2026-04-22

### Added
Expand Down
29 changes: 18 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

Reversible PII anonymization for Polish documents, designed for LLM workflows.

> **Status: alpha (v0.1.0).** Core regex + checksum detection, anonymization, deanonymization, and the CLI are implemented and tested (280+ tests, ~99% coverage). The optional spaCy NER recognizer for PERSON / ORGANIZATION / LOCATION is scheduled for v0.1.1. See [CHANGELOG.md](CHANGELOG.md) and [Roadmap](#roadmap).
> **Status: alpha (v0.2.0).** Core regex + checksum detection, anonymization, deanonymization, and the CLI are implemented and tested (319 tests, ~99% coverage). v0.2.0 is a service-pack release: ~25× faster `Shield.anonymize()` on documents with thousands of PII items, plus a security-hardening pass (strict `Mapping.from_dict` validation, `Shield(max_input_bytes=...)`, `Shield.reset()`, CLI `--force` / `--max-bytes`). The optional spaCy NER recognizer for PERSON / ORGANIZATION / LOCATION is still scheduled for a later 0.x release. See [CHANGELOG.md](CHANGELOG.md) and [Roadmap](#roadmap).

---

Expand Down Expand Up @@ -64,7 +64,9 @@ restored = shield.deanonymize(result.text)

The same value always maps to the same token within a `Shield` instance, including across multiple `anonymize()` calls. Formatted identifiers (e.g. `526-000-12-46`) round-trip exactly — the dashes are preserved.

PERSON detection (`Jan Kowalski` in the example) requires `pip install "llm-safe-pl[ner]"` and is part of Phase 6. Without the extra, names remain visible and structured identifiers (PESEL, NIP, IBAN, etc.) are tokenized.
If you process unrelated documents (different users, different requests) through one Shield, call `shield.reset()` between them to drop the accumulated mapping and prevent cross-document token leakage. For pipelines that ingest untrusted text, pass `Shield(max_input_bytes=...)` to refuse oversized inputs at the boundary instead of letting them turn into an O(n) memory blowup.

PERSON detection (`Jan Kowalski` in the example) requires `pip install "llm-safe-pl[ner]"` and is scheduled for a later 0.x release. Without the extra, names remain visible and structured identifiers (PESEL, NIP, IBAN, etc.) are tokenized.

## Try it live in Colab

Expand All @@ -82,11 +84,15 @@ llm-safe detect document.txt --format text
# Anonymize: writes rewritten text and a reversible mapping
llm-safe anonymize document.txt -o anon.txt -m mapping.json

# Re-running on the same outputs requires --force (otherwise the CLI refuses
# to overwrite, since v0.2.0)
llm-safe anonymize document.txt -o anon.txt -m mapping.json --force

# Restore original values (prints to stdout, or use -o FILE)
llm-safe deanonymize anon.txt -m mapping.json
```

The CLI reads UTF-8 (with or without BOM) and UTF-16 (when a BOM is present), so files produced by PowerShell's default `>` redirection work without manual conversion. Output is always canonical UTF-8.
The CLI reads UTF-8 (with or without BOM) and UTF-16 (when a BOM is present), so files produced by PowerShell's default `>` redirection work without manual conversion. Output is always canonical UTF-8. Each subcommand also supports `--max-bytes` (default 64 MiB) to refuse pathologically large inputs.

## What's supported

Expand Down Expand Up @@ -155,14 +161,15 @@ The 80% coverage gate is enforced in `pyproject.toml`.

## Roadmap

- **Phase 0** — Scaffolding: packaging, CI, locked public API surface, tests green. **Done.**
- **Phase 1** — `models.py`: `Match`, `Mapping`, `AnonymizeResult`, `PIIType`. **Done.**
- **Phase 2** — Checksum validators: PESEL, NIP, REGON, Luhn, mod-97 IBAN. **Done.**
- **Phase 3** — Nine regex + checksum detectors. **Done.**
- **Phase 4** — `Anonymizer` / `Deanonymizer` with consistent tokens. **Done.**
- **Phase 5** — `Shield` facade + CLI subcommands. **Done.**
- **Phase 6** — Optional spaCy NER recognizer. *Next — planned for v0.1.1.*
- **v0.2.0+** — Faker-based fake substitution, PDF/DOCX parsing, broader IBAN detector scope.
- **Phase 0** — Scaffolding: packaging, CI, locked public API surface, tests green. **Done in v0.1.0.**
- **Phase 1** — `models.py`: `Match`, `Mapping`, `AnonymizeResult`, `PIIType`. **Done in v0.1.0.**
- **Phase 2** — Checksum validators: PESEL, NIP, REGON, Luhn, mod-97 IBAN. **Done in v0.1.0.**
- **Phase 3** — Nine regex + checksum detectors. **Done in v0.1.0.**
- **Phase 4** — `Anonymizer` / `Deanonymizer` with consistent tokens. **Done in v0.1.0.**
- **Phase 5** — `Shield` facade + CLI subcommands. **Done in v0.1.0.**
- **v0.2.0** — Algorithmic perf fix (`Shield.anonymize()` ~25× faster on large docs), security-hardening pass (`Mapping.from_dict` strict validation, `Shield.reset()`, `Shield(max_input_bytes=...)`, CLI `--force` / `--max-bytes`). **Done.** See [CHANGELOG.md](CHANGELOG.md).
- **Next 0.x** — Optional spaCy NER recognizer for PERSON / ORGANIZATION / LOCATION via `pip install "llm-safe-pl[ner]"`.
- **Later** — Faker-based fake substitution, PDF/DOCX parsing, broader IBAN detector scope.

## Non-goals

Expand Down
18 changes: 18 additions & 0 deletions docs/limitations.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,24 @@ Detectors are whitespace-sensitive for the phone, IBAN, and credit card formats.
- **PII types the library does not detect.** Names, organizations, and locations without the `[ner]` extra; street addresses, landline phones with parens, dates of birth, legacy bank account formats, non-Polish identifiers. See the rest of this document for the full list.
- **Active adversaries inside your process.** If a compromised dependency or malicious import runs before `Shield.anonymize`, the raw document is already in memory.
- **Side channels outside the prompt body.** Request metadata, IP address, timing, response-size-based inference, retained billing records.
- **Cross-document leakage on round-trip via a long-lived Shield.** A single Shield's Mapping accumulates across every `anonymize()` call. If a process anonymizes document A (sensitive) and later runs `deanonymize` on document B (attacker-controlled) using the same Shield, any literal `[PESEL_001]` substring in B is substituted with A's PESEL. Call `Shield.reset()` between unrelated documents/users, or instantiate a fresh `Shield` per request.

### v0.2.0 hardening you should opt into

The library exposes three boundary controls. They are not enabled by default
because they require a deployment decision; turn them on when you are
processing untrusted text:

- `Shield(max_input_bytes=...)` — refuses inputs whose UTF-8 byte length
exceeds the cap. Without it, `Shield.anonymize` allocates O(n) memory in
input size, so unbounded input is a denial-of-service vector.
- `Shield.reset()` between unrelated calls — drops the accumulated Mapping
so cross-document leakage on round-trip cannot occur (see previous
section).
- Persisted `Mapping` JSON is validated strictly on load
(`Mapping.from_dict` / `from_json` raise on tampered or malformed input).
This protects you from accepting a hostile mapping file that would
otherwise silently corrupt subsequent `deanonymize` calls.

### Assumptions

Expand Down
46 changes: 43 additions & 3 deletions docs/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,41 @@ A few things to notice:
- `shield.deanonymize(text)` with no mapping argument uses the Shield's own mapping. Pass an explicit `Mapping` to deanonymize against a saved state.
- Detected PII formats are preserved: `526-000-12-46` stays dashed, `4532 0151 1283 0366` stays spaced. The round-trip reproduces the source byte-for-byte.

## Reusing a Shield across unrelated documents

Because the Mapping is shared across calls, processing two unrelated documents through the same Shield mixes their tokens. If the second document contains attacker-controlled text with a literal `[PESEL_001]` substring, `deanonymize` will substitute it with the *first* document's PESEL value. Use `Shield.reset()` between unrelated documents to drop the accumulated mapping:

```python
shield = Shield()

# Document A — internal, trusted.
result_a = shield.anonymize(doc_a)
restored_a = shield.deanonymize(llm_response_a)

# Discard A's tokens before touching B.
shield.reset()

# Document B — could be untrusted.
result_b = shield.anonymize(doc_b)
restored_b = shield.deanonymize(llm_response_b)
```

`reset()` keeps the detector list and any `max_input_bytes` setting; it only drops the Mapping. Equivalent to instantiating a fresh `Shield()` but cheaper if you have a custom detector list.

## Guarding against oversized input

`Shield.anonymize` allocates O(n) memory in input size. For pipelines that ingest untrusted text, set `max_input_bytes` to refuse oversized inputs at the boundary instead of letting them OOM the process:

```python
# Refuse anything over 1 MiB.
shield = Shield(max_input_bytes=1024 * 1024)

shield.anonymize(very_large_text)
# ValueError: input is 5242880 bytes; max_input_bytes=1048576
```

Default is unbounded. Set it whenever the upstream caller can't be trusted.

## CLI

Everything the Python API does is also available from a shell:
Expand All @@ -70,11 +105,14 @@ llm-safe detect document.txt
# Anonymize; writes two files.
llm-safe anonymize document.txt -o anon.txt -m mapping.json

# Re-running on the same outputs requires --force (since v0.2.0).
llm-safe anonymize document.txt -o anon.txt -m mapping.json --force

# Restore originals.
llm-safe deanonymize anon.txt -m mapping.json -o restored.txt
```

See [`cli_usage.md`](../examples/cli_usage.md) for more.
Each subcommand also supports `--max-bytes` (default 64 MiB) to refuse oversized stdin or file inputs. See [`cli_usage.md`](../examples/cli_usage.md) for more.

## Saving and loading mappings

Expand All @@ -95,13 +133,15 @@ shield = Shield(mapping=loaded)
# Any anonymize() call will reuse tokens already allocated in `loaded`.
```

`Mapping.from_dict` / `from_json` validate every field at load time: token shape, type-prefix consistency, and counter coverage of issued tokens. Tampered or hand-edited mapping JSON raises `ValueError` rather than loading silently. Mappings produced by `Mapping.to_json` always round-trip cleanly.

## What Shield detects

- PESEL, NIP, REGON (Polish government IDs, all checksum-validated)
- Polish ID card (dowód osobisty), passport (regex-only for v0.1)
- Polish ID card (dowód osobisty), passport (regex-only)
- Phone, email, PL IBAN, credit card (Luhn-validated, 13-19 digits)

Person, organization, and location names require the optional `[ner]` extra — planned for v0.1.1.
Person, organization, and location names require the optional `[ner]` extra — scheduled for a later 0.x release.

## Next steps

Expand Down
25 changes: 25 additions & 0 deletions examples/cli_usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,17 @@ cat mapping.json

Now it's safe to send `anonymized.txt` to any LLM API.

Re-running on the same outputs requires `--force` (since v0.2.0). The CLI refuses to silently overwrite an existing `-o` or `-m` file:

```bash
llm-safe anonymize document.txt -o anonymized.txt -m mapping.json
# Usage: llm-safe anonymize ...
# Error: anonymized.txt exists; pass --force to overwrite

llm-safe anonymize document.txt -o anonymized.txt -m mapping.json --force
# (overwrites both)
```

## Deanonymize

Restore original values using a mapping produced by `anonymize`.
Expand All @@ -56,6 +67,9 @@ llm-safe deanonymize anonymized.txt -m mapping.json

# To a file
llm-safe deanonymize anonymized.txt -m mapping.json -o restored.txt

# --force is required to overwrite an existing output file (since v0.2.0)
llm-safe deanonymize anonymized.txt -m mapping.json -o restored.txt --force
```

## End-to-end round-trip in one shell
Expand Down Expand Up @@ -96,6 +110,17 @@ The CLI accepts UTF-8 (with or without BOM) and UTF-16 LE/BE when a BOM is prese

Output is always canonical UTF-8 without BOM.

## Input-size cap

Every subcommand supports `--max-bytes` (default 64 MiB). Inputs larger than that are refused with a clear error instead of being slurped into memory. Useful when piping from an untrusted source:

```bash
# Refuse anything over 1 MiB.
some_user_program | llm-safe anonymize - -o out.txt -m map.json --max-bytes $((1024 * 1024))
```

Set it lower than the default if you know your real inputs are bounded; raising it above 64 MiB is allowed but treats the host's RAM as the only ceiling.

## Help

```bash
Expand Down
57 changes: 57 additions & 0 deletions examples/hardening.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
"""Hardening features added in v0.2.0: Shield.reset() and max_input_bytes.

Two short demos, each independent of the other.

Run: python examples/hardening.py
"""

from llm_safe_pl import Shield


def demo_reset() -> None:
"""Reset the accumulated mapping between unrelated documents."""
print("--- Demo 1: Shield.reset() ---")
shield = Shield()

# Document A — sensitive, internal.
doc_a = "Klient: PESEL 44051401359."
result_a = shield.anonymize(doc_a)
print(f"After document A: mapping has {len(shield.mapping)} entry/entries.")
print(f" text: {result_a.text}")

# Without reset(), document A's tokens persist into the next call.
# If document B happens to contain a literal '[PESEL_001]' (e.g. an LLM
# response that the caller forgot to validate), `deanonymize` would
# substitute it with A's PESEL.
shield.reset()
print(f"After reset(): mapping has {len(shield.mapping)} entry/entries.")

# Document B — different user, different request.
doc_b = "Inny klient: PESEL 92010100003."
result_b = shield.anonymize(doc_b)
print(f"After document B: mapping has {len(shield.mapping)} entry/entries.")
print(f" text: {result_b.text}")
print()


def demo_max_input_bytes() -> None:
"""Refuse oversized input at the boundary."""
print("--- Demo 2: max_input_bytes ---")
# Cap at 100 bytes for demonstration; realistic values are MiB-scale.
shield = Shield(max_input_bytes=100)

small = "PESEL 44051401359 — fits."
print(f"Small input ({len(small.encode('utf-8'))} bytes): accepted.")
shield.anonymize(small)

big = "x" * 200
print(f"Big input ({len(big.encode('utf-8'))} bytes): rejected.")
try:
shield.anonymize(big)
except ValueError as exc:
print(f" ValueError: {exc}")


if __name__ == "__main__":
demo_reset()
demo_max_input_bytes()
Loading
Loading