Skip to content
This repository was archived by the owner on Apr 29, 2026. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Added

- `llm_safe_pl.errors` module with typed exception hierarchy: `LlmSafeError` (base), `MappingError` and `InputSizeError` (both also subclass `ValueError` for backwards compatibility), and `DetectorError` (also subclass of `RuntimeError`). All four are re-exported from the top-level package. See `docs/errors.md`.
- `tests/corpora/` regression-corpus scaffolding with `pl_pii_positive/` and `pl_pii_negative/` directories. `tests/test_corpus.py` discovers `.txt`/`.json` pairs at collection time and asserts current detector behavior — adding more samples strengthens regression coverage without changing test code.

### Changed

- `Mapping.from_dict` / `from_json` now raise `MappingError` instead of bare `ValueError` (the new class still catches as `ValueError`, so existing handlers keep working).
- `Shield.anonymize` / `detect` raise `InputSizeError` instead of bare `ValueError` when input exceeds `max_input_bytes` (still catches as `ValueError`).

## [0.2.0] - 2026-04-26

Service-pack release: a large algorithmic-perf fix and a security/hardening
Expand Down
24 changes: 24 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,30 @@ Python 3.10 or newer is required.
4. Push the branch and open a pull request against `main`.
5. CI runs on every push. Please address failures before asking for review.

## Adding to the regression corpus

The regression corpus under `tests/corpora/` is the ground truth for detector
precision and recall. Adding samples is the cheapest way to harden coverage.

Layout:

- `tests/corpora/pl_pii_positive/<name>.txt` — source text containing PII.
- `tests/corpora/pl_pii_positive/<name>.json` — list of objects with
`{type, start, end, value}` covering every span the default `Shield()`
must detect. Spans must not overlap.
- `tests/corpora/pl_pii_negative/<name>.txt` — source text that must produce
zero matches under the default `Shield()`.
- `tests/corpora/pl_pii_negative/<name>.json` — empty list, or omit the file
entirely.

Naming: lowercase, snake_case, prefix with `sampleNN_` for sort order.
Character offsets are Python string indices (not UTF-8 bytes); negative
samples should not include strings the *current* detectors flag — wait until
the relevant fix lands before promoting an aspirational negative.

After adding samples, run `pytest tests/test_corpus.py -v`. The loader checks
that every labeled span actually matches its `value` in the source text.

## Commit and PR style

- Write commit messages in the imperative mood ("Add PESEL validator", not "Added").
Expand Down
49 changes: 49 additions & 0 deletions docs/errors.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Exception hierarchy

`llm-safe-pl` exposes a small typed hierarchy from `llm_safe_pl.errors` (also
re-exported from the top-level package). All library errors descend from
`LlmSafeError`; specific subclasses also inherit from a relevant builtin so
existing `except ValueError` code keeps catching them.

```
Exception
└── LlmSafeError
├── MappingError (also subclass of ValueError)
├── InputSizeError (also subclass of ValueError)
└── DetectorError (also subclass of RuntimeError)
```

## When each is raised

| Class | Raised by | Builtin compat |
|-------------------|-------------------------------------------------|------------------|
| `MappingError` | `Mapping.from_dict` / `from_json` validation | `ValueError` |
| `InputSizeError` | `Shield.anonymize` / `detect` exceeding `max_input_bytes` | `ValueError` |
| `DetectorError` | Reserved for detector-dispatch failures; the class is exported but not yet raised internally | `RuntimeError` |

## Why typed classes

A bare `ValueError` doesn't tell the caller whether the problem is hostile
mapping JSON, an oversized input, or a bug — they all look the same in
`except`. The typed hierarchy lets handlers branch on cause:

```python
from llm_safe_pl import InputSizeError, MappingError, Shield

shield = Shield(max_input_bytes=1_000_000)
try:
result = shield.anonymize(text)
except InputSizeError:
# Caller-side: trim or reject the input.
...
except MappingError:
# Hostile or corrupt persisted Mapping — treat as integrity failure.
...
```

## `DetectorError` deliberately drops context

`DetectorError.__init__` accepts only `detector_name` — never the input text or
an exception cause. Both can carry PII; surfacing them in a stack trace is the
class of leak the typed wrapper exists to prevent. Use `raise DetectorError(name) from None`
when re-raising a wrapped detector failure.
5 changes: 5 additions & 0 deletions src/llm_safe_pl/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from importlib.metadata import PackageNotFoundError
from importlib.metadata import version as _version

from llm_safe_pl.errors import DetectorError, InputSizeError, LlmSafeError, MappingError
from llm_safe_pl.models import AnonymizeResult, Mapping, Match, PIIType
from llm_safe_pl.shield import Shield

Expand All @@ -20,7 +21,11 @@

__all__ = [
"AnonymizeResult",
"DetectorError",
"InputSizeError",
"LlmSafeError",
"Mapping",
"MappingError",
"Match",
"PIIType",
"Shield",
Expand Down
34 changes: 34 additions & 0 deletions src/llm_safe_pl/errors.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
"""Typed exception hierarchy for llm-safe-pl.

All library errors descend from :class:`LlmSafeError`. Specific subclasses also
inherit from a relevant builtin (``ValueError`` for input/data errors,
``RuntimeError`` for dispatch failures) so legacy ``except ValueError`` code
keeps catching them.

:class:`DetectorError` deliberately does NOT accept the original text or an
exception cause — both can carry PII. The class signature is exactly
``(detector_name)``; raise it via ``raise DetectorError(name) from None`` to
suppress the implicit cause chain.
"""

from __future__ import annotations


class LlmSafeError(Exception):
"""Base class for all llm-safe-pl errors."""


class MappingError(LlmSafeError, ValueError):
"""Raised when a Mapping fails validation (e.g. ``Mapping.from_dict``)."""


class InputSizeError(LlmSafeError, ValueError):
"""Raised when input exceeds ``Shield(max_input_bytes=...)``."""


class DetectorError(LlmSafeError, RuntimeError):
"""Raised when a detector fails. Original text and cause are not attached."""

def __init__(self, detector_name: str) -> None:
super().__init__(f"detector {detector_name!r} failed")
self.detector_name = detector_name
46 changes: 29 additions & 17 deletions src/llm_safe_pl/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@
from enum import Enum
from typing import Any

from llm_safe_pl.errors import MappingError

_TOKEN_SHAPE = re.compile(r"^\[([A-Z][A-Z_]*)_(\d+)\]$")


Expand Down Expand Up @@ -102,55 +104,65 @@ def to_dict(self) -> dict[str, Any]:
def from_dict(cls, data: dict[str, Any]) -> Mapping:
"""Load a Mapping from its JSON-dict shape with strict validation.

Raises ``ValueError`` on any of: wrong schema version, malformed
token shape, type/token-prefix mismatch, counters that don't cover
their entries, non-int counter values, missing required fields.
Raises :class:`~llm_safe_pl.errors.MappingError` on any of: wrong schema
version, malformed token shape, type/token-prefix mismatch, counters
that don't cover their entries, non-int counter values, missing
required fields. ``MappingError`` subclasses ``ValueError`` so existing
``except ValueError`` code keeps catching it.

Validation matters because Mapping JSON is the cross-process trust
boundary — a tampered file should fail loudly, not silently corrupt
the Mapping.
"""
if not isinstance(data, dict):
raise ValueError(f"Mapping.from_dict expected a dict, got {type(data).__name__}")
raise MappingError(f"Mapping.from_dict expected a dict, got {type(data).__name__}")
version = data.get("schema_version")
if version != cls.SCHEMA_VERSION:
raise ValueError(f"Unsupported mapping schema version: {version!r}")
raise MappingError(f"Unsupported mapping schema version: {version!r}")

raw_counters = data.get("counters", {})
if not isinstance(raw_counters, dict):
raise ValueError(f"counters must be a dict, got {type(raw_counters).__name__}")
raise MappingError(f"counters must be a dict, got {type(raw_counters).__name__}")
counters: dict[PIIType, int] = {}
for t, n in raw_counters.items():
if not isinstance(n, int) or isinstance(n, bool) or n < 0:
raise ValueError(f"counter for {t!r} must be a non-negative int, got {n!r}")
counters[PIIType(t)] = n
raise MappingError(f"counter for {t!r} must be a non-negative int, got {n!r}")
try:
counters[PIIType(t)] = n
except ValueError as exc:
raise MappingError(f"unknown PII type in counters: {t!r}") from exc

raw_entries = data.get("entries")
if raw_entries is None:
raise ValueError("Mapping.from_dict requires an 'entries' field")
raise MappingError("Mapping.from_dict requires an 'entries' field")
if not isinstance(raw_entries, list):
raise ValueError(f"entries must be a list, got {type(raw_entries).__name__}")
raise MappingError(f"entries must be a list, got {type(raw_entries).__name__}")

m = cls()
m._counters = counters
max_per_type: dict[PIIType, int] = {}
for entry in raw_entries:
if not isinstance(entry, dict):
raise ValueError(f"each entry must be a dict, got {type(entry).__name__}")
raise MappingError(f"each entry must be a dict, got {type(entry).__name__}")
for required in ("token", "type", "value"):
if required not in entry:
raise ValueError(f"entry missing required field {required!r}: {entry!r}")
raise MappingError(f"entry missing required field {required!r}: {entry!r}")
token = entry["token"]
value = entry["value"]
if not isinstance(token, str) or not isinstance(value, str):
raise ValueError(f"entry token and value must be strings: {entry!r}")
pii_type = PIIType(entry["type"])
raise MappingError(f"entry token and value must be strings: {entry!r}")
try:
pii_type = PIIType(entry["type"])
except ValueError as exc:
raise MappingError(
f"unknown PII type in entry {entry!r}: {entry['type']!r}"
) from exc
shape = _TOKEN_SHAPE.fullmatch(token)
if shape is None:
raise ValueError(f"token {token!r} does not match [TYPE_NNN] shape")
raise MappingError(f"token {token!r} does not match [TYPE_NNN] shape")
token_type_prefix = shape.group(1)
if token_type_prefix != pii_type.value.upper():
raise ValueError(f"token {token!r} prefix does not match type {pii_type.value!r}")
raise MappingError(f"token {token!r} prefix does not match type {pii_type.value!r}")
counter_n = int(shape.group(2))
prev = max_per_type.get(pii_type, 0)
if counter_n > prev:
Expand All @@ -161,7 +173,7 @@ def from_dict(cls, data: dict[str, Any]) -> Mapping:
for pii_type, observed_max in max_per_type.items():
declared = counters.get(pii_type, 0)
if declared < observed_max:
raise ValueError(
raise MappingError(
f"counter for {pii_type.value!r} is {declared} but entry "
f"counter {observed_max} was issued"
)
Expand Down
16 changes: 10 additions & 6 deletions src/llm_safe_pl/shield.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
from llm_safe_pl.deanonymizer import Deanonymizer
from llm_safe_pl.detectors import DEFAULT_DETECTORS
from llm_safe_pl.detectors.base import Detector
from llm_safe_pl.errors import InputSizeError
from llm_safe_pl.models import AnonymizeResult, Mapping, Match
from llm_safe_pl.strategies import Strategy

Expand All @@ -36,11 +37,14 @@ class Shield:
detectors: Custom detector list (default: ``DEFAULT_DETECTORS``).
mapping: Preloaded Mapping (default: empty Mapping).
strategy: Anonymization strategy (only ``TOKEN`` in v0.1).
max_input_bytes: If set, ``anonymize``/``detect`` raise ``ValueError``
for inputs whose UTF-8 byte length exceeds this. Default ``None``
(unlimited). Recommended for hardened pipelines that ingest
untrusted text — ``Shield.anonymize`` allocates O(n) memory in
input size, so an unbounded input is a DoS vector.
max_input_bytes: If set, ``anonymize``/``detect`` raise
:class:`~llm_safe_pl.errors.InputSizeError` for inputs whose UTF-8
byte length exceeds this. ``InputSizeError`` subclasses
``ValueError`` so existing ``except ValueError`` code keeps
catching it. Default ``None`` (unlimited). Recommended for
hardened pipelines that ingest untrusted text — ``Shield.anonymize``
allocates O(n) memory in input size, so an unbounded input is a
DoS vector.
"""

def __init__(
Expand Down Expand Up @@ -84,7 +88,7 @@ def _check_input_size(self, text: str) -> None:
return
size = len(text.encode("utf-8"))
if size > self._max_input_bytes:
raise ValueError(f"input is {size} bytes; max_input_bytes={self._max_input_bytes}")
raise InputSizeError(f"input is {size} bytes; max_input_bytes={self._max_input_bytes}")

def anonymize(self, text: str) -> AnonymizeResult:
self._check_input_size(text)
Expand Down
1 change: 1 addition & 0 deletions tests/corpora/pl_pii_negative/sample01_office_text.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
[]
1 change: 1 addition & 0 deletions tests/corpora/pl_pii_negative/sample01_office_text.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Sklep otwarty od 9:00 do 17:00. Spotkanie planowe 12 maja 2024.
1 change: 1 addition & 0 deletions tests/corpora/pl_pii_negative/sample02_short_codes.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
[]
1 change: 1 addition & 0 deletions tests/corpora/pl_pii_negative/sample02_short_codes.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Numer referencyjny: 12345. Kod produktu: 9876. ID zamowienia: ABC123.
3 changes: 3 additions & 0 deletions tests/corpora/pl_pii_positive/sample01_pesel_simple.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[
{"type": "pesel", "start": 7, "end": 18, "value": "44051401359"}
]
1 change: 1 addition & 0 deletions tests/corpora/pl_pii_positive/sample01_pesel_simple.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
PESEL: 44051401359
3 changes: 3 additions & 0 deletions tests/corpora/pl_pii_positive/sample02_email_polish.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[
{"type": "email", "start": 16, "end": 30, "value": "jan@example.pl"}
]
1 change: 1 addition & 0 deletions tests/corpora/pl_pii_positive/sample02_email_polish.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Skontaktuj się: jan@example.pl
3 changes: 3 additions & 0 deletions tests/corpora/pl_pii_positive/sample03_iban.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[
{"type": "iban", "start": 7, "end": 35, "value": "PL61109010140000071219812874"}
]
1 change: 1 addition & 0 deletions tests/corpora/pl_pii_positive/sample03_iban.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Konto: PL61109010140000071219812874
Loading
Loading