Tatarinho · Tatarinho · Apr 26, 2026 · Apr 26, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+### Added
+
+- `llm_safe_pl.errors` module with typed exception hierarchy: `LlmSafeError` (base), `MappingError` and `InputSizeError` (both also subclass `ValueError` for backwards compatibility), and `DetectorError` (also subclass of `RuntimeError`). All four are re-exported from the top-level package. See `docs/errors.md`.
+- `tests/corpora/` regression-corpus scaffolding with `pl_pii_positive/` and `pl_pii_negative/` directories. `tests/test_corpus.py` discovers `.txt`/`.json` pairs at collection time and asserts current detector behavior — adding more samples strengthens regression coverage without changing test code.
+
+### Changed
+
+- `Mapping.from_dict` / `from_json` now raise `MappingError` instead of bare `ValueError` (the new class still catches as `ValueError`, so existing handlers keep working).
+- `Shield.anonymize` / `detect` raise `InputSizeError` instead of bare `ValueError` when input exceeds `max_input_bytes` (still catches as `ValueError`).
+
 ## [0.2.0] - 2026-04-26
 
 Service-pack release: a large algorithmic-perf fix and a security/hardening

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -53,6 +53,30 @@ Python 3.10 or newer is required.
 4. Push the branch and open a pull request against `main`.
 5. CI runs on every push. Please address failures before asking for review.
 
+## Adding to the regression corpus
+
+The regression corpus under `tests/corpora/` is the ground truth for detector
+precision and recall. Adding samples is the cheapest way to harden coverage.
+
+Layout:
+
+- `tests/corpora/pl_pii_positive/<name>.txt` — source text containing PII.
+- `tests/corpora/pl_pii_positive/<name>.json` — list of objects with
+  `{type, start, end, value}` covering every span the default `Shield()`
+  must detect. Spans must not overlap.
+- `tests/corpora/pl_pii_negative/<name>.txt` — source text that must produce
+  zero matches under the default `Shield()`.
+- `tests/corpora/pl_pii_negative/<name>.json` — empty list, or omit the file
+  entirely.
+
+Naming: lowercase, snake_case, prefix with `sampleNN_` for sort order.
+Character offsets are Python string indices (not UTF-8 bytes); negative
+samples should not include strings the *current* detectors flag — wait until
+the relevant fix lands before promoting an aspirational negative.
+
+After adding samples, run `pytest tests/test_corpus.py -v`. The loader checks
+that every labeled span actually matches its `value` in the source text.
+
 ## Commit and PR style
 
 - Write commit messages in the imperative mood ("Add PESEL validator", not "Added").

diff --git a/docs/errors.md b/docs/errors.md
@@ -0,0 +1,49 @@
+# Exception hierarchy
+
+`llm-safe-pl` exposes a small typed hierarchy from `llm_safe_pl.errors` (also
+re-exported from the top-level package). All library errors descend from
+`LlmSafeError`; specific subclasses also inherit from a relevant builtin so
+existing `except ValueError` code keeps catching them.
+
+```
+Exception
+└── LlmSafeError
+    ├── MappingError      (also subclass of ValueError)
+    ├── InputSizeError    (also subclass of ValueError)
+    └── DetectorError     (also subclass of RuntimeError)
+```
+
+## When each is raised
+
+| Class             | Raised by                                       | Builtin compat   |
+|-------------------|-------------------------------------------------|------------------|
+| `MappingError`    | `Mapping.from_dict` / `from_json` validation    | `ValueError`     |
+| `InputSizeError`  | `Shield.anonymize` / `detect` exceeding `max_input_bytes` | `ValueError` |
+| `DetectorError`   | Reserved for detector-dispatch failures; the class is exported but not yet raised internally | `RuntimeError` |
+
+## Why typed classes
+
+A bare `ValueError` doesn't tell the caller whether the problem is hostile
+mapping JSON, an oversized input, or a bug — they all look the same in
+`except`. The typed hierarchy lets handlers branch on cause:
+
+```python
+from llm_safe_pl import InputSizeError, MappingError, Shield
+
+shield = Shield(max_input_bytes=1_000_000)
+try:
+    result = shield.anonymize(text)
+except InputSizeError:
+    # Caller-side: trim or reject the input.
+    ...
+except MappingError:
+    # Hostile or corrupt persisted Mapping — treat as integrity failure.
+    ...
+```
+
+## `DetectorError` deliberately drops context
+
+`DetectorError.__init__` accepts only `detector_name` — never the input text or
+an exception cause. Both can carry PII; surfacing them in a stack trace is the
+class of leak the typed wrapper exists to prevent. Use `raise DetectorError(name) from None`
+when re-raising a wrapped detector failure.
diff --git a/src/llm_safe_pl/__init__.py b/src/llm_safe_pl/__init__.py
@@ -7,6 +7,7 @@
 from importlib.metadata import PackageNotFoundError
 from importlib.metadata import version as _version
 
+from llm_safe_pl.errors import DetectorError, InputSizeError, LlmSafeError, MappingError
 from llm_safe_pl.models import AnonymizeResult, Mapping, Match, PIIType
 from llm_safe_pl.shield import Shield
 
@@ -20,7 +21,11 @@
 
 __all__ = [
     "AnonymizeResult",
+    "DetectorError",
+    "InputSizeError",
+    "LlmSafeError",
     "Mapping",
+    "MappingError",
     "Match",
     "PIIType",
     "Shield",

diff --git a/src/llm_safe_pl/errors.py b/src/llm_safe_pl/errors.py
@@ -0,0 +1,34 @@
+"""Typed exception hierarchy for llm-safe-pl.
+
+All library errors descend from :class:`LlmSafeError`. Specific subclasses also
+inherit from a relevant builtin (``ValueError`` for input/data errors,
+``RuntimeError`` for dispatch failures) so legacy ``except ValueError`` code
+keeps catching them.
+
+:class:`DetectorError` deliberately does NOT accept the original text or an
+exception cause — both can carry PII. The class signature is exactly
+``(detector_name)``; raise it via ``raise DetectorError(name) from None`` to
+suppress the implicit cause chain.
+"""
+
+from __future__ import annotations
+
+
+class LlmSafeError(Exception):
+    """Base class for all llm-safe-pl errors."""
+
+
+class MappingError(LlmSafeError, ValueError):
+    """Raised when a Mapping fails validation (e.g. ``Mapping.from_dict``)."""
+
+
+class InputSizeError(LlmSafeError, ValueError):
+    """Raised when input exceeds ``Shield(max_input_bytes=...)``."""
+
+
+class DetectorError(LlmSafeError, RuntimeError):
+    """Raised when a detector fails. Original text and cause are not attached."""
+
+    def __init__(self, detector_name: str) -> None:
+        super().__init__(f"detector {detector_name!r} failed")
+        self.detector_name = detector_name
diff --git a/src/llm_safe_pl/models.py b/src/llm_safe_pl/models.py
@@ -12,6 +12,8 @@
 from enum import Enum
 from typing import Any
 
+from llm_safe_pl.errors import MappingError
+
 _TOKEN_SHAPE = re.compile(r"^\[([A-Z][A-Z_]*)_(\d+)\]$")
 
 
@@ -102,55 +104,65 @@ def to_dict(self) -> dict[str, Any]:
     def from_dict(cls, data: dict[str, Any]) -> Mapping:
         """Load a Mapping from its JSON-dict shape with strict validation.
 
-        Raises ``ValueError`` on any of: wrong schema version, malformed
-        token shape, type/token-prefix mismatch, counters that don't cover
-        their entries, non-int counter values, missing required fields.
+        Raises :class:`~llm_safe_pl.errors.MappingError` on any of: wrong schema
+        version, malformed token shape, type/token-prefix mismatch, counters
+        that don't cover their entries, non-int counter values, missing
+        required fields. ``MappingError`` subclasses ``ValueError`` so existing
+        ``except ValueError`` code keeps catching it.
 
         Validation matters because Mapping JSON is the cross-process trust
         boundary — a tampered file should fail loudly, not silently corrupt
         the Mapping.
         """
         if not isinstance(data, dict):
-            raise ValueError(f"Mapping.from_dict expected a dict, got {type(data).__name__}")
+            raise MappingError(f"Mapping.from_dict expected a dict, got {type(data).__name__}")
         version = data.get("schema_version")
         if version != cls.SCHEMA_VERSION:
-            raise ValueError(f"Unsupported mapping schema version: {version!r}")
+            raise MappingError(f"Unsupported mapping schema version: {version!r}")
 
         raw_counters = data.get("counters", {})
         if not isinstance(raw_counters, dict):
-            raise ValueError(f"counters must be a dict, got {type(raw_counters).__name__}")
+            raise MappingError(f"counters must be a dict, got {type(raw_counters).__name__}")
         counters: dict[PIIType, int] = {}
         for t, n in raw_counters.items():
             if not isinstance(n, int) or isinstance(n, bool) or n < 0:
-                raise ValueError(f"counter for {t!r} must be a non-negative int, got {n!r}")
-            counters[PIIType(t)] = n
+                raise MappingError(f"counter for {t!r} must be a non-negative int, got {n!r}")
+            try:
+                counters[PIIType(t)] = n
+            except ValueError as exc:
+                raise MappingError(f"unknown PII type in counters: {t!r}") from exc
 
         raw_entries = data.get("entries")
         if raw_entries is None:
-            raise ValueError("Mapping.from_dict requires an 'entries' field")
+            raise MappingError("Mapping.from_dict requires an 'entries' field")
         if not isinstance(raw_entries, list):
-            raise ValueError(f"entries must be a list, got {type(raw_entries).__name__}")
+            raise MappingError(f"entries must be a list, got {type(raw_entries).__name__}")
 
         m = cls()
         m._counters = counters
         max_per_type: dict[PIIType, int] = {}
         for entry in raw_entries:
             if not isinstance(entry, dict):
-                raise ValueError(f"each entry must be a dict, got {type(entry).__name__}")
+                raise MappingError(f"each entry must be a dict, got {type(entry).__name__}")
             for required in ("token", "type", "value"):
                 if required not in entry:
-                    raise ValueError(f"entry missing required field {required!r}: {entry!r}")
+                    raise MappingError(f"entry missing required field {required!r}: {entry!r}")
             token = entry["token"]
             value = entry["value"]
             if not isinstance(token, str) or not isinstance(value, str):
-                raise ValueError(f"entry token and value must be strings: {entry!r}")
-            pii_type = PIIType(entry["type"])
+                raise MappingError(f"entry token and value must be strings: {entry!r}")
+            try:
+                pii_type = PIIType(entry["type"])
+            except ValueError as exc:
+                raise MappingError(
+                    f"unknown PII type in entry {entry!r}: {entry['type']!r}"
+                ) from exc
             shape = _TOKEN_SHAPE.fullmatch(token)
             if shape is None:
-                raise ValueError(f"token {token!r} does not match [TYPE_NNN] shape")
+                raise MappingError(f"token {token!r} does not match [TYPE_NNN] shape")
             token_type_prefix = shape.group(1)
             if token_type_prefix != pii_type.value.upper():
-                raise ValueError(f"token {token!r} prefix does not match type {pii_type.value!r}")
+                raise MappingError(f"token {token!r} prefix does not match type {pii_type.value!r}")
             counter_n = int(shape.group(2))
             prev = max_per_type.get(pii_type, 0)
             if counter_n > prev:
@@ -161,7 +173,7 @@ def from_dict(cls, data: dict[str, Any]) -> Mapping:
         for pii_type, observed_max in max_per_type.items():
             declared = counters.get(pii_type, 0)
             if declared < observed_max:
-                raise ValueError(
+                raise MappingError(
                     f"counter for {pii_type.value!r} is {declared} but entry "
                     f"counter {observed_max} was issued"
                 )

diff --git a/src/llm_safe_pl/shield.py b/src/llm_safe_pl/shield.py
@@ -25,6 +25,7 @@
 from llm_safe_pl.deanonymizer import Deanonymizer
 from llm_safe_pl.detectors import DEFAULT_DETECTORS
 from llm_safe_pl.detectors.base import Detector
+from llm_safe_pl.errors import InputSizeError
 from llm_safe_pl.models import AnonymizeResult, Mapping, Match
 from llm_safe_pl.strategies import Strategy
 
@@ -36,11 +37,14 @@ class Shield:
         detectors: Custom detector list (default: ``DEFAULT_DETECTORS``).
         mapping: Preloaded Mapping (default: empty Mapping).
         strategy: Anonymization strategy (only ``TOKEN`` in v0.1).
-        max_input_bytes: If set, ``anonymize``/``detect`` raise ``ValueError``
-            for inputs whose UTF-8 byte length exceeds this. Default ``None``
-            (unlimited). Recommended for hardened pipelines that ingest
-            untrusted text — ``Shield.anonymize`` allocates O(n) memory in
-            input size, so an unbounded input is a DoS vector.
+        max_input_bytes: If set, ``anonymize``/``detect`` raise
+            :class:`~llm_safe_pl.errors.InputSizeError` for inputs whose UTF-8
+            byte length exceeds this. ``InputSizeError`` subclasses
+            ``ValueError`` so existing ``except ValueError`` code keeps
+            catching it. Default ``None`` (unlimited). Recommended for
+            hardened pipelines that ingest untrusted text — ``Shield.anonymize``
+            allocates O(n) memory in input size, so an unbounded input is a
+            DoS vector.
     """
 
     def __init__(
@@ -84,7 +88,7 @@ def _check_input_size(self, text: str) -> None:
             return
         size = len(text.encode("utf-8"))
         if size > self._max_input_bytes:
-            raise ValueError(f"input is {size} bytes; max_input_bytes={self._max_input_bytes}")
+            raise InputSizeError(f"input is {size} bytes; max_input_bytes={self._max_input_bytes}")
 
     def anonymize(self, text: str) -> AnonymizeResult:
         self._check_input_size(text)

diff --git a/tests/corpora/pl_pii_negative/sample01_office_text.json b/tests/corpora/pl_pii_negative/sample01_office_text.json
@@ -0,0 +1 @@
+[]
diff --git a/tests/corpora/pl_pii_negative/sample01_office_text.txt b/tests/corpora/pl_pii_negative/sample01_office_text.txt
@@ -0,0 +1 @@
+Sklep otwarty od 9:00 do 17:00. Spotkanie planowe 12 maja 2024.
diff --git a/tests/corpora/pl_pii_negative/sample02_short_codes.json b/tests/corpora/pl_pii_negative/sample02_short_codes.json
@@ -0,0 +1 @@
+[]
diff --git a/tests/corpora/pl_pii_negative/sample02_short_codes.txt b/tests/corpora/pl_pii_negative/sample02_short_codes.txt
@@ -0,0 +1 @@
+Numer referencyjny: 12345. Kod produktu: 9876. ID zamowienia: ABC123.
diff --git a/tests/corpora/pl_pii_positive/sample01_pesel_simple.json b/tests/corpora/pl_pii_positive/sample01_pesel_simple.json
@@ -0,0 +1,3 @@
+[
+  {"type": "pesel", "start": 7, "end": 18, "value": "44051401359"}
+]
diff --git a/tests/corpora/pl_pii_positive/sample01_pesel_simple.txt b/tests/corpora/pl_pii_positive/sample01_pesel_simple.txt
@@ -0,0 +1 @@
+PESEL: 44051401359
diff --git a/tests/corpora/pl_pii_positive/sample02_email_polish.json b/tests/corpora/pl_pii_positive/sample02_email_polish.json
@@ -0,0 +1,3 @@
+[
+  {"type": "email", "start": 16, "end": 30, "value": "jan@example.pl"}
+]
diff --git a/tests/corpora/pl_pii_positive/sample02_email_polish.txt b/tests/corpora/pl_pii_positive/sample02_email_polish.txt
@@ -0,0 +1 @@
+Skontaktuj się: jan@example.pl
diff --git a/tests/corpora/pl_pii_positive/sample03_iban.json b/tests/corpora/pl_pii_positive/sample03_iban.json
@@ -0,0 +1,3 @@
+[
+  {"type": "iban", "start": 7, "end": 35, "value": "PL61109010140000071219812874"}
+]
diff --git a/tests/corpora/pl_pii_positive/sample03_iban.txt b/tests/corpora/pl_pii_positive/sample03_iban.txt
@@ -0,0 +1 @@
+Konto: PL61109010140000071219812874
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		Sklep otwarty od 9:00 do 17:00. Spotkanie planowe 12 maja 2024.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		Numer referencyjny: 12345. Kod produktu: 9876. ID zamowienia: ABC123.