Skip to content
This repository was archived by the owner on Apr 29, 2026. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
[![Python versions](https://img.shields.io/pypi/pyversions/llm-safe-pl.svg)](https://pypi.org/project/llm-safe-pl/)
[![Tests](https://github.com/Tatarinho/llm-safe-pl/actions/workflows/tests.yml/badge.svg)](https://github.com/Tatarinho/llm-safe-pl/actions/workflows/tests.yml)
[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Tatarinho/llm-safe-pl/blob/main/notebooks/quickstart.ipynb)

Reversible PII anonymization for Polish documents, designed for LLM workflows.

Expand Down Expand Up @@ -65,6 +66,12 @@ The same value always maps to the same token within a `Shield` instance, includi

PERSON detection (`Jan Kowalski` in the example) requires `pip install "llm-safe-pl[ner]"` and is part of Phase 6. Without the extra, names remain visible and structured identifiers (PESEL, NIP, IBAN, etc.) are tokenized.

## Try it live in Colab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Tatarinho/llm-safe-pl/blob/main/notebooks/quickstart.ipynb)

No install needed — the notebook walks through a full anonymize → LLM → deanonymize round-trip in a Polish customer-service scenario.

## Quick example — CLI

```bash
Expand Down Expand Up @@ -115,6 +122,7 @@ Anything else is an implementation detail and may change without a major version

## More examples and documentation

- [`notebooks/quickstart.ipynb`](notebooks/quickstart.ipynb) — interactive Colab walk-through of the full round-trip. [Open in Colab ↗](https://colab.research.google.com/github/Tatarinho/llm-safe-pl/blob/main/notebooks/quickstart.ipynb).
- [`examples/basic.py`](examples/basic.py) — minimal programmatic use.
- [`examples/openai_integration.py`](examples/openai_integration.py) — full round-trip against OpenAI.
- [`examples/anthropic_integration.py`](examples/anthropic_integration.py) — same for the Anthropic API.
Expand Down
220 changes: 220 additions & 0 deletions notebooks/quickstart.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,220 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# llm-safe-pl — anonymize Polish PII before sending documents to an LLM\n",
"\n",
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Tatarinho/llm-safe-pl/blob/main/notebooks/quickstart.ipynb)\n",
"[![PyPI version](https://img.shields.io/pypi/v/llm-safe-pl.svg)](https://pypi.org/project/llm-safe-pl/)\n",
"[![GitHub](https://img.shields.io/badge/github-Tatarinho%2Fllm--safe--pl-blue)](https://github.com/Tatarinho/llm-safe-pl)\n",
"\n",
"This notebook walks through the full round-trip that `llm-safe-pl` is built for:\n",
"\n",
"1. **Detect** PII in a Polish document — PESEL, NIP, REGON, IBAN, phone, email — all checksum-validated where applicable.\n",
"2. **Anonymize** — replace each hit with a stable `[TYPE_NNN]` token and return a reversible mapping.\n",
"3. **Call an LLM** on the anonymized text so the model provider never sees raw PII.\n",
"4. **Deanonymize** — restore original values in the LLM's response using the mapping.\n",
"\n",
"The LLM call uses OpenAI if `OPENAI_API_KEY` is set, and falls back to a hand-crafted response otherwise, so the notebook runs end-to-end with no API key."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install -q llm-safe-pl openai"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The scenario\n",
"\n",
"A Polish customer-service email containing multiple identifiers. This is the kind of text you might want an LLM to summarize without exposing the underlying PII to your model provider."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"document = \"\"\"\\\n",
"Dzień dobry,\n",
"\n",
"Piszę w sprawie zamówienia nr INV-2025-00412. Klient: Jan Kowalski,\n",
"PESEL 44051401359, kontakt +48 600 123 456, e-mail jan.kowalski@example.pl.\n",
"Faktura wystawiona na firmę Acme Polska Sp. z o.o., NIP 526-000-12-46,\n",
"REGON 123456785, adres: ul. Marszałkowska 1, 00-001 Warszawa.\n",
"Przelew na IBAN PL61 1090 1014 0000 0712 1981 2874 nie dotarł do 2025-03-18.\n",
"\n",
"Proszę o sprawdzenie statusu i kontakt zwrotny.\n",
"Pozdrawiam,\n",
"Anna Nowak\n",
"\"\"\"\n",
"\n",
"print(document)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1 — detect\n",
"\n",
"`Shield.detect()` returns an ordered tuple of `Match` objects. Each match carries its type, the raw value, its character offsets, and which detector fired — useful when you want an audit trail without yet committing to rewriting the text."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "from llm_safe_pl import Shield\n\nshield = Shield()\nmatches = shield.detect(document)\n\nprint(f\"Found {len(matches)} PII hits:\\n\")\nfor m in matches:\n print(f\" [{m.type.value:<11}] {m.value!r:<40} at {m.start}-{m.end} (detector: {m.detector})\")"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2 — anonymize\n",
"\n",
"`Shield.anonymize()` runs the same pipeline as `detect()`, but also rewrites the text and updates the shield's `Mapping`. The same `(type, value)` pair always maps to the same token within one shield — so if `Jan Kowalski` appears in three documents, it gets the same `[PERSON_001]` across all three.\n",
"\n",
"Formatted identifiers (dashes in NIP, spaces in IBAN) are preserved byte-for-byte on round-trip."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"result = shield.anonymize(document)\n",
"\n",
"print(\"Anonymized — safe to send to an LLM:\\n\")\n",
"print(result.text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 3 — call the LLM\n",
"\n",
"The anonymized text is what you would send to OpenAI, Anthropic, or any local model. Tokens like `[PESEL_001]` look like ordinary placeholders and the model will happily quote them back in its reply.\n",
"\n",
"**System-prompt tip:** ask the model to keep `[TYPE_NNN]` tokens intact. Not strictly required, but it reduces the chance the model rephrases `[PESEL_001]` into something else.\n",
"\n",
"If `OPENAI_API_KEY` is set in your Colab environment, the real API is called. Otherwise we simulate a plausible summary so the rest of the notebook still works."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "import os\n\nSYSTEM = (\n \"You are a Polish-language customer service assistant. \"\n \"Summarize the user's message in 3 bullet points. \"\n \"Keep every placeholder of the form [TYPE_NNN] intact — do not rename, \"\n \"translate, or expand them.\"\n)\n\nif os.environ.get(\"OPENAI_API_KEY\"):\n from openai import OpenAI\n\n client = OpenAI()\n response = client.chat.completions.create(\n model=\"gpt-4o-mini\",\n messages=[\n {\"role\": \"system\", \"content\": SYSTEM},\n {\"role\": \"user\", \"content\": result.text},\n ],\n )\n llm_output = response.choices[0].message.content or \"\"\nelse:\n llm_output = (\n \"Podsumowanie zgłoszenia:\\n\"\n \"- Klient [PESEL_001] zgłasza brak przelewu dla zamówienia INV-2025-00412.\\n\"\n \"- Kontakt zwrotny: [PHONE_001] lub [EMAIL_001].\\n\"\n \"- Faktura VAT [NIP_001], REGON [REGON_001].\\n\"\n \"- IBAN [IBAN_001] — sprawdzić status transakcji.\"\n )\n\nprint(\"LLM response (still anonymized):\\n\")\nprint(llm_output)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 4 — deanonymize\n",
"\n",
"`Shield.deanonymize()` uses the mapping built during `anonymize()` to put real values back. This is the step that closes the loop: the model never saw the originals, but downstream systems — a ticket, a CRM write, an outgoing email — see fully restored text."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"restored = shield.deanonymize(llm_output)\n",
"\n",
"print(\"Final, de-anonymized output:\\n\")\n",
"print(restored)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Persisting the mapping\n",
"\n",
"Real LLM workflows are often async: you anonymize now, call the model later, and deanonymize when the response comes back — possibly from a different process. `Mapping` is JSON-serializable so it rides along with whatever queue or task record you are using."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"\n",
"from llm_safe_pl import Mapping\n",
"\n",
"serialized = result.mapping.to_json()\n",
"print(\"Serialized mapping:\\n\")\n",
"print(json.dumps(json.loads(serialized), indent=2, ensure_ascii=False))\n",
"\n",
"rehydrated = Mapping.from_json(serialized)\n",
"print(\"\\nRestored via a rehydrated mapping (e.g. from a queue):\\n\")\n",
"print(shield.deanonymize(llm_output, mapping=rehydrated))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What this install did — and did not — catch\n",
"\n",
"The core install uses **stdlib + `typer` only**. That is enough for every identifier with a deterministic format: PESEL, NIP, REGON, ID card, passport, phone, email, IBAN, credit card. Checksum validators (PESEL, NIP, REGON, Luhn, mod-97 IBAN) filter out valid-looking-but-wrong numbers so the audit log is not flooded with false positives.\n",
"\n",
"**Not caught in the core install** — these are in the document above but absent from the audit trail:\n",
"\n",
"- `Jan Kowalski`, `Anna Nowak` — person names\n",
"- `Acme Polska Sp. z o.o.` — organization\n",
"- `ul. Marszałkowska 1, 00-001 Warszawa` — address\n",
"\n",
"These require the optional `[ner]` extra, which pulls in spaCy and a Polish model:\n",
"\n",
"```bash\n",
"pip install \"llm-safe-pl[ner]\"\n",
"python -m spacy download pl_core_news_lg\n",
"```\n",
"\n",
"NER support ships in v0.1.1 — see the [roadmap](https://github.com/Tatarinho/llm-safe-pl#roadmap).\n",
"\n",
"## Next steps\n",
"\n",
"- [README](https://github.com/Tatarinho/llm-safe-pl) — full feature list and install options.\n",
"- [`docs/llm_workflow.md`](https://github.com/Tatarinho/llm-safe-pl/blob/main/docs/llm_workflow.md) — deeper guidance on the anonymize → LLM → deanonymize pattern.\n",
"- [`docs/limitations.md`](https://github.com/Tatarinho/llm-safe-pl/blob/main/docs/limitations.md) — read before shipping to production.\n",
"\n",
"Found a false positive, a missed identifier, or have a feature idea? [Open an issue](https://github.com/Tatarinho/llm-safe-pl/issues). Stars welcome."
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading