Tatarinho · Tatarinho · Apr 23, 2026 · Apr 23, 2026 · Apr 23, 2026
diff --git a/README.md b/README.md
@@ -4,6 +4,7 @@
 [![Python versions](https://img.shields.io/pypi/pyversions/llm-safe-pl.svg)](https://pypi.org/project/llm-safe-pl/)
 [![Tests](https://github.com/Tatarinho/llm-safe-pl/actions/workflows/tests.yml/badge.svg)](https://github.com/Tatarinho/llm-safe-pl/actions/workflows/tests.yml)
 [![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Tatarinho/llm-safe-pl/blob/main/notebooks/quickstart.ipynb)
 
 Reversible PII anonymization for Polish documents, designed for LLM workflows.
 
@@ -65,6 +66,12 @@ The same value always maps to the same token within a `Shield` instance, includi
 
 PERSON detection (`Jan Kowalski` in the example) requires `pip install "llm-safe-pl[ner]"` and is part of Phase 6. Without the extra, names remain visible and structured identifiers (PESEL, NIP, IBAN, etc.) are tokenized.
 
+## Try it live in Colab
+
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Tatarinho/llm-safe-pl/blob/main/notebooks/quickstart.ipynb)
+
+No install needed — the notebook walks through a full anonymize → LLM → deanonymize round-trip in a Polish customer-service scenario.
+
 ## Quick example — CLI
 
 ```bash
@@ -115,6 +122,7 @@ Anything else is an implementation detail and may change without a major version
 
 ## More examples and documentation
 
+- [`notebooks/quickstart.ipynb`](notebooks/quickstart.ipynb) — interactive Colab walk-through of the full round-trip. [Open in Colab ↗](https://colab.research.google.com/github/Tatarinho/llm-safe-pl/blob/main/notebooks/quickstart.ipynb).
 - [`examples/basic.py`](examples/basic.py) — minimal programmatic use.
 - [`examples/openai_integration.py`](examples/openai_integration.py) — full round-trip against OpenAI.
 - [`examples/anthropic_integration.py`](examples/anthropic_integration.py) — same for the Anthropic API.

diff --git a/notebooks/quickstart.ipynb b/notebooks/quickstart.ipynb
@@ -0,0 +1,220 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# llm-safe-pl — anonymize Polish PII before sending documents to an LLM\n",
+    "\n",
+    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Tatarinho/llm-safe-pl/blob/main/notebooks/quickstart.ipynb)\n",
+    "[![PyPI version](https://img.shields.io/pypi/v/llm-safe-pl.svg)](https://pypi.org/project/llm-safe-pl/)\n",
+    "[![GitHub](https://img.shields.io/badge/github-Tatarinho%2Fllm--safe--pl-blue)](https://github.com/Tatarinho/llm-safe-pl)\n",
+    "\n",
+    "This notebook walks through the full round-trip that `llm-safe-pl` is built for:\n",
+    "\n",
+    "1. **Detect** PII in a Polish document — PESEL, NIP, REGON, IBAN, phone, email — all checksum-validated where applicable.\n",
+    "2. **Anonymize** — replace each hit with a stable `[TYPE_NNN]` token and return a reversible mapping.\n",
+    "3. **Call an LLM** on the anonymized text so the model provider never sees raw PII.\n",
+    "4. **Deanonymize** — restore original values in the LLM's response using the mapping.\n",
+    "\n",
+    "The LLM call uses OpenAI if `OPENAI_API_KEY` is set, and falls back to a hand-crafted response otherwise, so the notebook runs end-to-end with no API key."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install -q llm-safe-pl openai"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## The scenario\n",
+    "\n",
+    "A Polish customer-service email containing multiple identifiers. This is the kind of text you might want an LLM to summarize without exposing the underlying PII to your model provider."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "document = \"\"\"\\\n",
+    "Dzień dobry,\n",
+    "\n",
+    "Piszę w sprawie zamówienia nr INV-2025-00412. Klient: Jan Kowalski,\n",
+    "PESEL 44051401359, kontakt +48 600 123 456, e-mail jan.kowalski@example.pl.\n",
+    "Faktura wystawiona na firmę Acme Polska Sp. z o.o., NIP 526-000-12-46,\n",
+    "REGON 123456785, adres: ul. Marszałkowska 1, 00-001 Warszawa.\n",
+    "Przelew na IBAN PL61 1090 1014 0000 0712 1981 2874 nie dotarł do 2025-03-18.\n",
+    "\n",
+    "Proszę o sprawdzenie statusu i kontakt zwrotny.\n",
+    "Pozdrawiam,\n",
+    "Anna Nowak\n",
+    "\"\"\"\n",
+    "\n",
+    "print(document)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 1 — detect\n",
+    "\n",
+    "`Shield.detect()` returns an ordered tuple of `Match` objects. Each match carries its type, the raw value, its character offsets, and which detector fired — useful when you want an audit trail without yet committing to rewriting the text."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "from llm_safe_pl import Shield\n\nshield = Shield()\nmatches = shield.detect(document)\n\nprint(f\"Found {len(matches)} PII hits:\\n\")\nfor m in matches:\n    print(f\"  [{m.type.value:<11}] {m.value!r:<40} at {m.start}-{m.end}  (detector: {m.detector})\")"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 2 — anonymize\n",
+    "\n",
+    "`Shield.anonymize()` runs the same pipeline as `detect()`, but also rewrites the text and updates the shield's `Mapping`. The same `(type, value)` pair always maps to the same token within one shield — so if `Jan Kowalski` appears in three documents, it gets the same `[PERSON_001]` across all three.\n",
+    "\n",
+    "Formatted identifiers (dashes in NIP, spaces in IBAN) are preserved byte-for-byte on round-trip."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "result = shield.anonymize(document)\n",
+    "\n",
+    "print(\"Anonymized — safe to send to an LLM:\\n\")\n",
+    "print(result.text)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 3 — call the LLM\n",
+    "\n",
+    "The anonymized text is what you would send to OpenAI, Anthropic, or any local model. Tokens like `[PESEL_001]` look like ordinary placeholders and the model will happily quote them back in its reply.\n",
+    "\n",
+    "**System-prompt tip:** ask the model to keep `[TYPE_NNN]` tokens intact. Not strictly required, but it reduces the chance the model rephrases `[PESEL_001]` into something else.\n",
+    "\n",
+    "If `OPENAI_API_KEY` is set in your Colab environment, the real API is called. Otherwise we simulate a plausible summary so the rest of the notebook still works."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "import os\n\nSYSTEM = (\n    \"You are a Polish-language customer service assistant. \"\n    \"Summarize the user's message in 3 bullet points. \"\n    \"Keep every placeholder of the form [TYPE_NNN] intact — do not rename, \"\n    \"translate, or expand them.\"\n)\n\nif os.environ.get(\"OPENAI_API_KEY\"):\n    from openai import OpenAI\n\n    client = OpenAI()\n    response = client.chat.completions.create(\n        model=\"gpt-4o-mini\",\n        messages=[\n            {\"role\": \"system\", \"content\": SYSTEM},\n            {\"role\": \"user\", \"content\": result.text},\n        ],\n    )\n    llm_output = response.choices[0].message.content or \"\"\nelse:\n    llm_output = (\n        \"Podsumowanie zgłoszenia:\\n\"\n        \"- Klient [PESEL_001] zgłasza brak przelewu dla zamówienia INV-2025-00412.\\n\"\n        \"- Kontakt zwrotny: [PHONE_001] lub [EMAIL_001].\\n\"\n        \"- Faktura VAT [NIP_001], REGON [REGON_001].\\n\"\n        \"- IBAN [IBAN_001] — sprawdzić status transakcji.\"\n    )\n\nprint(\"LLM response (still anonymized):\\n\")\nprint(llm_output)"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 4 — deanonymize\n",
+    "\n",
+    "`Shield.deanonymize()` uses the mapping built during `anonymize()` to put real values back. This is the step that closes the loop: the model never saw the originals, but downstream systems — a ticket, a CRM write, an outgoing email — see fully restored text."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "restored = shield.deanonymize(llm_output)\n",
+    "\n",
+    "print(\"Final, de-anonymized output:\\n\")\n",
+    "print(restored)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Persisting the mapping\n",
+    "\n",
+    "Real LLM workflows are often async: you anonymize now, call the model later, and deanonymize when the response comes back — possibly from a different process. `Mapping` is JSON-serializable so it rides along with whatever queue or task record you are using."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "\n",
+    "from llm_safe_pl import Mapping\n",
+    "\n",
+    "serialized = result.mapping.to_json()\n",
+    "print(\"Serialized mapping:\\n\")\n",
+    "print(json.dumps(json.loads(serialized), indent=2, ensure_ascii=False))\n",
+    "\n",
+    "rehydrated = Mapping.from_json(serialized)\n",
+    "print(\"\\nRestored via a rehydrated mapping (e.g. from a queue):\\n\")\n",
+    "print(shield.deanonymize(llm_output, mapping=rehydrated))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## What this install did — and did not — catch\n",
+    "\n",
+    "The core install uses **stdlib + `typer` only**. That is enough for every identifier with a deterministic format: PESEL, NIP, REGON, ID card, passport, phone, email, IBAN, credit card. Checksum validators (PESEL, NIP, REGON, Luhn, mod-97 IBAN) filter out valid-looking-but-wrong numbers so the audit log is not flooded with false positives.\n",
+    "\n",
+    "**Not caught in the core install** — these are in the document above but absent from the audit trail:\n",
+    "\n",
+    "- `Jan Kowalski`, `Anna Nowak` — person names\n",
+    "- `Acme Polska Sp. z o.o.` — organization\n",
+    "- `ul. Marszałkowska 1, 00-001 Warszawa` — address\n",
+    "\n",
+    "These require the optional `[ner]` extra, which pulls in spaCy and a Polish model:\n",
+    "\n",
+    "```bash\n",
+    "pip install \"llm-safe-pl[ner]\"\n",
+    "python -m spacy download pl_core_news_lg\n",
+    "```\n",
+    "\n",
+    "NER support ships in v0.1.1 — see the [roadmap](https://github.com/Tatarinho/llm-safe-pl#roadmap).\n",
+    "\n",
+    "## Next steps\n",
+    "\n",
+    "- [README](https://github.com/Tatarinho/llm-safe-pl) — full feature list and install options.\n",
+    "- [`docs/llm_workflow.md`](https://github.com/Tatarinho/llm-safe-pl/blob/main/docs/llm_workflow.md) — deeper guidance on the anonymize → LLM → deanonymize pattern.\n",
+    "- [`docs/limitations.md`](https://github.com/Tatarinho/llm-safe-pl/blob/main/docs/limitations.md) — read before shipping to production.\n",
+    "\n",
+    "Found a false positive, a missed identifier, or have a feature idea? [Open an issue](https://github.com/Tatarinho/llm-safe-pl/issues). Stars welcome."
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}