diff --git a/README.md b/README.md index 40e8fca..fbd64a2 100644 --- a/README.md +++ b/README.md @@ -4,6 +4,7 @@ [![Python versions](https://img.shields.io/pypi/pyversions/llm-safe-pl.svg)](https://pypi.org/project/llm-safe-pl/) [![Tests](https://github.com/Tatarinho/llm-safe-pl/actions/workflows/tests.yml/badge.svg)](https://github.com/Tatarinho/llm-safe-pl/actions/workflows/tests.yml) [![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE) +[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Tatarinho/llm-safe-pl/blob/main/notebooks/quickstart.ipynb) Reversible PII anonymization for Polish documents, designed for LLM workflows. @@ -65,6 +66,12 @@ The same value always maps to the same token within a `Shield` instance, includi PERSON detection (`Jan Kowalski` in the example) requires `pip install "llm-safe-pl[ner]"` and is part of Phase 6. Without the extra, names remain visible and structured identifiers (PESEL, NIP, IBAN, etc.) are tokenized. +## Try it live in Colab + +[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Tatarinho/llm-safe-pl/blob/main/notebooks/quickstart.ipynb) + +No install needed — the notebook walks through a full anonymize → LLM → deanonymize round-trip in a Polish customer-service scenario. + ## Quick example — CLI ```bash @@ -115,6 +122,7 @@ Anything else is an implementation detail and may change without a major version ## More examples and documentation +- [`notebooks/quickstart.ipynb`](notebooks/quickstart.ipynb) — interactive Colab walk-through of the full round-trip. [Open in Colab ↗](https://colab.research.google.com/github/Tatarinho/llm-safe-pl/blob/main/notebooks/quickstart.ipynb). - [`examples/basic.py`](examples/basic.py) — minimal programmatic use. - [`examples/openai_integration.py`](examples/openai_integration.py) — full round-trip against OpenAI. - [`examples/anthropic_integration.py`](examples/anthropic_integration.py) — same for the Anthropic API. diff --git a/notebooks/quickstart.ipynb b/notebooks/quickstart.ipynb new file mode 100644 index 0000000..8381411 --- /dev/null +++ b/notebooks/quickstart.ipynb @@ -0,0 +1,220 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# llm-safe-pl — anonymize Polish PII before sending documents to an LLM\n", + "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Tatarinho/llm-safe-pl/blob/main/notebooks/quickstart.ipynb)\n", + "[![PyPI version](https://img.shields.io/pypi/v/llm-safe-pl.svg)](https://pypi.org/project/llm-safe-pl/)\n", + "[![GitHub](https://img.shields.io/badge/github-Tatarinho%2Fllm--safe--pl-blue)](https://github.com/Tatarinho/llm-safe-pl)\n", + "\n", + "This notebook walks through the full round-trip that `llm-safe-pl` is built for:\n", + "\n", + "1. **Detect** PII in a Polish document — PESEL, NIP, REGON, IBAN, phone, email — all checksum-validated where applicable.\n", + "2. **Anonymize** — replace each hit with a stable `[TYPE_NNN]` token and return a reversible mapping.\n", + "3. **Call an LLM** on the anonymized text so the model provider never sees raw PII.\n", + "4. **Deanonymize** — restore original values in the LLM's response using the mapping.\n", + "\n", + "The LLM call uses OpenAI if `OPENAI_API_KEY` is set, and falls back to a hand-crafted response otherwise, so the notebook runs end-to-end with no API key." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install -q llm-safe-pl openai" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## The scenario\n", + "\n", + "A Polish customer-service email containing multiple identifiers. This is the kind of text you might want an LLM to summarize without exposing the underlying PII to your model provider." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "document = \"\"\"\\\n", + "Dzień dobry,\n", + "\n", + "Piszę w sprawie zamówienia nr INV-2025-00412. Klient: Jan Kowalski,\n", + "PESEL 44051401359, kontakt +48 600 123 456, e-mail jan.kowalski@example.pl.\n", + "Faktura wystawiona na firmę Acme Polska Sp. z o.o., NIP 526-000-12-46,\n", + "REGON 123456785, adres: ul. Marszałkowska 1, 00-001 Warszawa.\n", + "Przelew na IBAN PL61 1090 1014 0000 0712 1981 2874 nie dotarł do 2025-03-18.\n", + "\n", + "Proszę o sprawdzenie statusu i kontakt zwrotny.\n", + "Pozdrawiam,\n", + "Anna Nowak\n", + "\"\"\"\n", + "\n", + "print(document)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 1 — detect\n", + "\n", + "`Shield.detect()` returns an ordered tuple of `Match` objects. Each match carries its type, the raw value, its character offsets, and which detector fired — useful when you want an audit trail without yet committing to rewriting the text." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "from llm_safe_pl import Shield\n\nshield = Shield()\nmatches = shield.detect(document)\n\nprint(f\"Found {len(matches)} PII hits:\\n\")\nfor m in matches:\n print(f\" [{m.type.value:<11}] {m.value!r:<40} at {m.start}-{m.end} (detector: {m.detector})\")" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 2 — anonymize\n", + "\n", + "`Shield.anonymize()` runs the same pipeline as `detect()`, but also rewrites the text and updates the shield's `Mapping`. The same `(type, value)` pair always maps to the same token within one shield — so if `Jan Kowalski` appears in three documents, it gets the same `[PERSON_001]` across all three.\n", + "\n", + "Formatted identifiers (dashes in NIP, spaces in IBAN) are preserved byte-for-byte on round-trip." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "result = shield.anonymize(document)\n", + "\n", + "print(\"Anonymized — safe to send to an LLM:\\n\")\n", + "print(result.text)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 3 — call the LLM\n", + "\n", + "The anonymized text is what you would send to OpenAI, Anthropic, or any local model. Tokens like `[PESEL_001]` look like ordinary placeholders and the model will happily quote them back in its reply.\n", + "\n", + "**System-prompt tip:** ask the model to keep `[TYPE_NNN]` tokens intact. Not strictly required, but it reduces the chance the model rephrases `[PESEL_001]` into something else.\n", + "\n", + "If `OPENAI_API_KEY` is set in your Colab environment, the real API is called. Otherwise we simulate a plausible summary so the rest of the notebook still works." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "import os\n\nSYSTEM = (\n \"You are a Polish-language customer service assistant. \"\n \"Summarize the user's message in 3 bullet points. \"\n \"Keep every placeholder of the form [TYPE_NNN] intact — do not rename, \"\n \"translate, or expand them.\"\n)\n\nif os.environ.get(\"OPENAI_API_KEY\"):\n from openai import OpenAI\n\n client = OpenAI()\n response = client.chat.completions.create(\n model=\"gpt-4o-mini\",\n messages=[\n {\"role\": \"system\", \"content\": SYSTEM},\n {\"role\": \"user\", \"content\": result.text},\n ],\n )\n llm_output = response.choices[0].message.content or \"\"\nelse:\n llm_output = (\n \"Podsumowanie zgłoszenia:\\n\"\n \"- Klient [PESEL_001] zgłasza brak przelewu dla zamówienia INV-2025-00412.\\n\"\n \"- Kontakt zwrotny: [PHONE_001] lub [EMAIL_001].\\n\"\n \"- Faktura VAT [NIP_001], REGON [REGON_001].\\n\"\n \"- IBAN [IBAN_001] — sprawdzić status transakcji.\"\n )\n\nprint(\"LLM response (still anonymized):\\n\")\nprint(llm_output)" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 4 — deanonymize\n", + "\n", + "`Shield.deanonymize()` uses the mapping built during `anonymize()` to put real values back. This is the step that closes the loop: the model never saw the originals, but downstream systems — a ticket, a CRM write, an outgoing email — see fully restored text." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "restored = shield.deanonymize(llm_output)\n", + "\n", + "print(\"Final, de-anonymized output:\\n\")\n", + "print(restored)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Persisting the mapping\n", + "\n", + "Real LLM workflows are often async: you anonymize now, call the model later, and deanonymize when the response comes back — possibly from a different process. `Mapping` is JSON-serializable so it rides along with whatever queue or task record you are using." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "\n", + "from llm_safe_pl import Mapping\n", + "\n", + "serialized = result.mapping.to_json()\n", + "print(\"Serialized mapping:\\n\")\n", + "print(json.dumps(json.loads(serialized), indent=2, ensure_ascii=False))\n", + "\n", + "rehydrated = Mapping.from_json(serialized)\n", + "print(\"\\nRestored via a rehydrated mapping (e.g. from a queue):\\n\")\n", + "print(shield.deanonymize(llm_output, mapping=rehydrated))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What this install did — and did not — catch\n", + "\n", + "The core install uses **stdlib + `typer` only**. That is enough for every identifier with a deterministic format: PESEL, NIP, REGON, ID card, passport, phone, email, IBAN, credit card. Checksum validators (PESEL, NIP, REGON, Luhn, mod-97 IBAN) filter out valid-looking-but-wrong numbers so the audit log is not flooded with false positives.\n", + "\n", + "**Not caught in the core install** — these are in the document above but absent from the audit trail:\n", + "\n", + "- `Jan Kowalski`, `Anna Nowak` — person names\n", + "- `Acme Polska Sp. z o.o.` — organization\n", + "- `ul. Marszałkowska 1, 00-001 Warszawa` — address\n", + "\n", + "These require the optional `[ner]` extra, which pulls in spaCy and a Polish model:\n", + "\n", + "```bash\n", + "pip install \"llm-safe-pl[ner]\"\n", + "python -m spacy download pl_core_news_lg\n", + "```\n", + "\n", + "NER support ships in v0.1.1 — see the [roadmap](https://github.com/Tatarinho/llm-safe-pl#roadmap).\n", + "\n", + "## Next steps\n", + "\n", + "- [README](https://github.com/Tatarinho/llm-safe-pl) — full feature list and install options.\n", + "- [`docs/llm_workflow.md`](https://github.com/Tatarinho/llm-safe-pl/blob/main/docs/llm_workflow.md) — deeper guidance on the anonymize → LLM → deanonymize pattern.\n", + "- [`docs/limitations.md`](https://github.com/Tatarinho/llm-safe-pl/blob/main/docs/limitations.md) — read before shipping to production.\n", + "\n", + "Found a false positive, a missed identifier, or have a feature idea? [Open an issue](https://github.com/Tatarinho/llm-safe-pl/issues). Stars welcome." + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file