phi-redactor

The only LLM proxy built to HIPAA §164.514(c) — real PHI never reaches the LLM

phi-redactor is an open-source, drop-in PHI masking proxy that sits between your healthcare AI applications and LLM providers (OpenAI, Anthropic). It automatically detects all 18 HIPAA PHI identifier categories and replaces them with synthetic tokens before the request leaves your network — then restores original values locally from an encrypted vault.

The core guarantee: real PHI never reaches the LLM. Ever.

The LLM provider receives only synthetic tokens (e.g. "James Wilson", "07/22/1957") that have no mathematical or derivable relationship to the original values. Re-identification from the synthetic tokens alone is cryptographically impossible — the only mapping between token and original lives in a Fernet-encrypted SQLite vault that never leaves your infrastructure.

This is fundamentally different from:

Redaction ([REDACTED]) — destroys clinical context, LLM cannot reason effectively
Hashing — reversible via rainbow tables if the input space is small
Truncation — partial PHI still present
Safe Harbor removal — identifiers removed from a document that still describes a real patient

With phi-redactor, the LLM provider receives data that describes a fictional patient. The original patient identity exists only in your encrypted local vault.

Legal posture: phi-redactor is a privacy-by-design PHI minimization proxy. This technique (pseudonymization with encrypted local token mapping) is not the HIPAA Safe Harbor method (45 CFR §164.514(b)(2)), which requires removal rather than replacement. Healthcare organizations should maintain a BAA with their LLM provider and consult legal counsel. See Compliance Posture for a full breakdown.

Your App  -->  phi-redactor (localhost:8080)  -->  OpenAI / Anthropic
                    |                                      |
              [detect PHI]                          [masked request]
              [mask with fakes]                     [LLM processes]
              [vault mapping]                       [response back]
                    |                                      |
              [rehydrate response]  <--  [masked response]

Why phi-redactor?

Problem	Solution
PHI reaches cloud LLM providers	Proxy intercepts and replaces PHI with synthetic tokens — LLM never receives real data
Synthetic tokens can be reversed	No — mapping lives only in an encrypted local vault; re-identification without vault access is cryptographically impossible
Redaction destroys clinical context	Synthetic values are clinically coherent — the LLM reasons about a fictional patient with intact context
No audit trail	Tamper-evident hash-chain audit log records every redaction event
Complex integration	Zero code changes — just update your base URL
Multi-turn context loss	Encrypted vault preserves token mappings across conversation turns

Quick Start

Install

pip install phi-redactor
python -m spacy download en_core_web_lg

Start the proxy

phi-redactor serve --port 8080

Use with OpenAI (zero code changes)

from openai import OpenAI

# Just change the base_url -- everything else stays the same
client = OpenAI(
    api_key="your-openai-key",
    base_url="http://localhost:8080/v1",  # <-- only change needed
)

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{
        "role": "user",
        "content": "Patient John Smith (SSN: 123-45-6789) has Type 2 Diabetes."
    }]
)
print(response.choices[0].message.content)
# PHI is automatically redacted before reaching OpenAI,
# and restored in the response you receive

Use with Anthropic

import anthropic

client = anthropic.Anthropic(
    api_key="your-anthropic-key",
    base_url="http://localhost:8080/anthropic",
)

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": "Dr. Maria Garcia (NPI: 1234567890) prescribed metformin."
    }]
)

Library API (no LLM needed)

import httpx

# Redact text directly
resp = httpx.post("http://localhost:8080/api/v1/redact", json={
    "text": "Patient Jane Doe SSN 987-65-4321 seen on 01/15/2026."
})
result = resp.json()
print(result["redacted_text"])  # PHI replaced with synthetic values
session_id = result["session_id"]

# Rehydrate later
resp = httpx.post("http://localhost:8080/api/v1/rehydrate", json={
    "text": result["redacted_text"],
    "session_id": session_id,
})
print(resp.json()["text"])  # Original PHI restored

All 18 HIPAA PHI Identifier Categories

phi-redactor detects and pseudonymizes all 18 PHI identifier categories defined under HIPAA (45 CFR §164.514(b)). Note: replacement values are synthetic but realistic — this is pseudonymization, not de-identification.

#	Category	Detection Method	Example
1	Person Names	NER + Pattern	John Smith -> James Wilson
2	Geographic Data	NER + Pattern	Springfield, IL -> Portland, OR
3	Dates	Pattern + NER	03/15/1956 -> 07/22/1955
4	Phone Numbers	Pattern	(555) 123-4567 -> (555) 987-6543
5	Fax Numbers	Pattern	Fax: 555-0100 -> Fax: 555-0299
6	Email Addresses	Pattern	john@test.com -> james@example.net
7	SSN	Pattern	123-45-6789 -> 987-65-4321
8	Medical Record Numbers	Pattern	MRN: 00456789 -> MRN: 00891234
9	Health Plan IDs	Pattern	BCBS-987654321 -> AETNA-123456789
10	Account Numbers	Pattern	ACC-00112233 -> ACC-99887766
11	License/DEA/NPI	Pattern	NPI: 1234567890 -> NPI: 9876543210
12	Vehicle IDs	Pattern	VIN: 1HGBH41... -> VIN: 2FGCD52...
13	Device IDs (UDI)	Pattern	UDI: (01)12345... -> UDI: (01)98765...
14	URLs	Pattern	https://patient-portal.com -> https://example.com
15	IP Addresses	Pattern	192.168.1.100 -> 10.0.0.42
16	Biometric IDs	Pattern	Fingerprint hash -> BIO-a1b2c3d4
17	Photos	Detection	[REDACTED_PHOTO]
18	Other Unique IDs	Pattern	ID-12345678 -> ID-87654321

Architecture

+------------------+     +-------------------+     +------------------+
|   Your App       | --> |   phi-redactor    | --> |  LLM Provider    |
|   (OpenAI SDK)   |     |   (localhost)     |     |  (OpenAI/Claude) |
+------------------+     +-------------------+     +------------------+
                          |                 |
                    +-----+-----+     +-----+-----+
                    | Detection |     |  Masking   |
                    | Engine    |     |  Engine    |
                    | (Presidio |     | (Faker +   |
                    |  + spaCy) |     |  Custom)   |
                    +-----------+     +-----------+
                          |                 |
                    +-----+-----+     +-----+-----+
                    | Encrypted |     |   Audit   |
                    | Vault     |     |   Trail   |
                    | (SQLite + |     | (Hash-    |
                    |  Fernet)  |     |  chain)   |
                    +-----------+     +-----------+

Core Components

Component	Description
Detection Engine	Presidio + spaCy NER + 8 custom HIPAA recognizers
Masking Engine	Faker-based semantic replacement with healthcare providers
Encrypted Vault	Fernet-encrypted SQLite for PHI-to-synthetic mappings
Proxy Server	FastAPI reverse proxy with OpenAI + Anthropic adapters
Audit Trail	Append-only hash-chain JSON Lines log (tamper-evident)
Audit Reports	PHI detection and redaction activity report generator

API Endpoints

Proxy Routes

Method	Path	Description
POST	`/v1/chat/completions`	OpenAI chat proxy (drop-in compatible)
POST	`/v1/embeddings`	OpenAI embeddings proxy
POST	`/anthropic/v1/messages`	Anthropic Messages API proxy

Library Routes

Method	Path	Description
POST	`/api/v1/redact`	Detect and redact PHI from text
POST	`/api/v1/rehydrate`	Restore original PHI from redacted text

Management Routes

Method	Path	Description
GET	`/api/v1/health`	Health check and system info
GET	`/api/v1/stats`	Aggregate redaction statistics
GET	`/api/v1/sessions`	List all sessions
GET	`/api/v1/compliance/report`	PHI detection and redaction activity report
GET	`/api/v1/compliance/summary`	Quick redaction activity summary
GET	`/api/v1/audit`	Query audit trail events

CLI Commands

phi-redactor serve [--port 8080] [--host 0.0.0.0]   # Start the proxy
phi-redactor redact --file patient_notes.txt          # Batch file redaction
phi-redactor report --full --output report.json       # Compliance report
phi-redactor version                                   # Show version

Configuration

All settings can be configured via environment variables with the PHI_REDACTOR_ prefix:

PHI_REDACTOR_PORT=8080              # Proxy port
PHI_REDACTOR_HOST=0.0.0.0          # Bind address
PHI_REDACTOR_SENSITIVITY=0.5       # Detection sensitivity (0.0=aggressive, 1.0=permissive)
PHI_REDACTOR_LOG_LEVEL=INFO        # Logging level
PHI_REDACTOR_VAULT_PASSPHRASE=...  # Optional vault encryption passphrase
PHI_REDACTOR_SESSION_IDLE_TIMEOUT=1800   # Session idle timeout (seconds)
PHI_REDACTOR_SESSION_MAX_LIFETIME=86400  # Session max lifetime (seconds)

Security Design

PHI never logged: PHI-safe log formatter scrubs all known patterns
Encryption at rest: Fernet encryption (AES-128-CBC) for vault entries
Hash-chain audit: Every redaction event is chained via SHA-256 hashes
Fail-safe: Detection/masking failures block requests (never pass through)
Session isolation: Each session has independent vault mappings
Key rotation: Built-in support for encryption key rotation

Development

# Clone and install
git clone https://github.com/dilawar-gopang/phi-redactor.git
cd phi-redactor
pip install -e ".[dev]"
python -m spacy download en_core_web_lg

# Run tests
pytest

# Lint and type check
ruff check src/ tests/
mypy src/

Compliance Posture

The security guarantee

phi-redactor makes one hard guarantee: the LLM provider never receives real PHI.

Here is exactly what happens at the cryptographic level:

PHI is detected in the request (e.g., "John Smith")
A synthetic token is generated via Faker (e.g., "James Wilson") — no derivable relationship to the original
The mapping John Smith → James Wilson is stored in the vault as:
- Key (lookup): SHA-256 hash of "John Smith" — one-way, not reversible
- Original: Fernet-encrypted ciphertext — AES-128-CBC, key stored only on your machine
- Synthetic: plaintext "James Wilson" — safe to store because it reveals nothing
The LLM receives "James Wilson" — a fictional person with no connection to your patient
After the LLM responds, rehydration replaces "James Wilson" back to "John Smith" locally

An attacker who intercepts the network traffic sees only synthetic tokens. An attacker who steals the SQLite vault file sees only encrypted blobs without the key file. Neither can reconstruct the original PHI.

What this is (and is not) legally

	phi-redactor	HIPAA Safe Harbor (45 CFR §164.514(b)(2))
PHI reaches LLM provider	Never	N/A — assumes data already de-identified
Identifiers present in transmitted data	Yes (synthetic/fictional)	No — all 18 must be removed
Re-identification possible by LLM provider	No — cryptographically impossible	N/A
Qualifies as de-identification	No	Yes (if all 18 removed)
BAA with LLM provider still required	Yes	Not applicable

phi-redactor is a privacy-by-design pseudonymization proxy — not a de-identification engine under the Safe Harbor definition. Safe Harbor requires removal; phi-redactor does something different: it ensures the entity receiving the data (the LLM provider) cannot link it to a real person.

License

Apache License 2.0. See LICENSE for details.

Contributing

Contributions welcome! Please open an issue first to discuss what you'd like to change.

Built to HIPAA §164.514(c). The LLM gets a fictional patient. Your vault keeps the truth.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.claude		.claude
.github		.github
.specify		.specify
charts/phi-redactor		charts/phi-redactor
history/prompts		history/prompts
specs/001-phi-redaction-proxy		specs/001-phi-redaction-proxy
src/phi_redactor		src/phi_redactor
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

phi-redactor

Why phi-redactor?

Quick Start

Install

Start the proxy

Use with OpenAI (zero code changes)

Use with Anthropic

Library API (no LLM needed)

All 18 HIPAA PHI Identifier Categories

Architecture

Core Components

API Endpoints

Proxy Routes

Library Routes

Management Routes

CLI Commands

Configuration

Security Design

Development

Compliance Posture

The security guarantee

What this is (and is not) legally

Recommended deployment posture

License

Contributing

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

phi-redactor

Why phi-redactor?

Quick Start

Install

Start the proxy

Use with OpenAI (zero code changes)

Use with Anthropic

Library API (no LLM needed)

All 18 HIPAA PHI Identifier Categories

Architecture

Core Components

API Endpoints

Proxy Routes

Library Routes

Management Routes

CLI Commands

Configuration

Security Design

Development

Compliance Posture

The security guarantee

What this is (and is not) legally

Recommended deployment posture

License

Contributing

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages