diff --git a/README.md b/README.md index eca4b99..e5e2f3e 100644 --- a/README.md +++ b/README.md @@ -36,6 +36,8 @@ psql -d localdb < subset.sql - **Zero-config start** -- Introspects schema automatically, no data model file required - **Single command** -- Extract complete data subsets with one CLI invocation - **Safe by default** -- Auto-detects and anonymizes sensitive fields (emails, phones, SSNs, etc.) +- **Compliance profiles** -- Built-in GDPR, HIPAA Safe Harbor, and PCI-DSS profiles with two-phase PII scanning +- **Column mapping UI** -- Local browser UI to visually map columns, apply compliance profiles, and export config - **Multiple output formats** -- SQL, JSON, and CSV - **Streaming** -- Memory-efficient extraction for large datasets (100K+ rows) - **Virtual foreign keys** -- Support for Django GenericForeignKeys and implicit relationships via config @@ -100,6 +102,44 @@ dbslice extract postgres://... --seed "users.id=1" --anonymize dbslice extract postgres://... --seed "users.id=1" --anonymize --redact "audit_logs.ip_address" ``` +### Column Mapping UI + +Map columns visually, apply compliance profiles, and generate a ready-to-use config — all from a local browser UI. + +```bash +dbslice map postgres://localhost/myapp + +# Custom port +dbslice map postgres://localhost/myapp --port 8888 + +# Also works with uvx (no install needed) +uvx dbslice map postgres://localhost/myapp +``` + + + + + + + + + + +
Map columns to anonymization rulesGenerate and export config
Column mappingGenerated config
+ +Runs on `127.0.0.1:9473` with a one-time session token — no data leaves your machine. Apply GDPR, HIPAA, or PCI-DSS profiles with one click, review what gets masked, then download the YAML. + +### Compliance Profiles + +```bash +# HIPAA Safe Harbor — auto-masks all 18 identifier types +dbslice extract postgres://... --seed "patients.id=1" --compliance hipaa --compliance-strict + +# Multiple profiles + audit manifest +dbslice extract postgres://... --seed "users.id=1" --compliance gdpr --compliance pci-dss -f subset.sql +# Produces subset.sql + subset.manifest.json +``` + ### Output Formats ```bash @@ -170,6 +210,9 @@ dbslice extract --config dbslice.yaml --seed "orders.id=12345" | Configuration | Zero-config | Requires model file | Config required | Manual YAML | | Setup time | Seconds | Hours | Medium | Medium | | Anonymization | Built-in (Faker) | Plugin-based | Advanced transformers | Not available | +| Compliance profiles | GDPR, HIPAA, PCI-DSS | None | None | None | +| Column mapping UI | Built-in (local) | None | None | None | +| PII value scanning | Two-phase (pre/post mask) | None | None | None | | Subsetting | FK traversal | FK traversal | Limited | FK traversal | | Output formats | SQL, JSON, CSV | SQL, XML, CSV | SQL | SQL only | | Cycle handling | Automatic | Manual config | N/A | Manual | diff --git a/docs/assets/mapping.png b/docs/assets/mapping.png new file mode 100644 index 0000000..268d415 Binary files /dev/null and b/docs/assets/mapping.png differ diff --git a/docs/assets/mapping_instructions.png b/docs/assets/mapping_instructions.png new file mode 100644 index 0000000..5ed636f Binary files /dev/null and b/docs/assets/mapping_instructions.png differ diff --git a/docs/help/best-practices.md b/docs/help/best-practices.md index c1de711..affa47c 100644 --- a/docs/help/best-practices.md +++ b/docs/help/best-practices.md @@ -20,6 +20,19 @@ dbslice extract postgres://prod/db --seed "users.id=1" --anonymize Never extract production data without `--anonymize`. Foreign keys are preserved automatically. +## 2b. Use Compliance Profiles for Regulated Data + +```bash +dbslice extract postgres://prod/db --seed "users.id=1" \ + --compliance hipaa --compliance-strict +``` + +Compliance profiles (GDPR, HIPAA, PCI-DSS) auto-configure anonymization, run value-based PII scanning, and generate audit manifests. Use `--compliance-strict` to fail if unmasked PII is detected. + +## 2c. Treat Output as Pseudonymized Data + +Deterministic mode is **pseudonymization**, not full anonymization. For higher privacy, set `anonymization.deterministic: false` and still keep operational controls (least privilege DB account, restricted output location, and manifest review). + ## 3. Validate Extractions ```bash @@ -96,6 +109,29 @@ dropdb test_import Always verify extracted data loads cleanly into an isolated database before relying on it. +## 11. Use a Compliance Runbook in CI + +Suggested CI flow: +1. `dbslice inspect --compliance-check ... --compliance-output json` on target schema. +2. `dbslice extract ... --out-file ...` with compliance profiles. +3. `dbslice verify-manifest ...` to confirm output file hashes. +4. Optionally sign manifest + output with an external tool (cosign, GPG) for non-repudiation. +5. Archive artifacts to immutable storage (S3 Object Lock, GCS retention, etc.). + +## 12. Compliance Controls (Quick Reference) + +These are **runtime CLI checks**, not an IAM or governance system. They reduce accidental mistakes but are not a substitute for network-level controls, access policies, or encryption at rest. + +| Risk | Control | Limitation | +|------|---------|------------| +| Unmasked PII reaches dev/test | `--compliance ... --compliance-strict`, profile rules, residual scan | Pattern-based detection only; may miss PII in unusual column names or embedded in binary data | +| Unsafe ad-hoc extraction | `compliance.policy_mode: standard`, breakglass override with reason + ticket | CLI flags can be bypassed by not using the config file | +| Unknown data source used | `compliance.allow_url_patterns` / `deny_url_patterns` | Regex on URL string; does not prevent DNS aliasing or network-level bypass | +| Non-TLS DB connection | `compliance.required_sslmode` | Checks URL query param only; does not verify actual TLS handshake | +| Non-CI execution | `compliance.require_ci: true` | Checks `CI=true` env var, which can be set manually | +| Output tampering | Manifest `output_file_hashes` + `dbslice verify-manifest` | SHA256 file hashes detect changes after the fact | +| Manifest tampering | `compliance.sign_manifest: true` with HMAC-SHA256 | Symmetric key — tamper detection only, **not** non-repudiation. For provable origin, wrap with external signing (cosign, GPG) | + --- ## See Also diff --git a/docs/user-guide/advanced-usage.md b/docs/user-guide/advanced-usage.md index 817b198..6934551 100644 --- a/docs/user-guide/advanced-usage.md +++ b/docs/user-guide/advanced-usage.md @@ -117,6 +117,153 @@ dbslice extract \ Validation confirms all FK references remain intact after anonymization. +### Non-Deterministic Mode + +For stronger privacy guarantees, use non-deterministic mode where each value gets a random Faker seed instead of a deterministic one: + +```bash +dbslice extract \ + postgres://prod:5432/app \ + --seed "users.id=1" \ + --anonymize \ + --non-deterministic \ + --out-file strong_privacy.sql +``` + +Or in config: + +```yaml +anonymization: + enabled: true + deterministic: false +``` + +**Trade-off**: Same value in different tables may produce different fake values (e.g., "alice@example.com" might become "john@foo.com" in one table and "jane@bar.org" in another). Use deterministic mode when cross-table consistency matters. + +**Legal note**: Deterministic anonymization is technically **pseudonymization** under GDPR (same seed + input = same output = reversible). Non-deterministic mode is closer to true anonymization but structural linkage may still allow re-identification. + +--- + +## Compliance Profiles + +dbslice includes built-in compliance profiles for GDPR, HIPAA Safe Harbor, and PCI-DSS v4.0. Profiles auto-configure anonymization patterns, run value-based PII scanning, and generate audit manifests. + +### Using Compliance Profiles + +```bash +# HIPAA-compliant extraction +dbslice extract \ + postgres://medical-db:5432/ehr \ + --seed "patients.id=1" \ + --compliance hipaa \ + --out-file patient_subset.sql + +# Multiple profiles +dbslice extract \ + postgres://prod:5432/app \ + --seed "users.id=1" \ + --compliance gdpr \ + --compliance pci-dss \ + --out-file compliant_subset.sql +``` + +Or in config: + +```yaml +compliance: + profiles: [hipaa, gdpr] + strict: true + generate_manifest: true +``` + +### Available Profiles + +| Profile | Description | Key Coverage | +|---------|-------------|-------------| +| `gdpr` | EU General Data Protection Regulation | Names, email, phone, address, IP, DOB, SSN, financial IDs, online identifiers | +| `hipaa` | HIPAA Safe Harbor de-identification | All 18 Safe Harbor identifiers: names, dates, geographic data, phone, fax, email, SSN, medical record numbers, health plan IDs, account numbers, license numbers, vehicle/device IDs, URLs, IPs, biometrics, photos, unique IDs | +| `pci-dss` | PCI-DSS v4.0 | PAN (credit card), cardholder name, expiration date, service code; CVV/PIN NULLed (never faked) | + +### What Compliance Profiles Do + +When a profile is active: + +1. **Auto-enable anonymization** -- no need for `--anonymize` +2. **Merge column patterns** -- profile-defined patterns are added to your anonymization config +3. **Apply security NULL rules** -- profile-specific fields are forced to NULL (e.g., CVV for PCI-DSS) +4. **Run value-based PII scanning** -- regex patterns scan actual data values (not just column names) for email, SSN, phone numbers, IP addresses, and credit card numbers (with Luhn validation) +5. **Flag free-text columns** -- columns like `notes`, `comments`, `description` are flagged as potential PII containers +6. **Generate audit manifest** -- a JSON manifest documenting what was anonymized + +### Strict Mode + +In strict mode, extraction fails if the PII scanner detects unmasked PII in the output: + +```bash +dbslice extract \ + postgres://prod:5432/app \ + --seed "users.id=1" \ + --compliance hipaa \ + --compliance-strict \ + --out-file subset.sql +``` + +This ensures no PII slips through to dev/test environments. + +### Audit Manifest + +When compliance profiles are active (or `--manifest` is passed), dbslice writes a `*.manifest.json` file alongside the output: + +```json +{ + "extraction_id": "550e8400-e29b-41d4-a716-446655440000", + "timestamp": "2026-03-06T10:30:00Z", + "dbslice_version": "0.5.0", + "masking_type": "deterministic_pseudonymization", + "compliance_profiles": ["hipaa"], + "seed_hash": "sha256:a1b2c3d4e5f6...", + "tables": { + "patients": { + "rows_extracted": 1, + "fields_masked": [ + {"column": "email", "method": "email", "category": ""}, + {"column": "ssn", "method": "ssn", "category": ""} + ], + "fields_nulled": [ + {"column": "password_hash", "reason": "security_null_pattern"} + ], + "fields_preserved_fk": ["id", "doctor_id"], + "fields_unmasked": ["created_at", "status"] + } + }, + "pii_scan_results": [], + "output_file_hashes": { + "subset.sql": "sha256:a1b2c3..." + }, + "breakglass": {}, + "signature_algorithm": "", + "signature": "", + "warnings": [ + {"table": "visits", "column": "notes", "reason": "Free-text column may contain embedded PII", "severity": "warning"} + ] +} +``` + +This manifest provides structured evidence for audit reviews. It documents what dbslice did but is not a substitute for infrastructure-level audit logging. + +You can verify output file integrity later: + +```bash +# Verify output file hashes match +dbslice verify-manifest subset.manifest.json --no-verify-signature + +# Verify hashes + HMAC signature (if signing was enabled) +export DBSLICE_MANIFEST_SIGNING_KEY="your-key" +dbslice verify-manifest subset.manifest.json +``` + +Note: HMAC signing uses a shared symmetric key. It provides tamper detection (was the manifest modified after creation?) but not non-repudiation (it cannot prove *who* created it). For provable origin, wrap with an external signing tool (e.g., cosign, GPG) in your CI pipeline. + ### Compliance Use Cases **GDPR Right to Erasure** -- extract and anonymize before deletion: @@ -125,22 +272,71 @@ Validation confirms all FK references remain intact after anonymization. dbslice extract \ postgres://prod:5432/app \ --seed "users.id=12345" \ - --anonymize \ + --compliance gdpr \ --out-file gdpr_erasure_backup.sql ``` -**HIPAA De-identification** -- anonymize plus redact clinical fields: +**HIPAA Safe Harbor De-identification**: ```bash dbslice extract \ postgres://medical-db:5432/ehr \ --seed "patients.mrn='12345'" \ - --anonymize \ - --redact "patients.social_security" \ - --redact "visits.notes" \ + --compliance hipaa \ + --compliance-strict \ --out-file patient_deidentified.sql ``` +**PCI-DSS: No Real PANs in Dev/Test** (Requirement 6.5.6): + +```bash +dbslice extract \ + postgres://billing:5432/payments \ + --seed "transactions.id=999" \ + --compliance pci-dss \ + --out-file test_transactions.sql +``` + +--- + +## Column Mapping UI + +Instead of manually writing anonymization config, use the built-in browser UI to visually map columns. + +### Launch + +```bash +dbslice map postgresql://localhost/myapp + +# Custom port +dbslice map postgresql://localhost/myapp --port 8888 +``` + +This opens a local server on `127.0.0.1:9473` with a session token for security. No data leaves your machine — the browser connects to the local `dbslice` process, which connects to the database. + +### Workflow + +1. **Introspect** -- Enter your database URL, click Introspect Schema. Only metadata is read. +2. **Apply profiles** -- Click GDPR, HIPAA, or PCI-DSS to auto-map columns matching the profile's rules. +3. **Review** -- For each column, set action to Keep, Anonymize, or NULL. Pick a provider from the dropdown. +4. **Export** -- Click Generate Config to produce a `dbslice.yaml`. Download it. +5. **Use** -- `dbslice extract --config dbslice.yaml --seed "table.column=value"` + +### What the UI shows + +- **Table list** with progress bars showing how many columns are mapped per table +- **Compliance profile chips** that overlay suggested mappings with one click +- **Provider dropdown** with descriptions (not a raw text input) +- **Summary panel** at the bottom: click "14 masked" to see all masked fields across all tables, grouped by table +- **Live YAML preview** that updates as you change mappings +- **Bulk actions** per table: Anonymize all, NULL all, Reset + +### Security + +- Server binds to `127.0.0.1` only (not `0.0.0.0`) +- Random session token generated at startup, required on all API requests +- No persistent state, no cookies, no external requests (except Tailwind CSS CDN for styling) + --- ## Streaming Large Datasets diff --git a/docs/user-guide/cli-reference.md b/docs/user-guide/cli-reference.md index 2ed97ac..a0d998f 100644 --- a/docs/user-guide/cli-reference.md +++ b/docs/user-guide/cli-reference.md @@ -9,6 +9,8 @@ Complete reference for the dbslice command-line interface. - [extract](#extract) - [init](#init) - [inspect](#inspect) + - [map](#map) + - [verify-manifest](#verify-manifest) - [Global Options](#global-options) - [Environment Variables](#environment-variables) - [Exit Codes](#exit-codes) @@ -91,6 +93,20 @@ dbslice extract [OPTIONS] [DATABASE_URL] |--------|------|---------|-------------| | `--anonymize` / `--no-anonymize`, `-a` | FLAG | Disabled | Enable/disable automatic anonymization of sensitive fields | | `--redact`, `-r` | TEXT | - | Additional fields to redact (repeatable, format: `table.column`) | +| `--non-deterministic` / `--deterministic` | FLAG | Deterministic | Use non-deterministic anonymization (random output each run, stronger privacy but no cross-table consistency) | + +##### Compliance Options + +| Option | Type | Default | Description | +|--------|------|---------|-------------| +| `--compliance` | TEXT | - | Compliance profile(s) to apply (repeatable): `gdpr`, `hipaa`, `pci-dss` | +| `--compliance-strict` / `--no-compliance-strict` | FLAG | Disabled | Fail extraction if value-based PII scanning detects unmasked PII | +| `--manifest` / `--no-manifest` | FLAG | Auto | Generate audit manifest (auto-enabled with `--compliance`) | +| `--allow-raw` | FLAG | Disabled | Breakglass override for compliance policy gates (requires reason + ticket) | +| `--breakglass-reason` | TEXT | - | Required justification when `--allow-raw` is used | +| `--ticket-id` | TEXT | - | Required tracking ticket/incident ID when `--allow-raw` is used | + +When compliance profiles are active, anonymization is auto-enabled and profile patterns are merged as fallback wildcard rules (`user exact fields > user patterns > profile patterns > built-ins`). Value-based scanning runs in two phases: coverage (pre-mask) identifies where PII exists, then residual (post-mask) checks only unprotected columns. Strict mode fails only on residual detections — it won't false-positive on correctly anonymized fields. ##### Validation Options @@ -227,6 +243,36 @@ dbslice extract postgresql://localhost/myapp \ --redact customers.tax_id ``` +##### Compliance + +```bash +# Extract with HIPAA compliance profile +dbslice extract postgresql://localhost/myapp \ + -s "patients.id=1" \ + --compliance hipaa + +# Multiple compliance profiles with strict mode +dbslice extract postgresql://localhost/myapp \ + -s "users.id=1" \ + --compliance gdpr \ + --compliance pci-dss \ + --compliance-strict + +# Non-deterministic anonymization for stronger privacy +dbslice extract postgresql://localhost/myapp \ + -s "users.id=1" \ + --compliance gdpr \ + --non-deterministic + +# Generate audit manifest without compliance profile +dbslice extract postgresql://localhost/myapp \ + -s "users.id=1" \ + --anonymize \ + --manifest \ + -f subset.sql +# Writes subset.sql + subset.manifest.json +``` + ##### JSON Output ```bash @@ -439,6 +485,9 @@ dbslice inspect [OPTIONS] [DATABASE_URL] |--------|------|---------|-------------| | `--table`, `-t` | TEXT | - | Show details for a specific table | | `--schema` | TEXT | `public` | PostgreSQL schema name | +| `--compliance-check` | TEXT | - | Run compliance coverage check for profile(s): `gdpr`, `hipaa`, `pci-dss` | +| `--compliance-output` | TEXT | `human` | Compliance report output format: `human` or `json` | +| `--sample-rows` | INTEGER | `100` | Rows sampled per table for value-based compliance scan | #### Examples @@ -513,6 +562,101 @@ for table in users orders products; do done ``` +##### Compliance Coverage Check + +```bash +# Human-readable compliance check +dbslice inspect postgresql://localhost/myapp \ + --compliance-check gdpr + +# JSON report for CI pipelines +dbslice inspect postgresql://localhost/myapp \ + --compliance-check hipaa \ + --compliance-output json +``` + +--- + +### verify-manifest + +Verify manifest file hashes and optional HMAC signature. + +#### Synopsis + +```bash +dbslice verify-manifest [OPTIONS] MANIFEST_FILE +``` + +#### Options + +| Option | Type | Default | Description | +|--------|------|---------|-------------| +| `--verify-signature` / `--no-verify-signature` | FLAG | Enabled | Verify HMAC signature when present | +| `--key-env` | TEXT | `DBSLICE_MANIFEST_SIGNING_KEY` | Env var containing signature key | + +#### Examples + +```bash +# Verify output hashes only +dbslice verify-manifest subset.manifest.json --no-verify-signature + +# Verify hashes + HMAC signature +export DBSLICE_MANIFEST_SIGNING_KEY="super-secret" +dbslice verify-manifest subset.manifest.json +``` + +--- + +### map + +Launch a local browser UI for visually mapping database columns to anonymization rules. + +#### Synopsis + +```bash +dbslice map [OPTIONS] [DATABASE_URL] +``` + +#### Arguments + +| Argument | Description | +|----------|-------------| +| `DATABASE_URL` | Optional database connection URL. Can also be entered in the browser UI. | + +#### Options + +| Option | Type | Default | Description | +|--------|------|---------|-------------| +| `--schema` | TEXT | `public` | PostgreSQL schema name | +| `--port`, `-p` | INTEGER | `9473` | Port for the local server | +| `--open-browser` / `--no-open-browser` | FLAG | Enabled | Auto-open browser on launch | + +#### Security + +The server binds to `127.0.0.1` only — it is not accessible from the network. A random session token is generated at startup and required for all requests. The token is passed via the URL when the browser opens. + +#### Examples + +```bash +# Launch mapping UI (enter URL in browser) +dbslice map + +# Pre-fill database URL +dbslice map postgresql://localhost/myapp + +# Custom port, no auto-open +dbslice map postgresql://localhost/myapp --port 8888 --no-open-browser +``` + +#### Workflow + +1. Enter database URL and click **Introspect Schema** +2. Optionally click **GDPR**, **HIPAA**, or **PCI-DSS** to apply compliance profile suggestions +3. Review each column: set action to **Keep**, **Anonymize**, or **NULL** +4. For anonymized columns, select a provider from the dropdown (e.g., `email`, `ssn`, `hipaa_zip3`) +5. Click **Generate Config** to export a `dbslice.yaml` +6. Use the config: `dbslice extract --config dbslice.yaml --seed "table.column=value"` + --- ## Global Options diff --git a/docs/user-guide/configuration.md b/docs/user-guide/configuration.md index d2b8b44..d9d792e 100644 --- a/docs/user-guide/configuration.md +++ b/docs/user-guide/configuration.md @@ -12,6 +12,7 @@ Complete reference for dbslice YAML configuration files. - [database](#database) - [extraction](#extraction) - [anonymization](#anonymization) + - [compliance](#compliance) - [output](#output) - [tables](#tables) - [performance](#performance) @@ -69,6 +70,7 @@ version: "1.0" # Optional config version tag (informational) database: # Database connection settings extraction: # Extraction behavior settings anonymization: # Anonymization configuration +compliance: # Compliance profiles and audit manifest (optional) output: # Output format settings tables: # Per-table configuration (optional) performance: # Performance tuning (optional) @@ -245,6 +247,7 @@ anonymization: fields: object # Exact table.column -> provider patterns: object # Wildcard table.column glob -> provider security_null_fields: list # Wildcard table.column globs to force NULL + deterministic: boolean # Use deterministic anonymization (default: true) ``` #### Fields @@ -256,6 +259,7 @@ anonymization: | `fields` | Object | No | `{}` | Exact map of `table.column` to Faker method | | `patterns` | Object | No | `{}` | Wildcard map of `table.column` glob to Faker method | | `security_null_fields` | List[String] | No | `[]` | Wildcard `table.column` globs to force `NULL` | +| `deterministic` | Boolean | No | `true` | Deterministic mode (same input = same output). Set `false` for non-deterministic anonymization with stronger privacy guarantees | Notes: - `fields` keys must be exact `table.column` entries (no wildcards). @@ -361,6 +365,118 @@ anonymization: --- +### compliance + +Compliance profile and audit manifest configuration. + +#### Schema + +```yaml +compliance: + profiles: list[string] # Compliance profiles to apply + strict: boolean # Fail if uncovered PII detected + generate_manifest: boolean # Generate audit manifest + policy_mode: string # Runtime policy gates: off|standard|strict + allow_url_patterns: list[string]# Regex allow-list for source DB URL + deny_url_patterns: list[string] # Regex deny-list for source DB URL + required_sslmode: string # Required sslmode query value in DB URL + require_ci: boolean # Require CI=true environment + sign_manifest: boolean # HMAC-sign manifest when key is available + manifest_key_env: string # Env var name containing signing key +``` + +#### Fields + +| Field | Type | Required | Default | Description | +|-------|------|----------|---------|-------------| +| `profiles` | List[String] | No | `[]` | Compliance profiles: `gdpr`, `hipaa`, `pci-dss` | +| `strict` | Boolean | No | `false` | Fail extraction if value-based PII scanning detects unmasked PII | +| `generate_manifest` | Boolean | No | `false` | Generate a JSON audit manifest alongside output (auto-enabled when profiles are active) | +| `policy_mode` | String | No | `"off"` | Compliance policy gates: `off`, `standard`, `strict` | +| `allow_url_patterns` | List[String] | No | `[]` | Source DB URL must match one of these regex patterns (if set) | +| `deny_url_patterns` | List[String] | No | `[]` | Source DB URL must not match any of these regex patterns | +| `required_sslmode` | String | No | - | Required PostgreSQL `sslmode` query parameter value | +| `require_ci` | Boolean | No | `false` | Fail when running outside CI (`CI=true` expected) | +| `sign_manifest` | Boolean | No | `false` | Sign manifest with HMAC-SHA256 (tamper detection, not non-repudiation) | +| `manifest_key_env` | String | No | `"DBSLICE_MANIFEST_SIGNING_KEY"` | Env var containing HMAC signing key (shared secret) | + +#### Compliance Profiles + +| Profile | Description | Key Coverage | +|---------|-------------|-------------| +| `gdpr` | EU General Data Protection Regulation | Names, email, phone, address, IP, DOB, SSN, financial IDs | +| `hipaa` | HIPAA Safe Harbor (18 identifiers) | All 18 Safe Harbor identifiers including medical record numbers, device IDs, dates | +| `pci-dss` | PCI-DSS v4.0 | PAN, cardholder name, expiration, CVV/PIN (NULLed) | + +When a compliance profile is active: +- Anonymization is auto-enabled (no need for `anonymization.enabled: true`) +- Profile-defined column patterns are merged as **fallback wildcard rules** (`user exact fields > user patterns > profile patterns > built-ins`) +- Value-based scanning runs in two phases: + - coverage scan (pre-mask) to detect PII presence + - residual scan (post-mask) on unprotected columns only (strict mode fails only here) +- Free-text columns (notes, comments, descriptions) are flagged as warnings +- Audit manifest is generated by default + +#### Policy Modes + +`policy_mode` adds runtime guardrails when compliance profiles are active. These are CLI-level checks that prevent accidental misconfiguration — they are not a security boundary. + +- `off`: No policy gates (default). +- `standard` / `strict`: Block risky defaults — stdout output, `--allow-unsafe-where`, and non-masked extraction are rejected unless overridden with `--allow-raw`. Both modes currently apply the same gates; `strict` is reserved for future tightening. + +Breakglass override: `--allow-raw --breakglass-reason "..." --ticket-id "..."`. The reason and ticket ID are recorded in the manifest for audit purposes. + +#### Important: Pseudonymization vs Anonymization + +dbslice's anonymization is technically **pseudonymization** under GDPR (deterministic mode: same input = same output, reversible with seed knowledge). For stronger privacy guarantees, use `anonymization.deterministic: false` (non-deterministic mode), which uses random seeds per value but loses cross-table consistency. + +True GDPR anonymization (where re-identification is "not reasonably possible") may require additional measures beyond what dbslice provides (k-anonymity, data generalization, etc.). + +#### Audit Manifest + +When `generate_manifest` is enabled, dbslice writes a `*.manifest.json` file alongside the output containing: + +- Extraction metadata (timestamp, version, seed hash) +- Per-table breakdown of masked, NULLed, FK-preserved, and unmasked fields +- Residual PII scan results from value-based scanning +- Compliance warnings (e.g., free-text columns that may contain embedded PII) +- Output file hash set (`sha256`) for produced artifacts +- Optional breakglass metadata (reason + ticket) when override is used +- Optional HMAC-SHA256 signature for tamper detection (symmetric key — integrity checking, not non-repudiation) + +This manifest provides structured evidence for audit reviews. For non-repudiation (provable origin), sign the manifest externally with cosign or GPG in your CI pipeline. + +#### Examples + +```yaml +# HIPAA-compliant extraction +compliance: + profiles: [hipaa] + strict: true + generate_manifest: true + +anonymization: + enabled: true + seed: "hipaa-compliant-seed-2024" + +# Multiple compliance profiles +compliance: + profiles: [gdpr, pci-dss] + strict: false + generate_manifest: true + +# Non-deterministic mode for stronger privacy +compliance: + profiles: [gdpr] + strict: true + +anonymization: + enabled: true + deterministic: false # Random output each run +``` + +--- + ### output Output format and generation configuration. @@ -779,6 +895,51 @@ dbslice extract \ --- +### HIPAA-Compliant Extraction + +**config/hipaa_compliant.yaml:** +```yaml +version: "1.0" + +database: + url: ${MEDICAL_DATABASE_URL} + +extraction: + default_depth: 3 + direction: both + exclude_tables: + - audit_logs + - system_events + validate: true + fail_on_validation_error: true + +compliance: + profiles: [hipaa] + strict: true # Fail if PII detected in output + generate_manifest: true # Generate audit trail + +anonymization: + enabled: true + seed: "hipaa-compliant-extraction-2024" + deterministic: false # Non-deterministic for stronger privacy +``` + +**Usage:** +```bash +export MEDICAL_DATABASE_URL="postgresql://medical-db.example.com/ehr" + +dbslice extract \ + --config config/hipaa_compliant.yaml \ + --seed "patients.id=12345" \ + --out-file patient_subset.sql + +# Output: +# patient_subset.sql (anonymized data) +# patient_subset.manifest.json (audit manifest for compliance team) +``` + +--- + ### Test Fixture Generation **config/test_fixtures.yaml:** diff --git a/src/dbslice/cli.py b/src/dbslice/cli.py index f1e04df..ba6fa6e 100644 --- a/src/dbslice/cli.py +++ b/src/dbslice/cli.py @@ -1,6 +1,10 @@ +import itertools +import json import os +import re from pathlib import Path from typing import Annotated +from urllib.parse import parse_qs, urlparse import typer from rich.console import Console @@ -204,9 +208,7 @@ def _parse_and_validate_seeds( parsed_seeds = [] for s in seeds: try: - parsed_seeds.append( - SeedSpec.parse(s, allow_unsafe_subqueries=allow_unsafe_subqueries) - ) + parsed_seeds.append(SeedSpec.parse(s, allow_unsafe_subqueries=allow_unsafe_subqueries)) except ValueError as e: raise InvalidSeedError(s, str(e)) @@ -348,11 +350,19 @@ def _show_extraction_settings( seed_desc = f"{s.table}.{s.column}={s.value}" if s.column else f"{s.table}:{s.where_clause}" console.print(f" - {seed_desc}") if config.anonymize: - console.print(" [yellow]Anonymization: ENABLED[/yellow]") + mode = "deterministic" if config.deterministic else "non-deterministic" + console.print(f" [yellow]Anonymization: ENABLED ({mode})[/yellow]") if config.redact_fields: console.print(" Additional redacted fields:") for field in config.redact_fields: console.print(f" - {field}") + if config.compliance_profiles: + profiles_str = ", ".join(p.upper() for p in config.compliance_profiles) + console.print(f" [yellow]Compliance profiles: {profiles_str}[/yellow]") + if config.compliance_strict: + console.print(" [yellow]Strict mode: ENABLED (will fail on PII detection)[/yellow]") + if config.generate_manifest: + console.print(" [yellow]Audit manifest: ENABLED[/yellow]") console.print() @@ -408,8 +418,12 @@ def _show_extraction_summary( if result.has_cycles: console.print() if result.used_deferred_cycle_strategy: - console.print("[yellow]⚠ Circular dependencies detected (deferred-constraint strategy)[/yellow]") - console.print(" Strategy: [cyan]Deterministic order + SET CONSTRAINTS ALL DEFERRED[/cyan]") + console.print( + "[yellow]⚠ Circular dependencies detected (deferred-constraint strategy)[/yellow]" + ) + console.print( + " Strategy: [cyan]Deterministic order + SET CONSTRAINTS ALL DEFERRED[/cyan]" + ) console.print(f" Cycles: [cyan]{len(result.cycle_infos)}[/cyan]") else: console.print("[yellow]⚠ Circular dependencies detected and resolved[/yellow]") @@ -485,7 +499,7 @@ def _generate_and_output_sql( disable_fk_checks: bool, output_file_mode: int, db_schema: str | None = None, -) -> None: +) -> list[Path]: """ Generate SQL output and write to file or stdout. @@ -524,11 +538,13 @@ def _generate_and_output_sql( console.print( f"[green]Wrote {result.total_rows()} rows to [bold]{out_file}[/bold][/green]" ) + return [out_file.resolve()] else: if not no_progress: console.print() console.print("[dim]--- SQL Output ---[/dim]") stdout_console.print(sql_output) + return [] def _generate_and_output_json( @@ -541,7 +557,7 @@ def _generate_and_output_json( console: Console, stdout_console: Console, output_file_mode: int, -) -> None: +) -> list[Path]: """ Generate JSON output and write to file(s) or stdout. @@ -584,19 +600,23 @@ def _generate_and_output_json( console.print( f"[green]Wrote {result.total_rows()} rows to [bold]{out_file}[/bold][/green]" ) + return [out_file.resolve()] else: assert isinstance(json_output, dict) out_file.mkdir(parents=True, exist_ok=True) + written_files: list[Path] = [] for table_name, table_json in json_output.items(): table_file = out_file / f"{table_name}.json" write_text_file_secure( table_file, table_json, file_mode=output_file_mode, encoding="utf-8" ) + written_files.append(table_file.resolve()) if not no_progress: console.print() console.print( f"[green]Wrote {result.table_count()} tables ({result.total_rows()} rows) to [bold]{out_file}[/bold][/green]" ) + return written_files else: # Output to stdout (only single mode makes sense) if mode == "per-table": @@ -616,6 +636,7 @@ def _generate_and_output_json( console.print() console.print("[dim]--- JSON Output ---[/dim]") stdout_console.print(json_output) + return [] def _generate_and_output_csv( @@ -628,7 +649,7 @@ def _generate_and_output_csv( console: Console, stdout_console: Console, output_file_mode: int, -) -> None: +) -> list[Path]: """ Generate CSV output and write to file(s) or stdout. @@ -671,19 +692,23 @@ def _generate_and_output_csv( console.print( f"[green]Wrote {result.total_rows()} rows to [bold]{out_file}[/bold][/green]" ) + return [out_file.resolve()] else: assert isinstance(csv_output, dict) out_file.mkdir(parents=True, exist_ok=True) + written_files: list[Path] = [] for table_name, table_csv in csv_output.items(): table_file = out_file / f"{table_name}.csv" write_text_file_secure( table_file, table_csv, file_mode=output_file_mode, encoding="utf-8" ) + written_files.append(table_file.resolve()) if not no_progress: console.print() console.print( f"[green]Wrote {result.table_count()} tables ({result.total_rows()} rows) to [bold]{out_file}[/bold][/green]" ) + return written_files else: # Output to stdout (only single mode makes sense) if mode == "per-table": @@ -703,6 +728,7 @@ def _generate_and_output_csv( console.print() console.print("[dim]--- CSV Output ---[/dim]") stdout_console.print(csv_output) + return [] def _handle_output_format( @@ -720,7 +746,7 @@ def _handle_output_format( console: Console, stdout_console: Console, db_schema: str | None = None, -) -> None: +) -> list[Path]: """ Handle output generation based on configured format. @@ -743,7 +769,7 @@ def _handle_output_format( typer.Exit: If format is not yet implemented (exits with code 1) """ if output_format == OutputFormat.SQL: - _generate_and_output_sql( + return _generate_and_output_sql( result, schema, database_url, @@ -758,7 +784,7 @@ def _handle_output_format( db_schema=db_schema, ) elif output_format == OutputFormat.JSON: - _generate_and_output_json( + return _generate_and_output_json( result, schema, out_file, @@ -770,7 +796,7 @@ def _handle_output_format( output_file_mode=extract_config.output_file_mode, ) elif output_format == OutputFormat.CSV: - _generate_and_output_csv( + return _generate_and_output_csv( result, schema, out_file, @@ -782,6 +808,113 @@ def _handle_output_format( output_file_mode=extract_config.output_file_mode, ) + return [] + + +def _is_truthy_env(value: str | None) -> bool: + """Interpret common truthy environment values.""" + if value is None: + return False + return value.strip().lower() in {"1", "true", "yes", "on"} + + +def _enforce_source_guardrails(config: ExtractConfig) -> None: + """Apply source guardrail checks for compliance-sensitive runs.""" + if config.compliance_require_ci and not _is_truthy_env(os.environ.get("CI")): + raise ValueError("Compliance policy requires CI environment, but CI is not set") + + url = config.database_url + for pattern in config.compliance_denied_url_patterns: + if re.search(pattern, url): + raise ValueError(f"Database URL rejected by compliance deny pattern: {pattern}") + + if config.compliance_allowed_url_patterns and not any( + re.search(pattern, url) for pattern in config.compliance_allowed_url_patterns + ): + raise ValueError("Database URL does not match any compliance allow pattern") + + if config.compliance_required_sslmode: + parsed = urlparse(url) + query = parse_qs(parsed.query) + sslmode = query.get("sslmode", [None])[0] + if sslmode != config.compliance_required_sslmode: + raise ValueError( + "Database URL sslmode does not satisfy compliance requirement " + f"(expected '{config.compliance_required_sslmode}', got '{sslmode}')" + ) + + +def _enforce_compliance_policy( + config: ExtractConfig, + out_file: Path | None, + allow_raw: bool, + breakglass_reason: str | None, + ticket_id: str | None, +) -> tuple[bool, str | None, str | None]: + """ + Apply policy-mode gates for compliance runs. + + Returns: + Tuple of (breakglass_applied, reason, ticket_id) + """ + mode = config.compliance_policy_mode.lower() + policy_active = mode in {"standard", "strict"} and bool(config.compliance_profiles) + + risky_reasons: list[str] = [] + if policy_active: + if out_file is None: + risky_reasons.append("stdout output is blocked when compliance profiles are active") + if config.allow_unsafe_where: + risky_reasons.append("unsafe WHERE subqueries are blocked under compliance policy") + if not config.anonymize: + risky_reasons.append("masking/anonymization is required under compliance policy") + + if allow_raw: + if not risky_reasons: + raise ValueError("--allow-raw was provided but no policy gate requires breakglass") + if not breakglass_reason: + raise ValueError("--allow-raw requires --breakglass-reason") + if not ticket_id: + raise ValueError("--allow-raw requires --ticket-id") + return True, breakglass_reason, ticket_id + + if risky_reasons: + details = "\n".join(f" - {reason}" for reason in risky_reasons) + raise ValueError( + "Compliance policy blocked extraction:\n" + f"{details}\n" + "Use --allow-raw with --breakglass-reason and --ticket-id only for approved exceptions." + ) + + if breakglass_reason or ticket_id: + raise ValueError("--breakglass-reason/--ticket-id require --allow-raw") + + return False, None, None + + +def _write_compliance_manifest( + manifest: object, + out_file: Path | None, + console: Console, + no_progress: bool, +) -> None: + """Write compliance manifest to a JSON file alongside the output.""" + from dbslice.compliance.manifest import ComplianceManifest + from dbslice.utils.fileio import write_text_file_secure + + assert isinstance(manifest, ComplianceManifest) + + if out_file: + manifest_path = out_file.with_suffix(".manifest.json") + else: + manifest_path = Path("dbslice_manifest.json") + + manifest_json = manifest.to_json(pretty=True) + write_text_file_secure(manifest_path, manifest_json, file_mode=0o600) + + if not no_progress: + console.print(f"[green]Wrote compliance manifest to [bold]{manifest_path}[/bold][/green]") + @app.command() def extract( @@ -986,6 +1119,58 @@ def extract( ), ), ] = None, + compliance: Annotated[ + list[str] | None, + typer.Option( + "--compliance", + help="Compliance profile(s) to apply: gdpr, hipaa, pci-dss", + ), + ] = None, + compliance_strict: Annotated[ + bool | None, + typer.Option( + "--compliance-strict/--no-compliance-strict", + help="Fail extraction if uncovered PII is detected by value scanning", + ), + ] = None, + manifest: Annotated[ + bool | None, + typer.Option( + "--manifest/--no-manifest", + help="Generate audit manifest alongside output (auto-enabled with --compliance)", + ), + ] = None, + non_deterministic: Annotated[ + bool | None, + typer.Option( + "--non-deterministic/--deterministic", + help="Use non-deterministic anonymization (random output each run, stronger privacy)", + ), + ] = None, + allow_raw: Annotated[ + bool, + typer.Option( + "--allow-raw", + help=( + "Breakglass override for compliance policy gates (requires " + "--breakglass-reason and --ticket-id)" + ), + ), + ] = False, + breakglass_reason: Annotated[ + str | None, + typer.Option( + "--breakglass-reason", + help="Required breakglass justification when using --allow-raw", + ), + ] = None, + ticket_id: Annotated[ + str | None, + typer.Option( + "--ticket-id", + help="Required tracking ticket/incident ID when using --allow-raw", + ), + ] = None, ): """ Extract a database subset starting from seed record(s). @@ -1037,15 +1222,11 @@ def extract( direction_override = direction if direction_override is None: - direction_override = _parse_env_choice( - "DBSLICE_DIRECTION", {"up", "down", "both"} - ) + direction_override = _parse_env_choice("DBSLICE_DIRECTION", {"up", "down", "both"}) output_override = output if output_override is None: - output_override = _parse_env_choice( - "DBSLICE_OUTPUT_FORMAT", {"sql", "json", "csv"} - ) + output_override = _parse_env_choice("DBSLICE_OUTPUT_FORMAT", {"sql", "json", "csv"}) anonymize_override = anonymize if anonymize_override is None: @@ -1103,12 +1284,39 @@ def extract( allow_unsafe_where_override if allow_unsafe_where_override is not None else ( - loaded_config.extraction.allow_unsafe_where - if loaded_config is not None - else False + loaded_config.extraction.allow_unsafe_where if loaded_config is not None else False ) ) + # Compliance settings + effective_compliance = compliance or [] + effective_compliance_strict = compliance_strict if compliance_strict is not None else False + effective_manifest = manifest if manifest is not None else bool(effective_compliance) + effective_deterministic = not non_deterministic if non_deterministic is not None else True + + if loaded_config: + if not compliance: + effective_compliance = loaded_config.compliance.profiles + if compliance_strict is None: + effective_compliance_strict = loaded_config.compliance.strict + if manifest is None: + effective_manifest = loaded_config.compliance.generate_manifest or bool( + effective_compliance + ) + if non_deterministic is None: + effective_deterministic = loaded_config.anonymization.deterministic + + # Validate compliance profile names + if effective_compliance: + from dbslice.compliance.profiles import get_profile + + for profile_name in effective_compliance: + try: + get_profile(profile_name) + except ValueError as e: + console.print(f"[red]Compliance Error:[/red] {e}") + raise typer.Exit(1) + resolved_database_url = database_url_override if not resolved_database_url and loaded_config: resolved_database_url = loaded_config.database.url @@ -1143,12 +1351,17 @@ def extract( validate_exclude_tables(passthrough) # Same validation as exclude if redact_override: validate_redact_fields(redact_override) - if ( - direction_override is not None - and direction_override.lower() not in {"up", "down", "both"} - ): + if direction_override is not None and direction_override.lower() not in { + "up", + "down", + "both", + }: raise ValueError("Invalid direction. Use: up, down, both") - if output_override is not None and output_override.lower() not in {"sql", "json", "csv"}: + if output_override is not None and output_override.lower() not in { + "sql", + "json", + "csv", + }: raise ValueError("Invalid output format. Use: sql, json, csv") if effective_json_mode not in ("auto", "single", "per-table"): raise ValueError( @@ -1184,9 +1397,7 @@ def extract( else None ) output_format_enum = ( - OutputFormat(effective_output.lower()) - if output_override is not None - else None + OutputFormat(effective_output.lower()) if output_override is not None else None ) extract_config = loaded_config.to_extract_config( @@ -1245,18 +1456,45 @@ def extract( allow_unsafe_where=effective_allow_unsafe_where, ) + # Apply compliance settings to extract config + extract_config.compliance_profiles = effective_compliance + extract_config.compliance_strict = effective_compliance_strict + extract_config.generate_manifest = effective_manifest + extract_config.deterministic = effective_deterministic + + # Compliance profiles auto-enable anonymization + if effective_compliance and not extract_config.anonymize and not allow_raw: + extract_config.anonymize = True + + try: + _enforce_source_guardrails(extract_config) + ( + breakglass_applied, + breakglass_applied_reason, + breakglass_applied_ticket, + ) = _enforce_compliance_policy( + extract_config, + out_file, + allow_raw=allow_raw, + breakglass_reason=breakglass_reason, + ticket_id=ticket_id, + ) + except ValueError as e: + console.print(f"[red]Compliance Policy Error:[/red] {e}") + raise typer.Exit(1) + if verbose and not no_progress: _show_extraction_settings(extract_config, console) - result, schema, engine = _execute_extraction(extract_config, console) + result, schema_graph, engine = _execute_extraction(extract_config, console) if not no_progress: _show_extraction_summary(result, extract_config, engine, console) - _handle_output_format( + output_files = _handle_output_format( output_format=output_format, result=result, - schema=schema, + schema=schema_graph, extract_config=extract_config, database_url=extract_config.database_url, out_file=out_file, @@ -1270,6 +1508,30 @@ def extract( db_schema=extract_config.schema, ) + # Write compliance manifest after output so file hashes can be recorded. + engine_manifest = getattr(engine, "manifest", None) + if engine_manifest and effective_manifest: + engine_manifest.add_output_file_hashes(output_files, base_dir=Path.cwd()) + + if breakglass_applied and breakglass_applied_reason and breakglass_applied_ticket: + engine_manifest.set_breakglass( + reason=breakglass_applied_reason, + ticket_id=breakglass_applied_ticket, + ) + + if extract_config.compliance_manifest_sign: + signing_key = os.environ.get(extract_config.compliance_manifest_key_env) + if not signing_key: + console.print( + "[red]Compliance Policy Error:[/red] " + "Manifest signing is enabled but signing key environment variable " + f"'{extract_config.compliance_manifest_key_env}' is not set" + ) + raise typer.Exit(1) + engine_manifest.sign(signing_key) + + _write_compliance_manifest(engine_manifest, out_file, console, no_progress) + except ConnectionError as e: logger.error("Database connection failed", error=e.reason, exc_info=True) console.print(f"[red]Connection failed:[/red] {e.reason}") @@ -1590,6 +1852,124 @@ def _detect_potential_implicit_fks(schema) -> list[tuple[str, str, str]]: return sorted(candidates, key=lambda item: (item[0], item[1], item[2])) +def _run_compliance_check_report( + adapter, + db_schema, + profiles: list[str], + sample_rows: int, + output_mode: str, + target_table: str | None, + console: Console, +) -> None: + """Run profile-aware coverage scanning and print human/json compliance report.""" + from dbslice.compliance.profiles import get_profile + from dbslice.compliance.scanner import PIIScanner + from dbslice.utils.anonymizer import DeterministicAnonymizer + + scan_patterns: set[str] = set() + fallback_patterns: dict[str, str] = {} + security_null_fields: list[str] = [] + profile_summaries: list[dict[str, object]] = [] + + for profile_name in profiles: + profile = get_profile(profile_name) + scan_patterns.update(profile.value_scan_patterns) + for pattern, provider in profile.required_column_patterns.items(): + fallback_patterns.setdefault(f"*.{pattern}*", provider) + for pattern in profile.required_null_patterns: + glob = f"*.{pattern}*" + if glob not in security_null_fields: + security_null_fields.append(glob) + profile_summaries.append( + { + "profile": profile.name, + "display_name": profile.display_name, + "identifier_categories": len(profile.identifiers), + "required_column_patterns": len(profile.required_column_patterns), + "required_null_patterns": len(profile.required_null_patterns), + } + ) + + scanner = PIIScanner( + patterns=sorted(scan_patterns) + if scan_patterns + else ["email", "ssn", "phone", "credit_card"] + ) + anonymizer = DeterministicAnonymizer(seed="compliance-check") + anonymizer.configure( + [], + patterns={}, + fallback_patterns=fallback_patterns, + security_null_fields=security_null_fields, + ) + + if target_table: + tables_to_scan = [target_table] + else: + tables_to_scan = sorted(db_schema.tables.keys()) + + detections: list[dict[str, object]] = [] + for table_name in tables_to_scan: + rows = list(itertools.islice(adapter.fetch_rows(table_name, "TRUE", ()), sample_rows)) + if not rows: + continue + for detection in scanner.scan_rows(table_name, rows): + protected = anonymizer.should_anonymize( + detection.table, detection.column + ) or anonymizer.should_null(detection.table, detection.column) + detections.append( + { + "table": detection.table, + "column": detection.column, + "pattern": detection.pattern_name, + "match_count": detection.match_count, + "sample_size": detection.sample_size, + "match_rate": round(detection.match_rate, 4), + "confidence": detection.confidence, + "protected": protected, + } + ) + + uncovered = [item for item in detections if not item["protected"]] + report = { + "profiles": profiles, + "profile_summaries": profile_summaries, + "tables_scanned": len(tables_to_scan), + "sample_rows_per_table": sample_rows, + "detections_total": len(detections), + "detections_protected": len(detections) - len(uncovered), + "detections_uncovered": len(uncovered), + "status": "pass" if not uncovered else "gaps_found", + "uncovered_detections": uncovered, + } + + if output_mode == "json": + console.print(json.dumps(report, indent=2)) + return + + console.print("\n[bold]Compliance Coverage Check[/bold]") + console.print(f" Profiles: [cyan]{', '.join(profile.upper() for profile in profiles)}[/cyan]") + console.print(f" Tables scanned: [cyan]{report['tables_scanned']}[/cyan]") + console.print(f" Sample rows/table: [cyan]{sample_rows}[/cyan]") + console.print(f" Detections: [cyan]{report['detections_total']}[/cyan]") + console.print(f" Protected detections: [green]{report['detections_protected']}[/green]") + if uncovered: + console.print(f" Uncovered detections: [red]{len(uncovered)}[/red]") + console.print("\n[red]Potential Compliance Gaps:[/red]") + for finding in uncovered[:50]: + console.print( + " " + f"{finding['table']}.{finding['column']}: " + f"{finding['pattern']} ({finding['match_count']}/{finding['sample_size']}, " + f"{finding['confidence']})" + ) + if len(uncovered) > 50: + console.print(f" [dim]... and {len(uncovered) - 50} more[/dim]") + else: + console.print(" Uncovered detections: [green]0[/green]") + console.print("[green]Status: PASS[/green]") + + @app.command() def inspect( database_url: Annotated[ @@ -1611,6 +1991,28 @@ def inspect( help="PostgreSQL schema name (default: 'public')", ), ] = None, + compliance_check: Annotated[ + list[str] | None, + typer.Option( + "--compliance-check", + help="Run compliance coverage check for profile(s): gdpr, hipaa, pci-dss", + ), + ] = None, + compliance_output: Annotated[ + str, + typer.Option( + "--compliance-output", + help="Compliance report output format: human or json", + ), + ] = "human", + sample_rows: Annotated[ + int, + typer.Option( + "--sample-rows", + help="Rows sampled per table for compliance value scanning", + min=1, + ), + ] = 100, ): """ Inspect database schema without extracting data. @@ -1635,6 +2037,8 @@ def inspect( from dbslice.input_validators import validate_table_name validate_table_name(table) + if compliance_output not in {"human", "json"}: + raise ValueError("--compliance-output must be one of: human, json") except (ValidationError, ValueError) as e: console.print(f"[red]Validation Error:[/red] {e}") raise typer.Exit(1) @@ -1654,6 +2058,27 @@ def inspect( with console.status("[bold blue]Introspecting schema...[/bold blue]"): db_schema = adapter.get_schema() + if compliance_check: + from dbslice.compliance.profiles import get_profile + + for profile_name in compliance_check: + try: + get_profile(profile_name) + except ValueError as e: + console.print(f"[red]Compliance Error:[/red] {e}") + raise typer.Exit(1) + + _run_compliance_check_report( + adapter=adapter, + db_schema=db_schema, + profiles=compliance_check, + sample_rows=sample_rows, + output_mode=compliance_output, + target_table=table, + console=console, + ) + return + if table: table_info = db_schema.get_table(table) if not table_info: @@ -1729,9 +2154,7 @@ def inspect( for src_table, src_col, target_table in implicit_candidates[:25]: console.print(f" {src_table}.{src_col} -> [cyan]{target_table}[/cyan].id") if len(implicit_candidates) > 25: - console.print( - f" [dim]... and {len(implicit_candidates) - 25} more[/dim]" - ) + console.print(f" [dim]... and {len(implicit_candidates) - 25} more[/dim]") console.print( " [dim]Tip: define virtual_foreign_keys for confirmed implicit links.[/dim]" ) @@ -1752,6 +2175,134 @@ def inspect( raise typer.Exit(1) +@app.command("verify-manifest") +def verify_manifest( + manifest_file: Annotated[ + Path, + typer.Argument(help="Path to compliance manifest JSON file"), + ], + verify_signature: Annotated[ + bool, + typer.Option( + "--verify-signature/--no-verify-signature", + help="Verify HMAC manifest signature when present", + ), + ] = True, + key_env: Annotated[ + str, + typer.Option( + "--key-env", + help="Environment variable containing manifest signing key", + ), + ] = "DBSLICE_MANIFEST_SIGNING_KEY", +): + """Verify compliance manifest output hashes and optional HMAC signature.""" + try: + if not manifest_file.exists(): + console.print(f"[red]Error:[/red] Manifest file not found: {manifest_file}") + raise typer.Exit(1) + if not manifest_file.is_file(): + console.print(f"[red]Error:[/red] Not a file: {manifest_file}") + raise typer.Exit(1) + + try: + payload = json.loads(manifest_file.read_text(encoding="utf-8")) + except json.JSONDecodeError as e: + console.print(f"[red]Error:[/red] Invalid manifest JSON: {e}") + raise typer.Exit(1) + + from dbslice.compliance.manifest import verify_manifest_payload + + signing_key: str | None = None + if verify_signature: + signing_key = os.environ.get(key_env) + + valid, errors = verify_manifest_payload( + payload=payload, + manifest_path=manifest_file, + signing_key=signing_key, + verify_signature=verify_signature, + ) + + if valid: + console.print("[green]Manifest verification passed[/green]") + return + + console.print("[red]Manifest verification failed:[/red]") + for error in errors: + console.print(f" - {error}") + raise typer.Exit(1) + except typer.Exit: + raise + except Exception as e: + console.print(f"[red]Unexpected error:[/red] {e}") + raise typer.Exit(1) + + +@app.command() +def map( + database_url: Annotated[ + str | None, + typer.Argument(help="Optional database URL (can also enter in the UI)"), + ] = None, + schema: Annotated[ + str | None, + typer.Option("--schema", help="PostgreSQL schema name (default: 'public')"), + ] = None, + port: Annotated[ + int, + typer.Option("--port", "-p", help="Port for local mapping UI server", min=1024, max=65535), + ] = 9473, + open_browser: Annotated[ + bool, + typer.Option("--open-browser/--no-open-browser", help="Auto-open browser"), + ] = True, +): + """ + Launch local column-mapping UI. + + Opens a browser-based interface for reviewing database columns and + configuring anonymization mappings. Generates a ready-to-use dbslice.yaml. + + The server runs locally on 127.0.0.1 only and requires a session token. + + Examples: + + # Launch mapping UI (enter URL in browser) + dbslice map + + # Launch with pre-filled database URL + dbslice map postgresql://localhost/myapp + + # Custom port, no auto-open + dbslice map postgresql://localhost/myapp --port 8888 --no-open-browser + """ + from dbslice.mapping.server import MappingServer + + resolved_url = database_url + if resolved_url is None: + resolved_url = os.environ.get("DATABASE_URL") + + server = MappingServer( + port=port, + database_url=resolved_url or "", + schema=schema, + ) + + console.print("\n[bold]dbslice Column Mapping[/bold]") + console.print(f" URL: [cyan]{server.url}[/cyan]") + console.print(" Bound to 127.0.0.1 only (local access)") + console.print("\n Press Ctrl+C to stop.\n") + + try: + server.start(open_browser=open_browser) + except OSError as e: + console.print(f"[red]Error:[/red] Could not start server on port {port}: {e}") + raise typer.Exit(1) + except KeyboardInterrupt: + console.print("\n[dim]Mapping UI stopped.[/dim]") + + @app.command() def docs( port: Annotated[ diff --git a/src/dbslice/compliance/__init__.py b/src/dbslice/compliance/__init__.py new file mode 100644 index 0000000..d04022b --- /dev/null +++ b/src/dbslice/compliance/__init__.py @@ -0,0 +1,18 @@ +from dbslice.compliance.manifest import ( + ComplianceManifest, + ManifestEntry, + verify_manifest_payload, +) +from dbslice.compliance.profiles import ComplianceProfile, get_profile, list_profiles +from dbslice.compliance.scanner import PIIDetection, PIIScanner + +__all__ = [ + "ComplianceManifest", + "ComplianceProfile", + "ManifestEntry", + "PIIDetection", + "PIIScanner", + "get_profile", + "list_profiles", + "verify_manifest_payload", +] diff --git a/src/dbslice/compliance/manifest.py b/src/dbslice/compliance/manifest.py new file mode 100644 index 0000000..fb33e52 --- /dev/null +++ b/src/dbslice/compliance/manifest.py @@ -0,0 +1,319 @@ +import hashlib +import hmac +import json +from dataclasses import asdict, dataclass, field +from datetime import datetime, timezone +from pathlib import Path +from typing import Any + +from dbslice import __version__ +from dbslice.compliance.scanner import PIIDetection + + +@dataclass +class ManifestFieldEntry: + """Record of anonymization applied to a single field.""" + + table: str + column: str + method: str + category: str = "" # e.g., "direct_identifier", "hipaa_identifier_7" + + +@dataclass +class ManifestNullEntry: + """Record of a field forced to NULL.""" + + table: str + column: str + reason: str # e.g., "security_null_pattern" + + +@dataclass +class ManifestWarning: + """A compliance warning.""" + + table: str + column: str + reason: str + severity: str = "warning" # "warning" or "error" + + +@dataclass +class ManifestTableEntry: + """Per-table manifest data.""" + + rows_extracted: int = 0 + fields_masked: list[ManifestFieldEntry] = field(default_factory=list) + fields_nulled: list[ManifestNullEntry] = field(default_factory=list) + fields_preserved_fk: list[str] = field(default_factory=list) + fields_unmasked: list[str] = field(default_factory=list) + + +@dataclass +class ManifestEntry: + """A single entry in the compliance manifest (for external use).""" + + table: str + column: str + action: str # "masked", "nulled", "preserved_fk", "unmasked" + method: str = "" + reason: str = "" + + +@dataclass +class ComplianceManifest: + """ + Full compliance audit manifest. + + Generated alongside extraction output to document what anonymization + was applied and provide evidence for compliance audits. + """ + + extraction_id: str = "" + timestamp: str = "" + dbslice_version: str = "" + masking_type: str = "deterministic_pseudonymization" + compliance_profiles: list[str] = field(default_factory=list) + tables: dict[str, ManifestTableEntry] = field(default_factory=dict) + pii_scan_results: list[PIIDetection] = field(default_factory=list) + warnings: list[ManifestWarning] = field(default_factory=list) + seed_hash: str = "" + output_file_hashes: dict[str, str] = field(default_factory=dict) + breakglass: dict[str, str] = field(default_factory=dict) + signature_algorithm: str = "" + signature: str = "" + + def initialize( + self, + extraction_id: str, + compliance_profiles: list[str] | None = None, + anonymization_seed: str | None = None, + deterministic: bool = True, + ) -> None: + """ + Initialize manifest metadata. + + Args: + extraction_id: Unique ID for this extraction + compliance_profiles: Names of active compliance profiles + anonymization_seed: The anonymization seed (hashed, not stored raw) + deterministic: Whether deterministic mode is used + """ + self.extraction_id = extraction_id + self.timestamp = datetime.now(timezone.utc).isoformat() + self.dbslice_version = __version__ + self.compliance_profiles = compliance_profiles or [] + self.masking_type = ( + "deterministic_pseudonymization" + if deterministic + else "non_deterministic_pseudonymization" + ) + if anonymization_seed: + self.seed_hash = ( + f"sha256:{hashlib.sha256(anonymization_seed.encode()).hexdigest()[:16]}" + ) + + def record_masked_field( + self, + table: str, + column: str, + method: str, + category: str = "", + ) -> None: + """Record that a field was masked/anonymized.""" + entry = self.tables.setdefault(table, ManifestTableEntry()) + entry.fields_masked.append( + ManifestFieldEntry(table=table, column=column, method=method, category=category) + ) + + def record_nulled_field(self, table: str, column: str, reason: str) -> None: + """Record that a field was set to NULL.""" + entry = self.tables.setdefault(table, ManifestTableEntry()) + entry.fields_nulled.append(ManifestNullEntry(table=table, column=column, reason=reason)) + + def record_fk_preserved(self, table: str, column: str) -> None: + """Record that a FK column was preserved (not anonymized).""" + entry = self.tables.setdefault(table, ManifestTableEntry()) + if column not in entry.fields_preserved_fk: + entry.fields_preserved_fk.append(column) + + def record_unmasked_field(self, table: str, column: str) -> None: + """Record that a field was not masked.""" + entry = self.tables.setdefault(table, ManifestTableEntry()) + if column not in entry.fields_unmasked: + entry.fields_unmasked.append(column) + + def set_table_row_count(self, table: str, count: int) -> None: + """Set the extracted row count for a table.""" + entry = self.tables.setdefault(table, ManifestTableEntry()) + entry.rows_extracted = count + + def add_warning( + self, + table: str, + column: str, + reason: str, + severity: str = "warning", + ) -> None: + """Add a compliance warning.""" + self.warnings.append( + ManifestWarning(table=table, column=column, reason=reason, severity=severity) + ) + + def add_pii_detections(self, detections: list[PIIDetection]) -> None: + """Add PII scan results.""" + self.pii_scan_results.extend(detections) + + def set_breakglass(self, reason: str, ticket_id: str) -> None: + """Record breakglass metadata for raw/unsafe extraction exceptions.""" + self.breakglass = { + "reason": reason, + "ticket_id": ticket_id, + "timestamp": datetime.now(timezone.utc).isoformat(), + } + + def add_output_file_hashes( + self, output_files: list[Path], base_dir: Path | None = None + ) -> None: + """Record deterministic SHA256 hashes for generated output files.""" + root = (base_dir or Path.cwd()).resolve() + hashes: dict[str, str] = {} + + for file_path in sorted((Path(p).resolve() for p in output_files), key=lambda p: str(p)): + if not file_path.exists() or not file_path.is_file(): + continue + digest = _sha256_file(file_path) + key: str + try: + key = str(file_path.relative_to(root)) + except ValueError: + key = str(file_path) + hashes[key] = f"sha256:{digest}" + + self.output_file_hashes = hashes + + def sign(self, signing_key: str) -> None: + """Sign manifest payload using HMAC-SHA256.""" + payload = self._signable_dict() + digest = _manifest_hmac(payload, signing_key) + self.signature_algorithm = "hmac-sha256" + self.signature = f"hmac-sha256:{digest}" + + def to_dict(self) -> dict[str, Any]: + """Convert to a JSON-serializable dictionary.""" + tables_dict: dict[str, Any] = {} + for table_name, table_entry in self.tables.items(): + tables_dict[table_name] = { + "rows_extracted": table_entry.rows_extracted, + "fields_masked": [ + {"column": f.column, "method": f.method, "category": f.category} + for f in table_entry.fields_masked + ], + "fields_nulled": [ + {"column": f.column, "reason": f.reason} for f in table_entry.fields_nulled + ], + "fields_preserved_fk": table_entry.fields_preserved_fk, + "fields_unmasked": table_entry.fields_unmasked, + } + + pii_results = [ + { + "table": d.table, + "column": d.column, + "pattern": d.pattern_name, + "match_count": d.match_count, + "sample_size": d.sample_size, + "confidence": d.confidence, + } + for d in self.pii_scan_results + ] + + warnings = [asdict(w) for w in self.warnings] + + return { + "extraction_id": self.extraction_id, + "timestamp": self.timestamp, + "dbslice_version": self.dbslice_version, + "masking_type": self.masking_type, + "compliance_profiles": self.compliance_profiles, + "seed_hash": self.seed_hash, + "tables": tables_dict, + "pii_scan_results": pii_results, + "warnings": warnings, + "output_file_hashes": self.output_file_hashes, + "breakglass": self.breakglass, + "signature_algorithm": self.signature_algorithm, + "signature": self.signature, + } + + def to_json(self, pretty: bool = True) -> str: + """Serialize manifest to JSON string.""" + return json.dumps(self.to_dict(), indent=2 if pretty else None, default=str) + + def _signable_dict(self) -> dict[str, Any]: + payload = self.to_dict() + payload.pop("signature_algorithm", None) + payload.pop("signature", None) + return payload + + +def _sha256_file(path: Path) -> str: + digest = hashlib.sha256() + with path.open("rb") as handle: + for chunk in iter(lambda: handle.read(1024 * 1024), b""): + digest.update(chunk) + return digest.hexdigest() + + +def _manifest_hmac(payload: dict[str, Any], signing_key: str) -> str: + canonical = json.dumps(payload, sort_keys=True, separators=(",", ":"), ensure_ascii=True) + return hmac.new(signing_key.encode("utf-8"), canonical.encode("utf-8"), hashlib.sha256).hexdigest() + + +def verify_manifest_payload( + payload: dict[str, Any], + manifest_path: Path, + signing_key: str | None = None, + verify_signature: bool = True, +) -> tuple[bool, list[str]]: + """Verify output file hashes and optional HMAC signature for a manifest payload.""" + errors: list[str] = [] + manifest_dir = manifest_path.parent.resolve() + + file_hashes = payload.get("output_file_hashes", {}) + if not isinstance(file_hashes, dict): + errors.append("'output_file_hashes' must be an object") + return False, errors + + for rel_path, expected_hash in file_hashes.items(): + if not isinstance(rel_path, str) or not isinstance(expected_hash, str): + errors.append("Invalid output_file_hashes entry") + continue + target = (manifest_dir / rel_path).resolve() + if not target.exists(): + errors.append(f"Missing output file: {rel_path}") + continue + actual = f"sha256:{_sha256_file(target)}" + if actual != expected_hash: + errors.append( + f"Hash mismatch for {rel_path}: expected {expected_hash}, got {actual}" + ) + + if verify_signature: + signature = payload.get("signature") + signature_algorithm = payload.get("signature_algorithm") + if signature: + if signature_algorithm != "hmac-sha256": + errors.append("Unsupported signature_algorithm (expected hmac-sha256)") + elif signing_key is None: + errors.append("Manifest is signed but no signing key was provided") + else: + signable = dict(payload) + signable.pop("signature", None) + signable.pop("signature_algorithm", None) + expected = f"hmac-sha256:{_manifest_hmac(signable, signing_key)}" + if signature != expected: + errors.append("Manifest signature verification failed") + + return len(errors) == 0, errors diff --git a/src/dbslice/compliance/profiles.py b/src/dbslice/compliance/profiles.py new file mode 100644 index 0000000..fe9b28d --- /dev/null +++ b/src/dbslice/compliance/profiles.py @@ -0,0 +1,366 @@ +from dataclasses import dataclass, field + + +@dataclass(frozen=True) +class ComplianceProfile: + """A compliance profile defining anonymization requirements for a regulatory framework.""" + + name: str + """Profile identifier (e.g., 'gdpr', 'hipaa', 'pci-dss').""" + + display_name: str + """Human-readable name (e.g., 'GDPR', 'HIPAA Safe Harbor').""" + + description: str + """Brief description of what this profile covers.""" + + required_column_patterns: dict[str, str] = field(default_factory=dict) + """Column name substring -> Faker provider mappings that MUST be anonymized.""" + + required_null_patterns: list[str] = field(default_factory=list) + """Column name patterns that must be NULLed (security-sensitive data).""" + + value_scan_patterns: list[str] = field(default_factory=list) + """Names of value-based PII scanner patterns to run (e.g., 'email', 'ssn', 'credit_card').""" + + warn_freetext_columns: list[str] = field(default_factory=list) + """Column name patterns that may contain embedded PII in free text.""" + + identifiers: list[str] = field(default_factory=list) + """List of identifier categories this profile covers (for compliance reports).""" + + +GDPR_PROFILE = ComplianceProfile( + name="gdpr", + display_name="GDPR", + description=( + "EU General Data Protection Regulation. Covers direct identifiers and " + "flags quasi-identifiers that could enable singling out or linkage attacks." + ), + required_column_patterns={ + # Direct identifiers + "email": "email", + "first_name": "first_name", + "last_name": "last_name", + "firstname": "first_name", + "lastname": "last_name", + "full_name": "name", + "fullname": "name", + "name": "name", + "phone": "phone_number", + "mobile": "phone_number", + "fax": "phone_number", + # Address / location + "address": "address", + "street": "street_address", + "city": "city", + "zip": "zipcode", + "zipcode": "zipcode", + "postal": "zipcode", + # Identity documents + "ssn": "ssn", + "passport": "passport_number", + "driver_license": "license_plate", + # Financial + "credit_card": "credit_card_number", + "card_number": "credit_card_number", + "iban": "iban", + "bank_account": "bban", + "account_number": "bban", + # Network identifiers + "ip_address": "ipv4", + "ipaddress": "ipv4", + "ip": "ipv4", + "ipv6": "ipv6", + "mac_address": "mac_address", + # Online identifiers + "username": "user_name", + "user_name": "user_name", + # Biographic + "dob": "date_of_birth", + "date_of_birth": "date_of_birth", + "birthdate": "date_of_birth", + "birth_date": "date_of_birth", + }, + required_null_patterns=[ + "password", + "passwd", + "pwd", + "hash", + "salt", + "token", + "secret", + "api_key", + "apikey", + "private_key", + "public_key", + "certificate", + "session_id", + ], + value_scan_patterns=["email", "phone", "ipv4", "ipv6"], + warn_freetext_columns=[ + "note", + "notes", + "comment", + "comments", + "description", + "message", + "body", + "content", + "text", + "bio", + "about", + "reason", + "feedback", + "review", + ], + identifiers=[ + "Names", + "Email addresses", + "Phone numbers", + "Physical addresses", + "IP addresses", + "Date of birth", + "Identity documents (SSN, passport)", + "Financial identifiers (credit card, IBAN)", + "Online identifiers (username)", + "Biometric identifiers (flagged via value scan)", + ], +) + +HIPAA_PROFILE = ComplianceProfile( + name="hipaa", + display_name="HIPAA Safe Harbor", + description=( + "HIPAA Safe Harbor de-identification method. Requires removal or masking " + "of all 18 specified identifier types per 45 CFR 164.514(b)(2)." + ), + required_column_patterns={ + # 1. Names + "name": "name", + "first_name": "first_name", + "last_name": "last_name", + "firstname": "first_name", + "lastname": "last_name", + "full_name": "name", + "fullname": "name", + # 2. Geographic (smaller than state) — Safe Harbor requires ZIP3 with population check + "address": "address", + "street": "street_address", + "city": "city", + "zip": "hipaa_zip3", + "zipcode": "hipaa_zip3", + "postal": "hipaa_zip3", + "county": "city", + # 3. Dates (except year) — Safe Harbor requires year-only + "dob": "year_only", + "date_of_birth": "year_only", + "birthdate": "year_only", + "birth_date": "year_only", + "admission_date": "year_only", + "discharge_date": "year_only", + "death_date": "year_only", + "service_date": "year_only", + "visit_date": "year_only", + # 4. Phone numbers + "phone": "phone_number", + "mobile": "phone_number", + "telephone": "phone_number", + "cell": "phone_number", + # 5. Fax numbers + "fax": "phone_number", + # 6. Email addresses + "email": "email", + # 7. SSN + "ssn": "ssn", + "social_security": "ssn", + # 8. Medical record numbers + "medical_record": "pystr", + "mrn": "pystr", + "patient_id": "pystr", + # 9. Health plan beneficiary numbers + "beneficiary": "pystr", + "member_id": "pystr", + "subscriber_id": "pystr", + # 10. Account numbers + "account_number": "bban", + "bank_account": "bban", + # 11. Certificate/license numbers + "license_number": "license_plate", + "certificate_number": "pystr", + "driver_license": "license_plate", + "passport": "passport_number", + # 12. Vehicle identifiers + "vin": "pystr", + "vehicle_id": "pystr", + "license_plate": "license_plate", + # 13. Device identifiers + "device_id": "pystr", + "serial_number": "pystr", + "device_serial": "pystr", + # 14. Web URLs + "url": "url", + "website": "url", + # 15. IP addresses + "ip_address": "ipv4", + "ipaddress": "ipv4", + "ip": "ipv4", + "ipv6": "ipv6", + # 16. Biometric identifiers (column names are hints) + "fingerprint": "pystr", + "biometric": "pystr", + "retina": "pystr", + "voiceprint": "pystr", + # 17. Full-face photographs (binary columns - flag as warning) + # 18. Any other unique identifier + "unique_id": "pystr", + }, + required_null_patterns=[ + "password", + "passwd", + "pwd", + "hash", + "salt", + "token", + "secret", + "api_key", + "apikey", + "private_key", + "public_key", + "certificate", + "session_id", + ], + value_scan_patterns=["email", "ssn", "phone", "credit_card", "ipv4", "ipv6"], + warn_freetext_columns=[ + "note", + "notes", + "comment", + "comments", + "description", + "message", + "body", + "content", + "text", + "diagnosis", + "treatment", + "history", + "narrative", + "clinical_notes", + "progress_notes", + "discharge_summary", + ], + identifiers=[ + "1. Names", + "2. Geographic data (smaller than state)", + "3. Dates (except year)", + "4. Phone numbers", + "5. Fax numbers", + "6. Email addresses", + "7. Social Security numbers", + "8. Medical record numbers", + "9. Health plan beneficiary numbers", + "10. Account numbers", + "11. Certificate/license numbers", + "12. Vehicle identifiers", + "13. Device identifiers", + "14. Web URLs", + "15. IP addresses", + "16. Biometric identifiers", + "17. Full-face photographs (flag only)", + "18. Any other unique identifying number", + ], +) + +PCI_DSS_PROFILE = ComplianceProfile( + name="pci-dss", + display_name="PCI-DSS v4.0", + description=( + "Payment Card Industry Data Security Standard v4.0. " + "Real PANs are PROHIBITED in dev/test environments (Req 6.5.6). " + "Cardholder data must be fully replaced with synthetic data." + ), + required_column_patterns={ + # Primary Account Number (PAN) + "credit_card": "credit_card_number", + "card_number": "credit_card_number", + "card_num": "credit_card_number", + "pan": "credit_card_number", + "account_number": "bban", + # Cardholder name + "cardholder": "name", + "card_holder": "name", + "cardholder_name": "name", + # Expiration + "expiry": "credit_card_expire", + "expiration": "credit_card_expire", + "exp_date": "credit_card_expire", + "card_expiry": "credit_card_expire", + # Service code (3-4 digit) + "service_code": "pystr", + "cvv": "credit_card_security_code", + "cvc": "credit_card_security_code", + "cvv2": "credit_card_security_code", + }, + required_null_patterns=[ + # Sensitive authentication data - MUST be removed post-authorization + "pin", + "pin_block", + "pin_number", + "cvv", + "cvc", + "cvv2", + "cvc2", + "magnetic_stripe", + "track_data", + "track1", + "track2", + ], + value_scan_patterns=["credit_card"], + warn_freetext_columns=[ + "note", + "notes", + "comment", + "description", + "transaction_detail", + "memo", + ], + identifiers=[ + "Primary Account Number (PAN)", + "Cardholder name", + "Expiration date", + "Service code", + "Sensitive authentication data (CVV/PIN)", + ], +) + + +_PROFILES: dict[str, ComplianceProfile] = { + "gdpr": GDPR_PROFILE, + "hipaa": HIPAA_PROFILE, + "pci-dss": PCI_DSS_PROFILE, +} + + +def get_profile(name: str) -> ComplianceProfile: + """ + Get a compliance profile by name. + + Args: + name: Profile name (case-insensitive) + + Returns: + ComplianceProfile + + Raises: + ValueError: If profile not found + """ + profile = _PROFILES.get(name.lower()) + if profile is None: + available = ", ".join(sorted(_PROFILES.keys())) + raise ValueError(f"Unknown compliance profile '{name}'. Available: {available}") + return profile + + +def list_profiles() -> list[ComplianceProfile]: + """Return all available compliance profiles.""" + return list(_PROFILES.values()) diff --git a/src/dbslice/compliance/scanner.py b/src/dbslice/compliance/scanner.py new file mode 100644 index 0000000..8ffaeee --- /dev/null +++ b/src/dbslice/compliance/scanner.py @@ -0,0 +1,203 @@ +import re +from dataclasses import dataclass, field +from typing import Any + + +@dataclass +class PIIDetection: + """A single PII detection result.""" + + table: str + column: str + pattern_name: str + match_count: int + sample_size: int + confidence: str # "high", "medium", "low" + + @property + def match_rate(self) -> float: + """Fraction of sampled values that matched.""" + if self.sample_size == 0: + return 0.0 + return self.match_count / self.sample_size + + +# Compiled regex patterns for PII detection +_PII_PATTERNS: dict[str, tuple[re.Pattern[str], str]] = { + "email": ( + re.compile(r"\b[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}\b"), + "high", + ), + "ssn": ( + re.compile(r"\b\d{3}-\d{2}-\d{4}\b"), + "high", + ), + "phone": ( + re.compile(r"\b(?:\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b"), + "medium", + ), + "ipv4": ( + re.compile(r"\b(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\b"), + "medium", + ), + "ipv6": ( + re.compile(r"\b(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}\b"), + "medium", + ), + "credit_card": ( + re.compile(r"(? bool: + """Validate a number string using the Luhn algorithm.""" + digits = [int(d) for d in number if d.isdigit()] + if len(digits) < 13: + return False + checksum = 0 + for i, d in enumerate(reversed(digits)): + if i % 2 == 1: + d *= 2 + if d > 9: + d -= 9 + checksum += d + return checksum % 10 == 0 + + +def _extract_pan_candidates(text: str) -> list[str]: + """Extract PAN-like candidates and normalize separators before Luhn checks.""" + candidates: list[str] = [] + for match in _PAN_CANDIDATE_RE.findall(text): + digits_only = "".join(ch for ch in match if ch.isdigit()) + if 13 <= len(digits_only) <= 19: + candidates.append(digits_only) + return candidates + + +@dataclass +class PIIScanner: + """ + Scans data values for PII using regex patterns. + + Usage: + scanner = PIIScanner(patterns=["email", "ssn", "credit_card"]) + detections = scanner.scan_column("users", "notes", sample_values) + """ + + patterns: list[str] = field(default_factory=lambda: list(_PII_PATTERNS.keys())) + """Which PII patterns to scan for.""" + + min_match_rate: float = 0.1 + """Minimum fraction of values that must match to report a detection (default: 10%).""" + + def scan_column( + self, + table: str, + column: str, + values: list[Any], + ) -> list[PIIDetection]: + """ + Scan a list of column values for PII patterns. + + Args: + table: Table name + column: Column name + values: Sample of values from the column + + Returns: + List of PIIDetection results for patterns that matched + """ + # Only scan string-like values + str_values = [str(v) for v in values if v is not None and str(v).strip()] + if not str_values: + return [] + + detections: list[PIIDetection] = [] + sample_size = len(str_values) + + for pattern_name in self.patterns: + if pattern_name not in _PII_PATTERNS: + continue + + regex, base_confidence = _PII_PATTERNS[pattern_name] + match_count = 0 + + for val in str_values: + if pattern_name == "credit_card": + pan_candidates = _extract_pan_candidates(val) + if any(_luhn_check(candidate) for candidate in pan_candidates): + match_count += 1 + else: + matches = regex.findall(val) + if matches: + match_count += 1 + + if match_count == 0: + continue + + match_rate = match_count / sample_size + if match_rate < self.min_match_rate: + continue + + # Adjust confidence based on match rate + if match_rate >= 0.8: + confidence = "high" + elif match_rate >= 0.3: + confidence = base_confidence + else: + confidence = "low" if base_confidence == "medium" else "medium" + + detections.append( + PIIDetection( + table=table, + column=column, + pattern_name=pattern_name, + match_count=match_count, + sample_size=sample_size, + confidence=confidence, + ) + ) + + return detections + + def scan_rows( + self, + table: str, + rows: list[dict[str, Any]], + skip_columns: set[str] | None = None, + ) -> list[PIIDetection]: + """ + Scan all text columns in a set of rows for PII. + + Args: + table: Table name + rows: List of row dictionaries + skip_columns: Columns to skip (e.g., already anonymized) + + Returns: + List of PIIDetection results + """ + if not rows: + return [] + + skip = skip_columns or set() + all_detections: list[PIIDetection] = [] + + # Collect values per column + columns: dict[str, list[Any]] = {} + for row in rows: + for col, val in row.items(): + if col in skip: + continue + if val is not None and isinstance(val, (str, int, float)): + columns.setdefault(col, []).append(val) + + for col, values in columns.items(): + detections = self.scan_column(table, col, values) + all_detections.extend(detections) + + return all_detections diff --git a/src/dbslice/compliance/transformers.py b/src/dbslice/compliance/transformers.py new file mode 100644 index 0000000..9df44ba --- /dev/null +++ b/src/dbslice/compliance/transformers.py @@ -0,0 +1,183 @@ +from __future__ import annotations + +import datetime +import re +from typing import Any + +# Per 45 CFR 164.514(b)(2)(i)(B): Geographic data smaller than state must be +# removed, EXCEPT the initial 3 digits of a ZIP code may be retained if the +# geographic unit formed by combining all ZIP codes with the same 3 initial +# digits contains more than 20,000 people. +# +# The following 3-digit ZIP prefixes have population < 20,000 per US Census +# and must be changed to "000" under Safe Harbor. +# +# Source: US Census Bureau, derived from ZCTA population data. +# These prefixes are stable across census cycles. Last verified: 2020 Census. + +_LOW_POPULATION_ZIP3: frozenset[str] = frozenset({ + "036", # NH + "059", # MT + "063", # VT/NH + "102", # NY (small area) + "203", # DC (small overlap) + "556", # MN + "692", # NE + "790", # TX (small area) + "821", # WY + "823", # WY + "830", # WY + "831", # WY + "878", # NM + "879", # NM + "884", # NM + "890", # NV + "893", # NV +}) + + +def hipaa_safe_harbor_zip3(value: Any) -> str: + """ + HIPAA Safe Harbor ZIP code transformation. + + Retains only the first 3 digits of a ZIP code. If the 3-digit prefix + has population < 20,000 (per Census data), returns "000" instead. + + Per 45 CFR 164.514(b)(2)(i)(B). + + Args: + value: Original ZIP code (string or int) + + Returns: + 3-digit ZIP prefix, or "000" if low-population area + """ + raw = str(value).strip() + # Extract digits only (handles "12345-6789" format) + digits = re.sub(r"[^0-9]", "", raw) + if len(digits) < 3: + return "000" + + prefix = digits[:3] + if prefix in _LOW_POPULATION_ZIP3: + return "000" + return prefix + + +def year_only(value: Any) -> str: + """ + HIPAA Safe Harbor date transformation. + + Extracts only the year from a date value. Per 45 CFR 164.514(b)(2)(i)(C), + all date elements (except year) must be removed for dates directly related + to an individual. + + Args: + value: Original date value (date, datetime, string, or int) + + Returns: + Year string (e.g., "1985") + """ + if value is None: + return "" + + # datetime/date objects + if isinstance(value, (datetime.datetime, datetime.date)): + return str(value.year) + + raw = str(value).strip() + + # ISO format: 2024-03-15 or 2024-03-15T10:30:00 + iso_match = re.match(r"(\d{4})-\d{2}-\d{2}", raw) + if iso_match: + return iso_match.group(1) + + # US format: 03/15/2024 or 03-15-2024 + us_match = re.match(r"\d{1,2}[/-]\d{1,2}[/-](\d{4})", raw) + if us_match: + return us_match.group(1) + + # Just a 4-digit year + year_match = re.match(r"^(\d{4})$", raw) + if year_match: + return year_match.group(1) + + # Fallback: try to find any 4-digit year in the string + any_year = re.search(r"\b(19|20)\d{2}\b", raw) + if any_year: + return any_year.group(0) + + return "" + + +def age_bucket(value: Any) -> str: + """ + HIPAA Safe Harbor age bucketing. + + Per 45 CFR 164.514(b)(2)(i)(C), ages over 89 must be aggregated into + a single category of "90 or over." + + Args: + value: Age as integer or string + + Returns: + Original age as string if <= 89, or "90+" if > 89 + """ + try: + age = int(value) + except (ValueError, TypeError): + return str(value) + + if age > 89: + return "90+" + return str(age) + + +_FREETEXT_REDACTION_PATTERNS: list[tuple[re.Pattern[str], str]] = [ + # Email + (re.compile(r"\b[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}\b"), "[REDACTED_EMAIL]"), + # SSN + (re.compile(r"\b\d{3}-\d{2}-\d{4}\b"), "[REDACTED_SSN]"), + # US Phone + (re.compile(r"\b(?:\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b"), "[REDACTED_PHONE]"), + # Credit card (with separators) + (re.compile(r"(? str: + """ + Inline PII redaction for free-text fields. + + Replaces detected PII patterns with placeholder tokens while preserving + the surrounding text structure. This is for NOT NULL text columns where + NULLing is not possible. + + Args: + value: Original text value + + Returns: + Text with PII patterns replaced by [REDACTED_*] placeholders + """ + if value is None: + return "" + + text = str(value) + for pattern, replacement in _FREETEXT_REDACTION_PATTERNS: + text = pattern.sub(replacement, text) + return text + + +BINARY_SENTINEL = b"\x00" +"""Sentinel value for NOT NULL binary columns when compliance requires NULLing.""" + + +CUSTOM_TRANSFORMERS: dict[str, Any] = { + "hipaa_zip3": hipaa_safe_harbor_zip3, + "year_only": year_only, + "age_bucket": age_bucket, + "redact_freetext": redact_freetext, +} diff --git a/src/dbslice/config.py b/src/dbslice/config.py index 3ae619b..98a2dea 100644 --- a/src/dbslice/config.py +++ b/src/dbslice/config.py @@ -323,3 +323,20 @@ class ExtractConfig: virtual_foreign_keys: list[VirtualForeignKey] = field(default_factory=list) schema: str | None = None # PostgreSQL schema name (default: public) allow_unsafe_where: bool = False + compliance_profiles: list[str] = field(default_factory=list) + compliance_strict: bool = False # Fail if uncovered PII detected + generate_manifest: bool = False # Generate audit manifest + deterministic: bool = True # False = non-deterministic anonymization + compliance_policy_mode: str = "off" # off, standard, strict + compliance_allowed_url_patterns: list[str] = field(default_factory=list) + compliance_denied_url_patterns: list[str] = field(default_factory=list) + compliance_required_sslmode: str | None = None + compliance_require_ci: bool = False + compliance_manifest_sign: bool = False + compliance_manifest_key_env: str = "DBSLICE_MANIFEST_SIGNING_KEY" + freetext_action: str = "warn" # warn, null, redact + binary_action: str = "warn" # warn, null, sentinel + compliance_sample_rows: int = 100 # PII scan sample size during extract + k_anonymity_min_k: int | None = None # None = disabled, 2+ = check + k_anonymity_quasi_identifiers: list[str] = field(default_factory=list) + k_anonymity_action: str = "warn" # warn, fail diff --git a/src/dbslice/config_file.py b/src/dbslice/config_file.py index f1a45c6..78a8e3c 100644 --- a/src/dbslice/config_file.py +++ b/src/dbslice/config_file.py @@ -30,6 +30,7 @@ "performance", "tables", "virtual_foreign_keys", + "compliance", } _DATABASE_KEYS = {"url", "schema", "options"} _EXTRACTION_KEYS = { @@ -48,6 +49,19 @@ "fields", "patterns", "security_null_fields", + "deterministic", +} +_COMPLIANCE_KEYS = { + "profiles", + "strict", + "generate_manifest", + "policy_mode", + "allow_url_patterns", + "deny_url_patterns", + "required_sslmode", + "require_ci", + "sign_manifest", + "manifest_key_env", } _OUTPUT_KEYS = { "format", @@ -233,6 +247,7 @@ def _yaml_quote(value: str) -> str: "DatabaseConfig", "ExtractionConfig", "AnonymizationConfig", + "ComplianceConfig", "OutputConfig", "PerformanceConfig", "StreamingConfig", @@ -326,6 +341,44 @@ class AnonymizationConfig: Example: ["users.password*", "*.api_key"] """ + deterministic: bool = True + """Use deterministic anonymization (same input → same output). Set to false for stronger privacy.""" + + +@dataclass +class ComplianceConfig: + """Compliance configuration.""" + + profiles: list[str] = field(default_factory=list) + """Compliance profiles to apply (e.g., ['gdpr', 'hipaa', 'pci-dss']).""" + + strict: bool = False + """Fail extraction if uncovered PII is detected by value scanning.""" + + generate_manifest: bool = False + """Generate an audit manifest alongside extraction output.""" + + policy_mode: str = "off" + """Policy gate mode: off, standard, or strict.""" + + allow_url_patterns: list[str] = field(default_factory=list) + """Allow-list regex patterns for source database URLs.""" + + deny_url_patterns: list[str] = field(default_factory=list) + """Deny-list regex patterns for source database URLs.""" + + required_sslmode: str | None = None + """Required PostgreSQL sslmode query parameter value.""" + + require_ci: bool = False + """Require CI environment for compliance-active extraction.""" + + sign_manifest: bool = False + """Sign compliance manifests with HMAC.""" + + manifest_key_env: str = "DBSLICE_MANIFEST_SIGNING_KEY" + """Environment variable name containing HMAC signing key.""" + @dataclass class OutputConfig: @@ -438,6 +491,7 @@ class DbsliceConfig: database: DatabaseConfig = field(default_factory=DatabaseConfig) extraction: ExtractionConfig = field(default_factory=ExtractionConfig) anonymization: AnonymizationConfig = field(default_factory=AnonymizationConfig) + compliance: ComplianceConfig = field(default_factory=ComplianceConfig) output: OutputConfig = field(default_factory=OutputConfig) performance: PerformanceConfig = field(default_factory=PerformanceConfig) tables: dict[str, TableOverride] = field(default_factory=dict) @@ -583,12 +637,100 @@ def _from_dict(cls, data: dict[str, Any]) -> "DbsliceConfig": for pattern in security_null_fields: _validate_glob_field_pattern(pattern, "'anonymization.security_null_fields'") + deterministic_val = anon_data.get("deterministic", True) + if not isinstance(deterministic_val, bool): + raise ValueError("'anonymization.deterministic' must be true or false") + anonymization = AnonymizationConfig( enabled=anon_data.get("enabled", False), seed=anon_data.get("seed"), fields=fields, patterns=patterns, security_null_fields=security_null_fields, + deterministic=deterministic_val, + ) + + compliance_data = data.get("compliance", {}) + if not isinstance(compliance_data, dict): + raise ValueError("'compliance' section must be a mapping") + _validate_unknown_keys("compliance", compliance_data, _COMPLIANCE_KEYS) + + compliance_profiles_raw = compliance_data.get("profiles", []) + if not isinstance(compliance_profiles_raw, list): + raise ValueError("'compliance.profiles' must be a list") + + # Validate profile names + from dbslice.compliance.profiles import get_profile + for profile_name in compliance_profiles_raw: + if not isinstance(profile_name, str): + raise ValueError("'compliance.profiles' entries must be strings") + get_profile(profile_name) # Raises ValueError if unknown + + compliance_strict = compliance_data.get("strict", False) + if not isinstance(compliance_strict, bool): + raise ValueError("'compliance.strict' must be true or false") + compliance_manifest = compliance_data.get("generate_manifest", False) + if not isinstance(compliance_manifest, bool): + raise ValueError("'compliance.generate_manifest' must be true or false") + compliance_policy_mode = compliance_data.get("policy_mode", "off") + if compliance_policy_mode not in {"off", "standard", "strict"}: + raise ValueError("'compliance.policy_mode' must be one of: off, standard, strict") + + allow_url_patterns = compliance_data.get("allow_url_patterns", []) + if not isinstance(allow_url_patterns, list) or not all( + isinstance(item, str) for item in allow_url_patterns + ): + raise ValueError("'compliance.allow_url_patterns' must be a list of strings") + for pattern in allow_url_patterns: + try: + re.compile(pattern) + except re.error as e: + raise ValueError( + f"'compliance.allow_url_patterns' contains invalid regex '{pattern}': {e}" + ) from e + + deny_url_patterns = compliance_data.get("deny_url_patterns", []) + if not isinstance(deny_url_patterns, list) or not all( + isinstance(item, str) for item in deny_url_patterns + ): + raise ValueError("'compliance.deny_url_patterns' must be a list of strings") + for pattern in deny_url_patterns: + try: + re.compile(pattern) + except re.error as e: + raise ValueError( + f"'compliance.deny_url_patterns' contains invalid regex '{pattern}': {e}" + ) from e + + required_sslmode = compliance_data.get("required_sslmode") + if required_sslmode is not None and ( + not isinstance(required_sslmode, str) or not required_sslmode.strip() + ): + raise ValueError("'compliance.required_sslmode' must be a non-empty string when set") + + require_ci = compliance_data.get("require_ci", False) + if not isinstance(require_ci, bool): + raise ValueError("'compliance.require_ci' must be true or false") + + sign_manifest = compliance_data.get("sign_manifest", False) + if not isinstance(sign_manifest, bool): + raise ValueError("'compliance.sign_manifest' must be true or false") + + manifest_key_env = compliance_data.get("manifest_key_env", "DBSLICE_MANIFEST_SIGNING_KEY") + if not isinstance(manifest_key_env, str) or not manifest_key_env: + raise ValueError("'compliance.manifest_key_env' must be a non-empty string") + + compliance = ComplianceConfig( + profiles=compliance_profiles_raw, + strict=compliance_strict, + generate_manifest=compliance_manifest, + policy_mode=compliance_policy_mode, + allow_url_patterns=allow_url_patterns, + deny_url_patterns=deny_url_patterns, + required_sslmode=required_sslmode, + require_ci=require_ci, + sign_manifest=sign_manifest, + manifest_key_env=manifest_key_env, ) output_data = data.get("output", {}) @@ -805,6 +947,7 @@ def _from_dict(cls, data: dict[str, Any]) -> "DbsliceConfig": database=database, extraction=extraction, anonymization=anonymization, + compliance=compliance, output=output, performance=performance, tables=tables, @@ -1052,6 +1195,18 @@ def to_extract_config( virtual_foreign_keys=virtual_fks, schema=final_schema, allow_unsafe_where=final_allow_unsafe_where, + compliance_profiles=self.compliance.profiles, + compliance_strict=self.compliance.strict, + generate_manifest=self.compliance.generate_manifest + or bool(self.compliance.profiles), + deterministic=self.anonymization.deterministic, + compliance_policy_mode=self.compliance.policy_mode, + compliance_allowed_url_patterns=list(self.compliance.allow_url_patterns), + compliance_denied_url_patterns=list(self.compliance.deny_url_patterns), + compliance_required_sslmode=self.compliance.required_sslmode, + compliance_require_ci=self.compliance.require_ci, + compliance_manifest_sign=self.compliance.sign_manifest, + compliance_manifest_key_env=self.compliance.manifest_key_env, ) def to_yaml(self, include_comments: bool = True) -> str: @@ -1136,6 +1291,38 @@ def to_yaml(self, include_comments: bool = True) -> str: output.append(" security_null_fields:") for pattern in self.anonymization.security_null_fields: output.append(f" - {_yaml_quote(pattern)}") + output.append(f" deterministic: {str(self.anonymization.deterministic).lower()}") + if include_comments: + output.append( + " # deterministic=false increases privacy but may reduce repeatability" + ) + output.append("") + + if include_comments: + output.append("# Compliance settings") + output.append("compliance:") + if self.compliance.profiles: + output.append(" profiles:") + for profile in self.compliance.profiles: + output.append(f" - {profile}") + else: + output.append(" profiles: []") + output.append(f" strict: {str(self.compliance.strict).lower()}") + output.append(f" generate_manifest: {str(self.compliance.generate_manifest).lower()}") + output.append(f" policy_mode: {_yaml_quote(self.compliance.policy_mode)}") + if self.compliance.allow_url_patterns: + output.append(" allow_url_patterns:") + for pattern in self.compliance.allow_url_patterns: + output.append(f" - {_yaml_quote(pattern)}") + if self.compliance.deny_url_patterns: + output.append(" deny_url_patterns:") + for pattern in self.compliance.deny_url_patterns: + output.append(f" - {_yaml_quote(pattern)}") + if self.compliance.required_sslmode: + output.append(f" required_sslmode: {self.compliance.required_sslmode}") + output.append(f" require_ci: {str(self.compliance.require_ci).lower()}") + output.append(f" sign_manifest: {str(self.compliance.sign_manifest).lower()}") + output.append(f" manifest_key_env: {self.compliance.manifest_key_env}") output.append("") if include_comments: diff --git a/src/dbslice/core/engine.py b/src/dbslice/core/engine.py index b447851..e9f9266 100644 --- a/src/dbslice/core/engine.py +++ b/src/dbslice/core/engine.py @@ -117,18 +117,52 @@ def __init__( self.adapter: DatabaseAdapter | None = None self.schema: SchemaGraph | None = None self.progress_callback = progress_callback + self.manifest: Any = None # ComplianceManifest or None + + if config.generate_manifest or config.compliance_profiles: + from dbslice.compliance.manifest import ComplianceManifest + + self.manifest = ComplianceManifest() + import uuid + + self.manifest.initialize( + extraction_id=str(uuid.uuid4()), + compliance_profiles=config.compliance_profiles, + anonymization_seed=config.anonymization_seed, + deterministic=config.deterministic, + ) + + effective_field_providers = dict(config.anonymization_field_providers) + effective_patterns = dict(config.anonymization_patterns) + effective_profile_patterns: dict[str, str] = {} + effective_security_null = list(config.security_null_fields) + if config.compliance_profiles: + from dbslice.compliance.profiles import get_profile + + for profile_name in config.compliance_profiles: + profile = get_profile(profile_name) + # Merge profile patterns as wildcard fallback rules (lower priority than user rules). + for pattern, method in profile.required_column_patterns.items(): + effective_profile_patterns.setdefault(f"*.{pattern}*", method) + for null_pattern in profile.required_null_patterns: + glob = f"*.{null_pattern}*" + if glob not in effective_security_null: + effective_security_null.append(glob) - # Initialize anonymizer if needed (schema will be set after introspection) self.anonymizer: DeterministicAnonymizer | None = None - if config.anonymize or config.redact_fields: + needs_anonymize = config.anonymize or config.redact_fields or config.compliance_profiles + if needs_anonymize: self.anonymizer = DeterministicAnonymizer( - seed=config.anonymization_seed or DEFAULT_ANONYMIZATION_SEED + seed=config.anonymization_seed or DEFAULT_ANONYMIZATION_SEED, + deterministic=config.deterministic, + manifest=self.manifest, ) self.anonymizer.configure( config.redact_fields, - field_providers=config.anonymization_field_providers, - patterns=config.anonymization_patterns, - security_null_fields=config.security_null_fields, + field_providers=effective_field_providers, + patterns=effective_patterns, + fallback_patterns=effective_profile_patterns, + security_null_fields=effective_security_null, ) def _log(self, stage: str, message: str, current: int = 0, total: int = 0) -> None: @@ -371,7 +405,6 @@ def _do_extract(self, db_type: DatabaseType) -> ExtractionResult: logger.info("Starting data fetch phase", table_count=len(all_records)) tables_data: dict[str, list[dict[str, Any]]] = {} - stats: dict[str, int] = {} total_tables = len(all_records) for i, (table, pk_values) in enumerate(all_records.items()): @@ -392,29 +425,47 @@ def _do_extract(self, db_type: DatabaseType) -> ExtractionResult: ): rows = list(self.adapter.fetch_by_pk(table, pk_columns, pk_values)) - # Anonymize if enabled - if self.anonymizer: - with logger.timed_operation("anonymize_table_data", table=table): - rows = self._anonymize_table_data(table, rows) - logger.debug("Table data anonymized", table=table, row_count=len(rows)) - tables_data[table] = rows - stats[table] = len(rows) logger.debug("Table data fetched", table=table, row_count=len(rows)) if self._has_row_limits(): self._log("limits", "Applying deterministic row limits with integrity closure...") with logger.timed_operation("apply_row_limits"): tables_data = self._apply_row_limits(tables_data) - stats = {table: len(rows) for table, rows in tables_data.items()} logger.info( "Row limits applied", global_limit=self.config.row_limit_global, per_table_limits=len(self.config.row_limit_per_table), - total_rows=sum(stats.values()), + total_rows=sum(len(rows) for rows in tables_data.values()), ) self._log("limits", "Row limits applied") + scan_pre_mask_data: dict[str, list[dict[str, Any]]] | None = None + if self.config.compliance_profiles and self.anonymizer: + # Pre-mask snapshot used for coverage scan decisions. + scan_pre_mask_data = { + table: [dict(row) for row in rows] for table, rows in tables_data.items() + } + + if self.anonymizer: + self._log("anonymize", "Applying anonymization rules...") + total_tables = len(tables_data) + for i, table in enumerate(sorted(tables_data.keys())): + rows = tables_data[table] + if not rows: + continue + self._log( + "anonymize", + f"Anonymizing {len(rows)} rows in {table}", + i + 1, + total_tables, + ) + with logger.timed_operation("anonymize_table_data", table=table): + tables_data[table] = self._anonymize_table_data(table, rows) + logger.debug("Table data anonymized", table=table, row_count=len(rows)) + + stats: dict[str, int] = {table: len(rows) for table, rows in tables_data.items()} + deferred_updates = [] if broken_fks: from dbslice.core.cycles import build_deferred_updates @@ -472,6 +523,19 @@ def _do_extract(self, db_type: DatabaseType) -> ExtractionResult: ) raise ExtractionError(error_msg) + if self.config.compliance_profiles and self.schema: + self._apply_freetext_and_binary_handling(tables_data) + + if self.config.compliance_profiles and self.anonymizer and scan_pre_mask_data is not None: + self._run_pii_scan(scan_pre_mask_data, tables_data) + + if self.config.k_anonymity_min_k is not None: + self._check_k_anonymity(tables_data) + + if self.manifest: + for table, rows in tables_data.items(): + self.manifest.set_table_row_count(table, len(rows)) + return ExtractionResult( tables=tables_data, insert_order=insert_order, @@ -504,6 +568,276 @@ def _anonymize_table_data(self, table: str, rows: list[dict[str, Any]]) -> list[ return [self.anonymizer.anonymize_row(table, row) for row in rows] + def _run_pii_scan( + self, + pre_mask_data: dict[str, list[dict[str, Any]]], + post_mask_data: dict[str, list[dict[str, Any]]], + ) -> None: + """ + Run two-phase compliance value scanning. + + 1) Coverage scan (pre-mask): identify where PII exists in extracted values. + 2) Residual scan (post-mask): re-scan only columns not expected to be protected. + """ + from dbslice.compliance.profiles import get_profile + from dbslice.compliance.scanner import PIIScanner + + assert self.anonymizer is not None + + # Collect all scan patterns from active profiles + scan_patterns: set[str] = set() + freetext_patterns: set[str] = set() + for profile_name in self.config.compliance_profiles: + profile = get_profile(profile_name) + scan_patterns.update(profile.value_scan_patterns) + freetext_patterns.update(profile.warn_freetext_columns) + + if not scan_patterns: + return + + scanner = PIIScanner(patterns=sorted(scan_patterns)) + self._log("compliance", "Running compliance coverage scan...") + + coverage_detections = [] + for table, rows in pre_mask_data.items(): + if not rows: + continue + sample = rows[:100] + detections = scanner.scan_rows(table, sample) + coverage_detections.extend(detections) + + # Check for freetext columns that might contain embedded PII + if freetext_patterns: + for col in rows[0].keys(): + col_lower = col.lower() + for pattern in freetext_patterns: + if pattern in col_lower: + if self.manifest: + self.manifest.add_warning( + table, col, + f"Free-text column may contain embedded PII (matched pattern: {pattern})", + ) + break + + unprotected_columns: dict[str, set[str]] = {} + for detection in coverage_detections: + is_protected = self.anonymizer.should_anonymize( + detection.table, detection.column + ) or self.anonymizer.should_null(detection.table, detection.column) + if not is_protected: + unprotected_columns.setdefault(detection.table, set()).add(detection.column) + if self.manifest: + self.manifest.add_warning( + detection.table, + detection.column, + "PII detected in coverage scan but field is not configured for masking", + ) + + residual_detections = [] + if unprotected_columns: + self._log("compliance", "Running compliance residual scan...") + for table, rows in post_mask_data.items(): + if not rows: + continue + columns_to_scan = unprotected_columns.get(table) + if not columns_to_scan: + continue + sample = rows[:100] + skip_columns = {col for col in rows[0].keys() if col not in columns_to_scan} + detections = scanner.scan_rows(table, sample, skip_columns=skip_columns) + residual_detections.extend(detections) + + if self.manifest: + self.manifest.add_pii_detections(residual_detections) + + if residual_detections: + logger.warning( + "Residual PII detected in post-mask scan", + detection_count=len(residual_detections), + tables_affected=len({d.table for d in residual_detections}), + ) + self._log( + "compliance", + f"Residual scan found {len(residual_detections)} unprotected PII detection(s)", + ) + + if self.config.compliance_strict: + detection_details = [ + f" {d.table}.{d.column}: {d.pattern_name} ({d.match_count}/{d.sample_size} matches, {d.confidence} confidence)" + for d in residual_detections + ] + raise ExtractionError( + "Compliance strict mode: residual unprotected PII detected after masking.\n" + + "\n".join(detection_details) + ) + else: + if coverage_detections: + self._log( + "compliance", + "Coverage scan detected PII in source values; residual scan is clean", + ) + else: + self._log("compliance", "Coverage scan clean: no PII detected in sampled values") + + def _apply_freetext_and_binary_handling( + self, tables_data: dict[str, list[dict[str, Any]]] + ) -> None: + """Apply free-text redaction and binary column handling based on compliance config.""" + assert self.schema is not None + + from dbslice.compliance.profiles import get_profile + from dbslice.compliance.transformers import BINARY_SENTINEL, redact_freetext + + freetext_action = self.config.freetext_action + binary_action = self.config.binary_action + + # Collect freetext column patterns from active profiles + freetext_patterns: set[str] = set() + for profile_name in self.config.compliance_profiles: + profile = get_profile(profile_name) + freetext_patterns.update(profile.warn_freetext_columns) + + # Binary-like PostgreSQL types + binary_types = {"bytea", "blob", "binary", "varbinary", "image", "lo"} + + for table, rows in tables_data.items(): + if not rows: + continue + table_info = self.schema.get_table(table) + if not table_info: + continue + + for col_obj in table_info.columns: + col = col_obj.name + col_lower = col.lower() + col_type_lower = col_obj.data_type.lower() + + # Binary column handling + if any(bt in col_type_lower for bt in binary_types): + if binary_action == "null": + for row in rows: + if col in row: + row[col] = None + elif binary_action == "sentinel": + for row in rows: + if col in row and row[col] is not None: + if col_obj.nullable: + row[col] = None + else: + row[col] = BINARY_SENTINEL + if self.manifest: + self.manifest.add_warning( + table, col, + f"Binary column ({col_obj.data_type}) handled with action={binary_action}", + ) + continue + + # Free-text column handling + is_freetext = any(pat in col_lower for pat in freetext_patterns) + if not is_freetext: + continue + + # Skip columns already handled by anonymizer + if self.anonymizer and ( + self.anonymizer.should_anonymize(table, col) + or self.anonymizer.should_null(table, col) + ): + continue + + if freetext_action == "null": + for row in rows: + if col in row: + if col_obj.nullable: + row[col] = None + else: + # NOT NULL: fall back to redact + row[col] = redact_freetext(row[col]) + elif freetext_action == "redact": + for row in rows: + if col in row and row[col] is not None: + row[col] = redact_freetext(row[col]) + + if self.manifest and freetext_action != "warn": + effective = freetext_action + if freetext_action == "null" and not col_obj.nullable: + effective = "redact (NOT NULL fallback)" + self.manifest.add_warning( + table, col, + f"Free-text column handled with action={effective}", + ) + + def _check_k_anonymity(self, tables_data: dict[str, list[dict[str, Any]]]) -> None: + """ + Post-extraction k-anonymity verification. + + Checks that every combination of configured quasi-identifiers appears + at least k times in the output. Fail-only — does not modify data. + """ + min_k = self.config.k_anonymity_min_k + qi_specs = self.config.k_anonymity_quasi_identifiers + action = self.config.k_anonymity_action + + if not min_k or not qi_specs: + return + + self._log("compliance", f"Running k-anonymity check (k={min_k})...") + + # Parse quasi-identifier specs: "table.column" format + qi_by_table: dict[str, list[str]] = {} + for spec in qi_specs: + parts = spec.split(".", 1) + if len(parts) == 2: + qi_by_table.setdefault(parts[0].lower(), []).append(parts[1].lower()) + + violations: list[str] = [] + for table, qi_columns in qi_by_table.items(): + rows = tables_data.get(table, []) + if not rows: + continue + + # Check which columns actually exist + available = {c.lower() for c in rows[0].keys()} + active_qi = [c for c in qi_columns if c in available] + if not active_qi: + continue + + # Count combinations + from collections import Counter + + combos = Counter( + tuple(str(row.get(c, "")) for c in active_qi) + for row in rows + ) + + for combo, count in combos.items(): + if count < min_k: + combo_str = ", ".join(f"{c}={v}" for c, v in zip(active_qi, combo)) + violations.append(f"{table}: [{combo_str}] appears {count} time(s)") + + if not violations: + self._log("compliance", f"k-anonymity check passed (k={min_k})") + return + + msg = f"k-anonymity violation: {len(violations)} combination(s) appear fewer than {min_k} times" + logger.warning(msg, violation_count=len(violations), min_k=min_k) + + if self.manifest: + for v in violations[:50]: + self.manifest.add_warning("_k_anonymity", "quasi_identifiers", v) + + detail_lines = [f" {v}" for v in violations[:20]] + if len(violations) > 20: + detail_lines.append(f" ... and {len(violations) - 20} more") + + self._log("compliance", msg) + + if action == "fail": + raise ExtractionError( + f"k-anonymity check failed (k={min_k}): " + f"{len(violations)} quasi-identifier combination(s) are unique or below threshold.\n" + + "\n".join(detail_lines) + ) + def _has_row_limits(self) -> bool: """Check whether any row-limit configuration is active.""" return self.config.row_limit_global is not None or bool(self.config.row_limit_per_table) diff --git a/src/dbslice/mapping/__init__.py b/src/dbslice/mapping/__init__.py new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/src/dbslice/mapping/__init__.py @@ -0,0 +1 @@ + diff --git a/src/dbslice/mapping/server.py b/src/dbslice/mapping/server.py new file mode 100644 index 0000000..dd904e0 --- /dev/null +++ b/src/dbslice/mapping/server.py @@ -0,0 +1,489 @@ +from __future__ import annotations + +import inspect +import json +import secrets +import threading +from http.server import BaseHTTPRequestHandler, HTTPServer +from typing import Any +from urllib.parse import parse_qs, urlparse + +from dbslice.logging import get_logger +from dbslice.models import SchemaGraph + +logger = get_logger(__name__) + + +class MappingServer: + """Local mapping UI HTTP server.""" + + def __init__( + self, + port: int = 9473, + database_url: str | None = None, + schema: str | None = None, + ): + self.port = port + self.database_url = database_url + self.schema_name = schema + self.token = secrets.token_urlsafe(32) + self._server: HTTPServer | None = None + self._cached_schema: SchemaGraph | None = None + self._cached_adapter: Any = None + + @property + def url(self) -> str: + return f"http://127.0.0.1:{self.port}?token={self.token}" + + def start(self, open_browser: bool = True) -> None: + """Start the server and optionally open a browser.""" + handler = _make_handler(self) + self._server = HTTPServer(("127.0.0.1", self.port), handler) + + if open_browser: + import webbrowser + + threading.Timer(0.5, webbrowser.open, args=[self.url]).start() + + logger.info("Mapping UI server starting", url=self.url) + try: + self._server.serve_forever() + except KeyboardInterrupt: + pass + finally: + self._server.server_close() + if self._cached_adapter: + try: + self._cached_adapter.close() + except Exception: + pass + + def _introspect(self, database_url: str, schema: str | None, detect_sensitive: bool) -> dict: + """Connect to database and introspect schema.""" + from dbslice.adapters.postgresql import PostgreSQLAdapter + from dbslice.compliance.profiles import list_profiles + from dbslice.input_validators import validate_database_url + from dbslice.utils.anonymizer import ( + _SECURITY_NULL_PATTERNS, + ) + from dbslice.utils.connection import parse_database_url + + validate_database_url(database_url) + db_config = parse_database_url(database_url) + + if self._cached_adapter: + try: + self._cached_adapter.close() + except Exception: + pass + + adapter = PostgreSQLAdapter(schema=schema) + adapter.connect(database_url) + self._cached_adapter = adapter + + db_schema = adapter.get_schema() + self._cached_schema = db_schema + + tables = [] + sensitive_suggestions: dict[str, str] = {} + + if detect_sensitive: + sensitive_patterns = { + "email": "email", + "e_mail": "email", + "email_address": "email", + "phone": "phone_number", + "telephone": "phone_number", + "mobile": "phone_number", + "cell": "phone_number", + "first_name": "first_name", + "firstname": "first_name", + "last_name": "last_name", + "lastname": "last_name", + "full_name": "name", + "fullname": "name", + "address": "address", + "street": "street_address", + "city": "city", + "postal_code": "postcode", + "zipcode": "postcode", + "ssn": "ssn", + "social_security": "ssn", + "passport": "passport_number", + "driver_license": "license_plate", + "credit_card": "credit_card_number", + "card_number": "credit_card_number", + "ip_address": "ipv4", + "ip": "ipv4", + "ipv4": "ipv4", + "ipv6": "ipv6", + "dob": "date_of_birth", + "date_of_birth": "date_of_birth", + "username": "user_name", + } + for table_name, table in db_schema.tables.items(): + for column in table.columns: + col_lower = column.name.lower() + if col_lower in sensitive_patterns: + sensitive_suggestions[f"{table_name}.{column.name}"] = sensitive_patterns[ + col_lower + ] + else: + for pattern, provider in sensitive_patterns.items(): + if pattern in col_lower: + sensitive_suggestions[f"{table_name}.{column.name}"] = provider + break + + fk_columns: set[tuple[str, str]] = set() + for fk in db_schema.edges: + for col in fk.source_columns: + fk_columns.add((fk.source_table, col)) + + null_columns: set[str] = set() + for tbl_name, tbl in db_schema.tables.items(): + for col_obj in tbl.columns: + col_lower = col_obj.name.lower() + for pat in _SECURITY_NULL_PATTERNS: + if pat in col_lower: + null_columns.add(f"{tbl_name}.{col_obj.name}") + break + + from dbslice.models import Column as ColumnModel + + for table_name in sorted(db_schema.tables.keys()): + table_info = db_schema.tables[table_name] + columns: list[dict[str, Any]] = [] + col_obj2: ColumnModel + for col_obj2 in table_info.columns: + full_name = f"{table_name}.{col_obj2.name}" + is_fk = (table_name, col_obj2.name) in fk_columns + suggested = sensitive_suggestions.get(full_name) + is_null_target = full_name in null_columns + + action = "keep" + provider = "" + if is_fk: + action = "locked_fk" + elif col_obj2.is_primary_key: + action = "locked_pk" + elif is_null_target: + action = "null" + elif suggested: + action = "anonymize" + provider = suggested + + columns.append( + { + "name": col_obj2.name, + "data_type": col_obj2.data_type, + "nullable": col_obj2.nullable, + "is_pk": col_obj2.is_primary_key, + "is_fk": is_fk, + "suggested_action": action, + "suggested_provider": provider, + } + ) + + tables.append( + { + "name": table_name, + "primary_key": list(table_info.primary_key), + "columns": columns, + } + ) + + profiles = [ + {"name": p.name, "display_name": p.display_name, "description": p.description} + for p in list_profiles() + ] + + common_providers = [ + "email", + "phone_number", + "first_name", + "last_name", + "name", + "address", + "street_address", + "city", + "zipcode", + "ssn", + "credit_card_number", + "ipv4", + "ipv6", + "company", + "url", + "date_of_birth", + "user_name", + "passport_number", + "iban", + "pystr", + "random_int", + "year_only", + "hipaa_zip3", + "age_bucket", + "redact_freetext", + ] + + return { + "database": db_config.database, + "table_count": len(tables), + "tables": tables, + "sensitive_suggestions": sensitive_suggestions, + "compliance_profiles": profiles, + "common_providers": common_providers, + } + + def _apply_profile(self, profile_name: str, current_mappings: dict) -> dict: + """Apply a compliance profile's patterns to the current schema.""" + from dbslice.compliance.profiles import get_profile + + profile = get_profile(profile_name) + if not self._cached_schema: + return {"error": "No schema loaded. Run introspection first."} + + additions: dict[str, str] = {} + null_additions: list[str] = [] + + for table_name, table in self._cached_schema.tables.items(): + for column in table.columns: + full_name = f"{table_name}.{column.name}" + if full_name in current_mappings: + continue + + col_lower = column.name.lower() + + for pat in profile.required_null_patterns: + if pat in col_lower: + null_additions.append(full_name) + break + else: + for pat, method in profile.required_column_patterns.items(): + if pat in col_lower: + additions[full_name] = method + break + + return { + "profile": profile_name, + "display_name": profile.display_name, + "field_additions": additions, + "null_additions": null_additions, + "identifiers_covered": profile.identifiers, + } + + @staticmethod + def _generate_config(mappings: dict) -> dict: + """Generate YAML config from column mappings.""" + fields: dict[str, str] = {} + null_fields: list[str] = [] + + for full_name, action_data in mappings.items(): + action = action_data.get("action", "keep") + if action == "anonymize": + provider = action_data.get("provider", "pystr") + fields[full_name] = provider + elif action == "null": + null_fields.append(full_name) + + lines = [ + "# Generated by dbslice map", + "", + "database:", + " url: ${DATABASE_URL}", + "", + "anonymization:", + " enabled: true", + ] + + if fields: + lines.append(" fields:") + for field_name, provider in sorted(fields.items()): + lines.append(f" {field_name}: {provider}") + + if null_fields: + lines.append(" security_null_fields:") + for field_name in sorted(null_fields): + lines.append(f" - {field_name}") + + lines.extend( + [ + "", + "extraction:", + " default_depth: 3", + " direction: both", + " validate: true", + "", + "output:", + " format: sql", + " include_transaction: true", + ] + ) + + yaml_content = "\n".join(lines) + "\n" + + cmd = 'dbslice extract --config dbslice.yaml --seed ""' + + return { + "yaml": yaml_content, + "command_template": cmd, + "field_count": len(fields), + "null_count": len(null_fields), + } + + @staticmethod + def _validate_provider(provider: str) -> dict: + """Validate a Faker provider name.""" + from dbslice.compliance.transformers import CUSTOM_TRANSFORMERS + + if provider in CUSTOM_TRANSFORMERS: + return {"valid": True, "provider": provider, "source": "custom_transformer"} + + try: + from faker import Faker + except ImportError: + return {"valid": False, "error": "Faker not installed"} + + fake = Faker() + method = getattr(fake, provider, None) + if method is None or not callable(method): + return {"valid": False, "error": f"Unknown provider '{provider}'"} + + try: + sig = inspect.signature(method) + for param in sig.parameters.values(): + if param.kind in (inspect.Parameter.VAR_POSITIONAL, inspect.Parameter.VAR_KEYWORD): + continue + if param.default is inspect.Parameter.empty: + return { + "valid": False, + "error": f"Provider '{provider}' requires argument '{param.name}'", + } + except (TypeError, ValueError): + pass + + return {"valid": True, "provider": provider, "source": "faker"} + + +def _make_handler(server: MappingServer): + """Create a request handler class bound to the server instance.""" + + class Handler(BaseHTTPRequestHandler): + def log_message(self, format, *args): + pass + + def _check_token(self) -> bool: + token = self.headers.get("X-DBSLICE-Token") + if token != server.token: + self._json_error(403, "Invalid or missing session token") + return False + return True + + def _json_response(self, data: dict, status: int = 200) -> None: + body = json.dumps(data, default=str).encode("utf-8") + self.send_response(status) + self.send_header("Content-Type", "application/json") + self.send_header("Content-Length", str(len(body))) + self.end_headers() + self.wfile.write(body) + + def _json_error(self, status: int, message: str) -> None: + self._json_response({"error": message}, status) + + def _read_json_body(self) -> dict[str, Any] | None: + length = int(self.headers.get("Content-Length", 0)) + if length == 0: + self._json_error(400, "Empty request body") + return None + try: + result: dict[str, Any] = json.loads(self.rfile.read(length)) + return result + except json.JSONDecodeError: + self._json_error(400, "Invalid JSON") + return None + + def do_GET(self) -> None: + parsed = urlparse(self.path) + + if parsed.path == "/" or parsed.path == "": + query = parse_qs(parsed.query) + url_token = query.get("token", [None])[0] + if url_token != server.token: + self.send_response(403) + self.send_header("Content-Type", "text/plain") + self.end_headers() + self.wfile.write(b"Invalid session token") + return + + from dbslice.mapping.ui import get_ui_html + + html = get_ui_html(server.token, server.database_url or "") + body = html.encode("utf-8") + self.send_response(200) + self.send_header("Content-Type", "text/html; charset=utf-8") + self.send_header("Content-Length", str(len(body))) + self.end_headers() + self.wfile.write(body) + else: + self._json_error(404, "Not found") + + def do_POST(self) -> None: + parsed = urlparse(self.path) + + if not self._check_token(): + return + + if parsed.path == "/api/introspect": + body = self._read_json_body() + if body is None: + return + try: + result = server._introspect( + database_url=body.get("database_url", ""), + schema=body.get("schema"), + detect_sensitive=body.get("detect_sensitive", True), + ) + self._json_response(result) + except Exception as e: + self._json_error(400, str(e)) + + elif parsed.path == "/api/apply-profile": + body = self._read_json_body() + if body is None: + return + try: + result = server._apply_profile( + profile_name=body.get("profile", ""), + current_mappings=body.get("current_mappings", {}), + ) + self._json_response(result) + except Exception as e: + self._json_error(400, str(e)) + + elif parsed.path == "/api/generate-config": + body = self._read_json_body() + if body is None: + return + try: + result = server._generate_config( + mappings=body.get("mappings", {}), + ) + self._json_response(result) + except Exception as e: + self._json_error(400, str(e)) + + elif parsed.path == "/api/validate-provider": + body = self._read_json_body() + if body is None: + return + try: + result = server._validate_provider( + provider=body.get("provider", ""), + ) + self._json_response(result) + except Exception as e: + self._json_error(400, str(e)) + + else: + self._json_error(404, "Not found") + + return Handler diff --git a/src/dbslice/mapping/static/index.html b/src/dbslice/mapping/static/index.html new file mode 100644 index 0000000..d8b6b57 --- /dev/null +++ b/src/dbslice/mapping/static/index.html @@ -0,0 +1,591 @@ + + + + + +dbslice — Column Mapping + + + + + + + + + + + + +
+
+ +

dbslice

+ Map +
+ + + + 127.0.0.1 +
+ + +
+ + + + + + + + +
+ +
+ +
+
+
+ + + +
+

Connect to a database

+

Enter your PostgreSQL connection URL in the sidebar and introspect the schema to begin mapping columns.

+ +
+
+ + + +
+
+ + +
+ + + + diff --git a/src/dbslice/mapping/ui.py b/src/dbslice/mapping/ui.py new file mode 100644 index 0000000..1d4b80b --- /dev/null +++ b/src/dbslice/mapping/ui.py @@ -0,0 +1,18 @@ +from pathlib import Path + +_STATIC_DIR = Path(__file__).parent / "static" +_TEMPLATE_CACHE: str | None = None + + +def get_ui_html(token: str, initial_url: str) -> str: + """Return the complete HTML page with token and initial URL embedded.""" + global _TEMPLATE_CACHE # noqa: PLW0603 + if _TEMPLATE_CACHE is None: + template_path = _STATIC_DIR / "index.html" + _TEMPLATE_CACHE = template_path.read_text(encoding="utf-8") + + return ( + _TEMPLATE_CACHE + .replace("{{TOKEN}}", token) + .replace("{{INITIAL_URL}}", initial_url) + ) diff --git a/src/dbslice/utils/anonymizer.py b/src/dbslice/utils/anonymizer.py index 7fe836e..afbfddd 100644 --- a/src/dbslice/utils/anonymizer.py +++ b/src/dbslice/utils/anonymizer.py @@ -1,4 +1,5 @@ import hashlib +import secrets from fnmatch import fnmatchcase from typing import TYPE_CHECKING, Any @@ -16,6 +17,7 @@ Faker = None # type: ignore if TYPE_CHECKING: + from dbslice.compliance.manifest import ComplianceManifest from dbslice.models import SchemaGraph @@ -126,13 +128,21 @@ class DeterministicAnonymizer: tables or rows. Uses Faker with deterministic seeding based on input values. """ - def __init__(self, seed: str = DEFAULT_ANONYMIZATION_SEED, schema: "SchemaGraph | None" = None): + def __init__( + self, + seed: str = DEFAULT_ANONYMIZATION_SEED, + schema: "SchemaGraph | None" = None, + deterministic: bool = True, + manifest: "ComplianceManifest | None" = None, + ): """ Initialize the anonymizer with a global seed. Args: seed: Global seed for deterministic anonymization schema: Optional schema graph for FK detection (prevents anonymizing FK columns) + deterministic: If False, use random seeds per value (stronger privacy, no cross-table consistency) + manifest: Optional compliance manifest to record anonymization actions Raises: ImportError: If Faker is not installed @@ -142,16 +152,21 @@ def __init__(self, seed: str = DEFAULT_ANONYMIZATION_SEED, schema: "SchemaGraph "Faker is required for anonymization. Install it with: pip install faker" ) - logger.info("Initializing anonymizer", seed=seed[:20] + "...") # Truncate seed in logs + mode = "deterministic" if deterministic else "non-deterministic" + logger.info("Initializing anonymizer", seed=seed[:20] + "...", mode=mode) self.global_seed = seed + self.deterministic = deterministic self.fake = Faker() self._cache: dict[tuple, Any] = {} self.redact_fields: set[str] = set() # Set of normalized "table.column" self.field_providers: dict[str, str] = {} self.custom_patterns: list[tuple[str, str]] = [] + self.fallback_patterns: list[tuple[str, str]] = [] self.security_null_fields: list[str] = [] self.schema = schema self._fk_columns_cache: dict[str, set[str]] = {} # Cache of FK columns per table + self.manifest = manifest + self._manifest_recorded: set[tuple[str, str]] = set() # Track which fields we've recorded def _normalize_field(self, table: str, column: str) -> str: """Return normalized table.column field name for matching.""" @@ -161,9 +176,11 @@ def _match_glob(self, pattern: str, field: str) -> bool: """Case-insensitive shell-style glob match for table.column patterns.""" return fnmatchcase(field, pattern.lower()) - def _resolve_custom_pattern_provider(self, table: str, column: str) -> str | None: + def _resolve_pattern_provider( + self, table: str, column: str, patterns: list[tuple[str, str]] + ) -> str | None: """ - Resolve provider from custom wildcard patterns. + Resolve provider from wildcard patterns. Resolution policy: - Most specific pattern wins (longest non-wildcard literal). @@ -173,7 +190,7 @@ def _resolve_custom_pattern_provider(self, table: str, column: str) -> str | Non best_provider: str | None = None best_specificity = -1 - for pattern, provider in self.custom_patterns: + for pattern, provider in patterns: if not self._match_glob(pattern, field): continue @@ -184,6 +201,14 @@ def _resolve_custom_pattern_provider(self, table: str, column: str) -> str | Non return best_provider + def _resolve_custom_pattern_provider(self, table: str, column: str) -> str | None: + """Resolve provider from user-defined wildcard patterns.""" + return self._resolve_pattern_provider(table, column, self.custom_patterns) + + def _resolve_fallback_pattern_provider(self, table: str, column: str) -> str | None: + """Resolve provider from fallback wildcard patterns (e.g., compliance profiles).""" + return self._resolve_pattern_provider(table, column, self.fallback_patterns) + def _resolve_exact_field_provider(self, table: str, column: str) -> str | None: """Resolve provider from exact field mappings.""" return self.field_providers.get(self._normalize_field(table, column)) @@ -192,9 +217,10 @@ def _resolve_faker_method(self, table: str, column: str) -> str: """ Resolve faker method with precedence: 1. Exact field provider mapping - 2. Custom wildcard pattern mapping - 3. Built-in column substring mapping - 4. pystr fallback + 2. User wildcard pattern mapping + 3. Fallback wildcard pattern mapping + 4. Built-in column substring mapping + 5. pystr fallback """ exact_provider = self._resolve_exact_field_provider(table, column) if exact_provider: @@ -204,6 +230,10 @@ def _resolve_faker_method(self, table: str, column: str) -> str: if pattern_provider: return pattern_provider + fallback_pattern_provider = self._resolve_fallback_pattern_provider(table, column) + if fallback_pattern_provider: + return fallback_pattern_provider + return self.get_faker_method(column) def configure( @@ -211,6 +241,7 @@ def configure( redact_fields: list[str], field_providers: dict[str, str] | None = None, patterns: dict[str, str] | None = None, + fallback_patterns: dict[str, str] | None = None, security_null_fields: list[str] | None = None, ): """ @@ -219,7 +250,8 @@ def configure( Args: redact_fields: List of exact fields in "table.column" format. field_providers: Exact field to faker-provider mappings. - patterns: Wildcard table.column glob to faker-provider mappings. + patterns: User wildcard table.column glob to faker-provider mappings. + fallback_patterns: Lower-priority wildcard mappings (e.g., compliance profiles). security_null_fields: Wildcard table.column globs to force NULL. """ self.redact_fields = {field.lower() for field in redact_fields} @@ -229,13 +261,17 @@ def configure( self.custom_patterns = [ (pattern.lower(), provider) for pattern, provider in (patterns or {}).items() ] + self.fallback_patterns = [ + (pattern.lower(), provider) for pattern, provider in (fallback_patterns or {}).items() + ] self.security_null_fields = [pattern.lower() for pattern in (security_null_fields or [])] logger.info( "Anonymizer configured", redact_field_count=len(self.redact_fields), exact_provider_count=len(self.field_providers), - pattern_count=len(self.custom_patterns), + user_pattern_count=len(self.custom_patterns), + fallback_pattern_count=len(self.fallback_patterns), security_null_pattern_count=len(self.security_null_fields), ) @@ -297,6 +333,10 @@ def should_anonymize(self, table: str, column: str) -> bool: if self._resolve_custom_pattern_provider(table, column): return True + # Fallback wildcard patterns (e.g., compliance profiles) + if self._resolve_fallback_pattern_provider(table, column): + return True + # Pattern matching on column name col_lower = column.lower() for pattern in _DEFAULT_ANONYMIZATION_PATTERNS: @@ -380,34 +420,53 @@ def anonymize_value(self, value: Any, table: str, column: str) -> Any: # FK integrity has highest priority over nulling/anonymization rules. if self._is_foreign_key_column(table, column): + self._record_manifest_fk(table, column) return value if self.should_null(table, column): + self._record_manifest_null(table, column) return None if not self.should_anonymize(table, column): + self._record_manifest_unmasked(table, column) return value faker_method = self._resolve_faker_method(table, column) - cache_key = (str(value), column, faker_method) - if cache_key in self._cache: - return self._cache[cache_key] - - # Generate deterministic seed from global seed + column/provider + original value - # Including column name ensures same value in different column types gets different output - hash_input = f"{self.global_seed}:{column}:{faker_method}:{value}".encode() - seed_int = int.from_bytes(hashlib.sha256(hash_input).digest()[:8], "big") - - self.fake.seed_instance(seed_int) - - try: - anonymized = getattr(self.fake, faker_method)() - except (AttributeError, TypeError): - # Fallback if Faker method doesn't exist or fails - anonymized = self.fake.pystr() - - self._cache[cache_key] = anonymized - return anonymized + self._record_manifest_masked(table, column, faker_method) + + # Check for custom compliance transformers first (these take the value as input) + custom_fn = self._get_custom_transformer(faker_method) + if custom_fn is not None: + return custom_fn(value) + + if self.deterministic: + cache_key = (str(value), column, faker_method) + if cache_key in self._cache: + return self._cache[cache_key] + + # Generate deterministic seed from global seed + column/provider + original value + # Including column name ensures same value in different column types gets different output + hash_input = f"{self.global_seed}:{column}:{faker_method}:{value}".encode() + seed_int = int.from_bytes(hashlib.sha256(hash_input).digest()[:8], "big") + self.fake.seed_instance(seed_int) + + try: + anonymized = getattr(self.fake, faker_method)() + except (AttributeError, TypeError): + anonymized = self.fake.pystr() + + self._cache[cache_key] = anonymized + return anonymized + else: + seed_int = int.from_bytes(secrets.token_bytes(8), "big") + self.fake.seed_instance(seed_int) + + try: + anonymized = getattr(self.fake, faker_method)() + except (AttributeError, TypeError): + anonymized = self.fake.pystr() + + return anonymized def anonymize_row(self, table: str, row: dict[str, Any]) -> dict[str, Any]: """ @@ -451,5 +510,49 @@ def get_statistics(self) -> dict[str, int]: "redact_fields_count": len(self.redact_fields), "exact_provider_count": len(self.field_providers), "pattern_count": len(self.custom_patterns), + "fallback_pattern_count": len(self.fallback_patterns), "security_null_pattern_count": len(self.security_null_fields), } + + @staticmethod + def _get_custom_transformer(method_name: str) -> Any | None: + """Look up a custom compliance transformer function by name.""" + from dbslice.compliance.transformers import CUSTOM_TRANSFORMERS + + return CUSTOM_TRANSFORMERS.get(method_name) + + def _record_manifest_masked(self, table: str, column: str, method: str) -> None: + """Record a masked field in the manifest (once per table.column).""" + if not self.manifest: + return + key = (table, column) + if key not in self._manifest_recorded: + self._manifest_recorded.add(key) + self.manifest.record_masked_field(table, column, method) + + def _record_manifest_null(self, table: str, column: str) -> None: + """Record a NULLed field in the manifest (once per table.column).""" + if not self.manifest: + return + key = (table, column) + if key not in self._manifest_recorded: + self._manifest_recorded.add(key) + self.manifest.record_nulled_field(table, column, "security_null_pattern") + + def _record_manifest_fk(self, table: str, column: str) -> None: + """Record a preserved FK field in the manifest (once per table.column).""" + if not self.manifest: + return + key = (table, column) + if key not in self._manifest_recorded: + self._manifest_recorded.add(key) + self.manifest.record_fk_preserved(table, column) + + def _record_manifest_unmasked(self, table: str, column: str) -> None: + """Record an unmasked field in the manifest (once per table.column).""" + if not self.manifest: + return + key = (table, column) + if key not in self._manifest_recorded: + self._manifest_recorded.add(key) + self.manifest.record_unmasked_field(table, column) diff --git a/tests/test_anonymizer.py b/tests/test_anonymizer.py index 6654416..dc86580 100644 --- a/tests/test_anonymizer.py +++ b/tests/test_anonymizer.py @@ -240,6 +240,27 @@ def test_custom_pattern_tie_uses_first_defined(self): assert anon._resolve_faker_method("users", "user_id") == "name" + def test_fallback_patterns_apply_when_user_patterns_missing(self): + anon = DeterministicAnonymizer() + anon.configure( + [], + patterns={}, + fallback_patterns={"*.admission_date*": "date"}, + ) + + assert anon.should_anonymize("visits", "admission_date") + assert anon._resolve_faker_method("visits", "admission_date") == "date" + + def test_user_patterns_override_fallback_patterns(self): + anon = DeterministicAnonymizer() + anon.configure( + [], + patterns={"*.*date*": "date_time"}, + fallback_patterns={"*.admission_date*": "date"}, + ) + + assert anon._resolve_faker_method("visits", "admission_date") == "date_time" + def test_security_null_fields_applies(self): anon = DeterministicAnonymizer() anon.configure( diff --git a/tests/test_compliance.py b/tests/test_compliance.py new file mode 100644 index 0000000..c50b52b --- /dev/null +++ b/tests/test_compliance.py @@ -0,0 +1,701 @@ +"""Tests for the compliance module: profiles, scanner, manifest, and integration.""" + +import json +from unittest.mock import MagicMock + +import pytest +import yaml + +import dbslice.cli as cli +from dbslice.compliance.manifest import ComplianceManifest +from dbslice.compliance.profiles import ( + get_profile, + list_profiles, +) +from dbslice.compliance.scanner import PIIDetection, PIIScanner, _luhn_check +from dbslice.config import ExtractConfig +from dbslice.core.engine import ExtractionEngine +from dbslice.exceptions import ExtractionError +from dbslice.models import Column, SchemaGraph, Table + +# ────────────────────────────────────────────────────────── +# Profile tests +# ────────────────────────────────────────────────────────── + + +class TestProfiles: + def test_get_gdpr_profile(self): + profile = get_profile("gdpr") + assert profile.name == "gdpr" + assert profile.display_name == "GDPR" + assert "email" in profile.required_column_patterns + + def test_get_hipaa_profile(self): + profile = get_profile("hipaa") + assert profile.name == "hipaa" + assert "ssn" in profile.required_column_patterns + assert len(profile.identifiers) == 18 + + def test_get_pci_dss_profile(self): + profile = get_profile("pci-dss") + assert profile.name == "pci-dss" + assert "credit_card" in profile.required_column_patterns + assert "cvv" in profile.required_null_patterns + + def test_get_profile_case_insensitive(self): + assert get_profile("GDPR").name == "gdpr" + assert get_profile("Hipaa").name == "hipaa" + assert get_profile("PCI-DSS").name == "pci-dss" + + def test_get_profile_unknown_raises(self): + with pytest.raises(ValueError, match="Unknown compliance profile"): + get_profile("unknown") + + def test_list_profiles(self): + profiles = list_profiles() + assert len(profiles) >= 3 + names = {p.name for p in profiles} + assert "gdpr" in names + assert "hipaa" in names + assert "pci-dss" in names + + def test_gdpr_covers_direct_identifiers(self): + profile = get_profile("gdpr") + expected_patterns = ["email", "phone", "first_name", "last_name", "ssn", "ip_address"] + for pattern in expected_patterns: + assert pattern in profile.required_column_patterns, f"Missing: {pattern}" + + def test_hipaa_has_18_identifiers(self): + profile = get_profile("hipaa") + assert len(profile.identifiers) == 18 + assert profile.identifiers[0].startswith("1.") + assert profile.identifiers[17].startswith("18.") + + def test_pci_dss_covers_pan_fields(self): + profile = get_profile("pci-dss") + for pattern in ["credit_card", "card_number", "pan"]: + assert pattern in profile.required_column_patterns + + def test_profiles_have_value_scan_patterns(self): + for profile in list_profiles(): + assert len(profile.value_scan_patterns) > 0 + + def test_profiles_have_freetext_warnings(self): + for profile in list_profiles(): + assert len(profile.warn_freetext_columns) > 0 + + def test_profile_is_frozen(self): + profile = get_profile("gdpr") + with pytest.raises(AttributeError): + profile.name = "hacked" # type: ignore[misc] + + +# ────────────────────────────────────────────────────────── +# Scanner tests +# ────────────────────────────────────────────────────────── + + +class TestPIIScanner: + def test_detect_emails(self): + scanner = PIIScanner(patterns=["email"]) + values = ["john@example.com", "jane@test.org", "not-an-email", "bob@foo.co"] + detections = scanner.scan_column("users", "notes", values) + assert len(detections) == 1 + assert detections[0].pattern_name == "email" + assert detections[0].match_count == 3 + + def test_detect_ssn(self): + scanner = PIIScanner(patterns=["ssn"]) + values = ["123-45-6789", "987-65-4321", "not-ssn", "hello"] + detections = scanner.scan_column("users", "data", values) + assert len(detections) == 1 + assert detections[0].pattern_name == "ssn" + assert detections[0].match_count == 2 + + def test_detect_credit_card_with_luhn(self): + scanner = PIIScanner(patterns=["credit_card"]) + # 4111111111111111 is a valid Luhn number (Visa test card) + values = ["4111111111111111", "1234567890123456", "not-a-card"] + detections = scanner.scan_column("orders", "memo", values) + assert len(detections) == 1 + assert detections[0].pattern_name == "credit_card" + # Only the Luhn-valid one should match + assert detections[0].match_count >= 1 + + def test_detect_credit_card_with_grouped_format(self): + scanner = PIIScanner(patterns=["credit_card"]) + values = ["4111-1111-1111-1111", "not-a-card"] + detections = scanner.scan_column("orders", "memo", values) + assert len(detections) == 1 + assert detections[0].pattern_name == "credit_card" + + def test_detect_credit_card_embedded_in_text(self): + scanner = PIIScanner(patterns=["credit_card"]) + values = ["card=4111 1111 1111 1111 expires soon", "hello"] + detections = scanner.scan_column("orders", "memo", values) + assert len(detections) == 1 + assert detections[0].match_count == 1 + + def test_detect_ipv4(self): + scanner = PIIScanner(patterns=["ipv4"]) + values = ["192.168.1.1", "10.0.0.1", "not-ip", "256.1.1.1"] + detections = scanner.scan_column("logs", "source", values) + assert len(detections) >= 1 + assert detections[0].pattern_name == "ipv4" + + def test_no_detection_below_threshold(self): + scanner = PIIScanner(patterns=["email"], min_match_rate=0.5) + # Only 1 out of 10 is an email — below 50% threshold + values = ["john@example.com"] + ["no-email"] * 9 + detections = scanner.scan_column("users", "notes", values) + assert len(detections) == 0 + + def test_scan_rows(self): + scanner = PIIScanner(patterns=["email"]) + rows = [ + {"id": 1, "notes": "contact john@example.com"}, + {"id": 2, "notes": "call jane@test.org"}, + {"id": 3, "notes": "nothing here"}, + ] + detections = scanner.scan_rows("users", rows) + assert any(d.column == "notes" and d.pattern_name == "email" for d in detections) + + def test_scan_rows_skip_columns(self): + scanner = PIIScanner(patterns=["email"]) + rows = [ + {"id": 1, "email": "john@example.com", "notes": "call john@example.com"}, + ] + detections = scanner.scan_rows("users", rows, skip_columns={"email"}) + # Should only detect in "notes", not in "email" (skipped) + assert all(d.column != "email" for d in detections) + + def test_scan_empty_rows(self): + scanner = PIIScanner() + assert scanner.scan_rows("users", []) == [] + + def test_scan_none_values(self): + scanner = PIIScanner(patterns=["email"]) + values = [None, None, None] + detections = scanner.scan_column("users", "email", values) + assert len(detections) == 0 + + def test_confidence_levels(self): + scanner = PIIScanner(patterns=["email"], min_match_rate=0.01) + # High match rate = high confidence + values = ["a@b.com"] * 10 + detections = scanner.scan_column("t", "c", values) + assert detections[0].confidence == "high" + + def test_match_rate_property(self): + detection = PIIDetection( + table="t", + column="c", + pattern_name="email", + match_count=3, + sample_size=10, + confidence="high", + ) + assert detection.match_rate == 0.3 + + def test_match_rate_zero_sample(self): + detection = PIIDetection( + table="t", + column="c", + pattern_name="email", + match_count=0, + sample_size=0, + confidence="low", + ) + assert detection.match_rate == 0.0 + + +class TestLuhnCheck: + def test_valid_visa(self): + assert _luhn_check("4111111111111111") is True + + def test_valid_mastercard(self): + assert _luhn_check("5500000000000004") is True + + def test_invalid_number(self): + assert _luhn_check("1234567890123456") is False + + def test_too_short(self): + assert _luhn_check("123") is False + + +# ────────────────────────────────────────────────────────── +# Manifest tests +# ────────────────────────────────────────────────────────── + + +class TestComplianceManifest: + def test_initialize(self): + manifest = ComplianceManifest() + manifest.initialize( + extraction_id="test-123", + compliance_profiles=["gdpr"], + anonymization_seed="my_seed", + deterministic=True, + ) + assert manifest.extraction_id == "test-123" + assert manifest.compliance_profiles == ["gdpr"] + assert manifest.masking_type == "deterministic_pseudonymization" + assert manifest.seed_hash.startswith("sha256:") + assert manifest.dbslice_version + + def test_initialize_non_deterministic(self): + manifest = ComplianceManifest() + manifest.initialize( + extraction_id="test-456", + deterministic=False, + ) + assert manifest.masking_type == "non_deterministic_pseudonymization" + + def test_record_masked_field(self): + manifest = ComplianceManifest() + manifest.record_masked_field("users", "email", "email", category="direct_identifier") + assert len(manifest.tables["users"].fields_masked) == 1 + assert manifest.tables["users"].fields_masked[0].method == "email" + + def test_record_nulled_field(self): + manifest = ComplianceManifest() + manifest.record_nulled_field("users", "password_hash", "security_null_pattern") + assert len(manifest.tables["users"].fields_nulled) == 1 + + def test_record_fk_preserved(self): + manifest = ComplianceManifest() + manifest.record_fk_preserved("orders", "user_id") + assert "user_id" in manifest.tables["orders"].fields_preserved_fk + + def test_record_unmasked_field(self): + manifest = ComplianceManifest() + manifest.record_unmasked_field("orders", "status") + assert "status" in manifest.tables["orders"].fields_unmasked + + def test_set_table_row_count(self): + manifest = ComplianceManifest() + manifest.set_table_row_count("users", 150) + assert manifest.tables["users"].rows_extracted == 150 + + def test_add_warning(self): + manifest = ComplianceManifest() + manifest.add_warning("notes", "body", "may contain PII") + assert len(manifest.warnings) == 1 + assert manifest.warnings[0].table == "notes" + + def test_add_pii_detections(self): + manifest = ComplianceManifest() + detection = PIIDetection( + table="logs", + column="message", + pattern_name="email", + match_count=5, + sample_size=100, + confidence="high", + ) + manifest.add_pii_detections([detection]) + assert len(manifest.pii_scan_results) == 1 + + def test_to_dict(self): + manifest = ComplianceManifest() + manifest.initialize( + extraction_id="test-789", + compliance_profiles=["hipaa"], + anonymization_seed="seed123", + ) + manifest.record_masked_field("users", "email", "email") + manifest.record_nulled_field("users", "password", "security_null") + manifest.set_table_row_count("users", 50) + manifest.add_warning("notes", "body", "freetext PII risk") + + d = manifest.to_dict() + assert d["extraction_id"] == "test-789" + assert d["compliance_profiles"] == ["hipaa"] + assert "users" in d["tables"] + assert d["tables"]["users"]["rows_extracted"] == 50 + assert len(d["tables"]["users"]["fields_masked"]) == 1 + assert len(d["warnings"]) == 1 + + def test_to_json(self): + manifest = ComplianceManifest() + manifest.initialize(extraction_id="test-json") + manifest.record_masked_field("t", "c", "email") + json_str = manifest.to_json() + parsed = json.loads(json_str) + assert parsed["extraction_id"] == "test-json" + + def test_to_json_compact(self): + manifest = ComplianceManifest() + manifest.initialize(extraction_id="test-compact") + json_str = manifest.to_json(pretty=False) + assert "\n" not in json_str + + +# ────────────────────────────────────────────────────────── +# Integration: anonymizer + manifest +# ────────────────────────────────────────────────────────── + + +class TestAnonymizerManifestIntegration: + def test_anonymizer_records_to_manifest(self): + from dbslice.utils.anonymizer import DeterministicAnonymizer + + manifest = ComplianceManifest() + anonymizer = DeterministicAnonymizer( + seed="test_seed", + deterministic=True, + manifest=manifest, + ) + anonymizer.configure(redact_fields=["users.email"]) + + row = {"id": 1, "email": "john@example.com", "status": "active"} + anonymizer.anonymize_row("users", row) + + # email should be recorded as masked + assert any( + f.column == "email" for f in manifest.tables.get("users", MagicMock()).fields_masked + ) + + def test_anonymizer_non_deterministic_mode(self): + from dbslice.utils.anonymizer import DeterministicAnonymizer + + anonymizer = DeterministicAnonymizer( + seed="test_seed", + deterministic=False, + ) + anonymizer.configure(redact_fields=["users.email"]) + + # Same input should produce different outputs in non-deterministic mode + results = set() + for _ in range(10): + row = {"email": "john@example.com"} + result = anonymizer.anonymize_row("users", row) + results.add(result["email"]) + + # With 10 random seeds, we should get multiple distinct values + assert len(results) > 1 + + +# ────────────────────────────────────────────────────────── +# Config file integration +# ────────────────────────────────────────────────────────── + + +class TestComplianceConfig: + def test_config_file_compliance_section(self, tmp_path): + from dbslice.config_file import DbsliceConfig + + config_file = tmp_path / "dbslice.yaml" + config_file.write_text(""" +database: + url: postgres://localhost/test + +compliance: + profiles: [gdpr] + strict: true + generate_manifest: true + +anonymization: + enabled: true + deterministic: false +""") + config = DbsliceConfig.from_yaml(config_file) + assert config.compliance.profiles == ["gdpr"] + assert config.compliance.strict is True + assert config.compliance.generate_manifest is True + assert config.anonymization.deterministic is False + + def test_config_file_invalid_profile(self, tmp_path): + from dbslice.config_file import ConfigFileError, DbsliceConfig + + config_file = tmp_path / "dbslice.yaml" + config_file.write_text(""" +compliance: + profiles: [nonexistent] +""") + with pytest.raises(ConfigFileError, match="Unknown compliance profile"): + DbsliceConfig.from_yaml(config_file) + + def test_config_file_compliance_unknown_key(self, tmp_path): + from dbslice.config_file import ConfigFileError, DbsliceConfig + + config_file = tmp_path / "dbslice.yaml" + config_file.write_text(""" +compliance: + profiles: [gdpr] + invalid_key: true +""") + with pytest.raises(ConfigFileError, match="Unknown key"): + DbsliceConfig.from_yaml(config_file) + + def test_to_extract_config_with_compliance(self, tmp_path): + from dbslice.config import SeedSpec + from dbslice.config_file import DbsliceConfig + + config_file = tmp_path / "dbslice.yaml" + config_file.write_text(""" +database: + url: postgres://localhost/test + +compliance: + profiles: [hipaa] + strict: true + generate_manifest: true + +anonymization: + enabled: true + deterministic: false +""") + config = DbsliceConfig.from_yaml(config_file) + seed = SeedSpec(table="users", column="id", value=1, where_clause=None) + extract_config = config.to_extract_config(seeds=[seed]) + + assert extract_config.compliance_profiles == ["hipaa"] + assert extract_config.compliance_strict is True + assert extract_config.generate_manifest is True + assert extract_config.deterministic is False + + def test_to_extract_config_with_compliance_policy_fields(self, tmp_path): + from dbslice.config import SeedSpec + from dbslice.config_file import DbsliceConfig + + config_file = tmp_path / "dbslice.yaml" + config_file.write_text( + """ +database: + url: postgres://localhost/test +compliance: + profiles: [gdpr] + policy_mode: strict + allow_url_patterns: + - ".*localhost.*" + deny_url_patterns: + - ".*prod.*" + required_sslmode: require + require_ci: true + sign_manifest: true + manifest_key_env: DBSLICE_SIGN_KEY +""" + ) + config = DbsliceConfig.from_yaml(config_file) + seed = SeedSpec(table="users", column="id", value=1, where_clause=None) + extract_config = config.to_extract_config(seeds=[seed]) + + assert extract_config.compliance_policy_mode == "strict" + assert extract_config.compliance_allowed_url_patterns == [".*localhost.*"] + assert extract_config.compliance_denied_url_patterns == [".*prod.*"] + assert extract_config.compliance_required_sslmode == "require" + assert extract_config.compliance_require_ci is True + assert extract_config.compliance_manifest_sign is True + assert extract_config.compliance_manifest_key_env == "DBSLICE_SIGN_KEY" + + def test_compliance_empty_section_ok(self, tmp_path): + from dbslice.config_file import DbsliceConfig + + config_file = tmp_path / "dbslice.yaml" + config_file.write_text(""" +compliance: {} +""") + config = DbsliceConfig.from_yaml(config_file) + assert config.compliance.profiles == [] + assert config.compliance.strict is False + + def test_to_yaml_includes_deterministic_and_compliance(self, tmp_path): + from dbslice.config_file import DbsliceConfig + + config_file = tmp_path / "dbslice.yaml" + config_file.write_text( + """ +database: + url: postgres://localhost/test +anonymization: + enabled: true + deterministic: false +compliance: + profiles: [gdpr] + strict: true + generate_manifest: true + policy_mode: strict +""" + ) + config = DbsliceConfig.from_yaml(config_file) + exported = config.to_yaml(include_comments=False) + parsed = yaml.safe_load(exported) + assert parsed["anonymization"]["deterministic"] is False + assert parsed["compliance"]["profiles"] == ["gdpr"] + assert parsed["compliance"]["strict"] is True + assert parsed["compliance"]["policy_mode"] == "strict" + + +class TestComplianceScanSemantics: + def _engine(self, strict: bool) -> ExtractionEngine: + config = ExtractConfig( + database_url="postgresql://localhost/test", + seeds=[], + anonymize=True, + compliance_profiles=["gdpr"], + compliance_strict=strict, + ) + return ExtractionEngine(config) + + def test_strict_mode_ignores_masked_synthetic_values(self): + engine = self._engine(strict=True) + pre_mask = {"users": [{"email": "alice@example.com"}]} + post_mask = {"users": [{"email": "xcooper@example.org"}]} + # No exception: email column is protected by profile rules. + engine._run_pii_scan(pre_mask, post_mask) + + def test_strict_mode_fails_on_residual_unprotected_detections(self): + engine = self._engine(strict=True) + pre_mask = {"logs": [{"message": "contact alice@example.com"}]} + post_mask = {"logs": [{"message": "contact alice@example.com"}]} + with pytest.raises(ExtractionError, match="residual unprotected PII"): + engine._run_pii_scan(pre_mask, post_mask) + + +class TestManifestVerification: + def test_manifest_file_hash_and_signature_verification(self, tmp_path): + from dbslice.compliance.manifest import verify_manifest_payload + + data_file = tmp_path / "subset.sql" + data_file.write_text("select 1;\n", encoding="utf-8") + + manifest = ComplianceManifest() + manifest.initialize(extraction_id="verify-1") + manifest.add_output_file_hashes([data_file], base_dir=tmp_path) + manifest.sign("secret-key") + + manifest_path = tmp_path / "subset.manifest.json" + manifest_path.write_text(manifest.to_json(pretty=True), encoding="utf-8") + payload = json.loads(manifest_path.read_text(encoding="utf-8")) + + ok, errors = verify_manifest_payload( + payload, + manifest_path, + signing_key="secret-key", + verify_signature=True, + ) + assert ok is True + assert errors == [] + + data_file.write_text("tampered\n", encoding="utf-8") + ok, errors = verify_manifest_payload( + payload, + manifest_path, + signing_key="secret-key", + verify_signature=True, + ) + assert ok is False + assert any("Hash mismatch" in err for err in errors) + + def test_verify_manifest_cli_command(self, tmp_path, monkeypatch): + data_file = tmp_path / "subset.sql" + data_file.write_text("select 1;\n", encoding="utf-8") + + manifest = ComplianceManifest() + manifest.initialize(extraction_id="verify-cli") + manifest.add_output_file_hashes([data_file], base_dir=tmp_path) + manifest.sign("secret-key") + + manifest_path = tmp_path / "subset.manifest.json" + manifest_path.write_text(manifest.to_json(pretty=True), encoding="utf-8") + + monkeypatch.setenv("DBSLICE_MANIFEST_SIGNING_KEY", "secret-key") + cli.verify_manifest( + manifest_file=manifest_path, + verify_signature=True, + key_env="DBSLICE_MANIFEST_SIGNING_KEY", + ) + + +class TestCompliancePolicyGates: + def test_policy_blocks_stdout_without_breakglass(self): + config = ExtractConfig( + database_url="postgresql://localhost/test", + seeds=[], + compliance_profiles=["gdpr"], + compliance_policy_mode="standard", + anonymize=True, + ) + with pytest.raises(ValueError, match="stdout output is blocked"): + cli._enforce_compliance_policy( + config, + out_file=None, + allow_raw=False, + breakglass_reason=None, + ticket_id=None, + ) + + def test_policy_breakglass_requires_metadata(self): + config = ExtractConfig( + database_url="postgresql://localhost/test", + seeds=[], + compliance_profiles=["gdpr"], + compliance_policy_mode="strict", + anonymize=False, + ) + with pytest.raises(ValueError, match="--breakglass-reason"): + cli._enforce_compliance_policy( + config, + out_file=None, + allow_raw=True, + breakglass_reason=None, + ticket_id="SEC-123", + ) + + def test_source_guardrails_validate_sslmode_and_ci(self, monkeypatch): + config = ExtractConfig( + database_url="postgresql://localhost/test?sslmode=require", + seeds=[], + compliance_profiles=["gdpr"], + compliance_required_sslmode="require", + compliance_require_ci=True, + ) + monkeypatch.setenv("CI", "true") + cli._enforce_source_guardrails(config) + + bad = ExtractConfig( + database_url="postgresql://localhost/test?sslmode=disable", + seeds=[], + compliance_profiles=["gdpr"], + compliance_required_sslmode="require", + ) + with pytest.raises(ValueError, match="sslmode"): + cli._enforce_source_guardrails(bad) + + +class TestComplianceInspectReport: + def test_report_detects_uncovered_columns(self): + class FakeAdapter: + def fetch_rows(self, table: str, where_clause: str, params: tuple[object, ...]): + assert where_clause == "TRUE" + assert params == () + if table == "logs": + yield {"id": 1, "message": "contact jane@example.com"} + + schema = SchemaGraph( + tables={ + "logs": Table( + name="logs", + schema="public", + columns=[ + Column("id", "integer", False, True), + Column("message", "text", True, False), + ], + primary_key=("id",), + foreign_keys=[], + ) + }, + edges=[], + ) + + # Smoke test that helper executes and prints JSON report + cli._run_compliance_check_report( + adapter=FakeAdapter(), + db_schema=schema, + profiles=["gdpr"], + sample_rows=10, + output_mode="json", + target_table=None, + console=MagicMock(), + ) diff --git a/tests/test_compliance_gaps.py b/tests/test_compliance_gaps.py new file mode 100644 index 0000000..2bd30f0 --- /dev/null +++ b/tests/test_compliance_gaps.py @@ -0,0 +1,328 @@ +""" +Tests for compliance gap fixes: HIPAA transformers, free-text handling, +binary columns, k-anonymity, and configurable scan sample size. + +These tests verify END-TO-END behavior — not just that functions exist, +but that data is actually transformed correctly through the full pipeline. +""" + +from dbslice.compliance.transformers import ( + BINARY_SENTINEL, + age_bucket, + hipaa_safe_harbor_zip3, + redact_freetext, + year_only, +) + +# ────────────────────────────────────────────────────────── +# Phase A: HIPAA-specific transformers +# ────────────────────────────────────────────────────────── + + +class TestYearOnly: + def test_iso_date(self): + assert year_only("2024-03-15") == "2024" + + def test_iso_datetime(self): + assert year_only("2024-03-15T10:30:00") == "2024" + + def test_us_date_slash(self): + assert year_only("03/15/2024") == "2024" + + def test_us_date_dash(self): + assert year_only("03-15-2024") == "2024" + + def test_datetime_object(self): + import datetime + + assert year_only(datetime.date(1985, 6, 15)) == "1985" + + def test_datetime_datetime_object(self): + import datetime + + assert year_only(datetime.datetime(1985, 6, 15, 10, 30)) == "1985" + + def test_just_year(self): + assert year_only("1990") == "1990" + + def test_none(self): + assert year_only(None) == "" + + def test_garbage(self): + assert year_only("not a date") == "" + + def test_embedded_year(self): + assert year_only("born in 1985 somewhere") == "1985" + + +class TestHipaaZip3: + def test_normal_5digit_zip(self): + result = hipaa_safe_harbor_zip3("12345") + assert result == "123" + + def test_zip_plus_4(self): + result = hipaa_safe_harbor_zip3("12345-6789") + assert result == "123" + + def test_low_population_zip_suppressed(self): + # 036xx is in NH, low population + result = hipaa_safe_harbor_zip3("03601") + assert result == "000" + + def test_another_low_pop(self): + # 821xx is in WY + result = hipaa_safe_harbor_zip3("82101") + assert result == "000" + + def test_high_population_zip_retained(self): + # 100xx is NYC — high population + result = hipaa_safe_harbor_zip3("10001") + assert result == "100" + + def test_short_zip(self): + result = hipaa_safe_harbor_zip3("12") + assert result == "000" + + def test_integer_zip(self): + result = hipaa_safe_harbor_zip3(90210) + assert result == "902" + + +class TestAgeBucket: + def test_normal_age(self): + assert age_bucket(45) == "45" + + def test_age_89(self): + assert age_bucket(89) == "89" + + def test_age_90_bucketed(self): + assert age_bucket(90) == "90+" + + def test_age_105_bucketed(self): + assert age_bucket(105) == "90+" + + def test_string_age(self): + assert age_bucket("75") == "75" + + def test_string_age_over_89(self): + assert age_bucket("92") == "90+" + + def test_non_numeric(self): + assert age_bucket("unknown") == "unknown" + + +class TestCustomTransformersInAnonymizer: + """Verify custom transformers are actually called by the anonymizer.""" + + def test_year_only_through_anonymizer(self): + from dbslice.utils.anonymizer import DeterministicAnonymizer + + anon = DeterministicAnonymizer(seed="test") + anon.configure( + redact_fields=[], + field_providers={"patients.admission_date": "year_only"}, + ) + result = anon.anonymize_value("2024-03-15", "patients", "admission_date") + assert result == "2024" + + def test_hipaa_zip3_through_anonymizer(self): + from dbslice.utils.anonymizer import DeterministicAnonymizer + + anon = DeterministicAnonymizer(seed="test") + anon.configure( + redact_fields=[], + field_providers={"patients.zipcode": "hipaa_zip3"}, + ) + result = anon.anonymize_value("03601", "patients", "zipcode") + assert result == "000" # Low population area + + def test_age_bucket_through_anonymizer(self): + from dbslice.utils.anonymizer import DeterministicAnonymizer + + anon = DeterministicAnonymizer(seed="test") + anon.configure( + redact_fields=[], + field_providers={"patients.age": "age_bucket"}, + ) + assert anon.anonymize_value(92, "patients", "age") == "90+" + assert anon.anonymize_value(45, "patients", "age") == "45" + + def test_hipaa_profile_uses_year_only_for_dates(self): + """End-to-end: HIPAA profile maps date columns to year_only transformer.""" + from dbslice.compliance.profiles import get_profile + from dbslice.utils.anonymizer import DeterministicAnonymizer + + profile = get_profile("hipaa") + # Build fallback patterns like the engine does + fallback_patterns = { + f"*.{pattern}*": method for pattern, method in profile.required_column_patterns.items() + } + + anon = DeterministicAnonymizer(seed="test") + anon.configure( + redact_fields=[], + fallback_patterns=fallback_patterns, + ) + + # admission_date should use year_only + result = anon.anonymize_value("2024-03-15", "visits", "admission_date") + assert result == "2024" + + def test_hipaa_profile_uses_zip3_for_zipcodes(self): + """End-to-end: HIPAA profile maps ZIP columns to hipaa_zip3 transformer.""" + from dbslice.compliance.profiles import get_profile + from dbslice.utils.anonymizer import DeterministicAnonymizer + + profile = get_profile("hipaa") + fallback_patterns = { + f"*.{pattern}*": method for pattern, method in profile.required_column_patterns.items() + } + + anon = DeterministicAnonymizer(seed="test") + anon.configure( + redact_fields=[], + fallback_patterns=fallback_patterns, + ) + + result = anon.anonymize_value("82101", "addresses", "zipcode") + assert result == "000" # Wyoming, low population + + +# ────────────────────────────────────────────────────────── +# Phase B: Free-text handling +# ────────────────────────────────────────────────────────── + + +class TestFreetextRedaction: + def test_redact_email(self): + text = "Contact john@example.com for details" + result = redact_freetext(text) + assert "[REDACTED_EMAIL]" in result + assert "john@example.com" not in result + assert "Contact" in result + assert "for details" in result + + def test_redact_ssn(self): + text = "Patient SSN: 123-45-6789" + result = redact_freetext(text) + assert "[REDACTED_SSN]" in result + assert "123-45-6789" not in result + + def test_redact_phone(self): + text = "Call 555-123-4567" + result = redact_freetext(text) + assert "[REDACTED_PHONE]" in result + + def test_redact_multiple(self): + text = "Email: alice@test.com, SSN: 123-45-6789" + result = redact_freetext(text) + assert "[REDACTED_EMAIL]" in result + assert "[REDACTED_SSN]" in result + assert "alice@test.com" not in result + assert "123-45-6789" not in result + + def test_no_pii_unchanged(self): + text = "This is a normal note with no PII" + assert redact_freetext(text) == text + + def test_none_returns_empty(self): + assert redact_freetext(None) == "" + + def test_redact_ip(self): + text = "Source IP: 192.168.1.100" + result = redact_freetext(text) + assert "[REDACTED_IP]" in result + assert "192.168.1.100" not in result + + +# ────────────────────────────────────────────────────────── +# Phase E: k-anonymity +# ────────────────────────────────────────────────────────── + + +class TestKAnonymityCheck: + """Test k-anonymity verification logic directly on data.""" + + def test_passes_when_k_satisfied(self): + """Each combination appears >= 2 times.""" + rows = [ + {"age": "30", "gender": "M", "zip": "100"}, + {"age": "30", "gender": "M", "zip": "100"}, + {"age": "40", "gender": "F", "zip": "200"}, + {"age": "40", "gender": "F", "zip": "200"}, + ] + violations = self._check(rows, ["age", "gender", "zip"], k=2) + assert len(violations) == 0 + + def test_fails_when_unique_combination(self): + """One person with a unique combination.""" + rows = [ + {"age": "30", "gender": "M", "zip": "100"}, + {"age": "30", "gender": "M", "zip": "100"}, + {"age": "99", "gender": "X", "zip": "999"}, # unique + ] + violations = self._check(rows, ["age", "gender", "zip"], k=2) + assert len(violations) == 1 + + def test_k_1_always_passes(self): + rows = [{"age": "unique_value", "gender": "unique"}] + violations = self._check(rows, ["age", "gender"], k=1) + assert len(violations) == 0 + + @staticmethod + def _check(rows, qi_columns, k): + from collections import Counter + + combos = Counter(tuple(str(row.get(c, "")) for c in qi_columns) for row in rows) + return [(combo, count) for combo, count in combos.items() if count < k] + + +# ────────────────────────────────────────────────────────── +# Phase C: Binary column sentinel +# ────────────────────────────────────────────────────────── + + +class TestBinarySentinel: + def test_sentinel_value(self): + assert BINARY_SENTINEL == b"\x00" + + +# ────────────────────────────────────────────────────────── +# Integration: verify config fields exist +# ────────────────────────────────────────────────────────── + + +class TestConfigFields: + def test_extract_config_has_new_fields(self): + from dbslice.config import ExtractConfig, SeedSpec + + seed = SeedSpec(table="t", column="c", value=1, where_clause=None) + config = ExtractConfig( + database_url="postgres://localhost/test", + seeds=[seed], + freetext_action="redact", + binary_action="sentinel", + compliance_sample_rows=500, + k_anonymity_min_k=3, + k_anonymity_quasi_identifiers=["users.age", "users.zip"], + k_anonymity_action="fail", + ) + assert config.freetext_action == "redact" + assert config.binary_action == "sentinel" + assert config.compliance_sample_rows == 500 + assert config.k_anonymity_min_k == 3 + assert config.k_anonymity_action == "fail" + + def test_defaults(self): + from dbslice.config import ExtractConfig, SeedSpec + + seed = SeedSpec(table="t", column="c", value=1, where_clause=None) + config = ExtractConfig( + database_url="postgres://localhost/test", + seeds=[seed], + ) + assert config.freetext_action == "warn" + assert config.binary_action == "warn" + assert config.compliance_sample_rows == 100 + assert config.k_anonymity_min_k is None + assert config.k_anonymity_action == "warn" diff --git a/tests/test_mapping_ui.py b/tests/test_mapping_ui.py new file mode 100644 index 0000000..403f560 --- /dev/null +++ b/tests/test_mapping_ui.py @@ -0,0 +1,204 @@ +"""Tests for the column mapping UI server and API.""" + +import json +import threading +import time +from http.client import HTTPConnection + +import pytest + +from dbslice.mapping.server import MappingServer + + +@pytest.fixture() +def server(): + """Start a mapping UI server on a random-ish port for testing.""" + srv = MappingServer(port=19473, database_url="", schema=None) + thread = threading.Thread(target=srv.start, kwargs={"open_browser": False}, daemon=True) + thread.start() + time.sleep(0.3) # Wait for server to bind + yield srv + if srv._server: + srv._server.shutdown() + + +def _conn(): + return HTTPConnection("127.0.0.1", 19473, timeout=5) + + +def _post(conn, path, body, token): + conn.request( + "POST", + path, + json.dumps(body).encode(), + {"Content-Type": "application/json", "X-DBSLICE-Token": token}, + ) + return conn.getresponse() + + +class TestTokenSecurity: + def test_get_without_token_fails(self, server): + conn = _conn() + conn.request("GET", "/") + resp = conn.getresponse() + assert resp.status == 403 + + def test_get_with_wrong_token_fails(self, server): + conn = _conn() + conn.request("GET", "/?token=wrong") + resp = conn.getresponse() + assert resp.status == 403 + + def test_get_with_valid_token_serves_html(self, server): + conn = _conn() + conn.request("GET", f"/?token={server.token}") + resp = conn.getresponse() + assert resp.status == 200 + body = resp.read().decode() + assert "dbslice" in body + assert "Column Mapping" in body + + def test_post_without_token_fails(self, server): + conn = _conn() + conn.request( + "POST", + "/api/validate-provider", + json.dumps({"provider": "email"}).encode(), + {"Content-Type": "application/json"}, + ) + resp = conn.getresponse() + assert resp.status == 403 + + def test_post_with_wrong_token_fails(self, server): + conn = _conn() + resp = _post(conn, "/api/validate-provider", {"provider": "email"}, "bad-token") + assert resp.status == 403 + + +class TestValidateProviderAPI: + def test_valid_faker_provider(self, server): + conn = _conn() + resp = _post(conn, "/api/validate-provider", {"provider": "email"}, server.token) + assert resp.status == 200 + data = json.loads(resp.read()) + assert data["valid"] is True + assert data["source"] == "faker" + + def test_valid_custom_transformer(self, server): + conn = _conn() + resp = _post(conn, "/api/validate-provider", {"provider": "year_only"}, server.token) + assert resp.status == 200 + data = json.loads(resp.read()) + assert data["valid"] is True + assert data["source"] == "custom_transformer" + + def test_invalid_provider(self, server): + conn = _conn() + resp = _post(conn, "/api/validate-provider", {"provider": "not_a_real_provider_xyz"}, server.token) + assert resp.status == 200 + data = json.loads(resp.read()) + assert data["valid"] is False + + def test_hipaa_zip3_is_valid(self, server): + conn = _conn() + resp = _post(conn, "/api/validate-provider", {"provider": "hipaa_zip3"}, server.token) + data = json.loads(resp.read()) + assert data["valid"] is True + + +class TestGenerateConfigAPI: + def test_generate_basic_config(self, server): + conn = _conn() + mappings = { + "users.email": {"action": "anonymize", "provider": "email"}, + "users.password_hash": {"action": "null", "provider": ""}, + } + resp = _post(conn, "/api/generate-config", {"mappings": mappings}, server.token) + assert resp.status == 200 + data = json.loads(resp.read()) + assert "yaml" in data + assert "users.email: email" in data["yaml"] + assert "users.password_hash" in data["yaml"] + assert data["field_count"] == 1 + assert data["null_count"] == 1 + assert "command_template" in data + assert "dbslice extract" in data["command_template"] + + def test_generate_empty_config(self, server): + conn = _conn() + resp = _post(conn, "/api/generate-config", {"mappings": {}}, server.token) + assert resp.status == 200 + data = json.loads(resp.read()) + assert data["field_count"] == 0 + assert data["null_count"] == 0 + + def test_keep_action_excluded(self, server): + conn = _conn() + mappings = { + "users.email": {"action": "keep", "provider": ""}, + } + resp = _post(conn, "/api/generate-config", {"mappings": mappings}, server.token) + data = json.loads(resp.read()) + assert "users.email" not in data["yaml"] + + def test_generated_yaml_is_valid(self, server): + """The generated YAML should parse without errors.""" + import yaml + + conn = _conn() + mappings = { + "users.email": {"action": "anonymize", "provider": "email"}, + "users.ssn": {"action": "anonymize", "provider": "ssn"}, + "users.token": {"action": "null", "provider": ""}, + } + resp = _post(conn, "/api/generate-config", {"mappings": mappings}, server.token) + data = json.loads(resp.read()) + parsed = yaml.safe_load(data["yaml"]) + assert parsed["anonymization"]["enabled"] is True + assert "users.email" in parsed["anonymization"]["fields"] + assert "users.token" in parsed["anonymization"]["security_null_fields"] + + +class TestNotFoundRoutes: + def test_unknown_get(self, server): + conn = _conn() + conn.request("GET", f"/unknown?token={server.token}") + resp = conn.getresponse() + assert resp.status == 404 + + def test_unknown_post(self, server): + conn = _conn() + resp = _post(conn, "/api/unknown", {}, server.token) + assert resp.status == 404 + + +class TestUIContent: + def test_html_contains_token(self, server): + conn = _conn() + conn.request("GET", f"/?token={server.token}") + resp = conn.getresponse() + body = resp.read().decode() + assert server.token in body + + def test_html_loads_from_static_file(self, server): + """UI should load from the static HTML file, not inline string.""" + conn = _conn() + conn.request("GET", f"/?token={server.token}") + resp = conn.getresponse() + body = resp.read().decode() + # Should contain Tailwind CDN (intentional external resource) + assert "tailwindcss" in body + # Should contain key UI elements + assert "Column Mapping" in body + assert "Introspect Schema" in body + assert "Compliance Profiles" in body + + def test_html_has_proper_structure(self, server): + """UI should have accessibility and structural elements.""" + conn = _conn() + conn.request("GET", f"/?token={server.token}") + resp = conn.getresponse() + body = resp.read().decode() + assert 'lang="en"' in body + assert 'aria-label' in body + assert 'aria-live="polite"' in body