Skip to content

[REVIEW] ai-data-privacy: add AI data deletion propagation gates #1382

@catcherintheroad-hub

Description

@catcherintheroad-hub

Skill Being Reviewed

Skill name: ai-data-privacy
Skill path: skills/ai-security/ai-data-privacy/

False Positive Analysis

Benign-looking deletion control that can be over-credited:

privacy:
  dsar_endpoint: /privacy/delete
  primary_store_delete: true
rag:
  vector_store: pgvector
  embedding_ttl_days: 365
analytics:
  prompt_logs: retained
backups:
  retention_days: 90

Why this is a false positive:

The system can delete the primary user record, but the review does not prove that deletion propagates to embeddings, vector-store replicas, cached retrieval chunks, prompt/completion logs, training dataset snapshots, fine-tuning artifacts, analytics exports, and backups. A DSAR endpoint can exist while AI-derived copies of personal data persist and remain retrievable.

Coverage Gaps

Missed variant 1: source document deletion does not remove embeddings

The source document is deleted, but vector rows, chunk text, and search indexes remain available to RAG retrieval.

Missed variant 2: training snapshots retain opted-out data

Consent withdrawal removes new data from ingestion, but already-created fine-tuning datasets and model checkpoints are not flagged for retraining, unlearning, or exclusion.

Missed variant 3: analytics and backup stores extend retention

Prompt/completion logs, BI exports, and backup systems retain PII beyond the primary AI store's retention period.

Edge Cases

  • Deleting embeddings may require re-indexing or tombstoning if physical deletion is asynchronous.
  • Legal hold can override deletion but must be documented with scope, authority, and expiration.
  • Provider-hosted LLM retention and zero-data-retention settings need separate evidence from first-party stores.

Remediation Quality

  • Fix resolves the vulnerability
  • Fix doesn't introduce new security issues
  • Fix doesn't break functionality
  • Issues found: Add deletion propagation evidence gates for source data, embeddings, vector indexes, logs, training snapshots, model artifacts, analytics exports, backups, provider retention, and legal holds.

Comparison to Other Tools

Tool Catches this? Notes
DSAR workflow tools Partial Usually track primary application records, not all AI-derived data stores.
Data catalogs Partial Can inventory assets, but reviewers must verify propagation and deletion proofs.
Vector DB TTLs Partial May expire records eventually, but DSARs require targeted propagation and evidence.

Overall Assessment

Strengths: Strong privacy lifecycle coverage for training data, prompt/completion PII, retention, memorization, EU AI Act, and consent.

Needs improvement: Add operational deletion propagation evidence so reviewers can distinguish a DSAR endpoint from actual removal across AI-specific derived stores.

Priority recommendations:

  1. Add a deletion propagation evidence checklist under data retention or consent.
  2. Require source-to-derived mapping for embeddings, chunks, logs, snapshots, model artifacts, backups, analytics, and third-party provider stores.
  3. Add output fields for propagation status, proof artifact, residual data risk, legal hold, and re-index/retrain/unlearning decision.

Sources Checked

Bounty Info

  • I have read and agree to the CONTRIBUTING.md bounty terms
  • Preferred payment method: GitHub Sponsors

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions