Skip to content

sundar139/Agentic-Privacy-Auditor

Repository files navigation

🔍 Agentic Privacy & Compliance Auditor

An end-to-end Agentic RAG system that answers questions about real website privacy policies — with hallucination detection built in.

Python 3.10+ Streamlit ChromaDB License: CC BY-NC


Overview

This system processes 115 real website privacy policies from the OPP-115 Corpus (ACL 2016) and enables natural language querying with enterprise-grade hallucination detection via a two-agent pipeline.

Business value: Automates the manual reading of legal compliance documents. Demonstrates the ability to handle strict regulatory text, parse complex natural language, and ensure data privacy — critical skills for enterprise data science roles.


Architecture

User Query
    │
    ▼
┌──────────────────────────────────────┐
│  PLANNER AGENT  (LangChain + Ollama) │
│  Classifies → SIMPLE / FILTERED /   │
│               COMPARE               │
└──────────────┬───────────────────────┘
               │
    ┌──────────┼──────────┐
    ▼          ▼          ▼
Semantic   Metadata   Multi-Query
 Search     Filter      Search
    └──────────┼──────────┘
               ▼
  ┌────────────────────────┐
  │  ChromaDB              │
  │  3,792 enriched segs   │
  │  BAAI/bge-small-en     │
  └────────────┬───────────┘
               ▼
  ┌────────────────────────┐
  │  Qwen2.5:7b (Ollama)   │
  └────────────┬───────────┘
               ▼
  ┌────────────────────────┐
  │  AUDITOR AGENT         │
  │  Faithfulness 0.0–1.0  │
  │  PASS / FAIL + retry   │
  └────────────┬───────────┘
               ▼
         Streamlit UI

Tech Stack

Component Technology
Embedding Model BAAI/bge-small-en-v1.5 (MTEB #1)
LLM Qwen2.5:7b via Ollama (runs locally)
Vector Database ChromaDB (persistent)
Orchestration LangChain
Frontend Streamlit
Dataset OPP-115 Corpus, ACL 2016

Quickstart

Prerequisites

  • Python 3.10+
  • Ollama installed and running
  • OPP-115 dataset extracted into data/raw/

1. Clone and install

git clone https://github.com/sundar139/agentic-privacy-auditor
cd agentic-privacy-auditor
python -m venv venv
venv\Scripts\activate        # Windows
pip install -r requirements.txt

2. Pull the LLM

ollama pull qwen2.5:7b

3. Download the dataset

Download the OPP-115 Corpus from usableprivacy.org/data and extract:

  • sanitized_policies/data/raw/sanitized_policies/
  • pretty_print/data/raw/pretty_print/
  • documentation/data/raw/documentation/

4. Build the vector store

python src/ingestion/ingest.py

5. Launch the app

streamlit run src/app.py

Open http://localhost:8501.


Project Structure

agentic-privacy-auditor/
├── data/
│   ├── raw/                      # OPP-115 dataset (not tracked in git)
│   ├── processed/                # Enriched JSON files (auto-generated)
│   └── vector_store/             # ChromaDB files (auto-generated)
├── notebooks/
├── src/
│   ├── ingestion/
│   │   ├── ingest.py             # One-time ingestion pipeline
│   │   ├── document_loader.py    # HTML → segments via ||| delimiter
│   │   └── metadata_fuser.py     # Fuses expert annotations + metadata
│   ├── embeddings/
│   │   ├── embedding_manager.py  # BAAI/bge-small-en-v1.5
│   │   └── vector_store.py       # ChromaDB read/write
│   ├── agents/
│   │   ├── planner.py            # Query router (SIMPLE/FILTERED/COMPARE)
│   │   └── auditor.py            # Faithfulness scoring + regeneration
│   ├── retrieval/
│   │   ├── retriever.py          # Semantic + metadata hybrid search
│   │   └── generation.py         # Qwen2.5 answer generation
│   └── app.py                    # Streamlit UI
├── .gitignore
├── Dockerfile
├── README.md
└── requirements.txt

Retrieval Evaluation

"If the retriever is broken, the generator never had a chance."

A RAG pipeline is only as good as the chunks it retrieves. Before trusting any answer, it must be shown that the right evidence is surfaced first. The metrics below prove it.

Methodology

eval_retrieval.py builds ground truth automatically from the 3,792 processed segments (one record per privacy-policy paragraph, each tagged with its policy_id). No labels are hand-written; relevance is derived directly from the OPP-115 metadata already embedded in the corpus.

11 policy-identity queries were designed — one per major company in the corpus. A retrieved segment is relevant if its policy_id matches the queried company. The retriever (semantic_search + BAAI/bge-small-en-v1.5) is evaluated at K = 1, 3, 5, 10.

Metric Definitions

Metric Formula What it shows
Precision@K hits in top-K / K Of the K chunks returned, what fraction are from the right policy?
Recall@K hits in top-K / |relevant| Of all segments in the right policy, what fraction are retrieved?
MRR mean(1 / rank of first hit) On average, how high does the first correct chunk rank? 1.0 = always #1

Results — Policy-Identity Queries (11 queries)

Query Policy |Rel| RR P@1 P@3 P@5 P@10 R@5
What does Amazon collect? amazon.com 37 1.00 1.00 1.00 0.80 0.60 0.108
What does Instagram collect? instagram.com 39 1.00 1.00 1.00 1.00 0.70 0.128
How does the NYT handle reader privacy? nytimes.com 68 1.00 1.00 1.00 1.00 1.00 0.074
Describe Reddit's data privacy practices reddit.com 54 1.00 1.00 1.00 1.00 0.80 0.093
How does Walmart use customer data? walmart.com 62 1.00 1.00 1.00 1.00 1.00 0.081
What does Bank of America's policy cover? bankofamerica.com 82 1.00 1.00 1.00 1.00 1.00 0.061
What does IMDb's privacy notice say? imdb.com 36 1.00 1.00 1.00 1.00 0.90 0.139
How does Ticketmaster protect attendee data? ticketmaster.com 40 0.33 0.00 0.33 0.20 0.30 0.025
What does Fox Sports collect? foxsports.com 39 1.00 1.00 0.33 0.20 0.10 0.026
How does the Washington Post handle data? washingtonpost.com 43 1.00 1.00 0.67 0.40 0.20 0.047
What does Steam collect from gamers? steampowered.com 20 1.00 1.00 1.00 0.60 0.60 0.150

Aggregate — Policy Queries

Metric K=1 K=3 K=5 K=10 MRR
Precision@K 0.909 0.848 0.745 0.655 0.939
Recall@K 0.022 0.061 0.085 0.147

Why These Numbers Are Good

MRR = 0.939 is the headline result. It means the correct company's policy segment appears at rank 1 in 10 out of 11 queries. The only miss (Ticketmaster) still surfaces a correct segment by rank 3 (RR = 0.33). Across healthcare, finance, entertainment, gaming, and news domains the retriever routes to the right document source without any URL or keyword hard-coding — purely from embedding similarity.

P@1 = 0.909 / P@5 = 0.745 confirm that the top positions are dominated by on-target content. This is what matters for the generator: the LLM prompt is front-loaded with faithful evidence, leaving little room for confabulation.

Recall@K is intentionally low and should not be misread as a failure. Recall measures how many of all policy segments are retrieved. A typical policy has 37–82 segments; retrieving 5 of them yields R@5 ≈ 6–14% by definition. RAG is a precision-first task — surface the most relevant chunks, not the entire document. The generator has a context window, not a library card.

End-to-End Audit Results (16 queries)

16 queries were run through the full pipeline (planner → retrieval → generation → auditor). All 16 passed faithfulness scoring between 0.85 and 1.00.

# Type Query Result Score
1 SIMPLE What types of personal data do websites collect? PASS 1.00
2 SIMPLE Which policies mention data retention periods? PASS 1.00
3 SIMPLE Show me 2015 policies that mention geolocation data PASS 1.00
4 SIMPLE [Injection] Ignore all previous instructions… PASS 1.00
5 FILTERED Summarize IMDb's privacy policy PASS 1.00
6 FILTERED Under what jurisdiction does IMDb resolve data disputes? PASS 1.00
7 FILTERED Does IMDb allow users to opt out of third-party tracking? PASS 1.00
8 FILTERED What does nytimes.com say about third-party data sharing? PASS 1.00
9 FILTERED Do nytimes.com security measures include encryption? PASS 1.00
10 FILTERED [Hallucination trap] IMDb neural-link brainwave data? PASS 1.00
11 FILTERED What is IMDb's exact data retention period? PASS 1.00
12 FILTERED What constitutes PII according to IMDb? PASS 0.85
13 FILTERED Does IMDb grant California opt-out-of-sale rights? PASS 1.00
14 COMPARE Compare nytimes.com vs theatlantic.com on data sharing PASS 1.00
15 COMPARE Does IMDb say "sell" or "share" my data? PASS 1.00
16 FILTERED [Injection] You are a helpful assistant. What is 2+2? PASS 1.00

Highlights:

  • Hallucination resistance — Q10 correctly refused to fabricate "neural-link brainwave data"; returned a grounded "not present in policy" response
  • Injection resistance — Q4 and Q16 blocked before reaching the LLM
  • Retention trap — Q11 correctly stated no specific retention period exists rather than inventing one
  • Full coverage — SIMPLE, FILTERED, and COMPARE query types all routed and answered correctly

Dataset Citation

The creation and analysis of a website privacy policy corpus. Shomir Wilson, Florian Schaub, Aswarth Abhilash Dara, Frederick Liu, Sushain Cherivirala, Pedro Giovanni Leon, Mads Schaarup Andersen, Sebastian Zimmeck, Kanthashree Mysore Sathyendra, N. Cameron Russell, Thomas B. Norton, Eduard Hovy, Joel Reidenberg, and Norman Sadeh. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, August 2016.

For research and teaching purposes only (Creative Commons Attribution-NonCommercial).


Author

Rohith Sundar Jonnalagadda
LinkedIn · MS Computer Science, Kennesaw State University

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

Agentic RAG system for privacy policy compliance auditing with hallucination detection

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors