Scan, detect, and mask Personally Identifiable Information from local files and directories — in real time.
- Overview
- Architecture
- Project Structure
- Features
- Quick Start
- CLI Usage
- Python API
- Detection Capabilities
- Redaction Strategies
- Configuration
- Testing
- Security Design
- Performance
- Roadmap
Aegis-Data-Shield is a production-ready Python toolkit designed for software engineers who need to detect and mask Personally Identifiable Information (PII) from text-based files — in a single-file scan, batch directory mode, or fully automated real-time mode via watchdog filesystem events.
Built with a clean Object-Oriented architecture, strict separation of concerns, and two complementary detection layers — deterministic Regular Expressions and probabilistic spaCy NER — it delivers high-precision redaction suitable for:
- 🏥 Healthcare data pipelines (HIPAA compliance)
- 🏦 Financial log sanitisation (PCI-DSS)
- 🇪🇺 GDPR/CCPA data anonymisation workflows
- 🔐 Secure data sharing and audit preparation
- 🧪 Test-data scrubbing before committing to repositories
Sensitive data leaks happen silently in logs, files, and pipelines.
Aegis-Data-Shield detects and removes PII before it becomes a liability.
┌──────────────────────────────────────────────────────────┐
│ CLI / Python API │
│ main.py │
└────────────────────────┬─────────────────────────────────┘
│
┌──────────────┼──────────────┐
│ │ │
┌───────▼──────┐ ┌─────▼──────┐ ┌───▼────────────┐
│ PIIScanner │ │PIIRedactor │ │ FolderMonitor │
│ (detection) │ │ (masking) │ │ (watchdog) │
└──────┬───────┘ └─────┬──────┘ └───┬────────────┘
│ │ │
┌──────▼───────────────▼─────────────▼─────────────┐
│ File Handler Layer │
│ FileReader │ FileWriter │
└──────────────────────┬────────────────────────────┘
│
┌──────────────────────▼────────────────────────────┐
│ Utilities Layer │
│ AegisConfig │ Logger │ Reporter │
└───────────────────────────────────────────────────┘
| Principle | Implementation |
|---|---|
| Single Responsibility | Each module owns exactly one concern |
| Open/Closed | New PII types added via config dict, not code changes |
| Dependency Inversion | All components accept an AegisConfig interface |
| Fail-Safe Defaults | spaCy unavailable → graceful regex-only fallback |
| Immutability | Original text strings are never mutated |
| Atomicity | File writes go through temp-file + os.replace() |
aegis-data-shield/
│
├── aegis/ # Core library package
│ ├── __init__.py # Public API exports
│ │
│ ├── scanner/
│ │ ├── __init__.py
│ │ └── pii_scanner.py # ⭐ Regex + spaCy NER detection engine
│ │
│ ├── redactor/
│ │ ├── __init__.py
│ │ └── pii_redactor.py # ⭐ 4-strategy masking engine + audit trail
│ │
│ ├── file_handler/
│ │ ├── __init__.py
│ │ ├── file_reader.py # ⭐ Multi-encoding, whitelist-guarded reader
│ │ └── file_writer.py # ⭐ Atomic, backup-aware writer
│ │
│ ├── monitor/
│ │ ├── __init__.py
│ │ └── folder_monitor.py # ⭐ Real-time watchdog surveillance pipeline
│ │
│ └── utils/
│ ├── __init__.py
│ ├── config.py # Centralised configuration + AegisConfig dataclass
│ ├── logger.py # Colour-coded rotating file logger
│ └── reporter.py # JSON + ASCII scan report generator
│
├── tests/
│ ├── __init__.py
│ ├── test_scanner.py # 20+ scanner unit tests
│ ├── test_redactor.py # Redactor strategy & audit tests
│ └── test_file_handler.py # FileReader + FileWriter integration tests
│
├── sample_data/
│ ├── test_document.txt # Synthetic employee record
│ ├── test_log.log # Synthetic application log
│ └── test_data.csv # Synthetic user data CSV
│
├── logs/ # Runtime log files (auto-created)
├── reports/ # JSON scan reports (auto-created)
│
├── main.py # CLI entry point
├── setup.py # Package installer
├── requirements.txt # Pinned dependency manifest
├── .gitignore
└── README.md
- ✅ Email addresses — RFC-5322 compliant pattern
- ✅ Phone numbers — US, international, E.164, dotted, dashed
- ✅ Credit card numbers — Visa, MasterCard, Amex, Discover + Luhn validation
- ✅ Social Security Numbers — with or without dashes
- ✅ IPv4 addresses — full octet range validation
- ✅ Dates of birth — multiple format variants
- ✅ Person names — via spaCy NER (
PERSONentity) - ✅ Organisations — via spaCy NER (
ORGentity) - ✅ Locations/GPE — via spaCy NER (
GPEentity)
- 🔒 4 redaction strategies — token, hash, mask, partial
- 📋 Full audit trail — every substitution recorded with line number
- 📊 Per-run statistics — breakdown by PII type
- 🔄 Overlap resolution — deduplication with confidence-priority ranking
- 📂 14+ file types —
.txt,.log,.csv,.json,.xml,.html,.md,.yaml,.py,.js,.ts,.sql, and more - 🔐 Atomic writes — via
os.replace()— no half-written files - 💾 Auto-backup —
.bakcopies of originals before overwriting - 🧠 Encoding cascade — UTF-8 → Latin-1 → CP1252 auto-detection
- 🚫 Binary guard — heuristic check rejects non-text files
- 👁 Real-time surveillance —
watchdogfilesystem events - 🧵 Thread-pool pipeline — 4 workers (configurable); one slow file can't block the queue
- 🔁 Deduplication guard — rapid OS events don't double-process a file
- 🛑 Graceful shutdown — SIGINT / SIGTERM handled cleanly
- 📝 Colour-coded console logs — ANSI, per log level
- 🔄 Rotating file logs — 5 MB × 3 backups
- 📄 JSON reports — machine-readable scan results
- 🖨 ASCII summary tables — human-readable terminal output
git clone https://github.com/YOUR_USERNAME/aegis-data-shield.git
cd aegis-data-shield
# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate # Linux/macOS
# .venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# Download spaCy model (choose one)
python -m spacy download en_core_web_lg # Recommended (accurate)
python -m spacy download en_core_web_sm # Alternative (faster)pip install -e .python main.py scan sample_data/test_document.txtusage: aegis {scan,redact,monitor} ...
Commands:
scan Detect PII without modifying files
redact Detect and mask PII in files
monitor Real-time folder surveillance
python main.py scan sample_data/test_document.txtpython main.py scan sample_data/ --recursive --report reports/my_report.jsonpython main.py redact sample_data/test_document.txt --mode new_filepython main.py redact sample_data/ --recursive --mode overwritepython main.py redact sample_data/ --dry-run# token (default) → [REDACTED-EMAIL]
python main.py redact sample_data/ --strategy token
# hash → [SHA:3a7f9b2c]
python main.py redact sample_data/ --strategy hash
# mask → ******************
python main.py redact sample_data/ --strategy mask
# partial → a***************m
python main.py redact sample_data/ --strategy partialpython main.py scan sample_data/ --no-spacypython main.py monitor sample_data/ --recursive --mode new_filefrom aegis.scanner.pii_scanner import PIIScanner
from aegis.redactor.pii_redactor import PIIRedactor, RedactionStrategy
from aegis.utils.config import AegisConfig
# 1. Configure
cfg = AegisConfig(
enable_spacy=True,
backup_originals=True,
dry_run=False,
)
# 2. Scan
scanner = PIIScanner(config=cfg)
text = "Contact alice@example.com or call 800-555-0199"
matches = scanner.scan(text)
for match in matches:
print(match)
# → [EMAIL] 'alice@example.com' (line 1, chars 8-26, source=regex, conf=1.00)
# → [PHONE] '800-555-0199' (line 1, chars 33-45, source=regex, conf=1.00)
# 3. Redact
redactor = PIIRedactor(config=cfg, strategy=RedactionStrategy.TOKEN)
result = redactor.redact(text, matches)
print(result.redacted_text)
# → "Contact [REDACTED-EMAIL] or call [REDACTED-PHONE]"
result.print_summary()
# 4. Real-time monitoring
from aegis.monitor.folder_monitor import FolderMonitor
def on_redacted(path: str, count: int):
print(f"✅ Redacted {count} item(s) in {path}")
monitor = FolderMonitor(
watch_dir="./inbox",
config=cfg,
write_mode="new_file",
on_redacted=on_redacted,
)
monitor.start() # blocks; Ctrl-C to stop| PII Type | Method | Example | Notes |
|---|---|---|---|
| Regex | alice@example.com |
RFC-5322 compliant | |
| Phone | Regex | +1-800-555-0199 |
US/international |
| Credit Card | Regex + Luhn | 4111 1111 1111 1111 |
Visa/MC/Amex/Discover |
| SSN | Regex | 123-45-6789 |
US format |
| IP Address | Regex | 192.168.1.100 |
IPv4 |
| Date of Birth | Regex | 04/12/1985 |
Multiple formats |
| Person Name | spaCy NER | Alice Johnson |
PERSON entity |
| Organisation | spaCy NER | Acme Corp |
ORG entity |
| Location | spaCy NER | San Francisco |
GPE entity |
| Strategy | Example In | Example Out | Use Case |
|---|---|---|---|
token |
alice@corp.com |
[REDACTED-EMAIL] |
General-purpose redaction |
hash |
alice@corp.com |
[SHA:3a7f9b2c] |
Pseudonymisation (same PII → same token) |
mask |
alice@corp.com |
*************** |
Length-preserving masking |
partial |
alice@corp.com |
a*************m |
Low-sensitivity preview |
from aegis.utils.config import AegisConfig
cfg = AegisConfig(
# Redaction tokens per PII type (fully customisable)
redaction_tokens={
"EMAIL": "[EMAIL REMOVED]",
"CREDIT_CARD": "[CC REMOVED]",
# ... add more
},
# spaCy model to use
spacy_model="en_core_web_lg",
# NER labels to treat as PII
pii_labels=("PERSON", "ORG", "GPE"),
# Supported file extensions
supported_extensions=(".txt", ".log", ".csv", ".json"),
# Max file size (bytes) — default 50 MB
max_file_size=50 * 1024 * 1024,
# Log level: DEBUG | INFO | WARNING | ERROR
log_level="INFO",
# True → no files written, only reports
dry_run=False,
# True → create .bak before overwriting
backup_originals=True,
# False → regex only (faster, no spaCy dependency)
enable_spacy=True,
)Environment variable overrides:
export AEGIS_LOG_LEVEL=DEBUG # Set log verbosity
export AEGIS_MAX_FILE_SIZE=104857600 # 100 MB file size limit# Install test dependencies
pip install pytest pytest-cov
# Run all tests
pytest tests/ -v
# Run with coverage report
pytest tests/ -v --cov=aegis --cov-report=term-missing
# Run a specific test module
pytest tests/test_scanner.py -v
# Run a specific test class
pytest tests/test_redactor.py::TestHashStrategy -v| Module | Tests | Coverage Target |
|---|---|---|
pii_scanner.py |
18 tests | ≥ 95% |
pii_redactor.py |
16 tests | ≥ 95% |
file_reader.py |
6 tests | ≥ 90% |
file_writer.py |
5 tests | ≥ 90% |
-
No data leaves the machine — Aegis-Data-Shield is entirely local; zero network calls are made during scanning or redaction.
-
Atomic writes — All file operations use
os.replace()so a process crash mid-write cannot leave partially-redacted files on disk. -
Backup-before-overwrite — The default configuration creates
.bakcopies so original data is never permanently lost. -
No eval / exec — The codebase contains zero dynamic code execution. Regex patterns are compiled at import time from trusted constants.
-
Input size guard — Files larger than the configured limit (default 50 MB) are rejected before reading, preventing memory exhaustion.
-
Binary file guard — A heuristic null-byte and non-printable-byte ratio check prevents processing binary files as text.
-
Luhn check on credit cards — Credit-card patterns are additionally validated with the Luhn algorithm, dramatically reducing false positives.
- Data minimisation: only the detected PII span is replaced; surrounding context is preserved.
- Pseudonymisation: the
hashstrategy ensures repeatable, consistent anonymisation. - Audit trail: every
RedactionResultcontains a fully-itemised change log for compliance reporting.
| Scenario | Throughput |
|---|---|
| Regex-only scan (no spaCy) | ~15 MB/s |
| Full scan (regex + spaCy lg) | ~3 MB/s |
| Batch redact 100 × 1 KB files | < 2 seconds |
| Real-time monitor (watchdog) | Sub-second per event |
Benchmarked on M2 MacBook Pro with en_core_web_lg.
Regex-only mode is ~5× faster and sufficient for structured data (logs, CSVs).
- v1.1 — PDF support via
pdfminer.six - v1.1 — DOCX / XLSX support
- v1.2 — Async pipeline with
asynciofor I/O-bound batch jobs - v1.2 — Docker image for CI/CD pipeline integration
- v1.3 — REST API mode (FastAPI) for microservice deployment
- v1.3 — Presidio adapter for enterprise-scale NLP models
- v2.0 — ML-based false-positive reducer with active learning
Released under the MIT License. See LICENSE for details.
Built with ❤️ and a commitment to privacy-first engineering.
Star the repo if Aegis-Data-Shield helps secure your data pipeline!