🔍 Account Hunter v2 — LLM-Driven Smart Data Extractor

A local-first tool that lets you describe data you're looking for in plain English — and then automatically scans your files to extract it. Built around LLMs + regex, with a clean web UI and REST API.

✨ What Makes It Different

Tool	Approach	Drawback
`grep` / `ripgrep`	Fast regex search	You must write the regex
`gitleaks` / `trufflehog`	Secret detection	Fixed patterns only
`LangChain Extraction`	LLM extraction	Web/cloud-oriented
Account Hunter	Natural language → auto-generated rules → local file scan	✅ None — just describe it

🚀 Features

Natural language queries — type "find email:password pairs" and it handles the rest
8 built-in presets — run instantly without any LLM: emails, credentials, API keys, phone numbers, crypto wallets, IPs, URLs, SSH keys
Supports 10+ file types — .txt, .csv, .xlsx, .json, .log, .yaml, .xml, .md, .env
Real-time progress via WebSocket — see files being scanned live
Ruleset preview — inspect and review the LLM-generated regex before scanning
Scan history — every scan is stored in SQLite, browse past results anytime
Multiple export formats — TXT, JSON, CSV
Safe by default — LLM normalization code is sandboxed behind --allow-unsafe-code
Docker support — one-command deployment

🖥️ Web UI

# 1. Install dependencies
pip install -r requirements.txt

# 2. Start the server
python server.py
# → Web UI available at http://localhost:8000
# → API docs at http://localhost:8000/docs

Pages

/ — Submit a query or pick a preset, watch live scan progress, view results
/static/history.html — All past scans with stats and export links
/static/scan.html?id=N — Full result detail for a specific scan

💻 CLI (also works without the server)

# Using Ollama (local, free)
python main.py "find all gmail accounts" --scan-dir C:\Downloads

# Using OpenAI
python main.py "find API keys and tokens" \
  --provider openai --model gpt-4o --api-key sk-...

# Use a preset (no LLM needed)
# Presets are available via the web UI / API

# Scan all drives
python main.py "find email:password pairs" --all-drives

# Allow LLM normalization code (disabled by default for safety)
python main.py "find gmail accounts and remove dot aliases" --allow-unsafe-code

🐳 Docker

# Set your scan directory and optional LLM key
cp .env.example .env
# Edit .env: set SCAN_DIR and OPENAI_API_KEY if using OpenAI

docker-compose up --build
# → http://localhost:8000

⚙️ Configuration

Variable	Description	Default
`OPENAI_API_KEY`	OpenAI API key (for OpenAI provider)	—
`SCAN_DIR`	Directory to mount in Docker	`.`
`AH_DB_PATH`	SQLite database path	`hunter_data.db`
`AH_OUTPUT_DIR`	Output directory	`output/`
`PORT`	Server port	`8000`

📡 REST API

Method	Endpoint	Description
`GET`	`/api/presets`	List built-in presets
`POST`	`/api/ruleset/preview`	Preview LLM ruleset (no scan)
`POST`	`/api/scan`	Start a scan
`GET`	`/api/scan/{id}/status`	Scan status
`GET`	`/api/scan/{id}/results`	Paginated results
`GET`	`/api/scans`	All past scans
`GET`	`/api/scan/{id}/export?fmt=txt\|json\|csv`	Download results
`WS`	`/ws/scan/{id}`	Real-time progress events

🧪 Testing

# Run the core pipeline integration test (no LLM required)
python test_pipeline.py

# Start dev server with hot-reload
make dev

# Run tests + linting
make test

🗂️ Project Structure

account_hunter/
├── main.py            # CLI entrypoint
├── server.py          # FastAPI web server
├── hmi.py             # LLM → Ruleset translation
├── scanner.py         # File scanner (10+ formats)
├── processor.py       # Normalization + deduplication
├── storage.py         # SQLite persistence + export
├── presets.py         # 8 built-in scan templates
├── scan_runner.py     # Background scan with WebSocket progress
├── models.py          # Dataclasses: Ruleset, MatchRecord
├── copier.py          # Copy matched source files
├── reporter.py        # Human-readable report generator
├── llm_provider.py    # OpenAI / Ollama abstraction
├── test_pipeline.py   # Integration test (no LLM required)
├── static/            # Web UI (HTML/CSS/JS)
├── Dockerfile
├── docker-compose.yml
└── Makefile

⚠️ Security Notes

LLM-generated normalization code is not executed by default. Use --allow-unsafe-code only with trusted LLM providers.
Never scan drives you don't own. This tool is for your own data recovery and organization.

📋 Requirements

Python 3.11+
For LLM queries: either Ollama running locally, or an OpenAI API key
For local Ollama: ollama run llama3 (free, private, no internet)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 Account Hunter v2 — LLM-Driven Smart Data Extractor

✨ What Makes It Different

🚀 Features

🖥️ Web UI

Pages

💻 CLI (also works without the server)

🐳 Docker

⚙️ Configuration

📡 REST API

🧪 Testing

🗂️ Project Structure

⚠️ Security Notes

📋 Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
static		static
test_data		test_data
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
copier.py		copier.py
docker-compose.yml		docker-compose.yml
hmi.py		hmi.py
llm_provider.py		llm_provider.py
main.py		main.py
models.py		models.py
presets.py		presets.py
processor.py		processor.py
reporter.py		reporter.py
requirements.txt		requirements.txt
scan_runner.py		scan_runner.py
scanner.py		scanner.py
server.py		server.py
storage.py		storage.py
test_pipeline.py		test_pipeline.py

Folders and files

Latest commit

History

Repository files navigation

🔍 Account Hunter v2 — LLM-Driven Smart Data Extractor

✨ What Makes It Different

🚀 Features

🖥️ Web UI

Pages

💻 CLI (also works without the server)

🐳 Docker

⚙️ Configuration

📡 REST API

🧪 Testing

🗂️ Project Structure

⚠️ Security Notes

📋 Requirements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages