Skip to content

Astoriel/Account-Hunter-v2

Repository files navigation

🔍 Account Hunter v2 — LLM-Driven Smart Data Extractor

Account Hunter - New Scan Account Hunter - History

Python 3.11+ FastAPI License: MIT

A local-first tool that lets you describe data you're looking for in plain English — and then automatically scans your files to extract it. Built around LLMs + regex, with a clean web UI and REST API.


✨ What Makes It Different

Tool Approach Drawback
grep / ripgrep Fast regex search You must write the regex
gitleaks / trufflehog Secret detection Fixed patterns only
LangChain Extraction LLM extraction Web/cloud-oriented
Account Hunter Natural language → auto-generated rules → local file scan ✅ None — just describe it

🚀 Features

  • Natural language queries — type "find email:password pairs" and it handles the rest
  • 8 built-in presets — run instantly without any LLM: emails, credentials, API keys, phone numbers, crypto wallets, IPs, URLs, SSH keys
  • Supports 10+ file types.txt, .csv, .xlsx, .json, .log, .yaml, .xml, .md, .env
  • Real-time progress via WebSocket — see files being scanned live
  • Ruleset preview — inspect and review the LLM-generated regex before scanning
  • Scan history — every scan is stored in SQLite, browse past results anytime
  • Multiple export formats — TXT, JSON, CSV
  • Safe by default — LLM normalization code is sandboxed behind --allow-unsafe-code
  • Docker support — one-command deployment

🖥️ Web UI

# 1. Install dependencies
pip install -r requirements.txt

# 2. Start the server
python server.py
# → Web UI available at http://localhost:8000
# → API docs at http://localhost:8000/docs

Pages

  • / — Submit a query or pick a preset, watch live scan progress, view results
  • /static/history.html — All past scans with stats and export links
  • /static/scan.html?id=N — Full result detail for a specific scan

💻 CLI (also works without the server)

# Using Ollama (local, free)
python main.py "find all gmail accounts" --scan-dir C:\Downloads

# Using OpenAI
python main.py "find API keys and tokens" \
  --provider openai --model gpt-4o --api-key sk-...

# Use a preset (no LLM needed)
# Presets are available via the web UI / API

# Scan all drives
python main.py "find email:password pairs" --all-drives

# Allow LLM normalization code (disabled by default for safety)
python main.py "find gmail accounts and remove dot aliases" --allow-unsafe-code

🐳 Docker

# Set your scan directory and optional LLM key
cp .env.example .env
# Edit .env: set SCAN_DIR and OPENAI_API_KEY if using OpenAI

docker-compose up --build
# → http://localhost:8000

⚙️ Configuration

Variable Description Default
OPENAI_API_KEY OpenAI API key (for OpenAI provider)
SCAN_DIR Directory to mount in Docker .
AH_DB_PATH SQLite database path hunter_data.db
AH_OUTPUT_DIR Output directory output/
PORT Server port 8000

📡 REST API

Method Endpoint Description
GET /api/presets List built-in presets
POST /api/ruleset/preview Preview LLM ruleset (no scan)
POST /api/scan Start a scan
GET /api/scan/{id}/status Scan status
GET /api/scan/{id}/results Paginated results
GET /api/scans All past scans
GET /api/scan/{id}/export?fmt=txt|json|csv Download results
WS /ws/scan/{id} Real-time progress events

🧪 Testing

# Run the core pipeline integration test (no LLM required)
python test_pipeline.py

# Start dev server with hot-reload
make dev

# Run tests + linting
make test

🗂️ Project Structure

account_hunter/
├── main.py            # CLI entrypoint
├── server.py          # FastAPI web server
├── hmi.py             # LLM → Ruleset translation
├── scanner.py         # File scanner (10+ formats)
├── processor.py       # Normalization + deduplication
├── storage.py         # SQLite persistence + export
├── presets.py         # 8 built-in scan templates
├── scan_runner.py     # Background scan with WebSocket progress
├── models.py          # Dataclasses: Ruleset, MatchRecord
├── copier.py          # Copy matched source files
├── reporter.py        # Human-readable report generator
├── llm_provider.py    # OpenAI / Ollama abstraction
├── test_pipeline.py   # Integration test (no LLM required)
├── static/            # Web UI (HTML/CSS/JS)
├── Dockerfile
├── docker-compose.yml
└── Makefile

⚠️ Security Notes

  • LLM-generated normalization code is not executed by default. Use --allow-unsafe-code only with trusted LLM providers.
  • Never scan drives you don't own. This tool is for your own data recovery and organization.

📋 Requirements

  • Python 3.11+
  • For LLM queries: either Ollama running locally, or an OpenAI API key
  • For local Ollama: ollama run llama3 (free, private, no internet)

About

A local-first tool that lets you describe data you're looking for in plain English

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors