Skip to content

ShinichiShi/Vestigo

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

213 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Vestigo: Firmware analysis & crypto-detection pipeline

Vestigo is a collection of tools, scripts and services to automate the process of (1) producing cross-compiled test binaries, (2) statically and dynamically analyzing firmware/binaries, (3) extracting ML-ready features, and (4) producing datasets and inference results for cryptographic-function detection. The repo combines headless Ghidra-based extraction, Qiling-based dynamic tracing, a dataset generation pipeline (including optional LLM assisted labeling), and a small backend + frontend for web access.

This README gives a concise, practical overview and quickstart so you can get the pipeline running and contribute.

Key project goals

  • Extract function-level and trace-level features suitable for ML
  • Provide utilities for static (Ghidra) and dynamic (Qiling) analysis
  • Offer scripts to build training CSVs and run inference
  • Provide a backend API and frontend for file upload and analysis

Quick facts / highlights

  • Languages: Python (main tooling & backend), TypeScript/React frontend
  • Major folders: ghidra_scripts, qiling_analysis, ml, backend, frontend
  • Important entry points:
    • generate_dataset.py — create ML CSVs from Ghidra JSONs
      • analyzer.py, bare_metal.py, main.py — orchestrate analysis flows
      • factory/builder.py — cross-compile sources across arch/opt matrix
      • qiling_analysis/ — dynamic tracing & batch extraction pipeline
      • backend/ — FastAPI backend with analysis endpoints

Quick Setup

1. Automated Installation

./setup.sh
source activate_vestigo.sh

What gets installed:

  • Python environment with all dependencies (FastAPI, Qiling, ML libraries)
  • Ghidra headless analyzer (/opt/ghidra)
  • Qiling framework + rootfs
  • Cross-compiler toolchains (ARM, MIPS, AArch64)
  • Container runtime (Podman/Docker)

Options: --minimal | --skip-ghidra | --skip-ml | --help

2. Manual Steps Required

Frontend (Node.js 18+):

# Install Node.js for your OS, then:
cd frontend && npm install && cd ..

Database (PostgreSQL):

# Option A: Local
sudo apt install postgresql && sudo -u postgres createdb vestigo

# Option B: Cloud (https://neon.tech - recommended)
# Get connection string and add to .env

Configure .env:

DATABASE_URL=postgresql://user:pass@host:5432/vestigo
OPENAI_API_KEY=sk-your-key-here  # Get from platform.openai.com

Initialize Database:

cd backend && prisma db push && prisma generate && cd ..

Usage

Always activate environment first: source activate_vestigo.sh

Static Analysis (Ghidra)

python3 scripts/analyzer.py <binary>

Dynamic Analysis (Qiling)

python3 qiling_analysis/tests/verify_crypto.py <binary>

Generate ML Dataset

python3 scripts/generate_dataset.py --input-dir ghidra_output --output dataset.csv

Batch Processing

python3 qiling_analysis/batch_extract_features.py \
    --dataset-dir ./dataset_binaries --output-dir ./results --parallel 4

Cross-Compile Binaries

python3 factory/builder.py --source algorithm.c

LLM Crypto Analysis

python3 qiling_analysis/tests/llm/crypto_deep_analyzer.py --strace trace.log --output analysis.json

Run Web Interface

# Backend (terminal 1)
cd backend && uvicorn main:app --reload

# Frontend (terminal 2)
cd frontend && npm run dev

Project Structure

vestigo-data/
├── setup.sh                 # Automated installation
├── activate_vestigo.sh      # Environment activation
├── backend/                 # FastAPI server
├── frontend/                # React UI
├── factory/                 # Cross-compilation tools
├── ghidra_scripts/          # Ghidra analysis scripts
├── qiling_analysis/         # Dynamic tracing pipeline
├── ml/                      # ML models and training
├── scripts/                 # Analysis orchestration
└── dataset_binaries/        # Sample binaries

Key Scripts:

  • scripts/analyzer.py - Ghidra static analysis
  • scripts/generate_dataset.py - Create ML datasets
  • qiling_analysis/tests/verify_crypto.py - Dynamic analysis
  • factory/builder.py - Cross-compilation

Troubleshooting

Issue Solution
Virtual environment not found Run ./setup.sh
Import errors pip install -r requirements.txt -r backend/requirements.txt
Qiling rootfs missing git clone --depth 1 https://github.com/qilingframework/rootfs.git qiling_analysis/rootfs
Ghidra not found Set export GHIDRA_HOME=/opt/ghidra
Database errors Check DATABASE_URL in .env, run prisma generate
OpenAI quota exceeded Check billing at platform.openai.com
Frontend won't start cd frontend && rm -rf node_modules && npm install

System Requirements

  • OS: Ubuntu/Debian, Fedora/RHEL, Arch, macOS
  • RAM: 8GB min, 16GB recommended
  • Disk: ~10GB
  • Python: 3.9+ (3.11 recommended)
  • Node.js: 18+ (for frontend)

Documentation

  • qiling_analysis/QUICKSTART_GUIDE.md - Dynamic analysis guide
  • CONTRIBUTING.md - Contribution guidelines

License

Apache-2.0 - See LICENSE

About

An end-to-end pipeline for firmware analysis and cryptographic function detection. Automates cross-compilation (x86, ARM, RISC-V, etc.), static analysis (Ghidra), dynamic tracing (Qiling), and feature extraction to produce ML-ready datasets for binary security research.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 64.1%
  • TypeScript 24.9%
  • Shell 5.9%
  • Jupyter Notebook 3.8%
  • YARA 0.6%
  • Dockerfile 0.3%
  • Other 0.4%