GhostCVEs V2 represents a complete architectural overhaul with a fresh database schema, 6-stage processing pipeline, and machine learning capabilities. This guide covers migration from the original system to V2.
Key Decision: V2 uses a fresh database schema with NO backward compatibility. This is intentional—the old data has quality issues (40-60% false positives) and the new system requires different data structures.
-
High False Positive Rate (40-60%)
- No grace period for technical sync delays
- No confidence scoring
- No source reliability tracking
- Many "ghosts" were just normal publication lag
-
Missing Context
- No disclosure classification
- No root cause attribution
- No CNA information
- No multi-source validation
-
Inconsistent Validation
- Only used local NVD data (often stale)
- No CVE.org API integration
- No validation caching
- No fallback chain
-
No Learning System
- Fixed confidence scores
- No resolution tracking
- No historical analysis
- No pattern recognition
- Clean Baseline: Start with world-class detection from day one
- New Data Model: Supports 6-stage pipeline and learning system
- Better Performance: Optimized indexes and queries
- No Legacy Baggage: No need to support old structures
Best for: Most users, production deployments
Steps:
- Back up existing database
- Create fresh V2 database
- Initialize with default sources and CNAs
- Start hunting with new system
- Let learning system build reliability scores
Advantages:
- No data quality issues
- Immediate access to new features
- Clean audit trail from V2 start
Disadvantages:
- Lose historical ghost records
- No pre-trained reliability scores
Best for: Research, trend analysis
Steps:
- Keep old database as read-only archive
- Create separate V2 database
- Optionally analyze old data for insights
- Do NOT import old data into V2
Use Cases:
- Compare V1 vs V2 false positive rates
- Study historical ghost patterns
- Validate V2 improvements
Python Environment:
python --version # Must be 3.11+Disk Space:
- 5GB for local CVE data
- 100MB for database
- 50MB for reports
Dependencies:
pip install -r requirements.txt# Backup current database
cp ghost_log.db ghost_log_v1_backup_$(date +%Y%m%d_%H%M%S).db
# Verify backup
ls -lh ghost_log_v1_backup_*.dbImportant: Keep this backup! You may want to analyze old data later.
# Run automated migration
python scripts/migrate_to_v2.py
# Expected output:
# ========================================
# GhostCVEs Database Migration to V2
# ========================================
#
# Backing up existing database...
# ✓ Backup created: ghost_log_backup_20260310_120000.db
#
# Creating fresh V2 schema...
# ✓ Schema created with 6 tables
#
# Initializing default data...
# ✓ Initialized 21 discovery sources
# ✓ Initialized 6 CNAs
#
# Migration complete!
#
# Next steps:
# 1. Review configuration in src/config.py
# 2. Set environment variables (GITHUB_TOKEN, CVE_ORG_API_KEY)
# 3. Run first hunt: python main.py --hunt# Check database structure
sqlite3 ghost_log.db ".schema"
# Should show 6 tables:
# - cves
# - discovery_sources
# - source_reliability
# - cna_registry
# - resolution_history
# - validation_cache
# Check default data
sqlite3 ghost_log.db "SELECT COUNT(*) FROM source_reliability;"
# Should return: 21
sqlite3 ghost_log.db "SELECT COUNT(*) FROM cna_registry;"
# Should return: 6# Set optional API keys
export GITHUB_TOKEN="ghp_your_token_here"
export CVE_ORG_API_KEY="your_cve_org_api_key"
# Verify configuration
python -c "import os; print('GitHub:', 'SET' if os.getenv('GITHUB_TOKEN') else 'NOT SET'); print('CVE.org:', 'SET' if os.getenv('CVE_ORG_API_KEY') else 'NOT SET')"# Execute first V2 hunt
python main.py --hunt
# Monitor output
# Should see:
# - 23 discovery sources initialized
# - CVE discoveries from multiple sources
# - Disclosure classification
# - Multi-source validation
# - Ghost analysis with confidence scores
# - Root cause detection# Check database
sqlite3 ghost_log.db "SELECT COUNT(*) FROM cves;"
sqlite3 ghost_log.db "SELECT COUNT(*) FROM cves WHERE is_ghost = 1;"
sqlite3 ghost_log.db "SELECT cve_id, confidence_score, root_cause FROM cves WHERE is_ghost = 1 ORDER BY confidence_score DESC LIMIT 10;"
# Generate reports
python main.py --report --format all
# View dashboard
python main.py --dashboardFile: src/config.py
Check These Settings:
# Discovery configuration
class DiscoveryConfig:
MAX_WORKERS = 5 # Adjust based on system resources
REQUEST_TIMEOUT = 10 # Seconds
# Validation configuration
class ValidationConfig:
CACHE_TTL_HOURS = 1 # Adjust based on hunt frequency
GRACE_PERIOD_HOURS = 6 # Don't change without good reason
# Confidence thresholds
class ConfidenceConfig:
GHOST_THRESHOLD = 0.60 # Minimum confidence for ghost detection
FAKE_CVE_THRESHOLD = 0.40 # Maximum reliability for fake sourcesTime Required: 1-2 weeks
What Happens:
- Sources get initial reliability scores (default 0.75)
- Resolutions are detected and recorded
- Reliability scores adjust based on accuracy
- Fast sources (<3 days) get bonus points
Monitor Progress:
# Check resolution history
sqlite3 ghost_log.db "SELECT COUNT(*) FROM resolution_history;"
# View source reliability
sqlite3 ghost_log.db "SELECT source_name, reliability_score, total_discoveries, true_positives, false_positives FROM source_reliability ORDER BY reliability_score DESC;"
# Check false positive rate
sqlite3 ghost_log.db "SELECT CAST(SUM(CASE WHEN was_true_ghost = 0 THEN 1 ELSE 0 END) AS FLOAT) / COUNT(*) * 100 AS false_positive_rate FROM resolution_history;"GitHub Actions (if using GitHub repo):
- Already configured in
.github/workflows/hunt.yml - Runs every 6 hours automatically
- Commits updated database and reports
Cron Job (for local deployment):
# Add to crontab
crontab -e
# Run every 6 hours
0 */6 * * * cd /path/to/GhostCVEs && /path/to/venv/bin/python main.py --hunt --check-resolutions >> /var/log/ghostcves.log 2>&1Key Metrics:
- Ghost detection rate (should be 5-15 per hunt initially)
- False positive rate (target: <10%)
- Average confidence (target: >0.70)
- Resolution time (lower is better)
Dashboard Command:
python main.py --dashboardIf you want to compare V1 ghosts with V2 validation:
# analyze_v1_ghosts.py
import sqlite3
# Connect to old database
old_db = sqlite3.connect("ghost_log_v1_backup_XXXXXXXX.db")
old_cursor = old_db.cursor()
# Connect to new database
new_db = sqlite3.connect("ghost_log.db")
new_cursor = new_db.cursor()
# Get V1 ghosts
old_ghosts = old_cursor.execute(
"SELECT cve_id FROM ghost_cves WHERE is_ghost = 1"
).fetchall()
print(f"V1 had {len(old_ghosts)} ghosts")
# Check how many are still ghosts in V2
still_ghosts = 0
resolved = 0
not_found = 0
for (cve_id,) in old_ghosts:
result = new_cursor.execute(
"SELECT is_ghost, registry_status FROM cves WHERE cve_id = ?",
(cve_id,)
).fetchone()
if result:
if result[0]: # is_ghost
still_ghosts += 1
else:
resolved += 1
else:
not_found += 1
print(f"Still ghosts in V2: {still_ghosts}")
print(f"Resolved (published): {resolved}")
print(f"Not in V2 yet: {not_found}")
print(f"False positive rate: {resolved / len(old_ghosts) * 100:.1f}%")Typical V1 vs V2 Comparison:
- V1 ghosts: 80-100
- Still ghosts in V2: 5-10 (true ghosts)
- Resolved: 60-80 (were false positives)
- Not in V2: 10-15 (low confidence sources)
This demonstrates:
- V1 false positive rate: 60-80%
- V2 false positive rate: <10% (after learning)
Error: sqlite3.OperationalError: table cves already exists
Cause: Database already has V2 schema
Solution:
# Option 1: Use existing V2 database
# No action needed
# Option 2: Force fresh migration
rm ghost_log.db
python scripts/migrate_to_v2.pyError: ModuleNotFoundError: No module named 'httpx'
Solution:
pip install -r requirements.txt --upgradeSymptom: Many validation failures, slow hunts
Solution:
# Set API keys for higher limits
export CVE_ORG_API_KEY="your_key"
export GITHUB_TOKEN="ghp_your_token"
# Or adjust cache TTL
# Edit src/config.py:
# CACHE_TTL_HOURS = 2 # Longer cache = fewer API callsSymptom: First hunt finds 0 ghosts
Expected: This is normal initially!
Reasons:
- 6-hour grace period filters out recent CVEs
- 60% confidence threshold is strict
- Multi-source validation is thorough
- Most CVEs are properly published
What to Do:
- Wait 24 hours and run again
- Check for CVEs in grace period:
SELECT COUNT(*) FROM cves WHERE is_ghost = 0 - Review low-confidence findings:
SELECT * FROM cves ORDER BY confidence_score DESC
Symptom: Python process uses >2GB RAM
Causes:
- Too many concurrent workers
- Large validation cache
- Many discovery sources
Solution:
# Edit src/config.py
class DiscoveryConfig:
MAX_WORKERS = 3 # Reduce from 5
class ValidationConfig:
CACHE_TTL_HOURS = 0.5 # Reduce from 1If V2 migration causes issues and you need to rollback:
# Stop any running hunts
pkill -f "python main.py"
# Restore backup
cp ghost_log_v1_backup_XXXXXXXX.db ghost_log.db# Checkout V1 code
git log --oneline # Find V1 commit hash
git checkout <v1-commit-hash># Check database structure
sqlite3 ghost_log.db ".schema" | head -20
# Should show V1 tables (ghost_cves, discovery_sources)
# Run V1 hunt
python main.py --huntIf you had to rollback, please report the issue:
- GitHub: https://github.com/rogolabs/GhostCVEs/issues
- Include: Error messages, system info, database size
A: Not recommended. Old ghosts have quality issues and lack the context (disclosure classification, root cause, etc.) that V2 requires. Better to start fresh.
A: 1-2 weeks. The learning system needs time to:
- Collect resolution data (RESERVED → PUBLISHED transitions)
- Calculate source reliability scores
- Build historical patterns
Tip: Run --check-resolutions daily to speed up learning.
A: No. V2 will find fewer but higher quality ghosts. V1's high false positive rate means most "ghosts" weren't real issues.
A: Yes, but use different database files:
# V1 hunt
python main.py --hunt --database ghost_log_v1.db
# V2 hunt
python main.py --hunt --database ghost_log_v2.dbA: V2 automatically falls back to:
- Local CVElist V5 repo (~2GB, official data)
- Local NVD JSON database (~1.4GB)
A: Automatic! V2 fetches fresh data on first run and caches it. To force update:
# Clear cache
rm -rf ~/.cache/ghostcves/
# Next hunt will re-download
python main.py --huntA: Yes, edit src/config.py:
class ConfidenceConfig:
GHOST_THRESHOLD = 0.60 # Adjust 0.50-0.80
# Lower = more ghosts (higher false positive rate)
# Higher = fewer ghosts (might miss some)A: See ARCHITECTURE.md for details. Basic steps:
- Create new module in
src/discovery/ - Inherit from
BaseDiscoveryModule - Implement
discover()method - Add to
src/config.pywith confidence score - Initialize in database:
INSERT INTO source_reliability ...
- ARCHITECTURE.md: Complete system design
- README.md: User guide and features
- GitHub Issues: Bug reports and questions
Need Help?
- GitHub Issues: https://github.com/rogolabs/GhostCVEs/issues
- Email: support@rogolabs.net
Last Updated: 2026-03-10 Migration Version: 1.0 → 2.0