🔍 NFTruth - AI-Powered NFT Authenticity Detector (python base)

Fighting NFT scams with machine learning, one collection at a time! 🚀

🎯 What is NFTruth?

NFTruth is an intelligent system that analyzes NFT collections to determine their legitimacy and detect potential scams. Using ensemble machine learning algorithms trained on multi-source data (OpenSea marketplace data, Reddit social sentiment, and Ethereum blockchain metrics), it provides comprehensive risk assessments for NFT collections.

🏗️ System Architecture

NFTruth/
├── 🎯 app/
│   ├── 📊 data/
│   │   ├── opensea_collector.py      # OpenSea API integration & data collection
│   │   ├── reddit_collector.py       # Reddit OAuth + sentiment analysis pipeline
│   │   ├── etherscan_collector.py    # Ethereum blockchain analysis (placeholder)
│   │   └── ml_data_transformer.py    # Feature engineering & ML data preparation
│   ├── 🤖 models/
│   │   ├── model.py                  # Ensemble ML model implementation
│   │   ├── model_notebook.ipynb      # Technical documentation & explanation
│   │   └── opensea_known_legit.py    # Curated legitimate collections database
│   ├── 📈 model_training.py          # Synthetic data generation & training pipeline
│   ├── 🔮 predict.py                 # Prediction interface & risk assessment
│   └── 📋 opensea_collections.py     # Collection slug mappings
├── 🏆 model_outputs/
│   └── rule_based_model.json         # Rule-based baseline model
├── 📚 training_data/                 # Generated training datasets
├── 🧪 tests/
│   ├── test_model_setup.py          # ML functionality validation
│   └── test_opensea.py              # API connection testing
├── 📋 requirements.txt              # Python dependencies
└── 📖 README.md                     # This documentation

🧠 How The System Works

📊 Multi-Source Data Collection Pipeline

🏪 OpenSea API Integration (`opensea_collector.py`)

# Comprehensive collection metrics extraction
- Collection verification status (safelist_status)
- Trading statistics (total_volume, floor_price, market_cap)
- Social presence (Discord, Twitter links)
- Ownership metrics (total_supply, num_owners)
- Price dynamics (average_price, price_changes)
- Collection features (trait_offers_enabled, collection_offers_enabled)

💬 Reddit Social Intelligence (`reddit_collector.py`)

# Advanced social sentiment analysis pipeline
- OAuth 2.0 authentication with Reddit API
- Multi-subreddit targeted data collection:
  * crypto_general: ['cryptocurrency', 'crypto', 'CryptoMarkets']
  * nft_specific: ['NFT', 'NFTs', 'opensea', 'NFTsMarketplace']
  * ethereum: ['ethereum', 'ethtrader', 'ethfinance']
  * trading_focused: ['wallstreetbets', 'CryptoMoonShots']
- VADER sentiment analysis integration
- Scam keyword detection: ['scam', 'rugpull', 'fake', 'fraud']
- Hype indicator tracking: ['moon', 'diamond hands', 'hodl', 'wagmi']

⛓️ Blockchain Analysis (`etherscan_collector.py`)

# Creator wallet and transaction analysis (framework ready)
- Wallet age and transaction history
- Suspicious pattern detection (wash trading)
- Creator balance and activity analysis
- Mint distribution pattern recognition

🔬 Advanced Feature Engineering (`ml_data_transformer.py`)

The MLDataTransformer class transforms raw data into 20+ meaningful ML features:

💰 Market Intelligence Features

# Liquidity and market efficiency indicators
volume_per_owner = total_volume / num_owners        # Liquidity quality
market_efficiency = market_cap / total_volume       # Market maturity  
price_premium = average_price / floor_price         # Pricing structure
avg_daily_volume = total_volume / collection_age_days

🗣️ Social Sentiment Scoring

# Community engagement metrics using VADER sentiment analysis
social_score = reddit_mentions + reddit_engagement
sentiment_analysis = SentimentIntensityAnalyzer().polarity_scores()
scam_keyword_density = scam_mentions / total_mentions
hype_indicator = hype_keywords_count / total_discussion_volume

⛓️ Blockchain Forensics

# Creator and trading pattern analysis
creator_wallet_age = days_since_first_transaction
wash_trading_score = circular_transactions / total_transactions
mint_distribution_score = distribution_uniformity_analysis
whale_concentration = 1 - (num_owners / total_supply)

🤖 Ensemble Machine Learning Architecture (`model.py`)

The NFTAuthenticityModel implements four specialized algorithms:

🌳 Random Forest Classifier

Strength: Handles complex feature interactions and mixed data types
Use Case: Captures non-linear relationships between market metrics
Implementation: RandomForestClassifier(n_estimators=100, random_state=42)

🚀 Gradient Boosting Classifier

Strength: Sequential learning builds strong patterns from weak signals
Use Case: Detects subtle scam indicators through iterative improvement
Implementation: GradientBoostingClassifier(random_state=42)

📈 Logistic Regression

Strength: Interpretable linear relationships with feature scaling
Use Case: Provides explainable risk factors and coefficients
Implementation: LogisticRegression(random_state=42) with StandardScaler

🎯 Support Vector Machine (SVM)

Strength: Finds optimal decision boundaries in high-dimensional space
Use Case: Separates legitimate from suspicious collections precisely
Implementation: SVC(probability=True, random_state=42) with scaling

🏷️ Intelligent Labeling System

Since NFT authenticity ground truth is rare, the system uses a sophisticated scoring methodology in create_synthetic_labels():

# Legitimacy scoring algorithm (from model.py)
def create_synthetic_labels(self, df):
    scores = []
    for _, row in df.iterrows():
        score = 0
        
        # Verification signals (+3 points)
        if row.get('is_verified', False): score += 3
        
        # Social presence (+1 point each)
        if row.get('has_discord', False): score += 1
        if row.get('has_twitter', False): score += 1
        
        # Market signals (+2 points each)  
        if row.get('floor_price', 0) > 0: score += 2
        if row.get('total_volume', 0) > 1000: score += 2
        
        # Community signals (+1 point each)
        if row.get('num_owners', 0) > 1000: score += 1
        if row.get('reddit_mentions', 0) > 0: score += 1
        
        # Whitelist verification (+3 points)
        collection_name = row.get('collection', '').lower()
        if any(legit in collection_name for legit in self.known_legitimate):
            score += 3
            
        scores.append(score)
    
    # Classification threshold: score >= 6 = legitimate
    return [1 if score >= 6 else 0 for score in scores]

📈 Synthetic Data Generation (`model_training.py`)

The generate_synthetic_nft_data() function creates realistic training data:

# Creates balanced dataset with realistic feature distributions
- 65% legitimate collections (default ratio)
- 35% suspicious collections
- Realistic correlations between features
- Saves timestamped JSON files in training_data/

🔍 Complete Feature Analysis

📊 Market Intelligence (9 features)

total_volume, floor_price, average_price, market_cap
volume_per_owner, market_efficiency, price_premium
avg_daily_volume, liquidity_indicator

🏷️ Collection Properties (8 features)

is_verified, safelist_status, has_discord, has_twitter
trait_offers_enabled, collection_offers_enabled
total_supply, num_owners

💬 Social Intelligence (6 features)

reddit_mentions, reddit_engagement, social_score
reddit_sentiment, scam_keyword_density, hype_indicator

⛓️ Blockchain Forensics (7 features)

creator_wallet_age_days, creator_transaction_count
wash_trading_score, suspicious_transaction_patterns
mint_distribution_score, whale_concentration
creator_balance_eth

🎯 Prediction Interface (`predict.py`)

The system provides comprehensive risk analysis through the prediction interface:

# Example prediction output
{
    "collection": "example-nft-collection",
    "prediction": "Legitimate" | "Suspicious", 
    "confidence": {
        "legitimate": 0.847,    # 84.7% confidence
        "suspicious": 0.153     # 15.3% risk
    },
    "risk_score": 0.153,        # Inverse of legitimacy probability
    "risk_level": "Low",        # Categorical assessment
    "model_used": "LogisticRegression",
    "features_analyzed": {...}, # All extracted features
    "timestamp": "2025-06-20T..."
}

⚠️ Risk Classification System

Risk Level	Score Range	Characteristics	Action Recommended
🟢 Low Risk	0-30%	Verified, high volume, strong community	✅ Relatively safe to proceed
🟡 Medium Risk	31-50%	Mixed signals, some concerns	⚠️ Proceed with caution
🟠 High Risk	51-70%	Multiple red flags detected	🚨 High caution advised
🔴 Very High Risk	71-100%	Strong scam indicators	❌ Avoid completely

🛠️ Technical Implementation Stack

Core Dependencies (`requirements.txt`)

# Machine Learning & Data Processing
numpy, pandas, scikit-learn, joblib

# API & Web Functionality  
requests, python-dotenv

# Natural Language Processing
nltk, vaderSentiment, textblob

# Data Visualization
matplotlib, seaborn

# Date Handling
python-dateutil, pytz

External APIs Required

🏪 OpenSea API - Collection marketplace data
💬 Reddit API - Social sentiment analysis (OAuth 2.0)
⛓️ Etherscan API - Ethereum blockchain data

📊 Model Performance Metrics

Based on synthetic training data evaluation:

Logistic Regression is by far the most optimal!

Model	Key Strengths	Use Case
🏆 Logistic Regression	Interpretable, fast, linear separability	Primary classifier for NFT authenticity
🌳 Random Forest	Feature importance, non-linear patterns	Complex interaction detection
🚀 Gradient Boosting	Sequential improvement, weak signal boosting	Subtle scam pattern recognition
🎯 SVM	Maximum margin, high-dimensional separation	Precise decision boundaries

🧪 Testing Suite

Model Validation (`test_model_setup.py`)

Model initialization and training pipeline validation
Feature engineering functionality testing
Prediction interface verification

API Integration (`test_opensea.py`)

OpenSea API connection and data extraction testing
Rate limiting and error handling validation

📚 Known Legitimate Collections (`opensea_known_legit.py`)

Curated database of verified legitimate NFT collections:

CryptoPunks, Bored Ape Yacht Club, Azuki
Doodles, CloneX, Meebits, World of Women
Used for ground truth labeling and validation

🔄 Data Pipeline Flow

Collection Input → Collection slug or name
Data Collection → Multi-source API calls (OpenSea, Reddit)
Feature Engineering → Transform raw data into ML features
Model Prediction → Ensemble voting across 4 algorithms
Risk Assessment → Confidence scores and risk categorization
Output Generation → Structured JSON response with analysis

🚀 Future Enhancement Roadmap

🔄 Real-time monitoring - Live collection tracking dashboard
🌐 Web interface - User-friendly analysis portal
🤝 Community reporting - Crowdsourced scam detection system

⚖️ Important Disclaimers

⚠️ Investment Warning: This tool provides risk assessments based on observable data patterns and should not be the sole factor in investment decisions. The NFT market is highly speculative and volatile.

🔬 Research Tool: NFTruth is designed as a research and educational tool to demonstrate machine learning applications in blockchain analysis.

📊 Data Limitations: Predictions are based on publicly available data and may not capture all risk factors or market dynamics.

🎓 Educational Purpose: This system demonstrates advanced ML techniques for blockchain analysis and should be used for learning and research purposes.

Always conduct your own research (DYOR) before making any financial decisions 🧠

Built with ❤️ to make the NFT space safer for everyone 🛡️

⭐ Star this repository if you found it helpful!

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.vercel		.vercel
__pycache__		__pycache__
api		api
app		app
k8s		k8s
model_outputs		model_outputs
tests		tests
training_data		training_data
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
KUBERNETES_DEPLOYMENT.md		KUBERNETES_DEPLOYMENT.md
README.md		README.md
__init__.py		__init__.py
docker-compose.yml		docker-compose.yml
docker-scripts.sh		docker-scripts.sh
docker_fix.md		docker_fix.md
requirements.txt		requirements.txt

RushilMahadevu/NFTruth

Folders and files

Latest commit

History

Repository files navigation

🔍 NFTruth - AI-Powered NFT Authenticity Detector (python base)

🎯 What is NFTruth?

🏗️ System Architecture

🧠 How The System Works

📊 Multi-Source Data Collection Pipeline

🏪 OpenSea API Integration (opensea_collector.py)

💬 Reddit Social Intelligence (reddit_collector.py)

⛓️ Blockchain Analysis (etherscan_collector.py)

🔬 Advanced Feature Engineering (ml_data_transformer.py)

💰 Market Intelligence Features

🗣️ Social Sentiment Scoring

⛓️ Blockchain Forensics

🤖 Ensemble Machine Learning Architecture (model.py)

🌳 Random Forest Classifier

🚀 Gradient Boosting Classifier

📈 Logistic Regression

🎯 Support Vector Machine (SVM)

🏷️ Intelligent Labeling System

📈 Synthetic Data Generation (model_training.py)

🔍 Complete Feature Analysis

📊 Market Intelligence (9 features)

🏷️ Collection Properties (8 features)

💬 Social Intelligence (6 features)

⛓️ Blockchain Forensics (7 features)

🎯 Prediction Interface (predict.py)

⚠️ Risk Classification System

🛠️ Technical Implementation Stack

Core Dependencies (requirements.txt)

External APIs Required

📊 Model Performance Metrics

🧪 Testing Suite

Model Validation (test_model_setup.py)

API Integration (test_opensea.py)

📚 Known Legitimate Collections (opensea_known_legit.py)

🔄 Data Pipeline Flow

🚀 Future Enhancement Roadmap

⚖️ Important Disclaimers

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages