This is a machine learning pipeline for detecting fraudulent blockchain transactions. It classifies transactions into three risk tiers (low, moderate, and high) using behavioral and transactional features, with built-in handling for severe class imbalance via SMOTE oversampling.
Online fraud in blockchain ecosystems costs billions annually. This project applies supervised classification to 78,600 blockchain transactions, comparing seven ML algorithms including gradient boosting (XGBoost, LightGBM, CatBoost), neural networks (MLP), and a stacking ensemble to identify the most reliable fraud indicators. Key challenges include severe class imbalance (80.8% low-risk vs 8.3% high-risk) and the need for interpretable risk signals.
Exploratory analyses and initial results showed 100% accuracy across KNN, Logistic Regression, and Random Forest. A systematic investigation revealed two features that leak the target label:
risk_score— A pre-computed risk metric with completely non-overlapping ranges per class (low_risk: 15–59, moderate_risk: 62–84, high_risk: 90–100). This single feature achieves 100% accuracy alone.transaction_type— Contains values"scam"and"phishing"that map exclusively tohigh_risk, and"purchase"/"transfer"that map exclusively tolow_risk.
Both features were removed from the pipeline. The investigation also confirmed no issues with the train/test split (stratified, zero index overlap), no meaningful duplicate rows, and the label-shuffle sanity check ruled out pipeline bugs.
With leaking features removed, models produce realistic scores on the held-out test set (15% of data, stratified):
| Model | Test Accuracy | F1 (macro) | ROC-AUC (macro) | CV Accuracy (5-fold) |
|---|---|---|---|---|
| CatBoost | 85.3% | 0.608 | 0.911 | 90.8% +/- 0.061 |
| LightGBM | 85.1% | 0.601 | 0.910 | 92.0% +/- 0.064 |
| XGBoost | 82.6% | 0.640 | 0.907 | 91.1% +/- 0.044 |
| Stacking Ensemble | 81.7% | 0.646 | 0.905 | N/A (ensemble) |
| Random Forest | 80.3% | 0.649 | 0.904 | 92.5% +/- 0.017 |
| MLP (Neural Net) | 72.8% | 0.654 | 0.903 | 88.4% +/- 0.004 |
| Logistic Regression | 63.1% | 0.574 | 0.886 | 82.3% +/- 0.001 |
Permutation importance (accuracy-based, 30 repeats) identifies the features that matter most once the leaking columns are removed. The ranking is consistent across all seven models:
| Feature | Importance (CatBoost) | Role |
|---|---|---|
hour_of_day |
0.079 | Strongest signal - certain hours carry significantly higher fraud risk, suggesting time-based behavioral patterns |
amount |
0.055 | Larger or unusually sized transactions are more indicative of moderate- and high-risk activity |
age_group |
0.035 | Account maturity matters - newer accounts show different risk profiles than established or veteran ones |
session_duration |
0.008 | Weak individual signal, though it contributes in ensemble context |
purchase_pattern |
0.004 | Minimal standalone impact |
login_frequency |
−0.001 | Negligible - permuting this feature does not degrade accuracy |
location_region |
−0.001 | Negligible - geographic region alone is not predictive |
Key takeaway: After removing the leaking features, no single remaining
feature dominates the way risk_score did. The top three features
(hour_of_day, amount, age_group) combine behavioral timing, transaction
size, and account maturity — a reasonable fraud signal that aligns with
domain knowledge. The relatively modest importance scores (all < 0.08)
confirm that the models are learning a genuine multi-feature pattern rather
than relying on a single shortcut.
fraud_data.csv
│
▼
┌──────────┐ ┌──────────────┐ ┌───────────-┐
│ Loader │───▶│ Preprocessor │───▶│ Splitter │
│ validate │ │ encode/scale │ │ stratified │
└──────────┘ └──────────────┘ │ + SMOTE │
└─────┬──────┘
┌──────────────┼──────────────┐
▼ ▼ ▼
train set val set test set
│ │ │
▼ │ │
┌─────────────┐ │ │
│ Model Train │◀──────┘ │
│ XGB / LGBM /│ hyperparam │
│ CB / RF / … │ selection │
└──────┬──────┘ │
│ │
▼ ▼
┌─────────────┐ ┌──────────────┐
│ Serialize │ │ Evaluate │
│ (joblib) │ │ metrics/plots│
└─────────────┘ └──────────────┘
78,600 blockchain transactions with 14 features:
| Feature | Type | Description |
|---|---|---|
location_region |
Categorical | Europe, Asia, N. America, S. America, Africa |
purchase_pattern |
Categorical | focused, random, high_value |
age_group |
Categorical | new, established, veteran |
hour_of_day |
Numerical | 0–23 |
amount |
Numerical | Transaction amount |
login_frequency |
Numerical | Login count per session |
session_duration |
Numerical | Minutes per session |
anomaly |
Target | low_risk (80.8%), moderate_risk (10.9%), high_risk (8.3%) |
risk_score and transaction_type are present in the raw data but
dropped during preprocessing due to target leakage (see above).
Requirements: Python 3.9+
git clone https://github.com/aengusmartindonaire/blockchain-fraud-detection.git
cd blockchain-fraud-detection
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txtpython cli.py train --model allpython cli.py train --model xgboostpython cli.py evaluate --model xgboostpython cli.py predict --model catboost --input new_transactions.csv
python cli.py predict --model catboost --input new_transactions.csv --output predictions.csvpython cli.py visualize --type all
python cli.py visualize --type distributions
python cli.py visualize --type evaluation --model xgboostpytest tests/ -vblockchain-fraud-detection/
├── cli.py # CLI entry point (train/evaluate/predict/visualize)
├── config/
│ └── default.yaml # All hyperparameters and settings
├── data/
│ └── raw/fraud_data.csv # Source dataset (78,600 records)
├── src/
│ ├── data/
│ │ ├── loader.py # CSV loading and schema validation
│ │ ├── preprocessing.py # Encoding, scaling, outlier handling
│ │ └── splitter.py # Stratified split + SMOTE
│ ├── models/
│ │ ├── base.py # Abstract model interface
│ │ ├── xgboost_model.py # XGBoost
│ │ ├── lightgbm_model.py # LightGBM
│ │ ├── catboost_model.py # CatBoost
│ │ ├── random_forest.py # Random Forest
│ │ ├── mlp.py # Multi-Layer Perceptron
│ │ ├── logistic.py # Logistic Regression
│ │ ├── stacking.py # Stacking Ensemble (RF+XGB+LGBM+MLP)
│ │ └── registry.py # Model lookup by name
│ ├── evaluation/
│ │ ├── metrics.py # Accuracy, precision, recall, F1, ROC-AUC, cross-val
│ │ └── importance.py # Permutation feature importance
│ ├── visualization/
│ │ ├── distributions.py # Class distribution charts
│ │ ├── tuning_curves.py # Hyperparameter tuning plots
│ │ ├── evaluation_plots.py # Confusion matrices, ROC curves, model comparison
│ │ └── feature_plots.py # Feature importance bars, box plots
│ └── pipeline.py # End-to-end orchestrator
├── tests/ # 115 tests
├── notebooks/ # Exploratory analysis
├── outputs/ # Generated models, figures, reports
├── Makefile # Common commands
├── requirements.txt
└── pyproject.toml
- Python 3.9+
- scikit-learn — classifiers, preprocessing, metrics, stacking ensemble
- XGBoost — gradient boosting (XGBClassifier)
- LightGBM — gradient boosting (LGBMClassifier)
- CatBoost — gradient boosting with native categorical support
- imbalanced-learn — SMOTE oversampling
- pandas / numpy — data manipulation
- matplotlib / seaborn — visualization
- PyYAML — configuration
- joblib — model serialization
- Click — CLI framework
- pytest — testing
MIT