🇷🇺 Русская версия | 🇬🇧 English version
Final project: Machine Learning Course (Central University)
Timeline: November — December 2024
Build a machine learning system to detect fraudulent e-commerce transactions based on user behavioral patterns, transaction data, and security parameters. Fraud detection is mission-critical for online business: missed fraud = direct losses, false positives = lost customers.
Source: E-Commerce Fraud Detection Dataset (Kaggle)
- Size: 299,695 transactions
- Features: 17 original features
- Target variable:
is_fraud(binary classification) - Class imbalance: 2.206% fraud (6,612 out of 299,695)
- Time period: 2024
User profile:
account_age_days— account agetotal_transactions_user— user transaction countavg_amount_user— user average transaction amount
Transaction:
amount— transaction amountshipping_distance_km— shipping distancepromo_used— promo code usagemerchant_category— merchant categorychannel— transaction channel (web/app)
Security:
avs_match— Address Verification System matchcvv_result— CVV verification resultthree_ds_flag— 3D Secure flag
Geo:
country— user countrybin_country— card issuer country
Time:
transaction_time— transaction timestamp
Project completed in 3 stages following university course roadmap:
Data insights:
- Fraud rate significantly higher for cross-border transactions (11.28% vs 1.43%)
- Top-2 fraud countries: TR (2.80%), RO (2.40%)
- Low fraud rate for transactions with all security checks (0.58% for AVS+CVV+3DS)
- Outliers: 11 extreme values in
amountandshipping_distance_km
Baseline:
- Model: CatBoostClassifier
- Validation strategy: Stratified 80/20 split
- Metrics: ROC-AUC = 0.97784, PR-AUC = 0.85269
Anomaly detection:
3 approaches applied:
-
Statistical methods (Z-score, IQR):
amount: Z-score outliers (1.79% of data) → fraud rate 30.16% (14× base rate)shipping_distance_km: Z-score outliers (3.61%) → fraud rate 17.27% (8× higher)- IQR outliers for
shipping_distance_km→ fraud rate 12.81%
-
Extreme outlier removal:
- Removed points with
amount > 10000andshipping_distance_km > 10000 - Result: improved model stability
- Removed points with
-
ML methods for complex anomaly detection:
- Applied: Isolation Forest, LOF, One-Class SVM, Elliptic Envelope
- Created
anomaly_count(consensus anomaly counter) - Created
consensus_strong_anomaly(points flagged by ≥2 methods) - 11,808 points identified as strong anomalies → fraud rate 23.76% (11× higher!)
- Insight: Isolation Forest and LOF showed best precision/recall trade-off
Feature Engineering:
4 groups of new features created:
-
Target Encoding (leak-free):
merchant_category_te,country_te,bin_country_techannelencoded via One-Hot Encoding
-
User behavior features:
amount_zscore_user— Z-score of amount relative to user averagedist_zscore_user— Z-score of shipping distancemerchant_category×amount— cross-features- kNN-based: transaction density in feature space
-
Temporal features (sin/cos encoding):
hour_sin/cos,dow_sin/cos,month_sin/cosis_night,is_business_hours,is_evening,is_weekend
-
Domain-specific features:
is_cross_border— country ≠ bin_country (massive fraud indicator!)security_score— AVS, CVV, 3DS combination (weighted score)all_security_passed/no_security— flagsamount_to_avg_ratio,amount_diff_from_avg— deviations from user baselineis_long_distance— long shipping distance flag (90th percentile)risk_score— composite score:
risk_score = 3×is_cross_border + 2×no_security + 1×three_ds_flag + is_long_distance + is_night
Feature selection:
- Applied CatBoost feature importances
- Selected top-25 features for final model
- Removed unstable and duplicate features
Result: Baseline improved after feature engineering (details in notebook)
Model interpretation:
-
SHAP global interpretation:
- Built SHAP summary plots for CatBoost
- Top influential features:
security_score,risk_score,amount,cross_border,shipping_distance
-
LIME local interpretation:
- Local interpretation of fraud transactions
- LIME vs SHAP comparison: LIME shows simpler linear approximations, SHAP provides full interaction picture
-
Model comparison:
- Compared Logistic Regression (with StandardScaler) and CatBoost
- SHAP summary plot shows CatBoost better captures non-linear patterns (e.g., U-shaped dependencies)
SHAP embeddings and anomalies:
-
SHAP embedding creation:
- Function
get_shap_embeddings(model, X_data, shap_feature)for extracting SHAP values - SHAP embeddings for train and test
- Function
-
Anomaly detection on SHAP embeddings:
- Isolation Forest with
contamination=0.01on SHAP space - Identified 2,398 SHAP anomalies
- Result: ROC-AUC = 0.97340 (slight decrease), but model became more stable
- Isolation Forest with
-
SHAP embedding clustering:
- PCA for dimensionality reduction to 2 components
- k-Means (k=5) → added
clusterfeature - Retrained CatBoost with
cat_features=['cluster'] - Result: ROC-AUC = 0.97566 (small improvement)
- DBSCAN: no significant improvement (many outliers in cluster=-1)
Shapley Flow analysis:
-
Feature interaction graph:
- Built graph based on SHAP value correlations (|corr| > 0.5)
- NetworkX for visualization
- Community detection (greedy modularity) → identified 18 feature groups
-
Key communities:
- Security cluster:
shap_security_score,shap_avs_match,shap_risk_score,shap_all_security_passed - Geography cluster:
shap_shipping_distance_km,shap_is_cross_border,shap_is_long_distance - User behavior:
shap_user_amount_std,shap_avg_amount_user,shap_anomaly_consensus
- Security cluster:
-
Train vs Test comparison:
- Test graph more sparse (5 communities vs 4 in train)
- 4 stable groups persist between train/test
- Insight:
shap_is_cross_borderandshap_shipping_distance_kmalways cluster together → strong relationship
Final validation:
Comparison of 3 approaches:
- SHAP embeddings + Isolation Forest: ROC-AUC = 0.97340
- SHAP embeddings + clustering: ROC-AUC = 0.97566
- Original features (hold-out validation): ROC-AUC = 0.97640
SHAP embeddings alone (without original features): ROC-AUC = 0.96381 (1.26pp lower)
| Metric | Score |
|---|---|
| ROC-AUC | 0.97640 |
| PR-AUC (Average Precision) | 0.85556 |
| Fraud Precision | 0.34 |
| Fraud Recall | 0.90 |
| Fraud F1-Score | 0.49 |
Improvement over baseline:
- ROC-AUC: +0.00144 (+0.15%)
- PR-AUC: +0.00287 (+0.34%)
- Fraud Precision: +0.04 (+13.3%)
- Fraud Recall: +0.02 (+2.3%)
- Fraud F1: +0.04 (+8.9%)
Business findings:
- Cross-border transactions — primary fraud indicator (fraud rate 11.28% vs 1.43%)
- Security checks critical: AVS+CVV+3DS combination reduces fraud by 32× (0.58% vs 18.08%)
- Amount and distance anomalies — strong signals (fraud rate up to 30% for outliers)
- Geography matters: TR and RO — top fraud countries
Technical findings:
- Feature engineering has greater impact than hyperparameter tuning
- SHAP embeddings useful for interpretation but don't replace original features
- Anomaly detection methods help identify complex patterns (consensus approach effective)
- Shapley Flow reveals feature interaction structure
Data processing:
- pandas, numpy
- scikit-learn (preprocessing, imputation, feature selection)
Visualization:
- matplotlib, seaborn
- plotly (interactive graphs)
Anomaly detection:
- Isolation Forest, LOF, One-Class SVM, Elliptic Envelope
Modeling:
- CatBoost (main model)
- Logistic Regression (for comparison)
- RandomForest (feature importance)
Interpretability:
- SHAP (TreeExplainer, summary plots, dependence plots)
- LIME (Tabular explainer, SP-LIME)
Graph analysis:
- NetworkX (Shapley Flow, community detection)
Clustering:
- k-Means, DBSCAN
- PCA, UMAP (dimensionality reduction)
- Platform: Google Colab
- Language: Python 3.12
- GPU: NVIDIA Tesla T4 (for CatBoost acceleration)
Project completed as final team assignment for "Machine Learning" course at Central University. Project structure follows 3-stage roadmap:
- Checkpoint 1 (Nov 17-26): EDA + Baseline
- Checkpoint 2 (Nov 24 — Dec 3): Anomaly Detection + Feature Engineering
- Checkpoint 3 (Dec 1-10): SHAP Interpretability + Shapley Flow
Team completed all checkpoints, applying both classical statistical methods and SHAP embeddings with graph-based feature analysis.