AI Revenue Leakage Detection — Hybrid ML System

End-to-end revenue leakage / payment fraud detection on IEEE-CIS transactional data, with optional Elliptic Bitcoin graph analytics, gradient-boosted trees, deep learning, anomaly detection, and a hybrid fusion stage.

Problem statement

Revenue leakage in digital payments appears as a small fraction of high-risk transactions hidden in massive legitimate traffic. Class imbalance, noisy identity fields, and nonlinear fraud patterns make single-model solutions brittle. This project frames detection as a supervised fraud classification problem with complementary signals (tabular GBDT scores, neural risk scores, unsupervised anomaly scores) fused for better precision–recall tradeoffs than any single channel alone.

Datasets

Source	Role	Location
IEEE-CIS Fraud Detection	Primary tabular train slice: transactions + identity merge, engineered features, stages 1–4	`DATASET_ieee-cis-elliptic/` (see bundle README)
Elliptic++ (Bitcoin)	Graph-oriented EDA and Stage 5 experiments (temporal holdout)	Same bundle; cleaned table in `processed_data/elliptic_transactions_cleaned.csv`

After running the pipeline, processed IEEE tables, predictions, and configs live under processed_data/ (included in this submission where present).

Methodology

EDA & data prep (Stage 1) — Merge IEEE transaction + identity, missingness-driven column drops, imputation, time/amount feature engineering, class and temporal EDA; Elliptic cleaning and graph-derived degree features.
Feature engineering — Ratios to peer/card means, calendar features, log transforms; high-missing columns removed with documented thresholds (preprocessing_config.json).
Models (Stages 2–3) — GBDT (LightGBM / XGBoost) with optional SMOTE/tuning; attention / MLP-style DNN baselines; Isolation Forest anomaly scores on normalized features; intermediate prediction CSVs for downstream fusion.
Hybrid system (Stage 4) — Weighted or learned combination of GBDT, DNN, and anomaly channels with threshold tuning on validation (F1-oriented under imbalance); test metrics and score exports.
Graph extension (Stage 5) — Elliptic temporal split and graph-model experiments (elliptic_graph_experiments.csv).

Models used

Family	Implementation	Notes
GBDT	LightGBM / XGBoost via `stage02_gbdt.py`	Strong tabular baseline; high recall, lower precision at default 0.5 threshold
Deep learning	TensorFlow/Keras attention + MLP paths in `stage03_deep_anomaly.py`	Strong ROC-AUC / balanced precision–recall vs GBDT
Anomaly detection	Normalized Isolation Forest scores	Captures outliers; weak alone on fraud label
Hybrid	Fusion in `stage04_fusion.py`	Calibrated combination beats single-channel F1 on bundled run

Results summary (bundled run)

Values from processed_data/report_table_1_ieee_cis.csv and final_hybrid_comparison_metrics.csv (IEEE test, this checkout):

Model	Precision	Recall	F1	AUC (ROC)
XGBoost (GBDT)	0.260	0.822	0.395	0.943
Deep neural network	0.827	0.611	0.702	0.946
Isolation forest	0.290	0.122	0.172	0.769
Proposed hybrid	0.801	0.630	0.705	0.945

Additional PR-AUC for the hybrid (weighted) on the same run: 0.736 (final_hybrid_comparison_metrics.csv).

Why the hybrid works best

GBDT excels at sharp nonlinear splits on mixed-type tabular features but at 0.5 threshold favors recall over precision on imbalanced fraud.
The DNN learns smooth, high-dimensional representations and achieves strong precision with moderate recall.
Anomaly scores add a complementary “unknown pattern” signal but are noisy as a standalone classifier.
Fusion reweights channels so that high-confidence DNN and tree signals are not drowned out by the anomaly channel, yielding highest F1 and a practical precision–recall balance for operations.

Streamlit dashboard

Entry: app/streamlit_app.py (same home experience as app/app.py).
Home — Hero KPIs, pipeline stepper, dataset overview, key equations, inlined Tables I–IV, figure gallery with short inferences, business framing, threshold explorer.
Pages (app/pages/) — EDA gallery, GBDT/SHAP, deep + anomaly, fusion downloads, pipeline runner, Elliptic graph, reports.

Theme and layout: app/assets/custom.css, .streamlit/config.toml.

Screenshots (optional)

Add PNGs under docs/dashboard_screenshots/ and link them here for your report; figures from the pipeline remain under figures/.

How to run

Python: 3.10, 3.11, or 3.12 (see requirements.txt header).

cd AI_Revenue_Project
python -m venv .venv
# Windows: .venv\Scripts\activate
# macOS/Linux: source .venv/bin/activate
pip install -r requirements.txt
streamlit run app/streamlit_app.py

Windows shortcut: double-click run_dashboard.bat from this folder (runs the same command).

Full pipeline (optional, long run):

python main.py

Folder structure

AI_Revenue_Project/
├── README.md                 # This file
├── requirements.txt
├── pyproject.toml
├── main.py                   # CLI pipeline entry
├── run_dashboard.bat         # Windows launcher
├── LICENSE
├── .gitignore
├── .streamlit/config.toml
├── app/
│   ├── streamlit_app.py      # Primary Streamlit entry
│   ├── app.py                # Alternate entry
│   ├── assets/, components/, pages/
├── src/fraud_ml/             # Pipeline package
├── DATASET_ieee-cis-elliptic/   # Raw data bundle
├── processed_data/           # Cleaned CSVs, preds, metrics, configs
├── figures/                  # Stage plots (EDA → graph)
├── docs/                     # Demo, checklists, stage report
├── manuscript/               # Draft narrative (excludes obsolete TeX)
├── scripts/                  # e.g. appendix table embed helper
├── data/, models/, results/  # README stubs → see paths above
├── .github/workflows/        # CI smoke (optional)
└── ML_Report_Latest.tex      # LaTeX report (if present)

License and contributing

See LICENSE and CONTRIBUTING.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Revenue Leakage Detection — Hybrid ML System

Problem statement

Datasets

Methodology

Models used

Results summary (bundled run)

Why the hybrid works best

Streamlit dashboard

Screenshots (optional)

How to run

Folder structure

License and contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
.streamlit		.streamlit
DATASET_ieee-cis-elliptic		DATASET_ieee-cis-elliptic
app		app
data		data
docs		docs
figures		figures
manuscript		manuscript
models		models
processed_data		processed_data
results		results
scripts		scripts
src/fraud_ml		src/fraud_ml
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
ML_Report_Latest.tex		ML_Report_Latest.tex
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_dashboard.bat		run_dashboard.bat

Folders and files

Latest commit

History

Repository files navigation

AI Revenue Leakage Detection — Hybrid ML System

Problem statement

Datasets

Methodology

Models used

Results summary (bundled run)

Why the hybrid works best

Streamlit dashboard

Screenshots (optional)

How to run

Folder structure

License and contributing

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages