Health care fraud is a crime that involves misrepresenting information, concealing information, or deceiving a person or entity in order to receive benefits, or to make a financial profit. Both individuals and healthcare providers commit health care fraud, in a number of different ways.
The Goal in this project is to understand the behavior of potential fraud providers, and discover the important parameters that can help us to to that, and finnally to build a predictive model that can flag the potential fraud providers based on the claims fields
The data was downloaded from: https://www.kaggle.com/rohitrox/healthcare-provider-fraud-detection-analysis
All other parameters and their relationships with the fraud providers were examined in detail to determine which data, as well as the extended statistics derived from them, would be used in the final model.
A large range of commonly used machine learning models were employed, including: Linear Regression, Logistic Regression, Ridge Regression, Lasso Regression, Elastic Net, Random Forest, Gradient Boosting Machine (GBM), AdaBoost, XGBoost, LightGBM, CatBoost, k-Nearest Neighbors (k-NN), Feedforward Neural Networks (FNN / MLP) etc.
The analyses are presented in four notebooks:
- Exploratory Data Analysis (Part I)
- Exploratory Data Analysis (Part II)
- Exploratory Data Analysis (Part III)
- Feature Engineering and Model Section
Evaluation Metrics:
Accuracy -- F1 -- Precision -- Recall
XGBoost: 1.000000 -- 1.000000 -- 1.000000 -- 1.000000
Cat Boost: 0.999991 -- 0.999988 -- 1.000000 -- 0.999976
Blended Model: 0.999991 -- 0.999988 -- 1.000000 -- 0.999976
Random Forest: 0.999946 -- 0.999929 -- 1.000000 -- 0.999859
GBM: 0.999937 -- 0.999918 -- 0.999953 -- 0.999882,
Neural Network: 0.999364 -- 0.999163 -- 0.999670 -- 0.998657
LightGBM: 0.998647 -- 0.998218 -- 1.000000 -- 0.996442
KNN: 0.969788 -- 0.959888 -- 0.968956 -- 0.950987
Lasso: 0.952429 -- 0.936016 -- 0.957601 -- 0.915382
Ridge: 0.951730 -- 0.935039 -- 0.957181 -- 0.913898
Logistic Regression: 0.947637 -- 0.929239 -- 0.955374 -- 0.904496
ElasticNet: 0.946239 -- 0.927066 -- 0.957096 -- 0.898864
ADA: 0.924948 -- 0.897513 -- 0.933111 -- 0.864532
ROC Curves and AUC Scores:
By improving this model, It will help in predicting Provider fraud, to help Insurance companies and Government to take decision against fraudulent health provider and also to improve health of economy.