A comprehensive machine learning pipeline for predicting customer churn in the telecommunications industry.
This repo contains an end-to-end solution for predicting customer churn, helping businesses identify customers at risk of leaving their service. The pipeline implements best practices in data preprocessing, feature engineering, model training, and evaluation.
- Robust Data Preprocessing: Handles categorical variables, missing values, and data transformations for 7,043 customer records
- Advanced Feature Engineering: Creates 9 powerful derived features that capture customer behavior patterns
- Cross-Validated Model Training: Ensures model reliability and stability
- Comprehensive Evaluation: Multiple metrics including accuracy, precision, recall, F1-score, and AUC-ROC
- Insightful Visualizations: Various plots to understand churn drivers and model decisions
- Production-Ready Prediction: Returns probability, risk level, and binary prediction
-
Data Preprocessing (
data_preprocessing.py)- Standardizes data formats
- Converts categorical variables to numeric
- One-hot encodes multi-class categorical features
-
Feature Engineering (
feature_engineering.py)- Creates basic and advanced derived features:
AvgMonthlyCharge: Average charge per monthHighValueFlag: Identifies high-value customersTenureToChargeRatio: Relationship between tenure and chargesServiceDiversity: Variety of services usedTenureByContract: Interaction between tenure and contract typeTotalServices: Count of services subscribedServiceDensity: Service concentrationPotentialLTV: Estimated lifetime valueChurnRiskScore: Aggregated risk score
- Creates basic and advanced derived features:
-
Model Training (
model_training.py)- Implements machine learning model with cross-validation
- Current performance metrics:(still tuning)
- Accuracy: 76%
- Precision: 53%
- Recall: 73%
- F1-Score: 61%
- AUC-ROC: 84%
-
Visualization (
visualization.py)- Creates multiple visualizations:
- Contract vs Churn
- Monthly Charges vs Churn
- Tenure vs Churn
- Internet Service vs Churn
- Payment Method vs Churn
- Correlation heatmap
- SHAP visualizations for model explainability
- Creates multiple visualizations:
# Clone the repository
git clone https://github.com/MessoJ/Customer-Churn-Prediction.git
cd Customer-Churn-Prediction
# Install dependencies
pip install -r requirements.txt# Run the full pipeline
python data_preprocessing.py
python feature_engineering.py
python model_training.py
python model_validation.py
python model_evaluation.py
python visualization.py
python prediction.pyThe current model prioritizes recall (73%) over precision (53%), making it well-suited for identifying as many potential churners as possible, even at the cost of some false positives. This approach is business-oriented, as the cost of missing a potential churner typically exceeds the cost of incorrectly flagging a loyal customer.
Key findings from the analysis:
- Contract type is a strong predictor of churn (month-to-month contracts have higher churn)
- Fiber optic internet service users show higher churn rates
- Electronic check payment method is associated with higher churn
- Short tenure strongly predicts churn likelihood
Customer-Churn-Prediction/
├── data/
│ └── telco_churn.csv
├── data_preprocessing.py
├── feature_engineering.py
├── model_evaluation.py
├── model_training.py
├── model_validation.py
├── prediction.py
├── visualization.py
├── utils.py
├── requirements.txt
└── README.md
- I'm mplementing hyperparameter tuning
- Exploring ensemble models (Random Forest, XGBoost)
- Adding more interaction features
- Developing customer segmentation before classification
MIT
This project (is) and was developed using telecommunication customer data to help businesses improve customer retention strategies.