Skip to content

Tarun110/Scalable-Churn-Prediction-Using-PySpark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📞 Big Data Customer Churn Prediction using PySpark

📌 Overview

Customer churn is one of the most critical challenges faced by telecom companies, directly impacting revenue, customer acquisition costs, and long-term business growth.

This project leverages Apache Spark and PySpark MLlib to build a scalable machine learning pipeline capable of identifying customers at risk of churning. Using distributed data processing and predictive modeling techniques, the solution analyzes customer behavior patterns, service usage metrics, and subscription characteristics to proactively predict churn and support customer retention strategies.

The project demonstrates an end-to-end big data analytics workflow, including data preprocessing, feature engineering, class imbalance handling, model training, evaluation, and performance comparison across multiple machine learning algorithms.


🚀 Key Features

⚡ Distributed Data Processing

  • Built using Apache Spark and PySpark for scalable analysis of large customer datasets.
  • Efficiently processes telecom customer records using distributed computing.

🧹 Data Preprocessing & Feature Engineering

  • Cleaned and transformed raw customer data.
  • Encoded categorical variables into machine learning-ready features.
  • Removed redundant attributes and optimized feature selection.

⚖️ Churn Class Balancing

  • Addressed class imbalance using stratified sampling techniques.
  • Improved model reliability when identifying high-risk customers.

🤖 Machine Learning Models

Implemented and compared multiple classification algorithms:

  • Decision Tree Classifier
  • Random Forest Classifier
  • Gradient Boosted Trees (GBT)
  • Naive Bayes

📈 Model Evaluation & Comparison

Evaluated models using:

  • Accuracy
  • Precision
  • Recall
  • F1 Score
  • ROC-AUC
  • Confusion Matrix

📊 Advanced Visualization

  • Pairwise feature analysis
  • Correlation insights
  • ROC Curve comparison
  • Confusion Matrix heatmaps

🔧 Tech Stack

Big Data Technologies

  • Apache Spark
  • PySpark MLlib

Programming

  • Python

Machine Learning

  • Decision Trees
  • Random Forest
  • Gradient Boosted Trees
  • Naive Bayes

Data Analysis & Visualization

  • Pandas
  • NumPy
  • Matplotlib
  • Seaborn

Model Validation

  • Cross Validation
  • ROC-AUC Analysis
  • Classification Metrics

📂 Project Workflow

1️⃣ Data Ingestion

  • Loaded telecom customer datasets using Spark DataFrames.
  • Performed schema inference and data exploration.

2️⃣ Data Preparation

  • Removed redundant variables.
  • Encoded categorical features.
  • Converted churn labels into machine-learning compatible format.

3️⃣ Exploratory Data Analysis

  • Analyzed customer behavior patterns.
  • Investigated feature relationships.
  • Visualized churn-related attributes.

4️⃣ Feature Engineering

  • Created feature vectors using Spark ML pipelines.
  • Indexed categorical variables.
  • Prepared structured inputs for model training.

5️⃣ Model Development

Built and trained:

  • Decision Tree
  • Random Forest
  • Gradient Boosted Trees
  • Naive Bayes

6️⃣ Model Evaluation

Compared model performance using:

  • Accuracy
  • Precision
  • Recall
  • F1 Score
  • ROC-AUC

7️⃣ Business Insights

Generated actionable insights to help telecom providers:

  • Identify high-risk customers.
  • Improve retention campaigns.
  • Reduce customer acquisition costs.
  • Increase customer lifetime value.

📈 Key Insights

📉 Churn Drivers

Customer service usage patterns, international plans, and call activity showed strong relationships with churn behavior.

🎯 Model Performance

Ensemble methods such as Random Forest and Gradient Boosted Trees demonstrated superior predictive capability compared to baseline models.

⚡ Scalability

Apache Spark enabled efficient processing and model training on large-scale customer datasets, making the solution suitable for enterprise analytics environments.

💡 Business Impact

Early identification of churn-prone customers allows organizations to deploy targeted retention strategies and improve overall customer satisfaction.


🏆 Skills Demonstrated

  • Big Data Analytics
  • Apache Spark
  • PySpark MLlib
  • Machine Learning
  • Customer Churn Prediction
  • Feature Engineering
  • Predictive Analytics
  • Telecom Analytics
  • Model Evaluation
  • Data Visualization
  • Distributed Computing

🎯 Business Value

By predicting customer churn before it occurs, organizations can proactively engage at-risk customers, optimize retention campaigns, and significantly reduce revenue loss associated with customer attrition.

About

Big data analytics project leveraging PySpark, distributed machine learning, and predictive modeling to forecast customer churn, optimize retention efforts, and generate actionable business insights from telecom customer data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages