📞 Big Data Customer Churn Prediction using PySpark

📌 Overview

Customer churn is one of the most critical challenges faced by telecom companies, directly impacting revenue, customer acquisition costs, and long-term business growth.

This project leverages Apache Spark and PySpark MLlib to build a scalable machine learning pipeline capable of identifying customers at risk of churning. Using distributed data processing and predictive modeling techniques, the solution analyzes customer behavior patterns, service usage metrics, and subscription characteristics to proactively predict churn and support customer retention strategies.

The project demonstrates an end-to-end big data analytics workflow, including data preprocessing, feature engineering, class imbalance handling, model training, evaluation, and performance comparison across multiple machine learning algorithms.

🚀 Key Features

⚡ Distributed Data Processing

Built using Apache Spark and PySpark for scalable analysis of large customer datasets.
Efficiently processes telecom customer records using distributed computing.

🧹 Data Preprocessing & Feature Engineering

Cleaned and transformed raw customer data.
Encoded categorical variables into machine learning-ready features.
Removed redundant attributes and optimized feature selection.

⚖️ Churn Class Balancing

Addressed class imbalance using stratified sampling techniques.
Improved model reliability when identifying high-risk customers.

🤖 Machine Learning Models

Implemented and compared multiple classification algorithms:

Decision Tree Classifier
Random Forest Classifier
Gradient Boosted Trees (GBT)
Naive Bayes

📈 Model Evaluation & Comparison

Evaluated models using:

Accuracy
Precision
Recall
F1 Score
ROC-AUC
Confusion Matrix

📊 Advanced Visualization

Pairwise feature analysis
Correlation insights
ROC Curve comparison
Confusion Matrix heatmaps

🔧 Tech Stack

Big Data Technologies

Apache Spark
PySpark MLlib

Programming

Python

Machine Learning

Decision Trees
Random Forest
Gradient Boosted Trees
Naive Bayes

Data Analysis & Visualization

Pandas
NumPy
Matplotlib
Seaborn

Model Validation

Cross Validation
ROC-AUC Analysis
Classification Metrics

📂 Project Workflow

1️⃣ Data Ingestion

Loaded telecom customer datasets using Spark DataFrames.
Performed schema inference and data exploration.

2️⃣ Data Preparation

Removed redundant variables.
Encoded categorical features.
Converted churn labels into machine-learning compatible format.

3️⃣ Exploratory Data Analysis

Analyzed customer behavior patterns.
Investigated feature relationships.
Visualized churn-related attributes.

4️⃣ Feature Engineering

Created feature vectors using Spark ML pipelines.
Indexed categorical variables.
Prepared structured inputs for model training.

5️⃣ Model Development

Built and trained:

Decision Tree
Random Forest
Gradient Boosted Trees
Naive Bayes

6️⃣ Model Evaluation

Compared model performance using:

Accuracy
Precision
Recall
F1 Score
ROC-AUC

7️⃣ Business Insights

Generated actionable insights to help telecom providers:

Identify high-risk customers.
Improve retention campaigns.
Reduce customer acquisition costs.
Increase customer lifetime value.

📈 Key Insights

📉 Churn Drivers

Customer service usage patterns, international plans, and call activity showed strong relationships with churn behavior.

🎯 Model Performance

Ensemble methods such as Random Forest and Gradient Boosted Trees demonstrated superior predictive capability compared to baseline models.

⚡ Scalability

Apache Spark enabled efficient processing and model training on large-scale customer datasets, making the solution suitable for enterprise analytics environments.

💡 Business Impact

Early identification of churn-prone customers allows organizations to deploy targeted retention strategies and improve overall customer satisfaction.

🏆 Skills Demonstrated

Big Data Analytics
Apache Spark
PySpark MLlib
Machine Learning
Customer Churn Prediction
Feature Engineering
Predictive Analytics
Telecom Analytics
Model Evaluation
Data Visualization
Distributed Computing

🎯 Business Value

By predicting customer churn before it occurs, organizations can proactively engage at-risk customers, optimize retention campaigns, and significantly reduce revenue loss associated with customer attrition.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
Test_set.csv		Test_set.csv
Training_data_set.csv		Training_data_set.csv
archive.zip		archive.zip
pyspark_code.py		pyspark_code.py

Folders and files

Latest commit

History

Repository files navigation

📞 Big Data Customer Churn Prediction using PySpark

📌 Overview

🚀 Key Features

⚡ Distributed Data Processing

🧹 Data Preprocessing & Feature Engineering

⚖️ Churn Class Balancing

🤖 Machine Learning Models

📈 Model Evaluation & Comparison

📊 Advanced Visualization

🔧 Tech Stack

Big Data Technologies

Programming

Machine Learning

Data Analysis & Visualization

Model Validation

📂 Project Workflow

1️⃣ Data Ingestion

2️⃣ Data Preparation

3️⃣ Exploratory Data Analysis

4️⃣ Feature Engineering

5️⃣ Model Development

6️⃣ Model Evaluation

7️⃣ Business Insights

📈 Key Insights

📉 Churn Drivers

🎯 Model Performance

⚡ Scalability

💡 Business Impact

🏆 Skills Demonstrated

🎯 Business Value

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages