WHY ARE CUSTOMERS CHURNING

[Project Description] [Project Planning] [Key Findings] [Data Dictionary] [Data Acquire and Prep] [Data Exploration] [Statistical Analysis] [Modeling] [Conclusion]

Project Description:

In this project we will be using the Telco Data Set. Exploring the data, we will find features that are indepent and dependent on churn(using a stats test), in order to run features through a model that will predict whether a customer will churn. The goal is to beat baseline accuracy using one of the three algorythms: KNNeighbor, RandomForest and Logistic Regression.

[Back to top]

Project Planning:

[Back to top]

Objective

Find out whether a customer will churn by utilizing three algorythms and running the best scoring algrothym test score.

Target variable

The target variable of this project is churn.

Need to haves (Deliverables):

-Need to explore the data. -Run features through statistical test. -Select features for modeling -Run features through 3 different algorythms.

Nice to haves (With more time):

Further feature explore to see if accuracy can be improved.

Key Findings:

The longer a customer has been with the company the less likely they are to churn.
Being male or female was independent of churn.
The higher the monthly charges the higher the probability was that the customer has churned.

[Back to top]

Data Dictionary

[Back to top]

Data Used

Attribute	Definition	Data Type
total_charges	total accumulated charges	float
monthly_charges	customers charges monthly	float
tenure	months a customer has been with company	int
gender_female	customer is or is not female	int
gender	sex of customer	object
senior_citizen	customer stats of senior or not senior	int64
partner	has partner or does not	int64
dependents	does customer have dependents	int64
phone_service	customer purchsed phone service	int64
multiple_lines	customer has multiple lines	object
online_security	customer signed up for online security	object
online_backup	customer opt in to online backup	object
device_protection	customer is enrolled in device protection	object
tech_support	customer opt in for tech support	object
streaming_tv	customer signed up for streaming television	object
streaming_movies	customer signed up for streaming movies	object
paperless_billing	enrolled in e-bill	int64
churn	customer is active or is not active	object
contract_type	service contract customer selected	object
internet_service_type	info for what kind of internet service customer chose	object
payment_type	info for customer preffered payment method	object
has_churned	whether a customer has churned	int64
multiple_lines_No phone service	multiple phone lines	uint8
multiple_lines_Yes	multiple lines	uint8
online_security_No internet service	customer doesnt have online security	uint8
online_security_Yes	customer has online security	uint8
online_backup_No internet service	customer does not have online back up	uint8
online_backup_Yes	customer has online back up	uint8
device_protection_No internet service	customer does not have device protection	uint8
device_protection_Yes	customer does have device protection	uint8
tech_support_No internet service	customer does not have tech support	uint8
tech_support_Yes	customer has tech support	uint8
streaming_tv_No internet service	customer does not have tv streaming	uint8
streaming_tv_Yes	customer can stream tv	uint8
streaming_movies_No internet service	customer cannot stream movies	uint8
streaming_movies_Yes	customer is able to stream movies	uint8
contract_type_One year	customer is on a one year contract	uint8
contract_type_Two year	customer is on a two year contract	uint8
internet_service_type_Fiber optic	customer has fiber internet	uint8
internet_service_type_None	customer does not have internet	uint8
payment_type_Credit card (automatic)	customer pays via credit card	uint8
payment_type_Electronic check	customer pays via e-check	uint8
payment_type_Mailed check	customer pays with mail-in check	uint8
**

Data Acquisition and Preparation

[Back to top]

Wrangle steps:

dropped unwanted columns.
created dummies for certain features
replaced strings with numeric values that needed to be converted

Data Exploration:

[Back to top]

Python files used for exploration:
- prepare.py
- acquire.py

Takeaways from exploration:

Four features were chosen for statistical testing: Tenure, Total Charges, Monthly Charges, and Gender.

Statistical Analysis

[Back to top]

Stats Test 1: Chi2 Test

The chi-square test is a statistical method used to examine the relationship between categorical variables.

By using the chi-square test, we aim to determine whether there is a significant relationship between the independent variable and the dependent variable. The test helps us assess if the observed frequencies of the categorical variables differ significantly from what we would expect under the assumption of independence.

To perform the chi-square test in Python, we can use the chi2_contingency function from the scipy.stats module. This function takes the individual clusters as input and returns the chi-square statistic (chi2) and the p-value (p). The chi-square statistic represents the ratio of two variances, while the p-value indicates the probability of obtaining test results as extreme as the observed results, assuming the null hypothesis is true.

Stats Test 2: Independent T-Test

The independent t-test is a statistical method used to examine the association between a categorical variable and a continuous variable.

By using the independent t-test, we aim to determine whether there is a significant association between the both one categorical variable and a continuous variable.

Hypothesis

In summary, the hypotheses for the independent t-test and chi2 test can be stated as follows:

Null Hypothesis (H0): Tenure does not have an association with churn. Alternative Hypothesis (H1): Tenure associated with churn.

2nd Hypothesis

Null Hypothesis (H0): Monthly charges is not associated with churn. Alternative Hypothesis (H1): Monthly charges is associated with churn.

3rd Hypothesis

Null Hypothesis (H0): Total charges is not associated with churn. Alternative Hypothesis (H1): Total charges is associated with churn.

4th Hypothesis

Null Hypothesis (H0): Sex is independent of churn. Alternative Hypothesis (H1): Sex is dependent of churn.

Confidence level and alpha value:

I established a 95% confidence level
alpha = 1 - confidence, therefore alpha is 0.05

Results:

Feature	P - Value	Less than Alpha
Tenure	4.577513863553669e-115	True
Monthly Charges	1.0736272928972876e-35	True
Total Charges	1.2955473562990627e-34	True
Gender	1.0	False

Summary:

Tenure, Total Charges, and Monthly Charges hold a p-value less than 0.05. Gender has a p-value greater than alpha. We will be using Tenure, Total Charges, and Monthly Charges for our modeling.

Modeling:

[Back to top]

Baseline

Baseline Results:

Model	Train Score
Baseline	0.73

Selected features to input into models:
- features = Tenure, Total Charges and Monthly Charges

Models

Model 1:K-Nearest Neighbor(KNN)

KNN model had a train accuracy of 82% which was 9% over baseline, a validation score of 78%

Model 2 : Random Forest(RF)

RandomForest model had a train accuracy of 81% which was 8% over baseline, a validation score of 78%

Model 3 : Logistic Regression

Logistic Regression model had a train accuracy of 79% which was 6% over baseline, a validation score of 78%

Selecting the Best Model:

Use Table below as a template for all Modeling results for easy comparison:

Model	Train Score	Validation Score
Baseline	0.73
KNN	0.82	0.78
Random Forest	0.81	0.78
Logistic Regression	0.79	0.78

KNN preformed best

Testing the Model

Model Testing Results

Model	Max Depth	Train Score	Validation Score	Test Score
KNN	9	0.82	0.78	0.78

Conclusion:

Based on the information provided, it seems that the KNN model has the highest train accuracy of 82%, which is 9% over the baseline.

On the other hand, the RandomForest model has a slightly lower train accuracy of 81%, which is 8% over the baseline.

The Logistic Regression model has a train accuracy of 79%, which is 6% over the baseline.

Considering all models have the same validation score, KNN was chosen due to the models slight training accuracy advantage.

After running the KNN model, a test score of 0.78 was given

[Back to top]

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
acquire.py		acquire.py
churn_predictions.csv		churn_predictions.csv
final_report.ipynb		final_report.ipynb
prepare.py		prepare.py
telco_classification_project.ipynb		telco_classification_project.ipynb

Folders and files

Latest commit

History

Repository files navigation

WHY ARE CUSTOMERS CHURNING

Project Description:

Project Planning:

Objective

Target variable

Need to haves (Deliverables):

Nice to haves (With more time):

Key Findings:

Data Dictionary

Data Used

Data Acquisition and Preparation

Wrangle steps:

Data Exploration:

Takeaways from exploration:

Statistical Analysis

Stats Test 1: Chi2 Test

Stats Test 2: Independent T-Test

Hypothesis

2nd Hypothesis

3rd Hypothesis

4th Hypothesis

Confidence level and alpha value:

Results:

Summary:

Modeling:

Baseline

Models

Model 1:K-Nearest Neighbor(KNN)

KNN model had a train accuracy of 82% which was 9% over baseline, a validation score of 78%

Model 2 : Random Forest(RF)

RandomForest model had a train accuracy of 81% which was 8% over baseline, a validation score of 78%

Model 3 : Logistic Regression

Logistic Regression model had a train accuracy of 79% which was 6% over baseline, a validation score of 78%

Selecting the Best Model:

Use Table below as a template for all Modeling results for easy comparison:

KNN preformed best

Testing the Model

Conclusion:

Based on the information provided, it seems that the KNN model has the highest train accuracy of 82%, which is 9% over the baseline.

On the other hand, the RandomForest model has a slightly lower train accuracy of 81%, which is 8% over the baseline.

The Logistic Regression model has a train accuracy of 79%, which is 6% over the baseline.

Considering all models have the same validation score, KNN was chosen due to the models slight training accuracy advantage.

After running the KNN model, a test score of 0.78 was given

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages