GitHub - PedroHCouto/Santander-Case: Project developed together with Santander Bank. Here you will find: Classification, Feature Selection, ML, Bayesian Optimization and Clustering.

SANTANDER DATA MASTERS CASE

This project was made as part of the mentorship program that I received for winning the Data Masters Program of Santander Bank

About The Project

The Santander group is a global banking group, led by Banco Santander S.A., the largest bank in the euro area. It has its origin in Santander, Cantabria, Spain. As part of their empowerment culture, they create the Data Master program.

This repository contains the 3 Jupyter Notebooks used to solve the 3 parts of the challenge purpose by Santander.

Competition website: https://www.becas-santander.com/en/program/santanderdatamasters-cienciadedados2020

The Program

Santander Data Masters - Data Science is a learning and certification program provided by the corporate university of Academia Santander. The program consists of a mix of training and competition, where Santander University provides content and selects the best participants.

Competition

The program has 3 evaluation phases, which are:

Phase 1: the candidates must take the Cultural Fit, General Knowledge and Logic Reasoning assessments. In this phase, there were about 3200 candidates and only 100 were selected.
Phase 2: the candidates have 33 days to study all material selected by Santander Academy. In the end, there was a technical assessment and it is needed a performance over 50% to get certified.
Phase 3: the top 3 participants receive the opportunity to do a Project provided by Santander as well as a mentorship session with Santander staff.

Learning

After selecting the 100 participants, the academy provided structured learning material covering the most important knowledge and skills that a data scientist should have. The following topics were addressed:

Probability and Statistics: from basics to advanced concepts;
SQL and NoSQL;
Big Data concepts and applications;
Programming;
Regression: Linear, Multiple, Ridge & Lasso, Variable Selection
Classification: Logistic Regression, Decision Tree, Random Forest and Naive Bayes;
Clustering: K-means, Hierarchical Clustering (divisive and agglomerative), Latent Dirichlet Allocation;
Performance Metrics.

Case Introduction

As the winner of the program I receive the opportunity to solve a case provided by Santander's Staff.

The bank wants to know better when to apply its customer retention program. The program should apply only to those customers that are really unsatisfied (Target = 1) and customers not unsatisfied (Target = 0) should be ignored. The 3 problems purposed round around to maximize the expected profit by reducing the misclassification amount.

The problems are:

Part A: Classification

In this part, we want to classify if the customer is unsatisfied or not. The retention program cost $10 for each customer and an effective application (in customers that are really unsatisfied) returns a profit of $100. In the classification task we can have the following scenarios:

1. False Positive(FP): classify the customer as UNSATISFIED but he is SATISFIED. Cost: $ 10, Earn: $ 0.
2. False Negative(FN): classify the customer as SATISFIED but he is DISSATISFIED. Cost: $ 0, Earn: $ 0
3. True Positive(TP): classify the customer as UNSATISFIED and he is UNSATISFIED. Cost: $ 10, Earn: $ 100.
4. True Negative(TN): classify the customer as SATISFIED and he is SATISFIED. Cost: $ 0, Earn: $ 0.

In summary we want to minimize the rate of FP and FN as well as maximize the rate of TP. To do so, we will use the metric AUC of ROC Curve, because it returns to us the best model as well as the best threshold.

Part B: Net Promoter Score (NPS)

This task consists of giving a rating from 1 to 5 for each customer of the test base, respecting the variable 'TARGET', that is, their level of satisfaction, 1 being the most dissatisfied and 5 the most satisfied. The retention program should only be applied on customers we NPS 1.

It means we should create system to evaluate the Net Promoter Score of each customer.

Part C: Clustering

The third task is to find the three natural groups that have the highest expected profits per customer. That means the 3 groups that have the highest amount of unsatisfied customers

Built With

Jupyter Notebook
Python
Scikit-learn

Contact

E-mail: pedrocouto39@gmail.com
LinkedIn: https://www.linkedin.com/in/pedrocouto39/?locale=en_US
Kaggle: https://www.kaggle.com/pedrocouto39
XING: https://www.xing.com/profile/Pedro_Couto8/cv

Project Link: https://github.com/PedroHCouto/Santander-Case

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
Data		Data
images		images
Part A - Classification.ipynb		Part A - Classification.ipynb
Part B - Net Promoter Score.ipynb		Part B - Net Promoter Score.ipynb
Part C - Clustering.ipynb		Part C - Clustering.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SANTANDER DATA MASTERS CASE

Table of Contents

About The Project

The Program

Competition

Learning

Case Introduction

Part A: Classification

Part B: Net Promoter Score (NPS)

Part C: Clustering

Built With

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SANTANDER DATA MASTERS CASE

Table of Contents

About The Project

The Program

Competition

Learning

Case Introduction

Part A: Classification

Part B: Net Promoter Score (NPS)

Part C: Clustering

Built With

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages