AIAP Batch 11 Assessment 2

Full name: Chan Sze Kong
Email address: szekong@gmail.com
Github Repository: aiap11

Overview of folder structure

.github/workflow : directory for the Github workflow yaml file
- github-actions.yml : define the workflow to build and test the project on each code check-in
src : contains .py classes for handling data processing and machine learning model
- data_procesor.py : DataProcessor class handle the data processing task of the ML Building pipeline, which include data loading, data cleaning and data prepration.
- model.py : Model class building the ML model, handling data preparation, model training, model evaluation and model tunning functions
eda.ipynb : jupyter notebook for EAD
main.py : Script to coordinate the data processor and model training execution. Initial evaluation of Random Forest, Gradient Boost and Ada Boost classifier, and perform parameter tunning for best model
requirements.txt : dependency specification
run.sh : bash script for executing pipeline

Instructions

The following are the instructions for executing the pipeline and modifying any parameters.

Checkout latest code from https://github.com/szekongchan/aiap11-chan-sze-kong-478F
Install dependencies

python -m pip install --upgrade pip pip install -r requirements.txt
Run executable bash script

bash ./run.sh

Flow of ML pipeline

The following are the steps of the pipeline

Data Loading
Data Cleaning
Handling of Missing Data
Feature Engineering
Data Normalisation
Feature Selection
One Hot Encoding
Data Split
Model Training
Performance Evaluation
Parameter Tuning

EDA Key Findings

The following feature are found to have correlation with the target no_show field (evidence provided by distribution plot and chi square test p-value)

first_time
branch
room
country
month information: booking_month, arrival_month and checkout_month
date information: arrival_day and checkout_day
price

Features that have low correlation to target no_show field are

platform
num_adult
num_children

Transformation

price: this feature was stated in both USD and SGD. This feature was harmonized by pasring the string value into a numeric value. Currency is in USD were multipled by an exchange rate of 1.4 so that we can have a common currency to base the comparison and analysis on. This features was also used to fill out the missing value in room feature based on the price cluster discovered. Subsquently, we drop the price feature as it has high correlation with the room, and the price value was more difficult and specific to filled compared to the room missing value.

Missing Value Handling

Remove record with all feature missing
Remove booking id column
Missing room values are fill by the price clustering discovered in during EDA, the price value was used to lookup the room type.
Remove price feature as it has a high correlation with the room value and it is more difficult to fill up. Filling up the missing room value was selected over filling up the missing price value.

Feature Engineering / Normalization / One Hot Encoding

A new feature total_ppl which is a sum of num_adult and num_children was created
Numerical Feature ('arrival_day', 'checkout_day', 'num_adults', 'num_children', 'total_ppl') are normalized to 0..1

Model Evaluation

For this binary classifier model, we have selected the following metrics to evaluate the model performance

Precision score: this score will provide guidance on maintaining a low false postive, while driving the true positive.
Recall score: this score will provide guidance on maintaining a low false negative, while driving the true positive. This is useful as we don't want to flag a genuine booking as no show unnecessarily.
F1 score: this score provide a harmonic mean of precision and recall

Model Selection and Tuning

The following models are build and evaluated

Random Forest v0 (Benchmark)
Random Forest v1: F1=0.676, Precision=0.717, Recall=0.639
Ada Boost: F1=0.676, Precision=0.717, Recall=0.639
Gradient Boost: F1=0.563, Precision=0.672, Recall=0.484

The best model, Random Forest is selected for parameter tuning using GridSearchCV.
Final result from the tunning is F1=0.677, Precision=0.718, Recall=0.640 Optimized parameter: {'n_estimators': 160}

During the optimization step, various parameters were tested in different range.
Part of the challenge was to keep the search space small so that the time taken to run the search is manageable. After several trials, there isn't an optimal combination, example some parameters set provide a better precision at the expense of recall. After some trials, we set the n_estimators at 160, while leaving the rest of the parameter as default. This set give us a decent precision without losing out at the recall.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AIAP Batch 11 Assessment 2

Overview of folder structure

Instructions

Flow of ML pipeline

EDA Key Findings

Missing Value Handling

Feature Engineering / Normalization / One Hot Encoding

Model Evaluation

Model Selection and Tuning

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
src		src
README.md		README.md
eda.ipynb		eda.ipynb
main.py		main.py
requirements.txt		requirements.txt
run.sh		run.sh

Folders and files

Latest commit

History

Repository files navigation

AIAP Batch 11 Assessment 2

Overview of folder structure

Instructions

Flow of ML pipeline

EDA Key Findings

Missing Value Handling

Feature Engineering / Normalization / One Hot Encoding

Model Evaluation

Model Selection and Tuning

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages