Skip to content

szekongchan/aiap11

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AIAP Batch 11 Assessment 2

  • Full name: Chan Sze Kong
  • Email address: szekong@gmail.com
  • Github Repository: aiap11

Overview of folder structure

  • .github/workflow : directory for the Github workflow yaml file
    • github-actions.yml : define the workflow to build and test the project on each code check-in
  • src : contains .py classes for handling data processing and machine learning model
    • data_procesor.py : DataProcessor class handle the data processing task of the ML Building pipeline, which include data loading, data cleaning and data prepration.
    • model.py : Model class building the ML model, handling data preparation, model training, model evaluation and model tunning functions
  • eda.ipynb : jupyter notebook for EAD
  • main.py : Script to coordinate the data processor and model training execution. Initial evaluation of Random Forest, Gradient Boost and Ada Boost classifier, and perform parameter tunning for best model
  • requirements.txt : dependency specification
  • run.sh : bash script for executing pipeline

Instructions

The following are the instructions for executing the pipeline and modifying any parameters.

  1. Checkout latest code from https://github.com/szekongchan/aiap11-chan-sze-kong-478F
  2. Install dependencies

    python -m pip install --upgrade pip pip install -r requirements.txt

  3. Run executable bash script

    bash ./run.sh

Flow of ML pipeline

The following are the steps of the pipeline

  1. Data Loading
  2. Data Cleaning
  3. Handling of Missing Data
  4. Feature Engineering
  5. Data Normalisation
  6. Feature Selection
  7. One Hot Encoding
  8. Data Split
  9. Model Training
  10. Performance Evaluation
  11. Parameter Tuning

EDA Key Findings

The following feature are found to have correlation with the target no_show field (evidence provided by distribution plot and chi square test p-value)

  • first_time
  • branch
  • room
  • country
  • month information: booking_month, arrival_month and checkout_month
  • date information: arrival_day and checkout_day
  • price

Features that have low correlation to target no_show field are

  • platform
  • num_adult
  • num_children

Transformation

  • price: this feature was stated in both USD and SGD. This feature was harmonized by pasring the string value into a numeric value. Currency is in USD were multipled by an exchange rate of 1.4 so that we can have a common currency to base the comparison and analysis on. This features was also used to fill out the missing value in room feature based on the price cluster discovered. Subsquently, we drop the price feature as it has high correlation with the room, and the price value was more difficult and specific to filled compared to the room missing value.

Missing Value Handling

  • Remove record with all feature missing
  • Remove booking id column
  • Missing room values are fill by the price clustering discovered in during EDA, the price value was used to lookup the room type.
  • Remove price feature as it has a high correlation with the room value and it is more difficult to fill up. Filling up the missing room value was selected over filling up the missing price value.

Feature Engineering / Normalization / One Hot Encoding

  • A new feature total_ppl which is a sum of num_adult and num_children was created
  • Numerical Feature ('arrival_day', 'checkout_day', 'num_adults', 'num_children', 'total_ppl') are normalized to 0..1

Model Evaluation

For this binary classifier model, we have selected the following metrics to evaluate the model performance

  • Precision score: this score will provide guidance on maintaining a low false postive, while driving the true positive.
  • Recall score: this score will provide guidance on maintaining a low false negative, while driving the true positive. This is useful as we don't want to flag a genuine booking as no show unnecessarily.
  • F1 score: this score provide a harmonic mean of precision and recall

Model Selection and Tuning

The following models are build and evaluated

  • Random Forest v0 (Benchmark)
  • Random Forest v1: F1=0.676, Precision=0.717, Recall=0.639
  • Ada Boost: F1=0.676, Precision=0.717, Recall=0.639
  • Gradient Boost: F1=0.563, Precision=0.672, Recall=0.484

The best model, Random Forest is selected for parameter tuning using GridSearchCV.
Final result from the tunning is F1=0.677, Precision=0.718, Recall=0.640 Optimized parameter: {'n_estimators': 160}

During the optimization step, various parameters were tested in different range.
Part of the challenge was to keep the search space small so that the time taken to run the search is manageable. After several trials, there isn't an optimal combination, example some parameters set provide a better precision at the expense of recall. After some trials, we set the n_estimators at 160, while leaving the rest of the parameter as default. This set give us a decent precision without losing out at the recall.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors