- Full name: Chan Sze Kong
- Email address: szekong@gmail.com
- Github Repository: aiap11
- .github/workflow : directory for the Github workflow yaml file
- github-actions.yml : define the workflow to build and test the project on each code check-in
- src : contains .py classes for handling data processing and machine learning model
- data_procesor.py : DataProcessor class handle the data processing task of the ML Building pipeline, which include data loading, data cleaning and data prepration.
- model.py : Model class building the ML model, handling data preparation, model training, model evaluation and model tunning functions
- eda.ipynb : jupyter notebook for EAD
- main.py : Script to coordinate the data processor and model training execution. Initial evaluation of Random Forest, Gradient Boost and Ada Boost classifier, and perform parameter tunning for best model
- requirements.txt : dependency specification
- run.sh : bash script for executing pipeline
The following are the instructions for executing the pipeline and modifying any parameters.
- Checkout latest code from https://github.com/szekongchan/aiap11-chan-sze-kong-478F
- Install dependencies
python -m pip install --upgrade pip pip install -r requirements.txt
- Run executable bash script
bash ./run.sh
The following are the steps of the pipeline
- Data Loading
- Data Cleaning
- Handling of Missing Data
- Feature Engineering
- Data Normalisation
- Feature Selection
- One Hot Encoding
- Data Split
- Model Training
- Performance Evaluation
- Parameter Tuning
The following feature are found to have correlation with the target no_show field (evidence provided by distribution plot and chi square test p-value)
- first_time
- branch
- room
- country
- month information: booking_month, arrival_month and checkout_month
- date information: arrival_day and checkout_day
- price
Features that have low correlation to target no_show field are
- platform
- num_adult
- num_children
Transformation
- price: this feature was stated in both USD and SGD. This feature was harmonized by pasring the string value into a numeric value. Currency is in USD were multipled by an exchange rate of 1.4 so that we can have a common currency to base the comparison and analysis on. This features was also used to fill out the missing value in room feature based on the price cluster discovered. Subsquently, we drop the price feature as it has high correlation with the room, and the price value was more difficult and specific to filled compared to the room missing value.
- Remove record with all feature missing
- Remove booking id column
- Missing room values are fill by the price clustering discovered in during EDA, the price value was used to lookup the room type.
- Remove price feature as it has a high correlation with the room value and it is more difficult to fill up. Filling up the missing room value was selected over filling up the missing price value.
- A new feature total_ppl which is a sum of num_adult and num_children was created
- Numerical Feature ('arrival_day', 'checkout_day', 'num_adults', 'num_children', 'total_ppl') are normalized to 0..1
For this binary classifier model, we have selected the following metrics to evaluate the model performance
- Precision score: this score will provide guidance on maintaining a low false postive, while driving the true positive.
- Recall score: this score will provide guidance on maintaining a low false negative, while driving the true positive. This is useful as we don't want to flag a genuine booking as no show unnecessarily.
- F1 score: this score provide a harmonic mean of precision and recall
The following models are build and evaluated
- Random Forest v0 (Benchmark)
- Random Forest v1: F1=0.676, Precision=0.717, Recall=0.639
- Ada Boost: F1=0.676, Precision=0.717, Recall=0.639
- Gradient Boost: F1=0.563, Precision=0.672, Recall=0.484
The best model, Random Forest is selected for parameter tuning using GridSearchCV.
Final result from the tunning is F1=0.677, Precision=0.718, Recall=0.640
Optimized parameter: {'n_estimators': 160}
During the optimization step, various parameters were tested in different range.
Part of the challenge was to keep the search space small so that the time taken to run the search
is manageable. After several trials, there isn't an optimal combination, example some parameters set
provide a better precision at the expense of recall. After some trials, we set the n_estimators at 160,
while leaving the rest of the parameter as default. This set give us a decent precision without losing out
at the recall.