Kaiser Permanente Data Science Competition 2018

Team: Decision Tree Huggers (4th out of 30 teams)

Members: Chris Kennedy, Nerissa Nance, Makdine Dontsi, Noel Pimentel

Setup

train.csv and test.csv should be placed into data-raw/
Required R packages will need to be installed.
- Run:
```
source("R/_startup.R")
# Install the needed packages from CRAN.
install.packages(attr(startup, "packages_cran"))
# Install a few extra packages from Github.
devtools::install_github(attr(startup, "packages_github"))
```
- If you run into errors, you may need to install a few packages manually, e.g. h2o.
- You also may want to run Session -> Restart R if you get any weird messages about "lazy-load database something.rdb is corrupt".

Analyze

Run or knit clean.Rmd to import the raw data and generate data/clean.RData.
Run or knit eda.Rmd to conduct exploratory data analysis.
Run or knit one of the model files, such as model-sl-slower.Rmd to build a model and generate a submission export.

Methodology

The following summarizes the Decision Tree Huggers' work on the Higgs Boson challenge.

Cleaning

Categorical variables: converted pri_jet_num to a factor and then to a series of indicators.
Missing values: added missingness indicators for missing values and imputed missing values to the median/mode. Tried random forest missing value imputation but did not find a benefit from leaderboard submission.
We explored removing highly correlated variables (pri_jet_leading_pt, pri_jet_all_pt), adding log transformations of skewed variables, and trimming outliers but ran out of time to submit based on that additional cleaning.

Exploration

We examined univariate correlation with the outcome, OLS coefficients, and correlation heatmaps to review covariate relationships.
We used boxplots and outlier plots to review extreme values.
We ran out of time to complete tSNE and UMAP analyses.

Modeling

Our model consisted of a SuperLearner ensemble of XGBoost models and a random forest (Ranger, 200 trees) estimated with 20-fold cross-validation. This included a grid of 12 different XGBoost configurations: {300, 1000, or 2000 trees} x {4 or 8 maximum depth} x {0.01 or 0.1 shrinkage}. Logistic regression and lasso regression were also available but were not selected in the ensemble.
The SuperLearner ensemble was optimized to mean-squared error but with more time we might have tried AUC or negative log likelihood optimization.
We tried h2o.ai's auto_ml() feature targeted to AUC optimization with 10-fold cross-validation but did not find increased performance compared to our SuperLearner ensemble.
We tried feature selection based on univariate correlation (top 5, 15, 25, or 30 covariates) but found the best performance when all variables were included.

Submission

We analyzed different threshold options and found that using the mean of the outcome for the training data (0.3427) was the best.

Name		Name	Last commit message	Last commit date
Latest commit History 141 Commits
R		R
analysis		analysis
archive		archive
data-raw		data-raw
data		data
exports		exports
tables		tables
visuals		visuals
.gitignore		.gitignore
README.md		README.md
clean-eda-md.Rmd		clean-eda-md.Rmd
clean-eda-md.r		clean-eda-md.r
clean-eda-md.rmd		clean-eda-md.rmd
clean.Rmd		clean.Rmd
clean_md1_rcomp.Rmd		clean_md1_rcomp.Rmd
clean_md2.Rmd		clean_md2.Rmd
clean_missforest.Rmd		clean_missforest.Rmd
eda.Rmd		eda.Rmd
kp-dsc-2018.Rproj		kp-dsc-2018.Rproj
model-h2o.Rmd		model-h2o.Rmd
model-sl-basic.Rmd		model-sl-basic.Rmd
model-sl-slower.Rmd		model-sl-slower.Rmd
old-edaMD.R		old-edaMD.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kaiser Permanente Data Science Competition 2018

Setup

Analyze

Methodology

Cleaning

Exploration

Modeling

Submission

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Kaiser Permanente Data Science Competition 2018

Setup

Analyze

Methodology

Cleaning

Exploration

Modeling

Submission

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages