ML_Intro/2_Teaching.Rmd at main · bjohn21/ML_Intro · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
---
title: "Remote sensing Water Quality Prediction using Random Forest and XGBoost"
author: "Matthew Ross"
date: "2024-11-22"
output: html_document
---


# Objective

This assignment will guide you through a hands-on exploration of  modeling water quality with remote sensing data. Specifically you will be predicting "Secchi Disk Depth or SDD" which is a measure of water clarity, measured in meters. High values in SDD indicate a deep, blue clear lake, while low values indicate murkey lakes, potentially because algal particls or suspended sediment are occluding light.  You'll start with data exploration and simple models before comparing the performance of two machine learning techniques: Random Forest and XGBoost.

## Steps with Explanations and Tasks:

### 1. Setup and Libraries

The provided code initializes necessary libraries for data manipulation, plotting, and modeling.

Explanation: The tidyverse package is used for data wrangling and visualization, while randomForest and xgboost are machine learning packages for building prediction models.

```{r}
library(tidyverse) # Data manipulation and visualization
library(xgboost) # Gradient Boosting
library(randomForest) # Random Forest
library(sf) # Spatial data handling
library(mapview) # Interactive maps
library(Metrics) # Evaluation metrics
```


### 2. Data Exploration

Start by loading the dataset and performing exploratory data analysis (EDA) to understand the relationships between variables.

Explanation: Scatter plots with logarithmic scales and linear regression trends help identify correlations between the response variable (harmonized_value) and predictors.

```{r}
sdd <- read_csv('data/western_sdd.csv')

# Summary of the target variable
summary(sdd$harmonized_value)

# Relationships with key variables
ggplot(sdd, aes(x = harmonized_value, y = red_corr7)) +
  geom_point() +
  scale_y_log10() +
  geom_smooth(method = 'lm', se = F)

ggplot(sdd, aes(x = harmonized_value, y = green_corr7)) +
  geom_point() +
  scale_y_log10() +
  geom_smooth(method = 'lm', se = F)

ggplot(sdd, aes(x = harmonized_value, y = BR_G)) +
  geom_point() +
  scale_y_log10() +
  geom_smooth(method = 'lm', se = F)
```


### 3. Mapping Site Locations

Generate a quick map of sampling sites using mapview.

Explanation: Using spatial data visualization, we can verify if site locations correspond to different study parts.

```{r}
sdd_sites <- sdd %>%
  distinct(part, lat = WGS84_Latitude, long = WGS84_Longitude) %>%
  st_as_sf(., coords = c('long', 'lat'), crs = 4263)

# Interactive map
mapview(sdd_sites, zcol = 'part')
```


### 4. Simple Linear Model

Explanation: A simple linear regression model is a baseline to see if linear relationships explain the variation in harmonized_value (sdd).

```{r}
# Linear regression model
simple_mod <- lm(harmonized_value ~ red_corr7 * blue_corr7 * green_corr7 * BR_G, data = sdd)

# Summary of the model
summary(simple_mod)


```


## Machine Learning Demos

### 5. Random Forest - Naive Splitting

Explanation: A naive random split of training and testing datasets will make performance artificially high, because it doesn't account for data leakage where training data leaks into the test data.


```{r}

set.seed(221432)

# Selecting important variables
sdd_prepped <- sdd %>%
  select(harmonized_value, c('R_BS', 'R_BN', 'B_RG', 'BG', 'NmR', 'green_corr7', 'BR_G', 'GR_2', 'fai', 'red_corr7', 'G_BN', 'NmS'))

# Random test-train split
test_sdd <- sdd_prepped %>% sample_frac(0.2)
train_sdd <- sdd_prepped %>% anti_join(test_sdd)

# Random Forest model
rf_mod <- randomForest(harmonized_value ~ ., data = train_sdd, importance = F, ntree = 250)

# Predictions and visualization
test_sdd$sdd_pred <- predict(rf_mod, test_sdd)

ggplot(test_sdd, aes(y = sdd_pred, x = harmonized_value)) +
  geom_point() +
  xlab('Observed') +
  ylab('Predicted') +
  geom_smooth(method = 'lm', se = F) +
  geom_abline(intercept = 0, slope = 1, color = 'red')

# Evaluation metrics
mape(test_sdd$harmonized_value, test_sdd$sdd_pred)
rmse(test_sdd$harmonized_value, test_sdd$sdd_pred)

```


### 6. Random Forest - Spatial Splitting

Explanation: Splitting based on spatial or temporal characteristics (e.g., `part`) ensures that the test set represents unseen conditions. Part is a column that split the data evenly across space into five different domains.

```{r}

# Splitting data by 'part'
test_sdd <- sdd %>%
  filter(part != 5) %>%
  select(harmonized_value, c('R_BS', 'R_BN', 'B_RG', 'BG', 'NmR', 'green_corr7', 'BR_G', 'GR_2', 'fai', 'red_corr7', 'G_BN', 'NmS'))

train_sdd <- sdd %>%
  filter(part == 5) %>%
  select(harmonized_value, c('R_BS', 'R_BN', 'B_RG', 'BG', 'NmR', 'green_corr7', 'BR_G', 'GR_2', 'fai', 'red_corr7', 'G_BN', 'NmS'))

# Random Forest model
rf_mod <- randomForest(harmonized_value ~ ., data = train_sdd, importance = F, ntree = 250)

# Predictions
test_sdd$sdd_pred <- predict(rf_mod, test_sdd)

# Visualization
ggplot(test_sdd, aes(y = sdd_pred, x = harmonized_value)) +
  geom_point() +
  xlab('Observed') +
  ylab('Predicted') +
  geom_smooth(method = 'lm', se = F) +
  geom_abline(intercept = 0, slope = 1, color = 'red')

# Evaluation metrics
mape(test_sdd$harmonized_value, test_sdd$sdd_pred)
rmse(test_sdd$harmonized_value, test_sdd$sdd_pred)

```

### 7. XGBoost

XGBoost is a form of a tree based algorithm (like random forest), but with a different approach for optimizing which trees are selected and how parameters for the model are defined. More on xgboost here (https://www.nvidia.com/en-us/glossary/xgboost/)

Use the xgb.DMatrix() function to prepare the data for XGBoost, and configure the model using xgboost().


```{r}

# XGBoost task placeholder
# Convert to matrix
names(train_sdd)
names(test_sdd)

#The [-1] removes the harmonized_value column
train_matrix <- xgb.DMatrix(data = as.matrix(train_sdd[,-1]),
                            label = train_sdd$harmonized_value)

#The [-14] removes the sdd_pred from random forest
test_matrix <- xgb.DMatrix(data = as.matrix(test_sdd[,-c(1,14)]),
                           label = test_sdd$harmonized_value)

# Train XGBoost model
xgb_mod <- xgboost(data = train_matrix,
                   nrounds = 250,
                   objective = "reg:squarederror",
                   print_every_n = 50,
                   early_stopping_rounds = 5)


# Predictions
test_sdd$sdd_pred_xgb <- predict(xgb_mod, test_matrix)

# Visualization and evaluation
ggplot(test_sdd, aes(y = sdd_pred_xgb, x = harmonized_value)) +
  geom_point() +
  xlab('Observed') +
  ylab('Predicted') +
  geom_smooth(method = 'lm', se = F) +
  geom_abline(intercept = 0, slope = 1, color = 'red')

mape(test_sdd$harmonized_value, test_sdd$sdd_pred_xgb)
rmse(test_sdd$harmonized_value, test_sdd$sdd_pred_xgb)


```

# Playground

Both `xgboost` and `randomForest` have dozens of hyperparameters that you can tune (like eta for xgboost, the learning rate), I encourage you to spend 30 minutes to an hour trying to impove the model performance of our randomforest or our xgboost model by changing these hyperparameters. Doing so will give you a sense of what people in machine learning spend all of their time doing! It will also be the start of your journey to understanding which hyperparameters matter and why. ChatGPT can give pretty helpful advice on how to improve the models and I encourage you to use it, you can send it parts of this code and ask how to alter it.

How much improvement do you get?

What would be a systematic way to