sme/lm.Rmd at main · uvastatlab/sme · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
---
title: "Linear Modeling"
output:
  html_document:
    toc: TRUE
    toc_float: TRUE
bibliography: refs.bib
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, comment = '')
```


### Linear model with no predictors (intercept only)

A new reading program is being evaluated at an elementary school. A random sample of 20 students were tested to determine reading speed. Speed was measured in minutes. [@ott2004]

```{r}
speed <- c(5, 7, 15, 12, 8, 7, 10, 11, 9, 13, 10,
           6, 11, 8, 10, 8, 7, 6, 11, 8, 20)
```

We can visualize the distribution of data using a histogram as follows:

```{r}
hist(speed)
```


We can use the `t.test()` function to calculate both the mean of speed and a 95% confidence interval on the mean. The mean reading speed was about 9.62 with a 95% CI of [8.04, 11.19].

```{r}
t.test(speed)
```

To extract just the confidence interval we can save the result of the t-test and view the `conf.int` element of the saved object.

```{r}
tout <- t.test(speed)
tout$conf.int
```

We can also calculate the 95% confidence interval using an intercept-only linear model. Below we model speed as a function of the number 1 using the `lm()` function. This is the R syntax for fitting an intercept-only model. We save the result into an object we name "m" and use the `confint()` function to view the 95% confidence interval.

```{r}
m <- lm(speed ~ 1)
confint(m)
```

Does the sample appear to be from a Normal distribution? If so, the points in a QQ plot should lie close to the diagonal line. See [this article](https://data.library.virginia.edu/understanding-q-q-plots/) for information on QQ plots. This plot looks ok.

```{r}
qqnorm(speed)
qqline(speed)
```

We can also obtain a QQ plot by calling the `plot()` function on the model object. By default the `plot()` method for model objects returns four plots. Setting `which = 2` produces just the QQ plot.

```{r}
plot(m, which = 2)
```

A few observations seem unusual: 1, 3, 21. We can extract their speed values using index subsetting.

```{r}
speed[c(1, 3, 21)]
```

There was a fast reader and a couple of slower readers.

If a Normal QQ plot of your data looks suspect, we can try comparing it to other Normal QQ plots created with Normal data. Below we calculate the mean and standard deviation of our data, re-draw the original QQ Plot, and then generate 24 additional QQ plots from a Normal distribution with the same mean and standard deviation as our data. The QQ plot of our observed data doesn't seem too different from the QQ plots of Normal data.

```{r}
m <- mean(speed)
s <- sd(speed)

# create a 5x5 grid of plotting areas
op <- par(mar = c(2,2,1,1), mfrow = c(5,5))

# create first qq plot
qqnorm(speed, xlab = "", ylab = "", main = "")
qqline(speed)

# now create 24 qq plots using observed mean and sd
for(i in 1:24){
  # rnorm() samples from a Normal dist'n
  # with specified mean and sd
  d <- rnorm(20, mean = m, sd = s)
  qqnorm(d, xlab = "", ylab = "", main = "")
  qqline(d)
}
par(op)
```


### Linear model with categorical predictors


The `schooldays` data frame from the HSAUR3 package has 154 rows and 5 columns. It contains data from a sociological study of Australian Aboriginal and white children. Model `absent` (number of days absent during school year) as a function of `race`, `gender`, `school` (school type), and `learner` (average or slow learner). [@hsaur3]

```{r}
library(HSAUR3)
data("schooldays")
```

First we summarize and explore the data.

```{r}
# plots to look at data
library(ggplot2)
library(dplyr)

# histogram of absences
hist(schooldays$absent)

# base r strip chart of absences by gender
stripchart(absent ~ gender, data = schooldays,
           ylab = 'number of absences', xlab = 'gender',
           main = 'Stripchart of Absences by gender')

# ggplot strip chart of absences by gender
ggplot(schooldays, aes(x = gender, y = absent, color=learner)) +
  geom_jitter(position = position_jitter(0.2))
```

Next we model `absent` as a function of all other predictors.


```{r}
m <- lm(absent ~ race + gender + school + learner, schooldays)
summary(m)
```


In the first section we would like to see Residuals centered around 0, with the min/max and 1Q/3Q having roughly the same absolute value. This indicates a uniform scatter of residuals, which is what a linear model assumes.

- Positive residuals occur when observed response values are _greater than_ predicted response values.
- Negative residuals occur when observed response values are _less than_ predicted response values.

Residuals do not seem uniformly scatter based on these summary statistics.

The Coefficients section shows the fitted model in the Estimate column. A mathematical expression of the model is as follows:

$$\text{absent} = 17.03 + -7.83\text{ non-aboriginal} + 2.64\text{ male} + -1.78\text{ F1} + 5.35\text{ F2} + 3.29\text{ F3} + 0.74\text{ slow learner}$$

For example, we might use our model to predict expected number of days absent for non-aboriginal females in an F1 school who are average learners.

$$\text{absent} = 17.03 + -7.83(1) + 2.64(0) + -1.78(1) + 5.35(0) + 3.28(0) + 0.74(0)$$

Notice we plug in 1 if our subject belongs to the category and 0 otherwise. We can carry out this prediction using the `predict()` function. The input values need to be entered in a data frame using the same variable names and category levels as our analysis data set.

```{r}
predict(m, newdata = data.frame(race = "non-aboriginal",
                                gender = "female",
                                school = "F1",
                                learner = "average"))
```

Our model predicts about 7 days absence. Adding `interval = "confidence"` to `predict()` reports a 95% confidence interval on this prediction.

```{r}
predict(m, newdata = data.frame(race = "non-aboriginal",
                                gender = "female",
                                school = "F1",
                                learner = "average"),
        interval = "confidence")
```

The estimate is rather wide, ranging from 1 day to 14 days.

The "Std. Error" column in the summary output quantifies uncertainty in the estimated coefficients. The "t value" column is the ratio of the estimated coefficient to the standard error. Ratios larger than 2 or 3 in absolute value give us confidence in the direction of the coefficient. Other than the intercept, the only coefficient that we're confident about is the race coefficient. It appears `non-aboriginal` students have a lower rate of absence by about 7 days.

The "Pr(>|t|)" column reports p-values for hypothesis tests for each t value. The null hypothesis is the t value is 0. The reported p-value is the probability of seeing a t value that large or larger in magnitude if the null is true. In all seven hypothesis tests are reported. Two appear to be "significant" in the sense their p-values fall below traditional thresholds.

The "Residual standard error" is the expected amount a predicted response value will differ from its observed value. The reported value of 15.32 tells us the model's predicted value for days absent will be off by about 15 days.

The Multiple and Adjusted R-squareds are 0.10 and 0.07. This summarizes the proportion of variability explained by the model. Of the variability in days absent, it looks like this model explains about 7% of the variance. The Adjusted R-squared is adjusted for the number of predictors and is the preferred statistic of the two. (Multiple R-squared always increases when variables are added to a model, even variables unrelated to the response.)

The F-statistic tests the null hypothesis that all coefficients (other than the intercept) are 0. Small p-values provide evidence against this hypothesis.

R provides some built-in diagnostic plots. A good one to inspect is the Residuals vs Fitted plot.

```{r}
plot(m, which = 1)
```

Residuals above 0 are instances where the observed values are larger than the predicted values. It seems our model is under-predicting absences to a greater degree than over-predicting absences. The labeled points of 77, 111, and 61 are the rows of the data set with the largest residuals. These data points may be worth investigating. For example, point 111 is a child that we predict would be absent about 15 days (on the x-axis), but the residual of about 50 (on the y-axis) tells us this student was absent about 65 days.

A QQ Plot can help us assess the Normality assumption of the residuals.

```{r}
plot(m, which = 2)
```

We would like the points to fall along the diagonal line. This plot doesn't look great. Since our model doesn't seem to be very good based on the previous plot, it's probably not worth worrying too much about the QQ Plot.

Another version of the Residuals vs Fitted plot is the Scale-Location plot.

```{r}
plot(m, which = 3)
```

In this plot the residuals are standardized and strictly positive. We would like the smooth red line to trend straight. The fact it's trending up provides some evidence that our residuals are getting larger as our fitted values get larger. This could mean a violation of the constant variance assumption.

One final plot to inspect is the Residuals vs Leverage plot. This can help us see if any data points are influencing the model.

```{r}
plot(m, which = 5)
```

The x-axis is Leverage, also called hat-values. Higher values of leverage mean a higher potential to influence a model. The y-axis shows standardized residuals. Also plotted is Cook's distance contour lines if any points exceed a particular threshold. Like leverage, Cook's distance also quantifies the influence of a data point. In this plot it doesn't appear any points are unduly influencing the model.

Conclusion: it looks like the number of days absent cannot be properly modeled or understood with these categorical predictors.


### Linear model with categorical and numeric predictors

The `bp.obese` data frame from the ISwR package has 102 rows and 3 columns. It contains data from a random sample of Mexican-American adults in a small California town. Analyze [systolic blood pressure](https://www.mayoclinic.org/diseases-conditions/high-blood-pressure/in-depth/blood-pressure/art-20050982) in mm Hg (`bp`) as a function of obesity (`obese`) and sex (`sex`), where male = 0 and female = 1. Here `obese` is a ratio of actual weight to ideal weight from New York Metropolitan Life Tables. [@ISwR]

```{r}
library(ISwR)
data("bp.obese")
```

First, we will look at summary statistics of the data.

```{r}
summary(bp.obese)
```

Sex is encoded with 0s (male) and 1s (female). Sex's mean of 0.5686 tells us that about half of the people are female. Obese is a ratio representing participants weight. 1 indicates actual weight equals ideal weight. >1 indicates actual weight is greater than ideal weight. And <1 the opposite.


Next, we will visualize the distribution of blood pressure, for all subjects and by sex.

```{r}
library(ggplot2)

# ggplot(bp.obese, aes(x = bp)) +
#   geom_density()

ggplot() +
  geom_density(bp.obese, mapping = aes(x = bp, fill=as.factor(sex)), alpha=1/3) +
  labs(fill='sex') +
  scale_fill_manual(values = c('blue', 'red'),
                    labels=c('male','female'))
```

The distribution of bp appears to skew right, and does not differ much between male and female.

Next, we will plot the relationship between the variables bp and obese.

```{r}
ggplot(bp.obese, aes(x = obese, y = bp)) +
  geom_point()
```

Most participants have above ideal weight and below 150 bp. The relationship between obese and bp, in general, appears to be positive and somewhat linear. But obese does not entirely explain bp. For example, the 3 participants with obese  greater than 2 have bps at about or below 150. Then, the 2 participants with bp greater than 180 have obese less that 1.75.

Below we model bp as a function of the variables obese and sex.

```{r}
bp.obese$sex <- as.factor(bp.obese$sex)
m3 <- lm(bp ~ obese + sex, bp.obese)
summary(m3)
```

One way we can assess the model is by examining diagnostic plots. We will view the fitted vs. residuals plot.

```{r}
plot(m3, which=1)
```

The fitted vs. residuals plot summarizes the spread of residuals. For a 'good' model, we expect to see residuals centered around 0, with an even distribution above and below 0. The red trend line helps us detect any patterns in the residuals. We hope to see it approximately horizontal around 0.

Many (but not all) of the values on the plot are within 20 units of 0. This is what we would expect based on the residual standard error, which is 17. Linear modeling assumes the distribution of residuals has a mean of 0 and a constant standard deviation. This standard deviation is estimated as the residual standard error. The residual standard error is how much we can expect a prediction to deviate from its true value.

The residual versus fitted plot also reveals several subjects with residuals well above 20. This means our model is severely underfitting their blood pressure values. For example subject 15 has a fitted value of about 130, but a residual of 60. This means their observed blood pressure was 190, but our model predicts 130. There are many subjects experiencing this underfitting which indicates our model probably isn't very good.

Additionally, this model has an adjusted R-squared value of 0.1265. This roughly means the model explains 13% of the variability in bp. This is another indication that the model is not good.

Another way to assess model fit is through a calibration plot, which plots predicted values against observed values. The values should cluster around the slope=1 line shown in red.

```{r}
plot(bp.obese$bp, predict(m3))
abline(a=0, b=1, col='red')
```

These values do not cluster around the slope=1 line, another signal that this model does not perform well. It is likely that factors outside of the predictors in the model also affect bp.

Next, we will calculate confidence intervals for each of the coefficients in the model.

```{r}
confint(m3)
```

For the obese variable, the 95% confidence interval goes from 14.8 to 43.3. According to the confidence interval, generally a 1 unit increase in obese will increase bp by about 14 to 43 units. If a participant's obese measure increased from 1 to 2, then they started at their "ideal" weight and increased to twice their ideal weight.

We can use an example to better understand the relevance of the obese variable. A participant weighs 200 pounds and their ideal weight is 150 pounds. This makes their obese measure about 1.3. Then the participant gains 20 pounds, making their obese measure about 1.5. What happens if the obese measure increases by 0.2? To determine how much bp increases for every 0.1 unit increase in obese, we can divide the confidence interval by 10.

```{r}
confint(m3)['obese',]/10
```

So generally, for every 0.1 unit increase in obese, bp increases by 1.5 to 4.3 units. If a participant increases in obese by 0.2, the model predicts about a 2.0 to 8.6 unit increase in bp.

For the sex variable, a 1 unit difference means switching from one sex to another. According to the model, switching sex from male to female generally decreases bp.


### Linear model with time ordered data

The heating equipment data set from _Applied Linear Statistical Models (5th Ed)_ [@alsm] contains data on heating equipment orders over a span of four years. The data are listed in time order. Develop a reasonable predictor model for `orders`. Determine whether or not autocorrelation is present. If so, revise model as needed.

```{r}
URL <- "https://github.com/uvastatlab/sme/raw/main/data/heating.rds"
heating <- readRDS(url(URL))
names(heating)
```

The variables are as follows:

1. `orders`: number orders during month
2. `int_rate`: prime interest rate in effect during month
3. `new_homes`: number of new homes built and for sale during month
4. `discount`: percent discount offered to distributors during month
5. `inventories`: distributor inventories during month
6. `sell_through`: number of units sold by distributor to contractors in previous month
7. `temp_dev`: difference in avg temperature for month and 30-year avg for that month
8. `year`: year (1999, 2000, 2001, 2002)
9. `month`: month (1 - 12)

First we look at the distribution of orders.

```{r}
hist(heating$orders)
```

We had negative orders one month and zero orders a couple of months

```{r}
summary(heating$orders)
head(sort(heating$orders))
```

Let's review summaries of potential predictors. Discounts are almost always 0.

```{r}
summary(heating[,2:7])
```

Tabling up discounts shows there were only 5 occasions when a discount was not 0.

```{r}
table(heating$discount)
```


Pairwise scatterplots again reveal that discount is mostly 0. But when greater than 0, it appears to be associated with higher orders.

```{r}
pairs(heating[,1:7], lower.panel = NULL)
```

Since the data is in time order, we can plot `orders` against its index to investigate any trends in time. There appears to be an upward trend in the first 9 months. After that orders seem rather sporadic.

```{r}
plot(heating$orders, type = "b")
```

The `acf()` function allows us to formally check for autocorrelation (ie,  self-correlation). At lag 0, we see `orders` is perfectly correlated with itself. That will always be the case for any variable. Of interest is autocorrelation at lag 1 and beyond. The area between the horizontal dashed lines is where we would expect to see random autocorrelation values hover for pure noise (ie, data with no autocorrelation). Other than at lag 5, we so no evidence for autocorrelation. And even there it appears rather weak.

```{r}
acf(heating$orders)
```

With only 43 observations and little knowledge of this industry, we decide to pursue a simple model with no interactions or non-linear effects. To help us decide which predictors to use, will use a bootstrap procedure to investigate the variability of model selection under the `stepAIC()` function from the MASS package [@MASS], which performs stepwise model selection by AIC. The `bootStepAIC()` function from the package of the same name allows us to easily implement this procedure. [@bootStepAIC]

First we fit a "full" model with all predictors we would like to entertain, with the exception of year and month.

```{r}
m <- lm(orders ~ . - year - month, data = heating)
```

Then we bootstrap the `stepAIC()` function 999 times and investigate which variables appear to be selected most often.

```{r eval=FALSE}
library(bootStepAIC)
b.out <- boot.stepAIC(m, heating, B = 999)
```

```{r echo=FALSE}
b.out <- readRDS(file = "data/bout.rds")
```

The Covariates element of the BootStep object shows three predictors being selected most often: `sell_through`, `discount`, and `inventories`

```{r}
b.out$Covariates
```

The Sign element of the BootStep object shows that `sell_through` had a positive coefficient 100% of the time, while `discount` had a positive coefficient about 99.9% of the time. The sign of the coefficient for `inventories` was negative about 99.5% of the time.

```{r}
b.out$Sign
```

Let's proceed with these three predictors in a simple additive model.

```{r}
m <- lm(orders ~ discount + inventories + sell_through, data = heating)
summary(m)
```

Before we get too invested in the model output, we investigate the residuals for autocorrelation since our data is in time order. One (subjective) way to do that is to plot residuals over time. Since our data is already in time order, we can just plot residuals versus its index. We're looking to see if residuals are consistently above or below 0 for extended times. It appears we may have some instances of this phenomenon but nothing too severe.

```{r}
plot(residuals(m), type = "b")
segments(x0 = 0, y0 = 0, x1 = 43, y1 = 0, lty = 2)
```

A test for autocorrelation is the Durbin-Watson Test. The car package [@car] provides a function for this test. The null hypothesis is the autocorrelation parameter, $\rho$, is 0. Rejecting this test with a small p-value may provide evidence of serious autocorrelation. The test result below fails to provide good evidence against the null.

```{r}
library(car)
durbinWatsonTest(m)
```

Based on our residual versus time plot and the result of the Durbin-Watson test, we decide to assume our residual errors are independent.

Our Residuals versus Fitted plot shows a couple of observations (22 and 18) that the model massively under-predicts. Observation 18, for example, has a fitted value of over 4000, but its residual is about 1000, indicating the model under-predicted its order value by about 1000 units.

```{r}
plot(m, which = 1)
```

The Residuals versus Leverage plot indicates that observation 18 may be exerting a strong influence on the model.

```{r}
plot(m, which = 5)
```

The "18" label allows us to easily investigate this observation. It appears this is one of the rare times there was a discount. And at 5%, this is the largest observed discount.

```{r}
heating[18, c("orders", "discount", "inventories", "sell_through")]
```

We can assess how our model changes by re-fitting without observation 18.

```{r}
m2 <- update(m, subset = -18)
```

The `compareCoefs()` function from the car package makes it easy to compare model coefficients side-by-side. It doesn't appear to change the substance of the results and we elect to keep the observation moving forward.

```{r}
compareCoefs(m, m2)
```

Computing confidence intervals on the coefficients can help with interpretation. It appears every additional one percentage point discount could boost orders by about 500 units, perhaps as much as 750 or higher. Of course with only 5 instances of discounts being used, this is far from a sure thing, especially for forecasting future orders.

The `inventories` coefficient is very small. One reason for that is because it's on the scale of single units. Multiplying by 100 yields a CI of [-21, -1]. This suggests every additional 100 units in stock may slightly decrease orders. Same with the `sell_through` coefficient. If we multiple by 100 we get a CI of [29, 119]. Every 100 number units sold the previous month suggests an increase of a few dozen orders.

```{r}
confint(m)
```

We can use our model to make a prediction. What's the expected mean number of orders with `inventories` at 2500 and `sell_through` at 900, assuming a 1% discount?

```{r}
predict(m, newdata = data.frame(inventories = 2500,
                                sell_through = 900,
                                discount = 1),
        interval = "confidence")
```

Based on the confidence interval, a conservative estimate would be about 1200 orders.

### Inverse Gaussian Linear Model

The {faraway} package [@faraway] provides data on an experiment that investigates the effects of various treatments on certain toxic agents. 48 rats were randomly poisoned with one of three types of poison, and then treated with one of four available treatments. Of interest is which treatment helped the rats survive the longest. The data were original published in @boxcox.

```{r}
data("rats", package = "faraway")
```

We see the data is balanced with 4 rats per poison/treatment combination.

```{r}
table(rats$poison, rats$treat)
```

The outcome is survival time. It is measured in units of 10 hours.

```{r}
summary(rats$time)
```

It looks like most rats survived about 5 hours, but a few lived a bit longer.

```{r}
plot(density(rats$time))
```

Stripcharts help visualize the variability in time within each group. It appears poison III was the most lethal, and that treatment A is the least efficacious.

```{r}
stripchart(time ~ poison, data = rats, method = "jitter",
           ylab = "poison", las = 1)
stripchart(time ~ treat, data = rats, method = "jitter",
           xlab = "treatment", las = 1)
```

An interaction plot can help us assess if treatment effect depends on the poison. There doesn't appear to be any major interactions. It looks like treatment B may be preferred to increase survival time, at least for poisons I and II.

```{r}
with(rats, interaction.plot(x.factor = treat,
                            trace.factor = poison,
                            response = time))
```

Next we model time as a function of treat, poison, and their interaction. The summary is somewhat difficult to interpret with interactions.

```{r}
m <- lm(time ~ treat * poison, data = rats)
summary(m)
```

Before we try to interpret this model, let's check the residuals versus fitted values plot.

```{r}
plot(m, which = 1)
```

We see noticeable _heteroscedasticity_. As our fitted values get bigger, so do the residuals. Ideally we would like the residuals to be uniformly scattered around 0, which would indicate our assumption of constant variance is satisfied. That is not the case here.

To address this, we can attempt to transform the outcome. Common transformations include a log transformation and square root transformation. We can use the `powerTransform()` function from the {car} package [@car] to help us identify a suitable transformation.

```{r}
car::powerTransform(m)
```

The suggested parameter, -0.81, is the power we could raise our outcome to. But -0.81 is close to -1, which would be a simple inverse (or reciprocal) transformation. The `boxCox()` function helps us assess if -0.81 is close enough to -1.

```{r}
car::boxCox(m)
```

The outer dashed lines comfortably enclose -1, so we elect to use -1 as a our transformation parameter. We refit the model using this transformation, which can be expressed as `1/time`. This transformation corresponds to _rate of death_.

```{r}
m2 <- lm(1/time ~ treat * poison, data = rats)
```

The residuals versus fitted values plot looks much better.

```{r}
plot(m2, which = 1)
```

Before we attempt to interpret the model, lets assess whether we should keep the interaction using the `Anova()` function from the {car} package.

```{r}
car::Anova(m2)
```

It certainly doesn't seem to contribute much to the model. Let's drop the interaction and fit a new model.

```{r}
m3 <- lm(1/time ~ treat + poison, data = rats)
summary(m3)
```

The summary is a little easier to understand, but the coefficients are relative to the reference levels: Treatment A and Poison I. We can use the {emmeans} package [@emmeans] to estimate mean survival time conditional on the poison type. Notice we specify `type = "response"` and `tran = "inverse"` to back-transform the response back to the original scale.

```{r message=FALSE}
library(emmeans)
emmeans(m3, ~ treat | poison, type = "response", tran = "inverse")
```

We can pipe the output into the {emmeans} plot method to visualize these estimates. Treatments B and D seem to be most effective but also the most variable. No treatment appears to help rats subjected to poison type III.

```{r}
emmeans(m3, ~ treat | poison, type = "response", tran = "inverse") |>
  plot()
```

Another approach to modeling this data is using an inverse Gaussian generalized linear model. Instead of transforming the response, we use the `glm()` function with family set to `inverse.gaussian(link = "inverse")`. We see the coefficients in the model summary are not too different from our model above with a transformed response.

```{r}
m4 <- glm(time ~ treat + poison, data = rats,
          family = inverse.gaussian(link = "inverse"))
summary(m4)
```

The residuals versus fitted values plot hints at some heteroscedasticity, but it's being driven by only 3 points: 30, 24, and 15.

```{r}
plot(m4, which = 1)
```

As before, we can use the {emmeans} package to estimate mean survival times and create a plot with error bars. Notice we don't need to specify a transformation with the `tran` argument. The `emmeans()` function knows to automatically back-transform the response since the "inverse" link is part of the fitted model object.

```{r}
emmeans(m4, ~ treat | poison, type = "response") |>
  plot()
```

The results are very similar to the previous plot.

### References