-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathlab_9_model.Rmd
More file actions
290 lines (196 loc) · 11.8 KB
/
lab_9_model.Rmd
File metadata and controls
290 lines (196 loc) · 11.8 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
---
title: 'Lab #9 (model)'
author: "Jerid Francom"
date: "11/5/2021"
output:
pdf_document:
toc: yes
number_sections: yes
latex_engine: xelatex
html_document:
toc: yes
number_sections: yes
---
```{r setup, message=FALSE}
library(tidyverse) # data manipulation
library(knitr) # pretty formatted tables
library(skimr) # descriptive summaries
library(patchwork) # organize plots
library(effectsize) # calculate confidence and generate effect size
library(report) # create boilerplate statistical reporting
source("functions/functions.R") # print_pretty_table()
```
# Overview
The aim of this script will be use the transformed Switchboard Dialogue Act Corpus dataset to analyze a the following alternative hypothesis:
$H_1$: Women use more hedges than men.
Then, the null hypothesis is:
$H_0$: Women do not use more hedges than men.
Background information for the analysis:
Lakoff (1973) argues that women express themselves tentatively without warrant or justification more often than men. This suggestion would predict that women will use more hedges than men. What is a hedge? A hedge is used to diminish the confidence or certainty with which the speaker makes a statement or answers a question.
Examples of hedges:
(1) General example:
- *I don't know if I'm making any sense or not.*
(2) In context from the Switchboard Dialogue Act Corpus:
- You might try,
- *I don't know,*
- hold it down a little longer,
In the Switchboard Dialogue Act Corpus hedges are marked in the DAMSL tag annotation as `h` (hedge), `h^r` (repeated hedge), or `h^t` (hedge when talking about the task).
The findings from an Ordinary Least Squares Regression analysis using a log-transformed and normalized count of hedge use per speaker found that there are no difference between men and women's usage of hedges given the data in the Switchboard Dialogue Act Corpus, supporting Holmes' (1990) and arguing against Lakoff (1973. The age of the participants, used as a control variable, however, is found to be significant, although the effect size is small.
# Tasks
## Orientation
Read in the dataset and the data dictionary.
```{r sdac-read, message=FALSE}
sdac <- read_csv(file = "data/derived/sdac_transformed.csv") # read transformed sdac dataset
sdac_data_dictionary <- read_csv(file = "data/derived/sdac_transformed_data_dictionary.csv") # read transformed sdac data dictionary
```
```{r sdac-preview-dataset}
glimpse(sdac) # preview dataset structure
```
The `sdac` dataset has 223,606 observations and 5 variables.
```{r sdac-view-data-dictionary, echo=FALSE}
sdac_data_dictionary %>% # dataset
print_pretty_table(caption = "SDAC dataset data dictionary.")
```
The data dictionary shows that the DAMSL tag information is contained in the `damsl_tag` column. The variables `sex` and `age` are included in this dataset. Although `sex` is the primary variable of interest, `age` will be used as a control factor to account for potential variability which may be due to the age of the speaker participants.
## Preparation
Count the number of hedges. Include 'h', 'h^r', or 'h^t'. Use the `str_count()` function on the `damsl_tag` column and count the matches to the regular expression `"^h(\\^r|\\^t)?`. Create a new column with the match counts to `hedges`.
```{r sdac-count-hedges}
sdac_hedges <-
sdac %>% # dataset
mutate(hedges = str_count(damsl_tag, "^h(\\^r|\\^t)?")) # count hedges
```
Sum and normalize the number of hedges used by each speaker. Group the data by `speaker_id`, `sex`, and `age` and then use `sum()` to sum the `hedges` and divide by the number of utterances per speaker (`n()`). Multiple the result by $1000$ to get a number of hedges per 1000 utterances score. Remember to `ungroup()` the result to leave the data frame without grouping parameters.
```{r sdac-relative-counts-hedges, message=FALSE}
sdac_hedges <-
sdac_hedges %>% # dataset
group_by(speaker_id, sex, age) %>% # grouping parameters
summarize(hedges_per_utt = (sum(hedges)/ n()) * 1000) %>% # sum hedges and normalized per number of utterances per speaker
ungroup() # remove grouping parameters
```
Preview the result.
```{r}
sdac_hedges %>%
print_pretty_table(caption = "First 10 observations of prepared `sdac_hedges` data frame.")
```
There is one incomplete case, 155. Remove the one incomplete case using `filter()`.
```{r sdac-remove-incomplete-case}
sdac_hedges <-
sdac_hedges %>% # dataset
filter(speaker_id != 155) # remove speaker_id 155
```
The last step to prepare the dataset for analysis is to convert the categorical variables to factors. Neither need levels or new labels, so I will just use the `factor()` function and overwrite the current variable.
```{r sdac-convert-categorical-to-factor}
sdac_hedges <-
sdac_hedges %>% # dataset
mutate(speaker_id = factor(speaker_id)) %>% # convert to factor
mutate(sex = factor(sex)) # convert to factor
glimpse(sdac_hedges) # preview the data structure
```
## Descriptive assessment
Now we are ready to do the descriptive assessment for our analysis. Let's first look at each of the variables separately --that is, as a univariate description.
```{r sdac-uni-cat}
sdac_hedges %>%
select(-speaker_id) %>% # deselect speaker_id
skim() %>% # get data summary
yank("factor") # only show factor-oriented information
```
The factor `sex` has 234 males and 206 females.
```{r sdac-uni-num}
num_skim <- skim_with(numeric = sfl(iqr = IQR)) # add IQR to skim
sdac_hedges %>% # dataset
num_skim() %>% # get custom data summary
yank("numeric") # only show numeric-oriented information
```
The mean `age` of the speakers is 37.6 which is close to the median. The `hedges_per_utt` has a mean hedge use per 1000 utterances of 5.72 which is larger than the median, suggesting that the variable is right-skewed.
Explore the dependent variable `hedges_per_utt`. We will create a histogram and density plot.
```{r sdac-visual-dep, message=FALSE}
p1 <-
sdac_hedges %>% # dataset
ggplot(aes(x = hedges_per_utt)) + # mappings
geom_histogram() + # histogram
labs(x = "Hedges", y = "Count") # labels
p2 <-
sdac_hedges %>% # dataset
ggplot(aes(x = hedges_per_utt)) + # mappings
geom_density() + # density plot
geom_rug() + # add rug for individual observations
labs(x = "Hedges", y = "Density") # labels
p1 + p2 + plot_annotation("Distribution of hedges per 1000 utterances") # organize plots
```
The `hedges_per_utt` variable is right-skewed, the mean is greater than the median.
Since this variable is not discrete (our values are not whole numbers and contain a large range of values) we can try to apply a log-transformation to see if we can bring the distribution more in line with the normal distribution. For this all we need to do is apply the `log()` function to the `hedges_per_utt` variable. We can do this right inside the plotting operation to see how the log transformation affects the distribution.
Log transform the continuous dependent variable and create a density and QQ-plot.
```{r sdac-visual-dep-log, message=FALSE}
p1 <-
sdac_hedges %>% # dataset
ggplot(aes(x = log(hedges_per_utt))) + # mappings
geom_density() + # density plot
geom_rug() + # add rug for individual observations
labs(x = "Hedges (log)", y = "Density") # labels
p2 <-
sdac_hedges %>%
ggplot(aes(sample = log(hedges_per_utt))) + # mapping
stat_qq() + # calculate expected quantile-quantile distribution
stat_qq_line() # plot the qq-line
p1 + p2 + plot_annotation("Log-transformed distriution of hedges per 1000 utterances")
```
Apply the log transformation to `hedges_per_utt` using the `log()` function with and adding `1` to all the counts to avoid undefined `log(0)` where a speaker did not use any hedges. Since we are adding 1 to all the counts, the distribution remains the same.
```{r sdac-log-transform}
sdac_hedges_log <-
sdac_hedges %>%
mutate(hedges_per_utt_log = log(hedges_per_utt + 1)) # create log-transformed hedges_per_utt (add 1 to avoid -Inf)
```
The plots suggest that the log transformation bring the distribution of `hedges_per_utt` much closer to the normal distribution. But let's perform the Shapiro-Wilk test to verify.
```{r sdac-normality-test}
s1 <- sdac_hedges_log$hedges_per_utt_log %>% shapiro.test() # Shapiro-Wilk test of normality
s1 # test results summary
s1$p.value < .05 # confirm p-value
```
The $p$-value is significant suggesting the distribution is non-normal. But as we see from the log-transformed density plot and the QQ-plot is does not wildly diverge from the normal distribution. Nonetheless, if we were performing certain tests this distribution would be treated as non-parametric.
However, as we will see we will not be working with one of these tests, but rather applying Ordinary Least Squares Regression with the `lm()` function. This test is robust for dependent variables whose values are numeric and span a large range of values (like our `hedges_per_utt` variable which is a ratio).
Let's now take a look at the relationship between the variables we are going to add to our statistical model. Let's group our summaries by the categorical variable `sex`.
```{r sdac-grouped-numeric-descriptives}
sdac_hedges_log %>%
select(-speaker_id) %>% # deselect speaker_id
group_by(sex) %>% # grouping parameter
num_skim() %>% # get custom data summary
yank("numeric") # only show numeric-oriented information
```
Focusing in on `hedges_per_utt_log`, we can see that there does not seem to be much difference between males (mean 1.29) and females (mean 1.39) in terms of their use of hedges. The distribution also seems quite comparable as the median scores are similar to the means for both levels of `sex`.
Let's visualize these numeric descriptives. We will use a scatterplot as we will be comparing `hedges_per_utt_log`, and `age`. Then we will use the levels of `sex` to color our scatter points and trend lines.
```{r sdac-grouped-visualization, message=FALSE}
p1 <-
sdac_hedges_log %>% # dataset
ggplot(aes(x = age, y = hedges_per_utt_log, color = sex)) + # mappings
geom_point(alpha = 1/2) + # points, add alpha for overplotting
geom_smooth(method = "lm") + # trend line
labs(x = "Age", y = "Hedges per 1000 utterances", color = "Sex") # labels
p1
```
Looks like hedges decrease as a function of age, regardless of sex. Mai upshot the confidence intervals surrounding the trend lines for men versus women overlaps significantly -therefore the visual inspection suggests that men and women use hedges as similar rates.
## Statistical interrogation
We will now conduct an Ordinary Least Squares Regression with the `lm()` function.
```{r sdac-statistical-test}
m1 <- lm(hedges_per_utt_log ~ age + sex, data = sdac_hedges_log) # fit the model
summary(m1) # preview model results
```
The independent variable `age` is the only significant predictor in the model.
## Evaluation
Let's evaluate the effect size and confidence intervals for this model. We will assess the control variable, only as a *post-hoc* (after the fact) finding.
```{r sdac-evaluation}
effects <- effectsize(m1) # evaluate effect size and generate a confidence interval
effects # preview effect size and confidence interval
interpret_r(effects$Std_Coefficient[2]) # interpret the effect size
```
The coefficient for `age` falls within the confidence interval but the interval size is quite large relative to the coefficient size. Furthermore, the interpretation of our coefficient suggests that the effect size is quite small.
## Reporting
To give us some boilerplate information to add to our write-up for this project, let's use the reporter package's `report_text()` on the `m1` model.
```{r sdac-report-text}
report_text(m1)
```
If we would like a summary table of all the results to include in the write-up we can use the `report_table()` function.
```{r sdac-report-table}
report_table(m1)
```
# Assessment
...