-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathlab_10_model.Rmd
More file actions
329 lines (232 loc) · 11.6 KB
/
lab_10_model.Rmd
File metadata and controls
329 lines (232 loc) · 11.6 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
---
title: "Lab 10 (model)"
author: "Jerid Francom"
date: "11/13/2021"
output:
pdf_document:
toc: yes
number_sections: yes
html_document:
toc: yes
number_sections: yes
bibliography: bibliography.bib
---
```{r setup, include=FALSE}
library(tidyverse) # data manipulation
library(quanteda) # tokenization and document-frequency matrices
library(quanteda.textstats) # descriptive text statistics
library(quanteda.textmodels) # Naive Bayes classifier
source("functions/functions.R") # plot_indicative_features(), evaluate_training_model()
knitr::opts_chunk$set(echo = FALSE) # do not show code by default
```
# Overview
The aim of this script will be creating a predictive model for author detection. Specifically we will try to correctly predict the author of the once [disputed Federalist papers](https://en.wikipedia.org/wiki/The_Federalist_Papers). The disputed papers involved the debate whether "Alexander Hamilton" or "James Madison" was the author of some or all of the papers. Modern opinion is that James Madison is the author of the disputed papers [@Mosteller1963]. We will use the same data available to earlier scholars to build a Naive Bayes Text Classifier aimed a predicting the author(s) of the disputed papers.
# Tasks
## Orientation
Let's read the curated federalist papers dataset and look preview it.
```{r fed-df-read-dataset, message=FALSE}
fed_df <-
read_csv(file = "data/derived/federalist_papers_curated.csv") # read curated dataset
glimpse(fed_df) # preview dataset
```
There are 85 observations and 5 columns.
Now let's look at the data dictionary associated with this the curated dataset.
```{r fed-df-read-data-dictionary, message=FALSE}
fed_data_dictionary <- read_csv(file = "data/derived/federalist_papers_curated_data_dictionary.csv") # read data dictionary
fed_data_dictionary %>%
print_pretty_table(caption = "Federalist papers data dictionary.")
```
Let's create a cross-tabulation of the `author` and `disputed` columns to see the relationship between the authors and the disputed status.
```{r fed-df-description}
fed_df %>%
janitor::tabyl(author, disputed)
```
We see that there are 15 disputed papers. According to modern opinion these papers are associated with "James Madison".
## Preparation
Now let's create Quanteda corpus object `fed_corpus` from the `fed_df` data frame and view a summary of the object.
```{r fed-corpus-create}
# Create corpus object
fed_corpus <-
fed_df %>% # data frame
select(-title) %>% # remove the title column
corpus() # create the corpus object
fed_corpus %>%
summary(n = 10)
```
The 85 documents appear in the corpus. The `author` and `disputed` columns are included as document variables.
Remove the "John Jay" authored papers as he is not considered as one of the authors of the disputed papers.
```{r fed-corpus-subset}
fed_corpus <-
fed_corpus %>% # corpus objuect
corpus_subset(author != "John Jay") # remove John Jay authored papers
fed_corpus %>%
summary(n = 10)
```
There are now five less documents after removing the John Jay papers.
## Feature engineering
Tokenize the `fed_corpus` object into words. We will also remove punctuation and numbers and lowercase the text.
```{r fed-tokens-create}
fed_tokens <-
tokens(fed_corpus, # corpus object
what = "word", # tokenize by word
remove_punct = TRUE, # remove punctuation
remove_numbers = TRUE) %>% # remove numbers
tokens_tolower() # lowercase the tokens
fed_tokens %>% # tokens object
head(n = 5) # view the first observations
```
Create a Document-Frequency Matrix (DFM) object where the tokens are features and the papers are documents.
```{r fed-dfm-create}
fed_dfm <-
dfm(fed_tokens) # create document-frequency matrix
fed_dfm %>% # dfm object
head(n = 5) # preview
```
There are `r nfeat(fed_dfm)` word features in the matrix which is `r round(sparsity(fed_dfm), 2)` sparse.
Let's look at the 20 most frequent features in the matrix.
```{r fed-dfm-topfeatures}
fed_dfm %>% # dfm object
topfeatures(n = 20) # view most frequent features
```
Now let's look at the 10 most frequency features and then group them by `author`.
```{r fed-dfm-topfeatures-author}
fed_dfm %>% # dfm object
textstat_frequency(n = 10, groups = author) # get top frequency grouped by author
```
Visualize the top features by `author`.
```{r fed-dfm-topfeatures-author-view}
fed_dfm %>% # dfm
textstat_frequency(n = 25, groups = author) %>% # get top frequency grouped by author
ggplot(aes(x = reorder(feature, frequency),
y = frequency,
fill = group)) + # mappings
geom_col(show.legend = FALSE) + # bar plot (with no legend)
coord_flip() + # flip x/y coordinates
facet_wrap(~group, scales = "free") + # create separate plots for each author
labs(x = "Words", y = "Raw frequency") # labels
```
There are quite a few features that appear frequently in the papers of each author.
Let's explore weighing the raw frequency count to take into account the frequency that terms appear in across documents. I will apply the Term Frequency-Inverse Document Frequency to weigh the raw counts and visualize the top features by `author`.
```{r fed-dfm-weighted-topfeatures-author-view}
fed_dfm %>% # dfm
dfm_tfidf() %>% # weigh frequency by tf-idf
textstat_frequency(n = 25, groups = author, force = TRUE) %>% # get top frequency grouped by author
ggplot(aes(x = reorder(feature, frequency),
y = frequency,
fill = group)) + # mappings
geom_col(show.legend = FALSE) + # bar plot (with no legend)
coord_flip() + # flip x/y coordinates
facet_wrap(~group, scales = "free") + # create separate plots for each author
labs(x = "Words", y = "TF-IDF") # labels
```
We see that the TF-IDF weights help distinguish top features between the authors. We will continue for the moment with the unweighted raw frequency counts, but keep this in mind if the predictive model does not perform well.
Now I will split the unweighted `fed_dfm` matrix into training `fed_dfm_train` and testing `fed_dfm_test` matrices. In this case the 'training' data are the undisputed papers and the testing data are the disputed papers.
```{r fed-dfm-train-test}
# Create training and testing dfms
fed_dfm_train <-
fed_dfm %>%
dfm_subset(disputed == FALSE)
fed_dfm_train %>%
docvars() %>%
janitor::tabyl(author)
fed_dfm_test <-
fed_dfm %>%
dfm_subset(disputed == TRUE)
fed_dfm_test %>%
docvars() %>%
janitor::tabyl(author)
```
Contrasts in the proportions between the training and testing sets are not relevant here as the test set is composed solely of papers purported to be written by James Madison.
## Model training
We will use the Naive Bayes Text Classifier to generate a model that attempts to learn the differences between language usage (unweighted word counts) between Alexander Hamilton and James Madison. We will look at the model summary and preview feature and document probabilities.
```{r fed-fit-model}
# Train model
nb1 <-
textmodel_nb(x = fed_dfm_train, # document-feature matrix
y = fed_dfm_train$author) # class labels
summary(nb1) # model summary
```
The fit model assumes both authors are equally likely (priors 0.5 for each author). We see a preview of the estimated probability scores for each feature. To get a better look at the top features I will apply the `plot_indicative_features()` custom function.
```{r fed-fit-explore}
plot_indicative_features(nb_model = nb1, top_n = 25)
```
We see that there are distinct features, as expected. However, the top features for Alexander Hamilton are weighted more heavily. It is likely that many of these words will also appear in the James Madison papers (i.e. 'the', 'of', etc.) so this may lead to lower accuracy scores on the test set.
Let's evaluate the `nb1` model's performance on the training dataset itself.
```{r nb1-evaluate}
evaluate_training_model(nb1, classes = c("Alexander Hamilton", "James Madison"))
```
The results appear to be perfect. High scores on the training dataset are expected as the same dataset to train is used to evaluate. The question is whether the model features will generalize to the testing data.
## Model testing
We now use the trained model `nb1` to predict the author(s) of the disputed papers and view the predictions.
```{r fed-test-model-predictions}
# Test model predictions
dfm_matched <-
dfm_match(fed_dfm_test, # test dfm
features = featnames(nb1$x)) # (left) join with trained model features
predicted_class <- predict(nb1, newdata = dfm_matched) # apply model to test dataset
predicted_class %>%
as_tibble(rownames = "names") %>% # convert to tibble with rownames as a column
select(doc_id = names, prediction = value) %>% # rename columns
arrange(doc_id) # sort by doc_id
```
We can see that many documents are predicted to have been authored by Alexander Hamilton --not the desired result.
## Evaluation
Retrieve previously suggested author(s) of the disputed papers (i.e., "James Madison") and compare to the predictions made by our text classification model with a confusion matrix.
```{r fed-evaluate-predictions}
# Evaluate
actual_class <- dfm_matched$author # get actual class labels
tab_class <- table(actual_class, predicted_class) # cross-tabulate actual and predicted class labels
tab_class
```
The confusion matrix for the results of the model on the testing dataset suggests that the model does not perform well.
## Model improvement
Let's return to the idea of weighing the raw frequency counts to highlight most indicative features across all the documents by using TF-IDF weights. We will also create new training/ testing splits.
```{r fed-dfm-weighted-tfidf}
fed_dfm <-
fed_dfm %>%
dfm_tfidf() # weight by term-frequency inverse-document frequency
fed_dfm_train <-
fed_dfm %>%
dfm_subset(disputed == FALSE)
fed_dfm_test <-
fed_dfm %>%
dfm_subset(disputed == TRUE)
```
Train the new model again using the Naive Bayes Text Classifier to generate a model. This new model `nb2` is now based on weighted (TF-IDF) word usage.
```{r fed-weighted-fit-model}
nb2 <-
textmodel_nb(x = fed_dfm_train, # document-feature matrix
y = fed_dfm_train$author) # class labels
```
Evaluate trained `nb2` model.
```{r nb2-evaluate}
evaluate_training_model(nb2, classes = c("Alexander Hamilton", "James Madison"))
```
Let's look at the most indicative features for each author.
```{r fed-weighted-fit-explore}
plot_indicative_features(nb_model = nb2, top_n = 25)
```
The features here are less common words which bodes well for helping our predictive model distinguish between the two authors.
Let's apply our new `nb2` model to the testing dataset.
```{r fed-test-model-weighted-predictions}
# Test model predictions
dfm_matched <-
dfm_match(fed_dfm_test, # test dfm
features = featnames(nb2$x)) # (left) join with trained model features
predicted_class <- predict(nb2, newdata = dfm_matched)
predicted_class %>%
as_tibble(rownames = "names") %>% # convert to tibble with rownames as a column
select(doc_id = names, prediction = value) %>% # rename columns
arrange(doc_id) # sort by doc_id
```
We can see that all the documents, except one, were predicted to be authored by James Madison --very close to our desired result.
Evaluate the predictions and produce a confusion matrix.
```{r fed-evaluate-weighted-predictions}
# Evaluate
actual_class <- dfm_matched$author # get actual class labels
tab_class <- table(actual_class, predicted_class) # cross-tabulate actual and predicted class labels
tab_class
```
The confusion matrix for the results of the model on the testing dataset suggests that the model performs much better, although not perfect.
# Assessment
...