-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathTidyverse.Rmd
More file actions
561 lines (416 loc) · 18.3 KB
/
Tidyverse.Rmd
File metadata and controls
561 lines (416 loc) · 18.3 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
---
title: "Tidyverse"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction to Tidyverse
## Idealogy behind R for Data Science
Tidyverse is a collection of R packages that enables tools for data science, and is
especially useful for data wrangling, manipulation, visualization, and communication
of large data sets. Extensions of tidyverse
also enable direct connections and manipulation with SQL databases (e.g, dbplyr). Here
we briefly introduce some main concepts when this programming, all derived directly from
the open access book R for Data Science by Garrett Grolemund and Hadley Wickham (which
can be found [here](https://r4ds.had.co.nz/)).
As you can read [here](https://r4ds.had.co.nz/introduction.html), the main idea behind
using tidyverse is that exploratory data analysis in R is composed of a few main steps:
first is importing and tidying data, then iteratively transforming, visualising, and
modeling data to understand patterns held by them, and finally communicating results
effectively. Tidyverse was designed as a programming method and collection of functions
that are focused on easing these tasks into a simple uniform routine that can be applied
to any dataset. Standardizing the approach taken toward any data science project then
aids reproducability of any project as well as the ability to collaborate on a project.
## The foundation: tibbles, tidy data, and piping
### Tibbles
First we need to install and load Tidyverse. After that we can have a look at what at
the main form of data storage, called a tibble:
```{r tidyverse, eval = FALSE, echo = FALSE, results = FALSE, message = FALSE}
install.packages('tidyverse')
install.packages('nycflights13') # this is an example data package
install.packages('DT') # temporarily necessary to display tibbles correctly on web page
```
```{r tibble, echo = FALSE, results = FALSE, message = FALSE}
library(tidyverse)
```
```{r tibble2}
library(nycflights13)
```
```{r helper}
present_table <- function(x){
x %>%
slice(1:20) %>%
DT::datatable()
}
```
Note that a tibble is essentially the same as a data frame (for example made with data.frame) but with some useful information printed (e.g., dimensions and data types), as well as some restrictions placed on how it an be manipulated. These help prevent common errors. For example, recycling is possible as it is in data frames but less flexible:
```{r tibbleerror, eval=FALSE}
flights$year <- c(2013, 2014)
try(flights$year <- rep(2014, 7))
```
## Tidy data
In addition, the above data may look like a standard data set obtained from anywhere,
but it is not. It has already been formatted as 'tidy data'. Although data can be
represented a variety of ways in tables for visualization, but for data manipulation
and analysis, there is only one format that is much easier to use than others.
Therefore, all data should be transformed into this format before analyzing. This
format is called the 'tidy' dataset in tidyverse, and following three rules make a
dataset 'tidy':
1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.
If, for example, the flights dataset were organized such that each carrier, origin, or
dest had their own sets of columns, the data would no longer be tidy.
### Piping
In base R programming, functions wrap around the objects that they are applied to, which
are often indexed, and this manipulated object is saved as a new one. What is written is
arranged like an onion: in the following example, the first step of the command is in
the center of code (calling the object flights), followed by indexing the 15th and 16th
columns. As we move away from the center, a function is applied, and finally the output
of that function is assigned to a new object.
```{r base}
sub <- apply(flights[,c(15:16)], 2, mean, na.rm = T)
```
Piping, or using `%>%` to pass objects from one function to the next, introduces a
programming method that makes the process more intuitive by alligning the code with the
order of operation:
```{r pipe}
flights %>%
select(15, 16) %>% # or use select(air_time, distance)
apply(., 2, mean, na.rm = T) -> sub
```
In the above code, the flight tibble was piped into the 'select' function, which indexed
its 15th and 16th rows only. `Select` does not require an argument where flights is
referenced because it was built to accept a piped argument implicitly. Note that
selecting by column name (no quotations needed) is also possible and
more useful in most cases. After being piped to 'select', the result was then piped to
the function `apply`. `Apply` is an older function that is not built to implicitly
accept piped objects; therefore, it requires the placeholder `.` to be placed
where the input data frame is expected. Finally, this modified data frame
is assigned to 'sub' at the end, but alternatively it could have been assigned at the
beginning as in the non-piped version.
## The power of tidyverse: all you need in a handful of functions
As in the `select` function, there are a variety of functions that come with the
tidyverse package, but only a small set are needed to do almost any kind of data
wrangling that you ever wanted to do. These are the only functions we touch on in this
brief introduction. However, beyond tidyverse, there are also a variety of
packages that implement more advanced piping-compatible functions that speed the
manipulation of large data sets in particular (e.g., dbplyr, purrrlyr).
The most commonly used tidyverse commands, with a brief description, include:
* select() - select columns
* filter() - retain rows according to boolean criteria
* arrange() - sorts data
* rename() - renames existing columns
* mutate() - writes new columns
* group_by() / ungroup() - groups data according to column values (such as factors)
* summarise() - reduces dataset to an aggregated leve. Used after grouping (which
defines the aggregation level) and along with functions that define how to aggregate
(e.g., count(), n(), sum(), mean()).
* gather() / spread() - converts data between the `wide` and `long` (tidy) formats
* full_join(), left_join(), etc. - joins data contained in two data frames according to
certain criteria that define how rows are compatible (i.e., joining in relational
databases)
Below is an example of how the function `apply` in the previous example can be replaced
using tidyverse commands, as well as functions such as `aggregate` using `group_by` and `summarise`
```{r replace}
flights %>%
select(air_time, distance) %>%
summarise(mn_airtime = mean(air_time, na.rm = T),
mn_distance = mean(distance)) %>%
present_table() #this last line not necessary - helps with printing to web page
# or if the operation should occur by groupings:
flights %>%
select(dest, air_time, distance) %>%
group_by(dest) %>%
summarise(mn_airtime = mean(air_time, na.rm = T),
mn_distance = mean(distance)) %>%
present_table() #this last line not necessary - helps with printing to web page
```
Here is a smattering of demonstrations on how to use the other important functions and their equivalents in base R:
#### filter()
```{r smat filter}
#filter()
flights[flights$month==3 & flights$dest=="DEN",] %>%
present_table() #this last line not necessary - helps with printing to web page
flights %>%
filter(month == 3, dest == "DEN") %>%
present_table() #this last line not necessary - helps with printing to web page
```
#### arrange()
```{r smat arrange}
o <- order(flights$distance)
flights[o,c('year','month','day','distance')]%>%
present_table() #this last line not necessary - helps with printing to web page
flights[rev(o),c('year','month','day','distance')]%>%
present_table() #this last line not necessary - helps with printing to web page
flights %>%
select(year, month, day, distance) %>%
arrange(distance) %>%
present_table() #this last line not necessary - helps with printing to web page
flights %>%
select(year, month, day, distance) %>%
arrange(desc(distance)) %>%
present_table() #this last line not necessary - helps with printing to web page
```
#### rename()
```{r smat rename}
flights2 <- data.frame(flights, year2 = flights$year)
flights2 <- as_tibble(flights2[c(dim(flights2)[2],2:(dim(flights2)[2]-1))])
flights2 %>%
present_table() #this last line not necessary - helps with printing to web page
flights2 <-
flights %>%
rename(year2 = year)
flights2 %>%
present_table() #this last line not necessary - helps with printing to web page
```
#### mutate()
```{r smat mutate}
flights2 <- as_tibble(data.frame(flights, air_time_hr = flights$air_time/60, distance_1000m = flights$distance/1000)) %>%
present_table() #this last line not necessary - helps with printing to web page
flights2 <-
flights %>%
mutate(air_time_hr = air_time/60, distance_1000m = distance/1000)
flights2 %>%
present_table() #this last line not necessary - helps with printing to web page
```
#### group_by() with count()
```{r smat count}
#count() with group_by()
#this yields data in a `wide` format
flights2a <- t(table(flights[,c('origin', 'dest')]))
head(flights2a) #this last line not necessary - helps with printing to web page
#this yields 'tidy' 'long' data
flights2 <-
flights %>%
group_by(origin, dest) %>%
count()
flights2 %>%
present_table() #this last line not necessary - helps with printing to web page
```
#### spread() / gather() to convert between wide and long (tidy) formats:
```{r smat spread}
flights2b <-
flights2 %>%
as_tibble() %>%
spread(key = origin, value = n, fill = 0)
#result in the same formats
flights2b %>%
present_table() #this last line not necessary - helps with printing to web page
head(flights2a)
flights2b %>%
gather(key = origin, value = n, -dest) %>%
filter(n!=0)
head(flights2) %>%
present_table()
```
## Plotting in tidyverse
Tidyverse also uses ggplot2, which is intended to simplify the process of creating plots
so that data can be quickly and easily visualized as an iterative component of the
exploratory analysis process. Some advantages include an clear method for translating
data to visuals, having many preconfigured attributes available, and being able to build
and modify previously stored plot objects without needing to recreate them. A downside
is that to take advantage of the full power and flexibility of ggplot2 requires a wide
knowledge of what is available as options to include in graphics, and therefore involve
a long learning curve. However, the ultimate results are well worth the
learning investment. For a basic explanation and cheat sheet see
[here.](https://ggplot2.tidyverse.org/)
### ggplot: Key components
ggplot has __three__ key components:
1. __data__, this must be a `data.frame`
2. A set of aesthetic mappings (`aes`) between variables in the data and
visual properties, and
3. At least one `layer` which describes how to render each observation.
```{r}
sub <-
flights %>%
sample_n(100, replace = F) %>%
filter(!is.na(distance), !is.na(air_time), !is.na(origin), !is.na(month), !is.na(dest))
ggplot(data = sub, aes(x = distance, y = air_time)) + geom_point()
```
Different syntax, equivalent outcome:
```{r, eval = FALSE}
ggplot(sub, aes(distance, air_time)) + geom_point()
ggplot() + geom_point(data = sub, aes(distance, air_time))
ggplot(data = sub) + geom_point(aes(x = distance, y = air_time))
ggplot(sub) + geom_point(aes(distance, air_time))
```
Can be stored as an object for later use. This is a useful feature of Rgadget: because
default plots are created in ggplot, they can be stored and modified by the user at a
later point.
```{r}
p <- ggplot(sub, aes(distance, air_time)) + geom_point()
```
The class:
```{r}
class(p)
```
The structure (a bit of Latin - not run here):
```{r, eval = FALSE}
str(p)
```
### Aesthetic mapping
Adding more variables to a two dimensional scatterplot can be done by mapping the variables to aesthetics (colour, fill, size, shape, alpha)
#### colour
```{r, out.width = "50%", fig.show = "hold"}
p <- ggplot(sub, aes(distance, air_time))
p + geom_point(aes(colour = origin))
p + geom_point(aes(colour = dest))
```
Manual control of colours or other palette schemes (here brewer):
```{r, out.width = "50%", fig.show = "hold"}
p + geom_point(aes(colour = origin)) +
scale_colour_manual(values = c("orange","brown","green"))
p + geom_point(aes(colour = dest)) +
scale_colour_brewer(palette = "Set1")
```
Note, to view all the brewer palettes do:
```{r, eval = FALSE}
RColorBrewer::display.brewer.all()
```
#### shape
```{r}
p + geom_point(aes(distance, air_time, shape = origin))
```
#### size
```{r}
p + geom_point(aes(distance, air_time, size = month))
```
One can also "fix" the aesthetic manually, e.g.:
```{r}
ggplot(sub, aes(distance, air_time)) + geom_point(colour = "blue", shape = 8, size = 10)
```
Note here that the call to colour, shape, etc. is done outside the `aes`-call. One can also combine calls inside and outside the `aes`-function (here we showing overlay of adjacent datapoints):
```{r}
p + geom_point(aes(distance, air_time, size = month), alpha = 0.3, col = "red")
```
### Facetting
Splitting a graph into subsets based on a categorical variable.
```{r}
ggplot(sub) +
geom_point(aes(distance, air_time, colour = as.factor(year))) +
facet_wrap(~ origin)
```
One can also split the plot using two variables using the function `facet_grid`:
```{r}
ggplot(sub) +
geom_point(aes(distance, air_time)) +
facet_grid(as.factor(year) ~ origin)
```
### Adding layers
The power of ggplot comes into play when one adds layers on top of other layers. Let's for now look at only at two examples.
### Add a line to a scatterplot
```{r}
ggplot(sub, aes(distance, air_time)) +
geom_point() +
geom_line()
```
#### Add a smoother to a scatterplot
```{r, out.width = "33%", fig.show = "hold"}
p <- ggplot(sub, aes(distance, air_time))
p + geom_point() + geom_smooth()
p + geom_point() + geom_smooth(method = "lm")
```
```{r, echo = FALSE}
ggplot(sub, aes(distance, air_time, colour = origin)) +
geom_point() +
geom_smooth() +
facet_wrap(~ as.factor(year)) +
scale_colour_brewer(palette = "Set1")
```
## Statistical summary graphs
___
There are some useful *inbuilt* routines within the ggplot2-packages which allows one to create some simple summary plots of the raw data.
#### bar plot
One can create bar graph for discrete data using the `geom_bar`
```{r}
ggplot(sub, aes(dest)) + geom_bar()
```
The graph shows the number of observations we have of each destination. The original data is first transformed behind the scene into a table of counts, before being rendered.
#### histograms
For continuous data one uses the `geom_histogram`-function (left default bin-number, right bindwith specified as 50 mins):
```{r, out.width = "50%", fig.show = "hold"}
p <- ggplot(sub, aes(air_time))
p + geom_histogram()
p + geom_histogram(binwidth = 50)
```
One can add another variable (left) or better use facet (right):
```{r, out.width = "50%", fig.show = "hold"}
p + geom_histogram(aes(fill = origin))
p + geom_histogram() + facet_wrap(~ origin, ncol = 1)
```
#### Frequency polygons
Alternatives to histograms for continuous data are frequency polygons:
```{r, out.width = "50%", fig.show = "hold"}
p + geom_freqpoly(lwd = 1)
p + geom_freqpoly(aes(colour = origin), lwd = 1)
```
#### Box-plots
Boxplots, which are more condensed summaries of the data than histograms, are called using `geom_boxplot`. Here two versions of the same graph are used, the one on the left is the default, but on the right we have reordered the maturity variable on the x-axis such that the median value of length increases from left to right:
```{r, out.width = "50%", fig.show = "hold"}
ggplot(sub, aes(dest, air_time)) + geom_boxplot()
p <- ggplot(sub, aes(reorder(dest, air_time), air_time)) + geom_boxplot()
p
```
It is sometimes useful to plot the "raw" data over summary plots. Using `geom_point` as an overlay is sometimes not very useful when points overlap too much; `geom_jitter` can sometimes be more useful:
```{r, out.width = "50%", fig.show = "hold"}
p + geom_point(colour = "red", alpha = 0.5, size = 1)
p + geom_jitter(colour = "red", alpha = 0.5, size = 1)
```
Read the help on `geom_violin` and create a code that results in this plot:
```{r, echo = FALSE}
ggplot(sub, aes(reorder(dest, air_time), air_time)) +
geom_violin(scale = "width") +
geom_jitter(col = "red", alpha = 0.5, size = 1)
```
## Other statistical summaries
Using `stat_summary` one can call specific summary statistics. Here are examples of 4 plots, going from top-left to bottom right we have:
* Raw data with median length at age (red) superimposed
* A pointrange plot showing the mean and the range
* A pointrange plot showing the mean and the standard error
* A pointrange plot showing the bootstrap mean and standard error
```{r, fig.show = "asis"}
sub$distance <- round(sub$distance)
p <- ggplot(sub, aes(distance, air_time))
p + geom_point(alpha = 0.25) + stat_summary(fun.y = "median", geom = "point", colour = "red")
p + stat_summary(fun.y = "mean", fun.ymin = "min", fun.ymax = "max", geom = "pointrange")
p + stat_summary(fun.data = "mean_se")
p + stat_summary(fun.data = "mean_cl_boot")
```
## Some controls
### labels
```{r}
p <- ggplot(sub, aes(distance, air_time, colour = origin)) + geom_point()
p + labs(x = "Distance (miles)", y = "Air time (minutes)",
colour = "Origin",
title = "My flight plot",
subtitle = "My nice subtitle",
caption = "My caption")
```
### Legend position
```{r, fig.show = "asis"}
p + theme(legend.position = "none")
p <- p + theme(legend.position = c(0.8, 0.3))
p
```
### breaks
Controls which values appear as tick marks
```{r}
p +
scale_x_continuous(breaks = seq(5, 45, by = 10)*100) +
scale_y_continuous(breaks = seq(50, 400, by = 50))
```
### limits
```{r, fig.show = "asis"}
p + ylim(100, 500)
p + ylim(NA, 500) # setting only upper limit
```
## Further readings
___
* The ggplot2 site: http://ggplot2.tidyverse.org
* The ggplot2 book in the making: https://github.com/hadley/ggplot2-book
- A rendered version of the book: http://www.hafro.is/~einarhj/education/ggplot2
- needs to be updates
* [R4DS - Data visualisation](http://r4ds.had.co.nz/data-visualisation.html)
* R graphics cookbook: http://www.cookbook-r.com/Graphs
* [Data Visualization Cheat Sheet](https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf)