-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathExercise6.Rmd
More file actions
129 lines (93 loc) · 4.69 KB
/
Exercise6.Rmd
File metadata and controls
129 lines (93 loc) · 4.69 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
title: "Question 6"
author: "Sujay Chebbi"
date: "8/17/2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Question 6 - Association Rule Mining
```{r, include = FALSE}
library(tidyverse)
library(arules)
library(arulesViz)
library(igraph)
library(data.table)
```
reads groceries text file as transaction dataset and inspects the first 6 baskets
```{r}
groceries = read.transactions(file = "https://raw.githubusercontent.com/jgscott/STA380/master/data/groceries.txt",
format = c("basket"), sep = ",")
arules::inspect(head(groceries))
```
gathers most frequent items in dataset, filtered by decreasing support and inspects the 15 most frequent items in any basket
```{r}
FrequentItems = eclat(groceries, parameter = list(support = 0.05))
FrequentItems <- sort(FrequentItems, by = "support", decreasing = TRUE)
arules::inspect(head(FrequentItems, 15))
```
plotting absolute frequency of top 15 items in groceries dataset
```{r, echo = FALSE, fig.height=4, fig.width=8, fig.align='center'}
itemFrequencyPlot(groceries, topN = 15, type = "absolute",
main = "Item Frequency")
```
plotting relative frequency of top 15 items in dataset
```{r, echo = FALSE, fig.height=4, fig.width=8, fig.align='center'}
itemFrequencyPlot(groceries, topN = 15, type = "relative",
main = "Item Frequency")
```
32791 rules created, and we inspect the first 6 association rules
```{r}
groceryRules <- apriori(groceries, parameter = list(support = 0.001, confidence = 0.1))
arules::inspect(head(groceryRules))
```
creating a new rules matrix that sorts grocery rules by decreasing support, and we inspect the first 15 association rules of this new rules matrix
```{r}
rulesSupport <- sort(groceryRules, by = "support", decreasing = TRUE)
arules::inspect(head(rulesSupport, 15))
```
creating a new rules matrix that sorts grocery rules by decreasing confidence, and we inspect the first 15 associations rules of this new rules matrix
```{r}
rulesConfidence <- sort(groceryRules, by = "confidence", decreasing = TRUE)
arules::inspect(head(rulesConfidence, 15))
```
creating a new rules matrix that sorts grocery rules by decreasing lift, and we inspect the first 15 association rules of this new rules matrix
```{r}
rulesLift <- sort(groceryRules, by = "lift", decreasing = TRUE)
arules::inspect(head(rulesLift, 15))
```
There are 28 different association rules where confidence = 1. Notice that the support remains relatively low, hovering around 0.001. Lift for these rules tends to be about 4 or 5.
```{r}
arules::inspect(subset(rulesConfidence, subset = confidence == 1))
```
There are 20 different association rules where lift > 15. Notice that support is also relatively low, hovering around 0.01. Confidence ranges between 0.12 and 0.65.
```{r}
arules::inspect(subset(rulesLift, subset = lift > 15))
```
As seen from the inspection of subsets above, the consequent(s) of 100% confidence rules are 'whole milk' and 'other vegetables'. This makes sense seeing that people, in general, consume milk and vegetables often.
The consequent(s) of high lift rules include liquor, instant food products, rice, hamburger meat, wine, and popcorn among others.
The plot confirms what was stated above. Individual data points are shaded by confidence, and the most confidence occurs at low support and low to medium levels of lift.
```{r, echo = FALSE, fig.height=4, fig.width=8, fig.align='center'}
plot(groceryRules, measure = c("support", "lift"), shading = "confidence")
```
a subset of rules with confidence > 0.1 and support > 0.01 was created, and we inspect the first 6 association rules of this new rules matrix which contains 435 rules
```{r}
sub1 = subset(groceryRules, subset = confidence > 0.1 & support > 0.01)
arules::inspect(head(sub1))
```
A graph of 50 rules, sorted by the highest lift based on the subset given above: confidence > 0.1 and support> 0.01.
```{r, echo = FALSE, fig.height=4, fig.width=8, fig.align='center'}
plot(head(sub1, 50, by = 'lift'), method = 'graph')
```
saved a graphml file containing 1084 nodes and 3907 edges of groceryRules.
The file will be uploaded onto github and a screenshot of the gephi-produced graph is seen underneath this text block.
Some characteristics of the network are listed below.
Network diameter = 14
Graph Density = 0.003
Modularity = 0.331
Avg Path Length = 4.537

```{r}
#saveAsGraph(head(groceryRules, n = 1000, by = "lift"), file = "groceryRules.graphml")
```