rstats-tips

Repository of what I end up googling for again and again...

Ouput

options(max.print=100)
options(tibble.print_max=100)

Install old versions of R on Docker

apt install software-properties-common
add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/"

Check available R versions: apt policy r-base

GoogleCloudStorage

With service-account JSON

library(googleCloudStorageR)
library(googleAuthR)

scope <-c("https://www.googleapis.com/auth/cloud-platform")
gar_auth_service(Sys.getenv("GCS_AUTH_FILE"), scope = scope)

gcs_list_objects("MY-PROJECT")

On a GCP VM or container

library(googleCloudStorageR)
library(gargle)

scope <-c("https://www.googleapis.com/auth/cloud-platform")
token <- token_fetch(scopes = scope)
gcs_auth(token = token)

gcs_list_objects("MY-PROJECT")

JupyterLab

Install

Install `R` on Debian

sudo apt install r-base r-base-dev

On GCP Vertex AI needed the following

`tidyverse`

sudo apt install libssl-dev
sudo apt install libfontconfig1-dev
sudo apt install libharfbuzz-dev libfribidi-dev
sudo apt install libtiff-dev

Misc

To resize ggplot outputs in the notebook:

options(repr.plot.width=11, repr.plot.height=7)

%%R -w 15 -h 15 --units in

Tensorflow/Keras

To install with conda

install.packages("tensorflow")
install_tensorflow()

If issue with numpy version arises:

Then I had issues with numy version, so:

conda remove -n r-tensorflow numpy

Then instal numpy 1.14.5

conda install -n r-tensorflow numpy=1.14.5

Always set which python to use just after restarting R session. On a Mac:

library(reticulate)
reticulate::use_python('/Users/XXXXX/anaconda3/envs/r-tensorflow/bin/python', required = TRUE)
reticulate::py_config()

Hello world

library(tensorflow)
#use_condaenv('r-tensorflow')
h = tf$constant("Hello")
w = tf$constant(" World!")
hw = h + w

init <- tf$global_variables_initializer()
session <- tf$Session()
session$run(init)

ans <- session$run(hw)
ans

Summaries

dplyr::glimpse(df)
dplyr::tbl_sum(df)
skimr::skim(df)
Hmisc::describe(df)
summarytools::descr(spam)
summarytools::dfSummary(df)

Misc

By default, 1 is a float. To create an int, the syntax is

1L

Tidyverse

readr

Clean column names

rename_all(funs(str_replace(., '^[^.]*\\.', '')))

`dplyr`

Remove duplicates

distinct()

Vectors

Get a vector out of a tibble column

pull(data, col_name)

Timeseries `tsibble`

library(tsibble)
df %>%
  as_tsibble(index='dt_utc') %>%
  fill_gaps(km = 0, nb_cars = 0) %>% 
  mutate(dt_tzfr=if_else(is.na(dt_tzfr), with_tz(dt_utc, 'Europe/Paris'), dt_tzfr)) %>%
  as_tibble()

GGplot2

A good ref [[http://zevross.com/blog/2014/08/04/beautiful-plotting-in-r-a-ggplot2-cheatsheet-3/]]

Labels

Remove labels

+ guides(color = "none")

Rotate labels

+ theme(axis.text.x = element_text(angle = 90)

or the more modern:

scale_x_datetime(breaks = 'week', date_labels = "%b %d", guide = guide_axis(angle = 90))

Legend

Remove legend

+ theme(legend.position = "none")

Theme

+ theme(plot.caption=element_text(size=8, family='sans', face='italic', color='gray')

Expand limits (e.g. to origin)

+ expand_limits(x = 0)

`xlim` but with clipping

+coord_cartesian(xlim=c(0, 10))

Scales

Log gradient scale

+ scale_fill_gradient(trans = "log") + scale_x_date(date_breaks = "months" , date_labels = "%b-%y")

Log Breaks

+ scale_y_log10(minor_breaks=seq(5e-2, 3e7, 1e-2) %>% round(., -floor(log10(.))) %>% unique())

Aesthetics

Variables, within aes(color = var_color), constant, outside: geom_point(aes(x, y), color = "red")

geofacet

Install issues

sudo apt-get install libudunits2-dev
sudo apt-get install gdal-bin
sudo apt-get install libgdal-dev

`data.table`

Chain/Pipe

Bracket style

dt[a > 1, .(b = a + 1)][b < 3]

`magrittr` style

dt[a > 1, .(b = a + 1)] %>% .[b < 3]

help

?`[.data.table`

For help on .N, .SD, .SDcols

?`special-symbols`

List all `data.table`s

tables()

Creator

dt <- 
  data.table(
    x = runif(100),
    y = rnorm(100),
    g = sample.int(100, n = 2, replace = TRUE)
  )

Read large CSV files faster than with readr

data.table::fread(
    file = 'data_raw/huge_csv_file.csv', 
    na.strings = c('', 'null', 'Null', 'NULL', '1980-01-01', '1980-01-01 00:00:00.0')
    ) %>% 
  as.tibble() %>% 
  rename_all(funs(str_replace(., '^[^.]*\\.', ''))) %>% 
  mutate(date_col = ymd(date_col))

general format

dt[rows, cols, by]
dt[,.(col1, col2)]
dt[col1 > 0, .(col1, prod = col1 * col2)]

Replace all `NA`s

dt[is.na(dt)] <- 0

Operations on rows

Order/arrange

To reorder inplace

setorder(dt, col_name)

To reorder without modifying the original data.table

dt[order(col_name)]

Intermediary computations

Works with =, but not with :=

dt[,
  {
    a <- x ^ 2
    b <- y ^ 2
    z <- a + b
    t <- a - b
    .(z = z, t = t)
  }
  ]

Can call function with side effects as well, e.g. hist()

res[, {hist(Result)}]

If invalid output/last value, then set the last value to NULL

res[,
  {
    hist(Result)
    NULL
  }, 
  Round
]

Operations on columns

Select (get as data.table)

dt[, .(y)]
dt[, "y"]
dt[, c("y")]
dt[, 2]

dt[, .(x, y)]
dt[, c("x", y")]
dt[, 1:2]

Pull (get one column as vector)

dt[, y]

Remove column

dt[, y := NULL]

Work on multiple columns `.SD`

.SD is by default a list with all columns except grouping columns

dt[, lapply(data.table(.SD), mean), g]

One can also customize .SD by setting .SDcols

dt[, lapply(data.table(.SD), mean), g, .SDcols = -c("y")]

Shift/lag

dt[, .(y = shift(x, type = "lag", fill = NA, n = 2L)), by = g]

Aggregation

Functions

tally `.N`

dt[, .(.N), g]

spread / pivot & index within group

dcast(
  flights[, as.list(lm(air_time ~ distance)[c('coefficients', 'df.residual')]), by = .(month)][,.(coefficients, df.residual, .I-.I[1]), by = .(month)], 
  month + df.residual ~ V3, 
  value.var="coefficients")

Joins

DT[X, on="x"]                         # right join
X[DT, on="x"]                         # left join
DT[X, on="x", nomatch=0]              # inner join
DT[!X, on="x"]                        # not join
DT[X, on=c(y="v")]                    # join using column "y" of DT with column "v" of X
DT[X, on="y==v"]                      # same as above (v1.9.8+)

DT[X, on=.(y<=foo)]                   # NEW non-equi join (v1.9.8+)
DT[X, on="y<=foo"]                    # same as above
DT[X, on=c("y<=foo")]                 # same as above
DT[X, on=.(y>=foo)]                   # NEW non-equi join (v1.9.8+)
DT[X, on=.(x, y<=foo)]                # NEW non-equi join (v1.9.8+)
DT[X, .(x,y,x.y,v), on=.(x, y>=foo)]  # Select x's join columns as well

DT[X, on="x", mult="first"]           # first row of each group
DT[X, on="x", mult="last"]            # last row of each group
DT[X, sum(v), by=.EACHI, on="x"]      # join and eval j for each row in i
DT[X, sum(v)*foo, by=.EACHI, on="x"]  # join inherited scope
DT[X, sum(v)*i.v, by=.EACHI, on="x"]  # 'i,v' refers to X's v column
DT[X, on=.(x, v>=v), sum(y)*foo, by=.EACHI] # NEW non-equi join with by=.EACHI (v1.9.8+)

Options

Limit number of printed rows

options(datatable.print.nrows = 20)

Interaction with othe environents

`reticulate` (Python)

Create Python dictionaries/tuples

key <- 'some_key'
d <- reticulate::dict()
d[key] <- reticulate::dict(key1 = 'value1', key2 = reticulate::tuple('foo', 2 ))
d

outputs: {'some_key': {'key1': 'value1', 'key2': ('foo', 2.0)}}

Docker

Launch a docker container from R

client <- docker$from_env()
data_folder_path = normalizePath('local_path', winslash = '/')
vols <- reticulate::dict()
vols[data_folder_path] <- reticulate::dict(bind = '/path_in_container/', mode = 'ro')
ports <- reticulate::dict('5432/tcp' = reticulate::tuple('127.0.0.1', as.integer(5432)))
pg <- client$containers$run(name = 'pg', image = "image:latest", remove = TRUE, environment = reiculate::dict('foo' = 'bar'), ports = ports, volumes = vols, detach = TRUE)

Databases / Big Data

Passwords

rstudioapi::askForPassword("")

aesc <- digest::AES(charToRaw(stringr::str_sub(digest::sha1(uuid::UUIDgenerate(use.time = TRUE)), 1, 32)), mode = 'ECB')
aesc$decrypt(aesc$encrypt((function(x, n) (x[1:n]=x[1:n]))(charToRaw(rstudioapi::askForPassword()), 256)))

RJDBC

Create driver for Hive

jdbc_jar_path <- "./hive/jdbc/hive-jdbc-xxxx-standalone.jar" 
drv <- JDBC("org.apache.hive.jdbc.HiveDriver", classPath=jdbc_jar_path)

Create connection

conn <- dbConnect(drv, glue('jdbc:hive2://xxx.xxx.xx:pppp/{default_hive_db};ssl=true;sslTrustStore={trust_store_path};trustStorePassword={trust_store_pwd};transportMode=http;httpPath=gateway/default/llap'), user, aesc$decrypt(skrt))

Query directly from R notebook

{sql, connection = conn, output.var="query_result"}

Close connection

dbDisconnect(conn)

sparklyr

Start

spark_install(version = "2.1.0")
sc <- spark_connect(master = "local", version = "2.1.0")
flights_df <- copy_to(sc, nycflights13::flights, "flights_sp", overwrite = TRUE)
airlines <- copy_to(sc, airlines, "airlines")
src_tbls(sc)

remove NAs

drop_na() isn't available. Instead, use

flights %>% na.omit()

ML

Data prep

Center/normalize matrix

scale(x, center = TRUE, scale = TRUE)

Transpose matrix

t(x)

matrix multiplication

x %*% y

Eigen vectors/values

eigen(x)

PCA

scale(decm)[,] %*% (eigen(t(scale(decm)) %*% scale(decm)))$vectors

Clustering

Viz

factoextra::fviz_cluster(kmeans(Xs, 5, algorithm = 'MacQueen'), data = Xs, labelsize = 0)

Shiny

JS logs to R console

shinyjs::showLog()

Assign to a global variable that will persist in RStudio after the Shiny app is closed (facilitates debugging)

global_var <<- 1

Office

Copy table to excel via clipboard

%>% write.table("clipboard", sep="\t")

Watch out!

`%>%` and `.`

tibble %>% full_join(. %>% filter(), . %>% filter()) doesn't work. But tibble %>% full_join(filter(.), filter(.)) does. Not clear why yet :(

Package(s) for MS office interaction

David Gohel's officer

TODO

`purrrlyr::by_row()`

Check it out

Related files

https://holtzy.github.io/Pimp-my-rmd/

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

rstats-tips

Ouput

Install old versions of R on Docker

GoogleCloudStorage

With service-account JSON

On a GCP VM or container

JupyterLab

Install

Install R on Debian

On GCP Vertex AI needed the following

tidyverse

Misc

Tensorflow/Keras

To install with conda

If issue with numpy version arises:

Always set which python to use just after restarting R session. On a Mac:

Hello world

Summaries

Misc

Tidyverse

readr

Clean column names

dplyr

Remove duplicates

Vectors

Get a vector out of a tibble column

Timeseries tsibble

GGplot2

Labels

Remove labels

Rotate labels

Legend

Remove legend

Theme

Expand limits (e.g. to origin)

xlim but with clipping

Scales

Log gradient scale

Log Breaks

Aesthetics

geofacet

Install issues

data.table

Chain/Pipe

Bracket style

magrittr style

help

List all data.tables

Creator

Read large CSV files faster than with readr

general format

Replace all NAs

Operations on rows

Order/arrange

Intermediary computations

Operations on columns

Select (get as data.table)

Pull (get one column as vector)

Remove column

Work on multiple columns .SD

Shift/lag

Aggregation

Functions

tally .N

spread / pivot & index within group

Joins

Options

Limit number of printed rows

Interaction with othe environents

reticulate (Python)

Create Python dictionaries/tuples

Docker

Launch a docker container from R

Databases / Big Data

Passwords

RJDBC

Install `R` on Debian

`tidyverse`

`dplyr`

Timeseries `tsibble`

`xlim` but with clipping

`data.table`

`magrittr` style

List all `data.table`s

Replace all `NA`s

Work on multiple columns `.SD`

tally `.N`

`reticulate` (Python)

`%>%` and `.`

`purrrlyr::by_row()`

Packages