pangoling/README.Rmd at main · ropensci/pangoling · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
output: github_document
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# pangoling <a href="https://docs.ropensci.org/pangoling/"><img src="man/figures/logo.png" align="right" height="139" /></a>

<!-- badges: start -->
[![Codecov test coverage](https://codecov.io/gh/ropensci/pangoling/branch/main/graph/badge.svg)](https://app.codecov.io/gh/ropensci/pangoling?branch=main)
[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-green.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable)
 [![R-CMD-check](https://github.com/ropensci/pangoling/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/ropensci/pangoling/actions/workflows/R-CMD-check.yaml)
[![Project Status: active](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
[![DOI](https://zenodo.org/badge/497831295.svg)](https://zenodo.org/badge/latestdoi/497831295)
[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/575_status.svg)](https://github.com/ropensci/software-review/issues/575)
[![CRAN status](https://www.r-pkg.org/badges/version/pangoling)](https://CRAN.R-project.org/package=pangoling)
[![metacran downloads](https://cranlogs.r-pkg.org/badges/grand-total/pangoling)](https://cran.r-project.org/package=pangoling)

<!-- badges: end -->


`pangoling`^[The logo of the package was created with [stable diffusion ](https://huggingface.co/spaces/stabilityai/stable-diffusion) and the R package [hexSticker](https://github.com/GuangchuangYu/hexSticker).] is an R package for
estimating the predictability of words in a given context using transformer
models. The package provides an interface for utilizing pre-trained transformer
models (such as GPT-2 or BERT) to obtain word probabilities. These word
probabilities are often utilized as predictors in psycholinguistic studies. This
package can be useful for researchers in the field of psycholinguistics who want
to leverage the power of transformer models in their work.

The package is mostly a wrapper of the python package [`transformers`](https://pypi.org/project/transformers/) to process data in a convenient format.


## Important! Limitations and bias

The training data of the most popular models (such as GPT-2) haven't been released, so one cannot inspect it. It's clear that the data contain a lot of unfiltered content from the internet, which is far from neutral. See for example the scope in the [openAI team's model card for GPT-2](https://github.com/openai/gpt-2/blob/master/model_card.md#out-of-scope-use-cases), but it should be the same for many other models, and the [limitations and bias section of GPT-2 in Hugging Face website](https://huggingface.co/gpt2).

## Installation

To install the latest CRAN version of `pangoling` use:

```{r, eval = FALSE}
install.packages("pangoling")
```

To install the latest version from github use:

```{r, eval = FALSE}
install.packages("pangoling", repos = "https://ropensci.r-universe.dev")
```

`install_py_pangoling` function facilitates the installation of Python packages needed for using pangoling within an R environment, using the `reticulate` package for managing Python environments. This needs to be done once.

```{r, eval = FALSE}
install_py_pangoling()
```

## Example

This is a basic example which shows you how to get log-probabilities of words in a dataset:

```{r, message = FALSE}
library(pangoling)
library(tidytable) #fast alternative to dplyr
```

Given a (toy) dataset where sentences are organized with one word or short phrase in each row:

```{r, cache = TRUE}
sentences <- c("The apple doesn't fall far from the tree.",
               "Don't judge a book by its cover.")
(df_sent <- strsplit(x = sentences, split = " ") |>
  map_dfr(.f =  ~ data.frame(word = .x), .id = "sent_n"))
```

One can get the log-transformed probability of each word based on GPT-2 as follows:

```{r, cache = TRUE}
df_sent <- df_sent |>
  mutate(lp = causal_words_pred(word, by = sent_n))
df_sent
```


## How to cite

```{r, comment = NA }
citation("pangoling")
```

## How to contribute

See the [Contributing guidelines](.github/CONTRIBUTING.md).


## Code of conduct

Please note that this package is released with a [Contributor
Code of Conduct](https://ropensci.org/code-of-conduct/).
By contributing to this project, you agree to abide by its terms.

## See also

Another R package that act as a wrapper for [`transformers`](https://pypi.org/project/transformers/) is [`text`](https://r-text.org//) However, `text` is more general, and its focus
is on Natural Language Processing and Machine Learning.