Skip to content

openalexPro/openalexSnapshot

Repository files navigation

DOI Lifecycle: experimental License: GPL-2+ Codecov

openalexSnapshot

r-universe

openalexSnapshot converts the OpenAlex bulk snapshot from gzipped newline-delimited JSON to Parquet, builds fast ID-lookup indexes, and extracts individual records by OpenAlex ID. The heavy lifting is done by a compiled Rust library (statically linked via extendr), with no external binary dependency. For API-based access to OpenAlex, see openalexPro.

Installation

Install from r-universe (precompiled binaries available for macOS and Linux — no Rust toolchain required):

install.packages(
  "openalexSnapshot",
  repos = c("https://rkrug.r-universe.dev", "https://cloud.r-project.org")
)

Install the development version from GitHub:

# install.packages("pak")
pak::pak("openalexPro/openalexSnapshot")

Hardware Requirements

Resource Minimum Recommended
Disk space 2.5 TB 3+ TB
RAM 16 GB 32+ GB
CPU 2 cores 4+ cores

Quick Start

library(openalexSnapshot)

root <- "/Volumes/openalex"

# 1. Convert the snapshot to Parquet
snapshot_to_parquet(
  root_dir     = root,
  workers      = 4,
  memory_limit = 15000   # MB
)

# 2. Build ID indexes
build_corpus_index(
  root_dir  = root,
  data_sets = "works",
  workers   = 4
)

# 3. Look up specific records by OpenAlex ID
out_dir <- file.path(root, "my_extract")
lookup_by_id(
  index_file = file.path(root, "parquet", "works_id_idx.parquet"),
  ids        = c("W2741809807", "W2100837269"),
  output_dir = out_dir
)

# 4. Read results
library(arrow)
works <- open_dataset(out_dir) |> collect()

Documentation

Full documentation and articles are available at https://openalexpro.github.io/openalexSnapshot.

Related packages

  • openalexPro — API access, tidy data frames, and advanced OpenAlex workflows

About

R package to handle OpenAlex snapshots and to do id based searches in the openalex snapshot efficiently.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages