Standalone CLI for OpenAlex snapshot download, conversion, verification, schema inspection, and indexing.
This project was built with significant AI assistance:
- OpenAI Codex was used for the initial implementation, generating the core architecture, command structure, and the majority of the Rust source code.
- Claude Code (Anthropic) was used for ongoing development, refinement, bug fixes, and documentation.
All AI-generated code was directed and tested by the project author.
openalex-snapshot manages the full snapshot pipeline in one CLI:
download+verify_downloadfor snapshot sync and integrity checksconvert+verify_convert+repair_convertfor JSON.GZ -> parquet with validation/recoveryindex+verify_indexfor ID lookup indexesextractfor targeted parquet extraction by OpenAlex IDsschema+verify_schemafor schema inspection and parity checksreport/progressfor run-state visibilityallfor config-driven orchestration
Default root layout:
- snapshot:
<root>/snapshot - parquet:
<root>/parquet - metadata:
<root>/openalex-snapshot_metadata
duckdbavailable inPATH(or pass--duckdb-binwhere supported)awsCLI available inPATHfordownload/verify_download
git clone https://github.com/openalexPro/openalex-snapshot.git
cd openalex-snapshot
cargo build --release
./target/release/openalex-snapshot --helpgit clone https://github.com/openalexPro/openalex-snapshot.git
cd openalex-snapshot
cargo install --path .
openalex-snapshot --helpDownload the archive for your platform from GitHub Releases:
- Linux:
openalex-snapshot-<tag>-x86_64-unknown-linux-gnu.tar.gz - macOS Intel:
openalex-snapshot-<tag>-x86_64-apple-darwin.tar.gz - macOS Apple Silicon:
openalex-snapshot-<tag>-aarch64-apple-darwin.tar.gz - Windows:
openalex-snapshot-<tag>-x86_64-pc-windows-msvc.zip
Then unpack and place openalex-snapshot (or openalex-snapshot.exe) on your PATH.
macOS note: Because the binaries are not notarized with Apple, macOS Gatekeeper will block them on first run with a warning that the app "cannot be opened". To allow the binary, run this once after unpacking:
xattr -dr com.apple.quarantine openalex-snapshotAlternatively, open System Settings → Privacy & Security and click "Open Anyway". This is a one-time step per binary. To avoid it entirely, install via Cargo (option 1 or 2 above), which compiles locally and bypasses Gatekeeper.
Version:
openalex-snapshot --version
Argument precedence (highest wins):
- CLI arguments
- Config subcommand section values
- Config
defaultssection values - Built-in defaults
configallcheckdownloadverify_downloadconvertverify_convertschemaverify_schemaindexextractverify_indexrepair_convertreportprune-reportsprogressskills
# preflight
openalex-snapshot check --root-dir /Volumes/openalex --dataset all
# download + verify snapshot
openalex-snapshot download --root-dir /Volumes/openalex
openalex-snapshot verify_download --root-dir /Volumes/openalex
# convert one dataset (uses safe profile by default — works on any host)
openalex-snapshot convert \
--root-dir /Volumes/openalex \
--dataset works
# …or, on a 32+ GB host, use the empirically tuned stratified profile for a faster run:
openalex-snapshot convert \
--root-dir /Volumes/openalex \
--dataset works \
--profile stratified-36
# verify conversion
openalex-snapshot verify_convert \
--root-dir /Volumes/openalex \
--dataset works \
--scope dataset \
--metadata-level both
# extract by OpenAlex IDs (writes one parquet per resolved dataset)
openalex-snapshot extract \
--root-dir /Volumes/openalex \
--ids /Volumes/openalex/ids.csv \
--output /Volumes/openalex/extract.parquet
# repair from verify report
openalex-snapshot repair_convert \
--root-dir /Volumes/openalex \
--from-verify-report /Volumes/openalex/openalex-snapshot_metadata/reports/verify_convert-123456.json
# bootstrap AI skills folder
openalex-snapshot skills --root-dir /Volumes/openalexUnder <root>/openalex-snapshot_metadata:
reports/— latest report per command (current run)archived/<timestamp>/— previous completed runsdownload/download.log<dataset>/schemata/— schema cache<dataset>/convert/— convert logs (created when convert runs)<dataset>/conversion-verify/— verify_convert logs + metrics<dataset>/index/— index logs<dataset>/index-verify/— verify_index logs
docs/README.mddocs/quickstart.mddocs/commands/docs/operations/NEWS.mdARCHITECTURE_AND_DECISIONS.mdAI_SKILLS_USAGE.md