This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
openalexVectorComp is an R package for text embedding generation and distance/score-based corpus comparison. It is R-first with file-based artifacts (Parquet/YAML/JSON) as first-class outputs — functions are designed to compose in plain R workflows without hidden services.
Current version: see DESCRIPTION (treat it as the source of truth, not docs).
Run from the package root in R:
# Install local source
devtools::install_local(".")
# Document (regenerate man/ and NAMESPACE from roxygen)
devtools::document()
# Run all tests
devtools::test()
# Run a single test file
devtools::test(filter = "name-without-test-prefix-or-.R")
# or: testthat::test_file("tests/testthat/test-<name>.R")
# Full R CMD check (mirrors CI)
devtools::check(args = c("--no-manual", "--no-multiarch", "--no-examples", "--ignore-vignettes"))
# Build pkgdown site (output: _site/)
pkgdown::build_site()CI (.github/workflows/pr-checks.yml) runs R CMD check across Ubuntu (release/oldrel-1/devel), macOS, and Windows, skipping vignette execution and examples. Match this locally before committing.
R/backend_core.R exposes a provider-neutral interface (backend_config(), backend_info(), backend_embed_texts(), backend_read(), backend_save()). It dispatches on provider ∈ {hf, openai, tei} to internal .embedding_* functions in:
R/backend_hf.R— HuggingFace Inference routerR/backend_openai.R— OpenAI embeddings APIR/backend_tei.R— local Text Embeddings Inference server
Auth for hosted backends uses the OVC_API_TOKEN env var (bearer token). Backend configs serialize to/from YAML (embed_model.yaml); backend_read() also accepts a legacy nested format — preserve that path when editing.
embed_corpus()— prepare/clean input texts (supportsdry_run = TRUE).embed_texts()/backend_embed_texts()— generate embeddings.- Distances:
distance_reference_cosine()— full pairwise cosine matrix with centroid axes, written as a single parquet underdistance_reference_cosine/model_id=.../corpus_label=.../reference_label=.../pairwise-cosine.parquet. First columnidincludes a"centroid"row; one"centroid"column on the reference side.distance_ridge()— reference-area distance.
- Scores:
score_reference_cosine()(methods:"linear","exponential"),score_ridge(). - Optional:
calibrate_threshold().
batch_submit_openai() → batch_status_openai() → batch_collect_openai(). Pending state is an expected, non-fatal outcome. Jobs are auto-split by size/count; a single oversized request line is a hard error. Helpers live in R/batch_openai_helpers.R and R/batch_openai_http.R.
The demo wrapper demo_finalize_openai_batch() = status + collect + direct-vs-batch comparison, with persistent outputs at project/openai_batch_comparison/label=<label>/.
R/demo_run.R provides run_demo_openalex() and run_demo_openai(). Default demo directories are demos/openalex and demos/openai. Quarto and YAML live in demo_dir; pipeline artifacts under demo_dir/project/. run_demo_openai(..., render = TRUE) may return before batch completes — this is intentional; finalize separately.
Three docs have non-overlapping roles. Update the appropriate one(s) before every commit that changes behavior:
DEVELOPMENT_CONTINUITY.md— design principles, decision log, pre-commit checklist. Read this first when resuming work. Add a Decision Log entry for non-trivial changes.IMPLEMENTATION_NOTES.md— what was shipped per release (engineering log).NEWS.md— user-facing release notes.
Keep README and vignette commands aligned with actual defaults (paths, async behavior, API names).
- Don't hard-fail on pending OpenAI batch jobs — async pending is a normal state in submit/status/collect and demo flows.
- Backend YAML compatibility: support both the current flat format and the legacy nested (
meta$backend$...,meta$model$requested_id) format in any reader changes. - File-based artifacts are the contract. Parquet layout includes Hive-style partition dirs (
model_id=.../corpus_label=.../reference_label=...). Don't change these paths casually — downstream demos/tests assume them. - Tiny numeric drift between direct and batch OpenAI embeddings is expected (serialization/floating-point). Cosine ≈ 1.0 is not a regression.
- pkgdown site builds to
_site/(configured in_pkgdown.yml). .Rbuildignoreexcludesdemos/,_site/, continuity docs, andinst/qdrant functions/— don't expect those in the built package.