From 4d50513f49f82319061d6aef1e97a1e88ee2b038 Mon Sep 17 00:00:00 2001 From: Rainer M Krug Date: Fri, 22 May 2026 13:13:33 +0200 Subject: [PATCH] release: v0.3.3 add SPECTER2 + TEI support Add a documented path for using SPECTER2's proximity-adapter model with the existing TEI backend, plus a thin R helper for the boilerplate config. - backend_specter2_tei(): convenience wrapper around backend_config() for a local TEI server serving the merged SPECTER2 proximity model. - inst/scripts/prepare_specter2_merged.py: one-time Python script that fuses the proximity adapter into specter2_base and saves a HF-format model dir TEI can serve directly. Default output under R_user_dir() cache, overridable via OVC_SPECTER2_PATH. - inst/scripts/start_tei_specter2.sh: launcher that resolves the same path convention and runs text-embeddings-router. - vignettes/specter2-setup.qmd: end-to-end setup + serve + smoke test. - NOTES.md (Rbuildignored): captures the option to publish the merged model to HuggingFace Hub so users can skip the merge step entirely. - CLAUDE.md: repo guide for future Claude Code sessions. Model preparation deliberately stays out of the R API surface: TEI cannot load adapter-transformers adapters, so a one-time merge is required, but forcing a Python dependency on this R-first package conflicts with the design principle that the package does not manage external services. Co-Authored-By: Claude Opus 4.7 --- .Rbuildignore | 1 + CLAUDE.md | 83 ++++++++++++++ DESCRIPTION | 3 +- DEVELOPMENT_CONTINUITY.md | 21 ++++ NAMESPACE | 1 + NEWS.md | 10 ++ NOTES.md | 59 ++++++++++ R/backend_core.R | 37 ++++++ README.md | 1 + inst/scripts/prepare_specter2_merged.py | 90 +++++++++++++++ inst/scripts/start_tei_specter2.sh | 35 ++++++ man/backend_specter2_tei.Rd | 34 ++++++ vignettes/specter2-setup.qmd | 143 ++++++++++++++++++++++++ 13 files changed, 517 insertions(+), 1 deletion(-) create mode 100644 CLAUDE.md create mode 100644 NOTES.md create mode 100755 inst/scripts/prepare_specter2_merged.py create mode 100755 inst/scripts/start_tei_specter2.sh create mode 100644 man/backend_specter2_tei.Rd create mode 100644 vignettes/specter2-setup.qmd diff --git a/.Rbuildignore b/.Rbuildignore index b42d6d3..de5b702 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -2,6 +2,7 @@ ^vignettes/.*_files$ ^IMPLEMENTATION_NOTES\.md$ ^DEVELOPMENT_CONTINUITY\.md$ +^NOTES\.md$ ^doc$ ^Meta$ ^inst/ASR$ diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..2ce5ffd --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,83 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Project Overview + +`openalexVectorComp` is an R package for text embedding generation and distance/score-based corpus comparison. It is **R-first** with file-based artifacts (Parquet/YAML/JSON) as first-class outputs — functions are designed to compose in plain R workflows without hidden services. + +Current version: see `DESCRIPTION` (treat it as the source of truth, not docs). + +## Common Commands + +Run from the package root in R: + +```r +# Install local source +devtools::install_local(".") + +# Document (regenerate man/ and NAMESPACE from roxygen) +devtools::document() + +# Run all tests +devtools::test() + +# Run a single test file +devtools::test(filter = "name-without-test-prefix-or-.R") +# or: testthat::test_file("tests/testthat/test-.R") + +# Full R CMD check (mirrors CI) +devtools::check(args = c("--no-manual", "--no-multiarch", "--no-examples", "--ignore-vignettes")) + +# Build pkgdown site (output: _site/) +pkgdown::build_site() +``` + +CI (`.github/workflows/pr-checks.yml`) runs `R CMD check` across Ubuntu (release/oldrel-1/devel), macOS, and Windows, **skipping vignette execution and examples**. Match this locally before committing. + +## Architecture + +### Backend abstraction +`R/backend_core.R` exposes a provider-neutral interface (`backend_config()`, `backend_info()`, `backend_embed_texts()`, `backend_read()`, `backend_save()`). It dispatches on `provider` ∈ {`hf`, `openai`, `tei`} to internal `.embedding_*` functions in: + +- `R/backend_hf.R` — HuggingFace Inference router +- `R/backend_openai.R` — OpenAI embeddings API +- `R/backend_tei.R` — local Text Embeddings Inference server + +Auth for hosted backends uses the `OVC_API_TOKEN` env var (bearer token). Backend configs serialize to/from YAML (`embed_model.yaml`); `backend_read()` also accepts a legacy nested format — preserve that path when editing. + +### Pipeline (sync) +1. `embed_corpus()` — prepare/clean input texts (supports `dry_run = TRUE`). +2. `embed_texts()` / `backend_embed_texts()` — generate embeddings. +3. Distances: + - `distance_reference_cosine()` — full pairwise cosine matrix with centroid axes, written as a single parquet under `distance_reference_cosine/model_id=.../corpus_label=.../reference_label=.../pairwise-cosine.parquet`. First column `id` includes a `"centroid"` row; one `"centroid"` column on the reference side. + - `distance_ridge()` — reference-area distance. +4. Scores: `score_reference_cosine()` (methods: `"linear"`, `"exponential"`), `score_ridge()`. +5. Optional: `calibrate_threshold()`. + +### OpenAI Batch (async, explicit three-phase) +`batch_submit_openai()` → `batch_status_openai()` → `batch_collect_openai()`. **Pending state is an expected, non-fatal outcome.** Jobs are auto-split by size/count; a single oversized request line is a hard error. Helpers live in `R/batch_openai_helpers.R` and `R/batch_openai_http.R`. + +The demo wrapper `demo_finalize_openai_batch()` = status + collect + direct-vs-batch comparison, with persistent outputs at `project/openai_batch_comparison/label=