diff --git a/.Rbuildignore b/.Rbuildignore index b42d6d3..de5b702 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -2,6 +2,7 @@ ^vignettes/.*_files$ ^IMPLEMENTATION_NOTES\.md$ ^DEVELOPMENT_CONTINUITY\.md$ +^NOTES\.md$ ^doc$ ^Meta$ ^inst/ASR$ diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..2ce5ffd --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,83 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Project Overview + +`openalexVectorComp` is an R package for text embedding generation and distance/score-based corpus comparison. It is **R-first** with file-based artifacts (Parquet/YAML/JSON) as first-class outputs — functions are designed to compose in plain R workflows without hidden services. + +Current version: see `DESCRIPTION` (treat it as the source of truth, not docs). + +## Common Commands + +Run from the package root in R: + +```r +# Install local source +devtools::install_local(".") + +# Document (regenerate man/ and NAMESPACE from roxygen) +devtools::document() + +# Run all tests +devtools::test() + +# Run a single test file +devtools::test(filter = "name-without-test-prefix-or-.R") +# or: testthat::test_file("tests/testthat/test-.R") + +# Full R CMD check (mirrors CI) +devtools::check(args = c("--no-manual", "--no-multiarch", "--no-examples", "--ignore-vignettes")) + +# Build pkgdown site (output: _site/) +pkgdown::build_site() +``` + +CI (`.github/workflows/pr-checks.yml`) runs `R CMD check` across Ubuntu (release/oldrel-1/devel), macOS, and Windows, **skipping vignette execution and examples**. Match this locally before committing. + +## Architecture + +### Backend abstraction +`R/backend_core.R` exposes a provider-neutral interface (`backend_config()`, `backend_info()`, `backend_embed_texts()`, `backend_read()`, `backend_save()`). It dispatches on `provider` ∈ {`hf`, `openai`, `tei`} to internal `.embedding_*` functions in: + +- `R/backend_hf.R` — HuggingFace Inference router +- `R/backend_openai.R` — OpenAI embeddings API +- `R/backend_tei.R` — local Text Embeddings Inference server + +Auth for hosted backends uses the `OVC_API_TOKEN` env var (bearer token). Backend configs serialize to/from YAML (`embed_model.yaml`); `backend_read()` also accepts a legacy nested format — preserve that path when editing. + +### Pipeline (sync) +1. `embed_corpus()` — prepare/clean input texts (supports `dry_run = TRUE`). +2. `embed_texts()` / `backend_embed_texts()` — generate embeddings. +3. Distances: + - `distance_reference_cosine()` — full pairwise cosine matrix with centroid axes, written as a single parquet under `distance_reference_cosine/model_id=.../corpus_label=.../reference_label=.../pairwise-cosine.parquet`. First column `id` includes a `"centroid"` row; one `"centroid"` column on the reference side. + - `distance_ridge()` — reference-area distance. +4. Scores: `score_reference_cosine()` (methods: `"linear"`, `"exponential"`), `score_ridge()`. +5. Optional: `calibrate_threshold()`. + +### OpenAI Batch (async, explicit three-phase) +`batch_submit_openai()` → `batch_status_openai()` → `batch_collect_openai()`. **Pending state is an expected, non-fatal outcome.** Jobs are auto-split by size/count; a single oversized request line is a hard error. Helpers live in `R/batch_openai_helpers.R` and `R/batch_openai_http.R`. + +The demo wrapper `demo_finalize_openai_batch()` = status + collect + direct-vs-batch comparison, with persistent outputs at `project/openai_batch_comparison/label=