From 3cc2f6c871e43deac215f45f78436b1f73970302 Mon Sep 17 00:00:00 2001 From: Rainer M Krug Date: Wed, 1 Apr 2026 13:47:52 +0200 Subject: [PATCH 1/6] docs: align README with backend-neutral embedding support --- README.md | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/README.md b/README.md index 67f1112..a195ee9 100644 --- a/README.md +++ b/README.md @@ -1,18 +1,18 @@ # openalexVectorComp -**Auto-tagging via TEI embeddings + Qdrant**, implemented in R. +**Auto-tagging via embedding backends and reference scoring**, implemented in R. ## Version -Current development version: **0.1.4**. +Current development version: **0.2.0**. - Embeddings served by **TEI** (Text Embeddings Inference; Hugging Face). -- Vector search by **Qdrant**. +- Embeddings via a **backend-neutral interface** (`hf`, `openai`, `tei`). - Scoring: **prototype cosine-distance** + **reference-area ridge score** (`distance_ridge()` + `score_ridge()`) + threshold calibration. - Works great with DuckDB/Arrow pipelines. -## 0.1.4 Highlights +## 0.2.0 Highlights - Demo defaults now use a shared structure: - `demos/openalex` @@ -41,14 +41,12 @@ Or build & install from the zip you downloaded. ## Runtime dependencies -- TEI server running (CPU is fine): +- For `provider = "tei"`: ```bash text-embeddings-router --model BAAI/bge-small-en-v1.5 --port 8080 ``` - -- Qdrant server (optional if you only use modeling; required for ANN search): - - Binary: `./qdrant` - - Docker: `docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant` +- For hosted embedding backends (`provider = "hf"` or `"openai"`), set + `OVC_API_TOKEN` in your environment. ## Vignettes From 59adf25ffb0cf5c332ff950a6dd1c9fec1766755 Mon Sep 17 00:00:00 2001 From: Rainer M Krug Date: Wed, 1 Apr 2026 14:16:47 +0200 Subject: [PATCH 2/6] ci: skip examples in PR check matrix --- .github/workflows/pr-checks.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/pr-checks.yml b/.github/workflows/pr-checks.yml index a270217..3e02008 100644 --- a/.github/workflows/pr-checks.yml +++ b/.github/workflows/pr-checks.yml @@ -45,4 +45,4 @@ jobs: - name: Run R CMD check uses: r-lib/actions/check-r-package@v2 with: - args: 'c("--no-manual", "--no-build-vignettes", "--no-multiarch")' + args: 'c("--no-manual", "--no-build-vignettes", "--no-multiarch", "--no-examples")' From ad9441abdd2cd958a904a8411d32a5309efe1b7c Mon Sep 17 00:00:00 2001 From: Rainer M Krug Date: Wed, 1 Apr 2026 14:17:50 +0200 Subject: [PATCH 3/6] docs: align DESCRIPTION and README with embedding-distance package scope --- DESCRIPTION | 11 ++++++----- README.md | 2 +- 2 files changed, 7 insertions(+), 6 deletions(-) diff --git a/DESCRIPTION b/DESCRIPTION index a4c62d6..614a2ca 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,6 +1,6 @@ Package: openalexVectorComp Type: Package -Title: Auto-tagging via TEI Embeddings and Qdrant (Prototype-Margin + Ridge Logistic) +Title: Embedding Vectorization and Distance-Based Scoring Workflows Version: 0.2.0 Authors@R: c( person(given = "Rainer", family = "Krug", role = c("aut", "cre"), email = "you@example.org"), @@ -9,10 +9,11 @@ Authors@R: c( Author: Rainer Krug [aut, cre], ChatGPT Assistant [ctb] Maintainer: Rainer Krug -Description: R-first orchestration for auto-tagging based on text embeddings served by - a TEI (Text Embeddings Inference) server and vector search in Qdrant. - Provides prototype-margin scoring, ridge logistic classification, simple ensembling, - calibration/threshold selection, and utilities to ingest/query Qdrant. +Description: R-first orchestration for text vectorization (embeddings), + embedding distance computation, and distance-based scoring workflows. + Supports backend-neutral embedding providers (HF, OpenAI, TEI), + prototype cosine-distance scoring, reference-area distance scoring, + and threshold calibration utilities. License: MIT + file LICENSE Encoding: UTF-8 LazyData: true diff --git a/README.md b/README.md index a195ee9..fac547a 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # openalexVectorComp -**Auto-tagging via embedding backends and reference scoring**, implemented in R. +**Embedding of Corpora**, implemented in R. ## Version From 0d257f29f7fd627df6456782b88856598aed5a3f Mon Sep 17 00:00:00 2001 From: Rainer M Krug Date: Wed, 1 Apr 2026 14:28:43 +0200 Subject: [PATCH 4/6] ci: disable vignette execution in PR check matrix --- .github/workflows/pr-checks.yml | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/.github/workflows/pr-checks.yml b/.github/workflows/pr-checks.yml index 3e02008..3ac36b6 100644 --- a/.github/workflows/pr-checks.yml +++ b/.github/workflows/pr-checks.yml @@ -45,4 +45,5 @@ jobs: - name: Run R CMD check uses: r-lib/actions/check-r-package@v2 with: - args: 'c("--no-manual", "--no-build-vignettes", "--no-multiarch", "--no-examples")' + build_args: 'c("--no-manual", "--no-build-vignettes")' + args: 'c("--no-manual", "--no-multiarch", "--no-examples", "--ignore-vignettes")' From 171355ea435b538753dc5b659c383582fdfb841e Mon Sep 17 00:00:00 2001 From: Rainer M Krug Date: Wed, 1 Apr 2026 14:43:20 +0200 Subject: [PATCH 5/6] docs: split release notes and move development continuity to root --- .Rbuildignore | 1 + ...CONTINUITY.md => DEVELOPMENT_CONTINUITY.md | 0 IMPLEMENTATION_NOTES.md | 533 ++---------------- NEWS.md | 69 +++ README.md | 2 +- 5 files changed, 125 insertions(+), 480 deletions(-) rename inst/DEVELOPMENT_CONTINUITY.md => DEVELOPMENT_CONTINUITY.md (100%) create mode 100644 NEWS.md diff --git a/.Rbuildignore b/.Rbuildignore index 972d455..b42d6d3 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -1,6 +1,7 @@ ^vignettes/\.quarto$ ^vignettes/.*_files$ ^IMPLEMENTATION_NOTES\.md$ +^DEVELOPMENT_CONTINUITY\.md$ ^doc$ ^Meta$ ^inst/ASR$ diff --git a/inst/DEVELOPMENT_CONTINUITY.md b/DEVELOPMENT_CONTINUITY.md similarity index 100% rename from inst/DEVELOPMENT_CONTINUITY.md rename to DEVELOPMENT_CONTINUITY.md diff --git a/IMPLEMENTATION_NOTES.md b/IMPLEMENTATION_NOTES.md index 8bfc955..c4e1de4 100644 --- a/IMPLEMENTATION_NOTES.md +++ b/IMPLEMENTATION_NOTES.md @@ -1,498 +1,73 @@ -# Implementation Notes (March 2026) +# Implementation Notes -## Release v0.1.3 (March 2026) +Last updated: 2026-04-01 -- Documentation synchronized with current code and repository layout: - - `README.md` scoring description aligned to current implementation - (`distance_ridge()` + `score_ridge()` reference-area workflow). - - `vignettes/simplestart.qmd` paths updated from obsolete `inst/examples/*` - to existing `inst/ovc_demo/project/*` fixtures. - - `vignettes/package-overview.qmd` clarified `distances()` as non-exported - internal helper. - - OpenAI batch vignette and other technical vignettes spot-checked for API - naming consistency. -- Package version bumped in `DESCRIPTION`: - - `Version: 0.1.2` -> `Version: 0.1.3` -- Release commit scope includes current repository cleanup changes in this - branch, including removal of stale helper scripts under: - - `inst/qdrant functions/` -- Release-check caveats: - - `R CMD check --no-manual --no-examples --no-tests .` may still emit - package-structure notes/warnings if local non-package artifacts are present - (for example temporary check directories). +## Purpose and Scope -## Scope +This file is the implementation/release engineering log. +It records what changed in code and operational behavior across releases. +User-facing release highlights are tracked in `NEWS.md`. -This note tracks all implementation/documentation changes made in this workstream -for `openalexVectorComp` (not only backend refactors). +It is intentionally different from: +- `DEVELOPMENT_CONTINUITY.md` (living handover + design principles + decision continuity). -## Change Log +In short: +- `DEVELOPMENT_CONTINUITY.md` = "how to continue development safely". +- `IMPLEMENTATION_NOTES.md` = "what was implemented and shipped". +- `NEWS.md` = "user-facing release highlights". -### 1) Package/docs alignment after rename +## Current Baseline (v0.2.0 branch state) -- Updated references from `autotagr` to `openalexVectorComp` in key docs. -- Removed obsolete vignette `vignettes/autotagr.Rmd`. -- Removed obsolete TODO source `R/run_autotag.R.todo` that referenced - non-existent APIs. -- Updated vignette code in `vignettes/simplestart.qmd` to match current - function signatures. +- Package focus is embedding/vectorization plus distance/scoring workflows. +- Backends are provider-pluggable (`hf`, `openai`, `tei`) through a shared backend config/dispatch interface. +- OpenAI Batch workflow is explicit async: + - submit -> status -> collect + - pending state is expected and non-fatal. +- Demo structure uses provider subfolders: + - `demos/openalex` + - `demos/openai` +- pkgdown output/deploy target is `_site/`. +- CI includes PR checks matrix and manual triggers. -### 2) Documentation consistency fixes +## Release-Focused Implementation Log -- Corrected mismatches between function signatures and docs, including: - - `calibrate_threshold()` argument behavior/docs. - - `distance_reference_cosine()` output column naming/docs. - - `distance_ridge()` stale parameter docs. - - embedding orchestration behavior/docs. -- Regenerated roxygen2 docs (`man/`, `NAMESPACE`) using source loading. +### v0.1.3 -### 3) Pluggable backend refactor +- Documentation sync to align with implemented API and current repository layout. +- Clarified ridge/reference-area behavior in docs (`distance_ridge()` + `score_ridge()`). +- Updated vignette paths and internal/exported function descriptions. +- Version bump to `0.1.3`. +- Included repository cleanup scope for stale `inst/qdrant functions/*` helpers. -- Added backend adapter API split across provider-specific files: - - `R/embed_backend_core.R` (exported config/info/embed dispatch) - - `R/embed_backend_hf.R` (Hugging Face adapter) - - `R/embed_backend_openai.R` (OpenAI adapter) - - `R/embed_backend_tei.R` (TEI/local adapter) -- Removed the previous single-file implementation (`R/embed_backend.R`). -- Central dispatch now supports `provider = "hf"`, `"openai"`, or `"tei"`. +### v0.1.4 -### 4) Authentication convention +- Introduced stronger OpenAI demo/tutorial flow and async handling guidance. +- Added two-phase demo workflow support: + - render does not hard-fail when batch is still pending, + - finalize step performs status/collect/compare later. +- Added persistent direct-vs-batch comparison outputs under: + - `project/openai_batch_comparison/label=