diff --git a/.Rbuildignore b/.Rbuildignore index 972d455..b42d6d3 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -1,6 +1,7 @@ ^vignettes/\.quarto$ ^vignettes/.*_files$ ^IMPLEMENTATION_NOTES\.md$ +^DEVELOPMENT_CONTINUITY\.md$ ^doc$ ^Meta$ ^inst/ASR$ diff --git a/.github/workflows/pr-checks.yml b/.github/workflows/pr-checks.yml index a270217..3ac36b6 100644 --- a/.github/workflows/pr-checks.yml +++ b/.github/workflows/pr-checks.yml @@ -45,4 +45,5 @@ jobs: - name: Run R CMD check uses: r-lib/actions/check-r-package@v2 with: - args: 'c("--no-manual", "--no-build-vignettes", "--no-multiarch")' + build_args: 'c("--no-manual", "--no-build-vignettes")' + args: 'c("--no-manual", "--no-multiarch", "--no-examples", "--ignore-vignettes")' diff --git a/DESCRIPTION b/DESCRIPTION index a4c62d6..30edc48 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,7 +1,7 @@ Package: openalexVectorComp Type: Package -Title: Auto-tagging via TEI Embeddings and Qdrant (Prototype-Margin + Ridge Logistic) -Version: 0.2.0 +Title: Embedding Vectorization and Distance-Based Scoring Workflows +Version: 0.3.0 Authors@R: c( person(given = "Rainer", family = "Krug", role = c("aut", "cre"), email = "you@example.org"), person(given = "ChatGPT", family = "Assistant", role = "ctb") @@ -9,10 +9,11 @@ Authors@R: c( Author: Rainer Krug [aut, cre], ChatGPT Assistant [ctb] Maintainer: Rainer Krug -Description: R-first orchestration for auto-tagging based on text embeddings served by - a TEI (Text Embeddings Inference) server and vector search in Qdrant. - Provides prototype-margin scoring, ridge logistic classification, simple ensembling, - calibration/threshold selection, and utilities to ingest/query Qdrant. +Description: R-first orchestration for text vectorization (embeddings), + embedding distance computation, and distance-based scoring workflows. + Supports backend-neutral embedding providers (HF, OpenAI, TEI), + prototype cosine-distance scoring, reference-area distance scoring, + and threshold calibration utilities. License: MIT + file LICENSE Encoding: UTF-8 LazyData: true diff --git a/inst/DEVELOPMENT_CONTINUITY.md b/DEVELOPMENT_CONTINUITY.md similarity index 90% rename from inst/DEVELOPMENT_CONTINUITY.md rename to DEVELOPMENT_CONTINUITY.md index bb448bf..f356ac1 100644 --- a/inst/DEVELOPMENT_CONTINUITY.md +++ b/DEVELOPMENT_CONTINUITY.md @@ -43,20 +43,20 @@ Core flow: 5. Optional threshold calibration (`calibrate_threshold()`). OpenAI batch flow: -1. Submit (`embed_corpus_submit_openai_batch()`). -2. Refresh status (`embed_corpus_status_openai_batch()`). -3. Collect completed jobs (`embed_corpus_collect_openai_batch()`). +1. Submit (`batch_submit_openai()`). +2. Refresh status (`batch_status_openai()`). +3. Collect completed jobs (`batch_collect_openai()`). 4. Demo convenience wrapper: -- `finalize_demo_openai_batch()` = status + collect + direct-vs-batch compare. +- `demo_finalize_openai_batch()` = status + collect + direct-vs-batch compare. -## 3. Current Demo Conventions (0.1.4) +## 3. Current Demo Conventions (0.3.0) Default demo locations: - `demos/openalex` - `demos/openai` OpenAI demo behavior: -- `run_demo_openai_quarto(..., render = TRUE)` may complete before batch does. +- `run_demo_openai(..., render = TRUE)` may complete before batch does. - User is given explicit follow-up commands for status/finalize. - Batch comparison outputs are written to: `project/openai_batch_comparison/label=corpus_batch/`. @@ -79,7 +79,7 @@ Template: - Date: 2026-04-01 - Scope: OpenAI demo and batch comparison robustness - Decision: Implement two-phase OpenAI batch demo flow with - `finalize_demo_openai_batch()`. + `demo_finalize_openai_batch()`. - Why: Batch completion is asynchronous; render should not fail on pending jobs. - Alternatives considered: long blocking poll in render; hard-fail on timeout. - Impact: Clearer async semantics; stable demo render; persisted comparison diff --git a/IMPLEMENTATION_NOTES.md b/IMPLEMENTATION_NOTES.md index 8bfc955..519417b 100644 --- a/IMPLEMENTATION_NOTES.md +++ b/IMPLEMENTATION_NOTES.md @@ -1,498 +1,73 @@ -# Implementation Notes (March 2026) +# Implementation Notes -## Release v0.1.3 (March 2026) +Last updated: 2026-04-01 -- Documentation synchronized with current code and repository layout: - - `README.md` scoring description aligned to current implementation - (`distance_ridge()` + `score_ridge()` reference-area workflow). - - `vignettes/simplestart.qmd` paths updated from obsolete `inst/examples/*` - to existing `inst/ovc_demo/project/*` fixtures. - - `vignettes/package-overview.qmd` clarified `distances()` as non-exported - internal helper. - - OpenAI batch vignette and other technical vignettes spot-checked for API - naming consistency. -- Package version bumped in `DESCRIPTION`: - - `Version: 0.1.2` -> `Version: 0.1.3` -- Release commit scope includes current repository cleanup changes in this - branch, including removal of stale helper scripts under: - - `inst/qdrant functions/` -- Release-check caveats: - - `R CMD check --no-manual --no-examples --no-tests .` may still emit - package-structure notes/warnings if local non-package artifacts are present - (for example temporary check directories). +## Purpose and Scope -## Scope +This file is the implementation/release engineering log. +It records what changed in code and operational behavior across releases. +User-facing release highlights are tracked in `NEWS.md`. -This note tracks all implementation/documentation changes made in this workstream -for `openalexVectorComp` (not only backend refactors). +It is intentionally different from: +- `DEVELOPMENT_CONTINUITY.md` (living handover + design principles + decision continuity). -## Change Log +In short: +- `DEVELOPMENT_CONTINUITY.md` = "how to continue development safely". +- `IMPLEMENTATION_NOTES.md` = "what was implemented and shipped". +- `NEWS.md` = "user-facing release highlights". -### 1) Package/docs alignment after rename +## Current Baseline (v0.3.0 branch state) -- Updated references from `autotagr` to `openalexVectorComp` in key docs. -- Removed obsolete vignette `vignettes/autotagr.Rmd`. -- Removed obsolete TODO source `R/run_autotag.R.todo` that referenced - non-existent APIs. -- Updated vignette code in `vignettes/simplestart.qmd` to match current - function signatures. +- Package focus is embedding/vectorization plus distance/scoring workflows. +- Backends are provider-pluggable (`hf`, `openai`, `tei`) through a shared backend config/dispatch interface. +- OpenAI Batch workflow is explicit async: + - submit -> status -> collect + - pending state is expected and non-fatal. +- Demo structure uses provider subfolders: + - `demos/openalex` + - `demos/openai` +- pkgdown output/deploy target is `_site/`. +- CI includes PR checks matrix and manual triggers. -### 2) Documentation consistency fixes +## Release-Focused Implementation Log -- Corrected mismatches between function signatures and docs, including: - - `calibrate_threshold()` argument behavior/docs. - - `distance_reference_cosine()` output column naming/docs. - - `distance_ridge()` stale parameter docs. - - embedding orchestration behavior/docs. -- Regenerated roxygen2 docs (`man/`, `NAMESPACE`) using source loading. +### v0.1.3 -### 3) Pluggable backend refactor +- Documentation sync to align with implemented API and current repository layout. +- Clarified ridge/reference-area behavior in docs (`distance_ridge()` + `score_ridge()`). +- Updated vignette paths and internal/exported function descriptions. +- Version bump to `0.1.3`. +- Included repository cleanup scope for stale `inst/qdrant functions/*` helpers. -- Added backend adapter API split across provider-specific files: - - `R/embed_backend_core.R` (exported config/info/embed dispatch) - - `R/embed_backend_hf.R` (Hugging Face adapter) - - `R/embed_backend_openai.R` (OpenAI adapter) - - `R/embed_backend_tei.R` (TEI/local adapter) -- Removed the previous single-file implementation (`R/embed_backend.R`). -- Central dispatch now supports `provider = "hf"`, `"openai"`, or `"tei"`. +### v0.1.4 -### 4) Authentication convention +- Introduced stronger OpenAI demo/tutorial flow and async handling guidance. +- Added two-phase demo workflow support: + - render does not hard-fail when batch is still pending, + - finalize step performs status/collect/compare later. +- Added persistent direct-vs-batch comparison outputs under: + - `project/openai_batch_comparison/label=