Skip to content

Cross-tool efficiency: per-corpus reference artifact + docs#367

Merged
Mathews-Tom merged 4 commits into
mainfrom
feat/cross-tool-artifact-docs
Jun 23, 2026
Merged

Cross-tool efficiency: per-corpus reference artifact + docs#367
Mathews-Tom merged 4 commits into
mainfrom
feat/cross-tool-artifact-docs

Conversation

@Mathews-Tom

Copy link
Copy Markdown
Owner

Summary

Generates and checks in the per-corpus cross-tool efficiency reference artifact and documents the number. Builds on #366 (the tokens-at-fixed-recall metric).

What this adds

  • benchmarks/cross-tool-efficiency/cross-tool-comparison.json: the reference artifact, produced from a clean run of archex benchmark cross-tool over the benchmark task set. Metric values are not hand-edited.
  • docs/LOCAL_METRICS.md: a "Cross-tool efficiency (offline benchmark)" section citing the artifact and its per-corpus figures, stating the number is offline benchmark-only and never enters the in-process ledger or archex metrics summary.
  • A test asserting the checked-in artifact parses, grades the localization family as its own corpus, holds recall equal in every scored comparison, and renders.

Per-corpus reference values (tokens at 100% required-file recall)

Corpus Naive model Comparable archex naive Reduction
self full_file 16/24 9,484 4,416,681 99.8%
self grep_window 16/24 9,484 2,626,845 99.6%
external-comprehension full_file 16/19 22,681 783,725 97.1%
external-comprehension grep_window 16/19 22,681 492,119 95.4%
external-localization full_file 20/21 13,247 469,836 97.2%
external-localization grep_window 20/21 13,247 408,410 96.8%

"Comparable" counts only tasks where both paths reach 100% required-file recall, so no figure compares unequal recall. Localization is graded as its own corpus, never merged with comprehension.

Artifact location

.archex/ is gitignored repo-wide (and by a user-global ignore), so the committed reference artifact lives under the tracked benchmarks/ tree rather than punching negation holes in the ignored workspace dir. The archex benchmark cross-tool default output stays .archex/cross-tool-efficiency for ephemeral ad-hoc runs.

Out of scope (unchanged)

  • No in-process metrics ledger, metrics reporter, retrieval ranking, or default change.

Stack

Stack-Id: cross-tool-efficiency-cfdfb5
Base: feat/cross-tool-token-model
Position: 2/2

  1. feat/cross-tool-token-model -> Cross-tool token-efficiency: naive baseline + tokens-at-fixed-recall metric #366
  2. feat/cross-tool-artifact-docs -> this PR

Depends on: #366

Validation

  • uv run ruff check / ruff format --check / uv run pyright on changed Python — pass
  • uv run pytest tests/benchmark/test_cross_tool.py tests/benchmark/test_reporter.py --no-cov — 80 passed
  • Doc figures verified byte-for-byte against the checked-in artifact aggregates

@Mathews-Tom Mathews-Tom force-pushed the feat/cross-tool-artifact-docs branch from 7040125 to 5de743c Compare June 23, 2026 01:46
Generated by `archex benchmark cross-tool --tasks-dir benchmarks/tasks --output benchmarks/cross-tool-efficiency` over the benchmark task set. Aggregates tokens-at-100%-required-file-recall per corpus (self, external-comprehension, external-localization graded separately), archex_query vs naive full-file and grep-window reads, recall held equal. Tracked under benchmarks/ (the .archex/ default output stays ephemeral/ignored).

Stack-Id: cross-tool-efficiency-cfdfb5
Stack-Position: 2/2
Assert the reference artifact parses, grades the localization family as its own corpus (disjoint from comprehension), holds recall equal in every scored comparison, and renders via format_cross_tool_comparison.

Stack-Id: cross-tool-efficiency-cfdfb5
Stack-Position: 2/2
Add a cross-tool efficiency section to docs/LOCAL_METRICS.md citing the checked-in artifact and its per-corpus reduction figures, stating the number is offline benchmark-only and never enters the in-process metrics ledger or summary.

Stack-Id: cross-tool-efficiency-cfdfb5
Stack-Position: 2/2
The artifact shows every excluded (non-comparable) task is an archex recall miss, not a naive miss: the naive grep/read path reaches full recall on every comparable task. Rewrite the caveat to state this honestly and note the reduction is conditioned on archex fully localizing the task, so it never reads as 'archex always localizes cheaper'.

Stack-Id: cross-tool-efficiency-cfdfb5
Stack-Position: 2/2
@Mathews-Tom Mathews-Tom force-pushed the feat/cross-tool-artifact-docs branch from 5de743c to f69874f Compare June 23, 2026 01:57
@Mathews-Tom Mathews-Tom changed the base branch from feat/cross-tool-token-model to main June 23, 2026 01:57
@Mathews-Tom Mathews-Tom merged commit 85e86f0 into main Jun 23, 2026
6 checks passed
@Mathews-Tom Mathews-Tom deleted the feat/cross-tool-artifact-docs branch June 23, 2026 02:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant