Cross-tool efficiency: per-corpus reference artifact + docs#367
Merged
Conversation
7040125 to
5de743c
Compare
Generated by `archex benchmark cross-tool --tasks-dir benchmarks/tasks --output benchmarks/cross-tool-efficiency` over the benchmark task set. Aggregates tokens-at-100%-required-file-recall per corpus (self, external-comprehension, external-localization graded separately), archex_query vs naive full-file and grep-window reads, recall held equal. Tracked under benchmarks/ (the .archex/ default output stays ephemeral/ignored). Stack-Id: cross-tool-efficiency-cfdfb5 Stack-Position: 2/2
Assert the reference artifact parses, grades the localization family as its own corpus (disjoint from comprehension), holds recall equal in every scored comparison, and renders via format_cross_tool_comparison. Stack-Id: cross-tool-efficiency-cfdfb5 Stack-Position: 2/2
Add a cross-tool efficiency section to docs/LOCAL_METRICS.md citing the checked-in artifact and its per-corpus reduction figures, stating the number is offline benchmark-only and never enters the in-process metrics ledger or summary. Stack-Id: cross-tool-efficiency-cfdfb5 Stack-Position: 2/2
The artifact shows every excluded (non-comparable) task is an archex recall miss, not a naive miss: the naive grep/read path reaches full recall on every comparable task. Rewrite the caveat to state this honestly and note the reduction is conditioned on archex fully localizing the task, so it never reads as 'archex always localizes cheaper'. Stack-Id: cross-tool-efficiency-cfdfb5 Stack-Position: 2/2
5de743c to
f69874f
Compare
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Generates and checks in the per-corpus cross-tool efficiency reference artifact and documents the number. Builds on #366 (the tokens-at-fixed-recall metric).
What this adds
benchmarks/cross-tool-efficiency/cross-tool-comparison.json: the reference artifact, produced from a clean run ofarchex benchmark cross-toolover the benchmark task set. Metric values are not hand-edited.docs/LOCAL_METRICS.md: a "Cross-tool efficiency (offline benchmark)" section citing the artifact and its per-corpus figures, stating the number is offline benchmark-only and never enters the in-process ledger orarchex metrics summary.Per-corpus reference values (tokens at 100% required-file recall)
"Comparable" counts only tasks where both paths reach 100% required-file recall, so no figure compares unequal recall. Localization is graded as its own corpus, never merged with comprehension.
Artifact location
.archex/is gitignored repo-wide (and by a user-global ignore), so the committed reference artifact lives under the trackedbenchmarks/tree rather than punching negation holes in the ignored workspace dir. Thearchex benchmark cross-tooldefault output stays.archex/cross-tool-efficiencyfor ephemeral ad-hoc runs.Out of scope (unchanged)
Stack
Stack-Id:
cross-tool-efficiency-cfdfb5Base:
feat/cross-tool-token-modelPosition: 2/2
feat/cross-tool-token-model-> Cross-tool token-efficiency: naive baseline + tokens-at-fixed-recall metric #366feat/cross-tool-artifact-docs-> this PRDepends on: #366
Validation
uv run ruff check/ruff format --check/uv run pyrighton changed Python — passuv run pytest tests/benchmark/test_cross_tool.py tests/benchmark/test_reporter.py --no-cov— 80 passed