Cross-tool token-efficiency: naive baseline + tokens-at-fixed-recall metric#366
Merged
Conversation
Model the token cost of localizing a task's required files two ways and compare at a fixed required-file recall: archex's targeted returned regions versus a naive grep/read agent (full-file reads or grep-hit context windows, in grep-relevance order). The naive model is a pure, deterministic function of the gitignore-aware corpus, the task keywords, and the context window, tokenized with cl100k_base. Recall is held equal: a token delta is only computed for tasks where both paths reach the target recall. Offline benchmark-only; no query hot path, metrics ledger, retrieval, or default-path change. Stack-Id: cross-tool-efficiency-cfdfb5 Stack-Position: 1/2
Render the per-corpus aggregate and per-task tokens-at-fixed-recall tables (recall held equal, localization graded as its own corpus) and add the offline `archex benchmark cross-tool` subcommand that writes the comparison artifact and prints the report. Kept out of DEFAULT_STRATEGIES and any product path. Stack-Id: cross-tool-efficiency-cfdfb5 Stack-Position: 1/2
…ring Cover tokens-at-recall walking and stop conditions, naive-model determinism and ordering, recall held equal in compare_task, per-corpus aggregation that excludes unequal-recall tasks and grades localization separately, the run orchestration with injected regions, and the report rendering. Stack-Id: cross-tool-efficiency-cfdfb5 Stack-Position: 1/2
Isolate per-task retrieval errors (ArchexIndexError/NotImplementedError) in run_cross_tool so one bad repo no longer aborts the batch, mirroring the standard runner. Document corpus_of's three-bucket contract and self-repo precedence. Add a determinism test that exercises the git ls-files discovery path used by real benchmark repos. Stack-Id: cross-tool-efficiency-cfdfb5 Stack-Position: 1/2
The per-corpus aggregate's ratio column is the mean of per-task naive/archex ratios, rendered beside summed token columns; rename its header to 'Mean naive/archex' so it is not misread as the ratio of the displayed totals (the volume-weighted view is the adjacent token-reduction column). Stack-Id: cross-tool-efficiency-cfdfb5 Stack-Position: 1/2
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an offline, benchmark-only cross-tool token-efficiency comparison: how many tokens archex retrieval spends to localize a task's required files versus a naive grep/read agent, measured at a fixed required-file recall.
This is Stack B / Workstream M4 of the token-savings-metric-honesty plan. It produces a defensible "vs grep / full-file read" number without any in-process metric, reporter, retrieval, or default-path change.
What this adds
archex/benchmark/cross_tool.py: the naive token model and the tokens-at-fixed-recall metric.archex_query, in rank order; cost to localize = region tokens consumed until the required files are covered.full_file) or+/-Kcontext windows around the hits (grep_window), in grep-relevance order.cl100k_baseencoder. The naive model is a pure, deterministic function of the corpus snapshot, the keywords, andK.archex_returned_regionshelper inbenchmark/strategies.pyexposing the bundle's ranked regions (same pipeline asrun_archex_query).format_cross_tool_comparisonreporter: per-corpus aggregate plus per-task tables.archex benchmark cross-toolsubcommand to run the comparison and write the artifact.Recall held equal
A token delta is reported only for tasks where both paths reach the target required-file recall (default 100%). Tasks where the naive grep path never reaches a required file are marked unreached and excluded from the aggregate, so the efficiency number never compares unequal recall. Localization is aggregated as its own corpus and never merged with comprehension.
Out of scope (unchanged)
DEFAULT_STRATEGIES; no product code path.Stack
Stack-Id:
cross-tool-efficiency-cfdfb5Base:
mainPosition: 1/2
feat/cross-tool-token-model-> this PRfeat/cross-tool-artifact-docs-> per-corpus artifact + docsDepends on: (none - root)
Validation
uv run ruff check— pass (changed files)uv run ruff format --check— pass (changed files)uv run pyright— pass (changed files)uv run pytest tests/benchmark/test_cross_tool.py tests/benchmark/test_reporter.py tests/benchmark/test_runner.py --no-cov— 105 passed