Cross-tool token-efficiency: naive baseline + tokens-at-fixed-recall metric by Mathews-Tom · Pull Request #366 · Mathews-Tom/archex

Mathews-Tom · 2026-06-23T00:55:05Z

Summary

Adds an offline, benchmark-only cross-tool token-efficiency comparison: how many tokens archex retrieval spends to localize a task's required files versus a naive grep/read agent, measured at a fixed required-file recall.

This is Stack B / Workstream M4 of the token-savings-metric-honesty plan. It produces a defensible "vs grep / full-file read" number without any in-process metric, reporter, retrieval, or default-path change.

What this adds

archex/benchmark/cross_tool.py: the naive token model and the tokens-at-fixed-recall metric.
- archex path: the targeted returned regions of archex_query, in rank order; cost to localize = region tokens consumed until the required files are covered.
- naive path: grep the corpus for the task keywords (deterministic, in-process matching over archex's own gitignore-aware corpus), then read either whole grep-hit files (full_file) or +/-K context windows around the hits (grep_window), in grep-relevance order.
- Both tokenized with the existing cl100k_base encoder. The naive model is a pure, deterministic function of the corpus snapshot, the keywords, and K.
archex_returned_regions helper in benchmark/strategies.py exposing the bundle's ranked regions (same pipeline as run_archex_query).
format_cross_tool_comparison reporter: per-corpus aggregate plus per-task tables.
archex benchmark cross-tool subcommand to run the comparison and write the artifact.

Recall held equal

A token delta is reported only for tasks where both paths reach the target required-file recall (default 100%). Tasks where the naive grep path never reaches a required file are marked unreached and excluded from the aggregate, so the efficiency number never compares unequal recall. Localization is aggregated as its own corpus and never merged with comprehension.

Out of scope (unchanged)

Not in DEFAULT_STRATEGIES; no product code path.
No in-process metrics ledger, metrics reporter, retrieval ranking, or default change.

Stack

Stack-Id: cross-tool-efficiency-cfdfb5
Base: main
Position: 1/2

feat/cross-tool-token-model -> this PR
feat/cross-tool-artifact-docs -> per-corpus artifact + docs

Depends on: (none - root)

Validation

uv run ruff check — pass (changed files)
uv run ruff format --check — pass (changed files)
uv run pyright — pass (changed files)
uv run pytest tests/benchmark/test_cross_tool.py tests/benchmark/test_reporter.py tests/benchmark/test_runner.py --no-cov — 105 passed

Model the token cost of localizing a task's required files two ways and compare at a fixed required-file recall: archex's targeted returned regions versus a naive grep/read agent (full-file reads or grep-hit context windows, in grep-relevance order). The naive model is a pure, deterministic function of the gitignore-aware corpus, the task keywords, and the context window, tokenized with cl100k_base. Recall is held equal: a token delta is only computed for tasks where both paths reach the target recall. Offline benchmark-only; no query hot path, metrics ledger, retrieval, or default-path change. Stack-Id: cross-tool-efficiency-cfdfb5 Stack-Position: 1/2

Render the per-corpus aggregate and per-task tokens-at-fixed-recall tables (recall held equal, localization graded as its own corpus) and add the offline `archex benchmark cross-tool` subcommand that writes the comparison artifact and prints the report. Kept out of DEFAULT_STRATEGIES and any product path. Stack-Id: cross-tool-efficiency-cfdfb5 Stack-Position: 1/2

…ring Cover tokens-at-recall walking and stop conditions, naive-model determinism and ordering, recall held equal in compare_task, per-corpus aggregation that excludes unequal-recall tasks and grades localization separately, the run orchestration with injected regions, and the report rendering. Stack-Id: cross-tool-efficiency-cfdfb5 Stack-Position: 1/2

Isolate per-task retrieval errors (ArchexIndexError/NotImplementedError) in run_cross_tool so one bad repo no longer aborts the batch, mirroring the standard runner. Document corpus_of's three-bucket contract and self-repo precedence. Add a determinism test that exercises the git ls-files discovery path used by real benchmark repos. Stack-Id: cross-tool-efficiency-cfdfb5 Stack-Position: 1/2

The per-corpus aggregate's ratio column is the mean of per-task naive/archex ratios, rendered beside summed token columns; rename its header to 'Mean naive/archex' so it is not misread as the ratio of the displayed totals (the volume-weighted view is the adjacent token-reduction column). Stack-Id: cross-tool-efficiency-cfdfb5 Stack-Position: 1/2

Mathews-Tom added 4 commits June 23, 2026 06:23

Mathews-Tom mentioned this pull request Jun 23, 2026

Cross-tool efficiency: per-corpus reference artifact + docs #367

Merged

Mathews-Tom merged commit 0c8500d into main Jun 23, 2026
12 checks passed

Mathews-Tom deleted the feat/cross-tool-token-model branch June 23, 2026 02:02

Mathews-Tom mentioned this pull request Jun 23, 2026

chore(release): v0.15.0 #368

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross-tool token-efficiency: naive baseline + tokens-at-fixed-recall metric#366

Cross-tool token-efficiency: naive baseline + tokens-at-fixed-recall metric#366
Mathews-Tom merged 5 commits into
mainfrom
feat/cross-tool-token-model

Mathews-Tom commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Mathews-Tom commented Jun 23, 2026

Summary

What this adds

Recall held equal

Out of scope (unchanged)

Stack

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant