Skip to content

Cross-tool token-efficiency: naive baseline + tokens-at-fixed-recall metric#366

Merged
Mathews-Tom merged 5 commits into
mainfrom
feat/cross-tool-token-model
Jun 23, 2026
Merged

Cross-tool token-efficiency: naive baseline + tokens-at-fixed-recall metric#366
Mathews-Tom merged 5 commits into
mainfrom
feat/cross-tool-token-model

Conversation

@Mathews-Tom

Copy link
Copy Markdown
Owner

Summary

Adds an offline, benchmark-only cross-tool token-efficiency comparison: how many tokens archex retrieval spends to localize a task's required files versus a naive grep/read agent, measured at a fixed required-file recall.

This is Stack B / Workstream M4 of the token-savings-metric-honesty plan. It produces a defensible "vs grep / full-file read" number without any in-process metric, reporter, retrieval, or default-path change.

What this adds

  • archex/benchmark/cross_tool.py: the naive token model and the tokens-at-fixed-recall metric.
    • archex path: the targeted returned regions of archex_query, in rank order; cost to localize = region tokens consumed until the required files are covered.
    • naive path: grep the corpus for the task keywords (deterministic, in-process matching over archex's own gitignore-aware corpus), then read either whole grep-hit files (full_file) or +/-K context windows around the hits (grep_window), in grep-relevance order.
    • Both tokenized with the existing cl100k_base encoder. The naive model is a pure, deterministic function of the corpus snapshot, the keywords, and K.
  • archex_returned_regions helper in benchmark/strategies.py exposing the bundle's ranked regions (same pipeline as run_archex_query).
  • format_cross_tool_comparison reporter: per-corpus aggregate plus per-task tables.
  • archex benchmark cross-tool subcommand to run the comparison and write the artifact.

Recall held equal

A token delta is reported only for tasks where both paths reach the target required-file recall (default 100%). Tasks where the naive grep path never reaches a required file are marked unreached and excluded from the aggregate, so the efficiency number never compares unequal recall. Localization is aggregated as its own corpus and never merged with comprehension.

Out of scope (unchanged)

  • Not in DEFAULT_STRATEGIES; no product code path.
  • No in-process metrics ledger, metrics reporter, retrieval ranking, or default change.

Stack

Stack-Id: cross-tool-efficiency-cfdfb5
Base: main
Position: 1/2

  1. feat/cross-tool-token-model -> this PR
  2. feat/cross-tool-artifact-docs -> per-corpus artifact + docs

Depends on: (none - root)

Validation

  • uv run ruff check — pass (changed files)
  • uv run ruff format --check — pass (changed files)
  • uv run pyright — pass (changed files)
  • uv run pytest tests/benchmark/test_cross_tool.py tests/benchmark/test_reporter.py tests/benchmark/test_runner.py --no-cov — 105 passed

Model the token cost of localizing a task's required files two ways and compare at a fixed required-file recall: archex's targeted returned regions versus a naive grep/read agent (full-file reads or grep-hit context windows, in grep-relevance order). The naive model is a pure, deterministic function of the gitignore-aware corpus, the task keywords, and the context window, tokenized with cl100k_base. Recall is held equal: a token delta is only computed for tasks where both paths reach the target recall. Offline benchmark-only; no query hot path, metrics ledger, retrieval, or default-path change.

Stack-Id: cross-tool-efficiency-cfdfb5
Stack-Position: 1/2
Render the per-corpus aggregate and per-task tokens-at-fixed-recall tables (recall held equal, localization graded as its own corpus) and add the offline `archex benchmark cross-tool` subcommand that writes the comparison artifact and prints the report. Kept out of DEFAULT_STRATEGIES and any product path.

Stack-Id: cross-tool-efficiency-cfdfb5
Stack-Position: 1/2
…ring

Cover tokens-at-recall walking and stop conditions, naive-model determinism and ordering, recall held equal in compare_task, per-corpus aggregation that excludes unequal-recall tasks and grades localization separately, the run orchestration with injected regions, and the report rendering.

Stack-Id: cross-tool-efficiency-cfdfb5
Stack-Position: 1/2
Isolate per-task retrieval errors (ArchexIndexError/NotImplementedError) in run_cross_tool so one bad repo no longer aborts the batch, mirroring the standard runner. Document corpus_of's three-bucket contract and self-repo precedence. Add a determinism test that exercises the git ls-files discovery path used by real benchmark repos.

Stack-Id: cross-tool-efficiency-cfdfb5
Stack-Position: 1/2
The per-corpus aggregate's ratio column is the mean of per-task naive/archex ratios, rendered beside summed token columns; rename its header to 'Mean naive/archex' so it is not misread as the ratio of the displayed totals (the volume-weighted view is the adjacent token-reduction column).

Stack-Id: cross-tool-efficiency-cfdfb5
Stack-Position: 1/2
@Mathews-Tom Mathews-Tom merged commit 0c8500d into main Jun 23, 2026
12 checks passed
@Mathews-Tom Mathews-Tom deleted the feat/cross-tool-token-model branch June 23, 2026 02:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant