Skip to content

[Feature Request] Add codegraph gain for estimated token savings analytics #513

@zihu97

Description

@zihu97

Summary

Add a codegraph gain command (or codegraph context --token-stats) that reports estimated token savings from CodeGraph-assisted exploration.

Problem

CodeGraph already advertises large reductions in tool calls and exploration time, but there is no local way for a user to answer:

  • How many tokens did this CodeGraph query emit?
  • How many tokens did it likely save compared with reading the relevant files directly?
  • How much value has CodeGraph provided over recent local sessions?

Tools such as token-optimizing command proxies can report a gain view because they compare raw output with filtered output and persist per-command estimates. CodeGraph could expose a similar, clearly-estimated analytics view for graph-assisted code exploration.

Proposed solution

Add a token savings estimator with two parts:

  1. Per-query stats for codegraph context / MCP context tools
  2. Aggregated history via a new codegraph gain command

Example CLI output:

$ codegraph context "How does tool execution work?" --token-stats

CodeGraph token stats
Query: How does tool execution work?
Output tokens estimated: 8,420
Related files: 14
Full-file baseline estimated: 61,300
Estimated tokens saved: 52,880
Estimated savings: 86.3%
Method: ceil(chars / 4), baseline = full contents of related files

Example aggregate output:

$ codegraph gain

CodeGraph Gain
Queries tracked: 37
Output tokens estimated: 214,000
Baseline tokens estimated: 1,920,000
Estimated tokens saved: 1,706,000
Estimated savings: 88.9%

Suggested calculation model

A simple first version could mirror the pragmatic approach used by CLI output filters:

output_tokens_est = ceil(context_output_chars / 4)
baseline_tokens_est = ceil(sum(chars of relatedFiles full file contents) / 4)
saved_tokens_est = max(0, baseline_tokens_est - output_tokens_est)
savings_pct = saved_tokens_est / baseline_tokens_est * 100

For MCP calls, the same stats could be included in the response metadata or persisted locally.

Why this baseline is useful

This is not a perfect counterfactual for what an agent would have done without CodeGraph. The actual non-CodeGraph path might involve grep, glob, partial reads, repeated reads, or exploratory mistakes.

However, the full-related-files baseline is still useful because it is:

  • deterministic
  • local-only
  • cheap to compute
  • easy to explain
  • directionally aligned with CodeGraph's value proposition
  • comparable across projects and sessions

The command should label the result as an estimate, not exact model billing tokens.

Optional enhancements

  • --format json for dashboards and agent integrations
  • --since 7d, --project, --daily, --history for local analytics
  • Allow a configurable tokenizer later, while keeping chars / 4 as the default fallback
  • Track tool_calls_avoided_est using a simple related-files/read-count model
  • Separate stats for CLI and MCP usage
  • Include method and baseline fields so downstream tools do not mistake estimates for exact usage

Acceptance criteria

  • codegraph context ... --token-stats can show estimated output tokens, baseline tokens, saved tokens, and savings percent
  • codegraph gain can show aggregate local savings history
  • JSON output is available for both per-query and aggregate stats
  • The docs clearly explain that the numbers are estimates and define the baseline
  • No source code leaves the machine; history is stored locally, similar to the existing local .codegraph data model

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions