feat: Add token usage tracking, budget limits, and cost estimation #1535

Abhiraj-GetGarak · 2025-12-16T19:55:45Z

Summary

Adds --track_usage flag to enable token usage tracking without limits
Adds --token_limit and --cost_limit CLI options to cap scan resource usage
Tracks prompt/completion token breakdown during scans
Estimates costs based on model pricing from model_pricing.yaml
Displays token usage summary at end of scan

Related Issues

Closes #1532 - Feature Request: Track and display token usage during scans
Closes #1533 - Feature Request: Budget limits to cap token usage and costs
Closes #1534 - Feature Request: Estimate and report scan costs

Motivation

Users conducting garak scans against commercial LLM APIs lack visibility into token consumption and have no way to control costs. This creates:

Unexpected expenses from extended scans (potentially hundreds of dollars)
No budget enforcement mechanisms
Limited ability to test configurations safely
Barriers to enterprise adoption

Community feedback (Discord, April/September 2025):

"Does GARAK have any way to track or estimate how many tokens are used in each scan?"
"I'm trying to reduce the cost of running GARAK scans against commercial LLM APIs"

Usage

# Track usage without limits (report only)
python -m garak --target_type openai --target_name gpt-3.5-turbo --probes encoding --track_usage

# Limit scan to 10,000 tokens (automatically enables tracking)
python -m garak --target_type openai --target_name gpt-3.5-turbo --probes encoding --token_limit 10000

# Limit scan to $5.00 USD
python -m garak --target_type openai --target_name gpt-4o --probes dan --cost_limit 5.00

# Both limits (stops at whichever is hit first)
python -m garak --target_type openai --target_name gpt-4o --probes dan --token_limit 50000 --cost_limit 10.00

Output Example

Token Usage Summary:
  Total tokens: 1,686 (prompt: 1,424, completion: 262)
  Estimated cost: $0.0011 USD
  API calls: 2
  Token limit: 1,000

Supported Generators

Token usage tracking is implemented for:

OpenAI (openai.py)
LiteLLM (litellm.py) - supports all LiteLLM-compatible providers
Mistral (mistral.py)
Ollama (ollama.py)
Bedrock (bedrock.py)

Trade-offs and Limitations

1. Token Counting Accuracy

Cost estimation depends on APIs returning token counts. When APIs don't provide counts, we estimate using ~4 chars/token ratio (marked as "estimated" in summary).

Impact: Cost estimates may be less accurate for providers that don't return token counts.

2. Parallel Execution Overshoot

With --parallel_attempts, the budget may slightly overshoot the limit since multiple workers dispatch simultaneously. We use batch processing to minimize this.

Impact: If you set --token_limit 1000, actual usage might be ~1100-1200 tokens due to in-flight requests completing.

3. Pricing Data Staleness

Model prices in model_pricing.yaml may become outdated. Users should verify current rates with providers for production budgeting.

Impact: Cost estimates are approximations. The file includes an update timestamp (2025-12) for reference.

Unknown models use conservative defaults ($5/$15 per 1M tokens input/output).

Test Plan

Unit tests for BudgetManager, TokenUsage, cost calculation
Integration tests for multiprocessing shared state
Manual testing with OpenAI gpt-3.5-turbo (verified 2 API calls, proper token tracking)
Manual testing with LiteLLM/Claude (verified budget exceeded at ~1013 tokens)

Implements comprehensive budget management for garak scans: - Add --token_limit and --cost_limit CLI options - Track prompt/completion token usage across all generators - Estimate costs using model_pricing.yaml - Support multiprocessing with shared state budget enforcement - Display token usage summary at end of scan Closes NVIDIA#1532, NVIDIA#1533, NVIDIA#1534

Remove token usage tracking from nvcf and watsonx generators as these have not been tested and should be added in a separate PR once verified.

- Remove DEFAULT_PARAMS from BudgetManager (service class, not plugin) - Add _capture_oai_token_usage() and _capture_dict_token_usage() helpers to Generator base class for consistent token tracking across providers - Refactor bedrock, litellm, mistral, ollama, openai generators to use new base class helpers - Use self.budget_manager for limit values in harness instead of _config - Return (budget_manager, exception) tuple from probewise_run/pxd_run instead of storing in _config.transient - Auto-enable track_usage when cost/token limits are set in CLI - Add :rtype: annotations to docstrings - Move _last_usage from class variable to instance variable

- Add handle_budget_exceeded() function to command.py (was missing but called from cli.py) - Remove redundant BudgetExceededError imports in harnesses/base.py (already imported at top) - Remove dead code fallback to _config.transient.budget_manager in end_run()

Abhiraj-GetGarak added 2 commits December 17, 2025 01:19

chore: remove nvcf and watsonx token tracking (not yet tested)

24c408f

Remove token usage tracking from nvcf and watsonx generators as these have not been tested and should be added in a separate PR once verified.

Abhiraj-GetGarak marked this pull request as ready for review December 16, 2025 21:15

Abhiraj-GetGarak added 3 commits December 17, 2025 17:21

fix: probewise_run and pxd_run now return (budget_manager, budget_error)

fee46b8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add token usage tracking, budget limits, and cost estimation #1535

feat: Add token usage tracking, budget limits, and cost estimation #1535

Uh oh!

Abhiraj-GetGarak commented Dec 16, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: Add token usage tracking, budget limits, and cost estimation #1535

Are you sure you want to change the base?

feat: Add token usage tracking, budget limits, and cost estimation #1535

Uh oh!

Conversation

Abhiraj-GetGarak commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issues

Motivation

Usage

Output Example

Supported Generators

Trade-offs and Limitations

1. Token Counting Accuracy

2. Parallel Execution Overshoot

3. Pricing Data Staleness

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Abhiraj-GetGarak commented Dec 16, 2025 •

edited

Loading