Feature Request: TPM-Aware Rate Limiting
Problem Statement
The current implementation of rate limiting in lotus is purely RPM-based (Requests Per Minute). While this works for small data samples, it fails when processing "Large Context Tasks" such as long agent traces or document analysis.
When a user has a low-to-mid tier API account (e.g., OpenAI Tier 1 with 200,000 TPM limit), firing off a batch of requests with large prompt sizes immediately exhausts the TPM (Tokens Per Minute) budget, even if the RPM limit is respected. This leads to 429 Rate Limit Reached errors and makes it impossible to process datasets with large vertical context (rows with 50k+ tokens) reliably.
Proposed Solution
Implement a TPM Rate Limiter alongside the existing RPM logic in [lm.py]
Use Cases
- Agent Trace Analysis: Processing traces that are 300k+ characters long per row.
- Document Filtering/Joins: Running
sem_filter or sem_join on long research papers, legal documents, or logs.
- Stable Batch Processing: Ensuring that large DataFrames can be processed without manual restarts or "Retry-Heads" on standard API accounts (OpenAI Tier 1/2).
Alternative Solutions
- Lowering
max_batch_size: Users can manually set a tiny batch size (e.g., 1), but this is "token-blind" and inefficient if row sizes fluctuate.
- Retry Logic: Relying on library-level retries (e.g.,
litellm retries) which leads to a "wave of failures" where entire batches crash simultaneously, creating massive overhead and latency.
Additional Context
A TPM-aware approach ensures Theoretical Maximum Throughput. It keeps the "Token Pipe" to the limit of the user's specific API tier without overflowing it. I'm working on an implementation and am happy to contribute this as a PR.
Checklist
Feature Request: TPM-Aware Rate Limiting
Problem Statement
The current implementation of rate limiting in
lotusis purely RPM-based (Requests Per Minute). While this works for small data samples, it fails when processing "Large Context Tasks" such as long agent traces or document analysis.When a user has a low-to-mid tier API account (e.g., OpenAI Tier 1 with 200,000 TPM limit), firing off a batch of requests with large prompt sizes immediately exhausts the TPM (Tokens Per Minute) budget, even if the RPM limit is respected. This leads to
429 Rate Limit Reachederrors and makes it impossible to process datasets with large vertical context (rows with 50k+ tokens) reliably.Proposed Solution
Implement a TPM Rate Limiter alongside the existing RPM logic in [lm.py]
Use Cases
sem_filterorsem_joinon long research papers, legal documents, or logs.Alternative Solutions
max_batch_size: Users can manually set a tiny batch size (e.g., 1), but this is "token-blind" and inefficient if row sizes fluctuate.litellmretries) which leads to a "wave of failures" where entire batches crash simultaneously, creating massive overhead and latency.Additional Context
A TPM-aware approach ensures Theoretical Maximum Throughput. It keeps the "Token Pipe" to the limit of the user's specific API tier without overflowing it. I'm working on an implementation and am happy to contribute this as a PR.
Checklist