cost-aware request queuing#928
Conversation
Signed-off-by: Jagjeevan Kashid <jagjeevandev97@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Code Review
This pull request introduces a design proposal for token-cost-aware admission control in the kthena-router, shifting from request-count to token-budget-based scheduling to better handle mixed workloads. The review feedback identifies several critical areas for refinement: addressing budget over-admission in multi-router deployments, ensuring token estimation consistency by using rune counts instead of byte lengths, optimizing queue retry intervals to prevent high CPU overhead, and scaling the starvation mitigation factor to be effective against token-based priority scores. All provided review comments point to valid technical improvements or logic corrections within the proposal.
| | Budget leak on error path | Permanent reduced capacity | Single-owner reservation lifecycle with defer-based release | | ||
| | Queue starvation of large requests | Fairness regression | Aging factor + max-wait timeout + fairness priority blending | | ||
| | Increased routing overhead | Router latency | O(1) accounting structures, bounded queue ops, targeted metrics | | ||
| | Multi-router inconsistency | Uneven enforcement | Documented limitation + future Redis-backed shared budget mode | |
There was a problem hiding this comment.
Multi-router inconsistency is listed as a risk with a mitigation of "Documented limitation". However, in a horizontally scaled environment with
| Where: | ||
| 1. `estimated_prompt_tokens`: | ||
| - Primary: tokenizer estimate from prompt/messages. | ||
| - Fallback: `len(prompt)/4` approximation (existing style). |
There was a problem hiding this comment.
The fallback estimation len(prompt)/4 is inconsistent with the existing implementation in pkg/kthena-router/filters/tokenizer/estimator.go, which uses utf8.RuneCountInString(prompt) / 4.0. Using byte length (len) can lead to significant overestimation for non-ASCII text. The proposal should be updated to reflect the use of rune count for consistency and accuracy.
| | `fairness.tokenBudget.safetyFactor` | `FAIRNESS_COST_SAFETY_FACTOR` | `1.2` | Multiplicative safety factor | | ||
| | `fairness.tokenBudget.minReservationTokens` | `FAIRNESS_MIN_RESERVATION_TOKENS` | `64` | Minimum reservation | | ||
| | `fairness.tokenBudget.maxReservationTokens` | `FAIRNESS_MAX_RESERVATION_TOKENS` | `32768` | Max reservation clamp | | ||
| | `fairness.tokenBudget.queueRetryInterval` | `FAIRNESS_ADMISSION_RETRY_INTERVAL` | `10ms` | Retry tick for queued fit checks | |
There was a problem hiding this comment.
The queueRetryInterval of 10ms for re-attempting admission fit checks might cause high CPU overhead if the queue contains many requests that cannot fit. Since the system already has a "Wake on release" mechanism (as shown in the mermaid diagram on line 258), the periodic tick should ideally be a much larger fallback interval (e.g., 500ms or 1s) to handle edge cases, rather than a primary mechanism for admission retries.
| | `fairness.tokenBudget.maxReservationTokens` | `FAIRNESS_MAX_RESERVATION_TOKENS` | `32768` | Max reservation clamp | | ||
| | `fairness.tokenBudget.queueRetryInterval` | `FAIRNESS_ADMISSION_RETRY_INTERVAL` | `10ms` | Retry tick for queued fit checks | | ||
| | `fairness.tokenBudget.maxQueueWait` | `FAIRNESS_QUEUE_TIMEOUT` | `60s` | Max wait before timeout | | ||
| | `fairness.tokenBudget.ageBoostPerSecond` | `FAIRNESS_QUEUE_AGE_BOOST` | `0.01` | Starvation mitigation term | |
There was a problem hiding this comment.
The default ageBoostPerSecond of 0.01 appears too small to effectively mitigate starvation when compared to the typical magnitude of fairness_priority (which is based on token counts in a sliding window, often in the thousands or millions) and normalized_estimated_cost. For example, if a user has a priority score of 100,000, an aging boost of 0.01 per second would take an impractical amount of time to significantly influence the request's position in the queue. Consider scaling this factor to be more aligned with the expected token-based priority values.
|
hey @hzxuzhonghu can you check the proposal for linked issue ? before implementation I think this is a good starting point |
There was a problem hiding this comment.
Pull request overview
Adds a new RFC-style proposal document describing a cost-aware admission control and queuing approach for kthena-router, aiming to incorporate token-cost estimation and per-pod token budgets into routing/queue decisions for mixed workloads.
Changes:
- Introduces an RFC detailing token-budget admission control concepts (estimation, reservation lifecycle, reconciliation).
- Describes integration points with existing fairness scheduling and scheduler plugins (e.g.,
least-request). - Proposes configuration knobs, observability/metrics, rollout phases, and a test plan.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| ### Summary | ||
|
|
||
| This proposal introduces a token-cost-aware admission control layer in `kthena-router` so routing decisions and queueing behavior reflect actual request cost, not only request count. Today, pod selection is dominated by request-count signals (for example, `running + 100 * waiting` in `least-request`), which is not robust for mixed workloads where one request may be 64x or more expensive than another. This mismatch can admit expensive requests into already constrained pods, trigger KV-cache pressure/OOM, and inflate tail latency. |
There was a problem hiding this comment.
Spelling is inconsistent: the title/heading uses “Queuing” while the body uses “queueing” (e.g., this paragraph). Consider standardizing on one spelling throughout the document for easier search/discovery.
|
can you look into this one? CC: @hzxuzhonghu |
| - `token-budget-aware` plugin with clear separation. | ||
| - Can be composed in scheduler profile. | ||
|
|
||
| Initial recommendation: extend existing plugin behind explicit config for lower migration cost. |
There was a problem hiding this comment.
I think new score plugin would prefer.
| end | ||
| ``` | ||
|
|
||
| #### 5. Token Estimation Strategy |
There was a problem hiding this comment.
Only consider input tokens or both input and output tokens?
The key point is short input could generate long output and long input could genereate short output. Different scenarios need different resources. So how to estimate?
|
Thanks for writing this up — the motivation is solid and the problem is real. Mixed-cost workloads genuinely break request-count-only scheduling, and this direction makes sense. A few things that I think need either clarification or a design fix before this is ready to implement: 1. Author-as-reviewer in the frontmatter You've listed 2. The struct is: type PodBudgetState struct {
PodKey string
BudgetTokens int64
ReservedInflight int64
...
}The router is a high-concurrency system — many goroutines will call 3. Timer-tick retry is a thundering-herd waiting to happen The proposal uses a polling loop with 4. Section 7's config table maps 5. Streaming + usage reconciliation needs more detail Section 9 says "retain estimate until stream complete, then reconcile." In practice, OpenAI-compatible engines typically only include 6. The default 256K tokens per pod is the key knob operators will tune first. Where does this number come from? A 7B model on an A100 80GB and a 70B model on 8×A100 have dramatically different KV-budget ceilings. Without any guidance on how to derive the right value from GPU memory and model size, operators will either leave it at the default (wrong for them) or have no idea how to tune it. A brief sizing formula or reference table would make the proposal much more usable. 7. Multi-router state is a bigger problem than the proposal acknowledges The proposal lists "local-only accounting" as a documented limitation. But if kthena-router is typically deployed with multiple replicas (which it should be for HA), then per-instance budgeting means two routers could both believe they have headroom for a large request on the same pod and both admit, leading to exactly the OOM scenario the proposal is trying to prevent. This isn't just a v2 polish — it's a correctness gap for any multi-replica deployment. At minimum the proposal should describe what degraded behavior looks like and whether there's a safe default (e.g., halve the per-pod budget to account for N replicas assuming even split) until a proper shared-state solution lands. 8. Priority formula can still starve large requests The cost term here penalizes large requests (higher cost → lower priority in the queue). Combined with the budget check that keeps them queued longer anyway, there's a real risk that under sustained mixed-traffic load, a single heavy but legitimate request keeps getting deprioritized behind the next batch of light requests. The age boost will eventually win, but "eventually" at a busy system could be minutes. Worth quantifying an upper bound on how long a large request can wait before the age boost guarantees it gets in. Overall the proposal is well-structured and the three-phase rollout (observe → soft → strict) is exactly the right way to land this safely. The issues above are mostly about missing detail and one actual design gap (concurrency). Happy to see a revision that addresses these — especially the concurrency model and the thundering-herd retry mechanism. |
| - "@JagjeevanAK" | ||
| - "@hzxuzhonghu" | ||
| reviewers: | ||
| - "@JagjeevanAK" |
| 1. `estimated_prompt_tokens`: | ||
| - Primary: tokenizer estimate from prompt/messages. | ||
| - Fallback: `len(prompt)/4` approximation (existing style). | ||
| 2. `estimated_output_tokens`: |
There was a problem hiding this comment.
The most critical part, i donot see how this is estimated
|
|
||
| Priority function (composite): | ||
|
|
||
| `effective_priority = fairness_priority + cost_weight * normalized_estimated_cost - age_boost` |
There was a problem hiding this comment.
donot mess it with fairness, i think we are implementing the performant scheduling not fariness issue.
There was a problem hiding this comment.
The workflow should be after a request dequeue from fairness queue, we do decide when and which backend to send can achive best performance.
|
@JagjeevanAK Any update |
|
Sorry, I was bit busy with university curriculum as exams are approaching will be check tonight and will respond you CC: @hzxuzhonghu |
fix #907