Skip to content

cost-aware request queuing#928

Open
JagjeevanAK wants to merge 1 commit intovolcano-sh:mainfrom
JagjeevanAK:jagjeevan/issue-907
Open

cost-aware request queuing#928
JagjeevanAK wants to merge 1 commit intovolcano-sh:mainfrom
JagjeevanAK:jagjeevan/issue-907

Conversation

@JagjeevanAK
Copy link
Copy Markdown
Contributor

@JagjeevanAK JagjeevanAK commented Apr 26, 2026

fix #907

Signed-off-by: Jagjeevan Kashid <jagjeevandev97@gmail.com>
Copilot AI review requested due to automatic review settings April 26, 2026 15:13
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign lizhencheng9527 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a design proposal for token-cost-aware admission control in the kthena-router, shifting from request-count to token-budget-based scheduling to better handle mixed workloads. The review feedback identifies several critical areas for refinement: addressing budget over-admission in multi-router deployments, ensuring token estimation consistency by using rune counts instead of byte lengths, optimizing queue retry intervals to prevent high CPU overhead, and scaling the starvation mitigation factor to be effective against token-based priority scores. All provided review comments point to valid technical improvements or logic corrections within the proposal.

| Budget leak on error path | Permanent reduced capacity | Single-owner reservation lifecycle with defer-based release |
| Queue starvation of large requests | Fairness regression | Aging factor + max-wait timeout + fairness priority blending |
| Increased routing overhead | Router latency | O(1) accounting structures, bounded queue ops, targeted metrics |
| Multi-router inconsistency | Uneven enforcement | Documented limitation + future Redis-backed shared budget mode |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Multi-router inconsistency is listed as a risk with a mitigation of "Documented limitation". However, in a horizontally scaled environment with $N$ routers, the effective admission budget for a pod could be up to $N \times \text{PodBudget}$, which significantly undermines the goal of preventing OOM and KV-cache pressure. It is recommended to consider a simple local mitigation for Phase 1, such as allowing operators to configure a "Router Count" factor to statically shard the pod budget across router instances, or prioritizing the shared-state mode.

Where:
1. `estimated_prompt_tokens`:
- Primary: tokenizer estimate from prompt/messages.
- Fallback: `len(prompt)/4` approximation (existing style).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The fallback estimation len(prompt)/4 is inconsistent with the existing implementation in pkg/kthena-router/filters/tokenizer/estimator.go, which uses utf8.RuneCountInString(prompt) / 4.0. Using byte length (len) can lead to significant overestimation for non-ASCII text. The proposal should be updated to reflect the use of rune count for consistency and accuracy.

| `fairness.tokenBudget.safetyFactor` | `FAIRNESS_COST_SAFETY_FACTOR` | `1.2` | Multiplicative safety factor |
| `fairness.tokenBudget.minReservationTokens` | `FAIRNESS_MIN_RESERVATION_TOKENS` | `64` | Minimum reservation |
| `fairness.tokenBudget.maxReservationTokens` | `FAIRNESS_MAX_RESERVATION_TOKENS` | `32768` | Max reservation clamp |
| `fairness.tokenBudget.queueRetryInterval` | `FAIRNESS_ADMISSION_RETRY_INTERVAL` | `10ms` | Retry tick for queued fit checks |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The queueRetryInterval of 10ms for re-attempting admission fit checks might cause high CPU overhead if the queue contains many requests that cannot fit. Since the system already has a "Wake on release" mechanism (as shown in the mermaid diagram on line 258), the periodic tick should ideally be a much larger fallback interval (e.g., 500ms or 1s) to handle edge cases, rather than a primary mechanism for admission retries.

| `fairness.tokenBudget.maxReservationTokens` | `FAIRNESS_MAX_RESERVATION_TOKENS` | `32768` | Max reservation clamp |
| `fairness.tokenBudget.queueRetryInterval` | `FAIRNESS_ADMISSION_RETRY_INTERVAL` | `10ms` | Retry tick for queued fit checks |
| `fairness.tokenBudget.maxQueueWait` | `FAIRNESS_QUEUE_TIMEOUT` | `60s` | Max wait before timeout |
| `fairness.tokenBudget.ageBoostPerSecond` | `FAIRNESS_QUEUE_AGE_BOOST` | `0.01` | Starvation mitigation term |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The default ageBoostPerSecond of 0.01 appears too small to effectively mitigate starvation when compared to the typical magnitude of fairness_priority (which is based on token counts in a sliding window, often in the thousands or millions) and normalized_estimated_cost. For example, if a user has a priority score of 100,000, an aging boost of 0.01 per second would take an impractical amount of time to significantly influence the request's position in the queue. Consider scaling this factor to be more aligned with the expected token-based priority values.

@JagjeevanAK
Copy link
Copy Markdown
Contributor Author

hey @hzxuzhonghu can you check the proposal for linked issue ? before implementation I think this is a good starting point

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new RFC-style proposal document describing a cost-aware admission control and queuing approach for kthena-router, aiming to incorporate token-cost estimation and per-pod token budgets into routing/queue decisions for mixed workloads.

Changes:

  • Introduces an RFC detailing token-budget admission control concepts (estimation, reservation lifecycle, reconciliation).
  • Describes integration points with existing fairness scheduling and scheduler plugins (e.g., least-request).
  • Proposes configuration knobs, observability/metrics, rollout phases, and a test plan.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


### Summary

This proposal introduces a token-cost-aware admission control layer in `kthena-router` so routing decisions and queueing behavior reflect actual request cost, not only request count. Today, pod selection is dominated by request-count signals (for example, `running + 100 * waiting` in `least-request`), which is not robust for mixed workloads where one request may be 64x or more expensive than another. This mismatch can admit expensive requests into already constrained pods, trigger KV-cache pressure/OOM, and inflate tail latency.
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling is inconsistent: the title/heading uses “Queuing” while the body uses “queueing” (e.g., this paragraph). Consider standardizing on one spelling throughout the document for easier search/discovery.

Copilot uses AI. Check for mistakes.
Comment thread docs/proposal/cost-aware-request-queuing.md
Comment thread docs/proposal/cost-aware-request-queuing.md
@JagjeevanAK JagjeevanAK changed the title docs: add RFC proposal for cost-aware request queuing cost-aware request queuing Apr 26, 2026
@JagjeevanAK
Copy link
Copy Markdown
Contributor Author

can you look into this one?

CC: @hzxuzhonghu

- `token-budget-aware` plugin with clear separation.
- Can be composed in scheduler profile.

Initial recommendation: extend existing plugin behind explicit config for lower migration cost.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think new score plugin would prefer.

end
```

#### 5. Token Estimation Strategy
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only consider input tokens or both input and output tokens?

The key point is short input could generate long output and long input could genereate short output. Different scenarios need different resources. So how to estimate?

@hzxuzhonghu
Copy link
Copy Markdown
Member

Thanks for writing this up — the motivation is solid and the problem is real. Mixed-cost workloads genuinely break request-count-only scheduling, and this direction makes sense. A few things that I think need either clarification or a design fix before this is ready to implement:

1. Author-as-reviewer in the frontmatter

You've listed @JagjeevanAK as both an author and a reviewer. That slot should be filled by someone else from the team who can actually review the proposal independently.

2. PodBudgetState has no concurrency story

The struct is:

type PodBudgetState struct {
    PodKey           string
    BudgetTokens     int64
    ReservedInflight int64
    ...
}

The router is a high-concurrency system — many goroutines will call TryReserve simultaneously. The proposal mentions sync.Once guarding double-release, but there's no mutex or atomic operation anywhere on ReservedInflight. This needs to be spelled out explicitly: is the tracker guarded by a single mutex per pod-key, a sharded lock map, or are the core operations atomic? This is not a nit — a race here means silent budget corruption.

3. Timer-tick retry is a thundering-herd waiting to happen

The proposal uses a polling loop with queueRetryInterval: 10ms to re-check admission for queued requests. When a pod releases a large reservation, every waiting request fires its retry simultaneously. With 100 requests queued across 3 pods, that's up to 300 lock acquisitions per 10ms tick, and usually only 1–2 actually get admitted. A signal/channel-based wake (notify waiters on release) would be far more efficient and is the idiomatic Go approach here. Please reconsider the tick model.

4. FAIRNESS_QUEUE_TIMEOUT reuse is a conflict

Section 7's config table maps maxQueueWait to the existing FAIRNESS_QUEUE_TIMEOUT env var. That variable almost certainly already controls something in the current fairness queue implementation. If we start piggybacking the token-budget queue wait on the same knob, operators lose the ability to tune them independently (fairness queue timeout for normal backpressure vs. admission timeout for budget-constrained waiting can easily want different values). This should be a new, separate env var.

5. Streaming + usage reconciliation needs more detail

Section 9 says "retain estimate until stream complete, then reconcile." In practice, OpenAI-compatible engines typically only include usage in the final SSE chunk, and some engines don't include it in streaming mode at all (they only return it for non-streaming requests). Before the implementation phase, it's worth explicitly stating how this reconciliation handles: (a) engines that never return usage in streaming mode, (b) the case where the client disconnects mid-stream (no final chunk arrives). Right now the failure path just says "release by estimate / emit usage_missing metric," but that's important enough to deserve its own failure section.

6. The default podBudgetTokens: 262144 needs justification

256K tokens per pod is the key knob operators will tune first. Where does this number come from? A 7B model on an A100 80GB and a 70B model on 8×A100 have dramatically different KV-budget ceilings. Without any guidance on how to derive the right value from GPU memory and model size, operators will either leave it at the default (wrong for them) or have no idea how to tune it. A brief sizing formula or reference table would make the proposal much more usable.

7. Multi-router state is a bigger problem than the proposal acknowledges

The proposal lists "local-only accounting" as a documented limitation. But if kthena-router is typically deployed with multiple replicas (which it should be for HA), then per-instance budgeting means two routers could both believe they have headroom for a large request on the same pod and both admit, leading to exactly the OOM scenario the proposal is trying to prevent. This isn't just a v2 polish — it's a correctness gap for any multi-replica deployment. At minimum the proposal should describe what degraded behavior looks like and whether there's a safe default (e.g., halve the per-pod budget to account for N replicas assuming even split) until a proper shared-state solution lands.

8. Priority formula can still starve large requests

effective_priority = fairness_priority + cost_weight * normalized_estimated_cost - age_boost

The cost term here penalizes large requests (higher cost → lower priority in the queue). Combined with the budget check that keeps them queued longer anyway, there's a real risk that under sustained mixed-traffic load, a single heavy but legitimate request keeps getting deprioritized behind the next batch of light requests. The age boost will eventually win, but "eventually" at a busy system could be minutes. Worth quantifying an upper bound on how long a large request can wait before the age boost guarantees it gets in.


Overall the proposal is well-structured and the three-phase rollout (observe → soft → strict) is exactly the right way to land this safely. The issues above are mostly about missing detail and one actual design gap (concurrency). Happy to see a revision that addresses these — especially the concurrency model and the thundering-herd retry mechanism.

- "@JagjeevanAK"
- "@hzxuzhonghu"
reviewers:
- "@JagjeevanAK"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can list @YaoZengzeng and me here

1. `estimated_prompt_tokens`:
- Primary: tokenizer estimate from prompt/messages.
- Fallback: `len(prompt)/4` approximation (existing style).
2. `estimated_output_tokens`:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most critical part, i donot see how this is estimated


Priority function (composite):

`effective_priority = fairness_priority + cost_weight * normalized_estimated_cost - age_boost`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

donot mess it with fairness, i think we are implementing the performant scheduling not fariness issue.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow should be after a request dequeue from fairness queue, we do decide when and which backend to send can achive best performance.

@hzxuzhonghu
Copy link
Copy Markdown
Member

@JagjeevanAK Any update

@JagjeevanAK
Copy link
Copy Markdown
Contributor Author

Sorry, I was bit busy with university curriculum as exams are approaching will be check tonight and will respond you

CC: @hzxuzhonghu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Cost-aware request queuing in kthena-router to prevent backend LLM engine overloading

5 participants