Session affinity routing#178
Conversation
|
Your PR is large. Please consider breaking it into multiple PRs. The |
|
@nirrozenbaum - Could you please review? |
|
The PR is large because the html visualization how the flow work. |
4c60beb to
2a88a36
Compare
|
Your PR is large. Please consider breaking it into multiple PRs. The |
ronenkat
left a comment
There was a problem hiding this comment.
Session affinity is an important feature and should be made compatible with IPP which chooses models (which are fewer) and not vLLM nodes.
please also:
- add a readme for the plugin.
- document code at a function level.
|
Please document requirements, for example if storing session id beyond the request context is not allowed, etc.. |
2a88a36 to
2cfdbb8
Compare
|
Your PR is large. Please consider breaking it into multiple PRs. The |
2cfdbb8 to
89b7845
Compare
|
Your PR is large. Please consider breaking it into multiple PRs. The |
89b7845 to
32a8c55
Compare
|
Your PR is large. Please consider breaking it into multiple PRs. The |
ronenkat
left a comment
There was a problem hiding this comment.
Nice.
We should add a stress test for the session cache to validate that there is no contention on the cache when updating the session cache. Can be in a follow-up.
32a8c55 to
21a749c
Compare
ronenkat
left a comment
There was a problem hiding this comment.
Added suggestions inline
| if c.nowFunc().Sub(entry.lastUsed) > c.ttl { | ||
| return "", false | ||
| } |
There was a problem hiding this comment.
| if c.nowFunc().Sub(entry.lastUsed) > c.ttl { | |
| return "", false | |
| } |
|
|
||
| // Pass 1: sweep all TTL-expired entries from the tail | ||
| elem := c.order.Back() | ||
| for elem != nil { |
There was a problem hiding this comment.
| for elem != nil { | |
| for elem != nil && removed < c.minEvictQuantity * 10 { |
Introduces a consistent hash routing plugin that maps session IDs to backends, ensuring the same session always hits the same pod for KV cache reuse. The plugin implements Filter (for the model-selector pipeline) and ResponseProcessor (to echo X-Session-Id back to clients). When no X-Session-Id header is present, a UUID v4 is generated and returned in the response. Backends with weight <= 0 are excluded from the hash ring so unhealthy pods are skipped automatically. Fixes: llm-d#177 Signed-off-by: szedan <szedan@redhat.com>
21a749c to
8400afa
Compare
|
@nirrozenbaum please take a look.
|
The problem: When a user has a multi-turn conversation, each request might land on a different server. That server has to re-process all the previous context from scratch — wasting GPU time and adding latency.
The fix: We stick each conversation to one server.
How:
If a server goes down:
That's it. Same session, same server, faster responses.