Skip to content

feat: Enforce max_inflight_requests as a shared limit across ensemble requests#482

Merged
pskiran1 merged 6 commits into
mainfrom
spolisetty/tri-732-maximum-inflight-requests-from-request-context-ensemble
Apr 2, 2026
Merged

feat: Enforce max_inflight_requests as a shared limit across ensemble requests#482
pskiran1 merged 6 commits into
mainfrom
spolisetty/tri-732-maximum-inflight-requests-from-request-context-ensemble

Conversation

@pskiran1
Copy link
Copy Markdown
Member

@pskiran1 pskiran1 commented Mar 18, 2026

This PR changes ensemble scheduling to enforce max_inflight_requests as a shared limit at each ensemble step across all concurrent ensemble requests. This helps control memory usage when upstream steps produce responses faster than downstream models can consume them.

Changes:

  • Move per-step in-flight limiters to be owned and shared across all EnsembleContexts at the ensemble model level (EnsembleInfo).
  • Update the limiter API from “wait + increment/decrement” to an acquire/release slot model, and wire it into scheduling and completion paths.
  • Updates step dispatch to Acquire() a limiter slot before InferAsync() and Release() it on final response completion (or on scheduling failure).

CI and Doc: triton-inference-server/server#8707
Doc: triton-inference-server/common#152

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes ensemble scheduling so max_inflight_requests is enforced as a global per-step cap across all concurrent ensemble requests (instead of being enforced per ensemble request), helping bound memory growth when upstream steps outpace downstream consumption.

Changes:

  • Move per-step in-flight limiters to be owned globally by the ensemble model (EnsembleInfo) and shared across all EnsembleContexts.
  • Update the limiter API from “wait + increment/decrement” to an acquire/release slot model, and wire it into scheduling and completion paths.
  • Update configuration/comment semantics to reflect the new global behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/ensemble_scheduler/ensemble_scheduler.h Updates docs for max_inflight_requests semantics and adds storage for per-step global limiters.
src/ensemble_scheduler/ensemble_scheduler.cc Implements global limiter allocation and integrates acquire/release into step scheduling and completion.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/ensemble_scheduler/ensemble_scheduler.h
Comment thread src/ensemble_scheduler/ensemble_scheduler.cc Outdated
Comment thread src/ensemble_scheduler/ensemble_scheduler.cc Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes ensemble scheduling so max_inflight_requests is enforced as a global per-step limit shared across all concurrent ensemble requests for a given ensemble model, instead of being enforced per-ensemble-request.

Changes:

  • Introduces a per-step global limiter (StepInflightRequestLimiter) stored on EnsembleInfo, allocated once per step when max_inflight_requests > 0.
  • Updates step dispatch to Acquire() a limiter slot before InferAsync() and Release() it on final response completion (or on scheduling failure).
  • Removes the per-EnsembleContext limiter instances and adjusts counter decrement logic to avoid underflow when a step is not actually scheduled.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/ensemble_scheduler/ensemble_scheduler.h Adds the limiter type and stores one limiter per ensemble step on EnsembleInfo to make the limit global across ensemble requests.
src/ensemble_scheduler/ensemble_scheduler.cc Moves limiter implementation, wires Acquire/Release into scheduling and completion paths, and removes per-context limiter initialization/usage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/ensemble_scheduler/ensemble_scheduler.cc
Comment thread src/ensemble_scheduler/ensemble_scheduler.cc
@pskiran1 pskiran1 added the PR: feat A new feature label Mar 18, 2026
@pskiran1 pskiran1 marked this pull request as draft March 18, 2026 15:54
@pskiran1 pskiran1 marked this pull request as ready for review March 19, 2026 13:21
@pskiran1 pskiran1 added documentation Improvements or additions to documentation and removed documentation Improvements or additions to documentation labels Mar 21, 2026
@pskiran1 pskiran1 changed the title feat: Enforce max_inflight_requests as a global per-step limit across ensemble requests feat: Enforce max_inflight_requests as a shared limit across ensemble requests Mar 31, 2026
Comment thread src/ensemble_scheduler/ensemble_scheduler.h Outdated
@pskiran1 pskiran1 requested a review from yinggeh March 31, 2026 09:54
@pskiran1 pskiran1 merged commit a02b0c0 into main Apr 2, 2026
1 check passed
@pskiran1 pskiran1 deleted the spolisetty/tri-732-maximum-inflight-requests-from-request-context-ensemble branch April 2, 2026 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PR: feat A new feature

Development

Successfully merging this pull request may close these issues.

4 participants