Skip to content

feat: Support for allocating GPU memory based on the selected profile#108

Merged
yinggeh merged 11 commits into
triton-inference-server:mainfrom
anmolgupt:anmolgupt/remove_profile_0_context
Apr 18, 2025
Merged

feat: Support for allocating GPU memory based on the selected profile#108
yinggeh merged 11 commits into
triton-inference-server:mainfrom
anmolgupt:anmolgupt/remove_profile_0_context

Conversation

@anmolgupt
Copy link
Copy Markdown
Contributor

The changes in the PR support 2 main items:

  1. the GPU memory is allocated based on the selected TensorRT profile and not based on the profile that consumes max memory even when it's not selected.
  2. Avoid the creation of profile 0 execution context if it's required.

Signed-off-by: Anmol Gupta <14880251+anmolgupt@users.noreply.github.com>
Signed-off-by: Anmol Gupta <14880251+anmolgupt@users.noreply.github.com>
@dongfengy
Copy link
Copy Markdown

For kUSER_MANAGED , the user (in this case the triton server) would need to actually allocate a piece of device memory and pass to execution context. Do you have support for this behavior? If not I would suggest you to only add kSTATIC and kON_PROFILE_CHANGE

removed user_managed option
@yinggeh yinggeh requested a review from rmccorm4 April 17, 2025 18:35
Comment thread src/instance_state.cc Outdated
Comment on lines 1697 to 1699
// the first context creation. As currently triton supports one
// context per engine, in order to set the specified profile_index,
// another context is created and the previous context is destroyed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As currently triton supports one context per engine, in order to set the specified profile_index, another context is created and the previous context is destroyed.

Is the comment still valid? From the code, each profile_index holds a context.

    if (profile_index == 0) {
      res.first->second.context_ = std::move(default_trt_context);
    } else {
      res.first->second.context_.reset(engine_->createExecutionContext());

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested at my end, the changes work for me.

Comment thread src/model_state.cc Outdated
Signed-off-by: Anmol Gupta <14880251+anmolgupt@users.noreply.github.com>
@anmolgupt anmolgupt marked this pull request as ready for review April 17, 2025 21:19
@anmolgupt anmolgupt changed the title Support for allocating GPU memory based on the selected profile feat: Support for allocating GPU memory based on the selected profile Apr 17, 2025
Comment thread src/model_state.cc Outdated
Comment thread src/model_state.cc Outdated
anmolgupt and others added 2 commits April 17, 2025 15:26
Co-authored-by: Yingge He <157551214+yinggeh@users.noreply.github.com>
Co-authored-by: Yingge He <157551214+yinggeh@users.noreply.github.com>
@anmolgupt
Copy link
Copy Markdown
Contributor Author

@yinggeh: New changes look good to me; I got the expected results on the models with these updates.

@yinggeh
Copy link
Copy Markdown
Contributor

yinggeh commented Apr 18, 2025

Updated README.md

@yinggeh
Copy link
Copy Markdown
Contributor

yinggeh commented Apr 18, 2025

LGTM. Thanks for your contribution.

@yinggeh yinggeh merged commit 9d7ba1d into triton-inference-server:main Apr 18, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants