Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 42 additions & 46 deletions README.md

Large diffs are not rendered by default.

5 changes: 1 addition & 4 deletions docs/build.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,9 @@ bash scripts/build.sh

## Build the Docker Container

> [!CAUTION]
> [build.sh](../build.sh) is currently not working and will be fixed in the next weekly update.

#### Build via Docker

You can build the container using the instructions in the [TensorRT-LLM Docker Build](../tensorrt_llm/docker/README.md)
You can build the container using the instructions in the [TensorRT-LLM Docker Build](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docker/README.md)
with `tritonrelease` stage. Please make sure to add CUDA_ARCHS flag for your GPU, for example if compute capability of your GPU is 89:

```bash
Expand Down
6 changes: 3 additions & 3 deletions docs/encoder_decoder.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# End to end workflow to run an Encoder-Decoder model

### Support Matrix
For the specific models supported by encoder-decoder family, please visit [TensorRT-LLM encoder-decoder examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec#encoder-decoder-model-support). The following two model types are supported:
For the specific models supported by encoder-decoder family, please visit [TensorRT-LLM encoder-decoder examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec#encoder-decoder-model-support). The following two model types are supported:
* T5
* BART

Expand All @@ -28,7 +28,7 @@ If you're using [Triton TRT-LLM NGC container](https://catalog.ngc.nvidia.com/or
docker run --gpus all --ipc=host --ulimit memlock=-1 --shm-size=20g `pwd`:/workspace -w /workspace nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3 bash
```

If [building your own TensorRT-LLM Backend container](https://github.com/triton-inference-server/tensorrtllm_backend#option-2-build-via-docker) then you can run the `tensorrtllm_backend` container:
If [building your own TensorRT-LLM Backend container](https://github.com/triton-inference-server/tensorrtllm_backend) then you can run the `tensorrtllm_backend` container:

```
docker run --gpus all --ipc=host --ulimit memlock=-1 --shm-size=20g `pwd`:/workspace -w /workspace triton_trt_llm bash
Expand Down Expand Up @@ -93,7 +93,7 @@ Build TensorRT-LLM engines.

> **NOTE**
>
> If you want to build multi-GPU engine using Tensor Parallelism then you can set `--tp_size` in convert_checkpoint.py. For example, for TP=2 on 2-GPU you can set `--tp_size=2`. If you want to use beam search then set `--max_beam_width` to higher value than 1. The `--max_input_len` in encoder trtllm-build controls the model input length and should be same as `--max_encoder_input_len` in decoder trtllm-build. Additionally, to control the model output len you should set `--max_seq_len` in decoder trtllm-build to `desired output length + 1`. It is also advisable to tune [`--max_num_tokens`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-best-practices.md#max_num_tokens) as the default value of 8192 might be too large or too small depending on your input, output len and use-cases. For BART family models, make sure to remove `--context_fmha disable` from both encoder and decoder trtllm-build commands. Please refer to [TensorRT-LLM enc-dec example](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec#build-tensorrt-engines) for more details.
> If you want to build multi-GPU engine using Tensor Parallelism then you can set `--tp_size` in convert_checkpoint.py. For example, for TP=2 on 2-GPU you can set `--tp_size=2`. If you want to use beam search then set `--max_beam_width` to higher value than 1. The `--max_input_len` in encoder trtllm-build controls the model input length and should be same as `--max_encoder_input_len` in decoder trtllm-build. Additionally, to control the model output len you should set `--max_seq_len` in decoder trtllm-build to `desired output length + 1`. It is also advisable to tune [`--max_num_tokens`](https://github.com/NVIDIA/TensorRT-LLM/blob/v0.16.0/docs/source/performance/perf-best-practices.md#max_num_tokens) as the default value of 8192 might be too large or too small depending on your input, output len and use-cases. For BART family models, make sure to remove `--context_fmha disable` from both encoder and decoder trtllm-build commands. Please refer to [TensorRT-LLM enc-dec example](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec#build-tensorrt-engines) for more details.

#### 4. Prepare Tritonserver configs <a id="prepare-tritonserver-configs"></a>

Expand Down
2 changes: 1 addition & 1 deletion docs/guided_decoding.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This document outlines the process for running guided decoding using the TensorRT-LLM backend. Guided decoding ensures that generated outputs adhere to specified formats, such as JSON. Currently, this feature is supported through the [XGrammar](https://github.com/mlc-ai/xgrammar) backend.

For more information, refer to the [guided decoding documentation](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/executor.md#structured-output-with-guided-decoding) from TensorRT-LLM. Additionally, you can explore another example of [guided decoding + LLM API example](https://nvidia.github.io/TensorRT-LLM/llm-api-examples/llm_guided_decoding.html).
For more information, refer to the [guided decoding documentation](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/advanced/executor.md#structured-output-with-guided-decoding) from TensorRT-LLM. Additionally, you can explore another example of [guided decoding + LLM API example](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llm-api/llm_guided_decoding.py).

## Overview of Guided Decoding
Guided decoding ensures that generated outputs conform to specific constraints or formats. Supported guide types include:
Expand Down
6 changes: 3 additions & 3 deletions docs/llama_multi_instance.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<!--
# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright 2024-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -317,10 +317,10 @@ the model across multiple nodes.

It is also possible to use orchestrator mode with MPI processes that have been
pre-spawned. In order to do that, you need to set `--disable-spawn-processes`
when using the [launch_triton_server.py](../scripts/launch_triton_server.py)
when using the [launch_triton_server.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/triton_backend/scripts/launch_triton_server.py)
script or `export TRTLLM_ORCHESTRATOR_SPAWN_PROCESSES=0`. In this mode,
it is possible to run the server across different nodes in orchestrator mode.

In order to use the orchestrator mode itself, you need to set the `--multi-model`
flag when using the [launch_triton_server.py](../scripts/launch_triton_server.py)
flag when using the [launch_triton_server.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/triton_backend/scripts/launch_triton_server.py)
script or `export TRTLLM_ORCHESTRATOR=1`.
2 changes: 1 addition & 1 deletion docs/lora.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Running LoRA inference with inflight batching

Below is an example of how to run LoRA inference with inflight batching. See the
[LoRA documentation](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/lora.md)
[LoRA documentation](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/advanced/lora.md)
in the TensorRT-LLM repository for more information about running gpt-2b with
LoRA using inflight batching.

Expand Down
8 changes: 4 additions & 4 deletions docs/model_config.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,14 @@
## Model Parameters

The following tables show the parameters in the `config.pbtxt` of the models in
[all_models/inflight_batcher_llm](../tensorrt_llm/triton_backend/all_models/inflight_batcher_llm).
[all_models/inflight_batcher_llm](https://github.com/NVIDIA/TensorRT-LLM/tree/main/triton_backend/all_models/inflight_batcher_llm).
that can be modified before deployment. For optimal performance or custom
parameters, please refer to
[perf_best_practices](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-best-practices.md).
[perf_best_practices](https://github.com/NVIDIA/TensorRT-LLM/blob/v0.16.0/docs/source/performance/perf-best-practices.md).

The names of the parameters listed below are the values in the `config.pbtxt`
that can be modified using the
[`fill_template.py`](../tensorrt_llm/triton_backend/tools/fill_template.py) script.
[`fill_template.py`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/triton_backend/tools/fill_template.py) script.

**NOTE** For fields that have comma as the value (e.g. `gpu_device_ids`,
`participant_ids`), you need to escape the comma with
Expand Down Expand Up @@ -350,7 +350,7 @@ Note: the timing metrics oputputs are represented as the number of nanoseconds s
Below are some tips for configuring models for optimal performance. These
recommendations are based on our experiments and may not apply to all use cases.
For guidance on other parameters, please refer to the
[perf_best_practices](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-best-practices.md).
[perf_best_practices](https://github.com/NVIDIA/TensorRT-LLM/blob/v0.16.0/docs/source/performance/perf-best-practices.md).

- **Setting the `instance_count` for models to better utilize inflight batching**

Expand Down
2 changes: 1 addition & 1 deletion docs/multimodal.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ The following multimodal model is supported in tensorrtllm_backend:
* MLLAMA
* Qwen2-VL

For more multimodal models supported in TensorRT-LLM, please visit [TensorRT-LLM multimodal examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal).
For more multimodal models supported in TensorRT-LLM, please visit [TensorRT-LLM multimodal examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal).

## Run Multimodal with single-GPU Tritonserver
### Tritonserver setup steps
Expand Down
2 changes: 1 addition & 1 deletion docs/whisper.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ The following multimodal model is supported in tensorrtllm_backend:

> **NOTE**:
>
> TensorRT-LLM also supports using [distil-whisper's](https://github.com/huggingface/distil-whisper) different models by first converting their params and weights from huggingface's naming format to [openai whisper](https://github.com/openai/whisper) naming format. You can do so by running the script [distil_whisper/convert_from_distil_whisper.py](./convert_from_distil_whisper.py).
> TensorRT-LLM also supports using [distil-whisper's](https://github.com/huggingface/distil-whisper) different models by first converting their params and weights from huggingface's naming format to [openai whisper](https://github.com/openai/whisper) naming format. You can do so by running the script [distil_whisper/convert_from_distil_whisper.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/models/core/whisper/distil_whisper/convert_from_distil_whisper.py).

3. Prepare Tritonserver configs

Expand Down
Loading