triton-inference-server · yinggeh · Feb 6, 2026 · Feb 3, 2026
diff --git a/README.md b/README.md
diff --git a/docs/build.md b/docs/build.md
@@ -19,12 +19,9 @@ bash scripts/build.sh
 
 ## Build the Docker Container
 
-> [!CAUTION]
-> [build.sh](../build.sh) is currently not working and will be fixed in the next weekly update.
-
 #### Build via Docker
 
-You can build the container using the instructions in the [TensorRT-LLM Docker Build](../tensorrt_llm/docker/README.md)
+You can build the container using the instructions in the [TensorRT-LLM Docker Build](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docker/README.md)
 with `tritonrelease` stage. Please make sure to add CUDA_ARCHS flag for your GPU, for example if compute capability of your GPU is 89:
 
 ```bash

diff --git a/docs/encoder_decoder.md b/docs/encoder_decoder.md
@@ -1,7 +1,7 @@
 # End to end workflow to run an Encoder-Decoder model
 
 ### Support Matrix
-For the specific models supported by encoder-decoder family, please visit [TensorRT-LLM encoder-decoder examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec#encoder-decoder-model-support). The following two model types are supported:
+For the specific models supported by encoder-decoder family, please visit [TensorRT-LLM encoder-decoder examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec#encoder-decoder-model-support). The following two model types are supported:
 * T5
 * BART
 
@@ -28,7 +28,7 @@ If you're using [Triton TRT-LLM NGC container](https://catalog.ngc.nvidia.com/or
     docker run --gpus all --ipc=host --ulimit memlock=-1 --shm-size=20g `pwd`:/workspace -w /workspace nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3 bash
 ```
 
-If [building your own TensorRT-LLM Backend container](https://github.com/triton-inference-server/tensorrtllm_backend#option-2-build-via-docker) then you can run the `tensorrtllm_backend` container:
+If [building your own TensorRT-LLM Backend container](https://github.com/triton-inference-server/tensorrtllm_backend) then you can run the `tensorrtllm_backend` container:
 
 ```
     docker run --gpus all --ipc=host --ulimit memlock=-1 --shm-size=20g `pwd`:/workspace -w /workspace triton_trt_llm bash
@@ -93,7 +93,7 @@ Build TensorRT-LLM engines.
 
 > **NOTE**
 >
-> If you want to build multi-GPU engine using Tensor Parallelism then you can set `--tp_size` in convert_checkpoint.py. For example, for TP=2 on 2-GPU you can set `--tp_size=2`. If you want to use beam search then set `--max_beam_width` to higher value than 1. The `--max_input_len` in encoder trtllm-build controls the model input length and should be same as `--max_encoder_input_len` in decoder trtllm-build. Additionally, to control the model output len you should set `--max_seq_len` in decoder trtllm-build to `desired output length + 1`. It is also advisable to tune [`--max_num_tokens`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-best-practices.md#max_num_tokens) as the default value of 8192 might be too large or too small depending on your input, output len and use-cases. For BART family models, make sure to remove `--context_fmha disable` from both encoder and decoder trtllm-build commands. Please refer to [TensorRT-LLM enc-dec example](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec#build-tensorrt-engines) for more details.
+> If you want to build multi-GPU engine using Tensor Parallelism then you can set `--tp_size` in convert_checkpoint.py. For example, for TP=2 on 2-GPU you can set `--tp_size=2`. If you want to use beam search then set `--max_beam_width` to higher value than 1. The `--max_input_len` in encoder trtllm-build controls the model input length and should be same as `--max_encoder_input_len` in decoder trtllm-build. Additionally, to control the model output len you should set `--max_seq_len` in decoder trtllm-build to `desired output length + 1`. It is also advisable to tune [`--max_num_tokens`](https://github.com/NVIDIA/TensorRT-LLM/blob/v0.16.0/docs/source/performance/perf-best-practices.md#max_num_tokens) as the default value of 8192 might be too large or too small depending on your input, output len and use-cases. For BART family models, make sure to remove `--context_fmha disable` from both encoder and decoder trtllm-build commands. Please refer to [TensorRT-LLM enc-dec example](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec#build-tensorrt-engines) for more details.
 
 #### 4. Prepare Tritonserver configs <a id="prepare-tritonserver-configs"></a>
 

diff --git a/docs/guided_decoding.md b/docs/guided_decoding.md
@@ -2,7 +2,7 @@
 
 This document outlines the process for running guided decoding using the TensorRT-LLM backend. Guided decoding ensures that generated outputs adhere to specified formats, such as JSON. Currently, this feature is supported through the [XGrammar](https://github.com/mlc-ai/xgrammar) backend.
 
-For more information, refer to the [guided decoding documentation](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/executor.md#structured-output-with-guided-decoding) from TensorRT-LLM. Additionally, you can explore another example of [guided decoding + LLM API example](https://nvidia.github.io/TensorRT-LLM/llm-api-examples/llm_guided_decoding.html).
+For more information, refer to the [guided decoding documentation](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/advanced/executor.md#structured-output-with-guided-decoding) from TensorRT-LLM. Additionally, you can explore another example of [guided decoding + LLM API example](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llm-api/llm_guided_decoding.py).
 
 ## Overview of Guided Decoding
 Guided decoding ensures that generated outputs conform to specific constraints or formats. Supported guide types include:

diff --git a/docs/llama_multi_instance.md b/docs/llama_multi_instance.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2024-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -317,10 +317,10 @@ the model across multiple nodes.
 
 It is also possible to use orchestrator mode with MPI processes that have been
 pre-spawned. In order to do that, you need to set `--disable-spawn-processes`
-when using the [launch_triton_server.py](../scripts/launch_triton_server.py)
+when using the [launch_triton_server.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/triton_backend/scripts/launch_triton_server.py)
 script or `export TRTLLM_ORCHESTRATOR_SPAWN_PROCESSES=0`. In this mode,
 it is possible to run the server across different nodes in orchestrator mode.
 
 In order to use the orchestrator mode itself, you need to set the `--multi-model`
-flag when using the [launch_triton_server.py](../scripts/launch_triton_server.py)
+flag when using the [launch_triton_server.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/triton_backend/scripts/launch_triton_server.py)
 script or `export TRTLLM_ORCHESTRATOR=1`.
diff --git a/docs/lora.md b/docs/lora.md
@@ -1,7 +1,7 @@
 # Running LoRA inference with inflight batching
 
 Below is an example of how to run LoRA inference with inflight batching. See the
-[LoRA documentation](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/lora.md)
+[LoRA documentation](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/advanced/lora.md)
 in the TensorRT-LLM repository for more information about running gpt-2b with
 LoRA using inflight batching.
 

diff --git a/docs/model_config.md b/docs/model_config.md
@@ -3,14 +3,14 @@
 ## Model Parameters
 
 The following tables show the parameters in the `config.pbtxt` of the models in
-[all_models/inflight_batcher_llm](../tensorrt_llm/triton_backend/all_models/inflight_batcher_llm).
+[all_models/inflight_batcher_llm](https://github.com/NVIDIA/TensorRT-LLM/tree/main/triton_backend/all_models/inflight_batcher_llm).
 that can be modified before deployment. For optimal performance or custom
 parameters, please refer to
-[perf_best_practices](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-best-practices.md).
+[perf_best_practices](https://github.com/NVIDIA/TensorRT-LLM/blob/v0.16.0/docs/source/performance/perf-best-practices.md).
 
 The names of the parameters listed below are the values in the `config.pbtxt`
 that can be modified using the
-[`fill_template.py`](../tensorrt_llm/triton_backend/tools/fill_template.py) script.
+[`fill_template.py`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/triton_backend/tools/fill_template.py) script.
 
 **NOTE** For fields that have comma as the value (e.g. `gpu_device_ids`,
 `participant_ids`), you need to escape the comma with
@@ -350,7 +350,7 @@ Note: the timing metrics oputputs are represented as the number of nanoseconds s
 Below are some tips for configuring models for optimal performance. These
 recommendations are based on our experiments and may not apply to all use cases.
 For guidance on other parameters, please refer to the
-[perf_best_practices](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-best-practices.md).
+[perf_best_practices](https://github.com/NVIDIA/TensorRT-LLM/blob/v0.16.0/docs/source/performance/perf-best-practices.md).
 
 - **Setting the `instance_count` for models to better utilize inflight batching**
 

diff --git a/docs/multimodal.md b/docs/multimodal.md
@@ -9,7 +9,7 @@ The following multimodal model is supported in tensorrtllm_backend:
 * MLLAMA
 * Qwen2-VL
 
-For more multimodal models supported in TensorRT-LLM, please visit [TensorRT-LLM multimodal examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal).
+For more multimodal models supported in TensorRT-LLM, please visit [TensorRT-LLM multimodal examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/multimodal).
 
 ## Run Multimodal with single-GPU Tritonserver
 ### Tritonserver setup steps

diff --git a/docs/whisper.md b/docs/whisper.md
@@ -74,7 +74,7 @@ The following multimodal model is supported in tensorrtllm_backend:
 
     > **NOTE**:
     >
-    > TensorRT-LLM also supports using [distil-whisper's](https://github.com/huggingface/distil-whisper) different models by first converting their params and weights from huggingface's naming format to [openai whisper](https://github.com/openai/whisper) naming format. You can do so by running the script [distil_whisper/convert_from_distil_whisper.py](./convert_from_distil_whisper.py).
+    > TensorRT-LLM also supports using [distil-whisper's](https://github.com/huggingface/distil-whisper) different models by first converting their params and weights from huggingface's naming format to [openai whisper](https://github.com/openai/whisper) naming format. You can do so by running the script [distil_whisper/convert_from_distil_whisper.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/models/core/whisper/distil_whisper/convert_from_distil_whisper.py).
 
 3. Prepare Tritonserver configs