NVIDIA · MaciejBalaNV · Jun 23, 2026 · Jun 23, 2026 · Jun 23, 2026 · Jun 23, 2026
diff --git a/README.md b/README.md
@@ -209,8 +209,10 @@ Set `HF_HOME` if you want to use a shared cache or a disk with more space.
 Generator requires the Guardrail. Request access to the gated
 [nvidia/Cosmos-1.0-Guardrail](https://huggingface.co/nvidia/Cosmos-1.0-Guardrail)
 HF repository. To disable the guardrail, set `enable_safety_checker=False` (Diffusers),
-`guardrails: false` (vLLM-Omni `extra_params`/`extra_args`), or
-`--no-guardrails` (Cosmos Framework).
+`TRTLLM_DISABLE_COSMOS3_GUARDRAILS=1` or `use_guardrails: false` through
+`extra_params` (TensorRT-LLM), `guardrails: false` (vLLM-Omni
+`extra_params`/`extra_args`), or `--no-guardrails` (Cosmos Framework).
+
 #### Generator with Diffusers
 
 <details>
@@ -745,6 +747,7 @@ We are building examples that show Cosmos 3 capabilities end to end, including w
 | Generator (audiovisual) with Diffusers | Generator | Text-to-image, plus text-to-video and image-to-video each with or without synchronized sound, via `Cosmos3OmniPipeline`. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_diffusers.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_diffusers.ipynb) |
 | Generator (audiovisual) with Cosmos Framework | Generator | Text-to-image, plus text-to-video and image-to-video each with sound on or off, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_cosmos_framework.ipynb) |
 | Generator (audiovisual) with vLLM-Omni | Generator | Text-to-image, plus text-to-video and image-to-video each with sound on or off, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb) |
+| Generator (audiovisual) with TensorRT-LLM | Generator | Text-to-image, text-to-video, and image-to-video against an OpenAI-compatible TensorRT-LLM VisualGen server. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_trt_llm.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_trt_llm.ipynb) |
 | Forward dynamics with Cosmos Framework | Generator | Forward dynamics: action-conditioned future-observation prediction for AV, DROID, and UMI, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/action/run_fd_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_fd_with_cosmos_framework.ipynb) |
 | Forward dynamics with vLLM-Omni | Generator | Forward dynamics: action-conditioned future-observation prediction for AV, DROID, and UMI, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/action/run_fd_with_vllm.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_fd_with_vllm.ipynb) |
 | Inverse dynamics with Cosmos Framework | Generator | Inverse dynamics: ego-motion trajectory prediction from input AV video, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/action/run_id_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_id_with_cosmos_framework.ipynb) |

diff --git a/cookbooks/cosmos3/README.md b/cookbooks/cosmos3/README.md
@@ -8,6 +8,7 @@ backend you want to run and follow that one section.
 | --- | --- | --- |
 | [Cosmos Framework](#cosmos-framework) | Native PyTorch inference, launched with `torchrun` | Reasoner, Generator (Audiovisual, Action, **Transfer**) |
 | [Diffusers](#diffusers) | Direct generation with `Cosmos3OmniPipeline` | Generator (Audiovisual) |
+| [TensorRT-LLM](#tensorrt-llm) | OpenAI-compatible VisualGen server (image/video generation) | Generator (Audiovisual) |
 | [Transformers](#transformers) | Hugging Face Transformers inference | Reasoner |
 | [vLLM](#vllm) | OpenAI-compatible reasoning server (image/video understanding) | Reasoner |
 | [vLLM-Omni](#vllm-omni) | OpenAI-compatible generation server (image/video/audio/action) | Generator (Audiovisual, Action) |
@@ -28,9 +29,10 @@ backend you want to run and follow that one section.
   export HF_TOKEN=<your_token>
   ```
 
-  To disable the guardrail, set `enable_safety_checker=False` (Diffusers), `guardrails: false`
-  (vLLM-Omni `extra_params`/`extra_args`), or
-  `--no-guardrails` (Cosmos Framework).
+  To disable the guardrail, set `enable_safety_checker=False` (Diffusers),
+  `TRTLLM_DISABLE_COSMOS3_GUARDRAILS=1` or `use_guardrails: false` through
+  `extra_params` (TensorRT-LLM), `guardrails: false` (vLLM-Omni
+  `extra_params`/`extra_args`), or `--no-guardrails` (Cosmos Framework).
 - For the Cosmos Framework backend: access to `git@github.com:NVIDIA/cosmos-framework.git`.
 - For the NIM backend: an NGC API key (used as `NGC_API_KEY`), which you can generate on [build.nvidia.com](https://build.nvidia.com/nvidia/cosmos3-nano-reasoner) or [NGC](https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/cosmos3-reasoner), plus a one-time `docker login nvcr.io` (username `$oauthtoken`, password = your key). The HF login above is not needed for NIM.
 - Enough local disk for the venv/image, the uv cache, and the model cache. Nano
@@ -161,6 +163,93 @@ uv pip install --torch-backend=cu130 \
   transformers
 ```
 
+## TensorRT-LLM
+
+OpenAI-compatible **VisualGen** server for Generator audiovisual text-to-image,
+text-to-video, and image-to-video examples. Cosmos3 support was added in TensorRT-LLM PR
+[#14824](https://github.com/NVIDIA/TensorRT-LLM/pull/14824); use a
+TensorRT-LLM checkout or package that includes that change.
+
+Install TensorRT-LLM following its upstream documentation.
+
+To build TensorRT-LLM from source, follow NVIDIA's
+[Build from Source](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source.html)
+guide. This is the right path when you need a checkout that contains a recent
+Cosmos3 VisualGen change before it is available in your installed package or
+release image.
+
+```bash
+apt-get update && apt-get -y install git git-lfs
+git lfs install
+
+git clone https://github.com/NVIDIA/TensorRT-LLM.git
+cd TensorRT-LLM
+git submodule update --init --recursive
+git lfs pull
+
+# Pick a devel tag from the upstream build-from-source guide or NGC.
+docker pull nvcr.io/nvidia/tensorrt-llm/devel:<tag>
+docker run --rm -it \
+  --ipc=host \
+  --ulimit memlock=-1 --ulimit stack=67108864 \
+  --gpus=all \
+  --volume "$PWD":"$PWD" \
+  --workdir "$PWD" \
+  nvcr.io/nvidia/tensorrt-llm/devel:<tag>
+
+# Inside the container:
+python3 scripts/build_wheel.py --use_ccache --skip_building_wheel --linking_install_binary
+pip install -e .
+```
+
+For Python-only changes, the upstream guide also documents
+`TRTLLM_USE_PRECOMPILED=1 pip install -e .` to reuse precompiled binaries while
+installing the checkout in editable mode.
+
+Then install the Cosmos3 guardrail package in the same environment unless you
+explicitly disable guardrails before starting the server:
+
+```bash
+pip install cosmos_guardrail==0.3.0
+# If needed by your OpenCV stack:
+# pip uninstall opencv-python
+```
+
+Set the TensorRT-LLM source root for the shared VisualGen config YAMLs:
+
+```bash
+export TRTLLM_ROOT="${TRTLLM_ROOT:-$PWD/TensorRT-LLM}"
+export COSMOS3_TRTLLM_PORT="${COSMOS3_TRTLLM_PORT:-8000}"
+```
+
+**Cosmos3-Nano** (single GPU):
+
+```bash
+trtllm-serve nvidia/Cosmos3-Nano \
+  --visual_gen_args "$TRTLLM_ROOT/examples/visual_gen/configs/cosmos3-nano-1gpu.yaml" \
+  --port "$COSMOS3_TRTLLM_PORT"
+```
+
+**Cosmos3-Super** (four GPUs; CFG parallelism with Ulysses, plus parallel VAE):
+
+```bash
+torchrun --nproc_per_node=4 -m tensorrt_llm.commands.serve \
+  nvidia/Cosmos3-Super \
+  --visual_gen_args "$TRTLLM_ROOT/examples/visual_gen/configs/cosmos3-super-4gpu.yaml" \
+  --port "$COSMOS3_TRTLLM_PORT"
+```
+
+The server exposes `/health`, `/v1/videos/generations`, `/v1/videos`, and
+`/v1/images/generations`. The audiovisual notebook uses the validated video
+generation endpoint for text-to-image, text-to-video, and image-to-video. Cosmos3
+text-to-image is sent as a one-frame video request, matching the TensorRT-LLM
+Cosmos3 pipeline; the notebook sends it as `num_frames=1`, `seconds=1`, and
+`fps=8` to satisfy the video request schema while preserving a single generated
+frame. Requests send Cosmos3 controls through `extra_params`,
+so use a TensorRT-LLM build that includes the Cosmos3 VisualGen API schema.
+The notebook sets request-level `max_sequence_length=2048` for longer structured
+JSON prompts.
+
 ## Transformers
 
 Local Python inference for the Cosmos3 Reasoner. This backend uses the

diff --git a/cookbooks/cosmos3/generator/audiovisual/README.md b/cookbooks/cosmos3/generator/audiovisual/README.md
@@ -1,7 +1,7 @@
 # Cosmos3 Generator Audiovisual Examples
 
 Generate images and video (with optional audio) from text or image prompts with
-`Cosmos3-Nano` and `Cosmos3-Super`, across three inference backends. Sample
+`Cosmos3-Nano` and `Cosmos3-Super`, across four inference backends. Sample
 prompts live under [`assets/`](./assets).
 
 Environment setup for every backend is centralized in the shared
@@ -12,8 +12,10 @@ to get one generation running per backend — run them from this folder.
 Generator requires the Guardrail. Request access to the gated
 [nvidia/Cosmos-1.0-Guardrail](https://huggingface.co/nvidia/Cosmos-1.0-Guardrail)
 HF repository before running these examples. To disable the guardrail, set
-`enable_safety_checker=False` (Diffusers), `guardrails: false` (vLLM-Omni
-`extra_params`/`extra_args`), or `--no-guardrails` (Cosmos Framework).
+`enable_safety_checker=False` (Diffusers), `TRTLLM_DISABLE_COSMOS3_GUARDRAILS=1`
+or `use_guardrails: false` through `extra_params` (TensorRT-LLM),
+`guardrails: false` (vLLM-Omni `extra_params`/`extra_args`), or
+`--no-guardrails` (Cosmos Framework).
 
 ## Run with Cosmos Framework
 
@@ -184,3 +186,70 @@ the vLLM-Omni backend: it walks through text-to-image, text-to-video, and
 image-to-video requests with audio on or off. Server launch options (Nano and
 Super, tensor parallelism, layerwise offload, and CFG-parallel variants) live in
 the [shared environment setup guide](../../README.md#vllm-omni).
+
+## Run with TensorRT-LLM
+
+### Quickstart
+
+Set up the environment and start the server:
+[TensorRT-LLM setup](../../README.md#tensorrt-llm). The notebook targets the
+OpenAI-compatible VisualGen API served by `trtllm-serve`.
+
+Send a text-to-video request with the synchronous video API:
+
+```python
+import json
+from pathlib import Path
+
+import requests
+
+prompt = json.load(open("assets/prompts/text2video/robot_kitchen.json"))
+negative = json.load(open("assets/negative_prompts/text2video/neg_prompt.json"))
+
+response = requests.post(
+    "http://localhost:8000/v1/videos/generations",
+    json={
+        "prompt": json.dumps(prompt, ensure_ascii=True, separators=(",", ":")),
+        "negative_prompt": json.dumps(negative, ensure_ascii=True, separators=(",", ":")),
+        "size": "1280x720",
+        "seconds": 189 / 24,
+        "fps": 24,
+        "num_frames": 189,
+        "num_inference_steps": 35,
+        "guidance_scale": 6.0,
+        "max_sequence_length": 2048,
+        "seed": 0,
+        "extra_params": {
+            "use_resolution_template": False,
+            "use_duration_template": False,
+            "use_system_prompt": False,
+            "use_guardrails": True,
+        },
+    },
+)
+response.raise_for_status()
+suffix = ".avi" if "x-msvideo" in response.headers.get("content-type", "") else ".mp4"
+Path(f"/tmp/cosmos3_t2v_trtllm{suffix}").write_bytes(response.content)
+```
+
+For image-to-video, post multipart form data to the same endpoint with the
+reference image under `input_reference`. TensorRT-LLM Cosmos3 audio/action
+generation is not covered by this backend section.
+
+For text-to-image, use the same video generation endpoint with `num_frames=1`,
+`seconds=1`, and `fps=8`; TensorRT-LLM Cosmos3 returns a one-frame video
+response for this path. `num_frames` is passed explicitly so the server does not
+derive an eight-frame clip from `seconds * fps`.
+
+The TRT-LLM notebook always sends model-specific `extra_params`, so use a
+TensorRT-LLM release with the Cosmos3 VisualGen API schema. The notebook sets
+request-level `max_sequence_length=2048` for longer structured JSON prompts.
+
+### Notebook walkthrough
+
+[`run_with_trt_llm.ipynb`](./run_with_trt_llm.ipynb) is the full tutorial for the
+TensorRT-LLM backend: it walks through text-to-image, text-to-video, and
+image-to-video requests against an already-running VisualGen server. Server
+launch options (Nano and Super, FP8 dynamic quantization, CFG parallelism,
+Ulysses, and parallel VAE) live in the
+[shared environment setup guide](../../README.md#tensorrt-llm).