Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -209,8 +209,10 @@ Set `HF_HOME` if you want to use a shared cache or a disk with more space.
Generator requires the Guardrail. Request access to the gated
[nvidia/Cosmos-1.0-Guardrail](https://huggingface.co/nvidia/Cosmos-1.0-Guardrail)
HF repository. To disable the guardrail, set `enable_safety_checker=False` (Diffusers),
`guardrails: false` (vLLM-Omni `extra_params`/`extra_args`), or
`--no-guardrails` (Cosmos Framework).
`TRTLLM_DISABLE_COSMOS3_GUARDRAILS=1` or `use_guardrails: false` through
`extra_params` (TensorRT-LLM), `guardrails: false` (vLLM-Omni
`extra_params`/`extra_args`), or `--no-guardrails` (Cosmos Framework).

#### Generator with Diffusers

<details>
Expand Down Expand Up @@ -745,6 +747,7 @@ We are building examples that show Cosmos 3 capabilities end to end, including w
| Generator (audiovisual) with Diffusers | Generator | Text-to-image, plus text-to-video and image-to-video each with or without synchronized sound, via `Cosmos3OmniPipeline`. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_diffusers.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_diffusers.ipynb) |
| Generator (audiovisual) with Cosmos Framework | Generator | Text-to-image, plus text-to-video and image-to-video each with sound on or off, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_cosmos_framework.ipynb) |
| Generator (audiovisual) with vLLM-Omni | Generator | Text-to-image, plus text-to-video and image-to-video each with sound on or off, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb) |
| Generator (audiovisual) with TensorRT-LLM | Generator | Text-to-image, text-to-video, and image-to-video against an OpenAI-compatible TensorRT-LLM VisualGen server. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_trt_llm.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_trt_llm.ipynb) |
| Forward dynamics with Cosmos Framework | Generator | Forward dynamics: action-conditioned future-observation prediction for AV, DROID, and UMI, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/action/run_fd_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_fd_with_cosmos_framework.ipynb) |
| Forward dynamics with vLLM-Omni | Generator | Forward dynamics: action-conditioned future-observation prediction for AV, DROID, and UMI, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/action/run_fd_with_vllm.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_fd_with_vllm.ipynb) |
| Inverse dynamics with Cosmos Framework | Generator | Inverse dynamics: ego-motion trajectory prediction from input AV video, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/action/run_id_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_id_with_cosmos_framework.ipynb) |
Expand Down
95 changes: 92 additions & 3 deletions cookbooks/cosmos3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ backend you want to run and follow that one section.
| --- | --- | --- |
| [Cosmos Framework](#cosmos-framework) | Native PyTorch inference, launched with `torchrun` | Reasoner, Generator (Audiovisual, Action, **Transfer**) |
| [Diffusers](#diffusers) | Direct generation with `Cosmos3OmniPipeline` | Generator (Audiovisual) |
| [TensorRT-LLM](#tensorrt-llm) | OpenAI-compatible VisualGen server (image/video generation) | Generator (Audiovisual) |
| [Transformers](#transformers) | Hugging Face Transformers inference | Reasoner |
| [vLLM](#vllm) | OpenAI-compatible reasoning server (image/video understanding) | Reasoner |
| [vLLM-Omni](#vllm-omni) | OpenAI-compatible generation server (image/video/audio/action) | Generator (Audiovisual, Action) |
Expand All @@ -28,9 +29,10 @@ backend you want to run and follow that one section.
export HF_TOKEN=<your_token>
```

To disable the guardrail, set `enable_safety_checker=False` (Diffusers), `guardrails: false`
(vLLM-Omni `extra_params`/`extra_args`), or
`--no-guardrails` (Cosmos Framework).
To disable the guardrail, set `enable_safety_checker=False` (Diffusers),
`TRTLLM_DISABLE_COSMOS3_GUARDRAILS=1` or `use_guardrails: false` through
`extra_params` (TensorRT-LLM), `guardrails: false` (vLLM-Omni
`extra_params`/`extra_args`), or `--no-guardrails` (Cosmos Framework).
- For the Cosmos Framework backend: access to `git@github.com:NVIDIA/cosmos-framework.git`.
- For the NIM backend: an NGC API key (used as `NGC_API_KEY`), which you can generate on [build.nvidia.com](https://build.nvidia.com/nvidia/cosmos3-nano-reasoner) or [NGC](https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/cosmos3-reasoner), plus a one-time `docker login nvcr.io` (username `$oauthtoken`, password = your key). The HF login above is not needed for NIM.
- Enough local disk for the venv/image, the uv cache, and the model cache. Nano
Expand Down Expand Up @@ -161,6 +163,93 @@ uv pip install --torch-backend=cu130 \
transformers
```

## TensorRT-LLM

OpenAI-compatible **VisualGen** server for Generator audiovisual text-to-image,
text-to-video, and image-to-video examples. Cosmos3 support was added in TensorRT-LLM PR
[#14824](https://github.com/NVIDIA/TensorRT-LLM/pull/14824); use a
TensorRT-LLM checkout or package that includes that change.

Install TensorRT-LLM following its upstream documentation.

To build TensorRT-LLM from source, follow NVIDIA's
[Build from Source](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source.html)
guide. This is the right path when you need a checkout that contains a recent
Cosmos3 VisualGen change before it is available in your installed package or
release image.

```bash
apt-get update && apt-get -y install git git-lfs
git lfs install

git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
git lfs pull

# Pick a devel tag from the upstream build-from-source guide or NGC.
docker pull nvcr.io/nvidia/tensorrt-llm/devel:<tag>
docker run --rm -it \
--ipc=host \
--ulimit memlock=-1 --ulimit stack=67108864 \
--gpus=all \
--volume "$PWD":"$PWD" \
--workdir "$PWD" \
nvcr.io/nvidia/tensorrt-llm/devel:<tag>

# Inside the container:
python3 scripts/build_wheel.py --use_ccache --skip_building_wheel --linking_install_binary
pip install -e .
```

For Python-only changes, the upstream guide also documents
`TRTLLM_USE_PRECOMPILED=1 pip install -e .` to reuse precompiled binaries while
installing the checkout in editable mode.

Then install the Cosmos3 guardrail package in the same environment unless you
explicitly disable guardrails before starting the server:

```bash
pip install cosmos_guardrail==0.3.0
# If needed by your OpenCV stack:
# pip uninstall opencv-python
```

Set the TensorRT-LLM source root for the shared VisualGen config YAMLs:

```bash
export TRTLLM_ROOT="${TRTLLM_ROOT:-$PWD/TensorRT-LLM}"
export COSMOS3_TRTLLM_PORT="${COSMOS3_TRTLLM_PORT:-8000}"
```

**Cosmos3-Nano** (single GPU):

```bash
trtllm-serve nvidia/Cosmos3-Nano \
--visual_gen_args "$TRTLLM_ROOT/examples/visual_gen/configs/cosmos3-nano-1gpu.yaml" \
--port "$COSMOS3_TRTLLM_PORT"
```

**Cosmos3-Super** (four GPUs; CFG parallelism with Ulysses, plus parallel VAE):

```bash
torchrun --nproc_per_node=4 -m tensorrt_llm.commands.serve \
nvidia/Cosmos3-Super \
--visual_gen_args "$TRTLLM_ROOT/examples/visual_gen/configs/cosmos3-super-4gpu.yaml" \
--port "$COSMOS3_TRTLLM_PORT"
```

The server exposes `/health`, `/v1/videos/generations`, `/v1/videos`, and
`/v1/images/generations`. The audiovisual notebook uses the validated video
generation endpoint for text-to-image, text-to-video, and image-to-video. Cosmos3
text-to-image is sent as a one-frame video request, matching the TensorRT-LLM
Cosmos3 pipeline; the notebook sends it as `num_frames=1`, `seconds=1`, and
`fps=8` to satisfy the video request schema while preserving a single generated
frame. Requests send Cosmos3 controls through `extra_params`,
so use a TensorRT-LLM build that includes the Cosmos3 VisualGen API schema.
The notebook sets request-level `max_sequence_length=2048` for longer structured
JSON prompts.

## Transformers

Local Python inference for the Cosmos3 Reasoner. This backend uses the
Expand Down
75 changes: 72 additions & 3 deletions cookbooks/cosmos3/generator/audiovisual/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Cosmos3 Generator Audiovisual Examples

Generate images and video (with optional audio) from text or image prompts with
`Cosmos3-Nano` and `Cosmos3-Super`, across three inference backends. Sample
`Cosmos3-Nano` and `Cosmos3-Super`, across four inference backends. Sample
prompts live under [`assets/`](./assets).

Environment setup for every backend is centralized in the shared
Expand All @@ -12,8 +12,10 @@ to get one generation running per backend — run them from this folder.
Generator requires the Guardrail. Request access to the gated
[nvidia/Cosmos-1.0-Guardrail](https://huggingface.co/nvidia/Cosmos-1.0-Guardrail)
HF repository before running these examples. To disable the guardrail, set
`enable_safety_checker=False` (Diffusers), `guardrails: false` (vLLM-Omni
`extra_params`/`extra_args`), or `--no-guardrails` (Cosmos Framework).
`enable_safety_checker=False` (Diffusers), `TRTLLM_DISABLE_COSMOS3_GUARDRAILS=1`
or `use_guardrails: false` through `extra_params` (TensorRT-LLM),
`guardrails: false` (vLLM-Omni `extra_params`/`extra_args`), or
`--no-guardrails` (Cosmos Framework).

## Run with Cosmos Framework

Expand Down Expand Up @@ -184,3 +186,70 @@ the vLLM-Omni backend: it walks through text-to-image, text-to-video, and
image-to-video requests with audio on or off. Server launch options (Nano and
Super, tensor parallelism, layerwise offload, and CFG-parallel variants) live in
the [shared environment setup guide](../../README.md#vllm-omni).

## Run with TensorRT-LLM

### Quickstart

Set up the environment and start the server:
[TensorRT-LLM setup](../../README.md#tensorrt-llm). The notebook targets the
OpenAI-compatible VisualGen API served by `trtllm-serve`.

Send a text-to-video request with the synchronous video API:

```python
import json
from pathlib import Path

import requests

prompt = json.load(open("assets/prompts/text2video/robot_kitchen.json"))
negative = json.load(open("assets/negative_prompts/text2video/neg_prompt.json"))

response = requests.post(
"http://localhost:8000/v1/videos/generations",
json={
"prompt": json.dumps(prompt, ensure_ascii=True, separators=(",", ":")),
"negative_prompt": json.dumps(negative, ensure_ascii=True, separators=(",", ":")),
"size": "1280x720",
"seconds": 189 / 24,
"fps": 24,
"num_frames": 189,
"num_inference_steps": 35,
"guidance_scale": 6.0,
"max_sequence_length": 2048,
"seed": 0,
"extra_params": {
"use_resolution_template": False,
"use_duration_template": False,
"use_system_prompt": False,
"use_guardrails": True,
},
},
)
response.raise_for_status()
suffix = ".avi" if "x-msvideo" in response.headers.get("content-type", "") else ".mp4"
Path(f"/tmp/cosmos3_t2v_trtllm{suffix}").write_bytes(response.content)
```

For image-to-video, post multipart form data to the same endpoint with the
reference image under `input_reference`. TensorRT-LLM Cosmos3 audio/action
generation is not covered by this backend section.

For text-to-image, use the same video generation endpoint with `num_frames=1`,
`seconds=1`, and `fps=8`; TensorRT-LLM Cosmos3 returns a one-frame video
response for this path. `num_frames` is passed explicitly so the server does not
derive an eight-frame clip from `seconds * fps`.

The TRT-LLM notebook always sends model-specific `extra_params`, so use a
TensorRT-LLM release with the Cosmos3 VisualGen API schema. The notebook sets
request-level `max_sequence_length=2048` for longer structured JSON prompts.

### Notebook walkthrough

[`run_with_trt_llm.ipynb`](./run_with_trt_llm.ipynb) is the full tutorial for the
TensorRT-LLM backend: it walks through text-to-image, text-to-video, and
image-to-video requests against an already-running VisualGen server. Server
launch options (Nano and Super, FP8 dynamic quantization, CFG parallelism,
Ulysses, and parallel VAE) live in the
[shared environment setup guide](../../README.md#tensorrt-llm).
Loading