The run.py CLI is the main entry point for running workflows against Tenstorrent inference servers. There are two independent ways to run the inference server:
run.py --docker-server-- automates Docker setup, weight downloads, and container launch.- Direct
docker run-- run the container independently with--modeland--tt-device. See the container interface documentation.
Client-side workflows (benchmarks, evals, reports) can run against any compatible model server, whether started by run.py or externally. For LLMs we use vLLM and the tt-media-server otherwise.
flowchart TD
start{How do you want to run?}
start -->|"First time / automated"| runpyDocker["python3 run.py --model MODEL<br/>--workflow server --docker-server"]
start -->|"Direct Docker"| directDocker["docker run IMAGE<br/>--model MODEL --tt-device DEVICE"]
start -->|"Client workflows only<br/>(server already running)"| clientOnly["python3 run.py --model MODEL<br/>--workflow benchmarks"]
runpyDocker -->|"Handles weights, secrets,<br/>Docker volume, container launch"| server["Inference Server Running"]
directDocker -->|"Container self-resolves<br/>model spec, downloads weights"| server
clientOnly -->|"HTTP requests to<br/>localhost:8000"| server
--workflow options:
benchmarks: Send random data prompts to the inference server, profile throughput and latency.evals: Send evaluation dataset prompts to the inference server, score output for accuracy.reports: Generate summary reports frombenchmarksandevalsoutput data.release: Runevals,benchmarks,spec_tests,tests, andreportsin sequence for release certification.server: Start the inference server only (requires--docker-server).spec_tests(internal): Run server integration tests (device liveness, load tests) against the inference server.stress_tests(internal): Run sustained load tests to measure server stability and throughput.tests(internal): Run pytest-based vLLM API parameter tests against the inference server (model-dependent).
For example, start the vLLM server in a Docker container and run client-side benchmarks against it:
python3 run.py --model Llama-3.2-1B-Instruct --tt-device n150 --workflow benchmarks --docker-server- Requirements
run.pyCLI Options- Serving LLMs with vLLM
- Release Workflow
- Performance Benchmarks
- Accuracy Evaluations
- Reports
- Server Spec Tests
- API Parameter Tests
- Stress Tests
- Logs
- Additional Documentation
Using run.py the workflow scripts bootstraps the various required python virtual environments as needed using venv and uv (https://github.com/astral-sh/uv). With this design there are no typical python install steps such as pip install.
uv and venv will take some time, up to 15 minutes in runs that install many required venvs and have low-bandwidth network speeds. This only happens once. If you have errors with a venv file an issue, when applying a fix to the venv you should remove the specific venv to allow a clean installation.
The system requirements for run.py and the Model Readiness Workflows are:
- Python 3.8+ (Python 3.8.10 is default
python3on Ubuntu 20.04) - python3-venv: likely already installed, if needed install via apt:
$ apt install python3-venvHF_TOKEN: for access to gated HF datasets (access your token from https://huggingface.co, go to Settings -> Access Tokens)
You will need to accept the terms for any specific gated datasets or model repositories required, e.g. https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct
| Option | Description |
|---|---|
--model |
Name of the model to run. Available choices are defined in MODEL_SPECS. |
--workflow |
Workflow to run: benchmarks, evals, release, reports, server, spec_tests (internal), stress_tests (internal), tests (internal). |
| Option | Default | Description |
|---|---|---|
--tt-device |
Auto-detected | Target device: n150, n300, p100, p150, t3k, galaxy. Auto-inferred from host hardware via tt-smi when omitted. The legacy alias --device is still accepted. |
--impl |
Model spec default | Implementation option (e.g. tt-transformers). Inferred from model and device when not specified. |
--engine |
Model spec default | Inference engine override: vllm, media, forge. |
| Option | Default | Description |
|---|---|---|
--docker-server |
false | Run inference server inside a Docker container. |
--local-server |
false | Run the vLLM inference server directly on the host. Requires --tt-metal-home and host-backed persistence for logs and TT caches. |
-it, --interactive |
false | Run Docker in interactive mode (sleep infinity). |
--service-port |
8000 |
Service port. Also reads from $SERVICE_PORT env var. |
--no-auth |
false | Disable vLLM API key authorization (skips JWT_SECRET requirement). |
--print-docker-cmd |
false | Print the generated Docker run command and exit without starting. |
When using --docker-server, these options control how model weights and caches are persisted. For --local-server, they select the weights source while TT caches and logs still use a host persistent volume root. Only one of --host-volume, --host-hf-cache, --host-weights-dir can be specified explicitly at a time.
| Option | Default | Description |
|---|---|---|
--host-volume |
None for Docker, repo persistent_volume/ for local when omitted |
Host directory for persistent cache/log/tensor storage. |
--host-hf-cache |
None | Host HuggingFace cache directory to reuse for model weights. Bare --host-hf-cache defaults to HOST_HF_HOME, then HF_HOME, then ~/.cache/huggingface. |
--host-weights-dir |
None | Host directory with pre-downloaded model weights. |
--image-user |
1000 |
UID passed to docker run --user. Docker only; --local-server ignores this flag and runs as the invoking host user. Must match the UID the image was built with. Default release images use UID 1000. Only override when using a custom image built with a different UID. |
See Host Storage Options in the workflows reference for detailed descriptions of each strategy and file permission requirements.
| Option | Description |
|---|---|
--dev-mode |
Enable developer mode: bind mounts source code into Docker container for live editing. |
--override-docker-image |
Override the Docker image used by --docker-server. |
--device-id |
Tenstorrent device IDs, comma-separated PCI indices (e.g. 0 or 0,1,2). |
--override-tt-config |
Override TT config as JSON string (e.g., '{"data_parallel": 16}'). |
--vllm-override-args |
Override vLLM arguments as JSON string (e.g., '{"max_model_len": 4096}'). |
--disable-trace-capture |
Skip trace capture requests for faster execution if traces are already captured. |
--limit-samples-mode |
Apply predefined reduced workload presets for evals and benchmarks: ci-nightly, ci-long, ci-commit, smoke-test. Use smoke-test for quick developer validation. |
--workflow-args |
Additional workflow arguments (e.g., 'param1=value1 param2=value2'). |
You can serve a model with vLLM or another OpenAI API-compatible inference server however you like. The client-side workflows (evals, benchmarks, reports) only send HTTP requests to the inference server, so they work with any compatible server.
For example, if you run vLLM following the docs at https://github.com/tenstorrent/vllm/tree/dev/tt_metal during development, you can run the client-side workflows (evals, benchmarks, reports, or all of them with release) against that already running server.
This section describes how to use run.py automation to also run the inference server.
Options:
run.py --docker-server: automates Docker image pull, weight download, host setup, and container launch.run.py --local-server: launches the vLLM server directly from a host tt-metal checkout while reusing host filesystem storage.- Direct
docker run: use the container interface with--modeland--tt-device(see Direct Docker Run). - Custom: build tt-metal and vLLM from source.
Each combination of {model_name} and {tt-device} corresponds to a specific run configuration. Loading model weights to the device, starting the model, and compiling kernel binaries for all input sizes can take several minutes (e.g. ~5 minutes for 70B+ models).
The server workflow runs the vLLM inference server for the model as a detached Docker container and exits. Once the server is running, multiple client-side workloads (benchmarks, evals) can run against it without tearing the server down.
To run the inference server with Docker, use the --docker-server flag:
python3 run.py --model Llama-3.2-1B-Instruct --tt-device n300 --workflow server --docker-serverThe --tt-device flag can be omitted -- run.py will auto-detect the device from host hardware via tt-smi.
Add --dev-mode to bind mount source code into the container for live editing:
python3 run.py --model Llama-3.2-1B-Instruct --tt-device n300 --workflow server --docker-server --dev-modeUse --print-docker-cmd to inspect the generated Docker command without starting the server:
python3 run.py --model Llama-3.2-1B-Instruct --tt-device n300 --workflow server --docker-server --print-docker-cmdOn successful start, log output includes the container ID and log file path:
INFO: Created Docker container ID: 6b8c7038a44a
INFO: Access container logs via: docker logs -f 6b8c7038a44a
INFO: Docker logs are also streamed to log file: workflow_logs/docker_server/vllm_<timestamp>_<model>_<device>_server.log
INFO: Stop running container via: docker stop 6b8c7038a44a
The running container can be viewed with docker ps -a and stopped with docker stop <container-id>.
To run the vLLM server directly on the host, use --local-server and point --tt-metal-home at a built tt-metal checkout containing python_env/ and build/lib/:
python3 run.py --model Llama-3.2-1B-Instruct --tt-device n300 --workflow server \
--local-server --tt-metal-home /opt/tt-metalIf you omit all host storage flags, local server runs use REPO_ROOT/persistent_volume/ for logs, TT caches, and downloaded weights. --host-hf-cache reuses an existing Hugging Face cache for weights, and --host-weights-dir points at a pre-downloaded weights directory. In both of those modes, TT caches and logs still use the host volume path.
Because --local-server launches a host process, it uses the invoking host user's permissions. If that persistent_volume/ tree was previously created by Docker or another UID, fix ownership or permissions before retrying. --image-user does not apply here.
The inference server container can be used independently from run.py via a direct docker run command. The container entrypoint (run_vllm_api_server.py) accepts --model and --tt-device to resolve the model configuration from a bundled model spec catalog (model_spec.json).
docker run \
--env "HF_TOKEN=$HF_TOKEN" \
--ipc host \
--publish 8000:8000 \
--device /dev/tenstorrent \
--mount type=bind,src=/dev/hugepages-1G,dst=/dev/hugepages-1G \
--volume volume_id_tt_transformers-Llama-3.2-1B-Instruct:/home/container_app_user/cache_root \
ghcr.io/tenstorrent/tt-inference-server/vllm-tt-metal-src-release-ubuntu-22.04-amd64:0.9.0-84b4c53-222ee06 \
--model meta-llama/Llama-3.2-1B-Instruct \
--tt-device n300See the full container interface documentation for all container CLI args, secrets, and persistent volume overrides.
Secrets can be provided via a .env file in the repository root or as environment variables:
# .env file (automatically loaded by run.py)
HF_TOKEN=hf_...
JWT_SECRET=my-secret-stringOr as environment variables:
export HF_TOKEN=hf_...
export JWT_SECRET=my-secret-string- HF_TOKEN: Required for access to gated HF repositories (get your token from https://huggingface.co, go to
Settings->Access Tokens). - JWT_SECRET: Your JWT Token secret for vLLM server authorization. Use
--no-authto disable authorization.
If not set via .env or environment, run.py will prompt interactively on first run.
By default (Docker named volume mode), model weights are downloaded inside the container on first start via ensure_weights_available(). No host-side download is needed.
When using --host-volume or --host-hf-cache with --docker-server, weights are downloaded on the host by setup_host() before container launch. When using --host-weights-dir, weights are assumed to already exist at the specified path.
For --local-server, setup_host() resolves the host paths and creates the cache root, but the local vLLM process handles downloads itself unless MODEL_WEIGHTS_DIR is pointed at an existing --host-hf-cache snapshot or --host-weights-dir.
Permissions note for Docker modes: The container runs as a non-root user with no root-level entrypoint. The runtime UID is baked into the image (UID 1000 for default release images). When using --host-volume, the host directory must be writable by that UID (e.g. sudo chown 1000 <path>). When using --host-hf-cache or --host-weights-dir, the host path is mounted readonly and only needs read access; TT Metal caches are stored in a separate Docker named volume. The default Docker named volume strategy requires no host permission setup. For --local-server, the host process instead uses the current host user's permissions.
See Host Storage Options for details on each strategy.
Each model implementation is mapped to a pre-built release Docker Image that contains pre-built tt-metal and vLLM source builds. These Docker images are tested with the release workflow to ensure correctness for each model supported.
The Docker image for each model is listed in the per-model model support pages, starting here: LLM Models
For the same model-device combination, the release workflow runs in sequence:
evalsworkflowbenchmarksworkflowspec_testsworkflowtestsworkflow (only for models with entries inserver_tests/test_config.py)reportsworkflow
This is a convenience so that a single run on device executes all workflows required to certify a model implementation on Tenstorrent hardware is working correctly and ready for release.
The benchmarks workflow sends random data prompts to the inference server and profiles throughput and latency.
python3 run.py --model Llama-3.2-1B-Instruct --tt-device n300 --workflow benchmarksFor a quick development smoke test, add --limit-samples-mode smoke-test:
python3 run.py --model Llama-3.2-1B-Instruct --tt-device n300 --workflow benchmarks --limit-samples-mode smoke-testIn smoke-test mode, benchmarks selects a reduced single benchmark target and ignores --concurrency-sweeps.
The benchmarks workflow follows this sequence (visible in the runtime logs streamed to workflow_logs/run_logs/):
- Set up workflow virtual environments:
run.pybootstraps dedicated venvs for benchmark scripts. - Start workflow:
run_benchmarks.pyis launched with the runtime model spec JSON. - Wait for inference server: Polls the
/healthendpoint until the vLLM server is ready. - Trace capture: Sends initial requests at each configured input length to warm up the model and compile traces.
- Run benchmarks: Executes a sweep of configurations varying input/output sequence length and concurrency, saving results as JSON.
INFO: Running benchmark Llama-3.2-1B-Instruct: 1/18
INFO: Running command: .workflow_venvs/.venv_benchmarks_vllm/bin/serve --backend vllm ...
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
100%|██████████| 8/8 [00:17<00:00, 2.18s/it]
============ Serving Benchmark Result ============
...
==================================================
Benchmark output files are saved to workflow_logs/benchmarks_output/, for example:
benchmark_Llama-3.2-1B-Instruct_n300_<timestamp>_isl-128_osl-128_maxcon-1_n-8.json
See benchmarking docs for more detail on code.
The evals workflow follows the same pattern as the benchmarks workflow: it sets up its own venv, waits for the inference server to be ready, then sends HTTP requests to it. Each evaluation task uses a dedicated venv, which allows multiple different eval repos and different versions of e.g. https://github.com/EleutherAI/lm-evaluation-harness.
python3 run.py --model Llama-3.2-1B-Instruct --tt-device n300 --workflow evalsFor a quick development smoke test, add --limit-samples-mode smoke-test:
python3 run.py --model Llama-3.2-1B-Instruct --tt-device n300 --workflow evals --limit-samples-mode smoke-testIn smoke-test mode, evals runs only the first configured eval task and limits it to 3 samples.
Outputs are stored in: workflow_logs/evals_output/eval_Llama-3.2-1B-Instruct_n300/meta-llama__Llama-3.2-1B-Instruct
See evals docs for more detail on code.
The reports workflow generates summary log files from the raw data collected by benchmarks and evals workflows.
python3 run.py --model Llama-3.2-1B-Instruct --tt-device n300 --workflow reportsThis report summarizes metrics and uses defined tolerance thresholds to determine if models pass or fail validation.
See Logs section below for example format of the report files generated.
Internal workflow.
spec_testsis used for release validation and CI. It requires a running inference server.
The spec_tests workflow runs server integration tests against the inference server. Tests are defined in server_tests/server_tests_config.json and matched by model name and device. Test classes (e.g. DeviceLivenessTest, ImageGenerationLoadTest) are loaded dynamically and executed via server_tests/run_spec_tests.py.
python3 run.py --model Llama-3.1-8B-Instruct --tt-device n150 --workflow spec_testsEach test case entry in server_tests_config.json specifies:
name/module: the test class and its Python module path.enabled: set tofalseto skip a test case.test_config: execution settings —test_timeout,retry_attempts,retry_delay,break_on_failure,mock_mode.targets: test-specific numerical thresholds. Common keys includeimage_generation_time,audio_transcription_time,num_concurrent_requests(client-side concurrency for*LoadTest), andnum_of_devices(physical chip count, used byDeviceLivenessTest/DeviceStabilityTest).num_of_devicesis also accepted as a deprecated fallback fornum_concurrent_requestsinside load tests.
Output is written as JSON and Markdown reports to workflow_logs/spec_tests_output/.
Internal workflow.
testsis used for release validation and CI. It requires a running inference server. Not all models have test entries defined.
The tests workflow runs pytest-based tests that exercise vLLM API sampling parameters (n, max_tokens, stop, seed, logprobs, temperature, top_k, top_p, and penalty parameters). Model support is defined in server_tests/test_config.py (TEST_CONFIGS); models not listed there will skip this workflow.
python3 run.py --model Llama-3.1-8B-Instruct --tt-device n150 --workflow testsThe run script (server_tests/run_tests.py) iterates over TestTask entries for the model, invoking pytest with -s -v on server_tests/test_cases/test_vllm_server_parameters.py.
Output is written to workflow_logs/tests_output/.
Internal workflow.
stress_testsis used for release validation and CI. It requires a running inference server.
The stress_tests workflow runs sustained load tests against the inference server to measure server stability and throughput over time. The run script is stress_tests/run_stress_tests.py.
python3 run.py --model Llama-3.1-8B-Instruct --tt-device n150 --workflow stress_testsOutput is written to workflow_logs/stress_tests_output/.
Log types:
- run_logs: the stdout and stderr output from
run.py, stored for debugging. - runtime_model_specs: the serialized
ModelSpec+RuntimeConfigJSON used for each run. - docker_server: the logs from the Docker container running the vLLM inference server.
- benchmarks_output: the raw data output from the
benchmarksworkflow. - evals_output: the raw data output from the
evalsworkflow. - reports_output: for each workflow, the markdown (.md) summary output and
/datasummary data. Thereleaseworkflow output has a summary report of bothbenchmarksandevalsresults, used to determine if a model passes release validation. An example report: tenstorrent#164. - spec_tests_output: JSON and Markdown test reports from the
spec_testsworkflow. - tests_output: pytest result output from the
testsworkflow. - stress_tests_output: result data from the
stress_testsworkflow.
In this example for:
model_name:= Llama-3.2-1B-Instructtt-device:= n300
The logs have the following structure:
./workflow_logs
├── benchmarks_output
│ ├── benchmark_Llama-3.2-1B-Instruct_n300_2025-03-25_04-23-40_isl-128_osl-128_maxcon-1_n-8.json
│ ├── ...
│ └── benchmark_Llama-3.2-1B-Instruct_n300_2025-03-25_04-48-11_isl-16000_osl-64_maxcon-32_n-256.json
├── docker_server
│ └── vllm_2025-03-25_20-58-29_Llama-3.2-1B-Instruct_n300_benchmarks.log
├── evals_output
│ └── eval_Llama-3.2-1B-Instruct_n300/meta-llama__Llama-3.2-1B-Instruct
│ ├── results_2025-03-25T04-57-53.064778.json
│ └── samples_meta_gpqa_2025-03-25T04-57-53.064778.jsonl
├── reports_output
│ ├── benchmarks
│ │ ├── data
│ │ │ └── benchmark_stats_Llama-3.2-1B-Instruct_n300.csv
│ │ └── benchmark_display_Llama-3.2-1B-Instruct_n300.md
│ ├── evals
│ │ ├── data
│ │ │ └── eval_data_Llama-3.2-1B-Instruct_n300.json
│ │ ├── summary_Llama-3.2-1B-Instruct_n300.md
│ └── release
│ ├── data
│ │ └── report_data_Llama-3.2-1B-Instruct_n300.json
│ └── report_Llama-3.2-1B-Instruct_n300.md
├── run_logs
│ └── run_2025-03-26_02-09-13_Llama-3.2-1B-Instruct_n300_evals.log
├── spec_tests_output
│ ├── spec_tests_report_Llama-3.2-1B-Instruct_n300.json
│ └── spec_tests_report_Llama-3.2-1B-Instruct_n300.md
├── stress_tests_output
│ └── stress_tests_Llama-3.2-1B-Instruct_n300_<timestamp>.json
└── tests_output
└── parameter_report_Llama-3.2-1B-Instruct_n300_<timestamp>.json
- Workflows Reference -- CLI reference, architecture diagrams, model config
- Container Interface -- Direct Docker run, container CLI args
- Development
- Benchmarking
- Evals
- Tests