⚡ Lightning-Fast Text-to-Speech Inference & Service Framework
LightTTS is a lightweight and high-performance text-to-speech (TTS) inference and service framework based on Python. It supports CosyVoice2 and CosyVoice3 models, built upon the CosyVoice architecture and LightLLM framework, with optimizations to support fast, scalable, and service-ready TTS deployment.
- 🚀 Optimized LLM Inference: The language model part of the TTS pipeline is accelerated using techniques from lightllm and supports high-throughput batch inference
- 🧩 Shared Memory Timbre Manager with LRU: Manages speaker/timbre embeddings in shared memory for fast access and minimal recomputation
- 🧱 Modular Architecture (Encode–LLM–Decode): Refactored from LightLLM into three decoupled modules—Encoder, LLM, and Decoder—each running as separate processes for efficient task parallelism and scalability.
- 🌐 Service Ready and Easy Integration: Comes with an HTTP API for fast deployment and simple APIs for integration into other Python or web projects
- 🔄 Bi-streaming Mode via WebSocket: Supports interactive bi-directional streaming using WebSocket for low-latency, real-time TTS communication
-
(Option 1 Recommended) Run with Docker
# The easiest way to install LightTTS is by using the official image. You can directly pull and run the official image docker pull lighttts/light-tts:latest # Or you can manually build the image docker build -t light-tts:latest . # Run the image docker run -it --gpus all -p 8080:8080 --shm-size 4g -v your_local_path:/data/ light-tts:latest /bin/bash
-
(Option 2) Install from Source
# Clone the repo git clone --recursive https://github.com/ModelTC/LightTTS.git cd LightTTS # If you failed to clone the submodule due to network failures, please run the following command until success # cd LightTTS # git submodule update --init --recursive # (Recommended) Create a new conda environment conda create -n light-tts python=3.10 conda activate light-tts # Install dependencies (We use the latest torch==2.9.1, but other versions are also compatible) pip install -r requirements.txt # If you encounter sox compatibility issues # ubuntu sudo apt-get install sox libsox-dev # centos sudo yum install sox sox-devel
We now support CosyVoice2 and CosyVoice3 models.
# ModelScope SDK model download (SDK模型下载)
from modelscope import snapshot_download
snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
snapshot_download('iic/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B')
snapshot_download('iic/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')
# For overseas users, HuggingFace SDK model download
from huggingface_hub import snapshot_download
snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
snapshot_download('FunAudioLLM/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B')
snapshot_download('FunAudioLLM/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')(We have already installed the ttsfrd package in the docker image. If you are using docker image, you can skip this installation) For better text normalization performance, you can optionally install the ttsfrd package and unzip its resources. This step is not required — if skipped, the system will fall back to WeTextProcessing by default.
cd pretrained_models/CosyVoice-ttsfrd/
unzip resource.zip -d .
pip install ttsfrd_dependency-0.1-py3-none-any.whl
pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whlNote: It is recommended to enable the load_trt parameter for acceleration. The default flow precision is fp16 for CosyVoice2 and fp32 for CosyVoice3.
For CosyVoice2:
python -m light_tts.server.api_server --model_dir ./pretrained_models/CosyVoice2-0.5BFor CosyVoice3:
python -m light_tts.server.api_server --model_dir ./pretrained_models/Fun-CosyVoice3-0.5B-2512With custom data type (float32, bfloat16, or float16; default: float16):
# Use float32 for better accuracy or float16 for faster speed
python -m light_tts.server.api_server --model_dir ./pretrained_models/Fun-CosyVoice3-0.5B-2512 --data_type float32Available Parameters:
The default values are usually the fastest and generally do not need to be adjusted. If you need to customize them, please refer to the following parameter descriptions:
load_trt: Whether to load the flow_decoder in TensorRT mode (default: True).data_type: The data type for LLM inference (default: float16)load_jit: Whether to load the flow_encoder in JIT mode (default: False).max_total_token_num: LLM arg, total token count the GPU and model can support =max_batch * (input_len + output_len)(default: 64 * 1024)max_req_total_len: LLM arg, maximum value forreq_input_len + req_output_len(default: 32768, matchesmax_position_embeddings)graph_max_len_in_batch: Maximum sequence length for CUDA graph capture in decoding stage (default: 32768)graph_max_batch_size: Maximum batch size for CUDA graph capture in decoding stage (default: 16)
For more parameters, see light_tts/server/api_cli.py
Wait for the service to initialize. The default address is http://localhost:8080.
Once the service is running, you can interact with it through the HTTP API. We support three modes: non-streaming, streaming, and bi-streaming.
- Non-streaming and Streaming: Use
test/test_zero_shot.pyfor examples, which prints metrics such as RTF (Real-Time Factor) and TTFT (Time To First Token) - Bi-streaming: Uses WebSocket interface. See usage examples in
test/test_bistream.py
We have conducted performance benchmarks on different GPU configurations to demonstrate the throughput and latency characteristics of LightTTS in streaming mode.
Model: Fun-CosyVoice3-0.5B-2512 datatype: float16
non-stream: test/test_zs_speed.py
| num_workers | cost time 50% | cost time 90% | cost time 99% | rtf 50% | rtf 90% | rtf 99% | avg rtf | total_cost_time | qps |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.61 | 1.09 | 1.51 | 0.13 | 0.16 | 0.22 | 0.13 | 33.95 | 1.47 |
| 2 | 0.8 | 1.24 | 1.71 | 0.15 | 0.22 | 0.25 | 0.16 | 21.46 | 2.33 |
| 4 | 1.02 | 1.88 | 2.27 | 0.22 | 0.29 | 0.38 | 0.23 | 15.31 | 3.27 |
| 8 | 1.76 | 2.36 | 3.48 | 0.33 | 0.49 | 0.62 | 0.36 | 12.18 | 4.1 |
stream: test/test_zs_stream.py
| num_workers | cost time 50% | cost time 90% | cost time 99% | ttft 50% | ttft 90% | ttft 99% | rtf 50% | rtf 90% | rtf 99% | avg rtf | total_cost_time | qps |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1.01 | 2.15 | 2.82 | 0.33 | 0.34 | 0.9 | 0.21 | 0.25 | 0.34 | 0.22 | 60.13 | 0.83 |
| 2 | 1.83 | 3.56 | 5.16 | 0.93 | 1.53 | 2.3 | 0.34 | 0.63 | 0.81 | 0.4 | 52.47 | 0.95 |
| 4 | 3.43 | 5.76 | 7.31 | 2.62 | 4.37 | 5.8 | 0.7 | 1.28 | 2.16 | 0.81 | 48.74 | 1.03 |
| 8 | 7.27 | 10.01 | 10.45 | 6.4 | 8.55 | 9.03 | 1.28 | 2.67 | 3.66 | 1.57 | 47.37 | 1.06 |
non-stream
| num_workers | cost time 50% | cost time 90% | cost time 99% | rtf 50% | rtf 90% | rtf 99% | avg rtf | total_cost_time | qps |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.51 | 0.81 | 1.61 | 0.11 | 0.13 | 0.23 | 0.11 | 28.9 | 1.73 |
| 2 | 0.64 | 1.1 | 1.48 | 0.13 | 0.16 | 0.26 | 0.13 | 17.54 | 2.85 |
| 4 | 0.87 | 1.28 | 1.68 | 0.17 | 0.23 | 0.36 | 0.18 | 11.45 | 4.37 |
| 8 | 1.32 | 1.86 | 2.14 | 0.25 | 0.4 | 0.6 | 0.29 | 8.97 | 5.57 |
stream
| num_workers | cost time 50% | cost time 90% | cost time 99% | ttft 50% | ttft 90% | ttft 99% | rtf 50% | rtf 90% | rtf 99% | avg rtf | total_cost_time | qps |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.76 | 1.41 | 2.27 | 0.28 | 0.3 | 0.31 | 0.16 | 0.18 | 0.22 | 0.16 | 44.06 | 1.13 |
| 2 | 1.45 | 2.34 | 3.46 | 0.74 | 1.28 | 1.75 | 0.27 | 0.45 | 0.7 | 0.3 | 38.82 | 1.29 |
| 4 | 2.9 | 4.04 | 4.7 | 2.16 | 3.03 | 3.4 | 0.5 | 1.04 | 1.51 | 0.61 | 37.75 | 1.32 |
| 8 | 5.78 | 7.74 | 8.49 | 5.01 | 6.73 | 7.35 | 1.03 | 2.09 | 2.85 | 1.22 | 37.67 | 1.33 |
Metrics Explanation:
- num_workers: Number of concurrent workers
- cost time: Total request processing time in seconds (50th/90th/99th percentile)
- ttft: Time to First Token in seconds (50th/90th/99th percentile)
- rtf: Real-Time Factor (50th/90th/99th percentile)
- avg rtf: Average Real-Time Factor
- total_cost_time: Total benchmark duration in seconds
- qps: Queries Per Second
This repository is released under the Apache-2.0 license.
This project includes code from CosyVoice (Copyright Alibaba, Inc. and its affiliates), which is also licensed under Apache-2.0. The CosyVoice code is located in the cosyvoice/ directory and has been integrated and modified as part of LightTTS. See the NOTICE file for complete attribution details.
