| Documentation | Blog | Paper | Twitter/X | User Forum | Developer Slack |
Based on vLLM-0.10.0
The current repo is a specialized adaptation tailored to the original FireredASR-LLM model architecture and input parameters, containing extensive hard-coded elements. Significant work remains to be done before it can be merged into the main vLLM branch:
- Modify the FireredASR-LLM model files to match the standard loading procedure in vLLM
- Modify the input format to support raw features data
- Remove the separate fireredasr directory in
vllm/model_executor/models
-
Run
tools/merge_lora_weights.pyunder the directory ofFireRedASR-LLM-Lto get the complete Qwen2-7B LLM model with LoRA weights. -
Run
tools/save_tokenizer.pyto get the specific tokenizer of Qwen2-7B model. -
Set the soft link of
Qwen2-7B-Instructunder the directory ofFireRedASR-LLM-LtoQwen2-7B-Instruct-LoRA. -
Copy the file
tools/fireredasr_config_template.jsonto the directory ofFireRedASR-LLM-LasFireRedASR-LLM-L/config.json. -
Install vLLM from source:
Visit offical documentation to learn more.
Recommended environment:
- flash-attn==2.8.3
- torch==2.7.1
See files examples/fireredasr_vllm_example.py
| Parameter | Default | Description |
|---|---|---|
max_tokens |
min(2048,len(audio)) | Maximum number of tokens to generate(should be adjusted to the actual length of audio file) |
min_tokens |
0 | Minimum number of tokens to generate |
temperature |
0.1 | Sampling temperature |
top_p |
1.0 | Top-p (nucleus) sampling |
repetition_penalty |
1.05 | Penalty for repeating tokens |