WhisperX API Server is a FastAPI-based server designed to transcribe audio files using pluggable stage backends, with WhisperX (https://github.com/m-bain/WhisperX) as the default implementation. The API offers an OpenAI-like interface that allows users to upload audio files and receive transcription results in various formats. It supports customizable options such as different models, languages, temperature settings, and more.
Features
- Audio Transcription: Transcribe audio files using the configured transcription backend.
- Model Caching: Load and cache models for reusability and faster performance.
- OpenAI-like API, based on https://platform.openai.com/docs/api-reference/audio/createTranscription and https://platform.openai.com/docs/api-reference/audio/createTranslation
- Pluggable pipeline stages: choose backend per stage (
transcription,alignment,diarization) and mix different backends.
https://platform.openai.com/docs/api-reference/audio/createTranscription
Parameters:
file: The audio file to transcribe.model (str): Model name for the configured transcription backend. Ifwhisper-1is provided, it is replaced with the configured default transcription model.language (str | null): Language code for transcription. Default isconfig.default_language.prompt (str | null): Optional transcription prompt. Default isnull.response_format (str): One oftext,json,verbose_json,vtt_json,srt,vtt,aud. Default isconfig.default_response_format.temperature (float): Temperature setting for transcription. Default is0.0.timestamp_granularities[] (list[str]): Timestamp granularity values (segment,word). Default is["segment"].stream (bool): OpenAI-compatible streaming flag. Currently accepted but not used by the server. Default isFalse.hotwords (str | null): Optional hotwords for transcription. Default isnull.suppress_numerals (bool): Suppress numerals in transcription. Default isTrue.highlight_words (bool): Highlight words in subtitle-style outputs (vtt,srt). Default isFalse.align (bool): Enable transcription timing alignment. Default isTrue.diarize (bool): Enable speaker diarization. Default isFalse.speaker_embeddings (bool): Include speaker embeddings during diarization flow. Default isFalse.chunk_size (int): Chunk size (seconds) for VAD segment merging. Default isconfig.whisper.chunk_size.batch_size (int): Batch size used during inference. Default isconfig.whisper.batch_size.
Returns: Transcription output in the requested response_format:
json: JSON object withtext.verbose_json: Full transcript JSON object.vtt_json: Full transcript JSON object plusvtt_text.text,srt,vtt,aud: Plain text response body.
https://platform.openai.com/docs/api-reference/audio/createTranslation
Parameters:
file: The audio file to translate.model (str): Model name for the configured transcription backend. Ifwhisper-1is provided, it is replaced with the configured default transcription model.prompt (str): Optional translation prompt. Default is an empty string.response_format (str): One oftext,json,verbose_json,vtt_json,srt,vtt,aud. Default isconfig.default_response_format.temperature (float): Temperature setting for translation. Default is0.0.chunk_size (int): Chunk size (seconds) for VAD segment merging. Default isconfig.whisper.chunk_size.batch_size (int): Batch size used during inference. Default isconfig.whisper.batch_size.
Returns: Translation output in the requested response_format (same response behavior as /v1/audio/transcriptions).
Returns current API health status as JSON: {"status": "healthy"}.
Lists loaded transcription models.
Unloads a transcription model from cache.
Parameters:
model (str): Model name to unload.
Loads a transcription model into cache.
Parameters:
model (str): Model name to load.
Lists loaded alignment models.
Unloads an alignment model.
Parameters:
language (str): Language code of the alignment model to unload.
Loads an alignment model.
Parameters:
language (str): Language code of the alignment model to load.
Lists loaded diarization models.
Unloads a diarization model.
Parameters:
model (str): Diarization model name to unload.
Loads a diarization model.
Parameters:
model (str): Diarization model name to load.
You can define default backend per pipeline stage through environment variables:
BACKENDS__TRANSCRIPTION=whisperx
BACKENDS__ALIGNMENT=whisperx
BACKENDS__DIARIZATION=whisperxBy default, only the whisperx backend is registered. Additional backends can be added and combined per stage.
Model management endpoints (/models/*, /align_models/*, /diarize_models/*) operate through the configured stage backends.
With Docker:
For CPU:
docker compose build whisperx-api-server-cpu
docker compose up whisperx-api-server-cpuFor CUDA (GPU):
docker compose build whisperx-api-server-cuda
docker compose up whisperx-api-server-cuda
Feel free to submit issues, fork the repository, and send pull requests to contribute to the project.
This project is licensed under the GNU GENERAL PUBLIC LICENSE Version 3. See the LICENSE file for details.