As a software developer with limited Python expertise, I directed the development of srt-translator by iteratively prompting AI language models (Claude and DeepSeek) to generate the majority of the code. Through systematic debugging, precise requirements, and continuous validation of the AI's output, I guided the project from concept to a fully functional application. This process demonstrates my ability to leverage AI tools effectively while retaining full ownership of problem‑solving and architectural decisions.
A powerful, OBS-compatible voice recognition and translation app built with Python and Gradio. Supports multiple translation backends including AI models, offline Argos Translate, Whisper API, and LibreTranslate.
- Real-time Voice Recognition using Vosk (offline, open‑source) or Whisper API (online, high accuracy)
- Multiple Translation Backends:
- Argos Translate – offline, open‑source translation (no internet required)
- AI‑powered translation – OpenAI‑compatible endpoints (Ollama, OpenAI, etc.)
- Whisper Translate – direct audio translation via Whisper API
- LibreTranslate – self‑hosted or cloud translation service
- Internal translation – using
translatorslibrary (Google Translate, etc.) - Moonshine – lightweight ONNX‑based local ASR. Auto‑downloads models from HuggingFace. Supports 8+ languages.
- Independent Session Management – each browser tab runs its own isolated session
- Pop‑out Display – separate window for OBS overlay, updates via polling
- Interim Results – show partial recognition as you speak
- Multi‑language Support – recognize and translate between many languages
- Font, Size, Color – fully customizable for both recognized and translated text
- Text Alignment – left, center, or right
- Translation Position – before or after the recognized text
- Background Color – set any color (use
#00FF00for chroma key) - Fade Timeout – automatically fade text after a configurable pause
- Microphone Selection – choose from available input devices
- Vosk Model Management – load models from local
models/directory - Argos Model Management – download and install offline translation models with
download_argos_model.py - Whisper API Integration – use any OpenAI‑compatible Whisper server (e.g.,
whisper.cpp,faster-whisper) - Docker Support – easy deployment with Docker/Docker Compose
- Comprehensive Logging – real‑time logs in UI and persistent file logs
- Session Cleanup – automatic cleanup of inactive sessions
- Python 3.11 or 3.12 (recommended – Python 3.14 may have package compatibility issues)
- PortAudio (for audio input)
- Vosk models (download separately)
- (Optional) Argos Translate models for offline translation
- (Optional) Whisper server for online transcription/translation
- gradio>=4.44.0
- vosk>=0.3.45
- sounddevice>=0.4.7
- numpy>=1.26.0,<2.0.0
- requests>=2.31.0
- translators>=5.9.1
- argostranslate
-
Clone or download this repository
-
Install system dependencies (if needed)
Ubuntu/Debian:
sudo apt-get update sudo apt-get install portaudio19-dev python3-pyaudio
macOS:
brew install portaudio
-
Create and activate a virtual environment
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install Python dependencies
pip install -r requirements.txt
-
Download Vosk models
Use the included
download_vosk_models.pyscript:python download_vosk_models.py en-us-small # light English model python download_vosk_models.py en-us # full English model python download_vosk_models.py es fr de # multiple languages
Models are placed in the
models/directory. -
(Optional) Download Argos Translate models for offline translation
python download_argos_model.py en es # install English→Spanish python download_argos_model.py --common # install a set of common pairs
Models are stored in
argos_models/. -
Run the application
python app.py
Open your browser at http://localhost:7860.
-
Build the Docker image
docker-compose build
-
Place Vosk models in
./models/(created automatically if missing) -
Start the container
docker-compose up -d
-
Access the application at
http://localhost:7860
- Building the image requires internet access to download
gccand Python headers inside the container. This is handled automatically by the improved Dockerfile below. - Audio access on Linux requires the
--device /dev/sndflag (already indocker-compose.yml). On macOS/Windows, Docker Desktop may have limited audio support; use browser audio mode instead. - Moonshine (local ONNX model) – If you want to use Moonshine, add
moonshine-voicetorequirements.txt(or install it inside the container). The default Dockerfile does not include it to keep the image smaller. - Volume mounts – The compose file mounts
./vosk_models,./argos_models,./fonts, and./logs. Create these directories on your host before starting the container. - Fix for docker‑compose command – Use the corrected
commandshown below, otherwise the container will not start.
version: "3.8"
services:
voice-translator:
build: .
container_name: voice-translator
ports:
- "7860:7860"
volumes:
- ./vosk_models:/app/vosk_models
- ./argos_models:/app/argos_models
- ./fonts:/app/fonts
- ./logs:/app/logs
devices:
- /dev/snd:/dev/snd
environment:
- GRADIO_SERVER_NAME=0.0.0.0
- GRADIO_SERVER_PORT=7860
restart: unless-stopped
command: python app.py --host 0.0.0.0 --port 7860-
Select Recognition Engine:
- Vosk (offline, fast) – choose a model from the dropdown
- Whisper (online, more accurate) – configure Whisper API host and model
-
Choose Audio Mode:
- Hardware – uses system microphone (select device)
- Browser – uses browser’s microphone (useful for remote access)
-
Configure Translation (optional):
- Argos – offline translation using downloaded Argos models
- AI – OpenAI‑compatible endpoint (e.g., Ollama, OpenAI)
- Whisper Translate – direct audio translation via Whisper API
- LibreTranslate – self‑hosted or cloud instance
- Internal – uses Google Translate (internet required)
-
Set Source and Target Languages (format:
en-USfor Vosk/Whisper,enfor translation) -
Click “Start” – begin speaking. The recognized text and its translation appear in the display panel.
-
Use the Pop‑out URL for OBS – open the provided URL in a browser source in OBS.
Argos (Offline)
Translation Mode: argos
Source Language Code: en
Target Language Code: es
Requires the corresponding Argos models installed.
AI (Ollama)
Translation Mode: ai
AI Host: http://localhost:11434/v1
AI Model: llama3.2
(Leave API key empty)
Whisper Translate
Translation Mode: whisper_translate
Whisper API Host: http://localhost:9000
(Translates audio directly to English)
LibreTranslate (Self‑hosted)
Translation Mode: libretranslate
LibreTranslate Host: http://localhost:5000
(API key if required)
Adjust when the app detects speech:
- Threshold (dB) – sensitivity. Lower values (‑60) detect whispers, higher (‑10) only loud speech.
- End‑of‑speech pause – how long silence waits before sending a segment. Increase (500‑800 ms) if phrases are cut off.
- Noise filter – removes clicks, keyboard, and background hum. 0 = off, 1 = aggressive (may soften speech).
- Instant – text appears immediately, fades after timeout.
- Buffered (paced) – sentences queue up and are shown in chunks. Each chunk stays on screen long enough to read at the chosen characters per second (CPS). Ideal for fast speakers or Whisper (which sends whole phrases).
Add a stroke around recognized and translated text. Useful for better readability on bright backgrounds. Set width (pixels) and color.
Place .ttf, .otf, .woff, or .woff2 files in the fonts/ directory (created automatically). They appear in the font dropdown as [Custom] fontname.
Before starting recognition, click Test Mic (hardware mode) or Test Mic (browser mode) to see the level meter without transcribing. Useful to check if your mic works and to set the VAD threshold.
Each browser tab runs an independent session. Use the Manage Sessions dropdown to close other sessions and free resources.
By default, the pop‑out URL uses a random ID. You can enter a custom ID (letters, numbers, underscores, hyphens) to get a persistent URL, e.g., http://localhost:7860/popout/my_stream.
- Linux: Hardware microphone works via
/dev/sndpassthrough (already configured indocker-compose.yml). - macOS / Windows: Docker Desktop does not support
/dev/snd. Use browser audio mode instead – it works perfectly without host sound device access.
Expand the Advanced Whisper Parameters accordion to fine‑tune transcription (temperature, beam size, no‑speech threshold, etc.). See OpenAI Whisper API docs for details.
Open the Display Style accordion to adjust:
- Font family, sizes, colors
- Text alignment
- Translation position (before/after)
- Fade timeout
- Start the app.
- Copy the Popout URL from the UI (e.g.,
http://localhost:7860/popout/abc123). - In OBS, add a Browser Source and paste the URL.
- Set desired width/height (e.g., 1920×200).
- Optionally add custom CSS to remove background.
voice-translator/
├── app.py # Main application
├── translators.py # Translation service (AI, LibreTranslate, internal)
├── logger.py # Logging module
├── requirements.txt # Python dependencies
├── download_vosk_models.py # Vosk model downloader
├── download_argos_model.py # Argos Translate model downloader
├── Dockerfile # Docker configuration
├── docker-compose.yml # Docker Compose
├── README.md # This file
├── QUICKSTART.md # Quick start guide
├── TROUBLESHOOTING.md # Troubleshooting guide
├── CONFIG_EXAMPLES.txt # Example configurations
├── models/ # Vosk models directory
├── argos_models/ # Argos Translate models directory
└── logs/ # Application logs
| Argument | Description |
|---|---|
--host |
Host to bind to (default: localhost) |
--port |
Port to bind to (default: 7860) |
--share |
Create a public share link (Gradio feature) |
GRADIO_SERVER_NAME– set to0.0.0.0inside containerGRADIO_SERVER_PORT– default7860
- All activities are logged in real‑time in the UI (last 50 entries).
- Logs are also saved to logs/ with session identifiers.
- Live Streaming – real‑time translation overlay for multilingual streams
- Presentations – live translation for international audiences
- Meetings – real‑time transcription and translation
- Accessibility – speech‑to‑text with translation support
- Language Learning – see translations as you practice speaking
See the TROUBLESHOOTING.md file for common issues and solutions.
Contributions are welcome! Areas for improvement:
- Additional translation backends
- More display customization options
- Performance optimizations
- Additional language models
- UI/UX enhancements
This project uses several open‑source components:
- Vosk – Apache 2.0 License
- Gradio – Apache 2.0 License
- Argos Translate – MIT License
- translators – MIT License
-
Vosk – speech recognition toolkit
-
Argos Translate – offline translation library
-
Gradio – web UI framework
-
LibreTranslate – free and open‑source translation API
-
translators – multi‑engine translation library
-
Ollama – local AI model runner
Happy Translating! 🌍🎤✨
text
# 🚀 QUICK START GUIDE
## What's New
- **Argos Translate** – offline translation (no internet needed)
- **Whisper API** – use a Whisper server for transcription/translation
- **Multiple translation backends** – AI, LibreTranslate, internal, whisper_translate
- **Pop‑out display** – perfect for OBS overlays
- **Session isolation** – each browser tab is independent
## Fastest Way to Get Started
### 1. Run the setup script
**Linux/macOS:**
```bash
chmod +x setup.sh
./setup.sh
Windows:
text
setup.bat
2. Download a Vosk model (for offline recognition)
bash
# Activate virtual environment (if not already)
source venv/bin/activate # or venv\Scripts\activate on Windows
python download_vosk_models.py en-us-small # 40 MB English model
3. (Optional) Download Argos models for offline translation
bash
python download_argos_model.py en es # English → Spanish
python download_argos_model.py --common # common language pairs
4. Start the app
bash
python app.py
Open your browser to http://localhost:7860.
First‑Time UI Setup
Choose Recognition Engine (Vosk recommended for offline)
Select a Vosk model from the dropdown
Pick your microphone (hardware mode)
Enable Translation and choose a mode:
Argos – offline, requires models
AI – for Ollama/OpenAI
Whisper Translate – if you have a Whisper server
LibreTranslate – self‑hosted or cloud
Internal – easiest (uses Google Translate)
Set languages (e.g., source: en-US, target: es)
Click Start
OBS Integration
After starting, copy the Popout URL from the UI.
In OBS, add a Browser Source and paste the URL.
Set width/height (e.g., 1920×200).
(Optional) Add custom CSS to remove background:
css
body { background-color: rgba(0,0,0,0); overflow: hidden; }
Next Steps
Read the full README.md for detailed configuration.
Check CONFIG_EXAMPLES.txt for ready‑to‑use setups.
If something doesn't work, see TROUBLESHOOTING.md.
```











