⚠️ AI ALERT

As a software developer with limited Python expertise, I directed the development of srt-translator by iteratively prompting AI language models (Claude and DeepSeek) to generate the majority of the code. Through systematic debugging, precise requirements, and continuous validation of the AI's output, I guided the project from concept to a fully functional application. This process demonstrates my ability to leverage AI tools effectively while retaining full ownership of problem‑solving and architectural decisions.

🎤 Voice Translator - Real-time Speech Recognition & Translation

A powerful, OBS-compatible voice recognition and translation app built with Python and Gradio. Supports multiple translation backends including AI models, offline Argos Translate, Whisper API, and LibreTranslate.

✨ Features

Core Functionality

Real-time Voice Recognition using Vosk (offline, open‑source) or Whisper API (online, high accuracy)
Multiple Translation Backends:
- Argos Translate – offline, open‑source translation (no internet required)
- AI‑powered translation – OpenAI‑compatible endpoints (Ollama, OpenAI, etc.)
- Whisper Translate – direct audio translation via Whisper API
- LibreTranslate – self‑hosted or cloud translation service
- Internal translation – using translators library (Google Translate, etc.)
- Moonshine – lightweight ONNX‑based local ASR. Auto‑downloads models from HuggingFace. Supports 8+ languages.
Independent Session Management – each browser tab runs its own isolated session
Pop‑out Display – separate window for OBS overlay, updates via polling
Interim Results – show partial recognition as you speak
Multi‑language Support – recognize and translate between many languages

Display Customization

Font, Size, Color – fully customizable for both recognized and translated text
Text Alignment – left, center, or right
Translation Position – before or after the recognized text
Background Color – set any color (use #00FF00 for chroma key)
Fade Timeout – automatically fade text after a configurable pause

Advanced Features

Microphone Selection – choose from available input devices
Vosk Model Management – load models from local models/ directory
Argos Model Management – download and install offline translation models with download_argos_model.py
Whisper API Integration – use any OpenAI‑compatible Whisper server (e.g., whisper.cpp, faster-whisper)
Docker Support – easy deployment with Docker/Docker Compose
Comprehensive Logging – real‑time logs in UI and persistent file logs
Session Cleanup – automatic cleanup of inactive sessions

Screenshots

Session Management

Vosk

Whisper

Moonshine

Mic options

Translation

Argos

AI

Libretranslate

Whisper translate

Display

Display Style

Subtitle Timing

📋 Requirements

System Requirements

Python 3.11 or 3.12 (recommended – Python 3.14 may have package compatibility issues)
PortAudio (for audio input)
Vosk models (download separately)
(Optional) Argos Translate models for offline translation
(Optional) Whisper server for online transcription/translation

Python Dependencies

gradio>=4.44.0
vosk>=0.3.45
sounddevice>=0.4.7
numpy>=1.26.0,<2.0.0
requests>=2.31.0
translators>=5.9.1
argostranslate

🚀 Installation

Method 1: Local Installation

Clone or download this repository

Install system dependencies (if needed)

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install portaudio19-dev python3-pyaudio

macOS:

brew install portaudio

Create and activate a virtual environment

python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate

Install Python dependencies
```
pip install -r requirements.txt
```

Download Vosk models

Use the included download_vosk_models.py script:

python download_vosk_models.py en-us-small # light English model
python download_vosk_models.py en-us # full English model
python download_vosk_models.py es fr de # multiple languages

Models are placed in the models/ directory.

(Optional) Download Argos Translate models for offline translation

python download_argos_model.py en es # install English→Spanish
python download_argos_model.py --common # install a set of common pairs

Models are stored in argos_models/.

Run the application
```
python app.py
```
Open your browser at http://localhost:7860.

Method 2: Docker Deployment

Build the Docker image
```
docker-compose build
```
Place Vosk models in ./models/ (created automatically if missing)
Start the container
```
docker-compose up -d
```
Access the application at http://localhost:7860

Docker – Additional Notes

Building the image requires internet access to download gcc and Python headers inside the container. This is handled automatically by the improved Dockerfile below.
Audio access on Linux requires the --device /dev/snd flag (already in docker-compose.yml). On macOS/Windows, Docker Desktop may have limited audio support; use browser audio mode instead.
Moonshine (local ONNX model) – If you want to use Moonshine, add moonshine-voice to requirements.txt (or install it inside the container). The default Dockerfile does not include it to keep the image smaller.
Volume mounts – The compose file mounts ./vosk_models, ./argos_models, ./fonts, and ./logs. Create these directories on your host before starting the container.
Fix for docker‑compose command – Use the corrected command shown below, otherwise the container will not start.

`docker-compose.yml`

version: "3.8"

services:
  voice-translator:
    build: .
    container_name: voice-translator
    ports:
      - "7860:7860"
    volumes:
      - ./vosk_models:/app/vosk_models
      - ./argos_models:/app/argos_models
      - ./fonts:/app/fonts
      - ./logs:/app/logs
    devices:
      - /dev/snd:/dev/snd
    environment:
      - GRADIO_SERVER_NAME=0.0.0.0
      - GRADIO_SERVER_PORT=7860
    restart: unless-stopped
    command: python app.py --host 0.0.0.0 --port 7860

🎮 Usage

Basic Setup

Select Recognition Engine:
- Vosk (offline, fast) – choose a model from the dropdown
- Whisper (online, more accurate) – configure Whisper API host and model
Choose Audio Mode:
- Hardware – uses system microphone (select device)
- Browser – uses browser’s microphone (useful for remote access)
Configure Translation (optional):
- Argos – offline translation using downloaded Argos models
- AI – OpenAI‑compatible endpoint (e.g., Ollama, OpenAI)
- Whisper Translate – direct audio translation via Whisper API
- LibreTranslate – self‑hosted or cloud instance
- Internal – uses Google Translate (internet required)
Set Source and Target Languages (format: en-US for Vosk/Whisper, en for translation)
Click “Start” – begin speaking. The recognized text and its translation appear in the display panel.
Use the Pop‑out URL for OBS – open the provided URL in a browser source in OBS.

Translation Configuration Examples

Argos (Offline)

Translation Mode: argos
Source Language Code: en
Target Language Code: es

Requires the corresponding Argos models installed.

AI (Ollama)

Translation Mode: ai
AI Host: http://localhost:11434/v1
AI Model: llama3.2
(Leave API key empty)

Whisper Translate

Translation Mode: whisper_translate
Whisper API Host: http://localhost:9000
(Translates audio directly to English)

LibreTranslate (Self‑hosted)

Translation Mode: libretranslate
LibreTranslate Host: http://localhost:5000
(API key if required)

🎚️ Voice Activity Detection (VAD)

Adjust when the app detects speech:

Threshold (dB) – sensitivity. Lower values (‑60) detect whispers, higher (‑10) only loud speech.
End‑of‑speech pause – how long silence waits before sending a segment. Increase (500‑800 ms) if phrases are cut off.
Noise filter – removes clicks, keyboard, and background hum. 0 = off, 1 = aggressive (may soften speech).

📺 Subtitle Display Modes

Instant – text appears immediately, fades after timeout.
Buffered (paced) – sentences queue up and are shown in chunks. Each chunk stays on screen long enough to read at the chosen characters per second (CPS). Ideal for fast speakers or Whisper (which sends whole phrases).

✍️ Text Outline

Add a stroke around recognized and translated text. Useful for better readability on bright backgrounds. Set width (pixels) and color.

🔤 Custom Fonts

Place .ttf, .otf, .woff, or .woff2 files in the fonts/ directory (created automatically). They appear in the font dropdown as [Custom] fontname.

🎙️ Microphone Test

Before starting recognition, click Test Mic (hardware mode) or Test Mic (browser mode) to see the level meter without transcribing. Useful to check if your mic works and to set the VAD threshold.

👥 Multi‑Session Management

Each browser tab runs an independent session. Use the Manage Sessions dropdown to close other sessions and free resources.

🔗 Pop‑out Custom ID

By default, the pop‑out URL uses a random ID. You can enter a custom ID (letters, numbers, underscores, hyphens) to get a persistent URL, e.g., http://localhost:7860/popout/my_stream.

🐳 Docker Audio Passthrough

Linux: Hardware microphone works via /dev/snd passthrough (already configured in docker-compose.yml).
macOS / Windows: Docker Desktop does not support /dev/snd. Use browser audio mode instead – it works perfectly without host sound device access.

🤖 Whisper Advanced Parameters

Expand the Advanced Whisper Parameters accordion to fine‑tune transcription (temperature, beam size, no‑speech threshold, etc.). See OpenAI Whisper API docs for details.

Display Customization

Open the Display Style accordion to adjust:

Font family, sizes, colors
Text alignment
Translation position (before/after)
Fade timeout

OBS Integration

Start the app.
Copy the Popout URL from the UI (e.g., http://localhost:7860/popout/abc123).
In OBS, add a Browser Source and paste the URL.
Set desired width/height (e.g., 1920×200).
Optionally add custom CSS to remove background.

📁 Project Structure

voice-translator/
├── app.py # Main application
├── translators.py # Translation service (AI, LibreTranslate, internal)
├── logger.py # Logging module
├── requirements.txt # Python dependencies
├── download_vosk_models.py # Vosk model downloader
├── download_argos_model.py # Argos Translate model downloader
├── Dockerfile # Docker configuration
├── docker-compose.yml # Docker Compose
├── README.md # This file
├── QUICKSTART.md # Quick start guide
├── TROUBLESHOOTING.md # Troubleshooting guide
├── CONFIG_EXAMPLES.txt # Example configurations
├── models/ # Vosk models directory
├── argos_models/ # Argos Translate models directory
└── logs/ # Application logs

🔧 Configuration

Command Line Arguments

Argument	Description
`--host`	Host to bind to (default: `localhost`)
`--port`	Port to bind to (default: `7860`)
`--share`	Create a public share link (Gradio feature)

Environment Variables (Docker)

GRADIO_SERVER_NAME – set to 0.0.0.0 inside container
GRADIO_SERVER_PORT – default 7860

📊 Logging

All activities are logged in real‑time in the UI (last 50 entries).
Logs are also saved to logs/ with session identifiers.

🎯 Use Cases

Live Streaming – real‑time translation overlay for multilingual streams
Presentations – live translation for international audiences
Meetings – real‑time transcription and translation
Accessibility – speech‑to‑text with translation support
Language Learning – see translations as you practice speaking

🐛 Troubleshooting

See the TROUBLESHOOTING.md file for common issues and solutions.

🤝 Contributing

Contributions are welcome! Areas for improvement:

Additional translation backends
More display customization options
Performance optimizations
Additional language models
UI/UX enhancements

📝 License

This project uses several open‑source components:

Vosk – Apache 2.0 License
Gradio – Apache 2.0 License
Argos Translate – MIT License
translators – MIT License

🙏 Acknowledgments

Vosk – speech recognition toolkit
Argos Translate – offline translation library
Gradio – web UI framework
LibreTranslate – free and open‑source translation API
translators – multi‑engine translation library
Ollama – local AI model runner

Happy Translating! 🌍🎤✨

text

`QUICKSTART.md` (Updated)

# 🚀 QUICK START GUIDE

## What's New

- **Argos Translate** – offline translation (no internet needed)
- **Whisper API** – use a Whisper server for transcription/translation
- **Multiple translation backends** – AI, LibreTranslate, internal, whisper_translate
- **Pop‑out display** – perfect for OBS overlays
- **Session isolation** – each browser tab is independent

## Fastest Way to Get Started

### 1. Run the setup script

**Linux/macOS:**

```bash
chmod +x setup.sh
./setup.sh
Windows:

text
setup.bat
2. Download a Vosk model (for offline recognition)
bash
# Activate virtual environment (if not already)
source venv/bin/activate   # or venv\Scripts\activate on Windows

python download_vosk_models.py en-us-small   # 40 MB English model
3. (Optional) Download Argos models for offline translation
bash
python download_argos_model.py en es   # English → Spanish
python download_argos_model.py --common   # common language pairs
4. Start the app
bash
python app.py
Open your browser to http://localhost:7860.

First‑Time UI Setup
Choose Recognition Engine (Vosk recommended for offline)

Select a Vosk model from the dropdown

Pick your microphone (hardware mode)

Enable Translation and choose a mode:

Argos – offline, requires models

AI – for Ollama/OpenAI

Whisper Translate – if you have a Whisper server

LibreTranslate – self‑hosted or cloud

Internal – easiest (uses Google Translate)

Set languages (e.g., source: en-US, target: es)

Click Start

OBS Integration
After starting, copy the Popout URL from the UI.

In OBS, add a Browser Source and paste the URL.

Set width/height (e.g., 1920×200).

(Optional) Add custom CSS to remove background:

css
body { background-color: rgba(0,0,0,0); overflow: hidden; }
Next Steps
Read the full README.md for detailed configuration.

Check CONFIG_EXAMPLES.txt for ready‑to‑use setups.

If something doesn't work, see TROUBLESHOOTING.md.
```

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
images		images
.gitignore		.gitignore
CONFIG_EXAMPLES.txt		CONFIG_EXAMPLES.txt
Dockerfile		Dockerfile
QUICKSTART.md		QUICKSTART.md
README.md		README.md
TROUBLESHOOTING.md		TROUBLESHOOTING.md
docker-compose.yml		docker-compose.yml
download_argos_model.py		download_argos_model.py
download_vosk_models.py		download_vosk_models.py
logger.py		logger.py
requirements.txt		requirements.txt
setup.bat		setup.bat
setup.sh		setup.sh
start.bat		start.bat
translators.py		translators.py
voice_translator.py		voice_translator.py

Folders and files

Latest commit

History

Repository files navigation

⚠️ AI ALERT

🎤 Voice Translator - Real-time Speech Recognition & Translation

✨ Features

Core Functionality

Display Customization

Advanced Features

Screenshots

Session Management

Vosk

Whisper

Moonshine

Mic options

Translation

Argos

AI

Libretranslate

Whisper translate

Display

Display Style

Subtitle Timing

📋 Requirements

System Requirements

Python Dependencies

🚀 Installation

Method 1: Local Installation

Method 2: Docker Deployment

Docker – Additional Notes

docker-compose.yml

🎮 Usage

Basic Setup

Translation Configuration Examples

🎚️ Voice Activity Detection (VAD)

📺 Subtitle Display Modes

✍️ Text Outline

🔤 Custom Fonts

🎙️ Microphone Test

👥 Multi‑Session Management

🔗 Pop‑out Custom ID

🐳 Docker Audio Passthrough

🤖 Whisper Advanced Parameters

Display Customization

OBS Integration

📁 Project Structure

🔧 Configuration

Command Line Arguments

Environment Variables (Docker)

📊 Logging

🎯 Use Cases

🐛 Troubleshooting

🤝 Contributing

📝 License

🙏 Acknowledgments

QUICKSTART.md (Updated)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`docker-compose.yml`

`QUICKSTART.md` (Updated)

Packages