Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,15 @@ repos:
- id: check-merge-conflict
- id: debug-statements

- repo: https://github.com/PyCQA/autoflake
rev: v2.3.1
hooks:
- id: autoflake
args:
- --in-place
- --remove-unused-variables
- --remove-all-unused-imports

- repo: https://github.com/pycqa/flake8
rev: 7.0.0
hooks:
Expand Down
525 changes: 128 additions & 397 deletions README.md

Large diffs are not rendered by default.

27 changes: 27 additions & 0 deletions docs/benchmarks_and_performance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Performance and Benchmarks

VoiceGenHub is designed for both local CPU-only systems and GPU-accelerated environments.

## Performance Comparison (Single Job)

| Provider | Quality (MOS) | Startup Time | Sequential (per req) | Async (3x parallel) | Model Size | Commercial |
|----------|---------------|--------------|---------------------|-------------------|------------|------------|
| **Edge TTS** | 3.8/5 | 4.9s | 3.2s | 2.5s | 0MB (cloud) | ✅ Free |
| **Kokoro** | 3.5/5 | 94s | 14.2s | 2.5s | 625MB | ✅ Apache 2.0 |
| **Bark** | 4.2/5 | 180s | 25-40s | 8-12s | 4GB | ✅ MIT |
| **Chatterbox** | 4.3/5 | 120s | 15-30s | 5-15s | 3.7GB | ✅ MIT |
| **ElevenLabs** | 4.5/5* | 2s | 3-5s | 2-3s | 0MB (cloud) | ⚠️ Paid API |

*ElevenLabs quality estimate based on reputation; not yet tested.*

## Concurrency Analysis (Chatterbox)

- **Memory Safety**: Chatterbox uses a **shared model instance** (3.6GB) across all threads — **no duplication**.
- **Performance**: ~2.8x speedup at 4 threads on CPU. Optimal thread count: **2-4 threads**.
- **Async Concurrency**: Safe to use 2-8 concurrent threads without OOM risk.

## [View Concurrency Plot](assets/concurrency_plot.html)
Interactive performance analysis showing speedup curves, memory usage, and timing breakdowns.

---
*For more details on Kaggle GPU benchmarks, see the remote GPU documentation.*
51 changes: 51 additions & 0 deletions docs/cloning_and_design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Voice Cloning and Design

VoiceGenHub supports both zero-shot voice cloning (from audio samples) and voice design (from textual descriptions).

## 1. Voice Cloning with [Chatterbox](https://github.com/rsxdalv/chatterbox)

### Steps

1. **Generate a Reference Audio** (or use an existing sample):
```bash
voicegenhub synthesize "Sample text for cloning." \
--provider kokoro \
--voice kokoro-am_michael \
--output reference.wav
```

2. **Clone the Voice**:
```bash
voicegenhub synthesize "Your text to be synthesized in the cloned voice." \
--provider chatterbox \
--audio-prompt reference.wav \
--output cloned_voice.wav
```

3. **Adjust Emotion and Style**:
```bash
voicegenhub synthesize "Your text." \
--provider chatterbox \
--audio-prompt reference.wav \
--exaggeration 0.8 \
--cfg-weight 0.7
```

### Tips for Better Quality
- Use clear, noise-free reference audio (5-10 seconds recommended).
- Chatterbox supports **multilingual cloning** (clone any language, synthesize in any other language).

## 2. Voice Design with [Qwen 3 TTS](https://github.com/QwenLM/Qwen3-TTS)

*Requires `Qwen3-TTS-VoiceDesign` model for full control, available via Python API or remote GPU.*

### Qwen 3 TTS Voice Design Features

- **Natural Language Instruction**: Design custom voices using descriptions.
- **Example Voice Design**:
- `"Female, 25 years old, cheerful and energetic, slightly high-pitched with playful intonation"`
- `"Male, 17 years old, gaining confidence, deeper breath support, vowels tighten when nervous"`
- `"Elderly male, 70 years old, wise and gentle, slightly raspy with warm timbre"`

---
*For more details on Qwen 3 TTS design modes, see the [Qwen 3 TTS documentation](https://github.com/QwenLM/Qwen3-TTS).*
66 changes: 66 additions & 0 deletions docs/installation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Installation and Requirements

Detailed installation guide for various TTS providers and optional features.

## Basic Installation

```bash
pip install voicegenhub
```

## Optional Provider Dependencies

To use certain providers, you need to install their respective dependencies:

```bash
# Kokoro TTS (Lightweight, self-hosted)
pip install voicegenhub[kokoro]

# Bark TTS (High Quality, MIT)
pip install voicegenhub[bark]

# Chatterbox TTS (High Quality, MIT)
pip install chatterbox-tts

# Qwen 3 TTS (State-of-the-Art, Apache 2.0)
pip install voicegenhub[qwen]

# ElevenLabs TTS (Commercial)
pip install elevenlabs
```

---

## 2. Dependencies

### Voice Cloning Requirements (Chatterbox)

For voice cloning features with Chatterbox TTS:

```bash
pip install voicegenhub[voice-cloning]
```

**System Requirements:**
- **FFmpeg**: Required when `torchcodec` is installed for voice cloning.
- **PyTorch**: Required for local model execution.

**Windows Installations**: Download the "full-shared" FFmpeg build from [ffmpeg.org](https://ffmpeg.org/download.html#build-windows) and add the `bin` directory to your system PATH.

---

## Technical Note: CUDA and CPU Execution

- VoiceGenHub automatically detects if a GPU is available.
- For **Chatterbox** and **Bark**, if no GPU is found, the library will fall back to **CPU execution**.
- For **Qwen 3 TTS**, high-quality models (1.7B) are recommended for **GPU acceleration** (remote or local).

---

## Windows & Python 3.13+ (Kokoro)

On Windows with Python 3.13+, **Kokoro TTS** may require Microsoft Visual C++ Build Tools for compilation if pre-built wheels are not available.

1. Download [Microsoft Visual C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/).
2. Select "Desktop development with C++" workload.
3. Restart terminal and retry installation.
52 changes: 52 additions & 0 deletions docs/kaggle_gpu.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Kaggle Remote GPU Generation

Generate high-quality Qwen3-TTS audio using remote Kaggle GPUs (P100 or T4x2). This is useful for high-quality 1.7B models when you don't have a local GPU.

## Prerequisites

1. **Kaggle API Credentials**:
- Go to [Kaggle Settings](https://www.kaggle.com/settings) → API → Create New Token.
- Save the `kaggle.json` to `~/.kaggle/kaggle.json` (on Windows: `%USERPROFILE%\.kaggle\kaggle.json`).
2. **Kaggle CLI**:
```bash
pip install kaggle
```
3. **Kaggle Internet Access**:
- Ensure your Kaggle account has phone verification completed (allows internet access in kernels).

## Usage

Use the `--gpu` flag with the `synthesize` command to trigger remote generation.

### P100 GPU (default)

```bash
voicegenhub synthesize "Hello from the remote P100!" --gpu
```

### T4 x 2 GPU

```bash
voicegenhub synthesize "Hello from the remote T4!" --gpu --gpu-type t4
```

### Advanced Usage

```bash
voicegenhub synthesize "Chinese test." \
--gpu \
--gpu-type p100 \
--voice Serena \
--language zh \
--output ./remote_output/serena.wav
```

## How It Works

1. **Automation**: VoiceGenHub generates a Jupyter notebook cell-by-cell.
2. **Deployment**: It pushes the notebook to Kaggle using the specified accelerator (`nvidia-p100-1` or `nvidia-t4-2`).
3. **Execution**: On Kaggle, the notebook installs necessary dependencies (`transformers`, `qwen-tts`), loads the model onto the GPU, and generates the audio.
4. **Syncing**: The CLI polls for completion and automatically downloads the generated `.wav` file into a local timestamped directory (or your specified output path).

---
*Note: Remote generation takes approximately 2-4 minutes due to environment setup on Kaggle's side.*
19 changes: 19 additions & 0 deletions docs/licensing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Licensing and Commercial Usage

VoiceGenHub is compatible with multiple free and commercial TTS licenses.

## Commercially Safe Models (summary)
- **Bark** (MIT License) - Unrestricted commercial use, no attribution required.
- **Chatterbox** (MIT License) - Unrestricted commercial use, no attribution required.
- **Qwen 3 TTS** (Apache 2.0) - Commercial use allowed, attribution required.
- **Kokoro** (Apache 2.0) - Commercial use allowed, attribution required.
- **Edge TTS** (Microsoft) - Commercial use allowed.
- **ElevenLabs** (Paid API) - Commercial use with valid subscription.

### Provider Licenses (links)
- **Edge TTS (Microsoft)**: [Microsoft Terms of Use](https://www.microsoft.com/en-us/legal/terms-of-use)
- **Kokoro TTS**: [Apache License 2.0](https://github.com/hexgrad/kokoro/blob/main/LICENSE)
- **ElevenLabs TTS**: [ElevenLabs Terms of Service](https://elevenlabs.io/terms)
- **Bark TTS**: [MIT License](https://github.com/suno-ai/bark/blob/main/LICENSE)
- **Chatterbox TTS**: [MIT License](https://github.com/rsxdalv/chatterbox/blob/main/LICENSE)
- **Qwen 3 TTS**: [Apache License 2.0](https://github.com/QwenLM/Qwen3-TTS/blob/main/LICENSE)
51 changes: 51 additions & 0 deletions docs/providers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# TTS Providers Detail

VoiceGenHub supports multiple free and commercial TTS providers.

## [Chatterbox TTS](https://github.com/rsxdalv/chatterbox) (MIT)
Multilingual TTS with emotion control and voice cloning.

### Features
- **Model selection via voice**: Choose between standard, turbo, or multilingual models.
- Emotion/intensity control with `exaggeration` parameter (0.0-1.0).
- Zero-shot voice cloning from audio samples.
- Built-in Perth watermarking for responsible AI.

### Supported Languages
ar, da, de, el, en, es, fi, fr, he, hi, it, ja, ko, ms, nl, no, pl, pt, ru, sv, sw, tr, zh

---

## [Qwen 3 TTS](https://github.com/QwenLM/Qwen3-TTS) (Apache 2.0)
State-of-the-art multilingual TTS with voice design and cloning.

### Features
- **Three generation modes**: CustomVoice, VoiceDesign, VoiceClone.
- **10 languages**: Chinese, English, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish.
- **Native speakers**: Automatic selection of native speakers per language.
- **Ultra-low latency**: Streaming generation supported.

---

## [Bark TTS](https://github.com/suno-ai/bark) (MIT)
Self-hosted high-naturalness TTS with prosody control.

### Features
- Prosody markers: `[laughs]`, `[sighs]`, `[pause]`, `[whisper]`.
- 100+ speaker presets.
- Sound effects generation.

---

## [Kokoro TTS](https://github.com/hexgrad/kokoro) (Apache 2.0)
Self-hosted, extremely lightweight and fast.

---

## [Microsoft Edge TTS](https://github.com/rany2/edge-tts) (Free Cloud)
Fast, high-quality cloud-based voices.

---

## [ElevenLabs TTS](https://elevenlabs.io) (Commercial)
Premium high-quality voices (requires API key).
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "voicegenhub"
version = "1.1.5"
version = "2.0.0"
description = "Simple Text-to-Speech library supporting multiple providers"
authors = ["leweex95 <csibi.levente14@gmail.com>"]
readme = "README.md"
Expand Down
Loading
Loading