Python speaker diarization with OpenAI Whisper ASR & pyannote.audio for accurate multi‑speaker transcription & labeling.
Now includes a CLI tool for easy audio transcription!
- Free software: MIT license
System Requirements:
WhisperPy requires cmake and build tools to compile some dependencies. Install them first:
Manjaro/Arch Linux:
sudo pacman -S cmake base-devel
Ubuntu/Debian:
sudo apt update sudo apt install cmake build-essential
CentOS/RHEL/Fedora:
# Fedora sudo dnf install cmake gcc-c++ make # CentOS/RHEL sudo yum install cmake gcc-c++ make
macOS:
# Using Homebrew brew install cmake # Or install Xcode Command Line Tools xcode-select --install
Windows:
# Install Visual Studio Build Tools and CMake # Or use conda: conda install cmake
Package Installation:
After installing system dependencies:
pip install -e .
Alternative: Using Conda (Recommended for complex environments):
# This handles cmake and other dependencies automatically conda install -c conda-forge cmake pip install -e .
Note: This package includes both audio transcription (OpenAI Whisper) AND speaker diarization (pyannote.audio) as core features.
After installation, you can use the whipy command to transcribe audio files with automatic speaker identification:
Basic Usage:
whipy transcribe audio_file.wav
This will: 1. Transcribe the audio using OpenAI Whisper 2. Identify different speakers using pyannote.audio 3. Save results with speaker labels as audio_file.json
HuggingFace Token Setup:
For speaker diarization, set your HuggingFace token:
whipy set-token YOUR_HUGGINGFACE_TOKEN
Get your token from: https://huggingface.co/settings/tokens
Command Options:
-m, --model: Choose Whisper model (tiny, base, small, medium, large, large-v2, large-v3). Default: base-o, --output: Specify output file path. Default: same as input with .json extension-l, --language: Specify language code for better accuracy. Auto-detect if not specified-f, --format: Output format - 'json' for structured data or 'txt' for readable text. Default: json--boost: Enable GPU acceleration for faster processing (requires CUDA-compatible GPU)-v, --verbose: Enable verbose output to see processing details
GPU Acceleration:
WhisperPy supports GPU acceleration for significantly faster processing:
# Check if GPU acceleration is available whipy gpu-status # Use GPU acceleration (much faster) whipy transcribe --boost my_audio.mp3 # GPU acceleration with larger model and verbose output whipy transcribe --boost -m large -v my_audio.wav
Requirements for GPU acceleration: - CUDA-compatible NVIDIA GPU - CUDA toolkit installed - PyTorch with CUDA support
Examples:
# Basic transcription with speaker identification (default behavior) whipy transcribe my_audio.mp3 # Use GPU acceleration for faster processing whipy transcribe --boost my_audio.mp3 # Use a larger model with GPU acceleration whipy transcribe --boost -m large my_audio.wav # Save as text format instead of JSON whipy transcribe -f txt my_audio.mp3 # Specify output file and language with GPU boost whipy transcribe --boost -o transcript.json -l en my_audio.m4a # Verbose output to see what's happening whipy transcribe -v -m medium my_audio.flac
Output Formats:
- JSON format (default): Structured output perfect for LLMs with metadata, speaker segments, conversation timeline, and full transcript
- TXT format: Human-readable conversation with timestamps and speaker labels
Supported audio formats: WAV, MP3, M4A, FLAC, OGG, and many others supported by FFmpeg.
Installation Issues:
If you get cmake-related errors during installation:
# Make sure cmake is installed (see System Requirements above) cmake --version # If sentencepiece fails to compile, try installing via conda: conda install -c conda-forge sentencepiece pip install -e .
Runtime Issues:
- "HUGGINGFACE_TOKEN not found": Set your token using
whipy set-token YOUR_TOKEN - CUDA/GPU issues: WhisperPy works on CPU by default. For GPU acceleration, ensure PyTorch CUDA is properly installed
- Audio format issues: Convert your audio to a common format like WAV or MP3 if you encounter format-related errors
For programmatic use:
import whisperpy_diarizer
# Use the CLI functions programmatically
from whisperpy_diarizer.cli import perform_diarization, match_segments_with_speakersTo run all the tests run:
tox
Note, to combine the coverage data from all the tox environments run:
| Windows | set PYTEST_ADDOPTS=--cov-append tox |
|---|---|
| Other | PYTEST_ADDOPTS=--cov-append tox |