Modular voice-to-AI-to-voice bridge. Listens in a real-time voice channel, transcribes speech locally, sends text to an AI backend, and plays back the AI response as speech.
First implementation targets Discord, but the architecture cleanly separates platform-specific logic from core voice/AI logic for future transports.
┌─────────────────────────────────────────────────────────┐
│ voice-bridge │
│ │
│ ┌───────────┐ ┌─────┐ ┌────┐ ┌─────┐ │
│ │ Transport │───▶│ STT │───▶│ AI │───▶│ TTS │──┐ │
│ │ (Discord) │ └─────┘ └────┘ └─────┘ │ │
│ │ │◀─────────────────────────────────┘ │
│ └───────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ discord.js whisper.cpp OpenClaw Piper TTS │
│ @discordjs/ (HTTP) (HTTP) (HTTP) │
│ voice │
└─────────────────────────────────────────────────────────┘
- Node.js 22+
- pnpm
- Docker & Docker Compose (for whisper + piper servers)
- A Discord bot token
- Go to Discord Developer Portal
- Create a new application
- Go to Bot → click Reset Token → copy the token
- Enable these Privileged Gateway Intents:
- Message Content Intent
- Go to OAuth2 → URL Generator:
- Scopes:
bot,applications.commands - Bot Permissions:
Connect,Speak,Use Voice Activity
- Scopes:
- Use the generated URL to invite the bot to your server
- Copy your Application ID (Client ID) and Guild ID (Server ID)
Copy .env.example to .env and fill in your values:
cp .env.example .env| Variable | Required | Description |
|---|---|---|
DISCORD_TOKEN |
Yes | Bot token |
DISCORD_CLIENT_ID |
Yes | Application/Client ID |
DISCORD_GUILD_ID |
Yes | Server ID for slash commands |
DISCORD_LISTEN_USER_ID |
No | Only respond to this user |
WHISPER_URL |
No | Whisper server URL (default: http://whisper:8080) |
PIPER_URL |
No | Piper server URL (default: http://piper:5000) |
OPENCLAW_URL |
Yes | OpenClaw API base URL |
OPENCLAW_API_KEY |
Yes | OpenClaw API key |
OPENCLAW_MODEL |
No | Model identifier (default: default) |
OPENCLAW_SYSTEM_PROMPT |
No | Custom system prompt |
cp .env.example .env
# Edit .env with your credentials
docker compose up --buildThis starts:
- voice-bridge — the Node.js bot
- whisper — whisper.cpp HTTP server (speech-to-text)
- piper — Piper TTS server (text-to-speech)
pnpm install
pnpm build
# Register slash commands (one-time)
pnpm deploy-commands
# Run the bot
pnpm start
# Or watch mode for development
pnpm dev| Command | Description |
|---|---|
/join |
Join your current voice channel |
/leave |
Leave the voice channel |
/status |
Show connection status |
pnpm testThe project is built around four core interfaces that are completely platform-agnostic:
VoiceTransport— join/leave voice, receive audio streams, play audio backSpeechToText— audio buffers → textTextToSpeech— text → audio buffersAIBackend— text in → text out
The Pipeline orchestrator wires these together: transport → STT → AI → TTS → transport playback.
src/
├── core/
│ ├── interfaces.ts # Platform-agnostic interfaces
│ ├── pipeline.ts # Pipeline orchestrator
│ └── index.ts # Core exports
├── transports/
│ └── discord/
│ ├── transport.ts # Discord VoiceTransport implementation
│ ├── commands.ts # Slash command definitions + handlers
│ ├── deploy-commands.ts # One-time command registration
│ └── index.ts
├── stt/
│ └── whisper.ts # Whisper.cpp STT implementation
├── tts/
│ └── piper.ts # Piper TTS implementation
├── ai/
│ └── openclaw.ts # OpenClaw AI implementation
├── config.ts # Environment config loader
└── index.ts # Entry point
Implement the VoiceTransport interface from src/core/interfaces.ts:
import type { VoiceTransport, UserAudioStream } from "./core/interfaces.js";
class MyTransport implements VoiceTransport {
async join(channelId: string): Promise<void> { /* ... */ }
async leave(): Promise<void> { /* ... */ }
onUserAudio(handler: (userAudio: UserAudioStream) => void): void { /* ... */ }
async playAudio(audio: Buffer): Promise<void> { /* ... */ }
isConnected(): boolean { /* ... */ }
}Then wire it into a Pipeline instance — the STT, AI, and TTS modules work unchanged.
Implement the corresponding interface:
// STT
class MySTT implements SpeechToText {
async transcribe(audio: Buffer): Promise<string> { /* ... */ }
}
// TTS
class MyTTS implements TextToSpeech {
async synthesize(text: string): Promise<Buffer> { /* ... */ }
}
// AI
class MyAI implements AIBackend {
async chat(userMessage: string): Promise<string> { /* ... */ }
}ISC