A real-time, voice-to-voice AI pipeline demo featuring a sandwich shop order assistant. Built with LangChain/LangGraph agents and Speechmatics for both speech-to-text and text-to-speech.
The pipeline processes audio through transform stages. Note that due to the Speechmatics integration, an AsyncQueue is used to manage connection states between the components.
flowchart LR
subgraph Client [Browser]
Mic[🎤 Microphone] -->|PCM Audio| WS_Out[WebSocket]
WS_In[WebSocket] -->|Audio + Events| Speaker[🔊 Speaker]
end
subgraph Server [Node.js / Python]
WS_Receiver[WS Receiver] --> Pipeline
subgraph Pipeline [Voice Agent Pipeline]
direction LR
STT[Speechmatics STT] -->|Transcripts| Agent[LangChain Agent]
Agent -->|Text Chunks| TTS[Speechmatics TTS]
end
Pipeline -->|Events| WS_Sender[WS Sender]
end
WS_Out --> WS_Receiver
WS_Sender --> WS_In
Each stage is an async generator that transforms a stream of events:
-
STT Stage (
sttStream): Streams audio to Speechmatics, yields transcription events (stt_chunk,stt_output). -
Agent Stage (
agentStream): Passes upstream events through, invokes LangChain agent on final transcripts, yields agent responses (agent_chunk,tool_call,tool_result,agent_end). -
TTS Stage (
ttsStream): Passes upstream events through, sends agent text to Speechmatics, yields audio events (tts_chunk).
This project has been ported to use Speechmatics for both the Speech-to-Text (STT) and Text-to-Speech (TTS) layers, replacing the previous multi-provider pipeline.
Status: Active Development 🛠️
While the core pipeline is functional, this demo serves primarily as a proof-of-concept to demonstrate how easy it is to use the Voice SDK to add Speechmatics to these applications.
-
Unified Provider: Replaced separate STT/TTS services with a single Speechmatics integration.
-
Async Queue: To accommodate the specific call-and-response nature of the Speechmatics WebSocket connection, an
AsyncQueuehas been implemented to manage the flow of events and maintain stable connections.
-
Interruption Handling (Barge-in): The current implementation does not yet support "barge-in." If you interrupt the agent, it will continue speaking until it has finished processing all text in its output queue. It will not skip the remaining audio.
-
Speaker Awareness: By utilizing the Speechmatics Voice SDK, the architecture is now primed to support instant speaker-aware conversations (diarization), allowing the agent to distinguish between different speakers in real-time.
-
Node.js (v18+) or Python (3.11+)
-
pnpm or uv (Python package manager)
| Service | Environment Variable | Purpose |
|---|---|---|
| Speechmatics | SPEECHMATICS_API_KEY |
Unified STT & TTS |
| Anthropic | ANTHROPIC_API_KEY |
LangChain Agent (Claude) |
Using Make (Recommended):
# Install all dependencies
make bootstrap
# Run TypeScript implementation (with hot reload)
make dev-ts
# Or run Python implementation (with hot reload)
make dev-pyThe app will be available at http://localhost:8000.
cd components/typescript
pnpm install
cd ../web
pnpm install && pnpm build
cd ../typescript
pnpm run servercd components/python
uv sync --dev
cd ../web
pnpm install && pnpm build
cd ../python
uv run src/main.pycomponents/
├── web/ # Svelte frontend (shared by both backends)
│ └── src/
├── typescript/ # Node.js backend
│ └── src/
│ ├── index.ts # Main server & pipeline
│ └── speechmatics/# Speechmatics client (Unified STT/TTS)
└── python/ # Python backend
└── src/
├── main.py # Main server & pipeline
├── speechmatics_client.py # Speechmatics client
└── events.py # Event type definitions
The pipeline communicates via a unified event stream:
| Event | Direction | Description |
|---|---|---|
stt_chunk |
STT → Client | Partial transcription (real-time feedback) |
stt_output |
STT → Agent | Final transcription |
agent_chunk |
Agent → TTS | Text chunk from agent response |
tool_call |
Agent → Client | Tool invocation |
tool_result |
Agent → Client | Tool execution result |
agent_end |
Agent → TTS | Signals end of agent turn |
tts_chunk |
TTS → Client | Audio chunk for playback |