Voice Sandwich Demo 🥪

A real-time, voice-to-voice AI pipeline demo featuring a sandwich shop order assistant. Built with LangChain/LangGraph agents and Speechmatics for both speech-to-text and text-to-speech.

Architecture

The pipeline processes audio through transform stages. Note that due to the Speechmatics integration, an AsyncQueue is used to manage connection states between the components.

flowchart LR
    subgraph Client [Browser]
        Mic[🎤 Microphone] -->|PCM Audio| WS_Out[WebSocket]
        WS_In[WebSocket] -->|Audio + Events| Speaker[🔊 Speaker]
    end

    subgraph Server [Node.js / Python]
        WS_Receiver[WS Receiver] --> Pipeline

        subgraph Pipeline [Voice Agent Pipeline]
            direction LR
            STT[Speechmatics STT] -->|Transcripts| Agent[LangChain Agent]
            Agent -->|Text Chunks| TTS[Speechmatics TTS]
        end

        Pipeline -->|Events| WS_Sender[WS Sender]
    end

    WS_Out --> WS_Receiver
    WS_Sender --> WS_In

Pipeline Stages

Each stage is an async generator that transforms a stream of events:

STT Stage (sttStream): Streams audio to Speechmatics, yields transcription events (stt_chunk, stt_output).
Agent Stage (agentStream): Passes upstream events through, invokes LangChain agent on final transcripts, yields agent responses (agent_chunk, tool_call, tool_result, agent_end).
TTS Stage (ttsStream): Passes upstream events through, sends agent text to Speechmatics, yields audio events (tts_chunk).

🚧 Work in Progress: Speechmatics Port

This project has been ported to use Speechmatics for both the Speech-to-Text (STT) and Text-to-Speech (TTS) layers, replacing the previous multi-provider pipeline.

Status: Active Development 🛠️

While the core pipeline is functional, this demo serves primarily as a proof-of-concept to demonstrate how easy it is to use the Voice SDK to add Speechmatics to these applications.

Implementation Details

Unified Provider: Replaced separate STT/TTS services with a single Speechmatics integration.
Async Queue: To accommodate the specific call-and-response nature of the Speechmatics WebSocket connection, an AsyncQueue has been implemented to manage the flow of events and maintain stable connections.

Known Limitations & Roadmap

Interruption Handling (Barge-in): The current implementation does not yet support "barge-in." If you interrupt the agent, it will continue speaking until it has finished processing all text in its output queue. It will not skip the remaining audio.
Speaker Awareness: By utilizing the Speechmatics Voice SDK, the architecture is now primed to support instant speaker-aware conversations (diarization), allowing the agent to distinguish between different speakers in real-time.

Prerequisites

Node.js (v18+) or Python (3.11+)
pnpm or uv (Python package manager)

API Keys

Service	Environment Variable	Purpose
Speechmatics	`SPEECHMATICS_API_KEY`	Unified STT & TTS
Anthropic	`ANTHROPIC_API_KEY`	LangChain Agent (Claude)

Quick Start

Using Make (Recommended):

# Install all dependencies
make bootstrap

# Run TypeScript implementation (with hot reload)
make dev-ts

# Or run Python implementation (with hot reload)
make dev-py

The app will be available at http://localhost:8000.

Manual Setup

TypeScript

cd components/typescript
pnpm install
cd ../web
pnpm install && pnpm build
cd ../typescript
pnpm run server

Python

cd components/python
uv sync --dev
cd ../web
pnpm install && pnpm build
cd ../python
uv run src/main.py

Project Structure

components/
├── web/                 # Svelte frontend (shared by both backends)
│   └── src/
├── typescript/          # Node.js backend
│   └── src/
│       ├── index.ts     # Main server & pipeline
│       └── speechmatics/# Speechmatics client (Unified STT/TTS)
└── python/              # Python backend
    └── src/
        ├── main.py             # Main server & pipeline
        ├── speechmatics_client.py # Speechmatics client
        └── events.py           # Event type definitions

Event Types

The pipeline communicates via a unified event stream:

Event	Direction	Description
`stt_chunk`	STT → Client	Partial transcription (real-time feedback)
`stt_output`	STT → Agent	Final transcription
`agent_chunk`	Agent → TTS	Text chunk from agent response
`tool_call`	Agent → Client	Tool invocation
`tool_result`	Agent → Client	Tool execution result
`agent_end`	Agent → TTS	Signals end of agent turn
`tts_chunk`	TTS → Client	Audio chunk for playback

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gemini		.gemini
components		components
.env.example		.env.example
.gitignore		.gitignore
.prettierignore		.prettierignore
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Voice Sandwich Demo 🥪

Architecture

Pipeline Stages

🚧 Work in Progress: Speechmatics Port

Implementation Details

Known Limitations & Roadmap

Prerequisites

API Keys

Quick Start

Manual Setup

TypeScript

Python

Project Structure

Event Types

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Voice Sandwich Demo 🥪

Architecture

Pipeline Stages

🚧 Work in Progress: Speechmatics Port

Implementation Details

Known Limitations & Roadmap

Prerequisites

API Keys

Quick Start

Manual Setup

TypeScript

Python

Project Structure

Event Types

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages