Production voice AI agent built on LiveKit Agents with a modular STT-LLM-TTS pipeline. Achieves sub-800ms end-to-end latency at 50% lower cost than managed alternatives like OpenAI Realtime API.
+------------------+ WebRTC +------------------+
| Client App | <-------------------> | LiveKit Cloud |
+------------------+ +--------+---------+
|
+--------+---------+
| Voice AI Agent |
+--------+---------+
|
+---------------------------+---------------------------+
| | |
+--------+--------+ +--------+--------+ +--------+--------+
| Deepgram STT | | Claude LLM | | Cartesia TTS |
| Nova 3 | | Sonnet 4 | | Sonic 3 |
+-----------------+ +--------+--------+ +-----------------+
|
+--------+--------+
| Tool Layer |
+-----------------+
- LiveKit Agents v1.5.0 with adaptive interruption handling
- Deepgram Nova 3 STT - streaming transcription, < 200ms P95 latency
- Claude Sonnet 4 LLM with native tool calling and model routing (Haiku for simple turns)
- Cartesia Sonic 3 TTS - natural voice synthesis, < 150ms TTFB
- Adaptive VAD (Silero v5) + semantic turn detection (MultilingualModel, Qwen2.5-0.5B backbone)
- Barge-in detection with ML-based backchannel classification (39% fewer false interruptions)
- False interruption recovery - agent resumes interrupted responses naturally
- Silence handling with configurable 4-level escalation ladder
- Context window management - sliding window + async summarization, capped at 2,000 tokens
- STT confidence filtering (threshold: 0.3) and noise rejection
- Domain boundary guardrails with graceful redirection
- Sentence-aware TTS dispatch for natural prosody and lower latency
- OpenTelemetry tracing with per-turn span hierarchy
- Prometheus metrics (15 metrics covering latency, sessions, errors, tokens, interruptions)
- Pre-built Grafana dashboard with 12 panels (auto-provisioned)
- LLM-as-judge evaluation framework with golden datasets
- 58 automated tests (pytest) with CI quality gates
- Docker + docker-compose for local development with hot reload
- AWS CDK for production deployment (ECS Fargate, ALB, auto-scaling 1-10 tasks)
# 1. Clone
git clone https://github.com/jothiswaran/voice-ai-agent.git
cd voice-ai-agent
# 2. Configure
cp .env.example .env
# Add your API keys: LIVEKIT, DEEPGRAM, ANTHROPIC, CARTESIA
# 3. Run
docker compose -f infra/docker-compose.yml up --buildThe agent connects to your LiveKit Cloud project and handles incoming voice sessions. Metrics are available at http://localhost:9090/metrics, Prometheus at http://localhost:9091, and Grafana at http://localhost:3000 (admin / voiceagent).
The agent uses a modular pipeline where each component (STT, LLM, TTS) is independently swappable and runs as a streaming stage. The key optimization is pipeline parallelism - STT, LLM, and TTS overlap rather than running sequentially.
User speaks: |====== speech ======|--- silence ---|
STT: |== streaming transcription ==|
|
Turn Detection: [end-of-turn]
|
LLM: |=== token streaming ===|
| |
sentence1 sentence2
|
TTS: |=== audio ===|=== audio ===|
|
Audio out: first audio chunk
|
~750ms from end of user speech
| Component | P50 | P95 | Target |
|---|---|---|---|
| VAD + Turn Detection | 60ms | 100ms | 100ms |
| STT Final Transcript | 120ms | 185ms | 200ms |
| LLM TTFT | 250ms | 380ms | 400ms |
| TTS TTFB | 80ms | 135ms | 150ms |
| E2E (with overlap) | 450ms | 720ms | 800ms |
LiveKit's MultilingualModel (fine-tuned Qwen2.5-0.5B) analyzes transcript completeness and prosodic features to determine end-of-turn. This reduces false interruptions by 39% compared to VAD-only detection with fixed silence thresholds.
When the user speaks during agent playback, a classifier determines if it is a backchannel ("uh huh", "yeah") or a real interruption. Backchannels are ignored. Real interruptions stop playback immediately. False interruptions trigger recovery where the agent resumes its interrupted response.
Conversations stay fast regardless of length. A sliding window keeps the last 10 turns in full context. Older turns are asynchronously summarized by Haiku into a ~200 token digest. Total context stays under 2,000 tokens.
# Run all 58 tests
pytest tests/ -v
# Unit tests (38) - individual components with mocked APIs
pytest tests/unit/ -v
# Integration tests (12) - multi-component flows
pytest tests/integration/ -v
# LLM-as-judge evaluation (8) - quality scoring on golden dataset
pytest tests/test_evaluation.py -v
# Coverage report
pytest tests/ --cov=agent --cov-report=html| Gate | Threshold |
|---|---|
| All tests passing | 58/58 |
| Code coverage | >= 80% |
| E2E latency (mocked) | P95 < 1000ms |
| LLM-as-judge score | >= 4.0 / 5.0 |
docker compose -f infra/docker-compose.yml up --buildIncludes the agent, Prometheus (port 9091), and Grafana (port 3000) with auto-provisioned datasource and dashboard.
# First-time setup
cd infra/aws && cdk bootstrap && cdk deploy VoiceAgentStack
# Subsequent deploys
docker build -t voice-ai-agent -f infra/Dockerfile .
# Push to ECR, then:
aws ecs update-service --cluster voice-ai-agent --service VoiceAgentService --force-new-deploymentInfrastructure:
- ECS Fargate (2 vCPU, 4GB RAM per task)
- ALB with TLS termination and sticky sessions
- Auto-scaling: 1-10 tasks based on CPU and active sessions
- Secrets Manager for API keys
- CloudWatch Logs (30-day retention) + custom metrics
- SNS alerting for error rate > 5% and P95 latency > 1200ms
See infra/aws/architecture.md for the full architecture diagram and deployment details.
| Component | Per Minute |
|---|---|
| Deepgram STT | $0.0043 |
| Claude Sonnet LLM | $0.020-0.040 |
| Cartesia TTS | $0.008 |
| Infrastructure | $0.010 |
| Total | $0.045-0.065 |
Compared to OpenAI Realtime API at $0.11/min, this pipeline runs at 50-60% lower cost with full control over each component.
| Volume | Monthly Cost |
|---|---|
| 20 calls/day (MVP) | ~$325 |
| 200 calls/day (Production) | ~$1,790 |
| 2,000 calls/day (Enterprise) | ~$13,260 |
See infra/aws/cost_model.md for the full breakdown including optimization strategies.
15 metrics covering latency (E2E, STT, LLM TTFT, TTS TTFB), sessions, errors, token usage, interruptions (legitimate/false/recovered), tool calls, silence events, and STT confidence distribution.
Per-turn traces with spans for STT, turn detection, LLM (including tool calls), TTS, and playback. Exported via OTLP to any compatible backend.
Pre-built dashboard with 12 panels auto-provisioned on startup. Covers active sessions, error rate, latency breakdown, interruption patterns, token usage, tool success rate, and silence escalation events.
| Layer | Technology | Version |
|---|---|---|
| Agent Framework | LiveKit Agents SDK | 1.5.0 |
| STT | Deepgram Nova 3 | latest |
| LLM | Claude Sonnet 4 | claude-sonnet-4-20250514 |
| LLM (routing) | Claude Haiku | claude-haiku-4-20250514 |
| TTS | Cartesia Sonic 3 | latest |
| VAD | Silero VAD v5 | ONNX runtime |
| Turn Detection | MultilingualModel | Qwen2.5-0.5B backbone |
| Runtime | Python | 3.12 |
| Metrics | Prometheus + Grafana | 2.53 / 11.1 |
| Tracing | OpenTelemetry | 1.25 |
| Container | Docker (multi-stage) | 24.x |
| Infrastructure | AWS CDK (Python) | 2.x |
| Compute | ECS Fargate | - |
| Load Balancer | Application Load Balancer | - |
| Secrets | AWS Secrets Manager | - |
| Logs | CloudWatch Logs | - |
| Document | Description |
|---|---|
| docs/ARCHITECTURE.md | System architecture, component deep-dive, data flow, concurrency model |
| docs/LATENCY_BUDGET.md | Per-component latency targets and optimization strategies |
| docs/EDGE_CASES.md | 10 production edge cases with handling strategies and decision trees |
| docs/RUNBOOK.md | Operational guide - deployment, troubleshooting, alerting, investigation |
| infra/aws/architecture.md | AWS deployment architecture, network flow, security, DR |
| infra/aws/cost_model.md | Per-call cost breakdown, monthly projections, optimization strategies |
MIT