Skip to content

devjothish/production-voice-ai-agent-livekit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Voice AI Agent

Production voice AI agent built on LiveKit Agents with a modular STT-LLM-TTS pipeline. Achieves sub-800ms end-to-end latency at 50% lower cost than managed alternatives like OpenAI Realtime API.

+------------------+        WebRTC         +------------------+
|   Client App     | <-------------------> |   LiveKit Cloud  |
+------------------+                       +--------+---------+
                                                    |
                                           +--------+---------+
                                           |  Voice AI Agent  |
                                           +--------+---------+
                                                    |
                        +---------------------------+---------------------------+
                        |                           |                           |
               +--------+--------+         +--------+--------+        +--------+--------+
               |  Deepgram STT   |         |  Claude LLM     |        |  Cartesia TTS   |
               |  Nova 3         |         |  Sonnet 4       |        |  Sonic 3        |
               +-----------------+         +--------+--------+        +-----------------+
                                                    |
                                           +--------+--------+
                                           |   Tool Layer    |
                                           +-----------------+

Features

  • LiveKit Agents v1.5.0 with adaptive interruption handling
  • Deepgram Nova 3 STT - streaming transcription, < 200ms P95 latency
  • Claude Sonnet 4 LLM with native tool calling and model routing (Haiku for simple turns)
  • Cartesia Sonic 3 TTS - natural voice synthesis, < 150ms TTFB
  • Adaptive VAD (Silero v5) + semantic turn detection (MultilingualModel, Qwen2.5-0.5B backbone)
  • Barge-in detection with ML-based backchannel classification (39% fewer false interruptions)
  • False interruption recovery - agent resumes interrupted responses naturally
  • Silence handling with configurable 4-level escalation ladder
  • Context window management - sliding window + async summarization, capped at 2,000 tokens
  • STT confidence filtering (threshold: 0.3) and noise rejection
  • Domain boundary guardrails with graceful redirection
  • Sentence-aware TTS dispatch for natural prosody and lower latency
  • OpenTelemetry tracing with per-turn span hierarchy
  • Prometheus metrics (15 metrics covering latency, sessions, errors, tokens, interruptions)
  • Pre-built Grafana dashboard with 12 panels (auto-provisioned)
  • LLM-as-judge evaluation framework with golden datasets
  • 58 automated tests (pytest) with CI quality gates
  • Docker + docker-compose for local development with hot reload
  • AWS CDK for production deployment (ECS Fargate, ALB, auto-scaling 1-10 tasks)

Quick Start

# 1. Clone
git clone https://github.com/jothiswaran/voice-ai-agent.git
cd voice-ai-agent

# 2. Configure
cp .env.example .env
# Add your API keys: LIVEKIT, DEEPGRAM, ANTHROPIC, CARTESIA

# 3. Run
docker compose -f infra/docker-compose.yml up --build

The agent connects to your LiveKit Cloud project and handles incoming voice sessions. Metrics are available at http://localhost:9090/metrics, Prometheus at http://localhost:9091, and Grafana at http://localhost:3000 (admin / voiceagent).

Architecture

The agent uses a modular pipeline where each component (STT, LLM, TTS) is independently swappable and runs as a streaming stage. The key optimization is pipeline parallelism - STT, LLM, and TTS overlap rather than running sequentially.

User speaks:     |====== speech ======|--- silence ---|
STT:             |== streaming transcription ==|
                                               |
Turn Detection:                           [end-of-turn]
                                               |
LLM:                                      |=== token streaming ===|
                                           |              |
                                        sentence1      sentence2
                                           |
TTS:                                  |=== audio ===|=== audio ===|
                                       |
Audio out:                        first audio chunk
                                       |
                                  ~750ms from end of user speech

Latency Budget

Component P50 P95 Target
VAD + Turn Detection 60ms 100ms 100ms
STT Final Transcript 120ms 185ms 200ms
LLM TTFT 250ms 380ms 400ms
TTS TTFB 80ms 135ms 150ms
E2E (with overlap) 450ms 720ms 800ms

Turn Detection

LiveKit's MultilingualModel (fine-tuned Qwen2.5-0.5B) analyzes transcript completeness and prosodic features to determine end-of-turn. This reduces false interruptions by 39% compared to VAD-only detection with fixed silence thresholds.

Adaptive Interruption

When the user speaks during agent playback, a classifier determines if it is a backchannel ("uh huh", "yeah") or a real interruption. Backchannels are ignored. Real interruptions stop playback immediately. False interruptions trigger recovery where the agent resumes its interrupted response.

Context Management

Conversations stay fast regardless of length. A sliding window keeps the last 10 turns in full context. Older turns are asynchronously summarized by Haiku into a ~200 token digest. Total context stays under 2,000 tokens.

Testing

# Run all 58 tests
pytest tests/ -v

# Unit tests (38) - individual components with mocked APIs
pytest tests/unit/ -v

# Integration tests (12) - multi-component flows
pytest tests/integration/ -v

# LLM-as-judge evaluation (8) - quality scoring on golden dataset
pytest tests/test_evaluation.py -v

# Coverage report
pytest tests/ --cov=agent --cov-report=html

CI Quality Gates

Gate Threshold
All tests passing 58/58
Code coverage >= 80%
E2E latency (mocked) P95 < 1000ms
LLM-as-judge score >= 4.0 / 5.0

Deployment

Local (docker-compose)

docker compose -f infra/docker-compose.yml up --build

Includes the agent, Prometheus (port 9091), and Grafana (port 3000) with auto-provisioned datasource and dashboard.

Production (AWS ECS Fargate)

# First-time setup
cd infra/aws && cdk bootstrap && cdk deploy VoiceAgentStack

# Subsequent deploys
docker build -t voice-ai-agent -f infra/Dockerfile .
# Push to ECR, then:
aws ecs update-service --cluster voice-ai-agent --service VoiceAgentService --force-new-deployment

Infrastructure:

  • ECS Fargate (2 vCPU, 4GB RAM per task)
  • ALB with TLS termination and sticky sessions
  • Auto-scaling: 1-10 tasks based on CPU and active sessions
  • Secrets Manager for API keys
  • CloudWatch Logs (30-day retention) + custom metrics
  • SNS alerting for error rate > 5% and P95 latency > 1200ms

See infra/aws/architecture.md for the full architecture diagram and deployment details.

Cost Model

Component Per Minute
Deepgram STT $0.0043
Claude Sonnet LLM $0.020-0.040
Cartesia TTS $0.008
Infrastructure $0.010
Total $0.045-0.065

Compared to OpenAI Realtime API at $0.11/min, this pipeline runs at 50-60% lower cost with full control over each component.

Volume Monthly Cost
20 calls/day (MVP) ~$325
200 calls/day (Production) ~$1,790
2,000 calls/day (Enterprise) ~$13,260

See infra/aws/cost_model.md for the full breakdown including optimization strategies.

Observability

Metrics (Prometheus)

15 metrics covering latency (E2E, STT, LLM TTFT, TTS TTFB), sessions, errors, token usage, interruptions (legitimate/false/recovered), tool calls, silence events, and STT confidence distribution.

Tracing (OpenTelemetry)

Per-turn traces with spans for STT, turn detection, LLM (including tool calls), TTS, and playback. Exported via OTLP to any compatible backend.

Dashboard (Grafana)

Pre-built dashboard with 12 panels auto-provisioned on startup. Covers active sessions, error rate, latency breakdown, interruption patterns, token usage, tool success rate, and silence escalation events.

Tech Stack

Layer Technology Version
Agent Framework LiveKit Agents SDK 1.5.0
STT Deepgram Nova 3 latest
LLM Claude Sonnet 4 claude-sonnet-4-20250514
LLM (routing) Claude Haiku claude-haiku-4-20250514
TTS Cartesia Sonic 3 latest
VAD Silero VAD v5 ONNX runtime
Turn Detection MultilingualModel Qwen2.5-0.5B backbone
Runtime Python 3.12
Metrics Prometheus + Grafana 2.53 / 11.1
Tracing OpenTelemetry 1.25
Container Docker (multi-stage) 24.x
Infrastructure AWS CDK (Python) 2.x
Compute ECS Fargate -
Load Balancer Application Load Balancer -
Secrets AWS Secrets Manager -
Logs CloudWatch Logs -

Documentation

Document Description
docs/ARCHITECTURE.md System architecture, component deep-dive, data flow, concurrency model
docs/LATENCY_BUDGET.md Per-component latency targets and optimization strategies
docs/EDGE_CASES.md 10 production edge cases with handling strategies and decision trees
docs/RUNBOOK.md Operational guide - deployment, troubleshooting, alerting, investigation
infra/aws/architecture.md AWS deployment architecture, network flow, security, DR
infra/aws/cost_model.md Per-call cost breakdown, monthly projections, optimization strategies

License

MIT

About

Production voice AI agent on LiveKit Agents v1.5.0 with adaptive interruption, Deepgram STT, Cartesia TTS, OpenTelemetry observability, LLM-as-judge eval, and AWS CDK deployment

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors