Voice AI Agent

Production voice AI agent built on LiveKit Agents with a modular STT-LLM-TTS pipeline. Achieves sub-800ms end-to-end latency at 50% lower cost than managed alternatives like OpenAI Realtime API.

+------------------+        WebRTC         +------------------+
|   Client App     | <-------------------> |   LiveKit Cloud  |
+------------------+                       +--------+---------+
                                                    |
                                           +--------+---------+
                                           |  Voice AI Agent  |
                                           +--------+---------+
                                                    |
                        +---------------------------+---------------------------+
                        |                           |                           |
               +--------+--------+         +--------+--------+        +--------+--------+
               |  Deepgram STT   |         |  Claude LLM     |        |  Cartesia TTS   |
               |  Nova 3         |         |  Sonnet 4       |        |  Sonic 3        |
               +-----------------+         +--------+--------+        +-----------------+
                                                    |
                                           +--------+--------+
                                           |   Tool Layer    |
                                           +-----------------+

Features

Quick Start

# 1. Clone
git clone https://github.com/jothiswaran/voice-ai-agent.git
cd voice-ai-agent

# 2. Configure
cp .env.example .env
# Add your API keys: LIVEKIT, DEEPGRAM, ANTHROPIC, CARTESIA

# 3. Run
docker compose -f infra/docker-compose.yml up --build

The agent connects to your LiveKit Cloud project and handles incoming voice sessions. Metrics are available at http://localhost:9090/metrics, Prometheus at http://localhost:9091, and Grafana at http://localhost:3000 (admin / voiceagent).

Architecture

The agent uses a modular pipeline where each component (STT, LLM, TTS) is independently swappable and runs as a streaming stage. The key optimization is pipeline parallelism - STT, LLM, and TTS overlap rather than running sequentially.

User speaks:     |====== speech ======|--- silence ---|
STT:             |== streaming transcription ==|
                                               |
Turn Detection:                           [end-of-turn]
                                               |
LLM:                                      |=== token streaming ===|
                                           |              |
                                        sentence1      sentence2
                                           |
TTS:                                  |=== audio ===|=== audio ===|
                                       |
Audio out:                        first audio chunk
                                       |
                                  ~750ms from end of user speech

Latency Budget

Component	P50	P95	Target
VAD + Turn Detection	60ms	100ms	100ms
STT Final Transcript	120ms	185ms	200ms
LLM TTFT	250ms	380ms	400ms
TTS TTFB	80ms	135ms	150ms
E2E (with overlap)	450ms	720ms	800ms

Turn Detection

LiveKit's MultilingualModel (fine-tuned Qwen2.5-0.5B) analyzes transcript completeness and prosodic features to determine end-of-turn. This reduces false interruptions by 39% compared to VAD-only detection with fixed silence thresholds.

Adaptive Interruption

When the user speaks during agent playback, a classifier determines if it is a backchannel ("uh huh", "yeah") or a real interruption. Backchannels are ignored. Real interruptions stop playback immediately. False interruptions trigger recovery where the agent resumes its interrupted response.

Context Management

Conversations stay fast regardless of length. A sliding window keeps the last 10 turns in full context. Older turns are asynchronously summarized by Haiku into a ~200 token digest. Total context stays under 2,000 tokens.

Testing

# Run all 58 tests
pytest tests/ -v

# Unit tests (38) - individual components with mocked APIs
pytest tests/unit/ -v

# Integration tests (12) - multi-component flows
pytest tests/integration/ -v

# LLM-as-judge evaluation (8) - quality scoring on golden dataset
pytest tests/test_evaluation.py -v

# Coverage report
pytest tests/ --cov=agent --cov-report=html

CI Quality Gates

Gate	Threshold
All tests passing	58/58
Code coverage	>= 80%
E2E latency (mocked)	P95 < 1000ms
LLM-as-judge score	>= 4.0 / 5.0

Deployment

Local (docker-compose)

docker compose -f infra/docker-compose.yml up --build

Includes the agent, Prometheus (port 9091), and Grafana (port 3000) with auto-provisioned datasource and dashboard.

Production (AWS ECS Fargate)

# First-time setup
cd infra/aws && cdk bootstrap && cdk deploy VoiceAgentStack

# Subsequent deploys
docker build -t voice-ai-agent -f infra/Dockerfile .
# Push to ECR, then:
aws ecs update-service --cluster voice-ai-agent --service VoiceAgentService --force-new-deployment

Infrastructure:

ECS Fargate (2 vCPU, 4GB RAM per task)
ALB with TLS termination and sticky sessions
Auto-scaling: 1-10 tasks based on CPU and active sessions
Secrets Manager for API keys
CloudWatch Logs (30-day retention) + custom metrics
SNS alerting for error rate > 5% and P95 latency > 1200ms

See infra/aws/architecture.md for the full architecture diagram and deployment details.

Cost Model

Component	Per Minute
Deepgram STT	$0.0043
Claude Sonnet LLM	$0.020-0.040
Cartesia TTS	$0.008
Infrastructure	$0.010
Total	$0.045-0.065

Compared to OpenAI Realtime API at $0.11/min, this pipeline runs at 50-60% lower cost with full control over each component.

Volume	Monthly Cost
20 calls/day (MVP)	~$325
200 calls/day (Production)	~$1,790
2,000 calls/day (Enterprise)	~$13,260

See infra/aws/cost_model.md for the full breakdown including optimization strategies.

Observability

Metrics (Prometheus)

15 metrics covering latency (E2E, STT, LLM TTFT, TTS TTFB), sessions, errors, token usage, interruptions (legitimate/false/recovered), tool calls, silence events, and STT confidence distribution.

Tracing (OpenTelemetry)

Per-turn traces with spans for STT, turn detection, LLM (including tool calls), TTS, and playback. Exported via OTLP to any compatible backend.

Dashboard (Grafana)

Pre-built dashboard with 12 panels auto-provisioned on startup. Covers active sessions, error rate, latency breakdown, interruption patterns, token usage, tool success rate, and silence escalation events.

Tech Stack

Layer	Technology	Version
Agent Framework	LiveKit Agents SDK	1.5.0
STT	Deepgram Nova 3	latest
LLM	Claude Sonnet 4	claude-sonnet-4-20250514
LLM (routing)	Claude Haiku	claude-haiku-4-20250514
TTS	Cartesia Sonic 3	latest
VAD	Silero VAD v5	ONNX runtime
Turn Detection	MultilingualModel	Qwen2.5-0.5B backbone
Runtime	Python	3.12
Metrics	Prometheus + Grafana	2.53 / 11.1
Tracing	OpenTelemetry	1.25
Container	Docker (multi-stage)	24.x
Infrastructure	AWS CDK (Python)	2.x
Compute	ECS Fargate	-
Load Balancer	Application Load Balancer	-
Secrets	AWS Secrets Manager	-
Logs	CloudWatch Logs	-

Documentation

Document	Description
docs/ARCHITECTURE.md	System architecture, component deep-dive, data flow, concurrency model
docs/LATENCY_BUDGET.md	Per-component latency targets and optimization strategies
docs/EDGE_CASES.md	10 production edge cases with handling strategies and decision trees
docs/RUNBOOK.md	Operational guide - deployment, troubleshooting, alerting, investigation
infra/aws/architecture.md	AWS deployment architecture, network flow, security, DR
infra/aws/cost_model.md	Per-call cost breakdown, monthly projections, optimization strategies

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
agent		agent
docs		docs
edge_cases		edge_cases
eval		eval
infra		infra
observability		observability
pipeline		pipeline
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Voice AI Agent

Features

Quick Start

Architecture

Latency Budget

Turn Detection

Adaptive Interruption

Context Management

Testing

CI Quality Gates

Deployment

Local (docker-compose)

Production (AWS ECS Fargate)

Cost Model

Observability

Metrics (Prometheus)

Tracing (OpenTelemetry)

Dashboard (Grafana)

Tech Stack

Documentation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Voice AI Agent

Features

Quick Start

Architecture

Latency Budget

Turn Detection

Adaptive Interruption

Context Management

Testing

CI Quality Gates

Deployment

Local (docker-compose)

Production (AWS ECS Fargate)

Cost Model

Observability

Metrics (Prometheus)

Tracing (OpenTelemetry)

Dashboard (Grafana)

Tech Stack

Documentation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages