Skip to content

czhurdlespeed/sam2webappvoiceagent

Repository files navigation

SAM 2 Web App Voice Agent

A real-time voice AI assistant for the SAM 2 No Code Finetuning web application. Users interact with the agent through natural conversation to learn about LoRA fine-tuning, manufacturing datasets, and the training configuration UI.

How It Works

A user opens the web app, clicks the voice agent widget (powered by the agent-starter-embed submodule), and a LiveKit room is created. The agent joins the room, authenticates the user against the shared PostgreSQL database, and begins a real-time voice conversation. During the session, the agent can search a RAG knowledge base or the web to answer questions. When the session ends — either by the user leaving or hitting the time limit — a conversation summary is generated and emailed to the user.

Tech Stack & How Each Service Fits Together

Voice Pipeline — LiveKit Agents Framework

The core of the agent is a LiveKit Agents pipeline that chains together speech-to-text, an LLM, and text-to-speech into a single real-time voice loop:

User Microphone → Deepgram STT → OpenAI GPT-4.1-nano → Cartesia TTS → User Speaker
  • LiveKit provides the WebRTC infrastructure — low-latency audio transport between the browser and the agent server, room management, and participant lifecycle events.
  • Silero VAD runs voice activity detection locally to determine when the user has started and stopped speaking, enabling natural turn-taking.
  • LiveKit Noise Cancellation (BVC/BVCTelephony) filters background noise from the user's audio before it reaches the STT model.

Speech-to-Text — Deepgram

Deepgram (flux-general-en model) transcribes the user's speech into text in real time. It provides streaming transcription with configurable end-of-turn detection thresholds, allowing the agent to respond promptly without cutting the user off mid-sentence.

Language Model — OpenAI

OpenAI GPT-4.1-nano serves as the conversational brain. It receives the transcribed user speech, the system prompt (defining the agent's personality and scope), and any tool results, then generates a text response. It also powers the post-session conversation summary generation used in follow-up emails.

The LLM has access to two function tools:

  1. search_knowledge_base — queries the RAG system first
  2. web_search — falls back to internet search if RAG returns no relevant results

Text-to-Speech — Cartesia

Cartesia (sonic-3 model) converts the LLM's text responses into natural-sounding speech with configurable voice identity and emotion. A custom pronunciation dictionary ensures domain-specific terms (LoRA, SAM 2, manufacturing jargon) are spoken correctly.

Avatar — Lemonslice

Lemonslice renders a visual avatar in the browser that lip-syncs with the agent's speech, providing a more engaging conversational experience than audio alone.

RAG Knowledge Base — MongoDB Atlas + Voyage AI

The Retrieval-Augmented Generation (RAG) pipeline gives the agent access to domain-specific knowledge that the base LLM doesn't have:

  1. Document ingestion: PDF documents are parsed into markdown (via LandingAI ADE), chunked, and embedded using Voyage AI contextualized embeddings (1024 dimensions).
  2. Vector storage: Embeddings are stored in MongoDB Atlas with a vectorSearch index using dot-product similarity.
  3. Query flow: User questions are embedded with Voyage AI, matched against the vector store, and the top results are re-ranked using Voyage's rerank-2.5 model before being passed to the LLM as context.

This enables the agent to answer detailed questions about LoRA configurations, SAM 2 architecture, and manufacturing processes that aren't in the LLM's training data.

Web Search — Parallel

Parallel provides agentic web search as a fallback when the RAG knowledge base doesn't have relevant results. It returns excerpted content from the top web results, giving the agent access to current information beyond its training cutoff.

User & Session Management — Neon PostgreSQL + SQLModel

Neon hosts a serverless PostgreSQL database shared with the parent Next.js web application. The voice agent uses it to:

  • Authenticate users: Verify the user exists and has a verified email before allowing a session
  • Enforce usage limits: Track cumulative seconds used across sessions and reject connections once SESSION_TIME_LIMIT_SECONDS is reached
  • Record sessions: Write session duration after each conversation for billing/analytics

The ORM layer uses SQLModel (Pydantic + SQLAlchemy) with async sessions via asyncpg, sharing the same user table schema as the Next.js app's Better-Auth system.

Email Summaries — Resend

Resend sends a post-session email to the user containing usage stats and an AI-generated conversation summary. This creates a persistent record of the interaction and encourages users to return to the application.

Observability — Logfire + OpenTelemetry

Logfire (by the Pydantic team) collects traces, metrics, and structured logs from both the agent code and LiveKit's internal telemetry. The agent configures a shared OpenTelemetry tracer provider so that LiveKit spans and application spans appear in the same trace, enabling end-to-end debugging of the full voice pipeline.

Architecture Diagram

┌─────────────────────────────────────────────────────────┐
│                    Browser (User)                        │
│  ┌─────────────────┐  ┌──────────────────────────────┐  │
│  │ agent-starter-   │  │  SAM 2 Finetuning Web App   │  │
│  │ embed (widget)   │  │  (Next.js frontend)          │  │
│  └────────┬─────────┘  └──────────────────────────────┘  │
└───────────┼──────────────────────────────────────────────┘
            │ WebRTC audio
            ▼
┌───────────────────────────────────────────────────────────┐
│                    LiveKit Server                          │
│         (Room management, audio transport)                 │
└───────────────────────┬───────────────────────────────────┘
                        │
                        ▼
┌───────────────────────────────────────────────────────────┐
│                  Voice Agent (this repo)                   │
│                                                           │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐              │
│  │ Deepgram │──▶│  OpenAI  │──▶│ Cartesia │              │
│  │  (STT)   │   │  (LLM)   │   │  (TTS)   │              │
│  └──────────┘   └─────┬────┘   └──────────┘              │
│                       │                                   │
│              ┌────────┴────────┐                          │
│              ▼                 ▼                          │
│     ┌──────────────┐  ┌──────────────┐                   │
│     │  RAG Search   │  │  Web Search  │                   │
│     │ MongoDB Atlas │  │  Parallel    │                   │
│     │ + Voyage AI   │  │              │                   │
│     └──────────────┘  └──────────────┘                   │
│                                                           │
│  ┌──────────────┐  ┌────────────┐  ┌──────────────────┐  │
│  │ Lemonslice   │  │   Resend   │  │     Logfire      │  │
│  │ (Avatar)     │  │  (Email)   │  │ (Observability)  │  │
│  └──────────────┘  └────────────┘  └──────────────────┘  │
└──────────────┬────────────────────────────────────────────┘
               │
               ▼
┌───────────────────────────────────────────────────────────┐
│              Neon PostgreSQL (Shared DB)                   │
│         Users, sessions, usage tracking                   │
│    (shared with Next.js app via Better-Auth)              │
└───────────────────────────────────────────────────────────┘

Quick Start

# Install dependencies
uv sync --locked

# Set up environment variables
cp .env.example .env.local  # Then fill in all required API keys

# Pre-download ML models
uv run python3 -m src.agent download-files

# Run the agent
uv run python3 -m src.agent start

Container (Podman)

make build    # Build the container image
make run      # Start the container with src/ mounted for live reload
make attach   # Shell into the running container
make remove   # Stop and remove the container

Environment Variables

All variables are loaded from .env.local. See the CLAUDE.md for the full list of required environment variables and their associated services.

Related Repositories

About

Video and Voice Agent for SAM 2 Web App

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors