Skip to content

ree2raz/dental-desk

Repository files navigation

title Voice Agent - Deterministic Control
emoji 🦷
colorFrom gray
colorTo yellow
sdk docker
pinned false
app_port 7860

Voice Agent with Deterministic State Machine and Multi-Layer Guardrails

11-state FSM voice agent with six guardrail layers. ~1,500 lines of TypeScript, no SDK abstractions. Tool scoping makes hallucination structurally impossible.

Demo:

demo.mp4

▶ Try the Live Demo — Bring your own Deepgram API key. No key is stored or logged.


Why This Exists

Most LLM voice agents can hallucinate business hours, skip intake steps, or break on interruption. This one can't — by design.

Production voice agents built on LLMs share a predictable failure mode: the model controls the conversation, and prompt instructions degrade under pressure. Hallucinated facts, skipped required steps, undefined behavior on interruption. I built this agent to prove a different architecture: the LLM is a constrained tool, not the controller. State flow is enforced in code. Facts come from tools the LLM didn't have access to until the right moment. Output is validated before it becomes audio.


Architecture Overview

┌──────────────────────────────────────────────────────────────┐
│  Layer 1: Input Classification                               │
│  User Speech → classifyInput() → [off-topic?] → Redirect    │
├──────────────────────────────────────────────────────────────┤
│  Layer 2: State Machine                                      │
│  11-state appointment flow (+ CONNECTING/DISCONNECTED), hard transition gates, NLP-assisted progression  │
├──────────────────────────────────────────────────────────────┤
│  Layer 3: Tool Scoping                                       │
│  Client executor enforces per-state tool allowlist           │
├──────────────────────────────────────────────────────────────┤
│  Layer 4: Tool Execution                                     │
│  Real lookups only — LLM never has facts in context          │
├──────────────────────────────────────────────────────────────┤
│  Layer 5: Output Validation                                  │
│  Block + replace any response violating state rules          │
├──────────────────────────────────────────────────────────────┤
│  Layer 6: Interruption Handling                              │
│  Detect barge-in → clear buffers → rollback state → recover  │
└──────────────────────────────────────────────────────────────┘

Key Technical Decisions

1. Deepgram Voice Agent API

Built on Deepgram's Voice Agent WebSocket API (wss://agent.deepgram.com/v1/agent/converse) with:

  • Nova-3 for speech-to-text (fastest, most accurate)
  • Aura-2 for text-to-speech
  • OpenAI gpt-4o-mini for the think layer, routed via Deepgram's think.provider config
  • Short-lived JWT authentication via /v1/auth/grant — your API key never touches the WebSocket

Raw WebSocket integration — full control over the event lifecycle, function calling, and binary PCM16 audio pipeline.

2. Deterministic State Machine + NLP Fallback

GREETING → COLLECT_NAME → COLLECT_REASON → SAVE_INTAKE → ASK_TIME →
VALIDATE_HOURS → CONFIRM_SLOT → BOOK_APPOINTMENT → SUMMARY → CLOSING
  • allowedTransitions[] prevents invalid state jumps at the code level
  • UpdatePrompt sends a state-specific system prompt on every transition
  • Client-side executor enforces allowedToolNames per state — out-of-scope calls return an error response, never execute

NLP-assisted inference: When the LLM handles things conversationally without calling tools, inferStateFromConversation() detects state transitions from natural language patterns (name detection, scheduling intent, slot confirmation, booking confirmation) and advances the state machine automatically. Tool calls remain authoritative; NLP is a reliable fallback.

Result: The LLM literally cannot call a tool that isn't available. UI state machine stays in sync with the conversation even when the LLM skips function calls.

3. Multi-Layer Guardrails

Six defense layers, each independent:

Layer What It Does Example
Input Classification Keyword-based intent routing, no LLM "What's the weather?" → instant redirect
State Gate Hard transition validation COLLECT_NAME → ASK_TIME → blocked
Tool Scope Only state-relevant tools in session Can't check hours during intake
Tool Execution External data lookup, never LLM memory Business hours from deterministic logic, not context
Output Validation Regex + rule check on agent text "I've booked" in wrong state → blocked
Interruption Handler Detect barge-in → rollback → recover Mid-booking interrupt → rollback to CONFIRM_SLOT

4. React UI with Real-Time Architecture Visualization

The UI is built with React + TypeScript (Vite), decoupled from the agent core via a typed EventBus. All visualization panels subscribe to bus events — the WebSocket client never touches the DOM.

  • State Machine Panel — 10-node flow diagram. Current state glows, completed states dim, animated on every transition.
  • Guardrail Indicators — 6 stacked layer badges. Flash with reason text when triggered; CSS animation resets via React key cycling.
  • Tool Call Feed — Live log showing tool name, arguments, result, and latency. Newest-first, capped at 20 entries.

5. Interruption Handling with State Rollback

Every state has a defined rollback target. On UserStartedSpeaking during agent speech:

  1. Clear local audio playback buffers
  2. Determine safe rollback state
  3. Transition + send UpdatePrompt with rollback state's system prompt
  4. Inject recovery message

No undefined behavior on interruption.


Running Locally

# Install dependencies
npm install

# Build the React SPA
npm run build

# Start Express server (serves dist/ + token endpoint)
npm start

# Open http://localhost:7860
# Paste your Deepgram API key (never stored)
# Allow microphone access → Click Connect

Development Mode (hot reload)

# Terminal 1 — Express token server
npm start

# Terminal 2 — Vite dev server (proxies /api to Express)
npm run dev
# Open http://localhost:5173

Production / Docker

docker build -t dental-voice-agent .
docker run -p 7860:7860 dental-voice-agent

BYOK (Bring Your Own Key) Security

Your Deepgram API key is:

  • Sent via POST to the backend token endpoint (/api/token)
  • Used once to mint a short-lived JWT via POST /v1/auth/grant
  • Never logged, stored, or persisted
  • The ephemeral JWT (not your key) authenticates the WebSocket connection via bearer subprotocol
  • JWT expires after 10 minutes

Tech Stack

Component Choice Rationale
Voice API Deepgram Voice Agent API (WebSocket) Nova-3 STT, Aura-2 TTS, built-in turn detection
Frontend React + TypeScript (Vite) Stable component model, clean state management
Styling Vanilla CSS + CSS variables Full control, no framework overhead
Server Express + tsx Serves SPA + BYOK token endpoint
Deployment Docker → HF Spaces Free hosting, direct URL access
Design Dark theme, copper accents, glassmorphism Premium look, architecture visibility

Project Structure

├── server.ts                  # Express server (BYOK token endpoint + static serve)
├── Dockerfile                 # Multi-stage build: Vite build → lean runtime
├── index.html                 # React entry point
├── src/
│   ├── main.tsx               # ReactDOM.createRoot entry
│   ├── main_sdk.ts            # Voice agent core (WebSocket, audio I/O, guardrails)
│   ├── state-machine.ts       # State definitions, transitions, NLP inference, tool execution
│   ├── event-bus.ts           # Typed EventBus — decouples agent core from UI
│   ├── components/
│   │   ├── App.tsx            # Root layout
│   │   ├── ConnectionBar.tsx  # API key input, connect/disconnect, status
│   │   ├── StateMachinePanel.tsx  # Animated state flow visualization
│   │   ├── GuardrailPanel.tsx     # 6-layer guardrail badges with flash animation
│   │   └── ToolFeedPanel.tsx      # Live tool call log with latency
│   └── ui/
│       └── styles.css         # Design system (CSS variables, dark theme, animations)
├── ARCHITECTURE.md            # Detailed layer-by-layer architecture doc
└── vite.config.ts             # Vite + React plugin config

Production Roadmap

These are intentional scope boundaries for a prototype — not oversights:

This demo does not persist appointment records; bookings are confirmed in-call only.

  1. Test Suite — 25-30 scripted conversation scenarios covering edge cases (interruption during tool call, double-booking, off-topic chains)
  2. Production Infrastructure — Health checks, monitoring, WebSocket reconnection, token refresh
  3. Real Integrations — Database for patient records, calendar API for availability, Twilio for telephony
  4. Error Recovery — Rate limit backoff, graceful degradation, session recovery
  5. Load Testing — Concurrent session handling, connection pooling

The core architecture (state machine, guardrails, tool scoping, NLP fallback) is designed to be production-transferable.

About

FSM Voice agent prototype for clinic appointment scheduling.

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages