| title | Voice Agent - Deterministic Control |
|---|---|
| emoji | 🦷 |
| colorFrom | gray |
| colorTo | yellow |
| sdk | docker |
| pinned | false |
| app_port | 7860 |
11-state FSM voice agent with six guardrail layers. ~1,500 lines of TypeScript, no SDK abstractions. Tool scoping makes hallucination structurally impossible.
Demo:
demo.mp4
▶ Try the Live Demo — Bring your own Deepgram API key. No key is stored or logged.
Most LLM voice agents can hallucinate business hours, skip intake steps, or break on interruption. This one can't — by design.
Production voice agents built on LLMs share a predictable failure mode: the model controls the conversation, and prompt instructions degrade under pressure. Hallucinated facts, skipped required steps, undefined behavior on interruption. I built this agent to prove a different architecture: the LLM is a constrained tool, not the controller. State flow is enforced in code. Facts come from tools the LLM didn't have access to until the right moment. Output is validated before it becomes audio.
┌──────────────────────────────────────────────────────────────┐
│ Layer 1: Input Classification │
│ User Speech → classifyInput() → [off-topic?] → Redirect │
├──────────────────────────────────────────────────────────────┤
│ Layer 2: State Machine │
│ 11-state appointment flow (+ CONNECTING/DISCONNECTED), hard transition gates, NLP-assisted progression │
├──────────────────────────────────────────────────────────────┤
│ Layer 3: Tool Scoping │
│ Client executor enforces per-state tool allowlist │
├──────────────────────────────────────────────────────────────┤
│ Layer 4: Tool Execution │
│ Real lookups only — LLM never has facts in context │
├──────────────────────────────────────────────────────────────┤
│ Layer 5: Output Validation │
│ Block + replace any response violating state rules │
├──────────────────────────────────────────────────────────────┤
│ Layer 6: Interruption Handling │
│ Detect barge-in → clear buffers → rollback state → recover │
└──────────────────────────────────────────────────────────────┘
Built on Deepgram's Voice Agent WebSocket API (wss://agent.deepgram.com/v1/agent/converse) with:
- Nova-3 for speech-to-text (fastest, most accurate)
- Aura-2 for text-to-speech
- OpenAI
gpt-4o-minifor thethinklayer, routed via Deepgram'sthink.providerconfig - Short-lived JWT authentication via
/v1/auth/grant— your API key never touches the WebSocket
Raw WebSocket integration — full control over the event lifecycle, function calling, and binary PCM16 audio pipeline.
GREETING → COLLECT_NAME → COLLECT_REASON → SAVE_INTAKE → ASK_TIME →
VALIDATE_HOURS → CONFIRM_SLOT → BOOK_APPOINTMENT → SUMMARY → CLOSING
allowedTransitions[]prevents invalid state jumps at the code levelUpdatePromptsends a state-specific system prompt on every transition- Client-side executor enforces
allowedToolNamesper state — out-of-scope calls return an error response, never execute
NLP-assisted inference: When the LLM handles things conversationally without calling tools, inferStateFromConversation() detects state transitions from natural language patterns (name detection, scheduling intent, slot confirmation, booking confirmation) and advances the state machine automatically. Tool calls remain authoritative; NLP is a reliable fallback.
Result: The LLM literally cannot call a tool that isn't available. UI state machine stays in sync with the conversation even when the LLM skips function calls.
Six defense layers, each independent:
| Layer | What It Does | Example |
|---|---|---|
| Input Classification | Keyword-based intent routing, no LLM | "What's the weather?" → instant redirect |
| State Gate | Hard transition validation | COLLECT_NAME → ASK_TIME → blocked |
| Tool Scope | Only state-relevant tools in session | Can't check hours during intake |
| Tool Execution | External data lookup, never LLM memory | Business hours from deterministic logic, not context |
| Output Validation | Regex + rule check on agent text | "I've booked" in wrong state → blocked |
| Interruption Handler | Detect barge-in → rollback → recover | Mid-booking interrupt → rollback to CONFIRM_SLOT |
The UI is built with React + TypeScript (Vite), decoupled from the agent core via a typed EventBus. All visualization panels subscribe to bus events — the WebSocket client never touches the DOM.
- State Machine Panel — 10-node flow diagram. Current state glows, completed states dim, animated on every transition.
- Guardrail Indicators — 6 stacked layer badges. Flash with reason text when triggered; CSS animation resets via React key cycling.
- Tool Call Feed — Live log showing tool name, arguments, result, and latency. Newest-first, capped at 20 entries.
Every state has a defined rollback target. On UserStartedSpeaking during agent speech:
- Clear local audio playback buffers
- Determine safe rollback state
- Transition + send
UpdatePromptwith rollback state's system prompt - Inject recovery message
No undefined behavior on interruption.
# Install dependencies
npm install
# Build the React SPA
npm run build
# Start Express server (serves dist/ + token endpoint)
npm start
# Open http://localhost:7860
# Paste your Deepgram API key (never stored)
# Allow microphone access → Click Connect# Terminal 1 — Express token server
npm start
# Terminal 2 — Vite dev server (proxies /api to Express)
npm run dev
# Open http://localhost:5173docker build -t dental-voice-agent .
docker run -p 7860:7860 dental-voice-agentYour Deepgram API key is:
- Sent via POST to the backend token endpoint (
/api/token) - Used once to mint a short-lived JWT via
POST /v1/auth/grant - Never logged, stored, or persisted
- The ephemeral JWT (not your key) authenticates the WebSocket connection via
bearersubprotocol - JWT expires after 10 minutes
| Component | Choice | Rationale |
|---|---|---|
| Voice API | Deepgram Voice Agent API (WebSocket) | Nova-3 STT, Aura-2 TTS, built-in turn detection |
| Frontend | React + TypeScript (Vite) | Stable component model, clean state management |
| Styling | Vanilla CSS + CSS variables | Full control, no framework overhead |
| Server | Express + tsx | Serves SPA + BYOK token endpoint |
| Deployment | Docker → HF Spaces | Free hosting, direct URL access |
| Design | Dark theme, copper accents, glassmorphism | Premium look, architecture visibility |
├── server.ts # Express server (BYOK token endpoint + static serve)
├── Dockerfile # Multi-stage build: Vite build → lean runtime
├── index.html # React entry point
├── src/
│ ├── main.tsx # ReactDOM.createRoot entry
│ ├── main_sdk.ts # Voice agent core (WebSocket, audio I/O, guardrails)
│ ├── state-machine.ts # State definitions, transitions, NLP inference, tool execution
│ ├── event-bus.ts # Typed EventBus — decouples agent core from UI
│ ├── components/
│ │ ├── App.tsx # Root layout
│ │ ├── ConnectionBar.tsx # API key input, connect/disconnect, status
│ │ ├── StateMachinePanel.tsx # Animated state flow visualization
│ │ ├── GuardrailPanel.tsx # 6-layer guardrail badges with flash animation
│ │ └── ToolFeedPanel.tsx # Live tool call log with latency
│ └── ui/
│ └── styles.css # Design system (CSS variables, dark theme, animations)
├── ARCHITECTURE.md # Detailed layer-by-layer architecture doc
└── vite.config.ts # Vite + React plugin config
These are intentional scope boundaries for a prototype — not oversights:
This demo does not persist appointment records; bookings are confirmed in-call only.
- Test Suite — 25-30 scripted conversation scenarios covering edge cases (interruption during tool call, double-booking, off-topic chains)
- Production Infrastructure — Health checks, monitoring, WebSocket reconnection, token refresh
- Real Integrations — Database for patient records, calendar API for availability, Twilio for telephony
- Error Recovery — Rate limit backoff, graceful degradation, session recovery
- Load Testing — Concurrent session handling, connection pooling
The core architecture (state machine, guardrails, tool scoping, NLP fallback) is designed to be production-transferable.