A production-oriented voice booking system that converts user speech into safe, deterministic service bookings. Built for rapid iteration without sacrificing correctness.
Voice is treated as just another input channelβnever as an authority.
- Runtime: Bun + Node.js
- Backend: Express + TypeScript
- Database: PostgreSQL
- ORM: Prisma
- Cache / State: Redis
- Speech-to-Text (STT): Pluggable (Google STT / Bhashini)
- Intent Extraction: LLM (guard-railed)
- Text-to-Speech (TTS): External provider / device TTS
- LLMs do not execute business logic
- All AI outputs are validated, structured, and rejectable
- Conversation state is explicit and externalized
- Booking APIs are idempotent and shared
- Confirmation is mandatory before booking
If any of these are violated, itβs a bugβnot a feature.
User Voice
β Audio Upload
β Speech-to-Text
β Intent Extraction (LLM)
β Schema Validation
β Conversation State (Redis)
β Booking Engine
β Confirmation
β Text-to-Speech
src/
app.ts # Express app
server.ts # Bootstraps server
config/ # Env & config loaders
voice/
voice.routes.ts # Voice entrypoints
voice.controller.ts
stt/ # Speech-to-text adapters
intent/ # LLM intent extraction
state/ # Redis conversation state
responses/ # Voice/text responses
booking/
booking.service.ts # Core booking logic
prisma/
schema.prisma
infra/
redis.ts
db.ts
No βaiβ folder. This is product code, not a demo.
bun installbun run devServer starts on http://localhost:4006 (configurable via PORT).
curl -X POST http://localhost:4006/voice/audio \
-H "X-Conversation-Id: test-123" \
-F "audio=@/Users/pushkarmondal/100xdevs/voice_booking/sample.wav"- Go to Google Cloud Console
- Select your project (or create one)
- Navigate to IAM & Admin β Service Accounts
- Click Create Service Account
- Give it a name (example:
speech-to-text-service) - Grant the Cloud Speech-to-Text Admin role (or at minimum Cloud Speech Client)
- Click Done
- Click on the created service account
- Go to the Keys tab
- Click Add Key β Create New Key
- Choose JSON
- Download the JSON file
Option A (recommended for local dev): set GOOGLE_APPLICATION_CREDENTIALS
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/service-account-key.json"Or in your .env file:
GOOGLE_APPLICATION_CREDENTIALS=/Users/pushkarmondal/100xdevs/voice_booking/google-credentials.json- Go to Google Cloud Console
- Navigate to APIs & Services β Library
- Search for Cloud Speech-to-Text API
- Click Enable
bun run dev# 1. Place your service account JSON file in your project
mv ~/Downloads/your-service-account-key.json ./google-credentials.json
# 2. Add to .env
echo "GOOGLE_APPLICATION_CREDENTIALS=$(pwd)/google-credentials.json" >> .env
# 3. Add to .gitignore to avoid committing credentials
echo "google-credentials.json" >> .gitignore
# 4. Restart your server
bun run devDATABASE_URL=postgresql://...
REDIS_URL=redis://...
STT_API_KEY=...
LLM_API_KEY=...
VOICE_BOOKING_ENABLED=truebun prisma generate
bun prisma migrate devbun run devServer starts on http://localhost:4006 (configurable via PORT).
POST /voice/audio
Headers:
X-Conversation-Id: <uuid>
Body:
audio/wav | audio/webm
Behavior
- Accepts max 15s audio
- Returns
202 Accepted - Triggers async voice pipeline
LLM must return ONLY JSON:
{
"intent": "BOOK_SERVICE",
"slots": {
"service": "facial",
"date": "2026-02-01",
"time": "evening",
"location": "near_me"
},
"confidence": 0.87
}If:
- confidence is low
- fields are ambiguous
- schema validation fails
β system asks for clarification.
No guessing. Ever.
Each conversation is tracked explicitly:
{
state: "COLLECTING" | "CONFIRMING" | "BOOKED",
slots: { service?, date?, time?, location? },
expiresAt
}TTL: 15 minutes
Stateless APIs. Stateful experience.
- Shared with UI bookings
- Idempotent via
Idempotency-Key - Reservation lock with TTL
- Voice cannot bypass confirmation
Voice calls the same APIs your app uses.
- Hindi β
- Assamese β (not as per now but in future we can add)
- English β
Language is:
- Detected via STT
- Treated identically in intent pipeline
- Never inferred from location
- Feature flag:
VOICE_BOOKING_ENABLED - Confidence thresholds on STT + intent
- Mandatory confirmation step
- Full transcript + decision logging
Voice can be disabled instantly without redeploy.
β Production-ready architecture β Designed for scale and failure β Interview-grade system design
β Not a chatbot β Not βAI decidesβ logic β Not a voice toy
- WebSocket audio streaming
- Latency budgets & tracing
- Voice analytics (drop-offs per state)
- Staff-side voice booking