An AI-powered voice coaching application that turns any word list image into an interactive spelling practice session using NVIDIA's full-stack AI platform.
Students upload a photo of their spelling word list, and the assistant extracts the words using a vision-language model, then conducts a real-time voice-driven spelling quiz with pronunciation, definitions, example sentences, and encouragement β all guarded by NeMo Guardrails to keep the conversation child-safe and on-topic.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β USER (Browser) β
β β
β βββββββββββββββββββββββ ββββββββββββββββββββββββββββββββ β
β β Upload Word List β β Voice Spelling Session β β
β β Image (REST) β β (WebSocket Audio) β β
β ββββββββββ¬βββββββββββββ ββββββββββββββββ¬ββββββββββββββββ β
ββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββ
β β
βΌ βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FastAPI Backend (:8080) β
β β
β ββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββ β
β β POST /upload-imageβ β WS /pipecat/ws β β
β β β β β β
β β 1. Decode image β β Pipecat ACE Pipeline β β
β β 2. Extract words β β βββββββββββββββββββββββββββββββ β β
β β via VLM β β β β β β
β β 3. Store in Redis β β β Audio In βββΊ ElevenLabs β β β
β β 4. Return β β β ASR (Scribe) β β β
β β session_id β β β βΌ β β β
β β β β β NeMo Guardrails β β β
β ββββββββββ¬ββββββββββββ β β β β β β
β β β β βΌ β β β
β β β β Nemotron-Nano β β β
β β β β (Spelling Coach) β β β
β β β β β β β β
β β β β βΌ β β β
β β β β ElevenLabs TTS β β β
β β β β (Cloud API) β β β
β β β β βΌ β β β
β β β β Audio Out β β β
β β β β β β β
β β β βββββββββββββββββββββββββββββββ β β
β β βββββββββββββββββββββββββββββββββββββ β
β β β β
β βΌ βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Redis β β
β β Session words, progress, chat history (24h TTL) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
βΌ βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β NVIDIA AI Services β
β β
β βββββββββββββββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββ β
β β Nemotron-Nano-12B-VL-FP8β β ElevenLabs β β ElevenLabs TTS β β
β β (vLLM, self-hosted) β β STT (Scribe) β β (Cloud API) β β
β β β β (cloud API) β β β β
β β Image β Words β β Speech β β β Text β Speech β β
β β Definitions / Sentences β β Text β β 16kHz PCM β β
β β Voice Coach LLM β β β β β β
β βββββββββββββββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β NeMo Guardrails β β
β β Topic enforcement, intent filtering, child-safe content policy β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Component | Technology | Role |
|---|---|---|
| Voice Pipeline | NVIDIA Pipecat (ACE) | Orchestrates real-time audio I/O, ASR, LLM, and TTS |
| Vision-Language Model | Nemotron-Nano-12B-VL-FP8 via vLLM | Extracts spelling words from uploaded images |
| Speech Recognition | ElevenLabs STT (Scribe, cloud API) | Streaming speech-to-text via WebSocket |
| Text-to-Speech | ElevenLabs TTS (Cloud API) | Natural voice output at 16kHz |
| Conversational LLM | Nemotron-Nano-12B-VL-FP8 via vLLM | Powers the interactive spelling coach (same model as VLM) |
| Safety | NeMo Guardrails | Enforces spelling-only scope, filters off-topic intent |
| Session Store | Redis + LangChain | Persistent word lists, progress, and chat history |
| Fallback OCR | Tesseract (pytesseract) | Backup word extraction when VLM is unavailable |
| Web Framework | FastAPI + Uvicorn | REST API, WebSocket transport, static UI |
| Orchestration | Kubernetes (microk8s) | Multi-node GPU-aware deployment |
ββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Upload β β Words β β Voice β β Interactive β
β word ββββββΊβ extracted ββββββΊβ session ββββββΊβ spelling β
β list β β via VLM β β begins β β practice β
β image β β + stored β β (WebSocket) β β with coach β
ββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β
βββββββββββββββββββββββ€
βΌ βΌ
"Use it in a "What does it
sentence?" mean?"
β β
βΌ βΌ
LLM generates LLM generates
child-friendly age-appropriate
sentence definition
During a session, the student can:
- Hear the word pronounced
- Ask for it in a sentence
- Request a definition
- Spell the word aloud and receive feedback
- Skip to the next word
- Receive encouragement throughout
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Kubernetes Cluster (microk8s) β
β Namespace: spellingbee β
β β
β βββββββββββββββββββββββββββββββββββββββ β
β β Controller Node β β
β β β β
β β βββββββββββββββββββββββββββββ β β
β β β Backend Pod β β β
β β β spelling-bee-agent β β ββββββββββββββββββββββββ β
β β β NodePort :30088 β β β GPU Node β β
β β βββββββββββββ¬ββββββββββββββββ β β β β
β β β β β ββββββββββββββββββ β β
β β βββββββββββββΌββββββββββββββββ β β β vLLM Pod β β β
β β β Redis Pod β β β β Nemotron-Nano β β β
β β β Session Store β β β β 12B-VL-FP8 β β β
β β βββββββββββββββββββββββββββββ β β β NodePort β β β
β β β β β :30566 β β β
β βββββββββββββββββββββββββββββββββββββββ β β GPU: GB10 β β β
β β ββββββββββββββββββ β β
β ββββββββββββββββββββββββ β
β β
β External (Cloud APIs): β
β β’ ElevenLabs ASR β api.elevenlabs.io (Scribe v1) β
β β’ ElevenLabs TTS β api.elevenlabs.io (Cloud API) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- NVIDIA Pipecat (ACE) β Real-time voice agent pipeline framework
- ElevenLabs STT (Scribe) β Cloud-hosted streaming speech-to-text (WebSocket API)
- ElevenLabs TTS β Cloud-hosted text-to-speech (Cloud API)
- Nemotron-Nano-12B-VL-FP8 β Vision-language model for image understanding and conversational coaching, served via vLLM
- NeMo Guardrails β Programmable safety rails for topic enforcement and content filtering
- NVIDIA Container Runtime β GPU-accelerated container execution
Requirements: Python 3.12+ (required by NVIDIA Pipecat)
# Install dependencies
pip install -r requirements.txt
# Set required environment variables
export ELEVENLABS_API_KEY=<your-key> # For ElevenLabs ASR + TTS
# Optional: enable guardrails
export ENABLE_NEMO_GUARDRAILS=true
export NEMO_GUARDRAILS_CONFIG_PATH=./guardrails
# Start the server
python spelling_bee_agent_backend.pyOpen http://localhost:8080 to access the test UI.
Pre-requisites: a microk8s cluster with the spellingbee namespace, a local
container registry at localhost:32000, and GPU nodes with the NVIDIA runtime.
1. Create secrets
# ElevenLabs API key (required for ASR + TTS)
kubectl -n spellingbee create secret generic elevenlabs-api-key \
--from-literal=api-key=<YOUR_ELEVENLABS_KEY>
# HuggingFace token (required for vLLM model download)
kubectl -n spellingbee create secret generic hf-token \
--from-literal=token=<YOUR_HF_TOKEN>2. Deploy everything (model + Redis + backend)
./deploy/deploy_all.shOr deploy individually:
./deploy/deploy_model.sh # vLLM Nemotron-Nano-12B-VL-FP8 on GPU node
./deploy/deploy_redis.sh # Redis session store on controller node
./deploy/deploy_backend.sh # FastAPI backend on controller nodeNote: ASR and TTS are cloud-hosted via ElevenLabs, so no GPU pod for speech services is needed. Only the vLLM model requires a GPU.
The backend script builds the Docker image, pushes it to the local registry, applies the K8s manifest, and waits for rollout.
3. Verify
kubectl -n spellingbee get pods -o wide
kubectl -n spellingbee get svc4. Smoke test
./deploy/smoke_test.sh http://<controller-ip>:30088 ./path/to/words-image.png| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Test UI |
/healthz |
GET | Health check (reports Pipecat availability) |
/upload-image |
POST | Upload word list image, returns session_id |
/pipecat/ws |
WebSocket | Voice session β connect with ?session_id=<id> |
spelling-bee-assistant/
βββ spelling_bee_agent_backend.py # FastAPI backend + Pipecat pipeline
βββ ui/
β βββ index.html # Browser-based test UI
βββ guardrails/
β βββ config.yml # NeMo Guardrails model config
β βββ rails.co # Intent policies (spelling scope)
βββ deploy/
β βββ spelling-bee-agent-backend.k8s.yaml # K8s backend manifest
β βββ vllm-nemotron-nano-vl-8b.yaml # K8s vLLM model manifest
β βββ redis.k8s.yaml # K8s Redis manifest
β βββ deploy_all.sh # Deploy model + Redis + backend
β βββ deploy_backend.sh # Deploy backend only
β βββ deploy_model.sh # Deploy vLLM model only
β βββ deploy_redis.sh # Deploy Redis only
β βββ smoke_test.sh # End-to-end integration test
βββ Dockerfile # Backend container image
βββ requirements.txt # Python dependencies
| Variable | Default | Description |
|---|---|---|
ELEVENLABS_API_KEY |
β | ElevenLabs API key (required for ASR + TTS) |
ELEVENLABS_TTS_VOICE_ID |
3vbrfmIQGJrswxh7ife4 |
ElevenLabs TTS voice identifier |
ENABLE_NEMO_GUARDRAILS |
false |
Enable NeMo Guardrails |
NEMO_GUARDRAILS_CONFIG_PATH |
./guardrails |
Path to guardrails config |
REDIS_URL |
redis://localhost:6379/0 |
Redis connection URL |
VLLM_VL_BASE |
http://vllm-nemotron-nano-vl-8b:5566/v1 |
vLLM endpoint (used for both image extraction and voice coaching) |
VLLM_VL_MODEL |
nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-FP8 |
Vision-language model (one model, two roles) |
NVIDIA_LLM_URL |
Same as VLLM_VL_BASE |
Override LLM endpoint for voice pipeline |
NVIDIA_LLM_MODEL |
Same as VLLM_VL_MODEL |
Override LLM model for voice pipeline |