You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add always-on wake word detection to the dollbody firmware using a custom-trained microWakeWord model ("Hey Cipher") running via TFLite Micro on the ESP32-S3. This replaces any polling or button-triggered activation with a fully offline, low-latency voice trigger that feeds into the existing audio streaming pipeline. https://github.com/espressif/esp-sr
Tasks
Phase 1 — Model training (GPU server · node3)
Pull and run the microWakeWord trainer Docker image on node3 (RTX 4090) docker run -d --gpus all -p 8888:8888 -v $(pwd):/data ghcr.io/tatertotterson/microwakeword:latest
Train wake word model for "Hey Cipher" using TTS-only pipeline (no recordings needed)
Open browser at http://localhost:8888, enter phrase, click Train. First run downloads background noise + speech corpora.
Optionally record personal voice samples to improve accuracy
Requires HTTPS reverse proxy (Traefik) or localhost access for browser mic permissions
Export trained hey_cipher.tflite and note the exact feature parameters used (sample rate, FFT size, n_mels, fmin, fmax)
Write Python script to generate mel filterbank matrix using librosa.filters.mel()
Parameters must exactly match training: sr=16000, n_fft=512, n_mels=40, fmin=125, fmax=7500
Export filterbank as mel_filterbank.h C header (float32 2D array, baked into firmware flash)
Verify filterbank output matches Python librosa reference on a test audio clip
Tune detection probability threshold (start at 0.85, adjust based on false accept / false reject rate)
Measure peak RAM usage — target <60KB tensor arena on S3 with PSRAM
Verify wake word does not trigger during doll TTS playback (AEC in AFE should handle this)
Test detection latency — target <300ms from end of spoken phrase to pipeline activation
Document wake word phrase and retrain instructions in firmware/README.md
Notes
No Espressif involvement needed — entire pipeline is self-hosted. Training runs on node3 (RTX 4090), model ships in firmware flash.
Runtime target: ESP32-S3 at 240MHz with ESP-NN SIMD optimisations enabled. Do not use plain ESP32 — insufficient vector extension support for fast TFLite inference.
Critical: mel feature extraction parameters in the Python precompute script and C firmware must be byte-for-byte identical to those used during training. Any mismatch silently degrades accuracy.
microWakeWord vs WakeNet: this uses TFLite Micro directly rather than ESP-SR's proprietary WakeNet runtime, giving full control over the model and retraining without Espressif's customisation service.
Future: after wake detection, stream audio to Whisper on node3 for full sentence understanding → cipherdolls NestJS backend → LLM → TTS response.
Overview
Add always-on wake word detection to the dollbody firmware using a custom-trained microWakeWord model ("Hey Cipher") running via TFLite Micro on the ESP32-S3. This replaces any polling or button-triggered activation with a fully offline, low-latency voice trigger that feeds into the existing audio streaming pipeline.
https://github.com/espressif/esp-sr
Tasks
Phase 1 — Model training (GPU server · node3)
docker run -d --gpus all -p 8888:8888 -v $(pwd):/data ghcr.io/tatertotterson/microwakeword:latestOpen browser at http://localhost:8888, enter phrase, click Train. First run downloads background noise + speech corpora.
Requires HTTPS reverse proxy (Traefik) or localhost access for browser mic permissions
hey_cipher.tfliteand note the exact feature parameters used (sample rate, FFT size, n_mels, fmin, fmax)Phase 2 — Precompute mel filterbank (host machine)
librosa.filters.mel()Parameters must exactly match training: sr=16000, n_fft=512, n_mels=40, fmin=125, fmax=7500
mel_filterbank.hC header (float32 2D array, baked into firmware flash)Phase 3 — Firmware components (ESP-IDF · dollbody)
esp-tflite-microdependency via IDF component manageridf.py add-dependency "espressif/esp-tflite-micro"hey_cipher.tflitemodel binary into firmware (viaEMBED_FILESin CMakeLists orxxd -iheader)mel_feature_extractorcomponentUses
esp_dspFFT (already in ESP-IDF) + precomputed filterbank. ~1–2ms per 30ms frame on S3.wake_word_taskFreeRTOS taskReads from AFE output queue → mel features → TFLite interpreter → probability threshold check
AFE handles AEC, NS, BSS — TFLite model only sees clean PCM
on_wake_word_detected()callback — triggers audio stream handoff to cipherdolls backendPhase 4 — Integration & tuning (end-to-end)
firmware/README.mdNotes