Skip to content

add wake words #1

@ffaerber

Description

@ffaerber

Overview

Add always-on wake word detection to the dollbody firmware using a custom-trained microWakeWord model ("Hey Cipher") running via TFLite Micro on the ESP32-S3. This replaces any polling or button-triggered activation with a fully offline, low-latency voice trigger that feeds into the existing audio streaming pipeline.
https://github.com/espressif/esp-sr

Tasks

Phase 1 — Model training (GPU server · node3)

  • Pull and run the microWakeWord trainer Docker image on node3 (RTX 4090)
    docker run -d --gpus all -p 8888:8888 -v $(pwd):/data ghcr.io/tatertotterson/microwakeword:latest
  • Train wake word model for "Hey Cipher" using TTS-only pipeline (no recordings needed)
    Open browser at http://localhost:8888, enter phrase, click Train. First run downloads background noise + speech corpora.
  • Optionally record personal voice samples to improve accuracy
    Requires HTTPS reverse proxy (Traefik) or localhost access for browser mic permissions
  • Export trained hey_cipher.tflite and note the exact feature parameters used (sample rate, FFT size, n_mels, fmin, fmax)

Phase 2 — Precompute mel filterbank (host machine)

  • Write Python script to generate mel filterbank matrix using librosa.filters.mel()
    Parameters must exactly match training: sr=16000, n_fft=512, n_mels=40, fmin=125, fmax=7500
  • Export filterbank as mel_filterbank.h C header (float32 2D array, baked into firmware flash)
  • Verify filterbank output matches Python librosa reference on a test audio clip

Phase 3 — Firmware components (ESP-IDF · dollbody)

  • Add esp-tflite-micro dependency via IDF component manager
    idf.py add-dependency "espressif/esp-tflite-micro"
  • Embed hey_cipher.tflite model binary into firmware (via EMBED_FILES in CMakeLists or xxd -i header)
  • Implement mel_feature_extractor component
    Uses esp_dsp FFT (already in ESP-IDF) + precomputed filterbank. ~1–2ms per 30ms frame on S3.
  • Implement wake_word_task FreeRTOS task
    Reads from AFE output queue → mel features → TFLite interpreter → probability threshold check
  • Wire ESP-SR AFE (noise suppression + VAD) before the feature extractor
    AFE handles AEC, NS, BSS — TFLite model only sees clean PCM
  • Implement 16-frame sliding window buffer (160ms context) for inference smoothing
  • Define on_wake_word_detected() callback — triggers audio stream handoff to cipherdolls backend

Phase 4 — Integration & tuning (end-to-end)

  • Tune detection probability threshold (start at 0.85, adjust based on false accept / false reject rate)
  • Measure peak RAM usage — target <60KB tensor arena on S3 with PSRAM
  • Verify wake word does not trigger during doll TTS playback (AEC in AFE should handle this)
  • Test detection latency — target <300ms from end of spoken phrase to pipeline activation
  • Document wake word phrase and retrain instructions in firmware/README.md

Notes

  • No Espressif involvement needed — entire pipeline is self-hosted. Training runs on node3 (RTX 4090), model ships in firmware flash.
  • Runtime target: ESP32-S3 at 240MHz with ESP-NN SIMD optimisations enabled. Do not use plain ESP32 — insufficient vector extension support for fast TFLite inference.
  • Critical: mel feature extraction parameters in the Python precompute script and C firmware must be byte-for-byte identical to those used during training. Any mismatch silently degrades accuracy.
  • microWakeWord vs WakeNet: this uses TFLite Micro directly rather than ESP-SR's proprietary WakeNet runtime, giving full control over the model and retraining without Espressif's customisation service.
  • Future: after wake detection, stream audio to Whisper on node3 for full sentence understanding → cipherdolls NestJS backend → LLM → TTS response.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions