add wake words

## Overview

Add always-on wake word detection to the dollbody firmware using a custom-trained microWakeWord model ("Hey Cipher") running via TFLite Micro on the ESP32-S3. This replaces any polling or button-triggered activation with a fully offline, low-latency voice trigger that feeds into the existing audio streaming pipeline.
https://github.com/espressif/esp-sr
---

## Tasks

### Phase 1 — Model training (GPU server · node3)

- [ ] Pull and run the microWakeWord trainer Docker image on node3 (RTX 4090)
  `docker run -d --gpus all -p 8888:8888 -v $(pwd):/data ghcr.io/tatertotterson/microwakeword:latest`
- [ ] Train wake word model for **"Hey Cipher"** using TTS-only pipeline (no recordings needed)
  Open browser at http://localhost:8888, enter phrase, click Train. First run downloads background noise + speech corpora.
- [ ] Optionally record personal voice samples to improve accuracy
  Requires HTTPS reverse proxy (Traefik) or localhost access for browser mic permissions
- [ ] Export trained `hey_cipher.tflite` and note the exact feature parameters used (sample rate, FFT size, n_mels, fmin, fmax)

### Phase 2 — Precompute mel filterbank (host machine)

- [ ] Write Python script to generate mel filterbank matrix using `librosa.filters.mel()`
  Parameters must exactly match training: sr=16000, n_fft=512, n_mels=40, fmin=125, fmax=7500
- [ ] Export filterbank as `mel_filterbank.h` C header (float32 2D array, baked into firmware flash)
- [ ] Verify filterbank output matches Python librosa reference on a test audio clip

### Phase 3 — Firmware components (ESP-IDF · dollbody)

- [ ] Add `esp-tflite-micro` dependency via IDF component manager
  `idf.py add-dependency "espressif/esp-tflite-micro"`
- [ ] Embed `hey_cipher.tflite` model binary into firmware (via `EMBED_FILES` in CMakeLists or `xxd -i` header)
- [ ] Implement `mel_feature_extractor` component
  Uses `esp_dsp` FFT (already in ESP-IDF) + precomputed filterbank. ~1–2ms per 30ms frame on S3.
- [ ] Implement `wake_word_task` FreeRTOS task
  Reads from AFE output queue → mel features → TFLite interpreter → probability threshold check
- [ ] Wire ESP-SR AFE (noise suppression + VAD) before the feature extractor
  AFE handles AEC, NS, BSS — TFLite model only sees clean PCM
- [ ] Implement 16-frame sliding window buffer (160ms context) for inference smoothing
- [ ] Define `on_wake_word_detected()` callback — triggers audio stream handoff to cipherdolls backend

### Phase 4 — Integration & tuning (end-to-end)

- [ ] Tune detection probability threshold (start at 0.85, adjust based on false accept / false reject rate)
- [ ] Measure peak RAM usage — target <60KB tensor arena on S3 with PSRAM
- [ ] Verify wake word does not trigger during doll TTS playback (AEC in AFE should handle this)
- [ ] Test detection latency — target <300ms from end of spoken phrase to pipeline activation
- [ ] Document wake word phrase and retrain instructions in `firmware/README.md`

---

## Notes

- **No Espressif involvement needed** — entire pipeline is self-hosted. Training runs on node3 (RTX 4090), model ships in firmware flash.
- **Runtime target**: ESP32-S3 at 240MHz with ESP-NN SIMD optimisations enabled. Do not use plain ESP32 — insufficient vector extension support for fast TFLite inference.
- **Critical**: mel feature extraction parameters in the Python precompute script and C firmware must be byte-for-byte identical to those used during training. Any mismatch silently degrades accuracy.
- **microWakeWord vs WakeNet**: this uses TFLite Micro directly rather than ESP-SR's proprietary WakeNet runtime, giving full control over the model and retraining without Espressif's customisation service.
- **Future**: after wake detection, stream audio to Whisper on node3 for full sentence understanding → cipherdolls NestJS backend → LLM → TTS response.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add wake words #1

Overview

Tasks

Phase 1 — Model training (GPU server · node3)

Phase 2 — Precompute mel filterbank (host machine)

Phase 3 — Firmware components (ESP-IDF · dollbody)

Phase 4 — Integration & tuning (end-to-end)

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

add wake words #1

Description

Overview

Tasks

Phase 1 — Model training (GPU server · node3)

Phase 2 — Precompute mel filterbank (host machine)

Phase 3 — Firmware components (ESP-IDF · dollbody)

Phase 4 — Integration & tuning (end-to-end)

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions