A local, lightweight voice activation and conversational AI system. Kaizen uses a hybrid CNN-GRU-Attention neural network trained on normalized log-mel spectrogram features to detect the wakeword "Kaizen" on a 16kHz audio stream.
Kaizen's detection pipeline processes audio in a sliding 2-second window:
- Feature Extraction: Converts raw audio to a normalized log-mel spectrogram of shape
(1, 40, 200)usinglibrosa. - Convolutional Network (CNN): Extracts spatial-temporal acoustic features from the spectrogram.
- Recurrent Network (Bidirectional GRU): Models temporal context and sequence information over time.
- Attention Mechanism: Computes focus scores across the time frame to produce a context vector representing the entire 2-second clip.
- Classifier Head: A linear classification layer producing raw activation logits.
The current milestone focuses on integrating the detection engine with local LLM capabilities and offline transcription:
- LLM Integration: Connecting to a local Ollama instance running the Phi-3.5 model.
- Speech-to-Text (STT): Incorporating Whisper for local, offline transcription of the user's voice command immediately following a wakeword trigger.
- Interactive Control Loop:
- Voice Mode: Continuously listens for the "Kaizen" wakeword. When triggered, records the user's speech, transcribes it via Whisper, and sends the prompt to Phi-3.5.
- Manual Mode: Direct text interface to query the model without voice activation.
scripts/main/: Core training and inference engine.- model.py: Defines the
WakeCRNNandAttentionclasses. - features.py: Extracts log-mel features.
- train.py: Training pipeline with early stopping and real-time metric plotting.
- realtime.py: Microphone listener with WebRTC Voice Activity Detection (VAD).
- model.py: Defines the
scripts/utils/: Utilities for data preparation and synthesis.- gen_hard_neg.py: Synthesizes phonetically similar negative words via TTS.
- hard_noise_neg.py: Generates static, pink, and brown noise signals.
- augment_positives.py: Augments audio samples with pitch, speed, shifting, and noise.
- vad_segmenter.py: Cuts raw continuous speech into clean voice snippets.