Kaizen Wake Word Engine

A local, lightweight voice activation and conversational AI system. Kaizen uses a hybrid CNN-GRU-Attention neural network trained on normalized log-mel spectrogram features to detect the wakeword "Kaizen" on a 16kHz audio stream.

🧠 Architecture Overview

Kaizen's detection pipeline processes audio in a sliding 2-second window:

Feature Extraction: Converts raw audio to a normalized log-mel spectrogram of shape (1, 40, 200) using librosa.
Convolutional Network (CNN): Extracts spatial-temporal acoustic features from the spectrogram.
Recurrent Network (Bidirectional GRU): Models temporal context and sequence information over time.
Attention Mechanism: Computes focus scores across the time frame to produce a context vector representing the entire 2-second clip.
Classifier Head: A linear classification layer producing raw activation logits.

🎯 Current Milestone: LLM & Speech-to-Text Integration

The current milestone focuses on integrating the detection engine with local LLM capabilities and offline transcription:

LLM Integration: Connecting to a local Ollama instance running the Phi-3.5 model.
Speech-to-Text (STT): Incorporating Whisper for local, offline transcription of the user's voice command immediately following a wakeword trigger.
Interactive Control Loop:
- Voice Mode: Continuously listens for the "Kaizen" wakeword. When triggered, records the user's speech, transcribes it via Whisper, and sends the prompt to Phi-3.5.
- Manual Mode: Direct text interface to query the model without voice activation.

📁 Repository Structure

scripts/main/: Core training and inference engine.
- model.py: Defines the WakeCRNN and Attention classes.
- features.py: Extracts log-mel features.
- train.py: Training pipeline with early stopping and real-time metric plotting.
- realtime.py: Microphone listener with WebRTC Voice Activity Detection (VAD).
scripts/utils/: Utilities for data preparation and synthesis.
- gen_hard_neg.py: Synthesizes phonetically similar negative words via TTS.
- hard_noise_neg.py: Generates static, pink, and brown noise signals.
- augment_positives.py: Augments audio samples with pitch, speed, shifting, and noise.
- vad_segmenter.py: Cuts raw continuous speech into clean voice snippets.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
audio_data_processed		audio_data_processed
models		models
random		random
random_processed		random_processed
scripts		scripts
temp_audio		temp_audio
.gitignore		.gitignore
README.md		README.md
implementation_plan.md		implementation_plan.md
process_random.py		process_random.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kaizen Wake Word Engine

🧠 Architecture Overview

🎯 Current Milestone: LLM & Speech-to-Text Integration

📁 Repository Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Kaizen Wake Word Engine

🧠 Architecture Overview

🎯 Current Milestone: LLM & Speech-to-Text Integration

📁 Repository Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages