This project implements and evaluates CNN models for speech command recognition using the Speech Commands v0.02 dataset, with a focus on Post-Training Quantization (PTQ) for model compression and deployment.
- Dataset: Speech Commands v0.02 (36 classes)
- Model: CNN with 3 convolutional blocks (1.2M parameters)
- Framework: TensorFlow/Keras (Keras 3)
- Quantization: Post-Training Quantization (PTQ) to INT8
- Results: 93.49% accuracy with 11.73x model compression
pip install -r requirements.txtIf you don't have the dataset yet, download it:
cd Code
python -c "from utils import download_files; download_files()"This will download and extract the Speech Commands v0.02 dataset (~2.3 GB) to Code/data/speech_commands_v0_extracted/.
Note: The dataset must be at Code/data/speech_commands_v0_extracted/ for the test scripts to work.
cd Code
python test_tf_keras_model.pyExpected output: ~93.46% test accuracy
cd Code
python test_ptq_model.pyExpected output: ~93.49% INT8 accuracy, 11.73x compression
cd Code
python test_qat_model.pyExpected output: ~90.12% INT8 accuracy, 11.79x compression
test_tf_keras_model.py- Tests the FP32 TensorFlow/Keras modeltest_ptq_model.py- Tests and compares FP32 vs INT8 PTQ quantized modelstest_qat_model.py- Tests and compares FP32 vs INT8 QAT quantized models
ptq_pipeline_tf_keras.py- Post-Training Quantization pipeline (generates quantized model)
model_weights/tf_keras_weights.h5- Trained FP32 model weights (14.08 MB)model_weights/model_ptq_int8.tflite- Quantized INT8 model (1.20 MB)model_weights/model_fp32.keras- Full FP32 model (saved format)
preprocessing.py- Audio preprocessing (STFT → mel spectrogram → log scale)utils.py- Dataset loading and train/test split utilitiesconv_block_model.py- CNN model architecture definition
tf_keras_files/tf_keras_model.ipynb- TensorFlow/Keras training notebookqat_pipeline_final.ipynb- Quantization Aware Training (QAT) notebook (PyTorch)temp_file_pytorch_backend.ipynb- PyTorch training notebook
Code/data/speech_commands_v0_extracted/- Extracted Speech Commands dataset- 36 classes (yes, no, up, down, left, right, etc.)
- ~105,835 audio files
- 70/15/15 train/val/test split
Purpose: Verify the trained FP32 model works correctly.
cd Code
python test_tf_keras_model.pyWhat it does:
- Loads the model architecture
- Loads weights from
tf_keras_files/tf_keras_weights.h5 - Loads and preprocesses test data
- Evaluates model accuracy
- Shows 10 sample predictions
Expected output:
Test accuracy = 0.9346
Sample 0: True = wow | Predicted = wow
...
Purpose: Compare FP32 vs INT8 quantized model performance.
cd Code
python test_ptq_model.pyWhat it does:
- Loads FP32 model and quantized TFLite model
- Evaluates both on the same test set
- Compares accuracies and model sizes
- Shows side-by-side sample predictions
Expected output:
FP32 Accuracy: 93.4551% (0.934551)
INT8 Accuracy: 93.4929% (0.934929)
Accuracy Drop: -0.0378% (-0.04%)
Model Size (FP32): 14.08 MB
Model Size (INT8): 1.20 MB
Compression Ratio: 11.73x
Purpose: Compare FP32 vs INT8 QAT model performance.
cd Code
python test_qat_model.pyWhat it does:
- Loads FP32 model and QAT TFLite model
- Evaluates both on the same test set
- Compares accuracies and model sizes
- Shows side-by-side sample predictions
Expected output:
FP32 Accuracy: 93.4299% (0.934299)
INT8 Accuracy: 90.1228% (0.901228)
Accuracy Drop: 3.3071% (+3.31%)
Model Size (FP32): 14.08 MB
Model Size (INT8): 1.19 MB
Compression Ratio: 11.79x
Purpose: Generate a new quantized model from the FP32 model.
cd Code
python ptq_pipeline_tf_keras.pyWhat it does:
- Loads the FP32 model
- Creates calibration dataset (200 samples)
- Applies INT8 quantization using TensorFlow Lite
- Saves quantized model to
tf_keras_files/model_ptq_int8.tflite - Evaluates both models and shows comparison
Note: This takes several minutes due to calibration and conversion.
The CNN model consists of:
- 3 Convolutional Blocks:
- Block 1: 64 filters, stride 2
- Block 2: 128 filters
- Block 3: 256 filters
- Each block: Conv2D → BatchNorm → ReLU → MaxPooling → SpatialDropout
- Global Average Pooling
- Dense layers: 256 → 36 (softmax output)
Input: Mel spectrogram (128×64×1)
Output: 36 class probabilities
The preprocessing matches the training pipeline:
- Audio normalization (int16 → float32, /32768.0)
- STFT (frame_length=256, frame_step=128)
- Mel spectrogram (64 mel bins, 0-8000 Hz)
- Log scale (log(x + 1e-6))
- Resize to (128, 64, 1)
| Metric | FP32 Model | INT8 Model |
|---|---|---|
| Accuracy | 93.46% | 93.49% |
| Model Size | 14.08 MB | 1.20 MB |
| Compression | 1x | 11.73x |
| Format | H5 weights | TFLite |
See "Testing the Quantized Model" section above for detailed comparison output.
- Ensure
tf_keras_files/tf_keras_weights.h5exists - The script will try manual HDF5 loading if standard loading fails
- Ensure
Code/data/speech_commands_v0_extracted/exists - If missing, download the dataset:
cd Code python -c "from utils import download_files; download_files()"
- Run
ptq_pipeline_tf_keras.pyfirst to generate the TFLite model - Check that
tf_keras_files/model_ptq_int8.tfliteexists
- Ensure all dependencies are installed:
pip install -r requirements.txt - Use Keras 3 with TensorFlow backend (set
KERAS_BACKEND=tensorflow)
See requirements.txt for full list. Core dependencies:
tensorflow>=2.15.0keras>=3.0.0numpy>=1.24.0h5py>=3.8.0
If you use this code, please cite:
- Speech Commands Dataset: Warden, P. (2018). "Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition"
- TensorFlow Lite: Google (2024). "TensorFlow Lite: Machine Learning for Mobile and Edge Devices"