Real-time DSP audio analysis system for ESP32 with advanced feature extraction and JSON data streaming.
- Real-time FFT: 2048-point, 21.5 Hz resolution
- Spectral Features: Centroid, spread, flatness, rolloff
- Temporal Features: Zero-crossing rate, RMS energy, peak detection
- Note Detection: 12-semitone chromatic scale (Goertzel algorithm)
- Frequency Bands: 16 logarithmic bands (60 Hz - 16 kHz)
- JSON Streaming: ~10 Hz updates over serial
- ESP32 DevKit V1
- INMP441 I2S digital microphone
- USB cable (power + serial)
INMP441 ESP32
WS � GPIO 32
SD � GPIO 35
SCK � GPIO 33
GND � GND
3V3 � 3V3
cd audio-signal-processor
pio run --target uploadpio device monitor --baud 115200JSON with spectral, temporal, and frequency band features:
{
"timestamp_ms": 12345678,
"spectral": {"centroid_hz": 2450.5, ...},
"temporal": {"zcr": 0.15, "rms_energy": 0.35, ...},
"freq_bands": [...],
"peaks": [...],
"note_detection": [...]
}Audio Input (44.1 kHz)
↓
Audio Frame Buffer (2048 samples)
↓
Preprocessing (DC removal, windowing)
↓
FFT Analysis (2048-point)
↓
Feature Extraction (48 features)
├─ Spectral: centroid, spread, flatness, rolloff
├─ Temporal: ZCR, RMS energy, peak amplitude
├─ Frequency: 16 logarithmic bands (60 Hz - 16 kHz)
├─ MFCC: 13 coefficients
└─ Chroma: 12 note bins
↓
ML Classification (TensorFlow Lite)
├─ Normalize features
└─ Genre prediction (10 genres)
↓
JSON Output (serial @ 115200 baud)
-
Feature Extraction (
ml/feature_extraction.py)- Extract 48 audio features from audio files
- Output: CSV with feature vectors
-
Model Training (
ml/train_model.py)- Train neural network on extracted features
- Split: 64% train, 16% validation, 20% test (subject to change)
- Output: TensorFlow Lite model for ESP32
-
Deployment
- Quantized model runs on ESP32 in real-time (might change)
- Classifies audio into 10 music genres
-
Model Accuracy
- Overall test accuracy: 62.5% across 10 genres (200 test samples)
- Classical music has the highest per-genre accuracy at 90% (18/20 correct), likely due to its distinct spectral profile — low zero-crossing rate, narrow frequency spread, and strong harmonic structure make it easy to separate from other genres
- Metal (75%) and pop (75%) also perform well, while genres with overlapping characteristics like disco (35%) and rock (40%) are harder to distinguish
- The confusion matrix below shows that most misclassifications occur between sonically similar genres (e.g., rock confused with metal/country, disco confused with reggae/rock)
-
Training Results and Initial Data
- Correlation Matrix & Training History:
- ARCHITECTURE.md - Technical deep-dive
- FEATURES_REFERENCE.md - Feature explanations
- DSP Features API - API documentation
MIT