Skip to content

axryap27/esp32-audio-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ESP32 Audio Classifier

Real-time DSP audio analysis system for ESP32 with advanced feature extraction and JSON data streaming.

Features

  • Real-time FFT: 2048-point, 21.5 Hz resolution
  • Spectral Features: Centroid, spread, flatness, rolloff
  • Temporal Features: Zero-crossing rate, RMS energy, peak detection
  • Note Detection: 12-semitone chromatic scale (Goertzel algorithm)
  • Frequency Bands: 16 logarithmic bands (60 Hz - 16 kHz)
  • JSON Streaming: ~10 Hz updates over serial

Hardware

  • ESP32 DevKit V1
  • INMP441 I2S digital microphone
  • USB cable (power + serial)

Quick Start

Pinout

INMP441    ESP32
WS    �    GPIO 32
SD    �    GPIO 35
SCK   �    GPIO 33
GND   �    GND
3V3   �    3V3

Build

cd audio-signal-processor
pio run --target upload

Monitor

pio device monitor --baud 115200

Output Format

JSON with spectral, temporal, and frequency band features:

{
  "timestamp_ms": 12345678,
  "spectral": {"centroid_hz": 2450.5, ...},
  "temporal": {"zcr": 0.15, "rms_energy": 0.35, ...},
  "freq_bands": [...],
  "peaks": [...],
  "note_detection": [...]
}

System Architecture

Audio Input (44.1 kHz)
    ↓
Audio Frame Buffer (2048 samples)
    ↓
Preprocessing (DC removal, windowing)
    ↓
FFT Analysis (2048-point)
    ↓
Feature Extraction (48 features)
    ├─ Spectral: centroid, spread, flatness, rolloff
    ├─ Temporal: ZCR, RMS energy, peak amplitude
    ├─ Frequency: 16 logarithmic bands (60 Hz - 16 kHz)
    ├─ MFCC: 13 coefficients
    └─ Chroma: 12 note bins
    ↓
ML Classification (TensorFlow Lite)
    ├─ Normalize features
    └─ Genre prediction (10 genres)
    ↓
JSON Output (serial @ 115200 baud)

ML Pipeline & Training

  1. Feature Extraction (ml/feature_extraction.py)

    • Extract 48 audio features from audio files
    • Output: CSV with feature vectors
  2. Model Training (ml/train_model.py)

    • Train neural network on extracted features
    • Split: 64% train, 16% validation, 20% test (subject to change)
    • Output: TensorFlow Lite model for ESP32
  3. Deployment

    • Quantized model runs on ESP32 in real-time (might change)
    • Classifies audio into 10 music genres
  4. Model Accuracy

    • Overall test accuracy: 62.5% across 10 genres (200 test samples)
    • Classical music has the highest per-genre accuracy at 90% (18/20 correct), likely due to its distinct spectral profile — low zero-crossing rate, narrow frequency spread, and strong harmonic structure make it easy to separate from other genres
    • Metal (75%) and pop (75%) also perform well, while genres with overlapping characteristics like disco (35%) and rock (40%) are harder to distinguish
    • The confusion matrix below shows that most misclassifications occur between sonically similar genres (e.g., rock confused with metal/country, disco confused with reggae/rock)
  5. Training Results and Initial Data

    • Correlation Matrix & Training History:
    image image

Documentation

License

MIT

About

Real-time music genre classification on ESP32 using DSP feature extraction and TensorFlow Lite for EdgeML capabilities

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors