A PyTorch-based project to classify music genres from audio clips using neural networks. Different models and feature extraction techniques are explored, starting with a simple Feedforward Neural Network (FNN) on Mel-Frequency Cepstral Coefficients (MFCCs) and progressing to an optimized Convolutional Neural Network (CNN) on Mel Spectrograms.
The model classifies audio into four distinct genres:
- Blues
- Classical
- Hiphop
- Rock/Metal/Hardrock
The first model attempts classification using MFCCs as input features.
- Data: Loads pre-processed files containing MFCC features. Each audio sample is represented by a 1D vector of 26 features.
- Model: A simple 3-layer Feedforward Neural Network built with Linear layers.
- Training: The model is trained using Stochastic Gradient Descent (SGD) for 30 epochs.
- Evaluation: Model performance is evaluated using Accuracy and F1-Score, and a confusion matrix is generated. The best-performing model from the training epochs is saved and evaluated on the test set.
The second model uses Mel Spectrograms ("melgrams"), which treat the audio data as a 2D image-like representation. This allows for the use of Convolutional Neural Networks.
-
CNN Model: The model architectural design incorporates:
- Padding: Padding is added to the convolution layers to preserve feature map dimensions.
- Max Pooling: Max Pooling layers are added after each convolution to downsample the feature maps and reduce computational load.
-
Optimizer & Activation Function Tuning:
- Optimizers: This model is used to benchmark various PyTorch optimizers, including
SGD,Adam,AdamW,Adadelta, and others. - Activation Functions: Experimentation with 11 different activation functions (e.g.,
ReLU,LeakyReLU,GELU,SiLU,Mish) to find the best one.
- Optimizers: This model is used to benchmark various PyTorch optimizers, including
- Model: A 3-layer CNN with Max Pooling and 4 (dense) layers for classification.
- Key Components:
- Activation: Sigmoid-weighted Linear Unit ("SiLU") is used as the primary activation function.
- Optimizer: Adam is chosen, using Weight Decay for regularization.
- Scheduler: A Cyclic Learning Rate scheduler is used to dynamically adjust the learning rate during training.
- Download: Uses the
pytubelibrary to download a YouTube video's audio stream. - Convert: Uses the
pydublibrary to convert the downloaded audio into a.wavfile. - Extract Features: Uses
librosato load the.wavfile, segment it into chunks, and generate Mel Spectrograms for each chunk. - Classify: The best model is loaded and performs inference on these new spectrograms.
- Visualize: The results are plotted over time, showing the model's predicted genre for each segment of the song, along with a table showing the overall percentage breakdown.
You must have Python installed and different libraries. You can install them via pip:
pip install torch torcheval-nightly
pip install pytube pydub
pip install scikit-learn matplotlib numpy librosa- Environment: This notebook is designed to run in a Jupyter environment like Google Colab.
- Data: Ensure your pre-processed training, validation, and test datasets (MFCCs and Mel Spectrograms) are available and update the paths in the notebook to load them correctly.
- Execution: Open the notebook and execute the cells sequentially.
- The notebook will guide you through loading data, training the initial FNN model, and then building, training, and optimizing the final CNN model.
- The best-performing models will be saved as
.ptfiles.
- Inference: The final sections of the notebook demonstrate how to use the saved models to classify new songs directly from a YouTube URL. You can change the provided URLs to test your own examples.