This project aimed to develop a deep learning model for classifying different species of birds based on audio recordings of their vocalizations. The dataset was obtained from the Kaggle Bird CLEF competition and pre-processed to filter out low-quality audio samples and ensure a sufficient number of samples per bird species. The librosa library was used to extract log mel-spectrogram image representations from the audio files. These 2D spectrograms, which encode the time-frequency patterns of the bird vocalizations, were then normalized. The normalized spectrogram images served as input to a convolutional neural network (CNN) model built using the TensorFlow framework. After training for multiple epochs, the validation accuracy was about 74% and the validation F1 score was 73%; the trained CNN demonstrates the feasibility of using deep learning on audio spectrograms for acoustic bird species classification. Potential improvements could involve data augmentation, regularization, and ensemble methods to better generalize the model's performance across diverse recording conditions.
A Mel-frequency spectrogram is a representation of the spectrum of a signal as it varies over time. It is derived from the traditional spectrogram, which displays the frequency content of a signal over time. However, instead of linearly spaced frequency bins, the mel spectrogram uses frequency bins that are spaced according to the mel scale, which is a perceptual scale of pitches based on human hearing. This scaling is designed to better represent how humans perceive differences in pitch.

Steps to get the Mel spectrogram:
- The Short Time Fourier Transform is calculated, and the amplitude is converted to decibels.

- Convert frequencies to the Mel scale.
- Choose the number of mel bands and construct mel filter banks, which are then applied to the spectrogram.

CNNs, or Convolutional Neural Networks, are deep learning architectures particularly effective for image processing tasks. They consist of layers that apply convolution operations to capture features like edges and textures, pooling layers to reduce spatial dimensions, activation functions for non-linearity, and fully connected layers for classification or regression. CNNs excel at automatically learning hierarchical representations from raw data, making them invaluable for tasks such as image classification, object detection, and segmentation, where they have achieved state-of-the-art performance.

The dataset consists of 40 bird species. The goal is to extract mel spectrograms from the audio recordings and pass them to the CNN.
The convolutional neural network has the following structure:
- 4 blocks, each consisting of:
- Convolutional layer
- Batch normalization
- Max pooling
- Followed by:
after parameter tuning, we get a test accuracy of 83.33% and a test loss of 0.4696.

The deep learning model developed in this project successfully classified different species of birds based on their vocalizations. Using a dataset from Kaggle containing audio recordings of five bird species, we processed the audio data with the Python library librosa, converting the recordings into log mel-spectrogram images to capture the time-frequency characteristics of the bird calls. Finally, we achieve a test accuracy of approximately 83.33%, indicating the effectiveness of using deep learning with audio spectrograms for bird species classification.
- Aryan N Herur
- Vaibhav Santhosh
