GitHub - Samuel-Muzac/MusicSample-to-Mood

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data/manifests		data/manifests
datasets/deam/MEMD_audio		datasets/deam/MEMD_audio
plots/features		plots/features
scripts		scripts
.gitignore		.gitignore
README		README
Repository files navigation

##VERY ROUGH DRAFT

MFCC: Mel-frequency cepstral coefficients. MFCCs are a representation of the power spectrum of sound, made to estimate the auditory system's response. They're used to capture the quality of the sound (timbral information)
Chroma: Chromagrams map the entire specture of an audio signal in 12 bins, each correspodning to the semitones of the musical pitch scale. (A-G including sharps/relative flats). This captures harmonic content.

RFC Results:

First I only utilized the MFCC Data. When I trained the model using MFCC as the only baseline I got a % accuaracy
When I used Chroma and MFCC I got a \_% accuaracy.

For the threshold for the arosual/valance levels I swithced between using the mean and median of the Data
and found this out:

MFCC-baseline
SampleSize = 51
Train: 40 Test: 11

\*Threshold determined by median
Arousal threshold: 0.139
Valence threshold: 0.051
Angry: 5 (9.8%)
Calm: 5 (9.8%)
Happy: 21 (41.2%)
Sad: 20 (39.2%)

Results
precision recall f1-score support

       Angry       0.00      0.00      0.00         1
        Calm       0.00      0.00      0.00         1
       Happy       0.75      0.60      0.67         5
         Sad       0.57      1.00      0.73         4

    accuracy                           0.64        11

macro avg 0.33 0.40 0.35 11
weighted avg 0.55 0.64 0.57 11

Accuracy: 0.6363636363636364 = 63%

\*Threshold determined by mean
Arousal threshold: 0.114
Valence threshold: 0.055
Angry: 7 (13.7%)
Calm: 3 (5.9%)
Happy: 22 (43.1%)
Sad: 19 (37.3%)

Angry 0.00 0.00 0.00 1
Calm 0.00 0.00 0.00 1
Happy 0.60 0.60 0.60 5
Sad 0.60 0.75 0.67 4

    accuracy                           0.55        11

macro avg 0.30 0.34 0.32 11
weighted avg 0.49 0.55 0.52 11

Accuracy: 0.5454545454545454 = 54%

The Main issue with this is class imbalance. So I altered the extract_features_script to find randomize which files to extract.

New Code:
audio_path = "../datasets/deam/MEMD_audio/"
all_files = list(Path(audio_path).glob("\*.mp3"))

random.seed(42)
selected_files = random.sample(all_files, min(50, len(all_files)))

MFCC_Baseline:
Sample Size: 50
Training set size: 40
Testing set size: 10

\*Threshold determined by median

Arousal threshold: 0.076
Valence threshold: 0.042
Angry: 6 (12.0%)
Calm: 6 (12.0%)
Happy: 19 (38.0%)
Sad: 19 (38.0%)

              precision    recall  f1-score   support

       Angry       0.00      0.00      0.00         1
        Calm       0.00      0.00      0.00         1
       Happy       0.57      1.00      0.73         4
         Sad       0.67      0.50      0.57         4

    accuracy                           0.60        10

macro avg 0.31 0.38 0.32 10
weighted avg 0.50 0.60 0.52 10

Accuracy: 0.6 = 60%

\*Threshold determined by mean
Arousal threshold: 0.096
Valence threshold: 0.075
Angry: 7 (14.0%)
Calm: 4 (8.0%)
Happy: 18 (36.0%)
Sad: 21 (42.0%)

             precision    recall  f1-score   support

       Angry       0.00      0.00      0.00         1
        Calm       0.00      0.00      0.00         1
       Happy       0.67      1.00      0.80         4
         Sad       0.75      0.75      0.75         4

    accuracy                           0.70        10

macro avg 0.35 0.44 0.39 10
weighted avg 0.57 0.70 0.62 10

Accuracy: 0.7 = 70%

Now if I add Chroma:

MFCC + Chroma:
Sample Size: 50
Training set size: 40
Testing set size: 10

\*Threshold determined by median

Arousal threshold: 0.076
Valence threshold: 0.042
Angry: 6 (12.0%)
Calm: 6 (12.0%)
Happy: 19 (38.0%)
Sad: 19 (38.0%)

       Angry       0.00      0.00      0.00         1
        Calm       0.00      0.00      0.00         1
       Happy       0.67      1.00      0.80         4
         Sad       0.75      0.75      0.75         4

    accuracy                           0.70        10

macro avg 0.35 0.44 0.39 10
weighted avg 0.57 0.70 0.62 10

Accuracy: 0.7

\*Threshold determined by mean
Arousal threshold: 0.096
Valence threshold: 0.075
Angry: 7 (14.0%)
Calm: 4 (8.0%)
Happy: 18 (36.0%)
Sad: 21 (42.0%)

              precision    recall  f1-score   support

       Angry       0.00      0.00      0.00         1
        Calm       0.00      0.00      0.00         1
       Happy       0.80      1.00      0.89         4
         Sad       0.60      0.75      0.67         4

    accuracy                           0.70        10

macro avg 0.35 0.44 0.39 10
weighted avg 0.56 0.70 0.62 10

Accuracy: 0.7 = 70%

With Data of 250
Training: 200
Test:50

| Features      | Threshold | Accuracy | Notes                   |
| ------------- | --------- | -------- | ----------------------- |
| MFCC Only     | Mean      | 60%      | Baseline                |
| MFCC Only     | Median    | 56%      | Balanced classes better |
| MFCC + Chroma | Mean      | 56%      | Added Harmonic Content  |
| MFCC + Chroma | Median    | 56%      | Best Performer          |

MFCC features performed the best which suggests that timbral or sound quality information carry more weight than harmonic content. The Chroma feature didn't improve the performance likely because harmonic content depends on temporal structure (the order of notes/chords in the song samples) that the RFC doesn't capture yet. When extracting the Chroma features and aggregating them using mean/median, we remove the time component of the sample, losing the audio sample's harmonic progression. The model can't retain information about chord progressions and harmonic movement, which are critical for emotional interpretation. A good model to capture what RFC can't would be CNNs over spectrograms or Sequence models like RNNs and LSTMs.

Next Steps:
I want to see what my model is doing wrong beyond the Chroma Features so I ran a confusion matrix.
Per-Class PRecision/Recall results:

Angry:
The model never mapped Angry correctly and mostly mapped it to Happy. Both Happy and Angry are captured by high arousal scores, and the incorrect angry -> happy mapping is because MFCCs capture the energy and not emotional intent.

Calm:
The model also doesn't map Calm correctly, and is an unclear classifcation for the model. Calm samples are either mapped to Sad or Happy. The temporal context that classify samples to Calm are lossed without proper Chroma feature recogniation.

Happy:
The model performs best when classiying Happy musical samples, creating a high recall score for this class. MFCCs favor both the high-energy and consistent temporal/mood

Sad:
The model does decently well with this class, often confusing some samples with Happy.

Overall the model is classifying high energy with Happy and low energy with Sad, and doesn't learn how to utilize and discern emotional direction and chrod movement. Calm isn't classified correctly and is split between Happy and Sad, which suggests the static MFCC statistics fall short in capturing the subtle emotional cues in music. Feature aggregation loses emotional valence, but is able to capture timbral energy fairly well.

I also ran a feature importance analysis to see which MFCC coeficients influenced the model's decisions the most. Plotting the feature importances, the data concludes that the MFCC means features contributed more to the model then the MFCC variance features, which indicates that energy related and timbral characterisitics were much more influential then spectrial details for moood classification. The model primarily captures arousal-related features rather then the nuanced emotional distinctions in the music samples

Next, I wanted to analyze the performace of the RFC trained on MFCC-based features including temporal deratives. By changing the features, I hope to examine the overall accuracy, class-level behavior, and feature importance to understand which acoustic characterisitcs contribute the most to mood prediction.

Training:
Since using the mean as the threshold for the mood_lables worked the best previously, I decided to do the same for the training of the new MFCC feature vector. The same 250 songs were used, and the same split was used to train/test the RFC

Results:
precision recall f1-score support

       Angry       0.40      0.33      0.36         6
        Calm       0.00      0.00      0.00         6
       Happy       0.59      0.80      0.68        20
         Sad       0.72      0.72      0.72        18

    accuracy                           0.62        50

macro avg 0.43 0.46 0.44 50
weighted avg 0.55 0.62 0.58 50

Accuracy: 0.62

This model performed slightly better then the best model previosuly. From the confusion matrix data we can see that some songs classified as Angry were correctly identified, and there was some confusion between Angry and Happy, so the RFC was better at mapping moods outside Happy and Sad overall then the first four RFC models, but not very accurately.

The Feature importance also portrays a similar trend of importance in the MFCC feature vs the temporal deravative of MFCC; MFCC vs delta MFCC and delta2 MFCC (delta2 means the taking the dervative of the dervative). The model still captures arousal-related features rather then the nuanced emotional distinctions in the music samples, but does a better job of attempting to understand those distinctions.

For me, these models confirmed a lot about the nuanced emotional profiles in music. To perfectly classify the mood of a sample, the chord progressions along with rythmns as a whole are key to understand what the artists wish to convey. The model often falls short not just because of the lack of variety in the data, but it's one dimensional understanding of the sample, only learning from the arousal data and not as much from the temporal structure and the harmonies.

One thing that the models created wouldn't be able to tell, are the moods of much larger song samples, that have a change in mood. Many songs and pieces of music have a tonality change, and it would be difficult if the sample were to have both moods and could confuse the model's learning.