Skip to content

A deep learning framework designed for emotion and sentiment recognition using text, audio, and video modalities. This project leverages the MELD (Multimodal EmotionLines Dataset) to train a robust and flexible model that reflects human communication more accurately than unimodal models.

Notifications You must be signed in to change notification settings

akshaysinhaaa/emova

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ConvEmoSentNet: A Parameter-Efficient Framework for Multimodal Emotion and Sentiment Analysis in Social Media Conversations

A deep learning framework designed for emotion and sentiment recognition using text, audio, and video modalities. This project leverages the MELD (Multimodal EmotionLines Dataset) to train a robust and flexible model that reflects human communication more accurately than unimodal models.


πŸ“¦ Dataset: MELD

Multimodal EmotionLines Dataset (MELD) is a large-scale, multi-party conversation dataset derived from the TV series Friends. It provides aligned and synchronized text, audio, and video data, annotated with both emotion and sentiment labels.

  • Modalities:

    • Text: Dialogues (utterances)
    • Audio: Speaker voice tone
    • Video: Speaker facial expressions and posture
  • Emotion Labels:

    • Anger
    • Disgust
    • Fear
    • Joy
    • Neutral
    • Sadness
    • Surprise
  • Sentiment Labels:

    • Positive
    • Negative
    • Neutral

πŸ”— MELD Dataset GitHub


🧠 Model Architecture

The model is modular and allows training on individual or fused modalities: Text, Audio, and Video. It is designed to perform well when one or more modalities are missing or unavailable.

πŸ”Ή Individual Modality Encoders

Modality Model Used Preprocessing
Text BERT Tokenization, Padding
Audio CNN MFCC / Log-Mel Spectrogram
Video ResNet18 / 3D-CNN Face Extraction, Frame Sampling

πŸ”Ή Multimodal Fusion Strategy

  • Concatenation of latent vectors from each modality
  • Optional attention mechanism to weight more informative modalities
  • Final Fully Connected Layers leading to classification head (Softmax)
             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
             β”‚   Text     β”‚    β”‚   Audio    β”‚     β”‚   Video    β”‚
             β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜    └── ┬─────── β”˜      β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚                β”‚                   β”‚
                 BERT             CNN            3D CNN / ResNet
                  β”‚                β”‚                   β”‚
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”΄β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚ Fusion β”‚
                               β””β”€β”€β”€β”€β”¬β”€β”€β”€β”˜
                            Fully Connected
                               Softmax

πŸ§ͺ Training Details

  • Optimizer: Adam
  • Scheduler: ReduceLROnPlateau
  • Loss Function:
    • CrossEntropyLoss for multiclass emotion classification
    • Label Smoothing (0.1) to prevent overconfidence
  • Regularization:
    • Dropout in FC layers (0.3–0.5)
    • Early Stopping based on validation loss
  • Batch Size: 16–32
  • Epochs: 15–25

🧡 Hyperparameter Tuning

  • Performed manually (grid search) on:
    • Learning rate (1e-3 to 1e-5)
    • Hidden layer sizes
    • Dropout rates
    • Fusion strategies (early vs late fusion)

πŸ“ˆ Performance Snapshot

Configuration Emo Precision Emo Acc. Sen Precision Sen Acc.
Fused Model 53.50% 54.90% 64.40% 64.60%

πŸ§‘β€πŸ’» Author

Akshay Sinha, Gauri Saksena, Yash Chandel
Deep Learning | Multimodal AI | Emotion Recognition

About

A deep learning framework designed for emotion and sentiment recognition using text, audio, and video modalities. This project leverages the MELD (Multimodal EmotionLines Dataset) to train a robust and flexible model that reflects human communication more accurately than unimodal models.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published