ConvEmoSentNet: A Parameter-Efficient Framework for Multimodal Emotion and Sentiment Analysis in Social Media Conversations
A deep learning framework designed for emotion and sentiment recognition using text, audio, and video modalities. This project leverages the MELD (Multimodal EmotionLines Dataset) to train a robust and flexible model that reflects human communication more accurately than unimodal models.
Multimodal EmotionLines Dataset (MELD) is a large-scale, multi-party conversation dataset derived from the TV series Friends. It provides aligned and synchronized text, audio, and video data, annotated with both emotion and sentiment labels.
-
Modalities:
Text: Dialogues (utterances)Audio: Speaker voice toneVideo: Speaker facial expressions and posture
-
Emotion Labels:
- Anger
- Disgust
- Fear
- Joy
- Neutral
- Sadness
- Surprise
-
Sentiment Labels:
- Positive
- Negative
- Neutral
π MELD Dataset GitHub
The model is modular and allows training on individual or fused modalities: Text, Audio, and Video. It is designed to perform well when one or more modalities are missing or unavailable.
| Modality | Model Used | Preprocessing |
|---|---|---|
| Text | BERT | Tokenization, Padding |
| Audio | CNN | MFCC / Log-Mel Spectrogram |
| Video | ResNet18 / 3D-CNN | Face Extraction, Frame Sampling |
- Concatenation of latent vectors from each modality
- Optional attention mechanism to weight more informative modalities
- Final Fully Connected Layers leading to classification head (Softmax)
ββββββββββββββ ββββββββββββββ ββββββββββββββ
β Text β β Audio β β Video β
ββββββ¬ββββββββ βββ β¬βββββββ β ββββββ¬ββββββββ
β β β
BERT CNN 3D CNN / ResNet
β β β
ββββββββββββββ¬ββββ΄βββββ¬βββββββββββββββ
β Fusion β
ββββββ¬ββββ
Fully Connected
Softmax
- Optimizer: Adam
- Scheduler: ReduceLROnPlateau
- Loss Function:
- CrossEntropyLoss for multiclass emotion classification
- Label Smoothing (0.1) to prevent overconfidence
- Regularization:
- Dropout in FC layers (0.3β0.5)
- Early Stopping based on validation loss
- Batch Size: 16β32
- Epochs: 15β25
- Performed manually (grid search) on:
- Learning rate (1e-3 to 1e-5)
- Hidden layer sizes
- Dropout rates
- Fusion strategies (early vs late fusion)
| Configuration | Emo Precision | Emo Acc. | Sen Precision | Sen Acc. |
|---|---|---|---|---|
| Fused Model | 53.50% | 54.90% | 64.40% | 64.60% |
Akshay Sinha, Gauri Saksena, Yash Chandel
Deep Learning | Multimodal AI | Emotion Recognition