We used the code from https://github.com/SenticNet/CASCADE as basis for a small project for a course at Utrecht University. See the commit history for what has been adjusted by us and what has been produced by the original authors of CASCADE.
Here Follows the README of the original CASCADE repository
Code for the paper CASCADE: Contextual Sarcasm Detection in Online Discussion Forums (COLING 2018, New Mexico).
In this paper, we propose a ContextuAl SarCasm DEtector (CASCADE), which adopts a hybrid approach of both content and context-driven modeling for sarcasm detection in online social media discussions (Reddit).
- Clone this repo.
- Python (2.7 or 3.3-3.6)
- Install your preferred version of TensorFlow 1.4.0 (for CPU, GPU; from PyPI, compiled, etc).
- Install the rest of the requirements:
pip install -r requirements.txt - Download the FastText pre-trained embeddings and extract it somewhere.
- Download the
comments.jsondataset file [1] and place it indata/. - If you want to run the Preprocessing steps (optional), install YAJL 2, download the
train-balanced.csvfile, save it underdata/and continue with the Preprocessing instructions. Otherwise, just download user_gcca_embeddings.npz, place it inusers/user_embeddings/and go directly to Running CASCADE section.
-
User Embeddings: Stylometric features.
The file
data/comments.jsonhas Reddit users and their corresponding comments. Per user, there might be multiple number of comments. Hence, we concatenate all the comments corresponding to the same user with the<END>tag:cd users python create_per_user_paragraph.pyThe ParagraphVector algorithm is used to generate the stylometric features. First, train the model:
python train_stylometric.py
Generate
user_stylometric.csv(user stylometric features) using the trained model:python generate_stylometric.py
-
User Embeddings: Personality features
Pre-train a CNN-based model to detect personality features from text. The code utilizes two datasets to train. The second dataset [2] can be obtained by requesting it to the original authors.
python process_data.py [path/to/FastText_embedding] python train_personality.py
Generate
user_personality.csv(user personality features) using this model:python generate_user_personality.py
To use the pre-trained model from our experiments, download the model weights and unzip them inside the folder
user/. -
User Embeddings: Multi-view fusion
Merge the
user_stylometric.csvanduser_personality.csvfiles into a single mergeduser_view_vectors.csvfile:python merge_user_views.py
Multi-view fusion of the user views (stylometric and personality) is performed using GCCA (~ CCA for two views). Generate fused user embeddings
user_gcca_embeddings.npzusing the following command:python user_wgcca.py --input user_embeddings/user_view_vectors.csv --output user_embeddings/user_gcca_embeddings.npz --k 100 --no_of_views 2
This implementation of GCCA has been adapted from the wgcca repo.
Finally:
cd .. -
Discourse Embeddings
Similar to user stylometric features, create the discourse features for each discussion forum (sub-reddit):
cd discourse python create_per_discourse_paragraph.pyThe ParagraphVector algorithm is used to generate the stylometric features. First, train the model:
python train_discourse.py
Generate
discourse.csv(user stylometric features) using the trained model:python generate_discourse.py
Finally:
cd ..
Hybrid CNN combining user-embeddings and discourse-features with textual modeling.
cd src
python process_data.py [path/to/FastText_embedding]
python train_cascade.pyThe CNN codebase has been adapted from the repo cnn-text-classification-tf from Denny Britz.
If you use this code in your work then please cite the paper CASCADE: Contextual Sarcasm Detection in Online Discussion Forums with the following:
@InProceedings{C18-1156,
author = "Hazarika, Devamanyu
and Poria, Soujanya
and Gorantla, Sruthi
and Cambria, Erik
and Zimmermann, Roger
and Mihalcea, Rada",
title = "CASCADE: Contextual Sarcasm Detection in Online Discussion Forums",
booktitle = "Proceedings of the 27th International Conference on Computational Linguistics",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "1837--1848",
location = "Santa Fe, New Mexico, USA",
url = "http://aclweb.org/anthology/C18-1156"
}
[1]. Khodak, Mikhail, Nikunj Saunshi, and Kiran Vodrahalli. "A large self-annotated corpus for sarcasm." Proceedings of the Eleventh International Conference on Language Resources and Evaluation. 2018.
[2]. Celli, Fabio, et al. "Workshop on computational personality recognition (shared task)." Proceedings of the Workshop on Computational Personality Recognition. 2013.

