Refer to this Google Drive for data, models, embeddings, and results.
- Purpose: Detect hateful memes by combining CLIP image/text embeddings with a lightweight cross-modal attention classifier.
- Pipeline: generate LMM knowledge → build enriched CLIP embeddings → train classifier → run inference.
- Generate LMM knowledge
python generate_knowledge.py- Writes
knowledge/lmm_knowledge_{train,val,test}.jsonmapping meme id →descriptions(10) andemotions(10).
- Generate embeddings
python generate_embeddings.py- Uses dataset + knowledge to emit
hateful_memes_clip_embeddings_{train,val,test}.npzcontainingimage_embeddings,text_embeddings,desc_embeddings,emotion_embeddings,text_concat_embeddings,meme_concat_embeddings,labels,ids,valid_indices.
- Train
python training.py- Consumes
image_embeddings+text_concat_embeddingsfor train/val and savesbest_model.pth(includesimage_dimandtext_diminconfig).
- Predict
python predictions.py- Loads
best_model.pthand testimage_embeddings+text_concat_embeddings, writespredictions.npz, and prints accuracy/AUC/confusion matrix.