DualMind: A Novel Framework for Automated Image Contextualization and Meta-context Retrieval

This repository is an extension of the 5Pillars framework introduced by Tonglet et al. (2024). Our goal is to produce more explainable and reliable 5Pillars (5Pils) answers by incorporating essential contextual elements—such as Source, Date, Location, and Motivation—and enriching them with additional relevant information. Moreover, the framework introduces a confidence score to indicate the trustworthiness of each generated answer, supported by an explanation of any detected manipulations for improved transparency and interpretability.

Contact person: Zhenghua Bao (zheng.hua.b@gmail.com)

Abstract

The rapid proliferation of visual misinformation highlights the critical need for effective automated tools to assist human fact-checkers. Recent advances, such as the “5 Pillars” framework, have introduced structured approaches for contextualizing images by identifying Provenance, Source, Date, Location, and Motivation. However, current automated methods face notable limitations: manipulation detection approaches largely neglect semantic discrepancies, evidence retrieval often misses critical supporting materials, and generated explanations frequently rely on irrelevant or biased information. To address these gaps, we propose a novel pipeline—DualMind, extending the original 5 Pillars framework with significant improvements:

A semantically-informed forgery detection method integrating SAM-based semantic segmentation, CLIP-ViT image features, and LLM-based reasoning;
Expanded evidence retrieval leveraging multiple reverse image search engines with evidence re-ranking using media bias and semantic relevance scores;
Contextual-aware answer generation;
A fallback pipeline employing keyword-based textual search when image-based retrieval fails.

Experiments demonstrate that DualMind significantly improves evidence retrieval coverage, reduces misinformation in generated answers, and enhances the overall reliability of image contextualization. By systematically evaluating contextual reliability and forgery explanations, DualMind offers fact-checkers more robust and reliable references for detecting and debunking visual misinformation.

Demo output

Create environment

$ conda create --name DualMind python=3.10
$ conda activate DualMind
$ pip install -r requirements.txt
$ python -m spacy download en_core_web_lg

APIs

Our baseline model relies on evidence retrieved with reverse image search using the Google Vision API and Tineye. All URLs for the text evidence that we used in our experiments are provided in this repo. However, should you want to collect your own evidence, you will need to create a Google Cloud account and create an API key, or use a different reverse image search service.
To gather evidence using a keyword search, you'll need to generate a Serper API Key. This key enables you to perform Google web searches and retrieve relevant textual and visual evidence based on the generated description of the provided image.
If you want to use GPT4(-Vision) for answer generation, you will need an an API key from OpenAI.

5Pils dataset

We used 5pillars dataset provided by Jonathan Tonglet.

You can collect the images using the following script:

$ python scripts/build_dataset_from_url.py

Collecting the evidence

Collect the text evidence based on their URLs:

$ python scripts/collect_RIS_evidence.py --collect_google 0 --evidence_urls dataset/retrieval_results/evidence_urls.json

Instead of using the provided evidence set, you can also collect your own by setting collect_google to 1. This will require to provide a Google Vision API key.

Collect the ebidence based on their keywords (Serper API required):

$ python scripts/collect_keyword_evidence.py --collect_keyword 1

Once the evidence has been collected, make sure to download the extracted images so they can be used later for similarity checks:

$ python scripts/download_keyowrd_search_images.py

If you want to validate downloaded images:

$ python scripts/validate_images.py

This will display a list of images that failed to download correctly.

Compute embeddings

Compute embeddings for evidence ranking based on images or context and similarity checks:

$ python scripts/get_embeddings.py

Pipeline

You can generate answers for a specific pillar. The plus pillar includes a confidence score and forgery explanation. For example, to generate answers for the Date pillar using multimodal LLaVA:

a) Original baseline:

$ python scripts/get_5pillars_answers.py --results_file output/results_date.json --task date --modality multimodal --model llava

a) Core Pipeline:

$ python scripts/generate_core_pipeline_answers.py --results_file output/core_results_date.json --task date --modality multimodal --model llava

a) Fallback Pipeline:

$ python scripts/generate_fallback_answers.py --results_file output/fallback_results_date.json --task date --modality multimodal --model llava

Answer Selection

After generating both core and fallback answers, you can choose the final answers by specifying the task, e.g. for the Date pillar, using the following command:

$ python scripts/answer_selection.py --task date

This will produce a JSON file containing the final output of the pipeline.

Evaluation

Assess the model's performance using the final results. For instance, when evaluating the Date pillar:

$ python scripts/evaluate_answer_generation.py --results_file output/final_results_date.json --task date

Citation

If you use this work or the underlying dataset, please cite the original 5Pillars paper:

@inproceedings{tonglet-etal-2024-image,
    title = "{``}Image, Tell me your story!{''} Predicting the original meta-context of visual misinformation",
    author = "Tonglet, Jonathan  and
      Moens, Marie-Francine  and
      Gurevych, Iryna",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-main.448",
    pages = "7845--7864",
}

Acknowledgement

This project was developed as part of the Project Lab: Multimodal Artificial Intelligence 2024 at TU Darmstadt. It builds heavily upon the foundational work "Image, Tell me your story!" Predicting the original meta-context of visual misinformation.

We sincerely thank the original authors for sharing their dataset and answering our questions. Special thanks to our mentor Mark for his insightful guidance and support throughout the course.

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
.github/workflows		.github/workflows
baseline		baseline
dataset		dataset
dataset_collection		dataset_collection
evaluation		evaluation
figs		figs
forgery_detection		forgery_detection
scripts		scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DualMind: A Novel Framework for Automated Image Contextualization and Meta-context Retrieval

Abstract

Demo output

Create environment

APIs

5Pils dataset

Collecting the evidence

Compute embeddings

Pipeline

Answer Selection

Evaluation

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DualMind: A Novel Framework for Automated Image Contextualization and Meta-context Retrieval

Abstract

Demo output

Create environment

APIs

5Pils dataset

Collecting the evidence

Compute embeddings

Pipeline

Answer Selection

Evaluation

Citation

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages