Skip to content

cloneiq/CKRA-MedVQA

Repository files navigation

CKRA-MedVQA Banner

GitHub stars Last commit Repo size Environment Paper

Official implementation of CKRA-MedVQA
Dynamic Context-Aware Knowledge Perception · Cross-Modal Contrastive Learning · Medical Visual Question Answering

Overview

CKRA-MedVQA is the official implementation of:

Beyond Static Knowledge: Dynamic Context-Aware Cross-Modal Contrastive Learning for Medical Visual Question Answering

This paper was published in IEEE Transactions on Medical Imaging (IEEE TMI).

Medical Visual Question Answering (Med-VQA) aims to analyze medical images and accurately respond to natural language queries, thereby optimizing clinical workflows and improving diagnostic and therapeutic outcomes. Although medical images contain rich visual information, the corresponding textual queries frequently lack sufficient descriptive content. This imbalance of information and modality differences leads to significant semantic bias. Furthermore, existing approaches integrate external medical knowledge to enhance model performance, they primarily rely on static knowledge that lacks dynamic adaptation to specific input samples, leading to redundant information and noise interference.

To address these challenges, we propose a Contextual Knowledge-Aware Dynamic Perception for the Cross-Modal Reasoning and Alignment (CKRA) Model. To mitigate knowledge redundancy, CKRA employs a dynamic perception mechanism that leverages semantic cues from the query to selectively filter relevant medical knowledge specific to the current sample’s context. To alleviate cross-modal semantic bias, CKRA bridges the distance between visual and linguistic features through knowledge-image contrastive learning, optimizing knowledge feature representation and directing the model’s attention to key image regions. Further, we design a dual-stream guided attention network that facilitates cross-modal interaction and alignment across multiple dimensions. Experimental results show that the proposed CKRA model outperforms the state-of-the-art method on SLAKE and VQA-RAD datasets. In addition, ablation studies validate the effectiveness of each module, while Grad-CAM maps further demonstrate the feasibility of CKRA for medical visual questioning tasks. The overall architecture of the proposed method is depicted in the figure below.

CKRA-MedVQA framework

Overall architecture of CKRA-MedVQA.

The source code and weights of the model are available at:

https://github.com/cloneiq/CKRA-MedVQA

Key Features

  • Joint training paradigm for dynamic knowledge-aware Med-VQA: We propose a joint training framework that combines dynamic context-aware knowledge perception with cross-modal contrastive learning, enabling the model to select context-relevant medical knowledge guided by question semantics and visual cues.
  • Question- and image-guided knowledge reasoning: CKRA uses contextual knowledge as shared support while allowing question semantics and image features to guide the model toward key visual regions, improving evidence-aware cross-modal reasoning.
  • Dual-Stream Guided Attention mechanism: We design a dual-stream guided attention module in which questions and images collaboratively guide the inference process, facilitating multi-path reasoning across visual, textual, and knowledge modalities.

Quick Start

Clone the Repository

git clone https://github.com/cloneiq/CKRA-MedVQA.git
cd CKRA-MedVQA

Install Requirements

conda env create -f environment.yaml

or

pip install -r requirements.txt

Prepare Datasets and Pretrained Files

Prepare the datasets, pretrained weights, roberta-base, BioBERT, and checkpoints according to the instructions in Preparation.

Train and Test

Run training and testing scripts as described in Train & Test.

Project Structure

CKRA-MedVQA/
├── checkpoints/
├── data/
│   ├── vqa_medvqa_2019_test.arrow
│   ├── ......
├── download/
│   ├── checkpoints/
│   ├── biobert_v1.1/
│   ├── pretrained/
│   │   ├── m3ae.ckpt
│   ├── roberta-base/
├── m3ae/
├── prepro/
└── run_scripts/

Requirements

Run the following command to install the required packages:

conda env create -f environment.yaml # method 1
pip install -r requirements.txt # method 2

Preparation

Dataset

Please follow here and only use the SLAKE and VQA-RAD datasets.

Pretrained

Download the m3ae pretrained weight and put it in the download/pretrained.

roberta-base

Download the roberta-base and put it in the download/roberta-base.

BioBert

Download the BioBert and put it in the download/biobert_v1.1.

Checkpoints

Download the checkpoints we trained and put it in the download/checkpoints.

Train & Test

# Train
bash run_scripts/ckra_train.sh
# Test
bash run_scripts/ckra_test.sh

Citations

If this repository is useful for your research, please cite:

@article{Yang2025CKRA-MedVQA,
  title={Beyond Static Knowledge: Dynamic Context-Aware Cross-Modal Contrastive Learning for Medical Visual Question Answering},
  author={Rui Yang, Lijun Liu*,Xupeng Feng,Wei Peng, Xiaobing Yang},
  journal={IEEE Transactions on Medical Imaging},
  year={2025},
  publisher={IEEE}
}
@inproceedings{chen2022m3ae,
  title={Multi-Modal Masked Autoencoders for Medical Vision-and-Language Pre-Training},
  author={Chen, Zhihong and Du, Yuhao and Hu, Jinpeng and Liu, Yang and Li, Guanbin and Wan, Xiang and Chang, Tsung-Hui},
  booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention},
  year={2022},
  organization={Springer}
}

Contact

First Author: Rui Yang, Kunming University of Science and Technology Kunming, Yunnan CHINA, email: r2125381663@163.com

Corresponding Author: Lijun Liu, Associate Professor (Ph.D.), Kunming University of Science and Technology Kunming, Yunnan CHINA, email: cloneiq@kust.edu.cn

Acknowledges

We thank M3AE for its open-source implementation and dataset preparation reference, and we also thank the SLAKE and VQA-RAD datasets for supporting reproducible evaluation in medical visual question answering. We further acknowledge BioBERT and roberta-base for providing useful language representation backbones for medical vision-language modeling.

Maintained for dynamic knowledge-aware and cross-modal reasoning research in Medical Visual Question Answering.

About

Beyond Static Knowledge: Dynamic Context-Aware Cross-Modal Contrastive Learning for Medical Visual Question Answering

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages