Skip to content

ACL 2025 (Main) HiddenDetect: Detecting Jailbreak Attacks against Multimodal Large Language Models via Monitoring Hidden States

Notifications You must be signed in to change notification settings

leigest519/HiddenDetect

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 

Repository files navigation

📝 [ACL 2025] HiddenDetect: Detecting Jailbreak Attacks against Multimodal Large Language Models via Monitoring Hidden States

[📄 Arxiv][🤗 Hugging Face Daily Paper]


🔔 News

[2025.05.16] HiddenDetect has been accepted to ACL 2025 (Main)! 🎉


🚀 Overview

Large vision-language models (LVLMs) are more vulnerable to safety risks, such as jailbreak attacks, compared to language-only models. This work explores whether LVLMs encode safety-relevant signals within their internal activations during inference. Our findings show distinct activation patterns for unsafe prompts, which can be used to detect and mitigate adversarial inputs without extensive fine-tuning.

We propose HiddenDetect, a tuning-free framework leveraging internal model activations to enhance safety. Experimental results demonstrate that HiddenDetect surpasses state-of-the-art methods in detecting jailbreak attacks against LVLMs, providing an efficient and scalable solution for robustness against multimodal threats.


📑 Contents


⚙️ Install

1. Create a virtual environment for running LLaVA

git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # Enable PEP 660 support
pip install -e .

2. Install HiddenDetect

git clone https://github.com/leigest519/HiddenDetect.git
cd HiddenDetect
pip install -r requirements.txt

🏗️ Base Model

Download and save the following models to the ./model directory:


📂 Dataset

We evaluate the performance of HiddenDetect using various popular benchmark datasets, including:

Dataset Modality Source Usage Details
XSTest Pure text XSTest This dataset contains 250 safe prompts and 200 unsafe prompts.
FigTxt Pure text SafeBench CSV We use seven safety scenarios of this dataset, including 350 shots as unsafe samples under ./data/FigStep/safebench.csv. These are paired with 300 handcrafted safe samples stored in ./data/FigStep/benign_questions.csv.
FigImg Bimodal SafeBench Images We use all ten safety scenarios of this dataset as the visual query and pair them with the original text prompt from FigStep as unsafe samples under ./data/FigStep/FigImg.
MM-SafetyBench Bimodal MM-SafetyBench We include eight safety scenarios of this dataset as unsafe samples under ./data/MM-SafetyBench.
JailbreakV-28K Bimodal JailbreakV-28K We randomly select 300 shots belonging to five safety scenarios as unsafe samples from the llm_transfer_attack subset, stored under ./data/JailbreakV-28K.
VAE Bimodal Visual Adversarial Examples We pair four adversarial images with each prompt in the harmful corpus from the original repo to form a unsafe dataset stored in ./data/VAE.
MM-Vet Bimodal MM-Vet We use the entire dataset as safe samples under ./data/MM-Vet. It serves as a counterpart for all bimodal unsafe datasets.

Since the sizes of all bimodal datasets used are different, we randomly select samples from safe and unsafe datasets to form relatively balanced evaluation datasets. This approach enhances the robustness of performance evaluation.

Further details can be found in ./code/load_datasets.py.


🎬 Demo

To evaluate HiddenDetect across all the datasets, execute:

python ./code/test.py

📜 Citation

If you find HiddenDetect or our paper helpful, please consider citing:

@misc{jiang2025hiddendetectdetectingjailbreakattacks,
  title={HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States},
  author={Yilei Jiang and Xinyan Gao and Tianshuo Peng and Yingshui Tan and Xiaoyong Zhu and Bo Zheng and Xiangyu Yue},
  year={2025},
  eprint={2502.14744},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2502.14744}
}

If you like this project, give it a star!
💬 Feel free to open issues 💡

About

ACL 2025 (Main) HiddenDetect: Detecting Jailbreak Attacks against Multimodal Large Language Models via Monitoring Hidden States

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages