📝 [ACL 2025] HiddenDetect: Detecting Jailbreak Attacks against Multimodal Large Language Models via Monitoring Hidden States
[📄 Arxiv] • [🤗 Hugging Face Daily Paper]
[2025.05.16] HiddenDetect has been accepted to ACL 2025 (Main)! 🎉
Large vision-language models (LVLMs) are more vulnerable to safety risks, such as jailbreak attacks, compared to language-only models. This work explores whether LVLMs encode safety-relevant signals within their internal activations during inference. Our findings show distinct activation patterns for unsafe prompts, which can be used to detect and mitigate adversarial inputs without extensive fine-tuning.
We propose HiddenDetect, a tuning-free framework leveraging internal model activations to enhance safety. Experimental results demonstrate that HiddenDetect surpasses state-of-the-art methods in detecting jailbreak attacks against LVLMs, providing an efficient and scalable solution for robustness against multimodal threats.
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip # Enable PEP 660 support
pip install -e .2. Install HiddenDetect
git clone https://github.com/leigest519/HiddenDetect.git
cd HiddenDetect
pip install -r requirements.txtDownload and save the following models to the ./model directory:
We evaluate the performance of HiddenDetect using various popular benchmark datasets, including:
| Dataset | Modality | Source | Usage Details |
|---|---|---|---|
| XSTest | Pure text | XSTest | This dataset contains 250 safe prompts and 200 unsafe prompts. |
| FigTxt | Pure text | SafeBench CSV | We use seven safety scenarios of this dataset, including 350 shots as unsafe samples under ./data/FigStep/safebench.csv. These are paired with 300 handcrafted safe samples stored in ./data/FigStep/benign_questions.csv. |
| FigImg | Bimodal | SafeBench Images | We use all ten safety scenarios of this dataset as the visual query and pair them with the original text prompt from FigStep as unsafe samples under ./data/FigStep/FigImg. |
| MM-SafetyBench | Bimodal | MM-SafetyBench | We include eight safety scenarios of this dataset as unsafe samples under ./data/MM-SafetyBench. |
| JailbreakV-28K | Bimodal | JailbreakV-28K | We randomly select 300 shots belonging to five safety scenarios as unsafe samples from the llm_transfer_attack subset, stored under ./data/JailbreakV-28K. |
| VAE | Bimodal | Visual Adversarial Examples | We pair four adversarial images with each prompt in the harmful corpus from the original repo to form a unsafe dataset stored in ./data/VAE. |
| MM-Vet | Bimodal | MM-Vet | We use the entire dataset as safe samples under ./data/MM-Vet. It serves as a counterpart for all bimodal unsafe datasets. |
Since the sizes of all bimodal datasets used are different, we randomly select samples from safe and unsafe datasets to form relatively balanced evaluation datasets. This approach enhances the robustness of performance evaluation.
Further details can be found in ./code/load_datasets.py.
To evaluate HiddenDetect across all the datasets, execute:
python ./code/test.pyIf you find HiddenDetect or our paper helpful, please consider citing:
@misc{jiang2025hiddendetectdetectingjailbreakattacks,
title={HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States},
author={Yilei Jiang and Xinyan Gao and Tianshuo Peng and Yingshui Tan and Xiaoyong Zhu and Bo Zheng and Xiangyu Yue},
year={2025},
eprint={2502.14744},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.14744}
}⭐ If you like this project, give it a star! ⭐
💬 Feel free to open issues 💡