Run python -r requirements.txt to install the dependencies
Set your openai_key in openai_key.txt and deepseek key in deepseek_api_key.txt.
The main folder is datasets.
Download the training dataset, train.tsv from WildJailBreak, access it via https://huggingface.co/datasets/allenai/wildjailbreak/tree/main/train, and place it inside a folder named datasets, change the name to wjb.tsv.
For BD-A, download the pure_bad_100_original_category.jsonl from https://github.com/Jayfeather1024/Backdoor-Enhanced-Alignment/tree/main.
To generate the Intent-FT dataset, run notebooks/create_intent_data.ipynb, ensure the argument is_intent is set to True, and False to generate Safety-FT dataset. Run this notebook first before running notebooks/create_harmful_data.ipynb to create Harmful-FT data.
Run the following (We use 4x80GB VRAM for this, LoRA is possible by uncommenting out in recipes):
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/deepspeed_zero3.yaml sft.py --config recipes/llama-8b.yaml --dataset_path intentSelect the model based on config. dataset_path controls the dataset used, look at constants.py.
The meaning of the dataset_path is as follows:
None: Vanilla
intent: Intent-FT
intent_harmful: Harmful-FT on Intent-FT
non_intent: Safety-FT
backdoor: BD-A
harmful_mix_safety: Mix the harmful dataset with 10 examples.
The checkpoint names are also in constants.py, look at how the checkpoint names are created in utils/model_utils.py, and also what the names of the model refer to. This work only uses llama-8b, qwen-7b and gpt4.
Run either notebooks/pair.ipynb or
python -m src.eval_pair --model_name llama-8b --checkpoint_name intent For prompt-based defenses: IA, SR and ICD, run
python -m src.eval_pair --model_name llama-8b --prompt_def allThe code is derived from https://github.com/patrickrchao/JailbreakingLLMs, and has been optimized to run the entire dataset in parallel, speeding up from hours to minutes.
Note that the notebook also includes Harmful-FT. But make sure you load the model that has been fine-tuned with Harmful-FT. To run it separately:
python -m src.eval_ft_harmful --model_name llama-8b --checkpoint_name intent_harmfulRun either notebooks/adaptive_attack.ipynb or
python -m src.eval_adaptive_attack --model_name llama-8b --checkpoint_name intent Code is derived from https://github.com/tml-epfl/llm-adaptive-attacks and similarly optimized to run the whole dataset in parallel.
Run either notebooks/prefill.ipynb or
python -m src.eval_prefill --model_name llama-8b --checkpoint_name intent Run notebooks/whitebox.ipynb or only WMDP:
python -m src.eval_wmdp --model_name llama-8b --checkpoint_name intent Run either notebooks/eval_capability.ipynb or
python -m src.eval_capability --model_name llama-8b --checkpoint_name intent Run either notebooks/xstest.ipynb or
python -m src.xstest --model_name llama-8b --checkpoint_name intent - PAIR is extremely effective on vanilla models, it can jailbreak majority of the samples within the 1st iteration, we think it is due to using a stronger attacker model (GPT-4.1-mini) instead of Mixtral in the original work. (Look at
images/pair_iteration.png) - Similarly, the initial manual prompt template used in Adaptive Attack is also very effective.
- Intent-FT on AA takes some time due to not being able to optimized for a single token as the model begins it's response with the intent rather than the answer, you can't optimize on the immediate token, only the full output.
- Some extra updates on other attacks which we tried and did not document here are Autodan (https://arxiv.org/abs/2310.04451), CipherChat (https://arxiv.org/abs/2308.06463), and few-shot attack (https://arxiv.org/abs/2406.01288 and https://arxiv.org/abs/2310.06387). These attacks do not work well, often achieving <50% ASR on vanilla models and a simple prompt-based defense like SR can reduce to almost 0 ASR. These findings are on Llama-3.1-Instruct