Skip to content

wj210/Intent_Jailbreak

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Intent-FT

Install/Dependencies

Run python -r requirements.txt to install the dependencies

Set your openai_key in openai_key.txt and deepseek key in deepseek_api_key.txt.

Datasets

The main folder is datasets.

Download the training dataset, train.tsv from WildJailBreak, access it via https://huggingface.co/datasets/allenai/wildjailbreak/tree/main/train, and place it inside a folder named datasets, change the name to wjb.tsv.

For BD-A, download the pure_bad_100_original_category.jsonl from https://github.com/Jayfeather1024/Backdoor-Enhanced-Alignment/tree/main.

To generate the Intent-FT dataset, run notebooks/create_intent_data.ipynb, ensure the argument is_intent is set to True, and False to generate Safety-FT dataset. Run this notebook first before running notebooks/create_harmful_data.ipynb to create Harmful-FT data.

SFT

Run the following (We use 4x80GB VRAM for this, LoRA is possible by uncommenting out in recipes):

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/deepspeed_zero3.yaml sft.py --config recipes/llama-8b.yaml --dataset_path intent

Select the model based on config. dataset_path controls the dataset used, look at constants.py.

The meaning of the dataset_path is as follows:

None: Vanilla

intent: Intent-FT

intent_harmful: Harmful-FT on Intent-FT

non_intent: Safety-FT

backdoor: BD-A

harmful_mix_safety: Mix the harmful dataset with 10 examples.

The checkpoint names are also in constants.py, look at how the checkpoint names are created in utils/model_utils.py, and also what the names of the model refer to. This work only uses llama-8b, qwen-7b and gpt4.

Attacks

PAIR/DeepInception

Run either notebooks/pair.ipynb or

python -m src.eval_pair --model_name llama-8b --checkpoint_name intent 

For prompt-based defenses: IA, SR and ICD, run

python -m src.eval_pair --model_name llama-8b --prompt_def all

The code is derived from https://github.com/patrickrchao/JailbreakingLLMs, and has been optimized to run the entire dataset in parallel, speeding up from hours to minutes.

Note that the notebook also includes Harmful-FT. But make sure you load the model that has been fine-tuned with Harmful-FT. To run it separately:

python -m src.eval_ft_harmful --model_name llama-8b --checkpoint_name intent_harmful

Adaptive Attack (AA)

Run either notebooks/adaptive_attack.ipynb or

python -m src.eval_adaptive_attack --model_name llama-8b --checkpoint_name intent 

Code is derived from https://github.com/tml-epfl/llm-adaptive-attacks and similarly optimized to run the whole dataset in parallel.

Prefill

Run either notebooks/prefill.ipynb or

python -m src.eval_prefill --model_name llama-8b --checkpoint_name intent 

Extra Analysis

Whitebox

Run notebooks/whitebox.ipynb or only WMDP:

python -m src.eval_wmdp --model_name llama-8b --checkpoint_name intent 

Capability

Run either notebooks/eval_capability.ipynb or

python -m src.eval_capability --model_name llama-8b --checkpoint_name intent 

Over-refusal

Run either notebooks/xstest.ipynb or

python -m src.xstest --model_name llama-8b --checkpoint_name intent 

Remarks

  1. PAIR is extremely effective on vanilla models, it can jailbreak majority of the samples within the 1st iteration, we think it is due to using a stronger attacker model (GPT-4.1-mini) instead of Mixtral in the original work. (Look at images/pair_iteration.png)
  2. Similarly, the initial manual prompt template used in Adaptive Attack is also very effective.
  3. Intent-FT on AA takes some time due to not being able to optimized for a single token as the model begins it's response with the intent rather than the answer, you can't optimize on the immediate token, only the full output.
  4. Some extra updates on other attacks which we tried and did not document here are Autodan (https://arxiv.org/abs/2310.04451), CipherChat (https://arxiv.org/abs/2308.06463), and few-shot attack (https://arxiv.org/abs/2406.01288 and https://arxiv.org/abs/2310.06387). These attacks do not work well, often achieving <50% ASR on vanilla models and a simple prompt-based defense like SR can reduce to almost 0 ASR. These findings are on Llama-3.1-Instruct

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors