Intent-FT

Install/Dependencies

Run python -r requirements.txt to install the dependencies

Set your openai_key in openai_key.txt and deepseek key in deepseek_api_key.txt.

Datasets

The main folder is datasets.

Download the training dataset, train.tsv from WildJailBreak, access it via https://huggingface.co/datasets/allenai/wildjailbreak/tree/main/train, and place it inside a folder named datasets, change the name to wjb.tsv.

For BD-A, download the pure_bad_100_original_category.jsonl from https://github.com/Jayfeather1024/Backdoor-Enhanced-Alignment/tree/main.

To generate the Intent-FT dataset, run notebooks/create_intent_data.ipynb, ensure the argument is_intent is set to True, and False to generate Safety-FT dataset. Run this notebook first before running notebooks/create_harmful_data.ipynb to create Harmful-FT data.

SFT

Run the following (We use 4x80GB VRAM for this, LoRA is possible by uncommenting out in recipes):

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/deepspeed_zero3.yaml sft.py --config recipes/llama-8b.yaml --dataset_path intent

Select the model based on config. dataset_path controls the dataset used, look at constants.py.

The meaning of the dataset_path is as follows:

None: Vanilla

intent: Intent-FT

intent_harmful: Harmful-FT on Intent-FT

non_intent: Safety-FT

backdoor: BD-A

harmful_mix_safety: Mix the harmful dataset with 10 examples.

The checkpoint names are also in constants.py, look at how the checkpoint names are created in utils/model_utils.py, and also what the names of the model refer to. This work only uses llama-8b, qwen-7b and gpt4.

Attacks

PAIR/DeepInception

Run either notebooks/pair.ipynb or

python -m src.eval_pair --model_name llama-8b --checkpoint_name intent

For prompt-based defenses: IA, SR and ICD, run

python -m src.eval_pair --model_name llama-8b --prompt_def all

The code is derived from https://github.com/patrickrchao/JailbreakingLLMs, and has been optimized to run the entire dataset in parallel, speeding up from hours to minutes.

Note that the notebook also includes Harmful-FT. But make sure you load the model that has been fine-tuned with Harmful-FT. To run it separately:

python -m src.eval_ft_harmful --model_name llama-8b --checkpoint_name intent_harmful

Adaptive Attack (AA)

Run either notebooks/adaptive_attack.ipynb or

python -m src.eval_adaptive_attack --model_name llama-8b --checkpoint_name intent

Code is derived from https://github.com/tml-epfl/llm-adaptive-attacks and similarly optimized to run the whole dataset in parallel.

Prefill

Run either notebooks/prefill.ipynb or

python -m src.eval_prefill --model_name llama-8b --checkpoint_name intent

Extra Analysis

Whitebox

Run notebooks/whitebox.ipynb or only WMDP:

python -m src.eval_wmdp --model_name llama-8b --checkpoint_name intent

Capability

Run either notebooks/eval_capability.ipynb or

python -m src.eval_capability --model_name llama-8b --checkpoint_name intent

Over-refusal

Run either notebooks/xstest.ipynb or

python -m src.xstest --model_name llama-8b --checkpoint_name intent

Remarks

PAIR is extremely effective on vanilla models, it can jailbreak majority of the samples within the 1st iteration, we think it is due to using a stronger attacker model (GPT-4.1-mini) instead of Mixtral in the original work. (Look at images/pair_iteration.png)
Similarly, the initial manual prompt template used in Adaptive Attack is also very effective.
Intent-FT on AA takes some time due to not being able to optimized for a single token as the model begins it's response with the intent rather than the answer, you can't optimize on the immediate token, only the full output.
Some extra updates on other attacks which we tried and did not document here are Autodan (https://arxiv.org/abs/2310.04451), CipherChat (https://arxiv.org/abs/2308.06463), and few-shot attack (https://arxiv.org/abs/2406.01288 and https://arxiv.org/abs/2310.06387). These attacks do not work well, often achieving <50% ASR on vanilla models and a simple prompt-based defense like SR can reduce to almost 0 ASR. These findings are on Llama-3.1-Instruct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intent-FT

Install/Dependencies

Datasets

SFT

Attacks

PAIR/DeepInception

Adaptive Attack (AA)

Prefill

Extra Analysis

Whitebox

Capability

Over-refusal

Remarks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.vscode		.vscode
attacks		attacks
images		images
notebooks		notebooks
recipes		recipes
src		src
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
sft.py		sft.py

Folders and files

Latest commit

History

Repository files navigation

Intent-FT

Install/Dependencies

Datasets

SFT

Attacks

PAIR/DeepInception

Adaptive Attack (AA)

Prefill

Extra Analysis

Whitebox

Capability

Over-refusal

Remarks

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages