Scaling Trends in Language Model Robustness accompanying code
If you just want to run the code and nothing else, you can do the following:
- clone the repository
- cd into it
- create a new Python 3.10 virtual environment called
venv - activate the virtual environment
- install the
robust-llmproject
git clone https://github.com/AlignmentResearch/robust-llm.git
cd robust-llm
python -m venv venv
source venv/bin/activate
pip install .
Note that this project has not been tested with different versions of Python.
If you want to install robust-llm with developer dependencies (which gives you development tools like linting and tests), do the following:
-
Follow steps 1-4 from the Simple installation.
-
Add pre-commit hooks for various linting tasks by installing pre-commit:
pre-commit install
- Install
robust-llmin developer mode, with dev dependencies:
pip install -e '.[dev]'
We have various kinds of pipelines that you can run. Currently we have training pipeline (used for supervised fine-tuning of models or adversarial training), and an evaluation pipeline (used to evaluate adversarial attack on a fixed model). To choose appropriate pipeline, you need to define the experiment.experiment_type in a Hydra config (see below). Note that there is also a defense pipeline which is currently non-operational.
Experiments are configured with Hydra.
The config structure is defined as nested dataclasses in configs.py.
Many values are left intentionally MISSING so that experiments don't accidentally use defaults, and so that the final config is cleaner.
Full, runnable experiment configs are defined in yaml files in robust_llm/hydra_conf.
Experiments are run with python robust_llm +experiment=path/to/exp.yaml (where path/to/exp.yaml is relative to robust_llm/hydra_conf/experiment).
This setup takes inspiration from the Hydra docs experiment example.
The yaml files are set up in such a way as to be composable -- for example, we can write a yaml file for a model once (say pythia-14m) and then use that in multiple experiments and in multiple places in the config.
To do this we take advantage of Hydra package overriding (note that the linked documentation is for an older version, but is clearer than the current version).
A Hydra config consists of nested packages: collections of values under a given path, like training.adversarial.training_attack or evaluation.evaluation_attack.
This is closely related to the concept of a config group, which is the path of a collection of values that Hydra is aware of.
These paths can be defined either in a directory tree of yaml files or in a ConfigStore instance in Python.
Sometimes we want to change which config group is used in a given position in the config.
As an example, let's consider setting the training_attack and evaluation_attack in an adversarial training run.
We define our attacks to be in the attack config group.
This is done by using group="attack" in robust_llm/config/attack_configs.py and by putting yaml files in the robusts_llm/hydra_conf/attack directory.
However, we want these configs to be placed at training.adversarial.training_attack and evaluation.evaluation_attack, not at attack.
This requires overriding packages in the Defaults List.
The syntax to take an attack (say, GCG) from the config group attack and place it at evaluation.evaluation_attack is:
attack@evaluation.evaluation_attack: GCGHowever, there's an added complication: most of the time we want to write these overrides in a yaml script in hydra_conf/experiment, which is outside of the standard hierarchy.
To do this we have to use the # @package _global_ directive, which says that the paths we define should be relative to the hydra_conf root, rather than relative to the current file (which would be hydra_conf/experiment if we didn't change anything).
With that change, we now need to put a slash in front of attack to indicate that Hydra should look relative to the root hydra_conf rather than relative to hydra_conf/experiment:
/attack@evaluation.evaluation_attack: GCG(TODO: work out why we don't need to write /evaluation.evaluation_attack)
We'll put experiments in experiment/<ExperimentType> directories for organization; e.g. experiment/AdvTraining.
(Because we're using @package _global_, it's fine for our experiment configs to be nested since everything is relative to the top-level config group anyway.)
In yaml files, we'll use unquoted strings to refer to other configs, and "quoted" strings for literal string values.
Names of other yaml files will be lowercase-with-hyphens, while defaults from python will be UPPERCASE_WITH_UNDERSCORES (Python defaults are explained in more detail in the worked example below).
Let's look at the contents of robust_llm/hydra_conf/experiment/Eval/pm_gcg.yaml, which defines an evaluation experiment.
We'll go through some of the important lines.
# @package _global_
defaults:
- /dataset: passwordmatch
- /evaluation: DEFAULT
- /attack@evaluation.evaluation_attack: GCG
- /model: EleutherAI/pythia-14m
- _self_
dataset:
n_val: 10
experiment_name: ???
run_name: ???
experiment_type: "evaluation"The @package _global_ directive is important to relocate the overrides in the file.
For example, without the @package _global_ directive, the dataset below would be placed at args.experiment.Eval.dataset rather than args.dataset, which doesn't exist (since experiment is not part of the config structure).
Similarly, we have to use absolute paths for the overrides below so hydra knows to look for these config files relative to the hydra_conf root: For example, at dataset rather than experiment/Eval/dataset.
This means to look in the /dataset directory (which is hydra_conf/dataset, since hydra_conf is the root) and use passwordmatch.yaml.
This line is a little tricky. Strictly speaking, /evaluation means to look in the /evaluation config group rather than the /evaluation directory (and analogously for /dataset).
Things can end up in the /evaluation config group either by being in hydra_conf/evaluation or by being defined in Python using hydra.ConfigStore.
In this case, we have defined the default evaluation config in configs.py, using cs.store(name="DEFAULT", group="evaluation", node=EvaluationConfig). One reason for doing this is because we rarely change most of the attributes of EvaluationConfig (except evaluation_attack, which we will handle next), and another, maybe more compelling reason is because we want to have the default value of evaluation in ExperimentConfig be None to avoid clogging up the config when we don't need it, but then it's tricky to specify that we want an instance of EvaluationConfig just from the yaml.
This line says to take a config from the /attack config group and put it at
evaluation.evaluation_attack. In particular, we want to take /attack/GCG
(which we know is defined in Python because it's UPPERCASE; in particular it's
in config/attack_configs.py) and set this as the evaluation_attack. This
uses Hydra's package override
syntax.
Analogous to the /dataset line: look in /model and use EleutherAI/pythia-14m.
This line is important as it tells Hydra that the stuff that comes after the Defaults List should override the stuff in the defaults list.
The previous stuff was all in the Defaults List.
This line overrides whatever value dataset.n_val had before (which was 0, from configs.py).
- See
robust_llm/hydra_conf/Eval/_template.yamlfor a template forevaluationexperiments that explains some of what's going on- See also
_template.yamlunderAdvTraining,Training, andDefendedEval.
- See also
- See
random-token-n-its-1280andgcg-standardinrobust_llm/hydra_conf/attackfor examples of extending/overriding theAttackConfigdefaults. - See
robust_llm/hydra_conf/experiment/ian/20240429_pm_random-token-fted.yamlfor an example of extending a generic experimentyamlfor a specific experiment. - See the scripts in
experiments/_examplefor examples of how these configs could be used for real experiments.
- On the command line or in a Python experiment script:
- When we want to override a default value with a config (like
ModelConfig), not just a value (likemodel_family), if the default value comes from thedataclassand was not set using the Defaults List, then we have to prepend a+to the override string to add it to the Defaults List. - An example of this is given in
experiments/_example/example_004_Eval_pm_random-token_and_gcg.py.
- When we want to override a default value with a config (like
We use accelerate for multi-GPU runs with FSDP. This is because some models (e.g. 12B Pythia) are too big to do fine-tuning or run attacks against them using only 1 GPU. FSDP spreads model parameters across multiple GPUs.
To locally run any experiment with accelerate, instead of running
python robust_llm +experiment=my_exp
use
accelerate launch --config_file=accelerate_config.yaml --num_processes=<NUM_GPUS> robust_llm +experiment=my_exp
If you want to use FSDP with batch jobs, simply set gpu=<NUM_GPUS> to a number greater than 1 when calling run_multiple().
Note: not all our code has been adapted to be used with accelerate. Things that should currently work are: fine-tuning models, adversarial evals with GCG, adversarial evals with beam search. Please use with caution.
The training pipeline supports checkpointing. This is controlled by save_strategy, save_steps and save_total_limit in the TrainingConfig. If more than save_steps are completed during a run, then a checkpoint will be saved in a directory like trainer/run_name/checkpoint-0. This contains the following files necessary to record the full state of the trainer:
config.json
model.safetensors
optimizer.pt
rng_state.pth
scheduler.pt
trainer_state.json
training_args.json
adversarial_training_state
All of these are handled directly by HuggingFace's Trainer class except for the last which is used to record project-specific state such as adversarial training round and attack RNG state.
If a new run is started with the same name, then we will try to find the last checkpoint in trainer/run_name to resume.
We currently support the following model families:
You can find the relevant configs nested in robust_llm/hydra_conf/model/meta-llama/Llama-2-7b-chat-hf.yaml for example.
Note that the directory structure and names are generally chosen to mirror the names used on HuggingFace.
Currently, we use the following datasets in our experiments:
- IMDB, a task to classify whether movie review is positive or negative. You can use it by setting
dataset.dataset_type="AlignmentResearch/IMDB" - Spam, a task to classify whether an email is spam or not. You can use it by setting
dataset.dataset_type="AlignmentResearch/EnronSpam" - WordLength, synthetic binary classification task to predict which of the two words is longer. You can use it by setting
dataset.dataset_type="AlignmentResearch/WordLength" - PasswordMatch, synthetic binary classification task to predict whether two passwords are identical. You can use it by setting
dataset.dataset_type="AlignmentResearch/PasswordMatch" - StrongREJECT, a jailbreak/refusal dataset from https://arxiv.org/abs/2402.10260. You can use it by setting
dataset.dataset_type="AlignmentResearch/StrongREJECT". This is currently our only dataset which is purely for generative models. - Helpful, a human preference dataset for chatbot conversations comparing two responses of varying helpfulness. You can use by setting
dataset.dataset_type="AlignmentResearch/Helpful" - Harmless, a human preference dataset for chatbot conversations comparing two responses of varying harmlessness. You can use by setting
dataset.dataset_type="AlignmentResearch/Harmless".
We store our fine-tuned models on HuggingFace. You can find up-to-date IDs of the models here.
@misc{howe2025scalingtrendslanguagemodel,
title={Scaling Trends in Language Model Robustness},
author={Nikolaus Howe and Ian McKenzie and Oskar Hollinsworth and Michał Zajac and Tom Tseng and Aaron Tucker and Pierre-Luc Bacon and Adam Gleave},
year={2025},
eprint={2407.18213},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2407.18213},
}