Most of the code and implementation for this project comes from FailSpy's Abliterator. However, I have modified the code and setup to better fit my use case.
Original Paper: Refusal in LLMs is mediated by a single direction
import torch
print("PyTorch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
print("GPU:", torch.cuda.get_device_name(0))Install the Hugging Face Transformers library and other dependencies.
NOTE: If using GPU (which you should) you migtht need to check which torch version you need.
!pip install -q transformers einops transformer_lens scikit-learn torchCan also pip install from a requirements file:
!pip install -r requirements.txtDepending on the model you plan to use, you may need to log in to Hugging Face to download the model.
from huggingface_hub import notebook_login
notebook_login()Load the model and configure it. Note that the model has to be supported by transformers lens.
from abliterator import *
# You can check abliterator.py for more information about the prompt data. These are just basic baseline instructions.
model = ModelAbliterator(
"meta-llama/Meta-Llama-3-8B-Instruct",
[
get_baseline_instructions(),
get_baseline_instructions(),
],
activation_layers=["resid_pre"],
)
# Blacklist the first and last layers. This is optional. You can blacklist any layers you want.
model.blacklist_layer([0, 1, 2, 3, 29, 30, 31])Create a ChatTemplate. Modify this as you wish. This is just an example. Note: The template will depend on which model you are using. This is for Llama3 architecture. For something like Phi3 it will look like this:
phi3_template="""<|user|>\n{instruction}<|end|>\n<|assistant|>"""system_prompt = """You are highly optimistic. Your responses should reflect a positive and hopeful outlook on life. Emphasize the bright side of any situation, and express strong confidence that things will turn out well. Encourage others with uplifting and encouraging language."""
chat_template = ChatTemplate(
model,
"<|start_header_id|>system<|end_header_id|>\n"
+ system_prompt
+ "<|eot_id|><|start_header_id|>user<|end_header_id|>\n{instruction}<|start_header_id|>assistant<|end_header_id|>\n\n",
)Let's see how the model responds as a baseline.
model.test(
N=32,
test_set=model.baseline_inst_test[15:16],
max_tokens_generated=100,
drop_refusals=False,
)Measure the effectiveness of our prompt.
with chat_template:
model.test(N=4, test_set=model.baseline_inst_test[30:33], drop_refusals=False)Set up paths for saving baseline and altered caches.
import os
from tqdm.notebook import tqdm
MODEL = "llama3"
# MODEL = "phi3"
baseline_cache_path = f"/baseline_cache_{MODEL}_compressed.pkl.gz"Calculate the baseline cache if it doesn't exist.
if not os.path.exists(baseline_cache_path):
print("Calculating baseline cache...")
# Define prompt count
prompt_count = 1500 # using more samples can better target the direction
# Tokenize instructions for baseline
baseline = model.tokenize_instructions_fn(
model.baseline_inst_train[:prompt_count]
) # Use base system prompt
# Get baseline cache
baseline_cache = model.create_activation_cache(baseline, N=len(baseline))
base_cache, _ = baseline_cache
# Save baseline cache
save_compressed_cache(base_cache, baseline_cache_path)
else:
print("Baseline cache already exists.")
# Load baseline cache
baseline_cache = load_compressed_cache(baseline_cache_path, model)Create an altered cache using the ChatTemplate.
with chat_template:
# Tokenize instructions for altered tokens
altered_toks = model.tokenize_instructions_fn(
model.baseline_inst_train[:prompt_count]
)
altered_cache = model.create_activation_cache(altered_toks, N=len(altered_toks))Set trait and baseline caches.
# Set trait and baseline caches
model.trait, _ = altered_cache
model.baseline = baseline_cache
# Get feature directions
feature_directions = model.refusal_dirs(
invert=True
) # inverted because we're attempting to induce the feature, otherwise it would be a refusal directionFind the direction that best expresses the desired behavior. Adjust the modifier value if the model is not behaving as expected.
modifier = 1.3 # Lower is more stable. I've found 1.3 to 1.5 is good.
for block in feature_directions:
with model: # This line makes it so any changes we apply to the model's weights will be reverted on each loop
model.apply_refusal_dirs([feature_directions[block] * modifier])
print(block)
model.test(
N=32,
test_set=model.baseline_inst_test[15:25],
max_tokens_generated=64,
drop_refusals=False,
)
print("=" * 100)Clear memory before proceeding if necessary.
clear_mem()Apply the identified direction to the model. I have found that block (layer) 17 and 18 tend to give the best desired behavior.
model.apply_refusal_dirs([feature_directions["blocks.18.hook_resid_pre"] * modifier])Test the modified model to ensure it behaves as expected.
model.test(
N=32,
test_set=model.baseline_inst_test[15:25],
max_tokens_generated=64,
drop_refusals=False,
)Save the model state for future use. This is kind of a wonky approach, but it works.
cfg = model.model.cfg
state_dict = model.model.state_dict()
hf_model = AutoModelForCausalLM.from_pretrained(
model.MODEL_PATH, torch_dtype=torch.bfloat16
)
lm_model = hf_model.model # get the language model component
for l in range(cfg.n_layers):
lm_model.layers[l].self_attn.o_proj.weight = torch.nn.Parameter(
einops.rearrange(
state_dict[f"blocks.{l}.attn.W_O"], "n h m->m (n h)", n=cfg.n_heads
).contiguous()
)
lm_model.layers[l].mlp.down_proj.weight = torch.nn.Parameter(
torch.transpose(state_dict[f"blocks.{l}.mlp.W_out"], 0, 1).contiguous()
)Push the model to the Hugging Face Hub.
hf_model.push_to_hub("your-model-name")Alternatively, save the model locally.
hf_model.save_pretrained("your model name")There is also a FastAPI service to generate text based on a given prompt and feature directions. It also provides a health check endpoint. This is useful if you have saved your feature directions and want to generate text on the fly.
URL: /health
Method: GET
Description: Check the health of the service.
Response:
{
"status": "Service is up and running",
"pytorch_version": "<PyTorch version>",
"cuda_available": "<True/False>",
"gpu_name": "<GPU Name if available>"
}URL: /generate
Method: POST
Description: Generate text based on the provided prompt and feature directions.
Request Body:
{
"prompt": "Your prompt text",
"feature_directions": [0.1, 0.2, 0.3, ...], // List of floats
"modifier": 1.3, // Optional, default is 1.3
"max_tokens": 100 // Optional, default is 100
}Response:
{
"response": "Generated text"
}To run the FastAPI service, execute the following command:
uvicorn <your_script_name>:app --host 0.0.0.0 --port 8888Replace <your_script_name> with the name of your Python script containing the FastAPI app. I use api.py in this case.
To check the health of the service, make a GET request to the /health endpoint:
curl -X GET "http://0.0.0.0:8888/health"To generate text, make a POST request to the /generate endpoint with the appropriate JSON payload:
curl -X POST "http://0.0.0.0:8888/generate" -H "Content-Type: application/json" -d '{
"prompt": "Tell me a story about AI",
"feature_directions": [0.5, -0.2, 0.1],
"modifier": 1.3,
"max_tokens": 150
}'This will return a JSON response with the generated text.
@misc{allbert2025identifyingmanipulatingpersonalitytraits,
title={Identifying and Manipulating Personality Traits in LLMs Through Activation Engineering},
author={Rumi A. Allbert and James K. Wiles and Vlad Grankovsky},
year={2025},
eprint={2412.10427},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.10427},
}