Cross-Axis Capping

An exploratory study on separating jailbreak detection from correction in LLM activation space.

This project was done as part of the BlueDot Technical AI Safety project sprint. Compute costs were covered by Rapid Grants.

I have tried to explain everything in detail throughout this document and the codebase. If you find any information missing or the structure hard to follow, please reach out to me at manuxtmail@gmail.com.

What is this?

When a large language model (LLM) generates text, its internal "hidden states" carry directional signals that indicate whether it's about to comply with or refuse a request. Safety researchers have found that you can nudge a model toward refusal by capping these signals along a known direction (the "assistant axis").

The standard approach uses one axis for both detecting a jailbreak and correcting the model's behavior. This project asks: what if we use two separate axes instead?

Assistant-axis capping (the standard): uses the same direction to spot a jailbreak and to push the model toward refusal.
Cross-axis capping (the new idea): spots the jailbreak on the assistant axis, but corrects using a different "compliance axis" derived from PCA on the model's own refusing vs. compliant activations.

We test both methods on Qwen/Qwen3-32B across 100 jailbreak prompts and 50 benign prompts, then use Claude Sonnet as an automated judge to classify each output.

What did we find?

Metric	Assistant-Axis	Cross-Axis
Jailbreak refusal rate	28%	55%
Jailbreak compliance rate	59%	29%
Benign output preserved	62%	62%
Benign output degraded	6%	6%

Cross-axis capping nearly doubles the refusal rate while halving compliance -- and it does this with no additional cost to benign output quality.

Repository Structure

crosscap_experiment.py      Core library (model loading, capping hooks, axis math, generation)
run_crosscap.py             Main script that orchestrates the experiment
reclassify_refusals.py      Post-hoc LLM judge (sends outputs to Claude Sonnet for labeling)
run_parallel.sh             Convenience script for multi-GPU runs
requirements.txt            Python dependencies
Final Results/              CSVs and metadata from the completed experiment
final_analysis_summary.txt  Plain-text summary of all results

Prerequisites

Python 3.10+
A CUDA GPU with enough memory for Qwen3-32B (80 GB+ VRAM recommended)
An Anthropic API key (only needed for the LLM-judge reclassification step)

Setup

Follow these steps in order to avoid dependency errors.

1. Clone the repository

git clone <repo-url> && cd cross_capping

2. Install the HuggingFace CLI

The experiment downloads models, datasets, and pre-computed axes from HuggingFace, so you need the CLI installed and authenticated first.

Linux / macOS (bash):

curl -LsSf https://hf.co/cli/install.sh | bash

Windows (PowerShell):

-ExecutionPolicy ByPass -c "irm https://hf.co/cli/install.ps1 | iex"

3. Log in to HuggingFace

hf auth login

This will prompt you for an access token. You can create one at huggingface.co/settings/tokens.

4. Install Flash Attention

Flash Attention needs to be installed separately before the rest of the dependencies, otherwise the build can fail:

pip install flash-attn --no-build-isolation

5. Install the remaining dependencies

pip install -r requirements.txt

6. Set your Anthropic API key (optional)

Only needed if you plan to run the LLM-judge reclassification step:

export ANTHROPIC_API_KEY="sk-ant-..."

Running the Experiment

1. Quick smoke test (5 prompts, fastest way to verify everything works)

python run_crosscap.py --preset sanity

2. Full run on a single GPU

python run_crosscap.py --preset full

3. Full run across multiple GPUs (recommended)

The easiest way:

./run_parallel.sh full 4    # uses 4 GPUs

Or do it manually in three steps:

# Step 1 -- warmup: downloads everything, computes axes and thresholds (run once)
python run_crosscap.py --preset full --warmup

# Step 2 -- generate: run one chunk per GPU in parallel
CUDA_VISIBLE_DEVICES=0 python run_crosscap.py --preset full --chunk 0/4 &
CUDA_VISIBLE_DEVICES=1 python run_crosscap.py --preset full --chunk 1/4 &
CUDA_VISIBLE_DEVICES=2 python run_crosscap.py --preset full --chunk 2/4 &
CUDA_VISIBLE_DEVICES=3 python run_crosscap.py --preset full --chunk 3/4 &
wait

# Step 3 -- merge: combine chunk CSVs into the four final output files
python run_crosscap.py --preset full --merge

4. Classify outputs with the LLM judge

After generation, have Claude Sonnet label each output:

python reclassify_refusals.py            # full run
python reclassify_refusals.py --resume   # pick up where you left off
python reclassify_refusals.py --summary-only  # just print stats from existing labels

5. View summary statistics

python analyze_csvs.py

How the Pipeline Fits Together

 WARMUP
 ──────
   Download model, datasets, and assistant axis from HuggingFace
   Collect activations from 50 "refusing" + 50 "compliant" runs
   Compute PCA compliance axis and per-layer detection thresholds
   Save everything to warmup.pt
          │
          ▼
 GENERATION
 ──────────
   For each of the 150 prompts (100 jailbreak + 50 benign):
     1. Generate a baseline response (no capping)
     2. Generate an assistant-capped response
     3. Generate a cross-capped response
   Record which layers fired and the per-layer projection values
          │
          ▼
 MERGE
 ─────
   Combine per-chunk CSVs into four final files:
     assistant_cap_jailbreak.csv    assistant_cap_benign.csv
     cross_cap_jailbreak.csv        cross_cap_benign.csv
          │
          ▼
 RECLASSIFICATION (LLM JUDGE)
 ────────────────────────────
   Send each capped output to Claude Sonnet for labeling:
     Jailbreak outputs  →  refusal / compliance / partial_refusal / degraded
     Benign outputs     →  unchanged / false_refusal / degraded
   Produces *_reclassified.csv files with llm_label column

Presets

Preset	Jailbreak prompts	Benign prompts	Max tokens	When to use
`sanity`	5	10	64	Smoke test -- does it run at all?
`small`	20	20	128	Development and debugging
`full`	100	50	256	The real experiment
`full_meandiff`	100	50	256	Variant using mean-difference axis instead of PCA

Key Configuration

These live at the top of run_crosscap.py:

Parameter	Default	What it controls
`MODEL_NAME`	`Qwen/Qwen3-32B`	Which model to cap
`CAP_LAYERS`	L46--L53 (8 layers)	Where in the network capping is applied (72--84% depth)
`AXIS_METHOD`	`pca`	How the compliance axis is built (`pca` or `mean_diff`)

How Capping Works (briefly)

During text generation the model builds up a hidden state h at each layer. Capping modifies h in-place:

Single-axis (assistant) capping:

If h projects below a threshold on the assistant axis v, push it back: h = h - v * min(dot(h, v) - threshold, 0)

Cross-axis capping:

Detect on the assistant axis (same threshold check), but correct along a separate compliance axis c: h = h - c * min(dot(h, c) - threshold_c, 0)

The compliance axis is built by running PCA on hidden states the model produces when it refuses harmful requests vs. when it complies.

Datasets

All downloaded automatically from HuggingFace:

Dataset	Role in the experiment
JBB-Behaviors	Bare harmful goals -- used to collect "refusing" activations
WildJailbreak	Adversarial jailbreak prompts for evaluation, plus "compliant" activations from the train split
assistant-axis-vectors	Pre-computed assistant axis at layer 53

Output CSV Format

Each row is one prompt. Key columns:

Column	Description
`prompt_text`	The input prompt
`baseline_text`	What the model said with no intervention
`{method}_cap_applied`	Did the capping fire? (`Yes` / `No`)
`{method}_cap_layers`	Which layers fired (e.g. `L46,L47,L48`)
`{method}_cap_text`	What the model said with capping active
`llm_label`	Judge's classification (added during reclassification)

License

Research use. See repository for details.

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
Compliant-refusal		Compliant-refusal
__pycache__		__pycache__
llama_full		llama_full
.gitignore		.gitignore
README.md		README.md
RESULTS.md		RESULTS.md
analysis_plan.md		analysis_plan.md
bootstrap.sh		bootstrap.sh
bootstrap_results.py		bootstrap_results.py
build_calibration.sh		build_calibration.sh
classify_calibration.py		classify_calibration.py
crosscap_experiment.py		crosscap_experiment.py
diagnose_axes.py		diagnose_axes.py
double_check_labels.py		double_check_labels.py
download_model.sh		download_model.sh
generate_calibration.py		generate_calibration.py
primer.ipynb		primer.ipynb
reclassify_refusals.py		reclassify_refusals.py
reclassify_refusals_opus.py		reclassify_refusals_opus.py
requirements.txt		requirements.txt
run_crosscap.py		run_crosscap.py
run_gemma.sh		run_gemma.sh
run_llama.sh		run_llama.sh
run_parallel.sh		run_parallel.sh
run_qwen.sh		run_qwen.sh
run_reclassify.sh		run_reclassify.sh
run_steer_layer_sweep.sh		run_steer_layer_sweep.sh
run_steer_layer_sweep_jbb_compliance.sh		run_steer_layer_sweep_jbb_compliance.sh
run_steer_layer_sweep_mixed.sh		run_steer_layer_sweep_mixed.sh
run_steer_layer_sweep_mixed_lowsweep.sh		run_steer_layer_sweep_mixed_lowsweep.sh
run_steer_probe.sh		run_steer_probe.sh
run_steer_scopes.sh		run_steer_scopes.sh
run_sweep_detect.sh		run_sweep_detect.sh
steer_layer_sweep.py		steer_layer_sweep.py
steer_layer_sweep_jbb_compliance.py		steer_layer_sweep_jbb_compliance.py
steer_layer_sweep_mixed.py		steer_layer_sweep_mixed.py
steer_probe.py		steer_probe.py
sweep_detect_thresholds.py		sweep_detect_thresholds.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cross-Axis Capping

What is this?

What did we find?

Repository Structure

Prerequisites

Setup

1. Clone the repository

2. Install the HuggingFace CLI

3. Log in to HuggingFace

4. Install Flash Attention

5. Install the remaining dependencies

6. Set your Anthropic API key (optional)

Running the Experiment

1. Quick smoke test (5 prompts, fastest way to verify everything works)

2. Full run on a single GPU

3. Full run across multiple GPUs (recommended)

4. Classify outputs with the LLM judge

5. View summary statistics

How the Pipeline Fits Together

Presets

Key Configuration

How Capping Works (briefly)

Datasets

Output CSV Format

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cross-Axis Capping

What is this?

What did we find?

Repository Structure

Prerequisites

Setup

1. Clone the repository

2. Install the HuggingFace CLI

3. Log in to HuggingFace

4. Install Flash Attention

5. Install the remaining dependencies

6. Set your Anthropic API key (optional)

Running the Experiment

1. Quick smoke test (5 prompts, fastest way to verify everything works)

2. Full run on a single GPU

3. Full run across multiple GPUs (recommended)

4. Classify outputs with the LLM judge

5. View summary statistics

How the Pipeline Fits Together

Presets

Key Configuration

How Capping Works (briefly)

Datasets

Output CSV Format

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages