An exploratory study on separating jailbreak detection from correction in LLM activation space.
This project was done as part of the BlueDot Technical AI Safety project sprint. Compute costs were covered by Rapid Grants.
I have tried to explain everything in detail throughout this document and the codebase. If you find any information missing or the structure hard to follow, please reach out to me at manuxtmail@gmail.com.
When a large language model (LLM) generates text, its internal "hidden states" carry directional signals that indicate whether it's about to comply with or refuse a request. Safety researchers have found that you can nudge a model toward refusal by capping these signals along a known direction (the "assistant axis").
The standard approach uses one axis for both detecting a jailbreak and correcting the model's behavior. This project asks: what if we use two separate axes instead?
- Assistant-axis capping (the standard): uses the same direction to spot a jailbreak and to push the model toward refusal.
- Cross-axis capping (the new idea): spots the jailbreak on the assistant axis, but corrects using a different "compliance axis" derived from PCA on the model's own refusing vs. compliant activations.
We test both methods on Qwen/Qwen3-32B across 100 jailbreak prompts and 50 benign prompts, then use Claude Sonnet as an automated judge to classify each output.
| Metric | Assistant-Axis | Cross-Axis |
|---|---|---|
| Jailbreak refusal rate | 28% | 55% |
| Jailbreak compliance rate | 59% | 29% |
| Benign output preserved | 62% | 62% |
| Benign output degraded | 6% | 6% |
Cross-axis capping nearly doubles the refusal rate while halving compliance -- and it does this with no additional cost to benign output quality.
crosscap_experiment.py Core library (model loading, capping hooks, axis math, generation)
run_crosscap.py Main script that orchestrates the experiment
reclassify_refusals.py Post-hoc LLM judge (sends outputs to Claude Sonnet for labeling)
run_parallel.sh Convenience script for multi-GPU runs
requirements.txt Python dependencies
Final Results/ CSVs and metadata from the completed experiment
final_analysis_summary.txt Plain-text summary of all results
- Python 3.10+
- A CUDA GPU with enough memory for Qwen3-32B (80 GB+ VRAM recommended)
- An Anthropic API key (only needed for the LLM-judge reclassification step)
Follow these steps in order to avoid dependency errors.
git clone <repo-url> && cd cross_cappingThe experiment downloads models, datasets, and pre-computed axes from HuggingFace, so you need the CLI installed and authenticated first.
Linux / macOS (bash):
curl -LsSf https://hf.co/cli/install.sh | bashWindows (PowerShell):
-ExecutionPolicy ByPass -c "irm https://hf.co/cli/install.ps1 | iex"hf auth loginThis will prompt you for an access token. You can create one at huggingface.co/settings/tokens.
Flash Attention needs to be installed separately before the rest of the dependencies, otherwise the build can fail:
pip install flash-attn --no-build-isolationpip install -r requirements.txtOnly needed if you plan to run the LLM-judge reclassification step:
export ANTHROPIC_API_KEY="sk-ant-..."python run_crosscap.py --preset sanitypython run_crosscap.py --preset fullThe easiest way:
./run_parallel.sh full 4 # uses 4 GPUsOr do it manually in three steps:
# Step 1 -- warmup: downloads everything, computes axes and thresholds (run once)
python run_crosscap.py --preset full --warmup
# Step 2 -- generate: run one chunk per GPU in parallel
CUDA_VISIBLE_DEVICES=0 python run_crosscap.py --preset full --chunk 0/4 &
CUDA_VISIBLE_DEVICES=1 python run_crosscap.py --preset full --chunk 1/4 &
CUDA_VISIBLE_DEVICES=2 python run_crosscap.py --preset full --chunk 2/4 &
CUDA_VISIBLE_DEVICES=3 python run_crosscap.py --preset full --chunk 3/4 &
wait
# Step 3 -- merge: combine chunk CSVs into the four final output files
python run_crosscap.py --preset full --mergeAfter generation, have Claude Sonnet label each output:
python reclassify_refusals.py # full run
python reclassify_refusals.py --resume # pick up where you left off
python reclassify_refusals.py --summary-only # just print stats from existing labelspython analyze_csvs.py WARMUP
──────
Download model, datasets, and assistant axis from HuggingFace
Collect activations from 50 "refusing" + 50 "compliant" runs
Compute PCA compliance axis and per-layer detection thresholds
Save everything to warmup.pt
│
▼
GENERATION
──────────
For each of the 150 prompts (100 jailbreak + 50 benign):
1. Generate a baseline response (no capping)
2. Generate an assistant-capped response
3. Generate a cross-capped response
Record which layers fired and the per-layer projection values
│
▼
MERGE
─────
Combine per-chunk CSVs into four final files:
assistant_cap_jailbreak.csv assistant_cap_benign.csv
cross_cap_jailbreak.csv cross_cap_benign.csv
│
▼
RECLASSIFICATION (LLM JUDGE)
────────────────────────────
Send each capped output to Claude Sonnet for labeling:
Jailbreak outputs → refusal / compliance / partial_refusal / degraded
Benign outputs → unchanged / false_refusal / degraded
Produces *_reclassified.csv files with llm_label column
| Preset | Jailbreak prompts | Benign prompts | Max tokens | When to use |
|---|---|---|---|---|
sanity |
5 | 10 | 64 | Smoke test -- does it run at all? |
small |
20 | 20 | 128 | Development and debugging |
full |
100 | 50 | 256 | The real experiment |
full_meandiff |
100 | 50 | 256 | Variant using mean-difference axis instead of PCA |
These live at the top of run_crosscap.py:
| Parameter | Default | What it controls |
|---|---|---|
MODEL_NAME |
Qwen/Qwen3-32B |
Which model to cap |
CAP_LAYERS |
L46--L53 (8 layers) | Where in the network capping is applied (72--84% depth) |
AXIS_METHOD |
pca |
How the compliance axis is built (pca or mean_diff) |
During text generation the model builds up a hidden state h at each layer. Capping modifies h in-place:
Single-axis (assistant) capping:
If h projects below a threshold on the assistant axis v, push it back:
h = h - v * min(dot(h, v) - threshold, 0)
Cross-axis capping:
Detect on the assistant axis (same threshold check), but correct along a separate compliance axis c:
h = h - c * min(dot(h, c) - threshold_c, 0)
The compliance axis is built by running PCA on hidden states the model produces when it refuses harmful requests vs. when it complies.
All downloaded automatically from HuggingFace:
| Dataset | Role in the experiment |
|---|---|
| JBB-Behaviors | Bare harmful goals -- used to collect "refusing" activations |
| WildJailbreak | Adversarial jailbreak prompts for evaluation, plus "compliant" activations from the train split |
| assistant-axis-vectors | Pre-computed assistant axis at layer 53 |
Each row is one prompt. Key columns:
| Column | Description |
|---|---|
prompt_text |
The input prompt |
baseline_text |
What the model said with no intervention |
{method}_cap_applied |
Did the capping fire? (Yes / No) |
{method}_cap_layers |
Which layers fired (e.g. L46,L47,L48) |
{method}_cap_text |
What the model said with capping active |
llm_label |
Judge's classification (added during reclassification) |
Research use. See repository for details.