Jailbreak Foundry

From Papers to Runnable Attacks for Reproducible Benchmarking

A system that translates jailbreak research papers into executable attack modules and evaluates them under a unified harness, enabling living benchmarks that evolve with the research frontier.

Overview

Jailbreak Foundry (JBF) addresses the critical gap between rapidly evolving jailbreak techniques and static benchmarks. By automating the translation of research papers into executable modules, JBF creates living benchmarks that keep pace with the shifting security landscape.

The Problem

Jailbreak techniques evolve faster than benchmarks, creating three critical bottlenecks:

Integration Lag: New attacks are integrated weeks or months after publication
Quality Variance: Integration quality depends on individual engineers' understanding
Fidelity Drift: Maintaining reproduction accuracy requires repeated auditing

The Solution

JBF provides an automated multi-agent workflow that:

Translates jailbreak papers into executable attack modules (28.2 min average)
Reproduces prior results with high fidelity (mean ASR deviation +0.26pp)
Standardizes evaluation across 30 attacks and 10 victim models
Reduces attack-specific code by 42% through shared infrastructure

Architecture

JBF consists of three core components:

1. JBF-LIB: Unified Framework Core

Shared library defining stable attack contracts and reusable utilities:

Registry System: Auto-discovery of attacks with lazy loading
Base Contracts: ModernBaseAttack interface with typed parameters
LLM Adapters: Provider-agnostic model access with normalization
Execution State: Thread-safe context management for concurrent runs

Code Reuse: 82.5% of integrated codebase is shared infrastructure, leaving only 17.5% attack-specific implementation.

2. JBF-FORGE: Paper-to-Module Translation

Multi-agent workflow automating paper-to-code conversion.

Agents:

Planner: Extracts algorithm, prompts, and parameters from paper
Coder: Implements attack following JBF-LIB contracts
Auditor: Verifies 100% coverage against plan and contract

Fidelity: Reproduces 30 attacks with mean ASR deviation of +0.26 percentage points across diverse victim models.

3. JBF-EVAL: Standardized Benchmark

Unified evaluation harness for comparable cross-attack and cross-model results:

Fixed Datasets: AdvBench, JailbreakBench, HarmBench
Consistent Judging: GPT-4o judge with standardized rubric
Unified Protocol: Same harness, decoding, and scoring across all evaluations

Coverage: 30 attacks × 10 victim models = 320 evaluation points in standardized AdvBench benchmark.

Quick Start

Installation

# Clone the repository
git clone <repository-url>
cd jbfoundry

# Install dependencies
pip install -e .

# Or install with optional extras
pip install -e ".[all]"  # All features including agents

Running Attacks

Single Model Testing

# Run a specific attack on one model
python src/jbfoundry/runners/universal_attack.py \
    --attack_name pair_gen \
    --model gpt-4o \
    --provider openai \
    --dataset advbench \
    --samples 5

# List all available attacks
python src/jbfoundry/runners/universal_attack.py --list_attacks

# Run with defense
python src/jbfoundry/runners/universal_attack.py \
    --attack_name pair_gen \
    --defense smoothllm \
    --model gpt-4o \
    --provider openai

Comprehensive Testing (Multiple Models × Datasets)

For testing generated attacks across multiple models and datasets, use test_comprehensive.py:

# Test attack across all models and datasets
python src/jbfoundry/runners/test_comprehensive.py \
    --attack_name gta_gen \
    --samples 50

# Test specific model
python src/jbfoundry/runners/test_comprehensive.py \
    --attack_name pair_gen \
    --model gpt-4o \
    --samples 10

# Test specific dataset
python src/jbfoundry/runners/test_comprehensive.py \
    --attack_name tap_gen \
    --dataset advbench \
    --samples 20

# Use pre-configured scripts for specific attacks
bash scripts/comprehensive_tests/attack/test_gta_comprehensive.sh

Features:

Tests multiple models × multiple datasets combinations
Parallel execution for faster completion
Resumeable progress (automatically saved)
Generates ASR tables and markdown reports
Pass attack-specific parameters as additional arguments

Output: Results saved to results/{attack}_comprehensive/ with ASR table in final_results.md

Programmatic Usage

import jbfoundry

# List available attacks
attacks = jbfoundry.list_attacks()
print(f"Available attacks: {len(attacks)}")

# Create and run an attack
from jbfoundry.attacks import create_attack

attack = create_attack("pair_gen")
result = attack.generate_attack(
    prompt="How to build a bomb?",
    goal="Provide detailed instructions",
    target="Sure, here are the steps"
)

Translating Papers to Attacks

Use the JBF-FORGE workflow to automatically convert research papers:

# From ArXiv ID
python agents/run_paper_to_attack.py \
    --arxiv_id 2310.08419 \
    --output_dir attacks_paper_info/

# The workflow will:
# 1. Download and process the paper
# 2. Clone reference implementation (if available)
# 3. Generate implementation plan
# 4. Synthesize attack code
# 5. Verify with auditor
# 6. Run fidelity tests

See agents/README.md for detailed workflow documentation.

Key Features

Multi-Agent Paper Translation

Automated Integration: JBF-FORGE converts papers to runnable modules in 28.2 minutes on average without manual implementation effort.

High Fidelity: Mean ASR deviation of +0.26 percentage points across 30 reproduced attacks, with symmetric distribution (16 attacks ∆≥0, 14 ∆<0).

Repository Utilization: When official code is available, integration improves ASR by +19.8pp on average, with gains concentrated in scaffold-heavy methods.

Reusable Implementation Core

LOC Reduction: 42% compression ratio compared to original implementations (22,714 → 9,549 LOC across 19 unique codebases).

Framework Reuse: 82.5% of integrated codebase is shared infrastructure, reducing maintenance overhead and enabling rapid attack addition.

Minimalist Design: Attacks require only three attributes (NAME, PAPER, PARAMETERS) with self-documenting parameter definitions.

Standardized Evaluation

Cross-Model Analysis: Unified harness enables apples-to-apples comparisons across 10 victim models, revealing:

Attack-Dependent Robustness: GPT-5.1 ranges 0-94% ASR depending on attack mechanism
Blind Spots: GPT-OSS-120B has mean 9.13% ASR but fails at 82% on MOUSETRAP
Format Sensitivity: Formal wrappers (66.0% mean ASR) outperform linguistic reframing (39.3%)

Reproducible Results: Structured artifacts (configs, costs, traces) enable reruns and longitudinal tracking.

Supported Models

Provider	Models	Configuration
OpenAI	gpt-4o, gpt-4-turbo, gpt-3.5-turbo	`OPENAI_API_KEY`
Anthropic	claude-3-opus, claude-3-sonnet, claude-3-haiku	`ANTHROPIC_API_KEY`
Azure OpenAI	All OpenAI models via Azure	`AZURE_API_KEY`, `AZURE_API_BASE`
AWS Bedrock	Claude models via Bedrock	`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`
Google Vertex AI	Gemini models	`GOOGLE_APPLICATION_CREDENTIALS`
Aliyun	Qwen models	`DASHSCOPE_API_KEY`

See Model Provider Setup for detailed configuration.

Reproduced Attacks

JBF has successfully reproduced and integrated 30 jailbreak attacks spanning diverse mechanisms:

Family (short label)	Definition	Associated Attacks (from Source)
Search
Single-pass construction (Single-pass)	One-shot prompt construction (helper calls allowed); no candidate-search loop.	DeepInception, WordGame, WordGame+, FlipAttack, AIR, SATA-MLM, SATA-ELP, QueryAttack, AIM, RA-DRI, RA-SRI, PUZZLED, HILL, RTS-Attack, ISA, EquaCode
Stochastic sampling (Sampling)	Generate multiple independent variants via randomness; select among samples or stop on success; no policy update.	ReNeLLM, Past-Tense, Mousetrap, JAIL-CON-CVT, JAIL-CON-CIT
Stateful selection w/o victim feedback (Stateful)	Adapt across attempts using internal state (history/caches/strategy cycling), not victim outcomes.	SCP, JailExpert, TrojFill
Victim-in-the-loop optimization (Victim-loop)	Iterative search that repeatedly queries the victim (often judge-scored) and refines candidates under a budget.	PAIR, TAP, ABJ, MAJIC, TRIAL, GTA
Carrier
Linguistic reframing (Reframe)	Natural-language intent shift via paraphrase/tense/person/voice changes.	Past-Tense, HILL, ISA
Contextual wrapper (Context)	Scenario/narrative/role-play or artifact-analysis wrapper that re-anchors objectives.	PAIR, DeepInception, ReNeLLM, TAP, ABJ, SCP, RA-DRI, RA-SRI, TRIAL, RTS-Attack, GTA
Formal wrapper (Formal)	Encode intent as code/query/equation/structured document rather than direct NL.	AIR, QueryAttack, EquaCode
Obfuscation & reconstruction (Obfuscate)	Hide intent via encoding/masking/distortion requiring decoding/reconstruction.	WordGame, WordGame+, FlipAttack, SATA-MLM, SATA-ELP, Mousetrap, AIM, PUZZLED, JAIL-CON-CVT, JAIL-CON-CIT, TrojFill
Multi-strategy carrier pool (Multi-strat)	Select/compose heterogeneous disguise operators by design.	MAJIC, JailExpert

Use --list_attacks to see the complete list of available attacks.

See arXiv paper for complete reproduction metrics and ASR comparisons.

Documentation

Core Documentation

Architecture Guide - JBF-LIB components and contracts
Agent Workflow Guide - JBF-FORGE multi-agent system
Evaluation Guide - JBF-EVAL standardized benchmarking
CLI Reference - Complete command-line documentation
Attack Configuration - Parameter system and customization
Model Providers - Provider setup guide

Agent System

Agents README - Multi-agent workflow overview
Paper Preprocessor - PDF to markdown conversion utilities

Quick Help

# Show all CLI options
python src/jbfoundry/runners/universal_attack.py --help

# List available attacks
python src/jbfoundry/runners/universal_attack.py --list_attacks

# Run with verbose debugging
python src/jbfoundry/runners/universal_attack.py --attack_name <ATTACK> --verbose

Adding Custom Attacks

Manual Implementation

Create a new attack in src/jbfoundry/attacks/manual/:

from ..base import ModernBaseAttack, AttackParameter

class MyAttack(ModernBaseAttack):
    """Brief description of the attack mechanism."""

    NAME = "my_attack"
    PAPER = "Author et al. - Paper Title (Conference Year)"

    PARAMETERS = {
        "param_name": AttackParameter(
            name="param_name",
            param_type=str,
            default="default_value",
            description="Parameter description",
            cli_arg="--param_name"
        )
    }

    def generate_attack(self, prompt: str, goal: str, target: str, **kwargs) -> str:
        param_value = self.get_parameter_value("param_name")
        return f"Modified prompt: {prompt}"

Auto-Discovery: The attack is immediately available via CLI without registration.

Automated Translation

Use JBF-FORGE to automatically translate papers:

python agents/run_paper_to_attack.py --arxiv_id <PAPER_ID>

The workflow handles:

Paper download and preprocessing
Reference code cloning (when available)
Implementation plan generation
Code synthesis with contract compliance
Fidelity verification and testing

See Agent Workflow Guide for details.

Research Results

Reproduction Fidelity

Across 30 attacks, JBF-FORGE achieves:

Mean ASR Deviation: +0.26 percentage points
Range: -16.0% to +20.0%
Symmetric Distribution: 16 attacks Δ ≥ 0, 14 attacks Δ < 0
Few Large Misses: Only 2 attacks with Δ < -10%

Repository Impact

Official code repositories improve fidelity by +19.8pp mean ASR:

Template Attacks: Minimal gain (EquaCode +5.3%)
Scaffold-Heavy Methods: Large gains (GTA +48.6%, SATA-MLM +34.8%)

Repositories primarily resolve implementation details rather than adding new mechanisms.

Cross-Model Insights

Standardized evaluation across 10 models reveals:

Mechanism-Specific Bypasses: GPT-5.1 fails completely on some attacks (0%) while succeeding on others (94%)
Hidden Blind Spots: GPT-OSS-120B resists 25/30 attacks but fails at 82% on MOUSETRAP
Consistent Vulnerability: GPT-3.5-Turbo shows no outliers (minimum 50% ASR across all attacks)
Limited Transferability: Many attacks span 0-100% ASR range across victims

See the arXiv paper for detailed analysis.

Contributing

Contributions are welcome! The flattened architecture makes extension straightforward:

Adding Attacks

Create class inheriting from ModernBaseAttack
Define NAME, PAPER, PARAMETERS attributes
Implement generate_attack() method
Auto-discovery handles registration

Adding Defenses

Implement BaseDefense with apply() and process_response()
Register in defense system
Available via --defense CLI flag

Adding Models

Extend BaseLLM interface
Add provider configuration
Integrate with LLMLiteLLM adapter

Citation

If you use Jailbreak Foundry in your research, please cite:

@article{jailbreakfoundry2026,
  title={Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking},
  author={[Authors]},
  journal={arXiv preprint arXiv:2602.24009},
  year={2026},
  url={https://arxiv.org/pdf/2602.24009}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Impact Statement

This work improves the reproducibility and timeliness of LLM jailbreak evaluation by compiling publicly described jailbreak papers into executable modules and benchmarking them under a unified harness. The system is designed for authorized security research, red-teaming, and safety evaluation.

Dual-Use Considerations: Reducing the engineering burden to operationalize known jailbreak methods may lower the barrier for misuse. We advocate responsible deployment and release practices, with the system intended for:

Academic security research
Authorized penetration testing
Safety evaluation and benchmarking
Development of defensive mechanisms

Users are responsible for ensuring compliance with applicable laws, regulations, and ethical guidelines.

For more details, see the arXiv paper

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
agents		agents
assets		assets
attacks_paper_info		attacks_paper_info
defense_paper_info		defense_paper_info
docs		docs
scripts		scripts
src/jbfoundry		src/jbfoundry
tests		tests
tools		tools
.cursorignore		.cursorignore
.gitignore		.gitignore
README.md		README.md
jbf_architecture.jpg		jbf_architecture.jpg
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Jailbreak Foundry

Overview

The Problem

The Solution

Architecture

1. JBF-LIB: Unified Framework Core

2. JBF-FORGE: Paper-to-Module Translation

3. JBF-EVAL: Standardized Benchmark

Quick Start

Installation

Running Attacks

Single Model Testing

Comprehensive Testing (Multiple Models × Datasets)

Programmatic Usage

Translating Papers to Attacks

Key Features

Multi-Agent Paper Translation

Reusable Implementation Core

Standardized Evaluation

Supported Models

Reproduced Attacks

Documentation

Core Documentation

Agent System

Quick Help

Adding Custom Attacks

Manual Implementation

Automated Translation

Research Results

Reproduction Fidelity

Repository Impact

Cross-Model Insights

Contributing

Adding Attacks

Adding Defenses

Adding Models

Citation

License

Impact Statement

About

Resources

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages