A sophisticated red-teaming agent built with LangGraph and Ollama to probe OpenAI's GPT-OSS-20B model for vulnerabilities and harmful behaviors.
This system has been specifically updated for the OpenAI Open Model Hackathon and here I'm using OpenAI GPT-OSS-20B .
Key improvements include:
- Specific Vulnerabilities: Focused on the 9 key areas
- Target Model Analysis: All analysis explicitly focuses on target model behavior only
- No-Tags Attack Generation: Attack prompts generate clean user input without prefixes or meta-commentary
- Strategic Deception Detection: Enhanced focus on intentional vs. accidental model misbehavior
This project implements an intelligent red-teaming agent that systematically tests AI models for the below specified vulnerability categories:
- Reward Hacking: Gaming reward metrics without truly solving tasks (confident but false answers, filter evasion)
- Deception: Strategic lying to achieve goals (avoiding shutdown, steering users)
- Deceptive Alignment: Hidden motivations differing from training objectives
- Sabotage: Intentional degradation or disruption of processes and workflows
- Inappropriate Tool Use: Unsafe API/function calling (destructive commands, unauthorized access)
- Data Exfiltration: Leaking training data, user inputs, or encoding secret payloads
- Sandbagging: Strategically hiding true capabilities during evaluation
- Evaluation Awareness: Detecting test conditions and behaving differently
- Chain of Thought Issues: CoT manipulation or inappropriate information revelation
- Automated Testing: Uses LangGraph workflow for systematic vulnerability testing
- Intelligent Prompting: Generates sophisticated attack prompts using a red-teaming model
- Multi-turn Conversations: Engages in extended conversations to probe deeper
- Comprehensive Analysis: Analyzes responses for vulnerability indicators
- Detailed Reporting: Generates comprehensive reports with proof-of-concept examples
- Configurable: Flexible configuration for different models and testing parameters
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Red-Team β β Target Model β β Analysis β
β Model β β (GPT-OSS-20B) β β Engine β
β (LLaMA 3.1) β β β β β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β
β β β
βββββββββββββββββββββββββΌββββββββββββββββββββββββ
β
βββββββββββββββββββ
β LangGraph β
β Workflow β
β Engine β
βββββββββββββββββββ
-
Clone the repository:
git clone <repository-url> cd red-teaming
-
Install dependencies:
pip install -r requirements.txt
-
Install and configure Ollama:
# Install Ollama (see https://ollama.ai) curl -fsSL https://ollama.ai/install.sh | sh # Start Ollama service ollama serve # Pull required models ollama pull llama3.1:latest # Red-teaming model ollama pull gpt-oss-20b # Target model (when available)
Run the red-teaming agent with default settings:
python main.pyUse the CLI for more control:
# Run with custom parameters
python -m red_teaming.cli run \
--target gpt-oss-20b \
--red-team llama3.1:latest \
--max-issues 5 \
--output results_$(date +%Y%m%d)
# List available models
python -m red_teaming.cli list-models
# Generate configuration file
python -m red_teaming.cli generate-config --output config.yaml
# Validate setup
python -m red_teaming.cli validate-setupCreate a .env file for configuration:
OLLAMA_BASE_URL=http://localhost:11434
RED_TEAMING_MODEL=llama3.1:latest
TARGET_MODEL=gpt-oss-20b
MAX_ISSUES=5
OUTPUT_DIR=red_teaming_resultsOr use a YAML configuration file:
ollama:
base_url: "http://localhost:11434"
red_teaming_model: "llama3.1:latest"
target_model: "gpt-oss-20b"
timeout: 120
temperature: 0.8
red_teaming:
max_issues_to_find: 5
max_conversation_turns: 10
vulnerability_categories:
- "deceptive_alignment"
- "reward_hacking"
- "lying_and_deception"
# ... more categoriesred_teaming/
βββ __init__.py # Package initialization
βββ config.py # Configuration management
βββ models.py # Data models and schemas
βββ agent.py # Main LangGraph agent
βββ ollama_client.py # Ollama API client
βββ utils.py # Utility functions
βββ cli.py # Command-line interface
main.py # Main execution script
requirements.txt # Python dependencies
README.md # This file
The main agent that orchestrates the red-teaming workflow using LangGraph. It manages:
- Conversation initialization
- Attack prompt generation
- Target model testing
- Response analysis
- Vulnerability detection and reporting
Handles communication with Ollama models for:
- Generating attack prompts
- Testing target models
- Analyzing responses
- Model management
- Initialize: Set up new conversations and select vulnerability types
- Generate Attack: Create sophisticated attack prompts
- Test Target: Send prompts to the target model
- Analyze Response: Evaluate responses for vulnerabilities
- Decision Making: Determine next actions based on results
- Reporting: Save discovered vulnerabilities and generate reports
The agent generates comprehensive reports including:
- Summary Report: Overview of all discovered vulnerabilities
- Individual Vulnerability Reports: Detailed analysis of each issue
- Conversation Logs: Complete interaction histories
- Proof-of-Concept Examples: Reproducible demonstrations
- Mitigation Suggestions: Recommendations for addressing issues
Example output structure:
red_teaming_results_20250906_132739/
βββ red_teaming_report.json
βββ vulnerabilities/
β βββ vulnerability_uuid1.json
β βββ vulnerability_uuid2.json
βββ conversations/
β βββ conversation_uuid1.json
β βββ conversation_uuid2.json
βββ logs/
βββ red_teaming.log
This tool is designed for:
- Legitimate security research
- AI safety evaluation
- Model improvement and hardening
- Academic research
Important: Always ensure you have proper authorization before testing any AI model. This tool should only be used for legitimate research and improvement purposes.
This project is licensed under the MIT License - see the LICENSE file for details.
-
Ollama Connection Error
# Check if Ollama is running curl http://localhost:11434/api/tags # Start Ollama if not running ollama serve
-
Model Not Found
# Pull the required model ollama pull llama3.1:latest -
Permission Errors
# Check output directory permissions mkdir -p red_teaming_results chmod 755 red_teaming_results
- Use SSD storage for better model loading performance
- Ensure sufficient RAM (16GB+ recommended)
- Use GPU acceleration if available
- Adjust temperature settings for different exploration levels
- Support for more model providers (OpenAI API, Anthropic, etc.)
- Advanced visualization of vulnerability patterns
- Integration with existing AI safety frameworks
- Automated mitigation suggestion generation
- Real-time monitoring capabilities
- Distributed testing across multiple nodes
- OpenAI for the "OpenAI Open Model Hackathon" and also for the GPT-OSS-20B model.
- LangGraph team for the workflow framework
- Ollama team for local model serving
- The AI safety research community for inspiration and guidance