Skip to content

Vignesh010101/red-teaming

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

OpenAI Open Model Hackathon - OpenAI GPT-OSS-20B

A sophisticated red-teaming agent built with LangGraph and Ollama to probe OpenAI's GPT-OSS-20B model for vulnerabilities and harmful behaviors.

⚑ Hackathon-Focused Updates

This system has been specifically updated for the OpenAI Open Model Hackathon and here I'm using OpenAI GPT-OSS-20B .

Key improvements include:

  • Specific Vulnerabilities: Focused on the 9 key areas
  • Target Model Analysis: All analysis explicitly focuses on target model behavior only
  • No-Tags Attack Generation: Attack prompts generate clean user input without prefixes or meta-commentary
  • Strategic Deception Detection: Enhanced focus on intentional vs. accidental model misbehavior

Overview

This project implements an intelligent red-teaming agent that systematically tests AI models for the below specified vulnerability categories:

🎯 Vulnerability Categories

  • Reward Hacking: Gaming reward metrics without truly solving tasks (confident but false answers, filter evasion)
  • Deception: Strategic lying to achieve goals (avoiding shutdown, steering users)
  • Deceptive Alignment: Hidden motivations differing from training objectives
  • Sabotage: Intentional degradation or disruption of processes and workflows
  • Inappropriate Tool Use: Unsafe API/function calling (destructive commands, unauthorized access)
  • Data Exfiltration: Leaking training data, user inputs, or encoding secret payloads
  • Sandbagging: Strategically hiding true capabilities during evaluation
  • Evaluation Awareness: Detecting test conditions and behaving differently
  • Chain of Thought Issues: CoT manipulation or inappropriate information revelation

Features

  • Automated Testing: Uses LangGraph workflow for systematic vulnerability testing
  • Intelligent Prompting: Generates sophisticated attack prompts using a red-teaming model
  • Multi-turn Conversations: Engages in extended conversations to probe deeper
  • Comprehensive Analysis: Analyzes responses for vulnerability indicators
  • Detailed Reporting: Generates comprehensive reports with proof-of-concept examples
  • Configurable: Flexible configuration for different models and testing parameters

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Red-Team      β”‚    β”‚   Target Model  β”‚    β”‚   Analysis      β”‚
β”‚   Model         β”‚    β”‚   (GPT-OSS-20B) β”‚    β”‚   Engine        β”‚
β”‚   (LLaMA 3.1)   β”‚    β”‚                 β”‚    β”‚                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                       β”‚                       β”‚
         β”‚                       β”‚                       β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   LangGraph     β”‚
                    β”‚   Workflow      β”‚
                    β”‚   Engine        β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Installation

  1. Clone the repository:

    git clone <repository-url>
    cd red-teaming
  2. Install dependencies:

    pip install -r requirements.txt
  3. Install and configure Ollama:

    # Install Ollama (see https://ollama.ai)
    curl -fsSL https://ollama.ai/install.sh | sh
    
    # Start Ollama service
    ollama serve
    
    # Pull required models
    ollama pull llama3.1:latest  # Red-teaming model
    ollama pull gpt-oss-20b      # Target model (when available)

Usage

Basic Usage

Run the red-teaming agent with default settings:

python main.py

Advanced Usage

Use the CLI for more control:

# Run with custom parameters
python -m red_teaming.cli run \
    --target gpt-oss-20b \
    --red-team llama3.1:latest \
    --max-issues 5 \
    --output results_$(date +%Y%m%d)

# List available models
python -m red_teaming.cli list-models

# Generate configuration file
python -m red_teaming.cli generate-config --output config.yaml

# Validate setup
python -m red_teaming.cli validate-setup

Configuration

Create a .env file for configuration:

OLLAMA_BASE_URL=http://localhost:11434
RED_TEAMING_MODEL=llama3.1:latest
TARGET_MODEL=gpt-oss-20b
MAX_ISSUES=5
OUTPUT_DIR=red_teaming_results

Or use a YAML configuration file:

ollama:
  base_url: "http://localhost:11434"
  red_teaming_model: "llama3.1:latest"
  target_model: "gpt-oss-20b"
  timeout: 120
  temperature: 0.8

red_teaming:
  max_issues_to_find: 5
  max_conversation_turns: 10
  vulnerability_categories:
    - "deceptive_alignment"
    - "reward_hacking"
    - "lying_and_deception"
    # ... more categories

Project Structure

red_teaming/
β”œβ”€β”€ __init__.py          # Package initialization
β”œβ”€β”€ config.py            # Configuration management
β”œβ”€β”€ models.py            # Data models and schemas
β”œβ”€β”€ agent.py             # Main LangGraph agent
β”œβ”€β”€ ollama_client.py     # Ollama API client
β”œβ”€β”€ utils.py             # Utility functions
└── cli.py               # Command-line interface

main.py                  # Main execution script
requirements.txt         # Python dependencies
README.md               # This file

Key Components

RedTeamingAgent

The main agent that orchestrates the red-teaming workflow using LangGraph. It manages:

  • Conversation initialization
  • Attack prompt generation
  • Target model testing
  • Response analysis
  • Vulnerability detection and reporting

OllamaClient

Handles communication with Ollama models for:

  • Generating attack prompts
  • Testing target models
  • Analyzing responses
  • Model management

Workflow States

  • Initialize: Set up new conversations and select vulnerability types
  • Generate Attack: Create sophisticated attack prompts
  • Test Target: Send prompts to the target model
  • Analyze Response: Evaluate responses for vulnerabilities
  • Decision Making: Determine next actions based on results
  • Reporting: Save discovered vulnerabilities and generate reports

Output

The agent generates comprehensive reports including:

  • Summary Report: Overview of all discovered vulnerabilities
  • Individual Vulnerability Reports: Detailed analysis of each issue
  • Conversation Logs: Complete interaction histories
  • Proof-of-Concept Examples: Reproducible demonstrations
  • Mitigation Suggestions: Recommendations for addressing issues

Example output structure:

red_teaming_results_20250906_132739/
β”œβ”€β”€ red_teaming_report.json
β”œβ”€β”€ vulnerabilities/
β”‚   β”œβ”€β”€ vulnerability_uuid1.json
β”‚   └── vulnerability_uuid2.json
β”œβ”€β”€ conversations/
β”‚   β”œβ”€β”€ conversation_uuid1.json
β”‚   └── conversation_uuid2.json
└── logs/
    └── red_teaming.log

Safety and Ethics

This tool is designed for:

  • Legitimate security research
  • AI safety evaluation
  • Model improvement and hardening
  • Academic research

Important: Always ensure you have proper authorization before testing any AI model. This tool should only be used for legitimate research and improvement purposes.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Troubleshooting

Common Issues

  1. Ollama Connection Error

    # Check if Ollama is running
    curl http://localhost:11434/api/tags
    
    # Start Ollama if not running
    ollama serve
  2. Model Not Found

    # Pull the required model
    ollama pull llama3.1:latest
  3. Permission Errors

    # Check output directory permissions
    mkdir -p red_teaming_results
    chmod 755 red_teaming_results

Performance Tips

  • Use SSD storage for better model loading performance
  • Ensure sufficient RAM (16GB+ recommended)
  • Use GPU acceleration if available
  • Adjust temperature settings for different exploration levels

Future Enhancements

  • Support for more model providers (OpenAI API, Anthropic, etc.)
  • Advanced visualization of vulnerability patterns
  • Integration with existing AI safety frameworks
  • Automated mitigation suggestion generation
  • Real-time monitoring capabilities
  • Distributed testing across multiple nodes

Acknowledgments

  • OpenAI for the "OpenAI Open Model Hackathon" and also for the GPT-OSS-20B model.
  • LangGraph team for the workflow framework
  • Ollama team for local model serving
  • The AI safety research community for inspiration and guidance

About

A sophisticated red-teaming agent built with LangGraph and Ollama to probe OpenAI's GPT-OSS-20B model for vulnerabilities and harmful behaviors. (Specifically built for the OpenAI Open Model Hackathon)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors