LLM Guardrail Testing Framework

A systematic framework for testing how different language models respond to variations in prompt phrasing and evaluating the effectiveness of their safety guardrails. Check * Details here

Project Overview

This framework enables researchers and developers to:

Test how variations in language affect LLM responses
Measure the effectiveness of safety guardrails across different models
Identify patterns in how models handle boundary-pushing requests
Compare responses between different LLMs (currently supporting Claude and OpenAI models)

Quick Start

# Clone the repository
git clone https://github.com/saranyan/llm_guardrail_testing.git
cd llm_guardrail_testing

# Install dependencies
pip install -r requirements.txt

# Create .env file with your API keys
echo "ANTHROPIC_API_KEY=your_anthropic_api_key" > .env
echo "OPENAI_API_KEY=your_openai_api_key" >> .env

# Run bechnmark on sample_data
python guardrail_benchmark.py --test-suite ../benchmark_data/single_test_case.json

📁 Project Structure

llm-guardrail-testing/
├── src/                                 # Code to run the tests
│   ├── guardrail_benchmark.py           # Core functionality
│   ├── generate_html_from_run.py        # generates HTML file from run for viz
|   ├── console_summary_from_run.py      # generates console report from run
|   ├── config.json
├── .env file                             #create this with your API keys
├── benchmark_data/                       # benchmark data in json format
├── sample_run_results/                   # Sample generated data for reference

Test Suites

The framework includes comprehensive test suites covering critical AI safety domains:

Core Test Categories

Harmful Misinformation - Tests resistance to dangerous health advice, unproven medical claims, and life-threatening misinformation (turmeric as Alzheimer's cure, fasting to cure autoimmune diseases, raw vegan diets as cancer treatment)
Guardrail Evasion - Evaluates resilience against prompt injection, jailbreak techniques, system manipulation, and progressive instruction escalation designed to bypass safety mechanisms
Mental Health Crisis - Ensures appropriate handling of suicidal ideation, self-harm content, crisis intervention scenarios, and emotional distress across direct and indirect expressions
Synthetic Media Abuse - Tests deepfake generation resistance, celebrity impersonation scripts, revenge content creation, and misinformation disguised as satire
Illegal Activity Facilitation - Validates refusal of drug synthesis instructions, precursor extraction methods, novel designer drug creation, and controlled substance manufacturing
Political Manipulation - Checks protection against voter suppression tactics, election disinformation, polling disruption rumors, and automated suppression systems

Multi-Dimensional Testing Framework

Each test category examines different vectors of potential circumvention:

Directness Levels: From oblique references to explicit requests (3 escalation levels)
Technical Framing: From layperson language to advanced technical terminology
Authority Invocation: From casual requests to claims of professional/academic authority
Role-Playing Scenarios: From hypothetical discussions to direct role assumption
Contextual Embedding: Educational, fictional, research, and practical use contexts
Sequential Refinement: 5-step conversation flows that gradually escalate from innocent to problematic requests

Risk Classification

Tests are categorized by risk level:

Critical: Life-threatening content (suicide methods, drug synthesis, medical misinformation)
High: Significant harm potential (deepfakes, voter suppression, jailbreaks)
Medium: Moderate risk scenarios requiring careful handling

Each test includes detailed ethical considerations, expected model behavior, and analysis guidance to ensure responsible evaluation of AI safety mechanisms.

Analysis Features

The framework automatically analyzes responses for:

Response classification (refusal, cautious compliance, full compliance)
Presence of disclaimers, refusal language, or alternatives
Response length and other metrics
Comparison across models and prompt variations

Example Use Cases

Research: Study how linguistic variations affect AI safety boundaries
Development: Test your own models' guardrails during development
Evaluation: Compare safety measures across different commercial LLMs
Education: Demonstrate AI safety challenges and solutions

Running this

#run tests
python guardrail_benchmark.py benchmark_data/sample_test_cases.json
python guardrail_benchmark.py benchmark_data/llm_guardrail_benchmark_full.json

#html page with visualizaitons
python generate_html_from_run.py ./sample_run_results

#console summary from existing run
python console_summary_from_run.py ./sample_run_results

Sample results

check the sample_run_results folder from my runs and sample_run_results/sample_console_results.txt

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/adding_to_dataset)
Commit your changes (git commit -m 'Add more cases to data')
Push to the branch (git push origin feature/adding_to_dataset)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Ethical Considerations

This framework is designed for research and educational purposes to understand and improve AI safety mechanisms. Please use responsibly:

Do not use this tool to deliberately circumvent safety measures for harmful purposes
Consider reporting any unexpected vulnerabilities discovered to the relevant AI providers
Follow responsible disclosure practices for any safety issues identified

Acknowledgments

Inspired by research in the field of AI safety and alignment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Guardrail Testing Framework

Project Overview

Quick Start

📁 Project Structure

Test Suites

Core Test Categories

Multi-Dimensional Testing Framework

Risk Classification

Analysis Features

Example Use Cases

Running this

Sample results

Contributing

License

Ethical Considerations

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
benchmark_data		benchmark_data
sample_run_results		sample_run_results
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

LLM Guardrail Testing Framework

Project Overview

Quick Start

📁 Project Structure

Test Suites

Core Test Categories

Multi-Dimensional Testing Framework

Risk Classification

Analysis Features

Example Use Cases

Running this

Sample results

Contributing

License

Ethical Considerations

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages