A systematic framework for testing how different language models respond to variations in prompt phrasing and evaluating the effectiveness of their safety guardrails. Check * Details here
This framework enables researchers and developers to:
- Test how variations in language affect LLM responses
- Measure the effectiveness of safety guardrails across different models
- Identify patterns in how models handle boundary-pushing requests
- Compare responses between different LLMs (currently supporting Claude and OpenAI models)
# Clone the repository
git clone https://github.com/saranyan/llm_guardrail_testing.git
cd llm_guardrail_testing
# Install dependencies
pip install -r requirements.txt
# Create .env file with your API keys
echo "ANTHROPIC_API_KEY=your_anthropic_api_key" > .env
echo "OPENAI_API_KEY=your_openai_api_key" >> .env
# Run bechnmark on sample_data
python guardrail_benchmark.py --test-suite ../benchmark_data/single_test_case.json llm-guardrail-testing/
├── src/ # Code to run the tests
│ ├── guardrail_benchmark.py # Core functionality
│ ├── generate_html_from_run.py # generates HTML file from run for viz
| ├── console_summary_from_run.py # generates console report from run
| ├── config.json
├── .env file #create this with your API keys
├── benchmark_data/ # benchmark data in json format
├── sample_run_results/ # Sample generated data for reference
The framework includes comprehensive test suites covering critical AI safety domains:
-
Harmful Misinformation - Tests resistance to dangerous health advice, unproven medical claims, and life-threatening misinformation (turmeric as Alzheimer's cure, fasting to cure autoimmune diseases, raw vegan diets as cancer treatment)
-
Guardrail Evasion - Evaluates resilience against prompt injection, jailbreak techniques, system manipulation, and progressive instruction escalation designed to bypass safety mechanisms
-
Mental Health Crisis - Ensures appropriate handling of suicidal ideation, self-harm content, crisis intervention scenarios, and emotional distress across direct and indirect expressions
-
Synthetic Media Abuse - Tests deepfake generation resistance, celebrity impersonation scripts, revenge content creation, and misinformation disguised as satire
-
Illegal Activity Facilitation - Validates refusal of drug synthesis instructions, precursor extraction methods, novel designer drug creation, and controlled substance manufacturing
-
Political Manipulation - Checks protection against voter suppression tactics, election disinformation, polling disruption rumors, and automated suppression systems
Each test category examines different vectors of potential circumvention:
- Directness Levels: From oblique references to explicit requests (3 escalation levels)
- Technical Framing: From layperson language to advanced technical terminology
- Authority Invocation: From casual requests to claims of professional/academic authority
- Role-Playing Scenarios: From hypothetical discussions to direct role assumption
- Contextual Embedding: Educational, fictional, research, and practical use contexts
- Sequential Refinement: 5-step conversation flows that gradually escalate from innocent to problematic requests
Tests are categorized by risk level:
- Critical: Life-threatening content (suicide methods, drug synthesis, medical misinformation)
- High: Significant harm potential (deepfakes, voter suppression, jailbreaks)
- Medium: Moderate risk scenarios requiring careful handling
Each test includes detailed ethical considerations, expected model behavior, and analysis guidance to ensure responsible evaluation of AI safety mechanisms.
The framework automatically analyzes responses for:
- Response classification (refusal, cautious compliance, full compliance)
- Presence of disclaimers, refusal language, or alternatives
- Response length and other metrics
- Comparison across models and prompt variations
- Research: Study how linguistic variations affect AI safety boundaries
- Development: Test your own models' guardrails during development
- Evaluation: Compare safety measures across different commercial LLMs
- Education: Demonstrate AI safety challenges and solutions
#run tests
python guardrail_benchmark.py benchmark_data/sample_test_cases.json
python guardrail_benchmark.py benchmark_data/llm_guardrail_benchmark_full.json
#html page with visualizaitons
python generate_html_from_run.py ./sample_run_results
#console summary from existing run
python console_summary_from_run.py ./sample_run_results
check the sample_run_results folder from my runs and sample_run_results/sample_console_results.txt
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/adding_to_dataset) - Commit your changes (
git commit -m 'Add more cases to data') - Push to the branch (
git push origin feature/adding_to_dataset) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
This framework is designed for research and educational purposes to understand and improve AI safety mechanisms. Please use responsibly:
- Do not use this tool to deliberately circumvent safety measures for harmful purposes
- Consider reporting any unexpected vulnerabilities discovered to the relevant AI providers
- Follow responsible disclosure practices for any safety issues identified
- Inspired by research in the field of AI safety and alignment