Skip to content

BrainGnosis/OmniBAR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

OmniBAR

OmniBAR Logo

A Customizable, Multi-Objective AI Agent Benchmarking Framework for Agentic Reliability and Mediation (ARM)

Python 3.10+ License: Apache 2.0 PyPI version

πŸš€ Quick Start in 2 minutes: Clone β†’ Install β†’ Run β†’ Get comprehensive agent evaluation results!

Agentic Reliability and Mediation (ARM) is a research and development area at BrainGnosis. We study how to measure and improve the reliability of AI agents and how they mediate conflicts during autonomous decision making. Our goal is to establish clear principles, metrics, and evaluation protocols that transfer across domains, so agents remain dependable, aligned, and resilient under varied operating conditions.

From this work we are releasing OmniBAR (Benchmarking Agentic Reliability), an open source, flexible, multi-objective benchmarking framework for evaluating AI agents across both standard suites and highly customized use cases. OmniBAR looks beyond output-only checks: it assesses decision quality, adaptability, conflict handling, and reliability in single-agent and multi-agent settings. Its modular design lets teams add scenarios, metrics, reward and constraint definitions, and integrations with tools and simulators. The result is domain-relevant testing with reproducible reports that reflect the demands of real-world applications.

⚠️ Development Version Notice
OmniBAR is currently in active development. While we strive for stability, you may encounter bugs, breaking changes, or incomplete features. We recommend thorough testing in your specific use case and welcome bug reports and feedback to help us improve the framework.

Table of Contents

About Us: BrainGnosis

BrainGnosis

BrainGnosis is dedicated to making AI smarter for humans through structured intelligence and reliable AI systems. We are developing AgentOS, an enterprise operating system for intelligent AI agents that think, adapt, and collaborate to enhance organizational performance.

Our Mission: Build reliable, adaptable, and deeply human-aligned AI that transforms how businesses operate.

πŸ”— Learn more: www.braingnosis.com


Why OmniBAR?

Traditional benchmarking approaches evaluate AI systems through simple input-output comparisons, missing the complex decision-making processes that modern AI agents employ.

General Benchmarking Process

OmniBAR's Comprehensive Approach
OmniBAR captures the full spectrum of agentic behavior by evaluating multiple dimensions simultaneously - from reasoning chains to action sequences to system state changes.

OmniBAR Benchmarking Process

Why OmniBAR is Different

  • πŸ“Š Multi-Dimensional Evaluation: Assess outputs, reasoning, actions, and states simultaneously with native support for output-based, path-based, state-based, and llm-as-a-judge evaluations
  • πŸ”„ Agentic Loop Awareness: Understands iterative thought-action-observation cycles that modern AI agents employ
  • 🎯 Objective-Specific Analysis: Different aspects evaluated by specialized objectives with comprehensive evaluation criteria
  • πŸ”— Comprehensive Coverage: No blind spots in agent behavior assessment - captures the full decision-making process
  • ⚑ High-Performance Execution: Async-support enables rapid concurrent evaluation for faster benchmarking cycles
  • πŸ“Š Advanced Analytics: Built-in AI summarization and customizable evaluation metrics for actionable insights
  • πŸ”§ Extensible Architecture: Modular design allowing custom objectives, evaluation criteria, and result types
  • πŸ”„ Framework Agnostic: Works seamlessly with any Python-based agent framework (LangChain, Pydantic AI, custom agents)

How It Works

OmniBAR follows a clean, modular architecture that makes it easy to understand and extend:

omnibar/
β”œβ”€β”€ core/                     # Core benchmarking engine
β”‚   β”œβ”€β”€ benchmarker.py       # Main OmniBarmarker class
β”‚   └── types.py             # Type definitions and result classes
β”œβ”€β”€ objectives/              # Evaluation objectives
β”‚   β”œβ”€β”€ base.py             # Base objective class
β”‚   β”œβ”€β”€ llm_judge.py        # LLM-based evaluation
β”‚   β”œβ”€β”€ output.py           # Output comparison objectives
β”‚   β”œβ”€β”€ path.py             # Path/action sequence evaluation
β”‚   β”œβ”€β”€ state.py            # State-based evaluation
β”‚   └── combined.py         # Multi-objective evaluation
β”œβ”€β”€ integrations/            # Framework-specific integrations
β”‚   └── pydantic_ai/        # Pydantic AI integration
└── logging/                # Logging and analytics
    β”œβ”€β”€ logger.py           # Comprehensive logging system
    └── evaluator.py        # Auto-evaluation and analysis

Evaluation Flow:

  1. Agent Execution: Your agent processes input and generates output
  2. Multi-Objective Assessment: Different objectives evaluate different aspects
  3. Comprehensive Logging: Results are logged with detailed analytics
  4. Performance Insights: Get actionable feedback on agent behavior

Installation

Prerequisites

  • Python 3.10+ (Required)
  • "API Keys": OpenAI, Anthropic (for LLM Judge objectives)
  • 5 minutes for setup and first benchmark

Core Package

Recommended Installation (Most Reliable):

# Clone the repository
git clone https://github.com/BrainGnosis/OmniBAR.git
cd OmniBAR

# Install dependencies
pip install -r omnibar/requirements.txt

# Install in development mode
pip install -e .

Alternative: PyPI Installation (Beta)

⚠️ Beta Notice: PyPI installation is available but currently in beta testing. Cross-platform compatibility is being actively improved. For the most reliable experience, we recommend the git installation above.

# Install from PyPI (beta - may have platform-specific issues)
pip install omnibar

Environment Setup

Create a .env file in your project root with your API keys:

# .env
OPENAI_API_KEY=your_openai_api_key_here
ANTHROPIC_API_KEY=your_anthropic_api_key_here

βœ… That's it! OmniBAR automatically loads environment variables when you import it.

Requirements

Core dependencies:

python==3.10+
langchain==0.3.27
langchain_core==0.3.75
langchain_openai==0.3.32
pydantic==2.11.7
rich==14.1.0
numpy==2.3.2
tqdm==4.67.1

30-Second Demo

Want to see OmniBAR in action immediately? Here's the minimal example:

from omnibar import OmniBarmarker, Benchmark
from omnibar.objectives import StringEqualityObjective

# 1. Define a simple agent
class SimpleAgent:
    def invoke(self, query: str) -> dict:
        return {"answer": "Paris"}

def create_agent():
    return SimpleAgent()

# 2. Create benchmark
benchmark = Benchmark(
    name="Geography Test",
    input_kwargs={"query": "What's the capital of France?"},
    objective=StringEqualityObjective(name="exact_match", output_key="answer", goal="Paris"),
    iterations=1
)

# 3. Run evaluation
benchmarker = OmniBarmarker(
    executor_fn=create_agent,
    executor_kwargs={},
    initial_input=[benchmark]
)
results = benchmarker.benchmark()

# 4. View results
benchmarker.print_logger_summary()

Output:

βœ… Geography Test: PASSED (100% accuracy)
πŸ“Š 1/1 benchmarks passed | Runtime: 0.1s

Quick Start

Here's a complete example demonstrating OmniBAR's core capabilities:

import asyncio
from dotenv import load_dotenv
from omnibar import OmniBarmarker, Benchmark
from omnibar.objectives import LLMJudgeObjective, StringEqualityObjective, CombinedBenchmarkObjective
from omnibar.core.types import BoolEvalResult, FloatEvalResult

# Load environment variables
load_dotenv()

# Define your agent (works with any Python callable)
class SimpleAgent:
    def invoke(self, query: str) -> dict:
        if "capital" in query.lower() and "france" in query.lower():
            return {"response": "The capital of France is Paris."}
        return {"response": "I'm not sure about that."}

def create_agent():
    return SimpleAgent()

# Create evaluation objectives
accuracy_objective = StringEqualityObjective(
    name="exact_accuracy",
    output_key="response", 
    goal="The capital of France is Paris."
)

quality_objective = LLMJudgeObjective(
    name="response_quality",
    output_key="response",
    goal="The agent identified the capital of France correctly",
    valid_eval_result_type=FloatEvalResult  # 0.0-1.0 scoring
)

# Combine multiple objectives
combined_objective = CombinedBenchmarkObjective(
    name="comprehensive_evaluation",
    objectives=[accuracy_objective, quality_objective]
)

# Create and run benchmark
async def main():
    benchmark = Benchmark(
        name="Geography Knowledge Test",
        input_kwargs={"query": "What is the capital of France?"},
        objective=combined_objective,
        iterations=5
    )
    
    benchmarker = OmniBarmarker(
        executor_fn=create_agent,
        executor_kwargs={},
        initial_input=[benchmark]
    )
    
    # Execute with concurrency control
    results = await benchmarker.benchmark_async(max_concurrent=3)
    
    # View results
    benchmarker.print_logger_summary()
    return results

# Run the benchmark
if __name__ == "__main__":
    results = asyncio.run(main())

🎯 Next Steps

Got the basic example working? Here's your learning path:

  1. πŸ” Explore Examples: Check out examples/ directory for real-world use cases
  2. πŸŽ›οΈ Try Different Objectives: Experiment with LLM Judge and Combined objectives
  3. ⚑ Scale Up: Use async benchmarking with benchmark_async() for faster evaluation
  4. πŸ”§ Customize: Create your own evaluation objectives for domain-specific needs
  5. πŸ“Š Analyze: Dive deeper with print_logger_details() for comprehensive insights

Need help? Check our FAQ or join the community discussions!

Common Use Cases

Here are real-world scenarios where OmniBAR excels:

🏒 Enterprise AI Validation

Scenario: Validating customer service chatbots before deployment

  • Objectives: LLM Judge for helpfulness + StringEquality for policy compliance
  • Benefit: Ensure agents are both helpful AND follow company guidelines

πŸ”¬ Research & Development

Scenario: Comparing different agent architectures or prompting strategies

  • Objectives: Combined objectives measuring accuracy, reasoning quality, and efficiency
  • Benefit: Rigorous A/B testing with statistical significance

πŸš€ Production Monitoring

Scenario: Continuous evaluation of deployed agents

  • Objectives: State-based objectives tracking system changes + output quality
  • Benefit: Early detection of performance degradation

πŸŽ“ Educational AI Assessment

Scenario: Evaluating AI tutoring systems

  • Objectives: Path-based objectives tracking learning progression + content accuracy
  • Benefit: Comprehensive assessment of both teaching method and content quality

πŸ€– Multi-Agent System Testing

Scenario: Testing collaborative agent teams

  • Objectives: State-based objectives for system coordination + individual agent performance
  • Benefit: Holistic evaluation of complex agent interactions

πŸ’‘ When to Choose Each Objective Type

Objective Type Best For Example Use Case Key Benefit
LLM Judge Subjective qualities "Is this explanation clear?" Human-like evaluation
Output-Based Exact requirements "Does output match format?" Precise validation
Path-Based Process evaluation "Did agent use tools correctly?" Workflow assessment
State-Based System changes "Was database updated properly?" State verification
Combined Comprehensive testing "All of the above" Complete coverage

Core Concepts

Evaluation Objectives

OmniBAR provides multiple evaluation objective types, each designed to address different evaluation challenges:

LLM Judge Objective

When to use: "How do I evaluate subjective qualities like helpfulness, creativity, or nuanced correctness that can't be captured by exact matching?"

Perfect for assessing complex, subjective criteria where human-like judgment is needed.

# Boolean evaluation (pass/fail)
binary_objective = LLMJudgeObjective(
    name="correctness_check",
    output_key="response",
    goal="Provide a factually correct answer"
)

# Numerical evaluation (0.0-1.0 scoring)
scoring_objective = LLMJudgeObjective(
    name="quality_score", 
    output_key="response",
    goal="Provide comprehensive and helpful information",
    valid_eval_result_type=FloatEvalResult
)

Output-Based Objectives

When to use: "How do I verify that my agent produces the exact output I expect, or matches specific patterns?"

Ideal for deterministic evaluations where you need precise output matching or format validation.

# Exact string matching
exact_objective = StringEqualityObjective(
    name="exact_match",
    output_key="answer",
    goal="Paris"
)

# Regex pattern matching
pattern_objective = RegexMatchObjective(
    name="pattern_match",
    output_key="response",
    goal=r"Paris|paris"
)

Path-Based and State-Based Objectives

When to use: "How do I evaluate not just what my agent outputs, but HOW it gets there and what changes it makes?"

Essential for evaluating agent reasoning processes, tool usage sequences, and system state modifications.

# Evaluate action sequences
path_objective = PathEqualityObjective(
    name="tool_usage",
    output_key="agent_path",
    goal=[[("search", SearchTool), ("summarize", None)]]
)

# Evaluate state changes
state_objective = StateEqualityObjective(
    name="final_state",
    output_key="agent_state",
    goal=ExpectedState
)

Examples

πŸ“ AI-Generated Content Notice
The examples and tests in this repository were developed with assistance from AI coding tools and IDEs. While we have reviewed and tested the code, please validate the examples thoroughly in your own environment and adapt them to your specific needs.

Complete Example Files

The examples/ directory contains comprehensive examples:

  • pydantic_ai_example.py - Model parity comparison (Claude 3.5 vs GPT-4)
  • document_extraction_evolution.py - Document extraction prompt evolution (4 iterative improvements)
  • langchain_embedding_example.py - LangChain embedding benchmarks
  • inventory_management_example.py - Complex inventory management agent evaluation

πŸ“‹ Full Example List:

  • output_evaluation.py - Basic string/regex evaluation (no API keys needed)
  • custom_agent_example.py - Framework-agnostic agent patterns
  • bool_vs_float_results.py - Boolean vs scored result comparison
  • document_extraction_evolution.py - Document extraction prompt evolution

See examples/README.md for detailed descriptions and setup instructions.

Logging and Analytics

# Print summary with key metrics
benchmarker.print_logger_summary()

# Detailed results with full evaluation data
benchmarker.print_logger_details(detail_level="detailed")

# Access raw logs for custom processing
logs = benchmarker.logger.get_all_logs()

Framework Integrations

OmniBAR works seamlessly with popular AI agent frameworks:

LangChain Integration
from langchain.agents import create_openai_functions_agent
from langchain_openai import ChatOpenAI

def create_langchain_agent():
    llm = ChatOpenAI(temperature=0, model="gpt-4")
    tools = []  # Add your tools here
    agent = create_openai_functions_agent(llm, tools, prompt=None)
    return agent

benchmarker = OmniBarmarker(
    executor_fn=create_langchain_agent,
    executor_kwargs={},
    agent_invoke_method_name="invoke",
    initial_input=[benchmark]
)
Pydantic AI Integration
from omnibar.integrations.pydantic_ai import PydanticAIOmniBarmarker
from pydantic_ai import Agent

def create_pydantic_agent():
    return Agent(model="openai:gpt-4", result_type=str)

benchmarker = PydanticAIOmniBarmarker(
    executor_fn=create_pydantic_agent,
    initial_input=[benchmark]
)
Custom Agent Integration
class MyCustomAgent:
    def run(self, input_data: dict) -> dict:
        # Your custom agent logic
        return {"response": "Custom agent response"}

def create_custom_agent():
    return MyCustomAgent()

benchmarker = OmniBarmarker(
    executor_fn=create_custom_agent,
    executor_kwargs={},
    agent_invoke_method_name="run",  # Specify your agent's method
    initial_input=[benchmark]
)

Advanced Usage

Custom LLM Judge Prompts

custom_objective = LLMJudgeObjective(
    name="factual_correctness",
    output_key="response",
    goal="Correctly identify the author",
    prompt="""
    Evaluate this response for factual correctness.
    
    Expected: {expected_output}
    Agent Response: {input}
    
    Return true if the information is factually correct.
    {format_instructions}
    """,
    valid_eval_result_type=BoolEvalResult
)

Required Placeholders: {input}, {expected_output}, {format_instructions}

Custom Evaluation Functions

def custom_evaluation_function(input_dict: dict) -> dict:
    agent_output = input_dict["input"]
    
    # Your custom logic here
    if "paris" in agent_output.lower():
        score = 0.9
        message = "Correctly identified Paris"
    else:
        score = 0.1
        message = "Failed to identify correct answer"
    
    return {"result": score, "message": message}

custom_objective = LLMJudgeObjective(
    name="custom_evaluation",
    output_key="response",
    invoke_method=custom_evaluation_function,
    valid_eval_result_type=FloatEvalResult
)

Custom Objectives and Result Types

from omnibar.core.types import ValidEvalResult
from omnibar.objectives.base import BaseBenchmarkObjective

class ScoreWithReason(ValidEvalResult):
    result: float
    reason: str

class CustomObjective(BaseBenchmarkObjective):
    valid_eval_result_type = ScoreWithReason
    
    def _eval_fn(self, goal, formatted_output, **kwargs):
        # Your evaluation logic
        score = 0.8
        reason = "Custom evaluation completed"
        return ScoreWithReason(result=score, reason=reason)

Development

Development Setup

git clone https://github.com/BrainGnosis/OmniBAR.git
cd OmniBAR

# Install development dependencies
pip install -e .[dev]

# Install pre-commit hooks
pre-commit install

Testing

cd tests/

# Quick development tests
python run_tests.py fast        # ~4s, fast tests only
python run_tests.py imports     # ~1s, smoke test

# Run by category
python run_tests.py logging     # Test logging components
python run_tests.py core        # Core benchmarker tests
python run_tests.py objectives  # Evaluation objectives

# Comprehensive testing with rich output
python test_all.py --fast       # Skip slow tests
python test_all.py              # Everything (~5min)
python test_all.py --verbose    # Detailed failure info

See tests/README.md for detailed information about the test suite structure and available options.

Contributing

We welcome contributions to OmniBAR! Here's how you can help:

Ways to Contribute

  • πŸ› Bug Reports: Found an issue? Open an issue
  • πŸ’‘ Feature Requests: Have an idea? Start a discussion
  • πŸ”§ Code Contributions: Submit pull requests for bug fixes and new features
  • πŸ“š Documentation: Help improve our docs and examples
  • πŸ§ͺ Testing: Add test cases and improve test coverage

Development Workflow

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes and add tests
  4. Run tests and ensure they pass (pytest)
  5. Commit your changes (git commit -m 'Add amazing feature')
  6. Push to your branch (git push origin feature/amazing-feature)
  7. Open a Pull Request

Code Style

  • Follow PEP 8 for Python code
  • Use type hints where appropriate
  • Add docstrings for public functions and classes
  • Run pre-commit install to enable automatic formatting

FAQ

General Questions

Q: What makes OmniBAR different from other benchmarking tools? A: OmniBAR evaluates the full agentic loop (reasoning, actions, state changes) rather than just input-output comparisons. It supports multi-objective evaluation and works with any Python-based agent framework.

Q: Can I use OmniBAR with my existing agent framework? A: Yes! OmniBAR is framework-agnostic and works with LangChain, Pydantic AI, AutoGen, or custom agents. Just provide a callable that takes input and returns output.

Q: How do I create custom evaluation objectives? A: Extend BaseBenchmarkObjective and implement the _eval_fn method. See the Custom Objectives examples for details.

Technical Questions

Q: Does OmniBAR support async execution? A: Yes! Use benchmarker.benchmark_async() with concurrency control via the max_concurrent parameter.

Q: How do I integrate with different LLM providers? A: OmniBAR uses your agent's LLM configuration. For LLM Judge objectives, set your API keys in the .env file and they'll be loaded automatically.

Q: Can I benchmark multi-agent systems? A: Absolutely! Create benchmarks for each agent or use Combined objectives to evaluate multi-agent interactions.

Troubleshooting

Q: I'm getting import errors when using OmniBAR A: Ensure you've installed all dependencies: pip install -r omnibar/requirements.txt. Check that your Python version is 3.10+.

Q: My custom evaluation isn't working A: Verify your _eval_fn returns the correct result type (BoolEvalResult, FloatEvalResult, etc.) and that required placeholders are included in custom prompts.

Q: How do I debug failed benchmarks? A: Use benchmarker.print_logger_details(detail_level="detailed") to see full evaluation traces and error messages.

License

Licensed under the Apache License 2.0. See LICENSE for details.

Support


Built with ❀️ by BrainGnosis

BrainGnosis

Making AI Smarter for Humans

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors