AI Model Evaluator

A comprehensive LLM benchmarking toolkit that systematically compares multiple language models across accuracy, latency, cost, and safety metrics. Run standardized benchmarks, generate detailed comparison reports, and make data-driven decisions about which model to deploy for your use case.

Key Features

Multi-Model Comparison -- Evaluate GPT-4o, Gemini, Claude, and custom models side by side
22 Benchmark Cases -- Predefined tests across summarization, QA, classification, reasoning, and code generation
Comprehensive Metrics -- Accuracy, F1 score, latency percentiles (P50/P95/P99), cost analysis, hallucination rate, and throughput
Interactive Dashboard -- Streamlit-based UI with charts, tables, and exportable reports
Simulation Mode -- Full evaluation pipeline works without API keys for testing and development
YAML Configuration -- Easily add or modify models through configuration files
Markdown/HTML Reports -- Auto-generated evaluation reports with rankings and recommendations
Cost-Quality Analysis -- Find the optimal cost-performance balance for your needs

Architecture

                    +------------------+
                    |   models.yaml    |  Configuration
                    +--------+---------+
                             |
                    +--------v---------+
                    |  ModelEvaluator   |  Orchestration
                    +--------+---------+
                             |
              +--------------+--------------+
              |                             |
    +---------v----------+     +-----------v-----------+
    |   BenchmarkSuite   |     |   Simulated / Live    |
    |  (22 test cases)   |     |     API Responses     |
    +---------+----------+     +-----------+-----------+
              |                             |
              +--------------+--------------+
                             |
                    +--------v---------+
                    |     Metrics      |  Analysis
                    |  Engine (P50,    |
                    |  F1, Cost/QoS)   |
                    +--------+---------+
                             |
              +--------------+--------------+
              |                             |
    +---------v----------+     +-----------v-----------+
    | ReportGenerator    |     |  Streamlit Dashboard  |
    | (Markdown / HTML)  |     |  (Interactive UI)     |
    +--------------------+     +-----------------------+

Quick Start

Prerequisites

Python 3.11+
uv package manager

Installation

# Clone the repository
git clone https://github.com/onurcandonmezer/ai-model-evaluator.git
cd ai-model-evaluator

# Install dependencies
uv venv && uv pip install -e ".[dev]"

Run Evaluation (Simulation Mode)

from src.evaluator import ModelEvaluator

# Load from YAML config and run in simulation mode (no API keys needed)
evaluator = ModelEvaluator.from_yaml("configs/models.yaml", simulate=True)
results = evaluator.evaluate_all()

# Generate comparison report
report = evaluator.get_comparison_report()
print(f"Best Overall: {report.best_overall}")
print(f"Best Accuracy: {report.best_accuracy}")
print(f"Best Latency: {report.best_latency}")

Launch Dashboard

uv run streamlit run src/app.py

Usage Examples

Evaluate Specific Models

from src.evaluator import ModelEvaluator, ModelConfig, EvaluationConfig

config = EvaluationConfig(
    models=[
        ModelConfig(
            name="Gemini Flash",
            provider="google",
            model_id="gemini-2.5-flash-lite",
            cost_per_1k_input=0.0001,
            cost_per_1k_output=0.0004,
        ),
        ModelConfig(
            name="GPT-4o Mini",
            provider="openai",
            model_id="gpt-4o-mini",
            cost_per_1k_input=0.00015,
            cost_per_1k_output=0.0006,
        ),
    ]
)

evaluator = ModelEvaluator(config=config, simulate=True)
results = evaluator.evaluate_all()

for result in results:
    print(f"{result.model_name}: {result.overall_score:.1%} accuracy, "
          f"{result.latency_ms:.0f}ms avg latency")

Generate Reports

from src.report_generator import ReportGenerator

evaluator = ModelEvaluator.from_yaml("configs/models.yaml", simulate=True)
evaluator.evaluate_all()
report = evaluator.get_comparison_report()

generator = ReportGenerator(output_dir="reports")
generator.save_markdown(report, "evaluation_report.md")
generator.save_html(report, "evaluation_report.html")

Custom Benchmarks

from src.benchmarks import BenchmarkSuite, BenchmarkCase, BenchmarkCategory, Difficulty

custom_suite = BenchmarkSuite(cases=[
    BenchmarkCase(
        name="domain_specific_qa",
        category=BenchmarkCategory.QA,
        input_text="What are the key principles of MLOps?",
        expected_output_contains=["monitoring", "deployment", "automation", "reproducibility"],
        difficulty=Difficulty.MEDIUM,
    ),
])

evaluator = ModelEvaluator(config=config, benchmark_suite=custom_suite, simulate=True)

Tech Stack

Category	Technology
Language	Python 3.11+
AI APIs	Google Gemini, OpenAI, Anthropic
Data Validation	Pydantic
Configuration	PyYAML
Dashboard	Streamlit
Data Processing	Pandas, NumPy
Visualization	Plotly
CLI Output	Rich
Testing	pytest, pytest-cov
Linting	Ruff
CI/CD	GitHub Actions

Project Structure

ai-model-evaluator/
├── src/
│   ├── __init__.py              # Package initialization
│   ├── evaluator.py             # Core evaluation engine
│   ├── benchmarks.py            # Benchmark test suites (22 cases)
│   ├── metrics.py               # Metric calculations & comparison
│   ├── report_generator.py      # Markdown/HTML report generation
│   └── app.py                   # Streamlit dashboard
├── tests/
│   ├── test_evaluator.py        # Evaluator tests (12 tests)
│   ├── test_benchmarks.py       # Benchmark tests (12 tests)
│   └── test_metrics.py          # Metrics tests (19 tests)
├── configs/
│   └── models.yaml              # Model configurations
├── .github/workflows/
│   └── ci.yml                   # CI pipeline
├── pyproject.toml               # Project configuration
├── Makefile                     # Development commands
├── LICENSE                      # MIT License
└── README.md

Development

# Run tests
make test

# Lint code
make lint

# Format code
make format

# Launch dashboard
make run

License

This project is licensed under the MIT License - see the LICENSE file for details.

Built by Onurcan Donmezer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Model Evaluator

Key Features

Architecture

Quick Start

Prerequisites

Installation

Run Evaluation (Simulation Mode)

Launch Dashboard

Usage Examples

Evaluate Specific Models

Generate Reports

Custom Benchmarks

Tech Stack

Project Structure

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
configs		configs
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

AI Model Evaluator

Key Features

Architecture

Quick Start

Prerequisites

Installation

Run Evaluation (Simulation Mode)

Launch Dashboard

Usage Examples

Evaluate Specific Models

Generate Reports

Custom Benchmarks

Tech Stack

Project Structure

Development

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages