Skip to content

eval-hub/eval-hub-contrib

Repository files navigation

eval-hub-contrib

Community-contributed evaluation framework adapters for eval-hub.

Overview

This repository contains adapters that integrate various evaluation frameworks with the eval-hub service. Each adapter implements the FrameworkAdapter pattern from the evalhub-sdk, enabling seamless integration with the eval-hub evaluation service.

Supported Frameworks

Framework Container Image Kubernetes Notes
LightEval quay.io/evalhub/community-lighteval:latest Lightweight evaluation framework for language models
GuideLLM quay.io/evalhub/community-guidellm:latest Performance benchmarking for LLM inference servers
MTEB quay.io/evalhub/community-mteb:latest Massive Text Embedding Benchmark for embedding models
IBM CLEAR quay.io/evalhub/community-ibm-clear:latest Agentic trace analysis (LLM-as-judge error reporting)
Inspect AI quay.io/evalhub/community-inspect:latest UK AISI framework — Petri/Bloom alignment auditing and 75 inspect-evals benchmarks

Inspect AI Adapter

The Inspect AI adapter exposes alignment auditing and safety evaluation through the Petri and Bloom tools from Meridian Labs, as well as 35 curated benchmarks from the inspect-evals community library.

75 benchmarks across three categories:

  • 36 Petri alignment audits — covers all 40 built-in seed tag categories including sycophancy, deception, alignment faking, jailbreak, harmful cooperation, self-preservation, power seeking, oversight subversion, and more.
  • 2 Bloom behavioral suites — automated scenario generation from high-level behavior descriptions.
  • 36 inspect-evals — safety (AgentHarm, WMDP, StrongREJECT, MASK), scheming (agentic misalignment, GDM self-proliferation, GDM stealth), cybersecurity (Cybench, CyberSecEval), coding (HumanEval, SWE-bench), math (GSM8K, MATH, AIME), knowledge (MMLU, GPQA), and agent capabilities (GAIA, TheAgentCompany).

Model configuration — no provider prefixes required in job specs. The adapter detects the correct API from environment variables:

Environment variable API used
OPENAI_BASE_URL + OPENAI_API_KEY OpenAI-compatible (vLLM, OpenRouter)
OLLAMA_BASE_URL or port 11434 Ollama native
ANTHROPIC_API_KEY Anthropic Messages API

See adapters/inspect/README.md for full documentation, deployment examples, and benchmark catalog.

Framework Container Image Local Kubernetes Notes
LightEval quay.io/evalhub/community-lighteval:latest Lightweight evaluation framework for language models
GuideLLM quay.io/evalhub/community-guidellm:latest Performance benchmarking platform for LLM inference servers
MTEB quay.io/evalhub/community-mteb:latest Massive Text Embedding Benchmark for embedding models
IBM CLEAR quay.io/evalhub/community-ibm-clear:latest Agentic trace analysis (LLM-as-judge error reporting)
RAGAS quay.io/evalhub/community-ragas:latest RAG pipeline quality evaluation (faithfulness, relevancy, context precision/recall, and more)
SWE-bench quay.io/evalhub/community-swebench:latest Software engineering benchmark for code patch evaluation

Building Adapters

# Build specific adapter
make image-lighteval
make image-guidellm
make image-inspect

# Build all adapters
make images

# Run adapter tests
make test-inspect
make tests

# Push to registry
make push-inspect REGISTRY=quay.io/your-org VERSION=v1.0.0
make push-lighteval REGISTRY=quay.io/your-org VERSION=v1.0.0

Contributing

See CONTRIBUTING.md for guidelines on adding adapters.

License

See the LICENSE file for details.

About

Community-contributed evaluation framework adapters for eval-hub

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors