Community-contributed evaluation framework adapters for eval-hub.
This repository contains adapters that integrate various evaluation frameworks with the eval-hub service. Each adapter implements the FrameworkAdapter pattern from the evalhub-sdk, enabling seamless integration with the eval-hub evaluation service.
| Framework | Container Image | Kubernetes | Notes |
|---|---|---|---|
| LightEval | quay.io/evalhub/community-lighteval:latest |
✓ | Lightweight evaluation framework for language models |
| GuideLLM | quay.io/evalhub/community-guidellm:latest |
✓ | Performance benchmarking for LLM inference servers |
| MTEB | quay.io/evalhub/community-mteb:latest |
✓ | Massive Text Embedding Benchmark for embedding models |
| IBM CLEAR | quay.io/evalhub/community-ibm-clear:latest |
✓ | Agentic trace analysis (LLM-as-judge error reporting) |
| Inspect AI | quay.io/evalhub/community-inspect:latest |
✓ | UK AISI framework — Petri/Bloom alignment auditing and 75 inspect-evals benchmarks |
The Inspect AI adapter exposes alignment auditing and safety evaluation through the Petri and Bloom tools from Meridian Labs, as well as 35 curated benchmarks from the inspect-evals community library.
75 benchmarks across three categories:
- 36 Petri alignment audits — covers all 40 built-in seed tag categories including sycophancy, deception, alignment faking, jailbreak, harmful cooperation, self-preservation, power seeking, oversight subversion, and more.
- 2 Bloom behavioral suites — automated scenario generation from high-level behavior descriptions.
- 36 inspect-evals — safety (AgentHarm, WMDP, StrongREJECT, MASK), scheming (agentic misalignment, GDM self-proliferation, GDM stealth), cybersecurity (Cybench, CyberSecEval), coding (HumanEval, SWE-bench), math (GSM8K, MATH, AIME), knowledge (MMLU, GPQA), and agent capabilities (GAIA, TheAgentCompany).
Model configuration — no provider prefixes required in job specs. The adapter detects the correct API from environment variables:
| Environment variable | API used |
|---|---|
OPENAI_BASE_URL + OPENAI_API_KEY |
OpenAI-compatible (vLLM, OpenRouter) |
OLLAMA_BASE_URL or port 11434 |
Ollama native |
ANTHROPIC_API_KEY |
Anthropic Messages API |
See adapters/inspect/README.md for full documentation, deployment examples, and benchmark catalog.
| Framework | Container Image | Local | Kubernetes | Notes |
|---|---|---|---|---|
| LightEval | quay.io/evalhub/community-lighteval:latest |
✗ | ✓ | Lightweight evaluation framework for language models |
| GuideLLM | quay.io/evalhub/community-guidellm:latest |
✗ | ✓ | Performance benchmarking platform for LLM inference servers |
| MTEB | quay.io/evalhub/community-mteb:latest |
✗ | ✓ | Massive Text Embedding Benchmark for embedding models |
| IBM CLEAR | quay.io/evalhub/community-ibm-clear:latest |
✓ | ✓ | Agentic trace analysis (LLM-as-judge error reporting) |
| RAGAS | quay.io/evalhub/community-ragas:latest |
✗ | ✓ | RAG pipeline quality evaluation (faithfulness, relevancy, context precision/recall, and more) |
| SWE-bench | quay.io/evalhub/community-swebench:latest |
✗ | ✓ | Software engineering benchmark for code patch evaluation |
# Build specific adapter
make image-lighteval
make image-guidellm
make image-inspect
# Build all adapters
make images
# Run adapter tests
make test-inspect
make tests
# Push to registry
make push-inspect REGISTRY=quay.io/your-org VERSION=v1.0.0
make push-lighteval REGISTRY=quay.io/your-org VERSION=v1.0.0See CONTRIBUTING.md for guidelines on adding adapters.
See the LICENSE file for details.