Agent Skills for EvalHub, following the agentskills.io open format.
These skills enable AI coding agents (Claude Code, Copilot, Codex, etc.) to discover and execute EvalHub model evaluations during development sessions.
| Skill | Description |
|---|---|
evalhub |
Full skill — discovery, evaluation, job lifecycle, and EDD workflows |
evalhub-discovery |
Discover providers, benchmarks, and collections; read agent metadata |
evalhub-eval |
Submit evaluation jobs against benchmarks or collections |
evalhub-jobs |
Monitor, wait on, cancel, and fetch logs for evaluation jobs |
- Python 3.11+
- uv (scripts use PEP 723 inline metadata for auto-dependency resolution)
- Network access to an EvalHub service
EVALHUB_BASE_URL,EVALHUB_TOKEN,EVALHUB_TENANTenvironment variables
/plugin marketplace add eval-hub/eval-hub-skills
/plugin install evalhub@evalhubThe skill is then available as /evalhub:evalhub in any Claude Code session.
Clone the repo and symlink the skills into ~/.claude/skills/:
git clone https://github.com/eval-hub/eval-hub-skills
cd eval-hub-skills
make install-all # installs all four skillsTo install only the primary skill:
make installChanges to the skill source are reflected immediately without reinstalling.
If EvalHub exposes an MCP server on your cluster, you can register it directly with Claude Code using the claude mcp add CLI command. EvalHub's MCP requires a bearer token and an x-tenant header (the namespace):
claude mcp add evalhub "$EVALHUB_BASE_URL/mcp" \
--transport http \
--header "Authorization: Bearer $EVALHUB_TOKEN" \
--header "x-tenant: $EVALHUB_TENANT"This writes the server into your local Claude Code config (.claude/settings.json). Use --scope user to register it globally across all projects instead.
Note: OpenShift tokens expire. If you get 401 errors, refresh with
export EVALHUB_TOKEN="$(oc whoami -t)"and re-run the command.
make uninstall-all # remove all skills
make uninstall # remove primary skill onlymake update-allmake checkSet these environment variables before using the skill:
export EVALHUB_BASE_URL="https://evalhub.apps.cluster.example.com"
export EVALHUB_TOKEN="$(oc whoami -t)"
export EVALHUB_TENANT="eval-test"Optionally, for clusters with self-signed certificates:
export EVALHUB_INSECURE=trueOnce installed, the skill is automatically discovered by Claude Code. Ask naturally in any session.
You: What evaluation providers are available?
The skill fetches live metadata from the API and summarises each provider:
Available providers:
- AgentDojo — Test agent resilience to prompt injection across realistic tool-use
scenarios. Targets: agent. Evaluates: agent-safety, prompt-injection,
tool-use-security.
- LM Evaluation Harness — Comprehensive evaluation framework with 180 benchmarks
covering math, science, coding, safety, reasoning, and more.
- Garak — LLM vulnerability scanner and red-teaming framework.
- GuideLLM — Performance benchmarking for LLM inference servers.
- SWE-bench — Software engineering benchmark using real-world GitHub issues.
- Lighteval — Lightweight LLM evaluation from Hugging Face.You: Which providers can evaluate my agent for prompt injection?
[
{
"id": "agentdojo",
"name": "AgentDojo",
"summary": "Test agent resilience to prompt injection across realistic tool-use scenarios",
"target_type": "agent",
"evaluates": ["agent-safety", "prompt-injection", "tool-use-security"]
}
]Claude: AgentDojo is the right fit here. It runs your agent through realistic tool-use scenarios (workspace, Slack, banking, travel) and measures both utility (task completion) and security (injection resistance). The model must support tool/function calling via an OpenAI-compatible API.
You: What evaluation suites are available?
[
{
"id": "standard-llm-evals-v1",
"name": "Standard LLM Evaluation Suite v1",
"category": "general"
},
{
"id": "reasoning-v1",
"name": "Reasoning v1",
"category": "reasoning"
},
{
"id": "coding-v1",
"name": "Coding v1",
"category": "code"
},
{
"id": "instruction-following-v1",
"name": "Instruction Following v1",
"category": "instruction_following"
},
{
"id": "long-context-v1",
"name": "Long Context v1",
"category": "long_context"
},
{
"id": "toxicity-and-ethical-principles",
"name": "Toxicity and Ethical Principles",
"category": "safety"
}
]You: Run the standard LLM eval suite against my model at
http://vllm-svc:8000/v1(model name:mistral-7b-instruct).
{
"resource": {
"id": "eval-mistral-7b-instruct-a3f2c1",
"created_at": "2026-06-07T14:23:11Z"
},
"name": "eval-mistral-7b-instruct",
"state": "pending"
}Claude: Job submitted. ID:
eval-mistral-7b-instruct-a3f2c1. The standard suite covers instruction following, reasoning, math, coding, and long-context tasks. It runs multiple benchmarks with 3 seeds each — expect it to take 30–60 minutes. Say "check the status" or "wait for it to finish" when you're ready.
You: What's the status of my eval jobs?
[
{
"id": "eval-mistral-7b-instruct-a3f2c1",
"name": "eval-mistral-7b-instruct",
"state": "running",
"created_at": "2026-06-07T14:23:11Z"
}
]You: I'm about to refactor my model's safety guardrails. What should I run before and after to measure the impact?
Claude: I'd suggest this before/after plan using the live providers:
Before: Run
toxicity-and-ethical-principles(covers toxic content, truthfulness, and HHH alignment) andgarak(vulnerability scan for known attack patterns) to establish a baseline.After: Run the same two suites. Diff the scores — a regression in
toxigenortruthfulqa_mc1means the refactor weakened a safety property.Want me to kick off the baseline run now?
These skills are for use during the development of eval-hub components.
| Skill | Description |
|---|---|
golang-cve-update |
Scans Go repositories for vulnerabilities. |
golang-cve-update (dev-skills/golang-cve-update/) scans Go repositories for vulnerabilities with govulncheck, determines the minimum fixed Go version, and updates version pins in go.mod, container files, and CI only when a matching tag exists on registry.access.redhat.com/ubi9/go-toolset. If Red Hat has not published the required image yet, the skill reports the gap and leaves version files unchanged.
/plugin install golang-cve-update@evalhubApache-2.0