eval-hub-skills

Agent Skills for EvalHub, following the agentskills.io open format.

These skills enable AI coding agents (Claude Code, Copilot, Codex, etc.) to discover and execute EvalHub model evaluations during development sessions.

Skills

Skill	Description
`evalhub`	Full skill — discovery, evaluation, job lifecycle, and EDD workflows
`evalhub-discovery`	Discover providers, benchmarks, and collections; read agent metadata
`evalhub-eval`	Submit evaluation jobs against benchmarks or collections
`evalhub-jobs`	Monitor, wait on, cancel, and fetch logs for evaluation jobs

Installation

Prerequisites

Python 3.11+
uv (scripts use PEP 723 inline metadata for auto-dependency resolution)
Network access to an EvalHub service
EVALHUB_BASE_URL, EVALHUB_TOKEN, EVALHUB_TENANT environment variables

Install via Claude Code plugin (recommended)

/plugin marketplace add eval-hub/eval-hub-skills
/plugin install evalhub@evalhub

The skill is then available as /evalhub:evalhub in any Claude Code session.

Install locally (development)

Clone the repo and symlink the skills into ~/.claude/skills/:

git clone https://github.com/eval-hub/eval-hub-skills
cd eval-hub-skills
make install-all   # installs all four skills

To install only the primary skill:

make install

Changes to the skill source are reflected immediately without reinstalling.

Connect to an MCP server on a cluster

If EvalHub exposes an MCP server on your cluster, you can register it directly with Claude Code using the claude mcp add CLI command. EvalHub's MCP requires a bearer token and an x-tenant header (the namespace):

claude mcp add evalhub "$EVALHUB_BASE_URL/mcp" \
  --transport http \
  --header "Authorization: Bearer $EVALHUB_TOKEN" \
  --header "x-tenant: $EVALHUB_TENANT"

This writes the server into your local Claude Code config (.claude/settings.json). Use --scope user to register it globally across all projects instead.

Note: OpenShift tokens expire. If you get 401 errors, refresh with export EVALHUB_TOKEN="$(oc whoami -t)" and re-run the command.

Uninstall

make uninstall-all   # remove all skills
make uninstall       # remove primary skill only

Update

make update-all

Validate

make check

Configuration

Set these environment variables before using the skill:

export EVALHUB_BASE_URL="https://evalhub.apps.cluster.example.com"
export EVALHUB_TOKEN="$(oc whoami -t)"
export EVALHUB_TENANT="eval-test"

Optionally, for clusters with self-signed certificates:

export EVALHUB_INSECURE=true

Usage

Once installed, the skill is automatically discovered by Claude Code. Ask naturally in any session.

Example: discover what providers are available

You: What evaluation providers are available?

The skill fetches live metadata from the API and summarises each provider:

Available providers:

- AgentDojo — Test agent resilience to prompt injection across realistic tool-use
  scenarios. Targets: agent. Evaluates: agent-safety, prompt-injection,
  tool-use-security.

- LM Evaluation Harness — Comprehensive evaluation framework with 180 benchmarks
  covering math, science, coding, safety, reasoning, and more.

- Garak — LLM vulnerability scanner and red-teaming framework.

- GuideLLM — Performance benchmarking for LLM inference servers.

- SWE-bench — Software engineering benchmark using real-world GitHub issues.

- Lighteval — Lightweight LLM evaluation from Hugging Face.

Example: find providers for a specific use case

You: Which providers can evaluate my agent for prompt injection?

[
  {
    "id": "agentdojo",
    "name": "AgentDojo",
    "summary": "Test agent resilience to prompt injection across realistic tool-use scenarios",
    "target_type": "agent",
    "evaluates": ["agent-safety", "prompt-injection", "tool-use-security"]
  }
]

Claude: AgentDojo is the right fit here. It runs your agent through realistic tool-use scenarios (workspace, Slack, banking, travel) and measures both utility (task completion) and security (injection resistance). The model must support tool/function calling via an OpenAI-compatible API.

Example: list available evaluation collections

You: What evaluation suites are available?

[
  {
    "id": "standard-llm-evals-v1",
    "name": "Standard LLM Evaluation Suite v1",
    "category": "general"
  },
  {
    "id": "reasoning-v1",
    "name": "Reasoning v1",
    "category": "reasoning"
  },
  {
    "id": "coding-v1",
    "name": "Coding v1",
    "category": "code"
  },
  {
    "id": "instruction-following-v1",
    "name": "Instruction Following v1",
    "category": "instruction_following"
  },
  {
    "id": "long-context-v1",
    "name": "Long Context v1",
    "category": "long_context"
  },
  {
    "id": "toxicity-and-ethical-principles",
    "name": "Toxicity and Ethical Principles",
    "category": "safety"
  }
]

Example: submit an evaluation job

You: Run the standard LLM eval suite against my model at http://vllm-svc:8000/v1 (model name: mistral-7b-instruct).

{
  "resource": {
    "id": "eval-mistral-7b-instruct-a3f2c1",
    "created_at": "2026-06-07T14:23:11Z"
  },
  "name": "eval-mistral-7b-instruct",
  "state": "pending"
}

Claude: Job submitted. ID: eval-mistral-7b-instruct-a3f2c1. The standard suite covers instruction following, reasoning, math, coding, and long-context tasks. It runs multiple benchmarks with 3 seeds each — expect it to take 30–60 minutes. Say "check the status" or "wait for it to finish" when you're ready.

Example: check job status

You: What's the status of my eval jobs?

[
  {
    "id": "eval-mistral-7b-instruct-a3f2c1",
    "name": "eval-mistral-7b-instruct",
    "state": "running",
    "created_at": "2026-06-07T14:23:11Z"
  }
]

Example: EDD — measure impact before and after a change

You: I'm about to refactor my model's safety guardrails. What should I run before and after to measure the impact?

Claude: I'd suggest this before/after plan using the live providers:

Before: Run toxicity-and-ethical-principles (covers toxic content, truthfulness, and HHH alignment) and garak (vulnerability scan for known attack patterns) to establish a baseline.

After: Run the same two suites. Diff the scores — a regression in toxigen or truthfulqa_mc1 means the refactor weakened a safety property.

Want me to kick off the baseline run now?

Development skills

These skills are for use during the development of eval-hub components.

Skill	Description
`golang-cve-update`	Scans Go repositories for vulnerabilities.

golang-cve-update

golang-cve-update (dev-skills/golang-cve-update/) scans Go repositories for vulnerabilities with govulncheck, determines the minimum fixed Go version, and updates version pins in go.mod, container files, and CI only when a matching tag exists on registry.access.redhat.com/ubi9/go-toolset. If Red Hat has not published the required image yet, the skill reports the gap and leaves version files unchanged.

Installation

/plugin install golang-cve-update@evalhub

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.claude-plugin		.claude-plugin
.claude		.claude
.github/workflows		.github/workflows
dev-skills/golang-cve-update		dev-skills/golang-cve-update
eval		eval
evalhub-discovery		evalhub-discovery
evalhub-eval		evalhub-eval
evalhub-jobs		evalhub-jobs
evalhub		evalhub
tests		tests
.gitignore		.gitignore
.mcp.json		.mcp.json
.pre-commit-config.yaml		.pre-commit-config.yaml
.skillsaw.yaml		.skillsaw.yaml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

eval-hub-skills

Skills

Installation

Prerequisites

Install via Claude Code plugin (recommended)

Install locally (development)

Connect to an MCP server on a cluster

Uninstall

Update

Validate

Configuration

Usage

Example: discover what providers are available

Example: find providers for a specific use case

Example: list available evaluation collections

Example: submit an evaluation job

Example: check job status

Example: EDD — measure impact before and after a change

Development skills

golang-cve-update

Installation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

eval-hub-skills

Skills

Installation

Prerequisites

Install via Claude Code plugin (recommended)

Install locally (development)

Connect to an MCP server on a cluster

Uninstall

Update

Validate

Configuration

Usage

Example: discover what providers are available

Example: find providers for a specific use case

Example: list available evaluation collections

Example: submit an evaluation job

Example: check job status

Example: EDD — measure impact before and after a change

Development skills

golang-cve-update

Installation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages