Skip to content
1 change: 1 addition & 0 deletions docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,7 @@
{
"group": "Advanced features",
"pages": [
"features/sdk-tracing",
"features/mcp",
"features/synthetic-data-generation",
"features/multi-turn-simulation",
Expand Down
314 changes: 153 additions & 161 deletions features/mcp.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,229 +8,221 @@

## Overview

Scorecard's MCP (Model Context Protocol) server transforms AI assistants like Claude and Cursor into conversational AI evaluation companions. With natural language commands, you can manage projects, create testsets, configure metrics, run evaluations, and analyze results—all through your favorite AI assistant's interface.
Scorecard's MCP (Model Context Protocol) server lets you manage projects, create testsets, configure metrics, run evaluations, and analyze results through natural language in any MCP-compatible client.

Check warning on line 11 in features/mcp.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

features/mcp.mdx#L11

Did you really mean 'testsets'?

## Setting Up the MCP Server
## Available Tools

### Prerequisites
The MCP server exposes tools across 8 categories for full programmatic access to Scorecard.

- An MCP-compatible client (Claude Desktop, Cursor, or other MCP clients)
- A Scorecard account with API access
<Frame caption="Scorecard MCP server tools listed in Claude Code.">
<img src="/images/mcp-tools-overview.png" alt="Scorecard MCP server tool listing showing ~45 available tools across Metrics, Scores, Systems, Annotations, and Docs." />
</Frame>

### Remote configuration (recommended)
<AccordionGroup>

You can install the Scorecard remote MCP server without any dependencies.
<Accordion title="Projects">

#### Claude Desktop
| Tool | Description |
|------|-------------|
| `list_projects` | List all Projects, ordered by creation date |
| `create_projects` | Create a new Project |

Go to Claude Desktop settings page and click on the "Connectors" tab. Click "Add custom connector" and paste the following URL: `https://mcp.scorecard.io/mcp`. Click "Add" on the modal, then click "Connect" on the modal to login to Scorecard.
</Accordion>

<DarkLightImage
lightSrc="/images/claude-desktop-mcp-light.png"
caption="Screenshot of the Claude Desktop MCP connector"
/>
<Accordion title="Testsets">

### Local configuration
| Tool | Description |
|------|-------------|
| `list_testsets` | List Testsets in a Project |

Check warning on line 36 in features/mcp.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

features/mcp.mdx#L36

Did you really mean 'Testsets'?
| `get_testsets` | Get a specific Testset by ID |

Check warning on line 37 in features/mcp.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

features/mcp.mdx#L37

Did you really mean 'Testset'?
| `create_testsets` | Create a new Testset with a JSON schema and field mappings |

Check warning on line 38 in features/mcp.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

features/mcp.mdx#L38

Did you really mean 'Testset'?
| `update_testsets` | Update a Testset's name, description, schema, or field mappings |

Check warning on line 39 in features/mcp.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

features/mcp.mdx#L39

Did you really mean 'Testset's'?
| `delete_testsets` | Delete a Testset |

Check warning on line 40 in features/mcp.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

features/mcp.mdx#L40

Did you really mean 'Testset'?

You can directly run the MCP Server locally via npx:
</Accordion>

```sh
export SCORECARD_API_KEY="My API Key"
npx -y scorecard-ai-mcp@latest
```
<Accordion title="Testcases">

If you already have a client, consult their documentation to install the MCP server.
| Tool | Description |
|------|-------------|
| `list_testcases` | List Testcases in a Testset |

Check warning on line 48 in features/mcp.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

features/mcp.mdx#L48

Did you really mean 'Testcases'?

Check warning on line 48 in features/mcp.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

features/mcp.mdx#L48

Did you really mean 'Testset'?
| `get_testcases` | Get a specific Testcase by ID |

Check warning on line 49 in features/mcp.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

features/mcp.mdx#L49

Did you really mean 'Testcase'?
| `create_testcases` | Create up to 100 Testcases in a Testset |

Check warning on line 50 in features/mcp.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

features/mcp.mdx#L50

Did you really mean 'Testcases'?

Check warning on line 50 in features/mcp.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

features/mcp.mdx#L50

Did you really mean 'Testset'?
| `update_testcases` | Replace the data of an existing Testcase |

Check warning on line 51 in features/mcp.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

features/mcp.mdx#L51

Did you really mean 'Testcase'?
| `delete_testcases` | Delete multiple Testcases by ID |

Check warning on line 52 in features/mcp.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

features/mcp.mdx#L52

Did you really mean 'Testcases'?

For clients with a configuration JSON, it might look something like this:
</Accordion>

```json
{
"mcpServers": {
"scorecard_ai": {
"command": "npx",
"args": ["-y", "scorecard-ai-mcp", "--client=claude", "--tools=dynamic"],
"env": {
"SCORECARD_API_KEY": "ak_MyAPIKey",
}
}
}
}
```
<Accordion title="Metrics">

<Note>
The MCP server uses Clerk OAuth authentication and JWT tokens to securely connect to your Scorecard account. The configuration is identical across all MCP clients—simply add it to your client's MCP settings.
</Note>
| Tool | Description |
|------|-------------|
| `list_metrics` | List Metrics configured for a Project |
| `get_metrics` | Get a specific Metric by ID |
| `create_metrics` | Create a Metric — supports `ai`, `human`, and `heuristic` eval types with `int`, `float`, or `boolean` output |

Check warning on line 62 in features/mcp.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

features/mcp.mdx#L62

Did you really mean 'eval'?
| `update_metrics` | Update an existing Metric |
| `delete_metrics` | Delete a Metric by ID |

## Core Capabilities
</Accordion>

The MCP server provides natural language access to Scorecard's core functionality:
<Accordion title="Runs">

<Accordion title="Project Management">
Create and manage evaluation projects for your AI systems.
| Tool | Description |
|------|-------------|
| `list_runs` | List Runs for a Project, most recent first |
| `get_runs` | Get a specific Run by ID |
| `create_runs` | Create a new Run against a Testset and System Version |

Check warning on line 74 in features/mcp.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

features/mcp.mdx#L74

Did you really mean 'Testset'?

**Example Commands:**
- "Create a new project for evaluating my customer service chatbot"
- "Show me all my current projects"
- "Set up a project for testing my RAG pipeline"
</Accordion>

<Accordion title="Testset Creation">
Build comprehensive testsets with various scenarios and edge cases.
<Accordion title="Records">

**Example Commands:**
- "Create a testset for customer service scenarios"
- "Add 20 testcases covering product returns and refunds"
- "Import testcases from my CSV file"
</Accordion>

<Accordion title="Testcase Organization">
Organize and categorize your testcases for systematic evaluation.
| Tool | Description |
|------|-------------|
| `list_records` | List Records for a Run, including all scores |
| `create_records` | Create a new Record (system execution result) in a Run |
| `delete_records` | Delete a specific Record by ID |

**Example Commands:**
- "Group testcases by difficulty level"
- "Add tags for 'edge cases' and 'common queries'"
- "Show me all testcases related to billing issues"
</Accordion>

<Accordion title="Custom Metrics Configuration">
Define metrics that matter for your specific use case.
<Accordion title="Scores">

| Tool | Description |
|------|-------------|
| `upsert_scores` | Create or update a Score for a Record and Metric — updates if one already exists |

**Example Commands:**
- "Configure accuracy and helpfulness metrics"
- "Add a custom metric for response relevance"
- "Set up hallucination detection scoring"
</Accordion>

<Accordion title="AI System Version Management">
Track different versions of your AI systems and models.
<Accordion title="Systems">

| Tool | Description |
|------|-------------|
| `list_systems` | List all Systems in a Project |
| `get_systems` | Get a specific System by ID |
| `upsert_systems` | Create a System, or update it if one with the same name exists |
| `update_systems` | Update an existing System's name, description, or production version |
| `delete_systems` | Delete a System by ID |
| `get_systems_versions` | Get a specific System Version by ID |
| `upsert_systems_versions` | Create a System Version, or update its name if the config already exists |

**Example Commands:**
- "Register my GPT-4 based assistant as version 1.0"
- "Create a new version for my updated prompt template"
- "Compare versions 1.0 and 2.0 of my chatbot"
</Accordion>

<Accordion title="Evaluation Runs & Analysis">
Execute evaluations and analyze performance results.
<Accordion title="Annotations">

| Tool | Description |
|------|-------------|
| `list_annotations` | List annotations (ratings and comments) for a specific Record |

**Example Commands:**
- "Run an evaluation against my latest model"
- "Show me the performance results from yesterday's run"
- "Compare accuracy across the last 5 evaluation runs"
</Accordion>

<Accordion title="Annotations & Feedback">
Access human feedback annotations (ratings and comments) left on records.
<Accordion title="Docs">

| Tool | Description |
|------|-------------|
| `search_docs` | Search SDK/API documentation — supports Python, TypeScript, Go, and more |

**Example Commands:**
- "Show me all annotations for record 12345"
- "What feedback did reviewers leave on the latest run's records?"
- "Filter annotations to only show negative ratings"
</Accordion>

## Example Workflows
</AccordionGroup>

### Complete Evaluation Setup
## Setting Up the MCP Server

Here's how you might set up a complete evaluation workflow using natural language in any MCP client:
### Claude Code

<Steps>
<Step title="Create a Project">
"Create a new project called 'Customer Support Bot v2' for evaluating my updated support assistant"
</Step>
Add the Scorecard remote MCP server with a single command:

<Step title="Define Testcases">
"Create a testset with 50 diverse customer support scenarios including billing, technical issues, and product inquiries"
</Step>
```bash
claude mcp add --transport http scorecard https://mcp.scorecard.io/mcp
```

<Step title="Configure Metrics">
"Set up metrics for accuracy, response helpfulness, hallucination rate, and response time"
</Step>
Complete the OAuth authentication flow in your browser when prompted. Verify the connection:

<Step title="Register Your Model">
"Register my current GPT-4 based assistant with custom prompts as version 2.0"
</Step>
```bash
claude mcp list
```

<Step title="Run Evaluation">
"Run a full evaluation of version 2.0 against all testcases"
</Step>
You should see `scorecard: https://mcp.scorecard.io/mcp (HTTP) - ✓ Connected`.

<Step title="Analyze Results">
"Show me areas where the model is underperforming and suggest improvements"
</Step>
</Steps>
### Claude Desktop

### Continuous Testing Workflow
Go to Claude Desktop settings and click the "Connectors" tab. Click "Add custom connector" and paste the URL: `https://mcp.scorecard.io/mcp`. Click "Add", then "Connect" to login to Scorecard.

<Tabs>
<Tab title="Daily Testing">
**Example Commands:**
- "Run daily evaluation on production model"
- "Alert me if accuracy drops below 85%"
- "Generate weekly performance report"
</Tab>
<DarkLightImage
lightSrc="/images/claude-desktop-mcp-light.png"
caption="Adding the Scorecard MCP connector in Claude Desktop."
/>

<Tab title="A/B Testing">
**Example Commands:**
- "Compare performance between prompt A and prompt B"
- "Run both versions on the same testset"
- "Show statistical significance of differences"
</Tab>
### Local configuration

<Tab title="Regression Testing">
**Example Commands:**
- "Test new model version against regression suite"
- "Highlight any degradations from previous version"
- "Generate detailed comparison report"
</Tab>
</Tabs>
You can run the MCP server locally via npx:

Check warning on line 157 in features/mcp.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

features/mcp.mdx#L157

Did you really mean 'npx'?

## Advanced Use Cases
```sh
export SCORECARD_API_KEY="your_api_key"
npx -y scorecard-ai-mcp@latest
```

### Multi-Model Comparison
For clients with a configuration JSON:

Use your AI assistant to orchestrate complex multi-model evaluations:
```json
{
"mcpServers": {
"scorecard_ai": {
"command": "npx",
"args": ["-y", "scorecard-ai-mcp", "--client=claude", "--tools=dynamic"],
"env": {
"SCORECARD_API_KEY": "ak_MyAPIKey"
}
}
}
}
```

**Example Commands:**
- "Compare GPT-4, Claude 3, and Llama 3 on my customer service testset"
- "Evaluate cost-performance tradeoffs between models"
- "Recommend the best model for my use case"
## Examples

### Automated Test Generation
### Create a project and testset

Check warning on line 182 in features/mcp.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

features/mcp.mdx#L182

Did you really mean 'testset'?

Leverage your AI assistant's understanding to create comprehensive test suites:
```
Create a new Scorecard project called "Support Bot Eval". Then create a testset
called "Support Scenarios" with 10 testcases. Each testcase should have:
- inputs: "customerMessage" and "category" (billing, technical, or product)
- expected: "idealResponse"
```

**Example Commands:**
- "Generate 100 edge cases for my medical diagnosis assistant"
- "Create adversarial testcases to test robustness"
- "Build a testset from real user conversations"
### Create metrics

### Performance Optimization
```
Create two metrics in the "Support Bot Eval" project:
1. "Response Accuracy" (integer 1-5) - How well does the response answer the question?
2. "Tone" (boolean) - Is the response professional and empathetic?
```

Get insights and recommendations for improving your AI systems:
### Analyze results

**Example Commands:**
- "Analyze failure patterns in my evaluation results"
- "Suggest prompt improvements based on errors"
- "Identify which types of queries need more training data"
```
Show me the latest run results for the "Support Bot Eval" project.
Which testcases scored lowest on Response Accuracy?
```

## Technical Architecture
### Generate testcases from a codebase

Check warning on line 206 in features/mcp.mdx

View check run for this annotation

Mintlify / Mintlify Validation (scorecard-d65b5e8a) - vale-spellcheck

features/mcp.mdx#L206

Did you really mean 'testcases'?

The MCP server is:
- Built on the Model Context Protocol standard
- Compatible with any MCP client (Claude Desktop, Cursor, and more)
- Deployed on Vercel edge infrastructure for low latency
- Secured with Clerk OAuth authentication
- Open source and available on [GitHub](https://github.com/scorecard-ai/scorecard-mcp)
In Claude Code, you can combine file access with the MCP server:

<Tip>
The MCP server is continuously updated with new capabilities. Check the [GitHub repository](https://github.com/scorecard-ai/scorecard-mcp) for the latest features and updates.
</Tip>
```
Read the API routes in src/api/ and generate 20 testcases covering
the edge cases for each endpoint. Add them to the "API Tests" testset
in project 1234.
```

## Getting Help
### Iterate on metrics

```
The "Response Accuracy" metric is too lenient — update the prompt template
to penalize responses that miss key details from the ideal response.
```

If you encounter issues or have questions about the MCP server:
## Technical Details

1. Check the [GitHub repository](https://github.com/scorecard-ai/scorecard-mcp) for documentation
2. Open an issue for bugs or feature requests
3. Contact Scorecard support (support@scorecard.io) for account-related questions
- Built on the [Model Context Protocol](https://modelcontextprotocol.io/) standard
- Compatible with any MCP client (Claude Code, Claude Desktop, Cursor, and more)
- Secured with OAuth authentication
- Open source: [github.com/scorecard-ai/scorecard-mcp](https://github.com/scorecard-ai/scorecard-mcp)
Loading
Loading