EE-Bench Evaluation Guide

How to export datasets, evaluate datapoints, and interpret results for EE-Bench.

Overview

EE-Bench evaluates how well AI-generated code solves real software engineering tasks. Each task (called a datapoint) captures a real pull request: the issue description, the repository snapshot, and the gold-standard patch that solved it.

The evaluation pipeline has three stages:

Export — extract validated datapoints from the dataset repository into a portable format (folders or JSONL)
Evaluate — build a Docker environment from the datapoint, apply a patch (gold or candidate), and run the self-evaluating run.sh script
Interpret — parse the structured JSON result to determine pass/fail status

Currently only the codegen (code generation) evaluation type is supported. Datapoints live in the dpaia/dataset repository, organized as <eval_type>/<source_repo_name>/<instance_id>/.

Exporting via GitHub Actions

The "Export Dataset (v2)" workflow exports datapoints from dpaia/dataset as a downloadable artifact.

Step-by-Step

Navigate to the infrastructure repository's Actions tab
Select "Export Dataset (v2)" from the workflow list
Click "Run workflow"
Fill in the inputs (see table below)
Wait for the workflow to complete
Download the artifact from the workflow run's Artifacts section

Workflow Inputs

Input	Type	Default	Description
`eval_type`	string	`codegen`	Eval type
`search_query`	string	(empty)	GitHub search query to filter merged PRs (empty = all merged)
`format`	choice	`folders`	Output format: `folders` (directory per instance) or `jsonl` (one JSON object per line)
`output_name`	string	`dataset`	Name of the output artifact
`organization`	string	`dpaia`	GitHub organization
`dataset_repo`	string	`dataset`	Dataset repository name

Filtering with search_query

The search_query input uses GitHub's pull request search syntax to filter which merged PRs to include. When empty, all datapoints in the dataset repository are exported.

Query	Effect
(empty)	Export all datapoints
`created:>2025-01-01`	Datapoints from PRs created after January 1, 2025
`author:username`	Datapoints from PRs by a specific author
`label:priority`	Datapoints from PRs with a specific label
`created:2025-01-01..2025-06-30`	Datapoints from PRs created in the first half of 2025
`label:"Language: C#"`	Datapoints from PRs by a specific language

The workflow searches merged PRs in dpaia/dataset, then locates the corresponding datapoint directory on the filesystem. Instance IDs not found in the repository are skipped with a warning.

What the Export Produces

Choosing a Format

Format	Best for	Trade-off
`folders`	Local validation, debugging, inspecting individual files	One directory per instance; many small files
`jsonl`	Programmatic consumption, bulk processing, storage	Single file; all contents inlined as strings

Manifest

Every export includes a manifest.json at the root with export metadata:

{
  "eval_type": "codegen",
  "format": "folders",
  "search_query": "",
  "dataset_repo_ref": "main",
  "dataset_repo_commit": "abc123def456...",
  "exported_at": "2025-06-15T14:30:00Z",
  "datapoint_count": 42,
  "instance_ids": ["devlooped__moq-1259", "spectreconsole__spectre.console-1708", "..."]
}

Folder Format

When format=folders, each instance is a directory:

dataset/
├── manifest.json
├── devlooped__moq-1259/
│   ├── datapoint.json
│   ├── environment/
│   │   └── Dockerfile
│   ├── eval/
│   │   ├── run.sh
│   │   └── scripts/
│   └── verify/
│       └── patch.diff
└── spectreconsole__spectre.console-1708/
    └── ...

JSONL Format

When format=jsonl, all instances are in a single file with one JSON object per line:

dataset/
├── manifest.json
└── dataset.jsonl

Each line in the JSONL file is a self-contained JSON object with all file contents inlined under environment.files, eval.files, and verify.files.

Datapoint Record Structure

Each datapoint — whether a datapoint.json in a folder or a line in JSONL — contains everything needed for evaluation:

{
  "instance_id": "dpaia__moq-1259",
  "pr_number": 1259,
  "repo": "dpaia/moq",
  "base_commit": "abc123...",
  "problem_statement": "Description of the issue...",
  "hints_text": "Optional hints...",
  "version": "1.0",
  "created_at": "2026-01-15T12:00:00Z",
  "build_system": "dotnet",
  "project_root": "/repo",
  "expected": {
    "fail_to_pass": ["Moq.Tests.Regressions.IssueReportsFixture.Issue1259"],
    "pass_to_pass": ["Moq.Tests.MatcherAttributeFixture.TypedMatcherDoesNotMismatch"]
  },
  "environment": {
    "files": {
      "Dockerfile": "FROM mcr.microsoft.com/dotnet/sdk:8.0\n..."
    },
    "docker": { "run_params": "--network=host" }
  },
  "eval": {
    "files": {
      "run.sh": "#!/bin/bash\n...",
      "test_patch.diff": "diff --git a/...\n..."
    }
  },
  "verify": {
    "files": {
      "patch.diff": "diff --git a/...\n..."
    }
  }
}

Any custom fields from metadata.json are passed through as top-level fields.

instance_id derivation: By default computed as {owner}__{repo_name}-{pr_number}, where hyphens in the repo name are replaced with __ (e.g., dpaia__spectre__console-2).

Dataset Repository Layout

Datapoints in dpaia/dataset are organized by evaluation type and source repository:

{eval_type}/{source_repo_name}/{instance_id}/

For example: codegen/moq/dpaia__moq-1259/

Evaluating a Datapoint

Evaluation builds a Docker image from the datapoint's Dockerfile, mounts the evaluation scripts and a patch (gold or candidate), and runs run.sh inside the container.

Requirements

jq — JSON processor
docker — with support for linux/amd64 platform

Quick Start

Folder mode — validate a single instance directory:

bash .github/scripts/validate.sh path/to/instance_id/

JSONL mode — validate a specific instance from a JSONL file:

bash .github/scripts/validate.sh dataset.jsonl instance_id

What the Validation Script Does

Reads datapoint.json from the instance (or extracts from JSONL)
Stages evaluation and submission files to a temp directory
Builds the Docker image: docker build --platform linux/amd64 -t <instance_id>:<commit_short> -f environment/Dockerfile environment/
Runs the container with mounted volumes:
- /ee-bench/eval/ — evaluation scripts (read-only)
- /ee-bench/submission/ — gold patch (read-only)
- Additional docker run params from datapoint.json (environment.docker.run_params)
Executes: bash /ee-bench/eval/run.sh
Parses JSON output (looks for a line containing "schema_version")
Checks that all 6 criteria pass (run.sh self-evaluates fail_to_pass and pass_to_pass expectations internally)
Exits 0 on success (all criteria pass), 1 on failure

Manual Validation

If you need finer control or want to debug issues, follow this step-by-step Docker walkthrough.

Step 1: Read Metadata

INSTANCE_DIR="path/to/instance_id"
cat "$INSTANCE_DIR/datapoint.json" | jq .

# Extract key values
INSTANCE_ID=$(jq -r '.instance_id' "$INSTANCE_DIR/datapoint.json")
BASE_COMMIT=$(jq -r '.base_commit' "$INSTANCE_DIR/datapoint.json")
COMMIT_SHORT="${BASE_COMMIT:0:12}"

Step 2: Build Docker Image

docker build --platform linux/amd64 \
  -t "${INSTANCE_ID}:${COMMIT_SHORT}" \
  -f "$INSTANCE_DIR/environment/Dockerfile" \
  "$INSTANCE_DIR/environment/"

Step 3: Prepare Staging Directory

STAGE_DIR="/tmp/ee-bench-validate-${INSTANCE_ID}"
rm -rf "$STAGE_DIR"
mkdir -p "$STAGE_DIR/eval" "$STAGE_DIR/submission"

cp -r "$INSTANCE_DIR/eval/"* "$STAGE_DIR/eval/"
cp -r "$INSTANCE_DIR/verify/"* "$STAGE_DIR/submission/"

Step 4: Run Container

# Read optional docker run params
DOCKER_RUN_PARAMS=$(jq -r '.environment.docker.run_params // empty' "$INSTANCE_DIR/datapoint.json")

# Run with gold patch mounted
docker run --rm --platform linux/amd64 \
  -v "$STAGE_DIR/eval":/ee-bench/eval:ro \
  -v "$STAGE_DIR/submission":/ee-bench/submission:ro \
  $DOCKER_RUN_PARAMS \
  "${INSTANCE_ID}:${COMMIT_SHORT}" \
  bash /ee-bench/eval/run.sh

Step 5: Debug Failures

If the container fails, run it interactively:

docker run --rm -it --platform linux/amd64 \
  -v "$STAGE_DIR/eval":/ee-bench/eval:ro \
  -v "$STAGE_DIR/submission":/ee-bench/submission:ro \
  $DOCKER_RUN_PARAMS \
  "${INSTANCE_ID}:${COMMIT_SHORT}" \
  bash

Inside the container:

Check that the patch applies: cd /repo && git apply /ee-bench/submission/patch.diff
Run the build manually
Run individual tests to isolate failures

Step 6: Clean Up

rm -rf "$STAGE_DIR"
docker rmi "${INSTANCE_ID}:${COMMIT_SHORT}"

Bulk Validation

To validate all instances in a folder-format export:

#!/usr/bin/env bash
set -euo pipefail

EXPORT_DIR="${1:?Usage: $0 <export_dir>}"
PASSED=0
FAILED=0
ERRORS=()

for INSTANCE_DIR in "$EXPORT_DIR"/*/; do
  [ -f "$INSTANCE_DIR/datapoint.json" ] || continue

  INSTANCE_ID=$(jq -r '.instance_id' "$INSTANCE_DIR/datapoint.json")
  echo "=== Validating: $INSTANCE_ID ==="

  if bash .github/scripts/validate.sh "$INSTANCE_DIR"; then
    PASSED=$((PASSED + 1))
  else
    FAILED=$((FAILED + 1))
    ERRORS+=("$INSTANCE_ID")
  fi

  echo ""
done

echo "=== Summary ==="
echo "Passed: $PASSED"
echo "Failed: $FAILED"
if [ ${#ERRORS[@]} -gt 0 ]; then
  echo "Failed instances:"
  printf '  - %s\n' "${ERRORS[@]}"
fi

To validate instances from a JSONL export:

#!/usr/bin/env bash
set -euo pipefail

JSONL_FILE="${1:?Usage: $0 <dataset.jsonl>}"
PASSED=0
FAILED=0
ERRORS=()

while IFS= read -r line; do
  INSTANCE_ID=$(echo "$line" | jq -r '.instance_id')
  [ -z "$INSTANCE_ID" ] && continue

  echo "=== Validating: $INSTANCE_ID ==="

  if bash .github/scripts/validate.sh "$JSONL_FILE" "$INSTANCE_ID"; then
    PASSED=$((PASSED + 1))
  else
    FAILED=$((FAILED + 1))
    ERRORS+=("$INSTANCE_ID")
  fi

  echo ""
done < "$JSONL_FILE"

echo "=== Summary ==="
echo "Passed: $PASSED"
echo "Failed: $FAILED"
if [ ${#ERRORS[@]} -gt 0 ]; then
  echo "Failed instances:"
  printf '  - %s\n' "${ERRORS[@]}"
fi

Evaluation Results and Schema

Output Example

Building image devlooped__moq-1259:eef6e1b8f968 ...
Running validation ...
6/6 criteria passed

JSON output:
{
  "schema_version": "2.0",
  "status": "success",
  "criteria": [
    { "criterion": "compilation", "status": "pass" },
    { "criterion": "baseline_tests", "status": "pass" },
    { "criterion": "patch_applied", "status": "pass" },
    { "criterion": "tests", "status": "pass" },
    { "criterion": "fail_to_pass", "status": "pass" },
    { "criterion": "pass_to_pass", "status": "pass" }
  ],
  ...
}

Top-Level Fields

Field	Values	Meaning
`status`	`"success"`	`run.sh` completed without errors (individual criteria may still fail)
`status`	`"error"`	`run.sh` encountered an error; check the `error` field
`duration_seconds`	number	Total wall-clock time for the run

Criteria Array

The criteria array contains 6 criterion objects, evaluated in order. run.sh is self-evaluating — it performs all criteria checks internally, including fail_to_pass and pass_to_pass matching, with no external harness needed.

The 6 criteria (in order):

Criterion	Description	Status values
`compilation`	Build via install.sh	`pass`, `fail`
`baseline_tests`	Test run before submission (with test_patch, no submission)	`pass`, `fail`, `skipped`
`patch_applied`	Apply submission patch	`pass`, `fail`, `skipped`
`tests`	Test run after submission	`pass`, `fail`, `skipped`
`fail_to_pass`	Expected-failing tests failed in baseline, pass after submission	`pass`, `fail`, `skipped`
`pass_to_pass`	Expected-passing tests passed in baseline, still pass after submission	`pass`, `fail`, `skipped`

When criteria are skipped:

baseline_tests: compilation failed
patch_applied: no submission patch provided
tests: compilation or patch application failed
fail_to_pass: expected list empty or upstream criteria failed
pass_to_pass: expected list empty or upstream criteria failed

Two-phase test execution:

Apply test_patch, build, run baseline tests (verify the bug exists before the fix)
Apply submission patch, rebuild, run eval tests (verify the fix works)
Compare both runs against expected test lists (fail_to_pass and pass_to_pass)

Tests criterion example:

{
  "criterion": "tests",
  "status": "pass",
  "summary": {
    "total": 10,
    "passed": 10,
    "failed": 0,
    "skipped": 0
  },
  "passed_tests": [
    { "name": "com.example.FooTest#testBar" }
  ],
  "failed_tests": []
}

summary.total / summary.passed / summary.failed — test counts
passed_tests[].name — fully qualified names of passing tests
failed_tests[].name — names of failing tests, with optional message, stacktrace, and type fields

Other criteria key fields:

Criterion	Key Fields	Description
`compilation`	`exit_code`, `error_message`, `duration_seconds`	Whether the project compiled
`baseline_tests`	`summary`, `passed_tests`, `failed_tests`	Test results before submission patch
`patch_applied`	`files_modified`, `hunks_applied`, `hunks_failed`	Whether the submission patch applied cleanly
`fail_to_pass`	`expected`, `matched`, `unmatched`	Comparison of expected-failing tests against actual results
`pass_to_pass`	`expected`, `matched`, `unmatched`	Comparison of expected-passing tests against actual results

Note: additionalProperties: true at all levels — custom fields are allowed.

Full Result Schema v2.0

The result contains 6 criteria evaluated in order. run.sh is self-evaluating — it bakes expected test lists via template variables ({{ instance.expected.fail_to_pass | tojson }} and {{ instance.expected.pass_to_pass | tojson }}) and performs all criteria matching internally.

{
  "schema_version": "2.0",
  "status": "success | error",
  "timestamp": "ISO-8601 (optional)",
  "duration_seconds": 45.2,
  "criteria": [
    {
      "criterion": "compilation",
      "status": "pass | fail",
      "exit_code": 0,
      "error_message": "string (optional)",
      "duration_seconds": 12.1
    },
    {
      "criterion": "baseline_tests",
      "status": "pass | fail | skipped",
      "summary": {
        "total": 10,
        "passed": 9,
        "failed": 1,
        "skipped": 0
      },
      "passed_tests": [
        { "name": "fully.qualified.TestName" }
      ],
      "failed_tests": [
        { "name": "fully.qualified.FailingTest" }
      ],
      "duration_seconds": 8.0
    },
    {
      "criterion": "patch_applied",
      "status": "pass | fail | skipped",
      "files_modified": ["path/to/file.java"],
      "hunks_applied": 3,
      "hunks_failed": 0
    },
    {
      "criterion": "tests",
      "status": "pass | fail | skipped",
      "summary": {
        "total": 10,
        "passed": 10,
        "failed": 0,
        "skipped": 0,
        "errors": 0
      },
      "passed_tests": [
        {
          "name": "fully.qualified.TestName",
          "duration_seconds": 0.5
        }
      ],
      "failed_tests": [],
      "duration_seconds": 8.3
    },
    {
      "criterion": "fail_to_pass",
      "status": "pass | fail | skipped",
      "expected": ["fully.qualified.FailingTest"],
      "matched": ["fully.qualified.FailingTest"],
      "unmatched": []
    },
    {
      "criterion": "pass_to_pass",
      "status": "pass | fail | skipped",
      "expected": ["fully.qualified.PassingTest"],
      "matched": ["fully.qualified.PassingTest"],
      "unmatched": []
    }
  ],
  "stdout": "captured output (optional)",
  "stderr": "captured errors (optional)",
  "error": "error message when status is error (optional)"
}

Exporting via API

Triggering the Workflow

Via REST API:

curl -X POST \
  -H "Authorization: token $GITHUB_TOKEN" \
  -H "Accept: application/vnd.github.v3+json" \
  https://api.github.com/repos/dpaia/infrastructure/actions/workflows/export-dataset-v2.yml/dispatches \
  -d '{
    "ref": "main",
    "inputs": {
      "eval_type": "codegen",
      "format": "folders",
      "search_query": "",
      "output_name": "dataset",
      "organization": "dpaia",
      "dataset_repo": "dataset"
    }
  }'

Via gh CLI:

gh workflow run "Export Dataset (v2)" \
  --repo dpaia/infrastructure \
  --field eval_type=codegen \
  --field format=jsonl \
  --field output_name=my-export \
  --field search_query="created:>2026-01-01"

Downloading Artifacts

After a workflow run completes, download its artifacts:

# List recent runs
gh run list --repo dpaia/infrastructure --workflow "Export Dataset (v2)" --limit 5

# Download artifacts from a specific run
gh run download <run_id> --repo dpaia/infrastructure --name dataset

# Or via REST API: list artifacts for a run
curl -s -H "Authorization: token $GITHUB_TOKEN" \
  https://api.github.com/repos/dpaia/infrastructure/actions/runs/<run_id>/artifacts \
  | jq '.artifacts[] | {name, id, size_in_bytes}'

# Download a specific artifact (returns a zip)
curl -L -H "Authorization: token $GITHUB_TOKEN" \
  https://api.github.com/repos/dpaia/infrastructure/actions/artifacts/<artifact_id>/zip \
  -o dataset.zip

Troubleshooting

Problem	Cause	Fix
Docker build fails	Missing dependencies, wrong base image, or template rendering errors	Check the Dockerfile for invalid template variables. Build locally with `docker build --platform linux/amd64`
No JSON output from run.sh	Script doesn't print a line containing `"schema_version"` to stdout	Ensure `run.sh` outputs exactly one JSON object with `"schema_version": "2.0"`. Redirect build/test output to stderr or a file.
`fail_to_pass` mismatch	Test names in `expected.fail_to_pass` don't match `passed_tests[].name` in the result	Use fully qualified class names (e.g. `com.example.FooTest`) or method names (e.g. `com.example.FooTest.shouldBar`). Class-level names match all methods in that class.
Patch application failure	Gold patch doesn't apply to the codebase at `base_commit`	Check that `base_commit` is correct and that `patch.diff` in `verify/` was generated from that base
Compilation failure	Build tools or dependencies missing in Docker image	Enter the container interactively (`docker run --rm -it ... bash`) and debug the build
Validation passes locally but fails in CI	Environment differences (network, platform, caching)	Ensure `--platform linux/amd64` is set. Check if `docker.run_params` includes network flags.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EE-Bench Evaluation Guide

Overview

Exporting via GitHub Actions

Step-by-Step

Workflow Inputs

Filtering with search_query

What the Export Produces

Choosing a Format

Manifest

Folder Format

JSONL Format

Datapoint Record Structure

Dataset Repository Layout

Evaluating a Datapoint

Requirements

Quick Start

What the Validation Script Does

Manual Validation

Bulk Validation

Evaluation Results and Schema

Output Example

Top-Level Fields

Criteria Array

Full Result Schema v2.0

Exporting via API

Triggering the Workflow

Downloading Artifacts

Troubleshooting

FilesExpand file tree

evaluation-guide.md

Latest commit

History

evaluation-guide.md

File metadata and controls

EE-Bench Evaluation Guide

Overview

Exporting via GitHub Actions

Step-by-Step

Workflow Inputs

Filtering with search_query

What the Export Produces

Choosing a Format

Manifest

Folder Format

JSONL Format

Datapoint Record Structure

Dataset Repository Layout

Evaluating a Datapoint

Requirements

Quick Start

What the Validation Script Does

Manual Validation

Bulk Validation

Evaluation Results and Schema

Output Example

Top-Level Fields

Criteria Array

Full Result Schema v2.0

Exporting via API

Triggering the Workflow

Downloading Artifacts

Troubleshooting