How to review datapoints and manage the Code Generation project board.
As a code reviewer, you manage source PRs on the Code Generation project board. Your role is to evaluate whether a contributed PR is suitable for the benchmark, trigger automated verification by setting the correct status, and handle any failures. Most of the pipeline is automated — your main actions are moving PRs between statuses and reviewing verification results.
| Status | Set By | What It Triggers |
|---|---|---|
| Todo | Manual | Nothing — PR is queued for review |
| In progress | Manual or Bot (on new commits) | Nothing — PR is being worked on |
| Review | Reviewer | Bot dispatches verification workflow, creates "Datapoint Verification" check, sets Verification="Validating..." |
| Verified | Reviewer (after verification passes) | Bot dispatches generation workflow, creates "Datapoint Generation" check. Requires Verification="Valid" — blocked otherwise |
| Rejected | Reviewer | Nothing — PR is rejected with a reason |
| Done | Bot (after dataset PR merges) | Nothing — pipeline complete. Requires Verification="Valid" — blocked otherwise |
The project board has a Verification single-select field that tracks the automated verification status:
| Value | Set By | Meaning |
|---|---|---|
| Validating... | Bot (on dispatch or new commits) | Verification is running or hasn't started |
| Valid | Bot (on verification success) | Verification passed — PR can move to Verified/Done |
| Invalid | Bot (on verification failure) | Verification failed — PR cannot move to Verified/Done |
The bot enforces that Verified and Done statuses require Verification = "Valid". If you move a PR to either status without passing verification, the bot reverts the status and posts a comment explaining why.
Before moving a PR to "Review", verify:
-
.ee-bench/codegen/directory exists in the repository with required files:metadata.jsonwithinstance_id,base_commit, andexpected.fail_to_passenvironment/Dockerfileeval/run.sh
- PR body includes a problem statement explaining the issue being solved
- The code change (gold patch) is a reasonable solution to the described problem
When you move a PR to "Review", the bot dispatches the verification workflow and creates a "Datapoint Verification" check run on the PR.
After verification completes, the bot posts a comment on the PR:
On success:
✅ Datapoint verification **passed**.
**Instance:** `devlooped__moq-1259`
**Duration:** 45s
**Tests:** Total: 5, Passed: 5, Failed: 0, Skipped: 0
**fail_to_pass:** Expected: 1, Matched: 1
**pass_to_pass:** Expected: 1, Matched: 1
**Criteria:** 6/6 passed
**Details:** [Workflow run](https://github.com/...)
On failure:
❌ Datapoint verification **failed**.
**Instance:** `devlooped__moq-1259`
**Duration:** 120s
**Tests:** Total: 5, Passed: 3, Failed: 2, Skipped: 0
**fail_to_pass:** Expected: 1, Matched: 0
**pass_to_pass:** Expected: 1, Matched: 1
**Criteria:** 4/6 passed, 1 failed, 1 skipped
**Failed criteria:** tests: fail, fail_to_pass: fail
<details><summary>Failed tests (up to 20)</summary>
- com.example.FooTest#testBar: fail
- com.example.FooTest#testBaz: error
</details>
**Details:** [Workflow run](https://github.com/...)
| Field | What to Look For |
|---|---|
| Tests | All tests pass (Failed: 0) and total count is reasonable |
| fail_to_pass | Matched equals Expected — all expected-to-fail tests failed in baseline and pass after submission |
| pass_to_pass | Matched equals Expected — all expected-to-pass tests still pass after submission |
| Criteria | All 6 criteria pass. Criteria may be skipped when prerequisites are not met (e.g., empty expected lists) |
| Failed criteria | Only appears on failure — indicates which of the 6 criteria didn't pass |
| Failed tests | Only appears on failure — lists up to 20 failed test names with status |
In addition to the comment, check the "Datapoint Verification" check run on the PR's Checks tab. This shows the overall pass/fail status and links directly to the workflow run.
Move a PR to "Verified" only after verification passes:
- Confirm the verification comment shows a pass result
- Confirm the "Datapoint Verification" check run is green
- Confirm the Verification field shows "Valid"
- Move the PR to "Verified" on the project board
The bot enforces two gates:
- Verification field gate: If the Verification field is not "Valid", the bot reverts the status to its previous value and posts a comment.
- Check run gate: If there is no passing "Datapoint Verification" check on the current head SHA, the bot blocks dispatch and posts a comment.
What happens next: The bot dispatches the generation workflow, which creates a PR in dpaia/dataset containing the exported datapoint. A comment is posted on the source PR linking to the dataset PR.
If the verification comment shows a failure:
- Fixable by contributor: Leave the PR at "Review" or move to "In progress" and comment with what needs to change. The contributor pushes fixes, which resets status to "In progress". Move back to "Review" to re-verify.
- Unfixable / out of scope: Move to "Rejected" with a comment explaining the reason.
If the dataset PR is not created after moving to "Verified":
- Check the workflow run logs in the infrastructure repository's Actions tab
- Look for the "Datapoint Generation" check run on the source PR
- Common causes: export script errors, dataset repo access issues
- To retry: toggle the status (move to "In progress", then back to "Verified")
If the dataset PR's validation fails (the bot sets status to "Failed" on the Dataset Metadata project):
- Check the validation comment on the dataset PR in
dpaia/dataset - Common causes: datapoint files don't match expected structure, Docker build fails in clean environment
- To retry: the contributor fixes the source PR, which resets status to "In progress". Start the review cycle again.
After you move a PR to "Verified", no further reviewer action is needed. The pipeline proceeds automatically:
- Generation — Bot creates a dataset PR in
dpaia/datasetwith the exported datapoint - Validation — Bot dispatches validation on the dataset PR
- Auto-merge — If validation passes, the dataset PR is automatically merged
- Finalization — Both projects (Code Generation and Dataset Metadata) are set to "Done", and the source PR is closed with a comment
If any automated step fails, the bot updates the relevant project status. You can monitor progress through the check runs on the source PR and comments on both the source and dataset PRs.
When new commits are pushed to a source PR that is in Review, Verified, or Rejected status:
- The bot automatically resets all project statuses to "In progress"
- The bot resets the Verification field to "Validating..."
- A comment is posted explaining the reset and showing the new head SHA
- Previous verification results are invalidated
- The review process must start over from "Review"
This prevents stale verification results from being used after code changes.