This directory contains tools for collecting real PR data from GitHub for evaluation purposes.
-
GitHub Personal Access Token
- Go to: https://github.com/settings/tokens
- Click "Generate new token" → "Generate new token (classic)"
- Required scopes:
public_repo - Copy the token
-
Set Environment Variable
export GITHUB_TOKEN=ghp_your_token_here
# Collect from 5 repos, ~5 PRs per repo, balanced across categories
poetry run python eval/dataset/collect_dataset.py \
--repos 5 \
--prs-per-repo 5 \
--balanced# Collect from 10 repos, ~10 PRs per repo
poetry run python eval/dataset/collect_dataset.py \
--repos 10 \
--prs-per-repo 10poetry run python eval/dataset/collect_dataset.py list-repos- Curated list of high-quality Python projects
- Popular frameworks (Flask, Django, FastAPI)
- Data science tools (pandas, scikit-learn)
- Well-maintained with active reviews
Must have:
- ✅ Merged (not just closed)
- ✅ At least one Python file changed
- ✅ At least one review comment
- ✅ 50-500 lines changed (configurable)
- ✅ 1-15 files changed
Excluded:
- ❌ Draft PRs
- ❌ Too large (>500 lines)
- ❌ Too small (<50 lines)
- ❌ Only test files
- ❌ No reviews
The collector categorizes PRs automatically:
- Security: Fixes vulnerabilities, security issues
- Bugfix: Bug fixes, error corrections
- Feature: New features, enhancements
- Refactor: Code restructuring, cleanup
- Test: Test additions/modifications
- Other: Everything else
Complete PR metadata in our format:
[
{
"pr_id": "12345",
"repository": "pallets/flask",
"branch_source": "fix-security",
"branch_target": "main",
"title": "Fix XSS vulnerability in template rendering",
"description": "...",
"author": "contributor",
"commit_messages": ["Fix XSS issue", "Add test"],
"files_changed": 3,
"lines_added": 45,
"lines_deleted": 12,
"language": "python"
}
]Extracted important issues from reviews:
[
{
"pr_id": "12345",
"important_issues": [
"XSS vulnerability in render_template @ templates.py:142",
"Missing input validation @ utils.py:67"
],
"false_positive_tolerance": 3,
"labeler_id": "github_reviewers",
"notes": "Extracted from 9 review comments"
}
]Collection statistics:
{
"total_prs": 25,
"total_ground_truth": 23,
"repositories": ["pallets/flask", "django/django", ...],
"filters": {
"min_lines": 50,
"max_lines": 500,
"balanced": true
}
}After collection, manually review and refine:
# Review collected PRs
cat eval/dataset/pr_list.json | jq '.[] | {pr: .pr_id, title: .title, lines: (.lines_added + .lines_deleted)}'Edit ground_truth.json to:
- ✅ Clarify important issues
- ✅ Remove false positives
- ✅ Add missing critical issues
- ✅ Update severity expectations
For thesis validity, get 2+ experts to:
- Review each PR independently
- Mark important issues
- Calculate Cohen's κ for inter-rater reliability
from eval.metrics import cohens_kappa
# Compare two raters
kappa = cohens_kappa(rater_a, rater_b, num_categories=5)
print(f"Cohen's κ: {kappa:.3f}") # Should be >0.6GitHub API rate limits:
- Authenticated: 5,000 requests/hour
- Per endpoint: Various limits
The collector:
- ✅ Checks rate limit before starting
- ✅ Respects delays between requests
- ✅ Shows remaining calls
If you hit the limit:
# Check when limit resets
curl -H "Authorization: token $GITHUB_TOKEN" https://api.github.com/rate_limitEdit collect_dataset.py and modify CURATED_REPOS:
CURATED_REPOS = [
("your-org", "your-repo"),
# Add more repos...
]poetry run python eval/dataset/collect_dataset.py \
--min-lines 100 \ # Larger PRs
--max-lines 1000 \ # Allow bigger changes
--no-balanced # Don't balance categoriesWait for limit to reset (shown in error message) or:
# Check current limit
curl -H "Authorization: token $GITHUB_TOKEN" https://api.github.com/rate_limitTry:
- Increase
max_lines(some repos have larger PRs) - Decrease
min_lines(some repos have smaller PRs) - Check if repo is active (recent PRs)
Make sure:
# Check token is set
echo $GITHUB_TOKEN
# Should output: ghp_...- Start small: Collect 20-30 PRs first
- Manual review: Check quality before evaluation
- Expert labeling: Get 2+ independent reviewers
- Calculate κ: Ensure inter-rater reliability >0.6
- Document: Keep notes on selection criteria
# 1. Set token
export GITHUB_TOKEN=ghp_your_token
# 2. List available repos
poetry run python eval/dataset/collect_dataset.py list-repos
# 3. Collect dataset
poetry run python eval/dataset/collect_dataset.py --repos 5 --prs-per-repo 5
# 4. Review results
cat eval/dataset/collection_summary.json
# 5. Manual refinement
# Edit ground_truth.json as needed
# 6. Run evaluation
poetry run code-review evaluate --system multi_agent /path/to/repoIf using this dataset collection methodology in research:
@misc{dataset-collection,
title={Automated Dataset Collection for Code Review Evaluation},
author={Your Name},
year={2025},
note={Part of Multi-Agent Code Review Framework}
}