Dataset Collection Guide

Overview

This directory contains tools for collecting real PR data from GitHub for evaluation purposes.

Prerequisites

GitHub Personal Access Token
- Go to: https://github.com/settings/tokens
- Click "Generate new token" → "Generate new token (classic)"
- Required scopes: public_repo
- Copy the token
Set Environment Variable
```
export GITHUB_TOKEN=ghp_your_token_here
```

Quick Start

Collect Balanced Dataset (Recommended)

# Collect from 5 repos, ~5 PRs per repo, balanced across categories
poetry run python eval/dataset/collect_dataset.py \
  --repos 5 \
  --prs-per-repo 5 \
  --balanced

Collect More PRs

# Collect from 10 repos, ~10 PRs per repo
poetry run python eval/dataset/collect_dataset.py \
  --repos 10 \
  --prs-per-repo 10

List Available Repositories

poetry run python eval/dataset/collect_dataset.py list-repos

Collection Criteria

Repository Selection

Curated list of high-quality Python projects
Popular frameworks (Flask, Django, FastAPI)
Data science tools (pandas, scikit-learn)
Well-maintained with active reviews

PR Filtering

Must have:

✅ Merged (not just closed)
✅ At least one Python file changed
✅ At least one review comment
✅ 50-500 lines changed (configurable)
✅ 1-15 files changed

Excluded:

❌ Draft PRs
❌ Too large (>500 lines)
❌ Too small (<50 lines)
❌ Only test files
❌ No reviews

PR Categories

The collector categorizes PRs automatically:

Security: Fixes vulnerabilities, security issues
Bugfix: Bug fixes, error corrections
Feature: New features, enhancements
Refactor: Code restructuring, cleanup
Test: Test additions/modifications
Other: Everything else

Output Files

`pr_list.json`

Complete PR metadata in our format:

[
  {
    "pr_id": "12345",
    "repository": "pallets/flask",
    "branch_source": "fix-security",
    "branch_target": "main",
    "title": "Fix XSS vulnerability in template rendering",
    "description": "...",
    "author": "contributor",
    "commit_messages": ["Fix XSS issue", "Add test"],
    "files_changed": 3,
    "lines_added": 45,
    "lines_deleted": 12,
    "language": "python"
  }
]

`ground_truth.json`

Extracted important issues from reviews:

[
  {
    "pr_id": "12345",
    "important_issues": [
      "XSS vulnerability in render_template @ templates.py:142",
      "Missing input validation @ utils.py:67"
    ],
    "false_positive_tolerance": 3,
    "labeler_id": "github_reviewers",
    "notes": "Extracted from 9 review comments"
  }
]

`collection_summary.json`

Collection statistics:

{
  "total_prs": 25,
  "total_ground_truth": 23,
  "repositories": ["pallets/flask", "django/django", ...],
  "filters": {
    "min_lines": 50,
    "max_lines": 500,
    "balanced": true
  }
}

Manual Review & Refinement

After collection, manually review and refine:

1. Check PR Quality

# Review collected PRs
cat eval/dataset/pr_list.json | jq '.[] | {pr: .pr_id, title: .title, lines: (.lines_added + .lines_deleted)}'

2. Refine Ground Truth

Edit ground_truth.json to:

✅ Clarify important issues
✅ Remove false positives
✅ Add missing critical issues
✅ Update severity expectations

3. Add Expert Labels

For thesis validity, get 2+ experts to:

Review each PR independently
Mark important issues
Calculate Cohen's κ for inter-rater reliability

from eval.metrics import cohens_kappa

# Compare two raters
kappa = cohens_kappa(rater_a, rater_b, num_categories=5)
print(f"Cohen's κ: {kappa:.3f}")  # Should be >0.6

Rate Limits

GitHub API rate limits:

Authenticated: 5,000 requests/hour
Per endpoint: Various limits

The collector:

✅ Checks rate limit before starting
✅ Respects delays between requests
✅ Shows remaining calls

If you hit the limit:

# Check when limit resets
curl -H "Authorization: token $GITHUB_TOKEN" https://api.github.com/rate_limit

Advanced Usage

Custom Repository List

Edit collect_dataset.py and modify CURATED_REPOS:

CURATED_REPOS = [
    ("your-org", "your-repo"),
    # Add more repos...
]

Adjust Filters

poetry run python eval/dataset/collect_dataset.py \
  --min-lines 100 \        # Larger PRs
  --max-lines 1000 \       # Allow bigger changes
  --no-balanced            # Don't balance categories

Troubleshooting

"Rate limit exceeded"

Wait for limit to reset (shown in error message) or:

# Check current limit
curl -H "Authorization: token $GITHUB_TOKEN" https://api.github.com/rate_limit

"No PRs found"

Try:

Increase max_lines (some repos have larger PRs)
Decrease min_lines (some repos have smaller PRs)
Check if repo is active (recent PRs)

"Token not found"

Make sure:

# Check token is set
echo $GITHUB_TOKEN

# Should output: ghp_...

Best Practices

Start small: Collect 20-30 PRs first
Manual review: Check quality before evaluation
Expert labeling: Get 2+ independent reviewers
Calculate κ: Ensure inter-rater reliability >0.6
Document: Keep notes on selection criteria

Example Workflow

# 1. Set token
export GITHUB_TOKEN=ghp_your_token

# 2. List available repos
poetry run python eval/dataset/collect_dataset.py list-repos

# 3. Collect dataset
poetry run python eval/dataset/collect_dataset.py --repos 5 --prs-per-repo 5

# 4. Review results
cat eval/dataset/collection_summary.json

# 5. Manual refinement
# Edit ground_truth.json as needed

# 6. Run evaluation
poetry run code-review evaluate --system multi_agent /path/to/repo

Citation

If using this dataset collection methodology in research:

@misc{dataset-collection,
  title={Automated Dataset Collection for Code Review Evaluation},
  author={Your Name},
  year={2025},
  note={Part of Multi-Agent Code Review Framework}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset Collection Guide

Overview

Prerequisites

Quick Start

Collect Balanced Dataset (Recommended)

Collect More PRs

List Available Repositories

Collection Criteria

Repository Selection

PR Filtering

PR Categories

Output Files

`pr_list.json`

`ground_truth.json`

`collection_summary.json`

Manual Review & Refinement

1. Check PR Quality

2. Refine Ground Truth

3. Add Expert Labels

Rate Limits

Advanced Usage

Custom Repository List

Adjust Filters

Troubleshooting

"Rate limit exceeded"

"No PRs found"

"Token not found"

Best Practices

Example Workflow

Citation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Dataset Collection Guide

Overview

Prerequisites

Quick Start

Collect Balanced Dataset (Recommended)

Collect More PRs

List Available Repositories

Collection Criteria

Repository Selection

PR Filtering

PR Categories

Output Files

pr_list.json

ground_truth.json

collection_summary.json

Manual Review & Refinement

1. Check PR Quality

2. Refine Ground Truth

3. Add Expert Labels

Rate Limits

Advanced Usage

Custom Repository List

Adjust Filters

Troubleshooting

"Rate limit exceeded"

"No PRs found"

"Token not found"

Best Practices

Example Workflow

Citation

`pr_list.json`

`ground_truth.json`

`collection_summary.json`