Sync a GitHub repository's issues, pull requests, releases, and metadata into a
local github-data/ folder. The result is a plain-text archive that an agent
can read, grep, and reason about without API calls.
github-data/
repo.yml # repository metadata + sync state
labels.yml # all repository labels
milestones.yml # all milestones
issues/
0001.md # issue or PR — one file per number
0002.md
0042.md
0043.md
projects/ # Projects v2 boards linked to this repo
0001.md # project file — one per open project
discussions/ # GitHub Discussions
0007.md # one file per discussion
releases/
v1.0.0.md
v1.2.0.md
events/ # event files exported since last sync (for agents to pick up)
20240916-140000-000-issue_closed-42.md
Issues and pull requests share a single number space (as on GitHub). The filename is the zero-padded number. A file is self-contained: open it and you see the full thread.
Repository-level metadata and sync cursor. Feature flags and archived are
always written (even when false) so agents can grep for them; other fields
are omitted when empty.
owner: acme
repo: widgets
default_branch: main
description: A widget catalog
homepage: https://widgets.example
visibility: public
language: Go
license: MIT License
topics:
- cli
- golang
archived: false
has_issues: true
has_projects: true
has_wiki: true
has_pages: false
has_discussions: false
created_at: 2024-01-01T00:00:00Z
updated_at: 2024-09-15T08:00:00Z
pushed_at: 2024-09-17T07:55:00Z
synced_at: 2024-09-17T08:00:00Z- name: bug
color: d73a4a
description: Something isn't working
- name: enhancement
color: a2eeef
description: New feature or request
- name: priority/high
color: b60205- title: v2.1
state: closed
description: Stability release
due_on: 2024-10-01
closed_at: 2024-09-28
- title: v3.0
state: open
description: Major redesign
due_on: 2025-03-01YAML frontmatter holds structured metadata. The markdown body is the issue text.
Comments, reviews, and events follow as additional YAML documents separated by
---.
---
number: 42
title: Fix crash on empty input
state: closed
state_reason: completed
created_at: 2024-09-15T10:30:00Z
updated_at: 2024-09-16T14:00:00Z
closed_at: 2024-09-16T14:00:00Z
author: octocat
assignees:
- hubot
labels:
- bug
- priority/high
milestone: v2.1
---
When passing an empty string to `parse()`, the application crashes with a null
pointer exception.
## Steps to reproduce
1. Call `parse("")`
2. Observe crashSame format. The type: pull_request field and PR-specific frontmatter fields
distinguish it from an issue.
---
number: 43
title: Handle empty input in parser
type: pull_request
state: closed
created_at: 2024-09-15T12:00:00Z
updated_at: 2024-09-16T14:00:00Z
closed_at: 2024-09-16T14:00:00Z
author: octocat
assignees:
- octocat
labels:
- bugfix
milestone: v2.1
source_branch: fix/empty-input
target_branch: main
merge:
merged: true
merged_at: 2024-09-16T14:00:00Z
merged_by: hubot
commit_sha: abc123f
reviewers:
- hubot
requested_reviewers:
- monalisa
---
Fixes #42. Adds a guard clause to `parse()` to return early on empty input.After the first document, each --- starts a new document. The document field
declares its type. All documents appear in chronological order.
---
document: comment
id: 100
author: hubot
created_at: 2024-09-15T11:00:00Z
---
I can reproduce this. The guard clause was removed in the last refactor.---
document: review
id: 200
author: hubot
state: approved
commit_sha: abc123f
submitted_at: 2024-09-16T10:00:00Z
---
Looks good. The early return is clean.Inline code comment tied to a file, line, and review.
---
document: review_comment
id: 201
review_id: 200
author: hubot
created_at: 2024-09-16T10:00:00Z
path: src/parser.js
line: 12
side: RIGHT
commit_sha: abc123f
---
Nit: could use `=== undefined` instead of `== null` for clarity.State changes. Usually no body.
---
document: event
event: labeled
actor: octocat
created_at: 2024-09-15T10:31:00Z
label: bug
---Common event types: labeled, unlabeled, assigned, unassigned, closed,
reopened, merged, renamed, milestoned, demilestoned, referenced,
cross-referenced, review_requested, review_request_removed,
review_dismissed, head_ref_force_pushed, head_ref_deleted,
base_ref_changed, converted_to_draft, ready_for_review, locked,
unlocked, pinned, unpinned, transferred, connected, disconnected,
marked_as_duplicate, unmarked_as_duplicate.
Event-specific fields are added flat in frontmatter:
| Event type | Extra fields |
|---|---|
labeled/unlabeled |
label |
assigned/unassigned |
assignee |
milestoned/demilestoned |
milestone |
renamed |
from, to |
closed/merged/referenced |
commit_sha |
cross-referenced |
source_number, source_repo |
review_requested/review_request_removed |
reviewer |
locked |
lock_reason |
review_dismissed |
dismissal_message |
One file per open Projects v2 board linked to the repository, named by the
project number (projects/0001.md). The frontmatter holds the project header
and field definitions; the body is the project's readme; each linked item
follows as an item sub-document with its current field values.
Closed projects are not written — when a project transitions from open to
closed, its file is deleted and a project_closed event is emitted. Draft
issues (project-only items without an issue number) are skipped.
---
number: 1
title: Q1 Roadmap
state: open
public: true
url: https://github.com/orgs/acme/projects/1
owner: acme
description: Quarterly planning
created_at: 2024-01-01T00:00:00Z
updated_at: 2024-09-15T08:00:00Z
fields:
- name: Status
type: SINGLE_SELECT
options:
- Todo
- In Progress
- Done
- name: Priority
type: SINGLE_SELECT
options:
- P0
- P1
- name: Iteration
type: ITERATION
---
Long-form project description / readme.
---
document: item
type: issue
number: 42
title: Fix crash on empty input
repo: acme/widgets
fields:
Priority: P0
Status: In Progress
---
---
document: item
type: pull_request
number: 43
title: Handle empty input in parser
repo: acme/widgets
fields:
Status: Done
---One file per GitHub Discussion at discussions/<number>.md. Discussions share
the repository's number space with issues and PRs (so a repo can have an issue
#42 or a discussion #42, never both). YAML frontmatter holds the metadata;
top-level replies are emitted as document: comment and nested replies as
document: reply with a parent_id pointing at the comment they reply to.
Discussions are GraphQL-only and use the same since cutoff as the rest of
the exporter — the list is fetched newest-first and pagination stops as soon
as items are older than the cutoff.
---
number: 7
title: How do I export the wiki?
type: discussion
state: open
created_at: 2024-09-15T10:30:00Z
updated_at: 2024-09-16T14:00:00Z
author: octocat
category: Q&A
labels:
- question
answer_id: 17024104
answer_chosen_at: 2024-09-16T14:00:00Z
answer_chosen_by: hubot
---
Discussion body markdown.
---
document: comment
id: 17024104
author: hubot
created_at: 2024-09-15T11:00:00Z
is_answer: true
---
Top-level reply that was marked as the chosen answer.
---
document: reply
id: 17024105
parent_id: 17024104
author: octocat
created_at: 2024-09-15T11:30:00Z
---
Nested reply under the top-level comment.| Field | Type | Notes |
|---|---|---|
number |
integer | Required. Unique within the repo (shared with issues/PRs) |
title |
string | Required |
type |
string | Always discussion |
state |
string | open or closed |
state_reason |
string | outdated, duplicate, resolved, reopened (lowercased) |
locked |
boolean | Omit if false |
created_at |
ISO-8601 | Required |
updated_at |
ISO-8601 | Required |
closed_at |
ISO-8601 | Present when closed |
author |
string | GitHub username |
category |
string | Discussion category name (e.g. Q&A, General, Ideas) |
labels |
string list | Label names |
answer_id |
integer | Q&A only: databaseId of the comment marked as answer |
answer_chosen_at |
ISO-8601 | Q&A only |
answer_chosen_by |
string | Q&A only: username who marked the answer |
document |
Fields |
|---|---|
comment |
id, author, created_at, optional is_answer: true |
reply |
id, parent_id, author, created_at |
If a discussion has more than 100 top-level comments, or any comment has more
than 50 replies, the export keeps only the first N entries and logs a warning
(Warning: discussion #N has more than 100 top-level comments — only first 100 exported). This is a deliberate trade-off to keep GraphQL node cost bounded.
When an issue or PR is on one or more projects, the issue file's frontmatter also lists them:
projects:
- Q1 Roadmap
- BugsThis is populated on the next sync that re-fetches the issue (an issue gets
re-fetched when its updated_at advances, which happens whenever it is added
to or removed from a project).
---
tag: v1.0.0
name: Version 1.0.0
draft: false
prerelease: false
author: octocat
created_at: 2024-06-01T12:00:00Z
published_at: 2024-06-01T12:00:00Z
target_commitish: main
assets:
- name: app-v1.0.0-linux-amd64.tar.gz
content_type: application/gzip
size_bytes: 12345678
download_count: 542
---
## What's New
- Initial stable release
- Full parser support
- CLI interfaceA full issue file showing the chronological thread.
---
number: 42
title: Fix crash on empty input
state: closed
state_reason: completed
created_at: 2024-09-15T10:30:00Z
updated_at: 2024-09-16T14:00:00Z
closed_at: 2024-09-16T14:00:00Z
author: octocat
assignees:
- hubot
labels:
- bug
- priority/high
milestone: v2.1
---
When passing an empty string to `parse()`, the application crashes with a null
pointer exception.
---
document: event
event: labeled
actor: octocat
created_at: 2024-09-15T10:31:00Z
label: bug
---
---
document: event
event: labeled
actor: octocat
created_at: 2024-09-15T10:31:00Z
label: priority/high
---
---
document: comment
id: 100
author: hubot
created_at: 2024-09-15T11:00:00Z
---
I can reproduce this. The guard clause was removed in the last refactor.
---
document: event
event: assigned
actor: octocat
created_at: 2024-09-15T11:05:00Z
assignee: hubot
---
---
document: comment
id: 101
author: octocat
created_at: 2024-09-15T14:00:00Z
---
Fixed in PR #43.
---
document: event
event: closed
actor: hubot
created_at: 2024-09-16T14:00:00Z
commit_sha: abc123f
---| Field | Type | Notes |
|---|---|---|
number |
integer | Required. Unique within repo |
title |
string | Required |
type |
string | pull_request if PR, omit for issues |
state |
string | open or closed |
state_reason |
string | completed, not_planned, reopened |
locked |
boolean | Omit if false |
created_at |
ISO-8601 | Required |
updated_at |
ISO-8601 | Required |
closed_at |
ISO-8601 | Present when closed |
author |
string | GitHub username |
assignees |
string list | Usernames |
labels |
string list | Label names |
milestone |
string | Milestone title |
projects |
string list | Projects v2 boards the item is on |
reactions |
map | {"+1": 2, "heart": 1}, omit if none |
| Field | Type | Notes |
|---|---|---|
draft |
boolean | Omit if false |
source_branch |
string | |
target_branch |
string | |
source_repo |
string | Only for cross-repo PRs |
merge.merged |
boolean | |
merge.merged_at |
ISO-8601 | |
merge.merged_by |
string | Username |
merge.commit_sha |
string | |
reviewers |
string list | Completed reviewers |
requested_reviewers |
string list | Pending reviewers |
| Field | Type | Notes |
|---|---|---|
document |
string | Required. comment, review, review_comment, event |
id |
integer | Required for comments and reviews |
author |
string | For comments/reviews |
actor |
string | For events |
created_at |
ISO-8601 | Required |
Type-specific fields are added flat — see examples above.
This format is designed so an agent with standard file tools (read, glob, grep) can work with GitHub data without API access.
Find open bugs:
grep -l "state: open" github-data/issues/*.md | xargs grep -l "bug"
Read a specific issue thread:
cat github-data/issues/0042.md
Find issues mentioning a file:
grep -rl "parser.js" github-data/issues/
Find all PRs merged to main:
grep -l "target_branch: main" github-data/issues/*.md | xargs grep -l "merged: true"
List releases:
ls github-data/releases/
Check sync freshness:
cat github-data/repo.yml
- Full sync: Uses bulk API endpoints (repo-wide comments, events, PRs, review comments) to fetch all data in a few paginated requests instead of per-issue calls. Only PR reviews require per-PR fetches (no bulk endpoint).
- Incremental sync: Uses
synced_atfromrepo.yml. Fetches only items updated since last sync via thesinceparameter. Uses per-issue timeline endpoint for changed issues (gives complete history in one call) plus bulk PR list. - Deleted items: GitHub doesn't hard-delete issues. Transferred or
spam-deleted issues are left as-is (the
stateand timeline tell the story). - File naming: Zero-padded to 4 digits (
0042.md). Repos with >9999 issues use 5+ digits. - Idempotent: Running sync twice produces the same files. Safe to re-run.
Why github-data/ inside the repo? The agent already has the repo checked
out. Colocating the data means no extra paths to configure. Add github-data/
to .gitignore if you don't want it committed.
Why one file per issue? An agent can read a single file to get the full picture. Grep works across all issues. No database, no joins, no query language.
Why multi-document markdown? The thread reads top-to-bottom like a conversation. Frontmatter is parseable; the body is readable. Standard YAML parsers handle multi-document streams.
Why usernames instead of user objects? Keeps files readable and greppable. A username is enough to identify who did what. Full user profiles (email, avatar) are rarely needed for reasoning.
Why flat event fields? label: bug is simpler than
label: { name: bug, color: d73a4a }. The label details live in labels.yml if
you need them.
Why chronological order? Events and comments interleaved in time order tell the story of what happened. An agent can read top-to-bottom without sorting.