GitHub Data Model — Markdown Format

Sync a GitHub repository's issues, pull requests, releases, and metadata into a local github-data/ folder. The result is a plain-text archive that an agent can read, grep, and reason about without API calls.

Directory Structure

github-data/
  repo.yml                   # repository metadata + sync state
  labels.yml                 # all repository labels
  milestones.yml             # all milestones
  issues/
    0001.md                  # issue or PR — one file per number
    0002.md
    0042.md
    0043.md
  projects/                  # Projects v2 boards linked to this repo
    0001.md                  # project file — one per open project
  discussions/               # GitHub Discussions
    0007.md                  # one file per discussion
  releases/
    v1.0.0.md
    v1.2.0.md
  events/                    # event files exported since last sync (for agents to pick up)
    20240916-140000-000-issue_closed-42.md

Issues and pull requests share a single number space (as on GitHub). The filename is the zero-padded number. A file is self-contained: open it and you see the full thread.

repo.yml

Repository-level metadata and sync cursor. Feature flags and archived are always written (even when false) so agents can grep for them; other fields are omitted when empty.

owner: acme
repo: widgets
default_branch: main
description: A widget catalog
homepage: https://widgets.example
visibility: public
language: Go
license: MIT License
topics:
  - cli
  - golang
archived: false
has_issues: true
has_projects: true
has_wiki: true
has_pages: false
has_discussions: false
created_at: 2024-01-01T00:00:00Z
updated_at: 2024-09-15T08:00:00Z
pushed_at: 2024-09-17T07:55:00Z
synced_at: 2024-09-17T08:00:00Z

labels.yml

- name: bug
  color: d73a4a
  description: Something isn't working

- name: enhancement
  color: a2eeef
  description: New feature or request

- name: priority/high
  color: b60205

milestones.yml

- title: v2.1
  state: closed
  description: Stability release
  due_on: 2024-10-01
  closed_at: 2024-09-28

- title: v3.0
  state: open
  description: Major redesign
  due_on: 2025-03-01

Issue File

YAML frontmatter holds structured metadata. The markdown body is the issue text. Comments, reviews, and events follow as additional YAML documents separated by ---.

---
number: 42
title: Fix crash on empty input
state: closed
state_reason: completed
created_at: 2024-09-15T10:30:00Z
updated_at: 2024-09-16T14:00:00Z
closed_at: 2024-09-16T14:00:00Z
author: octocat
assignees:
  - hubot
labels:
  - bug
  - priority/high
milestone: v2.1
---

When passing an empty string to `parse()`, the application crashes with a null
pointer exception.

## Steps to reproduce

1. Call `parse("")`
2. Observe crash

Pull Request File

Same format. The type: pull_request field and PR-specific frontmatter fields distinguish it from an issue.

---
number: 43
title: Handle empty input in parser
type: pull_request
state: closed
created_at: 2024-09-15T12:00:00Z
updated_at: 2024-09-16T14:00:00Z
closed_at: 2024-09-16T14:00:00Z
author: octocat
assignees:
  - octocat
labels:
  - bugfix
milestone: v2.1
source_branch: fix/empty-input
target_branch: main
merge:
  merged: true
  merged_at: 2024-09-16T14:00:00Z
  merged_by: hubot
  commit_sha: abc123f
reviewers:
  - hubot
requested_reviewers:
  - monalisa
---

Fixes #42. Adds a guard clause to `parse()` to return early on empty input.

Subsequent Documents

After the first document, each --- starts a new document. The document field declares its type. All documents appear in chronological order.

comment

---
document: comment
id: 100
author: hubot
created_at: 2024-09-15T11:00:00Z
---

I can reproduce this. The guard clause was removed in the last refactor.

review

---
document: review
id: 200
author: hubot
state: approved
commit_sha: abc123f
submitted_at: 2024-09-16T10:00:00Z
---

Looks good. The early return is clean.

review_comment

Inline code comment tied to a file, line, and review.

---
document: review_comment
id: 201
review_id: 200
author: hubot
created_at: 2024-09-16T10:00:00Z
path: src/parser.js
line: 12
side: RIGHT
commit_sha: abc123f
---

Nit: could use `=== undefined` instead of `== null` for clarity.

event

State changes. Usually no body.

---
document: event
event: labeled
actor: octocat
created_at: 2024-09-15T10:31:00Z
label: bug
---

Common event types: labeled, unlabeled, assigned, unassigned, closed, reopened, merged, renamed, milestoned, demilestoned, referenced, cross-referenced, review_requested, review_request_removed, review_dismissed, head_ref_force_pushed, head_ref_deleted, base_ref_changed, converted_to_draft, ready_for_review, locked, unlocked, pinned, unpinned, transferred, connected, disconnected, marked_as_duplicate, unmarked_as_duplicate.

Event-specific fields are added flat in frontmatter:

Event type	Extra fields
`labeled/unlabeled`	`label`
`assigned/unassigned`	`assignee`
`milestoned/demilestoned`	`milestone`
`renamed`	`from`, `to`
`closed/merged/referenced`	`commit_sha`
`cross-referenced`	`source_number`, `source_repo`
`review_requested/review_request_removed`	`reviewer`
`locked`	`lock_reason`
`review_dismissed`	`dismissal_message`

Project File

One file per open Projects v2 board linked to the repository, named by the project number (projects/0001.md). The frontmatter holds the project header and field definitions; the body is the project's readme; each linked item follows as an item sub-document with its current field values.

Closed projects are not written — when a project transitions from open to closed, its file is deleted and a project_closed event is emitted. Draft issues (project-only items without an issue number) are skipped.

---
number: 1
title: Q1 Roadmap
state: open
public: true
url: https://github.com/orgs/acme/projects/1
owner: acme
description: Quarterly planning
created_at: 2024-01-01T00:00:00Z
updated_at: 2024-09-15T08:00:00Z
fields:
  - name: Status
    type: SINGLE_SELECT
    options:
      - Todo
      - In Progress
      - Done
  - name: Priority
    type: SINGLE_SELECT
    options:
      - P0
      - P1
  - name: Iteration
    type: ITERATION
---

Long-form project description / readme.

---
document: item
type: issue
number: 42
title: Fix crash on empty input
repo: acme/widgets
fields:
  Priority: P0
  Status: In Progress
---

---
document: item
type: pull_request
number: 43
title: Handle empty input in parser
repo: acme/widgets
fields:
  Status: Done
---

Discussion File

One file per GitHub Discussion at discussions/<number>.md. Discussions share the repository's number space with issues and PRs (so a repo can have an issue #42 or a discussion #42, never both). YAML frontmatter holds the metadata; top-level replies are emitted as document: comment and nested replies as document: reply with a parent_id pointing at the comment they reply to.

Discussions are GraphQL-only and use the same since cutoff as the rest of the exporter — the list is fetched newest-first and pagination stops as soon as items are older than the cutoff.

---
number: 7
title: How do I export the wiki?
type: discussion
state: open
created_at: 2024-09-15T10:30:00Z
updated_at: 2024-09-16T14:00:00Z
author: octocat
category: Q&A
labels:
  - question
answer_id: 17024104
answer_chosen_at: 2024-09-16T14:00:00Z
answer_chosen_by: hubot
---

Discussion body markdown.

---
document: comment
id: 17024104
author: hubot
created_at: 2024-09-15T11:00:00Z
is_answer: true
---

Top-level reply that was marked as the chosen answer.

---
document: reply
id: 17024105
parent_id: 17024104
author: octocat
created_at: 2024-09-15T11:30:00Z
---

Nested reply under the top-level comment.

Discussion frontmatter

Field	Type	Notes
`number`	integer	Required. Unique within the repo (shared with issues/PRs)
`title`	string	Required
`type`	string	Always `discussion`
`state`	string	`open` or `closed`
`state_reason`	string	`outdated`, `duplicate`, `resolved`, `reopened` (lowercased)
`locked`	boolean	Omit if false
`created_at`	ISO-8601	Required
`updated_at`	ISO-8601	Required
`closed_at`	ISO-8601	Present when closed
`author`	string	GitHub username
`category`	string	Discussion category name (e.g. `Q&A`, `General`, `Ideas`)
`labels`	string list	Label names
`answer_id`	integer	Q&A only: `databaseId` of the comment marked as answer
`answer_chosen_at`	ISO-8601	Q&A only
`answer_chosen_by`	string	Q&A only: username who marked the answer

Discussion sub-documents

`document`	Fields
`comment`	`id`, `author`, `created_at`, optional `is_answer: true`
`reply`	`id`, `parent_id`, `author`, `created_at`

If a discussion has more than 100 top-level comments, or any comment has more than 50 replies, the export keeps only the first N entries and logs a warning (Warning: discussion #N has more than 100 top-level comments — only first 100 exported). This is a deliberate trade-off to keep GraphQL node cost bounded.

Cross-references

When an issue or PR is on one or more projects, the issue file's frontmatter also lists them:

projects:
  - Q1 Roadmap
  - Bugs

This is populated on the next sync that re-fetches the issue (an issue gets re-fetched when its updated_at advances, which happens whenever it is added to or removed from a project).

Release File

---
tag: v1.0.0
name: Version 1.0.0
draft: false
prerelease: false
author: octocat
created_at: 2024-06-01T12:00:00Z
published_at: 2024-06-01T12:00:00Z
target_commitish: main
assets:
  - name: app-v1.0.0-linux-amd64.tar.gz
    content_type: application/gzip
    size_bytes: 12345678
    download_count: 542
---

## What's New

- Initial stable release
- Full parser support
- CLI interface

Complete Example: issues/0042.md

A full issue file showing the chronological thread.

---
number: 42
title: Fix crash on empty input
state: closed
state_reason: completed
created_at: 2024-09-15T10:30:00Z
updated_at: 2024-09-16T14:00:00Z
closed_at: 2024-09-16T14:00:00Z
author: octocat
assignees:
  - hubot
labels:
  - bug
  - priority/high
milestone: v2.1
---

When passing an empty string to `parse()`, the application crashes with a null
pointer exception.

---
document: event
event: labeled
actor: octocat
created_at: 2024-09-15T10:31:00Z
label: bug
---

---
document: event
event: labeled
actor: octocat
created_at: 2024-09-15T10:31:00Z
label: priority/high
---

---
document: comment
id: 100
author: hubot
created_at: 2024-09-15T11:00:00Z
---

I can reproduce this. The guard clause was removed in the last refactor.

---
document: event
event: assigned
actor: octocat
created_at: 2024-09-15T11:05:00Z
assignee: hubot
---

---
document: comment
id: 101
author: octocat
created_at: 2024-09-15T14:00:00Z
---

Fixed in PR #43.

---
document: event
event: closed
actor: hubot
created_at: 2024-09-16T14:00:00Z
commit_sha: abc123f
---

Frontmatter Reference

Issue / Pull Request (first document)

Field	Type	Notes
`number`	integer	Required. Unique within repo
`title`	string	Required
`type`	string	`pull_request` if PR, omit for issues
`state`	string	`open` or `closed`
`state_reason`	string	`completed`, `not_planned`, `reopened`
`locked`	boolean	Omit if false
`created_at`	ISO-8601	Required
`updated_at`	ISO-8601	Required
`closed_at`	ISO-8601	Present when closed
`author`	string	GitHub username
`assignees`	string list	Usernames
`labels`	string list	Label names
`milestone`	string	Milestone title
`projects`	string list	Projects v2 boards the item is on
`reactions`	map	`{"+1": 2, "heart": 1}`, omit if none

PR-only fields (when `type: pull_request`)

Field	Type	Notes
`draft`	boolean	Omit if false
`source_branch`	string
`target_branch`	string
`source_repo`	string	Only for cross-repo PRs
`merge.merged`	boolean
`merge.merged_at`	ISO-8601
`merge.merged_by`	string	Username
`merge.commit_sha`	string
`reviewers`	string list	Completed reviewers
`requested_reviewers`	string list	Pending reviewers

Subsequent documents

Field	Type	Notes
`document`	string	Required. `comment`, `review`, `review_comment`, `event`
`id`	integer	Required for comments and reviews
`author`	string	For comments/reviews
`actor`	string	For events
`created_at`	ISO-8601	Required

Type-specific fields are added flat — see examples above.

Agent Usage

This format is designed so an agent with standard file tools (read, glob, grep) can work with GitHub data without API access.

Find open bugs:

grep -l "state: open" github-data/issues/*.md | xargs grep -l "bug"

Read a specific issue thread:

cat github-data/issues/0042.md

Find issues mentioning a file:

grep -rl "parser.js" github-data/issues/

Find all PRs merged to main:

grep -l "target_branch: main" github-data/issues/*.md | xargs grep -l "merged: true"

List releases:

ls github-data/releases/

Check sync freshness:

cat github-data/repo.yml

Sync Behavior

Full sync: Uses bulk API endpoints (repo-wide comments, events, PRs, review comments) to fetch all data in a few paginated requests instead of per-issue calls. Only PR reviews require per-PR fetches (no bulk endpoint).
Incremental sync: Uses synced_at from repo.yml. Fetches only items updated since last sync via the since parameter. Uses per-issue timeline endpoint for changed issues (gives complete history in one call) plus bulk PR list.
Deleted items: GitHub doesn't hard-delete issues. Transferred or spam-deleted issues are left as-is (the state and timeline tell the story).
File naming: Zero-padded to 4 digits (0042.md). Repos with >9999 issues use 5+ digits.
Idempotent: Running sync twice produces the same files. Safe to re-run.

Design Decisions

Why github-data/ inside the repo? The agent already has the repo checked out. Colocating the data means no extra paths to configure. Add github-data/ to .gitignore if you don't want it committed.

Why one file per issue? An agent can read a single file to get the full picture. Grep works across all issues. No database, no joins, no query language.

Why multi-document markdown? The thread reads top-to-bottom like a conversation. Frontmatter is parseable; the body is readable. Standard YAML parsers handle multi-document streams.

Why usernames instead of user objects? Keeps files readable and greppable. A username is enough to identify who did what. Full user profiles (email, avatar) are rarely needed for reasoning.

Why flat event fields? label: bug is simpler than label: { name: bug, color: d73a4a }. The label details live in labels.yml if you need them.

Why chronological order? Events and comments interleaved in time order tell the story of what happened. An agent can read top-to-bottom without sorting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Data Model — Markdown Format

Directory Structure

repo.yml

labels.yml

milestones.yml

Issue File

Pull Request File

Subsequent Documents

comment

review

review_comment

event

Project File

Discussion File

Discussion frontmatter

Discussion sub-documents

Cross-references

Release File

Complete Example: issues/0042.md

Frontmatter Reference

Issue / Pull Request (first document)

PR-only fields (when `type: pull_request`)

Subsequent documents

Agent Usage

Sync Behavior

Design Decisions

FilesExpand file tree

data-model.md

Latest commit

History

data-model.md

File metadata and controls

GitHub Data Model — Markdown Format

Directory Structure

repo.yml

labels.yml

milestones.yml

Issue File

Pull Request File

Subsequent Documents

comment

review

review_comment

event

Project File

Discussion File

Discussion frontmatter

Discussion sub-documents

Cross-references

Release File

Complete Example: issues/0042.md

Frontmatter Reference

Issue / Pull Request (first document)

PR-only fields (when type: pull_request)

Subsequent documents

Agent Usage

Sync Behavior

Design Decisions

PR-only fields (when `type: pull_request`)