Skip to content

Hard block private repo surfacing from public github-stars runs #74

Description

@primeinc

Goal

Add a hard privacy policy blocker:

NEVER EVER EVER EVER EVER allow private repositories to be surfaced if github-stars is running in a public repository.

This is a non-negotiable app invariant. If the output surface is public, private repo data must not appear in generated files, workflow summaries, logs, artifacts, issues, PR comments, Pages output, classification batches, provenance records, or diagnostic reports.

Parent: #69
Related: #42, #54, #71, #73

Why this exists

github-stars is intended to publish or expose a star catalog from a repo that may itself be public. The app may eventually support authenticated/private star fetch paths, broad GitHub App permissions, setup diagnostics, artifact provenance, and agent routing.

That creates a specific leak hazard:

authenticated/private input source
  -> public repository run/output context
  -> generated catalog/log/artifact/issue leak

This must hard-fail, not merely warn.

Policy Invariant

If output_repository.visibility == public:
  private_repository_records MUST NOT be surfaced.

Where surfaced means written, printed, summarized, classified, routed, cached, uploaded, published, or committed anywhere visible from the public repo context.

Required Behavior

1. Public repo mode must quarantine private inputs

When github-stars runs in a public repository:

private repo slug -> do not print
private repo metadata -> do not print
private repo count -> allowed only as aggregate count if no identifiers leak
private repo classification -> disabled
private repo generated output -> prohibited
private repo artifact upload -> prohibited
private repo issue/PR/comment output -> prohibited
private repo Pages output -> prohibited

Allowed public-safe aggregate example:

private_repos_omitted: 12

Forbidden examples:

private_repos_omitted:
  - owner/private-repo

Skipped private repo owner/private-repo

Classified private repo owner/private-repo as security/dev-tools

2. Auth resolver must enforce visibility boundary

The auth resolver from #69 must distinguish:

star_fetch_auth = public | pat | github_app_user | github_token | disabled
output_visibility = public | private | internal/unknown
private_repo_surface_policy = block | allow_private_context_only

Public output context requires:

private_repo_surface_policy=block

If authenticated/private data is fetched while output visibility is public, the run must either:

filter private repos before any surface
or fail closed before writing/logging/uploading/publishing anything private-derived

3. Setup doctor must report policy state safely

primeinc-stars-yoshi-doctor must report:

output_repository_visibility=public|private|unknown
private_repo_surface_policy=block|allow_private_context_only
private_repo_identifiers_printed=false
private_repo_count=<aggregate only, optional>

No private identifiers may appear in setup doctor output when output repo is public.

4. Generated artifact registry must mark public/private safety

Any generated artifact registry from #69 must include a public-safety classification:

artifact
visibility_surface
may_contain_private_identifiers
private_safe_for_public_repo
producer
validation_gate

Public repo runs may only publish artifacts where:

private_safe_for_public_repo=true

5. Classifier must refuse private repo candidates in public mode

The #71 classifier path must not receive private repo identifiers or metadata when output context is public.

Hard rule:

private repo records cannot enter model prompt/input if output repo is public

This avoids both output leaks and model-prompt leakage. Because letting an LLM see private repo names and then asking it politely not to leak them is not a policy. It is a haunted pinky promise.

Required Tests

Add tests for:

public output repo + private input repo -> private identifier omitted
public output repo + private input repo -> artifact excludes private slug
public output repo + private input repo -> workflow summary excludes private slug
public output repo + private input repo -> classifier batch excludes private slug
public output repo + private input repo -> logs do not contain private slug
public output repo + private input repo -> issue/PR/router payload excludes private slug
public output repo + aggregate count -> allowed
private output repo + private input repo -> allowed only if policy explicitly permits
unknown output visibility -> fail closed

Add at least one forbidden sentinel fixture:

owner/private-sentinel-repo-do-not-leak

The test must fail if that string appears in any public-mode output, summary, artifact, classifier input, or router payload.

Acceptance Criteria

  • A formal policy exists in code or config for private repo surfacing.
  • Public output repo mode defaults to hard block.
  • Unknown output repo visibility fails closed.
  • Private repo identifiers are filtered/quarantined before prompt/model/classifier input.
  • Workflow summaries may show aggregate private omission counts but never private repo identifiers.
  • Generated artifacts are marked public-safe before upload/commit/publish.
  • Issue/PR/agent routing payloads are sanitized in public mode.
  • Tests include sentinel private repo leak checks.
  • AGENTS.md or control-plane docs state this as a hard blocker.
  • Completion proof includes a test showing the sentinel private repo name does not surface.

Proof Required

Completion comment must include:

  • PR URL or commit SHA.
  • Test output for private leak sentinel fixture.
  • Workflow/setup-doctor summary excerpt showing public mode and private_repo_surface_policy=block.
  • Example sanitized aggregate output.
  • Confirmation that classifier/model input excludes private repo identifiers in public mode.
  • Confirmation that generated artifacts cannot be marked public-safe if they contain private identifiers.

Evidence Labels for Implementer

Use these labels in the completion report:

  • Direct evidence: source diff, test fixture, test output, workflow summary, artifact validation output.
  • Weak inference: a field may contain private-derived information but no identifiers; justify with sanitizer evidence.
  • Unsupported: claiming private data is protected without a sentinel leak test.
  • Contradicted: any private repo slug appears in public-mode logs, summaries, artifacts, prompts, generated files, issues, PR comments, or Pages output.
  • Blocked: output repo visibility cannot be determined and fail-closed behavior triggers.

Non-Goals

  • Do not remove support for private/authenticated star fetching in private output contexts.
  • Do not expose private repo names just to explain that they were skipped.
  • Do not rely on LLM instructions to suppress private identifiers after they enter prompts.
  • Do not allow a warning-only mode for public output contexts.

Definition of Done

When github-stars runs in a public repository, private repositories are never surfaced. The pipeline either filters private records before any public surface or fails closed before output, and the sentinel leak test proves it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions