Skip to content

fix(badges): make GitHub badges reliable + surface provider failures in Sentry#122

Merged
jal-co merged 4 commits into
mainfrom
claude/github-badge-reliability-fexzpg
Jun 9, 2026
Merged

fix(badges): make GitHub badges reliable + surface provider failures in Sentry#122
jal-co merged 4 commits into
mainfrom
claude/github-badge-reliability-fexzpg

Conversation

@jal-co

@jal-co jal-co commented Jun 9, 2026

Copy link
Copy Markdown
Owner

Problem

Badges that depend on a third-party API intermittently rendered a red "not found" and stayed broken — the single worst outcome for user trust. Root causes:

  1. Errors were cached like successes → a momentary blip got pinned at the CDN for an hour.
  2. No last-known-good fallback → a failed live fetch collapsed straight to "not found".
  3. Only GitHub got hardened initially → every other API provider (npm, pypi, docker, crates, discord, vscode, …) still broke on a transient blip.
  4. No observability and no distinction between a transient failure and a genuinely invalid resource.

Changes

Resilience — now across all ~44 API-backed providers

A single clear fetcher contract in cachedFetchStale:

  • throws → TRANSIENT (network / 429 / 5xx / parse) → serve last-known-good, back off, alert if nothing cached.
  • returns null → DEFINITIVE negative (upstream answered; resource absent) → short-lived "not found", never serve stale, never alert (a typo'd package can't alert-storm or show someone else's value).
  • returns value → success (cached fresh + 7-day last-known-good).

Wired through the shared path so it covers everything at once:

  • provider-fetch (providerFetch/providerFetchText) — throw on 429/5xx, null on other non-2xx, 7-day stale window.
  • vscode (custom POST) moved onto the same contract.
  • githubgithubFetch throws on transient / returns null only on definitive negatives, so its stale fallback actually triggers.
  • Short-lived ERROR_CACHE_HEADERS on every error/"not found" response; group responses use them when any segment is unresolved.

Observability (Sentry)

  • Provider-alert channel in core (setProviderAlertCallback / reportProviderAlert), wired in web + engine to Sentry.captureMessage.
  • Rate limits / outages for every provider — centralized in recordBackoff (the chokepoint all providers funnel through), fired once per backoff cycle.
  • Badge actually breaksbadge_unavailable alert only when a transient failure has no last-known-good to serve.

Correctness / hardening

  • GitHub 403 rate limits: GitHub signals primary limits as 403 with x-ratelimit-remaining: 0 (not only 429). Detect both so a rate limit is treated as transient (backoff + stale) instead of masquerading as "not found". handleUpstreamStatus now backs off on any 5xx.
  • A genuine GitHub 404 renders invalid repository (terminal error: cached briefly, never persisted as last-known-good, short cache headers — self-heals in ~60s).
  • Removed dead dependents-repo / dependents-pkg GitHub topics (registered but never implemented → would have alert-stormed).

Tests

  • tsc --noEmit clean on core + engine.
  • Core suite: 55 passing, including a new provider-fetch suite (5xx→stale, 404→null, 2xx→data) and cache-layer tests for the throw=transient / null=definitive contract, terminal-error non-persistence, and once-per-cycle rate-limit alerts.

Notes

  • degraded_stale (served stale on failure) stays a metric counter, not a Sentry issue — the badge still renders a value.
  • Possible follow-up: token-pool health (an empty/invalid pool is what triggers the unauthenticated rate-limit cascade), and bundling the resvg WASM so PNG rendering never depends on a runtime CDN fetch.

Commits

  1. GitHub badges resilient to transient failures (error headers + last-known-good).
  2. Sentry alerts on GitHub rate limits + "invalid repository" for 404s.
  3. Extend provider alerts to all providers (centralized in recordBackoff).
  4. Address review — invalid-repo caching as terminal error + remove dead topics.
  5. Make all API-backed badges resilient (throw=transient / null=definitive contract) + GitHub 403 rate-limit detection.

https://claude.ai/code/session_01B5fH9ZzdStYYtWcF6QNpgw

claude added 2 commits June 9, 2026 15:49
GitHub badges intermittently rendered a red "github not found" badge and
stayed broken for up to an hour. Two compounding causes:

1. Error/"not found" responses were served with the same long cache headers
   as successes (s-maxage=3600, stale-while-revalidate=86400), so a momentary
   GitHub blip (rate limit, backoff, network, empty token pool) got pinned at
   the CDN/browser for an hour even after upstream recovered.

2. The GitHub provider bypassed the cache layer, so a failed live fetch had no
   last-known-good value to fall back to and collapsed straight to "not found".

Changes:
- Add a short-lived ERROR_CACHE_HEADERS (max-age=60, no long SWR) and apply it
  to all error / "not found" responses (single badge, group, invalid url,
  internal error). Errors now self-heal on the next request.
- Add cachedFetchStale(): keeps a fresh copy plus a long-lived last-known-good
  copy, and serves the last-known-good value on fetch failure or while the
  provider is backed off.
- Route GitHub badge resolution through cachedFetchStale so a transient failure
  serves the previous good value instead of "not found".
- Group responses use the short error headers when any segment is unresolved.
- Add tests for the stale-on-error behavior.

https://claude.ai/code/session_01B5fH9ZzdStYYtWcF6QNpgw
Builds on the badge-reliability fixes with observability and a clearer 404 state.

Observability (Sentry):
- Add a provider-alert channel to core (setProviderAlertCallback /
  reportProviderAlert), mirroring the existing cache-metrics callback so
  @shieldcn/core stays dependency-free.
- GitHub provider fires a "rate_limit" (429) / "unavailable" (503) alert the
  moment the limit is tripped. It only fires on the request that trips it —
  once backed off, githubFetch short-circuits — so it stays ~one alert per
  backoff cycle instead of one per blocked request.
- cachedFetchStale fires a "badge_unavailable" alert when the upstream fails
  AND there is no last-known-good value to serve (i.e. the badge actually
  breaks), but stays silent when it can serve a stale value.
- Both web and engine wire these to Sentry.captureMessage (warning for rate
  limits, error otherwise). Stable per-reason messages group recurring alerts
  into a single issue; path/url ride along in tags/extra.

UX:
- A genuine GitHub 404 now renders "invalid repository" instead of the
  generic "not found". On the failure path, resolveGitHubBadge does a cheap
  HEAD existence check (githubRepoExists) to distinguish a real bad/typo'd
  repo from a transient blip — the check only runs when a badge would
  otherwise have failed, and the result is cached so it doesn't re-probe.

Tests: cover badge_unavailable alerting and the no-alert-when-stale path.

https://claude.ai/code/session_01B5fH9ZzdStYYtWcF6QNpgw
@vercel

vercel Bot commented Jun 9, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
shieldcn Ready Ready Preview, Comment Jun 9, 2026 4:19pm

Request Review

Comment thread packages/core/src/route-handler.ts
Comment thread packages/core/src/route-handler.ts Outdated
Centralize rate-limit / outage alerting in recordBackoff, the single
chokepoint every provider funnels 429/503 through (githubFetch,
handleUpstreamStatus, provider-fetch, and the cachedFetch* catch blocks).
Any provider now surfaces a Sentry "rate_limit" / "unavailable" alert when an
upstream starts failing — not just GitHub.

- recordBackoff(provider, status?) fires reportProviderAlert on the request
  that *starts* a backoff cycle (no active window), so it stays ~one alert
  per outage instead of one per blocked request.
- Thread the upstream status through handleUpstreamStatus and the cachedFetch
  / cachedFetchStale catch blocks so the alert carries 429 vs 503.
- Drop the now-redundant GitHub-local alert helper; recordBackoff covers it.

Test: a rate limit alerts once per cycle and not on subsequent blocked calls.

https://claude.ai/code/session_01B5fH9ZzdStYYtWcF6QNpgw
@jal-co jal-co changed the title fix(badges): make GitHub badges resilient to transient failures fix(badges): make GitHub badges reliable + surface provider failures in Sentry Jun 9, 2026
Two issues flagged in review:

1. A temporary 404 cached "invalid repository" as if it were a good value
   (fresh + 7-day stale), so it could be served as last-known-good and was
   pinned at the CDN with success headers. Fix: treat the "invalid
   repository" verdict as a terminal error — add BadgeData.error, mark the
   verdict with it, and teach cachedFetchStale (via opts.isError/errorTtl) to
   cache such results only briefly, never write them to the long-lived stale
   store, and never overwrite an existing good value. The route also serves
   error-flagged badges with the short ERROR_CACHE_HEADERS. Net: a repo that
   later appears self-heals in ~60s instead of being stuck for up to a week.

2. The GitHub topics dependents-repo / dependents-pkg were registered but
   never implemented, so every request fell through to null — which, after
   the new alerting, would fire a badge_unavailable alert (~every 5 min per
   badge) and run an extra HEAD probe. They have no provider function and no
   docs. Removed them from both knownTopics sets, the registry, and a stale
   provider doc comment.

Test: a terminal-error result is returned but does not overwrite the
last-known-good store.

https://claude.ai/code/session_01B5fH9ZzdStYYtWcF6QNpgw
@jal-co jal-co added the sentry Auto-created from a Sentry error label Jun 9, 2026
@jal-co jal-co merged commit 0df42fe into main Jun 9, 2026
2 of 3 checks passed
@jal-co jal-co deleted the claude/github-badge-reliability-fexzpg branch June 9, 2026 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core engine sentry Auto-created from a Sentry error

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants