fix(badges): make GitHub badges reliable + surface provider failures in Sentry by jal-co · Pull Request #122 · jal-co/shieldcn

jal-co · 2026-06-09T16:00:56Z

Problem

Badges that depend on a third-party API intermittently rendered a red "not found" and stayed broken — the single worst outcome for user trust. Root causes:

Errors were cached like successes → a momentary blip got pinned at the CDN for an hour.
No last-known-good fallback → a failed live fetch collapsed straight to "not found".
Only GitHub got hardened initially → every other API provider (npm, pypi, docker, crates, discord, vscode, …) still broke on a transient blip.
No observability and no distinction between a transient failure and a genuinely invalid resource.

Changes

Resilience — now across all ~44 API-backed providers

A single clear fetcher contract in cachedFetchStale:

throws → TRANSIENT (network / 429 / 5xx / parse) → serve last-known-good, back off, alert if nothing cached.
returns null → DEFINITIVE negative (upstream answered; resource absent) → short-lived "not found", never serve stale, never alert (a typo'd package can't alert-storm or show someone else's value).
returns value → success (cached fresh + 7-day last-known-good).

Wired through the shared path so it covers everything at once:

provider-fetch (providerFetch/providerFetchText) — throw on 429/5xx, null on other non-2xx, 7-day stale window.
vscode (custom POST) moved onto the same contract.
github — githubFetch throws on transient / returns null only on definitive negatives, so its stale fallback actually triggers.
Short-lived ERROR_CACHE_HEADERS on every error/"not found" response; group responses use them when any segment is unresolved.

Observability (Sentry)

Provider-alert channel in core (setProviderAlertCallback / reportProviderAlert), wired in web + engine to Sentry.captureMessage.
Rate limits / outages for every provider — centralized in recordBackoff (the chokepoint all providers funnel through), fired once per backoff cycle.
Badge actually breaks → badge_unavailable alert only when a transient failure has no last-known-good to serve.

Correctness / hardening

GitHub 403 rate limits: GitHub signals primary limits as 403 with x-ratelimit-remaining: 0 (not only 429). Detect both so a rate limit is treated as transient (backoff + stale) instead of masquerading as "not found". handleUpstreamStatus now backs off on any 5xx.
A genuine GitHub 404 renders invalid repository (terminal error: cached briefly, never persisted as last-known-good, short cache headers — self-heals in ~60s).
Removed dead dependents-repo / dependents-pkg GitHub topics (registered but never implemented → would have alert-stormed).

Tests

tsc --noEmit clean on core + engine.
Core suite: 55 passing, including a new provider-fetch suite (5xx→stale, 404→null, 2xx→data) and cache-layer tests for the throw=transient / null=definitive contract, terminal-error non-persistence, and once-per-cycle rate-limit alerts.

Notes

degraded_stale (served stale on failure) stays a metric counter, not a Sentry issue — the badge still renders a value.
Possible follow-up: token-pool health (an empty/invalid pool is what triggers the unauthenticated rate-limit cascade), and bundling the resvg WASM so PNG rendering never depends on a runtime CDN fetch.

Commits

GitHub badges resilient to transient failures (error headers + last-known-good).
Sentry alerts on GitHub rate limits + "invalid repository" for 404s.
Extend provider alerts to all providers (centralized in recordBackoff).
Address review — invalid-repo caching as terminal error + remove dead topics.
Make all API-backed badges resilient (throw=transient / null=definitive contract) + GitHub 403 rate-limit detection.

https://claude.ai/code/session_01B5fH9ZzdStYYtWcF6QNpgw

GitHub badges intermittently rendered a red "github not found" badge and stayed broken for up to an hour. Two compounding causes: 1. Error/"not found" responses were served with the same long cache headers as successes (s-maxage=3600, stale-while-revalidate=86400), so a momentary GitHub blip (rate limit, backoff, network, empty token pool) got pinned at the CDN/browser for an hour even after upstream recovered. 2. The GitHub provider bypassed the cache layer, so a failed live fetch had no last-known-good value to fall back to and collapsed straight to "not found". Changes: - Add a short-lived ERROR_CACHE_HEADERS (max-age=60, no long SWR) and apply it to all error / "not found" responses (single badge, group, invalid url, internal error). Errors now self-heal on the next request. - Add cachedFetchStale(): keeps a fresh copy plus a long-lived last-known-good copy, and serves the last-known-good value on fetch failure or while the provider is backed off. - Route GitHub badge resolution through cachedFetchStale so a transient failure serves the previous good value instead of "not found". - Group responses use the short error headers when any segment is unresolved. - Add tests for the stale-on-error behavior. https://claude.ai/code/session_01B5fH9ZzdStYYtWcF6QNpgw

Builds on the badge-reliability fixes with observability and a clearer 404 state. Observability (Sentry): - Add a provider-alert channel to core (setProviderAlertCallback / reportProviderAlert), mirroring the existing cache-metrics callback so @shieldcn/core stays dependency-free. - GitHub provider fires a "rate_limit" (429) / "unavailable" (503) alert the moment the limit is tripped. It only fires on the request that trips it — once backed off, githubFetch short-circuits — so it stays ~one alert per backoff cycle instead of one per blocked request. - cachedFetchStale fires a "badge_unavailable" alert when the upstream fails AND there is no last-known-good value to serve (i.e. the badge actually breaks), but stays silent when it can serve a stale value. - Both web and engine wire these to Sentry.captureMessage (warning for rate limits, error otherwise). Stable per-reason messages group recurring alerts into a single issue; path/url ride along in tags/extra. UX: - A genuine GitHub 404 now renders "invalid repository" instead of the generic "not found". On the failure path, resolveGitHubBadge does a cheap HEAD existence check (githubRepoExists) to distinguish a real bad/typo'd repo from a transient blip — the check only runs when a badge would otherwise have failed, and the result is cached so it doesn't re-probe. Tests: cover badge_unavailable alerting and the no-alert-when-stale path. https://claude.ai/code/session_01B5fH9ZzdStYYtWcF6QNpgw

vercel · 2026-06-09T16:01:03Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
shieldcn	Ready	Preview, Comment	Jun 9, 2026 4:19pm

Centralize rate-limit / outage alerting in recordBackoff, the single chokepoint every provider funnels 429/503 through (githubFetch, handleUpstreamStatus, provider-fetch, and the cachedFetch* catch blocks). Any provider now surfaces a Sentry "rate_limit" / "unavailable" alert when an upstream starts failing — not just GitHub. - recordBackoff(provider, status?) fires reportProviderAlert on the request that *starts* a backoff cycle (no active window), so it stays ~one alert per outage instead of one per blocked request. - Thread the upstream status through handleUpstreamStatus and the cachedFetch / cachedFetchStale catch blocks so the alert carries 429 vs 503. - Drop the now-redundant GitHub-local alert helper; recordBackoff covers it. Test: a rate limit alerts once per cycle and not on subsequent blocked calls. https://claude.ai/code/session_01B5fH9ZzdStYYtWcF6QNpgw

Two issues flagged in review: 1. A temporary 404 cached "invalid repository" as if it were a good value (fresh + 7-day stale), so it could be served as last-known-good and was pinned at the CDN with success headers. Fix: treat the "invalid repository" verdict as a terminal error — add BadgeData.error, mark the verdict with it, and teach cachedFetchStale (via opts.isError/errorTtl) to cache such results only briefly, never write them to the long-lived stale store, and never overwrite an existing good value. The route also serves error-flagged badges with the short ERROR_CACHE_HEADERS. Net: a repo that later appears self-heals in ~60s instead of being stuck for up to a week. 2. The GitHub topics dependents-repo / dependents-pkg were registered but never implemented, so every request fell through to null — which, after the new alerting, would fire a badge_unavailable alert (~every 5 min per badge) and run an extra HEAD probe. They have no provider function and no docs. Removed them from both knownTopics sets, the registry, and a stale provider doc comment. Test: a terminal-error result is returned but does not overwrite the last-known-good store. https://claude.ai/code/session_01B5fH9ZzdStYYtWcF6QNpgw

claude added 2 commits June 9, 2026 15:49

github-actions Bot added engine core labels Jun 9, 2026

sentry Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread packages/core/src/route-handler.ts

Comment thread packages/core/src/route-handler.ts Outdated

jal-co changed the title ~~fix(badges): make GitHub badges resilient to transient failures~~ fix(badges): make GitHub badges reliable + surface provider failures in Sentry Jun 9, 2026

vercel Bot deployed to Preview June 9, 2026 16:05 View deployment

vercel Bot deployed to Preview June 9, 2026 16:19 View deployment

jal-co added the sentry Auto-created from a Sentry error label Jun 9, 2026

jal-co merged commit 0df42fe into main Jun 9, 2026
2 of 3 checks passed

jal-co deleted the claude/github-badge-reliability-fexzpg branch June 9, 2026 17:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(badges): make GitHub badges reliable + surface provider failures in Sentry#122

fix(badges): make GitHub badges reliable + surface provider failures in Sentry#122
jal-co merged 4 commits into
mainfrom
claude/github-badge-reliability-fexzpg

jal-co commented Jun 9, 2026 •

edited

Loading

Uh oh!

vercel Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jal-co commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

Resilience — now across all ~44 API-backed providers

Observability (Sentry)

Correctness / hardening

Tests

Notes

Commits

Uh oh!

vercel Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jal-co commented Jun 9, 2026 •

edited

Loading

vercel Bot commented Jun 9, 2026 •

edited

Loading