fix(badges): make GitHub badges reliable + surface provider failures in Sentry#122
Merged
Merged
Conversation
GitHub badges intermittently rendered a red "github not found" badge and stayed broken for up to an hour. Two compounding causes: 1. Error/"not found" responses were served with the same long cache headers as successes (s-maxage=3600, stale-while-revalidate=86400), so a momentary GitHub blip (rate limit, backoff, network, empty token pool) got pinned at the CDN/browser for an hour even after upstream recovered. 2. The GitHub provider bypassed the cache layer, so a failed live fetch had no last-known-good value to fall back to and collapsed straight to "not found". Changes: - Add a short-lived ERROR_CACHE_HEADERS (max-age=60, no long SWR) and apply it to all error / "not found" responses (single badge, group, invalid url, internal error). Errors now self-heal on the next request. - Add cachedFetchStale(): keeps a fresh copy plus a long-lived last-known-good copy, and serves the last-known-good value on fetch failure or while the provider is backed off. - Route GitHub badge resolution through cachedFetchStale so a transient failure serves the previous good value instead of "not found". - Group responses use the short error headers when any segment is unresolved. - Add tests for the stale-on-error behavior. https://claude.ai/code/session_01B5fH9ZzdStYYtWcF6QNpgw
Builds on the badge-reliability fixes with observability and a clearer 404 state. Observability (Sentry): - Add a provider-alert channel to core (setProviderAlertCallback / reportProviderAlert), mirroring the existing cache-metrics callback so @shieldcn/core stays dependency-free. - GitHub provider fires a "rate_limit" (429) / "unavailable" (503) alert the moment the limit is tripped. It only fires on the request that trips it — once backed off, githubFetch short-circuits — so it stays ~one alert per backoff cycle instead of one per blocked request. - cachedFetchStale fires a "badge_unavailable" alert when the upstream fails AND there is no last-known-good value to serve (i.e. the badge actually breaks), but stays silent when it can serve a stale value. - Both web and engine wire these to Sentry.captureMessage (warning for rate limits, error otherwise). Stable per-reason messages group recurring alerts into a single issue; path/url ride along in tags/extra. UX: - A genuine GitHub 404 now renders "invalid repository" instead of the generic "not found". On the failure path, resolveGitHubBadge does a cheap HEAD existence check (githubRepoExists) to distinguish a real bad/typo'd repo from a transient blip — the check only runs when a badge would otherwise have failed, and the result is cached so it doesn't re-probe. Tests: cover badge_unavailable alerting and the no-alert-when-stale path. https://claude.ai/code/session_01B5fH9ZzdStYYtWcF6QNpgw
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Centralize rate-limit / outage alerting in recordBackoff, the single chokepoint every provider funnels 429/503 through (githubFetch, handleUpstreamStatus, provider-fetch, and the cachedFetch* catch blocks). Any provider now surfaces a Sentry "rate_limit" / "unavailable" alert when an upstream starts failing — not just GitHub. - recordBackoff(provider, status?) fires reportProviderAlert on the request that *starts* a backoff cycle (no active window), so it stays ~one alert per outage instead of one per blocked request. - Thread the upstream status through handleUpstreamStatus and the cachedFetch / cachedFetchStale catch blocks so the alert carries 429 vs 503. - Drop the now-redundant GitHub-local alert helper; recordBackoff covers it. Test: a rate limit alerts once per cycle and not on subsequent blocked calls. https://claude.ai/code/session_01B5fH9ZzdStYYtWcF6QNpgw
Two issues flagged in review: 1. A temporary 404 cached "invalid repository" as if it were a good value (fresh + 7-day stale), so it could be served as last-known-good and was pinned at the CDN with success headers. Fix: treat the "invalid repository" verdict as a terminal error — add BadgeData.error, mark the verdict with it, and teach cachedFetchStale (via opts.isError/errorTtl) to cache such results only briefly, never write them to the long-lived stale store, and never overwrite an existing good value. The route also serves error-flagged badges with the short ERROR_CACHE_HEADERS. Net: a repo that later appears self-heals in ~60s instead of being stuck for up to a week. 2. The GitHub topics dependents-repo / dependents-pkg were registered but never implemented, so every request fell through to null — which, after the new alerting, would fire a badge_unavailable alert (~every 5 min per badge) and run an extra HEAD probe. They have no provider function and no docs. Removed them from both knownTopics sets, the registry, and a stale provider doc comment. Test: a terminal-error result is returned but does not overwrite the last-known-good store. https://claude.ai/code/session_01B5fH9ZzdStYYtWcF6QNpgw
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Badges that depend on a third-party API intermittently rendered a red "not found" and stayed broken — the single worst outcome for user trust. Root causes:
Changes
Resilience — now across all ~44 API-backed providers
A single clear fetcher contract in
cachedFetchStale:null→ DEFINITIVE negative (upstream answered; resource absent) → short-lived "not found", never serve stale, never alert (a typo'd package can't alert-storm or show someone else's value).Wired through the shared path so it covers everything at once:
provider-fetch(providerFetch/providerFetchText) — throw on 429/5xx, null on other non-2xx, 7-day stale window.vscode(custom POST) moved onto the same contract.github—githubFetchthrows on transient / returns null only on definitive negatives, so its stale fallback actually triggers.ERROR_CACHE_HEADERSon every error/"not found" response; group responses use them when any segment is unresolved.Observability (Sentry)
setProviderAlertCallback/reportProviderAlert), wired in web + engine toSentry.captureMessage.recordBackoff(the chokepoint all providers funnel through), fired once per backoff cycle.badge_unavailablealert only when a transient failure has no last-known-good to serve.Correctness / hardening
403withx-ratelimit-remaining: 0(not only429). Detect both so a rate limit is treated as transient (backoff + stale) instead of masquerading as "not found".handleUpstreamStatusnow backs off on any 5xx.invalid repository(terminal error: cached briefly, never persisted as last-known-good, short cache headers — self-heals in ~60s).dependents-repo/dependents-pkgGitHub topics (registered but never implemented → would have alert-stormed).Tests
tsc --noEmitclean on core + engine.provider-fetchsuite (5xx→stale, 404→null, 2xx→data) and cache-layer tests for the throw=transient / null=definitive contract, terminal-error non-persistence, and once-per-cycle rate-limit alerts.Notes
degraded_stale(served stale on failure) stays a metric counter, not a Sentry issue — the badge still renders a value.Commits
recordBackoff).https://claude.ai/code/session_01B5fH9ZzdStYYtWcF6QNpgw