Fix goroutine and heap leak in Worker.Process#405
Merged
Conversation
Process fetches the file in a goroutine that sends the payload over a channel. The channels were unbuffered, so the goroutine blocked on the send until the caller received. On the png path, Process can return early (invalid token, or a fetchAnnotations failure) before reaching the select that drains the channel. When that happened the fetch goroutine blocked forever on the send and never observed ctx cancellation, leaking the goroutine and the full PDF payload it held in memory. In ec1 prod this showed up as live goroutines and heap-live-size ramping linearly between deploys and resetting on each restart. fetchAnnotations hits Redis on every png request, so intermittent Redis errors (or a malformed stored annotation) steadily accumulated leaked goroutines. Buffer both channels with capacity 1 so the goroutine can always complete its single send and exit even when the caller abandons it. Add a regression test that drives the fetchAnnotations-error path and asserts goroutines return to baseline; it leaks ~50 goroutines without the fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
diegobernardes
approved these changes
Jun 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Worker.Processfetches the document in a goroutine and passes the payload back over a channel:The channels were unbuffered, so the goroutine blocks on the send until someone receives. On the
pngpath,Processhas two early returns before theselectthat drains the channel:extractTokenfails (signed URL without atokenquery param), orfetchAnnotationsfails — this calls Redis (FetchAnnotation) on every png request, so any Redis hiccup/timeout or a malformed stored annotation triggers it.When either fires, the fetch goroutine finishes
fetchFileand then blocks forever on the send — nobody will ever receive, and the goroutine doesn't select onctx.Done(), so request cancellation can't free it. Each leak strands one goroutine and the full PDF payload it holds, until the pod restarts.Impact
In ec1 prod this is the sawtooth in the runtime metrics: live goroutines climbing ~200→600 and heap-live-size ~160→256 MiB linearly across the month, both snapping back to baseline on each deploy. The
htmlpath is unaffected (no early return before itsselect).Fix
Buffer both channels with capacity 1 so the goroutine can always complete its single send and exit, even when the caller returns early and abandons it. On an early return the buffered channel and its payload simply go out of scope and are GC'd. One-line change on each channel.
Test
Added
TestWorkerProcessNoGoroutineLeak, which drives thefetchAnnotations-error path on the png format 50× and asserts the goroutine count returns to baseline. Verified it fails against the unbuffered version (baseline 2 → 52 stuck goroutines) and passes with the fix. Full suite passes under-race.Follow-up (not in this PR)
This leak is also the reason the eu-prod pods look heavily overprovisioned — worth right-sizing
lazyraster-deployresource requests once this has baked, since the memory ramp was partly this leak.🤖 Generated with Claude Code