Fix goroutine and heap leak in Worker.Process by PaNaVTEC · Pull Request #405 · Nitro/lazyraster

PaNaVTEC · 2026-06-15T15:17:48Z

Problem

Worker.Process fetches the document in a goroutine and passes the payload back over a channel:

chanPayload := make(chan []byte)   // unbuffered
chanError := make(chan error)      // unbuffered
go func() {
    payload, err := w.fetchFile(ctx, path)
    ...
    chanPayload <- payload          // blocks until the caller receives
}()

The channels were unbuffered, so the goroutine blocks on the send until someone receives. On the png path, Process has two early returns before the select that drains the channel:

extractToken fails (signed URL without a token query param), or
fetchAnnotations fails — this calls Redis (FetchAnnotation) on every png request, so any Redis hiccup/timeout or a malformed stored annotation triggers it.

When either fires, the fetch goroutine finishes fetchFile and then blocks forever on the send — nobody will ever receive, and the goroutine doesn't select on ctx.Done(), so request cancellation can't free it. Each leak strands one goroutine and the full PDF payload it holds, until the pod restarts.

Impact

In ec1 prod this is the sawtooth in the runtime metrics: live goroutines climbing ~200→600 and heap-live-size ~160→256 MiB linearly across the month, both snapping back to baseline on each deploy. The html path is unaffected (no early return before its select).

Fix

Buffer both channels with capacity 1 so the goroutine can always complete its single send and exit, even when the caller returns early and abandons it. On an early return the buffered channel and its payload simply go out of scope and are GC'd. One-line change on each channel.

Test

Added TestWorkerProcessNoGoroutineLeak, which drives the fetchAnnotations-error path on the png format 50× and asserts the goroutine count returns to baseline. Verified it fails against the unbuffered version (baseline 2 → 52 stuck goroutines) and passes with the fix. Full suite passes under -race.

Follow-up (not in this PR)

This leak is also the reason the eu-prod pods look heavily overprovisioned — worth right-sizing lazyraster-deploy resource requests once this has baked, since the memory ramp was partly this leak.

🤖 Generated with Claude Code

Process fetches the file in a goroutine that sends the payload over a channel. The channels were unbuffered, so the goroutine blocked on the send until the caller received. On the png path, Process can return early (invalid token, or a fetchAnnotations failure) before reaching the select that drains the channel. When that happened the fetch goroutine blocked forever on the send and never observed ctx cancellation, leaking the goroutine and the full PDF payload it held in memory. In ec1 prod this showed up as live goroutines and heap-live-size ramping linearly between deploys and resetting on each restart. fetchAnnotations hits Redis on every png request, so intermittent Redis errors (or a malformed stored annotation) steadily accumulated leaked goroutines. Buffer both channels with capacity 1 so the goroutine can always complete its single send and exit even when the caller abandons it. Add a regression test that drives the fetchAnnotations-error path and asserts goroutines return to baseline; it leaks ~50 goroutines without the fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

diegobernardes approved these changes Jun 15, 2026

View reviewed changes

PaNaVTEC merged commit 95a6e96 into main Jun 15, 2026
2 checks passed

PaNaVTEC deleted the fix/process-goroutine-leak branch June 15, 2026 15:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix goroutine and heap leak in Worker.Process#405

Fix goroutine and heap leak in Worker.Process#405
PaNaVTEC merged 1 commit into
mainfrom
fix/process-goroutine-leak

PaNaVTEC commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

PaNaVTEC commented Jun 15, 2026

Problem

Impact

Fix

Test

Follow-up (not in this PR)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants