Skip to content

bug: StackbiltCloudExporter silently drops spans in low-volume Workers (double-buffering + ephemeral isolates) #7

@stackbilt-admin

Description

@stackbilt-admin

Summary

StackbiltCloudExporter buffers spans/metrics/logs internally and only POSTs to the ingest endpoint when the buffer reaches 100 items or 50KB. This threshold is rarely reached in a typical low-volume Worker before the isolate is evicted, so the buffered signals are silently dropped when the isolate dies. Callers calling tracer.flush() / metrics.flush() in ctx.waitUntil() believe they're forcing a real flush, but those methods only drain their own buffers into the exporter — they never trigger the POST.

Discovered while dogfooding on Stackbilt-dev/tarotscript (issue Stackbilt-dev/tarotscript#163).

Repro / evidence

  • tarotscript-worker was instrumented per the README pattern: root span per request in a middleware, obs.tracer.flush() + obs.metrics.flush() in waitUntil.
  • Over ~14 hours of live traffic, the dashboard showed only 5 traces total, all from a single early burst. The dashboard and the underlying D1 agreed perfectly — the data just wasn't arriving.
  • After patching the worker to call the underlying exporter.flush() directly, traces immediately began flowing on every request (32 traces, last-seen 3m ago, real p50/p95/p99 populating within minutes).

Root cause

Two layers of buffering, and tracer.flush() only drains the first:

Layer 1Tracer.buffer in src/tracing.ts. Tracer.flush() snapshots the buffer and calls this.options.export.export(spans):

```ts
// src/tracing.ts:233
async flush(): Promise {
if (this.buffer.length === 0) return;
const spans = [...this.buffer];
this.buffer = [];
if (this.options.export) {
await this.options.export.export(spans); // <-- hands off to Layer 2
}
}
```

Layer 2StackbiltCloudExporter.spans in src/stackbilt-exporter.ts. export() pushes into its own buffer and calls maybeFlush(), which gates on a batch threshold:

```ts
// src/stackbilt-exporter.ts:100
async export(items: MetricPoint[] | TraceSpan[]): Promise {
if (items.length === 0) return;
if ('traceId' in items[0] && 'spanId' in items[0]) {
this.spans.push(...(items as TraceSpan[])); // <-- buffered
} else {
this.metrics.push(...(items as MetricPoint[]));
}
await this.maybeFlush(); // <-- gated, not forced
}

// src/stackbilt-exporter.ts:138
private async maybeFlush(): Promise {
if (Date.now() < this.backoffUntil) return;
const totalItems = this.metrics.length + this.spans.length + this.logs.length + this.alerts.length;
if (totalItems === 0) return;
if (totalItems < this.maxBatchSize) { // default 100
const bytes = this.estimateBytes();
if (bytes < this.maxBatchBytes) return; // default 50KB
}
await this.flush();
}
```

A typical tarotscript scaffold-cast request produces ~5 spans + ~3 metric points = ~8 items per request. Reaching 100 items requires ~12 concurrent-ish requests in the same isolate. Between bursts, Cloudflare evicts the idle isolate and the exporter buffer dies with it.

The public exporter.flush() at stackbilt-exporter.ts:129 is the method that actually POSTs, but it's not reachable from the return value of createMonitoring() — the exporter is only referenced internally by the tracer and metrics collector.

Why batching is the wrong default for Workers

Traditional exporters batch because network round-trips are expensive and long-running processes have time to fill a buffer between flushes. Workers invert both of those assumptions:

  • Isolates are ephemeral. There is no "next request" guarantee — you get one shot to flush before eviction.
  • setInterval-based auto-flush doesn't work reliably (timers don't fire while the isolate is idle).
  • Workers already amortize HTTP round-trips via subrequest budgets; one POST-per-request is fine for the volume this package targets.
  • Cost is already bounded by the dashboard's per-worker cap (the 403 backoff path) — batching isn't needed for cost protection.

Proposed fixes

Option A (preferred) — remove exporter-level buffering for the Workers case. Have StackbiltCloudExporter.export() POST immediately. The Tracer and MetricsCollector already have their own buffers that batch within a single request, which is the right granularity for Workers. This makes the exporter stateless across requests, which also fixes the "buffer dies with the isolate" failure mode.

Option B — propagate flush() through the tracer. Have Tracer.flush() check if the exporter has a flush() method and call it after export(spans) returns. Same for MetricsCollector.flush(). This preserves the batching semantics for anyone who relies on them but makes "I called flush, my data is on the wire" actually true.

Option C — expose the exporter on the monitoring bundle. Add exporter: StackbiltCloudExporter | null to the createMonitoring() return value so callers can do await obs.exporter?.flush() in waitUntil. Lowest-risk change but pushes the workaround onto every consumer.

My vote is A — the Workers-native assumption is that isolates are short-lived and you flush per-request. B is the next-best if you want to keep batching as an opt-in for long-lived use cases.

Worker-side workaround (shipped in tarotscript today)

Until this is fixed upstream, consumers can reach into the tracer's private options field:

```ts
// worker/src/observability.ts
const base = createMonitoring({ ... });
const exporter =
((base.tracer as unknown as { options?: { export?: unknown } } | null)
?.options?.export as StackbiltCloudExporter | undefined) ?? null;
return { ...base, exporter };

// worker/src/index.ts (middleware)
c.executionCtx.waitUntil((async () => {
await Promise.allSettled([obs.tracer.flush(), obs.metrics.flush()]);
await obs.exporter?.flush();
})());
```

This is the pattern currently running in tarotscript-worker deployed version 9d53b2c3-08a2-415a-8d95-b252e6e6f610. It works but relies on a private field and should be considered a stopgap.

Impact on other dogfooders

Per Stackbilt-dev/tarotscript#163, stackbilt-web and edge-auth were instrumented before tarotscript. Their dashboards are also worth auditing — if they're medium-to-high traffic they may have been masking the problem by naturally hitting the 100-item threshold, but any low-volume Worker adopting this package will silently lose telemetry.

Acceptance

  • tracer.flush() in waitUntil results in data landing at the ingest endpoint within the request lifetime
  • Low-volume workers (1-10 requests per isolate) emit every span
  • No private-field access required in consumer code
  • Existing batching consumers, if any, have a migration path

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions