bug: StackbiltCloudExporter silently drops spans in low-volume Workers (double-buffering + ephemeral isolates)

## Summary

`StackbiltCloudExporter` buffers spans/metrics/logs internally and only POSTs to the ingest endpoint when the buffer reaches 100 items or 50KB. This threshold is rarely reached in a typical low-volume Worker before the isolate is evicted, so the buffered signals are **silently dropped** when the isolate dies. Callers calling `tracer.flush()` / `metrics.flush()` in `ctx.waitUntil()` believe they're forcing a real flush, but those methods only drain their own buffers into the exporter — they never trigger the POST.

Discovered while dogfooding on `Stackbilt-dev/tarotscript` (issue [Stackbilt-dev/tarotscript#163](https://github.com/Stackbilt-dev/tarotscript/issues/163)).

## Repro / evidence

- `tarotscript-worker` was instrumented per the README pattern: root span per request in a middleware, `obs.tracer.flush()` + `obs.metrics.flush()` in `waitUntil`.
- Over ~14 hours of live traffic, the dashboard showed only **5 traces** total, all from a single early burst. The dashboard and the underlying D1 agreed perfectly — the data just wasn't arriving.
- After patching the worker to call the underlying `exporter.flush()` directly, traces immediately began flowing on every request (32 traces, last-seen 3m ago, real p50/p95/p99 populating within minutes).

## Root cause

Two layers of buffering, and `tracer.flush()` only drains the first:

**Layer 1** — `Tracer.buffer` in `src/tracing.ts`. `Tracer.flush()` snapshots the buffer and calls `this.options.export.export(spans)`:

\`\`\`ts
// src/tracing.ts:233
async flush(): Promise<void> {
  if (this.buffer.length === 0) return;
  const spans = [...this.buffer];
  this.buffer = [];
  if (this.options.export) {
    await this.options.export.export(spans);  // <-- hands off to Layer 2
  }
}
\`\`\`

**Layer 2** — `StackbiltCloudExporter.spans` in `src/stackbilt-exporter.ts`. `export()` pushes into its own buffer and calls `maybeFlush()`, which gates on a batch threshold:

\`\`\`ts
// src/stackbilt-exporter.ts:100
async export(items: MetricPoint[] | TraceSpan[]): Promise<void> {
  if (items.length === 0) return;
  if ('traceId' in items[0] && 'spanId' in items[0]) {
    this.spans.push(...(items as TraceSpan[]));   // <-- buffered
  } else {
    this.metrics.push(...(items as MetricPoint[]));
  }
  await this.maybeFlush();                         // <-- gated, not forced
}

// src/stackbilt-exporter.ts:138
private async maybeFlush(): Promise<void> {
  if (Date.now() < this.backoffUntil) return;
  const totalItems = this.metrics.length + this.spans.length + this.logs.length + this.alerts.length;
  if (totalItems === 0) return;
  if (totalItems < this.maxBatchSize) {           // default 100
    const bytes = this.estimateBytes();
    if (bytes < this.maxBatchBytes) return;       // default 50KB
  }
  await this.flush();
}
\`\`\`

A typical tarotscript scaffold-cast request produces ~5 spans + ~3 metric points = ~8 items per request. Reaching 100 items requires ~12 concurrent-ish requests in the same isolate. Between bursts, Cloudflare evicts the idle isolate and the exporter buffer dies with it.

The public `exporter.flush()` at `stackbilt-exporter.ts:129` is the method that actually POSTs, but it's not reachable from the return value of `createMonitoring()` — the exporter is only referenced internally by the tracer and metrics collector.

## Why batching is the wrong default for Workers

Traditional exporters batch because network round-trips are expensive and long-running processes have time to fill a buffer between flushes. Workers invert both of those assumptions:

- Isolates are **ephemeral**. There is no \"next request\" guarantee — you get one shot to flush before eviction.
- `setInterval`-based auto-flush doesn't work reliably (timers don't fire while the isolate is idle).
- Workers already amortize HTTP round-trips via subrequest budgets; one POST-per-request is fine for the volume this package targets.
- Cost is already bounded by the dashboard's per-worker cap (the 403 backoff path) — batching isn't needed for cost protection.

## Proposed fixes

**Option A (preferred) — remove exporter-level buffering for the Workers case.** Have `StackbiltCloudExporter.export()` POST immediately. The `Tracer` and `MetricsCollector` already have their own buffers that batch within a single request, which is the right granularity for Workers. This makes the exporter stateless across requests, which also fixes the \"buffer dies with the isolate\" failure mode.

**Option B — propagate `flush()` through the tracer.** Have `Tracer.flush()` check if the exporter has a `flush()` method and call it after `export(spans)` returns. Same for `MetricsCollector.flush()`. This preserves the batching semantics for anyone who relies on them but makes \"I called flush, my data is on the wire\" actually true.

**Option C — expose the exporter on the monitoring bundle.** Add `exporter: StackbiltCloudExporter | null` to the `createMonitoring()` return value so callers can do `await obs.exporter?.flush()` in `waitUntil`. Lowest-risk change but pushes the workaround onto every consumer.

My vote is **A** — the Workers-native assumption is that isolates are short-lived and you flush per-request. B is the next-best if you want to keep batching as an opt-in for long-lived use cases.

## Worker-side workaround (shipped in tarotscript today)

Until this is fixed upstream, consumers can reach into the tracer's private options field:

\`\`\`ts
// worker/src/observability.ts
const base = createMonitoring({ ... });
const exporter =
  ((base.tracer as unknown as { options?: { export?: unknown } } | null)
    ?.options?.export as StackbiltCloudExporter | undefined) ?? null;
return { ...base, exporter };

// worker/src/index.ts (middleware)
c.executionCtx.waitUntil((async () => {
  await Promise.allSettled([obs.tracer.flush(), obs.metrics.flush()]);
  await obs.exporter?.flush();
})());
\`\`\`

This is the pattern currently running in `tarotscript-worker` deployed version `9d53b2c3-08a2-415a-8d95-b252e6e6f610`. It works but relies on a private field and should be considered a stopgap.

## Impact on other dogfooders

Per [Stackbilt-dev/tarotscript#163](https://github.com/Stackbilt-dev/tarotscript/issues/163), `stackbilt-web` and `edge-auth` were instrumented before tarotscript. Their dashboards are also worth auditing — if they're medium-to-high traffic they may have been masking the problem by naturally hitting the 100-item threshold, but any low-volume Worker adopting this package will silently lose telemetry.

## Acceptance

- [ ] `tracer.flush()` in `waitUntil` results in data landing at the ingest endpoint within the request lifetime
- [ ] Low-volume workers (1-10 requests per isolate) emit every span
- [ ] No private-field access required in consumer code
- [ ] Existing batching consumers, if any, have a migration path

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: StackbiltCloudExporter silently drops spans in low-volume Workers (double-buffering + ephemeral isolates) #7

Summary

Repro / evidence

Root cause

Why batching is the wrong default for Workers

Proposed fixes

Worker-side workaround (shipped in tarotscript today)

Impact on other dogfooders

Acceptance

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug: StackbiltCloudExporter silently drops spans in low-volume Workers (double-buffering + ephemeral isolates) #7

Description

Summary

Repro / evidence

Root cause

Why batching is the wrong default for Workers

Proposed fixes

Worker-side workaround (shipped in tarotscript today)

Impact on other dogfooders

Acceptance

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions