Skip to content

Distribution metrics sent as quantized bucket boundaries break DDSketch accuracy with Datadog formatter #75

@moleskin-smile

Description

@moleskin-smile

First, hello, and thanks for maintaining this great project. It solved the issue with the metric submission throughput for us!

Problem

When using peep with formatter: :datadog, distribution metrics are sent as bucket boundary values with a sample rate encoding:

metric.name:100|d|@0.02|#tag:val

This tells the Datadog Agent "value 100 was observed 50 times" (1/0.02 = 50). But the actual observed values were spread across the bucket range (e.g., 88-100). The Agent and Datadog backend then build DDSketch structures from these quantized values rather than the real ones.

This produces two concrete issues:

1. Sub-1 values are completely broken

Peep.Buckets.Exponential.bucket_for/2 maps all values < 1 to bucket 0 (upper bound = 1.0):

def bucket_for(value, _) when value < 1 do
  0
end

Any metric with values in the 0-1 range (probabilities, ratios, normalized scores) gets reported as 1.0 to Datadog, making all percentiles identical and incorrect.

2. Percentiles are stepped/chunky instead of smooth

With default settings (bucket_variability: 0.10), bucket boundaries are ~22% apart. A real value of 92ms landing in an 88-100ms bucket gets reported as 100|d. The backend's DDSketch sees 100 repeated, not 92. This produces visibly stepped percentile graphs instead of the smooth curves seen with raw-value reporters like telemetry_metrics_statsd.

Narrowing bucket_variability improves resolution but increases the number of StatsD lines sent per flush (one per non-empty bucket), negating peep's batching advantage.

3. Double aggregation

Datadog's distribution metric type (d) is designed to receive raw values. The Datadog Agent forwards them to the backend, which builds globally accurate DDSketch percentiles across all hosts/containers. Pre-aggregating into histogram buckets before sending defeats this design: the backend builds a DDSketch of bucket boundaries, not actual observations. This is two levels of lossy aggregation (peep histogram -> DDSketch) where there should be one (DDSketch only).

How official Datadog clients handle this

Both datadog-go and datadogpy send raw individual values for distribution metrics. They never bucket or aggregate values client-side.

datadog-go (with extended client-side aggregation)

Buffers raw float64 values per metric+tags context in a []float64 slice. On flush, all values are batch-serialized into a single multi-value DogStatsD message:

metric.name:21:43.2:1657|d|#tag1:val1,tag2:val2

Under high throughput, reservoir sampling (WithMaxSamplesPerContext) caps memory/bandwidth while maintaining statistical representativeness via a @sample_rate adjustment.

datadogpy

Same approach - sends each value as metric.name:value|d. With max_samples_per_context, uses reservoir sampling to limit throughput.

Both clients achieve packet reduction through batching (multiple values per packet), not through lossy quantization.

Proposed solution

When the :datadog formatter is configured, use raw-value buffering instead of histogram bucketing for distribution metrics:

  1. Storage: Buffer raw observed values per metric+tags context (e.g., in an ETS table or agent), instead of incrementing histogram bucket counters. Optionally cap via reservoir sampling with a configurable max_samples_per_context.

  2. Flush: Serialize all buffered values into multi-value DogStatsD lines:

    metric.name:val1:val2:val3:...|d|#tags
    

    Split across multiple lines/packets as needed to respect MTU. If reservoir sampling was applied, include the @sample_rate so the backend compensates.

  3. Keep histogram mode for non-Datadog formatters: The current histogram/bucket approach is correct for Prometheus and standard StatsD, where the receiver expects pre-aggregated bucket counts. This change should only affect the :datadog code path.

This preserves peep's core advantage (fewer UDP packets than one-per-event) while maintaining full DDSketch accuracy on the Datadog side.

Environment

  • peep version: 4.4.0
  • Formatter: :datadog
  • Observed with: Datadog distribution metrics showing incorrect percentiles after migrating from telemetry_metrics_statsd
Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions