Prometheus Native Histograms & Quantiles

Created on 2025-09-14 10:11

Published on 2025-09-26 10:30

Accuracy vs. Cost vs. Complexity (DDSketch / HDR / NH)—When to Migrate and How to Sell It Upstairs

If you’ve ever tried to explain p95 to an executive at 3 a.m., you know the real outage isn’t the backend—it’s our collective faith in bucket math. Classic Prometheus histograms force you to pick bucket boundaries up front and then hope your latency distribution politely stays put. Spoiler: it won’t. Enter Prometheus Native Histograms (NH), DDSketch, and HDR Histogram: three different answers to the same question—how do we estimate quantiles fast, cheaply, and correctly enough that our SLOs aren’t fiction?

This piece breaks down what each approach really buys you, the trade-offs you’ll hit in production, the migration playbook that won’t nuke your dashboards, and the plain-English “why now” story your execs actually care about.

Why quantiles are hard (and why your buckets hate you)

Quantiles tease out tail behavior—p95, p99, p99.9—that drives user pain and error budgets. The trouble is you never store every single sample; you summarize. Classic Prometheus histograms summarize by counting into fixed buckets. That seems fine until traffic changes shape. Those buckets you lovingly tuned during Black Friday? Now they’re as helpful as a pager on airplane mode. Worse, misaligned buckets across services make cross-service quantiles wobbly, and changing bucket boundaries mid-flight breaks aggregation history. Summaries avoid bucket drama but compute quantiles client-side, which makes aggregating across instances a non-starter. Everyone has scars here.

Native Histograms, DDSketch, and HDR each attack the same pain with different math and operational stories.

The contestants

Native Histograms (NH): Prometheus grows up about distributions

Native Histograms transform a single sample into a compact, structured histogram with dynamic, exponentially-spaced buckets. Instead of N separate bucket series plus sum and _count, you store one series with all the distribution detail embedded. You can combine them in queries, compute quantiles server-side, and avoid the old “did we pick the right buckets?” ritual.

Operationally, this reduces cardinality explosion and makes histograms far easier to reason about when you aggregate across instances, zones, or services. In the ecosystem, Prometheus 3.x ships first-class support; Mimir/Cortex can ingest and query them; Grafana knows how to visualize them. There are still footnotes—some PromQL operators and functions are newer and a few corners remain marked experimental—but the fundamentals are production-ready in 2025.

What it means for SRE life: better tail accuracy without per-team bucket herding, and fewer “our p99 changed because Alice edited a YAML” incidents. What it means for the CFO: fewer time series than the classic histogram fan-out, which can cut storage and query costs depending on tooling and vendor pricing, while increasing resolution.

DDSketch: relative-error quantiles, battle-tested at Datadog

DDSketch is a mergeable quantile sketch with a relative-error guarantee. Instead of tracking exact bucket counts, it keeps values in exponentially-spaced bins such that the error is bounded as a percentage of the true value. Because it’s fully mergeable, it’s fantastic for distributed rollups: shard everywhere, merge centrally, still know your p99 within a predictable factor.

Operationally, it’s blazingly fast, memory-bounded, and language-portable. If you run a polyglot fleet or rely on OTel collectors and vendor backends, you may already be using DDSketch without realizing it. The flip side is that DDSketch lives slightly to the side of “pure Prometheus”—you’ll often compute quantiles in the collector or vendor backend, and you will tune the relative error to balance accuracy and footprint.

HDR Histogram: fixed cost, high dynamic range, ridiculous speed

HDR Histogram is the hot rod of latency histograms. You predefine a value range and number of significant digits, get near-constant time/space recording, and extremely fast percentile queries. In-process it shines—load-test tools, RPC libraries, or JVM services love HDR. The gotchas: you must choose min/max ranges up front; merging histograms requires compatible configs; and you’ll need a pipeline to ship HDR snapshots if you want cross-service SLOs. It’s phenomenal at local truth, a bit more opinionated at fleet truth.

The trade-off triangle: accuracy, cost, and complexity

Let’s put the three on the same axes SREs and execs both care about.

Accuracy

At the tail, classic histograms lose fidelity unless buckets are tuned and kept stable. NH and DDSketch both close that gap with exponential bucketing and consistent math across merges. HDR can be extremely accurate if configured correctly and if your upstream merges are disciplined. DDSketch’s relative-error guarantee is compelling for SLOs that watch “how far off could the p99 be?” NH provides high resolution with predictable behavior when aggregating; it also avoids the “two teams, two bucket sets” problem that quietly ruins many p95s.

Cost

Cost arrives as time-series count, storage footprint, query CPU, and human time. Classic histograms are deceptively expensive: every histogram spawns a fleet of _bucket series; multiply by labels and you have a surprise. NH collapses that into a single series per histogram, with sparse buckets that only exist when populated. In managed platforms that price by active series, NH often wins while giving you better precision. DDSketch and HDR keep memory tight in-process; end-to-end cost then depends on how you export and where you compute quantiles. If you already pay a vendor for sketch-friendly storage and queries, DDSketch may be the cheapest way to get useful tail metrics today.

Complexity

Complexity is two things: “what my engineers must know” and “what my stack must support.” NH cleans up cognitive load because you stop arguing about bucket edges and can aggregate across services with fewer booby traps. But you do need to confirm backend support and get comfortable with the newer PromQL functions and operators for histograms. DDSketch is conceptually simple once you accept relative error; wiring it into a Prom-first world can be trickier unless your collector/backends natively understand it. HDR is dead simple inside a process; distributing it well is an engineering exercise you’ll own.

A 3 a.m. anecdote (as required by observability law)

We had a checkout SLO where the dashboard flashed red every Saturday night. The p99 looked above the SLO by a hair, so the on-call did the usual dance: retry rate checks, cache stats, scale a replica or two, and pray. We eventually discovered the “breach” was a quantile illusion: most latency sat comfortably in the 180–220 ms range, but the classic histogram buckets had a 200–300 ms chasm. Interpolation pushed the p99 just over our target. Switching that service to NH gave us more resolution around the 200–250 ms zone and revealed the truth: no SLO breach, no need to page, and no “incident” postmortem with pie charts of guesswork.

Two opposing viewpoints (both held by sincere, sleep-deprived adults)

There’s a lively, good-faith debate here. One camp says, “Go all-in on Native Histograms now.” Their argument: NH is part of core Prometheus 3.x, it aggregates correctly across labels without bucket alignment drama, you get better tail fidelity, fewer series, and easier dashboards. The ecosystem support is real: ingestion, query, and viz are there across Prometheus, Mimir/Cortex, and Grafana. For teams living in PromQL, NH is aligned with how you work.

The other camp says, “Hold on—keep DDSketch/HDR where they fit and adopt NH selectively.” They’ll remind you not every PromQL operator loves NH yet, text exposition formats and some middleboxes have sharp edges, and your remote-write or downsampling layer might lag on advanced features. Meanwhile, DDSketch already ships relative-error guarantees at web scale, and HDR remains the fastest, safest in-process choice you can make. If your stack leans on OTel collectors or vendor backends with strong sketch support, moving to NH everywhere could be work for marginal gain.

Who’s right? Both—depending on your topology, vendors, and tolerance for migrations. Which is why the smart answer is “adopt NH where it pays off first, keep DDSketch/HDR where they’re the best local tool, and revisit in a quarter.”

Migration playbook: how to move without breaking dashboards (or hearts)

First, choose one service with the SLO that keeps you up at night—usually the customer-facing API with a tight latency objective. Don’t boil the ocean. Enable Native Histograms in that service’s client library and exporter, pick an initial resolution scale that gives you sub-10% relative error around the SLO threshold, and ship both NH and your existing classic histogram or sketch side-by-side. This dual-write window is your safety net and your argument to the business.

Second, build a dashboard that shows classical p95/p99 next to NH p95/p99 over the same window, annotated with deploys. Add a panel for request rate and error rate so people don’t forget context. Over a week of real traffic, measure three things honestly: how close the quantiles track during steady state, how they diverge during bursts, and how long your queries take in each model. If you use a managed metrics backend, compare active-series counts and query costs before and after.

Third, cut your alerts and SLO burn calculations over to NH-based queries while keeping the old panels for a sprint. Silence only if the NH alerts behave at least as well in on-call drills. If you find oddities—like a PromQL function that isn’t quite what you expect with NH—address them in code or dashboards rather than backing out the change. This is where small, high-signal dashboards pay dividends.

Fourth, expand to two more services with different traffic shapes, ideally one high-QPS internal service and one batchy background worker. Repeat the comparison. Document what scale settings and label sets gave you the sharpest signal without surprising costs. Only then make “NH by default” a platform stance.

Fifth, decide what you’ll do with DDSketch and HDR. In most shops, HDR stays in process where it shines—profiling, load testing, tight loops—while DDSketch remains a great choice in collectors or vendor pipelines that expect sketches. The point isn’t to pick a single hammer; it’s to make each hammer hit the right nail.

Selling it upstairs: the executive-level story

Executives don’t wake up craving exponential buckets. They care about reliability, customer experience, and money. So use that language.

Lead with risk: “Our current quantiles can mislead us near SLO thresholds. That causes false pages and wasted engineering cycles, or worse, we miss real burn until it’s expensive.” Then quantify efficiency: “The new histogram type consolidates many per-bucket series into a single series with better tail fidelity. On our biggest service, that’s a double win: fewer series to store and query, and better accuracy when aggregating across regions.” Finally, socialize speed to insight: “With native histograms, new services don’t need custom bucket tuning to join SLO dashboards. We can onboard faster, and we reduce the chance of human error.”

If cost is a sensitive topic, bring the A/B chart from your dual-write period showing active-series and query timings. If you’re on a managed platform that prices by active series, do the napkin math with your real data. If you self-host, talk in terms of TSDB footprint, chunk compression, and fewer scan-heavy queries. The right sentence is, “We expect lower storage cost or higher precision for the same cost; either outcome improves margin on observability.”

Close with a bounded ask: “We’ll migrate three services in Q4, keep the old metrics for a month, and report impact on alert fidelity, SLO burn accuracy, and spend. If the results match the pilot, we standardize in Q1.”

Practical approaches that won’t make your alerts compete with Netflix

Start by instrumenting native histograms at the edge of your SLOs. If your SLO is 250 ms p99 for checkout, tune your histogram scale so your relative error around 200–300 ms is tight. You don’t need nanosecond precision at 5 seconds; you need honesty where the SLO line lives. This gives your on-call playbooks cleaner signals and less alert flapping when traffic shifts.

Keep classic histograms or DDSketch running in parallel for at least one release cycle. This isn’t cowardice; it’s science. Use the overlap window to measure drift, query latency, and cardinality. When a surge hits, compare which model recovered faster and which one helped you pick the right remediation. Put those screenshots in a short write-up so you never have to debate the value in a vacuum.

Be deliberate about where you compute quantiles. If your world is PromQL-first with remote write to Mimir/Cortex/Thanos, let Prometheus compute quantiles from NH. If your world leans OTel collectors or a vendor with great DDSketch support, keep the quantile math where it’s cheapest and fastest. The wrong answer is “do it twice in two places.” Pick a single source of truth for SLOs.

Finally, treat your histogram configuration as code. Version the scale settings and label sets. When a team changes them, run a quick A/B in staging and update a living “Histogram Cookbook” in your platform docs. This keeps the institutional knowledge alive and stops your future self from wondering why p95 got “better” after a deploy.

Open questions worth arguing about in the comments

Are relative-error guarantees more valuable than higher raw resolution when your tail is where the money lives? If so, does that tilt you toward DDSketch for SLOs and NH for everything else?

How much precision is enough around SLO thresholds before diminishing returns set in, and does that answer change with user growth or seasonality?

If your metrics backend charges per active series, do native histograms reduce your bill enough to matter, or will you spend those savings on richer labels and more services anyway?

Should we normalize on one histogram type across the org for cognitive simplicity, or keep a “horses for courses” policy and accept some platform heterogeneity?

Closing reflection

We instrument because reality is messy and users don’t care about our feelings. Native histograms pull Prometheus closer to the way we actually operate: fewer foot-guns, better tails, less bucket therapy. DDSketch and HDR remain exceptional tools in the right places. If SRE is the art of making trade-offs explicit, this is one of the better ones—less midnight math, more honest SLOs, and a cleaner story for the people who fund the page-you-at-3-a.m. lifestyle.

References

  1. “Native Histograms [EXPERIMENTAL]”, Prometheus Documentation — https://prometheus.io/docs/specs/native_histograms/

  2. “Metric types”, Prometheus Documentation — https://prometheus.io/docs/concepts/metric_types/

  3. “Histograms and summaries”, Prometheus Documentation — https://prometheus.io/docs/practices/histograms/

  4. “Query functions”, Prometheus Documentation (histogram_* and histogram_quantile behavior) — https://prometheus.io/docs/prometheus/latest/querying/functions/

  5. “Announcing Prometheus 3.0”, Prometheus Blog — https://prometheus.io/blog/2024/11/14/prometheus-3-0/

  6. “Prometheus 3.0 migration guide”, Prometheus Docs — https://prometheus.io/docs/prometheus/latest/migration/

  7. “PromQL for Native Histograms”, PromCon EU 2022 talk — https://promcon.io/2022-munich/talks/promql-for-native-histograms/

  8. “How to Use Prometheus’s Native Histograms”, SREcon23 EMEA talk (slides) — https://www.usenix.org/system/files/srecon23emea-slides_rabenstein.pdf

  9. “Visualize native histograms”, Grafana Mimir Docs — https://grafana.com/docs/mimir/latest/visualize/native-histograms/

  10. “Send native histograms to Mimir”, Grafana Mimir Docs — https://grafana.com/docs/mimir/latest/send/native-histograms/

  11. “Prometheus native histograms in Grafana Cloud: More precise, easier to use, and better compatibility”, Grafana Blog — https://grafana.com/blog/2025/05/06/prometheus-native-histograms-in-grafana-cloud-more-precise-easier-to-use-and-better-compatibility/

  12. “Prometheus and OpenMetrics Compatibility”, OpenTelemetry Spec — https://opentelemetry.io/docs/specs/otel/compatibility/prometheus_and_openmetrics/

  13. “Metrics Data Model: ExponentialHistogram”, OpenTelemetry — https://opentelemetry.io/docs/specs/otel/metrics/data-model/

  14. Masson, Rim, Lee — “DDSketch: A Fast and Fully-Mergeable Quantile Sketch with Relative-Error Guarantees”, PVLDB — https://www.vldb.org/pvldb/vol12/p2195-masson.pdf

  15. Datadog Engineering — “Computing accurate percentiles with DDSketch” — https://www.datadoghq.com/blog/engineering/computing-accurate-percentiles-with-ddsketch/

  16. HDRHistogram Project — “HdrHistogram” — https://hdrhistogram.github.io/HdrHistogram/

  17. Thanos Issue — “Native Histograms: Downsampling support” — https://github.com/thanos-io/thanos/issues/7747

  18. Grafana Mimir Docs — “Configure native histograms ingestion” — https://grafana.com/docs/mimir/latest/configure/configure-native-histograms-ingestion/

  19. Prometheus GitHub Discussion — “Native Histogram Status #14111” — https://github.com/prometheus/prometheus/discussions/14111

#SRE #SiteReliability #DEVOPS #Prometheus #Observability #OpenTelemetry #Grafana #Mimir #Thanos #Latency #SLO