The observability cost war (and the hidden bill it sends to your MTTR)

Created on 2025-09-14 09:51

Published on 2025-09-15 10:30

“Our incident runbook says: Step 1 — panic. Step 2 — Google. Step 3 — realize your logs were ‘cost-optimized’ last quarter.”

There’s a new game in town, and Finance is winning it: who can cut the most telemetry without getting paged for it tomorrow. You’ve heard the slogans. “Ditch logs.” “Sample aggressively.” “Minimize data.” On paper, these are prudent moves. In practice, when something strange flaps its wings in production at 02:37, every byte you shaved yesterday has a nasty habit of reappearing on today’s bill — in the currency of Mean Time to Restore (MTTR).

This isn’t an anti-cost message. Most SREs have stared at a usage graph and muttered something unprintable about a chatty microservice. But the pendulum has swung hard toward austerity, and if you cut context along with cost, your MTTR creeps up in ways dashboards won’t confess. Let’s unpack the tradeoffs, the tech, the policies, and the human factors behind observability austerity — and how to reduce spend without leaving your on-call staring at an empty trace UI and a log search that returns exactly nothing.


Why everyone reached for the scissors

Telemetry is not free. Cloud logging still charges real money to ingest and retain data. At large scale, even “reasonable” per-GB pricing turns into “which team owns this line item?” That real pressure birthed a wave of designs and vendor features that decouple ingestion from indexing, that rehydrate cold data on demand, that store logs cheaply in object storage with minimal indexing, and that use sampling to throttle traces before they melt your collectors.

Two big levers emerged.

The first lever is architecture. Systems like Loki index only labels and cram the log text into compressed chunks in object storage. You get cheap, durable storage and just enough indexing to make queries tractable — if you’re disciplined about labels. Meanwhile, commercial stacks popularized “ingest everything, index selectively,” plus the ability to rehydrate a slice of archived logs back to hot search when you’re conducting an incident autopsy. These ideas mean you can keep detail somewhere, just not always in the expensive tier.

The second lever is sampling. OpenTelemetry made “head-based” sampling easy: decide at the start and drop most traces. Later came “tail-based” sampling: wait for the whole trace, then keep the interesting stuff — errors, long latency, odd attributes — and discard the rest. Paired with exemplars that tie a scary metric spike to a specific trace, sampling promised a kind of telemetry judo: keep the needles, toss most hay.

There’s even a philosophical shift. The “ditch logs” camp argues that wide, structured events and traces carry the causal story far better than line-oriented logs, so why pay to index piles of text? They’re not entirely wrong. The story of a broken request is often clearer in spans and attributes than in 12,000 INFO lines.

And yet…

…MTTR is where the bodies are buried

DORA didn’t put Time to Restore on the poster for fun. It’s the number that most directly reflects how quickly you can get users back on the happy path. Now picture an actual 3 a.m. incident. Your alert fires. Dashboards spike. Slack ignites. You click into traces… and your head sampler at one percent decided this request wasn’t “representative.” You pop open logs… and your data minimization policy stripped the only tenant identifier that would correlate the symptom to a customer. You’re left reconstructing the crime from footprints, not fingerprints.

That’s the dark side of austerity: silent cost cuts that break the feedback loops your incident response relies on. They don’t yell at you. They just slow you down. The team seems “sloppy.” MTTR creeps. Postmortems blame “human error.” Hidden in the timeline is a surgical removal of the clue you needed to be fast.

Two loud camps, both certain the other will break prod

“Ditch logs. Use wide events and traces.”

This camp points out that spans encode causality, concurrency, and timing, and that tail-based sampling lets you conserve budget without losing the spicy traces. Wide events (structured, arbitrarily-rich payloads) give you the context you wanted from logs, but make it queryable like data. On top of that, exemplars let you click from a p99 latency spike straight to a trace, skipping a lot of log spelunking. The implied promise: fewer bytes, same (or better) debuggability.

“Logs are ground truth. Turn them off and you’ll find out why you needed them.”

The rebuttal: traces show the flow; logs show the detail. Exact error messages. Odd user inputs. The nasty edge case string. Logs are also the paper trail for security and compliance. And with tiered storage models, rehydration, and label-only indexing, logs don’t have to be ruinously expensive. Many teams still start in logs to validate hunches, then pivot to traces for causality. Kill logs, and you will, sooner or later, wish you hadn’t.

Both sides have a point. The truth is less tribal: multi-signal debugging wins. Metrics show the symptoms; traces show the path; logs show the specifics. Remove any one without compensating design, and you lengthen the guess-and-check loop that dominates MTTR.

The real villain: “invisible” cuts that only surface during incidents

Budget moves turn into incident problems when they are implemented without guardrails and without a way to measure impact.

Head-only sampling at tiny rates is a classic footgun. On a quiet Tuesday it looks fine. On Black Friday with one customer hitting a weird code path, it’s a blindfold. Tail sampling exists because “decide up front” can’t know what becomes interesting later. It’s not magic — collectors need to buffer, timeouts matter, and you have to route all spans for a trace to the same decision point — but it’s how you ensure that errors and slow traces are kept, always.

Log minimization is another place where good intent goes bad. Privacy and compliance absolutely matter. But “data minimization” in law means “no more personal data than necessary for the purpose.” In operations, the purpose is “restore service safely.” Redact or hash personal data at the source, but don’t strip operational identifiers and correlation keys from logs in storage. If you can’t join a log to a trace or filter a dashboard by tenant during an incident, you’ve created a compliance policy that quietly raises MTTR.

And then there’s metric cardinality. A few unbounded labels can melt a time-series backend, so teams reactively drop labels. If the labels you drop are the same ones SREs use to find which region or customer is in pain, you saved storage at the cost of minutes of triage.

All three failures have the same shape: a cost change applied globally and silently. No alarm went off when you made yourself slower. You found out during the outage.

A middle path: minimize cost, not context

This is the part where we keep the CFO and the on-call at the same table.

Start with sampling that’s smarter than “flip a coin.” Use head-based sampling to tame the boring flood. Add tail-based rules so you never drop traces with errors, you keep the p99 and above tails during spikes, and you preferentially retain traces that traverse newly deployed services or uncommon paths. Hybrid sampling keeps volume predictable and preserves the exact cases you care about during incidents.

Keep logs, but tier them. Promise engineers a hot window of full-fidelity, searchable logs for the most recent 24–72 hours — the time when most incident response and early postmortems happen. Beyond that, archive to cheap object storage. When you need last week’s data, rehydrate a narrow slice. This pattern aligns real investigative behavior with budget realities. If you run Loki, embrace its design: pick stable labels, avoid unbounded values, and always narrow by labels before scanning chunks.

Link your signals. Add trace_id and span_id to logs so traces pivot to the exact lines that matter. Use exemplars so a histogram bucket with ugly latency carries a link to representative traces. When the path from alert → metric → exemplar trace → correlated logs is one or two clicks, you don’t waste precious minutes doing copy-paste archaeology.

Treat privacy as a scalpel, not a machete. Work with legal to identify which identifiers are operationally necessary for restoring service and how to treat them (hashing, tokenization, shorter retention). Document your purpose, your retention, and your redaction — and then instrument it. You can be compliant and fast.

Finally, budget and SLO your telemetry the way you do compute. Give teams a telemetry budget tied to outcomes. If error budgets are being consumed and MTTR is trending up, allow the team to “spend more visibility” in the short term, then tighten again once you’ve fixed the root cause. Observability isn’t overhead; it’s the headset a pilot uses to land the plane in fog. Don’t brag about how little you spent if you also missed the runway.

Real-world 3 a.m. moments where austerity backfire

Picture a European region where latency spikes for exactly one enterprise tenant on a rare code path. Your head sampler misses every anomalous trace for twenty minutes because the traffic is low and scattered. Logs exist, but you filtered out the tenant key “because GDPR.” Metrics show the pain, but you can’t filter by customer anymore after last quarter’s “cardinality cleanup.” The incident commander is suddenly investigating via vibes, not evidence.

Now flip it. You keep error traces and long-tail latencies via tail sampling. You include correlation IDs in logs and trace context. Your dashboards show exemplars and let you jump straight to the offending trace. Your logs are hot for 48 hours and archived afterward, so you can pull exactly what you need if today’s incident started yesterday. You didn’t store everything forever. You stored enough, on purpose, to be fast when it counts.

Three approaches that reduce spend without stealing your superpowers

Establish a hot-window contract with surgical rehydration. Make it explicit: engineers get full-fidelity logs, traces, and span events for the most recent couple of days. After that, data moves to cheap storage. If a weird issue surfaces days later, rehydrate a narrow time slice for the affected services. Engineers operate with confidence in the present; Finance sleeps at night about the past.

Adopt hybrid sampling that never drops a failure. Configure the collector with tail-based policies to always keep traces containing errors or exceeding dynamic latency thresholds. Dial up retention for traces that pass through newly deployed services during the first hour after release. Let head sampling keep the flood predictable everywhere else. Yes, tail sampling requires buffering and careful routing so all spans for a trace meet at the same decision point — but when it’s 3 a.m., you will be grateful for every guaranteed error trace.

Wire in exemplars and correlation everywhere. Add exemplars from your histograms to your tracing backend so an SRE can leap from a spike to a specific request path. Include trace and span IDs in logs and ensure your tools surface “view related logs” or “view related traces” buttons instead of making people copy IDs across tabs. It’s amazing how much MTTR shrinks when you cut out the busywork.

Minimize safely: redact personal data at the source, not context in storage. Decide with your DPO what “adequate, relevant, limited to what’s necessary” means for reliability and security. Redact PII before it leaves the app boundary. Keep operational identifiers that are necessary to restore service quickly. Apply retention limits in line with investigative reality. That’s how you satisfy the law and your users.

Shape intake to avoid cardinality explosions, don’t slash labels blindly. Track which metrics and labels drive explosions and either reshape them (aggregate, bucket, or tag differently) or move them to logs with filters. Loki and Prometheus both have clear guidance: labels must be bounded and stable. Make this a design review topic, not a panic reaction after a big spike.

Exploit pricing levers and product features as they land. Cloud providers change pricing; vendors add new tiers; OSS evolves. Keep a simple rhythm where platform and FinOps review telemetry cost levers quarterly. Revisit whether a managed tier, an OSS backend, or a hybrid yields the right price-to-control tradeoff for your current scale. “We can’t afford logs” often translates to “we never re-evaluated after the pricing changed.”

Open questions worth arguing about in the comments

Are wide events a true superset of logs and traces, or do logs retain unique value even when events are “arbitrarily wide”? If you’ve moved heavily toward events and spans, where did you still miss plain-English log text?

Can we make sampling explainable? Would you trust a “sampling audit log” that tells you what was dropped and why, so you can tune policies with evidence instead of guesswork?

What’s the best way to measure the impact of cost cuts on MTTR? Do you run controlled experiments — increase sampling in one service for a sprint and compare incident timelines — or do you rely on team anecdotes and postmortems?

How should privacy and reliability co-author telemetry policy? If the purpose is “restore service safely,” what identifiers truly are “necessary,” and how do you handle them in a way that satisfies both legal and on-call?

The human bit (with a wink)

If MTTR had a voice, it wouldn’t care how clever your ingestion pipeline is or how beautifully you shaved a few thousand dollars off the bill. It would ask one blunt question: How fast can you get users back on the happy path?Every observability decision ladders up to that. Spend wisely, yes. But spend where it shortens the distance between the alert and the root cause. Keep the traces that hurt, keep enough logs to read the story, link your signals so you can click through the mystery like a well-labeled whodunnit. Optimize costs with a scalpel. The cheapest data is never the data that saves your weekend.

Thought-starters (dare you to disagree)

If you kept only one percent of traces but always retained errors and p99+ latency tails, would your incident timelines stay tight? What would be the first incident where that wasn’t enough?

Which one label in your metrics would hurt the most to lose during an outage, and what’s your plan to keep it without causing a cardinality explosion?

If your hot window for full-text logs were 48 hours, would you still investigate 90% of incidents without rehydration? If not, what’s the right window for your team, and how will you prove it?

What operational identifiers are truly necessary to restore service quickly, and how will you hash, mask, or rotate them to keep your DPO smiling?

References

Google Cloud — “Cloud Logging pricing” — https://cloud.google.com/stackdriver/pricing

Amazon Web Services — “Amazon CloudWatch pricing” — https://aws.amazon.com/cloudwatch/pricing/

AWS Compute Blog — “AWS Lambda introduces tiered pricing for Amazon CloudWatch Logs and additional logging destinations” — https://aws.amazon.com/blogs/compute/aws-lambda-introduces-tiered-pricing-for-amazon-cloudwatch-logs-and-additional-logging-destinations/

Duckbill Group — “Lambda Logs Just Got a Whole Lot Cheaper” — https://www.duckbillgroup.com/blog/lambda-logs-just-got-cheaper/

Datadog Docs — “Rehydrating from Archives” — https://docs.datadoghq.com/logs/log_configuration/rehydrating/

Datadog Blog — “Efficiently retrieve old logs with Datadog’s Log Rehydration” — https://www.datadoghq.com/blog/efficient-log-rehydration-with-datadog/

Grafana Loki Docs — “Loki architecture” — https://grafana.com/docs/loki/latest/get-started/architecture/

Grafana Loki Docs — “Label best practices; cardinality guidance” — https://grafana.com/docs/loki/latest/get-started/labels/ and https://grafana.com/docs/loki/latest/get-started/labels/bp-labels/

Prometheus Docs — “Metric and label naming; avoid high cardinality” — https://prometheus.io/docs/practices/naming/

OpenTelemetry — “Sampling (head vs. tail)” — https://opentelemetry.io/docs/concepts/sampling/

New Relic — “Tail sampling with OpenTelemetry” — https://newrelic.com/blog/best-practices/open-telemetry-tail-sampling

AWS Distro for OpenTelemetry — “Advanced sampling: group-by-trace and tail sampling” — https://aws-otel.github.io/docs/getting-started/advanced-sampling

OpenTelemetry — “Logs are a stable signal” — https://opentelemetry.io/docs/concepts/signals/logs/

Datadog — “Correlate OpenTelemetry traces and logs/metrics” — https://docs.datadoghq.com/opentelemetry/correlate/logs_and_traces/ and https://docs.datadoghq.com/opentelemetry/correlate/metrics_and_traces/

Grafana Docs — “Introduction to exemplars” — https://grafana.com/docs/grafana/latest/fundamentals/exemplars/

OpenMetrics Spec (Prometheus) — “Exemplars” — https://prometheus.io/docs/specs/om/open_metrics_spec/

Google Cloud — “Correlate metrics and traces by using exemplars” — https://cloud.google.com/stackdriver/docs/instrumentation/advanced-topics/exemplars

Grafana Tempo Docs — “Metrics from traces (exemplars)” — https://grafana.com/docs/tempo/latest/getting-started/metrics-from-traces/

UK Information Commissioner’s Office — “Data minimisation principle (UK GDPR Article 5(1)(c))” — https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/data-protection-principles/a-guide-to-the-data-protection-principles/data-minimisation/

DORA — “DORA’s Four Keys (including Time to Restore)” — https://dora.dev/guides/dora-metrics-four-keys/

CNCF — “2024 Ecosystem Gaps Survey (observability challenges)” — https://www.cncf.io/wp-content/uploads/2024/11/CNCF_2024_Ecosystem-Gaps-Survey-Report_v2.pdf

CNCF — “Cloud Native 2024: Annual Survey” — https://www.cncf.io/wp-content/uploads/2025/04/cncf_annual_survey24_031225a.pdf

#SRE #SiteReliability #DEVOPS #Observability #OpenTelemetry #Tracing #Logging #FinOps #PlatformEngineering #Cloud #DevOps