SLO Bleed

Our runbook says reliability is a feature, but somehow the dashboard keeps interpreting that as “creativity is a feature” too.

SLO Bleed: When the Dashboard Says “Healthy” and Your Conscience Says “That’s Cute”

There is a special kind of silence in SRE teams when someone says, “Technically, we met the SLO.” It is not the silence of confidence. It is the silence of people mentally checking whether “technically” is doing a suspicious amount of heavy lifting.

That is where the wonderfully shady bit of internal jargon, SLO Bleed, lives. It is not a formal industry term you will find in the canonical Google SRE books, but it describes a very real behavior: an SLO appears healthy because the measurement window or eligible traffic has been trimmed just enough to hide the ugly bits. Overnight batch runs are excluded. Low-volume failure periods are ignored. Maintenance windows are carved out so generously they start to look like a second operating model. The chart says green. The users, internal customers, or downstream teams may remember things differently.

The reason this term feels sticky is because it captures a tension at the heart of modern reliability work. SLOs are supposed to be a shared, user-focused mechanism for making tradeoffs between feature velocity and reliability. Google’s SRE guidance repeatedly frames SLOs and error budgets as decision tools grounded in user experience, not vanity metrics. Google also emphasizes that SLOs should be tied to user journeys and business-relevant service behavior, not just whatever is easiest to instrument.

And yet, humans are involved. Humans with deadlines. Humans with leadership reviews. Humans with release trains, quarterly OKRs, executive dashboards, and that one meeting where somebody asks why reliability is “down three points” as if production systems were weather forecasts and not distributed acts of organized optimism.

Why SLO Bleed Happens

The most charitable explanation is that teams are trying to make noisy systems measurable. That part is fair. Real systems are messy. Traffic is bursty. Background jobs behave nothing like customer-facing APIs. Sparse metrics can make alerting weird. Grafana’s guidance notes that low-traffic services can generate noisy or misleading behavior, while Datadog also documents that sparse metrics can produce unexpected results if monitor settings are not tuned properly. In other words, not every odd-looking graph is evidence of moral collapse. Sometimes it is just statistics being annoying again.

But SLO Bleed is not really about statistics. It is about incentives.

The broader pattern is old enough to have its own law. Goodhart’s Law is usually summarized as: when a measure becomes a target, it stops being a good measure. The OECD still uses that framing in recent work on measurement and incentives, because the problem has not exactly aged out. Once a reliability number becomes a performance target, people begin shaping the number instead of the reality it was meant to represent. In SRE, that can mean redefining eligible events, excluding embarrassing windows, or choosing indicators that are clean to report but weakly connected to customer pain.

This is why SLO Bleed feels unethical even when it is technically allowed. The dashboard becomes less of a thermometer and more of a photo taken from a flattering angle.

The Case for Exclusions: “Not All Downtime Is Equal”

Now, to be fair, not every exclusion is metric fraud in a trench coat.

One legitimate school of thought says that scheduled downtime, maintenance windows, and non-user-critical workloads should not automatically count against a service objective. Google Cloud’s own guidance on error budgets and maintenance windows makes exactly this point. If the business has explicitly accepted scheduled downtime, and if there is no plan to eliminate it because the tradeoff is intentional, then counting that downtime against the error budget may distort rather than clarify priorities. The important caveat is that this acceptance of risk has strong business implications and must be explicit, constrained, and transparent.

This is the strongest argument against moral panic around SLO Bleed. Not every overnight batch failure belongs in the same reliability contract as user login latency. A batch reconciliation job that runs at 02:00 may be operationally important, but if it does not affect a user journey in the same way as checkout or authentication, folding it into the same SLO can create a target that is both noisy and strategically useless. Google’s materials on SLO design repeatedly push teams toward user journeys and end-user needs, not a giant bucket of “all observable system behavior.”

From this viewpoint, exclusions are not cheating. They are scoping. They are the grown-up version of saying, “Please stop using the same ruler to measure a website, a Kafka pipeline, and Steve’s nightly CSV monster.”

That argument has merit.

The Case Against Exclusions: “Congratulations on the Green Dashboard Nobody Trusts”

The opposing view is the one you hear from battle-scarred SREs, platform engineers, and internal customers who have watched teams slowly edit reality until the reliability report looks excellent and the incident channel remains strangely busy.

This camp argues that the real failure mode is not an imperfect SLO. It is an untrusted SLO.

Once people suspect the metric is curated to look good, the social contract collapses. Product stops taking error budgets seriously. Engineering managers stop believing alerts reflect customer harm. Leadership gets a falsely calm picture of risk. The SLO remains numerically correct within its rules, but operationally useless. Nobl9’s recent SLO guidance warns against targets and practices that teams ignore during systemic decisions, because then the SLO framework becomes ceremonial instead of actionable. Google’s SRE material makes the same point more elegantly: SLOs matter because they drive decisions. If violations do not trigger change, or success does not reflect reality, you are running theater, not reliability engineering.

This is where SLO Bleed becomes especially dangerous in DevOps organizations. DevOps was supposed to narrow the gap between builders and operators. SRE gave that relationship a reliability language: SLIs, SLOs, and error budgets. But when teams quietly shape the metric to preserve release velocity or protect a quarterly narrative, the old conflict returns wearing modern vocabulary. It is still “throw it over the wall,” except the wall now has a Grafana dashboard on it.

And human nature absolutely loves this move. Nobody wakes up saying, “Today I shall undermine the epistemic integrity of reliability measurement.” They say things like, “Let’s just exclude that maintenance period because it is expected,” and then, “That overnight processing is internal only,” and then, “That region was in failover,” and then suddenly the SLO represents the service only during the hours when it behaves like a well-supervised child.

What the Current Reliability Conversation Adds

Recent observability data makes this discussion even more relevant. Grafana’s 2025 and 2026 observability survey findings show that alert fatigue remains one of the most common obstacles to faster incident response. That matters because teams often justify exclusions as a way to reduce noise, especially around spiky traffic or low-signal workloads. The motivation is understandable. Nobody wants alerts competing with sleep, dinner, or basic dignity.

At the same time, current platform guidance points toward better modeling, not just less measurement. Grafana recommends minimum-failure thresholds and supplemental synthetics for low-traffic services. Datadog emphasizes burn-rate and error-budget alerting rather than raw error rates alone. Google’s SRE Workbook has long argued for burn-rate-based alerting precisely because it connects operational urgency to budget consumption across time windows, not random blips.

That is an important distinction. There is a huge difference between refining your SLO model so it reflects customer risk and carving away evidence until the report stops ruining your Monday.

Three Ways to Handle SLO Bleed Without Becoming a Reliability Performance Artist

The first practical move is to separate user-facing objectives from operational workload objectives. If batch jobs matter, give them their own SLO or operational health target instead of pretending they do not exist. A user-facing checkout SLO should represent checkout. A batch settlement SLO should represent batch settlement. This sounds obvious until you see how many teams keep one umbrella number because leadership prefers a single green circle. The cleanest way to reduce SLO Bleed is to stop forcing unrelated behaviors into one metric marriage. Google’s recent product-focused reliability guidance reinforces this idea by centering support models and objectives on end-user needs and critical product paths.

The second move is to make exclusions explicit, narrow, and reviewable. If you exclude maintenance windows, document why. Define the exact time boundaries. State who approved the tradeoff. Review whether the exclusion is temporary or permanent. Google Cloud’s maintenance-window guidance is useful precisely because it does not say, “Exclude whatever is inconvenient.” It says that if the business consciously accepts that downtime, the implications must be understood and the windows should be kept as short as possible. That is governance, not vibes.

The third move is to pair SLO compliance with error-budget detail and narrative context. GitLab’s public documentation on error-budget detail dashboards captures the spirit well: teams need to explore when budget was spent, not just whether a high-level target was met. A green monthly SLO with one catastrophic overnight collapse tells a very different story than a smooth, healthy month. This is how you reduce metric theater. You do not just report the score. You show the plot twists.

A fourth move, because production always laughs at round numbers, is to use burn-rate alerting and user-journey corroboration. Burn-rate alerts catch whether you are consuming budget too fast over short and long windows. User-journey telemetry and synthetics help verify whether the reliability picture matches lived experience. That combination is much harder to game than a monthly aggregate with selective blindness.

The Human Part Nobody Can Automate Away

SLO Bleed is not ultimately a tooling problem. It is a culture problem with very nice charts.

In healthy organizations, SLOs are trusted because they are allowed to be uncomfortable. Teams can miss them. Leaders can hear bad news without demanding decorative measurement changes. Product can accept that some reliability work is not optional just because the launch deck is pretty. In unhealthy organizations, the metric becomes diplomatic. It exists to avoid conflict. It reassures upward and confuses sideways.

That is why SRE and DevOps conversations always drift back to human behavior. Not because engineers are uniquely devious, but because systems of accountability always shape the data they consume. Reliability engineering is supposed to help us confront reality sooner, not negotiate with it until the quarter closes.

And there is a real emotional cost when teams stop trusting the scoreboards. On-call gets more cynical. Incident reviews get more political. Everyone learns that green can mean “safe,” “excluded,” or “please do not ask follow-up questions.” At that point, you do not just have observability debt. You have organizational trust debt.

Closing Thought

SLO Bleed is such a good piece of internal jargon because it captures the thing nobody wants to say out loud: reliability metrics can be technically valid and spiritually dishonest at the same time.

The cure is not purity. It is clarity.

Scope your SLOs around real user journeys. Create separate objectives for different workloads. Use burn rates, detail dashboards, and explicit governance for exclusions. Most of all, build a culture where the point of measurement is learning, not reputation management with better typography.

Because once your SLO becomes a costume for reliability instead of a mirror, production will eventually do what production always does.

It will introduce itself, loudly, at 3:07 a.m., and ask whether you still feel great about that exclusion window.

References

Google Cloud Blog, “SRE error budgets and maintenance windows” — https://cloud.google.com/blog/products/management-tools/sre-error-budgets-and-maintenance-windows
Google SRE Workbook, “Chapter 5 - Alerting on SLOs” — https://sre.google/workbook/alerting-on-slos/
Google SRE Resources, “Product SRE, improving reliability of services” — https://sre.google/resources/practices-and-processes/product-focused-reliability-for-sre/
Grafana Labs, “Observability Survey Report 2025 - key findings” — https://grafana.com/observability-survey/2025/
Datadog Blog, “Burn rate is a better error rate” — https://www.datadoghq.com/blog/burn-rate-is-better-error-rate/

#SRE #SiteReliability #DEVOPS #SLO #SLI #ErrorBudget #Observability #OnCall #ReliabilityEngineering #PlatformEngineering #IncidentManagement #Metrics #DevOpsCulture