Created on 2026-01-24 13:09
Published on 2026-01-26 11:30
If you’ve ever argued in a retro about whether a 502 is “really” an error if the user just refreshes, this one’s for you. SLOs and error budgets often travel as a pair in SRE conversations, so people treat them like synonyms. They aren’t. They’re more like a seatbelt and a speedometer: related, complementary, and very different in what they’re for. Let’s unpack the differences, where they come from, and how to use both without turning your roadmap into a reliability hostage situation.
SLOs grew out of a simple but powerful idea: your service should meet a level of reliability that users actually notice and value. Not perfect. Not aspirationally perfect. Just the right “good enough” to keep people happy and your business moving. In SRE practice, the SLO is that target, and it’s measured by a user-centered SLI, not by whatever infrastructure metric happens to be loudest this week. The error budget came later as a policy instrument to resolve the classic conflict: product wants to ship; ops wants to sleep. By defining error budget as the complement of the SLO and deciding what to do when you spend it, you align incentives so shipping and reliability aren’t permanent enemies.
An SLO is a promise you make to yourself about how your users should experience your service. It’s the target reliability measured the way a user would feel it—think successful requests, good minutes, or fast-enough responses. The SLO is not a legal contract; that’s the SLA’s job. The SLO lives inside your engineering org and product org. It guides trade-offs and prioritization. It’s the number you point to when the CEO asks, “Are we reliable enough to launch the new feature this quarter?”
An error budget is the math that makes that promise actionable. It’s the allowed amount of unreliability over a time window: one minus the SLO. If your monthly availability SLO is 99.9%, your monthly budget for “everything that can go wrong and still be okay” is 0.1%. You can translate that into events, requests, minutes, or whatever your SLI tracks. The point is not philosophical; it’s operational. It’s the finite allowance you spend on deploys, experiments, and controlled risk before your policy kicks in and says, “Cool it.”
Setting up SLOs starts with user journeys, not dashboards. You identify a few golden paths, define SLIs that reflect those experiences, and choose targets that balance business expectations with technical reality. The first version will be wrong. That’s fine. SLOs are designed to be iterated—what matters is that they are clear, measurable, and accepted by product and engineering together. You’ll decide on windows (rolling 28–30 days are common), and you’ll decide how strict a miss is in practice. That last bit—what you do when you miss—is where the error budget policy lives.
Configuring the error budget means clarifying two things: the measurement window and the governance. The window determines how forgiveness or pain compounds. A calendar month gives you a clean reporting cadence; a rolling window smooths the edges of “we broke prod on the 31st so we’re magically fine again on the 1st.” Governance defines thresholds for action—slow down merges, add extra reviews, prioritize reliability work, or even stop feature deploys until burn stabilizes. These are social contracts with teeth, not just charts. Google’s own guidance is blunt: the budget is one minus your SLO, and the policy is how you react to spending it, up to and including release freezes.
SLOs keep your on-call meaningful. They determine what’s page-worthy versus what can wait for business hours. They reframe “we saw 2% CPU jitter” into “users experienced 0.04% failures over the last hour, which is within our tolerance.” When you alert, you alert on the SLO via burn rate, not on raw metrics. Burn rate tells you how fast you’re consuming your budget—and therefore how quickly you need to act. Short, spiky burns tell you to grab a coffee and the runbook; slow, steady burns tell you to grab a whiteboard and the backlog. The SRE workbook popularized multi-window, multi-burn-rate alerting, which catches both sudden fires and creeping degradations without waking you up for noise. Recent industry practice continues to push burn-rate thinking as superior to naive error-rate thresholds.
Error budgets, meanwhile, are your risk currency. You “spend” budget with deploys, config changes, experiments, and controlled chaos. You “earn” budget back by being boring for a while. When the budget gets tight, you throttle change and invest in reliability work: retries, backpressure, cache warming, graceful degradation, or a payout of toil-debt. If you fully breach the budget, your policy directs the next sprint’s priorities, not the loudest executive. That’s the magic: the budget turns reliability into a product decision, not an all-hands argument.
Confusion happens because SLOs and error budgets show up together in the same dashboards and meetings. They’re numerically linked, so people start using the words interchangeably. Think of it this way: the SLO is your destination; the error budget is the amount of gas in the tank. You set the destination with product; you manage the gas with engineering process. Changing your SLO changes the size of the tank. Changing your policy changes how hard you press the accelerator.
Another reason for confusion is legacy monitoring culture. If your org still pages on low-level metrics, SLOs will feel like extra ceremony, and error budgets will look like accounting tricks. Flip it: make SLOs the entry point to understanding system health, and dashboards become supporting actors instead of the main plot. Many modern observability leaders argue that SLOs should be the front door to production, not an afterthought hung on top of metric sprawl.
One camp treats error budgets as hard gates. If you’re burning too fast, you stop shipping, no debate. This is clean, fair, and effective at stopping death-by-a-thousand-cuts. It also signals to executives that reliability is a first-class constraint, not a suggestion. The policy literature is full of examples where this clarity saved quarters of reliability work by forcing meaningful pause and investment.
The other camp treats error budgets as guardrails. You don’t slam the brakes for every skid; you steer, slow down, and continue learning. Advocates argue that fast, small, reversible changes are safer than big-bang releases, and that tying deploys to a single monthly number can calcify teams into fear. They’d rather use the budget to choose how to ship rather than whether to ship, and to improve feedback loops so Friday deploys aren’t scary legends.
There’s another perennial debate about alerting: should you page on short windows to catch issues faster, or long windows to avoid flapping? The modern consensus lands on multi-window burn-rate alerts to get the best of both worlds: a fast page when you’re shredding budget, and a measured ticket when slow burn will eat the month. Tooling from open source and vendors alike now bakes this in, making it far easier to adopt the practice than it was a few years ago.
How far should we decouple SLO setting from quarterly targets? If every missed SLO turns into a missed OKR, you’ll inflate targets. If you never connect them, you’ll ignore reliability until the invoice arrives, paid in pager fatigue and churn.
How many SLOs are too many? There’s a strong argument for a handful of well-chosen SLOs, because a dozen “priorities” is just a to-do list in disguise. But complex platforms and multi-tenant products often need multiple user journeys to be represented. The art is in keeping them coherent.
Should budgets be per-service or end-to-end? A per-service SLO helps teams own their destiny; an end-to-end SLO protects the user experience across boundaries. The best orgs do both and make sure incentives are aligned so nobody wins by breaking their neighbor.
First, make SLOs the language of your on-call. Start small: pick one top user journey, define a crisp SLI that a product manager would recognize as user happiness, and choose a starter target that reflects reality. Resist the fantasy of 100%. Run a calibration month. Talk about the SLO in stand-up, in post-mortems, and in planning. When someone says, “We had a great week, CPU was down five percent,” reply with, “Users had 99.92% good requests; we spent 20% of the monthly budget.” Watch priorities realign without a single “please” slide.
Second, implement burn-rate alerting that respects human sleep schedules. Tune two paths: a fast high-burn page that triggers within minutes when a big chunk of budget is on fire, and a slow low-burn ticket that catches chronic issues before they silently bankrupt the month. Tie your runbooks to these paths. The fast page points to mitigation; the slow ticket points to investigation. Your alerts will stop competing with Netflix for your attention, and your mean time to reasonable decision will improve.
Third, write and socialize a real error budget policy. This isn’t a wiki footnote. It should say what happens at 25%, 50%, 75%, and 100% of budget spent, who can make exceptions, and how you get out of “budget jail.” Include the positive: how to earn budget back with boring weeks, test hardening, or feature flags that reduce blast radius. If you’re a risk-loving org, keep shipping but change how you ship—smaller changes, progressive delivery, and automatic rollback on budget burn spikes. If you’re in a regulated or high-stakes domain, use stricter gates. The key is to pick one style on purpose, not by accident.
Fourth, pair SLOs with engineering economics. Reliability isn’t free. When the budget bleeds, track the real costs: extra on-call, lost developer cycles, incident toil, user churn. When the budget is healthy, celebrate and invest the surplus in risk-reduction experiments, capacity tuning, or product bets that were too spicy last quarter. Framing the budget as a portfolio of risk allows leadership to see reliability as a lever, not a tax.
Fifth, keep SLOs and error budgets alive in incident analysis. Don’t just ask, “What failed?” Ask, “How much budget did we burn, and why didn’t our alerts catch the burn earlier?” If a single retry policy saved you from paging, document it as a reliability multiplier and treat it like an asset. If a maintenance window choice protected the budget, keep doing that; if it made things worse, move it. Being explicit about budget impact turns retros from war stories into investment memos.
Imagine a payments API with a 99.95% availability SLO over a rolling 28-day window. That gives roughly 20 minutes of allowed unavailability per month. Monday morning, a dependency spikes latency and your success-rate SLI dips. The fast burn-rate alert pages you because, at this rate, you’d spend half the month’s budget by lunch. You flip a feature flag to degrade gracefully—less data per response, more success per second—and the burn subsides. You’ve spent 10% of your monthly budget by noon but avoided a full outage. Your policy says you keep shipping, but with smaller PRs and a senior reviewer for changes touching that dependency. On Friday, a slow burn alert files a ticket: that small degradation is still nibbling the budget. You shift one team’s sprint to address it and enter the weekend with 70% of your budget intact, your PM happy, and your pager quiet.
SLOs and error budgets exist because people do. Without them, we argue forever about performance graphs and feelings. With them, we argue for fifteen minutes about a number we agreed on last quarter. They create a shared reality between PMs who want acceleration and SREs who want seatbelts. They stop blame from being the default and make trade-offs explicit. They also make it easier to say “no” or “not yet” without sounding like the Department of No. That’s a superpower for an engineering org.
The difference between SLOs and error budgets is the difference between strategy and tactics. The SLO tells you where “reliable enough” lives. The error budget tells you how much road you’ve got left before you skid past it. Set them deliberately, measure them honestly, and agree on what you’ll do when they disagree with your calendar. Then, when someone asks if they can deploy on Friday, you can shrug and say, “What does the budget say?” and get back to the important work of hoping nobody finds Step 1 in your runbook.
References
“Implementing SLOs,” Google SRE Workbook — https://sre.google/workbook/implementing-slos/
“Error Budget Policy for Service Reliability,” Google SRE Workbook — https://sre.google/workbook/error-budget-policy/
“Alerting on SLOs,” Google SRE Workbook — https://sre.google/workbook/alerting-on-slos/
“Burn rate is a better error rate,” Datadog — https://www.datadoghq.com/blog/burn-rate-is-better-error-rate/
“Observability: the present and future, with Charity Majors,” The Pragmatic Engineer — https://newsletter.pragmaticengineer.com/p/observability-the-present-and-future
#SRE #SiteReliability #DEVOPS #SLO #ErrorBudget #ReliabilityEngineering #Observability #OnCall #DevOpsCulture #SLI