Created on 2025-09-14 10:34
Published on 2025-09-19 10:30
If you’ve been around Site Reliability Engineering long enough, you’ve seen the SLO pendulum swing. First we worshipped per-service SLOs: crisp, measurable, neatly owned. Then we learned that a user doesn’t care if Service-B-North had 99.95% uptime when checkout still failed. Cue the rise of user-journey SLOs (often built from Real User Monitoring, or RUM): “Can a real person sign in, search, add to cart, and pay in a reasonable time, from a phone on a train, with notifications lighting up like a pinball machine?”
Both perspectives are right—and wrong—depending on where you sit. The trick isn’t choosing “one SLO to rule them all,” it’s composing the right stack of SLOs and giving your org a living error-budget policy that adapts as reality changes.
Let’s map the battlefield, then get practical about building SLOs that actually move the needle for users and teams.
Per-service SLOs are the easy sell inside engineering. They pin a reliability goal to something you can actually change. They align to code ownership, team charters, and incident response. When a database replica face-plants at 03:17, a service-level SLO points to the on-call who can fix the thing without convening a constitutional convention.
Well-designed per-service SLOs also protect you from local surprises that haven’t yet bubbled up to users. They make your alerts actionable. They let platform teams, API teams, and backend services prove they’re meeting their end of the bargain. And when things go sideways, they give you a target for “what good looks like” that isn’t colored by the chaos of the moment.
The risk, of course, is local optimization. You can have a forest of green per-service SLOs and a charcoal-gray user experience. We’ve all lived the “our cache is fine, must be those front-end folks” blame carousel.
User-journey SLOs start with Critical User Journeys (CUJs)—the flows that define success for your product. “Create new workspace,” “Find a flight and pay,” “Submit a claim,” “Stream the first 10 seconds smoothly.” These SLOs track whether real users are succeeding at the tasks that matter, measured with latency, success rate, and responsiveness as users experience them.
The web ecosystem has even formalized some user-centric measurements. One useful example: Interaction to Next Paint (INP) became a Core Web Vital in March 2024, replacing FID as the primary interactivity metric. That wasn’t a vanity swap; it was an acknowledgment that what people feel when an app stutters matters more than what a single API thinks about its tail latencies. CUJ-driven SLOs embrace that mindset. They turn “is the service healthy?” into “did the person get what they came for?”
The risk? Attribution and noise. RUM sees the world from the browser or device, which means radio-hopping networks, battery-starved CPUs, ad blockers, and occasionally a pet ferret walking across the keyboard. You’ll need good sampling, segmentation, and privacy-aware instrumentation. And when the “Checkout-Complete within 3 seconds” SLO turns red, you still have to find which subsystem got grumpy.
Picture two teams at a release review.
Team Per-Service: “Our 99.95% SLO is green. The error budget is practically untouched. We shipped four features. It’s not us.”
Team User-Journey: “Our Add-to-Cart completion rate dipped below 99.2% when the INP got spicy on mobile. Users were rage-clicking the button like they were playing whack-a-mole. It is definitely you.”
Both teams are telling the truth—from different vantage points. If you optimize only per-service, you can meet all your numbers and still miss the moment where the product loses users. If you optimize only CUJs, you can burn weeks arguing about root cause, while the platform team quietly points at a flawless SLA.
The grown-up move is to admit both are necessary and make them work together.
In healthy SRE programs, the SLO conversation looks like a pyramid. At the top, CUJ/RUM SLOs answer “are users succeeding?” In the middle, product-oriented service SLOs reflect the pieces users touch directly (APIs, rendering services, search, payments). At the base, infrastructure SLOs keep shared dependencies honest (databases, queues, edge, identity).
Top-level SLOs set the bar for what “reliable enough” means in human terms. Middle- and lower-level SLOs make the result operational—identifiable owners, clear runbooks, and controllable levers. When a CUJ burns budget, your dependency graph and tracing connect the top-level symptom to the service-level cause. When a service burns budget without a CUJ signal, you’ve likely prevented tomorrow’s user pain.
The glue is a living error-budget policy that understands the stack. It can temporarily freeze deploys for the service actually burning the shared CUJ budget, not the three teams upstream that are green. It can escalate when a chronic, low-grade slowness eats half a month’s budget without a “boom” moment. And it can relax during planned, business-approved experiments that intentionally spend budget for learning.
A living policy is not a dusty Confluence page. It evolves with your product, your seasonality, and your incident learnings. A few hallmarks:
It ties specific actions to burn-rate signals. Paging on “we’ll blow the 28-day budget in three days” is very different from paging on a single spike. The modern practice is multi-window, multi-burn-rate alerting: fast windows to catch acute failures, slow windows to catch smoldering ones. Your on-call shouldn’t learn about a slow-motion budget disaster from your quarterly business review.
It operates across levels. If the “Search-and-Filter” CUJ is on track to miss this week, the policy identifies the services most responsible and moves their levers first: rollout pause, canary rollback, feature flag reduce, circuit breaker tighten, cache TTL adjust. Meanwhile, other teams can keep shipping. The person on the hook is the one burning the shared budget.
It adjusts for seasonality. Retail during the holidays, tax software in April, education platforms in September—same math, different stakes. A living policy defines “peak calendars” with stricter CUJ targets, brighter alert thresholds, and alternative mitigations. During a critical period, you might pre-approve a change freeze if a CUJ’s slow-burn alert fires, while off-peak you accept more risk in exchange for delivery velocity.
It bakes in governance and learning. At a regular cadence—a quarter is common—you review how each SLO behaved, whether alerts fired when they should have, and whether the policy drove the right actions. Update thresholds, rewrite noisy alerts, prune zombie SLOs, and retire “aspirational” targets that nobody can influence.
Finally, it’s mercifully specific. “If CUJ-Checkout success falls below X for Y minutes, we A/B switch to the legacy payment flow; if the burn-rate indicates budget exhaustion within 24 hours, we halt new card-brand rollouts and revert to the last known-good canary.” Ambiguity is cute in poetry, not in 03:00 incident channels.
Gather PMs, SREs, support, and the person who reads every angry tweet. Write down the three to five CUJs that define success. No dashboards yet—just stories. “From cold-start to first play under three seconds,” “From photo capture to post under two seconds,” “Claim filed and reference number shown in under ten.” Then translate those into SLIs users can feel: end-to-end latency, success rate, and responsiveness across devices and geos. Treat them like product requirements, because they are.
The magic here is political as much as technical. CUJ SLOs get product and engineering arguing about the right things. They don’t replace service SLOs; they aim them.
RUM tells you what users actually experienced; synthetic tells you what your system should have done under a controlled scenario. Use both. Let RUM drive CUJ SLOs and segmentation—e.g., “Android devices on mid-tier CPUs in high-latency networks”—while synthetic runs give you clean baselines for regression hunting and CI/CD gates. When CUJ fails, you’ll want the triangulation: was it truly a backend issue, a front-end regression bloating the DOM, or an LTE thunderstorm?
Be deliberate with privacy. Sample wisely, anonymize aggressively, and segment without creeping into user identification. If you’re in regulated regions, partner early with legal and security. “SLOs as surveillance” is a fast way to lose friends.
CUJs cross teams. That’s kind of the point. Use tracing and your service catalog to draw the “journey graph.” When the Checkout CUJ burns budget, your runbook should calculate who burned it and how much. You don’t need a PhD model—start with weights based on historical contribution and hop counts. The output feeds your policy: the team burning 60% of the Checkout budget pauses risky rollouts first; the team at 5% keeps going.
This “budget apportionment” won’t be perfect. Good. You’ll refine it each post-incident review. The psychological shift matters most: we defend user outcomes together, we tune service SLOs to align with those outcomes, and we hold the levers where they do the most good.
Page when users are on track to be hurt, not when a single pod is moody. Fast-window burn-rate alerts catch sharp outages. Slow-window burn-rate alerts catch chronic degradation (that 200-ms “we’ll fix it later” that eats half the month). In practice, teams often pair a “we’ll exhaust budget within hours” page with a “we’ll exhaust budget within days” ticket that prompts a daylight fix. The result is fewer 03:00 wake-ups and more issues resolved before customers notice.
A living policy is code. Treat its actions like product features. That means feature flags for mitigations, canary/blue-green rollouts you can reverse, and deploy tooling that reacts to SLO status. If you have progressive delivery, let a CUJ slow-burn alert throttle exposure automatically. If you have an API gateway, teach it to enforce “fairness” when one client threatens to torpedo a journey for everyone.
SLOs should age. When a CUJ stabilizes and nobody uses the “Search-Facet-Drilldown < 400ms 95th” SLO to make decisions, retire or roll it up into a composite. If a per-service SLO never correlates with any user pain, rewrite its SLI or downgrade its importance. SLO sprawl is how good programs go stale.
It was a Tuesday, which is SRE for “Friday.” The on-call rotation pinged for “API latency above 400ms,” then “read replica lag,” then “front-end long tasks.” The dashboards looked like a Jackson Pollock painting. Old us would have spent the next hour huddled in Zoom archaeology, each team presenting alibis while the VP refreshed the revenue chart.
With CUJ SLOs, the picture was different. The “Search-to-Checkout under three minutes” SLO triggered a slow-burn alert earlier that afternoon; the policy opened a ticket and flagged the responsible teams. By evening, the “Checkout responsiveness” SLI crossed the fast-window threshold, paging the payments gateway and front-end teams. The dependency map pointed at a recent font-loading change that expanded render-blocking work on mid-tier Android browsers. The gateway team paused a new 3-D Secure flow behind a feature flag, the front-end team reverted the font loading strategy, and the CUJ burn rate flattened. We still fixed the replica lag, but it was a supporting actor, not the villain.
We slept. Users bought things. The revenue chart was boring again. Best kind of boring.
Viewpoint A: “Per-service SLOs are more actionable.”
This camp isn’t wrong. RUM can be noisy, and CUJ blamestorming is real without clear attribution. Service SLOs give teams direct levers, clear ownership, and faster repair loops. Many organizations learned SLOs through backend availability because that’s where they had reliable telemetry and ops muscle.
Viewpoint B: “CUJ/RUM SLOs are the only ones that matter to the business.”
Also true. If users can’t complete the journey, nothing else matters. User-centric SLOs expose the real bottlenecks and align teams on outcomes. They make product and engineering speak one language. They also catch the front-end and network reality that infra metrics miss—especially with modern interactivity metrics like INP changing what “fast enough” means.
The synthesis: start with CUJs, power them with RUM, and support them with per-service SLOs that you can actually fix. Then bind the whole thing with a policy that adjusts to burn rate, seasonality, and learning.
Teams often overspecify CUJs into oblivion. “Login in 1.234 seconds at the 99.99th percentile from Antarctica on Edge 90” is not a product requirement. Start generous. Tighten later. Conversely, teams nail a per-service SLO that nobody cares about. A queue depth SLO that never maps to user pain is just a very committed graph.
Another trap is treating the policy like a punishment. Freezes and rollbacks should feel like what they are: permission to pay down reliability debt when the data says users need it. If your policy only ever says “stop shipping,” your product leaders will rightfully fight it. Include “budget-funded experiments” in your policy—a clear, time-boxed way to spend budget on learning and delivery speed when you’re well within target.
Finally, don’t let perfect be the enemy of “ship it.” You can get a long way with three CUJs, basic RUM, and a couple of burn-rate alerts. Your future self will make it fancier.
Are your current SLOs discovering outages before your customers tweet about them—or are they just explaining outages after the fact?
If your top CUJ hit red tomorrow, which team would pause rollouts first—and who would be allowed to keep shipping? Is that written down anywhere humans can find at 03:00?
What percentage of your alerts are burn-rate based versus static thresholds? Be honest: do your alerts predict budget exhaustion, or just yell when a graph looks spicy?
If you had to delete half your SLOs next week, which ones would stay because they actually changed decisions in the last quarter?
Does your policy include a legitimate path to spend error budget on experiments—or is it just a velvet rope in front of production?
SLOs are not about perfection; they’re about acceptable risk in service of a better user experience. Per-service SLOs keep the engine humming. CUJ/RUM SLOs make sure the car is actually going somewhere people want to go. The living error-budget policy is your steering wheel. Use all three, and you’ll ship faster, sleep better, and spend less time renaming dashboards to “legacy.”
See you in the comments. I’ll be the one defending the dignity of burn-rate alerts while muting my “DiskAlmostFull” page for the third time today.
Google SRE Workbook — “Implementing SLOs”
https://sre.google/workbook/implementing-slos/
Google SRE Workbook — “Alerting on SLOs” (Multi-window, Multi-burn-rate)
https://sre.google/workbook/alerting-on-slos/
Google SRE Workbook — “Example Error Budget Policy”
https://sre.google/workbook/error-budget-policy/
Google Cloud Blog — “A practical guide to setting SLOs”
https://cloud.google.com/blog/products/management-tools/practical-guide-to-setting-slos
Web.dev — “Interaction to Next Paint becomes a Core Web Vital on March 12”
https://web.dev/blog/inp-cwv-march-12
Google Search Central Blog — “Introducing INP to Core Web Vitals”
https://developers.google.com/search/blog/2023/05/introducing-inp
Grafana Blog — “How to implement multi-window, multi-burn-rate alerts with Grafana Cloud” (2025)
GitLab Handbook — “Engineering Error Budgets”
https://handbook.gitlab.com/handbook/engineering/error-budgets/
Datadog Documentation — Real User Monitoring
https://docs.datadoghq.com/real_user_monitoring/
Datadog Documentation — Service Level Objectives
https://docs.datadoghq.com/service_management/service_level_objectives/
Honeycomb — “SLOs” (product page and writings on paging/alerting)
https://www.honeycomb.io/platform/slos
https://charity.wtf/tag/paging-alerting/
Nobl9 — “SLO Best Practices: A Practical Guide”
https://www.nobl9.com/service-level-objectives/slo-best-practices
New Relic — “Error budget and service levels best practices” (2024)
https://newrelic.com/blog/best-practices/alerts-service-levels-error-budgets
#SRE #SiteReliability #DEVOPS #SLO #ErrorBudgets #Observability #RUM #CoreWebVitals #DevOpsCulture