Created on 2025-12-07 13:20
Published on 2025-12-08 11:15
Oh, the dream of simplicity. If reliability were just uptime, Windows 95 plug-n-play would be the gold standard and SREs could spend their days alphabetizing postmortems. In reality, reliability isn’t a single green LED—it’s a bundle of user outcomes: requests that return quickly enough, data that’s correct and durable, features that behave predictably, and yes, services that are actually reachable. Uptime is the lobby door. Reliability is everything behind it.
When SREs say “we measure user happiness,” we’re not kidding. We’re staring at latency percentiles, error rates, saturation, freshness, and sometimes our own burn rate—both the error budget kind and the “how many 3 a.m. pages before someone starts naming alerts after Greek tragedies” kind. Google’s SRE literature has spent years establishing that reliability is best represented by service level indicators (SLIs) and service level objectives (SLOs), not a binary “up/down” light. Error budgets translate those objectives into decision-making fuel: deploy more when you have budget, slow down when you don’t.
Let’s separate two near-twins that often get conflated. Availability is the percentage of time a workload is usable—i.e., it performs its agreed function successfully when required. Reliability is broader: can the system deliver its intended function correctly and consistently across its lifecycle? Cloud architecture guidance is explicit about this distinction because a service can be “available” yet still be unreliable if it’s slow, corrupts data, or returns stale results.
Here’s the human version. Your API is “up” but responds in seven seconds during peak. The checkout page doesn’t error, it just spins, and half your customers bail. Or your analytics dashboard is reachable but renders yesterday’s numbers because a dependency has a quiet little backlog. The status page smiles; your conversion graph does not. Reliability is whether users got what they needed, when they needed it, and believed the system did the right thing.
User-perceived reliability resists single-number reduction. That’s why SRE practice emphasizes a handful of user-centric signals—latency, traffic, errors, saturation—as a baseline for what to watch. These “golden signals” steer you toward the actual experience rather than the box’s temperature. Once you define SLIs from these signals, you set SLOs that correspond to outcomes your customers would notice. That might be “99.9% of requests under 200 ms” for a search endpoint or “99.99% of writes with durability guarantees” for a ledger.
There’s even research exploring better availability metrics that match how humans experience downtime. One notable approach is “windowed user-uptime,” which evaluates availability over rolling windows aligned to users’ expectations, rather than an average that hides pain in the corners. It’s a nudge to optimize for the streaks that matter—like not going wobbly every payday at 17:55.
At 02:11, the pager suggests “all clear,” but the product team’s chat is on fire. A promo code field is accepting entries but applying a zero-discount. The app is “up,” the endpoint returns 200 OKs, the dashboards are mostly green. The SLI we forgot? Correctness. Users perceive reliability as “did I get the price I expected?” not “did the POST succeed?” We added a correctness SLI the next day—ratio of validated discounts to attempts—and an SLO that pairs it with latency so our rollback triggers on “fast but wrong” just as surely as “slow and right.”
This camp prizes simplicity and communication clarity. “Five nines” is easy to sell to execs, easy to benchmark, and ties neatly to SLAs. The argument goes: if you optimize for availability, the rest follows, and people across the company can rally around one target. Beware the spreadsheet that says “99.99% = good enough; stop fiddling.”
The counter-pressure fueling this view is not irrational. One metric reduces local optimizations, avoids scorecard sprawl, and focuses investment. If you’ve ever watched a team tie itself in knots debating 95th vs. 99th percentile, you know why single-number dreams persist.
The SLO school acknowledges that different user journeys need different promises. A search suggestion can tolerate occasional blips; a payment capture cannot. They instrument SLIs per journey, set SLOs to match user sensitivity, and govern change velocity with error budgets. Alerting is driven by burn rate—the speed at which you consume the budget—not by raw thresholds, using multiple windows so you page neither too slow nor forever. In practice, this approach gives you fewer false positives, faster resolution, and a way to have adult conversations about risk. It’s messy to explain, but tractable to operate.
DevOps research has long argued you don’t have to pick between throughput and stability. But the 2024 DORA cycle introduced nuance: some cohorts showed weaker coupling between the “go fast” and “don’t break things” metrics, a reminder that context matters and platform effects can shift trade-offs. In other words, the industry’s scoreboard is evolving, and so should the way we measure reliability.
There’s also the human variable. Several recent surveys show incident volume and alert fatigue rising—and with them, burnout. Burnout is not a vibes metric; it’s a leading indicator of worse incident outcomes later. Treating “team sanity” as a first-class reliability concern isn’t soft—it’s preventative maintenance for your mean time to everything.
Yes. You can measure the parts of it that matter to your users and build a governance loop that nudges the system toward those outcomes. Here’s how SRE and DevOps teams do it without burying themselves in dashboards that compete with Netflix for your attention.
Start by mapping the journeys that pay the bills and the journeys that keep the lights on. For each, choose SLIs that reflect user pain: latency percentiles that match human patience curves, error rates for the stages where wrong is worse than slow, freshness where data staleness equals reputational harm, and durability where loss is unacceptable. When a retailer aligned its SLOs to “search response under 250 ms 99.5% of the time” and “checkout completion success at 99.97%,” incidents that previously hid under “we’re up” became visible. The effect wasn’t more work; it was better triage.
SLOs also need numbers. If your monthly SLO is 99.9% availability, your error budget is 0.1% of that period. Over 30 days, that’s 43.2 minutes of budget. You can spend that allowance in a burst or nibble it away in brownouts, but the budget is real and it frames release decisions. When leadership asks, “Can we ship this Friday?” the answer stops being a shrug; it becomes a math problem tied to user experience.
Traditional threshold alerts are needy toddlers—everything is an emergency. Burn-rate alerting watches how quickly you’re consuming the error budget and looks across both short and long windows. This catches fast-moving disasters (hello, bad deploy) and slow-rolling misery (hi, noisy neighbor) without paging you into oblivion. Multi-window designs also resolve quickly once you fix the issue, which is a mercy at 03:00. If your platform supports it, use recommended window/threshold pairings from the SRE workbook; if not, vendors and open source ecosystems have production-tested defaults to start from, and you can tune from there.
Availability matters. It’s just not the whole story. Keep a clear availability SLI because it’s legible and pairs well with SLAs, finance, and capacity planning. But resist letting it crowd out latency and correctness. Think of availability as your brake pedal—vital, measurable, communicable—while the engine is user-perceived performance and the steering is correctness. The car analogy breaks quickly in prod, but you get the point.
SLOs aren’t garden gnomes; they’re levers. Tie them to change management so error budget status actually influences deploy cadence. Full budget? Experiment. Budget burned? Slow rollouts, invest in reliability work, and prioritize toil reduction. The cultural part is the hardest: product, platform, and SRE must agree that occasionally saying “not now” to features is how you say “yes” to user trust. If the roadmap never flexes when the budget is gone, your SLOs are wall art.
If engineers are drowning in pages, the system is unreliable even if the graphs are green. Alert fatigue predicts mistakes, attrition, and slower recovery in the next incident. Build humane on-call rotations, prune noisy alerts, and automate low-judgment runbook steps. Consider service ownership models that spread expertise so the same three people aren’t summoned every weekend. Above all, measure it: track pages per shift, sleep interruptions, and mean time to “ugh.” Treat those as first-class reliability metrics because they are.
As your product and user base change, your SLOs should too. Revisit whether your “golden signals” still mirror pain, and experiment with user-centric availability metrics that don’t hide short, sharp outages in monthly averages. If payday Fridays are sacred, align your windows and objectives so you defend them accordingly. You don’t need perfect metrics; you need metrics that make better decisions obvious.
A good SRE is part skeptic, part scientist. We haven’t solved everything.
First, how should AI-driven alerting and triage play with SLOs? There’s promising work on correlation and prioritization that could reduce human alert fatigue, but “let the model decide” is not a silver bullet. Guardrails and good data matter more than ever.
Second, are the classic delivery-performance metrics sufficiently predictive of reliability in 2025? DORA’s latest nuance suggests the relationships evolve with platform maturity, team structure, and domain. That should make us curious, not cynical.
Third, can we bake “team health” right into reliability dashboards, not as a sidecar but as a primary petal? If incidents rise when humans are exhausted, it’s irresponsible to treat that as an HR-only concern. SREs have the telemetry chops to measure and act on it.
“You can’t measure reliability—it’s just uptime” is the tech equivalent of “you can’t measure health—it’s just a pulse.” Uptime is vital; it’s also table stakes. Reliability is a relationship with your users, and like any relationship, it takes multiple signals to know how it’s going. Choose SLIs that match real pain, set SLOs that express real promises, govern with error budgets, and keep your humans whole. Then, the next time someone triumphantly announces “we’re 100% up,” you can smile and say, “Great start.”
Implementing SLOs (SRE Workbook) — https://sre.google/workbook/implementing-slos/
Reliability Pillar — AWS Well-Architected Framework — https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html
Meaningful Availability (USENIX NSDI ’20) — https://www.usenix.org/system/files/nsdi20-paper-hauer.pdf
DORA Report 2024 – A Look at Throughput and Stability (RedMonk) — https://redmonk.com/rstephens/2024/11/26/dora2024/
Alerting on SLOs (SRE Workbook) — https://sre.google/workbook/alerting-on-slos/
#SRE #SiteReliability #DEVOPS #SLO #SLI #ErrorBudgets #Observability #GoldenSignals #ReliabilityEngineering #OnCall #DORA #BurnRate