The Trouble With Chasing 100%

There is always someone who wants 100% reliability, 100% uptime, 100% certainty, 100% confidence, and ideally by next quarter without reducing feature velocity. That person is usually enthusiastic, persuasive, and blissfully untouched by the infrastructure bill.

And this is where SRE enters the room like the least fun but most necessary adult at the party.

One of the most useful ideas in Site Reliability Engineering is that reliability is not a moral virtue. It is not holiness for distributed systems. It is not a personality test for engineering leadership. It is a choice. More specifically, it is a business and engineering choice made under constraints: money, people, time, architecture, user expectations, compliance demands, and the awkward reality that production is a deeply creative place for failure.

Google’s SRE guidance has been remarkably blunt about this for years: the point is not “zero outages,” but using error budgets to manage the tension between innovation and stability. In that model, some degree of failure is not a scandal. It is anticipated, measured, and governed so product and operations can make sane trade-offs together.

That framing still feels radical in organizations where uptime gets discussed like a sacred vow. It should not. The moment you try to force absolute reliability onto a non-absolute system, you stop doing engineering and start performing expensive wishful thinking.

Reliability is economics wearing a technical costume

The most uncomfortable truth in reliability work is also the most practical one: every extra nine costs something.

A 99.9% uptime target allows roughly 43 minutes of downtime per month. Push that to 99.95% and you are down to about 21 minutes. Push to 99.99% and the room gets very quiet, because now you are playing inside a tiny budget where small incidents suddenly become executive events. Microsoft’s reliability guidance explicitly warns against overengineering beyond what business requirements justify, which is about as close as cloud documentation gets to saying “please stop trying to buy yourself a personality with more redundancy.”

That extra reliability is never free. It usually arrives dragging friends behind it: multi-region complexity, failover choreography, stricter release gates, deeper observability, more expensive data strategies, tougher testing requirements, more operational burden, and an on-call experience that starts to resemble a hostage negotiation with entropy. AWS frames architecture decisions as trade-offs among reliability, cost, efficiency, and other concerns for exactly this reason.

In other words, when someone says “just make it five nines,” they are not asking for a percentage. They are asking for architecture, staffing, process, tooling, operational maturity, and recurring spend. Often without realizing it.

That is why mature SRE teams do not begin with “How do we maximize uptime?” They begin with “What reliability do users actually need, what failures matter most, and what is the cheapest responsible way to meet that need?” It sounds less heroic, which is precisely why it works.

The fantasy of 100% and the reality of systems

The number 100% is emotionally satisfying because it feels simple. Users like certainty. Executives like certainty. Boards definitely like certainty. Unfortunately, complex systems do not.

Cloud providers themselves are careful about this. Microsoft notes that an SLA is not the same thing as actual user-experienced reliability, and that architects should not simply copy provider uptime percentages into their own targets. SLAs are conditional commitments with exclusions and definitions; they are signals, not magic shields.

That matters because real production reliability is composite. It includes your code, your dependencies, your deployment habits, your queues, your DNS, your cloud region choices, your authentication path, your monitoring blind spots, your humans, your change management, and that one service account no one wants to touch because “it’s been fine for years.” The system is never just the thing on the architecture slide. The system is also all the glue, assumptions, and sleepy decisions surrounding it.

And the closer you push toward perfection, the more fragile your organization can become in weird ways. Teams delay needed changes because they fear burning budget. Releases become ceremonious. Local optimizations multiply. Complex failover paths are added “for safety” and then quietly become the most dangerous code in the company because nobody exercises them enough. Google’s SRE workbook emphasizes simplicity as an end-to-end goal precisely because complexity expands risk, not just capability.

There is a particular irony here that every on-call engineer eventually learns the hard way: trying to eliminate all failure modes often creates brand-new failure modes. Some of them have better documentation, which is lovely, but they still wake people up at 3:17 a.m.

The two camps: “Protect the service” versus “Ship the thing”

This debate usually shows up in one of two costumes.

The first camp says reliability must dominate. Their argument is not stupid. Customers remember outages more vividly than roadmap slides. Revenue systems, trust-heavy platforms, financial workflows, and anything customer-facing at scale can suffer real damage from instability. This view becomes even more persuasive when legal, contractual, or safety concerns are involved. In genuinely safety-critical environments, some forms of near-total assurance are not overkill at all; they are the job. NASA guidance for safety-critical software, for example, includes expectations around complete test coverage or explicit risk assessment when that cannot be achieved.

The second camp says overprotecting reliability kills progress. Also not stupid. If every release requires a religious ceremony, twelve approvals, a maintenance window, and emotional support from three departments, you do not have a software delivery process. You have a museum exhibit. Google’s SRE model explicitly positions error budgets as a way to resolve the conflict between stability and feature velocity, and its release guidance argues that the goal is to ship as fast as possible while still meeting the reliability users expect.

The funny part is that both camps usually believe the other one is about to break production.

The useful part is that both camps are partly right.

The reliability absolutists are right that some outages are existential. The speed absolutists are right that too much friction calcifies delivery, morale, and competitiveness. The mistake is assuming one principle should permanently defeat the other. SRE exists because software organizations need a mechanism for negotiated reality, not a winner in a philosophical cage match.

DevOps, SRE, and the awkward human truth underneath

A lot of reliability arguments are not really about technology. They are about incentives, fear, and memory.

Dev teams remember the quarter when they got blocked from shipping and watched competitors move faster. Ops teams remember the weekend they lost to a change that “should have been harmless.” Leadership remembers whichever one turned into an incident review with finance on the call.

This is why SRE and DevOps work best when they treat reliability as a shared behavioral system, not an infrastructure afterthought. Google’s framing of error budgets is powerful because it creates a common language. You stop arguing in vibes. You start arguing in budgets, risk, and consequences.

DORA’s long-running research is useful here too, because it undermines the lazy assumption that teams must choose between speed and stability. Google Cloud’s DORA research has consistently examined both delivery and operational performance, and related material from Google notes that high-performing teams can achieve strong throughput alongside strong stability outcomes.

That does not mean trade-offs disappear. It means mature organizations manage them better.

Human nature still gets its say. Teams overreact to the last incident. Executives reach for perfect numbers because imperfect ones are politically awkward. Engineers build safety layers they trust more than they understand. Monitoring expands until alerts begin reproducing faster than rabbits. And somewhere in the middle, a poor incident commander is trying to explain that “high availability” is not the same as “no outage ever, thanks.”

What sensible reliability looks like in practice

A rational approach starts with SLOs tied to user experience, not vanity percentages copied from a vendor datasheet. Microsoft’s guidance is explicit: do not just adopt a provider’s SLA as your target, because user needs and system realities may be stricter or looser than that number suggests.

From there, the healthy move is to use error budgets as an actual decision tool, not a decorative phrase in a slide deck. When the service is healthy and the budget is intact, teams should be able to release with confidence. When the budget is burning, the conversation changes. Maybe you slow releases, improve rollbacks, invest in resilience work, or fix the specific class of changes that keeps setting the building on fire. That is not bureaucracy. That is engineering with memory. Google’s SRE guidance repeatedly frames the budget as the mechanism that aligns incentives between product and reliability work.

The next sensible move is boring, and therefore underrated: reduce complexity. Not because simplicity is elegant, though it is, but because every additional dependency, failover rule, feature flag interaction, and bespoke deployment ritual expands the surface area for surprise. Google’s workbook on simplicity treats this as a reliability concern across code, architecture, and lifecycle processes, which is exactly right.

After that comes release engineering discipline that does not become release theology. Canarying, phased rollouts, feature isolation, fast rollback, and observable changes are the kinds of practices that let you move quickly without pretending change is harmless. Google’s canary guidance says the quiet part out loud: the goal is to ship software as fast as possible while staying within the reliability target users expect.

And finally, there is the most human intervention of all: say no to fake precision. Not every system deserves five nines. Not every dashboard metric is a crisis. Not every stakeholder request for “certainty” should survive first contact with an architecture review. Sometimes the most responsible answer in an SRE conversation is, “We can make it more reliable, but here is what it will cost, what it will slow down, and why that may not be worth it.”

That answer rarely gets applause. It does, however, prevent a lot of expensive nonsense.

The closing thought nobody puts on the poster

Chasing 100% without context is not ambition. It is a very expensive way to misunderstand systems.

SRE at its best is not anti-reliability. It is anti-fantasy. It says users deserve dependable systems, engineers deserve sane trade-offs, and businesses deserve honest conversations about cost, risk, and value. It also says that the healthiest teams stop treating uptime as a moral scoreboard and start treating it as one dimension of a larger operating model.

Because perfection is not the same thing as excellence.

Excellence is knowing when another nine is necessary, when it is wasteful, and when it is just a beautifully formatted panic response to uncertainty.

And if someone still insists on 100%, 100%, 100%, and by next quarter, at least make sure they are paying the cloud bill and carrying the pager.

References

**Google Site Reliability Engineering: book/embracing-risk/
Google Site Reliability Engineering: Introduction to SRE and Error Budgets
Google SRE Workbook: Canarying Releases
Microsoft Learn: How to Read a Service-Level Agreement (SLA)
Microsoft Learn: What are business continuity, high availability, and disaster recovery?

#SRE #SiteReliability #DEVOPS #ReliabilityEngineering #ErrorBudgets #SLO #Uptime #IncidentManagement #PlatformEngineering #CloudArchitecture #DevOpsCulture