The Myth of 100% Reliability

Created on 2025-06-13 10:17

Published on 2025-06-13 10:34

“Five nines.” It’s the gold standard. 99.999% uptime. Less than 5 minutes of downtime per year. It sounds impressive—and it is.

But in the world of Site Reliability Engineering, some have taken it a step further: chasing 100% reliability. Zero downtime. Zero errors. Zero risk. It makes for a great sales pitch. But here’s the truth: 100% reliability is a myth. Worse, chasing it can hurt more than it helps. Let’s unpack both sides of this seductive but dangerous idea.

The Case for Maximum Reliability

From a business perspective, more reliability is always better.

So the logic follows: if 99.9% is good, 100% must be great. Why not aim higher? In high-stakes industries—air traffic control, medical devices, financial exchanges—perfection isn’t aspirational. It’s mandatory. And thanks to cloud scaling, global failover, chaos engineering, and auto-healing systems, we’re closer than ever. So what’s the problem?

The Hidden Cost of Chasing 100%

SREs know the truth: the last 0.001% of uptime is disproportionately expensive—and often counterproductive.

  1. Diminishing Returns     Going from 99.9% to 99.99% might cost 2x more. Going to 100%? It may cost 10x more—and still fail.

  2. Innovation Suffers     Teams that fear breaking the system stop deploying. Change freezes. Learning slows. Velocity dies.

  3. User Impact ≠ Metric Impact     Some downtime is invisible to users. Other “100%” uptime metrics hide degraded performance. Chasing numbers misses nuance.

  4. Complexity Increases     To stay “always up,” systems get distributed, replicated, cached, layered. This adds latency, fragility, and observability pain.

  5. Human Burnout     Zero-error cultures punish mistakes. Engineers become anxious. Incidents turn into blame games.

Perfect Reliability ≠ Perfect Experience

Ironically, users don’t want perfection. They want predictability, responsiveness, and honesty. Would you rather have: 100% uptime with slow response, confusing errors, and no transparency? Or 99.9% uptime with fast service, graceful degradation, and great support? Reliability must serve user experience—not replace it.

The Google SRE View

Even Google—poster child for uptime—doesn’t chase 100%. In the original SRE book, the authors explain: > “100% is the wrong reliability target for basically everything.” Why? - If nothing can fail, nothing can change. Some failures are health they test systems, improve processes, and uncover blind spots. Error budgets allow teams to balance reliability with innovation. Instead of maximizing uptime, Google maximizes value. That means tolerating small failures to enable fast progress.

When You Should Aim Higher

Still, some systems do require near-perfection.

In these domains, the cost of failure is existential. Every extra “nine” is worth it. But most systems aren’t that. Your chat app? Your food delivery backend? Your SaaS dashboard? They need enough reliability to keep users happy—not 100%.

How 100% Thinking Spreads

The myth isn’t just technical. It’s cultural.

This creates toxic pressure. Teams hide incidents. Metrics get gamed. Real issues get buried under process. In the worst cases, fear replaces curiosity.

A Better Goal: Sufficient Reliability

The antidote is to ask: “How reliable do we need to be?” This depends on:

Then you define SLOs (Service Level Objectives) that reflect that reality. Maybe it’s 99.5% for internal dashboards. Maybe it’s 99.95% for customer login. Maybe it’s 99.999% for payment processing. Each system gets the reliability it deserves no more, no less.

Error Budgets: Innovation’s Safety Net

Error budgets give teams permission to move fast within limits. If you’re below your budget, ship away. If you’re above it, slow down. It’s a simple idea that prevents over-engineering and under-innovation. 100% uptime means zero error budget. No change. No risk. That’s not safe—it’s stagnant.

Real-World Example: The Frozen Pipeline

A major retailer insisted on 100% uptime for their e-commerce platform. As a result, no deploys were allowed during peak hours. Weekends. Holidays. Nights. Eventually, deploys only happened twice a month—after long review cycles. Features slowed. Bugs lingered. Engineers got frustrated. After a costly bug went unshipped for weeks, leadership relented. They moved to a 99.95% model with error budgets and automated rollback. Deploy frequency rose 5x. Customer satisfaction improved. Reliability also improved—because teams learned faster.

Final Thought

100% reliability is seductive. It sounds like success. It feels like safety. But in the real world, it’s an illusion—and a trap. The systems that win aren’t the ones that never fail. They’re the ones that fail gracefully, recover quickly, and keep evolving. So don’t chase perfection. Chase understanding. Because a system that works well enough, that your team understands deeply, and that your users trust? That’s more reliable than any five nines you can buy. And a whole lot more human.