Created on 2025-06-13 10:17
Published on 2025-06-13 10:34
“Five nines.” It’s the gold standard. 99.999% uptime. Less than 5 minutes of downtime per year. It sounds impressive—and it is.
But in the world of Site Reliability Engineering, some have taken it a step further: chasing 100% reliability. Zero downtime. Zero errors. Zero risk. It makes for a great sales pitch. But here’s the truth: 100% reliability is a myth. Worse, chasing it can hurt more than it helps. Let’s unpack both sides of this seductive but dangerous idea.
The Case for Maximum Reliability
From a business perspective, more reliability is always better.
Customer Trust: Downtime erodes confidence, especially in critical systems like banking, healthcare, or communications.
Brand Differentiation: “We’re always up” becomes a selling point.
User Expectations: Consumers don’t tolerate flakiness. They compare your app to Google and Netflix.
Incident Cost: Outages are expensive. Lost revenue. SLA violations. Regulatory penalties. Bad press.
So the logic follows: if 99.9% is good, 100% must be great. Why not aim higher? In high-stakes industries—air traffic control, medical devices, financial exchanges—perfection isn’t aspirational. It’s mandatory. And thanks to cloud scaling, global failover, chaos engineering, and auto-healing systems, we’re closer than ever. So what’s the problem?
The Hidden Cost of Chasing 100%
SREs know the truth: the last 0.001% of uptime is disproportionately expensive—and often counterproductive.
Diminishing Returns Going from 99.9% to 99.99% might cost 2x more. Going to 100%? It may cost 10x more—and still fail.
Innovation Suffers Teams that fear breaking the system stop deploying. Change freezes. Learning slows. Velocity dies.
User Impact ≠ Metric Impact Some downtime is invisible to users. Other “100%” uptime metrics hide degraded performance. Chasing numbers misses nuance.
Complexity Increases To stay “always up,” systems get distributed, replicated, cached, layered. This adds latency, fragility, and observability pain.
Human Burnout Zero-error cultures punish mistakes. Engineers become anxious. Incidents turn into blame games.
Perfect Reliability ≠ Perfect Experience
Ironically, users don’t want perfection. They want predictability, responsiveness, and honesty. Would you rather have: 100% uptime with slow response, confusing errors, and no transparency? Or 99.9% uptime with fast service, graceful degradation, and great support? Reliability must serve user experience—not replace it.
The Google SRE View
Even Google—poster child for uptime—doesn’t chase 100%. In the original SRE book, the authors explain: > “100% is the wrong reliability target for basically everything.” Why? - If nothing can fail, nothing can change. Some failures are health they test systems, improve processes, and uncover blind spots. Error budgets allow teams to balance reliability with innovation. Instead of maximizing uptime, Google maximizes value. That means tolerating small failures to enable fast progress.
When You Should Aim Higher
Still, some systems do require near-perfection.
Safety-critical systems (airbags, pacemakers, nuclear plants).
Financial core systems (clearinghouses, trading engines).
Infrastructure providers (DNS, routing, identity).
In these domains, the cost of failure is existential. Every extra “nine” is worth it. But most systems aren’t that. Your chat app? Your food delivery backend? Your SaaS dashboard? They need enough reliability to keep users happy—not 100%.
How 100% Thinking Spreads
The myth isn’t just technical. It’s cultural.
Marketing wants to advertise 100% uptime.
Executives fear outages and demand zero risk.
Engineers aim for perfection out of pride.
Security teams ban failure “in production.”
This creates toxic pressure. Teams hide incidents. Metrics get gamed. Real issues get buried under process. In the worst cases, fear replaces curiosity.
A Better Goal: Sufficient Reliability
The antidote is to ask: “How reliable do we need to be?” This depends on:
User expectations
Business impact
Recovery time
Support capabilities
Engineering resources
Then you define SLOs (Service Level Objectives) that reflect that reality. Maybe it’s 99.5% for internal dashboards. Maybe it’s 99.95% for customer login. Maybe it’s 99.999% for payment processing. Each system gets the reliability it deserves no more, no less.
Error Budgets: Innovation’s Safety Net
Error budgets give teams permission to move fast within limits. If you’re below your budget, ship away. If you’re above it, slow down. It’s a simple idea that prevents over-engineering and under-innovation. 100% uptime means zero error budget. No change. No risk. That’s not safe—it’s stagnant.
Real-World Example: The Frozen Pipeline
A major retailer insisted on 100% uptime for their e-commerce platform. As a result, no deploys were allowed during peak hours. Weekends. Holidays. Nights. Eventually, deploys only happened twice a month—after long review cycles. Features slowed. Bugs lingered. Engineers got frustrated. After a costly bug went unshipped for weeks, leadership relented. They moved to a 99.95% model with error budgets and automated rollback. Deploy frequency rose 5x. Customer satisfaction improved. Reliability also improved—because teams learned faster.
Final Thought
100% reliability is seductive. It sounds like success. It feels like safety. But in the real world, it’s an illusion—and a trap. The systems that win aren’t the ones that never fail. They’re the ones that fail gracefully, recover quickly, and keep evolving. So don’t chase perfection. Chase understanding. Because a system that works well enough, that your team understands deeply, and that your users trust? That’s more reliable than any five nines you can buy. And a whole lot more human.