Created on 2025-05-25 08:54
Published on 2025-05-26 09:15
Metrics are the lifeblood of Site Reliability Engineering. Uptime, latency, throughput, error rate—these numbers define how we measure system health, team performance, and overall reliability. They fuel dashboards, power alerts, and drive postmortems.
But increasingly, SREs are asking an uncomfortable question: are these metrics telling us the truth?
On the surface, they seem objective—clear, numerical signals that cut through subjectivity. But in practice, SRE metrics can mislead. They can obscure real issues, incentivize bad behavior, and create a false sense of control.
So are SRE metrics broken? Or are we just using them wrong?
Let’s examine both sides.
The Case for SRE Metrics
Let’s start with why metrics matter.
1. They Create Accountability
Metrics make it possible to track progress, benchmark goals, and show improvement over time. They turn reliability into something tangible.
2. They Enable Automation
Without metrics, we can’t build alerts, autoscalers, or SLO-driven remediation. Metrics are the data that fuels modern operations.
3. They Provide Early Warnings
Done right, metrics help catch problems before users notice. A rising latency, a spike in errors, a drop in traffic—they’re all early signs of trouble.
4. They Enable Communication
Metrics give engineers a common language. “P95 latency jumped to 700ms” is more actionable than “the app felt slow.”
In theory, metrics are objective and invaluable. But in practice, their value depends on how they’re used—and that’s where things get tricky.
The Case Against Over-Reliance
Metrics don’t lie—but they don’t tell the whole truth either.
1. Metrics Don’t Capture User Experience
A service can be “up” by availability metrics but still broken. Maybe it returns 200 OK with garbage data. Maybe the page loads, but third-party content fails. Maybe everything’s green—but the customer is furious.
2. Metrics Are Gameable
When SLOs are tied to incentives, teams game the numbers. They tweak thresholds, exclude error types, or segment traffic in creative ways. Metrics become performance theater.
3. Metrics Are Lagging Indicators
Many metrics only tell you what already happened. If you focus too much on numbers, you miss early qualitative signals—support tickets, dev complaints, customer churn.
4. Metrics Ignore Context
A 3% error rate might be fine at midnight and catastrophic during a product launch. The number doesn’t know that. Humans do.
5. They Create Alert Fatigue
Poorly tuned metrics generate noisy alerts. Teams learn to ignore them. Important issues get buried in meaningless noise.
When Metrics Mislead
Real-world examples of misleading metrics abound.
- A team monitors p99 latency and celebrates improvement—until they realize users experience slowness at p95.
- An error budget looks healthy because it excludes 404s—except most of the product is now returning 404s.
- A service shows perfect uptime—but it’s returning stale cache data due to a backend failure.
Metrics are only as good as the questions you ask and the behavior they encourage.
Metrics vs. Reality
There’s also the issue of abstraction.
As systems grow, we rely more on high-level aggregations. But aggregation hides detail. P95 hides the experience of the 96th percentile. Service-wide uptime hides regional failures. Averages hide outliers.
This creates a dangerous illusion: systems look healthy while parts are crumbling.
So What’s the Fix?
Metrics aren’t bad. But we need to approach them with humility and context.
Here’s how to do it better:
1. Use Multiple Lenses
Don’t rely on one golden metric. Use a combination: latency, error rate, saturation, availability, user sentiment, and business KPIs.
2. Focus on User-Centric Metrics
SLOs should reflect what users care about. Measure what matters to them, not just what’s easy to collect.
3. Validate with Qualitative Data
Pair metrics with customer feedback, support tickets, and usability tests. Don’t let dashboards be your only source of truth.
4. Treat Metrics as Signals, Not Truths
A red dashboard doesn’t mean failure. A green dashboard doesn’t mean success. Use metrics to guide investigation—not end it.
5. Refine Constantly
Metrics should evolve. Review them regularly. Ask: Are they helping us make better decisions? Are they encouraging the right behavior?
6. Promote Metric Literacy
Ensure teams understand what metrics mean—and what they don’t. Train people to interpret data critically.
A Real-World Evolution
At a large cloud company, the SRE team noticed something odd. Their uptime dashboards showed 99.99% reliability. But customers were unhappy. Churn was rising. NPS scores were dropping.
They dug deeper.
It turned out their metrics only measured request-level availability. When a service failed to authenticate, it returned a friendly error page—technically a 200 OK.
They rewrote their SLOs to include business-level success: could users complete a transaction? Could they get a response from all dependencies?
The numbers got worse. But the customer experience got better. And over time, so did the metrics—this time, meaningfully.
Final Thought
Metrics are powerful. But they are not reality—they are shadows of it.
They help us see patterns, track progress, and build automation. But they can also mislead, distract, and lull us into a false sense of security.
The best SREs don’t just watch dashboards. They ask hard questions. They investigate anomalies. They talk to users. They use metrics as tools—not as truths.
So are SRE metrics misleading?
Only if you stop thinking.
Because the real reliability isn’t in the numbers. It’s in the understanding behind them.