Enhanced Observability for SREs

The Four Golden Signals—latency, traffic, errors, and saturation—are a great starting point, but modern SRE work needs explainability, not just dashboards. Enhanced observability means instrumenting systems so questions can be answered quickly and confidently during incidents: What changed? Where’s the contention? Which users are impacted?The shift is from “collect everything” to “collect what answers real questions,” mapped to SLOs and error budgets.

What does this look like in practice? First, standardize telemetry so it’s analysis-ready on day one. OpenTelemetry (OTel) semantic conventions turn ad-hoc attributes into consistent, queryable data across services, languages, and teams. Second, embrace high-cardinality where it helps (e.g., customer_id, build_sha) and manage it with guardrails (sampling, attribute allowlists). Third, make tracing your primary narrative during incidents: use tail-based or rules-based sampling to retain the “interesting” traces (errors, long tail latency, new deploys). Link traces to exemplars so metric spikes click through to concrete requests.

Next, go beyond the three pillars with continuous profiling (CPU, memory, locks) to pinpoint why a spike happened—often in code paths your metrics can’t reveal. eBPF-powered profilers now make always-on production profiling feasible with low overhead. Tie all of this to SLO-driven alerting (page only when user experience is at risk), and pipe your telemetry through collectors so you can enrich, filter, and route without redeploying apps. Finally, close the loop: encode topologies/service graphs, ship deploy metadata, and turn post-incident findings into permanent telemetry (new spans, attributes, profiles), so the next outage is shorter—and ideally avoided.

In short: enhanced observability isn’t “more data”; it’s better questions answered faster—built on standards, enriched context, smart sampling, and profiles, all aligned with your SLOs.

Enhanced Observability for SREs

Further Reading