Created on 2025-09-14 11:25
Published on 2025-10-03 10:30
If you’ve ever followed a red error dot across six dashboards only to discover the real problem was a single broken button for users on Safari 17.3, you’ve felt the gap between backend-first observability and user reality. Traditional flows begin at the infrastructure: CPU climbs, error rates flare, traces bloom like fireworks, and we reverse-engineer our way toward a human who just wanted to check out. Frontend-first flips that journey. It starts with the user’s actual experience — real page loads, interactions, Core Web Vitals, rage-clicks — and then walks the trace back through services, queues, databases, and that one “temporary” cache that’s now old enough for school.
The angle that triggers the spiciest hallway debates is dashboards vs. traces. Dashboards are quick to love: they’re neat, tidy, and photogenic in postmortems. Traces, on the other hand, feel like the messy truth — a narrative of one request through the guts of your system, warts and N+1 queries included. So which worldview wins when you go headless or frontend-first? Let’s explore both, and decide how SREs and DevOps teams can design observability that matches human reality instead of forcing humans to match our charts.
This is the classic: you watch saturation, latency, errors, and traffic; you slice by service; you add a lovingly curated wall of panels. When the pager barks, you start at a timeseries plot, pivot to logs, then fan out to traces. This shines when the failure mode is systemic — hot code paths, exhausted connection pools, retry storms, or a dependency going sideways. It’s great for keeping the lights on and for long-term trends that help with capacity planning.
But it struggles with the subtle stuff users experience first: a flaky third-party script blocking interaction on mobile; a SPA route that mounts a slow component only for customers in a specific geography; a CDN config that mislabels a MIME type so the browser refuses to cache. In those moments, the backend looks fine. The user does not.
A headless or “frontend-first” approach starts at real user monitoring (RUM) and client-side traces: the user clicked “Buy,” the button took 1.8s to become interactive, an image timed out, then a fetch retried three times before succeeding. From that point of truth, the request trace becomes your rope back through the labyrinth. You correlate session, user, and request IDs across the browser, edge, and services to see exactly which backend hop cost the user their patience. Instead of asking “what’s our 99th percentile latency this hour?” you ask “why did Ada Lovelace in Berlin have a 4.2s TTI on checkout, and who else is trending that way?”
The payoff is not just empathy. It’s precision: when you anchor on the human, you only chase the parts of the backend that actually changed the experience. That’s less noise at 3 a.m., fewer “could not reproduce” tickets, and postmortems that don’t read like fanfiction.
The pro-dashboard crowd argues that reliable systems need shared, consistent health views. The “golden signals” — latency, traffic, errors, saturation — are excellent leading indicators of system distress. Dashboards enable fast, low-cost scanning when you’re triaging and essential common ground for on-call handoffs, exec briefings, and audits. They’re also the shortest path to detecting regressions in known hotspots. When your storage cluster goes red, a picture’s worth a thousand runbooks.
The trace-first camp counters that dashboards ossify yesterday’s questions. They’re filtered, aggregated, and pre-digested — great for “is the patient breathing?” but bad for “why this patient, right now?” When the unexpected happens, you need to ask ad-hoc questions of raw-ish, high-cardinality telemetry and follow a single request. Traces preserve causality: you can see the exact service hop that added 900ms, the feature flag branch that doubled payload size, or the flaky DNS lookup that struck only users on a certain network. They argue that modern incidents rarely fit on a fixed panel — you need narrative, not just averages.
Dashboards scale breadth; traces scale depth. If you only do dashboards, you risk comforting illusions. If you only do traces, you risk spelunking forever. Frontend-first observability reframes the debate: use dashboards to detect and traces to explain, but start where the pain begins — in the browser or app. That keeps detection honest and explanation surgical.
“Headless” in this context isn’t about decapitated CMSes (hold your jokes). It’s a mindset and workflow: rather than mandating prebuilt dashboards as the entry point, you let the user’s journey (via RUM and client traces) drive the investigative flow automatically. Think of it as an experience-to-root-cause pipeline:
A real user interaction degrades — LCP spikes, input delay increases, or a route silently fails.
The session is correlated to a backend trace through standard context propagation (trace IDs, baggage, user/session identifiers handled responsibly).
The tool pivots you into the exact service span that caused the regression, already filtered by the attributes that matter (geo, device, feature flag, experiment cohort).
If it’s not the backend, you stay in the browser: third-party script, render path, layout shift, or resource blocking gets the spotlight.
After a fix, you validate that the user-level SLO recovered — not just that “p99” looks prettier.
Headless here means your entry point is a question, not a canvas. It’s less “which dashboard?” and more “which user and which interaction?”
SRE culture is allergic to toil, but we’re also deeply human. We fall in love with dashboards we’ve curated; we confuse familiarity with signal. During incidents, humans pattern-match to past pain. A frontend-first approach counteracts that bias by forcing us to honor what users actually see. It also aligns perfectly with SLO thinking. If your SLOs measure user-perceived performance and correctness instead of service internals, you’ll find the backend optimizations that actually buy reliability (and skip the ones that just buy nicer charts).
There’s a human nature benefit during on-call, too. Nothing soothes a 3 a.m. brain like a crisp narrative: “Users in EU on Safari saw checkout timeouts due to a slow TLS handshake at the edge after a cert rotation. Here’s the span; here’s the fix.” That beats, “Panel 12 looks angry; let’s page networking and hope.”
Teams are converging on correlating browser telemetry (RUM, Core Web Vitals, JS errors) with distributed traces via shared context. Open standards like OpenTelemetry’s semantic conventions make that tractable across toolchains, and vendors increasingly support front-to-back stitching so a single click path becomes a single investigative flow. There’s also movement toward “explorable” views over static dashboards: instead of twenty fixed charts, you get a start-anywhere explorer that pivots from user session to service to log to profile, then back to the session to verify the fix. Frontend products keep adding session replay, error grouping, and milestone tracking tied to releases and feature flags. The direction of travel is clear: less canvas worship, more question-driven workflows, with user-level SLOs visible alongside service-level signals.
A user-first stance can still fail if you don’t align identifiers. If the browser session ID can’t be tied to the backend trace, you’re rebuilding the haystack. Privacy deserves first-class consideration: be intentional about what you collect, minimize PII, and be transparent with users. Client-side telemetry can also be noisy; browsers are diverse and networks messy. Without good sampling, bucketing, and filtering, you’ll drown in “everything hurts and I’m dying” dashboards — the very trap we’re trying to escape. Finally, don’t throw out backend-first just to be hip. If Kafka is melting, you don’t need RUM to tell you that; you need a runbook and a throttle.
Define SLOs in terms of user-perceived outcomes: successful checkout within X seconds, interactive home page within Y, error-free save action within Z. Then map those to golden signals and backend budgets. Alert on SLO burn rates, but make the links take you from the SLO straight to the impacted sessions, not a generic service dashboard. Yes, monitoring everything is great… until your alerts start competing with Netflix for your attention. Keep alerts for what violates promises to users; everything else is a breadcrumb for humans to follow, not a siren.
Instrument the browser with standards-based context propagation and attach the trace ID to the first hop. Pass cohort, feature flag, and release identifiers as trace attributes — responsibly scrubbed — so you can slice by “only users in experiment TARDIS-B who had service worker disabled.” This is the difference between “we think the DB was slow” and “this specific query plan regressed only for iOS Safari because payloads crossed a compression threshold after a feature rollout.”
Keep a tiny set of “are we on fire?” panels for execs and on-call nerves, but bias day-to-day engineering toward exploratory views and saved investigations. Think notebook-style “runnable postmortems” that re-execute the exact trace queries, log filters, and RUM facets that proved a hypothesis. Build a culture of sharing investigations instead of screenshotting charts out of context. The surprise benefit: onboarding speeds up because new engineers learn how to ask questions, not just where the charts are.
During incident triage, assign a dedicated “UX first responder” who owns the user timeline: they watch RUM signals, top errors, and session outliers, then coordinate with backends on the spans that matter. This prevents the eternal backend drift of “we fixed a thing and p99 improved, so the incident must be over” while the user is still staring at a spinner. Treat the browser like another service with SLIs, budgets, and postmortem representation.
Viewpoint A: Dashboards are the lingua franca of ops. The argument: consistency beats curiosity when seconds matter. A veteran on-call can glance at five panels and know where to dig, and dashboards are the only way to communicate health broadly up and out. You don’t need a 9,000-span trace to decide whether to roll back.
Viewpoint B: Dashboards are a comfort blanket. The counter: dashboards fossilize bad questions. Real incidents require new questions, and traces plus high-cardinality queries let you ask them. The fastest path to root cause is a single session traced to a single slow span, not a thousand-foot view of averages.
The truth in practice: Keep dashboards, but make them the lobby, not the destination. The conference room is a user session, a trace explorer, and a shared, searchable investigation. Get in, test a hypothesis, and get out with a fix — then update your SLO narrative, not your wallpaper.
A team I worked with (you know the type: twelve microservices, one monorepo, and an existential relationship with feature flags) had spotless backend graphs during a revenue dip. The culprit was a new image optimizer that over-compressed hero images only for Android devices on slow networks, making buttons barely visible. RUM lit up, session traces showed longer input delays only on that path, and a single span revealed a conditional that skipped a cache for “lossless” images. No infrastructure metrics moved. Dashboards looked angelic. The user reality did not. Anchoring on the session found the issue in minutes. The rollback took longer than the detective work.
Another incident: a checkout flow went intermittently blank for EU customers. Backend metrics showed a mild increase in 5xx at the edge. Session replay and traces tied it to a certificate rotation at a specific POP causing a slow TLS handshake only for Safari; a timeout in the SPA routed users to an empty state. The fix was at the edge; the validation was in the browser; the story stitched end-to-end made the postmortem actually useful.
Would you rather page on a user-centred SLO that sometimes hides infra pain, or on an infra metric that sometimes lies to users?
How many dashboards do you genuinely consult during an incident before you pivot to traces or logs? Be honest — your screenshot folder won’t judge you.
If you had to delete 80% of your panels tomorrow, which workflows would you keep so engineers can still answer new questions fast?
What’s your team’s plan for privacy-safe context propagation from browser to backend — and if you don’t have one, why are you still collecting user IDs at all?
When was the last time a dashboard directly produced a root cause, not just a direction of travel?
Headless, frontend-first observability is not a rejection of backend-first discipline. It’s a pragmatic admission that users don’t file bugs titled “p99 latency rose by 11%.” They say “checkout froze on my phone.” If you start where they start, your traces tell a tighter story, your SLOs protect real experiences, and your dashboards earn a humbler, more useful role. In SRE and DevOps, our job is to keep promises, not pictures. Design your observability so the first click in an investigation is the one your customer just made. Everything after that should feel like following a thread, not wandering a museum.
Close your ten favorite dashboards if you have to. Keep the three that keep you safe. Then wire your browser to your backend with semantic conventions, context propagation, and a ruthless respect for privacy. The next time the pager goes off, you’ll start with a human, end with a cause, and maybe — just maybe — get back to sleep before your coffee turns into a runbook step. And if anyone asks why your wall looks empty, tell them you’re saving pixels for real stories.
“Gain actionable insights with real user monitoring: the latest features in Grafana Cloud Frontend Observability.” https://grafana.com/blog/2024/08/26/gain-actionable-insights-with-real-user-monitoring-the-latest-features-in-grafana-cloud-frontend-observability/
“Frontend Observability With Emily Nakashima and Charity Majors.” Honeycomb blog. https://www.honeycomb.io/blog/frontend-observability-emily-nakashima-charity-majors
“Observability: the present and future, with Charity Majors.” The Pragmatic Engineer. https://newsletter.pragmaticengineer.com/p/observability-the-present-and-future
“Monitoring Distributed Systems.” Google SRE Book. https://sre.google/sre-book/monitoring-distributed-systems/
“Monitoring.” Google SRE Workbook. https://sre.google/workbook/monitoring/
“Semantic Conventions.” OpenTelemetry Docs. https://opentelemetry.io/docs/concepts/semantic-conventions/
“Trace semantic conventions.” OpenTelemetry Spec. https://opentelemetry.io/docs/specs/semconv/general/trace/
“The Missing Guide to OpenTelemetry Semantic Conventions.” Better Stack. https://betterstack.com/community/guides/observability/opentelemetry-semantic-conventions/
“What is Real User Monitoring (RUM)?” New Relic. https://newrelic.com/blog/best-practices/what-is-real-user-monitoring
“Real User Monitoring With a Splash of OpenTelemetry.” Honeycomb blog. https://www.honeycomb.io/blog/real-user-monitoring-and-opentelemetry
“Real User Monitoring (RUM) and Frontend Performance.” OpenObserve. https://openobserve.ai/blog/real-user-monitoring-and-frontend-performance/
“Frontend Observability for Real User Monitoring (RUM).” groundcover. https://www.groundcover.com/blog/real-user-monitoring
“Notes on the Perfidy of Dashboards.” charity.wtf. https://charity.wtf/2021/08/09/notes-on-the-perfidy-of-dashboards/
“There Is Only One Key Difference Between Observability 1.0 and 2.0.” charity.wtf. https://charity.wtf/2024/11/19/there-is-only-one-key-difference-between-observability-1-0-and-2-0/
“4 SRE Golden Signals (What they are and why they matter).” FireHydrant. https://firehydrant.com/blog/4-sre-golden-signals-what-they-are-and-why-they-matter/
“SRE Metrics: The Four Golden Signals of Monitoring.” Splunk. https://www.splunk.com/en_us/blog/learn/sre-metrics-four-golden-signals-of-monitoring.html
#SRE #SiteReliability #DEVOPS #Observability #RUM #OpenTelemetry #Tracing #SLI #SLO #FrontendPerformance #DevOpsCulture #IncidentResponse