Created on 2025-09-14 10:58
Published on 2025-09-29 10:30
We ask “SRE, platform, or product?” as if there’s a single, eternal answer. There isn’t. Observability is a capability, not a team. It cuts across how you build, run, and evolve software. If you frame it with Team Topologies, you quickly see why this becomes a topology, interaction, and outcomes question rather than a headcount debate. Stream-aligned (product) teams own customer outcomes; platform teams reduce their cognitive load by offering paved roads; enabling teams uplift skills and accelerate adoption; complicated-subsystem teams hold specialist areas too thorny to generalize. Those four team types—and the interaction modes between them—are the most reliable compass I’ve found for deciding who does what, when, and why.
Central observability groups exist because toolchains are messy, costs creep, and consistency matters. The platform lens is useful here. Platforms exist to give stream-aligned teams powerful building blocks with low friction. In most orgs, the observability stack is undeniably platform-shaped: collectors and agents, data pipelines, storage and retention, query engines, dashboards, alerting, budgets and quotas, and governance around data privacy. Treating this as a product—with roadmaps, SLIs/SLOs, and customer research—absolutely belongs in a platform team’s backlog. The CNCF’s platform guidance even lists observability as a core capability your platform should provide, which reinforces that this is part of the “paved road,” not a random side street.
Centralization also makes scaling patterns easier. Whether you’re aggregating traces across clusters or unifying dashboards so teams can share and reuse what works, some aspects truly benefit from a “single spine” approach. But centralization is a power tool: helpful in skilled hands, capable of serious damage when used indiscriminately.
I’ve seen central observability groups turn into ticket queues with a side hustle in dashboard archaeology. They start with noble intent—“we’ll help teams instrument and standardize”—and end up as a bottleneck for everyone’s alerts. The deeper problem is misaligned ownership. Observability is only valuable when it is about someone’s service, someone’s users, someone’s SLOs. If the team that builds the thing doesn’t also own the signals and the on-call outcomes, observability degenerates into charts for their own sake.
The Google SRE material is blunt about this: you balance reliability and velocity via error budgets and SLOs that are jointly owned by product, development, and SRE. It’s a social contract, not a dashboard setting. When SLOs live with the people shipping features, alerts finally encode intent rather than guesswork.
If you map the capability onto the four team types, the lines get crisper:
Stream-aligned teams own the signals that express customer value. That means the service’s golden signals, SLOs, alerts, and runbooks. They decide what “good” looks like and wire the alarms that wake them at 3 a.m. Platform teams own the paved road: telemetry ingestion, storage, query and viz, golden templates, and self-service APIs. They operate it as a product with reliability guarantees, docs, and a clear UX. Enabling teams parachute in to raise the bar—migrations to OpenTelemetry, SLO/SLI literacy, alert-design workshops—and then leave. If you hit a truly gnarly subsystem—say, highly specialized real-time routing or a shared data plane—the complicated-subsystem team may own bespoke instrumentation there, but even then, the dependent stream-aligned teams still own their SLOs on top.
Let’s stage the debate you’ve probably had in the hallway, on a whiteboard, or with too much coffee.
On one side, the Centralization Camp: “A central observability team should own the stack and the standards, otherwise we get tool sprawl, inconsistent metrics, and bloated bills.” There’s truth here. Shared pipelines and governance reduce cost; unified schemas and dashboards lift discoverability; economies of scale matter in high-cardinality worlds. Even practitioners who champion engineer-owned telemetry acknowledge cost discipline and platform product thinking are essential, especially as organizations learn where spend should land.
On the other side, the Decentralization Camp: “Product teams must own observability end to end because they own customer outcomes; anything else is outsourcing reliability.” Also true. SRE and DORA research consistently position reliability as a product decision, with SLOs negotiated across product, development, and SRE. If you separate ownership of alerts and SLOs from the people shipping changes, you create perverse incentives and stale telemetry. The healthiest pattern is that platform reduces friction and SRE sets the bar, but stream-aligned teams write and live with the signals.
And then there’s the Pragmatist’s View: “Own it together, but differently.” Platform owns the platform; product owns the product signals; SRE owns the reliability contract and the guardrails. You get central standards without robbing teams of agency. You also get a place to put the work nobody else will pick up: migrations, schema evolution, and cross-cutting analytics. The CNCF TAG Observability and the OpenTelemetry community both embody this federated model at ecosystem scale: strong common APIs and governance, decentralized implementation by many vendors and teams. That’s the vibe you want internally.
Here’s the human part. Engineers behave the way their incentives and feedback loops tell them to. If your SLOs are abstract, nobody learns. If your alerts are noisy, everyone mutes them. If your platform is hard to use, Teams will invent Homebrew Observatorium in a sprint and never look back. Conversely, when the platform is a joy to use and the SLOs are linked to decisions—freeze the deploys, pay down tech debt, trade velocity for reliability—observability stops being a compliance task and becomes a competitive advantage. DORA’s research over the past decade keeps landing on a similar truth: culture, clarity, and good platform experiences matter as much as tools.
At one company I worked with, a central “obs team” owned every alert rule in the org. They had dashboards for days and an inbox full of tickets requesting “add CPU alert.” When the main checkout path flaked, product engineers didn’t get paged; the observability team did. They dutifully investigated and re-routed pages to the service team, which made them a glorified switchboard. After one too many 2 a.m. escalations, they pivoted to platform-as-a-product: they built a self-service rule builder, declared SLOs as the first-class way to alert, and ran a series of enabling engagements to help teams instrument with OpenTelemetry. The moment the checkout team felt their own pager buzz, they rewrote alerting to align with customer pain, not CPU. Miraculously, incidents got shorter, and the “obs team” stopped being a helpdesk and became a platform. The org didn’t hire new heroes; it changed ownership.
Here’s the crisp split that works in most mid-to-large organizations.
Product (stream-aligned) teams own the service’s SLIs/SLOs, the alerts tied to those SLOs, the runbooks, and the day-to-day telemetry quality of their code paths. They choose what to measure, instrument at source, and live with the consequences during incidents. Google’s SRE guidance reinforces that SLOs and error-budget policies must be jointly set by product, dev, and SRE—ownership you can see on the page and in the pager rotation.
Platform teams own the paved road for observability. They supply opinionated, documented, reliable building blocks: ingestion/collection (often OpenTelemetry-based), storage and retention policies, query engines, a standard dashboarding and alerting experience, role-based access, and cost controls. They operate it as a product, measure their own SLOs, and gather feedback like any external SaaS would. The CNCF’s platform whitepaper and Team Topologies’ “platform as a product” framing aren’t just catchy phrases—they’re the operating model that keeps the central team from becoming a ticket queue.
SRE owns the reliability contract and enablement. They define what “good” looks like in practice—alerting principles, SLO hygiene, incident response—and either embed with teams or act as an enabling function that raises the bar and then steps away. They are the stewards of error budgets and the adults in the room when trade-offs get spicy. And when SRE is also the platform owner for observability, they wear both hats—just don’t let the platform product work swallow the reliability coaching.
Nothing focuses the mind like a bill. Observability spend grows with cardinality and enthusiasm. Industry voices have long argued that you shouldn’t pretend observability is “free,” and that a healthy benchmark is to think of it as a percentage of infra costs, with wide error bars based on context and stage. Whether you agree with the exact bands or not, the point stands: treat cost as a first-class signal and give teams showback so they can make grown-up trade-offs. It’s hard to own what you can’t see.
First, make SLOs the front door. If a team wants to add an alert, the form begins with “Which SLO does this protect?” It’s amazing how many alerts evaporate when they have to justify their existence in customer terms. Teach people to alert on symptoms, not guesses. Provide SLO templates and a humane error-budget policy with pre-agreed actions when budgets burn down. The SRE workbook provides a pragmatic walkthrough here; your job is to make it muscle memory for every team.
Second, operate the observability stack as a product. Publish a roadmap. Set SLOs for your own platform (availability, query latency, ingestion lag). Offer golden paths: “If you instrument with these libraries and export via this collector, your data will show up in these dashboards with these exemplars.” Partner closely with security on data governance. Keep your platform central, but your ownership distributed: product teams own signals; platform owns the substrate. The CNCF platform guidance and Team Topologies’ platform-as-product talks are your playbooks.
Third, invest in enabling bursts, not forever-teams. Run time-boxed adoption programs: “six weeks to SLOs” or “OpenTelemetry migration sprints.” Pair experienced SREs with product teams, pay down instrumentation debt, and leave behind patterns, not dependency. Rinse and repeat for the next area. Sustain it with an internal community of practice and a library of example dashboards that teams can copy, not tickets they must file. The result looks suspiciously like what the OpenTelemetry community and CNCF TAG Observability do in the wild: align on standards and let teams build.
The first failure mode is “platform owns everything.” The central team collects all the data, writes all the dashboards, and answers all the pages. It scales until it doesn’t, then collapses under the weight of other people’s services. Engineers stop learning because the feedback loop is somewhere else. Platform burnout follows.
The second failure mode is “every team for themselves.” Tool sprawl, inconsistent schemas, and unpayable bills. War rooms where nobody’s dashboards agree. Tribes invent their own definitions of “latency.” Your SREs become archaeologists, digging through sedimentary layers of metrics.
The cure is the same in both cases: explicit boundaries, great defaults, and relentless enablement. Give teams enough freedom to own outcomes and enough guardrails to not crash into each other. Centralize the things that compound, decentralize the signals that guide hands on keyboards.
None of this works if leaders won’t protect the time. Observability is toil when it’s squeezed between feature deadlines and budget cuts. It becomes leverage when it’s given real priority: quarterly goals tied to SLO adoption, platform OKRs tied to developer satisfaction, incident review actions that actually change telemetry. DORA’s work links platform quality and developer experience with performance outcomes; don’t make it a side quest. Put it in the plan.
Before your next meeting turns into a tool debate, try these on for size. Are your SLOs written down, blessed by product, and tied to a budget policy—or are you still paging on CPU because that’s what the default template did? If your primary observability cluster went for a long lunch, would your incident response meaningfully degrade—or do you have a platform SLO and a fallback plan? When a brand-new service spins up, how many minutes until its golden signals land in a dashboard someone actually uses? If your observability bill doubled tomorrow, who would change what—and how fast could they see the savings? And, most uncomfortable of all, when customers complain, do your engineers see the same pain in their graphs within a minute?
Who owns observability? Product teams own the signals and SLOs for their services. Platform teams own the platform that makes great signals cheap and easy. SRE owns the reliability contract, the patterns, and the enablement that keeps everyone honest. If you insist on a single owner, pick “the people who get paged when it breaks,” and then make sure your topology, platform, and policies line up with that truth. You’ll ship faster, sleep better, and you might even delete a few dashboards you secretly hate.
In the end, observability is just engineering empathy with a UI. You empathize with your users by defining what good feels like, with your future self by leaving breadcrumbs in telemetry, and with your colleagues by making the platform a paved road, not a maze. Do that, and your midnight self might finally forgive your sprint-planning self. Or at least send fewer angry Slack messages.
Team Topologies — Key Concepts. https://teamtopologies.com/key-concepts
IT Revolution: “The Four Team Types from Team Topologies.” https://itrevolution.com/articles/four-team-types/
CNCF TAG App Delivery — Platforms White Paper. https://tag-app-delivery.cncf.io/whitepapers/platforms/
Google SRE Workbook — “Implementing SLOs.” https://sre.google/workbook/implementing-slos/
Google SRE Workbook — “Error Budget Policy.” https://sre.google/workbook/error-budget-policy/
Google Cloud Blog — “Announcing the 2024 DORA report.” https://cloud.google.com/blog/products/devops-sre/announcing-the-2024-dora-report
DORA: Accelerate State of DevOps Report 2024. https://dora.dev/research/2024/dora-report/
CNCF TAG Observability (GitHub). https://github.com/cncf/tag-observability
OpenTelemetry Blog — “Behind the scenes of the OpenTelemetry Governance Committee.” https://opentelemetry.io/blog/2024/otel-governance/
Honeycomb — “How Much Should I Be Spending On Observability?” (Apr 16, 2025). https://www.honeycomb.io/blog/how-much-should-i-spend-on-observability-pt1
Honeycomb — “Frontend Observability With Emily Nakashima and Charity Majors.” https://www.honeycomb.io/blog/frontend-observability-emily-nakashima-charity-majors
Grafana Blog — “Managing access in Grafana: a single-stack journey with teams, roles, and real-world patterns” (Sept 8, 2025). https://grafana.com/blog/2025/09/08/managing-access-in-grafana-a-single-stack-journey-with-teams-roles-and-real-world-patterns/
Grafana Docs — “Cross-cluster query federation (Enterprise Traces).” https://grafana.com/docs/enterprise-traces/latest/configure/federation/
Team Topologies — “What is platform as a product?” (talk). https://teamtopologies.com/videos-slides/what-is-platform-as-a-product-clues-from-team-topologies
#SRE #SiteReliability #DEVOPS #Observability #TeamTopologies #PlatformEngineering #OpenTelemetry #DORA #SLOs #OnCall