Created on 2025-09-27 12:58
Published on 2025-10-08 10:15
If you’ve been anywhere near an on-call rotation lately, you’ve seen the same tug-of-war: one camp wants to instrument everything with OpenTelemetry (OTel) to keep costs sane and vendors interchangeable; the other camp wants a vendor’s agent because it “just works,” ships rich features on day one, and doesn’t make Friday deploys feel like a science project. Both sides claim reliability wins. Both sides also claim the other is about to break prod.
Here’s the truth SREs quietly acknowledge at 3 a.m.: you can succeed with either strategy — if you measure the right things, design for failure, and keep human reality in mind. This article breaks down the latest state of OTel and vendor agents, gives you a crisp set of evaluation criteria, and offers pragmatic ways to roll out observability without lighting your budget (or your team) on fire.
The OTel-first argument starts with standards. The OpenTelemetry Protocol (OTLP) is now stable for traces, metrics, and logs, which means the wire format between your app, the collector, and your backends isn’t supposed to change out from under you. That stability translates into leverage: you can point the same telemetry at multiple destinations, split routing by signal, and swap or combine backends when procurement or performance demands it.
OTel also rides on the W3C Trace Context standard for cross-service correlation. That little traceparent header you’ve squinted at? It lets traces propagate across services, platforms, and even vendors, so your distributed system doesn’t lose the plot halfway through a user journey.
Where OTel really earns SRE affection is the Collector. Instead of running a zoo of one-off agents, you operate a single vendor-neutral data plane that receives, processes, and exports telemetry. It’s the switchboard where you can downsample traces, drop noisy logs, remap attributes, and fan-out to multiple backends — all without touching application code. The Collector team even published a roadmap to v1 to tighten guarantees around stability and packaging, which matters when you’re betting your company’s telemetry on it.
There’s nuance on logs. OTel has a stable OTLP logs signal and data model, but pieces of the logs experience are still maturing in certain language SDKs; for example, the Go SIG explicitly called out logs API stabilization as a 2025 priority. Translation: production-ready paths exist today (collect from files, forward via Collector, correlate with traces), but depending on your language, you may still use bridges to your favorite logging library for the authoring side.
Finally, OTel-first shines for cost control. Tail-based sampling is a first-class pattern: make sampling decisions afteryou’ve seen the whole trace, so you can keep “spicy” traffic (errors, slow requests, VIP tenants) and shed the rest. The result is fewer war rooms where you have “all the boring traces” and none of the ones that matter.
The agent crowd’s case is also strong: day-one visibility with minimal fuss. Modern agents autodiscover services, inject tracing automatically, and ship opinionated dashboards, alerts, and topology maps. Some even lean on eBPF to observe network and process activity without code changes, lighting up service maps and golden signals across polyglot fleets in minutes — really handy when you inherit the world’s messiest monorepo plus five microservices called “api.”
Case in point: Datadog’s Universal Service Monitoring uses eBPF to detect and monitor services without touching application code, and can be enabled at the agent/Helm level in Kubernetes. New Relic’s Pixie (open source) brings eBPF-powered, “no-instrumentation” visibility for Kubernetes clusters, with guides to auto-attach APM where needed. These are big quality-of-life wins for busy SREs who’d like to spend their Friday evenings not typing kubectl with shaking hands.
Then there’s the “platform effect.” Vendor ecosystems bundle CI visibility, RUM, profiler, incident response, cost lenses, and AIOps features that do correlation and anomaly detection for you. You may not agree with every alert they generate, but “batteries included” can drastically shrink integration time, especially for teams without a telemetry platform squad.
Three trends matter in 2025.
First, cloud providers now speak OTLP natively or officially bless OTel pathways. Google Cloud announced native OTLP trace ingestion via telemetry.googleapis.com on September 12, 2025; Azure expanded OTel across Monitor, Functions, and Logic Apps; AWS continues to push its AWS Distro for OpenTelemetry (ADOT) for X-Ray, CloudWatch, and partners. That means “OTel in, cloud backends out” is a mainstream path, not a science fair project.
Second, OTel’s semantic conventions (the “names and shapes” of attributes) have stabilized and keep converging across domains like HTTP, databases, and messaging. This isn’t just spec-nerd trivia; consistent conventions are how backends auto-build service maps, error dashboards, and SLO insights without a human spending weekends normalizing field names.
Third, the OTel Collector is on a clear path to v1 with tighter stability promises — exactly the kind of boring reliability SREs adore. When your data plane has predictable lifecycle and module stability, change management stops feeling like defusing a bomb.
The OTel-first engineer says: “I want my telemetry like my container images — portable. If finance says we’re paying too much for logs, we route them to cheaper storage tomorrow and keep traces flowing to our favorite UI. And if the APM contract drama returns, we don’t have to rip and replace agents.” The agent-first engineer replies: “Cool story, but our incidents don’t care about your exporter purity. The agent lit up service discovery in ten minutes, the anomaly detection pointed at a bad deploy, and the mobile team’s RUM dashboards look respectable without us writing glue.”
They’re both right — and both a little wrong. OTel does reduce switching costs and gives you awesome control, but you’ll spend that flexibility running a telemetry platform. Vendor agents slash time-to-first-insight and bring powerful features, but you’ll pay in subscription costs and may accept less portability. The sweet spot for many orgs? A hybrid: OTel-first in the app and collector layer for leverage and cost control, and one “primary” vendor for UX, correlation, and AI-assisted workflows.
Let’s make this concrete. Your evaluation should hinge on three pillars: coverage, integration time, and total cost of ownership (TCO).
In practice, “coverage” means traces for every user-facing hop, metrics for every critical resource, and logs where they add unique value. On traces, start by instrumenting at ingress and egress boundaries and ensure W3C Trace Context flows through services and third-party calls. Aim to capture exemplars or span metrics for SLO golden signals so you aren’t blind when sampling drops traces. Keep an eye on logs maturity for your language: if your SDK’s logs API is still stabilizing, rely on your existing logging framework and feed its output into the Collector so you don’t lose correlation.
Anecdote: we once had 99% of services emitting traces, except the one doing retries. Guess where the latency spike hid? Yes, the untraced retry loop. Coverage is not just “how many services” — it’s “which edges matter for answering pager questions.”
Clock it. For each approach, track time-to-first-trace, time-to-correlated-logs, time-to-SLO dashboard, and time-to-one real incident investigated end-to-end. Vendor agents tend to win the “first week” with auto-instrumentation and out-of-the-box visuals; eBPF-powered features like Universal Service Monitoring or Pixie can map services with zero code changes. OTel-first may take longer up front — standing up the Collector, standardizing semantic conventions, building pipelines — but pays back with fewer ongoing detours.
Anecdote: during a Friday incident, the agent’s AIOps flagged a suspicious deployment and narrowed error suspects in minutes. The next quarter, finance flagged our log bill, and the OTel pipeline let us drop low-value logs and keep alerts stable over a single afternoon. Both were wins. Different timescales, different heroes.
TCO = vendor subscriptions + storage/egress + platform engineering time + incident/time-to-resolve externalities. Don’t underestimate cardinality: one eager developer adds user_id and session_id to every metric tag, and suddenly your time series explode — and so does your bill. Vendors have levers (e.g., tag allowlists and decoupled ingestion/indexing), and OTel pipelines have levers (processors to drop, remap, and sample before data hits expensive backends). Use them.
When you model costs, simulate real traffic. Turn on tail-based sampling policies that keep erroring or long-latency traces and shed the rest; then confirm your SLO dashboards and debugging playbooks still work. Treat logs as a precision instrument: route verbose logs to cheap storage with short retention, index only the subset you actually query during incidents. Your wallet (and your future self) will thank you.
Make OTel your lingua franca in code and cluster: use OTel SDKs and auto-instrumentation where stable, propagate W3C trace context, and standardize on semantic conventions. Ship everything to the Collector, then export to your “primary” vendor for most teams’ daily workflows while mirroring a slice to a secondary backend for leverage. This gets you the vendor UX and AIOps today, with an escape hatch if costs or needs change tomorrow. It also avoids agent sprawl on hosts and containers. And because OTLP is stable across signals, you aren’t betting on quicksand at the protocol layer.
Yes, “monitor everything” sounds noble — until your alerts compete with Netflix for your attention and your CFO starts learning about high-cardinality time series the hard way. Build your pipeline so economics are features, not afterthoughts. Use tail-based sampling to keep the traces that violate SLOs or hit top endpoints; derive span metrics for RED/USE; drop noisy attributes; and route debug-level logs to cold storage with cheap retention. Run monthly “telemetry audits” to review the top 20 high-cardinality dimensions and fix tagging drift in code. A small iteration here saves big money there.
If your estate is a heterogeneous carnival, use eBPF-powered discovery to bootstrap visibility. Light up service maps and golden signals with a vendor agent or Pixie in hours, then incrementally add OTel code-level instrumentation to critical services for rich spans, business attributes, and durable portability. This “ladder” lets SREs reduce incident risk immediately while platform engineers build the standards base.
The OTel-first purist says the agent path is lock-in wearing a friendly hoodie. “One day you’ll want to route logs to cheaper storage, or keep traces raw in your data lake for ML, and that magical agent will say ‘cool story, here’s the enterprise add-on.’” The agent-first pragmatist counters: “I’d love to hand-craft semantic conventions, but users are down. The agent shipped anomaly detection and root-cause hints before we finished arguing about attribute names.” For balance: the standards side increasingly enjoys rich vendor integrations — major clouds and APMs ingest OTLP and embrace OTel — while vendor platforms increasingly speak OTel fluently. The lines are blurring, by design.
Tools don’t fix on-call culture. An SRE team with well-understood SLOs, tidy rollback habits, and a bias for “boring and observable” will do fine with either approach. A team that treats observability as an afterthought will drown in dashboards no matter how shiny the UI. The most reliable systems I’ve seen treat telemetry like product: someone owns the schema, reviews changes, and watches cost/coverage SLOs the way product managers watch conversions.
And remember: you’re not choosing technology so much as you’re choosing where to spend your effort. OTel-first spends it building leverage and cost control. Agent-first spends it buying speed and out-of-the-box intelligence. Most orgs do a bit of both.
Are you brave enough to answer these in public? I double dog dare you.
First, if your vendor doubled prices tomorrow, how many weeks would it take to reroute your telemetry — and who’s on the hook for that work? Second, what percentage of your current log volume has never been queried — and why is it still being indexed? Third, when your last P1 hit, did you have the trace you needed, or ten thousand traces you didn’t? Fourth, if your security team asked for an audit trail of trace context across services, would you hand them evidence or vibes? Fifth, which costs more for you right now: observability storage or engineers’ time spent staring at dashboards?
Pick the path that fits your maturity curve today, but design it so Future You can change their mind. Use OTel to make switching possible; use a vendor to make Monday mornings survivable. And keep your incident runbook handy — for everything except Step 1.
OTLP Specification (Status: Stable for traces, metrics, logs) — https://opentelemetry.io/docs/specs/otlp/
The roadmap to v1 for the OpenTelemetry Collector — https://opentelemetry.io/blog/2024/collector-roadmap/
W3C Trace Context (traceparent/tracestate) — https://www.w3.org/TR/trace-context/
Datadog Universal Service Monitoring (eBPF, no-code service detection) — https://www.datadoghq.com/product/universal-service-monitoring/
OpenTelemetry now in Google Cloud Observability (OTLP trace ingestion) — https://cloud.google.com/blog/products/management-tools/opentelemetry-now-in-google-cloud-observability
#SRE #SiteReliability #DEVOPS #Observability #OpenTelemetry #APM #Kubernetes #eBPF #AIOps #OTLP #Tracing #Logging #Metrics #Cloud #ReliabilityEngineering