Created on 2025-09-14 10:01
Published on 2025-09-24 10:30
If that felt a little too real, welcome. Today we’re unpacking one of the spiciest debates in modern observability: go eBPF-first, stick with agents and sidecars, or bet on the new “ambient” meshes that promise security and telemetry without gluing a proxy to every pod. We’ll talk portability, kernel drift, and the vendor ecosystems shaping your choices, and we’ll keep it grounded in what actually happens on-call at 3 a.m.
The pitch is seductive. Instead of sprinkling SDKs and sidecars across your fleet, you load small programs into the Linux kernel, attach to well-defined hooks, and harvest rich network and runtime signals with close to zero app changes. Thanks to BPF CO-RE and BTF metadata, those programs increasingly behave like portable binaries rather than bespoke per-kernel snowflakes. In short, you stop arguing with service owners about “one more library,” and you start seeing traffic, requests, errors, and durations, regardless of whether the service is Go, Java, Rust, or “that thing Steve wrote in 2014.”
It’s not just a science project. Production-grade tools have made eBPF’s value painfully obvious. Grafana’s Beyla built a reputation for zero-code HTTP/gRPC metrics and traces and has since been donated into the OpenTelemetry eBPF effort, signaling an industry push to standardize kernel-powered auto-instrumentation. Datadog’s Universal Service Monitoring leans on eBPF to discover services from the traffic itself and emit the RED basics without a code change. Pixie’s instant Kubernetes telemetry, now under New Relic, showed platform teams how quickly you can go from “no idea” to “root cause” on a fresh cluster. Continuous profiling with Parca demonstrated that near-constant sampling doesn’t have to mean “say goodbye to your CPU credits.”
The operational narrative is familiar to SREs: fewer moving parts near the app, less developer friction, and a single path to catch signals across polyglot stacks. When your biggest problem is not “how do I instrument this?” but “how do I keep up with all the new services?”, eBPF-first feels like the only approach that scales with reality.
Now for the catch. “Compile once, run everywhere” runs headfirst into “it works on my kernel.” CO-RE and BTF do increase portability dramatically, but they don’t change the fact that you are loading code into different kernels maintained by different vendors with different backports and slightly different verifier behavior. On Tuesday, your program passes the verifier on Ubuntu LTS with backported BTF; on Wednesday, the same bytecode grumbles on an older managed node image where BTF isn’t present, or a verifier quirk rejects an otherwise harmless program path.
The good news is that the ecosystem is no longer shrugging at this. Tooling like BTFHub and generators such as BTFGen reduce the “no BTF on target” headache by letting you ship or fetch just enough type information so CO-RE has something to work with. Security and kernel folks have also built verifier harnesses that simulate the kernel’s validator outside the kernel, so you can test your eBPF programs across a matrix of versions before your node pool does. The bottom line is that eBPF portability is orders of magnitude better than the pre-CO-RE era, but it’s still engineering, not magic. Treat kernel compatibility like a first-class SLO: document supported minima, exercise your loaders in CI against the same kernels your clusters run, and have a graceful “no eBPF here” fallback.
If you’re running multi-cloud Kubernetes, brace for mixed fleets. AKS, EKS, and GKE don’t ship identical kernels or enable identical features. Some nodes will have embedded BTF and be delightful; others will leave you rummaging for external BTF or rethinking probes into kernel modules. When SREs say “test in prod,” this is what they mean: build a compatibility matrix and keep it alive as clusters evolve.
It’s trendy to dunk on agents and sidecars, but let’s be honest: they work, and they’re rich in the ways SREs depend on. The OpenTelemetry Collector gives you pipelines with receivers, processors, and exporters that you can shape to your needs. Want tail-based sampling because the interesting traces only reveal themselves at the end? That’s a processor away. Need batching, retry, backpressure, PII scrubbing, or cost controls? You wire those in and monitor the pipeline as if it were a critical service, because it is.
Sidecars bring context that kernel taps don’t always capture. Envoy next to your app can see negotiated protocols, method names, gRPC status codes, and custom headers without guesswork. That fidelity has paid for itself in enough incident reviews that many teams keep the sidecar even when they also deploy eBPF sensors. The tradeoff is resource cost and cognitive load. One proxy per pod means thousands of tiny furnaces warming your cluster, and you can absolutely turn a metrics pipeline into a denial-of-service if you let cardinality get feral. Every platform team has a story about an over-eager label that set a data fire.
Still, when you need predictable packaging, explicit upgrade windows, and repeatable config at app boundaries, agents and sidecars remain the boring, dependable tools that keep an SRE employed.
Then ambient meshes showed up and asked a different question: what if the data plane wasn’t glued to each workload? In Istio’s ambient mode, a node-level “ztunnel” handles Layer 4 security and routing, and optional per-service “waypoint” proxies add Layer 7 smarts where you actually need them. Instead of thousands of sidecars, you have a handful of shared proxies and a policy model that’s less about individual pods and more about namespaces and services.
For observability, that shift is profound. The default capture points move from per-pod to per-node for L4, and to per-service waypoints for L7. Access logs that used to come from every Envoy sidecar may no longer exist unless you enable them on waypoints. The knobs you reach for also change. Rather than “roll a sidecar with these extra filters,” you think “turn on L7 at this waypoint and ship those logs to the collector.” Telemetry becomes more centralized by design, which can simplify your pipelines and cost model, but it can also conceal the micro-variations you used to get for free when each pod had its own verbose proxy.
Ambient flips a few familiar SRE instincts. Want all the context you used to enjoy? Opt into a waypoint for the specific services where that context matters. Want broad, cheap, L4-only visibility and mTLS everywhere? Let ztunnel do the heavy lifting and keep L7 out of the hot path. The surprise many teams hit on day one is that the “sidecar default” assumptions don’t hold; you need to explicitly enable the logs, metrics, and traces you care about at the waypoint boundaries you’ve defined. Once you internalize that model, the ergonomics can feel cleaner than hunting through thousands of sidecar configs for “the one setting that drifted.”
Let’s put the two threads together. eBPF-first promises portability across languages and platforms, yet you’re still marrying your fate to kernels that drift and vendor tools that wrap the kernel with convenience. Sidecars and agents promise neutrality via open protocols and schemas, yet you often consume them through commercial distributions and managed backends. Ambient meshes aim to cut costs and simplify ops, but you’ll likely involve a vendor to navigate multicluster, gateway integration, and “what’s the safe default?”
The healthiest trend in all three camps is the gravitational pull toward OpenTelemetry. Beyla’s donation into OTel’s eBPF initiative is a strong signal that kernel-native auto-instrumentation doesn’t have to be vendor-locked. Many mesh vendors and platform teams are leaning into OTel pipelines for logs and traces, whether those signals originate at a node-level ztunnel, a waypoint, or a sidecar. Even continuous profiling is coalescing around open formats like pprof. The practical play is to embrace the vendor ecosystem for what it accelerates—hardening, packaging, fleet management—while insisting the data formats and control points stay portable.
Picture two incidents. In the first, you’ve gone all-in on eBPF for HTTP telemetry. A critical path starts timing out. You pull up live maps of inter-service calls that discovered themselves from the wire, and in minutes you see that one Go service just doubled its error rate for POST /checkout. You didn’t instrument it; you didn’t need to. That’s a good eBPF day.
In the second, you’re missing one bit of context: a custom header introduced last sprint that turns out to correlate with every failing call. Your kernel taps can’t see it reliably because the decoding logic for that protocol lives above the OS and your app isn’t using a common library. You pivot to the waypoint proxy for that service, enable L7 logging just for that namespace, and ship the logs to your OTel collector. Ten minutes later, you can filter on the header and map the blast radius. That’s a great ambient day.
And yes, there’s the third day, when a new node pool with a frugal kernel causes one eBPF program to refuse loading in exactly one availability zone. Your on-call runbook pays off because you built a CO-RE-first loader, shipped external BTF blobs for the common cases, and set a sane fallback to agent-mode collection for those nodes. That’s a great SRE day, because you got to go back to sleep.
The eBPF-first crowd says sidecars are yesterday’s compromise: heavy, chatty, hard to wrangle, with too many opportunities for config drift and bill shock. They argue that the kernel is the ultimate choke point, the place where you can see everything without asking developers for anything. They’re not wrong about the ergonomics or the speed of onboarding.
The sidecar-and-agent crowd counters that you can’t beat the reliability of explicit L7 context and the safety of mature processing pipelines. They warn that “portability” is a moving target when kernels diverge, and that verifier quirks are not the sort of roulette you want to play during an incident. They’re not wrong about predictable control and rich semantics.
Ambient partisans grin and say everyone else is fighting yesterday’s war. Why bolt a proxy to every pod when you can centralize L4 and selectively sprinkle L7 only where needed? Why not slash per-pod overhead and make multitenant clusters less toasty? They’re not wrong either—so long as you also accept that observability shifts from “everywhere by default” to “explicit where it counts,” and your team adjusts habits accordingly.
This is the fun part of SRE: all three positions can be correct, depending on the failure modes you fear most and the culture you run.
The first approach is to layer, not replace. Run eBPF-based auto-instrumentation for broad coverage and rapid discovery, but keep an OpenTelemetry Collector as your programmable choke point. Use tail-based sampling for traces so you catch the weird, high-latency ones without drowning in volume. When you need full L7 detail, light up a waypoint for that service in ambient mode or keep a small set of strategic sidecars. Yes, monitoring everything is great—until your alerts start competing with Netflix for your attention. Layering lets you pick your battles.
The second approach is to treat kernel compatibility as an SLO. Build a “kernel matrix” alongside your service catalog, wire CI to load your eBPF programs against the kernels you actually run, and ship external BTF to nodes that lack it. Make failure to load visible, actionable, and quiet—quiet in the sense that you fail back to an agent or collector on that host automatically, with a clearly labeled event so the platform team can fix it in daylight hours. The cost of writing this once is much lower than the cost of debugging verifier errors under adrenaline.
The third approach is to be intentional about where L7 lives in an ambient world. Decide which namespaces or services deserve waypoint proxies because their semantics matter for debugging and SLOs. Keep L4-only ztunnel for the rest so your data plane stays lean. Align telemetry with the places you enforce policy. If you put rate limits or auth decisions at a waypoint, make that the source of truth for access logs and export them through your OTel gateway. If you keep an API edge with Envoy or another gateway, unify that with the same pipelines so you can trace a request from ingress through waypoints with consistent attributes.
There’s also a fourth approach that saves careers: design for vendor exit, even if you don’t plan to exit. Favor open schemas like OTLP, stick to open agents where you can, and when you need vendor-specific eBPF magic, isolate it behind the same collector pipelines you’d use if you switched providers. The day you have to change contracts is not the day you want to rewrite a thousand dashboards.
Are we comfortable making kernel support part of our platform SLOs, and do our security teams agree with that tradeoff?
Where should we anchor L7 observability in an ambient mesh: at ingress, at a small set of waypoints, or back at the app with a sidecar for a few hotspots?
How much developer context do we actually need in traces to move MTTR, and can eBPF-first give us enough without heroic protocol decoding?
What’s our cost guardrail for per-pod proxies versus waypoint plus ztunnel, and do we know how that changes under a traffic spike or a new tenancy?
If we adopt vendor eBPF features today, what’s our exit plan that keeps data portable and dashboards useful a year from now?
We like to pretend these decisions are purely technical, but they’re mostly human. The team that loves eBPF often has strong kernel and networking instincts and a mandate to make things just work for developers. The team that loves sidecars loves explicitness and control, because they’ve been burned by ambiguity in the past. The team leaning into ambient is probably carrying a gnarly bill from the last mesh, or a mandate to simplify onboarding for dozens of app teams. None of them are wrong; all of them need to coexist in most real organizations.
If you’re leading SRE or platform, your job isn’t to pick a single hammer. It’s to give your engineers a toolbox that lets them choose eBPF when that’s the fastest daylight fix, a waypoint when you need L7 truth at a boundary, and a collector pipeline that keeps the whole thing predictable, debuggable, and fiscally sane. Tie it all back to SLOs, keep the blast radius small, and write the runbooks while you’re calm.
The future isn’t eBPF versus sidecars versus ambient. It’s a boringly reliable blend of the three, wrapped in open standards, with enough guardrails that your on-call brain can function at 03:04. Pick your choke points. Prove your portability. And remember: if you can’t explain your telemetry plan in a single page, you don’t have an observability strategy—you have a scavenger hunt.
BPF CO-RE (Compile Once – Run Everywhere) overview — Andrii Nakryiko — https://nakryiko.com/posts/bpf-portability-and-co-re/
BPF CO-RE reference guide — Andrii Nakryiko — https://nakryiko.com/posts/bpf-core-reference-guide/
BPF CO-RE concept — ebpf.io docs — https://docs.ebpf.io/concepts/core/
BTFHub project for external BTF types — Aqua Security — https://github.com/aquasecurity/btfhub
BTFGen: One Step Closer to Truly Portable eBPF Programs — Inspektor Gadget — https://www.inspektor-gadget.io/blog/2022/03/btfgen-one-step-closer-to-truly-portable-ebpf-programs/
Harnessing the eBPF Verifier — Trail of Bits — https://blog.trailofbits.com/2023/01/19/ebpf-verifier-harness/
Use our suite of eBPF libraries (verifier harness) — Trail of Bits — https://blog.trailofbits.com/2023/08/09/use-our-suite-of-ebpf-libraries/
The Challenge with Deploying eBPF Into the Wild — Pixie Labs blog — https://blog.px.dev/ebpf-portability/
Grafana Beyla OSS eBPF auto-instrumentation — https://grafana.com/oss/beyla-ebpf/
Introducing OpenTelemetry eBPF Instrumentation (Beyla donation) — Grafana — https://grafana.com/blog/2025/05/07/opentelemetry-ebpf-instrumentation-beyla-donation/
Datadog Universal Service Monitoring product page — https://www.datadoghq.com/product/universal-service-monitoring/
Automatically discover, map, and monitor all your services (USM blog) — Datadog — https://www.datadoghq.com/blog/universal-service-monitoring-datadog/
Pixie acquisition press release — New Relic — https://newrelic.com/press-release/20201210
Parca continuous profiling (project) — https://www.parca.dev/
Introduction to Parca Agent — Polar Signals — https://www.polarsignals.com/blog/posts/2023/01/19/introduction-to-parca-agent
OpenTelemetry Collector architecture — https://opentelemetry.io/docs/collector/architecture/
Collector deployment: Gateway pattern — https://opentelemetry.io/docs/collector/deployment/gateway/
Tail sampling concepts — OpenTelemetry — https://opentelemetry.io/docs/concepts/sampling/
Tail Sampling Processor — OpenTelemetry Collector Contrib — https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor
Istio data plane modes: sidecar vs. ambient — https://istio.io/latest/docs/overview/dataplane-modes/
Istio Ambient mode overview and docs — https://istio.io/latest/docs/ambient/
Istio Ambient GA announcement (1.24) — https://istio.io/latest/blog/2024/ambient-reaches-ga/
Announcing Istio 1.24.0 — Ambient GA — https://istio.io/latest/news/releases/1.24.x/announcing-1.24/
Ambient waypoint configuration — Istio docs — https://istio.io/latest/docs/ambient/usage/waypoint/
Istio performance and scalability: resource usage numbers — https://istio.io/latest/docs/ops/deployment/performance-and-scalability/
Ambient logs default and enabling via Telemetry API — ambientmesh.io — https://ambientmesh.io/docs/observability/logs/
Traffic in ambient mesh: ztunnel, eBPF redirection, waypoints — Solo.io — https://www.solo.io/blog/traffic-ambient-mesh-ztunnel-ebpf-waypoint
Choosing the Right Istio Architecture: data-driven ambient vs. sidecar — Tetrate — https://tetrate.io/blog/choosing-the-right-istio-architecture-a-data-driven-guide-to-ambient-sidecar-and-hybrid-deployment-models
Cilium Service Mesh and Hubble (eBPF-based) — https://cilium.io/use-cases/service-mesh/
Hubble for network observability — Cilium blog — https://cilium.io/blog/2024/08/14/hubble-for-network-security-and-observability-part-1/
#SRE #SiteReliability #DEVOPS #eBPF #OpenTelemetry #Observability #Kubernetes #Istio #ServiceMesh #Envoy #CORE #Tracing #Profiling