Created on 2026-02-03 05:53
Published on 2026-02-03 11:30
If your on-call has recently involved debugging “why our tokens-per-minute hit a brick wall at lunch” rather than “why service X returned 500s,” welcome to a new normal. Reliability for AI systems is no longer a quirky sideline; it’s deep in the SRE critical path. The modern stack ships models, not just microservices, so we’re coping with failure modes like output drift, long-context weirdness, token-throughput throttles, GPU saturation, and—because physics always wins—energy ceilings.
What’s striking is how explicitly this is showing up in mainstream SRE programming: talks, runbooks, and roadmaps now include LLM telemetry, evaluation loops, and the signals that matter beyond latency and error rate. Open standards are catching up too; the telemetry world is agreeing on vocabulary for prompts, tokens, models, tools, and agents, so we can finally trace AI the way we trace RPCs.
Output drift is the sneaky one. Your model passed evals on Tuesday, then by Thursday the same prompt produces a different tone, a different structure, or flat-out fabrications. This isn’t simple “error rate”; it’s reliability perceived by users as trust. Teams are building drift detection, human-in-the-loop gates, and application-level guardrails precisely because “green dashboards, wrong answers” is real. Industry write-ups increasingly treat hallucination and drift as first-class operational risks that need monitoring, versioning, and auditability, not just model tuning.
Long context is another trap. We all cheered for million-token windows, then rediscovered cognition is hard: models tend to privilege the beginning and end of the sequence, struggling with information jammed in the middle. If your “single-prompt everything” RAG pipeline regressed at 128K tokens, you likely met the “lost in the middle” effect. That’s not a theoretical paper cut; it drives user-visible variance and expensive retries.
Throughput bottlenecks are where SRE reality tumbles into LLMOps marketing. APIs and self-hosted stacks cap tokens per minute and tokens per second, often differently for input and output; uneven bursts get 429s even when you’re “under” your minute-level quota. In practice, you need concurrency shaping, traffic smoothing, and token budget accounting—not unlike QPS engineering, just with a new unit. Provisioned throughput SKUs, gateways, and batching engines are the levers SREs now tune.
Then there’s the metal. AI reliability includes “GPU reliability,” which is not just “servers, but hotter.” Operating large accelerator fleets means wrestling firmware quirks, ECC error bursts, straggler devices, and heterogeneity that torpedoes scheduling efficiency. At hyperscale, the dominant cause of training downtime is often hardware and configuration faults; the remedy is detection, diagnostics, and fast job restarts with healthy nodes—SRE muscle memory, just transplanted to accelerators.
Finally: energy. As AI pulls megawatts into inference and training, power becomes a gating SLO. Reliability for the business includes “we can keep serving under the facility power cap without melting our latency.” SREs are starting to track energy-normalized performance, apply power capping, and even treat watts as a first-class signal alongside p95s. Global outlooks show why this matters: data-centre electricity demand is rising steeply with AI, so efficiency isn’t a nice-to-have—it’s how you stay online.
Classic golden signals won’t cut it alone. We still need latency, saturation, and error rate, but SRE dashboards now include prompt IDs, model versions, tool-call traces, token counts, cache hit ratios, retrieval corpus versions, and cost. The good news: standards exist. The OpenTelemetry semantic conventions for GenAI define spans and attributes for LLM calls, vector DB operations, and agents, giving us portable telemetry and vendor-neutral autoinstrumentation. That unlocks the usual SRE discipline—SLOs, burn-rate alerts, and incident analysis—except the units are prompts and tokens rather than RPCs and bytes.
This standardization is spreading to vendor tooling: the observability platforms you already use can ingest GenAI spans and tie them to your service traces, cost lines, and error budgets. That means when someone says “quality dipped after 14:00,” you can actually correlate it with a vector index deploy, a model version bump, or a cache invalidation (the three most dangerous words in AI: “just retrained it”).
If you’ve ever babysat a flaky storage cluster, you’re emotionally prepared. Accelerator fleets simply make the blast radius bigger and the telemetry weirder. The meta-patterns look familiar: fail-in-place designs, aggressive health checks, automated node triage, rapid checkpoint/restore, and heterogeneity management so schedulers don’t get trapped by “mixed-bag” SKUs. What’s new is the failure texture—HBM errors, firmware regressions, thermal derates—and the cost of getting it wrong. In high-synchrony training environments, a single bad node can pause a job spanning thousands of accelerators, so your MTTR playbook is half “replace the part,” half “replace the participant.” Meta’s public write-ups frame it plainly: lower hardware/config-fault time, categorize failures quickly, and prioritize fast restarts with healthy nodes.
For inference, saturation feels like the worst day of a shopping holiday: bursts slam output-token throughput while long prompts hog KV-cache memory. Techniques like paged attention and smarter batching were invented precisely to keep GPUs busy without fragmenting memory. If you haven’t looked at the serving stack beneath your API gateway, you’re leaving throughput on the table.
Strip the buzzwords and what’s left is SRE muscle: performance engineering, cost control, routing and caching, and end-to-end telemetry. Prompt caching moved from “blog tip” to production primitive; it’s the CDN of your context. Meanwhile, serving stacks like vLLM took a leaf out of OS paging to squeeze KV-cache usage and unlock batching at concurrency levels that actually match real traffic. When your ops review includes tokens, cache hit rate, and eval scores, you’re doing SRE—you’ve just swapped HTTP 500s for “quality dips” and quota-429s.
One camp argues that the right place for AI reliability is squarely inside SRE. The reasoning is practical: we already own availability, latency, and cost; with GenAI semantic conventions in place, the telemetry plugs into existing tooling, processes, and error budgets. In this view, AI is another workload; we adapt our SLOs and add new signals, but the discipline doesn’t split.
The other camp says: proceed carefully—ML systems carry unique failure surfaces and hidden coupling that don’t look like microservices at all. The classic paper on “hidden technical debt in ML systems” warned that ML has extra entanglements—undeclared consumers, feedback loops, and configuration creep—that resist naïve adoption of software-only practices. This view suggests either a dedicated AI reliability function or a hybrid model (SRE + ML/Safety) because success demands deep model and data intuition, not just service SLOs.
Both sides think the other is about to break prod. Both are a little right. The pragmatic move in 2026 is “SRE-plus”: keep the SRE ownership of production outcomes, embed ML/LLMOps expertise for model-specific risks, and share the dashboards.
First, redefine SLOs for AI as multi-dimensional. Latency and availability still matter, but add quality and cost. Quality can be measured via continuous evals on golden sets, live shadow traffic scoring, or human review loops. Cost needs hard budgets per request and per user so you can enforce token ceilings and avoid “chatty agents” going rogue. The trick is to attach these SLOs to the same error budget ritual you already run, so teams trade latency, accuracy, and cost consciously instead of playing whack-a-mole.
Second, upgrade your telemetry pipeline. Instrument your apps with GenAI semantic conventions so every request carries model version, prompt hash, input/output token counts, tool-call traces, vector-index revisions, and cache flags. Fold those spans into your existing traces so a user click can be followed through service calls, model invocations, and RAG lookups. That’s how you investigate “RAG got slow after deploy”: you’ll see that the prompt grew 8×, the retrieval index changed, and cache hit rate cratered.
Third, build a throughput and cost playbook. Treat quotas and tokens like QPS and bytes. Shape traffic to avoid spiking short windows that trigger 429s; steady the flow, parallelize safely, and pre-warm caches. Use prompt caching for the heavy prefix you keep sending (system prompts, policies, doc boilerplate) and measure the difference—hit rates in high double digits are common when you standardize templates. If you self-host, pick a serving stack that manages KV-cache smartly and supports continuous batching; you’re chasing tokens/sec per dollar, not just raw speed.
Fourth, get serious about GPU ops hygiene. Cohort firmware versions so you can canary upgrades safely. Add pre-flight health checks and burn-in for flaky parts. Automate triage: blackhole nodes with recurrent ECC errors, and bias schedulers toward “known-good” pools during peak. The incident loop should optimize for rapid checkpoint/restore with a clean set of accelerators; humans dig into root cause after the job is healthy. This is the same SRE DNA—reduce MTTR—applied to accelerators.
Fifth, make energy a first-class signal. Add power caps, experiment with GPU power profiles, and track energy-normalized throughput. In shared facilities, you’ll inevitably hit a power ceiling; the team that can shed 10% power for ≤3% perf drop will keep SLAs while others flail. Treat watts as part of the SLO: your AI meets p95s and stays under the cap.
You will chase a “latency regression” that turns out to be a lost-in-the-middle case: someone stuffed 90 pages of context into a single prompt, the answer was fine but took forever, and retries blew the budget. You will discover your “mysterious quality drop” maps perfectly to a vector index refresh with slightly different chunking. You will watch GPU #7 flake out every third hour and bless the day you automated node quarantine. And you will file the incident under “we need better dashboards” until you standardize the spans and attributes that matter.
None of this is alien to SRE. It’s the same craft with new materials.
If you had to pick one, would you burn error budget on speed or quality for your AI feature this quarter, and why do you think users would agree?
What’s the most useful GenAI metric you added to a dashboard that you didn’t expect to matter—and what did it reveal?
Should GPU fleet health live with the platform team, the SRE team, or a new “accelerator ops” group? No hedging—pick one.
If you had to enforce a single energy rule for AI services—power cap, energy-per-token target, or carbon budget—what would you choose and how would you wire it into deployment gates?
SRE has always been about shipping reliability through uncertainty. AI just turned up the uncertainty dial and swapped “retry storms” for “prompt storms.” The playbook still works—instrument, measure, set SLOs, automate the boring parts, and leave humans for judgment calls. The only real change is what shows up on the y-axis. And maybe the jokes. Those definitely got weirder.
OpenTelemetry: “Semantic conventions for generative AI systems” — https://opentelemetry.io/docs/specs/semconv/gen-ai/
Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM)” — https://arxiv.org/abs/2309.06180
Anthropic Docs: “Prompt caching — Claude API” — https://platform.claude.com/docs/en/build-with-claude/prompt-caching
Meta Engineering: “How Meta keeps its AI hardware reliable” — https://engineering.fb.com/2025/07/22/data-infrastructure/how-meta-keeps-its-ai-hardware-reliable/
International Energy Agency: “Energy and AI” — https://www.iea.org/reports/energy-and-ai
#SRE #SiteReliability #DEVOPS #LLMOps #Observability #AIOps #MLOps #GPU #EnergyEfficiency #ReliabilityEngineering