SREs never sleep.

Created on 2025-11-29 07:33

Published on 2025-12-08 11:30

This is technically false. We just sleep in 15-minute increments between alerts. Our circadian rhythm is aligned to incident frequency, not daylight. Also, my smartwatch thinks “acknowledge” is a kind of yoga.

The myth of the sleepless SRE (and why it persists)

Every ops veteran has a story that starts at 03:14, features a status page nobody bookmarked, and ends with a half-typed kubectl command rescued by muscle memory and sheer terror. Those nights become lore. They also become culture. Somewhere along the way, “reliability” gets conflated with “availability of humans,” and heroics turn into a metric.

But let’s be blunt: no one does their best systems thinking after three partial naps and a Red Bull. The data backs up what our bleary eyes already know. Sleep loss degrades cognition in ways comparable to alcohol; after prolonged wakefulness, reaction time and accuracy fall off a cliff. Paging people repeatedly at night isn’t just unpleasant—it’s a reliability risk in itself. When we normalize sleeplessness as a badge of honor, we trade stability for mythology.

At the same time, reality intrudes. Incidents are up across many enterprises, driven by complicated stacks, faster change velocity, and the operational drag of integrating AI into production. More moving parts means more failure modes; more deployments mean more opportunities to stumble. No amount of caffeine converts that complexity into calm without structural changes to how we run on-call.

So the question isn’t, “Do SREs sleep?” It’s, “What would it take to make restful sleep compatible with 24/7 reliability?” Spoiler: it isn’t more heroics; it’s better systems—both technical and human.

What the data says (and what SRE already teaches)

When you sift through the research and the SRE canon, a few themes repeat louder than a pager in a quiet room.

First, pages must be rare, urgent, and actionable. Not “informative.” Not “interesting.” Actionable. Google’s SRE guidance is unambiguous: paging a human is expensive and should be reserved for user-visible, urgent conditions that require intelligent intervention. Everything else belongs in a dashboard, a ticket, or (ideally) an automation. And when incidents do hit, treating two per 12-hour shift as an upper bound gives humans the time to resolve them properly, write the postmortem, and follow through on fixes rather than stack up IOUs for “Future Us.”

Second, sustainable rotations are a design problem, not a stoicism contest. If you want 24/7 coverage with both primary and secondary, and you want each engineer to spend no more than a quarter of their time on-call, that implies a minimum team size and a rotation cadence. And because night shifts are hard on human bodies, multi-site “follow-the-sun” coverage isn’t a luxury for big tech—it’s a health intervention disguised as an org chart.

Third, change is the main source of instability. DORA’s research keeps showing that teams with solid delivery practices—fast flow, small batches, good CI/CD hygiene, and clear SLOs—ship faster and operate more stably. Reliability is not the enemy of speed; incoherence is. Tie paging to SLOs and error budgets, and the system begins to steer itself: when reliability is trending down, you slow change and invest in stabilizers; when it’s healthy, you safely push velocity.

Finally, the industry’s alert volume is not trending toward “peaceful.” With more microservices, more multi-cloud, and more AI-augmented features, the surface area grows. If you don’t deliberately prune alerts and automate remediation, entropy will draft you into a 3 a.m. hobby you didn’t want.

Two worldviews walk into an incident channel…

If you want a lively debate at your next postmortem review, try these two:

View A: “Own the pager. Be an adult.”

In this corner: the craftsmanship crowd. You build it, you own it, you carry the pager. On-call is a responsibility, not a punishment. Deep system ownership shortens recovery, improves design decisions, and spreads reliability thinking back into everyday work. With strong runbooks and good tooling, this view says, you can carry on-call without turning into a nocturnal cryptid. Also, centralized ops teams and NOCs can dull the feedback loop; if you never feel the pain, you might never fix the cause.

View B: “Follow the sun, automate the rest.”

Here: the systems-thinking pragmatists. Humans are diurnal mammals. Prolonged night work leads to real health impacts and impaired decision-making; therefore, avoid it structurally. Organize teams across time zones, use an incident command model to share cognitive load, automate detection and remediation aggressively, and send pages only when a human brain is truly required. Embrace AIOps and classification to cut noise; treat “actionable, urgent, user-visible” as sacred words. Make night pages rare enough that they’re surprising again.

They both have a point. The trick is combining them without breaking prod—or your people.

The human nature of incident response

SRE and DevOps are engineering disciplines, but the failure modes are gloriously human. Under stress, we lean on heuristics and habits. That last bash one-liner that fixed it once? Your hands will try it again before your brain has fully signed off. We also respond to incentives. If the only time reliability work is rewarded is in the adrenaline spike of a major incident, congratulations: you’ve accidentally gamified outages.

Conversely, when teams get a full night’s sleep, they notice the small two-line diff that would have avoided last night’s 90-minute detour. They write the missing runbook step. They stick to the error budget policy instead of “just this once” overriding it. A rested team invests in the system rather than in late-night heroics

I once worked with a team whose unofficial runbook began with “1) Panic. 2) Google.” It was meant as a joke, but it was closer to reality than anyone liked. Six months later, after we flattened alert noise, enforced SLO-first paging, and gave the day-after a recovery window, the new joke was, “1) Coffee. 2) Runbook.” Progress.

Practical approaches that let people sleep and systems hum

Let’s get painfully specific. Because “just automate more” is the reliability equivalent of “just eat healthy.”

1) Make paging SLO-first and actionable by construction

Start with the user story: what symptoms would a real user notice, right now, that justify interrupting a human? Trigger pages on those symptoms. Everything else routes to dashboards, tickets, or time-boxed “look soon” alerts during business hours. Tie paging strictly to SLOs and error budgets. When you burn budget too fast, your policy should automatically bias toward stability: freeze risky changes, reduce blast radius, and prioritize reliability action items. When budget is healthy, you intentionally allow change to flow. The pager becomes a governance tool, not a fire drill machine.

Write alerts like contracts. Each must answer: Why is this urgent? What should the on-call do first? What’s the rollback or mitigation? Where’s the runbook? The moment you see an “FYI” page, delete it with prejudice. Alert fatigue isn’t a vibe; it’s a measurable design flaw.

2) Design the rotation like a system, not a chore chart

Cap the share of time individuals spend on-call. Keep a secondary on-call for fall-through and non-paging operational work. Size the team so no one exceeds that cap, and consider multi-site coverage to eliminate routine night shifts. If your product demands 24/7 human response, don’t “solve” it by asserting that Europeans don’t need REM sleep. Use follow-the-sun handoffs with crisp, standardized context packets: current incident state, hypotheses tested, mitigation timeline, and clear decision ownership.

Establish an explicit incident limit per shift. If a 12-hour shift with two incidents is the upper bound, then exceeding it triggers relief and a retrospective on alert thresholds, runbooks, and automations. Normalize the day-after recovery window: late page? No 9 a.m. stand-up, no code reviews until after coffee and a walk. If you compensate on-call with time-off-in-lieu or cash, publish the policy, cap it, and track it. Unbounded incentives create heroes today and attrition tomorrow.

3) Automate remediation for the 80% you can predict

Most midnight pages rhyme. The class of incidents that end with “restart X” or “scale Y” should not be stealing REM cycles in 2025. Wire your alerts to automation that attempts safe mitigation with guardrails: canary before full rollout, feature-flag off switch with automatic revert, auto-scaling rules with sane ceilings. Bake rollback commands and runbook snippets directly into the alert payload so the on-call isn’t spelunking through outdated wiki pages under time pressure.

Lean into AI where it helps, but keep judgment with humans. Using models to classify incidents, cluster duplicates, suggest likely playbooks, and draft initial status updates can shave minutes when minutes matter. Better yet, let the AI tell you what to delete: noisy rules, duplicate alerts, flapping thresholds. The goal isn’t “AI runs prod”; it’s “humans are interrupted only when human judgment is the best tool.”

4) Put change under control without strangling flow

Reliability dies from uncontrolled change as often as from component failure. Treat progressive delivery as table stakes: feature flags, canaries, blue-green, and automatic halts when SLOs twitch. A release should carry its own kill switch and its own exit plan. When error budgets suffer, you don’t shame developers; you follow your policy and rebalance toward reliability work. This is where SRE’s social contract shines: it gives product engineering permission to slow down because the data said so.

5) Institutionalize learning and recovery

Incidents are tuition. Make the class worth it. Postmortems should be blameless, time-boxed, and action-oriented. If your alerts fired three times for the same underlying cause, the next step is not a new emoji in the incident channel; it’s a design change. Celebrate “boring” wins like deleting 30% of alerts or reducing MTTR by removing one manual step. And don’t let the calendar erase the humans: after a gnarly night, formalize recovery time. A tired brain can carry a pager, but it shouldn’t design the next critical migration.

A tale from the pager side

At 02:07 the alert read, “Elevated 500s in region-b.” No runbook link. No context. I acknowledged, opened three dashboards, and did that thing where you stare at graphs and try to remember the difference between the two blue lines. Pager pinged again: the same alert. We were in a loop—duplicate signals masking the signal. The eventual fix? A feature flag rollback that took 30 seconds once we found it.

The cleanup took two weeks. We rewired the alert to fire once per incident, attached the mitigation steps, and added a canary guardrail. Two months later, the same class of issue reappeared—this time as a business-hours ticket with an auto-mitigation already applied. No one woke up. No one performed grep with one eye closed. We didn’t get a trophy. We did get sleep. That’s what winning looks like.

Closing: Reliability isn’t a hero story—it’s a design story

Let’s retire the myth that SREs are sleepless wizards sustaining uptime by force of will. The best SRE teams are boring on purpose. They send almost no pages at night. They tie paging to user pain and SLOs. They automate the obvious. They rotate fairly. They make change safe. And they sleep—because the system, not the superhero, carries the load.

If your on-call life reads like a thriller, it might be time to rewrite it as a cozy mystery. The twist ending? Everybody wakes up rested, and prod is still up.

References (top 5)

  1. Being On-Call — Site Reliability Engineering (Google SRE Book). https://sre.google/sre-book/being-on-call/

  2. Monitoring Distributed Systems — Site Reliability Engineering (Google SRE Book). https://sre.google/sre-book/monitoring-distributed-systems/

  3. Accelerate State of DevOps Report 2024 (DORA).https://services.google.com/fh/files/misc/2024_final_dora_report.pdf

  4. PagerDuty: 2024 State of Digital Operations — Study Highlights.https://www.pagerduty.com/newsroom/2024-state-of-digital-operations-study/

  5. NIOSH/CDC: Impairments due to sleep deprivation are similar to alcohol intoxication.https://www.cdc.gov/niosh/work-hour-training-for-nurses/longhours/mod3/08.html

#SRE #SiteReliability #DEVOPS #OnCall #ErrorBudgets #SLO #Observability #AIOps #DevOpsCulture #IncidentResponse