Created on 2026-02-21 09:11
Published on 2026-02-23 11:15
We’ve spent years getting serious about Service Level Objectives. We argue about 99.9 vs 99.95 like it’s a moral philosophy. We build burn-rate alerts that can detect a reliability cliff before a user even notices the breeze. We fight over error budgets as if they’re rare Pokémon cards.
And then we quietly accept a “people availability target” of… whatever happens after the fifth night page in a row.
That’s the irony: modern reliability engineering is brilliant at quantifying system risk and weirdly casual about human risk. But humans aren’t an infinite resource. Brains don’t autoscale. Nervous systems don’t have multi-region failover. The people behind the pager are part of the system, whether your architecture diagram admits it or not.
That’s where human-sustainability SLOs come in: measurable reliability targets for the on-call experience and operational load, designed to keep teams effective, healthy, and capable of shipping improvements instead of just surviving the next incident.
This isn’t “be nicer to engineers” (though yes, please). This is systems thinking. If the service depends on humans responding well under stress, then the conditions that enable humans to respond well are operational requirements. Which means they deserve objectives, signals, and error budgets too.
SRE is built on a simple trade: we accept some risk (an error budget) to move fast, but we manage that risk explicitly. That trade collapses if the hidden cost is burned-out responders, brittle tribal knowledge, and a culture where the best engineers quietly transfer out of the team that “always gets paged.”
The on-call experience is not just a feelings problem. It’s a reliability input. If pages are noisy, response quality degrades. If recoveries are slow, interruptions multiply. If engineers stop trusting alerts, incidents last longer. If toil is high, automation doesn’t happen. If every week is a firefight, resilience work becomes a PowerPoint hobby.
So the question becomes: what if we set SLOs not only for the user experience, but also for the responder experience?
Not as a replacement for product SLOs, but as a paired constraint: “We will deliver reliable services, and we will do it in a way that doesn’t torch the humans.”
A human-sustainability SLO is a team-agreed, measurable target about operational load and the on-call experience, paired with signals (human SLIs) and consequences when the target is missed.
It is not a vague promise like “we care about well-being.” It is not a poster about psychological safety in the break room next to the printer that jams for emotional reasons. It is also not a performance management trap that ranks engineers by “resilience.”
It’s closer to how we treat latency. We don’t shame the server for being slow; we treat it as a system symptom. Human-sustainability SLOs treat burnout risk and overload as system symptoms too.
The trick is picking metrics that are hard to game, ethically sound, and genuinely tied to outcomes.
Here’s the uncomfortable truth: you already have human SLIs. You just call them “that vibe,” “the rotation is cursed,” or “why is Sam always online at 2 a.m.?”
Most teams can measure operational load without turning humans into spreadsheets with legs. A few common signals show up again and again in mature SRE orgs and in incident-management research:
You can measure paging load as pages per shift, pages per engineer per month, pages outside business hours, and the distribution of pages (because a rotation where one person gets all the weirdness is not “shared ownership,” it’s “shared denial”).
You can measure alert quality through the percentage of pages that lead to meaningful action, the volume of auto-resolved noise, and the rate of repeat alerts for the same underlying issue. If you’ve ever acknowledged an alert and immediately thought, “this again,” you’ve discovered a high-signal metric called “we’re wasting human attention.”
You can measure toil as time spent on manual, repetitive operational work. Toil is sneaky: it often feels productive because you’re busy, but it blocks the very engineering work that would remove the toil. This is how teams become heroic and miserable at the same time.
You can measure recovery burden as incident duration, after-hours incident time, number of escalations, and how often the same service causes major incidents. Repeated incidents are not just technical debt; they’re human debt with interest.
You can also measure context switching and interruption rate: how often on-call interrupts planned work, how many “small incidents” disrupt focus, and how frequently the team is pulled into other teams’ operational problems. In practice, this often correlates with a feeling every SRE knows: “I didn’t do any real work today, but I’m exhausted.”
None of these require invasive monitoring of individuals. They’re about the system and the rotation, not about grading someone’s emotional state.
The moment you say “human SLO,” you’ll get two strong reactions. Both are rational. Both have scars.
This camp argues that what gets measured gets managed, and what doesn’t get measured becomes invisible. If you don’t track pager load, leadership will fund features, not operational improvement. If you don’t quantify toil, you’ll keep hiring “more on-call coverage” instead of eliminating the causes. If you can’t show the trend, you can’t justify the investment.
They’ll point out that SRE already treats human time as a core constraint. Google’s SRE guidance has long emphasized paging only on actionable issues, reducing toil, and using SLOs to produce meaningful alerts. The spirit is clear: reliability should be engineered, not extracted from human suffering.
In other words: if we can have a latency SLO, we can have a “don’t wake people up for nonsense” SLO.
This camp fears that human-sustainability SLOs will be used like a scoreboard. They worry a “page budget” becomes an excuse to ignore incidents or to pressure teams into silence. They worry “toil percentage” becomes a justification for layoffs (“Looks like toil went down; do we still need you?”). They worry the metrics will be gamed: engineers will route work into untracked channels, reclassify pages as “tickets,” or stop reporting issues.
They also worry that a well-being metric can become intrusive, reductive, or unfair. People have different lives, tolerances, and responsibilities. The goal is sustainability, not standardization of human experience.
This camp isn’t anti-measurement. It’s anti-using measurement without trust, context, and ethical guardrails.
And honestly? They’re right to worry. The history of management metrics gives them receipts.
So the answer isn’t choosing a camp. The answer is designing human-sustainability SLOs with the same maturity we apply to product SLOs: with clear intent, calibration, and a bias toward reducing harm.
In classic SRE, an error budget is the allowed unreliability over a window. When you burn it too fast, you slow feature releases and focus on reliability work.
A human-sustainability SLO becomes powerful when it also has an error budget. Not “human error” like mistakes, but “human capacity error” like overload.
Imagine a team agrees on a sustainability objective: on-call should be interrupt-driven only when necessary, and the rotation should remain humane.
Now the team can define a budget like: “We can tolerate up to X after-hours pages per engineer per month,” or “We can tolerate Y hours of after-hours incident time per month,” or “We can tolerate Z% of toil time in a sprint.”
When that budget is exceeded, the response should mirror what we do with reliability: you pause and invest. You treat it as an engineering signal that the system is asking too much of humans.
This flips the usual dynamic. Instead of celebrating heroics, you treat heroics as a leading indicator of systemic failure. The hero isn’t the person who stayed up all night. The hero is the team that made sure nobody had to.
The goal here isn’t to invent a new dashboard that everyone ignores. The goal is to change operational behavior. That means connecting measurement to decisions.
One of the fastest ways to improve human sustainability is to enforce a strict philosophy: pages must be urgent, important, actionable, and real. If an alert doesn’t meet that bar, it doesn’t get a 2 a.m. phone call. It gets a ticket, a Slack message, or a morning review.
This isn’t about being lazy. It’s about treating attention as a scarce resource. Every noisy alert trains your brain to distrust the pager. Every false alarm teaches your muscle memory that the next one might also be nothing. And then, when something is truly on fire, the response starts five minutes later because your nervous system has learned that alarms are usually lies.
When you pair SLO-based alerting with burn-rate logic, you stop alerting on symptoms and start alerting on user impact. The system becomes calmer, and humans stop living in a constant state of anticipatory dread.
Toil is the silent killer because it feels necessary. “Someone has to do it.” True. But when toil becomes normal, it becomes a tax that grows every time the system scales.
Teams that get serious about sustainability treat toil as something you deliberately pay down. They make it visible, allocate time to reduce it, and refuse to let it quietly become “the job.”
There’s a practical on-call anecdote here that most SREs recognize. The team has a recurring incident. Everyone knows it. Everyone has a workaround. The runbook is basically: “Do the thing we always do. Don’t ask why. It’s haunted.” A human-sustainability SLO turns that haunting into a priority. If that recurring incident causes two after-hours pages a week, it is consuming real human capacity. Fixing it isn’t “nice to have.” It’s reliability work.
Most orgs have a recovery plan for infrastructure and none for people. But if you want sustainable reliability, you need explicit recovery mechanics: compensatory time, load shedding after bad nights, backup rotations that actually work, and escalation paths that don’t rely on waking the same two experts every time.
This is where human nature shows up. Engineers will often push through exhaustion because they care, because they’re proud, because they don’t want to be the one who “drops the ball.” That’s admirable in the moment and dangerous over time. A sustainability SLO gives permission to act differently. It makes rest a reliability control, not a personal indulgence.
A team that acknowledges, “We exceeded our after-hours incident budget this week, so we pause feature work and invest in stabilization,” is doing the same kind of disciplined tradeoff SRE was built to enable.
Let’s be honest: many sustainability problems are incentive problems wearing a monitoring hat.
If the org rewards shipping features, teams will ship. If the org rewards uptime but doesn’t fund resilience, teams will “buy” uptime with human pain. If leadership treats incidents as moral failings instead of learning opportunities, people will hide risk until it explodes at 2 a.m. on a holiday weekend.
Human-sustainability SLOs force a conversation about tradeoffs that often stays unspoken. They create a shared language to say, “We can keep pushing like this, but it will cost us people, and then it will cost us reliability anyway.”
This is one of those moments where DevOps isn’t about tools. It’s about aligning incentives with reality.
So, if we accept that human sustainability is part of reliability, the fun begins: what do we actually standardize?
Should a team have a universal “page budget,” or does it vary by service criticality and maturity?
If you hit your human-sustainability limit, do you slow shipping the same way you would after burning a reliability error budget, or do you treat it as an HR issue, which is another way of saying “we’ll do nothing but feel bad”?
How do you prevent sustainability metrics from becoming a weapon while still making them strong enough to influence prioritization?
If AI-assisted operations reduces toil but increases cognitive load through context switching, how do we measure the trade without fooling ourselves?
And the spiciest one: if a service consistently requires heroics, is it the service that’s unreliable… or the organization?
Human-sustainability SLOs aren’t about making on-call cozy. They’re about making on-call sustainable enough that people can do excellent work repeatedly, not occasionally.
The dirty secret of reliability is that the system is never just code. It’s code, process, incentives, and the squishy biological beings who get paged when assumptions meet reality. If we can define objectives for packets and processes, we can define objectives for the conditions that keep the humans effective.
Because the best kind of reliability improvement is the one that also lets everyone sleep. Preferably at night. All of it.
DORA — Capabilities: Well-being — https://dora.dev/capabilities/well-being (dora.dev)
Google SRE Workbook — Alerting on SLOs — https://sre.google/workbook/alerting-on-slos/ (sre.google)
Google SRE Workbook — Eliminating Toil — https://sre.google/workbook/eliminating-toil/ (sre.google)
Google Cloud Blog — Announcing the 2025 DORA Report: State of AI-Assisted Software Development — https://cloud.google.com/blog/products/ai-machine-learning/announcing-the-2025-dora-report (Google Cloud)
PagerDuty Whitepaper (PDF) — Whitepaper_Automation-survey — https://cdn.pagerduty.com/wp-content/uploads/2024/06/Whitepaper_Automation-survey.pdf (cdn.pagerduty.com)
#SRE #SiteReliability #DEVOPS #OnCall #IncidentManagement #ReliabilityEngineering #SLO #ErrorBudgets #Toil #Burnout #EngineeringLeadership #Observability #PlatformEngineering #OpsCulture