Created on 2025-09-08 11:34
Published on 2025-09-12 10:30
“My on-call strategy is simple: automate everything I do twice, and never admit to the third time.”
If you run a production system long enough, the universe hands you chores: tickets that multiply after every deploy, “quick” manual fixes that never seem to retire, and dashboards that require a human to squint just so. That’s toil—the repetitive, manual, interrupt-driven work that scales linearly with your service and steals energy from engineering. SRE treats toil not as a badge of heroism but as a solvable bug in the system’s design. The goal isn’t zero toil; it’s to cap it, contain it, and convert it into project work that permanently reduces future toil.
There’s a reason SREs try to quantify toil in hours and percentages: time is the scarcest reliability resource. If half your week disappears into hand-cranked deploys and ticket triage, the queue of improvements that would prevent that work never gets done. Human nature compounds the problem. When the pager barks, you do what’s quickest right now—copy-paste the shell incantation, click through the runbook steps—and promise to automate “next sprint.” Unless you budget time for anti-toil projects and enforce it with the same rigor you apply to capacity plans, next sprint never arrives.
Automation is SRE’s favorite tool precisely because it’s boring. A script runs the same way every time; a bot doesn’t forget the second step; a platform guardrail doesn’t get sleepy at 3 a.m. But automation is a means, not an end. It shines when it eliminates toil, absorbs variance, and shortens the blast radius of change. It backfires when it entombs bad processes in code, hides dangerous complexity, or auto-executes your way into a larger outage.
In practice, SREs treat automation like any other production dependency: they design it, review it, version it, observe it, and sunsetting it when it stops paying rent. The most durable automation lives close to user value—golden paths that make the right thing also the easy thing: safe deploys by default, SLO-first alerting by default, and runbooks as code that action themselves.
The most controversial thing SRE ever shipped is arguably the most useful: the error budget. By converting an SLO into a quantified allowance for failure, you finally get a single number that both product and engineering can talk about without metaphor. A green budget invites experimentation; a depleted budget constrains risk. It’s operational governance without a committee.
The social trick of error budgets is that they depersonalize uncomfortable decisions. You no longer “freeze because ops is grumpy”; you pause risky changes because customers have already seen too much pain for the period. The mechanism is as flexible as your culture. You might slow rollouts, require canaries for specific classes of change, or spend two weeks paying reliability debt. The point isn’t punishment; it’s to ensure the work you do next is the work the system needs most.
Picture two SREs in a hallway (or a war room). One says, “If we can do it twice, we can automate it today.” The other says, “If we automate what we don’t understand, we’ll just fail faster.” They’re both right—and both wrong.
When toil is obviously mechanical—regenerating certificates, rotating keys, draining nodes—automate early. Humans are bad at careful repetition under stress. But when the work masks a systemic gap—an odd deployment ordering, a flaky dependency, an ill-defined SLO—resist the urge to codify the workaround. First, stabilize and instrument until you can explain causality. Then automate the understanding, not just the keystrokes. Otherwise you’ll build a shiny button that quietly cements a design flaw.
Error budget policies also split the room. One camp says: when the budget is empty, you stop changes that risk reliability—full stop. The other says: the business sometimes needs a launch, and teams need discretion. In the first camp, the on-call sleeps better. In the second, product velocity doesn’t stall when the outage was external or the SLO definition was off.
The SRE answer is to make exceptions explicit and scarce. Write the policy first, including who can approve an exception and what compensating actions follow. If you burn budget to meet a critical deadline, commit to a postmortem and a reliability sprint next. If a vendor outage killed your quarter’s budget, adjust the SLO scope or provider strategy—don’t use it as a free pass. Good policies encode judgment without requiring heroics.
A serious toil reduction program starts with a census. For one month, ask every engineer to tag their work as project, interrupt, or overhead. Then slice by source: tickets, pages, deploys, manual data fixes, capacity changes. Patterns appear quickly. Maybe 30% of your interrupts are “please restart pods” because a health check is too strict. Maybe weekly hotfixes are fighting the same schema drift. Maybe on-call spends a surprising amount of time shepherding releases through an approval queue that never catches anything important.
Then you prioritize by aggregate human time. Killing a two-minute action that happens 2,000 times a week beats automating a twenty-minute edge case once a month. You’ll discover that a small number of “paper cuts” dominate your human SLO.
From there, SREs build an engineering backlog with the same ceremony as feature work. Each item has an expected toil delta: minutes or hours saved per week. That expected delta becomes the measure of success, not the dopamine hit of shipping a script. The best teams review the numbers every quarter and prune what didn’t land. It’s strangely satisfying to delete an automation job that “felt” useful but didn’t move the needle. That’s reliability science, not vibes.
First, confusing “faster” with “better.” A command that runs quicker than a runbook is still toil if a human has to shepherd it. Design for unattended operation and idempotence. If your playbook can’t be safely retried, it isn’t done.
Second, multiplying toil with unbounded variety. Ten teams each with bespoke deploy scripts will keep the pager fed forever. Standardize golden paths and make deviations expensive. Platform engineering shines here: product teams keep speed; the platform absorbs complexity once.
Third, mistaking noise for validation. If a process only works when a human “keeps an eye on it,” you haven’t validated the process—you’ve trained a babysitter. Bake verification into the path: health probes that reflect user experience, pre-flight checks that actually block, and post-deploy analysis that fails forward automatically.
Moving alerts from resource metrics to SLOs is the fastest way to reclaim sleep. Burn-rate alerting pages you when you’re consuming budget at a rate that threatens the period, whether it’s a one-hour fire or a slow bleed over a day. In the war room, this changes the vibe. You stop arguing about the perfect CPU threshold and start discussing the concrete impact on users and the runout time on the budget. That language aligns product, SRE, and executives instantly.
The byproduct is fewer tickets and sharper diagnostics. When the page says “consuming 5% of budget in the last six hours,” the playbook doesn’t begin with guesswork. You jump straight to exemplars and recent changes. Most importantly, you satisfy the human test: if we all ignore this alert to go to lunch, do users suffer? With SLO-first paging, the answer is finally consistent.
None of this works if your on-call rotation is a meat grinder. Humans who live in a permanent state of interruption don’t write elegant automation; they write brittle bandaids and hope for a quiet night. The most effective SRE managers do three things relentlessly. They protect an explicit cap on operational load. They defend project time for anti-toil engineering like a hawk. And they reward deletion—of pages, of steps, of “tribal knowledge”—as much as they reward creation.
When teams feel supported, they’ll take smart risks with their error budget. They’ll run canaries with conviction, try a new rollback tactic, and refactor a service boundary with eyes open. When they don’t, they hoard changes, bunch them up, and spend the budget in one risky release. Ironically, a healthy budget culture results in fewer surprises because people aren’t scared to change things responsibly.
One midnight, our checkout service started nibbling the budget—just enough to page, not enough to scandalize a dashboard. Burn-rate alerts told us we’d blow the week if the trend held. No obvious resource saturation. No smoking-gun error spike. We followed the playbook: pivoted to “what changed?” and pulled traces tied to the SLO’s latency histogram. The exemplars led us to a new feature flag that changed retry behavior for one upstream call. The fix was a one-line rollback. The learning was bigger: our deploy checklist had “validate retries,” but our automated verifications didn’t simulate the specific failure mode. The next sprint added a synthetic that does, plus a guardrail to auto-disable the flag on anomalous burn. Two hours of runbook and code; hundreds of future pages we’ll never see.
Start with a monthly toil budget. Decide what percentage of team time you’ll allow for interrupts and manual ops. If you’re above it, pause new features for a week and invest exclusively in automation that drops the number this month. Saying “no” to everything else is the only way to get your future back.
Build runbooks that execute themselves. Treat runbooks as code that can be invoked by your alerting system with guardrails: pre-conditions checked, retries bounded, logging rich enough to debug when the humans arrive. This transforms pages from “wake someone up to press a button” into “wake someone up only when the button didn’t work.”
Wire burn-rate alerts into the release train. If the short-window burn rate spikes during a rollout, automatically halt the pipeline, roll back the last step, and open an incident with all the context attached—recent commits, flags toggled, environment diffs. It’s much easier to be brave with change when the handrails are this solid.
Turn postmortems into a reliability backlog. Every action item should tie to either reducing toil (minutes saved), expanding early detection (fewer duplicate incidents), or shrinking blast radius (faster rollback or auto-mitigation). Then track those items with the same visibility as feature work. If it isn’t visible, it won’t ship.
Create one paved path per pain point. If tickets pile up for TLS rotations, make the paved path rotate certs safely by default: proper lifetimes, staged rollout, health-checked drains, and a dry run. If on-call gets whiplash approving low-risk changes, create auto-approval for the patterns that statistically never bite, and require extra checks for the ones that do. Platform guardrails should be opinionated but escapable—with receipts.
Finally, make automation observable. Give your bots and jobs names, dashboards, and SLOs. If the certificate rotator misses an expiry, you should learn it from its own burn-rate alert, not from a customer outage. Your automation is part of the production system; treat it that way.
Toil is the background radiation of operating software, but it’s not destiny. When SREs treat toil like any other defect—measured, prioritized, and engineered away—the work gets lighter and the systems get safer. Error budgets add the steering wheel: they tell us when to press the gas and when to ease off. Pair them with SLO-first alerting, runbooks that run themselves, and platforms that make the right path the easy path, and you’ll find something rare in production: enough calm to build the future while the present keeps working. The best part? Your future self will never know how many pages you quietly erased.
Eliminating Toil (Site Reliability Engineering Book, Chapter 5) — https://sre.google/sre-book/eliminating-toil/
Error Budget Policy (The Site Reliability Workbook) — https://sre.google/workbook/error-budget-policy/
Alerting on SLOs (The Site Reliability Workbook, Chapter 5) — https://sre.google/workbook/alerting-on-slos/
Implementing SLOs (The Site Reliability Workbook, Chapter 2) — https://sre.google/workbook/implementing-slos/
Accelerate: 2024 State of DevOps Report (DORA) — https://dora.dev/research/2024/dora-report/
Accelerate: 2024 State of DevOps Report (PDF) — https://services.google.com/fh/files/misc/2024_final_dora_report.pdf
NIST SP 800-61r2: Computer Security Incident Handling Guide (PDF) — https://nvlpubs.nist.gov/nistpubs/specialpublications/nist.sp.800-61r2.pdf
NIST SP 800-61r2 (Catalog page) — https://csrc.nist.gov/pubs/sp/800/61/r2/final
Wired: “Here’s How Google Makes Sure It (Almost) Never Goes Down” — https://www.wired.com/2016/04/google-ensures-services-almost-never-go
#SRE #SiteReliability #DEVOPS #Toil #Automation #SLO #ErrorBudgets #RiskManagement #BurnRate #ReliabilityEngineering