Toil vs. Valuable Work

Created on 2025-04-14 07:14

Published on 2025-04-21 10:00

It’s 2:00 AM, and I’m staring at a terminal window that’s begun to blur into itself.The room is dark except for the faint glow of a monitor and the blinking cursor quietly mocking my exhaustion. There’s a sticky note next to the keyboard that says, “Automate this later.” I chuckle. It’s not wrong.

The thing is, I’ve been here before—many times. Doing the same set of actions over and over: logging into servers, running cleanup scripts, restarting services that shouldn’t be failing in the first place, emailing postmortem notes that could’ve been templated. This isn’t the glorious part of Site Reliability Engineering. This is what we call toil.

Now, if you’ve spent any time in the SRE world, especially within organizations that follow the Google SRE model, you’ve probably heard this word a lot. “Toil.” It's almost a slur in certain circles—utter it, and people wince, like you just said something profane in a church of automation.Toil, according to the textbook, is work that’s manual, repetitive, automatable, and devoid of long-term value. The golden rule of SRE is to minimize it. You’re supposed to automate your way out of it, design systems that don’t require it, and build platforms that allow others to bypass it. That’s the ideal.

But here’s the thing about ideals—they rarely survive contact with reality.

Toil isn’t always evil. In fact, in many teams I’ve worked with, toil has been the crucible where real operational understanding was forged. You see, there’s something about doing the dirty work—handling flaky alerts, fixing misconfigured services, performing manual rollbacks—that builds a kind of intuition no book or dashboard can give you. It's the difference between reading a map and walking the terrain.

Take onboarding, for example. When new SREs join a team, we don't throw them into Terraform modules or deep Grafana dashboards right away. We let them shadow incident response, perform routine runbooks, and yes, do a bit of toil. Because in doing the repetitive work, they begin to see patterns. They begin to ask questions. Why does this fail so often? Why is this check even necessary? That questioning is the seed of automation.

So while the doctrine says eliminate toil, the wiser approach might be to understand toil first. What kind of toil are we dealing with? Is it temporary? Is it revealing system complexity? Is it creating on-ramps for junior engineers? Not all toil is equal.

Of course, there’s a limit to how much toil is healthy. If you’re spending half your week firefighting, responding to alerts that shouldn’t exist, or patching the same bug every Friday, that’s not learning anymore—that’s stagnation. And worse, it’s a recipe for burnout. I’ve seen amazing engineers reduced to husks of their former selves because they were buried under toil and couldn’t claw their way out.

That’s where SRE needs to draw the line. We should treat toil like technical debt: something to track, something to manage, and something to pay off before the interest kills us. In fact, some of the smartest SRE teams I’ve seen treat their toil like a product backlog. They log it, categorize it, and review it during sprint planning. Some even assign toil budgets—X hours a week on toil, Y hours on reduction efforts. That’s mature.

And then there’s the other side of this conversation: what exactly is valuable work?

You might think the answer is obvious—“the stuff that moves the business forward.” But defining value isn’t that simple in reliability engineering. Is writing a high-availability failover tool valuable? Probably. But so is documenting the quirks of your deployment pipeline if that saves someone two hours every week. Value can be found in surprising places.

The reality is that valuable work, in SRE, often looks boring to outsiders. It’s not about heroics or huge features. It’s about predictability. Stability. Consistency. It’s building guardrails so others don’t fall. It’s designing systems that fail gracefully rather than catastrophically. And yes, it’s automating away the toil—after you’ve learned from it.

Let me tell you a story. A few years ago, I was part of a team that had an obscure microservice that would crash every Tuesday morning. No one knew why. We built alerts for it. We restarted it manually. We had runbooks and even pagers for it. The “toil” became so normalized that nobody questioned it. Until one junior SRE asked, “Why Tuesday?” That led to a deeper investigation—and guess what? A malformed cron job in a dependency was flooding the service with bad data on Monday nights. One fix. Toil gone. That’s the kind of valuable work we often overlook.

So here’s the balance: Toil helps you discover the system. Valuable work helps you shape it. You need both. Toil teaches. Valuable work transforms.

In the end, being an SRE isn’t about living in one extreme or another. It’s about knowing when to endure the toil, and when to step back and elevate the experience for others. It's about being humble enough to learn from repetition, but bold enough to automate yourself out of a job.

And if you ever find yourself working late at night, again, typing in the same command for the third time—just remember to stick a note on your desk: Automate this later. Then follow through when the sun comes up.

That’s the real job.