Created on 2025-11-16 05:36
Published on 2025-11-24 11:00
Every SRE org is an ensemble sitcom that somehow ships software. You don’t have to write the laugh track; the dashboards, tickets, and Slack threads will do that for you. The beauty—and the chaos—is that the same team contains wildly different instincts that pull reliability in different directions. When you embrace that, you stop trying to make everyone the same and start composing a system that’s resilient precisely because people and perspectives are different.
Let’s meet the crew you described—because they live in every engineering org I’ve ever worked with or observed.
The Tooling Engineer sleeps inside Grafana, wakes up in Prometheus, and duct-tapes three CLIs into something that passes for an orchestrator. They’re allergic to “manual,” and their PRs contain equal parts code and bash sorcery. The Integrator doesn’t reinvent wheels; they make twelve vendor tools exchange JSON like holiday gifts and quietly make your stack feel designed on purpose. The Manager SRE’s pager is now a calendar where risk, budget, and headcount collide; they still say “we” and will still appear in a war room “just to listen,” which is code for “I miss it.”
The Process Engineer dreams in flowcharts, and yes, there’s a runbook for creating runbooks. The Rookie asks a “dumb” question that detonates two assumptions and a cornerstone of your architecture. The Waiter lives in the ticket queue and can juggle five “urgent” items without dropping their coffee. The By-the-Book refuses to bless a change that skipped three gates—and, mysteriously, your services don’t fall over anymore. The “I Can Do That Better and Faster” dev starts a three-month side project on a Tuesday and occasionally invents your next platform. The Automator writes a script at the first sign of repetition, and then automates themselves out of an afternoon’s work. The Pleaser says “sure” before asking about capacity, learns the sacred SRE word—“No… but”—and finally breathes. The Talker narrates incidents like a sports commentator, keeping nerves steady and decisions crisp. The Firefighter lives for blinking red; they’re bored by green. The Historian can quote the SEV-1 from three years ago and the Slack thread that proved your “new” idea already failed. The SLO Zealot measures everything including your excuses, and after the grumbling, people realize “user happiness” finally has a non-hand-wavy definition. The Platform Gardener tends clusters and pipelines like a backyard vineyard; when they’re great, everything “just grows.”
This isn’t a quirky taxonomy. It’s a design pattern for reliability.
High-performing teams don’t delete differences; they channel them. Consider incidents. The Talker and the Firefighter are your on-call accelerators, but they need the Process Engineer’s checklists to keep humans from forgetting steps when cortisol spikes. Pilots and surgeons use checklists for a reason: in high-stress moments, the brain drops packets. In tech, we like to think we’re exceptions; then a Rookie asks, “So where is that failover documented?” and the channel goes quiet. The lesson lands: checklists are not bureaucracy; they’re memory prosthetics for tired humans.
Even the Manager SRE—the one speaking budgets and risk—plays the long game for reliability. When they guard the ratio between toil and engineering work, they’re echoing SRE orthodoxy: if everything is a ticket and nothing is a roadmap, you accrue what I call “resilience debt.” The Platform Gardener prunes that debt by deprecating dead services, watering neglected dashboards, and building the “golden paths” developers can follow without paging a friend. The Integrator stitches together tools so the stack feels like a platform rather than a museum of logos. And the SLO Zealot keeps everyone honest about impact by asking, “What does this do to our error budget?”
Notice what that does culturally. The Pleaser starts saying “No… but here’s a safe, paved way.” The Waiter becomes a product manager for self-service instead of an order-taker. The Tooling Engineer gets to channel their 2 a.m. creativity into reusable modules rather than heroic one-offs. The “I Can Do That Better” dev still prototypes, but inside guardrails that stop “weekend brilliance” from becoming institutional dependency.
We all know the fight. On one side, the Tooling Engineer and the “I Can Do That Better” voice argue that bespoke beats bloat, that the sharpest tools are the ones you sharpen yourself, and that the only thing slower than vendor procurement is vendor support. On the other, the Integrator and the Manager SRE point out that gluing together mature products yields value faster and spreads risk. The Platform Gardener nods: the garden needs both heirloom seeds and store-bought soil.
There’s a real trade-off here. Build-heavy strategies optimize for fit and control but can produce “reliability monoculture”—that one wizardly library only two people understand. Buy-heavy strategies optimize for speed to some value and shared responsibility but can drift into “checkbox platforms” that nobody loves. The best orgs turn this into a portfolio question. Is the thing core to our differentiation or table stakes? If it’s core, invest; if it’s plumbing, pave a road and document the exit.
Another perennial sparring match: the By-the-Book versus the Firefighter. The By-the-Book says, “If it’s not in the runbook, it does not exist.” The Firefighter says, “Copy that; prod is on fire.” You need both. Gates and reviews prevent change explosions, but we’ve all seen “process theater” that slows safe changes while doing nothing to stop the dangerous ones. The trick is to let the SLO Zealot set the metronome: when the error budget is healthy, accept more risk; when it’s tight, slow down and pay reliability tax. It’s not dogma; it’s governance that moves with reality.
There’s also a human layer. The Historian tempers the moment with, “We tried skipping that approval once… remember the cascading retries?” The Talker translates the risk in language the Manager SRE and executives can act on. The Rookie’s questions expose the hidden tribal knowledge that made the last “emergency” worse. And the Automator quietly takes a manual approval step and replaces it with a policy check you can audit.
If you want to see real transformation, watch the Waiter become a product owner for a self-service platform. Tickets like “spin me a sandbox” move from a queue to a portal. Suddenly the By-the-Book can encode policy into “golden paths,” the Tooling Engineer can ship paved templates instead of snippets, and the Integrator glues identity, audit, and observability together so that “how do I deploy?” becomes muscle memory. Self-service reduces cognitive load for product teams, and—when done well—reduces bureaucracy rather than adding it. Platform engineering is not a committee; it’s the productization of the best ways to build and run software safely.
When this works, the Waiter finally throws away the laminated menu and replaces it with a buffet you can’t mess up. And when it doesn’t, you get an overly abstracted maze that moves tickets from Jira to a glossy UI without improving flow. The difference is whether you treat the platform as a product with customers, SLOs, and feedback loops—or as a bundle of tools you announced on a slide.
One camp says SLOs and error budgets are the only sane way to arbitrate speed versus safety. They’re not just metrics; they’re a social contract between “ship it” and “keep it up.” Teams use them to align business impact with engineering reality, and to decide when to pause features and pay down reliability risk. Another camp warns that SLOs can become ritualized and rigid, turning teams into metric accountants who game thresholds and fear change. If you’ve ever watched a team aim deployments at a time window to dodge a burn calculation, you’ve seen the pathology. The antidote is simple but hard: set meaningful SLOs, revisit them with customers, and pair error budgets with human judgment rather than auto-locking the doors.
A second debate pits platform engineering against “move fast” intuition. Advocates argue that golden paths and internal platforms reduce cognitive load and raise the floor for security, compliance, and operability. Skeptics worry that platforms become gatekeepers with pretty portals and long waits. The resolution looks suspiciously like SRE itself: treat the platform like a product with its own SLOs (latency of common workflows, success rates of templates, time-to-first-deployment), and let the SLO Zealot ask the same annoying, useful questions of the platform team as they do of the product teams.
Picture this: It’s 02:13. Alerts are lighting up like a holiday tree. The Firefighter jumps in with SSH incantations only a select brotherhood knows. The Talker takes command and narrates the situation, assigns roles, and keeps panic off the mic. The Process Engineer quietly posts the incident checklist, and suddenly people remember to record timestamps and update status. The Rookie asks why the canary didn’t trigger, and the channel pauses as the Historian says, “We disabled it last quarter for load tests.” The Tooling Engineer pastes a grim little script to extract the last 10 minutes of logs across shards. The Integrator points out a webhook failure in an upstream vendor service that explains the sudden cascade. The Automator files two “never again” tasks before the service even stabilizes. The Manager SRE joins, asks if there’s user-visible impact, and aligns comms. Thirty minutes later, green returns to Grafana, and the SLO Zealot tallies burn against the error budget. In the post-incident, the By-the-Book insists we update the deployment gate to catch the exact class of misconfiguration. The Platform Gardener takes an action to add a paved rollback path to the portal.
What looks like chaos is actually choreography. The roles aren’t redundant; they’re complementary redundancies—like diverse instances in a fault-tolerant cluster.
One, write an error-budget policy that people can follow without needing a lawyer. Define how you’ll change release cadence when you’re burning budget too fast, when you’ll pause features, and how you’ll unwind those decisions. Pair that with SLO-based alerting so you page on user impact rather than every wobbly metric. You’ll notice your alerts stop competing with Netflix for your attention, and on-call stops feeling like a haunted house.
Two, product-manage your platform. Give the Waiter and Platform Gardener a real backlog with outcomes like “time to first service creation” and “mean time to successful rollback.” Golden paths should be opinionated but not punitive; build escape hatches with observability, so the “I Can Do That Better” folks can experiment without making their prototype everyone else’s dependency. Treat platform docs as a first-class artifact; the Managers SRE will thank you when onboarding stops requiring a Sherpa.
Three, codify incident roles and practice. Train the Talkers—incident commanders—to keep decisions moving and voices calm. Train Scribes to keep crisp timelines so postmortems aren’t archaeological digs through chat logs. Give Firefighters a place to be heroic without being the only plan. This is a muscle: schedule “game days,” run drills, and give the Rookie mic time so they grow a year of experience without waiting for a SEV-1.
Four, invest in checklists and runbooks like they’re code. Version them, test them, prunethem. The Process Engineer already has a runbook for creating runbooks; let the Tooling Engineer embed those into chatops so the checklist shows up when someone types “/incident start.” The payoff is not in calm days; it’s in the worst hour of the worst day, when a good checklist turns panic into procedure.
Five, shift from tickets to self-service with clear guardrails. Start by moving the top three repeatable requests—new service, database provisioning, canary release—into your portal. Stripe each action with policy checks and observability. The Waiter becomes a platform concierge, the Pleaser finds boundaries they can say “Yes, inside here,” and the Integrator gets to glue identity and audit in ways that make auditors and developers equally happy. You’ll still have tickets—there’s no universe where you don’t—but they’ll be for “unknowns,” not “known repeats.”
Six, rotate the hats. Let the By-the-Book spend a sprint pairing with the Firefighter on risk-based gating. Let the “I Can Do That Better” dev own a golden path for a quarter and maintain it like a product. Let the Rookie shadow the Historian through old postmortems to learn cultural lore without absorbing bad habits. The goal isn’t empathy theater; it’s real cross-pollination that removes single points of human failure.
Seven, measure developer experience and platform flow the way you measure uptime. Track lead time for changes along your paved roads, success rates of templates, rework caused by platform friction, and “time to recovery” for botched deploys. Reliability is not just SLOs on the public endpoint; it’s also the reliability of your internal delivery engine. When the platform is reliable, the product is calmer, and the pager is kinder.
Everything above is technology flavored, but it’s people to the core. The Rookie keeps us honest about complexity we pretend isn’t there. The Historian stops us from relearning old lessons the expensive way. The Pleaser teaches us that “No, but…” can be both kind and firm. The By-the-Book reminds us that discipline beats bravado. The Firefighter reminds us that sometimes a human with a hot keyboard still saves the day. The SLO Zealot translates user pain into math we can act on. The Manager SRE keeps the lights funded. The Platform Gardener makes tomorrow calmer than today. And the Tooling Engineer and Integrator—our builder and our diplomat—keep the machine both sharp and civil.
If you’re lucky, your org doesn’t try to flatten these personas into one “ideal SRE.” Instead, it treats them like a balanced portfolio. On calm days, the Gardener prunes and the Automator deletes toil. On rough nights, the Talker commands, the Firefighter fixes, the Process Engineer steadies, and the Historian remembers. And somewhere, the Rookie asks the question everyone else was too tired—or too proud—to ask. That’s not dysfunction. That’s resilience.
I used to think a great SRE team was one where everyone could do everything. Now I think great SRE teams are those where everyone can do something essential, and the system makes those somethings add up to safety. Your platform is a product, your processes are code, your metrics are promises, and your people are the redundancy. If you can get those four to stop arguing and start harmonizing, you’ll have fewer 2 a.m. scripts, fewer “quick favors” that never end, and more mornings where nothing interesting happened—and that’s the highest compliment we have.
DORA | Accelerate State of DevOps Report 2024 — https://dora.dev/research/2024/dora-report/
Chapter 2: Implementing SLOs (Google SRE Workbook) — https://sre.google/workbook/implementing-slos/
Incident Commander (PagerDuty Incident Response) — https://response.pagerduty.com/training/incident_commander/
How to run a blameless postmortem (Atlassian) — https://www.atlassian.com/incident-management/postmortem/blameless
CNCF Annual Survey 2024 (PDF) — https://www.cncf.io/wp-content/uploads/2025/04/cncf_annual_survey24_031225a.pdf
#SRE #SiteReliability #DEVOPS #PlatformEngineering #SLOs #ErrorBudgets #IncidentResponse #DeveloperExperience #GoldenPaths #Runbooks