Created on 2025-06-16 08:25
Published on 2025-06-16 08:57
It’s the day after an outage. The system is back online. The alerts have stopped. Customers are recovering. And now, it’s time for the incident review.
Also known as the postmortem, the RCA, or the “blameless retrospective,” this ritual is meant to be a safe space for learning. A chance to explore what happened, why it happened, and how to prevent it in the future.But here’s the uncomfortable truth: in many companies, incident reviews don’t feel blameless. They feel like blame in disguise.Engineers come in tense. Managers ask loaded questions. Docs are sanitized. Learnings are shallow. And the quiet message is clear: don’t be the one holding the pager next time.
So what happened? How did a tool meant to foster safety end up becoming a source of fear?Let’s explore both sides of this critical—and controversial—practice.
**The Ideal: Blameless Postmortems**
The original idea, popularized by Google’s SRE team, is simple: incidents are systemic failures, not individual ones.- Humans are fallible.- Systems should account for human error.- Blame discourages honesty.- Only learning prevents repeat failures.
In this model:- Anyone can trigger a review.- The timeline is reconstructed collaboratively.- The focus is on systems, processes, and signals—not people.- Action items aim to fix conditions, not assign punishment.
It’s a powerful concept. When done right, it builds trust, surfaces real root causes, and creates a culture of continuous improvement.
**The Reality: Fear in the Room**
But in practice, many incident reviews deviate from the ideal.
Implicit Judgment Even without pointing fingers, participants know who was on-call, who deployed, who made the call. The judgment hangs in the air.
Power Dynamics When senior leaders attend, engineers may sugarcoat details, avoid admitting uncertainty, or play defense.
Language Slippage Blameless reviews often contain blameful language: “Should have caught,” “Failed to notice,” “Neglected to validate.”
Retrospective Theater Reviews become performative. Engineers write what’s expected. Real issues get lost in polished slides and action item spreadsheets.
Punitive Follow-Up A supposedly blameless review leads to performance reviews, reprimands, or tighter controls. The trust is broken.
**Why It Happens**
Blamelessness is easy to say, hard to practice.
Leaders want accountability but lack tools to separate it from blame.
Teams want to improve but fear consequences.
Cultures say “we’re blameless” but their actions tell another story.
And when the stakes are high lost revenue, customer churn, exec escalation, emotions run hot. It’s hard not to look for someone to hold responsible.
**The Case for Honest Reviews**
Still, incident reviews matter.- They surface latent risks.- They connect system behavior to human behavior.- They improve resilience over time.Without reviews, incidents repeat. Knowledge stays tribal. Systems rot.But the reviews must be safe.Because when engineers fear the review, they hide. They withhold. They cover up.And then, real reliability suffers.
**Signs Your Reviews Aren’t Blameless**
- People skip or avoid them.- Action items are vague or redundant.- Reviews focus only on the technical root cause.- Engineers say “we should have” more than “we learned.”- Reviews never question process or culture—just fixes.
**A Better Way Forward**
Making incident reviews safe and useful requires intention.
Set Psychological Safety First Begin every review with a reminder: no one will be punished. Learning is the goal.
De-Identify the Timeline Focus on “the system” did X, not “the engineer” did X.
Use Structured Templates Include environment, signals, detection, mitigation, communication, impact, and lessons learned.
Include Non-Technical Contributors Customer support, product, and legal often bring key insights. Incidents aren’t just code.
Focus on Multiple Root Causes Rarely is there just one. Look for stacked failures: systems, tools, norms.
Review the Review Process Meta-retrospectives are powerful. Ask: how did this review go? Did we learn?
**Real-World Story: Trust in Action**
At a fintech company, an engineer accidentally deployed a broken config that took down production. The incident cost real money. During the review, leadership focused on:
Why was the system able to deploy without validation?
Why didn’t alerts fire earlier?
Why wasn’t there a fast rollback?
The engineer wasn’t blamed. Instead, the team invested in safer deploys, faster detection, and better communication tooling.Months later, when another engineer made a mistake, they owned it immediately. Why? Because they trusted the process.
That’s the power of truly blameless reviews.
**The Case for Accountability**
Some argue that we’ve swung too far. That “blamelessness” is used to dodge responsibility.If someone repeatedly causes issues, shouldn’t we intervene?Yes
but in a different forum.
Performance issues belong in 1:1s.
Reviews are for systems, not individuals.
Accountability is important
but distinct from incident learning.Blame creates fear. Accountability creates growth. They’re not the same.
**Final Thought**
Incidents are stressful. Reviews don’t have to be.They can be moments of insight, alignment, and clarity. But only if we design them to be.
So ask your team:
Are we learning from failure or performing it?
Do our reviews build trust or erode it?
Are we fixing system or finding scapegoats?Because real reliability starts not with uptime but with honesty.
And that’s only possible when engineers know they’ll be heard, not hunted.
Build that culture, and your systems will follow.