K
KnowMBAAdvisory
Change ManagementIntermediate6 min read

Post-Mortem Discipline

Post-Mortem Discipline is the organizational practice of running structured, blameless retrospectives after every significant incident, project, or change โ€” and systematically converting findings into permanent process changes. Google's SRE handbook codified the modern blameless post-mortem: the goal is not to assign blame but to identify systemic causes (the conditions that allowed an individual error to cause harm) and ship fixes. Etsy's debrief practice goes further, treating outages as learning opportunities and publishing internal post-mortems widely so the organization compounds lessons. The discipline has three layers: (1) Blameless investigation, (2) Action item ownership with deadlines, (3) Closed-loop verification that action items shipped. Without all three, post-mortems become organizational scar tissue โ€” meetings that catalog what already broke without changing what comes next.

Also known asBlameless Post-MortemIncident RetrospectiveAfter-Action ReviewProject Debrief

The Trap

The trap is the post-mortem theater cycle: a meeting happens, action items get logged, no one owns them with a deadline, and 70% never ship. Six months later, the same incident recurs and a new post-mortem is held. The action items list grows; the actual change rate is near zero. KnowMBA POV: post-mortems can only catalog what already broke. They are necessary but insufficient โ€” pre-mortems uncover what post-mortems can only document after the damage is done. The second trap: blame creeping back in via euphemism ('the engineer who pushed the change' is blame disguised as fact). The moment blame enters, candor leaves and you stop hearing about real causes.

What to Do

Run post-mortems with: (1) Strict blamelessness โ€” discuss roles and decisions, never name individuals as causes. Frame: 'given the information available at the time, why was this a reasonable decision?' (2) Time-boxed within 5 business days of incident close โ€” memory degrades fast. (3) Written narrative document, not slides โ€” narrative captures causal chains slides flatten. (4) 3-7 SMART action items, each with named owner, deadline, and success criterion. (5) Action item review in standing leadership forum at 30/60/90 days. (6) Public publishing internally so other teams compound learning. Track 'action item ship rate' as a meta-metric โ€” if it's below 70%, the post-mortem process is broken regardless of the meeting quality.

Formula

Post-Mortem Effectiveness = Blamelessness ร— Action Item Ship Rate ร— Cross-Team Learning Reach (multiplicative โ€” weak in one, the practice fails)

In Practice

Google's SRE organization codified the blameless post-mortem in the public SRE Book (2016). After every significant incident, an SRE writes a post-mortem document with: timeline, root causes, contributing factors, action items with owners and deadlines, and lessons learned. Critically, post-mortems are reviewed publicly by other SRE teams โ€” making the company smarter as a system rather than per-team. The discipline is enforced by leadership: if action items don't ship, the post-mortem is reopened. Google estimates the practice has prevented many recurrences of incidents that, without action item follow-through, would have happened repeatedly. Etsy's similar practice (documented by John Allspaw) explicitly treats engineers as the second victim of incidents, not the cause โ€” preserving the candor needed to find systemic causes.

Pro Tips

  • 01

    Track action item ship rate as a leading indicator. If your post-mortems generate 10 action items per incident and only 3 ship within 90 days, your post-mortem process is producing scar tissue, not change. Target: 70%+ ship rate within committed deadline.

  • 02

    Separate the post-mortem document (forensic, blameless, published) from the leadership accountability discussion (private, performance-related, with HR). Conflating the two destroys the candor of the post-mortem because participants self-censor to protect colleagues.

  • 03

    Publish post-mortems internally โ€” even painful ones. The compounding value of post-mortems comes from cross-team learning. A post-mortem read only by the team that lived the incident extracts maybe 20% of its value; publishing widely extracts 80%+.

Myth vs Reality

Myth

โ€œBlameless post-mortems mean no one is held accountableโ€

Reality

Blameless investigations and accountability are separate processes. The post-mortem identifies what systemic conditions allowed an error to cause harm. Performance management (separate, private, with HR) addresses individual accountability. Conflating them breaks both โ€” investigation candor collapses and accountability becomes capricious.

Myth

โ€œAction items from post-mortems should be assigned to the team that caused the incidentโ€

Reality

Often the right action is in another team (e.g., a platform fix that prevents the class of error entirely). Constraining action items to the team-of-origin guarantees you only treat symptoms. Cross-team action items require leadership to enforce โ€” but they're where the highest-leverage fixes live.

Myth

โ€œQuick incidents (resolved in <1 hour) don't merit a post-mortemโ€

Reality

Quick incidents that resolved by luck or heroic effort are exactly the ones that will recur. Severity-of-impact-this-time is a poor filter. Better filters: was the cause novel? Did we have to use heroics to recover? Did we get lucky on impact? If yes to any, run the post-mortem.

Try it

Run the numbers.

Pressure-test the concept against your own knowledge โ€” answer the challenge or try the live scenario.

๐Ÿงช

Knowledge Check

Your engineering team runs post-mortems after every Sev1 incident. The meetings are well-attended, the documents are detailed, and action items get logged in Jira. But 6 months in, three of the original incidents have recurred. What's the most likely root cause?

Industry benchmarks

Is your number good?

Calibrate against real-world tiers. Use these ranges as targets โ€” not absolutes.

Action Item Ship Rate (Within Committed Deadline)

Engineering and operations post-mortem programs

Elite (Google SRE, Etsy)

75-85%

Healthy

60-75%

At-risk

40-60%

Theater

<40%

Source: Google SRE Book (2016); Etsy Code as Craft engineering blog

Real-world cases

Companies that lived this.

Verified narratives with the numbers that prove (or break) the concept.

๐ŸŸข

Google SRE (Site Reliability Engineering)

2003-present

success

Google's SRE organization codified the modern blameless post-mortem and made it public via the SRE Book in 2016. The practice: every Sev1 and Sev2 incident generates a written post-mortem within 5 business days, structured as timeline + root causes + contributing factors + action items + lessons. The defining feature is blamelessness โ€” investigators discuss decisions in context of the information available at the time, never identifying individuals as causes. Post-mortems are published widely across the SRE org, making cross-team learning the default rather than the exception. Action items are tracked with deadlines and reviewed in standing leadership forums. Google credits the discipline with materially reducing recurrence of similar incident classes โ€” the compounding value comes from organization-wide pattern recognition, not per-incident fixes.

Post-mortem deadline

5 business days from incident close

Blamelessness

Mandatory โ€” no individual naming

Publishing scope

Org-wide by default

Action item review cadence

Standing leadership forum

The Google SRE post-mortem is the modern reference implementation. The three non-negotiable elements: blameless investigation, action items with deadlines and named owners, and broad internal publishing. Skip any one and the practice degrades into documentation theater.

Source โ†—
๐Ÿ›๏ธ

Etsy (Debrief Culture)

2010-present

success

Etsy's engineering organization, under former CTO John Allspaw, developed and publicly documented one of the most influential blameless post-mortem cultures outside Google. Allspaw's 2012 essay 'Blameless PostMortems and a Just Culture' framed engineers as the second victim of incidents, not the cause โ€” a deliberate framing to preserve the candor needed for systemic investigation. Etsy's debriefs explicitly separated 'understanding the system' from 'evaluating individuals.' The practice became foundational to Etsy's continuous deployment culture (60+ deploys per day at peak) โ€” high deploy velocity is only safe with high-quality learning from incidents.

Deploys per day (peak era)

60+

Debrief framing

Engineer as second victim

Industry influence

Foundation for DevOps/SRE post-mortem practice

Allspaw essay (2012)

Widely cited reference

Etsy's contribution is the framing: when engineers are the second victim of incidents (not the cause), candor becomes possible and systemic causes become visible. The framing precedes the process โ€” get the framing wrong and no amount of post-mortem template polish saves the practice.

Source โ†—

Related concepts

Keep connecting.

The concepts that orbit this one โ€” each one sharpens the others.

Beyond the concept

Turn Post-Mortem Discipline into a live operating decision.

Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.

Typical response time: 24h ยท No retainer required

Turn Post-Mortem Discipline into a live operating decision.

Use Post-Mortem Discipline as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.