Incident Review and Postmortem Best Practices

This resource first appeared in issue #97 on 22 Oct 2021 and has tags Technical Leadership: Systems: Incident Handling, Technical Leadership: Systems: Other

Incident Review and Postmortem Best Practices - Gergely Orosz

If your team is thinking of starting incident reviews & postmortems - which I recommend if relevant to your work - this is a good place to start. Orosz reports on a survey and discussions with 60+ teams doing incident responses, and finds that most have a pretty common pattern:

An outage is detected
An outage is declared
The incident is being mitigated
The incident has been mitigated
Decompression period (often comparitively short)
Incident analysis / post mortem / root cause analysis - often aiming for within 36-48 hours of the incident
Incident review
Action items tracked.

Current best practices seem to be:

Encourage raising incidents, even when in doubt
Be clear on roles during incidents
Define severity levels ahead of time
Have playbooks ready
Make time for staff to work on the review
Dig deep when looking into causes
Share analysis fairly broadly
Find or build tools to support incident handling

He then goes into some details of conversations with teams that are going beyond best practices - companies like Honeycomb who, providing tracing for other team’s stacks, have very high uptime requirements (they publicly released an outage report for a 5 min outage!) amongst others.

A long article but worth a read.