Incident Review and Postmortem Best Practices - Gergely Orosz

This resource first appeared in issue #97 on 22 Oct 2021 and has tags Technical Leadership: Systems: Incident Handling, Technical Leadership: Systems: Other

Incident Review and Postmortem Best Practices - Gergely Orosz

If your team is thinking of starting incident reviews & postmortems - which I recommend if relevant to your work - this is a good place to start. Orosz reports on a survey and discussions with 60+ teams doing incident responses, and finds that most have a pretty common pattern:

  • An outage is detected
  • An outage is declared
  • The incident is being mitigated
  • The incident has been mitigated
  • Decompression period (often comparitively short)
  • Incident analysis / post mortem / root cause analysis - often aiming for within 36-48 hours of the incident
  • Incident review
  • Action items tracked.

Current best practices seem to be:

  • Encourage raising incidents, even when in doubt
  • Be clear on roles during incidents
  • Define severity levels ahead of time
  • Have playbooks ready
  • Make time for staff to work on the review
  • Dig deep when looking into causes
  • Share analysis fairly broadly
  • Find or build tools to support incident handling

He then goes into some details of conversations with teams that are going beyond best practices - companies like Honeycomb who, providing tracing for other team’s stacks, have very high uptime requirements (they publicly released an outage report for a 5 min outage!) amongst others.

A long article but worth a read.

<<<<<<< HEAD
======= >>>>>>> c1d069a... First pass at category pages