Good, Less-painful, Postmortems

This resource first appeared in issue #37 on 14 Aug 2020 and has tags Technical Leadership: Systems: Incident Handling

Improving Postmortems from Chores to Masterclass with Paul Osman - Blameless
Theory vs. Practice: Learnings from a recent Hadoop incident - Sandhya Ramu and Vasanth Rajamani, LinkedIn

Stuff happens, and when it does happen it’s a lot of work and stressful. We should at least take the opportunity to make the most of these “unplanned investments”, learn from them, and make use of those lessons to prevent related stuff from happening in the future.

The talk and transcript by Paul Osman is a good one about allowing for “blameless postmortems” - finding out what actually happened in the complex system of your team and the computing that lead to the incident. People feel defensive, other people will tend to feel accusatory, and unless you’ve got a good system in place for getting to the root of things without placing blame on people, it’s going to be hard to learn anything meaningful.

The post from LinkedIn about what happened in an incident is nice example of what can come out of a postmortem, a clear and well-written description of an incident and what will happen next:

Here’s what happened: roughly 2% of the machines across a handful of racks were inadvertently reimaged. This was caused by procedural gaps in our Hadoop infrastructure’s host life cycle management. Compounding our woes, the incident happened on our business-critical production cluster.

Small but non-trivial amounts of data were lost as a result of this. What’s really nice about this post and the results of their incident process is that the changes weren’t small scale tweaks to prevent exactly this error from happening again, but much larger scale process improvements that will reduce entire classes of somewhat related problems from occurring.