This resource first appeared in issue #120 on 30 Apr 2022 and has tags Technical Leadership: Systems: Incident Handling
Incident management best practices: before the incident - Robert Ross
Incident Analysis 101: Techniques for Sharing Incident Findings - Vanessa Huerta Granda
You’ll know, gentle reader, that I’m a big proponent of learning from incidents, and sharing them with researchers who after all deserve to know why they couldn’t do their work for some period of time. Here’s a pair] of good articles about preparing for an incident, and putting together and sharing the incident report afterwards.
In the first article, Ross talks about clarifying ownership for various services and systems, and clarifying the roles that need to be taken up during an incident. In the US, FEMA has roles for Commander, Mitigator, Planning, and Communication - deciding, doing, figuring out next steps, and communicating internally and externally. All of those tasks need to be done, but for smaller teams managing smaller incidents, one might want to combine those tasks into a smaller number of roles. Then the process - when an incident is called (what’s the threshold?) and what the process is once called needs to be clarified.
With the clarification done, you check your knowledge and for incidents the way you prepare for anything - practicing on dry runs, including with communication.
In the second article, Granda talks about the why’s and how’s of writing up the findings of an after-incident review. The benefits include transparency to users, which helps build trust. But benefits also include clarifying the lessons learned by the team and documenting things so that knowledge can be shared, revisited, and built upon. We’re in the science business - when new things are learned we should write them up and publish them! Granda walks us through the process and even gives us a template.