This resource first appeared in issue #41 on 11 Sep 2020 and has tags Technical Leadership: Systems: Incident Handling, Technical Leadership: Systems: Other
Findings From the Field - Two Years of Following Incidents Closely - John Allspaw
Incident handling is an area where research computing falls well behind best practices in technology or IT, partly because the implicitly lower SLAs haven’t pushed us to have the discipline around incidents that other sectors have had.
And that’s a shame. There’s nothing wrong with having lower (say) uptime requirements if that’s the tradeoff appropriate for researcher use cases, but that doesn’t mean having no incident response protocol, no playbooks, no procedures, and going through the stressful and error-prone approach of making it up as we go along every time something happens is a good way to do things. And I’ve seen many research computing centres where that is precisely what’s done.
This is a short presentation slide deck on what Allspaw has learned from following incident handling closely at multiple organizations.
Some common failure modes he’s seen in leadership in thinking incidents are themselves a bad sign, wanting to get inappropriately involved, and an insistence on largely irrelevant metrics. Some common among front-line incident support is an exclusive focus on fixing over learning, and treating post-incident processes as bureaucracy and busywork.
In Allspaw’s estimation, both groups need to build culture and process around learning from incidents, creating meaningful actions to follow up on what was learned, and to make the most of these unplanned investments in peoples time by having the reviews useful, re-read, and having them inform future work.