jonathan@researchcomputingteams.org

Category: Technical Leadership: Systems: Incident Handling

Parent categories: Technical Leadership, Technical Leadership: Systems

5 Best Practices on Nailing Postmortems - Hannah Culver, Blameless

5 Best Practices on Nailing Postmortems - Hannah Culver, Blameless We’ve started doing incident reports - sort of baby postmortems - in our project, which has been an extremely useful practice in growing more discipline about how issues with availability or security are reported, distributed, and dealt with. It also gives us a library of materials that we can look through to identify any patterns that appear. This article talks about some best practices for running postmortem processes – Use visuals Be a historian -...

Continue...

How to create an incident response playbook - Blake Thorne, Atlassian

Other tags: | Technical Leadership: Other |

How to create an incident response playbook - Blake Thorne, Atlassian This is a really good starting point for putting together an incident response playbook, and includes links to Atlassian’s own playbooks and a workshop on incident communication. This is something we’re working on in our own team. We’re not there yet, but we’re getting there. On the other hand, colleagues-of-colleagues of mine were involved in a major incident recently in an organization where there were lots of security policies in place about keys and...

Continue...

Good, Less-painful, Postmortems

Improving Postmortems from Chores to Masterclass with Paul Osman - Blameless Theory vs. Practice: Learnings from a recent Hadoop incident - Sandhya Ramu and Vasanth Rajamani, LinkedIn Stuff happens, and when it does happen it’s a lot of work and stressful. We should at least take the opportunity to make the most of these “unplanned investments”, learn from them, and make use of those lessons to prevent related stuff from happening in the future. The talk and transcript by Paul Osman is a good one...

Continue...

Findings From the Field - Two Years of Following Incidents Closely - John Allspaw

Findings From the Field - Two Years of Following Incidents Closely - John Allspaw Incident handling is an area where research computing falls well behind best practices in technology or IT, partly because the implicitly lower SLAs haven’t pushed us to have the discipline around incidents that other sectors have had. And that’s a shame. There’s nothing wrong with having lower (say) uptime requirements if that’s the tradeoff appropriate for researcher use cases, but that doesn’t mean having no incident response protocol, no playbooks, no...

Continue...

Learning from Postmortems in Hazardous Contexts

Incident Reviews in High-Hazard Industries: Sense Making and Learning Under Ambiguity and Accountability - Thai Wood, Resilience Roundup Incident Reviews in High-Hazard Industries: Sense Making and Learning Under Ambiguity and Accountability - John S. Carroll, Industrial & Environmental Crisis Quarterly (1995) This is is a recent blog post about a less recent paper, reviewing how incident reviews work in high-hazard industries like nuclear power. Whether the environment is life-critical or just inconvenient like a research cluster going down, a common incident review failure mechanism is...

Continue...

A List of Post-Mortems - Dan Luu

A List of Post-Mortems - Dan Luu In research computing, when it comes to running systems we could be a lot closer to industry best practices than we are. We’ve talked about post-mortems more than once; here’s a list of postmortems from many companies collected by Luu. It’s nice to see that they don’t necessarily have to be long or complicated or intricate; like risk management, just simple documents for ongoing clarity can be a huge step forward.

Continue...

How to apologize for server outages and keep users happy - Adam Fowler, Tech Target

How to apologize for server outages and keep users happy - Adam Fowler, Tech Target When AWS has an outage, it’s in the news and they publish public retrospectives (and here’s a great blog post of the retrospective of the Kinesis incident this week). Our downtimes and failures don’t make the news, but we owe at least that same level of transparency and communication to our researchers. The technical details will differ from case to case. But what’s also needed is an apology, and some...

Continue...

The Zero-prep Postmortem How to run your first incident postmortem with no preparation - Jonathan Hall

The Zero-prep Postmortem: How to run your first incident postmortem with no preparation - Jonathan Hall It’s never too late to start running postmortems on your systems when something goes wrong. It doesn’t have to be an advanced practice, or super complicated. Hall provides a script for your first couple. I’d suggest that once you have the basic approach down, move away from “root causes” and “mitigations” and more towards “lessons learned’. Those lessons learned can be about the postmortem process itself, too; you can...

Continue...

Counterfactuals are not Causality - Michael Nygard

Counterfactuals are not Causality - Michael Nygard When you’re digging into the (likely multiple) causes of a failure, Nygard reminds us that things that didn’t happen can’t, necessarily, be the cause of something.  To steal an example from the post, ”The admin did not configure file purging” is not a cause.  It can suggest future mitigations or useful lessons learned, as ”we should ensure that file purging is configured by default”, but looking for things that didn’t happen is a way for blame to sneak in and takes our eyes off of the system that...

Continue...

It’s Time to Rethink Outage Reports - Geoff Huston, CircleID

It’s Time to Rethink Outage Reports - Geoff Huston, CircleID Increasingly, private-sector systems provide their users detailed explanations of the reasons for unexpected outages, what was done to get things back up, and what they’re changing to prevent it from happening again. As part of incident response, we should be routinely writing up similar reports for internal use, so that we can learn from what happened. With that done, it makes no sense to then keep our users in the dark! Most users won’t care...

Continue...

Incident Review and Postmortem Best Practices - Gergely Orosz

Incident Review and Postmortem Best Practices - Gergely Orosz If your team is thinking of starting incident reviews & postmortems - which I recommend if relevant to your work - this is a good place to start. Orosz reports on a survey and discussions with 60+ teams doing incident responses, and finds that most have a pretty common pattern: An outage is detected An outage is declared The incident is being mitigated The incident has been mitigated Decompression period (often comparitively short) Incident analysis /...

Continue...

Five-P factors for root cause analysis - Lydia Leong

Five-P factors for root cause analysis - Lydia Leong Rather than “root cause analysis” or “five why’s”, both of which have long since fallen out of favour in areas that take incident analysis seriously like aerospace or health care, Leong suggests that we look at Macneil’s Five P factors from medicine: Presenting problem Precipitating factors - what combination of things triggered the incident? Perpetuating factors - what things kept the incident going, made it worse, or harder to handle? Predisposing factors - what long-standing things...

Continue...

OOPS writeups - Lorin Hochstein

OOPS writeups - Lorin Hochstein Hochstein gives the outline and an explanation as to how his team in Netflix write up “OOPS” reports, essentially incidents that didn’t rise to the level of Incident Response, as a way of learning and sharing knowledge about things that can go wrong in their systems. It’s a nice article and provides a light-weight model to potentially use. His outline, blasted verbatim from the article, is below. I particularly like the sections on contributors/enablers and Mitigators as things that didn’t...

Continue...

How to learn after an incident

Howie: The Post-Incident Guide - Jeli How to Write Meaningful Retrospectives - Emily Arnott, Blameless The key to getting better, individually or as a team, is to pay attention to how things go, and continue doing the things that lead to good results, while changing things that lead to bad results. Pretty simple, right? And yet we really don’t like to do this. Whether your teams run systems, develop software, curate data resources, or combinations of the three, sometimes things are going to go really...

Continue...

Before and after an incident

Incident management best practices: before the incident - Robert Ross Incident Analysis 101: Techniques for Sharing Incident Findings - Vanessa Huerta Granda You’ll know, gentle reader, that I’m a big proponent of learning from incidents, and sharing them with researchers who after all deserve to know why they couldn’t do their work for some period of time. Here’s a pair] of good articles about preparing for an incident, and putting together and sharing the incident report afterwards. In the first article, Ross talks about clarifying...

Continue...