jonathan@researchcomputingteams.org

Category: Technical Leadership: Systems

Parent categories: Technical Leadership

Subcategories: Incident Handling, Other

Building an SRE Team? How to Hire, Assess, & Manage SREs - Emily Arnott

Building an SRE Team? How to Hire, Assess, & Manage SREs - Emily Arnott DevOps and SRE are two sides of a similar coin - bridging the gap between systems and developer teams to do better work faster. DevOps topics usually involve speeding release cycles, and SRE topics usually focus on improving automation, resiliency, and handling incidents, but there’s a significant degree of overlap. Even if you aren’t explicitly building an SRE or DevOps team, you can start hiring for these skills and approaches in...

Continue...

Learning from SRE Teams About Identifying and Reducing Repetitive Work

Tracking Toil using SRE principles - Eric Harvieux, Google Cloud Blog Writing Runbook Documentation when you’re an SRE - Taylor Barnett, Transposit “Toil is the kind of work that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.” These two articles came out in the same week, and they work very nicely together. One of the under-appreciated advantages commercial cloud organizations (or other large operations) have at scale isn’t about hardware - it’s that they...

Continue...

Code-wise, cloud-foolish avoiding bad technology choices - Forrest Brazeal

Code-wise, cloud-foolish: avoiding bad technology choices - Forrest Brazeal This article is from the start of the year, but it’s been circulating around, and it is good advice in a short article. Everywhere in computing, but it maybe worse in research computing, is a tendency towards NIH Syndrome - “Not Invented Here”. There’s a tendency to roll our own rather than lean on existing tooling; which is frankly madness, since our specialty is using and computing technology to solve research problems, not to invent computing...

Continue...

5 Best Practices on Nailing Postmortems - Hannah Culver, Blameless

5 Best Practices on Nailing Postmortems - Hannah Culver, Blameless We’ve started doing incident reports - sort of baby postmortems - in our project, which has been an extremely useful practice in growing more discipline about how issues with availability or security are reported, distributed, and dealt with. It also gives us a library of materials that we can look through to identify any patterns that appear. This article talks about some best practices for running postmortem processes – Use visuals Be a historian -...

Continue...

How to create an incident response playbook - Blake Thorne, Atlassian

How to create an incident response playbook - Blake Thorne, Atlassian This is a really good starting point for putting together an incident response playbook, and includes links to Atlassian’s own playbooks and a workshop on incident communication. This is something we’re working on in our own team. We’re not there yet, but we’re getting there. On the other hand, colleagues-of-colleagues of mine were involved in a major incident recently in an organization where there were lots of security policies in place about keys and...

Continue...

The Runbooks Project - Ian Mieli

The Runbooks Project - Ian Mieli In an effort to help get people started with runbooks for operations, Ian Miele of Container Soltuions has started an opensource set of runbooks, the Open Runbooks Project, starting with their own.  Worth checking out as a set of templates, and keeping an eye on as more get added.

Continue...

Good, Less-painful, Postmortems

Improving Postmortems from Chores to Masterclass with Paul Osman - Blameless Theory vs. Practice: Learnings from a recent Hadoop incident - Sandhya Ramu and Vasanth Rajamani, LinkedIn Stuff happens, and when it does happen it’s a lot of work and stressful. We should at least take the opportunity to make the most of these “unplanned investments”, learn from them, and make use of those lessons to prevent related stuff from happening in the future. The talk and transcript by Paul Osman is a good one...

Continue...

Findings From the Field - Two Years of Following Incidents Closely - John Allspaw

Findings From the Field - Two Years of Following Incidents Closely - John Allspaw Incident handling is an area where research computing falls well behind best practices in technology or IT, partly because the implicitly lower SLAs haven’t pushed us to have the discipline around incidents that other sectors have had. And that’s a shame. There’s nothing wrong with having lower (say) uptime requirements if that’s the tradeoff appropriate for researcher use cases, but that doesn’t mean having no incident response protocol, no playbooks, no...

Continue...

SRE Classroom exercises for non-abstract large systems design - Google Cloud

SRE Classroom: exercises for non-abstract large systems design - Google Cloud Google, which is notoriously close-lipped about technology development in the company, is getting more and more open with their training materials. This is terrific, because google takes training materials very seriously, and they’re quite good. In Google’s systems reliability practice, they emphasize large systems design and “back of the envelope” estimation approaches which will seem quite familiar to those of us who were trained in the physical sciences. They teach this approach with quite...

Continue...

Alerting on SLOs - Mads Hartmann

Alerting on SLOs - Mads Hartmann Another recurring theme in this newsletter is that while research software development takes a lot of guff, research software development in research is often much closer to industry best practices than research computing systems management. While there’s a lot of research software out there with version control, continuous integration testing, documentation, and disciplined release management, it’s much rarer to find research computing systems with crisply defined service level objectives (SLOs). And without SLOs it’s not possible to answer even...

Continue...

Learning from Postmortems in Hazardous Contexts

Incident Reviews in High-Hazard Industries: Sense Making and Learning Under Ambiguity and Accountability - Thai Wood, Resilience Roundup Incident Reviews in High-Hazard Industries: Sense Making and Learning Under Ambiguity and Accountability - John S. Carroll, Industrial & Environmental Crisis Quarterly (1995) This is is a recent blog post about a less recent paper, reviewing how incident reviews work in high-hazard industries like nuclear power. Whether the environment is life-critical or just inconvenient like a research cluster going down, a common incident review failure mechanism is...

Continue...

A List of Post-Mortems - Dan Luu

A List of Post-Mortems - Dan Luu In research computing, when it comes to running systems we could be a lot closer to industry best practices than we are. We’ve talked about post-mortems more than once; here’s a list of postmortems from many companies collected by Luu. It’s nice to see that they don’t necessarily have to be long or complicated or intricate; like risk management, just simple documents for ongoing clarity can be a huge step forward.

Continue...

From Sysadmin to SRE - Josh Duffney, Octopus Deploy

From Sysadmin to SRE - Josh Duffney, Octopus Deploy As research computing becomes more complex, our systems teams are going to have more and more demands on them, moving them from sysadmins to systems reliability responsibilities, and working more closely with software development teams. It’s an easier transition for sysadmins in research computing than in most fields, as our teams generally have pretty deep experience on the software side of research computing too. Duffney’s article lays out how to start thinking about these changes to...

Continue...

How to apologize for server outages and keep users happy - Adam Fowler, Tech Target

How to apologize for server outages and keep users happy - Adam Fowler, Tech Target When AWS has an outage, it’s in the news and they publish public retrospectives (and here’s a great blog post of the retrospective of the Kinesis incident this week). Our downtimes and failures don’t make the news, but we owe at least that same level of transparency and communication to our researchers. The technical details will differ from case to case. But what’s also needed is an apology, and some...

Continue...

The Case for ‘Center Class’ HPC Think Tank Calls for $10B Fed Funding over Five Years

The Case for ‘Center Class’ HPC: Think Tank Calls for $10B Fed Funding over Five Years For those who haven’t seen the Centre for Data Innovation’s report advocating tripling NSF’s funding for university HPC centres, the report and the arguments therein may be useful for your own internal advocacy efforts.

Continue...

Open Source Update School of Software Reliability Engineering (SRE) - LinkedIn Engineering

Open Source Update: School of Software Reliability Engineering (SRE) - LinkedIn Engineering LInkedIn has updated its School of SRE materials for new hires or those looking to move into SRE. Even if your systems team isn’t thinking about moving to FAANG-style SRE operations, the basics covered in the material cover a nice range of dev-ops style development, deployment, design, monitoring, and securing of web applications.

Continue...

The Zero-prep Postmortem How to run your first incident postmortem with no preparation - Jonathan Hall

The Zero-prep Postmortem: How to run your first incident postmortem with no preparation - Jonathan Hall It’s never too late to start running postmortems on your systems when something goes wrong. It doesn’t have to be an advanced practice, or super complicated. Hall provides a script for your first couple. I’d suggest that once you have the basic approach down, move away from “root causes” and “mitigations” and more towards “lessons learned’. Those lessons learned can be about the postmortem process itself, too; you can...

Continue...

Manageable On-Call for Companies without Money Printers - Utsav Shah, Software at Scale

Manageable On-Call for Companies without Money Printers - Utsav Shah, Software at Scale A lot of information out there about running on-call, or more advanced practices like SRE, assume that you’re a large organization with 24/7 uptime targets. These can apply to research computing, but more often don’t. Teams sometimes respond to the inability to have 5-nines uptime support and 24/7 oncall with a shrug and just keep things vague; “will respond promptly during working hours, with best-effort responses outside of those times”. But that...

Continue...

Counterfactuals are not Causality - Michael Nygard

Counterfactuals are not Causality - Michael Nygard When you’re digging into the (likely multiple) causes of a failure, Nygard reminds us that things that didn’t happen can’t, necessarily, be the cause of something.  To steal an example from the post, ”The admin did not configure file purging” is not a cause.  It can suggest future mitigations or useful lessons learned, as ”we should ensure that file purging is configured by default”, but looking for things that didn’t happen is a way for blame to sneak in and takes our eyes off of the system that...

Continue...

It’s Time to Rethink Outage Reports - Geoff Huston, CircleID

It’s Time to Rethink Outage Reports - Geoff Huston, CircleID Increasingly, private-sector systems provide their users detailed explanations of the reasons for unexpected outages, what was done to get things back up, and what they’re changing to prevent it from happening again. As part of incident response, we should be routinely writing up similar reports for internal use, so that we can learn from what happened. With that done, it makes no sense to then keep our users in the dark! Most users won’t care...

Continue...

Incident Review and Postmortem Best Practices - Gergely Orosz

Incident Review and Postmortem Best Practices - Gergely Orosz If your team is thinking of starting incident reviews & postmortems - which I recommend if relevant to your work - this is a good place to start. Orosz reports on a survey and discussions with 60+ teams doing incident responses, and finds that most have a pretty common pattern: An outage is detected An outage is declared The incident is being mitigated The incident has been mitigated Decompression period (often comparitively short) Incident analysis /...

Continue...

Five-P factors for root cause analysis - Lydia Leong

Five-P factors for root cause analysis - Lydia Leong Rather than “root cause analysis” or “five why’s”, both of which have long since fallen out of favour in areas that take incident analysis seriously like aerospace or health care, Leong suggests that we look at Macneil’s Five P factors from medicine: Presenting problem Precipitating factors - what combination of things triggered the incident? Perpetuating factors - what things kept the incident going, made it worse, or harder to handle? Predisposing factors - what long-standing things...

Continue...

DevOps in academic research - by Matthew Segal

DevOps in academic research - by Matthew Segal Here Segal, who worked for 18 months as a “Research DevOps Specialist”, talks about his work in moving a 20kloc MCMC python modelling package for infectious disease models, in a development and systems environment that wasn’t prepared for the sudden urgency and rapid release cycles that were needed when COVID broke out. There were no tests, making development slow. A lot of manual toil was involved in calibrating updated models, which was fine when they were for...

Continue...

OOPS writeups - Lorin Hochstein

OOPS writeups - Lorin Hochstein Hochstein gives the outline and an explanation as to how his team in Netflix write up “OOPS” reports, essentially incidents that didn’t rise to the level of Incident Response, as a way of learning and sharing knowledge about things that can go wrong in their systems. It’s a nice article and provides a light-weight model to potentially use. His outline, blasted verbatim from the article, is below. I particularly like the sections on contributors/enablers and Mitigators as things that didn’t...

Continue...

How to learn after an incident

Howie: The Post-Incident Guide - Jeli How to Write Meaningful Retrospectives - Emily Arnott, Blameless The key to getting better, individually or as a team, is to pay attention to how things go, and continue doing the things that lead to good results, while changing things that lead to bad results. Pretty simple, right? And yet we really don’t like to do this. Whether your teams run systems, develop software, curate data resources, or combinations of the three, sometimes things are going to go really...

Continue...

Understanding wait time versus utilization - from reading Phoenix Project - Zhiqiang Qiao

Understanding wait time versus utilization - from reading Phoenix Project - Zhiqiang Qiao Every so often I see technologists rediscover a very widely known result in operations research - introductory textbook stuff, really. Wait times (or other bad behaviour) start rocketing upwards once we get to high (somewhere between 80% - 90%) utilization. You see this in equipment, and teams, of course, too. Teams, whether they’re cash registers or software developers, start getting into trouble at sustained high “utilization rates”, e.g. overwork. And yet, a...

Continue...

Building an SRE Career Progression Framework - Ethan Motion

Building an SRE Career Progression Framework - Ethan Motion Whether it’s for research software, systems, data management, or data science, a lot of groups are trying to figure out formal or informal career progression pathways for individual contributors. As a manager, you can work with individuals in their one-on-ones to find out where they are interested in and ready to grow, and give them opportunities at that intersection. But how do you start thinking about career progression at the whole-team or multi-team level? Motion describes...

Continue...

Before and after an incident

Incident management best practices: before the incident - Robert Ross Incident Analysis 101: Techniques for Sharing Incident Findings - Vanessa Huerta Granda You’ll know, gentle reader, that I’m a big proponent of learning from incidents, and sharing them with researchers who after all deserve to know why they couldn’t do their work for some period of time. Here’s a pair] of good articles about preparing for an incident, and putting together and sharing the incident report afterwards. In the first article, Ross talks about clarifying...

Continue...

Making operational work more visible - Lorin Hochstein

Making operational work more visible - Lorin Hochstein In the f-string failure article in software development, I pointed out that log and error handling code was under-reviewed and tested. There’s probably a bigger lesson one can take from that on the undervaluing of supporting or glue or infrastructure work compared to “core” work. And sure enough, one of the huge downsides of operations work is that when everything goes well, it’s invisible. Above, Granda walks us through writing up an incident report and sharing it...

Continue...