jonathan@researchcomputingteams.org

Category: Technical Leadership: Systems: Other

Parent categories: Technical Leadership, Technical Leadership: Systems

Building an SRE Team? How to Hire, Assess, & Manage SREs - Emily Arnott

Other tags: | Hiring: Other |

Building an SRE Team? How to Hire, Assess, & Manage SREs - Emily Arnott DevOps and SRE are two sides of a similar coin - bridging the gap between systems and developer teams to do better work faster. DevOps topics usually involve speeding release cycles, and SRE topics usually focus on improving automation, resiliency, and handling incidents, but there’s a significant degree of overlap. Even if you aren’t explicitly building an SRE or DevOps team, you can start hiring for these skills and approaches in...

Continue...

Learning from SRE Teams About Identifying and Reducing Repetitive Work

Tracking Toil using SRE principles - Eric Harvieux, Google Cloud Blog Writing Runbook Documentation when you’re an SRE - Taylor Barnett, Transposit “Toil is the kind of work that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.” These two articles came out in the same week, and they work very nicely together. One of the under-appreciated advantages commercial cloud organizations (or other large operations) have at scale isn’t about hardware - it’s that they...

Continue...

Code-wise, cloud-foolish avoiding bad technology choices - Forrest Brazeal

Other tags: | Technical Leadership: Cloud |

Code-wise, cloud-foolish: avoiding bad technology choices - Forrest Brazeal This article is from the start of the year, but it’s been circulating around, and it is good advice in a short article. Everywhere in computing, but it maybe worse in research computing, is a tendency towards NIH Syndrome - “Not Invented Here”. There’s a tendency to roll our own rather than lean on existing tooling; which is frankly madness, since our specialty is using and computing technology to solve research problems, not to invent computing...

Continue...

The Runbooks Project - Ian Mieli

The Runbooks Project - Ian Mieli In an effort to help get people started with runbooks for operations, Ian Miele of Container Soltuions has started an opensource set of runbooks, the Open Runbooks Project, starting with their own.  Worth checking out as a set of templates, and keeping an eye on as more get added.

Continue...

Findings From the Field - Two Years of Following Incidents Closely - John Allspaw

Findings From the Field - Two Years of Following Incidents Closely - John Allspaw Incident handling is an area where research computing falls well behind best practices in technology or IT, partly because the implicitly lower SLAs haven’t pushed us to have the discipline around incidents that other sectors have had. And that’s a shame. There’s nothing wrong with having lower (say) uptime requirements if that’s the tradeoff appropriate for researcher use cases, but that doesn’t mean having no incident response protocol, no playbooks, no...

Continue...

SRE Classroom exercises for non-abstract large systems design - Google Cloud

SRE Classroom: exercises for non-abstract large systems design - Google Cloud Google, which is notoriously close-lipped about technology development in the company, is getting more and more open with their training materials. This is terrific, because google takes training materials very seriously, and they’re quite good. In Google’s systems reliability practice, they emphasize large systems design and “back of the envelope” estimation approaches which will seem quite familiar to those of us who were trained in the physical sciences. They teach this approach with quite...

Continue...

Alerting on SLOs - Mads Hartmann

Alerting on SLOs - Mads Hartmann Another recurring theme in this newsletter is that while research software development takes a lot of guff, research software development in research is often much closer to industry best practices than research computing systems management. While there’s a lot of research software out there with version control, continuous integration testing, documentation, and disciplined release management, it’s much rarer to find research computing systems with crisply defined service level objectives (SLOs). And without SLOs it’s not possible to answer even...

Continue...

From Sysadmin to SRE - Josh Duffney, Octopus Deploy

From Sysadmin to SRE - Josh Duffney, Octopus Deploy As research computing becomes more complex, our systems teams are going to have more and more demands on them, moving them from sysadmins to systems reliability responsibilities, and working more closely with software development teams. It’s an easier transition for sysadmins in research computing than in most fields, as our teams generally have pretty deep experience on the software side of research computing too. Duffney’s article lays out how to start thinking about these changes to...

Continue...

How to apologize for server outages and keep users happy - Adam Fowler, Tech Target

How to apologize for server outages and keep users happy - Adam Fowler, Tech Target When AWS has an outage, it’s in the news and they publish public retrospectives (and here’s a great blog post of the retrospective of the Kinesis incident this week). Our downtimes and failures don’t make the news, but we owe at least that same level of transparency and communication to our researchers. The technical details will differ from case to case. But what’s also needed is an apology, and some...

Continue...

The Case for ‘Center Class’ HPC Think Tank Calls for $10B Fed Funding over Five Years

The Case for ‘Center Class’ HPC: Think Tank Calls for $10B Fed Funding over Five Years For those who haven’t seen the Centre for Data Innovation’s report advocating tripling NSF’s funding for university HPC centres, the report and the arguments therein may be useful for your own internal advocacy efforts.

Continue...

Open Source Update School of Software Reliability Engineering (SRE) - LinkedIn Engineering

Open Source Update: School of Software Reliability Engineering (SRE) - LinkedIn Engineering LInkedIn has updated its School of SRE materials for new hires or those looking to move into SRE. Even if your systems team isn’t thinking about moving to FAANG-style SRE operations, the basics covered in the material cover a nice range of dev-ops style development, deployment, design, monitoring, and securing of web applications.

Continue...

Manageable On-Call for Companies without Money Printers - Utsav Shah, Software at Scale

Manageable On-Call for Companies without Money Printers - Utsav Shah, Software at Scale A lot of information out there about running on-call, or more advanced practices like SRE, assume that you’re a large organization with 24/7 uptime targets. These can apply to research computing, but more often don’t. Teams sometimes respond to the inability to have 5-nines uptime support and 24/7 oncall with a shrug and just keep things vague; “will respond promptly during working hours, with best-effort responses outside of those times”. But that...

Continue...

Counterfactuals are not Causality - Michael Nygard

Counterfactuals are not Causality - Michael Nygard When you’re digging into the (likely multiple) causes of a failure, Nygard reminds us that things that didn’t happen can’t, necessarily, be the cause of something.  To steal an example from the post, ”The admin did not configure file purging” is not a cause.  It can suggest future mitigations or useful lessons learned, as ”we should ensure that file purging is configured by default”, but looking for things that didn’t happen is a way for blame to sneak in and takes our eyes off of the system that...

Continue...

It’s Time to Rethink Outage Reports - Geoff Huston, CircleID

It’s Time to Rethink Outage Reports - Geoff Huston, CircleID Increasingly, private-sector systems provide their users detailed explanations of the reasons for unexpected outages, what was done to get things back up, and what they’re changing to prevent it from happening again. As part of incident response, we should be routinely writing up similar reports for internal use, so that we can learn from what happened. With that done, it makes no sense to then keep our users in the dark! Most users won’t care...

Continue...

Incident Review and Postmortem Best Practices - Gergely Orosz

Incident Review and Postmortem Best Practices - Gergely Orosz If your team is thinking of starting incident reviews & postmortems - which I recommend if relevant to your work - this is a good place to start. Orosz reports on a survey and discussions with 60+ teams doing incident responses, and finds that most have a pretty common pattern: An outage is detected An outage is declared The incident is being mitigated The incident has been mitigated Decompression period (often comparitively short) Incident analysis /...

Continue...

Five-P factors for root cause analysis - Lydia Leong

Five-P factors for root cause analysis - Lydia Leong Rather than “root cause analysis” or “five why’s”, both of which have long since fallen out of favour in areas that take incident analysis seriously like aerospace or health care, Leong suggests that we look at Macneil’s Five P factors from medicine: Presenting problem Precipitating factors - what combination of things triggered the incident? Perpetuating factors - what things kept the incident going, made it worse, or harder to handle? Predisposing factors - what long-standing things...

Continue...

DevOps in academic research - by Matthew Segal

DevOps in academic research - by Matthew Segal Here Segal, who worked for 18 months as a “Research DevOps Specialist”, talks about his work in moving a 20kloc MCMC python modelling package for infectious disease models, in a development and systems environment that wasn’t prepared for the sudden urgency and rapid release cycles that were needed when COVID broke out. There were no tests, making development slow. A lot of manual toil was involved in calibrating updated models, which was fine when they were for...

Continue...

OOPS writeups - Lorin Hochstein

OOPS writeups - Lorin Hochstein Hochstein gives the outline and an explanation as to how his team in Netflix write up “OOPS” reports, essentially incidents that didn’t rise to the level of Incident Response, as a way of learning and sharing knowledge about things that can go wrong in their systems. It’s a nice article and provides a light-weight model to potentially use. His outline, blasted verbatim from the article, is below. I particularly like the sections on contributors/enablers and Mitigators as things that didn’t...

Continue...

Understanding wait time versus utilization - from reading Phoenix Project - Zhiqiang Qiao

Understanding wait time versus utilization - from reading Phoenix Project - Zhiqiang Qiao Every so often I see technologists rediscover a very widely known result in operations research - introductory textbook stuff, really. Wait times (or other bad behaviour) start rocketing upwards once we get to high (somewhere between 80% - 90%) utilization. You see this in equipment, and teams, of course, too. Teams, whether they’re cash registers or software developers, start getting into trouble at sustained high “utilization rates”, e.g. overwork. And yet, a...

Continue...

Building an SRE Career Progression Framework - Ethan Motion

Building an SRE Career Progression Framework - Ethan Motion Whether it’s for research software, systems, data management, or data science, a lot of groups are trying to figure out formal or informal career progression pathways for individual contributors. As a manager, you can work with individuals in their one-on-ones to find out where they are interested in and ready to grow, and give them opportunities at that intersection. But how do you start thinking about career progression at the whole-team or multi-team level? Motion describes...

Continue...

Making operational work more visible - Lorin Hochstein

Making operational work more visible - Lorin Hochstein In the f-string failure article in software development, I pointed out that log and error handling code was under-reviewed and tested. There’s probably a bigger lesson one can take from that on the undervaluing of supporting or glue or infrastructure work compared to “core” work. And sure enough, one of the huge downsides of operations work is that when everything goes well, it’s invisible. Above, Granda walks us through writing up an incident report and sharing it...

Continue...