Manageable On-Call for Companies without Money Printers

This resource first appeared in issue #73 on 07 May 2021 and has tags Technical Leadership: Systems: Other

Manageable On-Call for Companies without Money Printers - Utsav Shah, Software at Scale

A lot of information out there about running on-call, or more advanced practices like SRE, assume that you’re a large organization with 24/7 uptime targets. These can apply to research computing, but more often don’t.

Teams sometimes respond to the inability to have 5-nines uptime support and 24/7 oncall with a shrug and just keep things vague; “will respond promptly during working hours, with best-effort responses outside of those times”. But that lack of clarity isn’t great for either staff, who don’t know what’s expected of them, or researchers, who don’t know what they can and can’t count on. It’s also a missed opportunity to lay out priorities and define successes. How can you tell if you’re doing well if there’s no goal to reach?

Shaw suggests laying out managable service level objectives (SLOs - internal measures) and slightly less stringent commitments (SLAs - agreements, external measures), and designing on-call processes and policies about that, iterating on the process as needed. For some kinds of research computing, especially availability of compute, SLAs can be quite low and still keep the researchers happy; data availability is likely somewhat higher. Choosing some target services, a manageable SLA, and fixing a manageable on-call process consistent with that will help researchers and staff know what to expect and make priorities explicit.