Alerting on SLOs - Mads Hartmann

This resource first appeared in issue #44 on 02 Oct 2020 and has tags Technical Leadership: Systems: Other

Another recurring theme in this newsletter is that while research software development takes a lot of guff, research software development in research is often much closer to industry best practices than research computing systems management. While there’s a lot of research software out there with version control, continuous integration testing, documentation, and disciplined release management, it’s much rarer to find research computing systems with crisply defined service level objectives (SLOs). And without SLOs it’s not possible to answer even the most basic questions - “How are we doing?”

SLOs could be measures of node uptime, but they could just as easily be job time in queue, filesystem bandwidth, data availability - anything that matters to the researcher. This is a nice article on the topic, with lots of discussion and resources on service level indicators (SLIs), SLOs, SLO windows, error budgets, and burn rates, and how to alert on SLOs (not by “the SLO has been exceeded” - you alert when the burn rate is high enough that you’d likely exceed the SLO in the SLO window). If you’re thinking about implementing SLOs - and maybe especially if you haven’t before - this is a good read.