Learning from SRE Teams About Identifying and Reducing Repetitive Work

This resource first appeared in issue #5 on 07 Feb 2020 and has tags Technical Leadership: Systems: Other, Technical Leadership: Runbooks, Technical Leadership: Automation, Managing A Team: Documentation/Writing

Tracking Toil using SRE principles - Eric Harvieux, Google Cloud Blog
Writing Runbook Documentation when you’re an SRE - Taylor Barnett, Transposit

“Toil is the kind of work that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.”

These two articles came out in the same week, and they work very nicely together. One of the under-appreciated advantages commercial cloud organizations (or other large operations) have at scale isn’t about hardware - it’s that they have better awareness of how much stuff that gets done is repetitive across teams, and they put the resources in to ensuring those things get done smoothly.

Technical teams tend to hate documenting, but identifying these repetitive “toil” tasks and documenting what’s involved in getting them done in a “runbook” has three really important benefits:

  • It makes sure the tasks are done reliably in reproducible ways
  • It makes it much easier to onboard new people to the team or transfer responsibility for these tasks to someone else
  • It is the first step in automating the task to make them easier to do, or ideally to have them done completely automatically.

Are there things that have to routinely get done in your team that aren’t documented clearly in (kept-up-to-date) runbooks? Would things be easier if there were fewer such tasks?

And related to this; here’s a list of “flight rules” (e.g., very short runbooks) for common nontrivial tasks in git.

<<<<<<< HEAD
======= >>>>>>> c1d069a... First pass at category pages