This resource first appeared in issue #5 on 07 Feb 2020 and has tags Technical Leadership: Systems: Other, Technical Leadership: Runbooks, Technical Leadership: Automation, Managing A Team: Documentation/Writing
Tracking Toil using SRE principles - Eric Harvieux, Google Cloud Blog
Writing Runbook Documentation when you’re an SRE - Taylor Barnett, Transposit
“Toil is the kind of work that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.”
These two articles came out in the same week, and they work very nicely together. One of the under-appreciated advantages commercial cloud organizations (or other large operations) have at scale isn’t about hardware - it’s that they have better awareness of how much stuff that gets done is repetitive across teams, and they put the resources in to ensuring those things get done smoothly.
Technical teams tend to hate documenting, but identifying these repetitive “toil” tasks and documenting what’s involved in getting them done in a “runbook” has three really important benefits:
Are there things that have to routinely get done in your team that aren’t documented clearly in (kept-up-to-date) runbooks? Would things be easier if there were fewer such tasks?
And related to this; here’s a list of “flight rules” (e.g., very short runbooks) for common nontrivial tasks in git.