It’s Time to Rethink Outage Reports - Geoff Huston, CircleID

This resource first appeared in issue #86 on 06 Aug 2021 and has tags Technical Leadership: Systems: Incident Handling, Technical Leadership: Systems: Other

Increasingly, private-sector systems provide their users detailed explanations of the reasons for unexpected outages, what was done to get things back up, and what they’re changing to prevent it from happening again.

As part of incident response, we should be routinely writing up similar reports for internal use, so that we can learn from what happened. With that done, it makes no sense to then keep our users in the dark! Most users won’t care about the details, but they will care that it matters to you to keep them informed and that you are visibly trying to improve the service you provide them in the future.

I guess we’ve become used to reading evasive and vague outage reports that talk about “operational anomalies” causing “service incidents” that are “being rectified by our stalwart team of engineers as we speak.” When we see a report detailing the issues and the remedial measures, it sticks out as a welcome deviation from the mean. It’s as if any admission of the details of a fault in the service exposes the provider to some form of ill-defined liability or reputation damage. To minimize this exposure, the reports of faults, root causes, and mediation actions are phrased in vague and meaningless generalities.