No One Should be Sacked for Innovation — Blameless Postmortems
Learning from how companies like Cloudflare and Google handle their mistakes. A small dive into blameless postmortems.
In the last few days, both Facebook and Cloudflare suffered from major outages. The impact of these was felt globally by the millions of users who either could not access websites or view photos on their favorite social media platform. While I cannot speak to how Facebook handled corrected their employee, it was comforting to see that the Cloudflare employee was spared through the grace of the blameless post mortem per the following twitter thread.
Cloudflare blames their process being broken as the cause of the failure rather than the software written by an unnamed employee. This perspective allows the company to continue to innovate quickly. From here they can improve their process and as a result, the employees do not have to worry that their code will break the system because the tests are in place to ensure their code does not introduce any breaking changes.
Blameless Postmortems
Cloudflare's perspective on handling these matters is not new and is embraced by some of the largest tech giants such as Google and Etsy. Blameless postmortems are key to Googles Site Reliability Engineering as they are a chance to reflect on what went wrong and build their process to be able to catch events like that automatically in the future. Google published its full stance on-site reliability engineering as well to help the tech industry with the steps needed to maintain great site reliability.
While the name blameless postmortem may seem to imply the person responsible for making a breaking change gets off the hook, according to John Allspaw that is not really the case. John, former Etsy CTO, recognizes that when evaluating an event post mortem, there is no better witness to the timeline of events than the person who initiated the change. In Etsy’s process, the contributing engineer makes a detailed account of the what, when, where, and how of an event. Etsy’s full stance can be found in the following article.
Site reliability requires attention to detail and when things fail, companies must evaluate the cause and automate the checks to ensure that it will not happen in the future. Blameless postmortems help ensure that companies can do this in a constructive way to not inhibit employees innovation.