When a postmortem is not enough
When the Challenger space shuttle accident happened, Ronald Regan formed a committee to investigate the incident. I will give you a spoiler here: an O ring in a rocket booster lost responsiveness due to cold weather, so burning fuel leaked causing the explosion.
When Richard Feynman got an invitation to join the committee he wanted to say no. One of the reasons was that things will not stop at finding the direct mechanical cause of the accident. He will have to find out what is wrong with NASA as an organization, if the whole shuttle program should continue or should we move back to rockets, and then what should be the future goals in space. Yet some close people to him convinced him that he should join the committee and fortunately he did.
While everyone went looking for the mechanical reason for the explosion, and although Feynman played an important role in publicly displaying the O ring issue and putting some focus on it, Feynman went the extra mile (or miles) of questioning the organization that produced this disaster. He knew from the first moment that an incident at this scale results from a questionable managerial and technical culture. It does not stop at mechanics.
For example Feynman noticed that NASA’s managers claimed that the chance of an accident happening to a shuttle is 1/100000. This means that NASA can send a shuttle to space every day for 273 years and gets only a single accident. While he described this as fantasy, he did not stop here.He asked engineers to individually estimate the probability of rocket failure, then compared the results which were 1/100 in the best estimate to what management said, showing that there is a clear lack of communication between management and engineers. Then he went to what this entails: if we say that the shuttle’s failure possibility is 1/100000 the whole idea of measuring failure rates by multiple launches or multiple tests to components makes no sense because no one can test a component a hundred thousand times to get a single failure, so managers had to shift to other ways to prove that things are fine. So while the O ring had tons of failures in previous launches, managers deduced that since this did not result in an accident then these failures must be safe! They did things like measuring erosion, finding that it was 1/3 of the O ring, so they deduced that they had a 3x safety margin, while they had no grantees that the erosion in in next launch will stop at this ratio and they did not know what led to this erosion in the first place. No proper quantitative measures of the rate of failure, its limits, what could be deemed safe and what not, etc. The O ring was not acting as designed, no one knows why and since we abandoned the statistic measures and no accidents happened in previous launches then it must be safe. Feynman described this as playing Russian roulette pulling the trigger and when nothing happens deducing that it must is safe to pull the trigger again.
Feynman even went to inspect things that didn’t explode like the space shuttle engine and the software used to run the shuttle. We are software engineers, we know that software is the least reliable thing in the world, yet ironically the software and computers part was water tight.Feynman showed that the engine’s components had multiple failures in the past, and that they suffered from the same pattern of management behavior towards failures. Feynman even went examining the whole design process and comparing it to rocket and plane engine design process in the aviation industry and in the army’s air forces, showing that while usually people designed small components, tested them then assembled them into the full thing, NASA went by designing, assembling then testing. Of course when failures happened it was difficult to find which component is responsible, changes to a component’s design was difficult or impossible, etc.
I cannot give Feynman’s appendix to the committee’s report enough credit and trying to show just parts of it here feels like a crime, it should be read fully, it is a piece of art. However I do not have enough space.
Postmortems is part of site reliability engineering. You have and incident, you solve it, you write a postmortem, fix the process or design issues that lead to the incident and share it with the rest of the organization as a learning experience. While there is a common agreement on defining what should lead to a postmortem, like consuming a defined percent of your error budget, and unifying this across the company (see google’s first site reliability engineering book), a little attention is given to the scope of the postmortem in regard to the incident’s scale. From my experience postmortems usually focus on technical or process errors and rectifying them in the scope of a team or a group of teams or a single practice across the organization, but rarely does it extend to question the whole organization and the engineering culture that runs it.
When an incident results in a disaster, and a disaster definition here varies based on the size of your organization, nature of the service you are running, etc, the scale and scope of the postmortem should he proportional to this incident, questioning more that just the intimidate operational or technical practice that led to it. Ownership, actions to fix and follow up on those actions should come from high up the organizational ladder and should reflect deep changes.
The same should be done when you have more than a single incident, non of them alone is a disaster but their rate and size collectively qualify for being described as a disaster.
Do not hide behind SLAs
SLAs can be a blanket to hide the need for an organizational questioning. If have a web service that has SLA of request success rate of 3 nines, and assuming even distribution of requests across time, and then your service fails for an hour or a full day or a couple of days you break your SLA in all cases. In all cases you can just report this as (Breaking the SLA).
SLAs must be simple, one of the reasons for this is the need to be able to communicate them easily to the customer, yet this simplicity should not be used to treat all incidents equally and hide the need for organizational changes.You should not write the same level of postmortems in all cases, The major ones qualify for larger postmortems owned by people higher up the management chain and should include corrective actions on a scope that exceeds the smaller one.