Components of a good postmortem

Muhammad Soliman
7 min readAug 1, 2022

--

Postmortem is a written record of an incident. Postmortem after an incident is how to ensure that this incident will not happen again in the future and that things after the incident became better than before the incident. Ideally a postmortem is discussed and reviewed then published for the engineers to learn from.

To ensure that a postmortem is useful it should answer six questions.

What was the impact?

The impact of an incident should be the decisive factor in starting a postmortem or not in the first place. Incidents happen frequently and it is impossible to do a postmortem for every single incident. Engineering teams should have a predefined criteria that specifies if they should do a postmortem for a given incident or not.

Examples of impact that triggers a postmortem are:

  • consuming a defined ratio from error budget: An error budget is the other face of the SLA/SLO coin. For example, if your SLA/SLO is to serve 99.9% of requests successfully then your error budget is 1/1000 of requests, You are allowed to fail in serving 1/1000 of requests, this is your error budget. An incidents that eats more than a defined ratio of the monthly or quarterly error budget (like 20% or 30%) should be a justification to start a postmortem.
    Ideally breaking SLOs or SLAs should have consequences or they would be meaningless. breaking SLAs usually results in compensation for external users of your service, breaking SLOs should result in a company internal consequence like stopping working on new features till the end of the month or quarter. This is why engineers should react before they consume their monthly or quarterly error budget.
  • If the incident exceeded a certain severity threshold: Some companies have a classification of severity of each incident. for example AWS support cases have the following severity classifications: low — General guidance, normal — System impaired, high — Production system impaired,urgent — Production system down, critical — Business-critical system down.
    If an engineering team or a company have a severity classification of incidents then incidents exceeding a certain severity level should get a postmortem.

If possible, the impact of an incident should reflect final user and business impacts, and it would be nice if this could be mapped to financial impact although this might be difficult in many cases, like when the service or component that had the incident is deep down the services hierarchy and consumed in different ways by components that have different business or user impact.

What happened?

This is usually the easiest part. It should explain what happened and ideally this should be in the form of a timeline that states the sequence of events since the start of the incident till it was solved. It would be helpful if the communication logs between engineers addressing the incident (like chat history) was attached to give discussions, series of solutions tried and result of each solution.

How long did it take engineers to know about the incident and to react?

This is a measure of how good the alerting and operational response to alerts are. If engineers were notified of the incident quickly then detection criteria were correct and alerting was well setup. If they reacted quickly then operational procedures around alerting are running well (on call rules are followed, escalation is done based on the defined escalation rules, etc).

In my system interview at amazon, and after I finished designing some large scale system, the interviewer asked me what would I monitor. I started listing various metrics that should be collected for such system but he stopped me by asking me to select a single metric , the most important one. More than five years later and long after leaving amazon I am still using this same question in system design interviews where I am the interviewer.

Ideally you should have 3 or 4 metrics that should define if a given system is working well or not and those are the ones that you should build your alerting on. Those metrics should be directly related to the business or user impact of your services. If I open a website I don’t care if the CPU usage of a the servers running the website is high, but I do care if the page loads in a long time. The google SRE book defines the 4 golden signals of monitoring, and while they are not always the most important measures of all systems’ health as this varies from one system to the other, they are another way to say that for any system a few metrics are more important than the rest and ideally these are the ones you should alert upon.

Key metrics of a given system are not always easy to define -this is why I ask about them in system design interviews and many people don’t get them correctly- and failure to define them can result in discovering incidents late and in turn increasing the impact of the incident. Asking about time between incident start and engineers knowing about it and reacting to it can reveal such failures.

If engineers only knew about the incidents when users started calling or opening support tickets then this is a sign that something(s) in the alerting and / or operations is wrong and needs fixing, like the alerting pipeline is not well set up, alerting criteria are not correctly defined, the team is not building their alerts on the system’s key metrics, the team’s on call procedures and rules are not well defined or not well followed, etc.

Could we have discovered it before it became an incident?

while alerting kicks in when something goes wrong, ideally engineers should be monitoring bad trends and addressing them before they become an incident. Trends develop over long period of time, think of things like gradual decrease in free disk space or memory or gradual increase of unprocessed items in queues, etc.

A slight increase in unprocessed queue items count in a relatively short period of time is difficult to notice, but if multiple slight increases continue building up for weeks and months we end up with a very long queue that will eventually break.

It is difficult to automatically discover trends, usually engineers are the ones able to find them, not automation. To discover trends engineering teams need to develop a habit of periodically looking at monitoring dashboards looking for bad trends at a relatively long time intervals like two weeks or month. In some teams I worked in this was done daily by the on call engineer or in other teams it was done weekly in a weekly operational review meeting.

A relatively common problem when looking at dashboards searching for trends is having multiple dashboards when we have a lot of deployments and you setup a dashboard per deployment. For example if we run our system in multiple availability zones (AZs) or regions and we have a dashboard per region or AZ, or if we have separate deployment for each customer and we have a dashboard for each deployment. I used to be one of the engineers that developed and operated the automation that controlled one of the largest Cassandra fleets in the world that had 70 something clusters each one serving a different service or subsystem, holding petabytes of data in thousands of nodes, and where each cluster had a dedicated dashboard. How can we go looking at trends in each dashboard of the 70 something clusters? That can take forever. But we did it every day as part of the on call engineer’s daily routine.

The trick here is to have a dashboard that only displays the worst few metrics from all clusters for each type of metrics we monitor. So for example to discover trends in free disk space we had a graph that only plots the lowest 5 free disk space ratios of all clusters we have for the past 2 weeks. If the worst free disk space ratios are good enough and do not have bad trends then by definition the rest are more than good, we don’t need to look at the rest. The dashboard had graphs that applied the same for CPU utilization, number of unavailable nodes, etc.

The trend detection dashboard should be minimal, it is only used to quickly and easily answer a single boolean question: is everything OK or not?. This dashboard is not intended for troubleshooting or telling what is going wrong exactly. If we find that something is wrong with one particular cluster we go to the dedicated cluster dashboard to debug and troubleshoot.

So in a postmortem asking why didn’t the engineers catch the incident before it even became an incident is questioning the team’s monitoring and trend detection. It could show a missing signal that the team did not look at or could reveal things that need improvement in the team’s operational procedures.

How can you prevent this from happening again in the future?

This should be the action items that should prevent this incident or category of incidents from happening again in the future. It should contain solutions to address all issues discovered by the questions we asked earlier, so it could contain fixes to reliability bugs in code or changes to system design, but also it could contain fixes to alerting, fixes to monitoring, changes to operational procedures, etc.

Ideally this list of action items should be finalized after review and discussion.

Who owns the follow ups?

A postmortem is not closed until all action items are completed. Ideally there should be a tracking of postmortems where number of open postmortems and their age is tracked across teams, departments and the whole company.

But tracking is meaningless unless there is someone accountable for each postmortem. This is the person who is responsible to ensure that agreed upon action items are completed. Preferably he should be someone who can influence teams’ priorities like a team lead, a manger or a PO.

--

--

Muhammad Soliman
Muhammad Soliman

Written by Muhammad Soliman

Principal site reliability engineer (SRE) at elastic.co

No responses yet