Fault injection

8 min read2 days ago

Fault injection is the practice of injecting artificial faults or failures in the system and watching how the system behaves. The core premise behind building a reliable distributed system is that it has a level of fault tolerance. The promised fault tolerance level varies from one system to the other, and varies with types of failures the system promises to tolerate.

The types of failures to inject are infinite and can not be all covered in one article, so I am going to talk about my preferred type of fault injection: network partitioning, and give examples on how to achieve it.

Why network partitioning

First reason has to do with a core limitation of distributed systems defined by the CAP theorem, which states that in any distributed system you can only achieve at most only two of the three: strong consistency, Availability and partition tolerance.

Network portioning happens when the network becomes two or more disconnected parts, where nodes in a given part are able to communicate with each other, but not with nodes in other parts. Network partitioning is inevitable, it happens a lot.

Most systems are CP or AP systems: consistent and partition tolerant or Available and partition tolerant. Partition intolerance, or a CA system, means that when a network partitioning happens the whole system stops working. Because implementing CA must include distributed synchronous transactions which are almost impossible to implement to the full theoretical correctness, which in turn means that in a CA system chances for a split brain exist, many CA systems do not continue functioning after the portioning ends by themselves, once a partitioning happens the system stops working forever until an external intervention happens: human intervention or an automated mechanism for resolving conflicts.

This is why it is rare to find CA systems, and this is why my post is focusing on CP or AP systems.

The second reason has to do with the second most important theorem in distributed systems after CAP theorem: the FLP impossibility, which states that in any distributed system where a node can crash and we cannot put a cap on response time, it is impossible to have consensus.

The idea is simple: if I send a request to a node on the network, how long should I wait before I assume the node is dead? what If I decide that after waiting x seconds that the node is dead but turns out that the node is not dead but just busy?

And If I send a request and don’t get a response, should I assume that my request didn’t reach the destination or should I assume that it reached the destination, was executed, and I just didn’t get the response? In other words did the network outage happen before or after I sent my request?

And what happens if I assume that a node in the system is dead, because it didn’t respond for a long time, and then it turns to be a live and joins the system again after the network disruption that caused it to be non responsive ended? will the system be in a sort of split brain situation? will this corrupt data?

What FLP impossibility tells us is that in reality, all distributed systems work on an assumptions. We hope that our assumptions are true and they are true most of the time.

Network portioning tests those assumptions.

In a network portioning the system becomes split, each node in a partition makes assumptions about nodes in the other partition. Those assumptions could be that they crashed while in reality they are not, they are alive, or that the write request I sent didn’t reach its destination while in fact it did and the sender just missed the confirmation response, etc.

This is why network partitioning is a more serious test than shutting down some nodes for example. Shutting down a node does not leave any space for wrong assumptions. In theoretical distributed systems all failures are classified into two major categories: crash failures and byzantine failures. Crash failures, like shutting down a node, are the ones that are easy to reason about and tolerate, byzantine failures are the nasty ones. Timeouts are byzantine failures.

Fault injection and engineering teams dynamics

Everybody knows that you need more prevention than treatment, but few reward acts of prevention. Nassim Nicholas Taleb — The black swan.

Rewarding treatment more than prevention is a human nature. In his book The Black swan, Nassim Nicholas Taleb gives various examples from different aspect of life, like rewarding the central banker who avoids a recession less than the one who “corrects” what his predecessor has done.

Site reliability engineers, and those who work on distributed systems reliability whatever title they hold, are good examples of this human nature. Usually, the more their job is well done the more difficult it is to show. If a system doesn’t fail much it is difficult to credit this to the SRE’s design, implementation and operations vs luck that caused no problems to happen or any other similar factor. However, if the system fails and the engeineer intervenes to save it and end a costly outage they get celebrated more easily.

When Booking.com changed technical leadership, the new leaders came with many new ideas and mandates, one of those mandates that resonated with me and still does is “No more heroes”. Instead of celebrating those who jump into outages across the whole engineering org, systems should be reliable, have clear ownership and clear expectations on interventions when incidents happen, and measurable reliability should be more celebrated than heroic saving of failed systems.

However from my experience I would say that this is attitude is the exception not the norm.

Fault injection lies in the golden middle ground between prevention and treatment. When you do a fault injection exercise and systems react to it with some of them failing and some not, failures become things that need treatment with all of what this word entails: they are objectively clear to everyone in terms of severity, impact and in turn priority of fixing.

Back in the days when I worked in booking.com I had a discussion with a colleague about a change I suggested that should increase the overall reliability of the system our team owned, but my colleague thought it was too complicated. One thing that helped us resolve the disagreement about such subjective matter was that the system we owned fired alerts in the last company wide fault injection exercise, so we have to implement a fix and report this fix to management. My suggested change would fix this issue, it was a treatment to a problem that happened in real world not one that we theorize could happen or not.
In the next quarterly review my colleague was super fair to me, mentioning that my change resulted in decreasing the system’s alerts overall by about 97% percent — I didn’t bother to measure its effect with that level of precision when I was writing my own quarterly self reflections. While the change had a positive effect on the system overall, fault injection was the thing that helped give objectivity to a subjective matter like design complexity vs gained reliability.

In the same time, and assuming the fault injection exercise is performed in a non production environment, the fixes for failures that happen during a fault injection exercise are preventive actions, they prevent the occurrence of a problem that didn’t happen yet in a customer facing systems.

How to perform a network partitioning fault injection

Any distributed system has failure domains: groups of components of the infrastructure that could all fail together.

If we are talking about a cloud service, the largest failure domain should be a zone. This is the largest amount of components that could fail together- assuming we are talking about data plane not control plane component.

Ideally, any distributed system should have at least 3 failure domains of the largest size because when one of them fails the system can still do consensus. This is why any cloud region has 3 zones or more.

So you should perform network portioning by disconnecting 1 of your largest failure domains from the rest, So for example if your system is deployed in 3 zones or 3 data centers you should perform the exercise by disconnecting all your infra structure components (servers, virtual machines, database replicas, etc) in one zone or one data center from the rest of the infrastructure.

Network portioning fault injection in AWS

AWS has a powerful tool called fault injection service (FIS). It is a tool that allows users to perform many kinds of fault injections including network portioning. It uses templates to describe fault injection experiments, where each template include one or more actions and each action has one or more targets. In case of network portioning the action is aws:network:disrupt-connectivity.

The targets for this action should be all the subnets in the zone you want to disconnect from the rest. Subnets are a zonal resource in AWS.

Behind the scenes, FIS works by adding access control list (ACL) rules that prevent traffic to and from the targeted subnets.

Each fault injection experiment template can be used to run the fault injection multiple times, with settings for duration of the exercise, cloud watch alerts to monitor to terminate the exercise if they fire and many more powerful features.

Network portioning fault injection in GCP

GCP does not have a powerful fault injection tool like AWS but you can perform fault injection manually.

Subnets are regional resources in GCP, so you cannot target full subnets to isolate resources in a given zone. So when creating resources, you should add a network tag to all resources in each zone with value equal to the zone they lie in.

Then to disconnect all resources in a given zone, in each subnet add firewall egress and ingress rules that prevent all inbound and outbound traffic. Firewall rules define targets by network tags, so the network tags of those two rules should be set to the zone you want to disconnect from the rest.

Removing or disabling those two firewall rules should stop the network partitioning.

Network portioning fault injection in Azure

I don’t know a way to perform network portioning fault injections in azure.

Azure’s network security group (NSG) rules, which are Azure’s alternative to access control lists at AWS or firewall rules in GCP, only affect new connections but do not affect already existing ones.

So if you have a virtual machine that has connections opened to other systems in your network or the internet or even your ssh client and you add a NSG rule that stops all traffic to and from this virtual machine, those connections will remain open. Opening new connections will not be possible but old connection will keep functioning.

So far I don’t know a way to overcome this limitation.