What does it mean to have a fail-safe distributed system
If something is fail-safe, it has been designed so that if one part of it does not work, the whole thing does not become dangerous. Cambridge dictionary.
Fail-safe is a common concept shared among many engineering disciplines. It means that if a component or a machine fails it leaves the system in a state that presents no danger to the users.
Let’s take the train brakes as a famous example to clarify the concept. Train brakes operate by compressed air. An intuitive idea to design such brakes is by applying compressed air pressure to brake shoes installed on each wheel when we want to stop the train. But what happens when the air pipes that transfer this compressed air are cut or have a big leak? This will cause us to have this huge mass full of people in motion with nothing to stop it. A disaster.
An improvement to this initial idea is to have a spring applying pressure by default on all brake shoes. Air pressure is used to release the brakes not to apply the brakes. This means that if the air pressure system fails for any reason like big air leakage, brakes are applied and the train stops. when the brake system fails it fails to a state where the train stops.
Another famous example of fail-safe systems can be seen in nuclear power reactors. To have a nuclear chain reaction neutrons have to hit atoms in a defined speed range. If neutrons are too fast or too slow nuclear fission will not happen. Neutrons need to be moderated, a term used to express reducing their speed to be in the range needed to cause nuclear fission. One design pattern of fail-safe nuclear reactors relies on reactor water or other fluid as a moderator with little moderation from other sources. In those designs, if the reactor overheated for any reason like a human error or a failure of some component, more water will turn into steam, which is described in nuclear engineering as void. This void means that there is nothing to moderate the neutrons, so neutrons become too fast with less water to reduce their speed, so they become too fast and cause less nuclear fission and the nuclear reaction slows or stops with no human or machine intervention. These reactors are described as having a negative void coefficient where the more void forms due to overheating the more the nuclear reaction slows down, void has a negative effect on the nuclear reaction. With no human or machine intervention, the failure to control the reactor leaves the reactor in a state where the nuclear reaction stops or slows down significantly.
There are numerous applications from all engineering disciplines to the fail-safe design concept.
So what does it mean to have a fail-safe software system or in particular distributed system? Knowing that software is used to coordinate or control a lot of our daily life things, what does safety mean in this context? For example, If a large critical system like the one used for air traffic control stopped working like what happened in the UK in August 2023 or the USA in January 2023, causing a cancellation or delay of thousands of flights and millions of dollars in costs, can this be considered a safe failure? And If this is a safe failure, what does an unsafe failure look like then? And if not, is there a way to make it a safe failure? What does safety mean in a distributed software systems context?
Safety vs Liveness
Theoretical distributed system literature classifies all properties of a distributed system, all the things that a distributed system promises to do, under two major categories: safety and liveness properties. Safety properties simply state that the algorithm should not do anything wrong ¹. Does this mean that a system that does nothing is safe? Yes, it does! Using this definition, an air traffic control system that stops working has a safe failure. An unsafe failure is a failure where it does wrong operations, like giving the wrong output causing planes to crash together for example.
And here comes the other category of properties, liveness properties. Liveness means that something happens. When the system receives a request it sends a response, this is a liveness property.
Many engineers are familiar with measuring and setting targets for how often the system fails, expressed in numerical targets like service level indicators (SLIs) and service level objectives (SLOs), like saying that at least 99.99% of requests should return a response, which is another way to say that the system will not return a response once for every 10,000 response at most, The error budget for the system is 1/10,000. Such measurements are measurements of liveness properties.
But besides analyzing how often the system fails, it is important to understand how the system fails, which is something done in activities like failure mode analysis, where you analyze what can lead to a system failure, what this system failure looks like, and its possible consequences. Unfortunately, it is difficult to put a numeric measure or promise on safety failures if a system is prone to them. From a theoretical distributed systems perspective, safety properties are properties that when violated at time t can never be satisfied again after that ¹. If your system is prone to doing an unsafe operation then it will forever be described as an unsafe system, it cannot be unsafe at one point in time and later become safe.
Usually, safety and liveness are two competing sets of properties, and the larger the system (the more data it holds, the more requests it handles) the more difficult it is to achieve both safety and liveness properties, and compromises have to be made.
Let’s take the dynamo database as an example. Dynamo, and later the databases that had the same concepts as Cassandra, relied heavily on the idea of eventual consistency where the chance of reading the latest version of a data record increases as the time between reads and writes increases. Eventually, the client will get the latest version of the data. Eventual consistency is a tradeoff between availability and response latency on one side (two liveness properties) and correctness of data (a safety property) on the other side. If you ask for the latest version of a data record with strong guarantees that this is the latest version of the record, you will get a slow response and with lower chance of successfully getting a response at all. If you asked for any version of the record, optimistically hoping that you will get the latest version, you will get a quick response, and the chances of successfully getting a response increase.
Dynamo was originally built to be used inside Amazon retail, and one main component that relied on it was the shopping cart. This was before there was something called AWS. One deciding criterion for approving the idea of eventual consistency was measuring the rate of old data records returned to the shopping cart service, and in return measuring the expected rate of orders that require a fix by support personnel, like doing a reshipping of an item, and comparing it with the normal rate of human-made errors in handling of orders in fulfillment centers and shipping. It was found that the rate of errors introduced by eventual consistency was too little compared to the already existing rate of errors caused by humans so the idea was approved.
From the CAP theorem perspective, consistency is a safety property while availability and partition tolerance are liveness properties.
Control planes and safety
Control planes and data planes are two concepts that came from networking and found their way to distributed systems. In networking control plane controls data flow, the data plane is where the actual data flow happens. In distributed systems, the control plane is used to create and configure resources (databases, virtual machines, data buckets, etc) and the data plane is the resources themselves and the operations performed on them (query a database, delete record, run compute workload, etc).
While Dynamo did a trade-off of some safety to get more liveness, other types of applications prefer to have safety over liveness, they prefer to guarantee the most correct actions over being quick or doing anything at all. Control planes are a famous category of applications that favor safety over liveness. Let’s take control planes of AWS as an example.
In AWS, Control plane operations are not covered by SLAs! S3’s SLAs cover reading or writing data from and to an S3 Bucket but do not cover the creation of the S3 bucket. Control planes tend to be more complex with more dependencies, so they are more error-prone, and on top of that they favor being consistent partition tolerant (CP) systems, they prefer not responding or responding slowly over corrupting data or taking wrong actions. Some AWS services’ control planes even exist in a single region ². Data planes on the other hand tend to be simpler, faster, and target being available partition tolerant (AP) systems ³ and exist in multiple regions.
This is why it is strongly recommended to make your systems control plane operations rely on other systems or cloud services’ control planes, and your data plane operations rely on other systems or cloud services’ data plane, and maintain a sharp separation between the two types of operations. Making your systems’ data plane operations rely on another system’s control plane operation decreases your systems’ availability and response time.
One important consequence of control planes favoring safety over failure can be seen in auto-scaling systems. Auto-scaling adds more compute resources (pods, virtual machines, load balancers, etc) to the system when the load on the system increases and takes them away when the load decreases. Since adding resources is a control plane operation, so slow, and has no promised bounds to this slowness, auto-scaling fails in handling sharp spikes where traffic unpredictably increases significantly in a short period. People treat auto-scaling as a silver bullet and if I had a dollar for each time I had to explain that auto-scaling cannot handle sharp spikes I would have been a millionaire.
Another example of compromise between safety and liveness in control planes can be seen in Kubernetes’ scheduler. The scheduler is the component that assigns a pod to a node. If the system has a lot of nodes and pods put complex scheduling constraints (like spreading all pods from a deployment evenly among all nodes), the scheduler will have a lot of pods and nodes to consider for every scheduling decision which will result in slow scheduling. Given that the Kubernetes scheduler is sequential, where it handles one pod at a time in queued order, the slowness builds up and affects all scheduling operations.
So a tradeoff that Kubernetes scheduler offers is a configuration parameter to consider only a limited number or ratio of nodes in scheduling decisions. This is a tradeoff between making the most correct scheduling decision but slowly and making a correct enough scheduling decision but quickly. A tradeoff between safety and liveness that is configurable.
Fail-safe vs fail-secure
one key aspect of evaluating how a system fails is to see if it fails an open failure or it fails a closed failure.
A basic example of the contradiction between fail-open and fail-secure is what happens when you have an automatic door lock and you experience a power failure. Opening the door guarantees that everyone in the building can leave safely, particularly if the failure was caused by a disaster, but this can allow thieves in. Fail-open is a kind of fail-safe.
In distributed systems, in an ideal world, when a system fails you want all hands to be on deck, or at least all hands can be on deck, every engineer can contribute to debugging the problem and restoring the system as fast as possible, the system’s failure is open to everyone to help fix. An open failure is a safe failure. However, giving access to everyone is against the least privilege security principle, it contradicts a bunch of security and even sometimes regulatory rules (depending on the nature of the application).
Both Google ⁴ and Facebook ⁵ had incidents where security restrictions on accessing hardware and networking caused delays in resolving incidents, eventually in the case of Facebook costing the company 100 million dollars and a 5% stock price loss.
Conclusion
It is important to understand how the system fails, and what tradeoffs are made around different failure modes. While fail-safe sounds like a good thing, safety has a lot of tradeoffs including but not limited to latency and availability.
References:
1. Introduction to Reliable and Secure Distributed Programming.
Christian Cachin, Rachid Guerraoui and Luís Rodrigues.
2.https://docs.aws.amazon.com/whitepapers/latest/aws-fault-isolation-boundaries/global-services.html
5.https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/