Many of us may have experienced moments where we could not access certain applications due to an outage or unavailability. Recently, YouTube faced a global outage that stopped users from streaming videos for about an hour. You may wonder about the reason and How one can prevent it from happening. To understand it better, let's explore the idea of availability.
The availability of a distributed system is the percentage of time in a given period that a system is available to perform its task and function under normal conditions. Let's understand it from another perspective: Various components in a distributed system are spread across multiple nodes or locations. So, availability is the ability of a system to remain operational despite failures within its components.
One way to look at it is how resistant a system is to failures. The percentage of availability that a system requires depends on the usage of the system. For example, Air Traffic Control systems are among the systems that require high availability because a single error in directing aeroplanes can lead to catastrophic results. On the other side, systems which are not prone to failures can work well with fewer availability requirements. The idea is simple: High availability comes with a cost, so we have to optimize according to our needs.
We can measure the availability of a distributed as the percentage of a system’s uptime in a given time period i.e. dividing the total uptime by the total uptime and downtime in a given period.
Availability = Uptime / (Uptime + Downtime).
We usually measure availability in terms of Nines rather than percentages. If availability is 99.00 percent, it is said to have “2 nines” of availability, and if it is 99.9 percent, it is called “3 nines,” and so on. A system with 5 nines (i.e., 99.999%) is a Gold Standard of Availability. Let's take a look at different Nines of Availability.
To increase availability, we can use redundancy by duplicating or adding additional components (servers or storage). For example, a system with two identical web servers behind a load balancer can continue operating even if one of the servers goes down because the load balancer can redirect traffic to the remaining server. So by adding redundancy, we can make the system more resilient to failure.
Redundancy alone is not enough to guarantee high availability. Failure detection and alerting mechanisms must also be in place to identify failures. For this, we should continuously monitor system health and regularly perform high-availability testing, so that we can take corrective action whenever one of the components in the system becomes unavailable.
Here are some other strategies to ensure high availability:
There is a trade-off between the availability of a system and its performance. To achieve high availability, we often implement redundancy or disaster recovery strategies, which can degrade system performance (higher latency or lower throughput). For example, implementing redundancy requires replicating data or tasks across multiple resources, which can increase latency.
Both high availability and fault tolerance are strategies used to achieve high uptime, but they approach the problem differently. High availability is about the system's ability to remain operational and accessible with minimal downtime. On the other side, Fault tolerance is about the system's ability to continue functioning normally even in the event of a failure.
Thanks to Chiranjeev and Navtosh for their contribution in creating the first version of this content. If you have any queries or feedback, please write us at contact@enjoyalgorithms.com. Enjoy learning, Enjoy system design, Enjoy algorithms!