Availability: System Design Concept

Many of us may have experienced moments where we could not access certain applications due to an outage or unavailability. Recently, YouTube faced a global outage that stopped users from streaming videos for about an hour. You may wonder about the reason and How one can prevent it from happening. To understand it better, let's explore the idea of availability.

What is Availability?

The availability of a distributed system is the percentage of time in a given period that a system is available to perform its task and function under normal conditions. Let's understand it from another perspective: Various components in a distributed system are spread across multiple nodes or locations. So, availability is the ability of a system to remain operational despite failures within its components.

One way to look at it is how resistant a system is to failures. The percentage of availability that a system requires depends on the usage of the system. For example, Air Traffic Control systems are among the systems that require high availability because a single error in directing aeroplanes can lead to catastrophic results. On the other side, systems which are not prone to failures can work well with fewer availability requirements. The idea is simple: High availability comes with a cost, so we have to optimize according to our needs.

How is Availability Measured?

We can measure the availability of a distributed as the percentage of a system’s uptime in a given time period i.e. dividing the total uptime by the total uptime and downtime in a given period.

Availability = Uptime / (Uptime + Downtime).

The Nine’s of Availability

We usually measure availability in terms of Nines rather than percentages. If availability is 99.00 percent, it is said to have “2 nines” of availability, and if it is 99.9 percent, it is called “3 nines,” and so on. A system with 5 nines (i.e., 99.999%) is a Gold Standard of Availability. Let's take a look at different Nines of Availability.

The nine’s of availability in system design

How do we achieve High Availability?

To increase availability, we can use redundancy by duplicating or adding additional components (servers or storage). For example, a system with two identical web servers behind a load balancer can continue operating even if one of the servers goes down because the load balancer can redirect traffic to the remaining server. So by adding redundancy, we can make the system more resilient to failure.

  • Passive Redundancy: Only some of the components are active at any given time and backup components are available in case of a failure. If some component fails, the backup component will take over and become active.
  • Active Redundancy: Multiple active components work simultaneously to perform the task. In the event of a failure of one of the active components, the other active components can take over.

Redundancy alone is not enough to guarantee high availability. Failure detection and alerting mechanisms must also be in place to identify failures. For this, we should continuously monitor system health and regularly perform high-availability testing, so that we can take corrective action whenever one of the components in the system becomes unavailable.

Here are some other strategies to ensure high availability:

  • Use load balancing to prevent server overloading. It can also help us to monitor server health.
  • If possible, implement mechanisms for automatic failover. If one component fails, another takes over its function automatically without manual intervention.
  • Replicate data across multiple locations to avoid outages and make the system resilient against disasters. Replication can be synchronous or asynchronous, depending on the requirements.

There is a trade-off between the availability of a system and its performance. To achieve high availability, we often implement redundancy or disaster recovery strategies, which can degrade system performance (higher latency or lower throughput). For example, implementing redundancy requires replicating data or tasks across multiple resources, which can increase latency.

Difference between high availability and fault tolerance

Both high availability and fault tolerance are strategies used to achieve high uptime, but they approach the problem differently. High availability is about the system's ability to remain operational and accessible with minimal downtime. On the other side, Fault tolerance is about the system's ability to continue functioning normally even in the event of a failure.

  • Fault tolerance requires multiple systems that run in parallel. In the event of a failure, another system can take over without any loss of uptime. This requires advanced hardware that can detect component faults and enable the systems to operate in coordination. However, it may take longer for complex networks and devices to respond to malfunctions, and technical issues that result in a system crash may also cause the failure of redundant systems running in parallel.
  • High availability, on the other hand, can also use the software-based approach to minimize server downtime rather than relying on hardware redundancy. This can be more flexible and easier to implement but it may not provide the same level of protection against system failures.

Thanks to Chiranjeev and Navtosh for their contribution in creating the first version of this content. If you have any queries or feedback, please write us at contact@enjoyalgorithms.com. Enjoy learning, Enjoy system design, Enjoy algorithms!

More from EnjoyAlgorithms

Self-paced Courses and Blogs