The goal of leader election is to give a specific entity (such as a process, host, thread, object, or person) special powers within a distributed system. These powers may include the ability to delegate tasks, the ability to modify data, or the responsibility for handling all system requests. Leader election can be a useful tool for improving efficiency, minimizing coordination, simplifying architectures, and reducing overhead, but it can also introduce additional failure modes and scaling challenges and make it more difficult to assess the validity of the system.
To achieve this goal, a leader election algorithm selects a processor to coordinate the actions of a distributed system. The leader is typically chosen based on a criterion, such as selecting the processor with the highest identifier. The other processors enter a terminated state when the leader is chosen. In a leader election algorithm, the terminated states are divided into elected and non-elected states. Once a processor enters an elected or non-elected state, it remains in that state at all times.
The safety and liveness conditions must be met for a leader election algorithm to be effective. The liveness condition states that every processor will eventually enter either an elected or non-elected state. The safety condition states that only one processor is allowed to enter the elected state and become the leader of the distributed network.
There are three scenarios to consider when determining if a leader election is appropriate:
Leader election is a common pattern in distributed systems because it has several benefits:
There are also some significant drawbacks to leader election:
A leader election algorithm guides a cluster to jointly agree on one node to serve as a leader with as few back-and-forth interactions as possible. Generally, the algorithm assigns each node one of three states: Leader, Follower, or Candidate. The leader is also required to pass a "health check" or "heartbeat" regularly so that follower nodes can determine if the leader has become unavailable or failed, and a new leader can be elected.
The type of leader election mechanism used depends on whether the cluster is synchronous or asynchronous. In a synchronous cluster, nodes are synchronized to the same clock and send messages in a predictable order. In an asynchronous cluster, messages are not reliably sent within a specific timeframe or in any particular order.
Asynchronous algorithms cannot guarantee both safety (ensuring that only one leader is elected) and liveness (ensuring that every node completes the election) because any number of nodes in an asynchronous cluster can lag indefinitely. In practice, implementations prioritize safety because it has more serious consequences for the service.
Synchronous algorithms are easier to understand and may be preferable because they can guarantee both safety and liveness. However, synchronizing a cluster requires imposing additional constraints on how it operates, which may not be possible or scalable in practice.
The Bully Algorithm is a simple synchronous leader election technique that requires each node to have a unique numeric id and for all nodes in the cluster to know each other's ids. When a node starts up or the current leader fails the health check, the election process begins.
There are two possible outcomes:
Paxos is a general consensus protocol that can be used for asynchronous leader elections. The idea behind the Paxos algorithm is to reach consensus among the nodes in the network about which node should be the leader. In this algorithm, a node proposes itself as the leader and other nodes in the network vote for or against it. If majority of nodes vote for the node, it becomes the leader. If not, the node that failed to become the leader tries again. The process continues until a node receives a majority vote and becomes the leader.
Raft is a popular alternative to Paxos because it is easier to understand, implement, and use. It is a non-blocking algorithm that involves each node in the Raft consensus keeping track of the current "election term." When the leader election begins, each node increments its copy of the term number and listens for messages from other nodes. If the node does not receive any messages after a random interval, it becomes a candidate leader and asks other nodes for votes.
If the candidate wins a majority of votes, it becomes the leader. If another candidate with a higher term number sends a message, the original candidate concedes. If the election is split or the timer runs out without a consensus, the algorithm restarts. Restarts are rare due to random timeouts, which prevent nodes from colliding.
Apache Zookeeper is a centralized coordination service that is "self-distributed and highly dependable." It is designed to help distributed systems handle coordination tasks. Apache Zookeeper philosophy is that coordination is difficult, so it is better to have a shared, open-source implementation with all the necessary components so that services do not have to reinvent everything from scratch. This can be especially useful in large distributed systems.
Apache Zookeeper uses the ZAB (ZooKeeper Atomic Broadcast) protocol to manage leader election, replication order guarantees, and node recovery. ZAB is an asynchronous algorithm that ensures that writes are consistent and propagated to all nodes by "broadcasting" state changes from the leader to followers.
Leader election is a strong technique that may be employed in systems to help make them more fault-tolerant and easier to use. However, when we employ leader election, we pay close attention to the guarantees that each protocol gives and, more critically, does not give.
https://aws.amazon.com/builders-library/leader-election-in-distributed-systems/
Thanks to Navtosh for his contribution in creating the first version of this content. If you have any queries/doubts/feedback, please write us at contact@enjoyalgorithms.com. Enjoy learning, Enjoy system design, Enjoy algorithms!