Raft in a nutshell: what you need to know about the consensus protocol

Raft is a consensus algorithm for managing a replicated log. It is used by MongoDB, CockroachDB, Consul and many other well-known software which uses a distributed system. In this post, I will do a simple, high-level introduction to Raft consensus protocol. It is supposed to give you an idea of how the protocol works.

Some areas in this post might not be explained in detail intentionally for the sake of simplicity. So, I strongly recommend you to read the article In Search of an Understandable Consensus Algorithm by Diego Ongaro and John Ousterhout which is the whitepaper for the protocol and my main source of information, in order to understand Raft better.

Raft’s main purpose is to enable many servers to act as together. These servers are models of state machines which means that if you define the machine’s current status as a state, then, you may expect an another server with the same state to behave exactly as the other when it is given a command.

State transitions are achieved by commands which are recorded as logs such as “assign 5 to variable x”. Thus, the protocol actually needs to synchronize the logs on different servers.

Raft uses RPC (Remote Procedure Call) for communication between the servers in the cluster. Making an RPC call means a server sending a command to another server in the cluster.

Executing client requests

In a Raft cluster, there are 2 types of servers: leader and the followers (I lied. There is one more but we’ll get there, don’t worry). Raft ensures there is at most one leader at a certain period of time.

Clients are agents outside the cluster. Client requests are always handled by the leader. If a client issues a command to a follower in the cluster, it gets forwarded to the leader.

At the time of a write request of a client, leader first appends the command to its log as a new entry. Then, it issues an AppendEntries RPC calls to the followers to replicate the entry.

Once the majority of the servers replicate the log in their entries, the leader commits the new entry to its state machine. Committing means applying the log entry to its state machine. In simple terms, changing a value of a variable in the memory or disk.

However, if the majority of the servers cannot replicate the log (due to network issues, etc.), then the leader retries (even after it has responded to the client) until all the followers eventually store all the log entries.

After committing the entry to its log, the leader does not send a specific RPC call to let the followers know it has committed the entry. It includes the index of the latest committed log entry in the following RPC calls. The followers eventually find out the entry was committed and apply it to their state machines as well.

Since the leader only provides the highest index of the committed log entry to the followers, all of the log entries up until the last one are committed as well.

Electing a leader

Raft divides time into variable length of periods, called terms. There is at most one leader in each term.

Usually, the leader sends periodic RPC calls (heartbeats) to followers for maintaining authority. If a follower did not hear from the leader for a a period of time (election timeout period), it assumes there is no viable leader and begins an election to choose a new leader.

It starts a new term (increases the term number by one), transitions to candidate state (the third state), votes for itself, and issues a RequestVote RPC call to other servers.

Each server votes for a single candidate in a term. If the candidate receives votes from the majority of the servers, it becomes the leader and sends heartbeat messages to all the other servers and prevent new elections.

While waiting for votes, a candidate may receive an AppendEntries call from another server, claiming to be a leader. If the caller’s term is bigger or equal to the candidate’s term, it recognizes the server as the leader and returns to follower state. Otherwise, it rejects the call.

In case of no majority, the term ends and a new term starts with an election.

The term continues until the leader is not available anymore.

Joining to and leaving from the cluster

Each server has configuration entries in its log and they use the latest entry for decisions. Configuration change happens in two phases. First, the cluster switches to a transitional configuration called joint consensus. After the joint consensus is committed, cluster transitions into the expected configuration.

The advantage of this approach is the cluster is available even if it is transitioning to a new configuration.

While in the transition phase (when the joint consensus is in charge), log entries are replicated to all servers in both configurations, any server from either configuration may serve as the leader, and agreement (for elections and entry commitment) requires separate majorities from both the old and new configurations.

When the leader receives a request of changing the configuration, it adds a new joint configuration entry to its log and replicates that to other servers. Leader will use this configuration (majority) to decide when to commit. Once it is committed, neither old nor the new configuration can make decision by themselves and only servers with the joint configuration can be elected as the leader.

After the joint consensus is successfully committed, the cluster transitions to the new configuration in the same way.

One thing that is subtle yet important is new servers join the cluster as non-voting members. They act like followers but their vote does not count. This allows for them to propagate their logs and catch up with the others while the cluster is busy with changing the configuration.

Log compaction

Logs can take up too much space and slow down the processing. So, Raft has an effort to reduce the log size.

Each server takes snapshots of the state of the state machine, time to time. They write the snapshots to the log and delete the log entries which the snapshot covers, i.e. log entries up until the snapshot. When a follower falls behind, the leader might also send a snapshot to the follower to quickly catch up.

Properties guaranteed by Raft

There are some properties ensured by Raft. I strongly suggest you to read then details about them in the white paper.

Election safety: at most one leader can be elected in a given term. The majority rule ensures this property.

Leader Append-Only: a leader never overwrites or deletes entries in its log. It only appends new entries.

Log matching: if two logs contain an entry with the same index and term, they store the same command. The logs are also identical in all entries up through the given index.

Leader completeness: if a log entry is committed in a given term, then that entry will be present in the logs of the leaders in the newer terms. Only the candidates that contain all of the entries from the previous term can be elected as the leader.

State machine safety: if a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index.

Conclusion

Raft is simple yet powerful protocol. It is used by modern software solutions, either as it is or as modified. The main flow of the protocol is quite understandable. I tried to make an introduction and give a glimpse of how it works. If you find it interesting, the next thing should be reading the white paper.