Can't we all just agree?

The most often-cited papers on the theory of highly-available distributed systems. Lets explore and talk about distributed consensus. The core algorithms behind these papers, Paxos and Raft.

I've recently started studying the art of building highly-available distributed systems. One of the first papers that I was adviced to read was the Paxos paper from Lesie Lamport. The paper originates from 1989 and is one of the earlier papers on building highly-available distributed systems. Lets explore what distributed consensus is and why you should bother understanding it. The algorithms we go over are perfect in its simplicity, but complex in its subtlety.

If you look up the word 'Consensus' in the Cambridge Dictionary you'll find:

"A generally accepted opinion or decision among a group of people."

The original Paxos paper discuses the legislative parliament of the island of Paxos. Paxos’ parliament holds its meetings in a chamber with very bad acoustics, so the legislators communicate using messengers. Both the legislators and the messengers are merchants, so they might need to leave the parliament chamber at any point of time. Each legislator can propose a decree to be passed, and if it is passed then it is added to the notebook each legislator keeps. A decree that was passed has an index, i.e., "137: Olive tax is 5% per ton", which states that decree #137 is "Olive tax is 5% per ton".

As you might know, most things in computer science originate from problems in the real world. Think of a system that serves millions of requests per second, all of this data will for sure not come from one database node, we'll obviously have hundreds or thousands database nodes. But how would we go about keeping them in sync properly ?

There's 2 type of systems in client server communication that we'll discuss in this article. A single node architecture, where a client communicates with 1 server and no additional logic is required to share a machine its state.
And a N node architecture where the client interacts with a system that has N nodes, within this architecture the nodes can either be symmetric and they can all respond to the client and then some kind of replication syncs the state of the nodes. Or the nodes can be Asymmetric, this is where the client only interacts with one node and that node is responsible for replicating the state among the other nodes. This node is typically referred to as a 'leader' node. In N node architecture the nodes have to share 'updates' about changes to data (state) with all other nodes in its cluster, lets refer to such a system as a replicated state machine.

Distributed concensus ensure that no matter what, data is propagated to all the nodes (syncing the state). There will never be nodes that think that some index contains different values from one another. Suppose we have a collection of nodes and we want them all to agree on a specific value, we want to achieve consensus. This is exactly what we're trying to do when architecting for a highly-available fault-tolerant distributed system. This collection of nodes is working together to achieve a unified goal, appear as a single node to the end-user(s).

This 'specific value' they can agree up on is for example a database commit that persists a single value 'v' to disk.

To make this work we require to 'somehow' coordinate the way we want to achieve consensus so that every node in a collection of nodes can safely agree to whatever value it's about to agree up on. Woah, that is a mouthful to say. But fear not, this 'somehow' is what we are going to fill in today by studying papers from Leslie Lamport, Diego Ongaro and John Ousterhout.

What do these algorithms offer us?

A distributed consensus algorithm, in short, makes sure that all nodes in a cluster agree on accepting the same value after a proposal of the value has been made. Any process can make its own proposal and depending on the accepting nodes it's either propagated to all accepting nodes or it's rejected. The accepting nodes will only accept a proposal if the majority (> 50%) of the accepting nodes agrees on accepting the proposed value (a consensus).

A replicated state machine can be built by maintaining a distributed command log where the command at each position in the log is decided by the distributed consensus algorithm.

It's a protocol

Leslie Lamport provided us with a set of protocols for achieving consensus in his paper The Part-Time Parliament. It's important to note that a distributed consensus protocol as Paxos is rather a set of 'rules' and are helpful when having to implement a distributed consensus algorithm then a way of implementing it in a specific manner. There's not one 'right' way to implement distributed consensus, there's actually many distributed consensus algorithms out there. Think of Raft, Multi Paxos, EPaxos, etc.

Proposing and agreeing

For proposing and agreeing I suggest we take Paxos as an example. In a collection of nodes we'll have a set of nodes that concurrently propose values. Lets refer to these as the 'proposers'.

We'll also have processes called 'acceptors'. These are responsible for accepting a proposed value. Depending on the amount of nodes that accept a proposed value consensus can be established.

In the most simplest form of designing a consensus protocol there's N proposers and a single acceptor. Multiple proposers will propose values to the single acceptor. The acceptor can accept values in 'some' kind of order.
We might just face one problem with this architecture, our single point of failure in our consensus protocol crashes. The single acceptor node went down after accepting a value. We are unaware of which value it had accepted at last and can only continue accepting proposed values once the single acceptor is back up. Lets discuss a more fault-tolerant approach where we can proceed running our consensus protocol even if multiple nodes crash.

Fault tolerance in Paxos

When architecting for fault tolerance, high availability immediately comes to my mind. In a distributed system a node is just a 'unit' and nothing more. But we really need a few of them to make the bigger picture work. You know that 2 nodes isn’t enough1. Thus, you need at least three nodes to make most distributed algorithms function.

Time to add additional acceptors. If a proposer makes a proposal it will be send to a majority of the acceptors. Now besides 'accepting' and 'proposing' values, a value can also be 'chosen'. We consider a value to be 'chosen' if a quorum of acceptors has accepted the proposed value.

"Assume a collection of processes that can propose values. A consensus algorithm ensures that a single one among the proposed values is chosen. If no value is proposed, then no value should be chosen. If a value has been chosen, then processes should be able to learn the chosen value."

Above is a citation from the Paxos Made Simple paper published by Leslie Lamport. We can make up a few Paxos protocol requirements from this citation:

But what if there's an even amount of accepted values among the acceptors? If 3 proposers each propose a value and every acceptor accepts 1 different value we end up having with no majority of accepted values. Because this is a problem that could easily arise we should rethink the acceptance flow.

We could fix this by splitting the Paxos protocol in a 2 phase Paxos protocol. Before proposing a value we could contact the acceptors and ask them whether they've already chosen a value. If that's the case, the proposer must propose that chosen value to other acceptors. The protocol is now split up in 2 phases, first we perform a check and then we propose a value.

What if proposer P1 passes the checking phase and another proposer P2 meanwhile proposes a different value which is accepted by the majority of the acceptors? This could easily happen as we work with a lot of concurrent bidirectional communication. This would result in an acceptor accepting 2 values. The consensus protocol should not allow this, it should abort any competing proposals by ordering newer proposals over old ones. If P1 would try to get its value accepted it would've been aborted as P2 already established consensus with the acceptors.

It's important that acceptors persist the accepted value to a command log on disk as they need to keep track of the information they received from proposers so it can retrieved after a reboot.

What do we know so far about Paxos

We've covered quite a lot already. In case it's still too abstract, we should briefly recap what we know so far from the 2 phase Paxos protocol:

Phase 1

Phase 2
  • If a majority of acceptors accept, then we achieved a consensus.

How Paxos ensures to propose properly

In Paxos proposals should come with unique identifier, no two proposers may propose a value with the same identifier. The unique identifier must also be bigger than any previously used unique identifier in the cluster. I can think of one quickly, a PID + Unix Poch timestamp: '431-1631401257'.

In the 1st phase Paxos sends a Prepare(431-1631401257) message to at least a majority of acceptors. Each acceptor that receives the Prepare(431-1631401257) message looks at the unique identifier in the message and performs a lookup to the previous unique identifier it received. If 431-1631401257 is bigger than the previous unique identifier it received it overwrites the 'previous_unique_identifier' with 431-1631401257 and responds with a Promise() message to the proposer. If 431-1631401257 is not bigger than the previous unique identifier it received it will check if it already accepted another proposal. If that's the case it will respond with a Promise(unique_identifier, accepted_unique_identifier, accepted_value). Otherwise it responds with a failure message.

After the proposer proposed a unique_identifier through the Prepare() message it will need a majority of responses for it's proposal to continue. The proposer has to look through each response to see if any acceptor has replied with a Promise(unique_identifier, accepted_unique_identifier, accepted_value) message. If that's the case, the proposer is enforced to pick the highest unique identifier from the acceptor responses. If not, the proposer can continue and propose its value by proceeding with phase 2.

As just mentioned, at the start of phase 2 the proposer will check if any of the acceptor responses contained a accepted_value. If that's the case the proposer picks the highest unique identifier from the acceptor responses and wraps it in a Propose(unique_identifier, accepted_value) message and sends it to the acceptors. If it's not the case, it's free to propose a value v.

Suppose a proposer will propose its own value. The proposer will notify2 the acceptors to accept the proposal along with the value Propose(unique_identifier, value). The acceptors perform a check if the unique identifier from the Propose() message is still the highest unique identifier the acceptor has seen so far. As long as no higher unique identifier proposals have arrived during this time the acceptor will reply with a Accepted() message and sends a Accepted(unique_identifier, value) to all of the leaners. When the proposer receives a majority of Accept() responses then it knows that consensus has been achieved on the proposal.

Leader election

When it comes to leader election I would like to take Raft as an example. Raft is a distributed consensus protocol that was designed for better understandability of how consensus can be established. The Raft protocol was mainly developed due to the Paxos protocol being very difficult to understand and implement. The paper that introduces Raft is called In Search of an Understandable Consensus Algorithm and is actually a great read.

Leader election in a distributed system is very interesting. It's also the heart of many algorithms that try to achieve consensus (e.g. Raft). Like most leaders, we make them feel special and give them additional powers so that they can coordinate whatever they lead. Coordinating can refer to assigning a process to perform some work, assign a certain responsibility to another process, etc.

Reducing coordination between individual nodes and simplifying an architecture are some of the benefits that leader election comes with. Leader election is not fault tolerant, there's new points of failure that are introduced3.

Raft states that any node in a cluster can be in one of the following three states: leader, candidate or follower. Clients communicate with leaders (if a client calls a follower, the follower redirects it to the leader). Candidates can propose to become leaders through elections and followers only responds to candidates or leaders.

Raft divides time into terms of arbitrary length, each beginning with an election. If a candidate node wins the election, it remains the leader node for the rest of the term. If the vote is split, then that term ends without a leader. Every node in the cluster that runs Raft stores the term identifier locally (This can also be stored in a distributed manner I think). The term identifier is used throughout all the communication between nodes, if one node's current term identifier is smaller than the other's, then it updates its current term identifier to the larger value. If a candidate node or leader node discovers that its term identifier is outdated, it immediately turns into a follower node. If a node receives a request with a stale term identifier, it rejects the request.

Raft's candidates use RequestVotes during elections and leaders use AppendEntries() for replicating log entries and performing heartbeats. A leader sends heartbeats to its followers every N4 ms to maintain authority. Followers maintain a set timeout interval to keep track of leaders their heartbeat calls. If a follower times out while expecting a heartbeat from the leader it transitions itself into a candidate state, increases its term identifier, starts an election, votes for itself and sends RequestVotes to all other nodes.

Log replication

A log is the simplest form of possible storage abstraction in distributed systems. It's an append-only, totally-ordered sequence of records ordered by time. You can't understand databases, replication, version control or any distributed software without understanding logs. Records are appended to the end of the log, and reads proceed left-to-right. As you can imagine, a log is used to capture state changes. For distributed systems this is, in many ways, the very the problem they often face. Most databases maintain a log in order to capture any change that's made to the state of the data (whether it's modification, insertion or deletion).

Over-time the usage of the log grew from an implementation detail of ACID to a method for replicating data between databases. I'm sure you can already think of a few reasons why a log might be useful in a database cluster.

"If two identical, deterministic processes begin in the same state and get the same inputs in the same order, they will produce the same output and end in the same state."

That's still a bit too abstract to understand right? The distributed log can be seen as the data structure which models the problem of consensus.

In Raft the leader maintains the log and clients interact only with the leader. When the leader receives a client request it creates a command containing the term identifier and an index. The leader tries to replicate the command to the majority of the followers. If the replication is successful, then the command is committed to the cluster and the response is sent back to the client.

Lets go over what the actually happens with the log when a leader receives a client request.

Suppose one of the follower nodes had a fail-over and is rebooting. There's a high chance that the log of that particulpar follower node does not match the log from the currrent leader. It's up to the current leader to take action and resolve that inconsistency. When the follower node is back up and receives a leader event (AppendEntries()) containing the term identifier and log index of the previous log entry it would return a failure response as there's an index mismatch.

The leader keeps track of a property called 'nextIndex' of each follower. If follower F1 has an inconsistent log compared to the leader, the leader decrements the nextIndex value and retries a AppendEntries() RPC for that command. This procedure continues until the follower log becomes consistent with the leader log.

Safety

All of the above covers enough to have a happy flow implementation with a bit of fault tolerance implemented. Obviously there's a bit more to it when trying to achieve consensus (this refers back to the tweet at the beginning of this article). Marc Brooker actually wrote a good article on why Consensus is Harder Than It Looks. As he mentions at the beginning there's three categories of challenges that people completely overlook. I recommend you go over this and keep it in the back of your head when learning/implementing a consensus protocol.

Both Paxos, Raft and other consensus algorithms have their own ways of being fault tolerant and guaranteeing 'safety'. I recommend that you also read up on these yourself as it's not the intention of this article to explain them.

Conclusion

It has been a long journey for distributed computing to get this far. We concluded that Paxos is one of the fundamental building blocks of distributed computing and that many consensus algorithms are based upon Paxos. Building distributed systems is not something you learn in a day or 2. Learning about distributed consensus properly will take quite some time and is not as easy as importing a Raft library from Github. If you want to dive deeper into consensus algorithms I recommend reading the Paxos and Raft paper, but also the variants Multi-Paxos and EPaxos. I'd like to end this article with a quote.

"Despite these challenges, consensus is an important building block in building highly-available systems. Distribution makes building HA systems easier. It's a tool, not a solution." - Marc Brooker


Footnotes

  • 1. I here refer to the term Split-brain. Three nodes can monitor each other and see if a peer has crashed. Unfortunately any number below that can't distinguish a crash of a peer from broken communications with a peer.

  • 2. By notifying I refer to the usage of RPC (remote procedure calls).

  • 3. Leader election comes at a cost and it does not suit every architecture. There's an article on the Amazon Builders' Library called 'Leader Election in Distributed Systems' that captures the considerable downsides of leader election. Great read.

  • 4. Raft's followers expect a heartbeat to be send every 150 to 300 ms before timing out and promoting themselves to a candidate. I'm unsure about the exact interval of a leader its heartbeat (this varies per implementation).


Did you enjoy this read? Feel free to buy me a coffee! :)

Contact me? You can do that through blog@bschaatsbergen.com or LinkedIn.

If you're looking for other articles I recommend you to look in the library.