8.5 KiB
layout | page_title | sidebar_current |
---|---|---|
docs | Consensus Protocol | docs-internals-consensus |
Consensus Protocol
Consul uses a consensus protocol to provide Consistency and Availability as defined by CAP. This page documents the details of this internal protocol. The consensus protocol is based on "Raft: In search of an Understandable Consensus Algorithm".
Raft Protocol Overview
Raft is a relatively new consensus algorithm that is based on Paxos, but is designed to have fewer states and a simpler, more understandable algorithm. There are a few key terms to know when discussing Raft:
-
Log - The primary unit of work in a Raft system is a log entry. The problem of consistency can be decomposed into a replicated log. A log is an ordered sequence of entries. We consider the log consistent if all members agree on the entries and their order.
-
FSM - Finite State Machine. An FSM is a collection of finite states with transitions between them. As new logs are applied, the FSM is allowed to transition between states. Application of the same sequence of logs must result in the same state, meaning non-deterministic behavior is not permitted.
-
Peer set - The peer set is the set of all members participating in log replication. For Consul's purposes, all server nodes are in the peer set of the local datacenter.
-
Quorum - A quorum is a majority of members from a peer set, or (n/2)+1. For example, if there are 5 members in the peer set, we would need 3 nodes to form a quorum. If a quorum of nodes is unavailable for any reason, then the cluster becomes unavailable, and no new logs can be committed.
-
Committed Entry - An entry is considered committed when it is durably stored on a quorum of nodes. Once an entry is committed it can be applied.
-
Leader - At any given time, the peer set elects a single node to be the leader. The leader is responsible for ingesting new log entries, replicating to followers, and managing when an entry is considered committed.
Raft is a complex protocol, and will not be covered here in detail. For the full specification, we recommend reading the paper. We will attempt to provide a high level description, which may be useful for building a mental picture.
Raft nodes are always in one of three states: follower, candidate or leader. All nodes initially start out as a follower. In this state, nodes can accept log entries from a leader and cast votes. If no entries are received for some time, nodes self-promote to the candidate state. In the candidate state nodes request votes from their peers. If a candidate receives a quorum of votes, then it is promoted to a leader. The leader must accept new log entries and replicate to all the other followers. In addition, if stale reads are not acceptable, all queries must also be performed on the leader.
Once a cluster has a leader, it is able to accept new log entries. A client can request that a leader append a new log entry, which is an opaque binary blob to Raft. The leader then writes the entry to durable storage and attempts to replicate to a quorum of followers. Once the log entry is considered committed, it can be applied to a finite state machine. The finite state machine is application specific, and in Consul's case, we use LMDB to maintain cluster state.
An obvious question relates to the unbounded nature of a replicated log. Raft provides a mechanism by which the current state is snapshotted, and the log is compacted. Because of the FSM abstraction, restoring the state of the FSM must result in the same state as a replay of old logs. This allows Raft to capture the FSM state at a point in time, and then remove all the logs that were used to reach that state. This is performed automatically without user intervention, and prevents unbounded disk usage as well as minimizing time spent replaying logs. One of the advantages of using LMDB is that it allows Consul to continue accepting new transactions even while old state is being snapshotted, preventing any availability issues.
Lastly, there is the issue of updating the peer set when new servers are joining or existing servers are leaving. As long as a quorum of nodes is available, this is not an issue as Raft provides mechanisms to dynamically update the peer set. If a quorum of nodes is unavailable, then this becomes a very challenging issue. For example, suppose there are only 2 peers, A and B. The quorum size is also 2, meaning both nodes must agree to commit a log entry. If either A or B fails, it is now impossible to reach quorum. This means the cluster is unable to add, or remove a node, or commit any additional log entries. This results in unavailability. At this point, manual intervention would be required to remove either A or B, and to restart the remaining node in bootstrap mode.
A Raft cluster of 3 nodes can tolerate a single node failure, while a cluster of 5 can tolerate 2 node failures. The recommended configuration is to either run 3 or 5 Consul servers per datacenter. This maximizes availability without greatly sacrificing performance. See below for a deployment table.
In terms of performance, Raft is comprable to Paxos. Assuming stable leadership, a committing a log entry requires a single round trip to half of the cluster. Thus performance is bound by disk I/O and network latency. Although Consul is not designed to be a high-throughput write system, it should handle on the order of hundreds to thousands of transactions per second depending on network and hardware configuration.
Raft in Consul
Only Consul server nodes participate in Raft, and are part of the peer set. All client nodes forward requests to servers. Part of the reason for this design is that as more members are added to the peer set, the size of the quorum also increases. This introduces performance problems as you may be waiting for hundreds of machines to agree on an entry instead of a handful.
When getting started, a single Consul server is put into "bootstrap" mode. This mode allows it to self-elect as a leader. Once a leader is elected, other servers can be added to the peer set in a way that preserves consistency and safety. Eventually, bootstrap mode can be disabled, once the first few servers are added.
Since all servers participate as part of the peer set, they all know the current leader. When an RPC request arrives at a non-leader server, the request is forwarded to the leader. If the RPC is a query type, meaning it is read-only, then the leader generates the result based on the current state of the FSM. If the RPC is a transaction type, meaning it modifies state, then the leader generates a new log entry and applies it using Raft. Once the log entry is committed and applied to the FSM, the transaction is complete.
Because of the nature of Raft's replication, performance is sensitive to network latency. For this reason, each datacenter elects an independent leader, and maintains a disjoint peer set. Data is partitioned by datacenter, so each leader is responsible only for data in their datacenter. When a request is received for a remote datacenter, the request is forwarded to the correct leader. This design allows for lower latency transactions and higher availability without sacrificing consistency.
Deployment Table
Below is a table that shows for the number of servers how large the quorum is, as well as how many node failures can be tolerated. The recommended deployment is either 3 or 5 servers.
Servers | Quorum Size | Failure Tolerance |
---|---|---|
1 | 1 | 0 |
2 | 2 | 0 |
3 | 2 | 1 |
4 | 3 | 1 |
5 | 3 | 2 |
6 | 4 | 2 |
7 | 4 | 3 |