website: documenting the internals

2014-02-20 12:26:50 -08:00 · 2014-02-20 12:26:50 -08:00 · c975f5b968
parent 30bc19f002
commit c975f5b968
4 changed files with 202 additions and 242 deletions
--- a/website/source/docs/internals/architecture.html.markdown
+++ b/website/source/docs/internals/architecture.html.markdown
@ -82,9 +82,9 @@ slower as more machines are added. However, there is no limit to the number of c
 and they can easily scale into the thousands or tens of thousands.

 All the nodes that are in a datacenter participate in a [gossip protocol](/docs/internals/gossip.html).
-This means is there is a Serf cluster that contains all the nodes for a given datacenter. This serves
+This means is there is a gossip pool that contains all the nodes for a given datacenter. This serves
 a few purposes: first, there is no need to configure clients with the addresses of servers,
-that discovery is done automatically using Serf. Second, the work of detecting node failures
+discovery is done automatically. Second, the work of detecting node failures
 is not placed on the servers but is distributed. This makes the failure detection much more
 scalable than naive heartbeating schemes. Thirdly, it is used as a messaging layer to notify
 when important events such as leader election take place.
--- a/website/source/docs/internals/consensus.html.markdown
+++ b/website/source/docs/internals/consensus.html.markdown
@ -6,98 +6,177 @@ sidebar_current: "docs-internals-consensus"

 # Consensus Protocol

-Serf uses a [gossip protocol](http://en.wikipedia.org/wiki/Gossip_protocol)
-to broadcast messages to the cluster. This page documents the details of
-this internal protocol. The gossip protocol is based on
-["SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol"](http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf),
-with a few minor adaptations, mostly to increase propagation speed
-and convergence rate.
+Consul uses a [consensus protocol](http://en.wikipedia.org/wiki/Consensus_(computer_science))
+to provide [Consistency and Availability](http://en.wikipedia.org/wiki/CAP_theorem) as defined by CAP.
+This page documents the details of this internal protocol. The consensus protocol is based on
+["Raft: In search of an Understandable Consensus Algorithm"](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf).

 <div class="alert alert-block alert-warning">
-<strong>Advanced Topic!</strong> This page covers the technical details of
-the internals of Serf. You don't need to know these details to effectively
-operate and use Serf. These details are documented here for those who wish
+<strong>Advanced Topic!</strong> This page covers technical details of
+the internals of Consul. You don't need to know these details to effectively
+operate and use Consul. These details are documented here for those who wish
 to learn about them without having to go spelunking through the source code.
 </div>

-## SWIM Protocol Overview
+## Raft Protocol Overview

-Serf begins by joining an existing cluster or starting a new
-cluster. If starting a new cluster, additional nodes are expected to join
-it. New nodes in an existing cluster must be given the address of at
-least one existing member in order to join the cluster. The new member
-does a full state sync with the existing member over TCP and begins gossiping its
-existence to the cluster.
+Raft is a relatively new consensus algorithm that is based on Paxos,
+but is designed to have fewer states and a simpler more understandable
+algorithm. There are a few key terms to know when discussing Raft:

-Gossip is done over UDP with a configurable but fixed fanout and interval.
-This ensures that network usage is constant with regards to number of nodes.
-Complete state exchanges with a random node are done periodically over
-TCP, but much less often than gossip messages. This increases the likelihood
-that the membership list converges properly since the full state is exchanged
-and merged. The interval between full state exchanges is configurable or can
-be disabled entirely.
+* Log - The primary unit of work in a Raft system is a log entry. The problem
+of consistency can be decomposed into a *replicated log*. A log is a an ordered
+seequence of entries. We consider the log consistent if all members agree on
+the entries and their order.

-Failure detection is done by periodic random probing using a configurable interval.
-If the node fails to ack within a reasonable time (typically some multiple
-of RTT), then an indirect probe is attempted. An indirect probe asks a
-configurable number of random nodes to probe the same node, in case there
-are network issues causing our own node to fail the probe. If both our
-probe and the indirect probes fail within a reasonable time, then the
-node is marked "suspicious" and this knowledge is gossiped to the cluster.
-A suspicious node is still considered a member of cluster. If the suspect member
-of the cluster does not dispute the suspicion within a configurable period of
-time, the node is finally considered dead, and this state is then gossiped
-to the cluster.
+* FSM - [Finite State Machine](http://en.wikipedia.org/wiki/Finite-state_machine).
+An FSM is a collection of finite states with transitions between them. As new logs
+are applied, the FSM is allowed to transition between states. Application of the
+same sequence of logs must result in the same state, meaning non-deterministic
+behavior is not permitted.

-This is a brief and incomplete description of the protocol. For a better idea,
-please read the
-[SWIM paper](http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf)
-in its entirety, along with the Serf source code.
+* Peer set - The peer set is the set of all members participating in log replication.
+For Consul's purposes, all server nodes are in the peer set of the local datacenter.

-## SWIM Modifications
+* Quorum - A quorum is a majority of members from a peer set, or (n/2)+1.
+For example, if there are 5 members in the peer set, we would need 3 nodes
+to form a quorum. If a quorum of nodes is unavailable for any reason, then the
+cluster becomes *unavailable*, and no new logs can be committed.

-As mentioned earlier, the gossip protocol is based on SWIM but includes
-minor changes, mostly to increase propogation speed and convergence rates.
+* Committed Entry - An entry is considered *committed* when it is durably stored
+on a quorum of nodes. Once an entry is committed it can be applied.

-The changes from SWIM are noted here:
+* Leader - At any given time, the peer set elects a single node to be the leader.
+The leader is responsible for ingesting new log entries, replicating to followers,
+and managing when an entry is considered committed.

-* Serf does a full state sync over TCP periodically. SWIM only propagates
-  changes over gossip. While both are eventually consistent, Serf is able to
-  more quickly reach convergence, as well as gracefully recover from network
-  partitions.
+Raft is a complex protocol, and will not be covered here in detail. For the full
+specification, we recommend reading the paper. We will attempt to provide a high
+level description, which may be useful for building a mental picture.

-* Serf has a dedicated gossip layer separate from the failure detection
-  protocol. SWIM only piggybacks gossip messages on top of probe/ack messages.
-  Serf uses piggybacking along with dedicated gossip messages. This
-  feature lets you have a higher gossip rate (for example once per 200ms)
-  and a slower failure detection rate (such as once per second), resulting
-  in overall faster convergence rates and data propagation speeds.
+Raft nodes are always in one of three states: follower, candidate or leader. All
+nodes initially start out as a follower. In this state, nodes can accept log entries
+from a leader and cast votes. If no entries are received for some time, nodes
+self-promote to the candidate state. In the candidate state nodes request votes from
+their peers. If a candidate receives a quorum of votes, then it is promoted to a leader.
+The leader must accept new log entries and replicate to all the other followers.
+In addition, if stale reads are not acceptable, all queries must also be performed on
+the leader.

-* Serf keeps the state of dead nodes around for a set amount of time,
-  so that when full syncs are requested, the requester also receives information
-  about dead nodes. Because SWIM doesn't do full syncs, SWIM deletes dead node
-  state immediately upon learning that the node is dead. This change again helps
-  the cluster converge more quickly.
+Once a cluster has a leader, it is able to accept new log entries. A client can
+request that a leader append a new log entry, which is an opaque binary blob to
+Raft. The leader then writes the entry to durable storage and attempts to replicate
+to a quorum of followers. Once the log entry is considered *committed*, it can be
+*applied* to a finite state machine. The finite state machine is application specific,
+and in Consul's case, we use [LMDB](http://symas.com/mdb/) to maintain cluster state.

-## Serf-Specific Messages
+An obvious question relates to the unbounded nature of a replicated log. Raft provides
+a mechanism by which the current state is snapshotted, and the log is compacted. Because
+of the FSM abstraction, restoring the state of the FSM must result in the same state
+as a reply of old logs. This allows Raft to capture the FSM state at a point in time,
+and then remove all the logs that were used to reach that state. This is performed automatically
+without user intervention, and prevents unbounded disk usage as well as minimizing
+time spent replaying logs. One of the advantages of using LMDB is that it allows Consul
+to continue accepting new transactions even while old state is being snapshotted,
+preventing any availability issues.

-On top of the SWIM-based gossip layer, Serf sends some custom message types.
+Lastly, there is the issue of updating the peer set when new servers are joining
+or existing servers are leaving. As long as a quorum of nodes are available, this
+is not an issue as Raft provides mechanisms to dynamically update the peer set.
+If a quorum of nodes is unavailable, then this becomes a very challenging issue.
+For example, suppose there are only 2 peers, A and B. The quorum size is also
+2, meaning both nodes must agree to commit a log entry. If either A or B fails,
+it is now impossible to reach quorum. This means the cluster is unable to add,
+or remove a node, or commit any additional log entries. This results in *unavailability*.
+At this point, manual intervention would be required to remove either A or B,
+and to restart the remaining node in bootstrap mode.

-Serf makes heavy use of [lamport clocks](http://en.wikipedia.org/wiki/Lamport_timestamps)
-to maintain some notion of message ordering despite being eventually
-consistent. Every message sent by Serf contains a lamport clock time.
+A Raft cluster of 3 nodes can tolerate a single node failure, while a cluster
+of 5 can tolerate 2 node failures. The recommended configuration is to either
+run 3 or 5 Consul servers per datacenter. This maximizes availability without
+greatly sacrificing performance. See below for a deployment table.

-When a node gracefully leaves the cluster, Serf sends a _leave intent_ through
-the gossip layer. Because the underlying gossip layer makes no differentiation
-between a node leaving the cluster and a node being detected as failed, this
-allows the higher level Serf layer to detect a failure versus a graceful
-leave.
+In terms of performance, Raft is comprable to Paxos. Assuming stable leadership,
+a committing a log entry requires a single round trip to half of the cluster.
+Thus performance is bound by disk I/O and network latency. Although Consul is
+not designed to be a high-throughput write system, it should handle on the order
+of hundreds to thousands of transactions per second depending on network and
+hardware configuration.

-When a node joins the cluster, Serf sends a _join intent_. The purpose
-of this intent is solely to attach a lamport clock time to a join so that
-it can be ordered properly in case a leave comes out of order.
+## Raft in Consul
+
+Only Consul server nodes participate in Raft, and are part of the peer set. All
+client nodes forward requests to servers. Part of the reason for this design is
+that as more members are added to the peer set, the size of the quorum also increases.
+This introduces performance problems as you may be waiting for hundreds of machines
+to agree on an entry instead of a handful.
+
+When getting started, a single Consul server is put into "bootstrap" mode. This mode
+allows it to self-elect as a leader. Once a leader is elected, other servers can be
+added to the peer set in a way that preserves consistency and safety. Eventually,
+bootstrap mode can be disabled, once the first few servers are added.
+
+Since all servers participate as part of the peer set, they all know the current
+leader. When an RPC request arrives at a non-leader server, the request is
+forwarded to the leader. If the RPC is a *query* type, meaning it is read-only,
+then the leader generates the result based on the current state of the FSM. If
+the RPC is a *transaction* type, meaning it modifies state, then the leader
+generates a new log entry and applies it using Raft. Once the log entry is committed
+and applied to the FSM, the transaction is complete.
+
+Because of the nature of Raft's replication, performance is sensitive to network
+latency. For this reason, each datacenter elects an independent leader, and maintains
+a disjoint peer set. Data is partitioned by datacenter, so each leader is responsible
+only for data in their datacenter. When a request is received for a remote datacenter,
+the request is forwarded to the correct leader. This design allows for lower latency
+transactions and higher availability without sacrificing consistency.
+
+## Deployment Table
+
+Below is a table that shows for the number of servers how large the
+quorum is, as well as how many node failures can be tolerated. The
+recommended deployment is either 3 or 5 servers.
+
+<table class="table table-bordered table-striped">
+<tr>
+<th>Servers</th>
+<th>Quorum Size</th>
+<th>Failure Tolerance</th>
+</tr>
+<tr>
+<td>1</td>
+<td>1</td>
+<td>0</td>
+</tr>
+<tr>
+<td>2</td>
+<td>2</td>
+<td>0</td>
+</tr>
+<tr>
+<td><b>3</b></td>
+<td>2</td>
+<td>1</td>
+</tr>
+<tr>
+<td>4</td>
+<td>3</td>
+<td>1</td>
+</tr>
+<tr>
+<td><b>5</b></td>
+<td>3</td>
+<td>2</td>
+</tr>
+<tr>
+<td>6</td>
+<td>4</td>
+<td>2</td>
+</tr>
+<tr>
+<td>7</td>
+<td>4</td>
+<td>3</td>
+</tr>
+</table>

-For custom events, Serf sends a _user event_ message. This message contains
-a lamport time, event name, and event payload. Because user events are sent
-along the gossip layer, which uses UDP, the payload and entire message framing
-must fit within a single UDP packet.
--- a/website/source/docs/internals/gossip.html.markdown
+++ b/website/source/docs/internals/gossip.html.markdown
@ -6,98 +6,39 @@ sidebar_current: "docs-internals-gossip"

 # Gossip Protocol

-Serf uses a [gossip protocol](http://en.wikipedia.org/wiki/Gossip_protocol)
-to broadcast messages to the cluster. This page documents the details of
-this internal protocol. The gossip protocol is based on
+Consul uses a [gossip protocol](http://en.wikipedia.org/wiki/Gossip_protocol)
+to manage membership and broadcast messages to the cluster. All of this is provided
+through the use of the [Serf library](http://www.serfdom.io/). The gossip protocol
+used by Serf is based on
 ["SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol"](http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf),
-with a few minor adaptations, mostly to increase propagation speed
-and convergence rate.
+with a few minor adaptations. There are more details about [Serf's protocol here](http://www.serfdom.io/docs/internals/gossip.html).

 <div class="alert alert-block alert-warning">
-<strong>Advanced Topic!</strong> This page covers the technical details of
-the internals of Serf. You don't need to know these details to effectively
-operate and use Serf. These details are documented here for those who wish
+<strong>Advanced Topic!</strong> This page covers technical details of
+the internals of Consul. You don't need to know these details to effectively
+operate and use Consul. These details are documented here for those who wish
 to learn about them without having to go spelunking through the source code.
 </div>

-## SWIM Protocol Overview
+## Gossip in Consul

-Serf begins by joining an existing cluster or starting a new
-cluster. If starting a new cluster, additional nodes are expected to join
-it. New nodes in an existing cluster must be given the address of at
-least one existing member in order to join the cluster. The new member
-does a full state sync with the existing member over TCP and begins gossiping its
-existence to the cluster.
+Consul makes use of two different gossip pools. We refer to each pool as the
+LAN or WAN pool respectively. Each datacenter Consul operates in has a LAN gossip pool
+containing all members of the datacenter, both clients and servers. The LAN pool is
+used for a few purposes. Membership information allows clients to automatically discover
+servers, reducing the amount of configuration needed. The distributed failure detection
+allows the work of failure detection to be shared by the entire cluster instead of
+concentrated on a few servers. Lastly, the gossip pool allows for reliable and fast
+event broadcasts for events like leader election.

-Gossip is done over UDP with a configurable but fixed fanout and interval.
-This ensures that network usage is constant with regards to number of nodes.
-Complete state exchanges with a random node are done periodically over
-TCP, but much less often than gossip messages. This increases the likelihood
-that the membership list converges properly since the full state is exchanged
-and merged. The interval between full state exchanges is configurable or can
-be disabled entirely.
+The WAN pool is globally unique, as all servers should participate in the WAN pool
+regardless of datacenter. Membership information provided by the WAN pool allows
+servers to perform cross datacenter requests. THe integrated failure detection
+allows Consul to gracefully handle an entire datacenter losing connectivity, or just
+a single server in a remote datacenter.

-Failure detection is done by periodic random probing using a configurable interval.
-If the node fails to ack within a reasonable time (typically some multiple
-of RTT), then an indirect probe is attempted. An indirect probe asks a
-configurable number of random nodes to probe the same node, in case there
-are network issues causing our own node to fail the probe. If both our
-probe and the indirect probes fail within a reasonable time, then the
-node is marked "suspicious" and this knowledge is gossiped to the cluster.
-A suspicious node is still considered a member of cluster. If the suspect member
-of the cluster does not dispute the suspicion within a configurable period of
-time, the node is finally considered dead, and this state is then gossiped
-to the cluster.
+All of these features are provided by leveraging [Serf](http://www.serfdom.io/). It
+is used as an embedded library to provide these features. From a user perspective,
+this is not important, since the abstraction should be masked by Consul. It can be useful
+however as a developer to understand how this library is leveraged.

-This is a brief and incomplete description of the protocol. For a better idea,
-please read the
-[SWIM paper](http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf)
-in its entirety, along with the Serf source code.
-
-## SWIM Modifications
-
-As mentioned earlier, the gossip protocol is based on SWIM but includes
-minor changes, mostly to increase propogation speed and convergence rates.
-
-The changes from SWIM are noted here:
-
-* Serf does a full state sync over TCP periodically. SWIM only propagates
-  changes over gossip. While both are eventually consistent, Serf is able to
-  more quickly reach convergence, as well as gracefully recover from network
-  partitions.
-
-* Serf has a dedicated gossip layer separate from the failure detection
-  protocol. SWIM only piggybacks gossip messages on top of probe/ack messages.
-  Serf uses piggybacking along with dedicated gossip messages. This
-  feature lets you have a higher gossip rate (for example once per 200ms)
-  and a slower failure detection rate (such as once per second), resulting
-  in overall faster convergence rates and data propagation speeds.
-
-* Serf keeps the state of dead nodes around for a set amount of time,
-  so that when full syncs are requested, the requester also receives information
-  about dead nodes. Because SWIM doesn't do full syncs, SWIM deletes dead node
-  state immediately upon learning that the node is dead. This change again helps
-  the cluster converge more quickly.
-
-## Serf-Specific Messages
-
-On top of the SWIM-based gossip layer, Serf sends some custom message types.
-
-Serf makes heavy use of [lamport clocks](http://en.wikipedia.org/wiki/Lamport_timestamps)
-to maintain some notion of message ordering despite being eventually
-consistent. Every message sent by Serf contains a lamport clock time.
-
-When a node gracefully leaves the cluster, Serf sends a _leave intent_ through
-the gossip layer. Because the underlying gossip layer makes no differentiation
-between a node leaving the cluster and a node being detected as failed, this
-allows the higher level Serf layer to detect a failure versus a graceful
-leave.
-
-When a node joins the cluster, Serf sends a _join intent_. The purpose
-of this intent is solely to attach a lamport clock time to a join so that
-it can be ordered properly in case a leave comes out of order.
-
-For custom events, Serf sends a _user event_ message. This message contains
-a lamport time, event name, and event payload. Because user events are sent
-along the gossip layer, which uses UDP, the payload and entire message framing
-must fit within a single UDP packet.
--- a/website/source/docs/internals/security.html.markdown
+++ b/website/source/docs/internals/security.html.markdown
@ -6,93 +6,43 @@ sidebar_current: "docs-internals-security"

 # Security Model

-Serf uses a symmetric key, or shared secret, cryptosystem to provide
-[confidentiality, integrity and authentication](http://en.wikipedia.org/wiki/Information_security).
+Consul relies on both a lightweight gossip mechanism and an RPC system
+to provide various features. Both of the systems have different security
+mechanisms that stem from their independent designs. However, the goals
+of Consuls security are to provide [confidentiality, integrity and authentication](http://en.wikipedia.org/wiki/Information_security).

-This means Serf communication is protected against eavesdropping, tampering,
-or attempts to generate fake events. This makes it possible to run Serf over
-untrusted networks such as EC2 and other shared hosting providers.
+The [gossip protocol](/docs/internals/gossip.html) is powered by Serf,
+which uses a symmetric key, or shared secret, cryptosystem. There are more
+details on the security of [Serf here](http://www.serfdom.io/docs/internals/security.html).
+
+The RPC system supports using end-to-end TLS, with optional client authentication.
+[TLS](http://en.wikipedia.org/wiki/Transport_Layer_Security) is a widely deployed asymmetric
+cryptosystem, and is the foundation of security on the Internet.
+
+This means Consul communication is protected against eavesdropping, tampering,
+or spoofing. This makes it possible to run Consul over untrusted networks such
+as EC2 and other shared hosting providers.

 <div class="alert alert-block alert-warning">
 <strong>Advanced Topic!</strong> This page covers the technical details of
-the security model of Serf. You don't need to know these details to
-operate and use Serf. These details are documented here for those who wish
+the security model of Consul. You don't need to know these details to
+operate and use Consul. These details are documented here for those who wish
 to learn about them without having to go spelunking through the source code.
 </div>

-## Security Primitives
-
-The Serf security model is built on around a symmetric key, or shared secret system.
-All members of the Serf cluster must be provided the shared secret ahead of time.
-This places the burden of key distribution on the user.
-
-To support confidentiality, all messages are encrypted using the
-[AES-128 standard](http://en.wikipedia.org/wiki/Advanced_Encryption_Standard). The
-AES standard is considered one of the most secure and modern encryption standards.
-Additionally, it is a fast algorithm, and modern CPUs provide hardware instructions to
-make encryption and decryption very lightweight.
-
-AES is used with the [Galois Counter Mode (GCM)](http://en.wikipedia.org/wiki/Galois/Counter_Mode),
-using a randomly generated nonce. The use of GCM provides message integrity,
-as the ciphertext is suffixed with a 'tag' that is used to verify integrity.
-
-## Message Format
-
-In the previous section we described the crypto primitives that are used. In this
-section we cover how messages are framed on the wire and interpretted.
-
-### UDP Message Format
-
-UDP messages do not require any framing since they are packet oriented. This
-allows the message to be simple and saves space. The format is as follows:
-
-    -------------------------------------------------------------------
-    | Version (byte) | Nonce (12 bytes) | CipherText | Tag (16 bytes) |
-    -------------------------------------------------------------------
-
-The UDP message has an overhead of 29 bytes per message.
-Tampering or bit corruption will cause the GCM tag verification to fail.
-
-Once we receive a packet, we first verify the GCM tag, and only on verification,
-decrypt the payload. The version byte is provided to allow future versions to
-change the algorithm they use. It is currently always set to 0.
-
-### TCP Message Format
-
-TCP provides a stream abstraction and therefor we must provide our own framing.
-This intoduces a potential attack vector since we cannot verify the tag
-until the entire message is received, and the message length must be in plaintext.
-Our current strategy is to limit the maximum size of a framed message to prevent
-an malicious attacker from being able to send enough data to cause a Denial of Service.
-
-The TCP format is similar to the UDP format, but prepends the message with
-a message type byte (similar to other Serf messages). It also adds a 4 byte length
-field, encoded in Big Endian format. This increases its maximum overhead to 33 bytes.
-
-When we first receive a TCP encrypted message, we check the message type. If any
-party has encryption enabled, the other party must as well. Otherwise we are vulnerable
-to a downgrade attack where one side can force the other into a non-encrypted mode of
-operation.
-
-Once this is verified, we determine the message length and if it is less than our limit,.
-After the entire message is received, the tag is used to verify the entire message.
-
 ## Threat Model

 The following are the various parts of our threat model:

-* Non-members getting access to events
+* Non-members getting access to data
 * Cluster state manipulation due to malicious messages
-* Fake event generation due to malicious messages
-* Tampering of messages causing state corruption
+* Fake data generation due to malicious messages
+* Tampering causing state corruption
 * Denial of Service against a node

-We are specifically not concerned about replay attacks, as the gossip
-protocol is designed to handle that due to the nature of its broadcast mechanism.
-
 Additionally, we recognize that an attacker that can observe network
 traffic for an extended period of time may infer the cluster members.
-The gossip mechanism used by Serf relies on sending messages to random
+The gossip mechanism used by Consul relies on sending messages to random
 members, so an attacker can record all destinations and determine all
 members of the cluster.

@ -101,13 +51,3 @@ Our goal is not to protect top secret data but to provide a "reasonable"
 level of security that would require an attacker to commit a considerable
 amount of resources to defeat.

-## Future Roadmap
-
-Eventually, Serf will be able to use the versioning byte to support
-different encryption algorithms. These could be configured at the
-start time of the agent.
-
-Additionally, we need to support key rotation so that it is possible
-for network administrators to periodically change keys to ensure
-perfect forward security.
-