website: documenting the internals

This commit is contained in:
Armon Dadgar 2014-02-20 12:26:50 -08:00
parent 30bc19f002
commit c975f5b968
4 changed files with 202 additions and 242 deletions

View File

@ -82,9 +82,9 @@ slower as more machines are added. However, there is no limit to the number of c
and they can easily scale into the thousands or tens of thousands.
All the nodes that are in a datacenter participate in a [gossip protocol](/docs/internals/gossip.html).
This means is there is a Serf cluster that contains all the nodes for a given datacenter. This serves
This means is there is a gossip pool that contains all the nodes for a given datacenter. This serves
a few purposes: first, there is no need to configure clients with the addresses of servers,
that discovery is done automatically using Serf. Second, the work of detecting node failures
discovery is done automatically. Second, the work of detecting node failures
is not placed on the servers but is distributed. This makes the failure detection much more
scalable than naive heartbeating schemes. Thirdly, it is used as a messaging layer to notify
when important events such as leader election take place.

View File

@ -6,98 +6,177 @@ sidebar_current: "docs-internals-consensus"
# Consensus Protocol
Serf uses a [gossip protocol](http://en.wikipedia.org/wiki/Gossip_protocol)
to broadcast messages to the cluster. This page documents the details of
this internal protocol. The gossip protocol is based on
["SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol"](http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf),
with a few minor adaptations, mostly to increase propagation speed
and convergence rate.
Consul uses a [consensus protocol](http://en.wikipedia.org/wiki/Consensus_(computer_science))
to provide [Consistency and Availability](http://en.wikipedia.org/wiki/CAP_theorem) as defined by CAP.
This page documents the details of this internal protocol. The consensus protocol is based on
["Raft: In search of an Understandable Consensus Algorithm"](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf).
<div class="alert alert-block alert-warning">
<strong>Advanced Topic!</strong> This page covers the technical details of
the internals of Serf. You don't need to know these details to effectively
operate and use Serf. These details are documented here for those who wish
<strong>Advanced Topic!</strong> This page covers technical details of
the internals of Consul. You don't need to know these details to effectively
operate and use Consul. These details are documented here for those who wish
to learn about them without having to go spelunking through the source code.
</div>
## SWIM Protocol Overview
## Raft Protocol Overview
Serf begins by joining an existing cluster or starting a new
cluster. If starting a new cluster, additional nodes are expected to join
it. New nodes in an existing cluster must be given the address of at
least one existing member in order to join the cluster. The new member
does a full state sync with the existing member over TCP and begins gossiping its
existence to the cluster.
Raft is a relatively new consensus algorithm that is based on Paxos,
but is designed to have fewer states and a simpler more understandable
algorithm. There are a few key terms to know when discussing Raft:
Gossip is done over UDP with a configurable but fixed fanout and interval.
This ensures that network usage is constant with regards to number of nodes.
Complete state exchanges with a random node are done periodically over
TCP, but much less often than gossip messages. This increases the likelihood
that the membership list converges properly since the full state is exchanged
and merged. The interval between full state exchanges is configurable or can
be disabled entirely.
* Log - The primary unit of work in a Raft system is a log entry. The problem
of consistency can be decomposed into a *replicated log*. A log is a an ordered
seequence of entries. We consider the log consistent if all members agree on
the entries and their order.
Failure detection is done by periodic random probing using a configurable interval.
If the node fails to ack within a reasonable time (typically some multiple
of RTT), then an indirect probe is attempted. An indirect probe asks a
configurable number of random nodes to probe the same node, in case there
are network issues causing our own node to fail the probe. If both our
probe and the indirect probes fail within a reasonable time, then the
node is marked "suspicious" and this knowledge is gossiped to the cluster.
A suspicious node is still considered a member of cluster. If the suspect member
of the cluster does not dispute the suspicion within a configurable period of
time, the node is finally considered dead, and this state is then gossiped
to the cluster.
* FSM - [Finite State Machine](http://en.wikipedia.org/wiki/Finite-state_machine).
An FSM is a collection of finite states with transitions between them. As new logs
are applied, the FSM is allowed to transition between states. Application of the
same sequence of logs must result in the same state, meaning non-deterministic
behavior is not permitted.
This is a brief and incomplete description of the protocol. For a better idea,
please read the
[SWIM paper](http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf)
in its entirety, along with the Serf source code.
* Peer set - The peer set is the set of all members participating in log replication.
For Consul's purposes, all server nodes are in the peer set of the local datacenter.
## SWIM Modifications
* Quorum - A quorum is a majority of members from a peer set, or (n/2)+1.
For example, if there are 5 members in the peer set, we would need 3 nodes
to form a quorum. If a quorum of nodes is unavailable for any reason, then the
cluster becomes *unavailable*, and no new logs can be committed.
As mentioned earlier, the gossip protocol is based on SWIM but includes
minor changes, mostly to increase propogation speed and convergence rates.
* Committed Entry - An entry is considered *committed* when it is durably stored
on a quorum of nodes. Once an entry is committed it can be applied.
The changes from SWIM are noted here:
* Leader - At any given time, the peer set elects a single node to be the leader.
The leader is responsible for ingesting new log entries, replicating to followers,
and managing when an entry is considered committed.
* Serf does a full state sync over TCP periodically. SWIM only propagates
changes over gossip. While both are eventually consistent, Serf is able to
more quickly reach convergence, as well as gracefully recover from network
partitions.
Raft is a complex protocol, and will not be covered here in detail. For the full
specification, we recommend reading the paper. We will attempt to provide a high
level description, which may be useful for building a mental picture.
* Serf has a dedicated gossip layer separate from the failure detection
protocol. SWIM only piggybacks gossip messages on top of probe/ack messages.
Serf uses piggybacking along with dedicated gossip messages. This
feature lets you have a higher gossip rate (for example once per 200ms)
and a slower failure detection rate (such as once per second), resulting
in overall faster convergence rates and data propagation speeds.
Raft nodes are always in one of three states: follower, candidate or leader. All
nodes initially start out as a follower. In this state, nodes can accept log entries
from a leader and cast votes. If no entries are received for some time, nodes
self-promote to the candidate state. In the candidate state nodes request votes from
their peers. If a candidate receives a quorum of votes, then it is promoted to a leader.
The leader must accept new log entries and replicate to all the other followers.
In addition, if stale reads are not acceptable, all queries must also be performed on
the leader.
* Serf keeps the state of dead nodes around for a set amount of time,
so that when full syncs are requested, the requester also receives information
about dead nodes. Because SWIM doesn't do full syncs, SWIM deletes dead node
state immediately upon learning that the node is dead. This change again helps
the cluster converge more quickly.
Once a cluster has a leader, it is able to accept new log entries. A client can
request that a leader append a new log entry, which is an opaque binary blob to
Raft. The leader then writes the entry to durable storage and attempts to replicate
to a quorum of followers. Once the log entry is considered *committed*, it can be
*applied* to a finite state machine. The finite state machine is application specific,
and in Consul's case, we use [LMDB](http://symas.com/mdb/) to maintain cluster state.
## Serf-Specific Messages
An obvious question relates to the unbounded nature of a replicated log. Raft provides
a mechanism by which the current state is snapshotted, and the log is compacted. Because
of the FSM abstraction, restoring the state of the FSM must result in the same state
as a reply of old logs. This allows Raft to capture the FSM state at a point in time,
and then remove all the logs that were used to reach that state. This is performed automatically
without user intervention, and prevents unbounded disk usage as well as minimizing
time spent replaying logs. One of the advantages of using LMDB is that it allows Consul
to continue accepting new transactions even while old state is being snapshotted,
preventing any availability issues.
On top of the SWIM-based gossip layer, Serf sends some custom message types.
Lastly, there is the issue of updating the peer set when new servers are joining
or existing servers are leaving. As long as a quorum of nodes are available, this
is not an issue as Raft provides mechanisms to dynamically update the peer set.
If a quorum of nodes is unavailable, then this becomes a very challenging issue.
For example, suppose there are only 2 peers, A and B. The quorum size is also
2, meaning both nodes must agree to commit a log entry. If either A or B fails,
it is now impossible to reach quorum. This means the cluster is unable to add,
or remove a node, or commit any additional log entries. This results in *unavailability*.
At this point, manual intervention would be required to remove either A or B,
and to restart the remaining node in bootstrap mode.
Serf makes heavy use of [lamport clocks](http://en.wikipedia.org/wiki/Lamport_timestamps)
to maintain some notion of message ordering despite being eventually
consistent. Every message sent by Serf contains a lamport clock time.
A Raft cluster of 3 nodes can tolerate a single node failure, while a cluster
of 5 can tolerate 2 node failures. The recommended configuration is to either
run 3 or 5 Consul servers per datacenter. This maximizes availability without
greatly sacrificing performance. See below for a deployment table.
When a node gracefully leaves the cluster, Serf sends a _leave intent_ through
the gossip layer. Because the underlying gossip layer makes no differentiation
between a node leaving the cluster and a node being detected as failed, this
allows the higher level Serf layer to detect a failure versus a graceful
leave.
In terms of performance, Raft is comprable to Paxos. Assuming stable leadership,
a committing a log entry requires a single round trip to half of the cluster.
Thus performance is bound by disk I/O and network latency. Although Consul is
not designed to be a high-throughput write system, it should handle on the order
of hundreds to thousands of transactions per second depending on network and
hardware configuration.
When a node joins the cluster, Serf sends a _join intent_. The purpose
of this intent is solely to attach a lamport clock time to a join so that
it can be ordered properly in case a leave comes out of order.
## Raft in Consul
Only Consul server nodes participate in Raft, and are part of the peer set. All
client nodes forward requests to servers. Part of the reason for this design is
that as more members are added to the peer set, the size of the quorum also increases.
This introduces performance problems as you may be waiting for hundreds of machines
to agree on an entry instead of a handful.
When getting started, a single Consul server is put into "bootstrap" mode. This mode
allows it to self-elect as a leader. Once a leader is elected, other servers can be
added to the peer set in a way that preserves consistency and safety. Eventually,
bootstrap mode can be disabled, once the first few servers are added.
Since all servers participate as part of the peer set, they all know the current
leader. When an RPC request arrives at a non-leader server, the request is
forwarded to the leader. If the RPC is a *query* type, meaning it is read-only,
then the leader generates the result based on the current state of the FSM. If
the RPC is a *transaction* type, meaning it modifies state, then the leader
generates a new log entry and applies it using Raft. Once the log entry is committed
and applied to the FSM, the transaction is complete.
Because of the nature of Raft's replication, performance is sensitive to network
latency. For this reason, each datacenter elects an independent leader, and maintains
a disjoint peer set. Data is partitioned by datacenter, so each leader is responsible
only for data in their datacenter. When a request is received for a remote datacenter,
the request is forwarded to the correct leader. This design allows for lower latency
transactions and higher availability without sacrificing consistency.
## Deployment Table
Below is a table that shows for the number of servers how large the
quorum is, as well as how many node failures can be tolerated. The
recommended deployment is either 3 or 5 servers.
<table class="table table-bordered table-striped">
<tr>
<th>Servers</th>
<th>Quorum Size</th>
<th>Failure Tolerance</th>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>2</td>
<td>2</td>
<td>0</td>
</tr>
<tr>
<td><b>3</b></td>
<td>2</td>
<td>1</td>
</tr>
<tr>
<td>4</td>
<td>3</td>
<td>1</td>
</tr>
<tr>
<td><b>5</b></td>
<td>3</td>
<td>2</td>
</tr>
<tr>
<td>6</td>
<td>4</td>
<td>2</td>
</tr>
<tr>
<td>7</td>
<td>4</td>
<td>3</td>
</tr>
</table>
For custom events, Serf sends a _user event_ message. This message contains
a lamport time, event name, and event payload. Because user events are sent
along the gossip layer, which uses UDP, the payload and entire message framing
must fit within a single UDP packet.

View File

@ -6,98 +6,39 @@ sidebar_current: "docs-internals-gossip"
# Gossip Protocol
Serf uses a [gossip protocol](http://en.wikipedia.org/wiki/Gossip_protocol)
to broadcast messages to the cluster. This page documents the details of
this internal protocol. The gossip protocol is based on
Consul uses a [gossip protocol](http://en.wikipedia.org/wiki/Gossip_protocol)
to manage membership and broadcast messages to the cluster. All of this is provided
through the use of the [Serf library](http://www.serfdom.io/). The gossip protocol
used by Serf is based on
["SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol"](http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf),
with a few minor adaptations, mostly to increase propagation speed
and convergence rate.
with a few minor adaptations. There are more details about [Serf's protocol here](http://www.serfdom.io/docs/internals/gossip.html).
<div class="alert alert-block alert-warning">
<strong>Advanced Topic!</strong> This page covers the technical details of
the internals of Serf. You don't need to know these details to effectively
operate and use Serf. These details are documented here for those who wish
<strong>Advanced Topic!</strong> This page covers technical details of
the internals of Consul. You don't need to know these details to effectively
operate and use Consul. These details are documented here for those who wish
to learn about them without having to go spelunking through the source code.
</div>
## SWIM Protocol Overview
## Gossip in Consul
Serf begins by joining an existing cluster or starting a new
cluster. If starting a new cluster, additional nodes are expected to join
it. New nodes in an existing cluster must be given the address of at
least one existing member in order to join the cluster. The new member
does a full state sync with the existing member over TCP and begins gossiping its
existence to the cluster.
Consul makes use of two different gossip pools. We refer to each pool as the
LAN or WAN pool respectively. Each datacenter Consul operates in has a LAN gossip pool
containing all members of the datacenter, both clients and servers. The LAN pool is
used for a few purposes. Membership information allows clients to automatically discover
servers, reducing the amount of configuration needed. The distributed failure detection
allows the work of failure detection to be shared by the entire cluster instead of
concentrated on a few servers. Lastly, the gossip pool allows for reliable and fast
event broadcasts for events like leader election.
Gossip is done over UDP with a configurable but fixed fanout and interval.
This ensures that network usage is constant with regards to number of nodes.
Complete state exchanges with a random node are done periodically over
TCP, but much less often than gossip messages. This increases the likelihood
that the membership list converges properly since the full state is exchanged
and merged. The interval between full state exchanges is configurable or can
be disabled entirely.
The WAN pool is globally unique, as all servers should participate in the WAN pool
regardless of datacenter. Membership information provided by the WAN pool allows
servers to perform cross datacenter requests. THe integrated failure detection
allows Consul to gracefully handle an entire datacenter losing connectivity, or just
a single server in a remote datacenter.
Failure detection is done by periodic random probing using a configurable interval.
If the node fails to ack within a reasonable time (typically some multiple
of RTT), then an indirect probe is attempted. An indirect probe asks a
configurable number of random nodes to probe the same node, in case there
are network issues causing our own node to fail the probe. If both our
probe and the indirect probes fail within a reasonable time, then the
node is marked "suspicious" and this knowledge is gossiped to the cluster.
A suspicious node is still considered a member of cluster. If the suspect member
of the cluster does not dispute the suspicion within a configurable period of
time, the node is finally considered dead, and this state is then gossiped
to the cluster.
All of these features are provided by leveraging [Serf](http://www.serfdom.io/). It
is used as an embedded library to provide these features. From a user perspective,
this is not important, since the abstraction should be masked by Consul. It can be useful
however as a developer to understand how this library is leveraged.
This is a brief and incomplete description of the protocol. For a better idea,
please read the
[SWIM paper](http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf)
in its entirety, along with the Serf source code.
## SWIM Modifications
As mentioned earlier, the gossip protocol is based on SWIM but includes
minor changes, mostly to increase propogation speed and convergence rates.
The changes from SWIM are noted here:
* Serf does a full state sync over TCP periodically. SWIM only propagates
changes over gossip. While both are eventually consistent, Serf is able to
more quickly reach convergence, as well as gracefully recover from network
partitions.
* Serf has a dedicated gossip layer separate from the failure detection
protocol. SWIM only piggybacks gossip messages on top of probe/ack messages.
Serf uses piggybacking along with dedicated gossip messages. This
feature lets you have a higher gossip rate (for example once per 200ms)
and a slower failure detection rate (such as once per second), resulting
in overall faster convergence rates and data propagation speeds.
* Serf keeps the state of dead nodes around for a set amount of time,
so that when full syncs are requested, the requester also receives information
about dead nodes. Because SWIM doesn't do full syncs, SWIM deletes dead node
state immediately upon learning that the node is dead. This change again helps
the cluster converge more quickly.
## Serf-Specific Messages
On top of the SWIM-based gossip layer, Serf sends some custom message types.
Serf makes heavy use of [lamport clocks](http://en.wikipedia.org/wiki/Lamport_timestamps)
to maintain some notion of message ordering despite being eventually
consistent. Every message sent by Serf contains a lamport clock time.
When a node gracefully leaves the cluster, Serf sends a _leave intent_ through
the gossip layer. Because the underlying gossip layer makes no differentiation
between a node leaving the cluster and a node being detected as failed, this
allows the higher level Serf layer to detect a failure versus a graceful
leave.
When a node joins the cluster, Serf sends a _join intent_. The purpose
of this intent is solely to attach a lamport clock time to a join so that
it can be ordered properly in case a leave comes out of order.
For custom events, Serf sends a _user event_ message. This message contains
a lamport time, event name, and event payload. Because user events are sent
along the gossip layer, which uses UDP, the payload and entire message framing
must fit within a single UDP packet.

View File

@ -6,93 +6,43 @@ sidebar_current: "docs-internals-security"
# Security Model
Serf uses a symmetric key, or shared secret, cryptosystem to provide
[confidentiality, integrity and authentication](http://en.wikipedia.org/wiki/Information_security).
Consul relies on both a lightweight gossip mechanism and an RPC system
to provide various features. Both of the systems have different security
mechanisms that stem from their independent designs. However, the goals
of Consuls security are to provide [confidentiality, integrity and authentication](http://en.wikipedia.org/wiki/Information_security).
This means Serf communication is protected against eavesdropping, tampering,
or attempts to generate fake events. This makes it possible to run Serf over
untrusted networks such as EC2 and other shared hosting providers.
The [gossip protocol](/docs/internals/gossip.html) is powered by Serf,
which uses a symmetric key, or shared secret, cryptosystem. There are more
details on the security of [Serf here](http://www.serfdom.io/docs/internals/security.html).
The RPC system supports using end-to-end TLS, with optional client authentication.
[TLS](http://en.wikipedia.org/wiki/Transport_Layer_Security) is a widely deployed asymmetric
cryptosystem, and is the foundation of security on the Internet.
This means Consul communication is protected against eavesdropping, tampering,
or spoofing. This makes it possible to run Consul over untrusted networks such
as EC2 and other shared hosting providers.
<div class="alert alert-block alert-warning">
<strong>Advanced Topic!</strong> This page covers the technical details of
the security model of Serf. You don't need to know these details to
operate and use Serf. These details are documented here for those who wish
the security model of Consul. You don't need to know these details to
operate and use Consul. These details are documented here for those who wish
to learn about them without having to go spelunking through the source code.
</div>
## Security Primitives
The Serf security model is built on around a symmetric key, or shared secret system.
All members of the Serf cluster must be provided the shared secret ahead of time.
This places the burden of key distribution on the user.
To support confidentiality, all messages are encrypted using the
[AES-128 standard](http://en.wikipedia.org/wiki/Advanced_Encryption_Standard). The
AES standard is considered one of the most secure and modern encryption standards.
Additionally, it is a fast algorithm, and modern CPUs provide hardware instructions to
make encryption and decryption very lightweight.
AES is used with the [Galois Counter Mode (GCM)](http://en.wikipedia.org/wiki/Galois/Counter_Mode),
using a randomly generated nonce. The use of GCM provides message integrity,
as the ciphertext is suffixed with a 'tag' that is used to verify integrity.
## Message Format
In the previous section we described the crypto primitives that are used. In this
section we cover how messages are framed on the wire and interpretted.
### UDP Message Format
UDP messages do not require any framing since they are packet oriented. This
allows the message to be simple and saves space. The format is as follows:
-------------------------------------------------------------------
| Version (byte) | Nonce (12 bytes) | CipherText | Tag (16 bytes) |
-------------------------------------------------------------------
The UDP message has an overhead of 29 bytes per message.
Tampering or bit corruption will cause the GCM tag verification to fail.
Once we receive a packet, we first verify the GCM tag, and only on verification,
decrypt the payload. The version byte is provided to allow future versions to
change the algorithm they use. It is currently always set to 0.
### TCP Message Format
TCP provides a stream abstraction and therefor we must provide our own framing.
This intoduces a potential attack vector since we cannot verify the tag
until the entire message is received, and the message length must be in plaintext.
Our current strategy is to limit the maximum size of a framed message to prevent
an malicious attacker from being able to send enough data to cause a Denial of Service.
The TCP format is similar to the UDP format, but prepends the message with
a message type byte (similar to other Serf messages). It also adds a 4 byte length
field, encoded in Big Endian format. This increases its maximum overhead to 33 bytes.
When we first receive a TCP encrypted message, we check the message type. If any
party has encryption enabled, the other party must as well. Otherwise we are vulnerable
to a downgrade attack where one side can force the other into a non-encrypted mode of
operation.
Once this is verified, we determine the message length and if it is less than our limit,.
After the entire message is received, the tag is used to verify the entire message.
## Threat Model
The following are the various parts of our threat model:
* Non-members getting access to events
* Non-members getting access to data
* Cluster state manipulation due to malicious messages
* Fake event generation due to malicious messages
* Tampering of messages causing state corruption
* Fake data generation due to malicious messages
* Tampering causing state corruption
* Denial of Service against a node
We are specifically not concerned about replay attacks, as the gossip
protocol is designed to handle that due to the nature of its broadcast mechanism.
Additionally, we recognize that an attacker that can observe network
traffic for an extended period of time may infer the cluster members.
The gossip mechanism used by Serf relies on sending messages to random
The gossip mechanism used by Consul relies on sending messages to random
members, so an attacker can record all destinations and determine all
members of the cluster.
@ -101,13 +51,3 @@ Our goal is not to protect top secret data but to provide a "reasonable"
level of security that would require an attacker to commit a considerable
amount of resources to defeat.
## Future Roadmap
Eventually, Serf will be able to use the versioning byte to support
different encryption algorithms. These could be configured at the
start time of the agent.
Additionally, we need to support key rotation so that it is possible
for network administrators to periodically change keys to ensure
perfect forward security.