website: documenting the internals
This commit is contained in:
parent
30bc19f002
commit
c975f5b968
|
@ -82,9 +82,9 @@ slower as more machines are added. However, there is no limit to the number of c
|
|||
and they can easily scale into the thousands or tens of thousands.
|
||||
|
||||
All the nodes that are in a datacenter participate in a [gossip protocol](/docs/internals/gossip.html).
|
||||
This means is there is a Serf cluster that contains all the nodes for a given datacenter. This serves
|
||||
This means is there is a gossip pool that contains all the nodes for a given datacenter. This serves
|
||||
a few purposes: first, there is no need to configure clients with the addresses of servers,
|
||||
that discovery is done automatically using Serf. Second, the work of detecting node failures
|
||||
discovery is done automatically. Second, the work of detecting node failures
|
||||
is not placed on the servers but is distributed. This makes the failure detection much more
|
||||
scalable than naive heartbeating schemes. Thirdly, it is used as a messaging layer to notify
|
||||
when important events such as leader election take place.
|
||||
|
|
|
@ -6,98 +6,177 @@ sidebar_current: "docs-internals-consensus"
|
|||
|
||||
# Consensus Protocol
|
||||
|
||||
Serf uses a [gossip protocol](http://en.wikipedia.org/wiki/Gossip_protocol)
|
||||
to broadcast messages to the cluster. This page documents the details of
|
||||
this internal protocol. The gossip protocol is based on
|
||||
["SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol"](http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf),
|
||||
with a few minor adaptations, mostly to increase propagation speed
|
||||
and convergence rate.
|
||||
Consul uses a [consensus protocol](http://en.wikipedia.org/wiki/Consensus_(computer_science))
|
||||
to provide [Consistency and Availability](http://en.wikipedia.org/wiki/CAP_theorem) as defined by CAP.
|
||||
This page documents the details of this internal protocol. The consensus protocol is based on
|
||||
["Raft: In search of an Understandable Consensus Algorithm"](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf).
|
||||
|
||||
<div class="alert alert-block alert-warning">
|
||||
<strong>Advanced Topic!</strong> This page covers the technical details of
|
||||
the internals of Serf. You don't need to know these details to effectively
|
||||
operate and use Serf. These details are documented here for those who wish
|
||||
<strong>Advanced Topic!</strong> This page covers technical details of
|
||||
the internals of Consul. You don't need to know these details to effectively
|
||||
operate and use Consul. These details are documented here for those who wish
|
||||
to learn about them without having to go spelunking through the source code.
|
||||
</div>
|
||||
|
||||
## SWIM Protocol Overview
|
||||
## Raft Protocol Overview
|
||||
|
||||
Serf begins by joining an existing cluster or starting a new
|
||||
cluster. If starting a new cluster, additional nodes are expected to join
|
||||
it. New nodes in an existing cluster must be given the address of at
|
||||
least one existing member in order to join the cluster. The new member
|
||||
does a full state sync with the existing member over TCP and begins gossiping its
|
||||
existence to the cluster.
|
||||
Raft is a relatively new consensus algorithm that is based on Paxos,
|
||||
but is designed to have fewer states and a simpler more understandable
|
||||
algorithm. There are a few key terms to know when discussing Raft:
|
||||
|
||||
Gossip is done over UDP with a configurable but fixed fanout and interval.
|
||||
This ensures that network usage is constant with regards to number of nodes.
|
||||
Complete state exchanges with a random node are done periodically over
|
||||
TCP, but much less often than gossip messages. This increases the likelihood
|
||||
that the membership list converges properly since the full state is exchanged
|
||||
and merged. The interval between full state exchanges is configurable or can
|
||||
be disabled entirely.
|
||||
* Log - The primary unit of work in a Raft system is a log entry. The problem
|
||||
of consistency can be decomposed into a *replicated log*. A log is a an ordered
|
||||
seequence of entries. We consider the log consistent if all members agree on
|
||||
the entries and their order.
|
||||
|
||||
Failure detection is done by periodic random probing using a configurable interval.
|
||||
If the node fails to ack within a reasonable time (typically some multiple
|
||||
of RTT), then an indirect probe is attempted. An indirect probe asks a
|
||||
configurable number of random nodes to probe the same node, in case there
|
||||
are network issues causing our own node to fail the probe. If both our
|
||||
probe and the indirect probes fail within a reasonable time, then the
|
||||
node is marked "suspicious" and this knowledge is gossiped to the cluster.
|
||||
A suspicious node is still considered a member of cluster. If the suspect member
|
||||
of the cluster does not dispute the suspicion within a configurable period of
|
||||
time, the node is finally considered dead, and this state is then gossiped
|
||||
to the cluster.
|
||||
* FSM - [Finite State Machine](http://en.wikipedia.org/wiki/Finite-state_machine).
|
||||
An FSM is a collection of finite states with transitions between them. As new logs
|
||||
are applied, the FSM is allowed to transition between states. Application of the
|
||||
same sequence of logs must result in the same state, meaning non-deterministic
|
||||
behavior is not permitted.
|
||||
|
||||
This is a brief and incomplete description of the protocol. For a better idea,
|
||||
please read the
|
||||
[SWIM paper](http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf)
|
||||
in its entirety, along with the Serf source code.
|
||||
* Peer set - The peer set is the set of all members participating in log replication.
|
||||
For Consul's purposes, all server nodes are in the peer set of the local datacenter.
|
||||
|
||||
## SWIM Modifications
|
||||
* Quorum - A quorum is a majority of members from a peer set, or (n/2)+1.
|
||||
For example, if there are 5 members in the peer set, we would need 3 nodes
|
||||
to form a quorum. If a quorum of nodes is unavailable for any reason, then the
|
||||
cluster becomes *unavailable*, and no new logs can be committed.
|
||||
|
||||
As mentioned earlier, the gossip protocol is based on SWIM but includes
|
||||
minor changes, mostly to increase propogation speed and convergence rates.
|
||||
* Committed Entry - An entry is considered *committed* when it is durably stored
|
||||
on a quorum of nodes. Once an entry is committed it can be applied.
|
||||
|
||||
The changes from SWIM are noted here:
|
||||
* Leader - At any given time, the peer set elects a single node to be the leader.
|
||||
The leader is responsible for ingesting new log entries, replicating to followers,
|
||||
and managing when an entry is considered committed.
|
||||
|
||||
* Serf does a full state sync over TCP periodically. SWIM only propagates
|
||||
changes over gossip. While both are eventually consistent, Serf is able to
|
||||
more quickly reach convergence, as well as gracefully recover from network
|
||||
partitions.
|
||||
Raft is a complex protocol, and will not be covered here in detail. For the full
|
||||
specification, we recommend reading the paper. We will attempt to provide a high
|
||||
level description, which may be useful for building a mental picture.
|
||||
|
||||
* Serf has a dedicated gossip layer separate from the failure detection
|
||||
protocol. SWIM only piggybacks gossip messages on top of probe/ack messages.
|
||||
Serf uses piggybacking along with dedicated gossip messages. This
|
||||
feature lets you have a higher gossip rate (for example once per 200ms)
|
||||
and a slower failure detection rate (such as once per second), resulting
|
||||
in overall faster convergence rates and data propagation speeds.
|
||||
Raft nodes are always in one of three states: follower, candidate or leader. All
|
||||
nodes initially start out as a follower. In this state, nodes can accept log entries
|
||||
from a leader and cast votes. If no entries are received for some time, nodes
|
||||
self-promote to the candidate state. In the candidate state nodes request votes from
|
||||
their peers. If a candidate receives a quorum of votes, then it is promoted to a leader.
|
||||
The leader must accept new log entries and replicate to all the other followers.
|
||||
In addition, if stale reads are not acceptable, all queries must also be performed on
|
||||
the leader.
|
||||
|
||||
* Serf keeps the state of dead nodes around for a set amount of time,
|
||||
so that when full syncs are requested, the requester also receives information
|
||||
about dead nodes. Because SWIM doesn't do full syncs, SWIM deletes dead node
|
||||
state immediately upon learning that the node is dead. This change again helps
|
||||
the cluster converge more quickly.
|
||||
Once a cluster has a leader, it is able to accept new log entries. A client can
|
||||
request that a leader append a new log entry, which is an opaque binary blob to
|
||||
Raft. The leader then writes the entry to durable storage and attempts to replicate
|
||||
to a quorum of followers. Once the log entry is considered *committed*, it can be
|
||||
*applied* to a finite state machine. The finite state machine is application specific,
|
||||
and in Consul's case, we use [LMDB](http://symas.com/mdb/) to maintain cluster state.
|
||||
|
||||
## Serf-Specific Messages
|
||||
An obvious question relates to the unbounded nature of a replicated log. Raft provides
|
||||
a mechanism by which the current state is snapshotted, and the log is compacted. Because
|
||||
of the FSM abstraction, restoring the state of the FSM must result in the same state
|
||||
as a reply of old logs. This allows Raft to capture the FSM state at a point in time,
|
||||
and then remove all the logs that were used to reach that state. This is performed automatically
|
||||
without user intervention, and prevents unbounded disk usage as well as minimizing
|
||||
time spent replaying logs. One of the advantages of using LMDB is that it allows Consul
|
||||
to continue accepting new transactions even while old state is being snapshotted,
|
||||
preventing any availability issues.
|
||||
|
||||
On top of the SWIM-based gossip layer, Serf sends some custom message types.
|
||||
Lastly, there is the issue of updating the peer set when new servers are joining
|
||||
or existing servers are leaving. As long as a quorum of nodes are available, this
|
||||
is not an issue as Raft provides mechanisms to dynamically update the peer set.
|
||||
If a quorum of nodes is unavailable, then this becomes a very challenging issue.
|
||||
For example, suppose there are only 2 peers, A and B. The quorum size is also
|
||||
2, meaning both nodes must agree to commit a log entry. If either A or B fails,
|
||||
it is now impossible to reach quorum. This means the cluster is unable to add,
|
||||
or remove a node, or commit any additional log entries. This results in *unavailability*.
|
||||
At this point, manual intervention would be required to remove either A or B,
|
||||
and to restart the remaining node in bootstrap mode.
|
||||
|
||||
Serf makes heavy use of [lamport clocks](http://en.wikipedia.org/wiki/Lamport_timestamps)
|
||||
to maintain some notion of message ordering despite being eventually
|
||||
consistent. Every message sent by Serf contains a lamport clock time.
|
||||
A Raft cluster of 3 nodes can tolerate a single node failure, while a cluster
|
||||
of 5 can tolerate 2 node failures. The recommended configuration is to either
|
||||
run 3 or 5 Consul servers per datacenter. This maximizes availability without
|
||||
greatly sacrificing performance. See below for a deployment table.
|
||||
|
||||
When a node gracefully leaves the cluster, Serf sends a _leave intent_ through
|
||||
the gossip layer. Because the underlying gossip layer makes no differentiation
|
||||
between a node leaving the cluster and a node being detected as failed, this
|
||||
allows the higher level Serf layer to detect a failure versus a graceful
|
||||
leave.
|
||||
In terms of performance, Raft is comprable to Paxos. Assuming stable leadership,
|
||||
a committing a log entry requires a single round trip to half of the cluster.
|
||||
Thus performance is bound by disk I/O and network latency. Although Consul is
|
||||
not designed to be a high-throughput write system, it should handle on the order
|
||||
of hundreds to thousands of transactions per second depending on network and
|
||||
hardware configuration.
|
||||
|
||||
When a node joins the cluster, Serf sends a _join intent_. The purpose
|
||||
of this intent is solely to attach a lamport clock time to a join so that
|
||||
it can be ordered properly in case a leave comes out of order.
|
||||
## Raft in Consul
|
||||
|
||||
Only Consul server nodes participate in Raft, and are part of the peer set. All
|
||||
client nodes forward requests to servers. Part of the reason for this design is
|
||||
that as more members are added to the peer set, the size of the quorum also increases.
|
||||
This introduces performance problems as you may be waiting for hundreds of machines
|
||||
to agree on an entry instead of a handful.
|
||||
|
||||
When getting started, a single Consul server is put into "bootstrap" mode. This mode
|
||||
allows it to self-elect as a leader. Once a leader is elected, other servers can be
|
||||
added to the peer set in a way that preserves consistency and safety. Eventually,
|
||||
bootstrap mode can be disabled, once the first few servers are added.
|
||||
|
||||
Since all servers participate as part of the peer set, they all know the current
|
||||
leader. When an RPC request arrives at a non-leader server, the request is
|
||||
forwarded to the leader. If the RPC is a *query* type, meaning it is read-only,
|
||||
then the leader generates the result based on the current state of the FSM. If
|
||||
the RPC is a *transaction* type, meaning it modifies state, then the leader
|
||||
generates a new log entry and applies it using Raft. Once the log entry is committed
|
||||
and applied to the FSM, the transaction is complete.
|
||||
|
||||
Because of the nature of Raft's replication, performance is sensitive to network
|
||||
latency. For this reason, each datacenter elects an independent leader, and maintains
|
||||
a disjoint peer set. Data is partitioned by datacenter, so each leader is responsible
|
||||
only for data in their datacenter. When a request is received for a remote datacenter,
|
||||
the request is forwarded to the correct leader. This design allows for lower latency
|
||||
transactions and higher availability without sacrificing consistency.
|
||||
|
||||
## Deployment Table
|
||||
|
||||
Below is a table that shows for the number of servers how large the
|
||||
quorum is, as well as how many node failures can be tolerated. The
|
||||
recommended deployment is either 3 or 5 servers.
|
||||
|
||||
<table class="table table-bordered table-striped">
|
||||
<tr>
|
||||
<th>Servers</th>
|
||||
<th>Quorum Size</th>
|
||||
<th>Failure Tolerance</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>1</td>
|
||||
<td>1</td>
|
||||
<td>0</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>2</td>
|
||||
<td>2</td>
|
||||
<td>0</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><b>3</b></td>
|
||||
<td>2</td>
|
||||
<td>1</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>4</td>
|
||||
<td>3</td>
|
||||
<td>1</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><b>5</b></td>
|
||||
<td>3</td>
|
||||
<td>2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>6</td>
|
||||
<td>4</td>
|
||||
<td>2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>7</td>
|
||||
<td>4</td>
|
||||
<td>3</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
For custom events, Serf sends a _user event_ message. This message contains
|
||||
a lamport time, event name, and event payload. Because user events are sent
|
||||
along the gossip layer, which uses UDP, the payload and entire message framing
|
||||
must fit within a single UDP packet.
|
||||
|
|
|
@ -6,98 +6,39 @@ sidebar_current: "docs-internals-gossip"
|
|||
|
||||
# Gossip Protocol
|
||||
|
||||
Serf uses a [gossip protocol](http://en.wikipedia.org/wiki/Gossip_protocol)
|
||||
to broadcast messages to the cluster. This page documents the details of
|
||||
this internal protocol. The gossip protocol is based on
|
||||
Consul uses a [gossip protocol](http://en.wikipedia.org/wiki/Gossip_protocol)
|
||||
to manage membership and broadcast messages to the cluster. All of this is provided
|
||||
through the use of the [Serf library](http://www.serfdom.io/). The gossip protocol
|
||||
used by Serf is based on
|
||||
["SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol"](http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf),
|
||||
with a few minor adaptations, mostly to increase propagation speed
|
||||
and convergence rate.
|
||||
with a few minor adaptations. There are more details about [Serf's protocol here](http://www.serfdom.io/docs/internals/gossip.html).
|
||||
|
||||
<div class="alert alert-block alert-warning">
|
||||
<strong>Advanced Topic!</strong> This page covers the technical details of
|
||||
the internals of Serf. You don't need to know these details to effectively
|
||||
operate and use Serf. These details are documented here for those who wish
|
||||
<strong>Advanced Topic!</strong> This page covers technical details of
|
||||
the internals of Consul. You don't need to know these details to effectively
|
||||
operate and use Consul. These details are documented here for those who wish
|
||||
to learn about them without having to go spelunking through the source code.
|
||||
</div>
|
||||
|
||||
## SWIM Protocol Overview
|
||||
## Gossip in Consul
|
||||
|
||||
Serf begins by joining an existing cluster or starting a new
|
||||
cluster. If starting a new cluster, additional nodes are expected to join
|
||||
it. New nodes in an existing cluster must be given the address of at
|
||||
least one existing member in order to join the cluster. The new member
|
||||
does a full state sync with the existing member over TCP and begins gossiping its
|
||||
existence to the cluster.
|
||||
Consul makes use of two different gossip pools. We refer to each pool as the
|
||||
LAN or WAN pool respectively. Each datacenter Consul operates in has a LAN gossip pool
|
||||
containing all members of the datacenter, both clients and servers. The LAN pool is
|
||||
used for a few purposes. Membership information allows clients to automatically discover
|
||||
servers, reducing the amount of configuration needed. The distributed failure detection
|
||||
allows the work of failure detection to be shared by the entire cluster instead of
|
||||
concentrated on a few servers. Lastly, the gossip pool allows for reliable and fast
|
||||
event broadcasts for events like leader election.
|
||||
|
||||
Gossip is done over UDP with a configurable but fixed fanout and interval.
|
||||
This ensures that network usage is constant with regards to number of nodes.
|
||||
Complete state exchanges with a random node are done periodically over
|
||||
TCP, but much less often than gossip messages. This increases the likelihood
|
||||
that the membership list converges properly since the full state is exchanged
|
||||
and merged. The interval between full state exchanges is configurable or can
|
||||
be disabled entirely.
|
||||
The WAN pool is globally unique, as all servers should participate in the WAN pool
|
||||
regardless of datacenter. Membership information provided by the WAN pool allows
|
||||
servers to perform cross datacenter requests. THe integrated failure detection
|
||||
allows Consul to gracefully handle an entire datacenter losing connectivity, or just
|
||||
a single server in a remote datacenter.
|
||||
|
||||
Failure detection is done by periodic random probing using a configurable interval.
|
||||
If the node fails to ack within a reasonable time (typically some multiple
|
||||
of RTT), then an indirect probe is attempted. An indirect probe asks a
|
||||
configurable number of random nodes to probe the same node, in case there
|
||||
are network issues causing our own node to fail the probe. If both our
|
||||
probe and the indirect probes fail within a reasonable time, then the
|
||||
node is marked "suspicious" and this knowledge is gossiped to the cluster.
|
||||
A suspicious node is still considered a member of cluster. If the suspect member
|
||||
of the cluster does not dispute the suspicion within a configurable period of
|
||||
time, the node is finally considered dead, and this state is then gossiped
|
||||
to the cluster.
|
||||
All of these features are provided by leveraging [Serf](http://www.serfdom.io/). It
|
||||
is used as an embedded library to provide these features. From a user perspective,
|
||||
this is not important, since the abstraction should be masked by Consul. It can be useful
|
||||
however as a developer to understand how this library is leveraged.
|
||||
|
||||
This is a brief and incomplete description of the protocol. For a better idea,
|
||||
please read the
|
||||
[SWIM paper](http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf)
|
||||
in its entirety, along with the Serf source code.
|
||||
|
||||
## SWIM Modifications
|
||||
|
||||
As mentioned earlier, the gossip protocol is based on SWIM but includes
|
||||
minor changes, mostly to increase propogation speed and convergence rates.
|
||||
|
||||
The changes from SWIM are noted here:
|
||||
|
||||
* Serf does a full state sync over TCP periodically. SWIM only propagates
|
||||
changes over gossip. While both are eventually consistent, Serf is able to
|
||||
more quickly reach convergence, as well as gracefully recover from network
|
||||
partitions.
|
||||
|
||||
* Serf has a dedicated gossip layer separate from the failure detection
|
||||
protocol. SWIM only piggybacks gossip messages on top of probe/ack messages.
|
||||
Serf uses piggybacking along with dedicated gossip messages. This
|
||||
feature lets you have a higher gossip rate (for example once per 200ms)
|
||||
and a slower failure detection rate (such as once per second), resulting
|
||||
in overall faster convergence rates and data propagation speeds.
|
||||
|
||||
* Serf keeps the state of dead nodes around for a set amount of time,
|
||||
so that when full syncs are requested, the requester also receives information
|
||||
about dead nodes. Because SWIM doesn't do full syncs, SWIM deletes dead node
|
||||
state immediately upon learning that the node is dead. This change again helps
|
||||
the cluster converge more quickly.
|
||||
|
||||
## Serf-Specific Messages
|
||||
|
||||
On top of the SWIM-based gossip layer, Serf sends some custom message types.
|
||||
|
||||
Serf makes heavy use of [lamport clocks](http://en.wikipedia.org/wiki/Lamport_timestamps)
|
||||
to maintain some notion of message ordering despite being eventually
|
||||
consistent. Every message sent by Serf contains a lamport clock time.
|
||||
|
||||
When a node gracefully leaves the cluster, Serf sends a _leave intent_ through
|
||||
the gossip layer. Because the underlying gossip layer makes no differentiation
|
||||
between a node leaving the cluster and a node being detected as failed, this
|
||||
allows the higher level Serf layer to detect a failure versus a graceful
|
||||
leave.
|
||||
|
||||
When a node joins the cluster, Serf sends a _join intent_. The purpose
|
||||
of this intent is solely to attach a lamport clock time to a join so that
|
||||
it can be ordered properly in case a leave comes out of order.
|
||||
|
||||
For custom events, Serf sends a _user event_ message. This message contains
|
||||
a lamport time, event name, and event payload. Because user events are sent
|
||||
along the gossip layer, which uses UDP, the payload and entire message framing
|
||||
must fit within a single UDP packet.
|
||||
|
|
|
@ -6,93 +6,43 @@ sidebar_current: "docs-internals-security"
|
|||
|
||||
# Security Model
|
||||
|
||||
Serf uses a symmetric key, or shared secret, cryptosystem to provide
|
||||
[confidentiality, integrity and authentication](http://en.wikipedia.org/wiki/Information_security).
|
||||
Consul relies on both a lightweight gossip mechanism and an RPC system
|
||||
to provide various features. Both of the systems have different security
|
||||
mechanisms that stem from their independent designs. However, the goals
|
||||
of Consuls security are to provide [confidentiality, integrity and authentication](http://en.wikipedia.org/wiki/Information_security).
|
||||
|
||||
This means Serf communication is protected against eavesdropping, tampering,
|
||||
or attempts to generate fake events. This makes it possible to run Serf over
|
||||
untrusted networks such as EC2 and other shared hosting providers.
|
||||
The [gossip protocol](/docs/internals/gossip.html) is powered by Serf,
|
||||
which uses a symmetric key, or shared secret, cryptosystem. There are more
|
||||
details on the security of [Serf here](http://www.serfdom.io/docs/internals/security.html).
|
||||
|
||||
The RPC system supports using end-to-end TLS, with optional client authentication.
|
||||
[TLS](http://en.wikipedia.org/wiki/Transport_Layer_Security) is a widely deployed asymmetric
|
||||
cryptosystem, and is the foundation of security on the Internet.
|
||||
|
||||
This means Consul communication is protected against eavesdropping, tampering,
|
||||
or spoofing. This makes it possible to run Consul over untrusted networks such
|
||||
as EC2 and other shared hosting providers.
|
||||
|
||||
<div class="alert alert-block alert-warning">
|
||||
<strong>Advanced Topic!</strong> This page covers the technical details of
|
||||
the security model of Serf. You don't need to know these details to
|
||||
operate and use Serf. These details are documented here for those who wish
|
||||
the security model of Consul. You don't need to know these details to
|
||||
operate and use Consul. These details are documented here for those who wish
|
||||
to learn about them without having to go spelunking through the source code.
|
||||
</div>
|
||||
|
||||
## Security Primitives
|
||||
|
||||
The Serf security model is built on around a symmetric key, or shared secret system.
|
||||
All members of the Serf cluster must be provided the shared secret ahead of time.
|
||||
This places the burden of key distribution on the user.
|
||||
|
||||
To support confidentiality, all messages are encrypted using the
|
||||
[AES-128 standard](http://en.wikipedia.org/wiki/Advanced_Encryption_Standard). The
|
||||
AES standard is considered one of the most secure and modern encryption standards.
|
||||
Additionally, it is a fast algorithm, and modern CPUs provide hardware instructions to
|
||||
make encryption and decryption very lightweight.
|
||||
|
||||
AES is used with the [Galois Counter Mode (GCM)](http://en.wikipedia.org/wiki/Galois/Counter_Mode),
|
||||
using a randomly generated nonce. The use of GCM provides message integrity,
|
||||
as the ciphertext is suffixed with a 'tag' that is used to verify integrity.
|
||||
|
||||
## Message Format
|
||||
|
||||
In the previous section we described the crypto primitives that are used. In this
|
||||
section we cover how messages are framed on the wire and interpretted.
|
||||
|
||||
### UDP Message Format
|
||||
|
||||
UDP messages do not require any framing since they are packet oriented. This
|
||||
allows the message to be simple and saves space. The format is as follows:
|
||||
|
||||
-------------------------------------------------------------------
|
||||
| Version (byte) | Nonce (12 bytes) | CipherText | Tag (16 bytes) |
|
||||
-------------------------------------------------------------------
|
||||
|
||||
The UDP message has an overhead of 29 bytes per message.
|
||||
Tampering or bit corruption will cause the GCM tag verification to fail.
|
||||
|
||||
Once we receive a packet, we first verify the GCM tag, and only on verification,
|
||||
decrypt the payload. The version byte is provided to allow future versions to
|
||||
change the algorithm they use. It is currently always set to 0.
|
||||
|
||||
### TCP Message Format
|
||||
|
||||
TCP provides a stream abstraction and therefor we must provide our own framing.
|
||||
This intoduces a potential attack vector since we cannot verify the tag
|
||||
until the entire message is received, and the message length must be in plaintext.
|
||||
Our current strategy is to limit the maximum size of a framed message to prevent
|
||||
an malicious attacker from being able to send enough data to cause a Denial of Service.
|
||||
|
||||
The TCP format is similar to the UDP format, but prepends the message with
|
||||
a message type byte (similar to other Serf messages). It also adds a 4 byte length
|
||||
field, encoded in Big Endian format. This increases its maximum overhead to 33 bytes.
|
||||
|
||||
When we first receive a TCP encrypted message, we check the message type. If any
|
||||
party has encryption enabled, the other party must as well. Otherwise we are vulnerable
|
||||
to a downgrade attack where one side can force the other into a non-encrypted mode of
|
||||
operation.
|
||||
|
||||
Once this is verified, we determine the message length and if it is less than our limit,.
|
||||
After the entire message is received, the tag is used to verify the entire message.
|
||||
|
||||
## Threat Model
|
||||
|
||||
The following are the various parts of our threat model:
|
||||
|
||||
* Non-members getting access to events
|
||||
* Non-members getting access to data
|
||||
* Cluster state manipulation due to malicious messages
|
||||
* Fake event generation due to malicious messages
|
||||
* Tampering of messages causing state corruption
|
||||
* Fake data generation due to malicious messages
|
||||
* Tampering causing state corruption
|
||||
* Denial of Service against a node
|
||||
|
||||
We are specifically not concerned about replay attacks, as the gossip
|
||||
protocol is designed to handle that due to the nature of its broadcast mechanism.
|
||||
|
||||
Additionally, we recognize that an attacker that can observe network
|
||||
traffic for an extended period of time may infer the cluster members.
|
||||
The gossip mechanism used by Serf relies on sending messages to random
|
||||
The gossip mechanism used by Consul relies on sending messages to random
|
||||
members, so an attacker can record all destinations and determine all
|
||||
members of the cluster.
|
||||
|
||||
|
@ -101,13 +51,3 @@ Our goal is not to protect top secret data but to provide a "reasonable"
|
|||
level of security that would require an attacker to commit a considerable
|
||||
amount of resources to defeat.
|
||||
|
||||
## Future Roadmap
|
||||
|
||||
Eventually, Serf will be able to use the versioning byte to support
|
||||
different encryption algorithms. These could be configured at the
|
||||
start time of the agent.
|
||||
|
||||
Additionally, we need to support key rotation so that it is possible
|
||||
for network administrators to periodically change keys to ensure
|
||||
perfect forward security.
|
||||
|
||||
|
|
Loading…
Reference in New Issue