198 lines
8.9 KiB
Plaintext
198 lines
8.9 KiB
Plaintext
---
|
|
layout: docs
|
|
page_title: Integrated Storage
|
|
description: Learn about the integrated raft storage in Vault.
|
|
---
|
|
|
|
# Integrated Storage
|
|
|
|
Vault supports a number of Storage options for the durable storage of Vault's
|
|
information. Each backend has pros, cons, advantages, and trade-offs. For
|
|
example, some backends support high availability while others provide a more
|
|
robust backup and restoration process.
|
|
|
|
As of Vault 1.4 an integrated storage option is offered. This storage backend
|
|
does not rely on any third party systems, it implements high availability,
|
|
supports Enterprise Replication features, and provides backup/restore workflows.
|
|
|
|
## Consensus Protocol
|
|
|
|
Vault's integrated storage uses a [consensus
|
|
protocol](<https://en.wikipedia.org/wiki/Consensus_(computer_science)>) to provide
|
|
[Consistency (as defined by CAP)](https://en.wikipedia.org/wiki/CAP_theorem).
|
|
The consensus protocol is based on ["Raft: In search of an Understandable
|
|
Consensus Algorithm"](https://raft.github.io/raft.pdf). For a visual explanation
|
|
of Raft, see [The Secret Lives of Data](http://thesecretlivesofdata.com/raft).
|
|
|
|
### Raft Protocol Overview
|
|
|
|
Raft is a consensus algorithm that is based on
|
|
[Paxos](https://en.wikipedia.org/wiki/Paxos_%28computer_science%29). Compared
|
|
to Paxos, Raft is designed to have fewer states and a simpler, more
|
|
understandable algorithm.
|
|
|
|
There are a few key terms to know when discussing Raft:
|
|
|
|
- Log - The primary unit of work in a Raft system is a log entry. The problem
|
|
of consistency can be decomposed into a _replicated log_. A log is an ordered
|
|
sequence of entries. Entries includes any cluster change: adding nodes, adding
|
|
services, new key-value pairs, etc. We consider the log consistent if all
|
|
members agree on the entries and their order.
|
|
|
|
- FSM - [Finite State Machine](https://en.wikipedia.org/wiki/Finite-state_machine).
|
|
An FSM is a collection of finite states with transitions between them. As new logs
|
|
are applied, the FSM is allowed to transition between states. Application of the
|
|
same sequence of logs must result in the same state, meaning behavior must be deterministic.
|
|
|
|
- Peer set - The peer set is the set of all members participating in log replication.
|
|
For Vault's purposes, all server nodes are in the peer set of the local cluster.
|
|
|
|
- Quorum - A quorum is a majority of members from a peer set: for a set of size `n`,
|
|
quorum requires at least `(n+1)/2` members. For example, if there are 5 members
|
|
in the peer set, we would need 3 nodes to form a quorum. If a quorum of nodes is
|
|
unavailable for any reason, the cluster becomes _unavailable_ and no new logs
|
|
can be committed.
|
|
|
|
- Committed Entry - An entry is considered _committed_ when it is durably stored
|
|
on a quorum of nodes. Once an entry is committed it can be applied.
|
|
|
|
- Leader - At any given time, the peer set elects a single node to be the leader.
|
|
The leader is responsible for ingesting new log entries, replicating to followers,
|
|
and managing when an entry is considered committed. For Vault's purposes, the
|
|
leader node is also the Active vault node and followers are standby nodes. See
|
|
the [High Availability docs](/docs/internals/high-availability#design-overview)
|
|
for more information.
|
|
|
|
Raft is a complex protocol and will not be covered here in detail (for those who
|
|
desire a more comprehensive treatment, the full specification is available in this
|
|
[paper](https://raft.github.io/raft.pdf)). We will, however, attempt to provide
|
|
a high level description which may be useful for building a mental model.
|
|
|
|
Raft nodes are always in one of three states: follower, candidate, or leader. All
|
|
nodes initially start out as a follower. In this state, nodes can accept log entries
|
|
from a leader and cast votes. If no entries are received for some time, nodes
|
|
self-promote to the candidate state. In the candidate state, nodes request votes from
|
|
their peers. If a candidate receives a quorum of votes, then it is promoted to a leader.
|
|
The leader must accept new log entries and replicate to all the other followers.
|
|
|
|
Once a cluster has a leader, it is able to accept new log entries. A client can
|
|
request that a leader append a new log entry (from Raft's perspective, a log entry
|
|
is an opaque binary blob). The leader then writes the entry to durable storage and
|
|
attempts to replicate to a quorum of followers. Once the log entry is considered
|
|
_committed_, it can be _applied_ to a finite state machine. The finite state machine
|
|
is application specific; in Vault's case, we use
|
|
[BoltDB](https://github.com/etcd-io/bbolt) to maintain cluster state. Vault's writes
|
|
block until it is both _committed_ and _applied_.
|
|
|
|
Obviously, it would be undesirable to allow a replicated log to grow in an unbounded
|
|
fashion. Raft provides a mechanism by which the current state is snapshotted and the
|
|
log is compacted. Because of the FSM abstraction, restoring the state of the FSM must
|
|
result in the same state as a replay of old logs. This allows Raft to capture the FSM
|
|
state at a point in time and then remove all the logs that were used to reach that
|
|
state. This is performed automatically without user intervention and prevents unbounded
|
|
disk usage while also minimizing the time spent replaying logs. One of the advantages of
|
|
using BoltDB is that it allows Vault's snapshots to be very light weight. Since
|
|
Vault's data is already persisted to disk in BoltDB the snapshot process just
|
|
needs to truncate the raft logs.
|
|
|
|
Consensus is fault-tolerant while a cluster has quorum.
|
|
If a quorum of nodes is unavailable, it is impossible to process log entries or reason
|
|
about peer membership. For example, suppose there are only 2 peers: A and B. The quorum
|
|
size is also 2, meaning both nodes must agree to commit a log entry. If either A or B
|
|
fails, it is now impossible to reach quorum. This means the cluster is unable to add
|
|
or remove a node or to commit any additional log entries. This results in
|
|
_unavailability_. At this point, manual intervention would be required to remove
|
|
either A or B and to restart the remaining node in bootstrap mode.
|
|
|
|
A Raft cluster of 3 nodes can tolerate a single node failure while a cluster
|
|
of 5 can tolerate 2 node failures. The recommended configuration is to either
|
|
run 3 or 5 Vault servers per cluster. This maximizes availability without
|
|
greatly sacrificing performance. The [deployment table](#deployment-table) below
|
|
summarizes the potential cluster size options and the fault tolerance of each.
|
|
|
|
In terms of performance, Raft is comparable to Paxos. Assuming stable leadership,
|
|
committing a log entry requires a single round trip to half of the cluster.
|
|
Thus, performance is bound by disk I/O and network latency.
|
|
|
|
### Raft in Vault
|
|
|
|
When getting started, a single Vault server is
|
|
[initialized](/docs/commands/operator/init/#operator-init). At this point the
|
|
cluster is of size 1 which allows the node to self-elect as a leader. Once a
|
|
leader is elected, other servers can be added to the peer set in a way that
|
|
preserves consistency and safety.
|
|
|
|
The join process is how new nodes are added to the vault cluster, it uses an
|
|
encrypted challenge/answer workflow. To accomplish this, all nodes in a single
|
|
raft cluster must share the same seal configuration. If using an Auto Unseal the
|
|
join process can use the configured seal to automatically decrypt the challenge
|
|
and respond with the answer. If using a Shamir seal the unseal keys must be
|
|
provided to the node attempting to join the cluster before it can decrypt the
|
|
challenge and respond with the decrypted answer.
|
|
|
|
Since all servers participate as part of the peer set, they all know the current
|
|
leader. When an API request arrives at a non-leader server, the request is
|
|
forwarded to the leader.
|
|
|
|
Similar to other storage backends, data that is written to the Raft log and FSM
|
|
will be encrypted by Vault's barrier.
|
|
|
|
Vault does not currently offer automated dead server cleanup. If you wish to
|
|
decommission a node, or a node dies and must be replaced, the node must manually
|
|
be removed from the cluster with the `remove peer`
|
|
[command](/docs/commands/operator/raft#remove-peer).
|
|
|
|
### Deployment Table
|
|
|
|
Below is a table that shows quorum size and failure tolerance for various
|
|
cluster sizes. The recommended deployment is either 3 or 5 servers. A single
|
|
server deployment is _**highly**_ discouraged as data loss is inevitable in a
|
|
failure scenario.
|
|
|
|
<table class="table table-bordered table-striped">
|
|
<thead>
|
|
<tr>
|
|
<th>Servers</th>
|
|
<th>Quorum Size</th>
|
|
<th>Failure Tolerance</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody>
|
|
<tr>
|
|
<td>1</td>
|
|
<td>1</td>
|
|
<td>0</td>
|
|
</tr>
|
|
<tr>
|
|
<td>2</td>
|
|
<td>2</td>
|
|
<td>0</td>
|
|
</tr>
|
|
<tr class="warning">
|
|
<td>3</td>
|
|
<td>2</td>
|
|
<td>1</td>
|
|
</tr>
|
|
<tr>
|
|
<td>4</td>
|
|
<td>3</td>
|
|
<td>1</td>
|
|
</tr>
|
|
<tr class="warning">
|
|
<td>5</td>
|
|
<td>3</td>
|
|
<td>2</td>
|
|
</tr>
|
|
<tr>
|
|
<td>6</td>
|
|
<td>4</td>
|
|
<td>2</td>
|
|
</tr>
|
|
<tr>
|
|
<td>7</td>
|
|
<td>4</td>
|
|
<td>3</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|