open-vault/website/content/docs/internals/integrated-storage.mdx
2021-11-15 14:52:04 -05:00

198 lines
8.9 KiB
Plaintext

---
layout: docs
page_title: Integrated Storage
description: Learn about the integrated raft storage in Vault.
---
# Integrated Storage
Vault supports a number of Storage options for the durable storage of Vault's
information. Each backend has pros, cons, advantages, and trade-offs. For
example, some backends support high availability while others provide a more
robust backup and restoration process.
As of Vault 1.4 an integrated storage option is offered. This storage backend
does not rely on any third party systems, it implements high availability,
supports Enterprise Replication features, and provides backup/restore workflows.
## Consensus Protocol
Vault's integrated storage uses a [consensus
protocol](<https://en.wikipedia.org/wiki/Consensus_(computer_science)>) to provide
[Consistency (as defined by CAP)](https://en.wikipedia.org/wiki/CAP_theorem).
The consensus protocol is based on ["Raft: In search of an Understandable
Consensus Algorithm"](https://raft.github.io/raft.pdf). For a visual explanation
of Raft, see [The Secret Lives of Data](http://thesecretlivesofdata.com/raft).
### Raft Protocol Overview
Raft is a consensus algorithm that is based on
[Paxos](https://en.wikipedia.org/wiki/Paxos_%28computer_science%29). Compared
to Paxos, Raft is designed to have fewer states and a simpler, more
understandable algorithm.
There are a few key terms to know when discussing Raft:
- Log - The primary unit of work in a Raft system is a log entry. The problem
of consistency can be decomposed into a _replicated log_. A log is an ordered
sequence of entries. Entries includes any cluster change: adding nodes, adding
services, new key-value pairs, etc. We consider the log consistent if all
members agree on the entries and their order.
- FSM - [Finite State Machine](https://en.wikipedia.org/wiki/Finite-state_machine).
An FSM is a collection of finite states with transitions between them. As new logs
are applied, the FSM is allowed to transition between states. Application of the
same sequence of logs must result in the same state, meaning behavior must be deterministic.
- Peer set - The peer set is the set of all members participating in log replication.
For Vault's purposes, all server nodes are in the peer set of the local cluster.
- Quorum - A quorum is a majority of members from a peer set: for a set of size `n`,
quorum requires at least `(n+1)/2` members. For example, if there are 5 members
in the peer set, we would need 3 nodes to form a quorum. If a quorum of nodes is
unavailable for any reason, the cluster becomes _unavailable_ and no new logs
can be committed.
- Committed Entry - An entry is considered _committed_ when it is durably stored
on a quorum of nodes. Once an entry is committed it can be applied.
- Leader - At any given time, the peer set elects a single node to be the leader.
The leader is responsible for ingesting new log entries, replicating to followers,
and managing when an entry is considered committed. For Vault's purposes, the
leader node is also the Active vault node and followers are standby nodes. See
the [High Availability docs](/docs/internals/high-availability#design-overview)
for more information.
Raft is a complex protocol and will not be covered here in detail (for those who
desire a more comprehensive treatment, the full specification is available in this
[paper](https://raft.github.io/raft.pdf)). We will, however, attempt to provide
a high level description which may be useful for building a mental model.
Raft nodes are always in one of three states: follower, candidate, or leader. All
nodes initially start out as a follower. In this state, nodes can accept log entries
from a leader and cast votes. If no entries are received for some time, nodes
self-promote to the candidate state. In the candidate state, nodes request votes from
their peers. If a candidate receives a quorum of votes, then it is promoted to a leader.
The leader must accept new log entries and replicate to all the other followers.
Once a cluster has a leader, it is able to accept new log entries. A client can
request that a leader append a new log entry (from Raft's perspective, a log entry
is an opaque binary blob). The leader then writes the entry to durable storage and
attempts to replicate to a quorum of followers. Once the log entry is considered
_committed_, it can be _applied_ to a finite state machine. The finite state machine
is application specific; in Vault's case, we use
[BoltDB](https://github.com/etcd-io/bbolt) to maintain cluster state. Vault's writes
block until it is both _committed_ and _applied_.
Obviously, it would be undesirable to allow a replicated log to grow in an unbounded
fashion. Raft provides a mechanism by which the current state is snapshotted and the
log is compacted. Because of the FSM abstraction, restoring the state of the FSM must
result in the same state as a replay of old logs. This allows Raft to capture the FSM
state at a point in time and then remove all the logs that were used to reach that
state. This is performed automatically without user intervention and prevents unbounded
disk usage while also minimizing the time spent replaying logs. One of the advantages of
using BoltDB is that it allows Vault's snapshots to be very light weight. Since
Vault's data is already persisted to disk in BoltDB the snapshot process just
needs to truncate the raft logs.
Consensus is fault-tolerant while a cluster has quorum.
If a quorum of nodes is unavailable, it is impossible to process log entries or reason
about peer membership. For example, suppose there are only 2 peers: A and B. The quorum
size is also 2, meaning both nodes must agree to commit a log entry. If either A or B
fails, it is now impossible to reach quorum. This means the cluster is unable to add
or remove a node or to commit any additional log entries. This results in
_unavailability_. At this point, manual intervention would be required to remove
either A or B and to restart the remaining node in bootstrap mode.
A Raft cluster of 3 nodes can tolerate a single node failure while a cluster
of 5 can tolerate 2 node failures. The recommended configuration is to either
run 3 or 5 Vault servers per cluster. This maximizes availability without
greatly sacrificing performance. The [deployment table](#deployment-table) below
summarizes the potential cluster size options and the fault tolerance of each.
In terms of performance, Raft is comparable to Paxos. Assuming stable leadership,
committing a log entry requires a single round trip to half of the cluster.
Thus, performance is bound by disk I/O and network latency.
### Raft in Vault
When getting started, a single Vault server is
[initialized](/docs/commands/operator/init/#operator-init). At this point the
cluster is of size 1 which allows the node to self-elect as a leader. Once a
leader is elected, other servers can be added to the peer set in a way that
preserves consistency and safety.
The join process is how new nodes are added to the vault cluster, it uses an
encrypted challenge/answer workflow. To accomplish this, all nodes in a single
raft cluster must share the same seal configuration. If using an Auto Unseal the
join process can use the configured seal to automatically decrypt the challenge
and respond with the answer. If using a Shamir seal the unseal keys must be
provided to the node attempting to join the cluster before it can decrypt the
challenge and respond with the decrypted answer.
Since all servers participate as part of the peer set, they all know the current
leader. When an API request arrives at a non-leader server, the request is
forwarded to the leader.
Similar to other storage backends, data that is written to the Raft log and FSM
will be encrypted by Vault's barrier.
Vault does not currently offer automated dead server cleanup. If you wish to
decommission a node, or a node dies and must be replaced, the node must manually
be removed from the cluster with the `remove peer`
[command](/docs/commands/operator/raft#remove-peer).
### Deployment Table
Below is a table that shows quorum size and failure tolerance for various
cluster sizes. The recommended deployment is either 3 or 5 servers. A single
server deployment is _**highly**_ discouraged as data loss is inevitable in a
failure scenario.
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>Servers</th>
<th>Quorum Size</th>
<th>Failure Tolerance</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>2</td>
<td>2</td>
<td>0</td>
</tr>
<tr class="warning">
<td>3</td>
<td>2</td>
<td>1</td>
</tr>
<tr>
<td>4</td>
<td>3</td>
<td>1</td>
</tr>
<tr class="warning">
<td>5</td>
<td>3</td>
<td>2</td>
</tr>
<tr>
<td>6</td>
<td>4</td>
<td>2</td>
</tr>
<tr>
<td>7</td>
<td>4</td>
<td>3</td>
</tr>
</tbody>
</table>