241 lines
13 KiB
Plaintext
241 lines
13 KiB
Plaintext
---
|
|
layout: docs
|
|
page_title: Replication
|
|
description: Learn about the details of multi-datacenter replication within Vault.
|
|
---
|
|
|
|
# Replication (Vault Enterprise)
|
|
|
|
Vault Enterprise 0.7 adds support for multi-datacenter replication. Before
|
|
using this feature, it is useful to understand the intended use cases, design
|
|
goals, and high level architecture.
|
|
|
|
Replication is based on a primary/secondary (1:N) model with asynchronous
|
|
replication, focusing on high availability for global deployments. The
|
|
trade-offs made in the design and implementation of replication reflect these
|
|
high level goals.
|
|
|
|
# Use Cases
|
|
|
|
Vault replication is based on a number of common use cases:
|
|
|
|
- **Multi-Datacenter Deployments**: A common challenge is providing Vault to
|
|
applications across many datacenters in a highly-available manner. Running a
|
|
single Vault cluster imposes high latency of access for remote clients,
|
|
availability loss or outages during connectivity failures, and limits
|
|
scalability.
|
|
|
|
- **Backup Sites**: Implementing a robust business continuity plan around the
|
|
loss of a primary datacenter requires the ability to quickly and easily fail
|
|
to a hot backup site.
|
|
|
|
- **Scaling Throughput**: Applications that use Vault for
|
|
Encryption-as-a-Service or cryptographic offload may generate a very high
|
|
volume of requests for Vault. Replicating keys between multiple clusters
|
|
allows load to be distributed across additional servers to scale request
|
|
throughput.
|
|
|
|
# Design Goals
|
|
|
|
Based on the use cases for Vault Replication, we had a number of design goals
|
|
for the implementation:
|
|
|
|
- **Availability**: Global deployments of Vault require high levels of
|
|
availability, and can tolerate reduced consistency. During full connectivity,
|
|
replication is nearly real-time between the primary and secondary clusters.
|
|
Degraded connectivity between a primary and secondary does not impact the
|
|
primary's ability to service requests, and the secondary will continue to
|
|
service reads on last-known data.
|
|
|
|
- **Conflict Free**: Certain replication techniques allow for potential write
|
|
conflicts to take place. Particularly, any active/active configuration where
|
|
writes are allowed to multiple sites require a conflict resolution strategy.
|
|
This varies from techniques that allow for data loss like last-write-wins, or
|
|
techniques that require manual operator resolution like allowing multiple
|
|
values per key. We avoid the possibility of conflicts to ensure there is no
|
|
data loss or manual intervention required.
|
|
|
|
- **Transparent to Clients**: Vault replication should be transparent to
|
|
clients of Vault, so that existing thin clients work unmodified. The Vault
|
|
servers handle the logic of request forwarding to the primary when necessary,
|
|
and multi-hop routing is performed internally to ensure requests are
|
|
processed.
|
|
|
|
- **Simple to Operate**: Operating a replicated cluster should be simple to
|
|
avoid administrative overhead and potentially introducing security gaps.
|
|
Setup of replication is very simple, and secondaries can handle being
|
|
arbitrarily behind the primary, avoiding the need for operator intervention
|
|
to copy data or snapshot the primary.
|
|
|
|
# Architecture
|
|
|
|
The architecture of Vault replication is based on the design goals, focusing on
|
|
the intended use cases. When replication is enabled, a cluster is set as either
|
|
a _primary_ or _secondary_. The primary cluster is authoritative, and is the
|
|
only cluster allowed to perform actions that write to the underlying data
|
|
storage, such as modifying policies or secrets. Secondary clusters can service
|
|
all other operations, such as reading secrets or sending data through
|
|
`transit`, and forward any writes to the primary cluster. Disallowing multiple
|
|
primaries ensures the cluster is conflict free and has an authoritative state.
|
|
|
|
The primary cluster uses log shipping to replicate changes to all of the
|
|
secondaries. This ensures writes are visible globally in near real-time when
|
|
there is full network connectivity. If a secondary is down or unable to
|
|
communicate with the primary, writes are not blocked on the primary and reads
|
|
are still serviced on the secondary. This ensures the availability of Vault.
|
|
When the secondary is initialized or recovers from degraded connectivity it
|
|
will automatically reconcile with the primary.
|
|
|
|
Lastly, clients can speak to any Vault server without a thick client. If a
|
|
client is communicating with a standby instance, the request is automatically
|
|
forwarded to an active instance. Secondary clusters will service reads locally
|
|
and forward any write requests to the primary cluster. The primary cluster is
|
|
able to service all request types.
|
|
|
|
An important optimization Vault makes is to avoid replication of tokens or
|
|
leases between clusters. Policies and secrets are the minority of data managed
|
|
by Vault and tend to be relatively stable. Tokens and leases are much more
|
|
dynamic, as they are created and expire rapidly. Keeping tokens and leases
|
|
locally reduces the amount of data that needs to be replicated, and distributes
|
|
the work of TTL management between the clusters. The caveat is that clients
|
|
will need to re-authenticate if they switch the Vault cluster they are
|
|
communicating with.
|
|
|
|
# Implementation Details
|
|
|
|
It is important to understand the high-level architecture of replication to
|
|
ensure the trade-offs are appropriate for your use case. The implementation
|
|
details may be useful for those who are curious or want to understand more
|
|
about the performance characteristics or failure scenarios.
|
|
|
|
Using replication requires a storage backend that supports transactional
|
|
updates, such as Consul. This allows multiple key/value updates to be
|
|
performed atomically. Replication uses this to maintain a
|
|
[Write-Ahead-Log][wal] (WAL) of all updates, so that the key update happens
|
|
atomically with the WAL entry creation. The WALs are then used to perform log
|
|
shipping between the Vault clusters. When a secondary is closely synchronized
|
|
with a primary, Vault directly streams new WALs to be applied, providing near
|
|
real-time replication. A bounded set of WALs are maintained for the
|
|
secondaries, and older WALs are garbage collected automatically.
|
|
|
|
When a secondary is initialized or is too far behind the primary there may not
|
|
be enough WALs to synchronize. To handle this scenario, Vault maintains a
|
|
[merkle index][merkle] of the encrypted keys. Any time a key is updated or
|
|
deleted, the merkle index is updated to reflect the change. When a secondary
|
|
needs to reconcile with a primary, they compare their merkle indexes to
|
|
determine which keys are out of sync. The structure of the index allows this to
|
|
be done very efficiently, usually requiring only two round trips and a small
|
|
amount of data. The secondary uses this information to reconcile and then
|
|
switches back into WAL streaming mode.
|
|
|
|
Performance is an important concern for Vault, so WAL entries are batched and
|
|
the merkle index is not flushed to disk with every operation. Instead, the
|
|
index is updated in memory for every operation and asynchronously flushed to
|
|
disk. As a result, a crash or power loss may cause the merkle index to become
|
|
out of sync with the underlying keys. Vault uses the [ARIES][aries] recovery
|
|
algorithm to ensure the consistency of the index under those failure
|
|
conditions.
|
|
|
|
Log shipping traditionally requires the WAL stream to be synchronized, which
|
|
can introduce additional complexity when a new primary cluster is promoted.
|
|
Vault uses the merkle index as the source of truth, allowing the WAL streams to
|
|
be completely distinct and unsynchronized. This simplifies administration of
|
|
Vault Replication for operators.
|
|
|
|
## Addressing
|
|
|
|
### Cluster addresses on the primary
|
|
|
|
When a cluster is enabled as replication primary, it persists a cluster
|
|
definition to storage, under `core/cluster/replicated/info` or
|
|
`core/cluster/replicated-dr/info`. An optional field of the cluster definition
|
|
is `primary_cluster_addr`, which may be provided in the enable request.
|
|
|
|
Performance standbys regularly issue heartbeat RPC requests to the active node, and
|
|
one of the arguments to the RPC is the local node's `cluster_addr`. The
|
|
primary active node retains these cluster addresses received from its peers in
|
|
an in-memory cache named `clusterPeerClusterAddrsCache` with a 15s expiry time.
|
|
|
|
### Cluster addresses on the secondary
|
|
|
|
When a secondary is enabled, its replication activation token (obtained from
|
|
the primary) includes a `primary_cluster_addr` field. This is taken from the
|
|
persisted cluster definition created when the primary was enabled, or if no
|
|
`primary_cluster_addr` was provided at that time, the token contains the
|
|
`cluster_addr` of the current active node in the primary at the time the
|
|
activation token is created.
|
|
|
|
The secondary persists its own version of the cluster definition to storage,
|
|
again under `core/cluster/replicated/info` or
|
|
`core/cluster/replicated-dr/info`. Here the `primary_cluster_addr` field is
|
|
the one obtained from the activation token.
|
|
|
|
The secondary active node regularly issues heartbeat RPC requests to the
|
|
primary active node. In response to these, the primary returns a response
|
|
which includes a `ClusterAddrs` field, comprising the contents of its
|
|
`clusterPeerClusterAddrsCache` plus the current active node's `cluster_addr`.
|
|
The secondary uses the response to both to update its in-memory record of
|
|
`known_primary_cluster_addrs`, and in addition it persists them to storage
|
|
under `core/primary-addrs/dr` or `core/primary-addrs/perf`. When this happens
|
|
it logs the line `"replication: successful heartbeat"`, which includes the
|
|
`ClusterAddrs` value obtained in the response.
|
|
|
|
### Secondary RPC address resolution
|
|
|
|
gRPC is given a list of target addresses for it to use in performing RPC
|
|
requests. gRPC will discover that performance standbys can't service most RPCs, and
|
|
will quickly weed out all but the active node cluster address. If the primary
|
|
active node changes, gRPC will learn that its address is no longer viable and
|
|
will automatically fail over to the new active node, assuming it's one of the
|
|
known target addresses.
|
|
|
|
The secondary runs a background resolver goroutine that, every few seconds,
|
|
builds the gRPC target list of addresses for the primary. Its output is logged
|
|
at Trace level as `"loaded addresses"`.
|
|
|
|
To build the primary cluster address list, the resolver goroutine normally
|
|
simply concatenates `known_primary_cluster_addrs` with the
|
|
`primary_cluster_addr` in the cluster definition in storage.
|
|
|
|
There are two exceptions to that normal behaviour: the first time the goroutine
|
|
goes through its loop, and when gRPC asks for a forced ResolveNow, which
|
|
happens when it's unable to perform RPCs on any of its target addresses. In
|
|
both these cases, the resolver goroutine issues a special RemoteResolve RPC to
|
|
the primary. This RPC is special because unlike all the other replication
|
|
RPCs, it can be serviced by performance standbys as well as the active node. In
|
|
either case the node will return the `primary_cluster_addr` stored in the
|
|
primary's cluster definition, if any, or failing that the current active node's
|
|
cluster address. The result of the RemoteResolve call gets included in the
|
|
list of target addresses the resolver gives to gRPC for regular RPCs to the
|
|
primary.
|
|
|
|
# Caveats
|
|
|
|
~> **Mismatched Cluster Versions**: It is not safe to replicate from a newer
|
|
version of Vault to an older version. When upgrading replicated clusters,
|
|
ensure that upstream clusters are always on older version of Vault than
|
|
downstream clusters. See
|
|
[Upgrading Vault](/docs/upgrading#replication-installations) for an example.
|
|
|
|
- **Read-After-Write Consistency**: All write requests are forwarded from
|
|
secondaries to the primary cluster in order to avoid potential conflicts.
|
|
While replication is near real-time, it is not instantaneous, meaning there
|
|
is a potential for a client to write to a secondary and a subsequent read to
|
|
return an old value. Secondaries attempt to mask this from an individual
|
|
client making subsequent requests by stalling write requests until the write
|
|
is replicated or a timeout is reached (2 seconds). If the timeout is reached,
|
|
the client will receive a warning. Clients can also take steps to protect
|
|
against this, see [Consistency](/docs/enterprise/consistency#mitigations).
|
|
|
|
- **Stale Reads**: Secondary clusters service reads based on their
|
|
locally-replicated data. During normal operation updates from a primary are
|
|
received in near real-time by secondaries. However, during an outage or
|
|
network service disruption, replication may stall and secondaries may have
|
|
stale data. The cluster will automatically recover and reconcile any stale
|
|
data once the outage has recovered, but reads in the intervening period may
|
|
receive stale data.
|
|
|
|
[wal]: https://en.wikipedia.org/wiki/Write-ahead_logging
|
|
[merkle]: https://en.wikipedia.org/wiki/Merkle_tree
|
|
[aries]: https://en.wikipedia.org/wiki/Algorithms_for_Recovery_and_Isolation_Exploiting_Semantics
|