open-vault/website/source/docs/internals/replication.html.md

---
layout: "docs"
page_title: "Replication"
sidebar_title: "Replication"
sidebar_current: "docs-internals-replication"
description: |-
  Learn about the details of multi-datacenter replication within Vault.
---

# Replication (Vault Enterprise)

Vault Enterprise 0.7 adds support for multi-datacenter replication. Before
using this feature, it is useful to understand the intended use cases, design
goals, and high level architecture.

Replication is based on a primary/secondary (1:N) model with asynchronous
replication, focusing on high availability for global deployments. The
trade-offs made in the design and implementation of replication reflect these
high level goals.

# Use Cases

Vault replication is based on a number of common use cases:

* **Multi-Datacenter Deployments**: A common challenge is providing Vault to
  applications across many datacenters in a highly-available manner. Running a
  single Vault cluster imposes high latency of access for remote clients,
  availability loss or outages during connectivity failures, and limits
  scalability.

* **Backup Sites**: Implementing a robust business continuity plan around the
  loss of a primary datacenter requires the ability to quickly and easily fail
  to a hot backup site.

* **Scaling Throughput**: Applications that use Vault for
  Encryption-as-a-Service or cryptographic offload may generate a very high
  volume of requests for Vault. Replicating keys between multiple clusters
  allows load to be distributed across additional servers to scale request
  throughput.

# Design Goals

Based on the use cases for Vault Replication, we had a number of design goals
for the implementation:

* **Availability**: Global deployments of Vault require high levels of
  availability, and can tolerate reduced consistency. During full connectivity,
  replication is nearly real-time between the primary and secondary clusters.
  Degraded connectivity between a primary and secondary does not impact the
  primary's ability to service requests, and the secondary will continue to
  service reads on last-known data.

* **Conflict Free**: Certain replication techniques allow for potential write
  conflicts to take place. Particularly, any active/active configuration where
  writes are allowed to multiple sites require a conflict resolution strategy.
  This varies from techniques that allow for data loss like last-write-wins, or
  techniques that require manual operator resolution like allowing multiple
  values per key. We avoid the possibility of conflicts to ensure there is no
  data loss or manual intervention required.

* **Transparent to Clients**: Vault replication should be transparent to
  clients of Vault, so that existing thin clients work unmodified. The Vault
  servers handle the logic of request forwarding to the primary when necessary,
  and multi-hop routing is performed internally to ensure requests are
  processed.

* **Simple to Operate**: Operating a replicated cluster should be simple to
  avoid administrative overhead and potentially introducing security gaps.
  Setup of replication is very simple, and secondaries can handle being
  arbitrarily behind the primary, avoiding the need for operator intervention
  to copy data or snapshot the primary.

# Architecture

The architecture of Vault replication is based on the design goals, focusing on
the intended use cases. When replication is enabled, a cluster is set as either
a _primary_ or _secondary_. The primary cluster is authoritative, and is the
only cluster allowed to perform actions that write to the underlying data
storage, such as modifying policies or secrets. Secondary clusters can service
all other operations, such as reading secrets or sending data through
`transit`, and forward any writes to the primary cluster. Disallowing multiple
primaries ensures the cluster is conflict free and has an authoritative state.

The primary cluster uses log shipping to replicate changes to all of the
secondaries.  This ensures writes are visible globally in near real-time when
there is full network connectivity. If a secondary is down or unable to
communicate with the primary, writes are not blocked on the primary and reads
are still serviced on the secondary. This ensures the availability of Vault.
When the secondary is initialized or recovers from degraded connectivity it
will automatically reconcile with the primary.

Lastly, clients can speak to any Vault server without a thick client. If a
client is communicating with a standby instance, the request is automatically
forwarded to an active instance. Secondary clusters will service reads locally
and forward any write requests to the primary cluster. The primary cluster is
able to service all request types.

An important optimization Vault makes is to avoid replication of tokens or
leases between clusters. Policies and secrets are the minority of data managed
by Vault and tend to be relatively stable. Tokens and leases are much more
dynamic, as they are created and expire rapidly. Keeping tokens and leases
locally reduces the amount of data that needs to be replicated, and distributes
the work of TTL management between the clusters. The caveat is that clients
will need to re-authenticate if they switch the Vault cluster they are
communicating with.

# Implementation Details

It is important to understand the high-level architecture of replication to
ensure the trade-offs are appropriate for your use case. The implementation
details may be useful for those who are curious or want to understand more
about the performance characteristics or failure scenarios.

Using replication requires a storage backend that supports transactional
updates, such as Consul.  This allows multiple key/value updates to be
performed atomically. Replication uses this to maintain a
[Write-Ahead-Log][wal] (WAL) of all updates, so that the key update happens
atomically with the WAL entry creation.  The WALs are then used to perform log
shipping between the Vault clusters. When a secondary is closely synchronized
with a primary, Vault directly streams new WALs to be applied, providing near
real-time replication. A bounded set of WALs are maintained for the
secondaries, and older WALs are garbage collected automatically.

When a secondary is initialized or is too far behind the primary there may not
be enough WALs to synchronize. To handle this scenario, Vault maintains a
[merkle index][merkle] of the encrypted keys. Any time a key is updated or
deleted, the merkle index is updated to reflect the change.  When a secondary
needs to reconcile with a primary, they compare their merkle indexes to
determine which keys are out of sync. The structure of the index allows this to
be done very efficiently, usually requiring only two round trips and a small
amount of data. The secondary uses this information to reconcile and then
switches back into WAL streaming mode.

Performance is an important concern for Vault, so WAL entries are batched and
the merkle index is not flushed to disk with every operation. Instead, the
index is updated in memory for every operation and asynchronously flushed to
disk. As a result, a crash or power loss may cause the merkle index to become
out of sync with the underlying keys. Vault uses the [ARIES][aries] recovery
algorithm to ensure the consistency of the index under those failure
conditions.

Log shipping traditionally requires the WAL stream to be synchronized, which
can introduce additional complexity when a new primary cluster is promoted.
Vault uses the merkle index as the source of truth, allowing the WAL streams to
be completely distinct and unsynchronized.  This simplifies administration of
Vault Replication for operators.

# Caveats

* **Read-After-Write Consistency**: All write requests are forwarded from
  secondaries to the primary cluster in order to avoid potential conflicts.
  While replication is near real-time, it is not instantaneous, meaning there
  is a potential for a client to write to a secondary and a subsequent read to
  return an old value. Secondaries attempt to mask this from an individual
  client making subsequent requests by stalling write requests until the write
  is replicated or a timeout is reached (2 seconds). If the timeout is reached,
  the client will receive a warning.

* **Stale Reads**: Secondary clusters service reads based on their
  locally-replicated data. During normal operation updates from a primary are
  received in near real-time by secondaries. However, during an outage or
  network service disruption, replication may stall and secondaries may have
  stale data. The cluster will automatically recover and reconcile any stale
  data once the outage has recovered, but reads in the intervening period may
  receive stale data.

[wal]: https://en.wikipedia.org/wiki/Write-ahead_logging
[merkle]: https://en.wikipedia.org/wiki/Merkle_tree
[aries]: https://en.wikipedia.org/wiki/Algorithms_for_Recovery_and_Isolation_Exploiting_Semantics
Port some replication bits to OSS (#2386) 2017-02-16 20:15:02 +00:00			`---`
			`layout: "docs"`
			`page_title: "Replication"`
New Docs Website (#5535) * conversion stage 1 * correct image paths * add sidebar title to frontmatter * docs/concepts and docs/internals * configuration docs and multi-level nav corrections * commands docs, index file corrections, small item nav correction * secrets converted * auth * add enterprise and agent docs * add extra dividers * secret section, wip * correct sidebar nav title in front matter for apu section, start working on api items * auth and backend, a couple directory structure fixes * remove old docs * intro side nav converted * reset sidebar styles, add hashi-global-styles * basic styling for nav sidebar * folder collapse functionality * patch up border length on last list item * wip restructure for content component * taking middleman hacking to the extreme, but its working * small css fix * add new mega nav * fix a small mistake from the rebase * fix a content resolution issue with middleman * title a couple missing docs pages * update deps, remove temporary markup * community page * footer to layout, community page css adjustments * wip downloads page * deps updated, downloads page ready * fix community page * homepage progress * add components, adjust spacing * docs and api landing pages * a bunch of fixes, add docs and api landing pages * update deps, add deploy scripts * add readme note * update deploy command * overview page, index title * Update doc fields Note this still requires the link fields to be populated -- this is solely related to copy on the description fields * Update api_basic_categories.yml Updated API category descriptions. Like the document descriptions you'll still need to update the link headers to the proper target pages. * Add bottom hero, adjust CSS, responsive friendly * Add mega nav title * homepage adjustments, asset boosts * small fixes * docs page styling fixes * meganav title * some category link corrections * Update API categories page updated to reflect the second level headings for api categories * Update docs_detailed_categories.yml Updated to represent the existing docs structure * Update docs_detailed_categories.yml * docs page data fix, extra operator page remove * api data fix * fix makefile * update deps, add product subnav to docs and api landing pages * Rearrange non-hands-on guides to _docs_ Since there is no place for these on learn.hashicorp, we'll put them under _docs_. * WIP Redirects for guides to docs * content and component updates * font weight hotfix, redirects * fix guides and intro sidenavs * fix some redirects * small style tweaks * Redirects to learn and internally to docs * Remove redirect to `/vault` * Remove `.html` from destination on redirects * fix incorrect index redirect * final touchups * address feedback from michell for makefile and product downloads 2018-10-19 15:40:11 +00:00			`sidebar_title: "Replication"`
Port some replication bits to OSS (#2386) 2017-02-16 20:15:02 +00:00			`sidebar_current: "docs-internals-replication"`
			`description: \|-`
			`Learn about the details of multi-datacenter replication within Vault.`
			`---`

			`# Replication (Vault Enterprise)`

			`Vault Enterprise 0.7 adds support for multi-datacenter replication. Before`
			`using this feature, it is useful to understand the intended use cases, design`
			`goals, and high level architecture.`

			`Replication is based on a primary/secondary (1:N) model with asynchronous`
			`replication, focusing on high availability for global deployments. The`
			`trade-offs made in the design and implementation of replication reflect these`
			`high level goals.`

			`# Use Cases`

			`Vault replication is based on a number of common use cases:`

			`* Multi-Datacenter Deployments: A common challenge is providing Vault to`
			`applications across many datacenters in a highly-available manner. Running a`
			`single Vault cluster imposes high latency of access for remote clients,`
			`availability loss or outages during connectivity failures, and limits`
			`scalability.`

			`* Backup Sites: Implementing a robust business continuity plan around the`
			`loss of a primary datacenter requires the ability to quickly and easily fail`
			`to a hot backup site.`

			`* Scaling Throughput: Applications that use Vault for`
			`Encryption-as-a-Service or cryptographic offload may generate a very high`
			`volume of requests for Vault. Replicating keys between multiple clusters`
			`allows load to be distributed across additional servers to scale request`
			`throughput.`

			`# Design Goals`

			`Based on the use cases for Vault Replication, we had a number of design goals`
			`for the implementation:`

			`* Availability: Global deployments of Vault require high levels of`
			`availability, and can tolerate reduced consistency. During full connectivity,`
			`replication is nearly real-time between the primary and secondary clusters.`
			`Degraded connectivity between a primary and secondary does not impact the`
			`primary's ability to service requests, and the secondary will continue to`
			`service reads on last-known data.`

			`* Conflict Free: Certain replication techniques allow for potential write`
			`conflicts to take place. Particularly, any active/active configuration where`
			`writes are allowed to multiple sites require a conflict resolution strategy.`
			`This varies from techniques that allow for data loss like last-write-wins, or`
			`techniques that require manual operator resolution like allowing multiple`
			`values per key. We avoid the possibility of conflicts to ensure there is no`
			`data loss or manual intervention required.`

			`* Transparent to Clients: Vault replication should be transparent to`
			`clients of Vault, so that existing thin clients work unmodified. The Vault`
			`servers handle the logic of request forwarding to the primary when necessary,`
			`and multi-hop routing is performed internally to ensure requests are`
			`processed.`

			`* Simple to Operate: Operating a replicated cluster should be simple to`
			`avoid administrative overhead and potentially introducing security gaps.`
			`Setup of replication is very simple, and secondaries can handle being`
			`arbitrarily behind the primary, avoiding the need for operator intervention`
			`to copy data or snapshot the primary.`

			`# Architecture`

			`The architecture of Vault replication is based on the design goals, focusing on`
			`the intended use cases. When replication is enabled, a cluster is set as either`
			`a _primary_ or _secondary_. The primary cluster is authoritative, and is the`
			`only cluster allowed to perform actions that write to the underlying data`
			`storage, such as modifying policies or secrets. Secondary clusters can service`
			`all other operations, such as reading secrets or sending data through`
			`transit`, and forward any writes to the primary cluster. Disallowing multiple
			`primaries ensures the cluster is conflict free and has an authoritative state.`

			`The primary cluster uses log shipping to replicate changes to all of the`
			`secondaries. This ensures writes are visible globally in near real-time when`
			`there is full network connectivity. If a secondary is down or unable to`
			`communicate with the primary, writes are not blocked on the primary and reads`
			`are still serviced on the secondary. This ensures the availability of Vault.`
			`When the secondary is initialized or recovers from degraded connectivity it`
			`will automatically reconcile with the primary.`

			`Lastly, clients can speak to any Vault server without a thick client. If a`
			`client is communicating with a standby instance, the request is automatically`
docs: fix typo (change 'a' to 'an' - indefinite article) (#3741) 2018-01-03 15:47:15 +00:00			`forwarded to an active instance. Secondary clusters will service reads locally`
Port some replication bits to OSS (#2386) 2017-02-16 20:15:02 +00:00			`and forward any write requests to the primary cluster. The primary cluster is`
			`able to service all request types.`

			`An important optimization Vault makes is to avoid replication of tokens or`
			`leases between clusters. Policies and secrets are the minority of data managed`
			`by Vault and tend to be relatively stable. Tokens and leases are much more`
			`dynamic, as they are created and expire rapidly. Keeping tokens and leases`
			`locally reduces the amount of data that needs to be replicated, and distributes`
			`the work of TTL management between the clusters. The caveat is that clients`
			`will need to re-authenticate if they switch the Vault cluster they are`
			`communicating with.`

			`# Implementation Details`

			`It is important to understand the high-level architecture of replication to`
			`ensure the trade-offs are appropriate for your use case. The implementation`
			`details may be useful for those who are curious or want to understand more`
			`about the performance characteristics or failure scenarios.`

			`Using replication requires a storage backend that supports transactional`
			`updates, such as Consul. This allows multiple key/value updates to be`
			`performed atomically. Replication uses this to maintain a`
			`[Write-Ahead-Log][wal] (WAL) of all updates, so that the key update happens`
			`atomically with the WAL entry creation. The WALs are then used to perform log`
			`shipping between the Vault clusters. When a secondary is closely synchronized`
			`with a primary, Vault directly streams new WALs to be applied, providing near`
			`real-time replication. A bounded set of WALs are maintained for the`
			`secondaries, and older WALs are garbage collected automatically.`

			`When a secondary is initialized or is too far behind the primary there may not`
			`be enough WALs to synchronize. To handle this scenario, Vault maintains a`
			`[merkle index][merkle] of the encrypted keys. Any time a key is updated or`
			`deleted, the merkle index is updated to reflect the change. When a secondary`
			`needs to reconcile with a primary, they compare their merkle indexes to`
			`determine which keys are out of sync. The structure of the index allows this to`
			`be done very efficiently, usually requiring only two round trips and a small`
			`amount of data. The secondary uses this information to reconcile and then`
			`switches back into WAL streaming mode.`

			`Performance is an important concern for Vault, so WAL entries are batched and`
			`the merkle index is not flushed to disk with every operation. Instead, the`
			`index is updated in memory for every operation and asynchronously flushed to`
			`disk. As a result, a crash or power loss may cause the merkle index to become`
			`out of sync with the underlying keys. Vault uses the [ARIES][aries] recovery`
			`algorithm to ensure the consistency of the index under those failure`
			`conditions.`

			`Log shipping traditionally requires the WAL stream to be synchronized, which`
			`can introduce additional complexity when a new primary cluster is promoted.`
			`Vault uses the merkle index as the source of truth, allowing the WAL streams to`
			`be completely distinct and unsynchronized. This simplifies administration of`
			`Vault Replication for operators.`

Minor doc updates 2017-03-14 14:11:47 +00:00			`# Caveats`

			`* Read-After-Write Consistency: All write requests are forwarded from`
			`secondaries to the primary cluster in order to avoid potential conflicts.`
			`While replication is near real-time, it is not instantaneous, meaning there`
			`is a potential for a client to write to a secondary and a subsequent read to`
			`return an old value. Secondaries attempt to mask this from an individual`
			`client making subsequent requests by stalling write requests until the write`
			`is replicated or a timeout is reached (2 seconds). If the timeout is reached,`
			`the client will receive a warning.`

			`* Stale Reads: Secondary clusters service reads based on their`
			`locally-replicated data. During normal operation updates from a primary are`
			`received in near real-time by secondaries. However, during an outage or`
			`network service disruption, replication may stall and secondaries may have`
			`stale data. The cluster will automatically recover and reconcile any stale`
			`data once the outage has recovered, but reads in the intervening period may`
			`receive stale data.`

Port some replication bits to OSS (#2386) 2017-02-16 20:15:02 +00:00			`[wal]: https://en.wikipedia.org/wiki/Write-ahead_logging`
			`[merkle]: https://en.wikipedia.org/wiki/Merkle_tree`
			`[aries]: https://en.wikipedia.org/wiki/Algorithms_for_Recovery_and_Isolation_Exploiting_Semantics`