open-vault/website/content/docs/enterprise/consistency.mdx
2022-03-23 09:12:52 -04:00

229 lines
10 KiB
Plaintext

---
layout: docs
page_title: Vault Enterprise Eventual Consistency
description: Vault Enterprise Consistency Model
---
# Vault Eventual Consistency
When running in a cluster, Vault has an eventual consistency model.
Only one node (the leader) can write to Vault's storage.
Users generally expect read-after-write consistency: in other
words, after writing foo=1, a subsequent read of foo should return 1. Depending
on the Vault configuration this isn't always the case. When using performance
standbys with Integrated Storage, or when using performance replication,
there are some sequences of operations that don't always yield read-after-write
consistency.
## Performance Standby Nodes
When using the Integrated Storage backend without performance standbys, only
a single Vault node (the active node) handles requests. Requests sent to
regular standbys are handled by forwarding them to the active node. This Vault configuration
gives Vault the same behavior as the default Consul consistency model.
When using the Integrated Storage backend with performance standbys, both the
active node and performance standbys can handle requests. If a performance standby
handles a login request, or a request that generates a dynamic secret, the
performance standby will issue a remote procedure call (RPC) to the active node to store the token
and/or lease. If the performance standby handles any other request that
results in a storage write, it will forward that request to the active node
in the same way a regular standby forwards all requests.
With Integrated Storage, all writes occur on the active node, which then issues
RPCs to update the local storage on every other node. Between when the active
node writes the data to its local disk, and when those RPCs are handled on the
other nodes to write the data to their local disks, those nodes present a stale
view of the data.
As a result, even if you're always talking to the same performance standby,
you may not get read-after-write semantics. The write gets sent to the active
node, and if the subsequent read request occurs before the new data gets sent
to the node handling the read request, the read request won't be able to take
the write into account because the new data isn't present on that node yet.
## Performance replication
A similar phenomenon occurs when using performance replication. One example
of how this manifests is when using shared mounts. If a KV secrets engine
is mounted on the primary with `local=false`, it will exist on the secondary
cluster as well. The secondary cluster can handle requests to that mount,
though as with performance standbys, write requests must be forwarded - in
this case to the primary active node. Once data is written to the primary cluster,
it won't be visible on the secondary cluster until the data has been replicated
from the primary. Therefore, on the secondary cluster, it initially appears as if
the data write hasn't happened.
If the secondary cluster is using Integrated Storage, and the read request is
being handled on one of its performance standbys, the problem is exacerbated because it
has to be sent first from the primary active node to the secondary active node,
and then from there to the secondary performance standby, each of which can
introduce their own form of lag.
Even without shared secret engines, stale reads can still happen with performance
replication. The Identity subsystem aims to provide a view on entities and
groups which span across clusters. As such, when logging in to a secondary cluster
using a shared mount, Vault tries to generate an entity and alias if they don't
already exist, and these must be stored on the primary using an RPC. Something
similar happens with groups.
## Mitigations
There has long been a partial mitigation for the above problems. When writing
data via RPC, e.g. when a performance standby registers tokens and leases on the
active node after a login or generating a dynamic secret, part of the response
includes a number known as the "WAL index", aka Write-Ahead Log index.
A full explanation of this is outside the scope of this document, but the short
version is that both performance replication and performance standbys use log
shipping to stay in sync with the upstream source of writes. The mitigation
historically used by nodes doing writes via RPC is to look at the WAL index in
the response and wait up to 2 seconds to see if that WAL index appear in the
logs being shipped from upstream. Once the WAL index is seen, the Vault node
handling the request that resulted in RPCs can return its own response to the
client: it knows that any subsequent reads will be able to see the value that
was just written. If the WAL index isn't seen within those 2 seconds, the Vault
node completes the request anyway, returning a warning in the response.
This mitigation option still exists in Vault 1.7, though now there is a
configuration option to adjust the wait time:
[best_effort_wal_wait_duration](/docs/configuration/replication).
## Vault 1.7 Mitigations
There are now a variety of other mitigations available:
- per-request option to always forward the request to the active node
- per-request option to conditionally forward the request to the active node
if it would otherwise result in a stale read
- per-request option to fail requests if they might result in a stale read
- Vault Agent configuration to do the above for proxied requests
The remainder of this document describes the tradeoffs of these mitigations and
how to use them.
Note that any headers requesting forwarding are disabled by default, and must
be enabled using [allow_forwarding_via_header](/docs/configuration/replication).
### Unconditional Forwarding (Performance standbys only)
The simplest solution to never experience stale reads from a performance standby
is to provide the following HTTP header in the request:
```
X-Vault-Forward: active-node
```
The drawback here is that if all your requests are forwarded to the active node,
you might as well not be using performance standbys. So this mitigation only
makes sense to use selectively.
This mitigation will not help with stale reads relating to performance replication.
### Conditional Forwarding (Performance standbys only)
As of Vault Enterprise 1.7, all requests that modify storage now return a new
HTTP response header:
```
X-Vault-Index: <base64 value>
```
To ensure that the state resulting from that write request is visible to a
subsequent request, add these headers to that second request:
```
X-Vault-Index: <base64 value taken from previous response>
X-Vault-Inconsistent: forward-active-node
```
The effect will be that the node handling the request will look at the state
it has locally, and if it doesn't contain the state described by the X-Vault-Index
header, the node will forward the request to the active node.
The drawback here is that when requests are forwarded to the active node,
performance standbys provide less value. If this happens often enough
the active node can become a bottleneck, limiting the horizontal read scalability
performance standbys are intended to provide.
### Retry stale requests
As of Vault Enterprise 1.7, all requests that modify storage now return a new
HTTP response header:
```
X-Vault-Index: <base64 value>
```
To ensure that the state resulting from that write request is visible to a
subsequent request, add this headers to that second request:
```
X-Vault-Index: <base64 value taken from previous response>
```
When the desired state isn't present, Vault will return a failure response with
HTTP status code 412. This tells the client that it should retry the request.
The advantage over the Conditional Forwarding solution above is twofold:
first, there's no additional load on the active node. Second, this solution
is applicable to performance replication as well as performance standbys.
The Vault Go API will now automatically retry 412s, and provides convenience
methods for propagating the X-Vault-Index response header into the request
header of subsequent requests. Those not using the Vault Go API will want
to build equivalent functionality into their client library.
### Vault Agent and consistency headers
Vault Agent Caching will proxy incoming requests to Vault. There is
new Agent configuration available in the `cache` stanza that allows making use
of some of the above mitigations without modifying clients.
By setting `enforce_consistency="always"`, Agent will always provide
the `X-Vault-Index` consistency header. The value it uses for the header
will be based on the responses that have passed through the Agent previously.
The option `when_inconsistent` controls how stale reads are prevented:
- `"fail"` means that when a `412` response is seen, it is returned to the client
- `"retry"` means that `412` responses will be retried automatically by Agent,
so the client doesn't have to deal with them
- `"forward"` makes Agent provide the
`X-Vault-Inconsistent: forward-active-node` header as described above under
Conditional Forwarding
## Vault 1.10 Mitigations
In Vault 1.10, the token format has changed, where service tokens now employ server side consistency.
This means that by default, requests made
to nodes which cannot support read-after-write consistency due to
not having the necessary WAL index to check Vault tokens locally will output
a 412 status code. The Vault Go API automatically retries when receiving 412s, so
unless there is a considerable replication delay, users will experience
read-after-write consistency.
The replication option [allow_forwarding_via_token](/docs/configuration/replication)
can be used to enforce requests that would have returned 412s in the
aforementioned way will be forwarded instead to the active node.
Refer to the [Server Side Consistent Token FAQ](/docs/faq/ssct) for details.
## Client API helpers
There are some new helpers in the `api` package to work with the new headers.
`WithRequestCallbacks` and `WithResponseCallbacks` create a shallow clone of
the client and populate it with the given callbacks. `RecordState` and
`RequireState` are used to store the response header from one request and
provide it in a subsequent request. For example:
```go
client := api.NewClient(api.DefaultConfig)
var state string
_, err := client.WithResponseCallbacks(api.RecordState(&state)).Write(path, data)
secret, err := client.WithRequestCallbacks(api.RequireState(state)).Read(path)
```
This will retry the `Read` until the data stored by the `Write` is present.
There are also callbacks to use forwarding: `ForwardInconsistent` and
`ForwardAlways`.