open-vault/website/content/guides/operations/reference-architecture.mdx
Bryce Kalow b76a56d40c
feat(website): migrates nav data format and updates docs pages (#11242)
* migrates nav data format and updates docs pages

* removes sidebar_title from content files
2021-04-06 13:49:04 -04:00

337 lines
15 KiB
Plaintext

---
layout: guides
page_title: Vault Reference Architecture - Guides
description: |-
This guide provides guidance in the best practices of Vault
implementations through use of a reference architecture.
ea_version: 1
---
# Vault Reference Architecture
The goal of this document is to recommend _HashiCorp Vault_ deployment
practices. This reference architecture conveys a general architecture
that should be adapted to accommodate the specific needs of each implementation.
The following topics are addressed in this guide:
- [Deployment Topology within One Datacenter](#one-dc)
- [Network Connectivity](#network-connectivity-details)
- [Deployment System Requirements](#deployment-system-requirements)
- [Hardware Considerations](#hardware-considerations)
- [Load Balancing](#load-balancing)
- [High Availability](#high-availability)
- [Deployment Topology for Multiple Datacenters](#multi-dc)
- [Vault Replication](#vault-replication)
- [Additional References](#additional-references)
-> This document assumes Vault uses Consul as the [storage
backend](/docs/internals/architecture) since that is the recommended
storage backend for production deployments.
## Deployment Topology within One Datacenter ((#one-dc))
This section explains how to deploy a Vault open source cluster in one datacenter.
Support for [multiple datacenters](#multi-dc) is included in Vault Enterprise through
cluster replication.
### Reference Diagram
Eight Nodes with [Consul Storage Backend](/docs/configuration/storage/consul)
![Reference diagram](/img/vault-ref-arch-2.png)
#### Design Summary
This design is the recommended architecture for production environments, as it
provides flexibility and resilience. Consul servers are separate
from the Vault servers so that software upgrades are easier to perform. Additionally,
separate Consul and Vault servers allows for separate sizing for each.
Vault to Consul backend connectivity is over HTTP and should be
secured with TLS as well as a Consul token to provide encryption of all traffic.
-> Refer to the online documentation to learn more about running [Consul in encrypted mode](https://www.consul.io/docs/agent/options.html#encrypt).
#### Failure Tolerance
Typical distribution in a cloud environment is to spread Consul/Vault nodes into
separate Availability Zones (AZs) within a high bandwidth, low latency network,
such as an AWS Region. The diagram below shows Vault and Consul spread between
AZs, with Consul servers in Redundancy Zone configurations, promoting a single
voting member per AZ, providing both Zone and Node level failure protection.
-> Refer to the online documentation to learn more about the [Consul leader election process](https://www.consul.io/docs/guides/leader-election).
![Failure tolerance|40%](/img/vault-ref-arch-3.png)
### Network Connectivity Details
![Network Connectivity Details](/img/vault-ref-arch.png)
### Deployment System Requirements
The following table provides guidelines for server sizing. Of particular note is
the strong recommendation to avoid non-fixed performance CPUs, or "Burstable
CPU" in AWS terms, such as T-series instances.
#### Sizing for Vault Servers
| Size | CPU | Memory | Disk | Typical Cloud Instance Types |
| ----- | -------- | ------------ | ----- | ----------------------------------------- |
| Small | 2 core | 4-8 GB RAM | 25 GB | **AWS:** m5.large |
| | | | | **Azure:** Standard_D2_v3 |
| | | | | **GCE:** n1-standard-2, n1-standard-4 |
| Large | 4-8 core | 16-32 GB RAM | 50 GB | **AWS:** m5.xlarge, m5.2xlarge |
| | | | | **Azure:** Standard_D4_v3, Standard_D8_v3 |
| | | | | **GCE:** n1-standard-8, n1-standard-16 |
#### Sizing for Consul Servers
| Size | CPU | Memory | Disk | Typical Cloud Instance Types |
| ----- | -------- | ------------- | ------ | ----------------------------------------- |
| Small | 2 core | 8-16 GB RAM | 50 GB | **AWS:** m5.large, m5.xlarge |
| | | | | **Azure:** Standard_D2_v3, Standard_D4_v3 |
| | | | | **GCE:** n1-standard-4, n1-standard-8 |
| Large | 4-8 core | 32-64+ GB RAM | 100 GB | **AWS:** m5.2xlarge, m5.4xlarge |
| | | | | **Azure:** Standard_D4_v3, Standard_D8_v3 |
| | | | | **GCE:** n1-standard-16, n1-standard-32 |
### Hardware Considerations
The small size category would be appropriate for most initial production
deployments, or for development/testing environments.
The large size is for production environments where there is a consistent high
workload. That might be a large number of transactions, a large number of
secrets, or a combination of the two.
In general, processing requirements will be dependent on encryption workload and
messaging workload (operations per second, and types of operations). Memory
requirements will be dependent on the total size of secrets/keys stored in
memory and should be sized according to that data (as should the hard drive
storage). Vault itself has minimal storage requirements, but the underlying
storage backend should have a relatively high-performance hard disk subsystem.
If many secrets are being generated/rotated frequently, this information will
need to flush to disk often and can impact performance if slower hard drives are
used.
Consul servers function in this deployment is to serve as the storage backend
for Vault. This means that all content stored for persistence in Vault is
encrypted by Vault, and written to the storage backend at rest. This data is
written to the key-value store section of Consul's Service Catalog, which is
required to be stored in its entirety in-memory on each Consul server. This
means that memory can be a constraint in scaling as more clients authenticate to
Vault, more secrets are persistently stored in Vault, and more temporary secrets
are leased from Vault. This also has the effect of requiring vertical scaling on
Consul server's memory if additional space is required, as the entire Service
Catalog is stored in memory on each Consul server.
Furthermore, network throughput is a common consideration for Vault and Consul
servers. As both systems are HTTPS API driven, all incoming requests,
communications between Vault and Consul, underlying gossip communication between
Consul cluster members, communications with external systems (per auth or secret
engine configuration, and some audit logging configurations) and responses
consume network bandwidth.
Due to network performance considerations in Consul cluster operations,
replication of Vault datasets across network boundaries should be achieved
through Performance or DR Replication, rather than spreading the Consul cluster
across network and physical boundaries. If a single consul cluster is spread
across network segments that are distant or inter-regional, this can cause
synchronization issues within the cluster or additional data transfer charges
in some cloud providers.
### Other Considerations
[Vault Production Hardening Recommendations](/guides/operations/production)
provides guidance on best practices for a production hardened deployment of
Vault.
## Load Balancing
### Load Balancing Using Consul Interface ((#consul-lb))
Consul can provide load balancing capabilities, but it requires that any Vault
clients are Consul aware. This means that a client can either utilize Consul DNS
or API interfaces to resolve the active Vault node. A client might access Vault
via a URL like the following: `http://active.vault.service.consul:8200`
This relies upon the operating system DNS resolution system, and
the request could be forwarded to Consul for the actual IP address response.
The operation can be completely transparent to legacy applications and would
operate just as a typical DNS resolution operation.
### Load Balancing Using External Load Balancer ((#external-lb))
![Vault Behind a Load Balancer](/img/vault-ref-arch-9.png)
External load balancers are supported as well, and would be placed in front of the
Vault cluster, and would poll specific Vault URL's to detect the active node and
route traffic accordingly. An HTTP request to the active node with the following
URL will respond with a 200 status: `http://<Vault Node URL>:8200/v1/sys/health`
The following is a sample configuration block from HAProxy to illustrate:
```plaintext
listen vault
bind 0.0.0.0:80
balance roundrobin
option httpchk GET /v1/sys/health
server vault1 192.168.33.10:8200 check
server vault2 192.168.33.11:8200 check
server vault3 192.168.33.12:8200 check
```
Note that the above block could be generated by Consul (with consul-template)
when a software load balancer is used. This could be the case when the load
balancer is software like Nginx, HAProxy, or Apache.
**Example Consul Template for the above HAProxy block:**
```plaintext
listen vault
bind 0.0.0.0:8200
balance roundrobin
option httpchk GET /v1/sys/health{{range service "vault"}}
server {{.Node}} {{.Address}}:{{.Port}} check{{end}}
```
#### Client IP Address Handling
There are two supported methods for handling client IP addressing behind a proxy
or load balancer;
[X-Forwarded-For Headers](/docs/configuration/listener/tcp#x_forwarded_for_authorized_addrs)
and [PROXY v1](/docs/configuration/listener/tcp#proxy_protocol_authorized_addrs). Both require a trusted load balancer and require IP address whitelisting to
adhere to security best practices.
### High Availability
A Vault cluster is the highly-available unit of deployment within one
datacenter. A recommended approach is three Vault servers with a Consul storage
backend. With this configuration, during a Vault server outage, failover is
handled immediately without human intervention.
To learn more about setting up your Vault servers in HA mode, read [_Vault HA
with Consul_](/guides/operations/vault-ha-consul) guide.
> High-availability with [Performance Standby
> Nodes](/guides/operations/performance-nodes) and data-locality across
> datacenters requires Vault Enterprise.
## Deployment Topology for Multiple Datacenters ((#multi-dc))
<img src="/img/vault-ref-arch-6.png" />
### Vault Replication
~> **Enterprise Only:** Vault replication feature is a part of _Vault Enterprise_.
HashiCorp Vault Enterprise provides two modes of replication, **performance**
and **disaster recovery**. The [Vault documentation](/docs/enterprise/replication) provides more detailed information on the replication capabilities within Vault Enterprise.
![Replication Pattern](/img/vault-ref-arch-8.png)
#### Performance Replication
Vault performance replication allows for secrets management across many sites.
Secrets, authentication methods, authorization policies and other details are
replicated to be active and available in multiple locations.
-> Refer to the [Vault Mount Filter](/guides/operations/mount-filter) guide
about filtering out secret engines from being replicated across regions.
#### Disaster Recovery Replication
Vault disaster recovery replication ensures that a standby Vault cluster is kept
synchronized with an active Vault cluster. This mode of replication includes
data such as ephemeral authentication tokens, time-based token information as
well as token usage data. This provides for aggressive recovery point objective
in environments where preventing loss of ephemeral operational data is of the
utmost concern.
#### Cross-Region Disaster Recovery
If your disaster recovery strategy is to plan for a loss of an entire data
center, the following diagram illustrates a possible replication scenario.
![Replication Pattern](/img/vault-ref-arch-4.png)
In this scenario, if the Vault cluster in Region A fails and you promote the DR
cluster in Region B to be the new primary, your applications will need to read
and write secrets from the Vault cluster in Region B. This may or may not raise
an issue for your applications, but you need to take that into a consideration
during the planning.
#### In-Region Disaster Recovery
If your disaster recovery strategy is to plan for a loss of a cluster but not the
entire data center, the following diagram illustrates a possible replication
scenario.
![Replication Pattern](/img/vault-ref-arch-7.png)
-> Refer to the [Vault Disaster Recovery Setup](/guides/operations/disaster-recovery) guide for additional information.
#### Corruption or Sabotage Disaster Recovery
Another common scenario to protect against, more prevalent in cloud environments
that provide very high levels of intrinsic resiliency, might be the purposeful
or accidental corruption of data and configuration, and or a loss of cloud account
control. Vault's DR Replication is designed to replicate live data, which would
propagate intentional or accidental data corruption or deletion. To protect against
these possibilities, you should backup Vault's storage backend. This is supported
through the Consul Snapshot feature, which can be automated for regular archival
backups. A cold site or new infrastructure could be re-hydrated from a Consul
snapshot.
-> Refer to the online documentation to learn more about [Consul snapshots](https://www.consul.io/docs/commands/snapshot.html).
#### Replication Notes
- There is no set limit on number of clusters within a replication set. Largest
deployments today are in the 30+ cluster range.
- Any cluster within a Performance replication set can act as a Disaster
Recovery primary cluster.
- A cluster within a Performance replication set can also replicate to multiple
Disaster Recovery secondary clusters.
- While a Vault cluster can possess a replication role (or roles), there are no
special considerations required in terms of infrastructure, and clusters can
assume (or be promoted) to another role. Special circumstances related to mount
filters and HSM usage may limit swapping of roles, but those are based on
specific organization configurations.
#### Considerations Related to Unseal proxy_protocol_behavior
Using replication with Vault clusters integrated with HSM devices for automated
unseal operations has some details that should be understood during the planning
phase.
- If a **performance** primary cluster utilizes an HSM, all other clusters
within that replication set must use an HSM as well.
- If a **performance** primary cluster does NOT utilize an HSM (uses Shamir
secret sharing method), the clusters within that replication set can be mixed,
such that some may use an HSM, others may use Shamir.
For sake of this discussion, the cloud auto-unseal feature is treated as an
HSM.
## Additional References
- Vault [architecture](/docs/internals/architecture) documentation explains
each Vault component
- To integrate Vault with existing LDAP server, refer to
[LDAP Auth Method](/docs/auth/ldap) documentation
- Refer to the [AppRole Pull
Authentication](/guides/identity/authentication) guide to programmatically
generate a token for a machine or app
- Consul is an integral part of running a resilient Vault cluster, regardless of
location. Refer to the online [Consul documentation](https://www.consul.io/intro/getting-started/install.html) to
learn more.
## Next steps
- Read [Production Hardening](/guides/operations/production) to learn best
practices for a production hardening deployment of Vault.
- Read [Deployment Guide](/guides/operations/deployment-guide) to learn
the steps required to install and configure a single HashiCorp Vault cluster.