This commit is contained in:
Jeff Boruszak 2023-01-26 16:19:12 -06:00 committed by GitHub
parent b376fd2151
commit 143aabb1c1
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 15 additions and 15 deletions

View File

@ -7,7 +7,7 @@ description: >-
# Operating Consul at Scale
This page describes how Consul's architecture and impact its performance with large scale deployments, and shares recommendations for operating Consul in production at scale.
This page describes how Consul's architecture impacts its performance with large scale deployments and shares recommendations for operating Consul in production at scale.
## Overview
@ -23,7 +23,7 @@ This section provides general configuration and monitoring recommendations for o
### Data plane resiliency
To make service-to-service communication resilient against outages and failures, we recommend spreading multiple service instances for a service across fault domains. Resilient deployments spread services across multiples of the following:
To make service-to-service communication resilient against outages and failures, we recommend spreading multiple service instances for a service across fault domains. Resilient deployments spread services across multiples of the following:
- Infrastructure-level availability zones
- Runtime platform instances, such as Kubernetes clusters
@ -45,7 +45,7 @@ When a large number services are deployed to a single datacenter, the Consul ser
To ensure resiliency, we recommend limiting deployments to a maximum of 5,000 Consul client agents per Consul datacenter. There are two reasons for this recommendation:
1. **Blast radius reduction**: When Consul suffers a server outage in a datacenter or region, _blast radius_ refers to the number of Consul clients or dataplanes attached to that datacenter that can no longer communicate as a result. We recommend limiting the total number of clients attached to a single Consul datacenter in order to reduce the size of its blast radius. Even though Consul is able to run clusters with 10,000 or more nodes, it takes longer to bring larger deployments back online after an outage, which impacts time to recovery.
2. **Agent gossip management**: Consul agents use the [gossip protocol](/consul/docs/architecture/gossip) to share membership information in a gossip pool. By default, all client agents in a single Consul datacenter are in a single gossip pool. Whenever an agent joins or leaves the gossip pool, the other agents propagate that event throughout the pool. If a Consul datacenter experiences _agent churn_, or a consistently high rate of agents joining and leaving a single pool, cluster performance may be affected by gossip messages being generated faster than they can be transmitted. The result is an ever-growing message queue.
1. **Agent gossip management**: Consul agents use the [gossip protocol](/consul/docs/architecture/gossip) to share membership information in a gossip pool. By default, all client agents in a single Consul datacenter are in a single gossip pool. Whenever an agent joins or leaves the gossip pool, the other agents propagate that event throughout the pool. If a Consul datacenter experiences _agent churn_, or a consistently high rate of agents joining and leaving a single pool, cluster performance may be affected by gossip messages being generated faster than they can be transmitted. The result is an ever-growing message queue.
To mitigate these risks, we recommend a maximum of 5,000 Consul client agents in a single gossip pool. There are several strategies for making gossip pools smaller:
@ -77,9 +77,9 @@ Consul server agents are an important part of Consuls architecture. This sect
Consul servers can be deployed on a few different runtimes:
- **HashiCorp Cloud Platform (HCP) Consul (Managed)**. These Consul servers are deployed in a hosted environment managed by HCP. To get started with HCP Consul servers in Kubernetes or VM deployments, refer to the Deploy HCP Consul tutorial.
- **VMs or bare metal servers (Self-managed)**. To get started with Consul on VMs or bare metal servers, refer to the Deploy Consul server tutorial.For a full list of configuration options, refer to Agents Overview.
- **Kubernetes (Self-managed)**. To get started with Consul on Kubernetes, refer to the Deploy Consul on Kubernetes tutorial.
- **HashiCorp Cloud Platform (HCP) Consul (Managed)**. These Consul servers are deployed in a hosted environment managed by HCP. To get started with HCP Consul servers in Kubernetes or VM deployments, refer to the [Deploy HCP Consul tutorial](/consul/tutorials/get-started-hcp/hcp-gs-deploy).
- **VMs or bare metal servers (Self-managed)**. To get started with Consul on VMs or bare metal servers, refer to the [Deploy Consul server tutorial](/consul/tutorials/get-started-vms/virtual-machine-gs-deploy). For a full list of configuration options, refer to [Agents Overview](/consul/docs/agent).
- **Kubernetes (Self-managed)**. To get started with Consul on Kubernetes, refer to the [Deploy Consul on Kubernetes tutorial](/consul/tutorials/get-started-kubernetes/kubernetes-gs-deploy).
- **Other container environments, including Docker, Rancher, and Mesos (Self-managed)**.
When operating Consul at scale, self-managed VM or bare metal server deployments offer the most flexibility. Some Consul Enterprise features that can enhance fault tolerance and read scalability, such as [redundancy zones](/consul/docs/enterprise/redundancy) and [read replicas](/consul/docs/enterprise/read-scale), are not available to server agents on Kubernetes runtimes. To learn more, refer to [Consul Enterprise feature availability by runtime](/consul/docs/enterprise#feature-availability-by-runtime).
@ -89,7 +89,7 @@ When operating Consul at scale, self-managed VM or bare metal server deployments
Determining the number of Consul servers to deploy on your network has two key considerations:
1. **Fault tolerance**: The number of server outages your deployment can tolerate while maintaining quorum. Additional servers increase a networks fault tolerance.
1. **Performance scalability**: To handle more requests, additional servers produce latency and slow the quorum process. Too many servers impedes your network instead of helping it.
1. **Performance scalability**: To handle more requests, additional servers produce latency and slow the quorum process. Having too many servers impedes your network instead of helping it.
Fault tolerance should determine your initial decision for how many Consul server agents to deploy. Our recommendation for the number of servers to deploy depends on whether you have access to Consul Enterprise redundancy zones:
@ -114,7 +114,7 @@ We recommend using the [`bootstrap-expect` command-line flag](/consul/docs/agent
#### NUMA architecture awareness
Some cloud providers offer extremely large instance sizes with Non-Uniform Memory Access (NUMA) architectures. Because the Go runtime is not NUMA aware, Consul is not NUMA aware. Even though you can run Consul on NUMA architecture, it will not take advantage of the multiprocessing capabilities.
Some cloud providers offer extremely large instance sizes with Non-Uniform Memory Access (NUMA) architectures. Because the Go runtime is not NUMA aware, Consul is not NUMA aware. Even though you can run Consul on NUMA architecture, it will not take advantage of the multiprocessing capabilities.
### Consistency modes
@ -144,7 +144,7 @@ The highest CPU load usually belongs to the current leader. If the CPU load is h
- `consul.rpc.*` - Traditional RPC metrics. The most relevant metrics for understanding server CPU load in read-heavy workloads are `consul.rpc.query` and `consul.rpc.queries_blocking`.
- `consul.grpc.server.*` - Metrics for the number of streams being processed by the server.
- `consul.xds.server.*` - Metrics for the Envoy xDS resources being processed by the server. In Consul v1.14 and higher, these metrics are a significant source of read load. Refer to [Consul dataplanes](/consul/docs/connect/dataplane) for more information.
- `consul.xds.server.*` - Metrics for the Envoy xDS resources being processed by the server. In Consul v1.14 and higher, these metrics have the potential to become a significant source of read load. Refer to [Consul dataplanes](/consul/docs/connect/dataplane) for more information.
Depending on your needs, choose one of the following strategies to mitigate server CPU load:
@ -196,7 +196,7 @@ In general, set [`raftboltdb.NoFreelistSync`](/consul/docs/agent/config/config-f
- Reduce the amount of data written to disk
- Increase the amount of time it takes to load the raft.db file on startup
We recommend operators optimize networks according to their individual concerns. For example, if your server runs into disk performance issues but Consul servers do not restart often, setting [`raftboltdb.NoFreelistSync`](consul/docs/agent/config/config-files#NoFreelistSync) to `true` may solve your problems but it causes issues for deployments with large database files and frequent server restarts.
We recommend operators optimize networks according to their individual concerns. For example, if your server runs into disk performance issues but Consul servers do not restart often, setting [`raftboltdb.NoFreelistSync`](consul/docs/agent/config/config-files#NoFreelistSync) to `true` may solve your problems. However, the same action causes issues for deployments with large database files and frequent server restarts.
#### Raft snapshots
@ -204,9 +204,9 @@ Each state change produces a Raft log entry, and each Consul server receives the
When you add a new Consul server, it must catch up to the current state. It receives the latest snapshot from the leader followed by the sequence of logs between that snapshot and the leaders current state. Each Raft log has a sequence number and each snapshot contains the last sequence number included in the snapshot. A combination of write-heavy workloads, a large state, congested networks, or busy servers makes it possible for new servers to struggle to catch up to the current state before the next log they need from the leader has already been truncated. The result is a _snapshot install loop_.
For example, if snapshot A on the leader has an index of 99 and the current index is 150, then when a new server comes online the leader streams snapshot A to the new server for it to restore. However, this snapshot only enables the new server to catch up to index 99. Not only does the new server still need to catch up to index 150, but the leader continued to commit Raft logs in the meantime.
For example, if snapshot A on the leader has an index of 99 and the current index is 150, then when a new server comes online the leader streams snapshot A to the new server for it to restore. However, this snapshot only enables the new server to catch up to index 99. Not only does the new server still need to catch up to index 150, but the leader continued to commit Raft logs in the meantime.
When the leader takes snapshot B at index 199, it truncates the logs that accumulated between snapshot A and snapshot B, which means it truncates Raft logs with indexes between 100 and 199.
When the leader takes snapshot B at index 199, it truncates the logs that accumulated between snapshot A and snapshot B, which means it truncates Raft logs with indexes between 100 and 199.
Because the new server restored snapshot A, the new server has a current index of 99. It requests logs 100 to 150 because index 150 was the current index when it started the replication restore process. At this point, the leader recognizes that it only has logs 200 and higher, and does not have logs for indexes 100 to 150. The leader determines that the new servers state is stale and starts the process over by sending the new server the latest snapshot, snapshot B.
@ -227,11 +227,11 @@ This section provides configuration and monitoring recommendations for Consul de
To optimize performance for service discovery, we recommend deploying multiple small clusters with consistent numbers of service instances and watches.
Several factors influence Consul performance at scale when used primarily for its service discovery and health check features. The factors you have control over include:
Several factors influence Consul performance at scale when used primarily for its service discovery and health check features. The factors you have control over include:
- The overall number of registered service instances
- The use of [stale reads](/consul/api-docs/features/consistency#consul-dns-queries) for DNS queries
- The number of entities that are monitoring Consul, such as Consul client agents or dataplane components, for changes in a service's instances including registration and health status. When any service change occurs, all those entities incur a computational cost because they must process the state change and reconcile it with previously known data for the service. In addition, the Consul server agents also incur a computational cost when sending these updates.
- The number of entities, such as Consul client agents or dataplane components, that are monitoring Consul for changes in a service's instances, including registration and health status. When any service change occurs, all of those entities incur a computational cost because they must process the state change and reconcile it with previously known data for the service. In addition, the Consul server agents also incur a computational cost when sending these updates.
- Number of [watches](/consul/docs/dynamic-app-config/watches) monitoring for changes to a service.
- Rate of catalog updates, which is affected by the following events:
- A service instances health check status changes
@ -256,7 +256,7 @@ Consul ESM enables health checks and monitoring for external services. When usin
Because Consuls service mesh uses service discovery subsystems, service mesh performance is also optimized by deploying multiple small clusters with consistent numbers of service instances and watches. Service mesh performance is influenced by the following additional factors:
- The [transparent proxy](/consul/docs/connect/transparent-proxy) feature causes client agents to listen for service instance updates across all services instead of a subset. To prevent performance issues, we recommend that you do not use the permissive intention, `default: allow`, with the transparent proxy feature. When combined, every service instance update propagates to every proxy, which causes additional server load.
- The [transparent proxy](/consul/docs/connect/transparent-proxy) feature causes client agents to listen for service instance updates across all services instead of a subset. To prevent performance issues, we recommend that you do not use the permissive intention, `default: allow`, with the transparent proxy feature. When combined, every service instance update propagates to every proxy, which causes additional server load.
- When you use the [built in CA provider](/consul/docs/connect/ca/consul#built-in-ca), Consul leaders are responsible for signing certificates used for mTLS across the service mesh. The impact on CPU utilization depends on the total number of service instances and configured certificate TTLs. You can use the [CA provider configuration options](/consul/docs/agent/config/config-files#common-ca-config-options) to control the number of requests a server processes. We recommend adjusting [`csr_max_concurrent`](/consul/docs/agent/config/config-files#ca_csr_max_concurrent) and [`csr_max_per_second`](/consul/docs/agent/config/config-files#ca_csr_max_concurrent) to suit your environment.
### K/V store