open-consul/website/content/docs/install/performance.mdx

---
layout: docs
page_title: Server Performance
description: >-
  Consul requires different amounts of compute resources, depending on cluster
  size and expected workload. This guide provides guidance on choosing compute
  resources.
---

# Server Performance

Since Consul servers run a [consensus protocol](/docs/internals/consensus) to
process all write operations and are contacted on nearly all read operations, server
performance is critical for overall throughput and health of a Consul cluster. Servers
are generally I/O bound for writes because the underlying Raft log store performs a sync
to disk every time an entry is appended. Servers are generally CPU bound for reads since
reads work from a fully in-memory data store that is optimized for concurrent access.

## Minimum Server Requirements ((#minimum))

In Consul 0.7, the default server [performance parameters](/docs/agent/options#performance)
were tuned to allow Consul to run reliably (but relatively slowly) on a server cluster of three
[AWS t2.micro](https://aws.amazon.com/ec2/instance-types/) instances. These thresholds
were determined empirically using a leader instance that was under sufficient read, write,
and network load to cause it to permanently be at zero CPU credits, forcing it to the baseline
performance mode for that instance type. Real-world workloads typically have more bursts of
activity, so this is a conservative and pessimistic tuning strategy.

This default was chosen based on feedback from users, many of whom wanted a low cost way
to run small production or development clusters with low cost compute resources, at the
expense of some performance in leader failure detection and leader election times.

The default performance configuration is equivalent to this:

```javascript
{
  "performance": {
    "raft_multiplier": 5
  }
}
```

## Production Server Requirements ((#production))

When running Consul 0.7 and later in production, it is recommended to configure the server
[performance parameters](/docs/agent/options#performance) back to Consul's original
high-performance settings. This will let Consul servers detect a failed leader and complete
leader elections much more quickly than the default configuration which extends key Raft
timeouts by a factor of 5, so it can be quite slow during these events.

The high performance configuration is simple and looks like this:

```javascript
{
  "performance": {
    "raft_multiplier": 1
  }
}
```

This value must take into account the network latency between the servers and the read/write load on the servers.

The value of `raft_multiplier` is a scaling factor and directly affects the following parameters:

| Param              |  Value |         |
| ------------------ | -----: | ------: |
| HeartbeatTimeout   | 1000ms | default |
| ElectionTimeout    | 1000ms | default |
| LeaderLeaseTimeout |  500ms | default |

By default, Consul uses a scaling factor of `5` (i.e. `raft_multiplier: 5`), which results in the following values:

| Param              |  Value | Calculation |
| ------------------ | -----: | ----------: |
| HeartbeatTimeout   | 5000ms |  5 x 1000ms |
| ElectionTimeout    | 5000ms |  5 x 1000ms |
| LeaderLeaseTimeout | 2500ms |   5 x 500ms |

~> **NOTE** Wide networks with more latency will perform better with larger values of `raft_multiplier`.

The trade off is between leader stability and time to recover from an actual
leader failure. A short multiplier minimizes failure detection and election time
but may be triggered frequently in high latency situations. This can cause
constant leadership churn and associated unavailability. A high multiplier
reduces the chances that spurious failures will cause leadership churn but it
does this at the expense of taking longer to detect real failures and thus takes
longer to restore cluster availability.

Leadership instability can also be caused by under-provisioned CPU resources and
is more likely in environments where CPU cycles are shared with other workloads.
In order for a server to remain the leader, it must send frequent heartbeat
messages to all other servers every few hundred milliseconds. If some number of
these are missing or late due to the leader not having sufficient CPU to send
them on time, the other servers will detect it as failed and hold a new
election.

It's best to benchmark with a realistic workload when choosing a production server for Consul.
Here are some general recommendations:

- Consul will make use of multiple cores, and at least 2 cores are recommended.

- Spurious leader elections can be caused by networking
  issues between the servers or insufficient CPU resources. Users in cloud environments
  often bump their servers up to the next instance class with improved networking
  and CPU until leader elections stabilize, and in Consul 0.7 or later the [performance
  parameters](/docs/agent/options#performance) configuration now gives you tools
  to trade off performance instead of upsizing servers. You can use the [`consul.raft.leader.lastContact`
  telemetry](/docs/agent/telemetry#leadership-changes) to observe how the Raft timing is
  performing and guide the decision to de-tune Raft performance or add more powerful
  servers.

- For DNS-heavy workloads, configuring all Consul agents in a cluster with the
  [`allow_stale`](/docs/agent/options#allow_stale) configuration option will allow reads to
  scale across all Consul servers, not just the leader. Consul 0.7 and later enables stale reads
  for DNS by default. See [Stale Reads](https://learn.hashicorp.com/tutorials/consul/dns-caching#stale-reads) in the
  [DNS Caching](https://learn.hashicorp.com/tutorials/consul/dns-caching) guide for more details. It's also good to set
  reasonable, non-zero [DNS TTL values](https://learn.hashicorp.com/tutorials/consul/dns-caching#ttl-values) if your clients will
  respect them.

- In other applications that perform high volumes of reads against Consul, consider using the
  [stale consistency mode](/api/features/consistency#stale) available to allow reads to scale
  across all the servers and not just be forwarded to the leader.

- In Consul 0.9.3 and later, a new [`limits`](/docs/agent/options#limits) configuration is
  available on Consul clients to limit the RPC request rate they are allowed to make against the
  Consul servers. After hitting the limit, requests will start to return rate limit errors until
  time has passed and more requests are allowed. Configuring this across the cluster can help with
  enforcing a max desired application load level on the servers, and can help mitigate abusive
  applications.

## Memory Requirements

Consul server agents operate on a working set of data comprised of key/value
entries, the service catalog, prepared queries, access control lists, and
sessions in memory. These data are persisted through Raft to disk in the form
of a snapshot and log of changes since the previous snapshot for durability.

When planning for memory requirements, you should typically allocate
enough RAM for your server agents to contain between 2 to 4 times the working
set size. You can determine the working set size by noting the value of
`consul.runtime.alloc_bytes` in the [Telemetry data](/docs/agent/telemetry).

> NOTE: Consul is not designed to serve as a general purpose database, and you
> should keep this in mind when choosing what data are populated to the
> key/value store.

## Read/Write Tuning

Consul is write limited by disk I/O and read limited by CPU. Memory requirements will be dependent on the total size of KV pairs stored and should be sized according to that data (as should the hard drive storage). The limit on a key’s value size is `512KB`.

-> Consul is write limited by disk I/O and read limited by CPU.

For **write-heavy** workloads, the total RAM available for overhead must approximately be equal to

```
RAM NEEDED = number of keys * average key size * 2-3x
```

Since writes must be synced to disk (persistent storage) on a quorum of servers before they are committed, deploying a disk with high write throughput (or an SSD) will enhance performance on the write side. ([Documentation](/docs/agent/options#_data_dir))

For a **read-heavy** workload, configure all Consul server agents with the `allow_stale` DNS option, or query the API with the `stale` [consistency mode](/api/features/consistency). By default, all queries made to the server are RPC forwarded to and serviced by the leader. By enabling stale reads, any server will respond to any query, thereby reducing overhead on the leader. Typically, the stale response is `100ms` or less from consistent mode but it drastically improves performance and reduces latency under high load.

If the leader server is out of memory or the disk is full, the server eventually stops responding, loses its election and cannot move past its last commit time. However, by configuring `max_stale` and setting it to a large value, Consul will continue to respond to queries during such outage scenarios. ([max_stale documentation](/docs/agent/options#max_stale)).

It should be noted that `stale` is not appropriate for coordination where strong consistency is important (i.e. locking or application leader election). For critical cases, the optional `consistent` API query mode is required for true linearizability; the trade off is that this turns a read into a full quorum write so requires more resources and takes longer.

**Read-heavy** clusters may take advantage of the [enhanced reading](/docs/enterprise/read-scale) feature (Enterprise) for better scalability. This feature allows additional servers to be introduced as non-voters. Being a non-voter, the server will still participate in data replication, but it will not block the leader from committing log entries.

Consul’s agents use network sockets for communicating with the other nodes (gossip) and with the server agent. In addition, file descriptors are also opened for watch handlers, health checks, and log files. For a **write heavy** cluster, the `ulimit` size must be increased from the default value (`1024`) to prevent the leader from running out of file descriptors.

To prevent any CPU spikes from a misconfigured client, RPC requests to the server should be [rate limited](/docs/agent/options#limits)

~> **NOTE** Rate limiting is configured on the client agent only.

In addition, two [performance indicators](/docs/agent/telemetry) &mdash; `consul.runtime.alloc_bytes` and `consul.runtime.heap_objects` &mdash; can help diagnose if the current sizing is not adequately meeting the load.

## Connect Certificate Signing CPU Limits

If you enable [Connect](/docs/connect), the leader server will need
to perform public key signing operations for every service instance in the
cluster. Typically these operations are fast on modern hardware, however when
the CA is changed or its key rotated, the leader will face an influx of
requests for new certificates for every service instance running.

While the client agents distribute these randomly over 30 seconds to avoid an
immediate thundering herd, they don't have enough information to tune that
period based on the number of certificates in use in the cluster so picking
longer smearing results in artificially slow rotations for small clusters.

Smearing requests over 30s is sufficient to bring RPC load to a reasonable level
in all but the very largest clusters, but the extra CPU load from cryptographic
operations could impact the server's normal work. To limit that, Consul since
1.4.1 exposes two ways to limit the impact Certificate signing has on the leader
[`csr_max_per_second`](/docs/agent/options#ca_csr_max_per_second) and
[`csr_max_concurrent`](/docs/agent/options#ca_csr_max_concurrent).

By default we set a limit of 50 per second which is reasonable on modest
hardware but may be too low and impact rotation times if more than 1500 service
instances are using Connect in the cluster. `csr_max_per_second` is likely best
if you have fewer than four cores available since a whole core being used by
signing is likely to impact the server stability if it's all or a large portion
of the cores available. The downside is that you need to capacity plan: how many
service instances will need Connect certificates? What CSR rate can your server
tolerate without impacting stability? How fast do you want CA rotations to
process?

For larger production deployments, we generally recommend multiple CPU cores for
servers to handle the normal workload. With four or more cores available, it's
simpler to limit signing CPU impact with `csr_max_concurrent` rather than tune
the rate limit. This effectively sets how many CPU cores can be monopolized by
certificate signing work (although it doesn't pin that work to specific cores).
In this case `csr_max_per_second` should be disabled (set to `0`).

For example if you have an 8 core server, setting `csr_max_concurrent` to `1`
would allow you to process CSRs as fast as a single core can (which is likely
sufficient for the very large clusters), without consuming all available
CPU cores and impacting normal server work or stability.
-												Adds performance tuning capability for Raft, detuned defaults, and supplemental docs.

											
										
										
											2016-08-25 00:33:53 +00:00
+								---
-												intro and api navigation converted

											
										
										
											2020-04-07 18:55:19 +00:00
+								layout: docs
 								page_title: Server Performance
 								description: >-
 								  Consul requires different amounts of compute resources, depending on cluster
 								  size and expected workload. This guide provides guidance on choosing compute
 								  resources.
-												Adds performance tuning capability for Raft, detuned defaults, and supplemental docs.

											
										
										
											2016-08-25 00:33:53 +00:00
+								---
 								# Server Performance
-												replace internal .html link extensions

											
										
										
											2020-04-09 23:46:54 +00:00
+								Since Consul servers run a [consensus protocol](/docs/internals/consensus) to
-												Adds performance tuning capability for Raft, detuned defaults, and supplemental docs.

											
										
										
											2016-08-25 00:33:53 +00:00
+								process all write operations and are contacted on nearly all read operations, server
 								performance is critical for overall throughput and health of a Consul cluster. Servers
 								are generally I/O bound for writes because the underlying Raft log store performs a sync
 								to disk every time an entry is appended. Servers are generally CPU bound for reads since
 								reads work from a fully in-memory data store that is optimized for concurrent access.
-												docs: update structure (#8506)

- moved and renamed files/folders based on new structure
- updated docs navigation based on new structure
- moved CLI to top nav (created commands.jsx and commands-navigation.js)
- updated and added redirects
- updating to be consistent with standalone categories
- changing "overview" link in top nav to lead to where intro was moved (docs/intro)
- adding redirects for intro content
- deleting old intro folders
- format all data/navigation files
- deleting old commands folder
- reverting changes to glossary page
- adjust intro navigation for removal of 'vs' paths
- add helm page redirect
- fix more redirects
- add a missing redirect
- fix broken anchor links and formatting mistakes
- deleted duplicate section, added redirect, changed link
- removed duplicate glossary page

											
										
										
											2020-09-01 15:14:13 +00:00
+								## Minimum Server Requirements ((#minimum))
-												Adds performance tuning capability for Raft, detuned defaults, and supplemental docs.

											
										
										
											2016-08-25 00:33:53 +00:00
-												replace internal .html link extensions

											
										
										
											2020-04-09 23:46:54 +00:00
+								In Consul 0.7, the default server [performance parameters](/docs/agent/options#performance)
-												Adds performance tuning capability for Raft, detuned defaults, and supplemental docs.

											
										
										
											2016-08-25 00:33:53 +00:00
+								were tuned to allow Consul to run reliably (but relatively slowly) on a server cluster of three
 								[AWS t2.micro](https://aws.amazon.com/ec2/instance-types/) instances. These thresholds
 								were determined empirically using a leader instance that was under sufficient read, write,
 								and network load to cause it to permanently be at zero CPU credits, forcing it to the baseline
 								performance mode for that instance type. Real-world workloads typically have more bursts of
 								activity, so this is a conservative and pessimistic tuning strategy.
 								This default was chosen based on feedback from users, many of whom wanted a low cost way
 								to run small production or development clusters with low cost compute resources, at the
 								expense of some performance in leader failure detection and leader election times.
 								The default performance configuration is equivalent to this:
 								```javascript
 								{
 								  "performance": {
 								    "raft_multiplier": 5
 								  }
 								}
 								```
-												docs: update structure (#8506)

- moved and renamed files/folders based on new structure
- updated docs navigation based on new structure
- moved CLI to top nav (created commands.jsx and commands-navigation.js)
- updated and added redirects
- updating to be consistent with standalone categories
- changing "overview" link in top nav to lead to where intro was moved (docs/intro)
- adding redirects for intro content
- deleting old intro folders
- format all data/navigation files
- deleting old commands folder
- reverting changes to glossary page
- adjust intro navigation for removal of 'vs' paths
- add helm page redirect
- fix more redirects
- add a missing redirect
- fix broken anchor links and formatting mistakes
- deleted duplicate section, added redirect, changed link
- removed duplicate glossary page

											
										
										
											2020-09-01 15:14:13 +00:00
+								## Production Server Requirements ((#production))
-												Adds performance tuning capability for Raft, detuned defaults, and supplemental docs.

											
										
										
											2016-08-25 00:33:53 +00:00
 								When running Consul 0.7 and later in production, it is recommended to configure the server
-												replace internal .html link extensions

											
										
										
											2020-04-09 23:46:54 +00:00
+								[performance parameters](/docs/agent/options#performance) back to Consul's original
-												Adds performance tuning capability for Raft, detuned defaults, and supplemental docs.

											
										
										
											2016-08-25 00:33:53 +00:00
+								high-performance settings. This will let Consul servers detect a failed leader and complete
 								leader elections much more quickly than the default configuration which extends key Raft
-												Fixes a typo in the performance guide.

											
										
										
											2016-08-25 23:13:54 +00:00
+								timeouts by a factor of 5, so it can be quite slow during these events.
-												Adds performance tuning capability for Raft, detuned defaults, and supplemental docs.

											
										
										
											2016-08-25 00:33:53 +00:00
 								The high performance configuration is simple and looks like this:
 								```javascript
 								{
 								  "performance": {
 								    "raft_multiplier": 1
 								  }
 								}
 								```
-												Add Deployment Guide and update links (#4487)

* Adds Deployment Guide and update links
* Fixes releases link
* Re-organisation of content
* Cuts down "deployment" doc (which should focus on Reference Architecture) by moving raft and performance tuning to the Server Performance page which already covers some of this.
* Moves backups from "deployment" doc (which should focus on Reference Architecture) to "deployment-guide"
* Cleans up some notes and add single DC diagram
* Removes old link to deployment guide from nav
* Corrects minor styling, formatting, and grammar

											
										
										
											2018-10-03 21:37:36 +00:00
+								This value must take into account the network latency between the servers and the read/write load on the servers.
 								The value of `raft_multiplier` is a scaling factor and directly affects the following parameters:
-												initial

											
										
										
											2020-04-06 20:27:35 +00:00
+								| Param              |  Value |         |
 								| ------------------ | -----: | ------: |
 								| HeartbeatTimeout   | 1000ms | default |
 								| ElectionTimeout    | 1000ms | default |
 								| LeaderLeaseTimeout |  500ms | default |
-												Add Deployment Guide and update links (#4487)

* Adds Deployment Guide and update links
* Fixes releases link
* Re-organisation of content
* Cuts down "deployment" doc (which should focus on Reference Architecture) by moving raft and performance tuning to the Server Performance page which already covers some of this.
* Moves backups from "deployment" doc (which should focus on Reference Architecture) to "deployment-guide"
* Cleans up some notes and add single DC diagram
* Removes old link to deployment guide from nav
* Corrects minor styling, formatting, and grammar

											
										
										
											2018-10-03 21:37:36 +00:00
-												docs: Add raft_multiplier default clarification (#8339)


											
										
										
											2020-07-20 21:49:46 +00:00
+								By default, Consul uses a scaling factor of `5` (i.e. `raft_multiplier: 5`), which results in the following values:
-												Add Deployment Guide and update links (#4487)

* Adds Deployment Guide and update links
* Fixes releases link
* Re-organisation of content
* Cuts down "deployment" doc (which should focus on Reference Architecture) by moving raft and performance tuning to the Server Performance page which already covers some of this.
* Moves backups from "deployment" doc (which should focus on Reference Architecture) to "deployment-guide"
* Cleans up some notes and add single DC diagram
* Removes old link to deployment guide from nav
* Corrects minor styling, formatting, and grammar

											
										
										
											2018-10-03 21:37:36 +00:00
-												initial

											
										
										
											2020-04-06 20:27:35 +00:00
+								| Param              |  Value | Calculation |
 								| ------------------ | -----: | ----------: |
 								| HeartbeatTimeout   | 5000ms |  5 x 1000ms |
 								| ElectionTimeout    | 5000ms |  5 x 1000ms |
 								| LeaderLeaseTimeout | 2500ms |   5 x 500ms |
-												Add Deployment Guide and update links (#4487)

* Adds Deployment Guide and update links
* Fixes releases link
* Re-organisation of content
* Cuts down "deployment" doc (which should focus on Reference Architecture) by moving raft and performance tuning to the Server Performance page which already covers some of this.
* Moves backups from "deployment" doc (which should focus on Reference Architecture) to "deployment-guide"
* Cleans up some notes and add single DC diagram
* Removes old link to deployment guide from nav
* Corrects minor styling, formatting, and grammar

											
										
										
											2018-10-03 21:37:36 +00:00
 								~> **NOTE** Wide networks with more latency will perform better with larger values of `raft_multiplier`.
-												connect: tame thundering herd of CSRs on CA rotation (#5228)

* Support rate limiting and concurrency limiting CSR requests on servers; handle CA rotations gracefully with jitter and backoff-on-rate-limit in client

* Add CSR rate limiting docs

* Fix config naming and add tests for new CA configs

											
										
										
											2019-01-22 17:19:36 +00:00
+								The trade off is between leader stability and time to recover from an actual
 								leader failure. A short multiplier minimizes failure detection and election time
 								but may be triggered frequently in high latency situations. This can cause
 								constant leadership churn and associated unavailability. A high multiplier
 								reduces the chances that spurious failures will cause leadership churn but it
 								does this at the expense of taking longer to detect real failures and thus takes
 								longer to restore cluster availability.
 								Leadership instability can also be caused by under-provisioned CPU resources and
 								is more likely in environments where CPU cycles are shared with other workloads.
 								In order for a server to remain the leader, it must send frequent heartbeat
 								messages to all other servers every few hundred milliseconds. If some number of
 								these are missing or late due to the leader not having sufficient CPU to send
 								them on time, the other servers will detect it as failed and hold a new
 								election.
-												Add Deployment Guide and update links (#4487)

* Adds Deployment Guide and update links
* Fixes releases link
* Re-organisation of content
* Cuts down "deployment" doc (which should focus on Reference Architecture) by moving raft and performance tuning to the Server Performance page which already covers some of this.
* Moves backups from "deployment" doc (which should focus on Reference Architecture) to "deployment-guide"
* Cleans up some notes and add single DC diagram
* Removes old link to deployment guide from nav
* Corrects minor styling, formatting, and grammar

											
										
										
											2018-10-03 21:37:36 +00:00
-												Adds performance tuning capability for Raft, detuned defaults, and supplemental docs.

											
										
										
											2016-08-25 00:33:53 +00:00
+								It's best to benchmark with a realistic workload when choosing a production server for Consul.
 								Here are some general recommendations:
-												initial

											
										
										
											2020-04-06 20:27:35 +00:00
+								- Consul will make use of multiple cores, and at least 2 cores are recommended.
-												docs: Fix rendering of markdown on performance page

Fix issue with markdown not being rendered on /docs/install/performance.mdx.

Resolves #8049

											
										
										
											2020-06-08 16:09:08 +00:00
+								- Spurious leader elections can be caused by networking
-												intro and api navigation converted

											
										
										
											2020-04-07 18:55:19 +00:00
+								  issues between the servers or insufficient CPU resources. Users in cloud environments
 								  often bump their servers up to the next instance class with improved networking
 								  and CPU until leader elections stabilize, and in Consul 0.7 or later the [performance
-												replace internal .html link extensions

											
										
										
											2020-04-09 23:46:54 +00:00
+								  parameters](/docs/agent/options#performance) configuration now gives you tools
-												intro and api navigation converted

											
										
										
											2020-04-07 18:55:19 +00:00
+								  to trade off performance instead of upsizing servers. You can use the [`consul.raft.leader.lastContact`
-												docs: update structure (#8506)

- moved and renamed files/folders based on new structure
- updated docs navigation based on new structure
- moved CLI to top nav (created commands.jsx and commands-navigation.js)
- updated and added redirects
- updating to be consistent with standalone categories
- changing "overview" link in top nav to lead to where intro was moved (docs/intro)
- adding redirects for intro content
- deleting old intro folders
- format all data/navigation files
- deleting old commands folder
- reverting changes to glossary page
- adjust intro navigation for removal of 'vs' paths
- add helm page redirect
- fix more redirects
- add a missing redirect
- fix broken anchor links and formatting mistakes
- deleted duplicate section, added redirect, changed link
- removed duplicate glossary page

											
										
										
											2020-09-01 15:14:13 +00:00
+								  telemetry](/docs/agent/telemetry#leadership-changes) to observe how the Raft timing is
-												replace internal .html link extensions

											
										
										
											2020-04-09 23:46:54 +00:00
+								  performing and guide the decision to de-tune Raft performance or add more powerful
-												intro and api navigation converted

											
										
										
											2020-04-07 18:55:19 +00:00
+								  servers.
-												initial

											
										
										
											2020-04-06 20:27:35 +00:00
 								- For DNS-heavy workloads, configuring all Consul agents in a cluster with the
-												replace internal .html link extensions

											
										
										
											2020-04-09 23:46:54 +00:00
+								  [`allow_stale`](/docs/agent/options#allow_stale) configuration option will allow reads to
-												initial

											
										
										
											2020-04-06 20:27:35 +00:00
+								  scale across all Consul servers, not just the leader. Consul 0.7 and later enables stale reads
-												Learn/link updates derek (#8487)

* Updated Learn url paths.

Co-authored-by: danielehc <40759828+danielehc@users.noreply.github.com>

											
										
										
											2020-08-13 21:02:44 +00:00
+								  for DNS by default. See [Stale Reads](https://learn.hashicorp.com/tutorials/consul/dns-caching#stale-reads) in the
 								  [DNS Caching](https://learn.hashicorp.com/tutorials/consul/dns-caching) guide for more details. It's also good to set
 								  reasonable, non-zero [DNS TTL values](https://learn.hashicorp.com/tutorials/consul/dns-caching#ttl-values) if your clients will
-												initial

											
										
										
											2020-04-06 20:27:35 +00:00
+								  respect them.
 								- In other applications that perform high volumes of reads against Consul, consider using the
-												replace internal .html link extensions

											
										
										
											2020-04-09 23:46:54 +00:00
+								  [stale consistency mode](/api/features/consistency#stale) available to allow reads to scale
-												initial

											
										
										
											2020-04-06 20:27:35 +00:00
+								  across all the servers and not just be forwarded to the leader.
-												replace internal .html link extensions

											
										
										
											2020-04-09 23:46:54 +00:00
+								- In Consul 0.9.3 and later, a new [`limits`](/docs/agent/options#limits) configuration is
-												initial

											
										
										
											2020-04-06 20:27:35 +00:00
+								  available on Consul clients to limit the RPC request rate they are allowed to make against the
 								  Consul servers. After hitting the limit, requests will start to return rate limit errors until
 								  time has passed and more requests are allowed. Configuring this across the cluster can help with
 								  enforcing a max desired application load level on the servers, and can help mitigate abusive
 								  applications.
-												Adds simple rate limiting for client agent RPC calls to Consul servers. (#3440)

* Added rate limiting for agent RPC calls.
* Initializes the rate limiter based on the config.
* Adds the rate limiter into the snapshot RPC path.
* Adds unit tests for the RPC rate limiter.
* Groups the RPC limit parameters under "limits" in the config.
* Adds some documentation about the RPC limiter.
* Sends a 429 response when the rate limiter kicks in.
* Adds docs for new telemetry.
* Makes snapshot telemetry look like RPC telemetry and cleans up comments.

											
										
										
											2017-09-01 22:02:50 +00:00
-												Notes about memory usage (helps with #2535)

											
										
										
											2016-12-05 17:28:49 +00:00
+								## Memory Requirements
 								Consul server agents operate on a working set of data comprised of key/value
 								entries, the service catalog, prepared queries, access control lists, and
 								sessions in memory. These data are persisted through Raft to disk in the form
 								of a snapshot and log of changes since the previous snapshot for durability.
 								When planning for memory requirements, you should typically allocate
 								enough RAM for your server agents to contain between 2 to 4 times the working
 								set size. You can determine the working set size by noting the value of
-												replace internal .html link extensions

											
										
										
											2020-04-09 23:46:54 +00:00
+								`consul.runtime.alloc_bytes` in the [Telemetry data](/docs/agent/telemetry).
-												Notes about memory usage (helps with #2535)

											
										
										
											2016-12-05 17:28:49 +00:00
 								> NOTE: Consul is not designed to serve as a general purpose database, and you
 								> should keep this in mind when choosing what data are populated to the
 								> key/value store.
-												Add Deployment Guide and update links (#4487)

* Adds Deployment Guide and update links
* Fixes releases link
* Re-organisation of content
* Cuts down "deployment" doc (which should focus on Reference Architecture) by moving raft and performance tuning to the Server Performance page which already covers some of this.
* Moves backups from "deployment" doc (which should focus on Reference Architecture) to "deployment-guide"
* Cleans up some notes and add single DC diagram
* Removes old link to deployment guide from nav
* Corrects minor styling, formatting, and grammar

											
										
										
											2018-10-03 21:37:36 +00:00
 								## Read/Write Tuning
 								Consul is write limited by disk I/O and read limited by CPU. Memory requirements will be dependent on the total size of KV pairs stored and should be sized according to that data (as should the hard drive storage). The limit on a key’s value size is `512KB`.
 								-> Consul is write limited by disk I/O and read limited by CPU.
 								For **write-heavy** workloads, the total RAM available for overhead must approximately be equal to
-												docs: update structure (#8506)

- moved and renamed files/folders based on new structure
- updated docs navigation based on new structure
- moved CLI to top nav (created commands.jsx and commands-navigation.js)
- updated and added redirects
- updating to be consistent with standalone categories
- changing "overview" link in top nav to lead to where intro was moved (docs/intro)
- adding redirects for intro content
- deleting old intro folders
- format all data/navigation files
- deleting old commands folder
- reverting changes to glossary page
- adjust intro navigation for removal of 'vs' paths
- add helm page redirect
- fix more redirects
- add a missing redirect
- fix broken anchor links and formatting mistakes
- deleted duplicate section, added redirect, changed link
- removed duplicate glossary page

											
										
										
											2020-09-01 15:14:13 +00:00
+								```
 								RAM NEEDED = number of keys * average key size * 2-3x
 								```
-												Add Deployment Guide and update links (#4487)

* Adds Deployment Guide and update links
* Fixes releases link
* Re-organisation of content
* Cuts down "deployment" doc (which should focus on Reference Architecture) by moving raft and performance tuning to the Server Performance page which already covers some of this.
* Moves backups from "deployment" doc (which should focus on Reference Architecture) to "deployment-guide"
* Cleans up some notes and add single DC diagram
* Removes old link to deployment guide from nav
* Corrects minor styling, formatting, and grammar

											
										
										
											2018-10-03 21:37:36 +00:00
-												replace internal .html link extensions

											
										
										
											2020-04-09 23:46:54 +00:00
+								Since writes must be synced to disk (persistent storage) on a quorum of servers before they are committed, deploying a disk with high write throughput (or an SSD) will enhance performance on the write side. ([Documentation](/docs/agent/options#_data_dir))
-												Add Deployment Guide and update links (#4487)

* Adds Deployment Guide and update links
* Fixes releases link
* Re-organisation of content
* Cuts down "deployment" doc (which should focus on Reference Architecture) by moving raft and performance tuning to the Server Performance page which already covers some of this.
* Moves backups from "deployment" doc (which should focus on Reference Architecture) to "deployment-guide"
* Cleans up some notes and add single DC diagram
* Removes old link to deployment guide from nav
* Corrects minor styling, formatting, and grammar

											
										
										
											2018-10-03 21:37:36 +00:00
-												replace internal .html link extensions

											
										
										
											2020-04-09 23:46:54 +00:00
+								For a **read-heavy** workload, configure all Consul server agents with the `allow_stale` DNS option, or query the API with the `stale` [consistency mode](/api/features/consistency). By default, all queries made to the server are RPC forwarded to and serviced by the leader. By enabling stale reads, any server will respond to any query, thereby reducing overhead on the leader. Typically, the stale response is `100ms` or less from consistent mode but it drastically improves performance and reduces latency under high load.
-												Add Deployment Guide and update links (#4487)

* Adds Deployment Guide and update links
* Fixes releases link
* Re-organisation of content
* Cuts down "deployment" doc (which should focus on Reference Architecture) by moving raft and performance tuning to the Server Performance page which already covers some of this.
* Moves backups from "deployment" doc (which should focus on Reference Architecture) to "deployment-guide"
* Cleans up some notes and add single DC diagram
* Removes old link to deployment guide from nav
* Corrects minor styling, formatting, and grammar

											
										
										
											2018-10-03 21:37:36 +00:00
-												replace internal .html link extensions

											
										
										
											2020-04-09 23:46:54 +00:00
+								If the leader server is out of memory or the disk is full, the server eventually stops responding, loses its election and cannot move past its last commit time. However, by configuring `max_stale` and setting it to a large value, Consul will continue to respond to queries during such outage scenarios. ([max_stale documentation](/docs/agent/options#max_stale)).
-												Add Deployment Guide and update links (#4487)

* Adds Deployment Guide and update links
* Fixes releases link
* Re-organisation of content
* Cuts down "deployment" doc (which should focus on Reference Architecture) by moving raft and performance tuning to the Server Performance page which already covers some of this.
* Moves backups from "deployment" doc (which should focus on Reference Architecture) to "deployment-guide"
* Cleans up some notes and add single DC diagram
* Removes old link to deployment guide from nav
* Corrects minor styling, formatting, and grammar

											
										
										
											2018-10-03 21:37:36 +00:00
 								It should be noted that `stale` is not appropriate for coordination where strong consistency is important (i.e. locking or application leader election). For critical cases, the optional `consistent` API query mode is required for true linearizability; the trade off is that this turns a read into a full quorum write so requires more resources and takes longer.
-												remove internal /index.html

											
										
										
											2020-04-09 23:20:00 +00:00
+								**Read-heavy** clusters may take advantage of the [enhanced reading](/docs/enterprise/read-scale) feature (Enterprise) for better scalability. This feature allows additional servers to be introduced as non-voters. Being a non-voter, the server will still participate in data replication, but it will not block the leader from committing log entries.
-												Add Deployment Guide and update links (#4487)

* Adds Deployment Guide and update links
* Fixes releases link
* Re-organisation of content
* Cuts down "deployment" doc (which should focus on Reference Architecture) by moving raft and performance tuning to the Server Performance page which already covers some of this.
* Moves backups from "deployment" doc (which should focus on Reference Architecture) to "deployment-guide"
* Cleans up some notes and add single DC diagram
* Removes old link to deployment guide from nav
* Corrects minor styling, formatting, and grammar

											
										
										
											2018-10-03 21:37:36 +00:00
-												initial

											
										
										
											2020-04-06 20:27:35 +00:00
+								Consul’s agents use network sockets for communicating with the other nodes (gossip) and with the server agent. In addition, file descriptors are also opened for watch handlers, health checks, and log files. For a **write heavy** cluster, the `ulimit` size must be increased from the default value (`1024`) to prevent the leader from running out of file descriptors.
-												Add Deployment Guide and update links (#4487)

* Adds Deployment Guide and update links
* Fixes releases link
* Re-organisation of content
* Cuts down "deployment" doc (which should focus on Reference Architecture) by moving raft and performance tuning to the Server Performance page which already covers some of this.
* Moves backups from "deployment" doc (which should focus on Reference Architecture) to "deployment-guide"
* Cleans up some notes and add single DC diagram
* Removes old link to deployment guide from nav
* Corrects minor styling, formatting, and grammar

											
										
										
											2018-10-03 21:37:36 +00:00
-												replace internal .html link extensions

											
										
										
											2020-04-09 23:46:54 +00:00
+								To prevent any CPU spikes from a misconfigured client, RPC requests to the server should be [rate limited](/docs/agent/options#limits)
-												Add Deployment Guide and update links (#4487)

* Adds Deployment Guide and update links
* Fixes releases link
* Re-organisation of content
* Cuts down "deployment" doc (which should focus on Reference Architecture) by moving raft and performance tuning to the Server Performance page which already covers some of this.
* Moves backups from "deployment" doc (which should focus on Reference Architecture) to "deployment-guide"
* Cleans up some notes and add single DC diagram
* Removes old link to deployment guide from nav
* Corrects minor styling, formatting, and grammar

											
										
										
											2018-10-03 21:37:36 +00:00
 								~> **NOTE** Rate limiting is configured on the client agent only.
-												replace internal .html link extensions

											
										
										
											2020-04-09 23:46:54 +00:00
+								In addition, two [performance indicators](/docs/agent/telemetry) &mdash; `consul.runtime.alloc_bytes` and `consul.runtime.heap_objects` &mdash; can help diagnose if the current sizing is not adequately meeting the load.
-												connect: tame thundering herd of CSRs on CA rotation (#5228)

* Support rate limiting and concurrency limiting CSR requests on servers; handle CA rotations gracefully with jitter and backoff-on-rate-limit in client

* Add CSR rate limiting docs

* Fix config naming and add tests for new CA configs

											
										
										
											2019-01-22 17:19:36 +00:00
 								## Connect Certificate Signing CPU Limits
-												remove internal /index.html

											
										
										
											2020-04-09 23:20:00 +00:00
+								If you enable [Connect](/docs/connect), the leader server will need
-												connect: tame thundering herd of CSRs on CA rotation (#5228)

* Support rate limiting and concurrency limiting CSR requests on servers; handle CA rotations gracefully with jitter and backoff-on-rate-limit in client

* Add CSR rate limiting docs

* Fix config naming and add tests for new CA configs

											
										
										
											2019-01-22 17:19:36 +00:00
+								to perform public key signing operations for every service instance in the
 								cluster. Typically these operations are fast on modern hardware, however when
-												docs: fix typo in install/performance (#6428)


											
										
										
											2019-09-09 20:23:25 +00:00
+								the CA is changed or its key rotated, the leader will face an influx of
-												connect: tame thundering herd of CSRs on CA rotation (#5228)

* Support rate limiting and concurrency limiting CSR requests on servers; handle CA rotations gracefully with jitter and backoff-on-rate-limit in client

* Add CSR rate limiting docs

* Fix config naming and add tests for new CA configs

											
										
										
											2019-01-22 17:19:36 +00:00
+								requests for new certificates for every service instance running.
 								While the client agents distribute these randomly over 30 seconds to avoid an
 								immediate thundering herd, they don't have enough information to tune that
 								period based on the number of certificates in use in the cluster so picking
 								longer smearing results in artificially slow rotations for small clusters.
 								Smearing requests over 30s is sufficient to bring RPC load to a reasonable level
 								in all but the very largest clusters, but the extra CPU load from cryptographic
 								operations could impact the server's normal work. To limit that, Consul since
 .4.1 exposes two ways to limit the impact Certificate signing has on the leader
-												replace internal .html link extensions

											
										
										
											2020-04-09 23:46:54 +00:00
+								[`csr_max_per_second`](/docs/agent/options#ca_csr_max_per_second) and
 								[`csr_max_concurrent`](/docs/agent/options#ca_csr_max_concurrent).
-												connect: tame thundering herd of CSRs on CA rotation (#5228)

* Support rate limiting and concurrency limiting CSR requests on servers; handle CA rotations gracefully with jitter and backoff-on-rate-limit in client

* Add CSR rate limiting docs

* Fix config naming and add tests for new CA configs

											
										
										
											2019-01-22 17:19:36 +00:00
 								By default we set a limit of 50 per second which is reasonable on modest
 								hardware but may be too low and impact rotation times if more than 1500 service
 								instances are using Connect in the cluster. `csr_max_per_second` is likely best
 								if you have fewer than four cores available since a whole core being used by
 								signing is likely to impact the server stability if it's all or a large portion
 								of the cores available. The downside is that you need to capacity plan: how many
 								service instances will need Connect certificates? What CSR rate can your server
 								tolerate without impacting stability? How fast do you want CA rotations to
 								process?
 								For larger production deployments, we generally recommend multiple CPU cores for
 								servers to handle the normal workload. With four or more cores available, it's
 								simpler to limit signing CPU impact with `csr_max_concurrent` rather than tune
 								the rate limit. This effectively sets how many CPU cores can be monopolized by
 								certificate signing work (although it doesn't pin that work to specific cores).
 								In this case `csr_max_per_second` should be disabled (set to `0`).
 								For example if you have an 8 core server, setting `csr_max_concurrent` to `1`
 								would allow you to process CSRs as fast as a single core can (which is likely
 								sufficient for the very large clusters), without consuming all available
 								CPU cores and impacting normal server work or stability.