Docs/rate limiting 1.15 (#16345)

* Added rate limit section to agent overview, updated headings per style guide * added GTRL section and overview * added usage docs for rate limiting 1.15 * added file for initializing rate limits * added steps for initializing rate limits * updated descriptions for rate_limits in agent conf * updated rate limiter-related metrics * tweaks to agent index * Apply suggestions from code review Co-authored-by: Dhia Ayachi <dhia@hashicorp.com> Co-authored-by: Krastin Krastev <krastin@hashicorp.com> * Apply suggestions from code review Co-authored-by: Krastin Krastev <krastin@hashicorp.com> * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com> --------- Co-authored-by: Dhia Ayachi <dhia@hashicorp.com> Co-authored-by: Krastin Krastev <krastin@hashicorp.com> Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>
2023-02-22 13:02:51 -08:00 · 2023-02-22 13:02:51 -08:00 · 2d75b88eb3
parent a90d789a13
commit 2d75b88eb3
8 changed files with 222 additions and 23 deletions
--- a/website/content/docs/agent/config/config-files.mdx
+++ b/website/content/docs/agent/config/config-files.mdx
@ -534,17 +534,17 @@ Valid time units are 'ns', 'us' (or 'µs'), 'ms', 's', 'm', 'h'."

 - `license_path` <EnterpriseAlert inline /> This specifies the path to a file that contains the Consul Enterprise license. Alternatively the license may also be specified in either the `CONSUL_LICENSE` or `CONSUL_LICENSE_PATH` environment variables. See the [licensing documentation](/consul/docs/enterprise/license/overview) for more information about Consul Enterprise license management. Added in versions 1.10.0, 1.9.7 and 1.8.13. Prior to version 1.10.0 the value may be set for all agents to facilitate forwards compatibility with 1.10 but will only actually be used by client agents.

- `limits` Available in Consul 0.9.3 and later, this is a nested
-  object that configures limits that are enforced by the agent. Prior to Consul 1.5.2,
-  this only applied to agents in client mode, not Consul servers. The following parameters
-  are available:
+- `limits`: This block specifies various types of limits that the Consul server agent enforces.

  - `http_max_conns_per_client` - Configures a limit of how many concurrent TCP connections a single client IP address is allowed to open to the agent's HTTP(S) server. This affects the HTTP(S) servers in both client and server agents. Default value is `200`.
  - `https_handshake_timeout` - Configures the limit for how long the HTTPS server in both client and server agents will wait for a client to complete a TLS handshake. This should be kept conservative as it limits how many connections an unauthenticated attacker can open if `verify_incoming` is being using to authenticate clients (strongly recommended in production). Default value is `5s`.
-  - `request_limits` - This object povides configuration for rate limiting RPC and gRPC requests on the consul server.  As a result of rate limiting gRPC and RPC request, HTTP requests to the Consul server are rate limited.
-    - `mode` - Configures whether rate limiting is enabled or not as well as how it behaves through the use of 3 possible modes.  The default value of "disabled" will prevent any rate limiting from occuring.  A value of "permissive" will cause the system to track requests against the `read_rate` and `write_rate` but will only log violations and will not block and will allow the request to continue processing.  A value of "enforcing" also tracks requests against the `read_rate` and `write_rate` but in addition to logging violations, the system will block the request from processings by returning an error.
-    - `read_rate` - Configures how frequently RPC, gRPC, and HTTP queries are allowed to happen. The rate limiter limits the rate to tokens per second equal to this value. See https://en.wikipedia.org/wiki/Token_bucket for more about token buckets.
-    - `write_rate` - Configures how frequently RPC, gRPC, and HTTP write are allowed to happen. The rate limiter limits the rate to tokens per second equal to this value. See https://en.wikipedia.org/wiki/Token_bucket for more about token buckets.
+  - `request_limits` - This object specifies configurations that limit the rate of RPC and gRPC requests on the Consul server. Limiting the rate of gRPC and RPC requests also limits HTTP requests to the Consul server.
+    - `mode` - String value that specifies an action to take if the rate of requests exceeds the limit. You can specify the following values:
+      - `permissive`: The server continues to allow requests and records an error in the logs. 
+      - `enforcing`: The server stops accepting requests and records an error in the logs. 
+      - `disabled`: Limits are not enforced or tracked. This is the default value for `mode`.    
+    - `read_rate` - Integer value that specifies the number of read requests per second. Default is `100`.
+    - `write_rate` - Integer value that specifies the number of write requests per second. Default is `100`.
  - `rpc_handshake_timeout` - Configures the limit for how long servers will wait after a client TCP connection is established before they complete the connection handshake. When TLS is used, the same timeout applies to the TLS handshake separately from the initial protocol negotiation. All Consul clients should perform this immediately on establishing a new connection. This should be kept conservative as it limits how many connections an unauthenticated attacker can open if `verify_incoming` is being using to authenticate clients (strongly recommended in production). When `verify_incoming` is true on servers, this limits how long the connection socket and associated goroutines will be held open before the client successfully authenticates. Default value is `5s`.
  - `rpc_client_timeout` - Configures the limit for how long a client is allowed to read from an RPC connection. This is used to set an upper bound for calls to eventually terminate so that RPC connections are not held indefinitely. Blocking queries can override this timeout. Default is `60s`.
  - `rpc_max_conns_per_client` - Configures a limit of how many concurrent TCP connections a single source IP address is allowed to open to a single server. It affects both clients connections and other server connections. In general Consul clients multiplex many RPC calls over a single TCP connection so this can typically be kept low. It needs to be more than one though since servers open at least one additional connection for raft RPC, possibly more for WAN federation when using network areas, and snapshot requests from clients run over a separate TCP conn. A reasonably low limit significantly reduces the ability of an unauthenticated attacker to consume unbounded resources by holding open many connections. You may need to increase this if WAN federated servers connect via proxies or NAT gateways or similar causing many legitimate connections from a single source IP. Default value is `100` which is designed to be extremely conservative to limit issues with certain deployment patterns. Most deployments can probably reduce this safely. 100 connections on modern server hardware should not cause a significant impact on resource usage from an unauthenticated attacker though.
--- a/website/content/docs/agent/config/index.mdx
+++ b/website/content/docs/agent/config/index.mdx
@ -72,7 +72,7 @@ The following agent configuration options are reloadable at runtime:
  - These can be important in certain outage situations so being able to control
    them without a restart provides a recovery path that doesn't involve
    downtime. They generally shouldn't be changed otherwise.
- [RPC rate limiting](/consul/docs/agent/config/config-files#limits)
+- [RPC rate limits](/consul/docs/agent/config/config-files#limits)
 - [HTTP Maximum Connections per Client](/consul/docs/agent/config/config-files#http_max_conns_per_client)
 - Services
 - TLS Configuration
--- a/website/content/docs/agent/index.mdx
+++ b/website/content/docs/agent/index.mdx
@ -33,7 +33,7 @@ The following process describes the agent lifecycle within the context of an exi
   As a result, all nodes will eventually become aware of each other.
 1. **Existing servers will begin replicating to the new node** if the agent is a server.

-### Failures and Crashes
+### Failures and crashes

 In the event of a network failure, some nodes may be unable to reach other nodes.
 Unreachable nodes will be marked as _failed_.
@ -48,7 +48,7 @@ catalog.
 Once the network recovers or a crashed agent restarts, the cluster will repair itself and unmark a node as failed.
 The health check in the catalog will also be updated to reflect the current state.

-### Exiting Nodes
+### Exiting nodes

 When a node leaves a cluster, it communicates its intent and the cluster marks the node as having _left_.
 In contrast to changes related to failures, all of the services provided by a node are immediately deregistered.
@ -61,6 +61,9 @@ interval of 72 hours (changing the reap interval is _not_ recommended due to
 its consequences during outage situations). Reaping is similar to leaving,
 causing all associated services to be deregistered.

+## Limit traffic rates
+You can define a set of rate limiting configurations that help operators protect Consul servers from excessive or peak usage. The configurations enable you to gracefully degrade Consul servers to avoid a global interruption of service. Consul supports global server rate limiting, which lets configure Consul servers to deny requests that exceed the read or write limits. Refer to [Traffic Rate Limits Overview](/consul/docs/agent/limits/limit-traffic-rates).
+
 ## Requirements

 You should run one Consul agent per server or host.
@ -73,7 +76,7 @@ Refer to the following sections for information about host, port, memory, and ot

 The [Datacenter Deploy tutorial](/consul/tutorials/production-deploy/reference-architecture#deployment-system-requirements) contains additional information, including licensing configuration, environment variables, and other details.

-### Maximum Latency Network requirements
+### Maximum latency network requirements

 Consul uses the gossip protocol to share information across agents. To function properly, you cannot exceed the protocol's maximum latency threshold. The latency threshold is calculated according to the total round trip time (RTT) for communication between all agents.  Other network usages outside of Gossip are not bound by these latency requirements (i.e. client to server RPCs, HTTP API requests, xDS proxy configuration, DNS).

@ -82,7 +85,7 @@ For data sent between all Consul agents the following latency requirements must
 - Average RTT for all traffic cannot exceed 50ms.
 - RTT for 99 percent of traffic cannot exceed 100ms.

-## Starting the Consul Agent
+## Starting the Consul agent

 Start a Consul agent with the `consul` command and `agent` subcommand using the following syntax:

@ -111,7 +114,7 @@ $ consul agent -data-dir=tmp/consul -dev

 Agents are highly configurable, which enables you to deploy Consul to any infrastructure. Many of the default options for the `agent` command are suitable for becoming familiar with a local instance of Consul. In practice, however, several additional configuration options must be specified for Consul to function as expected. Refer to [Agent Configuration](/consul/docs/agent/config) topic for a complete list of configuration options.

-### Understanding the Agent Startup Output
+### Understanding the agent startup output

 Consul prints several important messages on startup.
 The following example shows output from the [`consul agent`](/consul/commands/agent) command:
@ -162,7 +165,7 @@ When running under `systemd` on Linux, Consul notifies systemd by sending
 this either the `join` or `retry_join` option has to be set and the
 service definition file has to have `Type=notify` set.

-## Configuring Consul Agents
+## Configuring Consul agents

 You can specify many options to configure how Consul operates when issuing the `consul agent` command.
 You can also create one or more configuration files and provide them to Consul at startup using either the `-config-file` or `-config-dir` option.
@ -180,7 +183,7 @@ $ consul agent -config-file=server.json
 The configuration options necessary to successfully use Consul depend on several factors, including the type of agent you are configuring (client or server), the type of environment you are deploying to (e.g., on-premise, multi-cloud, etc.), and the security options you want to implement (ACLs, gRPC encryption).
 The following examples are intended to help you understand some of the combinations you can implement to configure Consul.

-### Common Configuration Settings
+### Common configuration settings

 The following settings are commonly used in the configuration file (also called a service definition file when registering services with Consul) to configure Consul agents:

@ -195,7 +198,7 @@ The following settings are commonly used in the configuration file (also called
 | `addresses`  | Block of nested objects that define addresses bound to the agent for internal cluster communication.                                                                                                                                                                       | `"http": "0.0.0.0"` See the Agent Configuration page for [default address values](/consul/docs/agent/config/config-files#addresses) |
 | `ports`      | Block of nested objects that define ports bound to agent addresses. <br/>See (link to addresses option) for details.                                                                                                                                                       | See the Agent Configuration page for [default port values](/consul/docs/agent/config/config-files#ports)                            |

-### Server Node in a Service Mesh
+### Server node in a service mesh

 The following example configuration is for a server agent named "`consul-server`". The server is [bootstrapped](/consul/docs/agent/config/cli-flags#_bootstrap) and the Consul GUI is enabled.
 The reason this server agent is configured for a service mesh is that the `connect` configuration is enabled. Connect is Consul's service mesh component that provides service-to-service connection authorization and encryption using mutual Transport Layer Security (TLS). Applications can use sidecar proxies in a service mesh configuration to establish TLS connections for inbound and outbound connections without being aware of Connect at all. See [Connect](/consul/docs/connect) for details.
@ -243,7 +246,7 @@ connect {

 </CodeTabs>

-### Server Node with Encryption Enabled
+### Server node with encryption enabled

 The following example shows a server node configured with encryption enabled.
 Refer to the [Security](/consul/docs/security) chapter for additional information about how to configure security options for Consul.
@ -313,7 +316,7 @@ tls {

 </CodeTabs>

-### Client Node Registering a Service
+### Client node registering a service

 Using Consul as a central service registry is a common use case.
 The following example configuration includes common settings to register a service with a Consul agent and enable health checks (see [Checks](/consul/docs/discovery/checks) to learn more about health checks):
@ -371,7 +374,7 @@ service {

 </CodeTabs>

-## Client Node with Multiple Interfaces or IP addresses
+## Client node with multiple interfaces or IP addresses

 The following example shows how to configure Consul to listen on multiple interfaces or IP addresses using a [go-sockaddr template].

@ -422,7 +425,7 @@ advertise_addr = "{{ GetInterfaceIP \"en0\" }}"

 </CodeTabs>

-## Stopping an Agent
+## Stopping an agent

 An agent can be stopped in two ways: gracefully or forcefully. Servers and
 Clients both behave differently depending on the leave that is performed. There
--- a/website/content/docs/agent/limits/index.mdx
+++ b/website/content/docs/agent/limits/index.mdx
@ -0,0 +1,33 @@
+---
+layout: docs
+page_title: Limit Traffic Rates Overview
+description: Rate limiting is a set of Consul server agent configurations that you can use to mitigate the risks to Consul servers when clients send excessive requests to Consul resources. 
+
+---
+
+# Traffic rate limiting overview
+
+This topic provides an overview of the rates limits you can configure for Consul servers.
+
+## Introduction
+You can configure global RPC rate limits to mitigate the risks to Consul servers when clients send excessive read or write requests to Consul resources. A _read request_ is defined as any request that does not modify Consul internal state. A _write request_ is defined as any request that modifies Consul internal state. Rate limits for read and write requests are configured separately.
+
+## Rate limit modes
+
+You can set one of the following modes to determine how Consul servers react when exceeding request limits.
+ 
+- **Enforcing mode**: The rate limiter denies requests to a server once they exceed the configured rate. In this mode, Consul generates metrics and logs to help you understand your network's load and configure limits accordingly. 
+- **Permissive mode**: The rate limiter allows requests to a server once they exceed the configured rate. In this mode, Consul generates metrics and logs to help you understand your Consul load and configure limits accordingly. Use this mode to help you debug specific issues as you configure limits.
+- **Disabled mode**: Disables the rate limiter. This mode allows all requests Consul does not generate logs or metrics. This is the default mode. 
+
+Refer to [`rate_limits`](/consul/docs/agent/config/config-files#request_limits) for additional configuration information.
+
+## Request denials
+
+When an HTTP request is denied for rate limiting reason, Consul returns one of the following errors:
+
+- **429 Resource Exhausted**: Indicates that a server is not able to perform the request but that another server could potentially fulfill it. This error is most common on stale reads because any server may fulfill stale read requests. To resolve this type of error,  we recommend immediately retrying the request to another server. If the request came from a Consul client agent, the agent automatically retries the request up to the limit set in the [`rpc_hold_timeout`](/consul/docs/agent/config/config-files#rpc_hold_timeout) configuration .
+
+- **503 Service Unavailable**: Indicates that server is unable to perform the request and that no other server can fulfill the request, either. This usually occurs on consistent reads or for writes. In this case we recommend retrying according to an exponential backoff schedule. If the request came from a Consul client agent, the agent automatically retries the request according to the [`rpc_hold_timeout`](/consul/docs/agent/config/config-files#rpc_hold_timeout) configuration. 
+
+Refer to [Rate limit reached on the server](/consul/docs/troubleshoot/common-errors#rate-limit-reached-on-the-server) for additional information.
--- a/website/content/docs/agent/limits/init-rate-limits.mdx
+++ b/website/content/docs/agent/limits/init-rate-limits.mdx
@ -0,0 +1,32 @@
+---
+layout: docs
+page_title: Initialize Rate Limit Settings
+description: Learn how to determine regular and peak loads in your network so that you can set the initial global rate limit configurations.
+---
+
+# Initialize rate limit settings
+
+Because each network has different needs and application, you need to find out what the regular and peak loads in your network are before you set traffic limits. We recommend completing the following steps to benchmark request rates in your environment so that you can implement limits appropriate for your applications.
+
+1. In the agent configuration file, specify a global rate limit with arbitrary values based on the following conditions:
+
+    - Environment where Consul servers are running 
+    - Number of servers and the projected load
+    - Existing metrics expressing requests per second
+
+1. Set the `mode` to `permissive`. In the following example, the configuration allows up to 1000 reads and 500 writes per second for each Consul agent:
+
+    ```hcl
+    request_limits { 
+	    mode = "permissive"
+	    read_rate = 1000.0
+	    write_rate = 500.0
+    }
+    ```
+
+1. Observe the logs and metrics for your application's typical cycle, such as a 24 hour period. Refer to [`log_file`](/consul/docs/agent/config/config-files#log_file) for information about where to retrieve logs. Call the [`/agent/metrics`](/consul/api-docs/agent#view-metrics) HTTP API endpoint and check the data for the following metrics:  
+
+    - `rpc.rate_limit.exceeded.read`
+    - `rpc.rate_limit.exceeded.write`
+
+1. If the limits are not reached, set the `mode` configuration to `enforcing`. Otherwise, continue to adjust and iterate until you find your network's unique limits.
--- a/website/content/docs/agent/limits/set-global-traffic-rate-limits.mdx
+++ b/website/content/docs/agent/limits/set-global-traffic-rate-limits.mdx
@ -0,0 +1,114 @@
+---
+layout: docs
+page_title: Set a Global Limit on Traffic Rates
+description: Use global rate limits to prevent excessive rates of requests to Consul servers.
+---
+
+# Set a global limit on traffic rates
+
+This topic describes how to configure rate limits for RPC and gRPC traffic to the Consul server. 
+
+## Introduction 
+
+Rate limits apply to each Consul server separately and limit the number of read requests or write requests to the server on the RPC and internal gRPC endpoints.
+
+Because all requests coming to a Consul server eventually perform an RPC or an internal gRPC request, global rate limits apply to Consul's user interfaces, such as the HTTP API interface, the CLI, and the external gRPC endpoint for services in the service mesh.
+
+Refer to [Initialize Rate Limit Settings]() for additional information about right-sizing your gRPC request configurations. 
+
+## Set a global rate limit for a Consul server   
+
+Configure the following settings in your Consul server configuration to limit the RPC and gRPC traffic rates.
+
+- Set the rate limiter [`mode`](/consul/docs/agent/config/config-files#mode-1)
+- Set the [`read_rate`](/consul/docs/agent/config/config-files#read_rate) 
+- Set the [`write_rate`](/consul/docs/agent/config/config-files#write_rate)
+
+In the following example, the Consul server is configured to prevent more than `500` read and `200` write RPC calls:
+
+<CodeTabs heading="Consul server agent">
+
+```hcl
+limits = {
+    rate_limit = {
+        mode = "enforcing"
+        read_rate = 500
+        write_rate = 200
+    }
+}
+```
+
+```json
+{
+    "limits" : {
+        "rate_limit" : {
+            "mode" : "enforcing",
+            "read_rate" : 500,
+            "write_rate" : 200
+        }
+    }
+}
+
+```
+
+</CodeTabs>
+
+## Access rate limit logs 
+
+Consul prints a log line for each rate limit request. The log includes information to identify the source of the request and the server's configured limit. Consul prints to `DEBUG` log level, and can be configured to drop log lines to avoid affecting the server health. Dropping a log line increments the `rpc.rate_limit.log_dropped` metric.
+
+The following example log shows that RPC request from `127.0.0.1:53562` to `KVS.Apply` exceeded the rate limit:
+
+<CodeBlockConfig hideClipboard>
+
+```shell-session
+2023-02-17T10:01:15.565-0500 [DEBUG] agent.server.rpc-rate-limit: RPC 
+exceeded allowed rate limit: rpc=KVS.Apply source_addr=127.0.0.1:53562 
+limit_type=global/write limit_enforced=false
+```
+</CodeBlockConfig>
+
+## Review rate limit metrics
+
+Consul captures the following metrics associated with rate limits:
+
+- Type of limit
+- Operation
+- Rate limit mode
+
+Call the `agent/metrics` API endpoint to view the metrics associated with rate limits. Refer to [View Metrics](/consul/api-docs/agent#view-metrics) for API usage information. 
+
+In the following example, Consul dropped a call to the `consul` service because it exceeded the limit by one call:
+
+```shell-session
+$ curl http://127.0.0.1:8500/v1/agent/metrics
+{
+  . . .  
+  "Counters": [
+    {
+      "Name": "consul.rpc.rate_limit.exceeded",
+      "Count": 1,
+      "Sum": 1,
+      "Min": 1,
+      "Max": 1,
+      "Mean": 1,
+      "Stddev": 0,
+      "Labels": {
+        "service": "consul"
+      }
+    },
+    {
+      "Name": "consul.rpc.rate_limit.log_dropped",
+      "Count": 1,
+      "Sum": 1,
+      "Min": 1,
+      "Max": 1,
+      "Mean": 1,
+      "Stddev": 0,
+      "Labels": {}
+    }
+  ],
+  . . .
+```
+
+Refer to [Telemetry](/consul/docs/agent/telemetry) for additional information.
--- a/website/content/docs/agent/telemetry.mdx
+++ b/website/content/docs/agent/telemetry.mdx
@ -477,8 +477,8 @@ These metrics are used to monitor the health of the Consul servers.
 | `consul.raft.transition.heartbeat_timeout`          | The number of times an agent has transitioned to the Candidate state, after receive no heartbeat messages from the last known leader.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | timeouts / interval               | counter |
 | `consul.raft.verify_leader`                         | This metric doesn't have a direct correlation to the leader change.  It just counts the number of times an agent checks if it is still the leader or not.  For example, during every consistent read, the check is done.  Depending on the load in the system, this metric count can be high as it is incremented each time a consistent read is completed.                                                                                                                                                                                                                                                                                                                                                                                        | checks / interval                 | Counter |
 | `consul.rpc.accept_conn`                            | Increments when a server accepts an RPC connection.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | connections                       | counter |
-| `consul.rpc.rate_limit.exceeded`                    | Increments whenever an RPC is over a configured rate limit. In permissive mode, the RPC is still allowed to proceed.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | RPCs                              | counter |
-| `consul.rpc.rate_limit.log_dropped`                 | Increments whenever a log that is emitted because an RPC exceeded a rate limit gets dropped because the output buffer is full.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | log messages dropped              | counter |
+| `consul.rpc.rate_limit.exceeded`    | Number of rate limited requests. Only increments when `rate_limits.mode` is set to `permissive` or `enforcing`. |  requests | counter |
+| `consul.rpc.rate_limit.log_dropped`    | Number of logs for rate limited requests dropped for performance reasons. Only increments when `rate_limits.mode` is set to `permissive` or `enforcing` and the log is unable to print the number of excessive requests. |  log lines | counter |
 | `consul.catalog.register`                           | Measures the time it takes to complete a catalog register operation.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | ms                                | timer   |
 | `consul.catalog.deregister`                         | Measures the time it takes to complete a catalog deregister operation.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | ms                                | timer   |
 | `consul.server.isLeader`                            | Track if a server is a leader(1) or not(0)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | 1 or 0                            | gauge   |
--- a/website/data/docs-nav-data.json
+++ b/website/data/docs-nav-data.json
@ -734,6 +734,23 @@
          }
        ]
      },
+      {
+        "title": "Limit Traffic Rates",
+        "routes": [
+          {
+            "title": "Overview",
+            "path": "agent/limits"
+          },
+          {
+            "title": "Initialize Rate Limit Settings",
+            "path": "agent/limits/init-rate-limits"
+          },
+          {
+            "title": "Set Global Traffic Rate Limits",
+            "path": "agent/limits/set-global-traffic-rate-limits"
+          }
+        ]
+      },
      {
        "title": "Configuration Entries",
        "path": "agent/config-entries"