Updating Stopping Agent Section (#8016)

Fixes #6935 to clarify agent behavior.
2020-06-03 17:08:49 -04:00 · 2020-06-03 17:08:49 -04:00 · ebdc3a6b9e
parent e8be0533df
commit ebdc3a6b9e
1 changed files with 73 additions and 45 deletions
--- a/website/pages/docs/agent/index.mdx
+++ b/website/pages/docs/agent/index.mdx
@ -15,23 +15,28 @@ information, registers services, runs checks, responds to queries,
 and more. The agent must run on every node that is part of a Consul cluster.

 Any agent may run in one of two modes: client or server. A server
-node takes on the additional responsibility of being part of the [consensus quorum](/docs/internals/consensus).
+node takes on the additional responsibility of being part of the 
+[consensus quorum](/docs/internals/consensus).
 These nodes take part in Raft and provide strong consistency and availability in
-the case of failure. The higher burden on the server nodes means that usually they
-should be run on dedicated instances -- they are more resource intensive than a client
-node. Client nodes make up the majority of the cluster, and they are very lightweight
-as they interface with the server nodes for most operations and maintain very little state
-of their own.
+the case of failure. The higher burden on the server nodes means that usually
+they should be run on dedicated instances -- they are more resource intensive
+than a client node. Client nodes make up the majority of the cluster, and they
+are very lightweight as they interface with the server nodes for most
+operations and maintain very little state of their own.

 ## Running an Agent

-The agent is started with the [`consul agent`](/docs/commands/agent) command. This
-command blocks, running forever or until told to quit. You can test a local agent by following the [Getting Started guides](https://learn.hashicorp.com/consul/getting-started/install?utm_source=consul.io&utm_medium=docs).
+The agent is started with the [`consul agent`](/docs/commands/agent) command.
+This command blocks, running forever or until told to quit. You can test a 
+local agent by following the 
+[Getting Started guides](https://learn.hashicorp.com/consul/getting-started/install?utm_source=consul.io&utm_medium=docs).

-The agent command takes a variety
-of [`configuration options`](/docs/agent/options#command-line-options), but most have sane defaults.
+The agent command takes a variety of 
+[`configuration options`](/docs/agent/options#command-line-options), but most
+have sane defaults.

-When running [`consul agent`](/docs/commands/agent), you should see output similar to this:
+When running [`consul agent`](/docs/commands/agent), you should see output
+similar to this:

 ```shell-session
 $ consul agent -data-dir=/tmp/consul
@ -49,33 +54,38 @@ $ consul agent -data-dir=/tmp/consul
 ...
 ```

-There are several important messages that [`consul agent`](/docs/commands/agent) outputs:
+There are several important messages that 
+[`consul agent`](/docs/commands/agent) outputs:

 - **Node name**: This is a unique name for the agent. By default, this
  is the hostname of the machine, but you may customize it using the
  [`-node`](/docs/agent/options#_node) flag.

- **Datacenter**: This is the datacenter in which the agent is configured to run.
-  Consul has first-class support for multiple datacenters; however, to work efficiently,
-  each node must be configured to report its datacenter. The [`-datacenter`](/docs/agent/options#_datacenter)
-  flag can be used to set the datacenter. For single-DC configurations, the agent
-  will default to "dc1".
+- **Datacenter**: This is the datacenter in which the agent is configured to 
+run.
+  Consul has first-class support for multiple datacenters; however, to work
+  efficiently, each node must be configured to report its datacenter. The 
+  [`-datacenter`](/docs/agent/options#_datacenter) flag can be used to set the 
+  datacenter. For single-DC configurations, the agent will default to "dc1".

- **Server**: This indicates whether the agent is running in server or client mode.
+- **Server**: This indicates whether the agent is running in server or client
+mode.
  Server nodes have the extra burden of participating in the consensus quorum,
  storing cluster state, and handling queries. Additionally, a server may be
  in ["bootstrap"](/docs/agent/options#_bootstrap_expect) mode. Multiple servers
-  cannot be in bootstrap mode as that would put the cluster in an inconsistent state.
+  cannot be in bootstrap mode as that would put the cluster in an inconsistent
+  state.

 - **Client Addr**: This is the address used for client interfaces to the agent.
-  This includes the ports for the HTTP and DNS interfaces. By default, this binds only
-  to localhost. If you change this address or port, you'll have to specify a `-http-addr`
-  whenever you run commands such as [`consul members`](/docs/commands/members) to
-  indicate how to reach the agent. Other applications can also use the HTTP address and port
+  This includes the ports for the HTTP and DNS interfaces. By default, this
+  binds only to localhost. If you change this address or port, you'll have to
+  specify a `-http-addr` whenever you run commands such as
+  [`consul members`](/docs/commands/members) to indicate how to reach the
+  agent. Other applications can also use the HTTP address and port
  [to control Consul](/api).

- **Cluster Addr**: This is the address and set of ports used for communication between
-  Consul agents in a cluster. Not all Consul agents in a cluster have to
+- **Cluster Addr**: This is the address and set of ports used for communication
+  between Consul agents in a cluster. Not all Consul agents in a cluster have to
  use the same port, but this address **MUST** be reachable by all other nodes.

 When running under `systemd` on Linux, Consul notifies systemd by sending
@ -85,44 +95,62 @@ service definition file has to have `Type=notify` set.

 ## Stopping an Agent

-An agent can be stopped in two ways: gracefully or forcefully. To gracefully
-halt an agent, send the process an interrupt signal (usually
-`Ctrl-C` from a terminal or running `kill -INT consul_pid` ). When gracefully exiting, the agent first notifies
-the cluster it intends to leave the cluster. This way, other cluster members
-notify the cluster that the node has _left_.
+An agent can be stopped in two ways: gracefully or forcefully. Servers and
+Clients both behave differently depending on the leave that is performed. There
+are two potential states a process can be in after a system signal is sent: 
+_left_ and _failed_.

-Alternatively, you can force kill the agent by sending it a kill signal.
-When force killed, the agent ends immediately. The rest of the cluster will
-eventually (usually within seconds) detect that the node has died and
-notify the cluster that the node has _failed_.
+To gracefully halt an agent, send the process an _interrupt signal_ (usually
+`Ctrl-C` from a terminal, or running `kill -INT consul_pid` ). For more
+information on different signals sent by the `kill` command, see
+[here](https://www.linux.org/threads/kill-signals-and-commands-revised.11625/)

-It is especially important that a server node be allowed to leave gracefully
-so that there will be a minimal impact on availability as the server leaves
-the consensus quorum.
+When a Client is gracefully exited, the agent first notifies the cluster it
+intends to leave the cluster. This way, other cluster members notify the
+cluster that the node has _left_.
+
+When a Server is gracefully exited, the server will not be marked as _left_.
+This is to minimally impact the consensus quorum. Instead, the Server will be
+marked as _failed_. To remove a server from the cluster, the 
+[`force-leave`](/docs/commands/force-leave) command is used. Using
+`force-leave` will put the server instance in a _left_ state so long as the
+Server agent is not alive.
+
+Alternatively, you can forcibly stop an agent by sending it a
+`kill -KILL consul_pid` signal. This will stop any agent immediately. The rest
+of the cluster will eventually (usually within seconds) detect that the node has
+died and notify the cluster that the node has _failed_.

 For client agents, the difference between a node _failing_ and a node _leaving_
 may not be important for your use case. For example, for a web server and load
 balancer setup, both result in the same outcome: the web node is removed
 from the load balancer pool.

+The [`skip_leave_on_interrupt`](/docs/agent/options#skip_leave_on_interrupt) and
+[`leave_on_terminate`](/docs/agent/options#leave_on_terminate) configuration
+options allow you to adjust this behavior.
+
 ## Lifecycle

 Every agent in the Consul cluster goes through a lifecycle. Understanding
 this lifecycle is useful for building a mental model of an agent's interactions
 with a cluster and how the cluster treats a node.

-When an agent is first started, it does not know about any other node in the cluster.
+When an agent is first started, it does not know about any other node in the
+cluster.
 To discover its peers, it must _join_ the cluster. This is done with the
 [`join`](/docs/commands/join)
-command or by providing the proper configuration to auto-join on start. Once a node
-joins, this information is gossiped to the entire cluster, meaning all nodes will
-eventually be aware of each other. If the agent is a server, existing servers will
-begin replicating to the new node.
+command or by providing the proper configuration to auto-join on start. Once a
+node joins, this information is gossiped to the entire cluster, meaning all
+nodes will eventually be aware of each other. If the agent is a server,
+existing servers will begin replicating to the new node.

 In the case of a network failure, some nodes may be unreachable by other nodes.
-In this case, unreachable nodes are marked as _failed_. It is impossible to distinguish
-between a network failure and an agent crash, so both cases are handled the same.
-Once a node is marked as failed, this information is updated in the service catalog.
+In this case, unreachable nodes are marked as _failed_. It is impossible to
+distinguish between a network failure and an agent crash, so both cases are
+handled the same.
+Once a node is marked as failed, this information is updated in the service
+catalog.

 -> **Note:** There is some nuance here since this update is only possible if the servers can still [form a quorum](/docs/internals/consensus). Once the network recovers or a crashed agent restarts the cluster will repair itself and unmark a node as failed. The health check in the catalog will also be updated to reflect this.