open-consul/website/content/docs/agent/index.mdx
Zachary Shilton 5b53b5aef5
website: implement mktg 032 (#9953)
* website: migrate to new nav-data format

* website: clean up unused intro content

* website: remove deprecated sidebar_title from frontmatter

* website: add react-content to fix global style import issue
2021-04-07 15:50:38 -04:00

167 lines
7.9 KiB
Plaintext

---
layout: docs
page_title: Agent
description: >-
The Consul agent is the core process of Consul. The agent maintains membership
information, registers services, runs checks, responds to queries, and more.
The agent must run on every node that is part of a Consul cluster.
---
# Consul Agent
The Consul agent is the core process of Consul. The agent maintains membership
information, registers services, runs checks, responds to queries,
and more. The agent must run on every node that is part of a Consul cluster.
Any agent may run in one of two modes: client or server. A server
node takes on the additional responsibility of being part of the
[consensus quorum](/docs/internals/consensus).
These nodes take part in Raft and provide strong consistency and availability in
the case of failure. The higher burden on the server nodes means that usually
they should be run on dedicated instances -- they are more resource intensive
than a client node. Client nodes make up the majority of the cluster, and they
are very lightweight as they interface with the server nodes for most
operations and maintain very little state of their own.
## Running an Agent
The agent is started with the [`consul agent`](/commands/agent) command.
This command blocks, running forever or until told to quit. You can test a
local agent by following the
[Getting Started tutorials](https://learn.hashicorp.com/tutorials/consul/get-started-install?utm_source=consul.io&utm_medium=docs).
The agent command takes a variety of
[`configuration options`](/docs/agent/options#command-line-options), but most
have sane defaults.
When running [`consul agent`](/commands/agent), you should see output
similar to this:
```shell-session
$ consul agent -data-dir=/tmp/consul
==> Starting Consul agent...
==> Consul agent running!
Node name: 'Armons-MacBook-Air'
Datacenter: 'dc1'
Server: false (bootstrap: false)
Client Addr: 127.0.0.1 (HTTP: 8500, DNS: 8600)
Cluster Addr: 192.168.1.43 (LAN: 8301, WAN: 8302)
==> Log data will now stream in as it occurs:
[INFO] serf: EventMemberJoin: Armons-MacBook-Air.local 192.168.1.43
...
```
There are several important messages that
[`consul agent`](/commands/agent) outputs:
- **Node name**: This is a unique name for the agent. By default, this
is the hostname of the machine, but you may customize it using the
[`-node`](/docs/agent/options#_node) flag.
- **Datacenter**: This is the datacenter in which the agent is configured to
run.
Consul has first-class support for multiple datacenters; however, to work
efficiently, each node must be configured to report its datacenter. The
[`-datacenter`](/docs/agent/options#_datacenter) flag can be used to set the
datacenter. For single-DC configurations, the agent will default to "dc1".
- **Server**: This indicates whether the agent is running in server or client
mode.
Server nodes have the extra burden of participating in the consensus quorum,
storing cluster state, and handling queries. Additionally, a server may be
in ["bootstrap"](/docs/agent/options#_bootstrap_expect) mode. Multiple servers
cannot be in bootstrap mode as that would put the cluster in an inconsistent
state.
- **Client Addr**: This is the address used for client interfaces to the agent.
This includes the ports for the HTTP and DNS interfaces. By default, this
binds only to localhost. If you change this address or port, you'll have to
specify a `-http-addr` whenever you run commands such as
[`consul members`](/commands/members) to indicate how to reach the
agent. Other applications can also use the HTTP address and port
[to control Consul](/api).
- **Cluster Addr**: This is the address and set of ports used for communication
between Consul agents in a cluster. Not all Consul agents in a cluster have to
use the same port, but this address **MUST** be reachable by all other nodes.
When running under `systemd` on Linux, Consul notifies systemd by sending
`READY=1` to the `$NOTIFY_SOCKET` when a LAN join has completed. For
this either the `join` or `retry_join` option has to be set and the
service definition file has to have `Type=notify` set.
## Stopping an Agent
An agent can be stopped in two ways: gracefully or forcefully. Servers and
Clients both behave differently depending on the leave that is performed. There
are two potential states a process can be in after a system signal is sent:
_left_ and _failed_.
To gracefully halt an agent, send the process an _interrupt signal_ (usually
`Ctrl-C` from a terminal, or running `kill -INT consul_pid` ). For more
information on different signals sent by the `kill` command, see
[here](https://www.linux.org/threads/kill-signals-and-commands-revised.11625/)
When a Client is gracefully exited, the agent first notifies the cluster it
intends to leave the cluster. This way, other cluster members notify the
cluster that the node has _left_.
When a Server is gracefully exited, the server will not be marked as _left_.
This is to minimally impact the consensus quorum. Instead, the Server will be
marked as _failed_. To remove a server from the cluster, the
[`force-leave`](/commands/force-leave) command is used. Using
`force-leave` will put the server instance in a _left_ state so long as the
Server agent is not alive.
Alternatively, you can forcibly stop an agent by sending it a
`kill -KILL consul_pid` signal. This will stop any agent immediately. The rest
of the cluster will eventually (usually within seconds) detect that the node has
died and notify the cluster that the node has _failed_.
For client agents, the difference between a node _failing_ and a node _leaving_
may not be important for your use case. For example, for a web server and load
balancer setup, both result in the same outcome: the web node is removed
from the load balancer pool.
The [`skip_leave_on_interrupt`](/docs/agent/options#skip_leave_on_interrupt) and
[`leave_on_terminate`](/docs/agent/options#leave_on_terminate) configuration
options allow you to adjust this behavior.
## Lifecycle
Every agent in the Consul cluster goes through a lifecycle. Understanding
this lifecycle is useful for building a mental model of an agent's interactions
with a cluster and how the cluster treats a node.
When an agent is first started, it does not know about any other node in the
cluster.
To discover its peers, it must _join_ the cluster. This is done with the
[`join`](/commands/join)
command or by providing the proper configuration to auto-join on start. Once a
node joins, this information is gossiped to the entire cluster, meaning all
nodes will eventually be aware of each other. If the agent is a server,
existing servers will begin replicating to the new node.
In the case of a network failure, some nodes may be unreachable by other nodes.
In this case, unreachable nodes are marked as _failed_. It is impossible to
distinguish between a network failure and an agent crash, so both cases are
handled the same.
Once a node is marked as failed, this information is updated in the service
catalog.
-> **Note:** There is some nuance here since this update is only possible if the servers can still [form a quorum](/docs/internals/consensus). Once the network recovers or a crashed agent restarts the cluster will repair itself and unmark a node as failed. The health check in the catalog will also be updated to reflect this.
When a node _leaves_, it specifies its intent to do so, and the cluster
marks that node as having _left_. Unlike the _failed_ case, all of the
services provided by a node are immediately deregistered. If the agent was
a server, replication to it will stop.
To prevent an accumulation of dead nodes (nodes in either _failed_ or _left_
states), Consul will automatically remove dead nodes out of the catalog. This
process is called _reaping_. This is currently done on a configurable
interval of 72 hours (changing the reap interval is _not_ recommended due to
its consequences during outage situations). Reaping is similar to leaving,
causing all associated services to be deregistered.