62548616d4
Adds a new configuration to clients to optionally allow them to drain their workloads on shutdown. The client sends the `Node.UpdateDrain` RPC targeting itself and then monitors the drain state as seen by the server until the drain is complete or the deadline expires. If it loses connection with the server, it will monitor local client status instead to ensure allocations are stopped before exiting.
189 lines
8.7 KiB
Plaintext
189 lines
8.7 KiB
Plaintext
---
|
|
layout: docs
|
|
page_title: Nomad Agent
|
|
description: |-
|
|
The Nomad agent is a long running process which can be used either in
|
|
a client or server mode.
|
|
---
|
|
|
|
# Operating Nomad Agents
|
|
|
|
The Nomad agent is a long running process which runs on every machine that
|
|
is part of the Nomad cluster. The behavior of the agent depends on if it is
|
|
running in client or server mode. Clients are responsible for running tasks,
|
|
while servers are responsible for managing the cluster.
|
|
|
|
Client mode agents are relatively simple. They make use of fingerprinting
|
|
to determine the capabilities and resources of the host machine, as well as
|
|
determining what [drivers](/nomad/docs/drivers) are available. Clients
|
|
register with servers to provide the node information, heartbeat to provide
|
|
liveness, and run any tasks assigned to them.
|
|
|
|
Servers take on the responsibility of being part of the
|
|
[consensus protocol](/nomad/docs/concepts/consensus) and [gossip protocol](/nomad/docs/concepts/gossip).
|
|
The consensus protocol, powered by Raft, allows the servers to perform
|
|
leader election and state replication. The gossip protocol allows for simple
|
|
clustering of servers and multi-region federation. The higher burden on the
|
|
server nodes means that usually they should be run on dedicated instances --
|
|
they are more resource intensive than a client node.
|
|
|
|
Client nodes make up the majority of the cluster, and are very lightweight as
|
|
they interface with the server nodes and maintain very little state of their
|
|
own. Each cluster has usually 3 or 5 server mode agents and potentially
|
|
thousands of clients.
|
|
|
|
## Running an Agent
|
|
|
|
The agent is started with the [`nomad agent` command](/nomad/docs/commands/agent). This
|
|
command blocks, running forever or until told to quit. The agent command takes a variety
|
|
of configuration options, but most have sane defaults.
|
|
|
|
-> **Note:** If you are running Nomad on Linux, you'll need to run client agents
|
|
as root (or with `sudo`) so that cpuset accounting and network namespaces work
|
|
correctly.
|
|
|
|
When running `nomad agent`, you should see output similar to this:
|
|
|
|
```shell-session
|
|
$ sudo nomad agent -dev
|
|
==> Starting Nomad agent...
|
|
==> Nomad agent configuration:
|
|
|
|
Client: true
|
|
Log Level: INFO
|
|
Region: global (DC: dc1)
|
|
Server: true
|
|
|
|
==> Nomad agent started! Log data will stream in below:
|
|
|
|
[INFO] serf: EventMemberJoin: server-1.node.global 127.0.0.1
|
|
[INFO] nomad: starting 4 scheduling worker(s) for [service batch _core]
|
|
...
|
|
```
|
|
|
|
There are several important messages that `nomad agent` outputs:
|
|
|
|
- **Client**: This indicates whether the agent has enabled client mode.
|
|
Client nodes fingerprint their host environment, register with servers,
|
|
and run tasks.
|
|
|
|
- **Log Level**: This indicates the configured log level. Only messages with
|
|
an equal or higher severity will be logged. This can be tuned to increase
|
|
verbosity for debugging, or reduced to avoid noisy logging.
|
|
|
|
- **Region**: This is the region and datacenter in which the agent is configured
|
|
to run. Nomad has first-class support for multi-datacenter and multi-region
|
|
configurations. The `-region` and `-dc` flags can be used to set the region
|
|
and datacenter. The default is the `global` region in `dc1`.
|
|
|
|
- **Server**: This indicates whether the agent has enabled server mode.
|
|
Server nodes have the extra burden of participating in the consensus protocol,
|
|
storing cluster state, and making scheduling decisions.
|
|
|
|
## Stopping an Agent
|
|
|
|
By default, any stop signal to an agent (interrupt or terminate) will cause the
|
|
agent to exit after ensuring its internal state is committed to disk as
|
|
needed. You can configuration additonal behaviors by setting shutdown
|
|
[`leave_on_interrupt`][] or [`leave_on_terminate`][] to respond to the
|
|
respective signals.
|
|
|
|
For servers, when `leave_on_interrupt` or `leave_on_terminate` are set the
|
|
server will notify other servers of their intention to leave the cluster, which
|
|
allows them to leave the [consensus][] peer set. It is especially important that
|
|
a server node be allowed to leave gracefully so that there will be a minimal
|
|
impact on availability as the server leaves the consensus peer set. If a server
|
|
does not gracefully leave, and will not return into service, the [`server
|
|
force-leave` command][] should be used to eject it from the consensus peer set.
|
|
|
|
For clients, when `leave_on_interrupt` or `leave_on_terminate` are set and the
|
|
client is configured with [`drain_on_shutdown`][], the client will drain its
|
|
workloads before shutting down.
|
|
|
|
## Signal Handling
|
|
|
|
In addition to the optional handling of interrupt (`SIGINT`) and terminate
|
|
signals (`SIGTERM`) described in [Stopping an Agent][#stopping-an-agent], Nomad
|
|
supports special behavior for several other signals useful for debugging.
|
|
|
|
* `SIGHUP` will cause Nomad to [reload its configuration][].
|
|
* `SIGUSR1` will cause Nomad to print its [metrics][] without stopping the
|
|
agent.
|
|
* `SIGQUIT`, `SIGILL`, `SIGTRAP`, `SIGABRT`, `SIGSTKFLT`, `SIGEMT`, or `SIGSYS`
|
|
signals are handled by the Go runtime and will cause the Nomad agent to exit
|
|
and print its stack trace.
|
|
|
|
When using the official HashiCorp packages on Linux, you can send these signals
|
|
via `systemctl`. For example, to print the Nomad agent's metrics:
|
|
|
|
```shell-session
|
|
$ sudo systemctl kill nomad -s SIGUSR1
|
|
```
|
|
|
|
You can then read those metrics in the service logs:
|
|
|
|
```shell-session
|
|
$ journalctl -u nomad
|
|
```
|
|
|
|
## Lifecycle
|
|
|
|
Every agent in the Nomad cluster goes through a lifecycle. Understanding
|
|
this lifecycle is useful for building a mental model of an agent's interactions
|
|
with a cluster and how the cluster treats a node.
|
|
|
|
When a client agent is first started, it fingerprints the host machine to
|
|
identify its attributes, capabilities, and [task drivers](/nomad/docs/drivers).
|
|
These are reported to the servers during an initial registration. The addresses
|
|
of known servers are provided to the agent via configuration, potentially using
|
|
DNS for resolution. Using [Consul](https://www.consul.io/) provides a way to avoid hard
|
|
coding addresses and resolving them on demand.
|
|
|
|
While a client is running, it is performing heartbeating with servers to
|
|
maintain liveness. If the heartbeats fail, the servers assume the client node
|
|
has failed, and stop assigning new tasks while migrating existing tasks.
|
|
It is impossible to distinguish between a network failure and an agent crash,
|
|
so both cases are handled the same. Once the network recovers or a crashed agent
|
|
restarts the node status will be updated and normal operation resumed.
|
|
|
|
To prevent an accumulation of nodes in a terminal state, Nomad does periodic
|
|
garbage collection of nodes. By default, if a node is in a failed or 'down'
|
|
state for over 24 hours it will be garbage collected from the system.
|
|
|
|
Servers are slightly more complex as they perform additional functions. They
|
|
participate in a [gossip protocol](/nomad/docs/concepts/gossip) both to cluster
|
|
within a region and to support multi-region configurations. When a server is
|
|
first started, it does not know the address of other servers in the cluster.
|
|
To discover its peers, it must _join_ the cluster. This is done with the
|
|
[`server join` command](/nomad/docs/commands/server/join) or by providing the
|
|
proper configuration on start. Once a node joins, this information is gossiped
|
|
to the entire cluster, meaning all nodes will eventually be aware of each other.
|
|
|
|
When a server _leaves_, it specifies its intent to do so, and the cluster marks that
|
|
node as having _left_. If the server has _left_, replication to it will stop and it
|
|
is removed from the consensus peer set. If the server has _failed_, replication
|
|
will attempt to make progress to recover from a software or network failure.
|
|
|
|
## Permissions
|
|
|
|
Nomad servers and Nomad clients have different requirements for permissions.
|
|
|
|
Nomad servers should be run with the lowest possible permissions. They need
|
|
access to their own data directory and the ability to bind to their ports. You
|
|
should create a `nomad` user with the minimal set of required privileges.
|
|
|
|
Nomad clients should be run as `root` due to the OS isolation mechanisms that
|
|
require root privileges. While it is possible to run Nomad as an unprivileged
|
|
user, careful testing must be done to ensure the task drivers and features
|
|
you use function as expected. The Nomad client's data directory should be
|
|
owned by `root` with filesystem permissions set to `0700`.
|
|
|
|
|
|
[`leave_on_interrupt`]: /nomad/docs/configuration#leave_on_interrupt
|
|
[`leave_on_terminate`]: /nomad/docs/configuration#leave_on_terminate
|
|
[`server force-leave` command]: /nomad/docs/commands/server/force-leave
|
|
[consensus]: /nomad/docs/concepts/consensus
|
|
[`drain_on_shutdown`]: /nomad/docs/configuration/client#drain_on_shutdown
|
|
[reload its configuration]: /nomad/docs/configuration#configuration-reload
|
|
[metrics]: /nomad/docs/operations/metrics-reference
|