150 lines
7.1 KiB
Plaintext
150 lines
7.1 KiB
Plaintext
---
|
|
layout: docs
|
|
page_title: Nomad Agent
|
|
sidebar_title: Operating Nomad Agents
|
|
description: |-
|
|
The Nomad agent is a long running process which can be used either in
|
|
a client or server mode.
|
|
---
|
|
|
|
# Operating Nomad Agents
|
|
|
|
The Nomad agent is a long running process which runs on every machine that
|
|
is part of the Nomad cluster. The behavior of the agent depends on if it is
|
|
running in client or server mode. Clients are responsible for running tasks,
|
|
while servers are responsible for managing the cluster.
|
|
|
|
Client mode agents are relatively simple. They make use of fingerprinting
|
|
to determine the capabilities and resources of the host machine, as well as
|
|
determining what [drivers](/docs/drivers) are available. Clients
|
|
register with servers to provide the node information, heartbeat to provide
|
|
liveness, and run any tasks assigned to them.
|
|
|
|
Servers take on the responsibility of being part of the
|
|
[consensus protocol](/docs/internals/consensus) and [gossip protocol](/docs/internals/gossip).
|
|
The consensus protocol, powered by Raft, allows the servers to perform
|
|
leader election and state replication. The gossip protocol allows for simple
|
|
clustering of servers and multi-region federation. The higher burden on the
|
|
server nodes means that usually they should be run on dedicated instances --
|
|
they are more resource intensive than a client node.
|
|
|
|
Client nodes make up the majority of the cluster, and are very lightweight as
|
|
they interface with the server nodes and maintain very little state of their
|
|
own. Each cluster has usually 3 or 5 server mode agents and potentially
|
|
thousands of clients.
|
|
|
|
## Running an Agent
|
|
|
|
The agent is started with the [`nomad agent` command](/docs/commands/agent). This
|
|
command blocks, running forever or until told to quit. The agent command takes a variety
|
|
of configuration options, but most have sane defaults.
|
|
|
|
When running `nomad agent`, you should see output similar to this:
|
|
|
|
```shell-session
|
|
$ nomad agent -dev
|
|
==> Starting Nomad agent...
|
|
==> Nomad agent configuration:
|
|
|
|
Client: true
|
|
Log Level: INFO
|
|
Region: global (DC: dc1)
|
|
Server: true
|
|
|
|
==> Nomad agent started! Log data will stream in below:
|
|
|
|
[INFO] serf: EventMemberJoin: server-1.node.global 127.0.0.1
|
|
[INFO] nomad: starting 4 scheduling worker(s) for [service batch _core]
|
|
...
|
|
```
|
|
|
|
There are several important messages that `nomad agent` outputs:
|
|
|
|
- **Client**: This indicates whether the agent has enabled client mode.
|
|
Client nodes fingerprint their host environment, register with servers,
|
|
and run tasks.
|
|
|
|
- **Log Level**: This indicates the configured log level. Only messages with
|
|
an equal or higher severity will be logged. This can be tuned to increase
|
|
verbosity for debugging, or reduced to avoid noisy logging.
|
|
|
|
- **Region**: This is the region and datacenter in which the agent is configured
|
|
to run. Nomad has first-class support for multi-datacenter and multi-region
|
|
configurations. The `-region` and `-dc` flags can be used to set the region
|
|
and datacenter. The default is the `global` region in `dc1`.
|
|
|
|
- **Server**: This indicates whether the agent has enabled server mode.
|
|
Server nodes have the extra burden of participating in the consensus protocol,
|
|
storing cluster state, and making scheduling decisions.
|
|
|
|
## Stopping an Agent
|
|
|
|
An agent can be stopped in two ways: gracefully or forcefully. By default,
|
|
any signal to an agent (interrupt, terminate, kill) will cause the agent
|
|
to forcefully stop. Graceful termination can be configured by either
|
|
setting `leave_on_interrupt` or `leave_on_terminate` to respond to the
|
|
respective signals.
|
|
|
|
When gracefully exiting, clients will update their status to terminal on
|
|
the servers so that tasks can be migrated to healthy agents. Servers
|
|
will notify their intention to leave the cluster which allows them to
|
|
leave the [consensus](/docs/internals/consensus) peer set.
|
|
|
|
It is especially important that a server node be allowed to leave gracefully
|
|
so that there will be a minimal impact on availability as the server leaves
|
|
the consensus peer set. If a server does not gracefully leave, and will not
|
|
return into service, the [`server force-leave` command](/docs/commands/server/force-leave)
|
|
should be used to eject it from the consensus peer set.
|
|
|
|
## Lifecycle
|
|
|
|
Every agent in the Nomad cluster goes through a lifecycle. Understanding
|
|
this lifecycle is useful for building a mental model of an agent's interactions
|
|
with a cluster and how the cluster treats a node.
|
|
|
|
When a client agent is first started, it fingerprints the host machine to
|
|
identify its attributes, capabilities, and [task drivers](/docs/drivers).
|
|
These are reported to the servers during an initial registration. The addresses
|
|
of known servers are provided to the agent via configuration, potentially using
|
|
DNS for resolution. Using [Consul](https://www.consul.io) provides a way to avoid hard
|
|
coding addresses and resolving them on demand.
|
|
|
|
While a client is running, it is performing heartbeating with servers to
|
|
maintain liveness. If the heartbeats fail, the servers assume the client node
|
|
has failed, and stop assigning new tasks while migrating existing tasks.
|
|
It is impossible to distinguish between a network failure and an agent crash,
|
|
so both cases are handled the same. Once the network recovers or a crashed agent
|
|
restarts the node status will be updated and normal operation resumed.
|
|
|
|
To prevent an accumulation of nodes in a terminal state, Nomad does periodic
|
|
garbage collection of nodes. By default, if a node is in a failed or 'down'
|
|
state for over 24 hours it will be garbage collected from the system.
|
|
|
|
Servers are slightly more complex as they perform additional functions. They
|
|
participate in a [gossip protocol](/docs/internals/gossip) both to cluster
|
|
within a region and to support multi-region configurations. When a server is
|
|
first started, it does not know the address of other servers in the cluster.
|
|
To discover its peers, it must _join_ the cluster. This is done with the
|
|
[`server join` command](/docs/commands/server/join) or by providing the
|
|
proper configuration on start. Once a node joins, this information is gossiped
|
|
to the entire cluster, meaning all nodes will eventually be aware of each other.
|
|
|
|
When a server _leaves_, it specifies its intent to do so, and the cluster marks that
|
|
node as having _left_. If the server has _left_, replication to it will stop and it
|
|
is removed from the consensus peer set. If the server has _failed_, replication
|
|
will attempt to make progress to recover from a software or network failure.
|
|
|
|
## Permissions
|
|
|
|
Nomad servers and Nomad clients have different requirements for permissions.
|
|
|
|
Nomad servers should be run with the lowest possible permissions. They need
|
|
access to their own data directory and the ability to bind to their ports. You
|
|
should create a `nomad` user with the minimal set of required privileges.
|
|
|
|
Nomad clients should be run as `root` due to the OS isolation mechanisms that
|
|
require root privileges. While it is possible to run Nomad as an unprivileged
|
|
user, careful testing must be done to ensure the task drivers and features
|
|
you use function as expected. The Nomad client's data directory should be
|
|
owned by `root` with filesystem permissions set to `0700`.
|