open-nomad/website/content/docs/job-specification/group.mdx
Shantanu Gadgil 43d8baace0
heartbeat_grace is a server parameter (#13288)
`heartbeat_grace` is a `server` parameter, not a `client` parameter.
2022-06-08 10:49:23 -04:00

360 lines
13 KiB
Plaintext

---
layout: docs
page_title: group Stanza - Job Specification
description: |-
The "group" stanza defines a series of tasks that should be co-located on the
same Nomad client. Any task within a group will be placed on the same client.
---
# `group` Stanza
<Placement groups={['job', 'group']} />
The `group` stanza defines a series of tasks that should be co-located on the
same Nomad client. Any [task][] within a group will be placed on the same
client.
```hcl
job "docs" {
group "example" {
# ...
}
}
```
## `group` Parameters
- `constraint` <code>([Constraint][]: nil)</code> -
This can be provided multiple times to define additional constraints.
- `affinity` <code>([Affinity][]: nil)</code> - This can be provided
multiple times to define preferred placement criteria.
- `spread` <code>([Spread][spread]: nil)</code> - This can be provided
multiple times to define criteria for spreading allocations across a
node attribute or metadata. See the
[Nomad spread reference](/docs/job-specification/spread) for more details.
- `count` `(int)` - Specifies the number of instances that should be running
under for this group. This value must be non-negative. This defaults to the
`min` value specified in the [`scaling`](/docs/job-specification/scaling)
block, if present; otherwise, this defaults to `1`.
- `consul` <code>([Consul][consul]: nil)</code> - Specifies Consul configuration
options specific to the group.
- `ephemeral_disk` <code>([EphemeralDisk][]: nil)</code> - Specifies the
ephemeral disk requirements of the group. Ephemeral disks can be marked as
sticky and support live data migrations.
- `meta` <code>([Meta][]: nil)</code> - Specifies a key-value map that annotates
with user-defined metadata.
- `migrate` <code>([Migrate][]: nil)</code> - Specifies the group strategy for
migrating off of draining nodes. Only service jobs with a count greater than
1 support migrate stanzas.
- `network` <code>([Network][]: &lt;optional&gt;)</code> - Specifies the network
requirements and configuration, including static and dynamic port allocations,
for the group.
- `reschedule` <code>([Reschedule][]: nil)</code> - Allows to specify a
rescheduling strategy. Nomad will then attempt to schedule the task on another
node if any of the group allocation statuses become "failed".
- `restart` <code>([Restart][]: nil)</code> - Specifies the restart policy for
all tasks in this group. If omitted, a default policy exists for each job
type, which can be found in the [restart stanza documentation][restart].
- `service` <code>([Service][]: nil)</code> - Specifies integrations with
[Consul](/docs/configuration/consul) for service discovery.
Nomad automatically registers each service when an allocation
is started and de-registers them when the allocation is destroyed.
- `shutdown_delay` `(string: "0s")` - Specifies the duration to wait when
stopping a group's tasks. The delay occurs between Consul deregistration
and sending each task a shutdown signal. Ideally, services would fail
healthchecks once they receive a shutdown signal. Alternatively
`shutdown_delay` may be set to give in-flight requests time to complete
before shutting down. A group level `shutdown_delay` will run regardless
if there are any defined group services. In addition, tasks may have their
own [`shutdown_delay`](/docs/job-specification/task#shutdown_delay)
which waits between deregistering task services and stopping the task.
- `stop_after_client_disconnect` `(string: "")` - Specifies a duration after
which a Nomad client will stop allocations, if it cannot communicate with the
servers. By default, a client will not stop an allocation until explicitly
told to by a server. A client that fails to heartbeat to a server within the
[`heartbeat_grace`] window and any allocations running on it will be marked
"lost" and Nomad will schedule replacement allocations. The replaced
allocations will normally continue to run on the non-responsive client. But
you may want them to stop instead — for example, allocations requiring
exclusive access to an external resource. When specified, the Nomad client
will stop them after this duration.
The Nomad client process must be running for this to occur. This setting
cannot be used with [`max_client_disconnect`].
- `max_client_disconnect` `(string: "")` - Specifies a duration during which a
Nomad client will attempt to reconnect allocations after it fails to heartbeat
in the [`heartbeat_grace`] window. See [the example code
below][max-client-disconnect] for more details. This setting cannot be used
with [`stop_after_client_disconnect`].
- `task` <code>([Task][]: &lt;required&gt;)</code> - Specifies one or more tasks to run
within this group. This can be specified multiple times, to add a task as part
of the group.
- `update` <code>([Update][update]: nil)</code> - Specifies the task's update
strategy. When omitted, a default update strategy is applied.
- `vault` <code>([Vault][]: nil)</code> - Specifies the set of Vault policies
required by all tasks in this group. Overrides a `vault` block set at the
`job` level.
- `volume` <code>([Volume][]: nil)</code> - Specifies the volumes that are
required by tasks within the group.
### `consul` Parameters
- `namespace` `(string: "")` <EnterpriseAlert inline/> - The Consul namespace in which
group and task-level services within the group will be registered. Use of
`template` to access Consul KV will read from the specified Consul namespace.
Specifying `namespace` takes precedence over the [`-consul-namespace`][consul_namespace]
command line argument in `job run`.
## `group` Examples
The following examples only show the `group` stanzas. Remember that the
`group` stanza is only valid in the placements listed above.
### Specifying Count
This example specifies that 5 instances of the tasks within this group should be
running:
```hcl
group "example" {
count = 5
}
```
### Tasks with Constraint
This example shows two abbreviated tasks with a constraint on the group. This
will restrict the tasks to 64-bit operating systems.
```hcl
group "example" {
constraint {
attribute = "${attr.cpu.arch}"
value = "amd64"
}
task "cache" {
# ...
}
task "server" {
# ...
}
}
```
### Metadata
This example show arbitrary user-defined metadata on the group:
```hcl
group "example" {
meta {
my-key = "my-value"
}
}
```
### Network
This example shows network constraints as specified in the [network][] stanza
which uses the `bridge` networking mode, dynamically allocates two ports, and
statically allocates one port:
```hcl
group "example" {
network {
mode = "bridge"
port "http" {}
port "https" {}
port "lb" {
static = "8889"
}
}
}
```
### Service Discovery
This example creates a service in Consul. To read more about service discovery
in Nomad, please see the [Nomad service discovery documentation][service_discovery].
```hcl
group "example" {
network {
port "api" {}
}
service {
name = "example"
port = "api"
tags = ["default"]
check {
type = "tcp"
interval = "10s"
timeout = "2s"
}
}
task "api" { ... }
}
```
### Stop After Client Disconnect
This example shows how `stop_after_client_disconnect` interacts with
other stanzas. For the `first` group, after the default 10 second
[`heartbeat_grace`] window expires and 90 more seconds passes, the
server will reschedule the allocation. The client will wait 90 seconds
before sending a stop signal (`SIGTERM`) to the `first-task`
task. After 15 more seconds because of the task's `kill_timeout`, the
client will send `SIGKILL`. The `second` group does not have
`stop_after_client_disconnect`, so the server will reschedule the
allocation after the 10 second [`heartbeat_grace`] expires. It will
not be stopped on the client, regardless of how long the client is out
of touch.
Note that if the server's clocks are not closely synchronized with
each other, the server may reschedule the group before the client has
stopped the allocation. Operators should ensure that clock drift
between servers is as small as possible.
Note also that a group using this feature will be stopped on the
client if the Nomad server cluster fails, since the client will be
unable to contact any server in that case. Groups opting in to this
feature are therefore exposed to an additional runtime dependency and
potential point of failure.
```hcl
group "first" {
stop_after_client_disconnect = "90s"
task "first-task" {
kill_timeout = "15s"
}
}
group "second" {
task "second-task" {
kill_timeout = "5s"
}
}
```
### Max Client Disconnect
`max_client_disconnect` specifies a duration during which a Nomad client will
attempt to reconnect allocations after it fails to heartbeat in the
[`heartbeat_grace`] window.
By default, allocations running on a client that fails to heartbeat will be
marked "lost". When a client reconnects, its allocations, which may still be
healthy, will restart because they have been marked "lost". This can cause
issues with stateful tasks or tasks with long restart times.
Instead, an operator may desire that these allocations reconnect without a
restart. When `max_client_disconnect` is specified, the Nomad server will mark
clients that fail to heartbeat as "disconnected" rather than "down", and will
mark allocations on a disconnected client as "unknown" rather than "lost". These
allocations may continue to run on the disconnected client. Replacement
allocations will be scheduled according to the allocations' reschedule policy
until the disconnected client reconnects. Once a disconnected client reconnects,
Nomad will compare the "unknown" allocations with their replacements and keep
the one with the best node score. If the `max_client_disconnect` duration
expires before the client reconnects, the allocations will be marked "lost".
Clients that contain "unknown" allocations will transition to "disconnected"
rather than "down" until the last `max_client_disconnect` duration has expired.
In the example code below, if both of these task groups were placed on the same
client and that client experienced a network outage, both of the group's
allocations would be marked as "disconnected" at two minutes because of the
client's `heartbeat_grace` value of "2m". If the network outage continued for
eight hours, and the client continued to fail to heartbeat, the client would
remain in a "disconnected" state, as the first group's `max_client_disconnect`
is twelve hours. Once all groups' `max_client_disconnect` durations are
exceeded, in this case in twelve hours, the client node will be marked as "down"
and the allocation will be marked as "lost". If the client had reconnected
before twelve hours had passed, the allocations would gracefully reconnect
without a restart.
Max Client Disconnect is useful for edge deployments, or scenarios when
operators want zero on-client downtime due to node connectivity issues. This
setting cannot be used with [`stop_after_client_disconnect`].
```hcl
# server_config.hcl
server {
enabled = true
heartbeat_grace = "2m"
}
```
```hcl
# jobspec.nomad
group "first" {
max_client_disconnect = "12h"
task "first-task" {
...
}
}
group "second" {
max_client_disconnect = "6h"
task "second-task" {
...
}
}
```
~> **Note:** The `max_client_disconnect` feature is only supported on Nomad
version 1.3.0 and above. If you run a job with `max_client_disconnect` on servers
where some servers are not upgraded to 1.3.0, the `max_client_disconnect`
flag will be _ignored_. Deploying a job with `max_client_disconnect` to a
`datacenter` of Nomad clients where all clients are not 1.3.0 or above is unsupported.
[task]: /docs/job-specification/task 'Nomad task Job Specification'
[job]: /docs/job-specification/job 'Nomad job Job Specification'
[constraint]: /docs/job-specification/constraint 'Nomad constraint Job Specification'
[consul]: /docs/job-specification/group#consul-parameters
[consul_namespace]: /docs/commands/job/run#consul-namespace
[spread]: /docs/job-specification/spread 'Nomad spread Job Specification'
[affinity]: /docs/job-specification/affinity 'Nomad affinity Job Specification'
[ephemeraldisk]: /docs/job-specification/ephemeral_disk 'Nomad ephemeral_disk Job Specification'
[`heartbeat_grace`]: /docs/configuration/server#heartbeat_grace
[`max_client_disconnect`]: /docs/job-specification/group#max_client_disconnect
[max-client-disconnect]: /docs/job-specification/group#max-client-disconnect 'the example code below'
[`stop_after_client_disconnect`]: /docs/job-specification/group#stop_after_client_disconnect
[meta]: /docs/job-specification/meta 'Nomad meta Job Specification'
[migrate]: /docs/job-specification/migrate 'Nomad migrate Job Specification'
[network]: /docs/job-specification/network 'Nomad network Job Specification'
[reschedule]: /docs/job-specification/reschedule 'Nomad reschedule Job Specification'
[restart]: /docs/job-specification/restart 'Nomad restart Job Specification'
[service]: /docs/job-specification/service 'Nomad service Job Specification'
[service_discovery]: /docs/integrations/consul-integration#service-discovery 'Nomad Service Discovery'
[update]: /docs/job-specification/update 'Nomad update Job Specification'
[vault]: /docs/job-specification/vault 'Nomad vault Job Specification'
[volume]: /docs/job-specification/volume 'Nomad volume Job Specification'