Backport of docs: expand documentation on node pools into release/1.6.x (#18222)

This pull request was automerged via backport-assistant
This commit is contained in:
hc-github-team-nomad-core 2023-08-16 10:22:41 -05:00 committed by GitHub
parent 0a19fe3b60
commit dafef5b777
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
10 changed files with 554 additions and 8 deletions

View File

@ -559,7 +559,7 @@ $ curl \
- `Type`: The type of job in terms of scheduling. It can have one of the following values:
- `service`: Allocations are intended to remain alive.
- `batch`: Allocations are intended to exit.
- `system`: Each client gets an allocation.
- `system`: Each client in the datacenter and node pool gets an allocation.
## Read Job Submission

View File

@ -93,6 +93,18 @@ Nomad models infrastructure as regions and datacenters. A region will contain
one or more datacenters. A set of servers joined together will represent a
single region. Servers federate across regions to make Nomad globally aware.
In federated clusters one of the regions must be defined as the [authoritative
region](#authoritative-and-non-authoritative-regions).
#### Authoritative and Non-Authoritative Regions
The authoritative region is the region in a federated multi-region cluster that
holds the source of true for entities replicated across regions, such as ACL
tokens, policies, and roles, namespaces, and node pools.
All other regions are considered non-authoritative regions and replicate these
entities by pulling them from the authoritative region.
#### Datacenters
Nomad models a datacenter as an abstract grouping of clients within a
@ -101,6 +113,20 @@ servers they are joined with, but do need to be in the same
region. Datacenters provide a way to express fault tolerance among jobs as
well as isolation of infrastructure.
#### Node
A more generic term used to refer to machines running Nomad agents in client
mode. Despite being different concepts, you may find "node" being used
interchangeably with "client" in some materials and informal content.
#### Node Pool
Node pools are used to group [nodes](#node) and can be used to restrict which
[jobs](#job) are able to place [allocations](#allocation) in a given set of
nodes. Example use cases for node pools include segmenting nodes by environment
(development, staging, production), by department (engineering, finance,
support), or by functionality (databases, ingress proxy, applications).
#### Bin Packing
Bin Packing is the process of filling bins with items in a way that maximizes
@ -169,7 +195,9 @@ are more details available for each of the sub-systems. The [consensus protocol]
[gossip protocol](/nomad/docs/concepts/gossip), and [scheduler design](/nomad/docs/concepts/scheduling/scheduling)
are all documented in more detail.
For other details, either consult the code, ask in IRC or reach out to the mailing list.
For other details, either consult the code, [open an issue on
GitHub][gh_issue], or ask a question in the [community forum][forum].
[`update`]: /nomad/docs/job-specification/update
[gh_issue]: https://github.com/hashicorp/nomad/issues/new/choose
[forum]: https://discuss.hashicorp.com/c/nomad

View File

@ -0,0 +1,379 @@
---
layout: docs
page_title: Node Pools
description: Learn about the internal architecture of Nomad.
---
# Node Pools
Node pools are a way to group clients and segment infrastructure into logical
units that can be targeted by jobs for a strong control over where allocations
are placed.
Without node pools, allocations for a job can be placed in any eligible client
in the cluster. Affinities and constraints can help express preferences for
certain nodes, but they do not easily prevent other jobs from placing
allocations in a set of nodes.
A node pool can be created using the [`nomad node pool apply`][cli_np_apply]
command and passing a node pool [specification file][np_spec].
```hcl
# dev-pool.nomad.hcl
node_pool "dev" {
description = "Nodes for the development environment."
meta {
environment = "dev"
owner = "sre"
}
}
```
```shell-session
$ nomad node pool apply dev-pool.nomad.hcl
Successfully applied node pool "dev"!
```
Clients can then be added to this node pool by setting the
[`node_pool`][client_np] attribute in their configuration file, or using the
equivalent [`-node-pool`][cli_agent_np] command line flag.
```hcl
client {
# ...
node_pool = "dev"
# ...
}
```
To help streamline this process, nodes can create node pools on demand. If a
client configuration references a node pool that does not exist yet, Nomad
creates the node pool automatically on client registration.
<Note>
This behavior does not apply to clients in non-authoritative regions. Refer
to <a href="#multi-region-clusters">Multi-region Clusters</a> for more
information.
</Note>
Jobs can then reference node pools using the [`node_pool`][job_np] attribute.
```hcl
job "app-dev" {
# ...
node_pool = "dev"
# ...
}
```
Similarly to the `namespace` attribute, the node pool must exist beforehand,
otherwise the job registration results in an error. Only nodes in the given
node pool are considered for placement. If none are available the deployment
is kept as pending until a client is added to the node pool.
## Multi-region Clusters
In federated multi-region clusters, node pools are automatically replicated
from the authoritative region to all non-authoritative regions, and requests to
create or modify a new node pool are forwarded from non-authoritative to the
authoritative region.
Since the replication data only flows in one direction, clients in
non-authoritative regions are not able to create node pools on demand.
A client in a non-authoritative region that references a node pool that does
not exist yet is kept in the `initializing` status until the node pool is
created and replicated to all regions.
## Built-in Node Pools
In addition to the user generated node pools Nomad automatically creates two
built-in node pools that cannot be deleted nor modified.
- `default`: Node pools are an optional feature of Nomad. The `node_pool`
attribute in both the client configuration and job files are optional. When
not specified, these values are set to use the `default` built-in node pool.
- `all`: In some situations, it is useful to be able to run a job across all
clients in a cluster, regardless of their node pool configuration. For these
scenarios the job may use the built-in `all` node pool which always includes
all clients registered in the cluster. Unlike other node pools, the `all`
node pool can only be used in jobs and not in client configuration.
## Nomad Enterprise <EnterpriseAlert inline />
Nomad Enterprise provides additional features that make node pools more
powerful and easier to manage.
### Scheduler Configuration
Node pools in Nomad Enterprise are able to customize some aspects of the Nomad
scheduler and override certain global configuration per node pool.
This allows experimenting with with functionalities such as memory
oversubscription in isolation, or adjusting the scheduler algorithm between
`spread` or `binpacking` depending on the types of workload being deployed in a
given set of clients.
When using the built-in `all` node pool the global scheduler configuration is
applied.
Refer to the [`scheduler_config`][np_spec_scheduler_config] parameter in the
node pool specification for more information.
### Node Pool Governance
Node pools and namespaces share some similarities, with both providing a way to
group resources in isolated logical units. Jobs are grouped into namespaces and
clients into node pools.
Node Pool Governance allows assigning a default node pool to a namespace that
is automatically used by every job registered to the namespace. This feature
simplifies job management as the node pool is inferred from the namespace
configuration instead of having to be specified in every job.
This connection is done using the [`default`][ns_spec_np_default] attribute in
the namespace `node_pool_config block.
```hcl
namespace "dev" {
description = "Jobs for the development environment."
node_pool_config {
default = "dev"
}
}
```
Now any job in the `dev` namespace only places allocations in nodes in the
`dev` node pool, and so the `node_pool` attribute may be omitted from the job
specification.
```hcl
job "app-dev" {
# The "dev" node pool will be used because it is the
# namespace's default node pool.
namespace = "dev"
# ...
}
```
Jobs are able to override the namespace default node pool by specifying a
different `node_pool` value.
The namespace can enforce if this behavior is allowed or limit which node pools
can and cannot be used with the [`allowed`][ns_spec_np_allowed] and
[`denied`][ns_spec_np_denied] parameters.
```hcl
namespace "dev" {
description = "Jobs for the development environment."
node_pool_config {
default = "dev"
denied = ["prod", "qa"]
}
}
```
```hcl
job "app-dev" {
namespace = "dev"
# Jobs in the "dev" namespace are not allowed to use the
# "prod" node pool and so this job will fail to register.
node_pool = "prod"
# ...
}
```
### Multi-region Jobs
Multi-region jobs can specify different node pools to be used in each region by
overriding the top-level `node_pool` job value, or the namespace `default` node
pool, in each `region` block.
```hcl
job "multiregion" {
node_pool = "dev"
multiregion {
# This region will use the top-level "dev" node pool.
region "north" {}
# While the regions bellow will use their own specific node pool.
region "east" {
node_pool = "dev-east"
}
region "west" {
node_pool = "dev-west"
}
}
# ...
}
```
## Node Pool Patterns
The sections below describe some node pool patterns that can be used to achieve
specific goals.
### Infrastructure and System Jobs
This pattern illustrates an example where node pools are used to reserve nodes
for a specific set of jobs while also allowing system jobs to cross node pools
boundaries.
It is common for Nomad clusters to have certain jobs that are focused on
providing the underlying infrastructure for more business focused applications.
Some examples include reverse proxies for traffic ingress, CSI plugins, and
periodic maintenance jobs.
These jobs can be isolated in their own namespace but they may have different
scheduling requirements.
Reverse proxies, and only reverse proxies, may need to run in clients that are
exposed to public traffic, and CSI controller plugins may require clients to
have high-privileged access to cloud resources and APIs.
Other jobs, like CSI node plugins and periodic maintenance jobs, may need to
run as `system` jobs in all clients of the cluster.
Node pools can be used to achieve the isolation required by the first set of
jobs, and the built-in `all` node pool can be used for the jobs that must run
in every client. To keep them organized, all jobs are registered in the same
`infra` namespace.
```hcl
job "ingress-proxy" {
namespace = "infra"
node_pool = "ingress"
# ...
}
```
```hcl
job "csi-controller" {
namespace = "infra"
node_pool = "csi-controllers"
# ...
}
```
```hcl
job "csi-nodes" {
namespace = "infra"
node_pool = "all"
# ...
}
```
```hcl
job "maintenance" {
type = "batch"
namespace = "infra"
node_pool = "all"
periodic { /* ... */ }
# ...
}
```
Use positive and negative constraints to fine-tune placements when targeting
the built-in `all` node pool.
```hcl
job "maintenance-linux" {
type = "batch"
namespace = "infra"
node_pool = "all"
constraint {
attribute = "${attr.kernel.name}"
value = "linux"
}
constraint {
attribute = "${node.pool}"
operator = "!="
value = "ingress"
}
periodic { /* ... */ }
# ...
}
```
With Nomad Enterprise and Node Pool Governance, the `infra` namespace can be
configured to use a specific namespace by default and only allow the specific
node pools required.
```hcl
namespace "infra" {
description = "Infrastructure jobs."
node_pool_config {
default = "infra"
allowed = ["ingress", "csi-controllers", "all"]
}
}
```
### Mixed Scheduling Algorithms
This pattern illustrate an example where different scheduling algorithms are
per node pool.
Each of the scheduling algorithms provided by Nomad are best suited for
different types of environments and workloads.
The `binpack` algorithm aims to maximize resource usage and pack as much
workload as possible in the given set of of clients. This makes it ideal for
cloud environments where infrastructure is billed by the hour and can be
quickly scaled in and out. By maximizing workload density a cluster running in
cloud instances can reduce the number of clients needed to run everything that
is necessary.
The `spread` algorithm behaves in the opposite direction, making use of every
client available to reduce density and potential noisy neighbors and resource
contention. This makes it ideal for environments where clients are
pre-provisioned and scale more slowly, such as on-premises deployments.
Clusters in a mixed environment can use node pools to adjust the scheduler
algorithm per node type. Cloud instances may be placed in a node pool that uses
the `binpack` algorithm while bare-metal nodes are placed in a node pool
configured to use `spread`.
```hcl
node_pool "cloud" {
# ...
scheduler_config {
scheduler_algorithm = "binpack"
}
}
```
```hcl
node_pool "on-prem" {
# ...
scheduler_config {
scheduler_algorithm = "spread"
}
}
```
Another scenario where mixing algorithms may be useful is to separate workloads
that are more sensitive to noisy neighbors (and thus use the `spread`
algorithm), from those that are able to be packed more tightly (`binpack`).
[cli_np_apply]: /nomad/docs/commands/node-pool/apply
[cli_agent_np]: /nomad/docs/commands/agent#node-pool
[client_np]: /nomad/docs/configuration/client#node_pool
[job_np]: /nomad/docs/job-specification/job#node_pool
[np_spec]: /nomad/docs/other-specifications/node-pool
[np_spec_scheduler_config]: /nomad/docs/other-specifications/node-pool#scheduler_config-parameters
[ns_spec_np_allowed]: /nomad/docs/other-specifications/namespace#allowed
[ns_spec_np_default]: /nomad/docs/other-specifications/namespace#default
[ns_spec_np_denied]: /nomad/docs/other-specifications/namespace#denied

View File

@ -13,6 +13,7 @@ both [Omega: flexible, scalable schedulers for large compute clusters][omega] an
for implementation details on scheduling in Nomad.
- [Scheduling Internals](/nomad/docs/concepts/scheduling/scheduling) - An overview of how the scheduler works.
- [Placement](/nomad/docs/concepts/scheduling/placement) - Explains how placements are computed and how they can be adjusted.
- [Preemption](/nomad/docs/concepts/scheduling/preemption) - Details of preemption, an advanced scheduler feature introduced in Nomad 0.9.
[omega]: https://research.google.com/pubs/pub41684.html

View File

@ -0,0 +1,122 @@
---
layout: docs
page_title: Placement
description: Learn about how placements are computed in Nomad.
---
# Placement
When the Nomad scheduler receives a job registration request, it needs to
determine which clients will run allocations for the job.
This process is called allocation placement and can be important to understand
it to help achieve important goals for your applications, such as high
availability and resilience.
By default, all nodes are considered for placements but this process can be
adjusted via agent and job configuration.
There are several options that can be used depending on the desired outcome.
### Affinities and Constraints
Affinities and constraints allow users to define soft or hard requirements for
their jobs. The [`affinity`][job_affinity] block specifies a soft requirement
on certain node properties, meaning allocations for the job have a preference
for some nodes, but may be placed elsewhere if the rules can't be matched,
while the [`constraint`][job_constraint] block creates hard requirements and
allocations can only be placed in nodes that match these rules. Job placement
fails if a constraint cannot be satisfied.
These rules can reference intrinsic node characteristics, which are called
[node attributes][] and are automatically detected by Nomad, static values
defined in the agent configuration file by cluster administrators, or dynamic
values defined after the agent starts.
One restriction of using affinities and constraints is that they only express
relationships from jobs to nodes, so it is not possible to use them to restrict
a node to only receive allocations for specific jobs.
Use affinities and constraints when some jobs have certain node preferences or
requirements but it is acceptable to have other jobs sharing the same nodes.
The sections below describe the node values that can be configured and used in
job affinity and constraint rules.
#### Node Class
Node class is an arbitrary value that can be used to group nodes based on some
characteristics, like the instance size or the presence of fast hard drives,
and is specified in the client configuration file using the
[`node_class`][config_client_node_class] parameter.
#### Dynamic and Static Node Metadata
Node metadata are arbitrary key-value mappings specified either in the client
configuration file using the [`meta`][config_client_meta] parameter or
dynamically via the [`nomad node meta`][cli_node_meta] command and the
[`/v1/client/metadata`][api_client_metadata] API endpoint.
There are no preconceived use cases for metadata values, and each team may
choose to use them in different ways. Some examples of static metadata include
resource ownership, such as `owner = "team-qa"`, or fine-grained locality,
`rack = "3"`. Dynamic metadata may be used to track runtime information, such
as jobs running in a given client.
### Datacenter
Datacenters represent a geographical location in a region that can be used for
fault tolerance and infrastructure isolation.
It is defined in the agent configuration file using the
[`datacenter`][config_datacenter] parameter and, unlike affinities and
constraints, datacenters are opt-in at the job level, meaning that a job only
places allocations in the datacenters it uses, and, more importantly, only jobs
in a given datacenter are allowed to place allocations in those nodes.
Given the strong connotation of a geographical location, use datacenters to
represent where a node resides rather than its intended use. The
[`spread`][job_spread] block can help achieve fault tolerance across
datacenters.
### Node Pool
Node pools allow grouping nodes that can be targeted by jobs to achieve
workload isolation.
Similarly to datacenters, node pools are configured in an agent configuration
file using the [`node_pool`][config_client_node_pool] attribute, and are opt-in
on jobs, allowing restricted use of certain nodes by specific jobs without
extra configuration.
But unlike datacenters, node pools don't have a preconceived notion and can be
used for several use cases, such as segmenting infrastructure per environment
(development, staging, production), by department (engineering, finance,
support), or by functionality (databases, ingress proxy, applications).
Node pools are also a first-class concept and can hold additional [metadata and
configuration][spec_node_pool].
Use node pools when there is a need to restrict and reserve certain nodes for
specific workloads, or when you need to adjust specific [scheduler
configuration][spec_node_pool_sched_config] values.
Nomad Enterprise also allows associating a node pool to a namespace to
facilitate managing the relationships between jobs, namespaces, and node pools.
Refer to the [Node Pools][concept_np] concept page for more information.
[api_client_metadata]: /nomad/api-docs/client#update-node-metadata
[cli_node_meta]: /nomad/docs/commands/node/meta
[concept_np]: /nomad/docs/concepts/node-pools
[config_client_meta]: /nomad/docs/configuration/client#meta
[config_client_node_class]: /nomad/docs/configuration/client#node_class
[config_client_node_pool]: /nomad/docs/configuration/client#node_pool
[config_datacenter]: /nomad/docs/configuration#datacenter
[job_affinity]: /nomad/docs/job-specification/affinity
[job_constraint]: /nomad/docs/job-specification/constraint
[job_spread]: /nomad/docs/job-specification/spread
[node attributes]: /nomad/docs/runtime/interpolation#node-attributes
[spec_node_pool]: /nomad/docs/other-specifications/node-pool
[spec_node_pool_sched_config]: /nomad/docs/other-specifications/node-pool#scheduler_config-parameters

View File

@ -52,8 +52,9 @@ and existing allocations may need to be updated, migrated, or stopped.
Placing allocations is split into two distinct phases, feasibility checking and
ranking. In the first phase the scheduler finds nodes that are feasible by
filtering unhealthy nodes, those missing necessary drivers, and those failing
the specified constraints.
filtering nodes in datacenters and node pools not used by the job, unhealthy
nodes, those missing necessary drivers, and those failing the specified
constraints.
The second phase is ranking, where the scheduler scores feasible nodes to find
the best fit. Scoring is primarily based on bin packing, which is used to

View File

@ -147,6 +147,9 @@ The name of a region must match the name of one of the [federated regions].
datacenters in the region which are eligible for task placement. If not
provided, the `datacenters` field of the job will be used.
- `node_pool` `(string: <optional>)` - The node pool to be used in this region.
It overrides the job-level `node_pool` and the namespace default node pool.
- `meta` - `Meta: nil` - The meta block allows for user-defined arbitrary
key-value pairs. The meta specified for each region will be merged with the
meta block at the job level.

View File

@ -56,8 +56,8 @@ Successfully applied node pool "example"!
the node pool.
- `meta` `(map[string]string: <optional>)` - Sets optional metadata on the node
pool, defined as key-value pairs. The scheduler does not use node pool metadat
as part of scheduling.
pool, defined as key-value pairs. The scheduler does not use node pool
metadata as part of scheduling.
- `scheduler_config` <code>([SchedulerConfig][sched-config]: nil)</code> <EnterpriseAlert inline /> -
Sets scheduler configuration options specific to the node pool. If not

View File

@ -60,6 +60,10 @@ Systems jobs are intended to run until explicitly stopped either by an operator
or [preemption]. If a system task exits it is considered a failure and handled
according to the job's [restart] block; system jobs do not have rescheduling.
When used with node pools, system jobs run on all nodes of the pool used by the
job. The built-in node pool `all` allows placing allocations on all clients in
the cluster.
## System Batch
The `sysbatch` scheduler is used to register jobs that should be run to completion
@ -80,7 +84,7 @@ Sysbatch jobs are intended to run until successful completion, explicitly stoppe
by an operator, or evicted through [preemption]. Sysbatch tasks that exit with an
error are handled according to the job's [restart] block.
Like the `batch` scheduler, task groups in system batch jobs may have a `count`
Like the `batch` scheduler, task groups in system batch jobs may have a `count`
greater than 1 to control how many instances are run. Instances that cannot be
immediately placed will be scheduled when resources become available,
potentially on a node that has already run another instance of the same job.

View File

@ -129,6 +129,10 @@
"title": "Concepts",
"path": "concepts/scheduling/scheduling"
},
{
"title": "Placement",
"path": "concepts/scheduling/placement"
},
{
"title": "Preemption",
"path": "concepts/scheduling/preemption"
@ -147,6 +151,10 @@
"title": "Gossip Protocol",
"path": "concepts/gossip"
},
{
"title": "Node Pools",
"path": "concepts/node-pools"
},
{
"title": "Security Model",
"path": "concepts/security"