Commit graph

498 commits

Author SHA1 Message Date
Piotr Kazmierczak 7077d1f9aa
template: custom change_mode scripts (#13972)
This PR adds the functionality of allowing custom scripts to be executed on template change. Resolves #2707
2022-08-24 17:43:01 +02:00
Piotr Kazmierczak 077b6e7098
docs: Update upgrade guide to reflect enterprise changes introduced in nomad-enterprise (#14212)
This PR documents a change made in the enterprise version of nomad that addresses the following issue:

When a user tries to filter audit logs, they do so with a stanza that looks like the following:

audit {
  enabled = true

  filter "remove deletes" {
    type = "HTTPEvent"
    endpoints  = ["*"]
    stages = ["OperationComplete"]
    operations = ["DELETE"]
  }
}

When specifying both an "endpoint" and a "stage", the events with both matching a "endpoint" AND a matching "stage" will be filtered.

When specifying both an "endpoint" and an "operation" the events with both matching a "endpoint" AND a matching "operation" will be filtered.

When specifying both a "stage" and an "operation" the events with a matching a "stage" OR a matching "operation" will be filtered.

The "OR" logic with stages and operations is unexpected and doesn't allow customers to get specific on which events they want to filter. For instance the following use-case is impossible to achieve: "I want to filter out all OperationReceived events that have the DELETE verb".
2022-08-24 16:31:49 +02:00
Tim Gross afb9fe6a4e
docs: fix an anchor link in secure vars docs (#14231) 2022-08-23 10:46:24 -04:00
Seth Hoenig b5427a9f3b
Merge pull request #14215 from hashicorp/docs-update-checks-for-nsd
docs: update check documentation with NSD specifics
2022-08-23 09:23:53 -05:00
Seth Hoenig fb82f11e70
docs: fix checks doc typo
Co-authored-by: Piotr Kazmierczak <phk@mm.st>
2022-08-23 09:23:36 -05:00
Tim Gross bf57d76ec7
allow ACL policies to be associated with workload identity (#14140)
The original design for workload identities and ACLs allows for operators to
extend the automatic capabilities of a workload by using a specially-named
policy. This has shown to be potentially unsafe because of naming collisions, so
instead we'll allow operators to explicitly attach a policy to a workload
identity.

This changeset adds workload identity fields to ACL policy objects and threads
that all the way down to the command line. It also a new secondary index to the
ACL policy table on namespace and job so that claim resolution can efficiently
query for related policies.
2022-08-22 16:41:21 -04:00
Luiz Aoqui dbffdca92e
template: use pointer values for gid and uid (#14203)
When a Nomad agent starts and loads jobs that already existed in the
cluster, the default template uid and gid was being set to 0, since this
is the zero value for int. This caused these jobs to fail in
environments where it was not possible to use 0, such as in Windows
clients.

In order to differentiate between an explicit 0 and a template where
these properties were not set we need to use a pointer.
2022-08-22 16:25:49 -04:00
Seth Hoenig ea6d010790 docs: update check documentation with NSD specifics
This PR updates the checks documentation to mention support for checks
when using the Nomad service provider. There are limitations of NSD
compared to Consul, and those configuration options are now noted as
being Consul-only.
2022-08-22 10:50:26 -05:00
Tim Gross a4e89d72a8
secure vars: filter by path in List RPCs (#14036)
The List RPCs only checked the ACL for the Prefix argument of the request. Add
an ACL filter to the paginator for the List RPC.

Extend test coverage of ACLs in the List RPC and in the `acl` package, and add a
"deny" capability so that operators can deny specific paths or prefixes below an
allowed path.
2022-08-15 11:38:20 -04:00
Seth Hoenig 394aebfbd9
Merge pull request #14088 from hashicorp/b-plan-vault-token
cli: support vault token in plan command
2022-08-12 09:05:34 -05:00
Seth Hoenig 1224fdf60d
Merge pull request #14089 from hashicorp/f-docker-disable-healthchecks
docker: configuration for disable docker healthcheck
2022-08-12 09:00:31 -05:00
James Rasell f6a5961a20
docs: correctly state RPC port is used by servers and clients. (#14091) 2022-08-12 10:14:14 +02:00
Seth Hoenig dc761aa7ec docker: create a docker task config setting for disable built-in healthcheck
This PR adds a docker driver task configuration setting for turning off
built-in HEALTHCHECK of a container.

References)
https://docs.docker.com/engine/reference/builder/#healthcheck
https://github.com/docker/engine-api/blob/master/types/container/config.go#L16

Closes #5310
Closes #14068
2022-08-11 10:33:48 -05:00
Seth Hoenig ba5c45ab93 cli: respect vault token in plan command
This PR fixes a regression where the 'job plan' command would not respect
a Vault token if set via --vault-token or $VAULT_TOKEN.

Basically the same bug/fix as for the validate command in https://github.com/hashicorp/nomad/issues/13062

Fixes https://github.com/hashicorp/nomad/issues/13939
2022-08-11 08:54:08 -05:00
dgotlieb 7fbc8baaeb
doc typo fix
docker and podman don't suck 🤣
2022-08-10 15:04:07 +03:00
Charlie Voiselle 9a19279f59
Sweep of docs for repeated words; minor edits (#14032) 2022-08-05 16:45:30 -04:00
Luiz Aoqui 9affe31a0f
qemu: reduce monitor socket path (#13971)
The QEMU driver can take an optional `graceful_shutdown` configuration
which will create a Unix socket to send ACPI shutdown signal to the VM.

Unix sockets have a hard length limit and the driver implementation
assumed that QEMU versions 2.10.1 were able to handle longer paths. This
is not correct, the linked QEMU fix only changed the behaviour from
silently truncating longer socket paths to throwing an error.

By validating the socket path before starting the QEMU machine we can
provide users a more actionable and meaningful error message, and by
using a shorter socket file name we leave a bit more room for
user-defined values in the path, such as the task name.

The maximum length allowed is also platform-dependant, so validation
needs to be different for each OS.
2022-08-04 12:10:35 -04:00
Luiz Aoqui e3d78c343c
template: set default UID/GID to -1 (#13998)
UID/GID 0 is usually reserved for the root user/group. While Nomad
clients are expected to run as root it may not always be the case.

Setting these values as -1 if not defined will fallback to the pervious
behaviour of not attempting to set file ownership and use whatever
UID/GID the Nomad agent is running as. It will also keep backwards
compatibility, which is specially important for platforms where this
feature is not supported, like Windows.
2022-08-04 11:26:08 -04:00
Luiz Aoqui 8f05a55def
docs: remove link to HCL2 timestamp function (#13999)
The `timestamp` HCL2 function was never part of the set of supported
functions.
2022-08-04 10:07:51 -04:00
Derek Strickland 77df9c133b
Add Nomad RetryConfig to agent template config (#13907)
* add Nomad RetryConfig to agent template config
2022-08-03 16:56:30 -04:00
Piotr Kazmierczak 530280505f
client: enable specifying user/group permissions in the template stanza (#13755)
* Adds Uid/Gid parameters to template.

* Updated diff_test

* fixed order

* update jobspec and api

* removed obsolete code

* helper functions for jobspec parse test

* updated documentation

* adjusted API jobs test.

* propagate uid/gid setting to job_endpoint

* adjusted job_endpoint tests

* making uid/gid into pointers

* refactor

* updated documentation

* updated documentation

* Update client/allocrunner/taskrunner/template/template_test.go

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>

* Update website/content/api-docs/json-jobs.mdx

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>

* propagating documentation change from Luiz

* formatting

* changelog entry

* changed changelog entry

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2022-08-02 22:15:38 +02:00
Tim Gross e025afdf87
docs: concepts for secure variables and workload identity (#13764)
Includes concept docs for secure variables, concept docs for workload
identity, and an operations docs for keyring management.
2022-08-02 10:06:26 -04:00
Eric Weber cbce13c1ac
Add stage_publish_base_dir field to csi_plugin stanza of a job (#13919)
* Allow specification of CSI staging and publishing directory path
* Add website documentation for stage_publish_dir
* Replace erroneous reference to csi_plugin.mount_config with csi_plugin.mount_dir
* Avoid requiring CSI plugins to be redeployed after introducing StagePublishDir
2022-08-02 09:42:44 -04:00
asymmetric b89718d70e
Update filesystem.mdx (#13738)
fix alloc working directory path
2022-07-25 10:25:48 -04:00
Scott Holodak 12ef89a61a
docs: fix placement for scaling and csi_plugin (#13892) 2022-07-25 10:06:59 -04:00
Michael Schurter 0d1c9a53a4
docs: clarify submit-job allows stopping (#13871) 2022-07-21 10:18:57 -07:00
Tim Gross 96aea74b4b
docs: keyring commands (#13690)
Document the secure variables keyring commands, document the aliased
gossip keyring commands, and note that the old gossip keyring commands
are deprecated.
2022-07-20 14:14:10 -04:00
Tim Gross 49ad3dc3ba
docs: document secure variables server config options (#13695) 2022-07-20 14:13:39 -04:00
Will Jordan 5354409b1a
Return 429 response on HTTP max connection limit (#13621)
Return 429 response on HTTP max connection limit. Instead of silently closing
the connection, return a `429 Too Many Requests` HTTP response with a helpful
error message to aid debugging when the connection limit is unintentionally
reached.

Set a 10-millisecond write timeout and rate limiter for connection-limit 429
response to prevent writing the HTTP response from consuming too many server
resources.

Add `nomad.agent.http.exceeded metric` counting the number of HTTP connections
exceeding concurrency limit.
2022-07-20 14:12:21 -04:00
Niklas Hambüchen 422c83e97a
docs: job-specification: Explain that priority has no effect on run order (#13835)
Makes the issues from #9845 and #12792 less surprising to the user.
2022-07-19 08:55:29 -04:00
Andy Assareh e49c021792
word typo digestible (#13772) 2022-07-19 09:00:52 +02:00
Seth Hoenig 4dea14267d
Merge pull request #13813 from hashicorp/docs-move-checks
docs: move checks into own page
2022-07-18 12:27:43 -05:00
Seth Hoenig 4459312541 docs: move checks into own page
This PR creates a top-level 'check' page for job-specification docs.

The content for checks is about half the content of the service page, and
is about to increase in size when we add docs about Nomad service checks.
Seemed like a good idea to just split the checks section out into its own
thing (e.g. check_restart is already a topic).

Doing the move first lets us backport this change without adding Nomad service
check stuff yet.

Mostly just a lift-and-shift but with some tweaked examples to de-emphasize
the use of script checks.
2022-07-18 09:34:55 -05:00
Tim Gross 1e8978ca04
docs: ACL policy spec reference (#13787)
The "Secure Nomad with Access Control" guide provides a tutorial for
bootstrapping Nomad ACLs, writing policies, and creating tokens. Add a reference
guide just for the ACL policy specification.
2022-07-18 09:35:28 -04:00
Michael Schurter e97548b5f8
Improve metrics reference documentation (#13769)
* docs: tighten up parameterized job metrics docs

* docs: improve alloc status descriptions

Remove `nomad.client.allocations.start` as it doesn't exist.
2022-07-15 14:22:39 -07:00
Michael Schurter 5414f49821
docs: clarify blocked_evals metrics (#13751)
Related to #13740

- blocked_evals.total_blocked is the number of evals blocked for *any*
  reason
- blocked_evals.total_quota_limit is the number of evals blocked by
  quota limits, but critically: their resources are *not* counted in the
  cpu/memory
2022-07-14 11:32:33 -07:00
Seth Hoenig 3a32220b3b
Merge pull request #13716 from hashicorp/docs-update-consul-warning
docs: remove consul 1.12.0 warning
2022-07-14 08:45:56 -05:00
Luiz Aoqui b656981cf0
Track plan rejection history and automatically mark clients as ineligible (#13421)
Plan rejections occur when the scheduler work and the leader plan
applier disagree on the feasibility of a plan. This may happen for valid
reasons: since Nomad does parallel scheduling, it is expected that
different workers will have a different state when computing placements.

As the final plan reaches the leader plan applier, it may no longer be
valid due to a concurrent scheduling taking up intended resources. In
these situations the plan applier will notify the worker that the plan
was rejected and that they should refresh their state before trying
again.

In some rare and unexpected circumstances it has been observed that
workers will repeatedly submit the same plan, even if they are always
rejected.

While the root cause is still unknown this mitigation has been put in
place. The plan applier will now track the history of plan rejections
per client and include in the plan result a list of node IDs that should
be set as ineligible if the number of rejections in a given time window
crosses a certain threshold. The window size and threshold value can be
adjusted in the server configuration.

To avoid marking several nodes as ineligible at one, the operation is rate
limited to 5 nodes every 30min, with an initial burst of 10 operations.
2022-07-12 18:40:20 -04:00
Michael Schurter 3e50f72fad
core: merge reserved_ports into host_networks (#13651)
Fixes #13505

This fixes #13505 by treating reserved_ports like we treat a lot of jobspec settings: merging settings from more global stanzas (client.reserved.reserved_ports) "down" into more specific stanzas (client.host_networks[].reserved_ports).

As discussed in #13505 there are other options, and since it's totally broken right now we have some flexibility:

Treat overlapping reserved_ports on addresses as invalid and refuse to start agents. However, I'm not sure there's a cohesive model we want to publish right now since so much 0.9-0.12 compat code still exists! We would have to explain to folks that if their -network-interface and host_network addresses overlapped, they could only specify reserved_ports in one place or the other?! It gets ugly.
Use the global client.reserved.reserved_ports value as the default and treat host_network[].reserverd_ports as overrides. My first suggestion in the issue, but @groggemans made me realize the addresses on the agent's interface (as configured by -network-interface) may overlap with host_networks, so you'd need to remove the global reserved_ports from addresses shared with a shared network?! This seemed really confusing and subtle for users to me.
So I think "merging down" creates the most expressive yet understandable approach. I've played around with it a bit, and it doesn't seem too surprising. The only frustrating part is how difficult it is to observe the available addresses and ports on a node! However that's a job for another PR.
2022-07-12 14:40:25 -07:00
Seth Hoenig a9fa48f3db docs: remove consul 1.12.0 warning 2022-07-12 09:53:17 -05:00
Tim Gross fc4cd53cfb
docs: rename Internals to Concepts (#13696) 2022-07-11 16:55:33 -04:00
Tim Gross d49ff0175c
docs: move operator subcommands under their own trees (#13677)
The sidebar navigation tree for the `operator` sub-sub commands is
getting cluttered and we have a new set of commands coming to support
secure variables keyring as well. Move these all under their own
subtrees.
2022-07-11 14:00:24 -04:00
Seth Hoenig ed2f2b1a75
docs: move upgrade docs for max_client_timeout
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2022-07-07 16:46:26 -05:00
Seth Hoenig 905e673553 docs: upgrade guide for client max_kill_timeout 2022-07-07 15:27:40 -05:00
Luiz Aoqui 03433dd8af
cli: improve output of eval commands (#13581)
Use the same output format when listing multiple evals in the `eval
list` command and when `eval status <prefix>` matches more than one
eval.

Include the eval namespace in all output formats and always include the
job ID in `eval status` since, even `node-update` evals are related to a
job.

Add Node ID to the evals table output to help differentiate
`node-update` evals.

Co-authored-by: James Rasell <jrasell@hashicorp.com>
2022-07-07 13:13:34 -04:00
Ted Behling 6a032a54d2
driver/docker: Don't pull InfraImage if it exists (#13265)
Co-authored-by: James Rasell <jrasell@hashicorp.com>
2022-07-07 17:44:06 +02:00
Seth Hoenig b9fe6c8d2c docs: fixup from cr comments 2022-07-07 08:37:10 -05:00
Seth Hoenig 1c31ef285e docs: add docs for simple load balancing nomad services
This PR adds a section to template docs for simple load balancing with nomad servicse.
2022-07-06 17:34:30 -05:00
James Rasell 0c0b028a59
core: allow deleting of evaluations (#13492)
* core: add eval delete RPC and core functionality.

* agent: add eval delete HTTP endpoint.

* api: add eval delete API functionality.

* cli: add eval delete command.

* docs: add eval delete website documentation.
2022-07-06 16:30:11 +02:00
James Rasell 181b247384
core: allow pausing and un-pausing of leader broker routine (#13045)
* core: allow pause/un-pause of eval broker on region leader.

* agent: add ability to pause eval broker via scheduler config.

* cli: add operator scheduler commands to interact with config.

* api: add ability to pause eval broker via scheduler config

* e2e: add operator scheduler test for eval broker pause.

* docs: include new opertor scheduler CLI and pause eval API info.
2022-07-06 16:13:48 +02:00