Commit Graph

429 Commits

Author SHA1 Message Date
Luiz Aoqui b656981cf0
Track plan rejection history and automatically mark clients as ineligible (#13421)
Plan rejections occur when the scheduler work and the leader plan
applier disagree on the feasibility of a plan. This may happen for valid
reasons: since Nomad does parallel scheduling, it is expected that
different workers will have a different state when computing placements.

As the final plan reaches the leader plan applier, it may no longer be
valid due to a concurrent scheduling taking up intended resources. In
these situations the plan applier will notify the worker that the plan
was rejected and that they should refresh their state before trying
again.

In some rare and unexpected circumstances it has been observed that
workers will repeatedly submit the same plan, even if they are always
rejected.

While the root cause is still unknown this mitigation has been put in
place. The plan applier will now track the history of plan rejections
per client and include in the plan result a list of node IDs that should
be set as ineligible if the number of rejections in a given time window
crosses a certain threshold. The window size and threshold value can be
adjusted in the server configuration.

To avoid marking several nodes as ineligible at one, the operation is rate
limited to 5 nodes every 30min, with an initial burst of 10 operations.
2022-07-12 18:40:20 -04:00
Michael Schurter 3e50f72fad
core: merge reserved_ports into host_networks (#13651)
Fixes #13505

This fixes #13505 by treating reserved_ports like we treat a lot of jobspec settings: merging settings from more global stanzas (client.reserved.reserved_ports) "down" into more specific stanzas (client.host_networks[].reserved_ports).

As discussed in #13505 there are other options, and since it's totally broken right now we have some flexibility:

Treat overlapping reserved_ports on addresses as invalid and refuse to start agents. However, I'm not sure there's a cohesive model we want to publish right now since so much 0.9-0.12 compat code still exists! We would have to explain to folks that if their -network-interface and host_network addresses overlapped, they could only specify reserved_ports in one place or the other?! It gets ugly.
Use the global client.reserved.reserved_ports value as the default and treat host_network[].reserverd_ports as overrides. My first suggestion in the issue, but @groggemans made me realize the addresses on the agent's interface (as configured by -network-interface) may overlap with host_networks, so you'd need to remove the global reserved_ports from addresses shared with a shared network?! This seemed really confusing and subtle for users to me.
So I think "merging down" creates the most expressive yet understandable approach. I've played around with it a bit, and it doesn't seem too surprising. The only frustrating part is how difficult it is to observe the available addresses and ports on a node! However that's a job for another PR.
2022-07-12 14:40:25 -07:00
Phil Renaud 59c12fc758
Remove namespace cache (#13679) 2022-07-11 18:06:18 -04:00
Phil Renaud e9219a1ae0
Allow wildcard for Evaluations API (#13530)
* Failing test and TODO for wildcard

* Alias the namespace query parameter for Evals

* eval: fix list when using ACLs and * namespace

Apply the same verification process as in job, allocs and scaling
policy list endpoints to handle the eval list when using an ACL token
with limited namespace support but querying using the `*` wildcard
namespace.

* changelog: add entry for #13530

* ui: set namespace when querying eval

Evals have a unique UUID as ID, but when querying them the Nomad API
still expects a namespace query param, otherwise it assumes `default`.

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2022-07-11 16:42:17 -04:00
Luiz Aoqui 674c0ae08b
changelog: add entry for #13659 (#13691) 2022-07-11 16:07:33 -04:00
Tim Gross b6dd1191b2
snapshot restore-from-archive streaming and filtering (#13658)
Stream snapshot to FSM when restoring from archive
The `RestoreFromArchive` helper decompresses the snapshot archive to a
temporary file before reading it into the FSM. For large snapshots
this performs a lot of disk IO. Stream decompress the snapshot as we
read it, without first writing to a temporary file.

Add bexpr filters to the `RestoreFromArchive` helper.
The operator can pass these as `-filter` arguments to `nomad operator
snapshot state` (and other commands in the future) to include only
desired data when reading the snapshot.
2022-07-11 10:48:00 -04:00
James Rasell 9eb63c9e03
cli: ensure node status and drain use correct cmd name. (#13656) 2022-07-11 09:50:42 +02:00
Seth Hoenig 239eaf9a29
Merge pull request #13626 from hashicorp/b-client-max-kill-timeout
client: enforce max_kill_timeout client configuration
2022-07-07 13:44:39 -05:00
Luiz Aoqui 85908415f9
state: fix eval list by prefix with * namespace (#13551) 2022-07-07 14:21:51 -04:00
Luiz Aoqui 03433dd8af
cli: improve output of eval commands (#13581)
Use the same output format when listing multiple evals in the `eval
list` command and when `eval status <prefix>` matches more than one
eval.

Include the eval namespace in all output formats and always include the
job ID in `eval status` since, even `node-update` evals are related to a
job.

Add Node ID to the evals table output to help differentiate
`node-update` evals.

Co-authored-by: James Rasell <jrasell@hashicorp.com>
2022-07-07 13:13:34 -04:00
Ted Behling 6a032a54d2
driver/docker: Don't pull InfraImage if it exists (#13265)
Co-authored-by: James Rasell <jrasell@hashicorp.com>
2022-07-07 17:44:06 +02:00
Michael Schurter f21272065d
core: emit node evals only for sys jobs in dc (#12955)
Whenever a node joins the cluster, either for the first time or after
being `down`, we emit a evaluation for every system job to ensure all
applicable system jobs are running on the node.

This patch adds an optimization to skip creating evaluations for system
jobs not in the current node's DC. While the scheduler performs the same
feasability check, skipping the creation of the evaluation altogether
saves disk, network, and memory.
2022-07-06 14:35:18 -07:00
Seth Hoenig 5dd8aa3e27 client: enforce max_kill_timeout client configuration
This PR fixes a bug where client configuration max_kill_timeout was
not being enforced. The feature was introduced in 9f44780 but seems
to have been removed during the major drivers refactoring.

We can make sure the value is enforced by pluming it through the DriverHandler,
which now uses the lesser of the task.killTimeout or client.maxKillTimeout.
Also updates Event.SetKillTimeout to require both the task.killTimeout and
client.maxKillTimeout so that we don't make the mistake of using the wrong
value - as it was being given only the task.killTimeout before.
2022-07-06 15:29:38 -05:00
Luiz Aoqui a9a66ad018
api: apply new ACL check for wildcard namespace (#13608)
api: apply new ACL check for wildcard namespace

In #13606 the ACL check was refactored to better support the all
namespaces wildcard (`*`). This commit applies the changes to the jobs
and alloc list endpoints.
2022-07-06 16:17:16 -04:00
Tim Gross 1fc8995590
query for leader in `operator debug` command (#13472)
The `operator debug` command doesn't output the leader anywhere in the
output, which adds extra burden to offline debugging (away from an
ongoing incident where you can simply check manually). Query the
`/v1/status/leader` API but degrade gracefully.
2022-07-06 10:57:44 -04:00
James Rasell 0c0b028a59
core: allow deleting of evaluations (#13492)
* core: add eval delete RPC and core functionality.

* agent: add eval delete HTTP endpoint.

* api: add eval delete API functionality.

* cli: add eval delete command.

* docs: add eval delete website documentation.
2022-07-06 16:30:11 +02:00
James Rasell 181b247384
core: allow pausing and un-pausing of leader broker routine (#13045)
* core: allow pause/un-pause of eval broker on region leader.

* agent: add ability to pause eval broker via scheduler config.

* cli: add operator scheduler commands to interact with config.

* api: add ability to pause eval broker via scheduler config

* e2e: add operator scheduler test for eval broker pause.

* docs: include new opertor scheduler CLI and pause eval API info.
2022-07-06 16:13:48 +02:00
Phil Renaud 84a59ff059
[ui] Fix a bug where redirects after planning/editing a job didn't include namespace (#13588)
* Job editing and planning handles namespace as part of ID instead of queryParam

* Changelog added

* Tests updated to reflect new namespace redirects
2022-07-05 15:58:56 -04:00
Seth Hoenig 97726c2fd8
Merge pull request #12862 from hashicorp/f-choose-services
api: enable selecting subset of services using rendezvous hashing
2022-06-30 15:17:40 -05:00
Seth Hoenig 0048c59f1a
cl: fixup changelog comment
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
2022-06-30 15:10:47 -05:00
James Rasell 3ecffaf36b
deps: update `github.com/hashicorp/go-discover` to latest. (#13491) 2022-06-28 10:28:32 +02:00
James Rasell d080eed9ae
client: fixed a problem calculating a service namespace. (#13493)
When calculating a services namespace for registration, the code
assumed the first task within the task array would include a
service block. This is incorrect as it is possible only a latter
task within the array contains a service definition.

This change fixes the logic, so we correctly search for a service
definition before identifying the namespace.
2022-06-28 09:47:28 +02:00
Seth Hoenig 9467bc9eb3 api: enable selecting subset of services using rendezvous hashing
This PR adds the 'choose' query parameter to the '/v1/service/<service>' endpoint.

The value of 'choose' is in the form '<number>|<key>', number is the number
of desired services and key is a value unique but consistent to the requester
(e.g. allocID).

Folks aren't really expected to use this API directly, but rather through consul-template
which will soon be getting a new helper function making use of this query parameter.

Example,

curl 'localhost:4646/v1/service/redis?choose=2|abc123'

Note: consul-templte v0.29.1 includes the necessary nomadServices functionality.
2022-06-25 10:37:37 -05:00
Phil Renaud 2e6e95e78c
[ui] Reinstate Meta and Payload sections to Parameterized Child Jobs (#13473)
* Shift meta off job.definition and decodedPayload alias to passed arg

* Changelog
2022-06-24 15:03:08 -04:00
Seth Hoenig b7a8318eac
Merge pull request #13467 from hashicorp/f-purge-raft-v2
core: remove support for raft protocol version 2
2022-06-24 10:10:26 -05:00
Tim Gross 4368dcc02f
fix deadlock in plan_apply (#13407)
The plan applier has to get a snapshot with a minimum index for the
plan it's working on in order to ensure consistency. Under heavy raft
loads, we can exceed the timeout. When this happens, we hit a bug
where the plan applier blocks waiting on the `indexCh` forever, and
all schedulers will block in `Plan.Submit`.

Closing the `indexCh` when the `asyncPlanWait` is done with it will
prevent the deadlock without impacting correctness of the previous
snapshot index.

This changeset includes the a PoC failing test that works by injecting
a large timeout into the state store. We need to turn this into a test
we can run normally without breaking the state store before we can
merge this PR.

Increase `snapshotMinIndex` timeout to 10s.
This timeout creates backpressure where any concurrent `Plan.Submit`
RPCs will block waiting for results. This sheds load across all
servers and gives raft some CPU to catch up, because schedulers won't
dequeue more work while waiting. Increase it to 10s based on
observations of large production clusters.
2022-06-23 12:06:27 -04:00
Seth Hoenig 91e08d5e23 core: remove support for raft protocol version 2
This PR checks server config for raft_protocol, which must now
be set to 3 or unset (0). When unset, version 3 is used as the
default.
2022-06-23 14:37:50 +00:00
Derek Strickland 7d6a3df197
csi_hook: valid if any driver supports csi (#13446)
* csi_hook: valid if any driver supports csi volumes
2022-06-22 10:43:43 -04:00
Derek Strickland 9de4d7367c
cli: fix detach handling (#13405)
Fix detach handling for:

- `deployment fail`
- `deployment promote`
- `deployment resume`
- `deployment unblock`
- `job promote`
2022-06-21 06:01:23 -04:00
Jeffrey Clark a97699221c
cni: add loopback to linux bridge (#13428)
CNI changed how to bring up the interface in v0.2.0.
Support was moved to a new loopback plugin.

https://github.com/containernetworking/cni/pull/121

Fixes #10014
2022-06-20 11:22:53 -04:00
James Rasell f1f7c5040b
api: added sysbatch job type constant to match other schedulers. (#13359) 2022-06-16 11:53:04 +02:00
Joseph Martin 4aa96d5bfc
Return evalID if `-detach` flag is passed to job revert (#13364)
* Return evalID if `-detach` flag is passed to job revert
2022-06-15 14:20:29 -04:00
Tim Gross 12d87c040c
fixup changelog entry for backported regression fix (#13370)
The changelog entry for #13340 indicated it was an improvement. But on
discussion, it was determined that this was a workaround for a
regression. Update the changelog to make this clear.
2022-06-14 14:33:39 -04:00
Grant Griffiths 99896da443
CSI: make plugin health_timeout configurable in csi_plugin stanza (#13340)
Signed-off-by: Grant Griffiths <ggriffiths@purestorage.com>
2022-06-14 10:04:16 -04:00
Daniel Rossbach 8c52c03c8c
qemu driver: Add option to configure drive_interface (#11864) 2022-06-10 10:03:51 -04:00
Luiz Aoqui e8b788b372
changelog: add entry for #12961 (#13318) 2022-06-10 09:04:00 -04:00
Tim Gross 9d5523a72d
CSI: skip node unpublish on GC'd or down nodes (#13301)
If the node has been GC'd or is down, we can't send it a node
unpublish. The CSI spec requires that we don't send the controller
unpublish before the node unpublish, but in the case where a node is
gone we can't know the final fate of the node unpublish step.

The `csi_hook` on the client will unpublish if the allocation has
stopped and if the host is terminated there's no mount for the volume
anyways. So we'll now assume that the node has unpublished at its
end. If it hasn't, any controller unpublish will potentially hang or
error and need to be retried.
2022-06-09 11:33:22 -04:00
phreakocious 94a78597d2
Add `guest_agent` config option for QEMU driver (#12800)
Add boolean 'guest_agent' config option for QEMU driver, which will
create the socket file for the QEMU Guest Agent in the task dir when
enabled.
2022-06-09 09:21:38 -04:00
Derek Strickland 13ea5ae87a
consul-template: Add fault tolerant defaults (#13041)
consul-template: Add fault tolerant defaults

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-06-08 14:08:25 -04:00
Luiz Aoqui 2e0bffba90
changelog: add entry for #12925 (#13250) 2022-06-08 10:14:33 -04:00
Tim Gross 8ff5ea1bee
CSI: no early return when feasibility check fails on eligible nodes (#13274)
As a performance optimization in the scheduler, feasibility checks
that apply to an entire class are only checked once for all nodes of
that class. Other feasibility checks are "available" checks because
they rely on more ephemeral characteristics and don't contribute to
the hash for the node class. This currently includes only CSI.

We have a separate fast path for "available" checks when the node has
already been marked eligible on the basis of class. This fast path has
a bug where it returns early rather than continuing the loop. This
causes the entire task group to be rejected.

Fix the bug by not returning early in the fast path and instead jump
to the top of the loop like all the other code paths in this method.
Includes a new test exercising topology at whole-scheduler level and a
fix for an existing test that should've caught this previously.
2022-06-07 13:31:10 -04:00
Derek Strickland 12f3ee46ea
alloc_runner: stop sidecar tasks last (#13055)
alloc_runner: stop sidecar tasks last
2022-06-07 11:35:19 -04:00
Tim Gross 81c70f4973
changelog entry for #12534 (#13260) 2022-06-06 16:19:17 -04:00
Conor Evans 86116a7607
add filebase64 function (#11791)
Signed-off-by: Conor Evans <coevans@tcd.ie>
2022-06-06 11:58:17 -04:00
Lance Haig 4bf27d743d
Allow Operator Generated bootstrap token (#12520) 2022-06-03 07:37:24 -04:00
Huan Wang 7d15157635
adding support for customized ingress tls (#13184) 2022-06-02 18:43:58 -04:00
Seth Hoenig 45e8748658
Merge pull request #13205 from hashicorp/b-batch-preempt2
core: reschedule evicted batch job when resources become available
2022-06-02 16:32:01 -05:00
Shantanu Gadgil 6cb8c95534
fingerprint kernel architecture name (#13182) 2022-06-02 15:51:00 -04:00
Seth Hoenig 0692190e12 core: reschedule evicted batch job when resources become available
This PR fixes a bug where an evicted batch job would not be rescheduled
once resources become available.

Closes #9890
2022-06-02 14:04:13 -05:00
Seth Hoenig 54efec5dfe docs: add docs and tests for tagged_addresses 2022-05-31 13:02:48 -05:00