open-nomad

Commit Graph

Author	SHA1	Message	Date
Tim Gross	5025731ebe	consul: handle "not found" errors from Consul when deleting tokens (#17847 ) In Consul 1.15.0, the Delete Token API was changed so as to return an error when deleting a non-existent ACL token. This means that if Nomad successfully deletes the token but fails to persist that fact, it will get stuck trying to delete a non-existent token forever. Update the token deletion function to ignore "not found" errors and treat them as successful deletions. Fixes: #17833	2023-07-07 16:22:13 -04:00
Yorick Gersie	3e66291b0e	cni: ensure to setup CNI addresses in deterministic order (#17766 ) * cni: ensure to setup CNI addresses in deterministic order Currently as commented in the code the go-cni library returns an unordered map of interfaces. In cases where there are multiple CNI interfaces being created this creates a problem with service registration and healthchecking because the first address in the map is being used. The use case we have where this is an issue is that we run CNI with the macvlan plugin to isolate workloads, but they still need to be able to access the host on a static address to be able to perform local resolving and hit host services like the Consul agent API. To make this work there are 2 options, you either add a macvlan interface on the host with an assigned address for each VLAN you have or you create an additional veth bridged interface in the container namespace. We chose the latter option through a custom CNI plugin but the ordering issue leaves us with incorrect service registration. * Updates after feedback * First check for the CNIResult interfaces length, if it's zero we don't need to proceed at all. * Use sorted interfaces list for the address fallback scenario as well. * Remove "found" log message logic, when an address isn't found an error is returned stating the allocation could not be configured as an address was missing from the CNIResult. If we still need a Warn message then we can add it to the condition that returns the error if no address could be found instead of using the "found" bool logic.	2023-07-06 13:25:29 -07:00
Patric Stout	ebb363d43e	metrics: add "total_ticks_count" for CPU metrics (#17579 ) This counter tells you the total amount of ticks for that CPU entry since the start of Nomad.	2023-07-05 10:28:55 -04:00
deverton-godaddy	f44793d377	[api] Add NetworkStatus to allocation response (#17280 ) Service discovery or mesh network systems consuming the Nomad event stream or API need to know the CNI assigned IP for the allocation. This data is returned by the underlying Nomad API but isn't mapped in the response struct.	2023-07-04 19:35:38 -04:00
Phil Renaud	5cc0b39683	Report shows a 3rd party browser extension puts a banner at the top of page and awkwardly shifts nav; this fixes that (#17783 )	2023-06-30 17:09:42 -04:00
Phil Renaud	d559072e48	[ui] Text wrap long lines of code and logs (#17754 ) * Text and code wrapping as a localStorage var * task-log uses wrapping and kb shortcut * Word wrap keyboard labels * Wrapper as a toggle not a button * Changelog and fixed an extra space trailing log lines * Moves toggle to inside * Acceptance tests for ww and toggle click	2023-06-30 17:07:57 -04:00
Phil Renaud	01d6a94eac	[ui] HCL-in-UI: Re-arrange buttons, add save-as-file (#17752 ) * Move buttons over as expected * Let a user download file locally * test mock fns for jobeditor * Changelog	2023-06-28 21:57:03 -04:00
Tim Gross	06c7974120	Prepare release 1.6.0-beta.1	2023-06-28 11:06:05 -04:00
Seth Hoenig	cfb7efc478	fix changelog entry typo (#17743 )	2023-06-27 08:02:06 -05:00
Seth Hoenig	d590123637	drivers/docker: refactor use of clients in docker driver (#17731 ) * drivers/docker: refactor use of clients in docker driver This PR refactors how we manage the two underlying clients used by the docker driver for communicating with the docker daemon. We keep two clients - one with a hard-coded timeout that applies to all operations no matter what, intended for use with short lived / async calls to docker. The other has no timeout and is the responsibility of the caller to set a context that will ensure the call eventually terminates. The use of these two clients has been confusing and mistakes were made in a number of places where calls were making use of the wrong client. This PR makes it so that a user must explicitly call a function to get the client that makes sense for that use case. Fixes #17023 * cr: followup items	2023-06-26 15:21:42 -05:00
sejalapeno	4c6906d873	Update allocations.go (#17726 ) * Update allocations.go updated missing client status "unknown" #17688 * changelog * Update .changelog/17726.txt adding relevant desc. Co-authored-by: Seth Hoenig <shoenig@duck.com> --------- Co-authored-by: Seth Hoenig <shoenig@duck.com>	2023-06-26 13:33:29 -05:00
nicoche	649831c1d3	deploymentwatcher: fail early whenever possible (#17341 ) Given a deployment that has a `progress_deadline`, if a task group runs out of reschedule attempts, allow it to fail at this time instead of waiting until the `progress_deadline` is reached. Fixes: #17260	2023-06-26 14:01:03 -04:00
Phil Renaud	81edceb2de	[ui] alignment and spacing for job status panel (#17708 ) * CSS alignment and spacing for job status panel * Only fade the count, not the legend icon, when count is 0 * Unrounded version corners * changelog * css has to only remove border radius when count is present * Seed stabilization for services test * Try consolidating the testfixes from before * Total test isolation and bonus logs * Drop the isolation but keep the logs * Remove bonus logging	2023-06-26 12:18:12 -04:00
Luiz Aoqui	9aa9779d80	api: prevent panic on job plan (#17689 ) Check for a nil job ID to prevent a panic when calling Jobs().Plan().	2023-06-23 16:20:52 -04:00
Luiz Aoqui	d62c34b9f9	build: add Docker image (#17017 ) Co-authored-by: Daniel Kimsey <90741+dekimsey@users.noreply.github.com>	2023-06-23 15:57:09 -04:00
Luiz Aoqui	3398d32000	changelog: add entry for node pools (#17707 )	2023-06-23 15:47:35 -04:00
grembo	7936c1e33f	Add `disable_file` parameter to job's `vault` stanza (#13343 ) This complements the `env` parameter, so that the operator can author tasks that don't share their Vault token with the workload when using `image` filesystem isolation. As a result, more powerful tokens can be used in a job definition, allowing it to use template stanzas to issue all kinds of secrets (database secrets, Vault tokens with very specific policies, etc.), without sharing that issuing power with the task itself. This is accomplished by creating a directory called `private` within the task's working directory, which shares many properties of the `secrets` directory (tmpfs where possible, not accessible by `nomad alloc fs` or Nomad's web UI), but isn't mounted into/bound to the container. If the `disable_file` parameter is set to `false` (its default), the Vault token is also written to the NOMAD_SECRETS_DIR, so the default behavior is backwards compatible. Even if the operator never changes the default, they will still benefit from the improved behavior of Nomad never reading the token back in from that - potentially altered - location.	2023-06-23 15:15:04 -04:00
Seth Hoenig	5138c5b99e	client: do not disable memory swappiness if kernel does not support it (#17625 ) * client: do not disable memory swappiness if kernel does not support it This PR adds a workaround for very old Linux kernels which do not support the memory swappiness interface file. Normally we write a "0" to the file to explicitly disable swap. In the case the kernel does not support it, give libcontainer a nil value so it does not write anything. Fixes #17448 * client: detect swappiness by writing to the file * fixup changelog Co-authored-by: James Rasell <jrasell@users.noreply.github.com> --------- Co-authored-by: James Rasell <jrasell@users.noreply.github.com>	2023-06-22 09:36:31 -05:00
Luiz Aoqui	9f5c02d947	ui: add tooltips to the Topology labels (#17647 ) Add tooltips to labels in nodes and datacenters for the Topology view page to clarify what each value represents.	2023-06-22 10:33:42 -04:00
Luiz Aoqui	3d761e712b	ui: remove redundant columns from child job table (#17645 ) Namespace, job type, and priority are already available from the parent job header, so displaying them in the table caused it to be too crowded.	2023-06-22 10:22:41 -04:00
VishnuJin	67efb19e94	fingerprint: added windows os.build attribute to host fingerprint (#17576 )	2023-06-21 10:53:50 -04:00
Tim Gross	ff9ba8ff73	scheduler: tolerate having only one dynamic port available (#17619 ) If the dynamic port range for a node is set so that the min is equal to the max, there's only one port available and this passes config validation. But the scheduler panics when it tries to pick a random port. Only add the randomness when there's more than one to pick from. Adds a test for the behavior but also adjusts the commentary on a couple of the existing tests that made it seem like this case was already covered if you didn't look too closely. Fixes: #17585	2023-06-20 13:29:25 -04:00
Patric Stout	4767d44b94	Fix DevicesSets being removed when cpusets are reloaded with cgroup v2 (#17535 ) * Fix DevicesSets being removed when cpusets are reloaded with cgroup v2 This meant that if any allocation was created or removed, all active DevicesSets were removed from all cgroups of all tasks. This was most noticeable with "exec" and "raw_exec", as it meant they no longer had access to /dev files. * e2e: add test for verifying cgroups do not interfere with access to devices --------- Co-authored-by: Seth Hoenig <shoenig@duck.com>	2023-06-15 09:39:36 -05:00
Tim Gross	5f509b8ce0	cli: fix missing `-quiet` flag for `var init` (#17526 ) The `var init` command was intended to have support for a `-quiet` flag but it was not documented and never parsed.	2023-06-14 14:52:46 -04:00
Tim Gross	e3a37c0b97	replication: fix potential panic during upgrades (#17476 ) If the authoritative region has been upgraded to a version of Nomad that has new replicated objects (such as ACL Auth Methods, ACL Binding Rules, etc.), the non-authoritative regions will start replicating those objects as soon as their leader is upgraded. If a server in the non-authoritative region is upgraded and then becomes the leader before all the other servers in the region have been upgraded, then it will attempt to write a Raft log entry that the followers don't understand. The followers will then panic. Add same the minimum version checks that we do for RPC writes to the leader's replication loop.	2023-06-12 08:53:56 -04:00
Phil Renaud	6a9df6e3ab	[ui] Don't show a service as healthy when its parent alloc is not running (#17465 ) * Fix: dont show a service as healthy when its parent alloc is not running * Test for Health Unknown	2023-06-09 15:43:11 -04:00
Seth Hoenig	557a6b4a5e	docker: stop network pause container of lost alloc after node restart (#17455 ) This PR fixes a bug where the docker network pause container would not be stopped and removed in the case where a node is restarted, the alloc is moved to another node, the node comes back up. See the issue below for full repro conditions. Basically in the DestroyNetwork PostRun hook we would depend on the NetworkIsolationSpec field not being nil - which is only the case if the Client stays alive all the way from network creation to network teardown. If the node is rebooted we lose that state and previously would not be able to find the pause container to remove. Now, we manually find the pause container by scanning them and looking for the associated allocID. Fixes #17299	2023-06-09 08:46:29 -05:00
Seth Hoenig	134e70cbab	client: fix client panic during drain cause by shutdown (#17450 ) During shutdown of a client with drain_on_shutdown there is a race between the Client ending the cgroup and the task's cpuset manager cleaning up the cgroup. During the path traversal, skip anything we cannot read, which avoids the nil DirEntry we try to dereference now.	2023-06-07 15:12:44 -05:00
Jerome Eteve	c26f01eefd	client checks kernel module in /sys/module for WSL2 bridge networking (#17306 )	2023-06-06 10:26:50 -04:00
Dao Thanh Tung	7c7f2d00bb	Add check for missing `path` in client `host_volume` config (#17393 )	2023-06-05 19:31:19 -04:00
dependabot[bot]	2f4fe019db	build(deps): bump go.etcd.io/bbolt from 1.3.6 to 1.3.7 (#16228 ) * build(deps): bump go.etcd.io/bbolt from 1.3.6 to 1.3.7 Bumps [go.etcd.io/bbolt](https://github.com/etcd-io/bbolt) from 1.3.6 to 1.3.7. - [Release notes](https://github.com/etcd-io/bbolt/releases) - [Commits](https://github.com/etcd-io/bbolt/compare/v1.3.6...v1.3.7) --- updated-dependencies: - dependency-name: go.etcd.io/bbolt dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> * cl: update cl for bbolt --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Seth Hoenig <shoenig@duck.com>	2023-06-05 10:19:14 -05:00
dependabot[bot]	c585cc68db	build(deps): bump github.com/hashicorp/raft from 1.3.11 to 1.5.0 (#17421 ) * build(deps): bump github.com/hashicorp/raft from 1.3.11 to 1.5.0 Bumps [github.com/hashicorp/raft](https://github.com/hashicorp/raft) from 1.3.11 to 1.5.0. - [Release notes](https://github.com/hashicorp/raft/releases) - [Changelog](https://github.com/hashicorp/raft/blob/main/CHANGELOG.md) - [Commits](https://github.com/hashicorp/raft/compare/v1.3.11...v1.5.0) --- updated-dependencies: - dependency-name: github.com/hashicorp/raft dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * cl: add cl for raft 1.5.0 --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Seth Hoenig <shoenig@duck.com>	2023-06-05 09:03:02 -05:00
KamilCuk	cc64281445	Add group_add docker option (#17313 )	2023-06-02 20:26:01 -04:00
Samantha	b92a782b6e	check: Add support for Consul field tls_server_name (#17334 )	2023-06-02 10:19:12 -04:00
Luiz Aoqui	4be8d7c049	core: fix kill_timeout validation when progress_deadline is 0 (#17342 )	2023-06-01 19:01:32 -04:00
Tim Gross	06972fae0c	prioritized client updates (#17354 ) The allocrunner sends several updates to the server during the early lifecycle of an allocation and its tasks. Clients batch-up allocation updates every 200ms, but experiments like the C2M challenge has shown that even with this batching, servers can be overwhelmed with client updates during high volume deployments. Benchmarking done in #9451 has shown that client updates can easily represent ~70% of all Nomad Raft traffic. Each allocation sends many updates during its lifetime, but only those that change the `ClientStatus` field are critical for progressing a deployment or kicking off a reschedule to recover from failures. Add a priority to the client allocation sync and update the `syncTicker` receiver so that we only send an update if there's a high priority update waiting, or on every 5th tick. This means when there are no high priority updates, the client will send updates at most every 1s instead of 200ms. Benchmarks have shown this can reduce overall Raft traffic by 10%, as well as reduce client-to-server RPC traffic. This changeset also switches from a channel-based collection of updates to a shared buffer, so as to split batching from sending and prevent backpressure onto the allocrunner when the RPC is slow. This doesn't have a major performance benefit in the benchmarks but makes the implementation of the prioritized update simpler. Fixes: #9451	2023-05-31 15:34:16 -04:00
Phil Renaud	52772ab0c0	Text type to password type input on profile sign-in page (#17345 )	2023-05-30 16:58:34 -04:00
Phil Renaud	038d53c58f	Observe newlines when displaying variables (#17343 )	2023-05-30 16:58:16 -04:00
Luiz Aoqui	6236cb8f82	cli: output errors when monitoring deployment (#17348 )	2023-05-30 11:12:12 -04:00
Luiz Aoqui	e236d6dedd	cli: fix panic on job restart (#17346 ) When monitoring the replacement allocation, if the `Allocations().Info()` request fails, the `alloc` variable is `nil`, so it should not be read.	2023-05-30 11:08:49 -04:00
Luiz Aoqui	bb2395031b	client: fix Consul version finterprint (#17349 ) Consul v1.13.8 was released with a breaking change in the /v1/agent/self endpoint version where a line break was being returned. This caused the Nomad finterprint to fail because `NewVersion` errors on parse. This commit removes any extra space from the Consul version returned by the API.	2023-05-30 11:07:57 -04:00
Phil Renaud	294aa4bfe7	[ui] A few variables-ui-related bugfixes (#17319 ) * A few variable-adding bugfixes * Disable Delete button if only one KV is left, and remove entity warnings on Add More	2023-05-25 17:11:13 -04:00
Charlie Voiselle	86e04a4c6c	[core] nil check and error handling for client status in heartbeat responses (#17316 ) Add a nil check to constructNodeServerInfoResponse to manage an apparent race between deregister and client heartbeats. Fixes #17310	2023-05-25 16:04:54 -04:00
Tim Gross	b85a28b851	changelog entry for Vault SDK update (#17281 )	2023-05-23 09:21:29 -04:00
Charlie Voiselle	fc313b7f8f	[api] Return a shapely error for unexpected response (#16743 ) * Add UnexpectedResultError to nomad/api This allows users to perform additional status-based behavior by rehydrating the error using `errors.As` inside of consumers.	2023-05-22 11:45:31 -04:00
Lance Haig	568da5918b	cli: tls certs not created with correct SANs (#16959 ) The `nomad tls cert` command did not create certificates with the correct SANs for them to work with non default domain and region names. This changset updates the code to support non default domains and regions in the certificates.	2023-05-22 09:31:56 -04:00
Roberto Hidalgo	2f702a9f11	allow periodic jobs to use workload identity ACL policies (#17018 ) When resolving ACL policies, we were not using the parent ID for the policy lookup for dispatch/periodic jobs, even though the claims were signed for that parent ID. This prevents all calls to the Task API (and other WI-authenticated API calls) from a periodically-dispatched job failing with 403. Fix this by using the parent job ID whenever it's available.	2023-05-22 09:19:16 -04:00
Yethal	4073987de3	cli: show leader status in json output of server members (#17138 )	2023-05-18 16:43:57 -04:00
Bram Vogelaar	3b40f778e5	agent: display node id on start up for servers (#17084 ) Signed-off-by: Bram Vogelaar <bram@attachmentgenie.com>	2023-05-18 11:23:12 -04:00
Tim Gross	fe29cf8b7b	logs: fix `logs.disabled` on Windows (#17199 ) On Windows the executor returns an error when trying to open the `NUL` device when we pass it `os.DevNull` for the stdout/stderr paths. Instead of opening the device, use the discard pipe so that we have platform-specific behavior from the executor itself. Fixes: #17148	2023-05-18 09:14:39 -04:00
Phil Renaud	50a35143c9	[ui, deployments] Fix a bug where watchers on a parent (periodic) job would continue on a child route (#17214 ) * Treated same-route as sub-route and didnt cancel watchers * Changelog	2023-05-17 16:36:15 -04:00
Tim Gross	5fc63ace0b	scheduler: count implicit spread targets as a single target (#17195 ) When calculating the score in the `SpreadIterator`, the score boost is proportional to the difference between the current and desired count. But when there are implicit spread targets, the current count is the sum of the possible implicit targets, which results in incorrect scoring unless there's only one implicit target. This changeset updates the `propertySet` struct to accept a set of explicit target values so it can detect when a property value falls into the implicit set and should be combined with other implicit values. Fixes: #11823	2023-05-17 10:25:00 -04:00
Tim Gross	2426aae832	scheduler: prevent -Inf in spread scoring (#17198 ) When spread targets have a percent value of zero it's possible for them to return -Inf scoring because of a float divide by zero. This is very hard for operators to debug because the string "-Inf" is returned in the API and that breaks the presentation of debugging data. Most scoring iterators are bracketed to -1/+1, but spread iterators do not so that they can handle greatly unbalanced scoring so we can't simply return a -1 score without generating a score that might be greater than the negative scores set by other spread targets. Instead, track the lowest-seen spread boost and use that as the spread boost for any cases where we'd divide by zero. Fixes: #8863	2023-05-16 16:01:32 -04:00
Seth Hoenig	e04ff0d935	client: ignore restart issued to terminal allocations (#17175 ) * client: ignore restart issued to terminal allocations This PR fixes a bug where issuing a restart to a terminal allocation would cause the allocation to run its hooks anyway. This was particularly apparent with group_service_hook who would then register services but then never deregister them - as the allocation would be effectively in a "zombie" state where it is prepped to run tasks but never will. * e2e: add e2e test for alloc restart zombies * cl: tweak text Co-authored-by: Tim Gross <tgross@hashicorp.com> --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2023-05-16 10:19:41 -05:00
Tim Gross	6814e8e6d9	drivers: make internal `DisableLogCollection` capability public (#17196 ) The `DisableLogCollection` capability was introduced as an experimental interface for the Docker driver in 0.10.4. The interface has been stable and allowing third-party task drivers the same capability would be useful for those drivers that don't need the additional overhead of logmon. This PR only makes the capability public. It doesn't yet add it to the configuration options for the other internal drivers. Fixes: #14636 #15686	2023-05-16 09:16:03 -04:00
Phil Renaud	9a5d67d475	[ui] Keyboard shortcuts to switch regions (#17169 ) * Regions keynav * Dont show if you only have a single region (global by default)	2023-05-12 11:46:00 -04:00
Tim Gross	9ed75e1f72	client: de-duplicate alloc updates and gate during restore (#17074 ) When client nodes are restarted, all allocations that have been scheduled on the node have their modify index updated, including terminal allocations. There are several contributing factors: * The `allocSync` method that updates the servers isn't gated on first contact with the servers. This means that if a server updates the desired state while the client is down, the `allocSync` races with the `Node.ClientGetAlloc` RPC. This will typically result in the client updating the server with "running" and then immediately thereafter "complete". * The `allocSync` method unconditionally sends the `Node.UpdateAlloc` RPC even if it's possible to assert that the server has definitely seen the client state. The allocrunner may queue-up updates even if we gate sending them. So then we end up with a race between the allocrunner updating its internal state to overwrite the previous update and `allocSync` sending the bogus or duplicate update. This changeset adds tracking of server-acknowledged state to the allocrunner. This state gets checked in the `allocSync` before adding the update to the batch, and updated when `Node.UpdateAlloc` returns successfully. To implement this we need to be able to equality-check the updates against the last acknowledged state. We also need to add the last acknowledged state to the client state DB, otherwise we'd drop unacknowledged updates across restarts. The client restart test has been expanded to cover a variety of allocation states, including allocs stopped before shutdown, allocs stopped by the server while the client is down, and allocs that have been completely GC'd on the server while the client is down. I've also bench tested scenarios where the task workload is killed while the client is down, resulting in a failed restore. Fixes #16381	2023-05-11 09:05:24 -04:00
Daniel Bennett	a7ed6f5c53	full task cleanup when alloc prerun hook fails (#17104 ) to avoid leaking task resources (e.g. containers, iptables) if allocRunner prerun fails during restore on client restart. now if prerun fails, TaskRunner.MarkFailedKill() will only emit an event, mark the task as failed, and cancel the tr's killCtx, so then ar.runTasks() -> tr.Run() can take care of the actual cleanup. removed from (formerly) tr.MarkFailedDead(), now handled by tr.Run(): * set task state as dead * save task runner local state * task stop hooks also done in tr.Run() now that it's not skipped: * handleKill() to kill tasks while respecting their shutdown delay, and retrying as needed * also includes task preKill hooks * clearDriverHandle() to destroy the task and associated resources * task exited hooks	2023-05-08 13:17:10 -05:00
stswidwinski	9c1c2cb5d2	Correct the status description and modify time of canceled evals. (#17071 ) Fix for #17070. Corrected the status description and modify time of evals which are canceled due to another eval having completed in the meantime.	2023-05-08 08:50:36 -04:00
Seth Hoenig	fff2eec625	connect: use heuristic to detect sidecar task driver (#17065 ) * connect: use heuristic to detect sidecar task driver This PR adds a heuristic to detect whether to use the podman task driver for the connect sidecar proxy. The podman driver will be selected if there is at least one task in the task group configured to use podman, and there are zero tasks in the group configured to use docker. In all other cases the task driver defaults to docker. After this change, we should be able to run typical Connect jobspecs (e.g. nomad job init [-short] -connect) on Clusters configured with the podman task driver, without modification to the job files. Closes #17042 * golf: cleanup driver detection logic	2023-05-05 10:19:30 -05:00
James Rasell	6ec4a69f47	scale: fixed a bug where evals could be created with wrong type. (#17092 ) The job scale RPC endpoint hard-coded the eval creation to use the type of service. This meant scaling events triggered on jobs of type batch would create evaluations with the wrong type, which does not seem to cause any problems, just confusion when correlating the two.	2023-05-05 14:46:10 +01:00
Tim Gross	17bd930ca9	logs: fix missing allocation logs after update to Nomad 1.5.4 (#17087 ) When the server restarts for the upgrade, it loads the `structs.Job` from the Raft snapshot/logs. The jobspec has long since been parsed, so none of the guards around the default value are in play. The empty field value for `Enabled` is the zero value, which is false. This doesn't impact any running allocation because we don't replace running allocations when either the client or server restart. But as soon as any allocation gets rescheduled (ex. you drain all your clients during upgrades), it'll be using the `structs.Job` that the server has, which has `Enabled = false`, and logs will not be collected. This changeset fixes the bug by adding a new field `Disabled` which defaults to false (so that the zero value works), and deprecates the old field. Fixes #17076	2023-05-04 16:01:18 -04:00
Charlie Voiselle	8f6fa14e9e	[deps] Update consul-template to v0.31.0 (#16908 ) * Update consul-template to v0.31.0 * Add changelog	2023-05-03 09:15:17 -04:00
Michael Schurter	f8f9e91b8a	build: upgrade from go 1.20.3 to 1.20.4 (#17056 ) Includes CVE fixes that do not impact Nomad: https://groups.google.com/g/golang-announce/c/MEb0UyuSMsU	2023-05-02 13:09:11 -07:00
Seth Hoenig	e9fec4ebc8	connect: remove unusable path for fallback envoy image names (#17044 ) This PR does some cleanup of an old code path for versions of Consul that did not support reporting the supported versions of Envoy in its API. Those versions are no longer supported for years at this point, and the fallback version of envoy hasn't been supported by any version of Consul for almost as long. Remove this code path that is no longer useful.	2023-05-02 09:48:44 -05:00
Seth Hoenig	e8d53ea30b	connect: use explicit docker.io prefix in default envoy image names (#17045 ) This PR modifies references to the envoyproxy/envoy docker image to explicitly include the docker.io prefix. This does not affect existing users, but makes things easier for Podman users, who otherwise need to specify the full name because Podman does not default to docker.io	2023-05-02 09:27:48 -05:00
Seth Hoenig	86f6a38867	connect: do not restrict auto envoy version to docker task driver (#17041 ) This PR updates the envoy_bootstrap_hook to no longer disable itself if the task driver in use is not docker. In other words, make it work for podman and other image based task drivers. The hook now only checks that 1. the task is a connect sidecar 2. the task.config block contains an "image" field	2023-05-01 15:07:35 -05:00
Phil Renaud	5ca59aef56	Move the token JWT console log out of an interator (#17010 )	2023-04-28 13:46:10 -04:00
Seth Hoenig	5744b2cd4f	docs: add more notes about artifact breaking changes in 1.5.0 (#17005 ) * changelog: note artifact breaking changes for 1.5.0 * docs: add note about environment variables to artifact job spec docs * Update website/content/docs/job-specification/artifact.mdx Co-authored-by: Luiz Aoqui <luiz@hashicorp.com> --------- Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2023-04-27 11:41:18 -05:00
Michael Schurter	d3b0bbc088	deps: update go-bexpr from 0.1.11 to 0.1.12 (#16991 ) Pulls in https://github.com/hashicorp/go-bexpr/pull/38 Fixes #16758	2023-04-27 09:01:42 -07:00
Phil Renaud	7f7f764c5a	[ui] Fixed: Evaluations sidebar/response not scrollable (#16960 ) * Sets up a CSS grid for Evaluations sidebar * Flex seems more sensible for this actually * Tighten up the header margin * Percy found a diff; the expand button wasnt showing for view logs sidebar	2023-04-27 09:49:18 -04:00
James Rasell	4d2c1403c2	scale: do not allow scaling of jobs with type system. (#16969 )	2023-04-25 15:47:44 +01:00
Phil Renaud	7dbebe9a93	[ui, feature] Job Page Redesign (#16932 ) * [ui] Service job status panel (#16134) * it begins * Hacky demo enabled * Still very hacky but seems deece * Floor of at least 3 must be shown * Width from on-high * Other statuses considered * More sensible allocTypes listing * Beginnings of a legend * Total number of allocs running now maps over job.groups * Lintfix * base the number of slots to hold open on actual tallies, which should never exceed totalAllocs * Versions get yer versions here * Versions lookin like versions * Mirage fixup * Adds Remaining as an alloc chart status and adds historical status option * Get tests passing again by making job status static for a sec * Historical status panel click actions moved into their own component class * job detail tests plz chill * Testing if percy is fickle * Hyper-specfic on summary distribution bar identifier * Perhaps the 2nd allocSummary item no longer exists with the more accurate afterCreate data * UI Test eschewing the page pattern * Bones of a new acceptance test * Track width changes explicitly with window-resize * testlintfix * Alloc counting tests * Alloc grouping test * Alloc grouping with complex resizing * Refined the list of showable statuses * PR feedback addressed * renamed allocation-row to allocation-status-row * [ui, job status] Make panel status mode a queryParam (#16345) * queryParam changing * Test for QP in panel * Adding @tracked to legacy controller * Move the job of switching to Historical out to larger context * integration test mock passed func * [ui] Service job deployment status panel (#16383) * A very fast and loose deployment panel * Removing Unknown status from the panel * Set up oldAllocs list in constructor, rather than as a getter/tracked var * Small amount of template cleanup * Refactored latest-deployment new logic back into panel.js * Revert now-unused latest-deployment component * margin bottom when ungrouped also * Basic integration tests for job deployment status panel * Updates complete alloc colour to green for new visualizations only (#16618) * Updates complete alloc colour to green for new visualizations only * Pale green instead of dark green for viz in general * [ui] Job Deployment Status: History and Update Props (#16518) * Deployment history wooooooo * Styled deployment history * Update Params * lintfix * Types and groups for updateParams * Live-updating history * Harden with types, error states, and pending states * Refactor updateParams to use trigger component * [ui] Deployment History search (#16608) * Functioning searchbox * Some nice animations for history items * History search test * Fixing up some old mirage conventions * some a11y rule override to account for scss keyframes * Split panel into deploying and steady components * HandleError passed from job index * gridified panel elements * TotalAllocs added to deploying.js * Width perc to px * [ui] Splitting deployment allocs by status, health, and canary status (#16766) * Initial attempt with lots of scratchpad work * Style mods per UI discussion * Fix canary overflow bug * Dont show canary or health for steady/prev-alloc blocks * Steady state * Thanks Julie * Fixes steady-state versions * Legen, wait for it... * Test fixes now that we have a minimum block size * PR prep * Shimmer effect on pending and unplaced allocs (#16801) * Shimmer effect on pending and unplaced * Dont show animation in the legend * [ui, deployments] Linking allocblocks and legends to allocation / allocations index routes (#16821) * Conditional link-to component and basic linking to allocations and allocation routes * Job versions filter added to allocations index page * Steady state legends link * Legend links * Badge count links for versions * Fix: faded class on steady-state legend items * version link now wont show completed ones * Fix a11y violations with link labels * Combining some template conditional logic * [ui, deployments] Conversions on long nanosecond update params (#16882) * Conversions on long nanosecond nums * Early return in updateParamGroups comp prop * [ui, deployments] Mirage Actively Deploying Job and Deployment Integration Tests (#16888) * Start of deployment alloc test scaffolding * Bit of test cleanup and canary for ungrouped allocs * Flakey but more robust integrations for deployment panel * De-flake acceptance tests and add an actively deploying job to mirage * Jitter-less alloc status distribution removes my bad math * bugfix caused by summary.desiredTotal non-null * More interesting mirage active deployment alloc breakdown * Further tests for previous-allocs row * Previous alloc legend tests * Percy snapshots added to integration test * changelog	2023-04-24 22:45:39 -04:00
Seth Hoenig	753c17c9de	services: un-mark group services as deregistered if restart hook runs (#16905 ) * services: un-mark group services as deregistered if restart hook runs This PR may fix a bug where group services will never be deregistered if the group undergoes a task restart. * e2e: add test case for restart and deregister group service * cl: add cl * e2e: add wait for service list call	2023-04-24 14:24:51 -05:00
Tim Gross	72cbe53f19	logs: allow disabling log collection in jobspec (#16962 ) Some Nomad users ship application logs out-of-band via syslog. For these users having `logmon` (and `docker_logger`) running is unnecessary overhead. Allow disabling the logmon and pointing the task's stdout/stderr to /dev/null. This changeset is the first of several incremental improvements to log collection short of full-on logging plugins. The next step will likely be to extend the internal-only task driver configuration so that cluster administrators can turn off log collection for the entire driver. --- Fixes: #11175 Co-authored-by: Thomas Weber <towe75@googlemail.com>	2023-04-24 10:00:27 -04:00
valodzka	379497a484	fix host port handling for ipv6 (#16723 )	2023-04-20 19:53:20 -07:00
Etienne Bruines	1e3531b978	cni: fix plugin fingerprinting versions (#16776 ) CNI plugins v1.2.0 and above output a second line, containing supported protocol versions.	2023-04-20 18:44:39 -07:00
Luiz Aoqui	a1ba068e1f	cli: fix panic on job plan when -diff=false (#16944 ) PR #14492 introduced a new check to return 0 when the `nomad job plan` command returns a diff of type `None`. But the `-diff` CLI flag was also being used to control whether the plan request should return the diff of not instead of just controlling if the diff was printed. This means that when `-diff=false` is set the response does not include any diff information, and so the new check panics. This commit fixes the problem by always requesting a diff and using the `-diff` only for controlling output, as it's currently documented.	2023-04-20 17:33:29 -07:00
astudentofblake	42c4c8d5ea	fix: added landlock access to /usr/libexec for getter (#16900 )	2023-04-20 11:16:04 -05:00
claire labry	d2beea3435	changelog: add changelog update for vendor label for linux packaging (#16071 )	2023-04-19 08:14:14 -07:00
Luiz Aoqui	fb588fcbb8	allocrunner: prevent panic on network manager (#16921 )	2023-04-18 13:39:13 -07:00
Charlie Voiselle	9e8f2a937c	[scheduler] Honor `false` for distinct hosts constraint (#16907 ) * Honor value for distinct_hosts constraint * Add test for feasibility checking for `false` --------- Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2023-04-17 17:43:56 -04:00
Tim Gross	04e049caed	license: show Terminated field in `license get` command (#16892 )	2023-04-17 09:01:43 -04:00
Tim Gross	62548616d4	client: allow `drain_on_shutdown` configuration (#16827 ) Adds a new configuration to clients to optionally allow them to drain their workloads on shutdown. The client sends the `Node.UpdateDrain` RPC targeting itself and then monitors the drain state as seen by the server until the drain is complete or the deadline expires. If it loses connection with the server, it will monitor local client status instead to ensure allocations are stopped before exiting.	2023-04-14 15:35:32 -04:00
Tim Gross	5a9abdc469	drain: use client status to determine drain is complete (#14348 ) If an allocation is slow to stop because of `kill_timeout` or `shutdown_delay`, the node drain is marked as complete prematurely, even though drain monitoring will continue to report allocation migrations. This impacts the UI or API clients that monitor node draining to shut down nodes. This changeset updates the behavior to wait until the client status of all drained allocs are terminal before marking the node as done draining.	2023-04-13 08:55:28 -04:00
Seth Hoenig	ec1a8ae12a	deps: update docker to 23.0.3 (#16862 ) * [no ci] deps: update docker to 23.0.3 This PR brings our docker/docker dependency (which is hosted at github.com/moby/moby) up to 23.0.3 (forward about 2 years). Refactored our use of docker/libnetwork to reference the package in its new home, which is docker/docker/libnetwork (it is no longer an independent repository). Some minor nearby test case cleanup as well. * add cl	2023-04-12 14:13:36 -05:00
Juana De La Cuesta	8302085384	Deployment Status Command Does Not Respect -namespace Wildcard (#16792 ) * func: add namespace support for list deployment * func: add wildcard to namespace filter for deployments * Update deployment_endpoint.go * style: use must instead of require or asseert * style: rename paginator to avoid clash with import * style: add changelog entry * fix: add missing parameter for upsert jobs	2023-04-12 11:02:14 +02:00
James Rasell	bc01d47071	consul/connect: fixed a bug where restarting proxy tasks failed. (#16815 ) The first start of a Consul Connect proxy sidecar triggers a run of the envoy_version hook which modifies the task config image entry. The modification takes into account a number of factors to correctly populate this. Importantly, once the hook has run, it marks itself as done so the taskrunner will not execute it again. When the client receives a non-destructive update for the allocation which the proxy sidecar is a member of, it will update and overwrite the task definition within the taskerunner. In doing so it overwrite the modification performed by the hook. If the allocation is restarted, the envoy_version hook will be skipped as it previously marked itself as done, and therefore the sidecar config image is incorrect and causes a driver error. The fix removes the hook in marking itself as done to the view of the taskrunner.	2023-04-11 15:56:03 +01:00
Seth Hoenig	ba728f8f97	api: enable support for setting original job source (#16763 ) * api: enable support for setting original source alongside job This PR adds support for setting job source material along with the registration of a job. This includes a new HTTP endpoint and a new RPC endpoint for making queries for the original source of a job. The HTTP endpoint is /v1/job/<id>/submission?version=<version> and the RPC method is Job.GetJobSubmission. The job source (if submitted, and doing so is always optional), is stored in the job_submission memdb table, separately from the actual job. This way we do not incur overhead of reading the large string field throughout normal job operations. The server config now includes job_max_source_size for configuring the maximum size the job source may be, before the server simply drops the source material. This should help prevent Bad Things from happening when huge jobs are submitted. If the value is set to 0, all job source material will be dropped. * api: avoid writing var content to disk for parsing * api: move submission validation into RPC layer * api: return an error if updating a job submission without namespace or job id * api: be exact about the job index we associate a submission with (modify) * api: reword api docs scheduling * api: prune all but the last 6 job submissions * api: protect against nil job submission in job validation * api: set max job source size in test server * api: fixups from pr	2023-04-11 08:45:08 -05:00
Daniel Bennett	fa33ee567a	gracefully recover tasks that use csi node plugins (#16809 ) new WaitForPlugin() called during csiHook.Prerun, so that on startup, clients can recover running tasks that use CSI volumes, instead of them being terminated and rescheduled because they need a node plugin that is "not found" yet, only because the plugin task has not yet been recovered.	2023-04-10 17:15:33 -05:00
Tim Gross	1335543731	ephemeral disk: `migrate` should imply `sticky` (#16826 ) The `ephemeral_disk` block's `migrate` field allows for best-effort migration of the ephemeral disk data to new nodes. The documentation says the `migrate` field is only respected if `sticky=true`, but in fact if client ACLs are not set the data is migrated even if `sticky=false`. The existing behavior when client ACLs are disabled has existed since the early implementation, so "fixing" that case now would silently break backwards compatibility. Additionally, having `migrate` not imply `sticky` seems nonsensical: it suggests that if we place on a new node we migrate the data but if we place on the same node, we throw the data away! Update so that `migrate=true` implies `sticky=true` as follows: * The failure mode when client ACLs are enabled comes from the server not passing along a migration token. Update the server so that the server provides a migration token whenever `migrate=true` and not just when `sticky=true` too. * Update the scheduler so that `migrate` implies `sticky`. * Update the client so that we check for `migrate \|\| sticky` where appropriate. * Refactor the E2E tests to move them off the old framework and make the intention of the test more clear.	2023-04-07 16:33:45 -04:00
Michael Schurter	a8b379f962	docker: default device.container_path to host_path (#16811 ) * docker: default device.container_path to host_path Matches docker cli behavior. Fixes #16754	2023-04-06 14:44:33 -07:00
Tim Gross	6f2b9266bc	Merge pull request #16794 from hashicorp/post-1.5.3-release Post 1.5.3 release	2023-04-05 13:02:37 -04:00
the-nando	f541f2e59b	Do not set attributes when spawning the getter child (#16791 ) * Do not set attributes when spawning the getter child * Cleanup * Cleanup --------- Co-authored-by: the-nando <the-nando@invalid.local>	2023-04-05 11:47:51 -05:00
Tim Gross	66a01bb35a	upgrade go to 1.20.3	2023-04-05 12:18:19 -04:00
Tim Gross	8278f23042	acl: fix ACL bypass for anon requests that pass thru client HTTP Requests without an ACL token that pass thru the client's HTTP API are treated as though they come from the client itself. This allows bypass of ACLs on RPC requests where ACL permissions are checked (like `Job.Register`). Invalid tokens are correctly rejected. Fix the bypass by only setting a client ID on the identity if we have a valid node secret. Note that this changeset will break rate metrics for RPCs sent by clients without a client secret such as `Node.GetClientAllocs`; these requests will be recorded as anonymous. Future work should: * Ensure the node secret is sent with all client-driven RPCs except `Node.Register` which is TOFU. * Create a new `acl.ACL` object from client requests so that we can enforce ACLs for all endpoints in a uniform way that's less error-prone.~	2023-04-05 12:17:51 -04:00
Juana De La Cuesta	9b4871fece	Prevent kill_timeout greater than progress_deadline (#16761 ) * func: add validation for kill timeout smaller than progress dealine * style: add changelog * style: typo in changelog * style: remove refactored test * Update .changelog/16761.txt Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/structs/structs.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> --------- Co-authored-by: James Rasell <jrasell@users.noreply.github.com>	2023-04-04 18:17:10 +02:00
James Rasell	cb6ba80f0f	cli: stream both stdout and stderr when following an alloc. (#16556 ) This update changes the behaviour when following logs from an allocation, so that both stdout and stderr files streamed when the operator supplies the follow flag. The previous behaviour is held when all other flags and situations are provided. Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2023-04-04 10:42:27 +01:00
Georgy Buranov	ca80546ef7	take maximum processor Mhz (#16740 ) * take maximum processor Mhz * remove break * cl: add cl for 16740 --------- Co-authored-by: Seth Hoenig <shoenig@duck.com>	2023-03-31 11:25:32 -05:00
Horacio Monsalvo	20372b1721	connect: add meta on ConsulSidecarService (#16705 ) Co-authored-by: Sol-Stiep <sol.stiep@southworks.com>	2023-03-30 16:09:28 -04:00
Piotr Kazmierczak	acfc266c30	acl: JWT changelog entry and typo fix	2023-03-30 09:40:11 +02:00
Tim Gross	76284a09a0	docker: move pause container recovery to after `SetConfig` (#16713 ) When we added recovery of pause containers in #16352 we called the recovery function from the plugin factory function. But in our plugin setup protocol, a plugin isn't ready for use until we call `SetConfig`. This meant that recovering pause containers was always done with the default config. Setting up the Docker client only happens once, so setting the wrong config in the recovery function also means that all other Docker API calls will use the default config. Move the `recoveryPauseContainers` call into the `SetConfig`. Fix the error handling so that we return any error but also don't log when the context is canceled, which happens twice during normal startup as we fingerprint the driver.	2023-03-29 16:20:37 -04:00
dependabot[bot]	afa9608475	build(deps): bump github.com/opencontainers/runc from 1.1.4 to 1.1.5 (#16712 ) * build(deps): bump github.com/opencontainers/runc from 1.1.4 to 1.1.5 Bumps [github.com/opencontainers/runc](https://github.com/opencontainers/runc) from 1.1.4 to 1.1.5. - [Release notes](https://github.com/opencontainers/runc/releases) - [Changelog](https://github.com/opencontainers/runc/blob/v1.1.5/CHANGELOG.md) - [Commits](https://github.com/opencontainers/runc/compare/v1.1.4...v1.1.5) --- updated-dependencies: - dependency-name: github.com/opencontainers/runc dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * changelog entry --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Tim Gross <tgross@hashicorp.com>	2023-03-29 15:05:05 -04:00
Elvis Pranskevichus	11a9bb6ce7	drivers/exec: Fix handling of capabilities for unprivileged tasks (#16643 ) Currently, the `exec` driver is only setting the Bounding set, which is not sufficient to actually enable the requisite capabilities for the task process. In order for the capabilities to survive `execve` performed by libcontainer, the `Permitted`, `Inheritable`, and `Ambient` sets must also be set. Per CAPABILITIES (7): > Ambient: This is a set of capabilities that are preserved across an > execve(2) of a program that is not privileged. The ambient capability > set obeys the invariant that no capability can ever be ambient if it > is not both permitted and inheritable.	2023-03-28 12:12:55 -04:00
Seth Hoenig	87f4b71df0	client/fingerprint: correctly fingerprint E/P cores of Apple Silicon chips (#16672 ) * client/fingerprint: correctly fingerprint E/P cores of Apple Silicon chips This PR adds detection of asymetric core types (Power & Efficiency) (P/E) when running on M1/M2 Apple Silicon CPUs. This functionality is provided by shoenig/go-m1cpu which makes use of the Apple IOKit framework to read undocumented registers containing CPU performance data. Currently working on getting that functionality merged upstream into gopsutil, but gopsutil would still not support detecting P vs E cores like this PR does. Also refactors the CPUFingerprinter code to handle the mixed core types, now setting power vs efficiency cpu attributes. For now the scheduler is still unaware of mixed core types - on Apple platforms tasks cannot reserve cores anyway so it doesn't matter, but at least now the total CPU shares available will be correct. Future work should include adding support for detecting P/E cores on the latest and upcoming Intel chips, where computation of total cpu shares is currently incorrect. For that, we should also include updating the scheduler to be core-type aware, so that tasks of resources.cores on Linux platforms can be assigned the correct number of CPU shares for the core type(s) they have been assigned. node attributes before cpu.arch = arm64 cpu.modelname = Apple M2 Pro cpu.numcores = 12 cpu.reservablecores = 0 cpu.totalcompute = 1000 node attributes after cpu.arch = arm64 cpu.frequency.efficiency = 2424 cpu.frequency.power = 3504 cpu.modelname = Apple M2 Pro cpu.numcores.efficiency = 4 cpu.numcores.power = 8 cpu.reservablecores = 0 cpu.totalcompute = 37728 * fingerprint/cpu: follow up cr items	2023-03-28 08:27:58 -05:00
Juana De La Cuesta	320884b8ee	Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true (#16583 ) * Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true Fixes #11052 When restoring periodic dispatcher, all periodic jobs are forced without checking for previous childre. * Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true Fixes #11052 When restoring periodic dispatcher, all periodic jobs are forced without checking for previous children. * style: refactor force run function * fix: remove defer and inline unlock for speed optimization * Update nomad/leader.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * style: refactor tests to use must * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * fix: move back from defer to calling unlock before returning. createEval cant be called with the lock on * style: refactor test to use must * added new entry to changelog and update comments --------- Co-authored-by: James Rasell <jrasell@hashicorp.com> Co-authored-by: James Rasell <jrasell@users.noreply.github.com>	2023-03-27 17:25:05 +02:00
Luiz Aoqui	8070882c4b	scheduler: fix reconciliation of reconnecting allocs (#16609 ) When a disconnect client reconnects the `allocReconciler` must find the allocations that were created to replace the original disconnected allocations. This process was being done in only a subset of non-terminal untainted allocations, meaning that, if the replacement allocations were not in this state the reconciler didn't stop them, leaving the job in an inconsistent state. This inconsistency is only solved in a future job evaluation, but at that point the allocation is considered reconnected and so the specific reconnection logic was not applied, leading to unexpected outcomes. This commit fixes the problem by running reconnecting allocation reconciliation logic earlier into the process, leaving the rest of the reconciler oblivious of reconnecting allocations. It also uses the full set of allocations to search for replacements, stopping them even if they are not in the `untainted` set. The system `SystemScheduler` is not affected by this bug because disconnected clients don't trigger replacements: every eligible client is already running an allocation.	2023-03-24 19:38:31 -04:00
Luiz Aoqui	e5d31bca61	cli: job restart command (#16278 ) Implement the new `nomad job restart` command that allows operators to restart allocations tasks or reschedule then entire allocation. Restarts can be batched to target multiple allocations in parallel. Between each batch the command can stop and hold for a predefined time or until the user confirms that the process should proceed. This implements the "Stateless Restarts" alternative from the original RFC (https://gist.github.com/schmichael/e0b8b2ec1eb146301175fd87ddd46180). The original concept is still worth implementing, as it allows this functionality to be exposed over an API that can be consumed by the Nomad UI and other clients. But the implementation turned out to be more complex than we initially expected so we thought it would be better to release a stateless CLI-based implementation first to gather feedback and validate the restart behaviour. Co-authored-by: Shishir Mahajan <smahajan@roblox.com>	2023-03-23 18:28:26 -04:00
Phil Renaud	11de45d17b	[ui] Copyable server and client attribute values (#16548 ) * Copyable server and client attribute values * Changelog	2023-03-22 15:05:01 -04:00
Luiz Aoqui	518fd610b3	changelog: update #16427 to improvement (#16565 ) The security fix in Go 1.20.2 does not apply to Nomad.	2023-03-20 21:24:53 -04:00
Michael Schurter	f8884d8b52	client/metadata: fix crasher caused by AllowStale = false (#16549 ) Fixes #16517 Given a 3 Server cluster with at least 1 Client connected to Follower 1: If a NodeMeta.{Apply,Read} for the Client request is received by Follower 1 with `AllowStale = false` the Follower will forward the request to the Leader. The Leader, not being connected to the target Client, will forward the RPC to Follower 1. Follower 1, seeing AllowStale=false, will forward the request to the Leader. The Leader, not being connected to... well hoppefully you get the picture: an infinite loop occurs.	2023-03-20 16:32:32 -07:00
Phil Renaud	ccce4b68f2	[ui] Perform common job tasks with keyboard shortcuts (#16378 ) * Throw your mouse into traffic * Add node metadata with a shortcut * Re-labelled * Adds a toast notification to job start/stop on keyboard shortcut * Typo fix	2023-03-20 09:24:39 -04:00
Juana De La Cuesta	47be374bbd	Add `-json` flag to `quota inspect` command (#16478 ) * Added and flag to command * cli[style]: small refactor to avoid confussion with tmpl variable * Update inspect.mdx * cli: add changelog entry * Update .changelog/16478.txt Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update command/quota_inspect.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> --------- Co-authored-by: James Rasell <jrasell@users.noreply.github.com>	2023-03-20 10:40:51 +01:00
Juana De La Cuesta	ed44f50091	cli: add `-json` and `-t` flags to `quota status` command (#16485 ) * cli: add json and t flags to quota status command * cli: add entry to changelog * Update command/quota_status.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> --------- Co-authored-by: James Rasell <jrasell@users.noreply.github.com>	2023-03-20 10:39:56 +01:00
Juana De La Cuesta	eeb3766575	cli: Add `json` and `-t` flags to `server members` command (#16444 ) * cli: Add and flags to server members * Update website/content/docs/commands/server/members.mdx Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update website/content/docs/commands/server/members.mdx Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * cli: update the server memebers tests to use must * cli: add flags addition to changelog --------- Co-authored-by: James Rasell <jrasell@users.noreply.github.com>	2023-03-20 10:39:24 +01:00
Piotr Kazmierczak	0a2b425eb5	cli: nomad login command should not require a -type flag and should respect default auth method (#16504 ) nomad login command does not need to know ACL Auth Method's type, since all method names are unique. Co-authored-by: James Rasell <jrasell@users.noreply.github.com>	2023-03-17 19:14:28 +01:00
Seth Hoenig	07543f8bdf	nsd: always set deregister flag after deregistration of group (#16289 ) * services: always set deregister flag after deregistration of group This PR fixes a bug where the group service hook's deregister flag was not set in some cases, causing the hook to attempt deregistrations twice during job updates (alloc replacement). In the tests ... we used to assert on the wrong behvior (remove twice) which has now been corrected to assert we remove only once. This bug was "silent" in the Consul provider world because the error logs for double deregistration only show up in Consul logs; with the Nomad provider the error logs are in the Nomad agent logs. * services: cleanup group service hook tests	2023-03-17 09:44:21 -05:00
Tim Gross	ec47b245d0	client: don't use `Status` RPC for Consul discovery (#16490 ) In #16217 we switched clients using Consul discovery to the `Status.Members` endpoint for getting the list of servers so that we're using the correct address. This endpoint has an authorization gate, so this fails if the anonymous policy doesn't have `node:read`. We also can't check the `AuthToken` for the request for the client secret, because the client hasn't yet registered so the server doesn't have anything to compare against. Instead of hitting the `Status.Peers` or `Status.Members` RPC endpoint, use the Consul response directly. Update the `registerNode` method to handle the list of servers we get back in the response; if we get a "no servers" or "no path to region" response we'll kick off discovery again and retry immediately rather than waiting 15s.	2023-03-16 15:38:33 -04:00
Seth Hoenig	5b1970468e	artifact: git needs more files for private repositories (#16508 ) * landlock: git needs more files for private repositories This PR fixes artifact downloading so that git may work when cloning from private repositories. It needs - file read on /etc/passwd - dir read on /root/.ssh - file write on /root/.ssh/known_hosts Add these rules to the landlock rules for the artifact sandbox. * cr: use nonexistent instead of devnull Co-authored-by: Michael Schurter <mschurter@hashicorp.com> * cr: use go-homdir for looking up home directory * pr: pull go-homedir into explicit require * cr: fixup homedir tests in homeless root cases * cl: fix root test for real --------- Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2023-03-16 12:22:25 -05:00
Seth Hoenig	ed7177de76	scheduler: annotate tasksUpdated with reason and purge DeepEquals (#16421 ) * scheduler: annotate tasksUpdated with reason and purge DeepEquals * cr: move opaque into helper * cr: swap affinity/spread hashing for slice equal * contributing: update checklist-jobspec with notes about struct methods * cr: add more cases to wait config equal method * cr: use reflect when comparing envoy config blocks * cl: add cl	2023-03-14 09:46:00 -05:00
Juana De La Cuesta	c235bafa3f	cli: Add `-json` and `-t` flags to `namespace status` command (#16442 ) * cli: Add and flag to namespace status command * Update command/namespace_status.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * cli: update tests for namespace status command to use must --------- Co-authored-by: James Rasell <jrasell@users.noreply.github.com>	2023-03-14 14:23:04 +01:00
Tim Gross	8579d1e479	agent: trim space when parsing X-Nomad-Token header (#16469 ) Our auth token parsing code trims space around the `Authorization` header but not around `X-Nomad-Token`. When using the UI, it's easy to accidentally introduce a leading or trailing space, which results in spurious authentication errors. Trim the space at the HTTP server.	2023-03-14 08:57:53 -04:00
Seth Hoenig	a25d3ea792	cgv1: do not disable cpuset manager if reserved interface already exists (#16467 ) * cgv1: do not disable cpuset manager if reserved interface already exists This PR fixes a bug where restarting a Nomad Client on a machine using cgroups v1 (e.g. Ubuntu 20.04) would cause the cpuset cgroups manager to disable itself. This is being caused by incorrectly interpreting a "file exists" error as problematic when ensuring the reserved cpuset exists. If we get a "file exists" error, that just means the Client was likely restarted. Note that a machine reboot would fix the issue - the groups interfaces are ephemoral. * cl: add cl	2023-03-13 17:00:17 -05:00
Luiz Aoqui	adf147cb36	acl: update job eval requirement to `submit-job` (#16463 ) The job evaluate endpoint creates a new evaluation for the job which is a write operation. This change modifies the necessary capability from `read-job` to `submit-job` to better reflect this.	2023-03-13 17:13:54 -04:00
Luiz Aoqui	c29a87b875	plugin: add missing fields to `TaskConfig` (#16434 )	2023-03-13 15:58:16 -04:00
Michael Schurter	8da636c6d5	build: update from go1.20.1 to go1.20.2 (#16427 ) * build: update from go1.20.1 to go1.20.2 Note that the CVE fixed in go1.20.2 does not impact Nomad. https://github.com/golang/go/issues/58647	2023-03-13 09:47:07 -07:00
Tim Gross	1cf28996e7	acl: prevent privilege escalation via workload identity ACL policies can be associated with a job so that the job's Workload Identity can have expanded access to other policy objects, including other variables. Policies set on the variables the job automatically has access to were ignored, but this includes policies with `deny` capabilities. Additionally, when resolving claims for a workload identity without any attached policies, the `ResolveClaims` method returned a `nil` ACL object, which is treated similarly to a management token. While this was safe in Nomad 1.4.x, when the workload identity token was exposed to the task via the `identity` block, this allows a user with `submit-job` capabilities to escalate their privileges. We originally implemented automatic workload access to Variables as a separate code path in the Variables RPC endpoint so that we don't have to generate on-the-fly policies that blow up the ACL policy cache. This is fairly brittle but also the behavior around wildcard paths in policies different from the rest of our ACL polices, which is hard to reason about. Add an `ACLClaim` parameter to the `AllowVariableOperation` method so that we can push all this logic into the `acl` package and the behavior can be consistent. This will allow a `deny` policy to override automatic access (and probably speed up checks of non-automatic variable access).	2023-03-13 11:13:27 -04:00
Luiz Aoqui	7305a374e3	allocrunner: fix health check monitoring for Consul services (#16402 ) Services must be interpolated to replace runtime variables before they can be compared against the values returned by Consul.	2023-03-10 14:43:31 -05:00
Juana De La Cuesta	5089f13f1d	cli: add `-json` and `-t` flag for `alloc checks` command (#16405 ) * cli: add -json flag to alloc checks for completion * CLI: Expand test to include testing the json flag for allocation checks * Documentation: Add the checks command * Documentation: Add example for alloc check command * Update website/content/docs/commands/alloc/checks.mdx Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * CLI: Add template flag to alloc checks command * Update website/content/docs/commands/alloc/checks.mdx Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * CLI: Extend test to include -t flag for alloc checks * func: add changelog for added flags to alloc checks * cli[doc]: Make usage section on alloc checks clearer * Update website/content/docs/commands/alloc/checks.mdx Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Delete modd.conf * cli[doc]: add -t flag to command description for alloc checks --------- Co-authored-by: James Rasell <jrasell@users.noreply.github.com> Co-authored-by: Juanita De La Cuesta Morales <juanita.delacuestamorales@juanita.delacuestamorales-LHQ7X0QG9X>	2023-03-10 16:58:53 +01:00
Michael Schurter	0021b282ef	env/aws: update ec2 cpu info data (#16417 ) Update AWS EC2 CPU tables using `make ec2info`	2023-03-09 14:33:21 -08:00
Luiz Aoqui	1aceff7806	cli: remove hard requirement on `list-jobs` (#16380 ) Most job subcommands allow for job ID prefix match as a convenience functionality so users don't have to type the full job ID. But this introduces a hard ACL requirement that the token used to run these commands have the `list-jobs` permission, even if the token has enough permission to execute the basic command action and the user passed an exact job ID. This change softens this requirement by not failing the prefix match in case the request results in a permission denied error and instead using the information passed by the user directly.	2023-03-09 15:00:04 -05:00
Tim Gross	99d46e5a49	scheduling: prevent self-collision in dynamic port network offerings (#16401 ) When the scheduler tries to find a placement for a new allocation, it iterates over a subset of nodes. For each node, we populate a `NetworkIndex` bitmap with the ports of all existing allocations and any other allocations already proposed as part of this same evaluation via its `SetAllocs` method. Then we make an "ask" of the `NetworkIndex` in `AssignPorts` for any ports we need and receive an "offer" in return. The offer will include both static ports and any dynamic port assignments. The `AssignPorts` method was written to support group networks, and it shares code that selects dynamic ports with the original `AssignTaskNetwork` code. `AssignTaskNetwork` can request multiple ports from the bitmap at a time. But `AssignPorts` requests them one at a time and does not account for possible collisions, and doesn't return an error in that case. What happens next varies: 1. If the scheduler doesn't place the allocation on that node, the port conflict is thrown away and there's no problem. 2. If the node is picked and this is the only allocation (or last allocation), the plan applier will reject the plan when it calls `SetAllocs`, as we'd expect. 3. If the node is picked and there are additional allocations in the same eval that iterate over the same node, their call to `SetAllocs` will detect the impossible state and the node will be rejected. This can have the puzzling behavior where a second task group for the job without any networking at all can hit a port collision error! It looks like this bug has existed since we implemented group networks, but there are several factors that add up to making the issue rare for many users yet frustratingly frequent for others: * You're more likely to hit this bug the more tightly packed your range for dynamic ports is. With 12000 ports in the range by default, many clusters can avoid this for a long time. * You're more likely to hit case (3) for jobs with lots of allocations or if a scheduler has to iterate over a large number of nodes, such as with system jobs, jobs with `spread` blocks, or (sometimes) jobs using `unique` constraints. For unlucky combinations of these factors, it's possible that case (3) happens repeatedly, preventing scheduling of a given job until a client state change (ex. restarting the agent so all its allocations are rescheduled elsewhere) re-opens the range of dynamic ports available. This changeset: * Fixes the bug by accounting for collisions in dynamic port selection in `AssignPorts`. * Adds test coverage for `AssignPorts`, expands coverage of this case for the deprecated `AssignTaskNetwork`, and tightens the dynamic port range in a scheduler test for spread scheduling to more easily detect this kind of problem in the future. * Adds a `String()` method to `Bitmap` so that any future "screaming" log lines have a human-readable list of used ports.	2023-03-09 10:09:54 -05:00
Seth Hoenig	ff4503aac6	client: disable running artifact downloader as nobody (#16375 ) * client: disable running artifact downloader as nobody This PR reverts a change from Nomad 1.5 where artifact downloads were executed as the nobody user on Linux systems. This was done as an attempt to improve the security model of artifact downloading where third party tools such as git or mercurial would be run as the root user with all the security implications thereof. However, doing so conflicts with Nomad's own advice for securing the Client data directory - which when setup with the recommended directory permissions structure prevents artifact downloads from working as intended. Artifact downloads are at least still now executed as a child process of the Nomad agent, and on modern Linux systems make use of the kernel Landlock feature for limiting filesystem access of the child process. * docs: update upgrade guide for 1.5.1 sandboxing * docs: add cl * docs: add title to upgrade guide fix	2023-03-08 15:58:43 -06:00
Phil Renaud	54bb97f299	Outage recovery link fix (#16365 )	2023-03-07 15:52:26 -05:00
Seth Hoenig	835365d2a4	docker: fix bug where network pause containers would be erroneously reconciled (#16352 ) * docker: fix bug where network pause containers would be erroneously gc'd * docker: cl: thread context from driver into pause container restoration	2023-03-07 12:17:32 -06:00
James Rasell	7507c92139	cli: support `json` and `t` on `acl binding-rule info` command. (#16357 )	2023-03-07 18:27:02 +01:00
Tim Gross	a2ceab3d8c	scheduler: correctly detect inplace update with wildcard datacenters (#16362 ) Wildcard datacenters introduced a bug where a job with any wildcard datacenters will always be treated as a destructive update when we check whether a datacenter has been removed from the jobspec. Includes updating the helper so that callers don't have to loop over the job's datacenters.	2023-03-07 10:05:59 -05:00
Phil Renaud	edf59597d2	[ui] Fix: Wildcard-datacenter system/sysbatch jobs stopped showing client links/chart (#16274 ) * Fix for wildcard DC sys/sysbatch jobs * A few extra modules for wildcard DC in systemish jobs * doesMatchPattern moved to its own util as match-glob * DC glob lookup using matchGlob * PR feedback	2023-03-06 10:06:31 -05:00
Luiz Aoqui	2a1a790820	client: don't emit task shutdown delay event if not waiting (#16281 )	2023-03-03 18:22:06 -05:00
Luiz Aoqui	3f1ea9da4b	api: set last index and request time on alloc stop (#16319 ) Some of the methods in `Allocations()` incorrectly use the `putQuery` in API calls where `put` is more appropriate since they are not reading information back. These methods are also not returning request metadata such as `LastIndex` back to callers, which can be useful to have in some scenarios. They also provide poor developer experience as they take an `api.Allocation` struct when only the allocation ID is necessary. This can lead consumers to make unnecessary API calls to fetch the full allocation. Fixing these problems require updating the methods' signatures so they take `WriteOptions` instead of `QueryOptions` and return `WriteMeta`, but this is a breaking change that requires advanced notice to consumers. This commit adds a future breaking change notice and also fixes the `Stop` method so it properly returns request metadata in a backwards compatible way.	2023-03-03 15:52:41 -05:00
Tim Gross	3c0eaba9db	remove backcompat support for non-atomic job registration (#16305 ) In Nomad 0.12.1 we introduced atomic job registration/deregistration, where the new eval was written in the same raft entry. Backwards-compatibility checks were supposed to have been removed in Nomad 1.1.0, but we missed that. This is long safe to remove.	2023-03-03 15:52:22 -05:00
Luiz Aoqui	1d051d834d	cli: use shared logic for resolving job prefix (#16306 ) Several `nomad job` subcommands had duplicate or slightly similar logic for resolving a job ID from a CLI argument prefix, while others did not have this functionality at all. This commit pulls the shared logic to the command Meta and updates all `nomad job` subcommands to use it.	2023-03-03 14:43:20 -05:00
Tim Gross	8747059b86	service: fix regression in task access to list/read endpoint (#16316 ) When native service discovery was added, we used the node secret as the auth token. Once Workload Identity was added in Nomad 1.4.x we needed to use the claim token for `template` blocks, and so we allowed valid claims to bypass the ACL policy check to preserve the existing behavior. (Invalid claims are still rejected, so this didn't widen any security boundary.) In reworking authentication for 1.5.0, we unintentionally removed this bypass. For WIs without a policy attached to their job, everything works as expected because the resulting `acl.ACL` is nil. But once a policy is attached to the job the `acl.ACL` is no longer nil and this causes permissions errors. Fix the regression by adding back the bypass for valid claims. In future work, we should strongly consider getting turning the implicit policies into real `ACLPolicy` objects (even if not stored in state) so that we don't have these kind of brittle exceptions to the auth code.	2023-03-03 11:41:19 -05:00
Valentino	1f9d11feff	Add namespace argument to the job verification help text (#16243 )	2023-03-02 16:42:14 -05:00
Dao Thanh Tung	ed31e0a5f5	cli: sort Node value in `nomad operator raft list-peers` command (#16221 ) Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>	2023-03-02 16:16:30 -05:00
Phil Renaud	93574ce085	[ui, helios] Toast Component (#16099 ) * Template and styles * @type to @color on flash messages * Notifications service as wrapper * Test cases updated for new notifs	2023-03-02 13:52:16 -05:00
Tim Gross	0e1b554299	handle `FSM.Apply` errors in `raftApply` (#16287 ) The signature of the `raftApply` function requires that the caller unwrap the first returned value (the response from `FSM.Apply`) to see if it's an error. This puts the burden on the caller to remember to check two different places for errors, and we've done so inconsistently. Update `raftApply` to do the unwrapping for us and return any `FSM.Apply` error as the error value. Similar work was done in Consul in https://github.com/hashicorp/consul/pull/9991. This eliminates some boilerplate and surfaces a few minor bugs in the process: * job deregistrations of already-GC'd jobs were still emitting evals * reconcile job summaries does not return scheduler errors * node updates did not report errors associated with inconsistent service discovery or CSI plugin states Note that although _most_ of the `FSM.Apply` functions return only errors (which makes it tempting to remove the first return value entirely), there are few that return `bool` for some reason and Variables relies on the response value for proper CAS checking.	2023-03-02 13:51:09 -05:00
Tim Gross	bb4880ec13	client: use RPC address and not serf after initial Consul discovery (#16217 ) Nomad servers can advertise independent IP addresses for `serf` and `rpc`. Somewhat unexpectedly, the `serf` address is also used for both Serf and server-to-server RPC communication (including Raft RPC). The address advertised for `rpc` is only used for client-to-server RPC. This split was introduced intentionally in Nomad 0.8. When clients are using Consul discovery for connecting to servers, they get an initial discovery set from Consul and use the correct `rpc` tag in Consul to get a list of adddresses for servers. The client then makes a `Status.Peers` RPC to get the list of those servers that are raft peers. But this endpoint is shared between servers and clients, and provides the address used for Raft. Most of the time this is harmless because servers will bind on 0.0.0.0 anyways., But in topologies where servers are on a private network and clients are on separate subnets (or even public subnets), clients will make initial contact with the server to get the list of peers but then populate their local server set with unreachable addresses. Cluster administrators can work around this problem by using `server_join` with specific IP addresses (or DNS names), because the `Node.UpdateStatus` endpoint returns the correct set of RPC addresses when updating the node. So once a client has registered, it will get the correct set of RPC addresses. This changeset updates the client logic to query `Status.Members` instead of `Status.Peers`, and then extract the correctly advertised address and port from the response body.	2023-03-02 13:36:45 -05:00
Daniel Bennett	39e3a1ac3e	build/cli: Add BuildDate (#16216 ) * build: add BuildDate to version info will be used in enterprise to compare to license expiration time * cli: multi-line version output, add BuildDate before: $ nomad version Nomad v1.4.3 (coolfakecommithashomgoshsuchacoolonewoww) after: $ nomad version Nomad v1.5.0-dev BuildDate 2023-02-17T19:29:26Z Revision coolfakecommithashomgoshsuchacoolonewoww compare consul: $ consul version Consul v1.14.4 Revision dae670fe Build Date 2023-01-26T15:47:10Z Protocol 2 spoken by default, blah blah blah... and vault: $ vault version Vault v1.12.3 (209b3dd99fe8ca320340d08c70cff5f620261f9b), built 2023-02-02T09:07:27Z * docs: update version command output	2023-02-27 11:27:40 -06:00
Tim Gross	79844048e6	populate Nomad token for task runner update hooks (#16266 ) The `TaskUpdateRequest` struct we send to task runner update hooks was not populating the Nomad token that we get from the task runner (which we do for the Vault token). This results in task runner hooks like the template hook overwriting the Nomad token with the zero value for the token. This causes in-place updates of a task to break templates (but not other uses that rely on identity but don't currently bother to update it, like the identity hook).	2023-02-27 10:48:13 -05:00

1 2 3 4 5 ...

955 Commits