open-nomad

Commit Graph

Author	SHA1	Message	Date
VishnuJin	67efb19e94	fingerprint: added windows os.build attribute to host fingerprint (#17576 )	2023-06-21 10:53:50 -04:00
Tim Gross	ff9ba8ff73	scheduler: tolerate having only one dynamic port available (#17619 ) If the dynamic port range for a node is set so that the min is equal to the max, there's only one port available and this passes config validation. But the scheduler panics when it tries to pick a random port. Only add the randomness when there's more than one to pick from. Adds a test for the behavior but also adjusts the commentary on a couple of the existing tests that made it seem like this case was already covered if you didn't look too closely. Fixes: #17585	2023-06-20 13:29:25 -04:00
Patric Stout	4767d44b94	Fix DevicesSets being removed when cpusets are reloaded with cgroup v2 (#17535 ) * Fix DevicesSets being removed when cpusets are reloaded with cgroup v2 This meant that if any allocation was created or removed, all active DevicesSets were removed from all cgroups of all tasks. This was most noticeable with "exec" and "raw_exec", as it meant they no longer had access to /dev files. * e2e: add test for verifying cgroups do not interfere with access to devices --------- Co-authored-by: Seth Hoenig <shoenig@duck.com>	2023-06-15 09:39:36 -05:00
Tim Gross	5f509b8ce0	cli: fix missing `-quiet` flag for `var init` (#17526 ) The `var init` command was intended to have support for a `-quiet` flag but it was not documented and never parsed.	2023-06-14 14:52:46 -04:00
Tim Gross	e3a37c0b97	replication: fix potential panic during upgrades (#17476 ) If the authoritative region has been upgraded to a version of Nomad that has new replicated objects (such as ACL Auth Methods, ACL Binding Rules, etc.), the non-authoritative regions will start replicating those objects as soon as their leader is upgraded. If a server in the non-authoritative region is upgraded and then becomes the leader before all the other servers in the region have been upgraded, then it will attempt to write a Raft log entry that the followers don't understand. The followers will then panic. Add same the minimum version checks that we do for RPC writes to the leader's replication loop.	2023-06-12 08:53:56 -04:00
Phil Renaud	6a9df6e3ab	[ui] Don't show a service as healthy when its parent alloc is not running (#17465 ) * Fix: dont show a service as healthy when its parent alloc is not running * Test for Health Unknown	2023-06-09 15:43:11 -04:00
Seth Hoenig	557a6b4a5e	docker: stop network pause container of lost alloc after node restart (#17455 ) This PR fixes a bug where the docker network pause container would not be stopped and removed in the case where a node is restarted, the alloc is moved to another node, the node comes back up. See the issue below for full repro conditions. Basically in the DestroyNetwork PostRun hook we would depend on the NetworkIsolationSpec field not being nil - which is only the case if the Client stays alive all the way from network creation to network teardown. If the node is rebooted we lose that state and previously would not be able to find the pause container to remove. Now, we manually find the pause container by scanning them and looking for the associated allocID. Fixes #17299	2023-06-09 08:46:29 -05:00
Seth Hoenig	134e70cbab	client: fix client panic during drain cause by shutdown (#17450 ) During shutdown of a client with drain_on_shutdown there is a race between the Client ending the cgroup and the task's cpuset manager cleaning up the cgroup. During the path traversal, skip anything we cannot read, which avoids the nil DirEntry we try to dereference now.	2023-06-07 15:12:44 -05:00
Jerome Eteve	c26f01eefd	client checks kernel module in /sys/module for WSL2 bridge networking (#17306 )	2023-06-06 10:26:50 -04:00
Dao Thanh Tung	7c7f2d00bb	Add check for missing `path` in client `host_volume` config (#17393 )	2023-06-05 19:31:19 -04:00
dependabot[bot]	2f4fe019db	build(deps): bump go.etcd.io/bbolt from 1.3.6 to 1.3.7 (#16228 ) * build(deps): bump go.etcd.io/bbolt from 1.3.6 to 1.3.7 Bumps [go.etcd.io/bbolt](https://github.com/etcd-io/bbolt) from 1.3.6 to 1.3.7. - [Release notes](https://github.com/etcd-io/bbolt/releases) - [Commits](https://github.com/etcd-io/bbolt/compare/v1.3.6...v1.3.7) --- updated-dependencies: - dependency-name: go.etcd.io/bbolt dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> * cl: update cl for bbolt --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Seth Hoenig <shoenig@duck.com>	2023-06-05 10:19:14 -05:00
dependabot[bot]	c585cc68db	build(deps): bump github.com/hashicorp/raft from 1.3.11 to 1.5.0 (#17421 ) * build(deps): bump github.com/hashicorp/raft from 1.3.11 to 1.5.0 Bumps [github.com/hashicorp/raft](https://github.com/hashicorp/raft) from 1.3.11 to 1.5.0. - [Release notes](https://github.com/hashicorp/raft/releases) - [Changelog](https://github.com/hashicorp/raft/blob/main/CHANGELOG.md) - [Commits](https://github.com/hashicorp/raft/compare/v1.3.11...v1.5.0) --- updated-dependencies: - dependency-name: github.com/hashicorp/raft dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * cl: add cl for raft 1.5.0 --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Seth Hoenig <shoenig@duck.com>	2023-06-05 09:03:02 -05:00
KamilCuk	cc64281445	Add group_add docker option (#17313 )	2023-06-02 20:26:01 -04:00
Samantha	b92a782b6e	check: Add support for Consul field tls_server_name (#17334 )	2023-06-02 10:19:12 -04:00
Luiz Aoqui	4be8d7c049	core: fix kill_timeout validation when progress_deadline is 0 (#17342 )	2023-06-01 19:01:32 -04:00
Tim Gross	06972fae0c	prioritized client updates (#17354 ) The allocrunner sends several updates to the server during the early lifecycle of an allocation and its tasks. Clients batch-up allocation updates every 200ms, but experiments like the C2M challenge has shown that even with this batching, servers can be overwhelmed with client updates during high volume deployments. Benchmarking done in #9451 has shown that client updates can easily represent ~70% of all Nomad Raft traffic. Each allocation sends many updates during its lifetime, but only those that change the `ClientStatus` field are critical for progressing a deployment or kicking off a reschedule to recover from failures. Add a priority to the client allocation sync and update the `syncTicker` receiver so that we only send an update if there's a high priority update waiting, or on every 5th tick. This means when there are no high priority updates, the client will send updates at most every 1s instead of 200ms. Benchmarks have shown this can reduce overall Raft traffic by 10%, as well as reduce client-to-server RPC traffic. This changeset also switches from a channel-based collection of updates to a shared buffer, so as to split batching from sending and prevent backpressure onto the allocrunner when the RPC is slow. This doesn't have a major performance benefit in the benchmarks but makes the implementation of the prioritized update simpler. Fixes: #9451	2023-05-31 15:34:16 -04:00
Phil Renaud	52772ab0c0	Text type to password type input on profile sign-in page (#17345 )	2023-05-30 16:58:34 -04:00
Phil Renaud	038d53c58f	Observe newlines when displaying variables (#17343 )	2023-05-30 16:58:16 -04:00
Luiz Aoqui	6236cb8f82	cli: output errors when monitoring deployment (#17348 )	2023-05-30 11:12:12 -04:00
Luiz Aoqui	e236d6dedd	cli: fix panic on job restart (#17346 ) When monitoring the replacement allocation, if the `Allocations().Info()` request fails, the `alloc` variable is `nil`, so it should not be read.	2023-05-30 11:08:49 -04:00
Luiz Aoqui	bb2395031b	client: fix Consul version finterprint (#17349 ) Consul v1.13.8 was released with a breaking change in the /v1/agent/self endpoint version where a line break was being returned. This caused the Nomad finterprint to fail because `NewVersion` errors on parse. This commit removes any extra space from the Consul version returned by the API.	2023-05-30 11:07:57 -04:00
Phil Renaud	294aa4bfe7	[ui] A few variables-ui-related bugfixes (#17319 ) * A few variable-adding bugfixes * Disable Delete button if only one KV is left, and remove entity warnings on Add More	2023-05-25 17:11:13 -04:00
Charlie Voiselle	86e04a4c6c	[core] nil check and error handling for client status in heartbeat responses (#17316 ) Add a nil check to constructNodeServerInfoResponse to manage an apparent race between deregister and client heartbeats. Fixes #17310	2023-05-25 16:04:54 -04:00
Tim Gross	b85a28b851	changelog entry for Vault SDK update (#17281 )	2023-05-23 09:21:29 -04:00
Charlie Voiselle	fc313b7f8f	[api] Return a shapely error for unexpected response (#16743 ) * Add UnexpectedResultError to nomad/api This allows users to perform additional status-based behavior by rehydrating the error using `errors.As` inside of consumers.	2023-05-22 11:45:31 -04:00
Lance Haig	568da5918b	cli: tls certs not created with correct SANs (#16959 ) The `nomad tls cert` command did not create certificates with the correct SANs for them to work with non default domain and region names. This changset updates the code to support non default domains and regions in the certificates.	2023-05-22 09:31:56 -04:00
Roberto Hidalgo	2f702a9f11	allow periodic jobs to use workload identity ACL policies (#17018 ) When resolving ACL policies, we were not using the parent ID for the policy lookup for dispatch/periodic jobs, even though the claims were signed for that parent ID. This prevents all calls to the Task API (and other WI-authenticated API calls) from a periodically-dispatched job failing with 403. Fix this by using the parent job ID whenever it's available.	2023-05-22 09:19:16 -04:00
Yethal	4073987de3	cli: show leader status in json output of server members (#17138 )	2023-05-18 16:43:57 -04:00
Bram Vogelaar	3b40f778e5	agent: display node id on start up for servers (#17084 ) Signed-off-by: Bram Vogelaar <bram@attachmentgenie.com>	2023-05-18 11:23:12 -04:00
Tim Gross	fe29cf8b7b	logs: fix `logs.disabled` on Windows (#17199 ) On Windows the executor returns an error when trying to open the `NUL` device when we pass it `os.DevNull` for the stdout/stderr paths. Instead of opening the device, use the discard pipe so that we have platform-specific behavior from the executor itself. Fixes: #17148	2023-05-18 09:14:39 -04:00
Phil Renaud	50a35143c9	[ui, deployments] Fix a bug where watchers on a parent (periodic) job would continue on a child route (#17214 ) * Treated same-route as sub-route and didnt cancel watchers * Changelog	2023-05-17 16:36:15 -04:00
Tim Gross	5fc63ace0b	scheduler: count implicit spread targets as a single target (#17195 ) When calculating the score in the `SpreadIterator`, the score boost is proportional to the difference between the current and desired count. But when there are implicit spread targets, the current count is the sum of the possible implicit targets, which results in incorrect scoring unless there's only one implicit target. This changeset updates the `propertySet` struct to accept a set of explicit target values so it can detect when a property value falls into the implicit set and should be combined with other implicit values. Fixes: #11823	2023-05-17 10:25:00 -04:00
Tim Gross	2426aae832	scheduler: prevent -Inf in spread scoring (#17198 ) When spread targets have a percent value of zero it's possible for them to return -Inf scoring because of a float divide by zero. This is very hard for operators to debug because the string "-Inf" is returned in the API and that breaks the presentation of debugging data. Most scoring iterators are bracketed to -1/+1, but spread iterators do not so that they can handle greatly unbalanced scoring so we can't simply return a -1 score without generating a score that might be greater than the negative scores set by other spread targets. Instead, track the lowest-seen spread boost and use that as the spread boost for any cases where we'd divide by zero. Fixes: #8863	2023-05-16 16:01:32 -04:00
Seth Hoenig	e04ff0d935	client: ignore restart issued to terminal allocations (#17175 ) * client: ignore restart issued to terminal allocations This PR fixes a bug where issuing a restart to a terminal allocation would cause the allocation to run its hooks anyway. This was particularly apparent with group_service_hook who would then register services but then never deregister them - as the allocation would be effectively in a "zombie" state where it is prepped to run tasks but never will. * e2e: add e2e test for alloc restart zombies * cl: tweak text Co-authored-by: Tim Gross <tgross@hashicorp.com> --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2023-05-16 10:19:41 -05:00
Tim Gross	6814e8e6d9	drivers: make internal `DisableLogCollection` capability public (#17196 ) The `DisableLogCollection` capability was introduced as an experimental interface for the Docker driver in 0.10.4. The interface has been stable and allowing third-party task drivers the same capability would be useful for those drivers that don't need the additional overhead of logmon. This PR only makes the capability public. It doesn't yet add it to the configuration options for the other internal drivers. Fixes: #14636 #15686	2023-05-16 09:16:03 -04:00
Phil Renaud	9a5d67d475	[ui] Keyboard shortcuts to switch regions (#17169 ) * Regions keynav * Dont show if you only have a single region (global by default)	2023-05-12 11:46:00 -04:00
Tim Gross	9ed75e1f72	client: de-duplicate alloc updates and gate during restore (#17074 ) When client nodes are restarted, all allocations that have been scheduled on the node have their modify index updated, including terminal allocations. There are several contributing factors: * The `allocSync` method that updates the servers isn't gated on first contact with the servers. This means that if a server updates the desired state while the client is down, the `allocSync` races with the `Node.ClientGetAlloc` RPC. This will typically result in the client updating the server with "running" and then immediately thereafter "complete". * The `allocSync` method unconditionally sends the `Node.UpdateAlloc` RPC even if it's possible to assert that the server has definitely seen the client state. The allocrunner may queue-up updates even if we gate sending them. So then we end up with a race between the allocrunner updating its internal state to overwrite the previous update and `allocSync` sending the bogus or duplicate update. This changeset adds tracking of server-acknowledged state to the allocrunner. This state gets checked in the `allocSync` before adding the update to the batch, and updated when `Node.UpdateAlloc` returns successfully. To implement this we need to be able to equality-check the updates against the last acknowledged state. We also need to add the last acknowledged state to the client state DB, otherwise we'd drop unacknowledged updates across restarts. The client restart test has been expanded to cover a variety of allocation states, including allocs stopped before shutdown, allocs stopped by the server while the client is down, and allocs that have been completely GC'd on the server while the client is down. I've also bench tested scenarios where the task workload is killed while the client is down, resulting in a failed restore. Fixes #16381	2023-05-11 09:05:24 -04:00
Daniel Bennett	a7ed6f5c53	full task cleanup when alloc prerun hook fails (#17104 ) to avoid leaking task resources (e.g. containers, iptables) if allocRunner prerun fails during restore on client restart. now if prerun fails, TaskRunner.MarkFailedKill() will only emit an event, mark the task as failed, and cancel the tr's killCtx, so then ar.runTasks() -> tr.Run() can take care of the actual cleanup. removed from (formerly) tr.MarkFailedDead(), now handled by tr.Run(): * set task state as dead * save task runner local state * task stop hooks also done in tr.Run() now that it's not skipped: * handleKill() to kill tasks while respecting their shutdown delay, and retrying as needed * also includes task preKill hooks * clearDriverHandle() to destroy the task and associated resources * task exited hooks	2023-05-08 13:17:10 -05:00
stswidwinski	9c1c2cb5d2	Correct the status description and modify time of canceled evals. (#17071 ) Fix for #17070. Corrected the status description and modify time of evals which are canceled due to another eval having completed in the meantime.	2023-05-08 08:50:36 -04:00
Seth Hoenig	fff2eec625	connect: use heuristic to detect sidecar task driver (#17065 ) * connect: use heuristic to detect sidecar task driver This PR adds a heuristic to detect whether to use the podman task driver for the connect sidecar proxy. The podman driver will be selected if there is at least one task in the task group configured to use podman, and there are zero tasks in the group configured to use docker. In all other cases the task driver defaults to docker. After this change, we should be able to run typical Connect jobspecs (e.g. nomad job init [-short] -connect) on Clusters configured with the podman task driver, without modification to the job files. Closes #17042 * golf: cleanup driver detection logic	2023-05-05 10:19:30 -05:00
James Rasell	6ec4a69f47	scale: fixed a bug where evals could be created with wrong type. (#17092 ) The job scale RPC endpoint hard-coded the eval creation to use the type of service. This meant scaling events triggered on jobs of type batch would create evaluations with the wrong type, which does not seem to cause any problems, just confusion when correlating the two.	2023-05-05 14:46:10 +01:00
Tim Gross	17bd930ca9	logs: fix missing allocation logs after update to Nomad 1.5.4 (#17087 ) When the server restarts for the upgrade, it loads the `structs.Job` from the Raft snapshot/logs. The jobspec has long since been parsed, so none of the guards around the default value are in play. The empty field value for `Enabled` is the zero value, which is false. This doesn't impact any running allocation because we don't replace running allocations when either the client or server restart. But as soon as any allocation gets rescheduled (ex. you drain all your clients during upgrades), it'll be using the `structs.Job` that the server has, which has `Enabled = false`, and logs will not be collected. This changeset fixes the bug by adding a new field `Disabled` which defaults to false (so that the zero value works), and deprecates the old field. Fixes #17076	2023-05-04 16:01:18 -04:00
Charlie Voiselle	8f6fa14e9e	[deps] Update consul-template to v0.31.0 (#16908 ) * Update consul-template to v0.31.0 * Add changelog	2023-05-03 09:15:17 -04:00
Michael Schurter	f8f9e91b8a	build: upgrade from go 1.20.3 to 1.20.4 (#17056 ) Includes CVE fixes that do not impact Nomad: https://groups.google.com/g/golang-announce/c/MEb0UyuSMsU	2023-05-02 13:09:11 -07:00
Seth Hoenig	e9fec4ebc8	connect: remove unusable path for fallback envoy image names (#17044 ) This PR does some cleanup of an old code path for versions of Consul that did not support reporting the supported versions of Envoy in its API. Those versions are no longer supported for years at this point, and the fallback version of envoy hasn't been supported by any version of Consul for almost as long. Remove this code path that is no longer useful.	2023-05-02 09:48:44 -05:00
Seth Hoenig	e8d53ea30b	connect: use explicit docker.io prefix in default envoy image names (#17045 ) This PR modifies references to the envoyproxy/envoy docker image to explicitly include the docker.io prefix. This does not affect existing users, but makes things easier for Podman users, who otherwise need to specify the full name because Podman does not default to docker.io	2023-05-02 09:27:48 -05:00
Seth Hoenig	86f6a38867	connect: do not restrict auto envoy version to docker task driver (#17041 ) This PR updates the envoy_bootstrap_hook to no longer disable itself if the task driver in use is not docker. In other words, make it work for podman and other image based task drivers. The hook now only checks that 1. the task is a connect sidecar 2. the task.config block contains an "image" field	2023-05-01 15:07:35 -05:00
Phil Renaud	5ca59aef56	Move the token JWT console log out of an interator (#17010 )	2023-04-28 13:46:10 -04:00
Seth Hoenig	5744b2cd4f	docs: add more notes about artifact breaking changes in 1.5.0 (#17005 ) * changelog: note artifact breaking changes for 1.5.0 * docs: add note about environment variables to artifact job spec docs * Update website/content/docs/job-specification/artifact.mdx Co-authored-by: Luiz Aoqui <luiz@hashicorp.com> --------- Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2023-04-27 11:41:18 -05:00
Michael Schurter	d3b0bbc088	deps: update go-bexpr from 0.1.11 to 0.1.12 (#16991 ) Pulls in https://github.com/hashicorp/go-bexpr/pull/38 Fixes #16758	2023-04-27 09:01:42 -07:00

1 2 3 4 5 ...

835 Commits