open-nomad

Author	SHA1	Message	Date
Luiz Aoqui	e4c8b59919	Update alloc after reconnect and enforece client heartbeat order (#15068 ) * scheduler: allow updates after alloc reconnects When an allocation reconnects to a cluster the scheduler needs to run special logic to handle the reconnection, check if a replacement was create and stop one of them. If the allocation kept running while the node was disconnected, it will be reconnected with `ClientStatus: running` and the node will have `Status: ready`. This combination is the same as the normal steady state of allocation, where everything is running as expected. In order to differentiate between the two states (an allocation that is reconnecting and one that is just running) the scheduler needs an extra piece of state. The current implementation uses the presence of a `TaskClientReconnected` task event to detect when the allocation has reconnected and thus must go through the reconnection process. But this event remains even after the allocation is reconnected, causing all future evals to consider the allocation as still reconnecting. This commit changes the reconnect logic to use an `AllocState` to register when the allocation was reconnected. This provides the following benefits: - Only a limited number of task states are kept, and they are used for many other events. It's possible that, upon reconnecting, several actions are triggered that could cause the `TaskClientReconnected` event to be dropped. - Task events are set by clients and so their timestamps are subject to time skew from servers. This prevents using time to determine if an allocation reconnected after a disconnect event. - Disconnect events are already stored as `AllocState` and so storing reconnects there as well makes it the only source of information required. With the new logic, the reconnection logic is only triggered if the last `AllocState` is a disconnect event, meaning that the allocation has not been reconnected yet. After the reconnection is handled, the new `ClientStatus` is store in `AllocState` allowing future evals to skip the reconnection logic. * scheduler: prevent spurious placement on reconnect When a client reconnects it makes two independent RPC calls: - `Node.UpdateStatus` to heartbeat and set its status as `ready`. - `Node.UpdateAlloc` to update the status of its allocations. These two calls can happen in any order, and in case the allocations are updated before a heartbeat it causes the state to be the same as a node being disconnected: the node status will still be `disconnected` while the allocation `ClientStatus` is set to `running`. The current implementation did not handle this order of events properly, and the scheduler would create an unnecessary placement since it considered the allocation was being disconnected. This extra allocation would then be quickly stopped by the heartbeat eval. This commit adds a new code path to handle this order of events. If the node is `disconnected` and the allocation `ClientStatus` is `running` the scheduler will check if the allocation is actually reconnecting using its `AllocState` events. * rpc: only allow alloc updates from `ready` nodes Clients interact with servers using three main RPC methods: - `Node.GetAllocs` reads allocation data from the server and writes it to the client. - `Node.UpdateAlloc` reads allocation from from the client and writes them to the server. - `Node.UpdateStatus` writes the client status to the server and is used as the heartbeat mechanism. These three methods are called periodically by the clients and are done so independently from each other, meaning that there can't be any assumptions in their ordering. This can generate scenarios that are hard to reason about and to code for. For example, when a client misses too many heartbeats it will be considered `down` or `disconnected` and the allocations it was running are set to `lost` or `unknown`. When connectivity is restored the to rest of the cluster, the natural mental model is to think that the client will heartbeat first and then update its allocations status into the servers. But since there's no inherit order in these calls the reverse is just as possible: the client updates the alloc status and then heartbeats. This results in a state where allocs are, for example, `running` while the client is still `disconnected`. This commit adds a new verification to the `Node.UpdateAlloc` method to reject updates from nodes that are not `ready`, forcing clients to heartbeat first. Since this check is done server-side there is no need to coordinate operations client-side: they can continue sending these requests independently and alloc update will succeed after the heartbeat is done. * chagelog: add entry for #15068 * code review * client: skip terminal allocations on reconnect When the client reconnects with the server it synchronizes the state of its allocations by sending data using the `Node.UpdateAlloc` RPC and fetching data using the `Node.GetClientAllocs` RPC. If the data fetch happens before the data write, `unknown` allocations will still be in this state and would trigger the `allocRunner.Reconnect` flow. But when the server `DesiredStatus` for the allocation is `stop` the client should not reconnect the allocation. * apply more code review changes * scheduler: persist changes to reconnected allocs Reconnected allocs have a new AllocState entry that must be persisted by the plan applier. * rpc: read node ID from allocs in UpdateAlloc The AllocUpdateRequest struct is used in three disjoint use cases: 1. Stripped allocs from clients Node.UpdateAlloc RPC using the Allocs, and WriteRequest fields 2. Raft log message using the Allocs, Evals, and WriteRequest fields 3. Plan updates using the AllocsStopped, AllocsUpdated, and Job fields Adding a new field that would only be used in one these cases (1) made things more confusing and error prone. While in theory an AllocUpdateRequest could send allocations from different nodes, in practice this never actually happens since only clients call this method with their own allocations. * scheduler: remove logic to handle exceptional case This condition could only be hit if, somehow, the allocation status was set to "running" while the client was "unknown". This was addressed by enforcing an order in "Node.UpdateStatus" and "Node.UpdateAlloc" RPC calls, so this scenario is not expected to happen. Adding unnecessary code to the scheduler makes it harder to read and reason about it. * more code review * remove another unused test	2022-11-04 16:25:11 -04:00
Luiz Aoqui	1b87d292a3	client: retry RPC call when no server is available (#15140 ) When a Nomad service starts it tries to establish a connection with servers, but it also runs alloc runners to manage whatever allocations it needs to run. The alloc runner will invoke several hooks to perform actions, with some of them requiring access to the Nomad servers, such as Native Service Discovery Registration. If the alloc runner starts before a connection is established the alloc runner will fail, causing the allocation to be shutdown. This is particularly problematic for disconnected allocations that are reconnecting, as they may fail as soon as the client reconnects. This commit changes the RPC request logic to retry it, using the existing retry mechanism, if there are no servers available.	2022-11-04 14:09:39 -04:00
Charlie Voiselle	79c4478f5b	template: error on missing key (#15141 ) * Support error_on_missing_value for templates * Update docs for template stanza	2022-11-04 13:23:01 -04:00
Seth Hoenig	d7aa37a5c9	e2e: explicitly wait on task status in chroot download exec test (#15145 ) Also add some debug log lines for this test, because it doesn't make sense for the allocation to be complete yet a task in the allocation to be not started yet, which is what the test failures are implying.	2022-11-04 09:50:11 -05:00
Charlie Voiselle	83e43e01c1	Add missing timer reset (#15134 )	2022-11-03 18:57:57 -04:00
Ethan	654ae1d591	fix: batchFirstFingerprints does not update device on node after v1.3.5 (#15125 ) * fix: update device in batch first footprint * cl: add cl note Co-authored-by: Seth Hoenig <shoenig@duck.com>	2022-11-03 16:31:39 -05:00
Phil Renaud	147df77e62	Ember patched for security release (#15126 )	2022-11-03 16:29:23 -04:00
Tim Gross	672fb46d12	WI: set identity to client secret if missing (#15121 ) Allocations created before 1.4.0 will not have a workload identity token. When the client running these allocs is upgraded to 1.4.x, the identity hook will run and replace the node secret ID token used previously with an empty string. This causes service discovery queries to fail. Fallback to the node's secret ID when the allocation doesn't have a signed identity. Note that pre-1.4.0 allocations won't have templates that read Variables, so there's no threat that this new node ID secret will be able to read data that the allocation shouldn't have access to.	2022-11-03 11:10:11 -04:00
Phil Renaud	ab5bfa8149	Accidentally trailed off on a docs paragraph (#15118 )	2022-11-02 23:33:41 -04:00
Phil Renaud	ffb4c63af7	[ui] Adds meta to job list stub and displays a pack logo on the jobs index (#14833 ) * Adds meta to job list stub and displays a pack logo on the jobs index * Changelog * Modifying struct for optional meta param * Explicitly ask for meta anytime I look up a job from index or job page * Test case for the endpoint * adding meta field to API struct and ommitting from response if empty * passthru method added to api/jobs.list * Meta param listed in docs for jobs list * Update api/jobs.go Co-authored-by: Tim Gross <tgross@hashicorp.com> Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-11-02 16:58:24 -04:00
Phil Renaud	6d5fe56fa1	Job spec upload (#14747 ) * Job spec upload by click or drag * pseudo-restrict formats * Changelog * Tweak to job spec upload to be above editor layer * Within the job-editor again tho * Beginning testcase cleanup * Test progression * refact: update codemirror fillin logic Co-authored-by: Jai Bhagat <jaybhagat841@gmail.com>	2022-11-02 10:34:10 -04:00
Seth Hoenig	a0bdc67d6a	build: update to go1.19.3 (#15099 )	2022-11-01 15:54:49 -05:00
Tim Gross	4d7a4171cd	volumewatcher: prevent panic on nil volume (#15101 ) If a GC claim is written and then volume is deleted before the `volumewatcher` enters its run loop, we panic on the nil-pointer access. Simply doing a nil-check at the top of the loop reveals a race condition around shutting down the loop just as a new update is coming in. Have the parent `volumeswatcher` send an initial update on the channel before returning, so that we're still holding the lock. Update the watcher's `Stop` method to set the running state, which lets us avoid having a second context and makes stopping synchronous. This reduces the cases we have to handle in the run loop. Updated the tests now that we'll safely return from the goroutine and stop the runner in a larger set of cases. Ran the tests with the `-race` detection flag and fixed up any problems found here as well.	2022-11-01 16:53:10 -04:00
Tim Gross	38542f256e	variables: limit rekey eval to half the nack timeout (#15102 ) In order to limit how much the rekey job can monopolize a scheduler worker, we limit how long it can run to 1min before stopping work and emitting a new eval. But this exactly matches the default nack timeout, so it'll fail the eval rather than getting a chance to emit a new one. Set the timeout for the rekey eval to half the configured nack timeout.	2022-11-01 16:50:50 -04:00
Tim Gross	903b5baaa4	keyring: safely handle missing keys and restore GC (#15092 ) When replication of a single key fails, the replication loop breaks early and therefore keys that fall later in the sorting order will never get replicated. This is particularly a problem for clusters impacted by the bug that caused #14981 and that were later upgraded; the keys that were never replicated can now never be replicated, and so we need to handle them safely. Included in the replication fix: * Refactor the replication loop so that each key replicated in a function call that returns an error, to make the workflow more clear and reduce nesting. Log the error and continue. * Improve stability of keyring replication tests. We no longer block leadership on initializing the keyring, so there's a race condition in the keyring tests where we can test for the existence of the root key before the keyring has been initialize. Change this to an "eventually" test. But these fixes aren't enough to fix #14981 because they'll end up seeing an error once a second complaining about the missing key, so we also need to fix keyring GC so the keys can be removed from the state store. Now we'll store the key ID used to sign a workload identity in the Allocation, and we'll index the Allocation table on that so we can track whether any live Allocation was signed with a particular key ID.	2022-11-01 15:00:50 -04:00
dependabot[bot]	acc94d523f	build(deps): bump github.com/docker/cli from 20.10.18+incompatible to 20.10.21+incompatible (#15078 ) * build(deps): bump github.com/docker/cli Bumps [github.com/docker/cli](https://github.com/docker/cli) from 20.10.18+incompatible to 20.10.21+incompatible. - [Release notes](https://github.com/docker/cli/releases) - [Commits](https://github.com/docker/cli/compare/v20.10.18...v20.10.21) --- updated-dependencies: - dependency-name: github.com/docker/cli dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> * deps: updated github.com/docker/cli from 20.10.18+incompatible to 20.10.21+incompatible Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Seth Hoenig <shoenig@duck.com>	2022-10-31 08:50:33 -05:00
dependabot[bot]	c0fc174e0d	build(deps): bump github.com/hashicorp/serf from 0.10.0 to 0.10.1 (#15077 ) Bumps [github.com/hashicorp/serf](https://github.com/hashicorp/serf) from 0.10.0 to 0.10.1. - [Release notes](https://github.com/hashicorp/serf/releases) - [Changelog](https://github.com/hashicorp/serf/blob/master/CHANGELOG.md) - [Commits](https://github.com/hashicorp/serf/compare/v0.10.0...v0.10.1) --- updated-dependencies: - dependency-name: github.com/hashicorp/serf dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-10-31 08:48:34 -05:00
dependabot[bot]	369e4da4ad	build(deps): bump github.com/aws/aws-sdk-go from 1.44.84 to 1.44.126 (#15081 ) * build(deps): bump github.com/aws/aws-sdk-go from 1.44.84 to 1.44.126 Bumps [github.com/aws/aws-sdk-go](https://github.com/aws/aws-sdk-go) from 1.44.84 to 1.44.126. - [Release notes](https://github.com/aws/aws-sdk-go/releases) - [Commits](https://github.com/aws/aws-sdk-go/compare/v1.44.84...v1.44.126) --- updated-dependencies: - dependency-name: github.com/aws/aws-sdk-go dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> * deps: update github.com/aws/aws-sdk-go from 1.44.84 to 1.44.126 Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Seth Hoenig <shoenig@duck.com>	2022-10-31 08:47:48 -05:00
dependabot[bot]	0ae4c4a241	build(deps): bump github.com/stretchr/testify in /api (#15082 ) Bumps [github.com/stretchr/testify](https://github.com/stretchr/testify) from 1.8.0 to 1.8.1. - [Release notes](https://github.com/stretchr/testify/releases) - [Commits](https://github.com/stretchr/testify/compare/v1.8.0...v1.8.1) --- updated-dependencies: - dependency-name: github.com/stretchr/testify dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-10-31 08:45:04 -05:00
dependabot[bot]	fc9a731813	build(deps): bump github.com/hashicorp/memberlist from 0.4.0 to 0.5.0 (#15080 ) Bumps [github.com/hashicorp/memberlist](https://github.com/hashicorp/memberlist) from 0.4.0 to 0.5.0. - [Release notes](https://github.com/hashicorp/memberlist/releases) - [Commits](https://github.com/hashicorp/memberlist/compare/v0.4.0...v0.5.0) --- updated-dependencies: - dependency-name: github.com/hashicorp/memberlist dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-10-31 08:44:27 -05:00
dependabot[bot]	6d4fef46fe	build(deps): bump go.uber.org/goleak from 1.1.12 to 1.2.0 (#15079 ) Bumps [go.uber.org/goleak](https://github.com/uber-go/goleak) from 1.1.12 to 1.2.0. - [Release notes](https://github.com/uber-go/goleak/releases) - [Changelog](https://github.com/uber-go/goleak/blob/master/CHANGELOG.md) - [Commits](https://github.com/uber-go/goleak/compare/v1.1.12...v1.2.0) --- updated-dependencies: - dependency-name: go.uber.org/goleak dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-10-31 08:43:29 -05:00
Tim Gross	dab9388c75	refactor eval delete safety check (#15070 ) The `Eval.Delete` endpoint has a helper that takes a list of jobs and allocs and determines whether the eval associated with those is safe to delete (based on their state). Filtering improvements to the `Eval.Delete` endpoint are going to need this check to run in the state store itself for consistency. Refactor to push this check down into the state store to keep the eventual diff for that work reasonable.	2022-10-28 09:10:33 -04:00
Seth Hoenig	ec915b2436	build: update linters (#15063 ) Remove dead linters and add some interesting new ones.	2022-10-27 15:02:30 -05:00
Tim Gross	9c37a234e7	test: refactor EvalEndpoint_Delete (#15065 ) While working on filtering improvements to the `Eval.Delete` endpoint I noticed that this test was going to need to expand significantly and needed some refactoring to make that work nicely. In order to reduce the size of the eventual diff, I've pulled this refactoring out into its own changeset.	2022-10-27 15:29:22 -04:00
Charlie Voiselle	d57e333534	Update architecture-state-store.md (#15049 )	2022-10-27 14:03:43 -04:00
Tim Gross	8ac41c167f	Merge pull request #15062 from hashicorp/post-1.4.2-release Post 1.4.2 release	2022-10-27 13:38:36 -04:00
Tim Gross	2ce1728fa6	Merge release 1.4.2 files Changelog updates for 1.4.2 and backports.	2022-10-27 13:31:29 -04:00
hc-github-team-nomad-core	38b1c8a22a	Prepare for next release	2022-10-27 13:08:05 -04:00
hc-github-team-nomad-core	fbef8881cd	Generate files for 1.4.2 release	2022-10-27 13:08:05 -04:00
Tim Gross	9d906d4632	variables: fix filter on List RPC The List RPC correctly authorized against the prefix argument. But when filtering results underneath the prefix, it only checked authorization for standard ACL tokens and not Workload Identity. This results in WI tokens being able to read List results (metadata only: variable paths and timestamps) for variables under the `nomad/` prefix that belong to other jobs in the same namespace. Fixes the filtering and split the `handleMixedAuthEndpoint` function into separate authentication and authorization steps so that we don't need to re-verify the claim token on each filtered object. Also includes: * update semgrep rule for mixed auth endpoints * variables: List returns empty set when all results are filtered	2022-10-27 13:08:05 -04:00
James Rasell	da5069bded	event stream: ensure token expiry is correctly checked for subs. This change ensures that a token's expiry is checked before every event is sent to the caller. Previously, a token could still be used to listen for events after it had expired, as long as the subscription was made while it was unexpired. This would last until the token was garbage collected from state. The check occurs within the RPC as there is currently no state update when a token expires.	2022-10-27 13:08:05 -04:00
dependabot[bot]	81ac5d93f1	build(deps): bump github.com/kr/pretty from 0.3.0 to 0.3.1 in /api (#14859 ) * build(deps): bump github.com/kr/pretty from 0.3.0 to 0.3.1 in /api Bumps [github.com/kr/pretty](https://github.com/kr/pretty) from 0.3.0 to 0.3.1. - [Release notes](https://github.com/kr/pretty/releases) - [Commits](https://github.com/kr/pretty/compare/v0.3.0...v0.3.1) --- updated-dependencies: - dependency-name: github.com/kr/pretty dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> * deps: update in root as well Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Seth Hoenig <shoenig@duck.com>	2022-10-27 11:58:00 -05:00
dependabot[bot]	7324d90ba7	build(deps): bump github.com/ryanuber/columnize (#14858 ) Bumps [github.com/ryanuber/columnize](https://github.com/ryanuber/columnize) from 2.1.1-0.20170703205827-abc90934186a+incompatible to 2.1.2+incompatible. - [Release notes](https://github.com/ryanuber/columnize/releases) - [Commits](https://github.com/ryanuber/columnize/commits/v2.1.2) --- updated-dependencies: - dependency-name: github.com/ryanuber/columnize dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-10-27 11:34:27 -05:00
dependabot[bot]	2f86f92d87	build(deps): bump github.com/shirou/gopsutil/v3 from 3.22.8 to 3.22.9 (#14857 ) Bumps [github.com/shirou/gopsutil/v3](https://github.com/shirou/gopsutil) from 3.22.8 to 3.22.9. - [Release notes](https://github.com/shirou/gopsutil/releases) - [Commits](https://github.com/shirou/gopsutil/compare/v3.22.8...v3.22.9) --- updated-dependencies: - dependency-name: github.com/shirou/gopsutil/v3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-10-27 11:33:50 -05:00
dependabot[bot]	07796965b1	build(deps): bump google.golang.org/grpc from 1.48.0 to 1.50.1 (#14897 ) * build(deps): bump google.golang.org/grpc from 1.48.0 to 1.50.1 Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.48.0 to 1.50.1. - [Release notes](https://github.com/grpc/grpc-go/releases) - [Commits](https://github.com/grpc/grpc-go/compare/v1.48.0...v1.50.1) --- updated-dependencies: - dependency-name: google.golang.org/grpc dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * cl: add changelog entry for grpc Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Seth Hoenig <shoenig@duck.com>	2022-10-27 11:32:48 -05:00
dependabot[bot]	eb210f2af7	build(deps): bump github.com/fsouza/go-dockerclient from 1.8.2 to 1.9.0 (#14898 ) * build(deps): bump github.com/fsouza/go-dockerclient from 1.8.2 to 1.9.0 Bumps [github.com/fsouza/go-dockerclient](https://github.com/fsouza/go-dockerclient) from 1.8.2 to 1.9.0. - [Release notes](https://github.com/fsouza/go-dockerclient/releases) - [Changelog](https://github.com/fsouza/go-dockerclient/blob/main/container_changes_test.go) - [Commits](https://github.com/fsouza/go-dockerclient/compare/v1.8.2...v1.9.0) --- updated-dependencies: - dependency-name: github.com/fsouza/go-dockerclient dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * cl: add changelog entry Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Seth Hoenig <shoenig@duck.com>	2022-10-27 11:05:45 -05:00
Seth Hoenig	4f3a1e6f7d	ci: use groups of tests in gha (#15018 ) * [no ci] use json for grouping packages for testing * [no ci] able to get packages in group * [no ci] able to run groups of tests * [no ci] more * [no ci] try disable circle unit tests * ci: use actions/checkout@v3 * ci: rename to quick * ci: need make dev in mods cache step * ci: make compile step depend on checks step * ci: bump consul and vault versions * ci: need make dev for group tests * ci: update ci unit testing docs * docs: spell plumbing correctly Co-authored-by: Tim Gross <tgross@hashicorp.com> Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-10-27 09:02:58 -05:00
Charlie Voiselle	28cd831085	Update consul-template dep (#15045 )	2022-10-26 11:51:45 -04:00
Tim Gross	f29c781fa7	docs: improved documentation on hardening and required capabilities (#15036 ) The existing docs on required capabilities are a little sparse and have been the subject of a lots of questions. Expand on this information and provide a pointer to the ongoing design discussion around rootless Nomad.	2022-10-26 09:46:13 -04:00
Tim Gross	aca95c0bc6	keyring: remove root key GC (#15034 )	2022-10-25 17:06:18 -04:00
Seth Hoenig	d69556fb35	client: ensure minimal cgroup controllers enabled (#15027 ) * client: ensure minimal cgroup controllers enabled This PR fixes a bug where Nomad could not operate properly on operating systems that set the root cgroup.subtree_control to a set of controllers that do not include the minimal set of controllers needed by Nomad. Nomad needs these controllers enabled to operate: - cpuset - cpu - io - memory - pids Now, Nomad will ensure these controllers are enabled during Client initialization, adding them to cgroup.subtree_control as necessary. This should be particularly helpful on the RHEL/CentOS/Fedora family of system. Ubuntu systems should be unaffected as they enable all controllers by default. Fixes: https://github.com/hashicorp/nomad/issues/14494 * docs: cleanup doc string * client: cleanup controller writes, enhance log messages	2022-10-24 16:08:54 -05:00
Tim Gross	c45d9a9ea8	keyring: refactor to hold locks for less time (#15026 ) Follow-up from https://github.com/hashicorp/nomad/pull/14987/files#r1003611644 We don't need to hold the lock when querying the state store, so move the read-lock to the interior of the `activeKeySet` function.	2022-10-24 16:23:44 -04:00
Zach Shilton	4dd0bd916b	docs: add details to redirects file (#15020 )	2022-10-24 13:16:07 -04:00
Seth Hoenig	32744a3548	deps: update hashicorp/raft to v1.3.11 (#15021 ) * deps: update hashicorp/raft to v1.3.11 Includes part of the fix for https://github.com/hashicorp/raft/issues/524 * cl: add changelog entry	2022-10-24 12:10:24 -05:00
Seth Hoenig	dd2999d6af	ci: add -core suffix to mods action (#15015 ) Forgot to add this line to the new mods action; without it, it creates a cache different from the one used by the other jobs.	2022-10-24 08:49:01 -05:00
Jai	f4138a88e0	refact: preserve promise.then behavior for acceptance tests (#15003 )	2022-10-24 09:04:39 -04:00
Tim Gross	b9922631bd	keyring: fix missing GC config, don't rotate on manual GC (#15009 ) The configuration knobs for root keyring garbage collection are present in the consumer and present in the user-facing config, but we missed the spot where we copy from one to the other. Fix this so that users can set their own thresholds. The root key is automatically rotated every ~30d, but the function that does both rotation and key GC was wired up such that `nomad system gc` caused an unexpected key rotation. Split this into two functions so that `nomad system gc` cleans up old keys without forcing a rotation, which will be done periodially or by the `nomad operator root keyring rotate` command.	2022-10-24 08:43:42 -04:00
Seth Hoenig	91d29e6449	ci: use the same go mod cache across test-core jobs (#15006 ) * ci: use the same go mod cache for test-core jobs * ci: precache go modules * ci: add a mods precache job	2022-10-21 17:38:45 -05:00
Tim Gross	3a811ac5e7	keyring: fixes for keyring replication on cluster join (#14987 ) * keyring: don't unblock early if rate limit burst exceeded The rate limiter returns an error and unblocks early if its burst limit is exceeded (unless the burst limit is Inf). Ensure we're not unblocking early, otherwise we'll only slow down the cases where we're already pausing to make external RPC requests. * keyring: set MinQueryIndex on stale queries When keyring replication makes a stale query to non-leader peers to find a key the leader doesn't have, we need to make sure the peer we're querying has had a chance to catch up to the most current index for that key. Otherwise it's possible for newly-added servers to query another newly-added server and get a non-error nil response for that key ID. Ensure that we're setting the correct reply index in the blocking query. Note that the "not found" case does not return an error, just an empty key. So as a belt-and-suspenders, update the handling of empty responses so that we don't break the loop early if we hit a server that doesn't have the key. * test for adding new servers to keyring * leader: initialize keyring after we have consistent reads Wait until we're sure the FSM is current before we try to initialize the keyring. Also, if a key is rotated immediately following a leader election, plans that are in-flight may get signed before the new leader has the key. Allow for a short timeout-and-retry to avoid rejecting plans	2022-10-21 12:33:16 -04:00
Michael Schurter	9cac60dbed	test: use port collision instead of cpu exhaustion (#14994 ) Originally this test relied on Job 1 blocking Job 2 until Job 1 had a terminal ClientStatus. Job 2 ensured it would get blocked using 2 mechanisms: 1. A constraint requiring it is placed on the same node as Job 1. 2. Job 2 would require all unreserved CPU on the node to ensure it would be blocked until Job 1's resources were free. That 2nd assertion breaks if any previous job is still running on the target node! That seems very likely to happen in the flaky world of our e2e tests. In fact there may be some jobs we intentionally want running throughout; in hindsight it was never safe to assume my test would be the only thing scheduled when it ran. Ports to the rescue! Reserving a static port means that both Job 2 will now block on Job 1 being terminal. It will only conflict with other tests if those tests use that port on every node. I ensured no existing tests were using the port I chose. Other changes: - Gave job a bit more breathing room resource-wise. - Tightened timings a bit since previous failure ran into the `go test` time limit. - Cleaned up the DumpEvals output. It's quite nice and handy now!	2022-10-21 07:53:26 -07:00

1 2 3 4 5 ...

23987 commits