open-nomad

Author	SHA1	Message	Date
Giovani Avelar	a625de2062	Allow specification of a custom job name/prefix for parameterized jobs (#14631 )	2022-10-06 16:21:40 -04:00
Tim Gross	80ec5e1346	fix panic from keyring raft entries being written during upgrade (#14821 ) During an upgrade to Nomad 1.4.0, if a server running 1.4.0 becomes the leader before one of the 1.3.x servers, the old server will crash because the keyring is initialized and writes a raft entry. Wait until all members are on a version that supports the keyring before initializing it.	2022-10-06 12:47:02 -04:00
Michael Schurter	0df5c7d5ae	test: fix flaky test (#14713 ) Need to wait for Stop evals to be processed before you can expect subsequent RPCs to see the alloc's DesiredStatus=stop.	2022-09-27 10:36:16 -07:00
Tim Gross	87681fca68	CSI: ensure initial unpublish state is checkpointed (#14675 ) A test flake revealed a bug in the CSI unpublish workflow, where an unpublish that comes from a client that's successfully done the node-unpublish step will not have the claim checkpointed if the controller-unpublish step fails. This will result in a delay in releasing the volume claim until the next GC. This changeset also ensures we're using a new snapshot after each write to raft, and fixes two timing issues in test where either the volume watcher can unpublish before the unpublish RPC is sent or we don't wait long enough in resource-restricted environements like GHA.	2022-09-27 08:43:45 -04:00
Seth Hoenig	87ec5fdee5	deps: update set and test (#14680 ) This PR updates go-set and shoenig/test, which introduced some breaking API changes.	2022-09-26 08:28:03 -05:00
Seth Hoenig	ae5b800085	cleanup: rearrange mocks package (#14660 ) This PR splits up the nomad/mock package into more files. Specific features that have a lot of mocks get their own file (e.g. acl, variables, csi, connect, etc.). Above that, functions that return jobs/allocs/nodes are in the job/alloc/node file. And lastly other mocks/helpers are in mock.go	2022-09-22 13:49:58 -05:00
Derek Strickland	6874997f91	scheduler: Fix bug where the would treat multiregion jobs as paused for job types that don't use deployments (#14659 ) * scheduler: Fix bug where the scheduler would treat multiregion jobs as paused for job types that don't use deployments Co-authored-by: Tim Gross <tgross@hashicorp.com> Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-09-22 14:31:27 -04:00
Florian Apolloner	f66d61e17f	consul: Removed unused ConsulUsage.Kinds. (#11303 )	2022-09-22 10:07:14 -05:00
Jorge Marey	584ddfe859	Add Namespace, Job and Group to envoy stats (#14311 )	2022-09-22 10:38:21 -04:00
Seth Hoenig	2088ca3345	cleanup more helper updates (#14638 ) * cleanup: refactor MapStringStringSliceValueSet to be cleaner * cleanup: replace SliceStringToSet with actual set * cleanup: replace SliceStringSubset with real set * cleanup: replace SliceStringContains with slices.Contains * cleanup: remove unused function SliceStringHasPrefix * cleanup: fixup StringHasPrefixInSlice doc string * cleanup: refactor SliceSetDisjoint to use real set * cleanup: replace CompareSliceSetString with SliceSetEq * cleanup: replace CompareMapStringString with maps.Equal * cleanup: replace CopyMapStringString with CopyMap * cleanup: replace CopyMapStringInterface with CopyMap * cleanup: fixup more CopyMapStringString and CopyMapStringInt * cleanup: replace CopySliceString with slices.Clone * cleanup: remove unused CopySliceInt * cleanup: refactor CopyMapStringSliceString to be generic as CopyMapOfSlice * cleanup: replace CopyMap with maps.Clone * cleanup: run go mod tidy	2022-09-21 14:53:25 -05:00
Michael Schurter	bd4b4b8f66	Data race fixes in tests and a new semgrep rule (#14594 ) * test: don't use loop vars in goroutines fixes a data race in the test * test: copy objects in statestore before mutating fixes data race in test * test: @lgfa29's segmgrep rule for loops/goroutines Found 2 places where we were improperly using loop variables inside goroutines.	2022-09-15 10:35:08 -07:00
Tim Gross	89dfdef95d	variables: handler should catch errors before conflicts (#14591 )	2022-09-14 13:14:17 -04:00
Mahmood Ali	a9d5e4c510	scheduler: stopped-yet-running allocs are still running (#10446 ) * scheduler: stopped-yet-running allocs are still running * scheduler: test new stopped-but-running logic * test: assert nonoverlapping alloc behavior Also add a simpler Wait test helper to improve line numbers and save few lines of code. * docs: tried my best to describe #10446 it's not concise... feedback welcome * scheduler: fix test that allowed overlapping allocs * devices: only free devices when ClientStatus is terminal * test: output nicer failure message if err==nil Co-authored-by: Mahmood Ali <mahmood@hashicorp.com> Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2022-09-13 12:52:47 -07:00
Seth Hoenig	bf4dd30919	Merge pull request #14553 from hashicorp/f-nsd-check-watcher servicedisco: implement check_restart support for nomad service checks	2022-09-13 09:55:51 -05:00
Charlie Voiselle	6ab59d2aa6	var: Correct 0-index CAS Deletes (#14555 ) * Add missing 0 case for VarDeleteCAS, more comments * Add tests for VarDeleteCAS	2022-09-13 10:12:08 -04:00
Seth Hoenig	9a943107c7	servicedisco: implement check_restart for nomad service checks This PR implements support for check_restart for checks registered in the Nomad service provider. Unlike Consul, Nomad service checks never report a "warning" status, and so the check_restart.ignore_warnings configuration is not valid for Nomad service checks.	2022-09-13 08:59:23 -05:00
Seth Hoenig	b960925939	Merge pull request #14546 from hashicorp/f-refactor-check-watcher client: refactor check watcher to be reusable	2022-09-13 07:32:32 -05:00
Tim Gross	03312f3227	variables: restrict allowed paths for variables (#14547 ) Restrict variable paths to RFC3986 URL-safe characters that don't conflict with the use of characters "@" and "." in `template` blocks. This prevents users from writing variables that will require tricky templating syntax or that they simply won't be able to use. Also restrict the length so that a user can't make queries in the state store unusually expensive (as they are O(k) on the key length).	2022-09-12 16:37:33 -04:00
Seth Hoenig	feff36f3f7	client: refactor check watcher to be reusable This PR refactors agent/consul/check_watcher into client/serviceregistration, and abstracts away the Consul-specific check lookups. In doing so we should be able to reuse the existing check watcher logic for also watching NSD checks in a followup PR. A chunk of consul/unit_test.go is removed - we'll cover that in e2e tests in a follow PR if needed. In the long run I'd like to remove this whole file.	2022-09-12 10:13:31 -05:00
Derek Strickland	5ca934015b	job_endpoint: check spec for all regions (#14519 ) * job_endpoint: check spec for all regions	2022-09-12 09:24:26 -04:00
Charlie Voiselle	b55112714f	Vars: CLI commands for `var get`, `var put`, `var purge` (#14400 ) * Includes updates to `var init`	2022-09-09 17:55:20 -04:00
Charlie Voiselle	e58998e218	Add client scheduling eligibility to heartbeat (#14483 )	2022-09-08 14:31:36 -04:00
Tim Gross	3fc7482ecd	CSI: failed allocation should not block its own controller unpublish (#14484 ) A Nomad user reported problems with CSI volumes associated with failed allocations, where the Nomad server did not send a controller unpublish RPC. The controller unpublish is skipped if other non-terminal allocations on the same node claim the volume. The check has a bug where the allocation belonging to the claim being freed was included in the check incorrectly. During a normal allocation stop for job stop or a new version of the job, the allocation is terminal. But allocations that fail are not yet marked terminal at the point in time when the client sends the unpublish RPC to the server. For CSI plugins that support controller attach/detach, this means that the controller will not be able to detach the volume from the allocation's host and the replacement claim will fail until a GC is run. This changeset fixes the conditional so that the claim's own allocation is not included, and makes the logic easier to read. Include a test case covering this path. Also includes two minor extra bugfixes: * Entities we get from the state store should always be copied before altering. Ensure that we copy the volume in the top-level unpublish workflow before handing off to the steps. * The list stub object for volumes in `nomad/structs` did not match the stub object in `api`. The `api` package also did not include the current readers/writers fields that are expected by the UI. True up the two objects and add the previously undocumented fields to the docs.	2022-09-08 13:30:05 -04:00
James Rasell	3fa8b0b270	client: fix RPC forwarding when querying checks for alloc. (#14498 ) When querying the checks for an allocation, the request must be forwarded to the agent that is running the allocation. If the initial request is made to a server agent, the request can be made directly to the client agent running the allocation. If the request is made to a client agent not running the alloc, the request needs to be forwarded to a server and then the correct client.	2022-09-08 16:55:23 +02:00
Tim Gross	2a961af44c	test: fix concurrent map access in `TestStatsFetcher` (#14496 ) The map of in-flight RPCs gets cleared by a goroutine in the test without first locking it to make sure that it's not being accessed concurrently by the stats fetcher itself. This can cause a panic in tests.	2022-09-08 10:41:15 -04:00
Tim Gross	5c57a84e99	autopilot: deflake tests (#14475 ) Includes: * Remove leader upgrade raft version test, as older versions of raft are now incompatible with our autopilot library. * Remove attempt to assert initial non-voter status on the `PromoteNonVoter` test, as this happens too quickly to reliably detect. * Unskip some previously-skipped tests which we should make stable. * Remove the `consul/sdk` retry helper for these tests; this uses panic recovery in a kind of a clever/gross way to reduce LoC but it seems to introduce some timing issues in the process. * Add more test step logging and reduce logging noise from the scheduler goroutines to make it easier to debug failing tests. * Be more consistent about using the `waitForStableLeadership` helper so that we can assert the cluster is fully stable and not just that we've added peers.	2022-09-07 09:35:01 -04:00
James Rasell	962b1f78e8	core: clarify ACL token expiry GC messages to show global param. (#14466 )	2022-09-06 15:42:45 +02:00
Kellen Fox	5086368a1e	Add a log line to help track node eligibility (#14125 ) Co-authored-by: James Rasell <jrasell@hashicorp.com>	2022-09-06 14:03:33 +02:00
Yan	6e927fa125	warn destructive update only when count > 1 (#13103 )	2022-09-02 15:30:06 -04:00
Tim Gross	7921f044e5	migrate autopilot implementation to raft-autopilot (#14441 ) Nomad's original autopilot was importing from a private package in Consul. It has been moved out to a shared library. Switch Nomad to use this library so that we can eliminate the import of Consul, which is necessary to build Nomad ENT with the current version of the Consul SDK. This also will let us pick up autopilot improvements shared with Consul more easily.	2022-09-01 14:27:10 -04:00
Luiz Aoqui	19de803503	cli: ignore VaultToken when generating job diff (#14424 )	2022-09-01 10:01:53 -04:00
James Rasell	4b9bcf94da	chore: remove use of "err" a log line context key for errors. (#14433 ) Log lines which include an error should use the full term "error" as the context key. This provides consistency across the codebase and avoids a Go style which operators might not be aware of.	2022-09-01 15:06:10 +02:00
Luiz Aoqui	dc6525336b	ci: fix TestNomad_BootstrapExpect_NonVoter test (#14407 ) PR #12130 refactored the test to use the `wantPeers` helper, but this function only returns the number of voting peers, which in this test should be equal to 2. I think the tests were passing back them because of a bug in Raft (https://github.com/hashicorp/raft/pull/483) where a non-voting server was able to transition to candidate state. One possible evidence of this is that a successful test run would have the following log line: ``` raft@v1.3.5/raft.go:1058: nomad.raft: updating configuration: command=AddVoter server-id=127.0.0.1:9101 server-addr=127.0.0.1:9101 servers="[{Suffrage:Voter ID:127.0.0.1:9107 Address:127.0.0.1:9107} {Suffrage:Voter ID:127.0.0.1:9105 Address:127.0.0.1:9105} {Suffrage:Voter ID:127.0.0.1:9103 Address:127.0.0.1:9103} {Suffrage:Voter ID:127.0.0.1:9101 Address:127.0.0.1:9101}]" ``` This commit reverts the test logic to check for peer count, regardless of voting status.	2022-08-30 16:32:54 -04:00
Tim Gross	5784fb8c58	search: enforce correct ACL for search over variables (#14397 )	2022-08-30 13:27:31 -04:00
Tim Gross	c9d678a91a	keyring: wrap root key in key encryption key (#14388 ) Update the on-disk format for the root key so that it's wrapped with a unique per-key/per-server key encryption key. This is a bit of security theatre for the current implementation, but it uses `go-kms-wrapping` as the interface for wrapping the key. This provides a shim for future support of external KMS such as cloud provider APIs or Vault transit encryption. * Removes the JSON serialization extension we had on the `RootKey` struct; this struct is now only used for key replication and not for disk serialization, so we don't need this helper. * Creates a helper for generating cryptographically random slices of bytes that properly accounts for short reads from the source. * No observable functional changes outside of the on-disk format, so there are no test updates.	2022-08-30 10:59:25 -04:00
Seth Hoenig	52de2dc09d	Merge pull request #14290 from hashicorp/cleanup-more-helper-cleanup cleanup: tidy up helper package some more	2022-08-30 08:19:48 -05:00
James Rasell	755b4745ed	Merge branch 'main' into f-gh-13120-sso-umbrella-merged-main	2022-08-30 08:59:13 +01:00
Seth Hoenig	3e1e2001b9	Merge pull request #14143 from hashicorp/cleanup-slice-sets-3 cleanup: more cleanup of slices that are really sets	2022-08-29 13:52:59 -05:00
Tim Gross	7d1eb2efd5	keyring: split structs to its own file (#14378 )	2022-08-29 14:18:35 -04:00
Seth Hoenig	9d0e274f27	cleanup: cleanup more slice-set comparisons	2022-08-29 12:04:21 -05:00
Tim Gross	62a968f443	Merge pull request #14351 from hashicorp/variables-rename Variables rename	2022-08-29 11:36:50 -04:00
Piotr Kazmierczak	5f353503e5	bugfix: fixed template validation panic in case of incorrect ChangeScript configuration (#14374 ) Fixes #14367	2022-08-29 17:11:15 +02:00
Tim Gross	1dc053b917	rename SecureVariables to Variables throughout	2022-08-26 16:06:24 -04:00
Tim Gross	dcfd31296b	file rename	2022-08-26 16:06:24 -04:00
Seth Hoenig	b87689d2d1	Merge pull request #14318 from hashicorp/cleanup-create-pointer-compare cleanup: create pointer.Compare helper function	2022-08-26 09:15:41 -05:00
Seth Hoenig	6b2655ad86	cleanup: create pointer.Compare helper function This PR creates a pointer.Compare helper for comparing equality of two pointers. Strictly only works with primitive types we know are safe to derefence and compare using '=='.	2022-08-26 08:55:59 -05:00
James Rasell	601588df6b	Merge branch 'main' into f-gh-13120-sso-umbrella-merged-main	2022-08-25 12:14:29 +01:00
James Rasell	7a0798663d	acl: fix a bug where roles could be duplicated by name. An ACL roles name must be unique, however, a bug meant multiple roles of the same same could be created. This fixes that problem with checks in the RPC handler and state store.	2022-08-25 09:20:43 +01:00
Luiz Aoqui	e012d9411e	Task lifecycle restart (#14127 ) * allocrunner: handle lifecycle when all tasks die When all tasks die the Coordinator must transition to its terminal state, coordinatorStatePoststop, to unblock poststop tasks. Since this could happen at any time (for example, a prestart task dies), all states must be able to transition to this terminal state. * allocrunner: implement different alloc restarts Add a new alloc restart mode where all tasks are restarted, even if they have already exited. Also unifies the alloc restart logic to use the implementation that restarts tasks concurrently and ignores ErrTaskNotRunning errors since those are expected when restarting the allocation. * allocrunner: allow tasks to run again Prevent the task runner Run() method from exiting to allow a dead task to run again. When the task runner is signaled to restart, the function will jump back to the MAIN loop and run it again. The task runner determines if a task needs to run again based on two new task events that were added to differentiate between a request to restart a specific task, the tasks that are currently running, or all tasks that have already run. * api/cli: add support for all tasks alloc restart Implement the new -all-tasks alloc restart CLI flag and its API counterpar, AllTasks. The client endpoint calls the appropriate restart method from the allocrunner depending on the restart parameters used. * test: fix tasklifecycle Coordinator test * allocrunner: kill taskrunners if all tasks are dead When all non-poststop tasks are dead we need to kill the taskrunners so we don't leak their goroutines, which are blocked in the alloc restart loop. This also ensures the allocrunner exits on its own. * taskrunner: fix tests that waited on WaitCh Now that "dead" tasks may run again, the taskrunner Run() method will not return when the task finishes running, so tests must wait for the task state to be "dead" instead of using the WaitCh, since it won't be closed until the taskrunner is killed. * tests: add tests for all tasks alloc restart * changelog: add entry for #14127 * taskrunner: fix restore logic. The first implementation of the task runner restore process relied on server data (`tr.Alloc().TerminalStatus()`) which may not be available to the client at the time of restore. It also had the incorrect code path. When restoring a dead task the driver handle always needs to be clear cleanly using `clearDriverHandle` otherwise, after exiting the MAIN loop, the task may be killed by `tr.handleKill`. The fix is to store the state of the Run() loop in the task runner local client state: if the task runner ever exits this loop cleanly (not with a shutdown) it will never be able to run again. So if the Run() loops starts with this local state flag set, it must exit early. This local state flag is also being checked on task restart requests. If the task is "dead" and its Run() loop is not active it will never be able to run again. * address code review requests * apply more code review changes * taskrunner: add different Restart modes Using the task event to differentiate between the allocrunner restart methods proved to be confusing for developers to understand how it all worked. So instead of relying on the event type, this commit separated the logic of restarting an taskRunner into two methods: - `Restart` will retain the current behaviour and only will only restart the task if it's currently running. - `ForceRestart` is the new method where a `dead` task is allowed to restart if its `Run()` method is still active. Callers will need to restart the allocRunner taskCoordinator to make sure it will allow the task to run again. * minor fixes	2022-08-24 17:43:07 -04:00
Tim Gross	c732b215f0	vault: detect namespace change in config reload (#14298 ) The `namespace` field was not included in the equality check between old and new Vault configurations, which meant that a Vault config change that only changed the namespace would not be detected as a change and the clients would not be reloaded. Also, the comparison for boolean fields such as `enabled` and `allow_unauthenticated` was on the pointer and not the value of that pointer, which results in spurious reloads in real config reload that is easily missed in typical test scenarios. Includes a minor refactor of the order of fields for `Copy` and `Merge` to match the struct fields in hopes it makes it harder to make this mistake in the future, as well as additional test coverage.	2022-08-24 17:03:29 -04:00

1 2 3 4 5 ...

4143 commits