open-nomad

Author	SHA1	Message	Date
Charlie Voiselle	79c4478f5b	template: error on missing key (#15141 ) * Support error_on_missing_value for templates * Update docs for template stanza	2022-11-04 13:23:01 -04:00
Tim Gross	672fb46d12	WI: set identity to client secret if missing (#15121 ) Allocations created before 1.4.0 will not have a workload identity token. When the client running these allocs is upgraded to 1.4.x, the identity hook will run and replace the node secret ID token used previously with an empty string. This causes service discovery queries to fail. Fallback to the node's secret ID when the allocation doesn't have a signed identity. Note that pre-1.4.0 allocations won't have templates that read Variables, so there's no threat that this new node ID secret will be able to read data that the allocation shouldn't have access to.	2022-11-03 11:10:11 -04:00
Seth Hoenig	e66c9ede24	e2e: convert flaky exec download in chroot unit test into e2e test (#14949 ) Similar to https://github.com/hashicorp/nomad/pull/14710, convert flaky test into e2e test.	2022-10-19 08:22:32 -05:00
Hemanth Krishna	e516fc266f	enhancement: UpdateTask when Task is waiting for ShutdownDelay (#14775 ) Signed-off-by: Hemanth Krishna <hkpdev008@gmail.com>	2022-10-06 16:33:28 -04:00
Luiz Aoqui	b924802958	template: apply splay value on change_mode script (#14749 ) Previously, the splay timeout was only applied if a template re-render caused a restart or a signal action. The `change_mode = "script"` was running after the `if restart \|\| len(signals) != 0` check, so it was invoked at all times. This change refactors the logic so it's easier to notice that new `change_mode` options should start only after `splay` is applied.	2022-09-30 12:04:22 -04:00
Seth Hoenig	c68ed3b4c8	client: protect user lookups with global lock (#14742 ) * client: protect user lookups with global lock This PR updates Nomad client to always do user lookups while holding a global process lock. This is to prevent concurrency unsafe implementations of NSS, but still enabling NSS lookups of users (i.e. cannot not use osusergo). * cl: add cl	2022-09-29 09:30:13 -05:00
Michael Schurter	0e95fb03c0	test: skip chown test if nonroot (#14738 ) CI always runs this as root, so it worked there and always scared me when I ran it locally.	2022-09-28 14:45:38 -07:00
Seth Hoenig	7235d9988b	e2e: convert chroot env unit tests into e2e tests (#14710 ) This PR translates two of our most flakey unit tests into e2e tests where they are fit much more naturally.	2022-09-26 15:40:29 -05:00
Luiz Aoqui	5c100c0d3d	client: recover from getter panics (#14696 ) The artifact getter uses the go-getter library to fetch files from different sources. Any bug in this library that results in a panic can cause the entire Nomad client to crash due to a single file download attempt. This change aims to guard against this types of crashes by recovering from panics when the getter attempts to download an artifact. The resulting panic is converted to an error that is stored as a task event for operator visibility and the panic stack trace is logged to the client's log.	2022-09-26 15:16:26 -04:00
Jorge Marey	92158a1c62	connect: add nomad env to envoy bootstrap (#12959 ) * Add nomad env to envoy bootstrap * Add changelog file	2022-09-22 13:18:18 -05:00
Jorge Marey	584ddfe859	Add Namespace, Job and Group to envoy stats (#14311 )	2022-09-22 10:38:21 -04:00
Seth Hoenig	2088ca3345	cleanup more helper updates (#14638 ) * cleanup: refactor MapStringStringSliceValueSet to be cleaner * cleanup: replace SliceStringToSet with actual set * cleanup: replace SliceStringSubset with real set * cleanup: replace SliceStringContains with slices.Contains * cleanup: remove unused function SliceStringHasPrefix * cleanup: fixup StringHasPrefixInSlice doc string * cleanup: refactor SliceSetDisjoint to use real set * cleanup: replace CompareSliceSetString with SliceSetEq * cleanup: replace CompareMapStringString with maps.Equal * cleanup: replace CopyMapStringString with CopyMap * cleanup: replace CopyMapStringInterface with CopyMap * cleanup: fixup more CopyMapStringString and CopyMapStringInt * cleanup: replace CopySliceString with slices.Clone * cleanup: remove unused CopySliceInt * cleanup: refactor CopyMapStringSliceString to be generic as CopyMapOfSlice * cleanup: replace CopyMap with maps.Clone * cleanup: run go mod tidy	2022-09-21 14:53:25 -05:00
Luiz Aoqui	25a63195da	test: remove flaky Gate test (#14575 ) The concurrent gate access test is flaky since it depends on the order of operations of two concurrent goroutines. Despite the heavy bias towards one of the results, it's still possible to end the execution with a closed gate. I believe this case was created to test an earlier implementation where the gate state was stored and mutated internally, so the access had to be protected by a lock. However, the final implementation changed this approach to be only channel-based, so there is no need for this flaky test anymore.	2022-09-19 11:31:03 -04:00
Seth Hoenig	9a943107c7	servicedisco: implement check_restart for nomad service checks This PR implements support for check_restart for checks registered in the Nomad service provider. Unlike Consul, Nomad service checks never report a "warning" status, and so the check_restart.ignore_warnings configuration is not valid for Nomad service checks.	2022-09-13 08:59:23 -05:00
Seth Hoenig	31234d6a62	cleanup: consolidate interfaces for workload restarting This PR combines two of the same interface definitions around workload restarting	2022-09-09 08:59:04 -05:00
James Rasell	4b9bcf94da	chore: remove use of "err" a log line context key for errors. (#14433 ) Log lines which include an error should use the full term "error" as the context key. This provides consistency across the codebase and avoids a Go style which operators might not be aware of.	2022-09-01 15:06:10 +02:00
Charlie Voiselle	5c0e34dd33	Vars: Update CT dependency to support variables. (#14399 ) * Update Consul Template dep to support Nomad vars * Remove `Peering` config for Consul Testservers Upgrading to the 1.14 Consul SDK introduces and additional default configuration—`Peering`—that is not compatible with versions of Consul before v1.13.0. because Nomad tests against Consul v1.11.1, this configuration has to be nil'ed out before passing it to the Consul binary.	2022-08-30 15:26:01 -04:00
Tim Gross	cc9b480996	testing: setting env var incompatible with parallel tests (#14405 ) Neither the `os.Setenv` nor `t.Setenv` helper are safe to use in parallel tests because environment variables are process-global. The stdlib panics if you try to do this. Remove the `ci.Parallel()` call from all tests where we're setting environment variables.	2022-08-30 14:49:03 -04:00
Seth Hoenig	52de2dc09d	Merge pull request #14290 from hashicorp/cleanup-more-helper-cleanup cleanup: tidy up helper package some more	2022-08-30 08:19:48 -05:00
Luiz Aoqui	e012d9411e	Task lifecycle restart (#14127 ) * allocrunner: handle lifecycle when all tasks die When all tasks die the Coordinator must transition to its terminal state, coordinatorStatePoststop, to unblock poststop tasks. Since this could happen at any time (for example, a prestart task dies), all states must be able to transition to this terminal state. * allocrunner: implement different alloc restarts Add a new alloc restart mode where all tasks are restarted, even if they have already exited. Also unifies the alloc restart logic to use the implementation that restarts tasks concurrently and ignores ErrTaskNotRunning errors since those are expected when restarting the allocation. * allocrunner: allow tasks to run again Prevent the task runner Run() method from exiting to allow a dead task to run again. When the task runner is signaled to restart, the function will jump back to the MAIN loop and run it again. The task runner determines if a task needs to run again based on two new task events that were added to differentiate between a request to restart a specific task, the tasks that are currently running, or all tasks that have already run. * api/cli: add support for all tasks alloc restart Implement the new -all-tasks alloc restart CLI flag and its API counterpar, AllTasks. The client endpoint calls the appropriate restart method from the allocrunner depending on the restart parameters used. * test: fix tasklifecycle Coordinator test * allocrunner: kill taskrunners if all tasks are dead When all non-poststop tasks are dead we need to kill the taskrunners so we don't leak their goroutines, which are blocked in the alloc restart loop. This also ensures the allocrunner exits on its own. * taskrunner: fix tests that waited on WaitCh Now that "dead" tasks may run again, the taskrunner Run() method will not return when the task finishes running, so tests must wait for the task state to be "dead" instead of using the WaitCh, since it won't be closed until the taskrunner is killed. * tests: add tests for all tasks alloc restart * changelog: add entry for #14127 * taskrunner: fix restore logic. The first implementation of the task runner restore process relied on server data (`tr.Alloc().TerminalStatus()`) which may not be available to the client at the time of restore. It also had the incorrect code path. When restoring a dead task the driver handle always needs to be clear cleanly using `clearDriverHandle` otherwise, after exiting the MAIN loop, the task may be killed by `tr.handleKill`. The fix is to store the state of the Run() loop in the task runner local client state: if the task runner ever exits this loop cleanly (not with a shutdown) it will never be able to run again. So if the Run() loops starts with this local state flag set, it must exit early. This local state flag is also being checked on task restart requests. If the task is "dead" and its Run() loop is not active it will never be able to run again. * address code review requests * apply more code review changes * taskrunner: add different Restart modes Using the task event to differentiate between the allocrunner restart methods proved to be confusing for developers to understand how it all worked. So instead of relying on the event type, this commit separated the logic of restarting an taskRunner into two methods: - `Restart` will retain the current behaviour and only will only restart the task if it's currently running. - `ForceRestart` is the new method where a `dead` task is allowed to restart if its `Run()` method is still active. Callers will need to restart the allocRunner taskCoordinator to make sure it will allow the task to run again. * minor fixes	2022-08-24 17:43:07 -04:00
Seth Hoenig	062c817450	cleanup: move fs helpers into escapingfs	2022-08-24 14:45:34 -05:00
Piotr Kazmierczak	7077d1f9aa	template: custom change_mode scripts (#13972 ) This PR adds the functionality of allowing custom scripts to be executed on template change. Resolves #2707	2022-08-24 17:43:01 +02:00
Luiz Aoqui	d3be0abf61	ci: fix gofmt on tasklifecycle (#14232 )	2022-08-23 15:47:15 -04:00
Luiz Aoqui	7a8cacc9ec	allocrunner: refactor task coordinator (#14009 ) The current implementation for the task coordinator unblocks tasks by performing destructive operations over its internal state (like closing channels and deleting maps from keys). This presents a problem in situations where we would like to revert the state of a task, such as when restarting an allocation with tasks that have already exited. With this new implementation the task coordinator behaves more like a finite state machine where task may be blocked/unblocked multiple times by performing a state transition. This initial part of the work only refactors the task coordinator and is functionally equivalent to the previous implementation. Future work will build upon this to provide bug fixes and enhancements.	2022-08-22 18:38:49 -04:00
Luiz Aoqui	dbffdca92e	template: use pointer values for gid and uid (#14203 ) When a Nomad agent starts and loads jobs that already existed in the cluster, the default template uid and gid was being set to 0, since this is the zero value for int. This caused these jobs to fail in environments where it was not possible to use 0, such as in Windows clients. In order to differentiate between an explicit 0 and a template where these properties were not set we need to use a pointer.	2022-08-22 16:25:49 -04:00
Piotr Kazmierczak	b63944b5c1	cleanup: replace TypeToPtr helper methods with pointer.Of (#14151 ) Bumping compile time requirement to go 1.18 allows us to simplify our pointer helper methods.	2022-08-17 18:26:34 +02:00
Seth Hoenig	b3ea68948b	build: run gofmt on all go source files Go 1.19 will forecefully format all your doc strings. To get this out of the way, here is one big commit with all the changes gofmt wants to make.	2022-08-16 11:14:11 -05:00
Derek Strickland	77df9c133b	Add Nomad RetryConfig to agent template config (#13907 ) * add Nomad RetryConfig to agent template config	2022-08-03 16:56:30 -04:00
Piotr Kazmierczak	530280505f	client: enable specifying user/group permissions in the template stanza (#13755 ) * Adds Uid/Gid parameters to template. * Updated diff_test * fixed order * update jobspec and api * removed obsolete code * helper functions for jobspec parse test * updated documentation * adjusted API jobs test. * propagate uid/gid setting to job_endpoint * adjusted job_endpoint tests * making uid/gid into pointers * refactor * updated documentation * updated documentation * Update client/allocrunner/taskrunner/template/template_test.go Co-authored-by: Luiz Aoqui <luiz@hashicorp.com> * Update website/content/api-docs/json-jobs.mdx Co-authored-by: Luiz Aoqui <luiz@hashicorp.com> * propagating documentation change from Luiz * formatting * changelog entry * changed changelog entry Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2022-08-02 22:15:38 +02:00
Eric Weber	cbce13c1ac	Add stage_publish_base_dir field to csi_plugin stanza of a job (#13919 ) * Allow specification of CSI staging and publishing directory path * Add website documentation for stage_publish_dir * Replace erroneous reference to csi_plugin.mount_config with csi_plugin.mount_dir * Avoid requiring CSI plugins to be redeployed after introducing StagePublishDir	2022-08-02 09:42:44 -04:00
Seth Hoenig	606e3ebdd4	client: updates from pr feedback	2022-07-21 09:54:27 -05:00
Seth Hoenig	297d386bdc	client: add support for checks in nomad services This PR adds support for specifying checks in services registered to the built-in nomad service provider. Currently only HTTP and TCP checks are supported, though more types could be added later.	2022-07-12 17:09:50 -05:00
Tim Gross	bfcbc00f4e	workload identity (#13223 ) In order to support implicit ACL policies for tasks to get their own secrets, each task would need to have its own ACL token. This would add extra raft overhead as well as new garbage collection jobs for cleaning up task-specific ACL tokens. Instead, Nomad will create a workload Identity Claim for each task. An Identity Claim is a JSON Web Token (JWT) signed by the server’s private key and attached to an Allocation at the time a plan is applied. The encoded JWT can be submitted as the X-Nomad-Token header to replace ACL token secret IDs for the RPCs that support identity claims. Whenever a key is is added to a server’s keyring, it will use the key as the seed for a Ed25519 public-private private keypair. That keypair will be used for signing the JWT and for verifying the JWT. This implementation is a ruthlessly minimal approach to support the secure variables feature. When a JWT is verified, the allocation ID will be checked against the Nomad state store, and non-existent or terminal allocation IDs will cause the validation to be rejected. This is sufficient to support the secure variables feature at launch without requiring implementation of a background process to renew soon-to-expire tokens.	2022-07-11 13:34:05 -04:00
Seth Hoenig	5dd8aa3e27	client: enforce max_kill_timeout client configuration This PR fixes a bug where client configuration max_kill_timeout was not being enforced. The feature was introduced in 9f44780 but seems to have been removed during the major drivers refactoring. We can make sure the value is enforced by pluming it through the DriverHandler, which now uses the lesser of the task.killTimeout or client.maxKillTimeout. Also updates Event.SetKillTimeout to require both the task.killTimeout and client.maxKillTimeout so that we don't make the mistake of using the wrong value - as it was being given only the task.killTimeout before.	2022-07-06 15:29:38 -05:00
Derek Strickland	7d6a3df197	csi_hook: valid if any driver supports csi (#13446 ) * csi_hook: valid if any driver supports csi volumes	2022-06-22 10:43:43 -04:00
Jeffrey Clark	a97699221c	cni: add loopback to linux bridge (#13428 ) CNI changed how to bring up the interface in v0.2.0. Support was moved to a new loopback plugin. https://github.com/containernetworking/cni/pull/121 Fixes #10014	2022-06-20 11:22:53 -04:00
Grant Griffiths	99896da443	CSI: make plugin health_timeout configurable in csi_plugin stanza (#13340 ) Signed-off-by: Grant Griffiths <ggriffiths@purestorage.com>	2022-06-14 10:04:16 -04:00
Derek Strickland	12f3ee46ea	alloc_runner: stop sidecar tasks last (#13055 ) alloc_runner: stop sidecar tasks last	2022-06-07 11:35:19 -04:00
Radek Simko	9cc71d6665	client/allochealth: add healthy_deadline as context to error messages (#13214 )	2022-06-06 10:11:08 -04:00
Michael Schurter	2965dc6a1a	artifact: fix numerous go-getter security issues Fix numerous go-getter security issues: - Add timeouts to http, git, and hg operations to prevent DoS - Add size limit to http to prevent resource exhaustion - Disable following symlinks in both artifacts and `job run` - Stop performing initial HEAD request to avoid file corruption on retries and DoS opportunities. Approach Since Nomad has no ability to differentiate a DoS-via-large-artifact vs a legitimate workload, all of the new limits are configurable at the client agent level. The max size of HTTP downloads is also exposed as a node attribute so that if some workloads have large artifacts they can specify a high limit in their jobspecs. In the future all of this plumbing could be extended to enable/disable specific getters or artifact downloading entirely on a per-node basis.	2022-05-24 16:29:39 -04:00
Seth Hoenig	65f7abf2f4	cli: update default redis and use nomad service discovery Closes #12927 Closes #12958 This PR updates the version of redis used in our examples from 3.2 to 7. The old version is very not supported anymore, and we should be setting a good example by using a supported version. The long-form example job is now fixed so that the service stanza uses nomad as the service discovery provider, and so now the job runs without a requirement of having Consul running and configured.	2022-05-17 10:24:19 -05:00
Seth Hoenig	26b5c01431	Merge pull request #12817 from twunderlich-grapl/fix-network-interpolation Fix network.dns interpolation	2022-05-17 09:31:32 -05:00
Eng Zer Jun	97d1bc735c	test: use `T.TempDir` to create temporary test directory (#12853 ) * test: use `T.TempDir` to create temporary test directory This commit replaces `ioutil.TempDir` with `t.TempDir` in tests. The directory created by `t.TempDir` is automatically removed when the test and all its subtests complete. Prior to this commit, temporary directory created using `ioutil.TempDir` needs to be removed manually by calling `os.RemoveAll`, which is omitted in some tests. The error handling boilerplate e.g. defer func() { if err := os.RemoveAll(dir); err != nil { t.Fatal(err) } } is also tedious, but `t.TempDir` handles this for us nicely. Reference: https://pkg.go.dev/testing#T.TempDir Signed-off-by: Eng Zer Jun <engzerjun@gmail.com> * test: fix TestLogmon_Start_restart on Windows Signed-off-by: Eng Zer Jun <engzerjun@gmail.com> * test: fix failing TestConsul_Integration t.TempDir fails to perform the cleanup properly because the folder is still in use testing.go:967: TempDir RemoveAll cleanup: unlinkat /tmp/TestConsul_Integration2837567823/002/191a6f1a-5371-cf7c-da38-220fe85d10e5/web/secrets: device or resource busy Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>	2022-05-12 11:42:40 -04:00
Seth Hoenig	96ec19788d	cgroups: make sure cgroup still exists after task restart This PR modifies raw_exec and exec to ensure the cgroup for a task they are driving still exists during a task restart. These drivers have the same bug but with different root cause. For raw_exec, we were removing the cgroup in 2 places - the cpuset manager, and in the unix containment implementation (the thing that uses freezer cgroup to clean house). During a task restart, the containment would remove the cgroup, and when the task runner hooks went to start again would block on waiting for the cgroup to exist, which will never happen, because it gets created by the cpuset manager which only runs as an alloc pre-start hook. The fix here is to simply not delete the cgroup in the containment implementation; killing the PIDs is enough. The removal happens in the cpuset manager later anyway. For exec, it's the same idea, except DestroyTask is called on task failure, which in turn calls into libcontainer, which in turn deletes the cgroup. In this case we do not have control over the deletion of the cgroup, so instead we hack the cgroup back into life after the call to DestroyTask. All of this only applies to cgroups v2.	2022-05-05 09:51:03 -05:00
Thomas Wunderlich	245d2a463b	Fix formatting	2022-04-29 10:02:20 -04:00
Thomas Wunderlich	c86e287de9	Remove debug log lines	2022-04-28 19:14:31 -04:00
Thomas Wunderlich	960e192359	Quick and dirty hack to get interpolated dns values working	2022-04-28 17:09:53 -04:00
Tim Gross	3d630a3629	CSI: enforce one plugin supervisor loop via `sync.Once` (#12785 ) We enforce exactly one plugin supervisor loop by checking whether `running` is set and returning early. This works but is fairly subtle. It can briefly result in two goroutines where one quickly exits before doing any work. Clarify the intent by using `sync.Once`. The goroutine we've spawned only exits when the entire task runner is being torn down, and not when the task driver restarts the workload, so it should never be re-run.	2022-04-26 10:38:50 -04:00
Tim Gross	766025cde7	CSI: plugin supervisor prestart should not mark itself done (#12752 ) The task runner hook `Prestart` response object includes a `Done` field that's intended to tell the client not to run the hook again. The plugin supervisor creates mount points for the task during prestart and saves these mounts in the hook resources. But if a client restarts the hook resources will not be populated. If the plugin task restarts at any time after the client restarts, it will fail to have the correct mounts and crash loop until restart attempts run out. Fix this by not returning `Done` in the response, just as we do for the `volume_mount_hook`.	2022-04-22 13:07:47 -04:00
Gowtham	1ff8b5f759	Add Concurrent Download Support for artifacts (#11531 ) * add concurrent download support - resolves #11244 * format imports * mark `wg.Done()` via `defer` * added tests for successful and failure cases and resolved some goleak * docs: add changelog for #11531 * test typo fixes and improvements Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2022-04-20 10:15:56 -07:00

1 2 3 4 5 ...

671 commits