open-nomad

Author	SHA1	Message	Date
James Rasell	222592a07e	client: track service deregister call so it's only called once. In certain task lifecycles the taskrunner service deregister call could be called three times for a task that is exiting. Whilst each hook caller of deregister has its own purpose, we should try and ensure it is only called once during the shutdown lifecycle of a task. This change therefore tracks when deregister has been called, so that subsequent calls are noop. In the event the task is restarting, the deregister value is reset to ensure proper operation.	2022-02-11 09:29:38 +01:00
Kevin Schoonover	1dcfff2f70	fingerprint: remove metadata from digitalocean (#12032 )	2022-02-09 07:31:45 -05:00
Tim Gross	21bd4835bd	fingerprint: digitalocean fingerprint test requires metadata header (#12028 )	2022-02-08 16:35:13 -05:00
Seth Hoenig	5cb365b36b	env: update aws cpu configs By running the tools/ec2info tool	2022-02-08 12:44:00 -06:00
Kevin Schoonover	b13573d4ab	address comments Co-authored-by: Seth Hoenig <seth.a.hoenig@gmail.com>	2022-02-07 09:03:48 -08:00
Kevin Schoonover	68eeaa7a18	small fixes	2022-02-05 22:23:43 -08:00
Kevin Schoonover	5523275e95	add digitalocean fingerprinter	2022-02-05 22:17:36 -08:00
Karthick Ramachandran	0600bc32e2	improve error message on service length (#12012 )	2022-02-04 19:39:34 -05:00
Seth Hoenig	5f48e18189	Merge pull request #11983 from hashicorp/b-select-after cleanup: prevent leaks from time.After	2022-02-03 09:38:06 -06:00
Samantha	54f8c04c91	Fix health checking for ephemeral poststart tasks (#11945 ) Update the logic in the Nomad client's alloc health tracker which erroneously marks existing healthy allocations with dead poststart ephemeral tasks as unhealthy even if they were already successful during a previous deployment.	2022-02-02 16:29:49 -05:00
Seth Hoenig	db2347a86c	cleanup: prevent leaks from time.After This PR replaces use of time.After with a safe helper function that creates a time.Timer to use instead. The new function returns both a time.Timer and a Stop function that the caller must handle. Unlike time.NewTimer, the helper function does not panic if the duration set is <= 0.	2022-02-02 14:32:26 -06:00
Seth Hoenig	04f84bcdfe	deps: import libtime the normal way Previously we copied this library by hand to avoid vendor-ing a bunch of files related to minimock. Now that we no longer vendor, just import the library normally. Also we might use more of the library for handling `time.After` uses, for which this library provides a Context-based solution.	2022-01-31 14:49:05 -06:00
Tim Gross	66b4b28b1a	CSI: node unmount from the client before unpublish RPC (#11892 ) When an allocation stops, the `csi_hook` makes an unpublish RPC to the servers to unpublish via the CSI RPCs: first to the node plugins and then the controller plugins. The controller RPCs must happen after the node RPCs so that the node has had a chance to unmount the volume before the controller tries to detach the associated device. But the client has local access to the node plugins and can independently determine if it's safe to send unpublish RPC to those plugins. This will allow the server to treat the node plugin as abandoned if a client is disconnected and `stop_on_client_disconnect` is set. This will let the server try to send unpublish RPCs to the controller plugins, under the assumption that the client will be trying to unmount the volume on its end first. Note that the CSI `NodeUnpublishVolume`/`NodeUnstageVolume` RPCs can return ignorable errors in the case where the volume has already been unmounted from the node. Handle all other errors by retrying until we get success so as to give operators the opportunity to reschedule a failed node plugin (ex. in the case where they accidentally drained a node without `-ignore-system`). Fan-out the work for each volume into its own goroutine so that we can release a subset of volumes if only one is stuck.	2022-01-28 08:30:31 -05:00
Seth Hoenig	cade04d3f6	client: change test to not poke cgroupv2 edge case This PR tweaks the TestCpusetManager_AddAlloc unit test to not break when being run on a machine using cgroupsv2. The behavior of writing an empty cpuset.cpu changes in cgroupv2, where such a group now inherits the value of its parent group, rather than remaining empty. The test in question was written such that a task would consume all available cores shared on an alloc, causing the empty set to be written to the shared group, which works fine on cgroupsv1 but breaks on cgroupsv2. By adjusting the test to consume only 1 core instead of all cores, it no longer triggers that edge case. The actual fix for the new cgroupsv2 behavior will be in #11933	2022-01-27 08:27:40 -06:00
Derek Strickland	b3c8ab9be7	Update IsEmpty to check for pre-1.2.4 fields (#11930 )	2022-01-26 11:31:37 -05:00
Seth Hoenig	4650e97d29	deps: upgrade docker and runc This PR upgrades - docker dependency to the latest tagged release (v20.10.12) - runc dependency to the latest tagged release (v1.0.3) Docker does not abide by [semver](https://github.com/moby/moby/issues/39302), so it is marked +incompatible, and transitive dependencies are upgrade manually. Runc made three relevant breaking changes * cgroup manager .Set changed to accept Resources instead of Cgroup `3f65946756` * config.Device moved to devices.Device https://github.com/opencontainers/runc/pull/2679 * mountinfo.Mounted now returns an error if the specified path does not exist https://github.com/moby/sys/blob/mountinfo/v0.5.0/mountinfo/mountinfo.go#L16	2022-01-18 08:35:26 -06:00
James Rasell	7205b3f08e	Merge pull request #11402 from hashicorp/document-client-initial-vault-renew taskrunner: add clarifying initial vault token renew comment.	2022-01-13 16:21:58 +01:00
Alessandro De Blasis	e647549ecf	metrics: added `mapped_file` metric (#11500 ) Signed-off-by: Alessandro De Blasis <alex@deblasis.net> Co-authored-by: Nate <37554478+servusdei2018@users.noreply.github.com>	2022-01-10 15:35:19 -05:00
grembo	edd3b8a20c	Un-break templates when using vault stanza change_mode noop (#11783 ) Templates in nomad jobs make use of the vault token defined in the vault stanza when issuing credentials like client certificates. When using change_mode "noop" in the vault stanza, consul-template is not informed in case a vault token is re-issued (which can happen from time to time for various reasons, as described in https://www.nomadproject.io/docs/job-specification/vault). As a result, consul-template will keep using the old vault token to renew credentials and - once the token expired - stop renewing credentials. The symptom of this problem is a vault_token file that is newer than the issued credential (e.g., TLS certificate) in a job's /secrets directory. This change corrects this, so that h.updater.updatedVaultToken(token) is called, which will inform stakeholders about the new token and make sure, the new token is used by consul-template. Example job template fragment: vault { policies = ["nomad-job-policy"] change_mode = "noop" } template { data = <<-EOH {{ with secret "pki_int/issue/nomad-job" "common_name=myjob.service.consul" "ttl=90m" "alt_names=localhost" "ip_sans=127.0.0.1"}} {{ .Data.certificate }} {{ .Data.private_key }} {{ .Data.issuing_ca }} {{ end }} EOH destination = "${NOMAD_SECRETS_DIR}/myjob.crt" change_mode = "noop" } This fix does not alter the meaning of the three change modes of vault - "noop" - Take no action - "restart" - Restart the job - "signal" - send a signal to the task as the switch statement following line 232 contains the necessary logic. It is assumed that "take no action" was never meant to mean "don't tell consul-template about the new vault token". Successfully tested in a staging cluster consisting of multiple nomad client nodes.	2022-01-10 14:41:38 -05:00
Conor Evans	8d622797af	replace 'a alloc' with 'an alloc' where appropriate (#11792 )	2022-01-10 11:59:46 -05:00
Derek Strickland	0a8e03f0f7	Expose Consul template configuration parameters (#11606 ) This PR exposes the following existing`consul-template` configuration options to Nomad jobspec authors in the `{job.group.task.template}` stanza. - `wait` It also exposes the following`consul-template` configuration to Nomad operators in the `{client.template}` stanza. - `max_stale` - `block_query_wait` - `consul_retry` - `vault_retry` - `wait` Finally, it adds the following new Nomad-specific configuration to the `{client.template}` stanza that allows Operators to set bounds on what `jobspec` authors configure. - `wait_bounds` Co-authored-by: Tim Gross <tgross@hashicorp.com> Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2022-01-10 10:19:07 -05:00
Tim Gross	5eda9be7b0	CSI: tests to exercise csi_hook (#11788 ) Small refactoring of the allocrunner hook for CSI to make it more testable, and a unit test that covers most of its logic.	2022-01-07 15:23:47 -05:00
Arkadiusz	ffb174b596	Fix log streaming missing frames (#11721 ) Perform one more read after receiving cancel when streaming file from the allocation API	2022-01-04 14:07:16 -05:00
Tim Gross	265e488ab4	task runner: fix goroutine leak in prestart hook (#11741 ) The task runner prestart hooks take a `joincontext` so they have the option to exit early if either of two contexts are canceled: from killing the task or client shutdown. Some tasks exit without being shutdown from the server, so neither of the joined contexts ever gets canceled and we leak the `joincontext` (48 bytes) and its internal goroutine. This primarily impacts batch jobs and any task that fails or completes early such as non-sidecar prestart lifecycle tasks. Cancel the `joincontext` after the prestart call exits to fix the leak.	2021-12-23 11:50:51 -05:00
Luiz Aoqui	4bdd2c84e3	fix host network reserved port fingerprint (#11728 )	2021-12-22 15:29:54 -05:00
James Rasell	45f4689f9c	chore: fixup inconsistent method receiver names. (#11704 )	2021-12-20 11:44:21 +01:00
Tim Gross	a0cf5db797	provide `-no-shutdown-delay` flag for job/alloc stop (#11596 ) Some operators use very long group/task `shutdown_delay` settings to safely drain network connections to their workloads after service deregistration. But during incident response, they may want to cause that drain to be skipped so they can quickly shed load. Provide a `-no-shutdown-delay` flag on the `nomad alloc stop` and `nomad job stop` commands that bypasses the delay. This sets a new desired transition state on the affected allocations that the allocation/task runner will identify during pre-kill on the client. Note (as documented here) that using this flag will almost always result in failed inbound network connections for workloads as the tasks will exit before clients receive updated service discovery information and won't be gracefully drained.	2021-12-13 14:54:53 -05:00
Tim Gross	6e1311a265	client: respect `client_auto_join` after connection loss (#11585 ) The `consul.client_auto_join` configuration block tells the Nomad client whether to use Consul service discovery to find Nomad servers. By default it is set to `true`, but contrary to the documentation it was only respected during the initial client registration. If a client missed a heartbeat, failed a `Node.UpdateStatus` RPC, or if there was no Nomad leader, the client would fallback to Consul even if `client_auto_join` was set to `false`. This changeset returns early from the client's trigger for Consul discovery if the `client_auto_join` field is set to `false`.	2021-11-30 13:20:42 -05:00
pavel	06349676de	docs: fix typo in the comment comment in the source code for Logger: thhe -> the	2021-11-25 00:35:45 +01:00
Luiz Aoqui	0cf1964651	Merge remote-tracking branch 'origin/release-1.2.2' into merge-release-1.2.2-branch	2021-11-24 14:40:45 -05:00
Nomad Release Bot	2e4ef67c2d	remove generated files	2021-11-24 18:54:50 +00:00
Luiz Aoqui	d3c1a03edd	Version 1.2.1 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJhl94SAAoJELC0QQl2hbZ2pqoP/R7HyOxvealo5MBJcG4mGiWT Hsu9VXpYKDWn0GSXd3JmqYWH7tIwFMXispZ7pMlDLieypW3UpMYIbIquaePxOaRL yhlc0CLT7JDsFPx8Puv1fgKXaS3EfFyJlYx437bhCQ+K0k2+1n3EOhrzU/DQ4j8V D5qxlkZh6IK6brIJ54NivGzTxtzGGvIGXCrDPolX3cwoBtyO/pbecfEkRlN2xwxl P68l52+Jit3lK2Cljh4Kr1qFj8voHPjYUTXGas8ZkIVrx9l4fb6CHib2y3hy4bRR qwXT4keWc8bxtLQ7vtetGBAXp4UKJigziE4imhHAttBN9th2/Oy0qSQCNX3xELJC Jwgc+N+ON63QI2sP/8FWvmeUrJpASRITYl/Gr8uOR6n1PacrBhFT9OV4VMkte1ua jS/WF/7k21NZYqZca+thvN12wmw/gSEAEeCHH5kR3vPLeV6FdanhKLjufMNuMShc UKJCEZw1/Lyux1XkLqMPoZ4DCak8/HskupQoLNsekF1Uki8ObU4as7GERedxqkj6 i2+1QIQMqvviskOwT0QOWm4RFXjRQsIK8uUfXzHHWDMzDhvnGjB0eWVMLAj4/rTe 46yUP4kdarFkxwkDmLEyoogdD35wC4Xc8Y8IynzUTN77pOWID5QEyFZVaaBB4NR3 wNowUJGrNkxEYXwGSkjh =Zuw2 -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEElFaq1Z5DKdB91i+lKfRZwNnLtXMFAmGbu3sACgkQKfRZwNnL tXMx4BAAksQ07tSoOku8zDwx2JpoiNApoYhMLlfJ4S3Mw+RYtbayAMRyA08GG56I U85XJB/Z2CzliYL/Nya1e3z6Gyn92V0iD9u7N1xEAPt8PdyiXqIBZn1rWoiCcnMO C3f2aRGhLZMVOZG0v7fgbh1PkhJt4MLcRQE9nn5ojPvFzW9bL0Iz7lc9IxHQtaU0 rANDcXdj3IhiOdEgjtO++Qhdeu3t2SBhT2xFnlJ3gXC2q/aY1a2C7BYdlSxtw0JU nKpxvBTsB7rINGcYxhXZlckui5YLL4BX11XqsYhUTMC+33vxE5HNty1ANc1+SNyO 0iHp0yc5J6MCLuiZ/2sBek2tC+KHCufb+qEIqPmBpcWPJRT8HjginLxj/HyL2TQc pLF9XxhYKvv0sm3Zr3Ima5kqWgayph3XhQ73hKs9f7SLfErr6qr4XaI8egZA4OTG 0QGmY/61UlAdsz5tUvIGRWYD5rqXyXIYnUprldPSQdeZ0o2GjX7T0GZ934O5uHfE Ne73GafGn8JaGxH9+AEHMJAVpkrzWR1wrExL3kGJ8NF40HlsYofIuhTkZqMKX3EH 7KfefSJW1NQAGeAEwjtvzhmUiM0cVoCWGd4COxX1G3oJ0o8gZ3RklDEA4Pa9C0rO pBW/KIckPpGieGvPaA3mqmXDjx6oOaxPi9wd5TniBHh43pgrASo= =KVce -----END PGP SIGNATURE----- Merge tag 'v1.2.1' into merge-release-1.2.1-branch Version 1.2.1	2021-11-22 10:47:04 -05:00
Danish Prakash	1e2c9b3aa0	client: emit max_memory metric (#11490 )	2021-11-17 08:34:22 -05:00
Nomad Release bot	c4463682e7	Generate files for 1.2.0 release	2021-11-15 23:00:30 +00:00
Dave May	3c04d7927b	cli: refactor operator debug capture (#11466 ) * debug: refactor Consul API collection * debug: refactor Vault API collection * debug: cleanup test timing * debug: extend test to multiregion * debug: save cmdline flags in bundle * debug: add cli version to output * Add changelog entry	2021-11-05 19:43:10 -04:00
Alessandro De Blasis	07c670fdc0	cli: show `host_network` in `nomad status` (#11432 ) Enhance the CLI in order to return the host network in two flavors (default, verbose) of the `node status` command. Fixes: #11223. Signed-off-by: Alessandro De Blasis <alex@deblasis.net>	2021-11-05 09:02:46 -04:00
James Rasell	e3537a06bb	taskrunner: add clarifying initial vault token renew comment.	2021-10-28 17:09:22 +02:00
Michael Schurter	fd68bbc342	test: update tests to properly use AllocDir Also use t.TempDir when possible.	2021-10-19 10:49:07 -07:00
Michael Schurter	10c3bad652	client: never embed alloc_dir in chroot Fixes #2522 Skip embedding client.alloc_dir when building chroot. If a user configures a Nomad client agent so that the chroot_env will embed the client.alloc_dir, Nomad will happily infinitely recurse while building the chroot until something horrible happens. The best case scenario is the filesystem's path length limit is hit. The worst case scenario is disk space is exhausted. A bad agent configuration will look something like this: ```hcl data_dir = "/tmp/nomad-badagent" client { enabled = true chroot_env { # Note that the source matches the data_dir "/tmp/nomad-badagent" = "/ohno" # ... } } ``` Note that `/ohno/client` (the state_dir) will still be created but not `/ohno/alloc` (the alloc_dir). While I cannot think of a good reason why someone would want to embed Nomad's client (and possibly server) directories in chroots, there should be no cause for harm. chroots are only built when Nomad runs as root, and Nomad disables running exec jobs as root by default. Therefore even if client state is copied into chroots, it will be inaccessible to tasks. Skipping the `data_dir` and `{client,server}.state_dir` is possible, but this PR attempts to implement the minimum viable solution to reduce risk of unintended side effects or bugs. When running tests as root in a vm without the fix, the following error occurs: ``` === RUN TestAllocDir_SkipAllocDir alloc_dir_test.go:520: Error Trace: alloc_dir_test.go:520 Error: Received unexpected error: Couldn't create destination file /tmp/TestAllocDir_SkipAllocDir1457747331/001/nomad/test/testtask/nomad/test/testtask/.../nomad/test/testtask/secrets/.nomad-mount: open /tmp/TestAllocDir_SkipAllocDir1457747331/001/nomad/test/.../testtask/secrets/.nomad-mount: file name too long Test: TestAllocDir_SkipAllocDir --- FAIL: TestAllocDir_SkipAllocDir (22.76s) ``` Also removed unused Copy methods on AllocDir and TaskDir structs. Thanks to @eveld for not letting me forget about this!	2021-10-18 09:22:01 -07:00
James Rasell	444d25db07	Merge pull request #11280 from benbuzbee/log-err Log error if there are no event handlers registered	2021-10-14 14:49:22 +02:00
Michael Schurter	59fda1894e	Merge pull request #11167 from a-zagaevskiy/master Support configurable dynamic port range	2021-10-13 16:47:38 -07:00
Ben Buzbee	573fb840fa	Log error if there are no event handlers registered We see this error all the time ``` no handler registered for event event.Message=, event.Annotations=, event.Timestamp=0001-01-01T00:00:00Z, event.TaskName=, event.AllocID=, event.TaskID=, ``` So we're handling an even with all default fields. I noted that this can happen if only err is set as in ``` func (d driverPluginClient) handleTaskEvents(reqCtx context.Context, ch chan TaskEvent, stream proto.Driver_TaskEventsClient) { defer close(ch) for { ev, err := stream.Recv() if err != nil { if err != io.EOF { ch <- &TaskEvent{ Err: grpcutils.HandleReqCtxGrpcErr(err, reqCtx, d.doneCtx), } } ``` In this case Err fails to be serialized by the logger, see this test ``` ev := &drivers.TaskEvent{ Err: fmt.Errorf("errz"), } i.logger.Warn("ben test", "event", ev) i.logger.Warn("ben test2", "event err str", ev.Err.Error()) i.logger.Warn("ben test3", "event err", ev.Err) ev.Err = nil i.logger.Warn("ben test4", "nil error", ev.Err) 2021-10-06T22:37:56.736Z INFO nomad.stdout {"@level":"warn","@message":"ben test","@module":"client.driver_mgr","@timestamp":"2021-10-06T22:37:56.643900Z","driver":"mock_driver","event":{"TaskID":"","TaskName":"","AllocID":"","Timestamp":"0001-01-01T00:00:00Z","Message":"","Annotations":null,"Err":{}}} 2021-10-06T22:37:56.736Z INFO nomad.stdout {"@level":"warn","@message":"ben test2","@module":"client.driver_mgr","@timestamp":"2021-10-06T22:37:56.644226Z","driver":"mock_driver","event err str":"errz"} 2021-10-06T22:37:56.736Z INFO nomad.stdout {"@level":"warn","@message":"ben test3","@module":"client.driver_mgr","@timestamp":"2021-10-06T22:37:56.644240Z","driver":"mock_driver","event err":"errz"} 2021-10-06T22:37:56.736Z INFO nomad.stdout {"@level":"warn","@message":"ben test4","@module":"client.driver_mgr","@timestamp":"2021-10-06T22:37:56.644252Z","driver":"mock_driver","nil error":null} ``` Note in the first example err is set to an empty object and the error is lost. What we want is the last two examples which call out the err field explicitly so we can see what it is in this case	2021-10-11 19:44:52 +00:00
Florian Apolloner	709c1a2947	Fixed creation of ControllerCreateVolumeRequest. (#11238 )	2021-10-06 10:17:39 -04:00
Mahmood Ali	c86cff02f9	logmon: Fix a memory leak on task restart Fix a logmon leak causing high goroutine and memory usage when a task restarts. Logmon `FileRotator` buffers the task stdout/stderr streams and periodically flushing them to log files. Logmon creates a new FileRotator for each stream for each task run. However, the `flushPeriodically` goroutine is leaked when a task restarts, holding a reference to a no-longer-needed `FileRotator` instance along with its 64kb buffer. The cause is that the code assumed `time.Ticker.Stop()` closes the ticker channel, thereby terminating the goroutine, but the documentation says otherwise: > Stop turns off a ticker. After Stop, no more ticks will be sent. Stop does not close the channel, to prevent a concurrent goroutine reading from the channel from seeing an erroneous "tick". https://pkg.go.dev/time#Ticker.Stop	2021-10-05 12:11:53 -04:00
Mahmood Ali	9668245c4c	logmon: add a test for leaked goroutines	2021-10-05 12:11:42 -04:00
Mahmood Ali	614ade1bb6	logmon: refactor Logging tests Mostly to use testify assertions and close open resources	2021-10-05 12:10:58 -04:00
Michael Schurter	7071425af3	client: defensively log reserved ports - Fix test broken due to being improperly setup. - Include min/max ports in default client config.	2021-10-04 15:43:35 -07:00
Mahmood Ali	4d90afb425	gofmt all the files mostly to handle build directives in 1.17.	2021-10-01 10:14:28 -04:00
Michael Schurter	c6e72b6818	client: output reserved ports with min/max ports Also add a little more min/max port testing and add the consts back that had been removed: but unexported and as defaults.	2021-09-30 17:05:46 -07:00
Luiz Aoqui	a7698dedba	Disable PowerShell profile and simplify fingerprinting link speed on Windows (#11183 )	2021-09-22 11:17:47 -04:00

1 2 3 4 5 ...

4469 commits