open-nomad

Commit Graph

Author	SHA1	Message	Date
Tim Gross	27bb2da5ee	CSI: make gRPC client creation more robust (#12057 ) Nomad communicates with CSI plugin tasks via gRPC. The plugin supervisor hook uses this to ping the plugin for health checks which it emits as task events. After the first successful health check the plugin supervisor registers the plugin in the client's dynamic plugin registry, which in turn creates a CSI plugin manager instance that has its own gRPC client for fingerprinting the plugin and sending mount requests. If the plugin manager instance fails to connect to the plugin on its first attempt, it exits. The plugin supervisor hook is unaware that connection failed so long as its own pings continue to work. A transient failure during plugin startup may mislead the plugin supervisor hook into thinking the plugin is up (so there's no need to restart the allocation) but no fingerprinter is started. * Refactors the gRPC client to connect on first use. This provides the plugin manager instance the ability to retry the gRPC client connection until success. * Add a 30s timeout to the plugin supervisor so that we don't poll forever waiting for a plugin that will never come back up. Minor improvements: * The plugin supervisor hook creates a new gRPC client for every probe and then throws it away. Instead, reuse the client as we do for the plugin manager. * The gRPC client constructor has a 1 second timeout. Clarify that this timeout applies to the connection and not the rest of the client lifetime.	2022-02-15 16:57:29 -05:00
Seth Hoenig	ac3cd73d00	Merge pull request #12054 from hashicorp/b-creation-indexes api: return sorted results in certain list endpoints	2022-02-15 15:08:38 -06:00
Seth Hoenig	40c714a681	api: return sorted results in certain list endpoints These API endpoints now return results in chronological order. They can return results in reverse chronological order by setting the query parameter ascending=true. - Eval.List - Deployment.List	2022-02-15 13:48:28 -06:00
Seth Hoenig	f8f0d92469	Merge pull request #11955 from hashicorp/f-update-gopsutil Update gopsutil to 3.21.12	2022-02-15 08:31:57 -06:00
Seth Hoenig	d1ce07cbf7	cl: shorten changelog entry	2022-02-15 08:31:25 -06:00
Tim Gross	bed9b3c248	changelog entry (#12072 )	2022-02-15 09:00:30 -05:00
Seth Hoenig	3646bdd738	Merge pull request #12066 from hashicorp/f-make-golint-faster build: allow golangci-lint to use more than 1 core	2022-02-15 08:00:07 -06:00
Alex Holyoake	3071c7d91b	config: merge ReservableCores in clientConfig (#12044 )	2022-02-15 08:36:37 -05:00
Seth Hoenig	5e919ae95b	Merge pull request #12069 from alrs/scheduler-test-err scheduler: fix dropped test error	2022-02-15 07:29:50 -06:00
Lars Lehtonen	a07795b4a2	scheduler: fix dropped test error	2022-02-14 22:11:45 -08:00
Seth Hoenig	84abc16ec6	build: allow golangci-lint to use more than 1 core Since switching to `golangci-lint` we have set the `-j 1` flag, which restricts the tool to using 1 CPU thread. This PR removes the flag so `make check` takes less time on good computers.	2022-02-14 16:56:58 -06:00
James Rasell	15205b5408	Merge pull request #12052 from hashicorp/b-taskrunner-track-deregistered-call client: track service deregister call so it's only called once.	2022-02-14 09:01:26 +01:00
Tim Gross	2f79a260fe	csi: volume cli prefix matching should accept exact match (#12051 ) The `volume detach`, `volume deregister`, and `volume status` commands accept a prefix argument for the volume ID. Update the behavior on exact matches so that if there is more than one volume that matches the prefix, we should only return an error if one of the volume IDs is not an exact match. Otherwise we won't be able to use these commands at all on those volumes. This also makes the behavior of these commands consistent with `job stop`.	2022-02-11 08:53:03 -05:00
Tim Gross	8ffe7aa76f	csi: provide `CSI_ENDPOINT` env var to plugins (#12050 ) The CSI specification says: > The CO SHALL provide the listen-address for the Plugin by way of the `CSI_ENDPOINT` environment variable. Note that plugins without filesystem isolation won't have the plugin dir bind-mounted to their alloc dir, but we can provide a path to the socket anyways. Refactor to use opts struct for plugin supervisor hook config. The parameter list for configuring the plugin supervisor hook has grown enough where is makes sense to use an options struct similiar to many of the other task runner hooks (ex. template).	2022-02-11 08:46:21 -05:00
James Rasell	926458c5b2	Merge pull request #12053 from marcaurele/fix-typo doc(typo): technical typo in advertised example	2022-02-11 14:27:12 +01:00
James Rasell	d7f90a4fbb	Merge pull request #12041 from hashicorp/b-gh-12040 changelog: add entry for #12040	2022-02-11 10:15:09 +01:00
James Rasell	222592a07e	client: track service deregister call so it's only called once. In certain task lifecycles the taskrunner service deregister call could be called three times for a task that is exiting. Whilst each hook caller of deregister has its own purpose, we should try and ensure it is only called once during the shutdown lifecycle of a task. This change therefore tracks when deregister has been called, so that subsequent calls are noop. In the event the task is restarting, the deregister value is reset to ensure proper operation.	2022-02-11 09:29:38 +01:00
Derek Strickland	e1f9c442e1	reconciler: refactor `computeGroup` (#12033 ) The allocReconciler's computeGroup function contained a significant amount of inline logic that was difficult to understand the intent of. This commit extracts inline logic into the following intention revealing subroutines. It also includes updates to the function internals also aimed at improving maintainability and renames some existing functions for the same purpose. New or renamed functions include. Renamed functions - handleGroupCanaries -> cancelUnneededCanaries - handleDelayedLost -> createLostLaterEvals - handeDelayedReschedules -> createRescheduleLaterEvals New functions - filterAndStopAll - initializeDeploymentState - requiresCanaries - computeCanaries - computeUnderProvisionedBy - computeReplacements - computeDestructiveUpdates - computeMigrations - createDeployment - isDeploymentComplete	2022-02-10 16:24:51 -05:00
Luiz Aoqui	d976e4a19b	docs: add upgrade note and ACL requirements for the job submit endpoint (#12046 )	2022-02-10 15:35:16 -05:00
Luiz Aoqui	1d5b96bdf7	update download to Nomad v1.2.6 (#12042 )	2022-02-10 15:33:28 -05:00
Luiz Aoqui	77c718d227	Merge pull request #12045 from hashicorp/merge-release-1.2.6-branch Merge release 1.2.6 branch	2022-02-10 15:12:40 -05:00
Luiz Aoqui	e83ef0a008	prepare for next release	2022-02-10 14:56:11 -05:00
Luiz Aoqui	3bf6036487	Version 1.2.6 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJiBIXqAAoJELC0QQl2hbZ2M8cP/A7LENJbFSph25M1aGItra5j BphSX//Sq/v9ZzO44rOGNYQGfTpFT8STJgj2GC50qR/ilF4KX4D0oZlDyu/6D0NG ouN9RUjnFd6IEDQrjqqqhr3F69Z95SWVfi1rfgn/pIgOYkVEXfi6DXaulVVyd2ZT J0G5w5ryl5d8PhuL7TWw4zbhZRQn0hVspZv/1s3/I9aG6Sew8SMweeOxbN9lBr7E H19Amdjh6ugRuPgU7YMpKDVrZQRv9Wt7BUP/uc0u3LiW9z3Ko8ZKnCRKErtL5Kc3 HDZsWe+t3va4Uekzd0HULNcYU4kwjogdRYRzX5kRsOyXelrZkQIqYFiKrk1wVbq/ cYM5DUak6eUQBGhgi3UY0fklBFq4GDGpiwEzn7rvQb0PRSuVyykgbZ12fzyIu8dp tWbR/WOEg9F+jva6HkR2kDIcr5mDmny3Pxi5aUT6lMk1111nCzOjDzhLkQVtfsex FDMByXxM4oWAK3ouq2OIdxDL2c742A2933C4/30KWE7Xy7twsvkGw52irw66VO3V 4PHP880cDvEDaEh15mY/8FlaAE7t/gsCUuYLxGwl33TaXSRBLc9vVNrrp89q53TD ZcvXTBpHUOWa6ZlHF/4f8LW44rowM6bU0Wili7NaWOKx86dnUJMG4sqJifNgcpS/ 7lXogv98CYLbMy4X4if0 =NY1Z -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEElFaq1Z5DKdB91i+lKfRZwNnLtXMFAmIFbbkACgkQKfRZwNnL tXOr/g/+N2ZBMK8ohEvtdXLl7WXrVhgJfUSVbdD5Kfshul9CPn3yWRxJzqtEN2Pf 55ozeWLpoziP9y9LviJ7rDidXcTmDFutbFdGJ3L+ZLdLILsNOq1A+lbuwO3fJngZ 5aiPoJLsw4sqj6uHaM6Cls2f145O92nT7GXEHCxuvGHeSf3NkcR+zRY5nPrLTIrA uxYefCOzP6C2I+W7dL4Oj5R5EZd4UDi1WiL8pGzwm24LcagZN2ctctolAeF9OlJX M58UUv9b4GObe617u8MeH0LIlyZiNwn9JqrV33dKVTyrkBIYfYxkzdzMKf1csVYk kQb13KPdPTASBAGTl+sxeXXnw/bg09JXGcvREX5lLyQqY8xGwTv2FpTmybKWLiss Bg6BbejrgtCPBik0EAHWV0+kVzhi9bPfUYwTXLDCzMtrbyCyPoWchruel2sm41U1 ezRDzlSvf6nrXf7sAv6umJICck4Bc5Gol+8W7fxvWqnY9rQ3ds2v7E5lXZMBbOmE JSi+EDWBJjBAXehE6pLxeVsvlHMRWN007Z2UeD4neGIgG7xFJLq6nKeUKoiNIpgk hKBL8iwHyuJfrBB/dcPzI9NV+jL6OZ/oI1RWxSj0MX/B4VXZp8HrqZA5JxzQolUg KIxqe4iX3WIkQv+UU4WiELvs4O7fujB4KWz3iQokhwDxqGUpffk= =5EG2 -----END PGP SIGNATURE----- Merge tag 'v1.2.6' into merge-release-1.2.6-branch Version 1.2.6	2022-02-10 14:55:34 -05:00
Marc-Aurèle Brothier	fb80dc57a1	small typo in advertised example	2022-02-10 13:53:05 +01:00
James Rasell	d96f9684be	changelog: add entry for #12040	2022-02-10 08:36:32 +01:00
Nomad Release Bot	45b04fb53e	Release v1.2.6	2022-02-10 03:26:34 +00:00
Nomad Release bot	3b0e6ae029	Generate files for 1.2.6 release	2022-02-10 02:47:03 +00:00
Luiz Aoqui	9476c75c57	docs: add 1.2.6 to changelog	2022-02-09 19:59:37 -05:00
Tim Gross	74486d86fb	scheduler: prevent panic in spread iterator during alloc stop The spread iterator can panic when processing an evaluation, resulting in an unrecoverable state in the cluster. Whenever a panicked server restarts and quorum is restored, the next server to dequeue the evaluation will panic. To trigger this state: * The job must have `max_parallel = 0` and a `canary >= 1`. * The job must not have a `spread` block. * The job must have a previous version. * The previous version must have a `spread` block and at least one failed allocation. In this scenario, the desired changes include `(place 1+) (stop 1+), (ignore n) (canary 1)`. Before the scheduler can place the canary allocation, it tries to find out which allocations can be stopped. This passes back through the stack so that we can determine previous-node penalties, etc. We call `SetJob` on the stack with the previous version of the job, which will include assessing the `spread` block (even though the results are unused). The task group spread info state from that pass through the spread iterator is not reset when we call `SetJob` again. When the new job version iterates over the `groupPropertySets`, it will get an empty `spreadAttributeMap`, resulting in an unexpected nil pointer dereference. This changeset resets the spread iterator internal state when setting the job, logging with a bypass around the bug in case we hit similar cases, and a test that panics the scheduler without the patch.	2022-02-09 19:53:06 -05:00
Luiz Aoqui	15f9d54dea	api: prevent excessice CPU load on job parse Add new namespace ACL requirement for the /v1/jobs/parse endpoint and return early if HCLv2 parsing fails. The endpoint now requires the new `parse-job` ACL capability or `submit-job`.	2022-02-09 19:51:47 -05:00
Seth Hoenig	437bb4b86d	client: check escaping of alloc dir using symlinks This PR adds symlink resolution when doing validation of paths to ensure they do not escape client allocation directories.	2022-02-09 19:50:13 -05:00
Seth Hoenig	de078e7ac6	client: fix race condition in use of go-getter go-getter creates a circular dependency between a Client and Getter, which means each is inherently thread-unsafe if you try to re-use on or the other. This PR fixes Nomad to no longer make use of the default Getter objects provided by the go-getter package. Nomad must create a new Client object on every artifact download, as the Client object controls the Src and Dst among other things. When Caling Client.Get, the Getter modifies its own Client reference, creating the circular reference and race condition. We can still achieve most of the desired connection caching behavior by re-using a shared HTTP client with transport pooling enabled.	2022-02-09 19:48:28 -05:00
Charlie Voiselle	6750235152	Add changelog	2022-02-09 19:31:42 -05:00
Tim Gross	6bd33d3fb9	CSI: use job status not alloc status for plugin updates from summary (#12027 ) When an allocation is updated, the job summary for the associated job is also updated. CSI uses the job summary to set the expected count for controller and node plugins. We incorrectly used the allocation's server status instead of the job status when deciding whether to update or remove the job from the plugins. This caused a node drain or other terminal state for an allocation to clear the expected count for the entire plugin. Use the job status to guide whether to update or remove the expected count. The existing CSI tests for the state store incorrectly modeled the updates we received from servers vs those we received from clients, leading to test assertions that passed when they should not. Rework the tests to clarify each step in the lifecycle and rename CSI state store functions for clarity	2022-02-09 11:51:49 -05:00
Tim Gross	59c8558969	docs and changelog for `nomad config validate` (#12031 )	2022-02-09 10:20:45 -05:00
Kevin Schoonover	1dcfff2f70	fingerprint: remove metadata from digitalocean (#12032 )	2022-02-09 07:31:45 -05:00
Thomas Lefebvre	3b57f3af9d	Add config command and config validate subcommand to nomad CLI (#9198 )	2022-02-08 16:52:35 -05:00
Tim Gross	21bd4835bd	fingerprint: digitalocean fingerprint test requires metadata header (#12028 )	2022-02-08 16:35:13 -05:00
Seth Hoenig	5dad1cbb98	Merge pull request #12026 from hashicorp/f-update-aws env: update aws cpu configs	2022-02-08 13:56:50 -06:00
Seth Hoenig	5cb365b36b	env: update aws cpu configs By running the tools/ec2info tool	2022-02-08 12:44:00 -06:00
Tim Gross	d9d4da1e9f	scheduler: seed random shuffle nodes with eval ID (#12008 ) Processing an evaluation is nearly a pure function over the state snapshot, but we randomly shuffle the nodes. This means that developers can't take a given state snapshot and pass an evaluation through it and be guaranteed the same plan results. But the evaluation ID is already random, so if we use this as the seed for shuffling the nodes we can greatly reduce the sources of non-determinism. Unfortunately golang map iteration uses a global source of randomness and not a goroutine-local one, but arguably if the scheduler behavior is impacted by this, that's a bug in the iteration.	2022-02-08 12:16:33 -05:00
Seth Hoenig	aece0ddda8	Merge pull request #12024 from hashicorp/docs-update-cl changelog: update changelog for DO	2022-02-08 10:29:09 -06:00
Seth Hoenig	a06ae106f0	cl: fix DO name Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2022-02-08 10:28:57 -06:00
Seth Hoenig	a21776a82d	changelog: update changelog for DO	2022-02-08 08:43:49 -06:00
Seth Hoenig	451d9b0dd2	Merge pull request #12015 from kevinschoonover/main client/fingerprint: add digitalocean fingerprinter	2022-02-08 08:41:03 -06:00
Dylan Staley	fdf67e6bb5	Merge pull request #11936 from hashicorp/ds.ie11-warning website: display warning in IE 11	2022-02-07 13:59:41 -08:00
Kevin Schoonover	b13573d4ab	address comments Co-authored-by: Seth Hoenig <seth.a.hoenig@gmail.com>	2022-02-07 09:03:48 -08:00
Tim Gross	464026c87b	scheduler: recover from panic (#12009 ) If processing a specific evaluation causes the scheduler (and therefore the entire server) to panic, that evaluation will never get a chance to be nack'd and cleared from the state store. It will get dequeued by another scheduler, causing that server to panic, and so forth until all servers are in a panic loop. This prevents the operator from intervening to remove the evaluation or update the state. Recover the goroutine from the top-level `Process` methods for each scheduler so that this condition can be detected without panicking the server process. This will lead to a loop of recovering the scheduler goroutine until the eval can be removed or nack'd, but that's much better than taking a downtime.	2022-02-07 11:47:53 -05:00
Kevin Schoonover	68eeaa7a18	small fixes	2022-02-05 22:23:43 -08:00
Kevin Schoonover	5523275e95	add digitalocean fingerprinter	2022-02-05 22:17:36 -08:00

1 2 3 4 5 ...

22493 Commits All Branches Search

22493 Commits

All Branches