open-nomad

Commit Graph

Author	SHA1	Message	Date
Tim Gross	027277a0d9	csi: make volume GC in job deregister safely async The `Job.Deregister` call will block on the client CSI controller RPCs while the alloc still exists on the Nomad client node. So we need to make the volume claim reaping async from the `Job.Deregister`. This allows `nomad job stop` to return immediately. In order to make this work, this changeset changes the volume GC so that the GC jobs are on a by-volume basis rather than a by-job basis; we won't have to query the (possibly deleted) job at the time of volume GC. We smuggle the volume ID and whether it's a purge into the GC eval ID the same way we smuggled the job ID previously.	2020-04-06 10:15:55 -04:00
Tim Gross	8bc5641438	csi: volume claim garbage collection (#7125 ) When an alloc is marked terminal (and after node unstage/unpublish have been called), the client syncs the terminal alloc state with the server via `Node.UpdateAlloc RPC`. For each job that has a terminal alloc, the `Node.UpdateAlloc` RPC handler at the server will emit an eval for a new core job to garbage collect CSI volume claims. When this eval is handled on the core scheduler, it will call a `volumeReap` method to release the claims for all terminal allocs on the job. The volume reap will issue a `ControllerUnpublishVolume` RPC for any node that has no alloc claiming the volume. Once this returns (or is skipped), the volume reap will send a new `CSIVolume.Claim` RPC that releases the volume claim for that allocation in the state store, making it available for scheduling again. This same `volumeReap` method will be called from the core job GC, which gives us a second chance to reclaim volumes during GC if there were controller RPC failures.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	9d4307a3ef	csi_endpoint: Provide AllocID in req, and return Volume Currently, the client has to ship an entire allocation to the server as part of performing a VolumeClaim, this has a few problems: Firstly, it means the client is sending significantly more data than is required (an allocation contains the entire contents of a Nomad job, alongside other irrelevant state) which has a non-zero (de)serialization cost. Secondly, because the allocation was never re-fetched from the state store, it means that we were potentially open to issues caused by stale state on a misbehaving or malicious client. The change removes both of those issues at the cost of a couple of more state store lookups, but they should be relatively cheap. We also now provide the CSIVolume in the response for a claim, so the client can perform a Claim without first going ahead and fetching all of the volumes.	2020-03-23 13:58:30 -04:00
Tim Gross	fb1aad66ee	csi: implement releasing volume claims for terminal allocs (#7076 ) When an alloc is marked terminal, and after node unstage/unpublish have been called, the client will sync the terminal alloc state with the server via `Node.UpdateAlloc` RPC. This changeset implements releasing the volume claim for each volume associated with the terminal alloc. It doesn't yet implement the RPC call we need to make to the `ControllerUnpublishVolume` CSI RPC.	2020-03-23 13:58:29 -04:00
Mahmood Ali	0da7130a1a	Protect against args being modified	2020-03-18 08:11:16 -04:00
Mahmood Ali	52fd31af80	server: node connections must not be forwarded This fixes a bug where a forwarded node update request may be assumed to be the actual direct client connection if the server just lost leadership. When a nomad non-leader server receives a Node.UpdateStatus request, it forwards the RPC request to the leader, and holds on the request Yamux connection in a cache to allow for server<->client forwarding. When the leader handles the request, it must differentiate between a forwarded connection vs the actual connection. This is done in https://github.com/hashicorp/nomad/blob/v0.10.4/nomad/node_endpoint.go#L412 Now, consider if the non-leader server forwards to the connection to a recently deposed nomad leader, which in turn forwards the RPC request to the new leader. Without this change, the deposed leader will mistake the forwarded connection for the actual client connection and cache it mapped to the client ID. If the server attempts to connect to that client, it will attempt to start a connection/session to the other server instead and the call will hang forever. This change ensures that we only add node connection mapping if the request is not a forwarded request, regardless of circumstances.	2020-03-17 16:39:01 -04:00
Seth Hoenig	587a5d4a8d	nomad: make TaskGroup.UsesConnect helper a public helper	2020-01-31 19:05:11 -06:00
Seth Hoenig	78a7d1e426	comments: cleanup some leftover debug comments and such	2020-01-31 19:04:35 -06:00
Seth Hoenig	8219c78667	nomad: handle SI token revocations concurrently Be able to revoke SI token accessors concurrently, and also ratelimit the requests being made to Consul for the various ACL API uses.	2020-01-31 19:04:14 -06:00
Seth Hoenig	2c7ac9a80d	nomad: fixup token policy validation	2020-01-31 19:04:08 -06:00
Seth Hoenig	9df33f622f	nomad: proxy requests for Service Identity tokens between Clients and Consul Nomad jobs may be configured with a TaskGroup which contains a Service definition that is Consul Connect enabled. These service definitions end up establishing a Consul Connect Proxy Task (e.g. envoy, by default). In the case where Consul ACLs are enabled, a Service Identity token is required for these tasks to run & connect, etc. This changeset enables the Nomad Server to recieve RPC requests for the derivation of SI tokens on behalf of instances of Consul Connect using Tasks. Those tokens are then relayed back to the requesting Client, which then injects the tokens in the secrets directory of the Task.	2020-01-31 19:03:53 -06:00
Luiz Aoqui	e862b61daa	api: use the same initial time for all drain properties	2019-11-14 16:06:09 -05:00
Luiz Aoqui	5bd7cdd5c3	api: add `StartedAt` in `Node.DrainStrategy`	2019-11-13 17:54:40 -05:00
Jasmine Dahilig	8d980edd2e	add create and modify timestamps to evaluations (#5881 )	2019-08-07 09:50:35 -07:00
Lang Martin	0b97175a16	node_endpoint preserve both messages as rpcs and in raft	2019-07-10 13:56:20 -04:00
Lang Martin	a95225d754	NodeDeregisterBatch -> NodeBatchDeregister match JobBatch pattern	2019-07-10 13:56:20 -04:00
Lang Martin	fa5649998e	node endpoint support new NodeDeregisterBatchRequest	2019-07-10 13:56:19 -04:00
Lang Martin	82349aba5d	node_endpoint argument setup	2019-07-10 13:56:19 -04:00
Lang Martin	09fd05bd8f	node_endpoint raft store then shutdown, test deprecation	2019-07-10 13:56:19 -04:00
Lang Martin	3e2d1f0338	node_endpoint improve error messages	2019-07-10 13:56:19 -04:00
Lang Martin	b176066d42	node_endpoint deregister the batch of nodes	2019-07-10 13:56:19 -04:00
Mahmood Ali	6bdbeed319	set node.StatusUpdatedAt in raft Fix a case where `node.StatusUpdatedAt` was manipulated directly in memory. This ensures that StatusUpdatedAt is set in raft layer, and ensures that the field is updated when node drain/eligibility is updated too.	2019-05-21 16:13:32 -04:00
Alex Dadgar	4bdccab550	goimports	2019-01-22 15:44:31 -08:00
Alex Dadgar	3c19d01d7a	server	2018-09-15 16:23:13 -07:00
Nick Ethier	d35bf6d184	nomad: handle edge case where node drain event shouldn't be emitted	2018-06-06 14:02:10 -04:00
Preetha Appan	647ccc2dc3	fix bug where disabling a node drain when there is no drain strategy set causes scheduling eligibility to stay ineligible	2018-05-30 12:28:46 -05:00
Alex Dadgar	21c5ed850d	Register events	2018-05-22 14:06:33 -07:00
Alex Dadgar	1fe9cb4f00	update error message	2018-05-22 14:04:59 -07:00
Alex Dadgar	5f2080bc26	Emit events based on eligibility	2018-05-22 14:04:59 -07:00
Alex Dadgar	b6ecb75af9	update error message	2018-05-22 14:01:43 -07:00
Alex Dadgar	0cb31feb1f	Add node event when draining is set/removed/updated	2018-05-10 16:54:43 -07:00
Alex Dadgar	a35248d1d8	Plumb event via FSM	2018-05-10 16:30:54 -07:00
Alex Dadgar	8d50955054	Fix typos	2018-05-07 14:50:01 -05:00
Preetha Appan	a569d34f25	Add custom status description for rescheduling follow up evals, and make unit test robust	2018-04-10 15:30:15 -05:00
Preetha Appan	b3402efd0b	Adds a new custom description for update alloc triggered evals to make it easier to unit test.	2018-04-10 14:00:07 -05:00
Preetha Appan	d1cb5df477	Batch evals for rescheduling failed allocs correctly and group them by job ID	2018-04-09 14:05:31 -05:00
Alex Dadgar	58a3ec3fb2	Improve Vault error handling	2018-04-03 14:29:22 -07:00
Alex Dadgar	de4b3772f1	Create evals for system jobs when drain is unset This PR creates evals for system jobs when: * Drain is unset and mark eligible is true * Eligibility is restored to the node	2018-03-27 15:53:24 -07:00
Alex Dadgar	5dacb057b7	Only track nodes if the conn is from the node Fixes a bug in which a connection to a Nomad server was treated as a connection to a node because the server forwarded a node specific RPC.	2018-03-27 09:59:31 -07:00
Alex Dadgar	e63bcb474d	Drainer	2018-03-21 16:51:44 -07:00
Alex Dadgar	a37329189a	Improve DeadlineTime helper	2018-03-21 16:51:44 -07:00
Alex Dadgar	8289cc3c6f	HTTP and API	2018-03-21 16:51:44 -07:00
Alex Dadgar	0fba0101b6	RPC/FSM/State Store for Eligibility	2018-03-21 16:51:44 -07:00
Alex Dadgar	b3d2346419	Upgrade path	2018-03-21 16:51:43 -07:00
Alex Dadgar	2f5309d82a	Remove update time	2018-03-21 16:51:43 -07:00
Alex Dadgar	e459a666ed	Node.Drain takes strategy	2018-03-21 16:49:48 -07:00
Alex Dadgar	db4a634072	RPC, FSM, State Store for marking DesiredTransistion fix build tag	2018-03-21 16:49:48 -07:00
Michael Schurter	c0542474db	drain: initial drainv2 structs and impl	2018-03-21 16:49:48 -07:00
Alex Dadgar	b8607ad6d6	Heartbeat uses client rpc advertise and server defaults server rpc advertise addr	2018-03-16 16:47:08 -07:00
Alex Dadgar	85be2d99b3	Drop ACL todo	2018-03-14 16:41:46 -07:00

1 2 3 4

162 Commits