open-nomad

Commit Graph

Author	SHA1	Message	Date
Luiz Aoqui	e4c8b59919	Update alloc after reconnect and enforece client heartbeat order (#15068 ) * scheduler: allow updates after alloc reconnects When an allocation reconnects to a cluster the scheduler needs to run special logic to handle the reconnection, check if a replacement was create and stop one of them. If the allocation kept running while the node was disconnected, it will be reconnected with `ClientStatus: running` and the node will have `Status: ready`. This combination is the same as the normal steady state of allocation, where everything is running as expected. In order to differentiate between the two states (an allocation that is reconnecting and one that is just running) the scheduler needs an extra piece of state. The current implementation uses the presence of a `TaskClientReconnected` task event to detect when the allocation has reconnected and thus must go through the reconnection process. But this event remains even after the allocation is reconnected, causing all future evals to consider the allocation as still reconnecting. This commit changes the reconnect logic to use an `AllocState` to register when the allocation was reconnected. This provides the following benefits: - Only a limited number of task states are kept, and they are used for many other events. It's possible that, upon reconnecting, several actions are triggered that could cause the `TaskClientReconnected` event to be dropped. - Task events are set by clients and so their timestamps are subject to time skew from servers. This prevents using time to determine if an allocation reconnected after a disconnect event. - Disconnect events are already stored as `AllocState` and so storing reconnects there as well makes it the only source of information required. With the new logic, the reconnection logic is only triggered if the last `AllocState` is a disconnect event, meaning that the allocation has not been reconnected yet. After the reconnection is handled, the new `ClientStatus` is store in `AllocState` allowing future evals to skip the reconnection logic. * scheduler: prevent spurious placement on reconnect When a client reconnects it makes two independent RPC calls: - `Node.UpdateStatus` to heartbeat and set its status as `ready`. - `Node.UpdateAlloc` to update the status of its allocations. These two calls can happen in any order, and in case the allocations are updated before a heartbeat it causes the state to be the same as a node being disconnected: the node status will still be `disconnected` while the allocation `ClientStatus` is set to `running`. The current implementation did not handle this order of events properly, and the scheduler would create an unnecessary placement since it considered the allocation was being disconnected. This extra allocation would then be quickly stopped by the heartbeat eval. This commit adds a new code path to handle this order of events. If the node is `disconnected` and the allocation `ClientStatus` is `running` the scheduler will check if the allocation is actually reconnecting using its `AllocState` events. * rpc: only allow alloc updates from `ready` nodes Clients interact with servers using three main RPC methods: - `Node.GetAllocs` reads allocation data from the server and writes it to the client. - `Node.UpdateAlloc` reads allocation from from the client and writes them to the server. - `Node.UpdateStatus` writes the client status to the server and is used as the heartbeat mechanism. These three methods are called periodically by the clients and are done so independently from each other, meaning that there can't be any assumptions in their ordering. This can generate scenarios that are hard to reason about and to code for. For example, when a client misses too many heartbeats it will be considered `down` or `disconnected` and the allocations it was running are set to `lost` or `unknown`. When connectivity is restored the to rest of the cluster, the natural mental model is to think that the client will heartbeat first and then update its allocations status into the servers. But since there's no inherit order in these calls the reverse is just as possible: the client updates the alloc status and then heartbeats. This results in a state where allocs are, for example, `running` while the client is still `disconnected`. This commit adds a new verification to the `Node.UpdateAlloc` method to reject updates from nodes that are not `ready`, forcing clients to heartbeat first. Since this check is done server-side there is no need to coordinate operations client-side: they can continue sending these requests independently and alloc update will succeed after the heartbeat is done. * chagelog: add entry for #15068 * code review * client: skip terminal allocations on reconnect When the client reconnects with the server it synchronizes the state of its allocations by sending data using the `Node.UpdateAlloc` RPC and fetching data using the `Node.GetClientAllocs` RPC. If the data fetch happens before the data write, `unknown` allocations will still be in this state and would trigger the `allocRunner.Reconnect` flow. But when the server `DesiredStatus` for the allocation is `stop` the client should not reconnect the allocation. * apply more code review changes * scheduler: persist changes to reconnected allocs Reconnected allocs have a new AllocState entry that must be persisted by the plan applier. * rpc: read node ID from allocs in UpdateAlloc The AllocUpdateRequest struct is used in three disjoint use cases: 1. Stripped allocs from clients Node.UpdateAlloc RPC using the Allocs, and WriteRequest fields 2. Raft log message using the Allocs, Evals, and WriteRequest fields 3. Plan updates using the AllocsStopped, AllocsUpdated, and Job fields Adding a new field that would only be used in one these cases (1) made things more confusing and error prone. While in theory an AllocUpdateRequest could send allocations from different nodes, in practice this never actually happens since only clients call this method with their own allocations. * scheduler: remove logic to handle exceptional case This condition could only be hit if, somehow, the allocation status was set to "running" while the client was "unknown". This was addressed by enforcing an order in "Node.UpdateStatus" and "Node.UpdateAlloc" RPC calls, so this scenario is not expected to happen. Adding unnecessary code to the scheduler makes it harder to read and reason about it. * more code review * remove another unused test	2022-11-04 16:25:11 -04:00
Seth Hoenig	0692190e12	core: reschedule evicted batch job when resources become available This PR fixes a bug where an evicted batch job would not be rescheduled once resources become available. Closes #9890	2022-06-02 14:04:13 -05:00
Derek Strickland	5e309f3f33	reconciler: Handle canaries when client disconnects (#12539 ) * plan_apply: Allow node updates in disconnected node plans * plan: Keep the job when persisting unknown allocs * reconciler: stop unknown allocs when stopping all * reconcile_util: reorder filtering to handle canaries; skip rescheduling unknown * heartbeat: Fix bug in node heartbeating	2022-04-21 10:05:58 -04:00
Derek Strickland	0891218ee9	system_scheduler: support disconnected clients (#12555 ) * structs: Add helper method for checking if alloc is configured to disconnect * system_scheduler: Add support for disconnected clients	2022-04-15 09:31:32 -04:00
Tim Gross	57b3a0028f	allocs without max_client_disconnect should be lost on disconnect (#12529 ) In the reconciler's filtering for tainted nodes, we use whether the server supports disconnected clients as a gate to a bunch of our logic, but this doesn't account for cases where the job doesn't have `max_client_disconnect`. The only real consequence of this appears to be that allocs on disconnected nodes are marked "complete" instead of "lost".	2022-04-11 11:24:49 -04:00
Derek Strickland	d1d6009e2c	disconnected clients: Support operator manual interventions (#12436 ) * allocrunner: Remove Shutdown call in Reconnect * Node.UpdateAlloc: Stop orphaned allocs. * reconciler: Stop failed reconnects. * Apply feedback from code review. Handle rebase conflict. * Apply suggestions from code review Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-04-06 09:33:32 -04:00
Derek Strickland	bd719bc7b8	reconciler: 2 phase reconnects and tests (#12333 ) * structs: Add alloc.Expired & alloc.Reconnected functions. Add Reconnect eval trigger by. * node_endpoint: Emit new eval for reconnecting unknown allocs. * filterByTainted: handle 2 phase commit filtering rules. * reconciler: Append AllocState on disconnect. Logic updates from testing and 2 phase reconnects. * allocs: Set reconnect timestamp. Destroy if not DesiredStatusRun. Watch for unknown status.	2022-04-05 17:13:10 -04:00
Derek Strickland	fd04a24ac7	disconnected clients: ensure servers meet minimum required version (#12202 ) * planner: expose ServerMeetsMinimumVersion via Planner interface * filterByTainted: add flag indicating disconnect support * allocReconciler: accept and pass disconnect support flag * tests: update dependent tests	2022-04-05 17:12:23 -04:00
Derek Strickland	b128769e19	reconciler: support disconnected clients (#12058 ) * Add merge helper for string maps * structs: add statuses, MaxClientDisconnect, and helper funcs * taintedNodes: Include disconnected nodes * upsertAllocsImpl: don't use existing ClientStatus when upserting unknown * allocSet: update filterByTainted and add delayByMaxClientDisconnect * allocReconciler: support disconnecting and reconnecting allocs * GenericScheduler: upsert unknown and queue reconnecting Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-04-05 17:10:37 -04:00
Derek Strickland	e1f9c442e1	reconciler: refactor `computeGroup` (#12033 ) The allocReconciler's computeGroup function contained a significant amount of inline logic that was difficult to understand the intent of. This commit extracts inline logic into the following intention revealing subroutines. It also includes updates to the function internals also aimed at improving maintainability and renames some existing functions for the same purpose. New or renamed functions include. Renamed functions - handleGroupCanaries -> cancelUnneededCanaries - handleDelayedLost -> createLostLaterEvals - handeDelayedReschedules -> createRescheduleLaterEvals New functions - filterAndStopAll - initializeDeploymentState - requiresCanaries - computeCanaries - computeUnderProvisionedBy - computeReplacements - computeDestructiveUpdates - computeMigrations - createDeployment - isDeploymentComplete	2022-02-10 16:24:51 -05:00
Derek Strickland	7a63a249ca	reconciler: improve variable names and extract methods from inline logic (#12010 ) * reconciler: improved variable names and extract methods from inline logic Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-02-05 04:54:19 -05:00
James Rasell	45f4689f9c	chore: fixup inconsistent method receiver names. (#11704 )	2021-12-20 11:44:21 +01:00
Michael Schurter	e62795798d	core: propagate remote task handles Add a new driver capability: RemoteTasks. When a task is run by a driver with RemoteTasks set, its TaskHandle will be propagated to the server in its allocation's TaskState. If the task is replaced due to a down node or draining, its TaskHandle will be propagated to its replacement allocation. This allows tasks to be scheduled in remote systems whose lifecycles are disconnected from the Nomad node's lifecycle. See https://github.com/hashicorp/nomad-driver-ecs for an example ECS remote task driver.	2021-04-27 15:07:03 -07:00
Mahmood Ali	339617a836	Only ignore rescheduled allocations if they got stopped	2020-09-14 21:11:52 -04:00
Mahmood Ali	8a342926b7	Respect alloc job version for lost/failed allocs This change fixes a bug where lost/failed allocations are replaced by allocations with the latest versions, even if the version hasn't been promoted yet. Now, when generating a plan for lost/failed allocations, the scheduler first checks if the current deployment is in Canary stage, and if so, it ensures that any lost/failed allocations is replaced one with the latest promoted version instead.	2020-08-19 09:52:48 -04:00
Lang Martin	069840bef8	scheduler/reconcile: set FollowupEvalID on lost stop_after_client_disconnect (#8105 ) (#8138 ) * scheduler/reconcile: set FollowupEvalID on lost stop_after_client_disconnect * scheduler/reconcile: thread follupEvalIDs through to results.stop * scheduler/reconcile: comment typo * nomad/_test: correct arguments for plan.AppendStoppedAlloc * scheduler/reconcile: avoid nil, cleanup handleDelayed(Lost\|Reschedules)	2020-06-09 17:13:53 -04:00
Lang Martin	d3c4700cd3	server: stop after client disconnect (#7939 ) * jobspec, api: add stop_after_client_disconnect * nomad/state/state_store: error message typo * structs: alloc methods to support stop_after_client_disconnect 1. a global AllocStates to track status changes with timestamps. We need this to track the time at which the alloc became lost originally. 2. ShouldClientStop() and WaitClientStop() to actually do the math * scheduler/reconcile_util: delayByStopAfterClientDisconnect * scheduler/reconcile: use delayByStopAfterClientDisconnect * scheduler/util: updateNonTerminalAllocsToLost comments This was setup to only update allocs to lost if the DesiredStatus had already been set by the scheduler. It seems like the intention was to update the status from any non-terminal state, and not all lost allocs have been marked stop or evict by now * scheduler/testing: AssertEvalStatus just use require * scheduler/generic_sched: don't create a blocked eval if delayed * scheduler/generic_sched_test: several scheduling cases	2020-05-13 16:39:04 -04:00
Mahmood Ali	fd8fb8c22b	Stop allocs to be rescheduled Currently, when an alloc fails and is rescheduled, the alloc desired state remains as "run" and the nomad client may not free the resources. Here, we ensure that an alloc is marked as stopped when it's rescheduled. Notice the Desired Status and Description before and after this change: Before: ``` mars-2:nomad notnoop$ nomad alloc status 02aba49e ID = 02aba49e Eval ID = bb9ed1d2 Name = example-reschedule.nodes[0] Node ID = 5853d547 Node Name = mars-2.local Job ID = example-reschedule Job Version = 0 Client Status = failed Client Description = Failed tasks Desired Status = run Desired Description = <none> Created = 10s ago Modified = 5s ago Replacement Alloc ID = d6bf872b Task "payload" is "dead" Task Resources CPU Memory Disk Addresses 0/100 MHz 24 MiB/300 MiB 300 MiB Task Events: Started At = 2019-06-06T21:12:45Z Finished At = 2019-06-06T21:12:50Z Total Restarts = 0 Last Restart = N/A Recent Events: Time Type Description 2019-06-06T17:12:50-04:00 Not Restarting Policy allows no restarts 2019-06-06T17:12:50-04:00 Terminated Exit Code: 1 2019-06-06T17:12:45-04:00 Started Task started by client 2019-06-06T17:12:45-04:00 Task Setup Building Task Directory 2019-06-06T17:12:45-04:00 Received Task received by client ``` After: ``` ID = 5001ccd1 Eval ID = 53507a02 Name = example-reschedule.nodes[0] Node ID = a3b04364 Node Name = mars-2.local Job ID = example-reschedule Job Version = 0 Client Status = failed Client Description = Failed tasks Desired Status = stop Desired Description = alloc was rescheduled because it failed Created = 13s ago Modified = 3s ago Replacement Alloc ID = 7ba7ac20 Task "payload" is "dead" Task Resources CPU Memory Disk Addresses 21/100 MHz 24 MiB/300 MiB 300 MiB Task Events: Started At = 2019-06-06T21:22:50Z Finished At = 2019-06-06T21:22:55Z Total Restarts = 0 Last Restart = N/A Recent Events: Time Type Description 2019-06-06T17:22:55-04:00 Not Restarting Policy allows no restarts 2019-06-06T17:22:55-04:00 Terminated Exit Code: 1 2019-06-06T17:22:50-04:00 Started Task started by client 2019-06-06T17:22:50-04:00 Task Setup Building Task Directory 2019-06-06T17:22:50-04:00 Received Task received by client ```	2019-06-06 17:27:12 -04:00
Preetha Appan	c1b92c284e	Work in progress - force rescheduling of failed allocs	2018-05-08 17:26:57 -05:00
Alex Dadgar	c6478d9469	clarify comment	2018-05-07 14:55:01 -05:00
Alex Dadgar	8626c1b94a	Reschedule when we have canaries properly	2018-05-07 14:55:01 -05:00
Alex Dadgar	550f5e31f8	Allow canary count greater than desired	2018-05-07 14:50:01 -05:00
Preetha Appan	5329900f6d	Only use DesiredTransition.Reschedule in reconciler when its an active deployment	2018-05-07 14:50:01 -05:00
Alex Dadgar	5547974f35	Only reschedule allowed deployment allocs	2018-05-07 14:50:01 -05:00
Preetha Appan	7e17bc231f	remove unnecessary check and other fixes from code review	2018-04-04 07:35:20 -05:00
Preetha Appan	00537c739b	Fixes edge cases around timing and task finish time being set more than once	2018-04-03 16:34:59 -05:00
Preetha Appan	d87e528059	rename skip->ignore and improve comment formatting	2018-03-29 15:11:10 -05:00
Preetha Appan	38a7614776	Refactored for readability, pair programmed with @dadgar	2018-03-29 13:28:37 -05:00
Preetha Appan	5090fefe96	Filter out allocs with DesiredState = stop, and unit tests	2018-03-29 09:28:52 -05:00
Preetha Appan	cbfd69ce7a	Fix edge case in reconciler where service jobs with ClientstatusComplete were not replaced	2018-03-23 18:41:00 -05:00
Michael Schurter	9263cc2ed7	scheduler: migrate non-terminal migrating allocs filterByTainted node should always migrate non-terminal migrating allocs	2018-03-21 16:49:48 -07:00
Michael Schurter	d1ec65d765	switch to new raft DesiredTransition message	2018-03-21 16:49:48 -07:00
Alex Dadgar	db4a634072	RPC, FSM, State Store for marking DesiredTransistion fix build tag	2018-03-21 16:49:48 -07:00
Preetha Appan	9fed0d2103	Get reschedule policy from the alloc directly	2018-03-14 16:10:32 -05:00
Preetha Appan	e2656ef546	Cleaner handling of batched evals	2018-03-14 16:10:32 -05:00
Preetha Appan	2ba976dec8	Remove unnecessary check against 5 second window for determining immediate scheduling eligibility	2018-03-14 16:10:32 -05:00
Preetha Appan	5373ade731	Scheduler and Reconciler changes to support delayed rescheduling	2018-03-14 16:10:32 -05:00
Josh Soref	f8eb766fb5	spelling: reschedulable	2018-03-11 18:48:12 +00:00
Preetha Appan	7b6ba7a1f4	Fixes bug in reconciler where previously rescheduled allocs are rescheduled again. Simplified logic and added test case to catch this.	2018-02-20 12:07:56 -06:00
Preetha Appan	ea4a889e28	Address more code review feedback	2018-01-31 09:56:53 -06:00
Preetha Appan	bd89d2b39e	Make sure that reschedule trackers are not added for node drain replacements	2018-01-31 09:56:53 -06:00
Preetha Appan	d0f9d59abb	Reconile with changes to structs for reschedule tracking	2018-01-31 09:56:53 -06:00
Preetha Appan	fbb1936dee	Fix some comments and lint warnings, remove unused method	2018-01-31 09:56:53 -06:00
Preetha Appan	031c566ada	Reschedule previous allocs and track their reschedule attempts	2018-01-31 09:56:53 -06:00
Preetha Appan	3b4d7ac2a3	Fix some typos	2017-12-14 13:29:27 -06:00
Alex Dadgar	4173834231	Enable more linters	2017-09-26 15:26:33 -07:00
Alex Dadgar	2453f13fc5	fixes	2017-08-15 12:27:05 -07:00
Alex Dadgar	0570e09feb	Fix panic occuring from improper bitmap size This PR fixes an allignment calculation when determining the bitmap size. Fixes https://github.com/hashicorp/nomad/issues/3008	2017-08-12 15:37:02 -07:00
Luke Farnell	f0ced87b95	fixed all spelling mistakes for goreport	2017-08-07 17:13:05 -04:00
Alex Dadgar	66a90326e1	Treat destructive updates atomically	2017-07-16 10:35:38 -07:00

1 2

59 Commits