open-nomad

24,230 Commits 1 Branch 0 Tags 311 MiB

Author	SHA1	Message	Date
Luiz Aoqui	909435a0e7	scheduler: log stack in case of panic (#15303 )	2022-11-17 18:59:33 -05:00
Luiz Aoqui	e4c8b59919	Update alloc after reconnect and enforece client heartbeat order (#15068 ) * scheduler: allow updates after alloc reconnects When an allocation reconnects to a cluster the scheduler needs to run special logic to handle the reconnection, check if a replacement was create and stop one of them. If the allocation kept running while the node was disconnected, it will be reconnected with `ClientStatus: running` and the node will have `Status: ready`. This combination is the same as the normal steady state of allocation, where everything is running as expected. In order to differentiate between the two states (an allocation that is reconnecting and one that is just running) the scheduler needs an extra piece of state. The current implementation uses the presence of a `TaskClientReconnected` task event to detect when the allocation has reconnected and thus must go through the reconnection process. But this event remains even after the allocation is reconnected, causing all future evals to consider the allocation as still reconnecting. This commit changes the reconnect logic to use an `AllocState` to register when the allocation was reconnected. This provides the following benefits: - Only a limited number of task states are kept, and they are used for many other events. It's possible that, upon reconnecting, several actions are triggered that could cause the `TaskClientReconnected` event to be dropped. - Task events are set by clients and so their timestamps are subject to time skew from servers. This prevents using time to determine if an allocation reconnected after a disconnect event. - Disconnect events are already stored as `AllocState` and so storing reconnects there as well makes it the only source of information required. With the new logic, the reconnection logic is only triggered if the last `AllocState` is a disconnect event, meaning that the allocation has not been reconnected yet. After the reconnection is handled, the new `ClientStatus` is store in `AllocState` allowing future evals to skip the reconnection logic. * scheduler: prevent spurious placement on reconnect When a client reconnects it makes two independent RPC calls: - `Node.UpdateStatus` to heartbeat and set its status as `ready`. - `Node.UpdateAlloc` to update the status of its allocations. These two calls can happen in any order, and in case the allocations are updated before a heartbeat it causes the state to be the same as a node being disconnected: the node status will still be `disconnected` while the allocation `ClientStatus` is set to `running`. The current implementation did not handle this order of events properly, and the scheduler would create an unnecessary placement since it considered the allocation was being disconnected. This extra allocation would then be quickly stopped by the heartbeat eval. This commit adds a new code path to handle this order of events. If the node is `disconnected` and the allocation `ClientStatus` is `running` the scheduler will check if the allocation is actually reconnecting using its `AllocState` events. * rpc: only allow alloc updates from `ready` nodes Clients interact with servers using three main RPC methods: - `Node.GetAllocs` reads allocation data from the server and writes it to the client. - `Node.UpdateAlloc` reads allocation from from the client and writes them to the server. - `Node.UpdateStatus` writes the client status to the server and is used as the heartbeat mechanism. These three methods are called periodically by the clients and are done so independently from each other, meaning that there can't be any assumptions in their ordering. This can generate scenarios that are hard to reason about and to code for. For example, when a client misses too many heartbeats it will be considered `down` or `disconnected` and the allocations it was running are set to `lost` or `unknown`. When connectivity is restored the to rest of the cluster, the natural mental model is to think that the client will heartbeat first and then update its allocations status into the servers. But since there's no inherit order in these calls the reverse is just as possible: the client updates the alloc status and then heartbeats. This results in a state where allocs are, for example, `running` while the client is still `disconnected`. This commit adds a new verification to the `Node.UpdateAlloc` method to reject updates from nodes that are not `ready`, forcing clients to heartbeat first. Since this check is done server-side there is no need to coordinate operations client-side: they can continue sending these requests independently and alloc update will succeed after the heartbeat is done. * chagelog: add entry for #15068 * code review * client: skip terminal allocations on reconnect When the client reconnects with the server it synchronizes the state of its allocations by sending data using the `Node.UpdateAlloc` RPC and fetching data using the `Node.GetClientAllocs` RPC. If the data fetch happens before the data write, `unknown` allocations will still be in this state and would trigger the `allocRunner.Reconnect` flow. But when the server `DesiredStatus` for the allocation is `stop` the client should not reconnect the allocation. * apply more code review changes * scheduler: persist changes to reconnected allocs Reconnected allocs have a new AllocState entry that must be persisted by the plan applier. * rpc: read node ID from allocs in UpdateAlloc The AllocUpdateRequest struct is used in three disjoint use cases: 1. Stripped allocs from clients Node.UpdateAlloc RPC using the Allocs, and WriteRequest fields 2. Raft log message using the Allocs, Evals, and WriteRequest fields 3. Plan updates using the AllocsStopped, AllocsUpdated, and Job fields Adding a new field that would only be used in one these cases (1) made things more confusing and error prone. While in theory an AllocUpdateRequest could send allocations from different nodes, in practice this never actually happens since only clients call this method with their own allocations. * scheduler: remove logic to handle exceptional case This condition could only be hit if, somehow, the allocation status was set to "running" while the client was "unknown". This was addressed by enforcing an order in "Node.UpdateStatus" and "Node.UpdateAlloc" RPC calls, so this scenario is not expected to happen. Adding unnecessary code to the scheduler makes it harder to read and reason about it. * more code review * remove another unused test	2022-11-04 16:25:11 -04:00
Derek Strickland	0891218ee9	system_scheduler: support disconnected clients (#12555 ) * structs: Add helper method for checking if alloc is configured to disconnect * system_scheduler: Add support for disconnected clients	2022-04-15 09:31:32 -04:00
Tim Gross	464026c87b	scheduler: recover from panic (#12009 ) If processing a specific evaluation causes the scheduler (and therefore the entire server) to panic, that evaluation will never get a chance to be nack'd and cleared from the state store. It will get dequeued by another scheduler, causing that server to panic, and so forth until all servers are in a panic loop. This prevents the operator from intervening to remove the evaluation or update the state. Recover the goroutine from the top-level `Process` methods for each scheduler so that this condition can be detected without panicking the server process. This will lead to a loop of recovering the scheduler goroutine until the eval can be removed or nack'd, but that's much better than taking a downtime.	2022-02-07 11:47:53 -05:00
Luiz Aoqui	b1753d0568	scheduler: detect and log unexpected scheduling collisions (#11793 )	2022-01-14 20:09:14 -05:00
Tim Gross	ba38008596	scheduler: fix panic in system jobs when nodes filtered by class (#11565 ) In the system scheduler, if a subset of clients are filtered by class, we hit a code path where the `AllocMetric` has been copied, but the `Copy` method does not instantiate the various maps. This leads to an assignment to a nil map. This changeset ensures that the maps are non-nil before continuing. The `Copy` method relies on functions in the `helper` package that all return nil slices or maps when passed zero-length inputs. This changeset to fix the panic bug intentionally defers updating those functions because it'll have potential impact on memory usage. See https://github.com/hashicorp/nomad/issues/11564 for more details.	2021-11-24 12:28:47 -05:00
Mahmood Ali	e06ff1d613	scheduler: stop allocs in unrelated nodes (#11391 ) The system scheduler should leave allocs on draining nodes as-is, but stop node stop allocs on nodes that are no longer part of the job datacenters. Previously, the scheduler did not make the distinction and left system job allocs intact if they are already running. I've added a failing test first, which you can see in https://app.circleci.com/jobs/github/hashicorp/nomad/179661 . Fixes https://github.com/hashicorp/nomad/issues/11373	2021-10-27 07:04:13 -07:00
Michael Schurter	9ff246a394	scheduler: deep copy AllocMetric Defensively deep copy AllocMetric to avoid side effects from shared map references.	2021-09-10 16:41:31 -07:00
Mahmood Ali	9d0378cfcc	scheduler: warn when system jobs cannot place an alloc When a system or sysbatch job specify constraints that none of the current nodes meet, report a warning to the user. Also, for sysbatch job, mark the job as dead as a result. A sample run would look like: ``` $ nomad job run ./example.nomad ==> 2021-08-31T16:57:35-04:00: Monitoring evaluation "b48e8882" 2021-08-31T16:57:35-04:00: Evaluation triggered by job "example" ==> 2021-08-31T16:57:36-04:00: Monitoring evaluation "b48e8882" 2021-08-31T16:57:36-04:00: Evaluation status changed: "pending" -> "complete" ==> 2021-08-31T16:57:36-04:00: Evaluation "b48e8882" finished with status "complete" but failed to place all allocations: 2021-08-31T16:57:36-04:00: Task Group "cache" (failed to place 1 allocation): * Constraint "${meta.tag} = bar": 2 nodes excluded by filter * Constraint "${attr.kernel.name} = linux": 1 nodes excluded by filter $ nomad job status example ID = example Name = example Submit Date = 2021-08-31T16:57:35-04:00 Type = sysbatch Priority = 50 Datacenters = dc1 Namespace = default Status = dead Periodic = false Parameterized = false Summary Task Group Queued Starting Running Failed Complete Lost cache 0 0 0 0 0 0 Allocations No allocations placed ```	2021-08-31 16:58:09 -04:00
Seth Hoenig	3371214431	core: implement system batch scheduler This PR implements a new "System Batch" scheduler type. Jobs can make use of this new scheduler by setting their type to 'sysbatch'. Like the name implies, sysbatch can be thought of as a hybrid between system and batch jobs - it is for running short lived jobs intended to run on every compatible node in the cluster. As with batch jobs, sysbatch jobs can also be periodic and/or parameterized dispatch jobs. A sysbatch job is considered complete when it has been run on all compatible nodes until reaching a terminal state (success or failed on retries). Feasibility and preemption are governed the same as with system jobs. In this PR, the update stanza is not yet supported. The update stanza is sill limited in functionality for the underlying system scheduler, and is not useful yet for sysbatch jobs. Further work in #4740 will improve support for the update stanza and deployments. Closes #2527	2021-08-03 10:30:47 -04:00

Renamed from scheduler/system_sched.go (Browse further)

10 Commits