open-nomad

History

Michael Schurter e2544dd089 client: fix waiting on preempted alloc (#12779 ) Fixes #10200 The bug A user reported receiving the following error when an alloc was placed that needed to preempt existing allocs: ``` [ERROR] client.alloc_watcher: error querying previous alloc: alloc_id=28... previous_alloc=8e... error="rpc error: alloc lookup failed: index error: UUID must be 36 characters" ``` The previous alloc (8e) was already complete on the client. This is possible if an alloc stops after the scheduling decision was made to preempt it, but before the node running both allocations was able to pull and start the preemptor. While that is hopefully a narrow window of time, you can expect it to occur in high throughput batch scheduling heavy systems. However the RPC error made no sense! `previous_alloc` in the logs was a valid 36 character UUID! The fix The fix is: ``` - prevAllocID: c.Alloc.PreviousAllocation, + prevAllocID: watchedAllocID, ``` The alloc watcher new func used for preemption improperly referenced Alloc.PreviousAllocation instead of the passed in watchedAllocID. When multiple allocs are preempted, a watcher is created for each with watchedAllocID set properly by the caller. In this case Alloc.PreviousAllocation="" -- which is where the `UUID must be 36 characters` error was coming from! Sadly we were properly referencing watchedAllocID in the log, so it made the error make no sense! The repro I was able to reproduce this with a dev agent with [preemption enabled](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hcl) and [lowered limits](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-limits-hcl) for ease of repro. First I started a [low priority count 3 job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-lo-nomad), then a [high priority job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hi-nomad) that evicts 2 low priority jobs. Everything worked as expected. However if I force it to use the [remotePrevAlloc implementation](https://github.com/hashicorp/nomad/blob/v1.3.0-beta.1/client/allocwatcher/alloc_watcher.go#L147), it reproduces the bug because the watcher references PreviousAllocation instead of watchedAllocID.		2022-04-26 13:14:43 -07:00
..
allocdir	…
allochealth	…
allocrunner	…
allocwatcher	…
config	…
consul	…
devicemanager	…
dynamicplugins	…
fingerprint	…
interfaces	…
lib	…
logmon	…
pluginmanager	…
servers	…
serviceregistration	…
state	…
stats	…
structs	…
taskenv	…
testutil	…
vaultclient	…
acl.go	…
acl_test.go	…
agent_endpoint.go	…
agent_endpoint_test.go	…
alloc_endpoint.go	…
alloc_endpoint_test.go	…
alloc_watcher_e2e_test.go	…
client.go	…
client_stats_endpoint.go	…
client_stats_endpoint_test.go	…
client_test.go	…
csi_endpoint.go	…
csi_endpoint_test.go	…
driver_manager_test.go	…
enterprise_client_oss.go	…
fingerprint_manager.go	…
fingerprint_manager_test.go	…
fs_endpoint.go	…
fs_endpoint_test.go	…
gc.go	…
gc_test.go	…
heartbeatstop.go	…
heartbeatstop_test.go	…
node_updater.go	…
rpc.go	…
rpc_test.go	…
testing.go	…
util.go	…