open-nomad/client
Michael Schurter e2544dd089
client: fix waiting on preempted alloc (#12779)
Fixes #10200

**The bug**

A user reported receiving the following error when an alloc was placed
that needed to preempt existing allocs:

```
[ERROR] client.alloc_watcher: error querying previous alloc:
alloc_id=28... previous_alloc=8e... error="rpc error: alloc lookup
failed: index error: UUID must be 36 characters"
```

The previous alloc (8e) was already complete on the client. This is
possible if an alloc stops *after* the scheduling decision was made to
preempt it, but *before* the node running both allocations was able to
pull and start the preemptor. While that is hopefully a narrow window of
time, you can expect it to occur in high throughput batch scheduling
heavy systems.

However the RPC error made no sense! `previous_alloc` in the logs was a
valid 36 character UUID!

**The fix**

The fix is:

```
-		prevAllocID:  c.Alloc.PreviousAllocation,
+		prevAllocID:  watchedAllocID,
```

The alloc watcher new func used for preemption improperly referenced
Alloc.PreviousAllocation instead of the passed in watchedAllocID. When
multiple allocs are preempted, a watcher is created for each with
watchedAllocID set properly by the caller. In this case
Alloc.PreviousAllocation="" -- which is where the `UUID must be 36 characters`
error was coming from! Sadly we were properly referencing
watchedAllocID in the log, so it made the error make no sense!

**The repro**

I was able to reproduce this with a dev agent with [preemption enabled](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hcl)
and [lowered limits](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-limits-hcl)
for ease of repro.

First I started a [low priority count 3 job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-lo-nomad),
then a [high priority job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hi-nomad)
that evicts 2 low priority jobs. Everything worked as expected.

However if I force it to use the [remotePrevAlloc implementation](https://github.com/hashicorp/nomad/blob/v1.3.0-beta.1/client/allocwatcher/alloc_watcher.go#L147),
it reproduces the bug because the watcher references PreviousAllocation
instead of watchedAllocID.
2022-04-26 13:14:43 -07:00
..
allocdir raw_exec: make raw exec driver work with cgroups v2 2022-04-04 16:11:38 -05:00
allochealth Merge branch 'main' into f-1.3-boogie-nights 2022-03-23 09:41:25 +01:00
allocrunner CSI: enforce one plugin supervisor loop via `sync.Once` (#12785) 2022-04-26 10:38:50 -04:00
allocwatcher client: fix waiting on preempted alloc (#12779) 2022-04-26 13:14:43 -07:00
config ci: fixup task runner chroot test 2022-04-19 10:37:46 -05:00
consul Merge branch 'main' into f-1.3-boogie-nights 2022-03-23 09:41:25 +01:00
devicemanager ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
dynamicplugins fix data race in dynamic plugin registry tests (#12554) 2022-04-14 14:55:56 -04:00
fingerprint build: update ec2 instance profiles 2022-04-21 11:47:40 -05:00
interfaces disconnected clients: Add reconnect task event (#12133) 2022-04-05 17:12:23 -04:00
lib raw_exec: fixup review comments 2022-04-05 15:21:28 -05:00
logmon ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
pluginmanager CSI: fix data race in plugin manager (#12553) 2022-04-12 12:18:04 -04:00
servers feat: remove dependency to consul/lib 2022-04-09 13:22:44 +02:00
serviceregistration services: cr followup 2022-04-22 09:14:29 -05:00
state Merge branch 'main' into f-1.3-boogie-nights 2022-03-23 09:41:25 +01:00
stats ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
structs ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
taskenv services: cr followup 2022-04-22 09:14:29 -05:00
testutil client: cgroups v2 code review followup 2022-03-24 13:40:42 -05:00
vaultclient ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
acl.go Audit config, seams for enterprise audit features 2020-03-23 13:47:42 -04:00
acl_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
agent_endpoint.go json handles were moved to a new package in #10202 2021-04-02 13:31:10 +00:00
agent_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
alloc_endpoint.go client: fix multiple imports (#10537) 2021-05-13 14:30:31 -04:00
alloc_endpoint_test.go client: enable support for cgroups v2 2022-03-23 11:35:27 -05:00
alloc_watcher_e2e_test.go job_hooks: add implicit constraint when using Consul for services. (#12602) 2022-04-20 14:09:13 +02:00
client.go feat: remove dependency to consul/lib 2022-04-09 13:22:44 +02:00
client_stats_endpoint.go Server side impl + touch ups 2018-02-15 13:59:02 -08:00
client_stats_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
client_test.go job_hooks: add implicit constraint when using Consul for services. (#12602) 2022-04-20 14:09:13 +02:00
csi_endpoint.go CSI: allow updates to volumes on re-registration (#12167) 2022-03-07 11:06:59 -05:00
csi_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
driver_manager_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
enterprise_client_oss.go gofmt all the files 2021-10-01 10:14:28 -04:00
fingerprint_manager.go chore: fixup inconsistent method receiver names. (#11704) 2021-12-20 11:44:21 +01:00
fingerprint_manager_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
fs_endpoint.go Fix log streaming missing frames (#11721) 2022-01-04 14:07:16 -05:00
fs_endpoint_test.go raw_exec: make raw exec driver work with cgroups v2 2022-04-04 16:11:38 -05:00
gc.go chore: fix incorrect docstring formatting. 2021-08-30 11:08:12 +02:00
gc_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
heartbeatstop.go Delayed evaluations for `stop_after_client_disconnect` can cause unwanted extra followup evaluations around job garbage collection (#8099) 2020-06-03 09:48:38 -04:00
heartbeatstop_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
node_updater.go client: use NewNodeEvent builder for consistency (#7559) 2020-03-31 10:02:16 -04:00
rpc.go fix: use NewSafeTimer 2022-04-11 19:37:14 +02:00
rpc_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
testing.go client: refactor common service registration objects from Consul. 2022-03-15 09:38:30 +01:00
util.go Revert "client: defensive against getting stale alloc updates" 2020-06-19 15:39:44 -04:00