open-nomad/client/allocrunner
Mahmood Ali c132623ffc Don't persist allocs of destroyed alloc runners
This fixes a bug where allocs that have been GCed get re-run again after client
is restarted.  A heavily-used client may launch thousands of allocs on startup
and get killed.

The bug is that an alloc runner that gets destroyed due to GC remains in
client alloc runner set.  Periodically, they get persisted until alloc is
gced by server.  During that  time, the client db will contain the alloc
but not its individual tasks status nor completed state.  On client restart,
client assumes that alloc is pending state and re-runs it.

Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc
to the state DB.

This is a short-term fix, as we should consider revamping client state
management.  Storing alloc and task information in non-transaction non-atomic
concurrently while alloc runner is running and potentially changing state is a
recipe for bugs.

Fixes https://github.com/hashicorp/nomad/issues/5984
Related to https://github.com/hashicorp/nomad/pull/5890
2019-08-25 11:21:28 -04:00
..
interfaces client: cleanup and document context uses 2019-03-12 15:03:54 -07:00
state stale allocation data leads to incorrect (and even negative) metrics (#5637) 2019-05-07 15:54:36 -04:00
taskrunner taskrunner getter set Umask for go-getter, setuid test 2019-08-23 15:59:03 -04:00
alloc_runner.go Don't persist allocs of destroyed alloc runners 2019-08-25 11:21:28 -04:00
alloc_runner_hooks.go connect: register group services with Consul 2019-08-20 12:25:10 -07:00
alloc_runner_test.go Don't persist allocs of destroyed alloc runners 2019-08-25 11:21:28 -04:00
alloc_runner_unix_test.go connect: register group services with Consul 2019-08-20 12:25:10 -07:00
allocdir_hook.go client: cleanup and document context uses 2019-03-12 15:03:54 -07:00
config.go client: do not restart dead tasks until server is contacted (try 2) 2019-05-14 10:53:27 -07:00
groupservice_hook.go connect: register group services with Consul 2019-08-20 12:25:10 -07:00
groupservice_hook_test.go connect: register group services with Consul 2019-08-20 12:25:10 -07:00
health_hook.go client: cleanup and document context uses 2019-03-12 15:03:54 -07:00
health_hook_test.go client: cleanup and document context uses 2019-03-12 15:03:54 -07:00
migrate_hook.go client: cleanup and document context uses 2019-03-12 15:03:54 -07:00
network_hook.go ar: plumb client config for networking into the network hook 2019-07-31 01:04:06 -04:00
network_hook_test.go ar: plumb client config for networking into the network hook 2019-07-31 01:04:06 -04:00
network_manager_linux.go fix failing tests 2019-07-31 01:04:07 -04:00
network_manager_linux_test.go ar: rearrange network hook to support building on windows 2019-07-31 01:03:19 -04:00
network_manager_nonlinux.go ar: plumb client config for networking into the network hook 2019-07-31 01:04:06 -04:00
networking.go ar: plumb client config for networking into the network hook 2019-07-31 01:04:06 -04:00
networking_bridge_linux.go ar: fix bridge networking port mapping when port.To is unset (#6190) 2019-08-22 21:53:52 -04:00
testing.go client: do not restart dead tasks until server is contacted (try 2) 2019-05-14 10:53:27 -07:00
upstream_allocs_hook.go client: cleanup and document context uses 2019-03-12 15:03:54 -07:00
util.go allocrunnerv2 -> allocrunner 2018-10-16 16:56:56 -07:00