c132623ffc
This fixes a bug where allocs that have been GCed get re-run again after client is restarted. A heavily-used client may launch thousands of allocs on startup and get killed. The bug is that an alloc runner that gets destroyed due to GC remains in client alloc runner set. Periodically, they get persisted until alloc is gced by server. During that time, the client db will contain the alloc but not its individual tasks status nor completed state. On client restart, client assumes that alloc is pending state and re-runs it. Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc to the state DB. This is a short-term fix, as we should consider revamping client state management. Storing alloc and task information in non-transaction non-atomic concurrently while alloc runner is running and potentially changing state is a recipe for bugs. Fixes https://github.com/hashicorp/nomad/issues/5984 Related to https://github.com/hashicorp/nomad/pull/5890 |
||
---|---|---|
.. | ||
interfaces | ||
state | ||
taskrunner | ||
alloc_runner.go | ||
alloc_runner_hooks.go | ||
alloc_runner_test.go | ||
alloc_runner_unix_test.go | ||
allocdir_hook.go | ||
config.go | ||
groupservice_hook.go | ||
groupservice_hook_test.go | ||
health_hook.go | ||
health_hook_test.go | ||
migrate_hook.go | ||
network_hook.go | ||
network_hook_test.go | ||
network_manager_linux.go | ||
network_manager_linux_test.go | ||
network_manager_nonlinux.go | ||
networking.go | ||
networking_bridge_linux.go | ||
testing.go | ||
upstream_allocs_hook.go | ||
util.go |