c5f5a1fcb9
When fetching node alloc assignments, be defensive against a stale read before killing local nodes allocs. The bug is when both client and servers are restarting and the client requests the node allocation for the node, it might get stale data as server hasn't finished applying all the restored raft transaction to store. Consequently, client would kill and destroy the alloc locally, just to fetch it again moments later when server store is up to date. The bug can be reproduced quite reliably with single node setup (configured with persistence). I suspect it's too edge-casey to occur in production cluster with multiple servers, but we may need to examine leader failover scenarios more closely. In this commit, we only remove and destroy allocs if the removal index is more recent than the alloc index. This seems like a cheap resiliency fix we already use for detecting alloc updates. A more proper fix would be to ensure that a nomad server only serves RPC calls when state store is fully restored or up to date in leadership transition cases. |
||
---|---|---|
.. | ||
allocdir | ||
allochealth | ||
allocrunner | ||
allocwatcher | ||
config | ||
consul | ||
devicemanager | ||
fingerprint | ||
interfaces | ||
lib | ||
logmon | ||
pluginmanager | ||
servers | ||
state | ||
stats | ||
structs | ||
taskenv | ||
testutil | ||
vaultclient | ||
acl.go | ||
acl_test.go | ||
alloc_endpoint.go | ||
alloc_endpoint_test.go | ||
alloc_watcher_e2e_test.go | ||
client.go | ||
client_stats_endpoint.go | ||
client_stats_endpoint_test.go | ||
client_test.go | ||
driver_manager_test.go | ||
fingerprint_manager.go | ||
fingerprint_manager_test.go | ||
fs_endpoint.go | ||
fs_endpoint_test.go | ||
gc.go | ||
gc_test.go | ||
node_updater.go | ||
rpc.go | ||
rpc_test.go | ||
testing.go | ||
util.go | ||
util_test.go |