Consul registers Connect services automatically, however Nomad thinks it
owns them due to the _nomad prefix. Since the services are managed by
Consul, Nomad needs to explicitly ignore them or otherwies they will be
removed.
When checking driver feasability for an alloc with multiple drivers, we
must check that all drivers are detected and healthy.
Nomad 0.9 and 0.8 have a bug where we may check a single driver only,
but which driver is dependent on map traversal order, which is
unspecified in golang spec.
The ClientState being pending isn't a good criteria; as an alloc may
have been updated in-place before it was completed.
Also, updated the logic so we only check for task states. If an alloc
has deployment state but no persisted tasks at all, restore will still
fail.
* taskenv: add connect upstream env vars + test
* set taskenv upstreams instead of appending
* Update client/taskenv/env.go
Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>
This uses an alternative approach where we avoid restoring the alloc
runner in the first place, if we suspect that the alloc may have been
completed already.
This commit aims to help users running with clients suseptible to the
destroyed alloc being restrarted bug upgrade to latest. Without this,
such users will have their tasks run unexpectedly on upgrade and only
see the bug resolved after subsequent restart.
If, on restore, the client sees a pending alloc without any other
persisted info, then err on the side that it's an corrupt persisted
state of an alloc instead of the client happening to be killed right
when alloc is assigned to client.
Few reasons motivate this behavior:
Statistically speaking, corruption being the cause is more likely. A
long running client will have higher chance of having allocs persisted
incorrectly with pending state. Being killed right when an alloc is
about to start is relatively unlikely.
Also, delaying starting an alloc that hasn't started (by hopefully
seconds) is not as severe as launching too many allocs that may bring
client down.
More importantly, this helps customers upgrade their clients without
risking taking their clients down and destablizing their cluster. We
don't want existing users to force triggering the bug while they upgrade
and restart cluster.
Protect against a race where destroying and persist state goroutines
race.
The downside is that the database io operation will run while holding
the lock and may run indefinitely. The risk of lock being long held is
slow destruction, but slow io has bigger problems.
This fixes a bug where allocs that have been GCed get re-run again after client
is restarted. A heavily-used client may launch thousands of allocs on startup
and get killed.
The bug is that an alloc runner that gets destroyed due to GC remains in
client alloc runner set. Periodically, they get persisted until alloc is
gced by server. During that time, the client db will contain the alloc
but not its individual tasks status nor completed state. On client restart,
client assumes that alloc is pending state and re-runs it.
Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc
to the state DB.
This is a short-term fix, as we should consider revamping client state
management. Storing alloc and task information in non-transaction non-atomic
concurrently while alloc runner is running and potentially changing state is a
recipe for bugs.
Fixes https://github.com/hashicorp/nomad/issues/5984
Related to https://github.com/hashicorp/nomad/pull/5890