Protect against a race where destroying and persist state goroutines
race.
The downside is that the database io operation will run while holding
the lock and may run indefinitely. The risk of lock being long held is
slow destruction, but slow io has bigger problems.
This fixes a bug where allocs that have been GCed get re-run again after client
is restarted. A heavily-used client may launch thousands of allocs on startup
and get killed.
The bug is that an alloc runner that gets destroyed due to GC remains in
client alloc runner set. Periodically, they get persisted until alloc is
gced by server. During that time, the client db will contain the alloc
but not its individual tasks status nor completed state. On client restart,
client assumes that alloc is pending state and re-runs it.
Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc
to the state DB.
This is a short-term fix, as we should consider revamping client state
management. Storing alloc and task information in non-transaction non-atomic
concurrently while alloc runner is running and potentially changing state is a
recipe for bugs.
Fixes https://github.com/hashicorp/nomad/issues/5984
Related to https://github.com/hashicorp/nomad/pull/5890
* adds meta object to service in job spec, sends it to consul
* adds tests for service meta
* fix tests
* adds docs
* better hashing for service meta, use helper for copying meta when registering service
* tried to be DRY, but looks like it would be more work to use the
helper function
Fixes#6041
Unlike all other Consul operations, boostrapping requires Consul be
available. This PR tries Consul 3 times with a backoff to account for
the group services being asynchronously registered with Consul.
When rendering a task template, the `plugin` function is no longer
permitted by default and will raise an error. An operator can opt-in
to permitting this function with the new `template.function_blacklist`
field in the client configuration.
When rendering a task template, path parameters for the `file`
function will be treated as relative to the task directory by
default. Relative paths or symlinks that point outside the task
directory will raise an error. An operator can opt-out of this
protection with the new `template.disable_file_sandbox` field in the
client configuration.
When rendering a task consul template, ensure that only task environment
variables are used.
Currently, `consul-template` always falls back to host process
environment variables when key isn't a task env var[1]. Thus, we add
an empty entry for each host process env-var not found in task env-vars.
[1] bfa5d0e133/template/funcs.go (L61-L75)
Adds a new Prerun and Postrun hooks to manage set up of network namespaces
on linux. Work still needs to be done to make the code platform agnostic and
support Docker style network initalization.
Previously, if a channel is closed, we retry the Stats call. But, if that call
fails, we go in a backoff loop without calling Stats ever again.
Here, we use a utility function for calling driverHandle.Stats call that retries
as one expects.
I aimed to preserve the logging formats but made small improvements as I saw fit.
When an alloc runner prestart hook fails, the task runners aren't invoked
and they remain in a pending state.
This leads to terrible results, some of which are:
* Lockup in GC process as reported in https://github.com/hashicorp/nomad/pull/5861
* Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed
* Alloc not being restarted/rescheduled to another node (as it's still in
pending state)
* Unexpected restart of alloc on a client restart, potentially days/weeks after
alloc expected start time!
Here, we treat all tasks to have failed if alloc runner prestart hook fails.
This fixes the lockups, and permits the alloc to be rescheduled on another node.
While it's desirable to retry alloc runner in such failures, I opted to treat it
out of scope. I'm afraid of some subtles about alloc and task runners and their
idempotency that's better handled in a follow up PR.
This might be one of the root causes for
https://github.com/hashicorp/nomad/issues/5840 .
This change fixes a bug where nomad would avoid running alloc tasks if
the alloc is client terminal but the server copy on the client isn't
marked as running.
Here, we fix the case by having task runner uses the
allocRunner.shouldRun() instead of only checking the server updated
alloc.
Here, we preserve much of the invariants such that `tr.Run()` is always
run, and don't change the overall alloc runner and task runner
lifecycles.
Fixes https://github.com/hashicorp/nomad/issues/5883
Currently, if killTask results in the termination of a process before
calling WaitTask, Restart() will incorrectly return a TaskNotFound
error when using the raw_exec driver on Windows.
We currently only run cleanup Service Hooks when a task is either
Killed, or Exited. However, due to the implementation of a task runner,
tasks are only Exited if they every correctly started running, which is
not true when you recieve an error early in the task start flow, such as
not being able to pull secrets from Vault.
This updates the service hook to also call consul deregistration
routines during a task Stop lifecycle event, to ensure that any
registered checks and services are cleared in such cases.
fixes#5770
When a client is running against an old server (e.g. running 0.8),
`alloc.AllocatedResources` may be nil, and we need to check the
deprecated `alloc.TaskResources` instead.
Fixes https://github.com/hashicorp/nomad/issues/5810