Previously, if a channel is closed, we retry the Stats call. But, if that call
fails, we go in a backoff loop without calling Stats ever again.
Here, we use a utility function for calling driverHandle.Stats call that retries
as one expects.
I aimed to preserve the logging formats but made small improvements as I saw fit.
When an alloc runner prestart hook fails, the task runners aren't invoked
and they remain in a pending state.
This leads to terrible results, some of which are:
* Lockup in GC process as reported in https://github.com/hashicorp/nomad/pull/5861
* Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed
* Alloc not being restarted/rescheduled to another node (as it's still in
pending state)
* Unexpected restart of alloc on a client restart, potentially days/weeks after
alloc expected start time!
Here, we treat all tasks to have failed if alloc runner prestart hook fails.
This fixes the lockups, and permits the alloc to be rescheduled on another node.
While it's desirable to retry alloc runner in such failures, I opted to treat it
out of scope. I'm afraid of some subtles about alloc and task runners and their
idempotency that's better handled in a follow up PR.
This might be one of the root causes for
https://github.com/hashicorp/nomad/issues/5840 .
This change fixes a bug where nomad would avoid running alloc tasks if
the alloc is client terminal but the server copy on the client isn't
marked as running.
Here, we fix the case by having task runner uses the
allocRunner.shouldRun() instead of only checking the server updated
alloc.
Here, we preserve much of the invariants such that `tr.Run()` is always
run, and don't change the overall alloc runner and task runner
lifecycles.
Fixes https://github.com/hashicorp/nomad/issues/5883
Currently, if killTask results in the termination of a process before
calling WaitTask, Restart() will incorrectly return a TaskNotFound
error when using the raw_exec driver on Windows.
We currently only run cleanup Service Hooks when a task is either
Killed, or Exited. However, due to the implementation of a task runner,
tasks are only Exited if they every correctly started running, which is
not true when you recieve an error early in the task start flow, such as
not being able to pull secrets from Vault.
This updates the service hook to also call consul deregistration
routines during a task Stop lifecycle event, to ensure that any
registered checks and services are cleared in such cases.
fixes#5770
When a client is running against an old server (e.g. running 0.8),
`alloc.AllocatedResources` may be nil, and we need to check the
deprecated `alloc.TaskResources` instead.
Fixes https://github.com/hashicorp/nomad/issues/5810
Alloc runner already tracks tasks associated with alloc. Here, we
become defensive by relying on the alloc runner tracked tasks, rather
than depend on server never updating the job unexpectedly.
This fixes an issue where batch and service workloads would never be
restarted due to indefinitely blocking on a nil channel.
It also raises the restoration logging message to `Info` to simplify log
analysis.
Registration and restoring allocs don't share state or depend on each
other in any way (syncing allocs with servers is done outside of
registration).
Since restoring is synchronous, start the registration goroutine first.
For nodes with lots of allocs to restore or close to their heartbeat
deadline, this could be the difference between becoming "lost" or not.
Refactoring of 104067bc2b2002a4e45ae7b667a476b89addc162
Switch the MarkLive method for a chan that is closed by the client.
Thanks to @notnoop for the idea!
The old approach called a method on most existing ARs and TRs on every
runAllocs call. The new approach does a once.Do call in runAllocs to
accomplish the same thing with less work. Able to remove the gate
abstraction that did much more than was needed.
Fixes#1795
Running restored allocations and pulling what allocations to run from
the server happen concurrently. This means that if a client is rebooted,
and has its allocations rescheduled, it may restart the dead allocations
before it contacts the server and determines they should be dead.
This commit makes tasks that fail to reattach on restore wait until the
server is contacted before restarting.
Related to #4280
This PR adds
`client.allocs.<job>.<group>.<alloc>.<task>.memory.allocated` as a gauge
in bytes to metrics to ease calculating how close a task is to OOMing.
```
'nomad.client.allocs.memory.allocated.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 268435456.000
'nomad.client.allocs.memory.cache.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 5677056.000
'nomad.client.allocs.memory.kernel_max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.kernel_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8908800.000
'nomad.client.allocs.memory.rss.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 876544.000
'nomad.client.allocs.memory.swap.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8208384.000
```
* client: was not using up-to-date client state in determining which alloc count towards allocated resources
* Update client/client.go
Co-Authored-By: cgbaker <cgbaker@hashicorp.com>
This fixes a confusing UX where a previously successful deployment's
healthy/unhealthy count would get updated if any allocations failed after
the deployment was already marked as successful.
This command will be used to send a signal to either a single task within an
allocation, or all of the tasks if <task-name> is omitted. If the sent signal
terminates the allocation, it will be treated as if the allocation has crashed,
rather than as if it was operator-terminated.
Signal validation is currently handled by the driver itself and nomad
does not attempt to restrict or validate them.
Currently, when logmon fails to reattach, we will retry reattachment to
the same pid until the task restart specification is exhausted.
Because we cannot clear hook state during error conditions, it is not
possible for us to signal to a future restart that it _shouldn't_
attempt to reattach to the plugin.
Here we revert to explicitly detecting reattachment seperately from a
launch of a new logmon, so we can recover from scenarios where a logmon
plugin has failed.
This is a net improvement over the current hard failure situation, as it
means in the most common case (the pid has gone away), we can recover.
Other reattachment failure modes where the plugin may still be running
could potentially cause a duplicate process, or a subsequent failure to launch
a new plugin.
If there was a duplicate process, it could potentially cause duplicate
logging. This is better than a production workload outage.
If there was a subsequent failure to launch a new plugin, it would fail
in the same (retry until restarts are exhausted) as the current failure
mode.
This adds a `nomad alloc restart` command and api that allows a job operator
with the alloc-lifecycle acl to perform an in-place restart of a Nomad
allocation, or a given subtask.