- updated region in job metadata that gets persisted to nomad datastore
- fixed many unrelated unit tests that used an invalid region value
(they previously passed because hcl wasn't getting picked up and
the job would default to global region)
We currently only run cleanup Service Hooks when a task is either
Killed, or Exited. However, due to the implementation of a task runner,
tasks are only Exited if they every correctly started running, which is
not true when you recieve an error early in the task start flow, such as
not being able to pull secrets from Vault.
This updates the service hook to also call consul deregistration
routines during a task Stop lifecycle event, to ensure that any
registered checks and services are cleared in such cases.
fixes#5770
When a client is running against an old server (e.g. running 0.8),
`alloc.AllocatedResources` may be nil, and we need to check the
deprecated `alloc.TaskResources` instead.
Fixes https://github.com/hashicorp/nomad/issues/5810
Alloc runner already tracks tasks associated with alloc. Here, we
become defensive by relying on the alloc runner tracked tasks, rather
than depend on server never updating the job unexpectedly.
This exposes a client flag to disable nomad remote exec support in
environments where access to tasks ought to be restricted.
I used `disable_remote_exec` client flag that defaults to allowing
remote exec. Opted for a client config that can be used to disable
remote exec globally, or to a subset of the cluster if necessary.
This fixes an issue where batch and service workloads would never be
restarted due to indefinitely blocking on a nil channel.
It also raises the restoration logging message to `Info` to simplify log
analysis.
Without this change, alloc_endpoint cancel the context passed to handler
when we detect EOF. This races driver in setting exit code; and we run
into a case where the exec process terminates cleanly yet we attempt to
mark it as failed with context error.
Here, we rely on the driver to handle errors returned from Stream and
without racing to set an error.
Registration and restoring allocs don't share state or depend on each
other in any way (syncing allocs with servers is done outside of
registration).
Since restoring is synchronous, start the registration goroutine first.
For nodes with lots of allocs to restore or close to their heartbeat
deadline, this could be the difference between becoming "lost" or not.
Refactoring of 104067bc2b2002a4e45ae7b667a476b89addc162
Switch the MarkLive method for a chan that is closed by the client.
Thanks to @notnoop for the idea!
The old approach called a method on most existing ARs and TRs on every
runAllocs call. The new approach does a once.Do call in runAllocs to
accomplish the same thing with less work. Able to remove the gate
abstraction that did much more than was needed.
Fixes#1795
Running restored allocations and pulling what allocations to run from
the server happen concurrently. This means that if a client is rebooted,
and has its allocations rescheduled, it may restart the dead allocations
before it contacts the server and determines they should be dead.
This commit makes tasks that fail to reattach on restore wait until the
server is contacted before restarting.
Related to #4280
This PR adds
`client.allocs.<job>.<group>.<alloc>.<task>.memory.allocated` as a gauge
in bytes to metrics to ease calculating how close a task is to OOMing.
```
'nomad.client.allocs.memory.allocated.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 268435456.000
'nomad.client.allocs.memory.cache.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 5677056.000
'nomad.client.allocs.memory.kernel_max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.kernel_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8908800.000
'nomad.client.allocs.memory.rss.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 876544.000
'nomad.client.allocs.memory.swap.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8208384.000
```
* client: was not using up-to-date client state in determining which alloc count towards allocated resources
* Update client/client.go
Co-Authored-By: cgbaker <cgbaker@hashicorp.com>
This removes an unnecessary shared lock between discovery and heartbeating
which was causing heartbeats to be missed upon retries when a single server
fails. Also made a drive by fix to call the periodic server shuffler goroutine.
This fixes a confusing UX where a previously successful deployment's
healthy/unhealthy count would get updated if any allocations failed after
the deployment was already marked as successful.