When we added recovery of pause containers in #16352 we called the recovery
function from the plugin factory function. But in our plugin setup protocol, a
plugin isn't ready for use until we call `SetConfig`. This meant that
recovering pause containers was always done with the default
config. Setting up the Docker client only happens once, so setting the wrong
config in the recovery function also means that all other Docker API calls will
use the default config.
Move the `recoveryPauseContainers` call into the `SetConfig`. Fix the error
handling so that we return any error but also don't log when the context is
canceled, which happens twice during normal startup as we fingerprint the
driver.
The `docker` driver cannot expand capabilities beyond the default set when the
task is a non-root user. Clarify this in the documentation of `allow_caps` and
update the `cap_add` and `cap_drop` to match the `exec` driver, which has more
clear language overall.
Currently, the `exec` driver is only setting the Bounding set, which is
not sufficient to actually enable the requisite capabilities for the
task process. In order for the capabilities to survive `execve`
performed by libcontainer, the `Permitted`, `Inheritable`, and `Ambient`
sets must also be set.
Per CAPABILITIES (7):
> Ambient: This is a set of capabilities that are preserved across an
> execve(2) of a program that is not privileged. The ambient capability
> set obeys the invariant that no capability can ever be ambient if it
> is not both permitted and inheritable.
* client/fingerprint: correctly fingerprint E/P cores of Apple Silicon chips
This PR adds detection of asymetric core types (Power & Efficiency) (P/E)
when running on M1/M2 Apple Silicon CPUs. This functionality is provided
by shoenig/go-m1cpu which makes use of the Apple IOKit framework to read
undocumented registers containing CPU performance data. Currently working
on getting that functionality merged upstream into gopsutil, but gopsutil
would still not support detecting P vs E cores like this PR does.
Also refactors the CPUFingerprinter code to handle the mixed core
types, now setting power vs efficiency cpu attributes.
For now the scheduler is still unaware of mixed core types - on Apple
platforms tasks cannot reserve cores anyway so it doesn't matter, but
at least now the total CPU shares available will be correct.
Future work should include adding support for detecting P/E cores on
the latest and upcoming Intel chips, where computation of total cpu shares
is currently incorrect. For that, we should also include updating the
scheduler to be core-type aware, so that tasks of resources.cores on Linux
platforms can be assigned the correct number of CPU shares for the core
type(s) they have been assigned.
node attributes before
cpu.arch = arm64
cpu.modelname = Apple M2 Pro
cpu.numcores = 12
cpu.reservablecores = 0
cpu.totalcompute = 1000
node attributes after
cpu.arch = arm64
cpu.frequency.efficiency = 2424
cpu.frequency.power = 3504
cpu.modelname = Apple M2 Pro
cpu.numcores.efficiency = 4
cpu.numcores.power = 8
cpu.reservablecores = 0
cpu.totalcompute = 37728
* fingerprint/cpu: follow up cr items
When cluster administrators restore from Raft snapshot, they also need to ensure the
keyring is in place. For on-prem users doing in-place upgrades this is less of a
concern but for typical cloud workflows where the whole host is replaced, it's
an important warning (at least until #14852 has been implemented).
The configuration docs for `client.template.vault_retry`, `consul_retry`, and
`nomad_retry` incorrectly document the default number of attempts to be
unlimited (0). When we added these config blocks, we defaulted the fields to
`nil` for backwards compatibility, which causes them to fall back to the default
consul-template configuration values.
* Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true
Fixes#11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous childre.
* Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true
Fixes#11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous children.
* style: refactor force run function
* fix: remove defer and inline unlock for speed optimization
* Update nomad/leader.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* style: refactor tests to use must
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* fix: move back from defer to calling unlock before returning.
createEval cant be called with the lock on
* style: refactor test to use must
* added new entry to changelog and update comments
---------
Co-authored-by: James Rasell <jrasell@hashicorp.com>
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
When a disconnect client reconnects the `allocReconciler` must find the
allocations that were created to replace the original disconnected
allocations.
This process was being done in only a subset of non-terminal untainted
allocations, meaning that, if the replacement allocations were not in
this state the reconciler didn't stop them, leaving the job in an
inconsistent state.
This inconsistency is only solved in a future job evaluation, but at
that point the allocation is considered reconnected and so the specific
reconnection logic was not applied, leading to unexpected outcomes.
This commit fixes the problem by running reconnecting allocation
reconciliation logic earlier into the process, leaving the rest of the
reconciler oblivious of reconnecting allocations.
It also uses the full set of allocations to search for replacements,
stopping them even if they are not in the `untainted` set.
The system `SystemScheduler` is not affected by this bug because
disconnected clients don't trigger replacements: every eligible client
is already running an allocation.
Implement the new `nomad job restart` command that allows operators to
restart allocations tasks or reschedule then entire allocation.
Restarts can be batched to target multiple allocations in parallel.
Between each batch the command can stop and hold for a predefined time
or until the user confirms that the process should proceed.
This implements the "Stateless Restarts" alternative from the original
RFC
(https://gist.github.com/schmichael/e0b8b2ec1eb146301175fd87ddd46180).
The original concept is still worth implementing, as it allows this
functionality to be exposed over an API that can be consumed by the
Nomad UI and other clients. But the implementation turned out to be more
complex than we initially expected so we thought it would be better to
release a stateless CLI-based implementation first to gather feedback
and validate the restart behaviour.
Co-authored-by: Shishir Mahajan <smahajan@roblox.com>
This changeset refactors the tests of the draining node watcher so that we don't
mock the node watcher's `Remove` and `Update` methods for its own tests. Instead
we'll mock the node watcher's dependencies (the job watcher and deadline
notifier) and now unit tests can cover the real code. This allows us to remove a
bunch of TODOs in `watch_nodes.go` around testing and clarify some important
behaviors:
* Nodes that are down or disconnected will still be watched until the scheduler
decides what to do with their allocations. This will drive the job watcher but
not the node watcher, and that lets the node watcher gracefully handle cases
where a heartbeat fails but the node heartbeats again before its allocs can be
evicted.
* Stop watching nodes that have been deleted. The blocking query for nodes set
the maximum index to the highest index of a node it found, rather than the
index of the nodes table. This misses updates to the index from deleting
nodes. This was done as an performance optimization to avoid excessive
unblocking, but because the query is over all nodes anyways there's no
optimization to be had here. Remove the optimization so we can detect deleted
nodes without having to wait for an update to an unrelated node.
* Generate files for 1.5.2 release
* Prepare for next release
* add 1.4.7 and 1.3.12 to the changelog
---------
Co-authored-by: hc-github-team-nomad-core <github-team-nomad-core@hashicorp.com>
In #12916 we updated the events test as part of a larger set of changes around
mapstructure serialization fixes. But the changes to the jobs we're deploying in
the tests had invalid task configs so they never result in good deployments and
the test will always fail. Make the before/after jobs identical (except for the
version bump) and make them valid. Also wait for allocations for the 2nd job run
to appear before checking the deployment list, so that we don't race with the
scheduler.
The drainer component is fairly complex. As part of upcoming work to fix some of
the drainer's rough edges, document the drainer's architecture from a Nomad
developer perspective.
Fixes#16517
Given a 3 Server cluster with at least 1 Client connected to Follower 1:
If a NodeMeta.{Apply,Read} for the Client request is received by
Follower 1 with `AllowStale = false` the Follower will forward the
request to the Leader.
The Leader, not being connected to the target Client, will forward the
RPC to Follower 1.
Follower 1, seeing AllowStale=false, will forward the request to the
Leader.
The Leader, not being connected to... well hoppefully you get the
picture: an infinite loop occurs.
* Throw your mouse into traffic
* Add node metadata with a shortcut
* Re-labelled
* Adds a toast notification to job start/stop on keyboard shortcut
* Typo fix
* Added and flag to command
* cli[style]: small refactor to avoid confussion with tmpl variable
* Update inspect.mdx
* cli: add changelog entry
* Update .changelog/16478.txt
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update command/quota_inspect.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
---------
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* cli: add json and t flags to quota status command
* cli: add entry to changelog
* Update command/quota_status.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
---------
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* cli: Add and flags to server members
* Update website/content/docs/commands/server/members.mdx
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update website/content/docs/commands/server/members.mdx
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* cli: update the server memebers tests to use must
* cli: add flags addition to changelog
---------
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
nomad login command does not need to know ACL Auth Method's type, since all
method names are unique.
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* services: always set deregister flag after deregistration of group
This PR fixes a bug where the group service hook's deregister flag was
not set in some cases, causing the hook to attempt deregistrations twice
during job updates (alloc replacement).
In the tests ... we used to assert on the wrong behvior (remove twice) which
has now been corrected to assert we remove only once.
This bug was "silent" in the Consul provider world because the error logs for
double deregistration only show up in Consul logs; with the Nomad provider the
error logs are in the Nomad agent logs.
* services: cleanup group service hook tests
In #16217 we switched clients using Consul discovery to the `Status.Members`
endpoint for getting the list of servers so that we're using the correct
address. This endpoint has an authorization gate, so this fails if the anonymous
policy doesn't have `node:read`. We also can't check the `AuthToken` for the
request for the client secret, because the client hasn't yet registered so the
server doesn't have anything to compare against.
Instead of hitting the `Status.Peers` or `Status.Members` RPC endpoint, use the
Consul response directly. Update the `registerNode` method to handle the list of
servers we get back in the response; if we get a "no servers" or "no path to
region" response we'll kick off discovery again and retry immediately rather
than waiting 15s.
* landlock: git needs more files for private repositories
This PR fixes artifact downloading so that git may work when cloning from
private repositories. It needs
- file read on /etc/passwd
- dir read on /root/.ssh
- file write on /root/.ssh/known_hosts
Add these rules to the landlock rules for the artifact sandbox.
* cr: use nonexistent instead of devnull
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
* cr: use go-homdir for looking up home directory
* pr: pull go-homedir into explicit require
* cr: fixup homedir tests in homeless root cases
* cl: fix root test for real
---------
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>