In #12916 we updated the events test as part of a larger set of changes around
mapstructure serialization fixes. But the changes to the jobs we're deploying in
the tests had invalid task configs so they never result in good deployments and
the test will always fail. Make the before/after jobs identical (except for the
version bump) and make them valid. Also wait for allocations for the 2nd job run
to appear before checking the deployment list, so that we don't race with the
scheduler.
The drainer component is fairly complex. As part of upcoming work to fix some of
the drainer's rough edges, document the drainer's architecture from a Nomad
developer perspective.
Fixes#16517
Given a 3 Server cluster with at least 1 Client connected to Follower 1:
If a NodeMeta.{Apply,Read} for the Client request is received by
Follower 1 with `AllowStale = false` the Follower will forward the
request to the Leader.
The Leader, not being connected to the target Client, will forward the
RPC to Follower 1.
Follower 1, seeing AllowStale=false, will forward the request to the
Leader.
The Leader, not being connected to... well hoppefully you get the
picture: an infinite loop occurs.
* Throw your mouse into traffic
* Add node metadata with a shortcut
* Re-labelled
* Adds a toast notification to job start/stop on keyboard shortcut
* Typo fix
* Added and flag to command
* cli[style]: small refactor to avoid confussion with tmpl variable
* Update inspect.mdx
* cli: add changelog entry
* Update .changelog/16478.txt
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update command/quota_inspect.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
---------
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* cli: add json and t flags to quota status command
* cli: add entry to changelog
* Update command/quota_status.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
---------
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* cli: Add and flags to server members
* Update website/content/docs/commands/server/members.mdx
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update website/content/docs/commands/server/members.mdx
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* cli: update the server memebers tests to use must
* cli: add flags addition to changelog
---------
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
nomad login command does not need to know ACL Auth Method's type, since all
method names are unique.
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* services: always set deregister flag after deregistration of group
This PR fixes a bug where the group service hook's deregister flag was
not set in some cases, causing the hook to attempt deregistrations twice
during job updates (alloc replacement).
In the tests ... we used to assert on the wrong behvior (remove twice) which
has now been corrected to assert we remove only once.
This bug was "silent" in the Consul provider world because the error logs for
double deregistration only show up in Consul logs; with the Nomad provider the
error logs are in the Nomad agent logs.
* services: cleanup group service hook tests
In #16217 we switched clients using Consul discovery to the `Status.Members`
endpoint for getting the list of servers so that we're using the correct
address. This endpoint has an authorization gate, so this fails if the anonymous
policy doesn't have `node:read`. We also can't check the `AuthToken` for the
request for the client secret, because the client hasn't yet registered so the
server doesn't have anything to compare against.
Instead of hitting the `Status.Peers` or `Status.Members` RPC endpoint, use the
Consul response directly. Update the `registerNode` method to handle the list of
servers we get back in the response; if we get a "no servers" or "no path to
region" response we'll kick off discovery again and retry immediately rather
than waiting 15s.
* landlock: git needs more files for private repositories
This PR fixes artifact downloading so that git may work when cloning from
private repositories. It needs
- file read on /etc/passwd
- dir read on /root/.ssh
- file write on /root/.ssh/known_hosts
Add these rules to the landlock rules for the artifact sandbox.
* cr: use nonexistent instead of devnull
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
* cr: use go-homdir for looking up home directory
* pr: pull go-homedir into explicit require
* cr: fixup homedir tests in homeless root cases
* cl: fix root test for real
---------
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
* docs: dispatch_payload docs had some weirdness
Docs said "Examples" when there was only 1 example. Not sure what the
floating "to" in the description was for.
* docs: missing a heading level on jobs api docs
This PR fixes the non-root macOS use case where artifact downloads
stopped working. It seems setting a Credential on a SysProcAttr
used by the exec package will always cause fork/exec to fail -
even if the credential contains our own UID/GID or nil UID/GID.
Technically we do not need to set this as the child process will
inherit the parent UID/GID anyway... and not setting it makes
things work again ... /shrug
Fix the e2e case where we download the go-getter bomb.zip test file, which
is being removed on main. We can still get it from the version tag - yay git!
* cli: Add and flag to namespace status command
* Update command/namespace_status.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* cli: update tests for namespace status command to use must
---------
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
Our auth token parsing code trims space around the `Authorization` header but
not around `X-Nomad-Token`. When using the UI, it's easy to accidentally
introduce a leading or trailing space, which results in spurious authentication
errors. Trim the space at the HTTP server.
* cgv1: do not disable cpuset manager if reserved interface already exists
This PR fixes a bug where restarting a Nomad Client on a machine using cgroups
v1 (e.g. Ubuntu 20.04) would cause the cpuset cgroups manager to disable itself.
This is being caused by incorrectly interpreting a "file exists" error as
problematic when ensuring the reserved cpuset exists. If we get a "file exists"
error, that just means the Client was likely restarted.
Note that a machine reboot would fix the issue - the groups interfaces are
ephemoral.
* cl: add cl
The job evaluate endpoint creates a new evaluation for the job which is
a write operation. This change modifies the necessary capability from
`read-job` to `submit-job` to better reflect this.
The tests for the system allocs reconciling code path (`diffSystemAllocs`)
include many impossible test environments, such as passing allocs for the wrong
node into the function. This makes the test assertions nonsensible for use in
walking yourself through the correct behavior.
I've pulled this changeset out of PR #16097 so that we can merge these
improvements and revisit the right approach to fix the problem in #16097 with
less urgency now that the PFNR bug fix has been merged. This changeset breaks up
a couple of tests, expands test coverage, and makes test assertions more
clear. It also corrects one bit of production code that behaves fine in
production because of canonicalization, but forces us to remember to set values
in tests to compensate.
In preperation for some refactoring to tasksUpdated, add a benchmark to the
old code so it's easy to compare with the changes, making sure nothing goes
off the rails for performance.
ACL policies can be associated with a job so that the job's Workload Identity
can have expanded access to other policy objects, including other
variables. Policies set on the variables the job automatically has access to
were ignored, but this includes policies with `deny` capabilities.
Additionally, when resolving claims for a workload identity without any attached
policies, the `ResolveClaims` method returned a `nil` ACL object, which is
treated similarly to a management token. While this was safe in Nomad 1.4.x,
when the workload identity token was exposed to the task via the `identity`
block, this allows a user with `submit-job` capabilities to escalate their
privileges.
We originally implemented automatic workload access to Variables as a separate
code path in the Variables RPC endpoint so that we don't have to generate
on-the-fly policies that blow up the ACL policy cache. This is fairly brittle
but also the behavior around wildcard paths in policies different from the rest
of our ACL polices, which is hard to reason about.
Add an `ACLClaim` parameter to the `AllowVariableOperation` method so that we
can push all this logic into the `acl` package and the behavior can be
consistent. This will allow a `deny` policy to override automatic access (and
probably speed up checks of non-automatic variable access).