New RPC endpoints introduced during OIDC and JWT auth perform unnecessary many RPC calls when they upsert generated ACL tokens, as pointed out by @tgross.
This PR moves the common logic from acl.UpsertTokens method into a helper method that contains common logic, and sidesteps authentication, metrics, etc.
Implementation of the base work for the new node pools feature. It includes a new `NodePool` struct and its corresponding state store table.
Upon start the state store is populated with two built-in node pools that cannot be modified nor deleted:
* `all` is a node pool that always includes all nodes in the cluster.
* `default` is the node pool where nodes that don't specify a node pool in their configuration are placed.
* core: eliminate second index on job_submissions table
This PR refactors the job_submissions state store code to eliminate the
use of a second index formerly used for purging all versions of a given
job. In practice we ended up with duplicate entries on the table. Instead,
use index prefix scanning on the primary index and tidy up any potential
for creating (or removing) duplicates.
* core: pr comments followup
When client nodes are restarted, all allocations that have been scheduled on the
node have their modify index updated, including terminal allocations. There are
several contributing factors:
* The `allocSync` method that updates the servers isn't gated on first contact
with the servers. This means that if a server updates the desired state while
the client is down, the `allocSync` races with the `Node.ClientGetAlloc`
RPC. This will typically result in the client updating the server with "running"
and then immediately thereafter "complete".
* The `allocSync` method unconditionally sends the `Node.UpdateAlloc` RPC even
if it's possible to assert that the server has definitely seen the client
state. The allocrunner may queue-up updates even if we gate sending them. So
then we end up with a race between the allocrunner updating its internal state
to overwrite the previous update and `allocSync` sending the bogus or duplicate
update.
This changeset adds tracking of server-acknowledged state to the
allocrunner. This state gets checked in the `allocSync` before adding the update
to the batch, and updated when `Node.UpdateAlloc` returns successfully. To
implement this we need to be able to equality-check the updates against the last
acknowledged state. We also need to add the last acknowledged state to the
client state DB, otherwise we'd drop unacknowledged updates across restarts.
The client restart test has been expanded to cover a variety of allocation
states, including allocs stopped before shutdown, allocs stopped by the server
while the client is down, and allocs that have been completely GC'd on the
server while the client is down. I've also bench tested scenarios where the task
workload is killed while the client is down, resulting in a failed restore.
Fixes#16381
* api: set the job submission during job reversion
This PR fixes a bug where the job submission would always be nil when
a job goes through a reversion to a previous version. Basically we need
to detect when this happens, lookup the submission of the job version
being reverted to, and set that as the submission of the new job being
created.
* e2e: add e2e test for job submissions during reversion
This e2e test ensures a reverted job inherits the job submission
associated with the version of the job being reverted to.
to avoid leaking task resources (e.g. containers,
iptables) if allocRunner prerun fails during
restore on client restart.
now if prerun fails, TaskRunner.MarkFailedKill()
will only emit an event, mark the task as failed,
and cancel the tr's killCtx, so then ar.runTasks()
-> tr.Run() can take care of the actual cleanup.
removed from (formerly) tr.MarkFailedDead(),
now handled by tr.Run():
* set task state as dead
* save task runner local state
* task stop hooks
also done in tr.Run() now that it's not skipped:
* handleKill() to kill tasks while respecting
their shutdown delay, and retrying as needed
* also includes task preKill hooks
* clearDriverHandle() to destroy the task
and associated resources
* task exited hooks
* connect: use heuristic to detect sidecar task driver
This PR adds a heuristic to detect whether to use the podman task driver
for the connect sidecar proxy. The podman driver will be selected if there
is at least one task in the task group configured to use podman, and there
are zero tasks in the group configured to use docker. In all other cases
the task driver defaults to docker.
After this change, we should be able to run typical Connect jobspecs
(e.g. nomad job init [-short] -connect) on Clusters configured with the
podman task driver, without modification to the job files.
Closes#17042
* golf: cleanup driver detection logic
The job scale RPC endpoint hard-coded the eval creation to use the
type of service. This meant scaling events triggered on jobs of
type batch would create evaluations with the wrong type, which
does not seem to cause any problems, just confusion when
correlating the two.
When the server restarts for the upgrade, it loads the `structs.Job` from the
Raft snapshot/logs. The jobspec has long since been parsed, so none of the
guards around the default value are in play. The empty field value for `Enabled`
is the zero value, which is false.
This doesn't impact any running allocation because we don't replace running
allocations when either the client or server restart. But as soon as any
allocation gets rescheduled (ex. you drain all your clients during upgrades),
it'll be using the `structs.Job` that the server has, which has `Enabled =
false`, and logs will not be collected.
This changeset fixes the bug by adding a new field `Disabled` which defaults to
false (so that the zero value works), and deprecates the old field.
Fixes#17076
Their release notes are here: https://github.com/golang-jwt/jwt/releases
Seemed wise to upgrade before we do even more with JWTs. For example
this upgrade *would* have mattered if we already implemented common JWT
claims such as expiration. Since we didn't rely on any claim
verification this upgrade is a noop...
...except for 1 test that called `Claims.Valid()`! Removing that
assertion *seems* scary, but it didn't actually do anything because we
didn't implement any of the standard claims it validated:
https://github.com/golang-jwt/jwt/blob/v4.5.0/map_claims.go#L120-L151
So functionally this major upgrade is a noop.
Some Nomad users ship application logs out-of-band via syslog. For these users
having `logmon` (and `docker_logger`) running is unnecessary overhead. Allow
disabling the logmon and pointing the task's stdout/stderr to /dev/null.
This changeset is the first of several incremental improvements to log
collection short of full-on logging plugins. The next step will likely be to
extend the internal-only task driver configuration so that cluster
administrators can turn off log collection for the entire driver.
---
Fixes: #11175
Co-authored-by: Thomas Weber <towe75@googlemail.com>
* Upgrade from hashicorp/go-msgpack v1.1.5 to v2.1.0
Fixes#16808
* Update hashicorp/net-rpc-msgpackrpc to v2 to match go-msgpack
* deps: use go-msgpack v2.0.0
go-msgpack v2.1.0 includes some code changes that we will need to
investigate furthere to assess its impact on Nomad, so keeping this
dependency on v2.0.0 for now since it's no-op.
---------
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
Adds a new configuration to clients to optionally allow them to drain their
workloads on shutdown. The client sends the `Node.UpdateDrain` RPC targeting
itself and then monitors the drain state as seen by the server until the drain
is complete or the deadline expires. If it loses connection with the server, it
will monitor local client status instead to ensure allocations are stopped
before exiting.
If an allocation is slow to stop because of `kill_timeout` or `shutdown_delay`,
the node drain is marked as complete prematurely, even though drain monitoring
will continue to report allocation migrations. This impacts the UI or API
clients that monitor node draining to shut down nodes.
This changeset updates the behavior to wait until the client status of all
drained allocs are terminal before marking the node as done draining.
* func: add namespace support for list deployment
* func: add wildcard to namespace filter for deployments
* Update deployment_endpoint.go
* style: use must instead of require or asseert
* style: rename paginator to avoid clash with import
* style: add changelog entry
* fix: add missing parameter for upsert jobs
msgpackrpc codec handles are specific to a connection and cannot be shared
between goroutines; this can cause corrupted decoding. Fix the drainer
integration test so that we create separate codecs for the goroutines that the
test helper spins up to simulate client updates.
This changeset also refactors the drainer integration test to bring it up to
current idioms and library usages, make assertions more clear, and reduce
duplication.
* api: enable support for setting original source alongside job
This PR adds support for setting job source material along with
the registration of a job.
This includes a new HTTP endpoint and a new RPC endpoint for
making queries for the original source of a job. The
HTTP endpoint is /v1/job/<id>/submission?version=<version> and
the RPC method is Job.GetJobSubmission.
The job source (if submitted, and doing so is always optional), is
stored in the job_submission memdb table, separately from the
actual job. This way we do not incur overhead of reading the large
string field throughout normal job operations.
The server config now includes job_max_source_size for configuring
the maximum size the job source may be, before the server simply
drops the source material. This should help prevent Bad Things from
happening when huge jobs are submitted. If the value is set to 0,
all job source material will be dropped.
* api: avoid writing var content to disk for parsing
* api: move submission validation into RPC layer
* api: return an error if updating a job submission without namespace or job id
* api: be exact about the job index we associate a submission with (modify)
* api: reword api docs scheduling
* api: prune all but the last 6 job submissions
* api: protect against nil job submission in job validation
* api: set max job source size in test server
* api: fixups from pr
The `ephemeral_disk` block's `migrate` field allows for best-effort migration of
the ephemeral disk data to new nodes. The documentation says the `migrate` field
is only respected if `sticky=true`, but in fact if client ACLs are not set the
data is migrated even if `sticky=false`.
The existing behavior when client ACLs are disabled has existed since the early
implementation, so "fixing" that case now would silently break backwards
compatibility. Additionally, having `migrate` not imply `sticky` seems
nonsensical: it suggests that if we place on a new node we migrate the data but
if we place on the same node, we throw the data away!
Update so that `migrate=true` implies `sticky=true` as follows:
* The failure mode when client ACLs are enabled comes from the server not passing
along a migration token. Update the server so that the server provides a
migration token whenever `migrate=true` and not just when `sticky=true` too.
* Update the scheduler so that `migrate` implies `sticky`.
* Update the client so that we check for `migrate || sticky` where appropriate.
* Refactor the E2E tests to move them off the old framework and make the intention
of the test more clear.
Requests without an ACL token that pass thru the client's HTTP API are treated
as though they come from the client itself. This allows bypass of ACLs on RPC
requests where ACL permissions are checked (like `Job.Register`). Invalid tokens
are correctly rejected.
Fix the bypass by only setting a client ID on the identity if we have a valid node secret.
Note that this changeset will break rate metrics for RPCs sent by clients
without a client secret such as `Node.GetClientAllocs`; these requests will be
recorded as anonymous.
Future work should:
* Ensure the node secret is sent with all client-driven RPCs except
`Node.Register` which is TOFU.
* Create a new `acl.ACL` object from client requests so that we
can enforce ACLs for all endpoints in a uniform way that's less error-prone.~
* Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true
Fixes#11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous childre.
* Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true
Fixes#11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous children.
* style: refactor force run function
* fix: remove defer and inline unlock for speed optimization
* Update nomad/leader.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* style: refactor tests to use must
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* fix: move back from defer to calling unlock before returning.
createEval cant be called with the lock on
* style: refactor test to use must
* added new entry to changelog and update comments
---------
Co-authored-by: James Rasell <jrasell@hashicorp.com>
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
This changeset refactors the tests of the draining node watcher so that we don't
mock the node watcher's `Remove` and `Update` methods for its own tests. Instead
we'll mock the node watcher's dependencies (the job watcher and deadline
notifier) and now unit tests can cover the real code. This allows us to remove a
bunch of TODOs in `watch_nodes.go` around testing and clarify some important
behaviors:
* Nodes that are down or disconnected will still be watched until the scheduler
decides what to do with their allocations. This will drive the job watcher but
not the node watcher, and that lets the node watcher gracefully handle cases
where a heartbeat fails but the node heartbeats again before its allocs can be
evicted.
* Stop watching nodes that have been deleted. The blocking query for nodes set
the maximum index to the highest index of a node it found, rather than the
index of the nodes table. This misses updates to the index from deleting
nodes. This was done as an performance optimization to avoid excessive
unblocking, but because the query is over all nodes anyways there's no
optimization to be had here. Remove the optimization so we can detect deleted
nodes without having to wait for an update to an unrelated node.
Fixes#16517
Given a 3 Server cluster with at least 1 Client connected to Follower 1:
If a NodeMeta.{Apply,Read} for the Client request is received by
Follower 1 with `AllowStale = false` the Follower will forward the
request to the Leader.
The Leader, not being connected to the target Client, will forward the
RPC to Follower 1.
Follower 1, seeing AllowStale=false, will forward the request to the
Leader.
The Leader, not being connected to... well hoppefully you get the
picture: an infinite loop occurs.
The job evaluate endpoint creates a new evaluation for the job which is
a write operation. This change modifies the necessary capability from
`read-job` to `submit-job` to better reflect this.