* jobspec: rename node pool scheduler_configuration
In HCL specifications we usually call configuration blocks `config`
instead of `configuration`.
* np: add memory oversubscription config
* np: make scheduler config ENT
Add structs and fields to support the Nomad Pools Governance Enterprise
feature of controlling node pool access via namespaces.
Nomad Enterprise allows users to specify a default node pool to be used
by jobs that don't specify one. In order to accomplish this, it's
necessary to distinguish between a job that explicitly uses the
`default` node pool and one that did not specify any.
If the `default` node pool is set during job canonicalization it's
impossible to do this, so this commit allows a job to have an empty node
pool value during registration but sets to `default` at the admission
controller mutator.
In order to guarantee state consistency the state store validates that
the job node pool is set and exists before inserting it.
Upserts and deletes of node pools are forwarded to the authoritative region,
just like we do for namespaces, quotas, ACL policies, etc. Replicate node pools
from the authoritative region.
Implementation of the base work for the new node pools feature. It includes a new `NodePool` struct and its corresponding state store table.
Upon start the state store is populated with two built-in node pools that cannot be modified nor deleted:
* `all` is a node pool that always includes all nodes in the cluster.
* `default` is the node pool where nodes that don't specify a node pool in their configuration are placed.
to avoid leaking task resources (e.g. containers,
iptables) if allocRunner prerun fails during
restore on client restart.
now if prerun fails, TaskRunner.MarkFailedKill()
will only emit an event, mark the task as failed,
and cancel the tr's killCtx, so then ar.runTasks()
-> tr.Run() can take care of the actual cleanup.
removed from (formerly) tr.MarkFailedDead(),
now handled by tr.Run():
* set task state as dead
* save task runner local state
* task stop hooks
also done in tr.Run() now that it's not skipped:
* handleKill() to kill tasks while respecting
their shutdown delay, and retrying as needed
* also includes task preKill hooks
* clearDriverHandle() to destroy the task
and associated resources
* task exited hooks
Their release notes are here: https://github.com/golang-jwt/jwt/releases
Seemed wise to upgrade before we do even more with JWTs. For example
this upgrade *would* have mattered if we already implemented common JWT
claims such as expiration. Since we didn't rely on any claim
verification this upgrade is a noop...
...except for 1 test that called `Claims.Valid()`! Removing that
assertion *seems* scary, but it didn't actually do anything because we
didn't implement any of the standard claims it validated:
https://github.com/golang-jwt/jwt/blob/v4.5.0/map_claims.go#L120-L151
So functionally this major upgrade is a noop.
* api: enable support for setting original source alongside job
This PR adds support for setting job source material along with
the registration of a job.
This includes a new HTTP endpoint and a new RPC endpoint for
making queries for the original source of a job. The
HTTP endpoint is /v1/job/<id>/submission?version=<version> and
the RPC method is Job.GetJobSubmission.
The job source (if submitted, and doing so is always optional), is
stored in the job_submission memdb table, separately from the
actual job. This way we do not incur overhead of reading the large
string field throughout normal job operations.
The server config now includes job_max_source_size for configuring
the maximum size the job source may be, before the server simply
drops the source material. This should help prevent Bad Things from
happening when huge jobs are submitted. If the value is set to 0,
all job source material will be dropped.
* api: avoid writing var content to disk for parsing
* api: move submission validation into RPC layer
* api: return an error if updating a job submission without namespace or job id
* api: be exact about the job index we associate a submission with (modify)
* api: reword api docs scheduling
* api: prune all but the last 6 job submissions
* api: protect against nil job submission in job validation
* api: set max job source size in test server
* api: fixups from pr
In preperation for some refactoring to tasksUpdated, add a benchmark to the
old code so it's easy to compare with the changes, making sure nothing goes
off the rails for performance.
This change resolves policies for workload identities when calling Client RPCs. Previously only ACL tokens could be used for Client RPCs.
Since the same cache is used for both bearer tokens (ACL and Workload ID), the token cache size was doubled.
---------
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
This adds new OIDC endpoints on the RPC endpoint. These two RPCs
handle generating the OIDC provider URL and then completing the
login by exchanging the provider token with an internal Nomad
token.
The RPC endpoints both do double forwarding. The initial forward
is to ensure we are talking to the regional leader; the second
then takes into account whether the auth method generates local or
global tokens. If it creates global tokens, we must then forward
onto the federated regional leader.
This change adds a new table that will store ACL binding rule
objects. The two indexes allow fast lookups by their ID, or by
which auth method they are linked to. Snapshot persist and
restore functionality ensures this table can be saved and
restored from snapshots.
In order to write and delete the object to state, new Raft messages
have been added.
All RPC request and response structs, along with object functions
such as diff and canonicalize have been included within this work
as it is nicely separated from the other areas of work.
This PR implements ACLAuthMethod type, acl_auth_methods table schema and crud state store methods. It also updates nomadSnapshot.Persist and nomadSnapshot.Restore methods in order for them to work with the new table, and adds two new Raft messages: ACLAuthMethodsUpsertRequestType and ACLAuthMethodsDeleteRequestType
This PR is part of the SSO work captured under ☂️ ticket #13120.
This PR splits up the nomad/mock package into more files. Specific features
that have a lot of mocks get their own file (e.g. acl, variables, csi, connect, etc.).
Above that, functions that return jobs/allocs/nodes are in the job/alloc/node file. And
lastly other mocks/helpers are in mock.go
This PR implements support for check_restart for checks registered
in the Nomad service provider.
Unlike Consul, Nomad service checks never report a "warning" status,
and so the check_restart.ignore_warnings configuration is not valid
for Nomad service checks.
* allocrunner: handle lifecycle when all tasks die
When all tasks die the Coordinator must transition to its terminal
state, coordinatorStatePoststop, to unblock poststop tasks. Since this
could happen at any time (for example, a prestart task dies), all states
must be able to transition to this terminal state.
* allocrunner: implement different alloc restarts
Add a new alloc restart mode where all tasks are restarted, even if they
have already exited. Also unifies the alloc restart logic to use the
implementation that restarts tasks concurrently and ignores
ErrTaskNotRunning errors since those are expected when restarting the
allocation.
* allocrunner: allow tasks to run again
Prevent the task runner Run() method from exiting to allow a dead task
to run again. When the task runner is signaled to restart, the function
will jump back to the MAIN loop and run it again.
The task runner determines if a task needs to run again based on two new
task events that were added to differentiate between a request to
restart a specific task, the tasks that are currently running, or all
tasks that have already run.
* api/cli: add support for all tasks alloc restart
Implement the new -all-tasks alloc restart CLI flag and its API
counterpar, AllTasks. The client endpoint calls the appropriate restart
method from the allocrunner depending on the restart parameters used.
* test: fix tasklifecycle Coordinator test
* allocrunner: kill taskrunners if all tasks are dead
When all non-poststop tasks are dead we need to kill the taskrunners so
we don't leak their goroutines, which are blocked in the alloc restart
loop. This also ensures the allocrunner exits on its own.
* taskrunner: fix tests that waited on WaitCh
Now that "dead" tasks may run again, the taskrunner Run() method will
not return when the task finishes running, so tests must wait for the
task state to be "dead" instead of using the WaitCh, since it won't be
closed until the taskrunner is killed.
* tests: add tests for all tasks alloc restart
* changelog: add entry for #14127
* taskrunner: fix restore logic.
The first implementation of the task runner restore process relied on
server data (`tr.Alloc().TerminalStatus()`) which may not be available
to the client at the time of restore.
It also had the incorrect code path. When restoring a dead task the
driver handle always needs to be clear cleanly using `clearDriverHandle`
otherwise, after exiting the MAIN loop, the task may be killed by
`tr.handleKill`.
The fix is to store the state of the Run() loop in the task runner local
client state: if the task runner ever exits this loop cleanly (not with
a shutdown) it will never be able to run again. So if the Run() loops
starts with this local state flag set, it must exit early.
This local state flag is also being checked on task restart requests. If
the task is "dead" and its Run() loop is not active it will never be
able to run again.
* address code review requests
* apply more code review changes
* taskrunner: add different Restart modes
Using the task event to differentiate between the allocrunner restart
methods proved to be confusing for developers to understand how it all
worked.
So instead of relying on the event type, this commit separated the logic
of restarting an taskRunner into two methods:
- `Restart` will retain the current behaviour and only will only restart
the task if it's currently running.
- `ForceRestart` is the new method where a `dead` task is allowed to
restart if its `Run()` method is still active. Callers will need to
restart the allocRunner taskCoordinator to make sure it will allow the
task to run again.
* minor fixes
The current implementation for the task coordinator unblocks tasks by
performing destructive operations over its internal state (like closing
channels and deleting maps from keys).
This presents a problem in situations where we would like to revert the
state of a task, such as when restarting an allocation with tasks that
have already exited.
With this new implementation the task coordinator behaves more like a
finite state machine where task may be blocked/unblocked multiple times
by performing a state transition.
This initial part of the work only refactors the task coordinator and
is functionally equivalent to the previous implementation. Future work
will build upon this to provide bug fixes and enhancements.
This commit includes the new state schema for ACL roles along with
state interaction functions for CRUD actions.
The change also includes snapshot persist and restore
functionality and the addition of FSM messages for Raft updates
which will come via RPC endpoints.
This PR adds support for specifying checks in services registered to
the built-in nomad service provider.
Currently only HTTP and TCP checks are supported, though more types
could be added later.
* SV: CAS
* Implement Check and Set for Delete and Upsert
* Reading the conflict from the state store
* Update endpoint for new error text
* Updated HTTP api tests
* Conflicts to the HTTP api
* SV: structs: Update SV time to UnixNanos
* update mock to UnixNano; refactor
* SV: encrypter: quote KeyID in error
* SV: mock: add mock for namespace w/ SV
We need to track per-namespace storage usage for secure variables even
in Nomad OSS so that a cluster can be seamlessly upgraded from OSS to
ENT without having to re-calculate quota usage.
Provide a hook in the upsert RPC for enforcement of quotas in
ENT. This will be a no-op in Nomad OSS.
This PR splits SecureVariable into SecureVariableDecrypted and
SecureVariableEncrypted in order to use the type system to help
verify that cleartext secret material is not committed to file.
* Make Encrypt function return KeyID
* Split SecureVariable
Co-authored-by: Tim Gross <tgross@hashicorp.com>
The volumewatcher test incorrectly represents the change in attachment
and access modes introduced in Nomad 1.1.0 to support volume
creation. This leads to a test that happens to pass but only
accidentally.
Update the test to correctly represent the volume modes set by the
existing claims on the test volumes.
When an allocation is updated, the job summary for the associated job
is also updated. CSI uses the job summary to set the expected count
for controller and node plugins. We incorrectly used the allocation's
server status instead of the job status when deciding whether to
update or remove the job from the plugins. This caused a node drain or
other terminal state for an allocation to clear the expected count for
the entire plugin.
Use the job status to guide whether to update or remove the expected
count.
The existing CSI tests for the state store incorrectly modeled the
updates we received from servers vs those we received from clients,
leading to test assertions that passed when they should not.
Rework the tests to clarify each step in the lifecycle and rename CSI state
store functions for clarity
Update the logic in the Nomad client's alloc health tracker which
erroneously marks existing healthy allocations with dead poststart ephemeral
tasks as unhealthy even if they were already successful during a previous
deployment.
This PR implements a new "System Batch" scheduler type. Jobs can
make use of this new scheduler by setting their type to 'sysbatch'.
Like the name implies, sysbatch can be thought of as a hybrid between
system and batch jobs - it is for running short lived jobs intended to
run on every compatible node in the cluster.
As with batch jobs, sysbatch jobs can also be periodic and/or parameterized
dispatch jobs. A sysbatch job is considered complete when it has been run
on all compatible nodes until reaching a terminal state (success or failed
on retries).
Feasibility and preemption are governed the same as with system jobs. In
this PR, the update stanza is not yet supported. The update stanza is sill
limited in functionality for the underlying system scheduler, and is
not useful yet for sysbatch jobs. Further work in #4740 will improve
support for the update stanza and deployments.
Closes#2527