* api: enable support for setting original source alongside job
This PR adds support for setting job source material along with
the registration of a job.
This includes a new HTTP endpoint and a new RPC endpoint for
making queries for the original source of a job. The
HTTP endpoint is /v1/job/<id>/submission?version=<version> and
the RPC method is Job.GetJobSubmission.
The job source (if submitted, and doing so is always optional), is
stored in the job_submission memdb table, separately from the
actual job. This way we do not incur overhead of reading the large
string field throughout normal job operations.
The server config now includes job_max_source_size for configuring
the maximum size the job source may be, before the server simply
drops the source material. This should help prevent Bad Things from
happening when huge jobs are submitted. If the value is set to 0,
all job source material will be dropped.
* api: avoid writing var content to disk for parsing
* api: move submission validation into RPC layer
* api: return an error if updating a job submission without namespace or job id
* api: be exact about the job index we associate a submission with (modify)
* api: reword api docs scheduling
* api: prune all but the last 6 job submissions
* api: protect against nil job submission in job validation
* api: set max job source size in test server
* api: fixups from pr
The `ephemeral_disk` block's `migrate` field allows for best-effort migration of
the ephemeral disk data to new nodes. The documentation says the `migrate` field
is only respected if `sticky=true`, but in fact if client ACLs are not set the
data is migrated even if `sticky=false`.
The existing behavior when client ACLs are disabled has existed since the early
implementation, so "fixing" that case now would silently break backwards
compatibility. Additionally, having `migrate` not imply `sticky` seems
nonsensical: it suggests that if we place on a new node we migrate the data but
if we place on the same node, we throw the data away!
Update so that `migrate=true` implies `sticky=true` as follows:
* The failure mode when client ACLs are enabled comes from the server not passing
along a migration token. Update the server so that the server provides a
migration token whenever `migrate=true` and not just when `sticky=true` too.
* Update the scheduler so that `migrate` implies `sticky`.
* Update the client so that we check for `migrate || sticky` where appropriate.
* Refactor the E2E tests to move them off the old framework and make the intention
of the test more clear.
Requests without an ACL token that pass thru the client's HTTP API are treated
as though they come from the client itself. This allows bypass of ACLs on RPC
requests where ACL permissions are checked (like `Job.Register`). Invalid tokens
are correctly rejected.
Fix the bypass by only setting a client ID on the identity if we have a valid node secret.
Note that this changeset will break rate metrics for RPCs sent by clients
without a client secret such as `Node.GetClientAllocs`; these requests will be
recorded as anonymous.
Future work should:
* Ensure the node secret is sent with all client-driven RPCs except
`Node.Register` which is TOFU.
* Create a new `acl.ACL` object from client requests so that we
can enforce ACLs for all endpoints in a uniform way that's less error-prone.~
* Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true
Fixes#11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous childre.
* Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true
Fixes#11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous children.
* style: refactor force run function
* fix: remove defer and inline unlock for speed optimization
* Update nomad/leader.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* style: refactor tests to use must
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* Update nomad/leader_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* fix: move back from defer to calling unlock before returning.
createEval cant be called with the lock on
* style: refactor test to use must
* added new entry to changelog and update comments
---------
Co-authored-by: James Rasell <jrasell@hashicorp.com>
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
This changeset refactors the tests of the draining node watcher so that we don't
mock the node watcher's `Remove` and `Update` methods for its own tests. Instead
we'll mock the node watcher's dependencies (the job watcher and deadline
notifier) and now unit tests can cover the real code. This allows us to remove a
bunch of TODOs in `watch_nodes.go` around testing and clarify some important
behaviors:
* Nodes that are down or disconnected will still be watched until the scheduler
decides what to do with their allocations. This will drive the job watcher but
not the node watcher, and that lets the node watcher gracefully handle cases
where a heartbeat fails but the node heartbeats again before its allocs can be
evicted.
* Stop watching nodes that have been deleted. The blocking query for nodes set
the maximum index to the highest index of a node it found, rather than the
index of the nodes table. This misses updates to the index from deleting
nodes. This was done as an performance optimization to avoid excessive
unblocking, but because the query is over all nodes anyways there's no
optimization to be had here. Remove the optimization so we can detect deleted
nodes without having to wait for an update to an unrelated node.
Fixes#16517
Given a 3 Server cluster with at least 1 Client connected to Follower 1:
If a NodeMeta.{Apply,Read} for the Client request is received by
Follower 1 with `AllowStale = false` the Follower will forward the
request to the Leader.
The Leader, not being connected to the target Client, will forward the
RPC to Follower 1.
Follower 1, seeing AllowStale=false, will forward the request to the
Leader.
The Leader, not being connected to... well hoppefully you get the
picture: an infinite loop occurs.
The job evaluate endpoint creates a new evaluation for the job which is
a write operation. This change modifies the necessary capability from
`read-job` to `submit-job` to better reflect this.
The tests for the system allocs reconciling code path (`diffSystemAllocs`)
include many impossible test environments, such as passing allocs for the wrong
node into the function. This makes the test assertions nonsensible for use in
walking yourself through the correct behavior.
I've pulled this changeset out of PR #16097 so that we can merge these
improvements and revisit the right approach to fix the problem in #16097 with
less urgency now that the PFNR bug fix has been merged. This changeset breaks up
a couple of tests, expands test coverage, and makes test assertions more
clear. It also corrects one bit of production code that behaves fine in
production because of canonicalization, but forces us to remember to set values
in tests to compensate.
In preperation for some refactoring to tasksUpdated, add a benchmark to the
old code so it's easy to compare with the changes, making sure nothing goes
off the rails for performance.
ACL policies can be associated with a job so that the job's Workload Identity
can have expanded access to other policy objects, including other
variables. Policies set on the variables the job automatically has access to
were ignored, but this includes policies with `deny` capabilities.
Additionally, when resolving claims for a workload identity without any attached
policies, the `ResolveClaims` method returned a `nil` ACL object, which is
treated similarly to a management token. While this was safe in Nomad 1.4.x,
when the workload identity token was exposed to the task via the `identity`
block, this allows a user with `submit-job` capabilities to escalate their
privileges.
We originally implemented automatic workload access to Variables as a separate
code path in the Variables RPC endpoint so that we don't have to generate
on-the-fly policies that blow up the ACL policy cache. This is fairly brittle
but also the behavior around wildcard paths in policies different from the rest
of our ACL polices, which is hard to reason about.
Add an `ACLClaim` parameter to the `AllowVariableOperation` method so that we
can push all this logic into the `acl` package and the behavior can be
consistent. This will allow a `deny` policy to override automatic access (and
probably speed up checks of non-automatic variable access).
When the scheduler tries to find a placement for a new allocation, it iterates
over a subset of nodes. For each node, we populate a `NetworkIndex` bitmap with
the ports of all existing allocations and any other allocations already proposed
as part of this same evaluation via its `SetAllocs` method. Then we make an
"ask" of the `NetworkIndex` in `AssignPorts` for any ports we need and receive
an "offer" in return. The offer will include both static ports and any dynamic
port assignments.
The `AssignPorts` method was written to support group networks, and it shares
code that selects dynamic ports with the original `AssignTaskNetwork`
code. `AssignTaskNetwork` can request multiple ports from the bitmap at a
time. But `AssignPorts` requests them one at a time and does not account for
possible collisions, and doesn't return an error in that case.
What happens next varies:
1. If the scheduler doesn't place the allocation on that node, the port
conflict is thrown away and there's no problem.
2. If the node is picked and this is the only allocation (or last allocation),
the plan applier will reject the plan when it calls `SetAllocs`, as we'd expect.
3. If the node is picked and there are additional allocations in the same eval
that iterate over the same node, their call to `SetAllocs` will detect the
impossible state and the node will be rejected. This can have the puzzling
behavior where a second task group for the job without any networking at all
can hit a port collision error!
It looks like this bug has existed since we implemented group networks, but
there are several factors that add up to making the issue rare for many users
yet frustratingly frequent for others:
* You're more likely to hit this bug the more tightly packed your range for
dynamic ports is. With 12000 ports in the range by default, many clusters can
avoid this for a long time.
* You're more likely to hit case (3) for jobs with lots of allocations or if a
scheduler has to iterate over a large number of nodes, such as with system jobs,
jobs with `spread` blocks, or (sometimes) jobs using `unique` constraints.
For unlucky combinations of these factors, it's possible that case (3) happens
repeatedly, preventing scheduling of a given job until a client state
change (ex. restarting the agent so all its allocations are rescheduled
elsewhere) re-opens the range of dynamic ports available.
This changeset:
* Fixes the bug by accounting for collisions in dynamic port selection in
`AssignPorts`.
* Adds test coverage for `AssignPorts`, expands coverage of this case for the
deprecated `AssignTaskNetwork`, and tightens the dynamic port range in a
scheduler test for spread scheduling to more easily detect this kind of problem
in the future.
* Adds a `String()` method to `Bitmap` so that any future "screaming" log lines
have a human-readable list of used ports.
Wildcard datacenters introduced a bug where a job with any wildcard datacenters
will always be treated as a destructive update when we check whether a
datacenter has been removed from the jobspec.
Includes updating the helper so that callers don't have to loop over the job's
datacenters.
In Nomad 0.12.1 we introduced atomic job registration/deregistration, where the
new eval was written in the same raft entry. Backwards-compatibility checks were
supposed to have been removed in Nomad 1.1.0, but we missed that. This is long
safe to remove.
When native service discovery was added, we used the node secret as the auth
token. Once Workload Identity was added in Nomad 1.4.x we needed to use the
claim token for `template` blocks, and so we allowed valid claims to bypass the
ACL policy check to preserve the existing behavior. (Invalid claims are still
rejected, so this didn't widen any security boundary.)
In reworking authentication for 1.5.0, we unintentionally removed this
bypass. For WIs without a policy attached to their job, everything works as
expected because the resulting `acl.ACL` is nil. But once a policy is attached
to the job the `acl.ACL` is no longer nil and this causes permissions errors.
Fix the regression by adding back the bypass for valid claims. In future work,
we should strongly consider getting turning the implicit policies into real
`ACLPolicy` objects (even if not stored in state) so that we don't have these
kind of brittle exceptions to the auth code.
The signature of the `raftApply` function requires that the caller unwrap the
first returned value (the response from `FSM.Apply`) to see if it's an
error. This puts the burden on the caller to remember to check two different
places for errors, and we've done so inconsistently.
Update `raftApply` to do the unwrapping for us and return any `FSM.Apply` error
as the error value. Similar work was done in Consul in
https://github.com/hashicorp/consul/pull/9991. This eliminates some boilerplate
and surfaces a few minor bugs in the process:
* job deregistrations of already-GC'd jobs were still emitting evals
* reconcile job summaries does not return scheduler errors
* node updates did not report errors associated with inconsistent service
discovery or CSI plugin states
Note that although _most_ of the `FSM.Apply` functions return only errors (which
makes it tempting to remove the first return value entirely), there are few that
return `bool` for some reason and Variables relies on the response value for
proper CAS checking.
This change resolves policies for workload identities when calling Client RPCs. Previously only ACL tokens could be used for Client RPCs.
Since the same cache is used for both bearer tokens (ACL and Workload ID), the token cache size was doubled.
---------
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* build: add BuildDate to version info
will be used in enterprise to compare to license expiration time
* cli: multi-line version output, add BuildDate
before:
$ nomad version
Nomad v1.4.3 (coolfakecommithashomgoshsuchacoolonewoww)
after:
$ nomad version
Nomad v1.5.0-dev
BuildDate 2023-02-17T19:29:26Z
Revision coolfakecommithashomgoshsuchacoolonewoww
compare consul:
$ consul version
Consul v1.14.4
Revision dae670fe
Build Date 2023-01-26T15:47:10Z
Protocol 2 spoken by default, blah blah blah...
and vault:
$ vault version
Vault v1.12.3 (209b3dd99fe8ca320340d08c70cff5f620261f9b), built 2023-02-02T09:07:27Z
* docs: update version command output
The `CSIVolume` struct has references to allocations that are "denormalized"; we
don't store them on the `CSIVolume` struct but hydrate them on read. Tests
detecting potential state store corruptions found two locations where we're not
copying the volume before denormalizing:
* When garbage collecting CSI volume claims.
* When checking if it's safe to force-deregister the volume.
There are no known user-visible problems associated with these bugs but both
have the potential of mutating volume claims outside of a FSM transaction. This
changeset also cleans up state mutations in some CSI tests so as to avoid having
working tests cover up potential future bugs.
The RPC TLS enforcement test was frequently failing with broken connections. The
most likely cause was that the tests started to run before the server had
started its RPC server. Wait until it self-elects to ensure that the RPC server
is up. This seems to have corrected the error; I ran this 3 times without a
failure (even accounting for `gotestsum` retries).
Also, fix a minor test bug that didn't impact the test but showed an incorrect
usage for `Status.Ping.`
* artifact: protect against unbounded artifact decompression
Starting with 1.5.0, set defaut values for artifact decompression limits.
artifact.decompression_size_limit (default "100GB") - the maximum amount of
data that will be decompressed before triggering an error and cancelling
the operation
artifact.decompression_file_count_limit (default 4096) - the maximum number
of files that will be decompressed before triggering an error and
cancelling the operation.
* artifact: assert limits cannot be nil in validation
* Warn when Items key isn't directly accessible
Go template requires that map keys are alphanumeric for direct access
using the dotted reference syntax. This warns users when they create
keys that run afoul of this requirement.
- cli: use regex to detect invalid indentifiers in var keys
- test: fix slash in escape test case
- api: share warning formatting function between API and CLI
- ui: warn if var key has characters other than _, letter, or number
---------
Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com>
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
The eval broker's `Cancelable` method used by the cancelable eval reaper mutates
the slice of cancelable evals by removing a batch at a time from the slice. But
this method unsafely uses a read lock despite this mutation. Under normal
workloads this is likely to be safe but when the eval broker is under the heavy
load this feature is intended to fix, we're likely to have a race
condition. Switch this to a write lock, like the other locks that mutate the
eval broker state.
This changeset also adjusts the timeout to allow poorly-sized Actions runners
more time to schedule the appropriate goroutines. The test has also been updated
to use `shoenig/test/wait` so we can have sensible reporting of the results
rather than just a timeout error when things go wrong.
Some of the core scheduler tests need the maximum batch size for writes to be
smaller than the usual `structs.MaxUUIDsPerWriteRequest`. But they do so by
unsafely modifying the global struct, which creates test flakes in other tests.
Modify the functions under test to take a batch size parameter. Production code
will pass the global while the tests can inject smaller values. Turn the
`structs.MaxUUIDsPerWriteRequest` into a constant, and add a semgrep rule for
avoiding this kind of thing in the future.
In #15901 we introduced pre-forwarding authentication for RPCs so that we can
grab the identity for rate metrics. The `ACL.Bootstrap` RPC is an
unauthenticated endpoint, so any error message from authentication is not
particularly useful. This would be harmless, but if you try to bootstrap with
your `NOMAD_TOKEN` already set (perhaps because you were talking to another
cluster previously from the same shell session), you'll get an authentication
error instead of just having the token be ignored. This is a regression from the
existing behavior, so have this endpoint ignore auth errors the same way we do
for every other unauthenticated endpoint (ex `Status.Peers`)
Fixes#14617
Dynamic Node Metadata allows Nomad users, and their jobs, to update Node metadata through an API. Currently Node metadata is only reloaded when a Client agent is restarted.
Includes new UI for editing metadata as well.
---------
Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com>
* main: remove deprecated uses of rand.Seed
go1.20 deprecates rand.Seed, and seeds the rand package
automatically. Remove cases where we seed the random package,
and cleanup the one case where we intentionally create a
known random source.
* cl: update cl
* mod: update go mod
This change introduces the Task API: a portable way for tasks to access Nomad's HTTP API. This particular implementation uses a Unix Domain Socket and, unlike the agent's HTTP API, always requires authentication even if ACLs are disabled.
This PR contains the core feature and tests but followup work is required for the following TODO items:
- Docs - might do in a followup since dynamic node metadata / task api / workload id all need to interlink
- Unit tests for auth middleware
- Caching for auth middleware
- Rate limiting on negative lookups for auth middleware
---------
Co-authored-by: Seth Hoenig <shoenig@duck.com>
* Demoable state
* Demo mirage color
* Label as a block with foreground and background colours
* Test mock updates
* Go test updated
* Documentation update for label support
Service jobs should have unique allocation Names, derived from the
Job.ID. System jobs do not have unique allocation Names because the index is
intended to indicated the instance out of a desired count size. Because system
jobs do not have an explicit count but the results are based on the targeted
nodes, the index is less informative and this was intentionally omitted from the
original design.
Update docs to make it clear that NOMAD_ALLOC_INDEX is always zero for
system/sysbatch jobs
Validate that `volume.per_alloc` is incompatible with system/sysbatch jobs.
System and sysbatch jobs always have a `NOMAD_ALLOC_INDEX` of 0. So
interpolation via `per_alloc` will not work as soon as there's more than one
allocation placed. Validate against this on job submission.
* largely a doc-ification of this commit message:
d47678074bf8ae9ff2da3c91d0729bf03aee8446
this doesn't spell out all the possible failure modes,
but should be a good starting point for folks.
* connect: add doc link to envoy bootstrap error
* add Unwrap() to RecoverableError
mainly for easier testing
Add `identity` jobspec block to expose workload identity tokens to tasks.
---------
Co-authored-by: Anders <mail@anars.dk>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
* Extend variables under the nomad path prefix to allow for job-templates (#15570)
* Extend variables under the nomad path prefix to allow for job-templates
* Add job-templates to error message hinting
* RadioCard component for Job Templates (#15582)
* chore: add
* test: component API
* ui: component template
* refact: remove bc naming collission
* styles: remove SASS var causing conflicts
* Disallow specific variable at nomad/job-templates (#15681)
* Disallows variables at exactly nomad/job-templates
* idiomatic refactor
* Expanding nomad job init to accept a template flag (#15571)
* Adding a string flag for templates on job init
* data-down actions-up version of a custom template editor within variable
* Dont force grid on job template editor
* list-templates flag started
* Correctly slice from end of path name
* Pre-review cleanup
* Variable form acceptance test for job template editing
* Some review cleanup
* List Job templates test
* Example from template test
* Using must.assertions instead of require etc
* ui: add choose template button (#15596)
* ui: add new routes
* chore: update file directory
* ui: add choose template button
* test: button and page navigation
* refact: update var name
* ui: use `Button` component from `HDS` (#15607)
* ui: integrate buttons
* refact: remove helper
* ui: remove icons on non-tertiary buttons
* refact: update normalize method for key/value pairs (#15612)
* `revert`: `onCancel` for `JobDefinition`
The `onCancel` method isn't included in the component API for `JobEditor` and the primary cancel behavior exists outside of the component. With the exception of the `JobDefinition` page where we include this button in the top right of the component instead of next to the `Plan` button.
* style: increase button size
* style: keep lime green
* ui: select template (#15613)
* ui: deprecate unused component
* ui: deprecate tests
* ui: jobs.run.templates.index
* ui: update logic to handle templates
* refact: revert key/value changes
* style: padding for cards + buttons
* temp: fixtures for mirage testing
* Revert "refact: revert key/value changes"
This reverts commit 124e95d12140be38fc921f7e15243034092c4063.
* ui: guard template for unsaved job
* ui: handle reading template variable
* Revert "refact: update normalize method for key/value pairs (#15612)"
This reverts commit 6f5ffc9b610702aee7c47fbff742cc81f819ab74.
* revert: remove test fixtures
* revert: prettier problems
* refact: test doesnt need filter expression
* styling: button sizes and responsive cards
* refact: remove route guarding
* ui: update variable adapter
* refact: remove model editing behavior
* refact: model should query variables to populate editor
* ui: clear qp on exit
* refact: cleanup deprecated API
* refact: query all namespaces
* refact: deprecate action
* ui: rely on collection
* refact: patch deprecate transition API
* refact: patch test to expect namespace qp
* styling: padding, conditionals
* ui: flashMessage on 404
* test: update for o(n+1) query
* ui: create new job template (#15744)
* refact: remove unused code
* refact: add type safety
* test: select template flow
* test: add data-test attrs
* chore: remove dead code
* test: create new job flow
* ui: add create button
* ui: create job template
* refact: no need for wildcard
* refact: record instead of delete
* styling: spacing
* ui: add error handling and form validation to job create template (#15767)
* ui: handle server side errors
* ui: show error to prevent duplicate
* refact: conditional namespace
* ui: save as template flow (#15787)
* bug: patches failing tests associated with `pretender` (#15812)
* refact: update assertion
* refact: test set-up
* ui: job templates manager view (#15815)
* ui: manager list view
* test: edit flow
* refact: deprecate column-helper
* ui: template edit and delete flow (#15823)
* ui: manager list view
* refact: update title
* refact: update permissions
* ui: template edit page
* bug: typo
* refact: update toast messages
* bug: clear selections on exit (#15827)
* bug: clear controllers on exit
* test: mirage config changes (#15828)
* refact: deprecate column-helper
* style: update z-index for HDS
* Revert "style: update z-index for HDS"
This reverts commit d3d87ceab6d083f7164941587448607838944fc1.
* refact: update delete button
* refact: edit redirect
* refact: patch reactivity issues
* styling: fixed width
* refact: override defaults
* styling: edit text causing overflow
* styling: add inline text
Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com>
* bug: edit `text` to `template`
Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com>
Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com>
* test: delete flow job templates (#15896)
* refact: edit names
* bug: set correct ref to store
* chore: trim whitespace:
* test: delete flow
* bug: reactively update view (#15904)
* Initialized default jobs (#15856)
* Initialized default jobs
* More jobs scaffolded
* Better commenting on a couple example job specs
* Adapter doing the work
* fall back to epic config
* Label format helper and custom serialization logic
* Test updates to account for a never-empty state
* Test suite uses settled and maintain RecordArray in adapter return
* Updates to hello-world and variables example jobspecs
* Parameterized job gets optional payload output
* Formatting changes for param and service discovery job templates
* Multi-group service discovery job
* Basic test for default templates (#15965)
* Basic test for default templates
* Percy snapshot for manage page
* Some late-breaking design changes
* Some copy edits to the header paragraphs for job templates (#15967)
* Added some init options for job templates (#15994)
* Async method for populating default job templates from the variable adapter
---------
Co-authored-by: Jai <41024828+ChaiWithJai@users.noreply.github.com>
The ACL token decoding was not correctly handling time duration
syntax such as "1h" which forced people to use the nanosecond
representation via the HTTP API.
The change adds an unmarshal function which allows this syntax to
be used, along with other styles correctly.
Prior to 2409f72 the code compared the modification index of a job to itself. Afterwards, the code compared the creation index of the job to itself. In either case there should never be a case of re-parenting of allocs causing the evaluation to trivially always result in false, which leads to unreclaimable memory.
Prior to this change allocations and evaluations for batch jobs were never garbage collected until the batch job was explicitly stopped. The new `batch_eval_gc_threshold` server configuration controls how often they are collected. The default threshold is `24h`.
When registering a job with a service and 'consul.allow_unauthenticated=false',
we scan the given Consul token for an acceptable policy or role with an
acceptable policy, but did not scan for an acceptable service identity (which
is backed by an acceptable virtual policy). This PR updates our consul token
validation to also accept a matching service identity when registering a service
into Consul.
Fixes#15902
Also tightens up authentication for these endpoints by enforcing the server
certificate name is valid. We protect these endpoints currently by mTLS and
can't use an auth token because these endpoints are (uniquely) called by the
leader and followers for a given node won't have the leader's ephemeral ACL
token. Add a certificate name check that requests come from a server and not a
client, because no client should ever send these RPCs directly.
Disallowing per_alloc for host volumes in some cases makes life of a nomad user much harder.
When we rely on the NOMAD_ALLOC_INDEX for any configuration that needs to be re-used across
restarts we need to make sure allocation placement is consistent. With CSI volumes we can
use the `per_alloc` feature but for some reason this is explicitly disabled for host volumes.
Ensure host volumes understand the concept of per_alloc
This changeset configures the RPC rate metrics that were added in #15515 to all
the RPCs that support authenticated HTTP API requests. These endpoints already
configured with pre-forwarding authentication in #15870, and a handful of others
were done already as part of the proof-of-concept work. So this changeset is
entirely copy-and-pasting one method call into a whole mess of handlers.
Upcoming PRs will wire up pre-forwarding auth and rate metrics for the remaining
set of RPCs that have no API consumers or aren't authenticated, in smaller
chunks that can be more thoughtfully reviewed.
When a Nomad client that is running an allocation with
`max_client_disconnect` set misses a heartbeat the Nomad server will
update its status to `disconnected`.
Upon reconnecting, the client will make three main RPC calls:
- `Node.UpdateStatus` is used to set the client status to `ready`.
- `Node.UpdateAlloc` is used to update the client-side information about
allocations, such as their `ClientStatus`, task states etc.
- `Node.Register` is used to upsert the entire node information,
including its status.
These calls are made concurrently and are also running in parallel with
the scheduler. Depending on the order they run the scheduler may end up
with incomplete data when reconciling allocations.
For example, a client disconnects and its replacement allocation cannot
be placed anywhere else, so there's a pending eval waiting for
resources.
When this client comes back the order of events may be:
1. Client calls `Node.UpdateStatus` and is now `ready`.
2. Scheduler reconciles allocations and places the replacement alloc to
the client. The client is now assigned two allocations: the original
alloc that is still `unknown` and the replacement that is `pending`.
3. Client calls `Node.UpdateAlloc` and updates the original alloc to
`running`.
4. Scheduler notices too many allocs and stops the replacement.
This creates unnecessary placements or, in a different order of events,
may leave the job without any allocations running until the whole state
is updated and reconciled.
To avoid problems like this clients must update _all_ of its relevant
information before they can be considered `ready` and available for
scheduling.
To achieve this goal the RPC endpoints mentioned above have been
modified to enforce strict steps for nodes reconnecting:
- `Node.Register` does not set the client status anymore.
- `Node.UpdateStatus` sets the reconnecting client to the `initializing`
status until it successfully calls `Node.UpdateAlloc`.
These changes are done server-side to avoid the need of additional
coordination between clients and servers. Clients are kept oblivious of
these changes and will keep making these calls as they normally would.
The verification of whether allocations have been updates is done by
storing and comparing the Raft index of the last time the client missed
a heartbeat and the last time it updated its allocations.
This changeset allows Workload Identities to authenticate to all the RPCs that
support HTTP API endpoints, for use with PR #15864.
* Extends the work done for pre-forwarding authentication to all RPCs that
support a HTTP API endpoint.
* Consolidates the auth helpers used by the CSI, Service Registration, and Node
endpoints that are currently used to support both tokens and client secrets.
Intentionally excluded from this changeset:
* The Variables endpoint still has custom handling because of the implicit
policies. Ideally we'll figure out an efficient way to resolve those into real
policies and then we can get rid of that custom handling.
* The RPCs that don't currently support auth tokens (i.e. those that don't
support HTTP endpoints) have not been updated with the new pre-forwarding auth
We'll be doing this under a separate PR to support RPC rate metrics.
If a consumer of the new `Authenticate` method gets passed a bogus token that's
a correctly-shaped UUID, it will correctly get an identity without a ACL
token. But most consumers will then panic when they consume this nil `ACLToken`
for authorization.
Because no API client should ever send a bogus auth token, update the
`Authenticate` method to create the identity with remote IP (for metrics
tracking) but also return an `ErrPermissionDenied`.
Implement a metric for RPC requests with labels on the identity, so that
administrators can monitor the source of requests within the cluster. This
changeset demonstrates the change with the new `ACL.WhoAmI` RPC, and we'll wire
up the remaining RPCs once we've threaded the new pre-forwarding authentication
through the all.
Note that metrics are measured after we forward but before we return any
authentication error. This ensures that we only emit metrics on the server that
actually serves the request. We'll perform rate limiting at the same place.
Includes telemetry configuration to omit identity labels.
In #15417 we added a new `Authenticate` method to the server that returns an
`AuthenticatedIdentity` struct. This changeset implements this method for a
small number of RPC endpoints that together represent all the various ways in
which RPCs are sent, so that we can validate that we're happy with this
approach.
Upon dequeuing an evaluation workers snapshot their state store at the
eval's wait index or later. This ensures we process an eval at a point
in time after it was created or updated. Processing an eval on an old
snapshot could cause any number of problems such as:
1. Since job registration atomically updates an eval and job in a single
raft entry, scheduling against indexes before that may not have the
eval's job or may have an older version.
2. The older the scheduler's snapshot, the higher the likelihood
something has changed in the cluster state which will cause the plan
applier to reject the scheduler's plan. This could waste work or
even cause eval's to be failed needlessly.
However, the workers run in parallel with a new server pulling the
cluster state from a peer. During this time, which may be many minutes
long, the state store is likely far behind the minimum index required
to process evaluations.
This PR addresses this by adding an additional long backoff period after
an eval is nacked. If the scheduler's indexes catches up within the
additional backoff, it will unblock early to dequeue the next eval.
When the server shuts down we'll get a `context.Canceled` error from the state
store method. We need to bubble this error up so that other callers can detect
it. Handle this case separately when waiting after dequeue so that we can warn
on shutdown instead of throwing an ambiguous error message with just the text
"canceled."
While there may be more precise ways to block scheduling until the
server catches up, this approach adds little risk and covers additional
cases where a server may be temporarily behind due to a spike in load or
a saturated network.
For testing, we make the `raftSyncLimit` into a parameter on the worker's `run` method
so that we can run backoff tests without waiting 30+ seconds. We haven't followed thru
and made all the worker globals into worker parameters, because there isn't much
use outside of testing, but we can consider that in the future.
Co-authored-by: Tim Gross <tgross@hashicorp.com>
This changeset fixes a long-standing point of confusion in metrics emitted by
the eval broker. The eval broker has a queue of "blocked" evals that are waiting
for an in-flight ("unacked") eval of the same job to be completed. But this
"blocked" state is not the same as the `blocked` status that we write to raft
and expose in the Nomad API to end users. There's a second metric
`nomad.blocked_eval.total_blocked` that refers to evaluations in that
state. This has caused ongoing confusion in major customer incidents and even in
our own documentation! (Fixed in this PR.)
There's little functional change in this PR aside from the name of the metric
emitted, but there's a bit refactoring to clean up the names in `eval_broker.go`
so that there aren't name collisions and multiple names for the same
state. Changes included are:
* Everything that was previously called "pending" referred to entities that were
associated witht he "ready" metric. These are all now called "ready" to match
the metric.
* Everything named "blocked" in `eval_broker.go` is now named "pending", except
for a couple of comments that actually refer to blocked RPCs.
* Added a note to the upgrade guide docs for 1.5.0.
* Fixed the scheduling performance metrics docs because the description for
`nomad.broker.total_blocked` was actually the description for
`nomad.blocked_eval.total_blocked`.
* Add config elements
* Wire in snapshot configuration to raft
* Add hot reload of raft config
* Add documentation for new raft settings
* Add changelog
* consul: correctly understand missing consul checks as unhealthy
This PR fixes a bug where Nomad assumed any registered Checks would exist
in the service registration coming back from Consul. In some cases, the
Consul may be slow in processing the check registration, and the response
object would not contain checks. Nomad would then scan the empty response
looking for Checks with failing health status, finding none, and then
marking a task/alloc as healthy.
In reality, we must always use Nomad's view of what checks should exist as
the source of truth, and compare that with the response Consul gives us,
making sure they match, before scanning the Consul response for failing
check statuses.
Fixes#15536
* consul: minor CR refactor using maps not sets
* consul: observe transition from healthy to unhealthy checks
* consul: spell healthy correctly
This adds new OIDC endpoints on the RPC endpoint. These two RPCs
handle generating the OIDC provider URL and then completing the
login by exchanging the provider token with an internal Nomad
token.
The RPC endpoints both do double forwarding. The initial forward
is to ensure we are talking to the regional leader; the second
then takes into account whether the auth method generates local or
global tokens. If it creates global tokens, we must then forward
onto the federated regional leader.
This PR adds support for configuring `proxy.upstreams[].config` for
Consul Connect upstreams. This is an opaque config value to Nomad -
the data is passed directly to Consul and is unknown to Nomad.
* connect: fix non-"tcp" ingress gateway validation
changes apply to http, http2, and grpc:
* if "hosts" is excluded, consul will use its default domain
e.g. <service-name>.ingress.dc1.consul
* can't set hosts with "*" service name
* test http2 and grpc too
* [no ci] first pass at plumbing grpc_ca_file
* consul: add support for grpc_ca_file for tls grpc connections in consul 1.14+
This PR adds client config to Nomad for specifying consul.grpc_ca_file
These changes combined with https://github.com/hashicorp/consul/pull/15913 should
finally enable Nomad users to upgrade to Consul 1.14+ and use tls grpc connections.
* consul: add cl entgry for grpc_ca_file
* docs: mention grpc_tls changes due to Consul 1.14
* vault: configure user agent on Nomad vault clients
This PR attempts to set the User-Agent header on each Vault API client
created by Nomad. Still need to figure a way to set User-Agent on the
Vault client created internally by consul-template.
* vault: fixup find-and-replace gone awry
This changeset covers a sidebar discussion that @schmichael and I had around the
design for pre-forwarding auth. This includes some changes extracted out of
#15513 to make it easier to review both and leave a clean history.
* Remove fast path for NodeID. Previously-connected clients will have a NodeID
set on the context, and because this is a large portion of the RPCs sent we
fast-pathed it at the top of the `Authenticate` method. But the context is
shared for all yamux streams over the same yamux session (and TCP
connection). This lets an authenticated HTTP request to a client use the
NodeID for authentication, which is a privilege escalation. Remove the fast
path and annotate it so that we don't break it again.
* Add context to decisions around AuthenticatedIdentity. The `Authenticate`
method taken on its own looks like it wants to return an `acl.ACL` that folds
over all the various identity types (creating an ephemeral ACL on the fly if
neccessary). But keeping these fields idependent allows RPC handlers to
differentiate between internal and external origins so we most likely want to
avoid this. Leave some docstrings as a warning as to why this is built the way
it is.
* Mutate the request rather than returning. When reviewing #15513 we decided
that forcing the request handler to call `SetIdentity` was repetitive and
error prone. Instead, the `Authenticate` method mutates the request by setting
its `AuthenticatedIdentity`.