In the recent auth refactor, we accidentally fixed a bug where poststop tasks
would not get access to Variables. Fix this same bug for backports by ensuring
that we use client-terminal status and not server-terminal status to enforce
access.
When a user performs a client API call, the Nomad client will
perform an RPC which looks up the ACL policies which the callers
ACL token is assigned. If the ACL token includes dangling (deleted)
policies, the call would previously fail with a permission denied
error.
This change ensures this error is not returned and that the lookup
will succeed in the event of dangling policies.
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
This change fixes a bug within the generic scheduler which meant
duplicate alloc indexes (names) could be submitted to the plan
applier and written to state. The bug originates from the
placements calculation notion that names of allocations being
replaced are blindly copied to their replacement. This is not
correct in all cases, particularly when dealing with canaries.
The fix updates the alloc name index tracker to include minor
duplicate tracking. This can be used when computing placements to
ensure duplicate are found, and a new name picked before the plan
is submitted. The name index tracking is now passed from the
reconciler to the generic scheduler via the results, so this does
not have to be regenerated, or another data structure used.
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
this can save a bit of cpu when
running plans for tasks that already exist,
and prevents Nomad tokens from changing,
which can cause task template{}s to restart
unnecessarily.
Similar to #18269, it is possible that even if Node.GetClientAllocs
retrieves fresh allocs that the subsequent Alloc.GetAllocs call
retrieves stale allocs. While `diffAlloc(existing, updated)` properly
ignores stale alloc *updates*, alloc deletions have no such check.
So if a client retrieves an alloc created at index 123, and then a
subsequent Alloc.GetAllocs call hits a new server which returns results
at index 100, the client will stop the alloc created at 123 because it
will be missing from the stale response.
This change applies the same logic as #18269 and ensures only fresh
responses are used.
Glossary:
* fresh - modified at an index > the query index
* stale - modified at an index <= the query index
* Rename pages to include roles
* Models and adapters
* [ui] Any policy checks in the UI now check for roles' policies as well as token policies (#18346)
* combinedPolicies as a concept
* Classic decorator on role adapter
* We added a new request for roles, so the test based on a specific order of requests got fickle fast
* Mirage roles cluster scaffolded
* Acceptance test for roles and policies on the login page
* Update mirage mock for nodes fetch to account for role policies / empty token.policies
* Roles-derived policies checks
* [ui] Access Control with Roles and Tokens (#18413)
* top level policies routes moved into access control
* A few more routes and name cleanup
* Delog and test fixes to account for new url prefix and document titles
* Overview page
* Tokens and Roles routes
* Tokens helios table
* Add a role
* Hacky role page and deletion
* New policy keyboard shortcut and roles breadcrumb nav
* If you leave New Role but havent made any changes, remove the newly-created record from store
* Roles index list and general role route crud
* Roles index actually links to roles now
* Helios button styles for new roles and policies
* Handle when you try to create a new role without having any policies
* Token editing generally
* Create Token functionality
* Cant delete self-token but management token editing and deleting is fine
* Upgrading helios caused codemirror to explode, shimmed
* Policies table fix
* without bang-element condition, modifier would refire over and over
* Token TTL or Time setting
* time will take you on
* Mirage hooks for create and list roles
* Ensure policy names only use allow characters in mirage mocks
* Mirage mocked roles and policies in the default cluster
* log and lintfix
* chromedriver to 2.1.2
* unused unit tests removed
* Nice profile dropdown
* With the HDS accordion, rename our internal component scss ref
* design revisions after discussion
* Tooltip on deleted-policy tokens
* Two-step button peripheral isDeleting gcode removed
* Never to null on token save
* copywrite headers added and empty routefiles removed
* acceptance test fixes for policies endpoint
* Route for updating a token
* Policies testfixes
* Ember on-click-outside modifier upgraded with general ember-modifier upgrade
* Test adjustments to account for new profile header dropdown
* Test adjustments for tokens via policy pages
* Removed an unused route
* Access Control index page tests
* a11y tests
* Tokens index acceptance tests generally
* Lintfix
* Token edit page tests
* Token editing tests
* New token expiration tests
* Roles Index tests
* Role editing policies tests
* A complete set of Access Control Roles tests
* Policies test
* Be more specific about which row to check for expiration time
* Nil check on expirationTime equality
* Management tokens shouldnt show No Roles/Policies, give them their own designation
* Route guard on selftoken, conditional columns, and afterModel at parent to prevent orphaned policies on tokens/roles from stopping a new save
* Policy unloading on delete and other todos plus autofocus conditionally re-enabled
* Invalid policies non-links now a concept for Roles index
* HDS style links to make job.variables.alert links look like links again
* Mirage finding looks weird so making model async in hash even though redundant
* Drop rsvp
* RSVP wasnt the problem, cached lookups were
* remove old todo comments
* de-log
When a CSI volume is deleted while its plugin is not running, the
function `volAndPluginLookup` returns a `nil` plugin value resulting in a
panic in the request handler.
We use capped exponential backoff in several places in the code when handling
failures. The code we've copy-and-pasted all over has a check to see if the
backoff is greater than the limit, but this check happens after the bitshift and
we always increment the number of attempts. This causes an overflow with a
fairly small number of failures (ex. at one place I tested it occurs after only
24 iterations), resulting in a negative backoff which then never recovers. The
backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC
handler or an external API such as Vault. Note this doesn't occur in places
where we cap the number of iterations so the loop breaks (usually to return an
error), so long as the number of iterations is reasonable.
Introduce a helper with a check on the cap before the bitshift to avoid overflow in all
places this can occur.
Fixes: #18199
Co-authored-by: stswidwinski <stan.swidwinski@gmail.com>
* build: update to go1.21
* go: eliminate helpers in favor of min/max
* build: run go mod tidy
* build: swap depguard for semgrep
* command: fixup broken tls error check on go1.21
The `TestDrainer_AllTypes_NoDeadline` test has been flaky. It looks like this
might be because the final update of batch allocations to complete is improperly
updating the state store directly rather than by RPC. If the service jobs have
restarted in the meantime, the `allocClientStateSimulator` will have updated the
index on the allocations table and that will prevent the drainer from
unblocking (and being marked complete) when the batch jobs are written with an
earlier index.
This changeset attempts to fix that by making the update via RPC (as it normally
would be in real code).
We've seen test flakiness in the `TestJobEndpoint_Register_NonOverlapping` test,
which asserts that we don't try to placed allocations for blocked evals until
resources have been actually freed by setting the client status of the previous
alloc to complete.
The flaky assertion includes sorting the two allocations by CreateIndex and this
appears to be a non-stable sort in the context of the test run, which results in
failures that shouldn't exist. There's no reason to sort the allocations instead
of just examining them by ID. This changeset does so.
There are some refactorings that have to be made in the getter and state
where the api changed in `slices`
* Bump golang.org/x/exp
* Bump golang.org/x/exp in api
* Update job_endpoint_test
* [feedback] unexport sort function