The current implementation for the task coordinator unblocks tasks by
performing destructive operations over its internal state (like closing
channels and deleting maps from keys).
This presents a problem in situations where we would like to revert the
state of a task, such as when restarting an allocation with tasks that
have already exited.
With this new implementation the task coordinator behaves more like a
finite state machine where task may be blocked/unblocked multiple times
by performing a state transition.
This initial part of the work only refactors the task coordinator and
is functionally equivalent to the previous implementation. Future work
will build upon this to provide bug fixes and enhancements.
The original design for workload identities and ACLs allows for operators to
extend the automatic capabilities of a workload by using a specially-named
policy. This has shown to be potentially unsafe because of naming collisions, so
instead we'll allow operators to explicitly attach a policy to a workload
identity.
This changeset adds workload identity fields to ACL policy objects and threads
that all the way down to the command line. It also a new secondary index to the
ACL policy table on namespace and job so that claim resolution can efficiently
query for related policies.
When a Nomad agent starts and loads jobs that already existed in the
cluster, the default template uid and gid was being set to 0, since this
is the zero value for int. This caused these jobs to fail in
environments where it was not possible to use 0, such as in Windows
clients.
In order to differentiate between an explicit 0 and a template where
these properties were not set we need to use a pointer.
In #14139 this code was changed to use the original copy of the config,
but Config.AllocDir is updated in the `Client.init()` method for dev
agents.
This uses the latest version of the alloc dir (which cannot change
further at runtime without a client restart which would reinitialize
the stats collector as well).
This PR updates the checks documentation to mention support for checks
when using the Nomad service provider. There are limitations of NSD
compared to Consul, and those configuration options are now noted as
being Consul-only.
Making the ACL Role listing return object a stub future-proofs the
endpoint. In the event the role object grows, we are not bound by
having to return all fields within the list endpoint or change the
signature of the endpoint to reduce the list return size.
ACL Roles along with policies and global token will be replicated
from the authoritative region to all federated regions. This
involves a new replication loop running on the federated leader.
Policies and roles may be replicated at different times, meaning
the policies and role references may not be present within the
local state upon replication upsert. In order to bypass the RPC
and state check, a new RPC request parameter has been added. This
is used by the replication process; all other callers will trigger
the ACL role policy validation check.
There is a new ACL RPC endpoint to allow the reading of a set of
ACL Roles which is required by the replication process and matches
ACL Policies and Tokens. A bug within the ACL Role listing RPC has
also been fixed which returned incorrect data during blocking
queries where a deletion had occurred.
Since the state store returns a pointer to the shared job structs in
memdb we must always copy it before mutating it and applying the new
version via raft. Otherwise if the rpc fails before the mutated job is
committed to raft (either due to validation, bug, crash, or other exit
condition), the leader server will have an updated copy of the job that
other servers will not have.
This PR adds some NSD check status output to the CLI.
1. The 'nomad alloc status' command produces nsd check summary output (if present)
2. The 'nomad alloc checks' sub-command is added to produce complete nsd check output (if present)
Before this change, Client had 2 copies of the config object: config and configCopy. There was no guidance around which to use where (other than configCopy's comment to pass it to alloc runners), both are shared among goroutines and mutated in data racy ways. At least at one point I think the idea was to have `config` be mutable and then grab a lock to overwrite `configCopy`'s pointer atomically. This would have allowed alloc runners to read their config copies in data race safe ways, but this isn't how the current implementation worked.
This change takes the following approach to safely handling configs in the client:
1. `Client.config` is the only copy of the config and all access must go through the `Client.configLock` mutex
2. Since the mutex *only protects the config pointer itself and not fields inside the Config struct:* all config mutation must be done on a *copy* of the config, and then Client's config pointer is overwritten while the mutex is acquired. Alloc runners and other goroutines with the old config pointer will not see config updates.
3. Deep copying is implemented on the Config struct to satisfy the previous approach. The TLS Keyloader is an exception because it has its own internal locking to support mutating in place. An unfortunate complication but one I couldn't find a way to untangle in a timely fashion.
4. To facilitate deep copying I made an *internally backward incompatible API change:* our `helper/funcs` used to turn containers (slices and maps) with 0 elements into nils. This probably saves a few memory allocations but makes it very easy to cause panics. Since my new config handling approach uses more copying, it became very difficult to ensure all code that used containers on configs could handle nils properly. Since this code has caused panics in the past, I fixed it: nil containers are copied as nil, but 0-element containers properly return a new 0-element container. No more "downgrading to nil!"
This PR fixes the circle workflow step on windows where we print
the go version. Like the other commands that use Go, we must
inject the install path into PATH first.
This PR is the first of several for cleaning up warnings, and refactoring
bits of code in the command package. First pass is over acl_ files and
gets some helpers in place.
* Initialized keyboard service
Neat but funky: dynamic subnav traversal
👻
generalized traverseSubnav method
Shift as special modifier key
Nice little demo panel
Keyboard shortcuts keycard
Some animation styles on keyboard shortcuts
Handle situations where a link is deeply nested from its parent menu item
Keyboard service cleanup
helper-based initializer and teardown for new contextual commands
Keyboard shortcuts modal component added and demo-ghost removed
Removed j and k from subnav traversal
Register and unregister methods for subnav plus new subnavs for volumes and volume
register main nav method
Generalizing the register nav method
12762 table keynav (#12975)
* Experimental feature: shortcut visual hints
* Long way around to a custom modifier for keyboard shortcuts
* dynamic table and list iterative shortcuts
* Progress with regular old tether
* Delogging
* Table Keynav tether fix, server and client navs, and fix to shiftless on modified arrow keys
Go to Optimize keyboard link and storage key changed to g r
parameterized jobs keyboard nav
Dynamic numeric keynav for multiple tables (#13482)
* Multiple tables init
* URL-bind enumerable keyboard commands and add to more taskRow and allocationRows
* Type safety and lint fixes
* Consolidated push to keyCommands
* Default value when removing keyCommands
* Remove the URL-based removal method and perform a recompute on any add
Get tests passing in Keynav: remove math helpers and a few other defensive moves (#13761)
* Remove ember math helpers
* Test fixes for jobparts/body
* Kill an unneeded integration helper test
* delog
* Trying if disabling percy lets this finish
* Okay so its not percy; try parallelism in circle
* Percyless yet again
* Trying a different angle to not have percy
* Upgrade percy to 1.6.1
[ui] Keyboard nav: "u" key to go up a level (#13754)
* U to go up a level
* Mislabelled my conditional
* Custom lint ignore rule
* Custom lint ignore rule, this time with commas
* Since we're getting rid of ember math helpers elsewhere, do the math ourselves here
Replace ArrowLeft etc. with an ascii arrow (#13776)
* Replace ArrowLeft etc. with an ascii arrow
* non-mutative helper cleanup
Keyboard Nav: let users rebind their shortcuts (#13781)
* click-outside and shortcuts enabled/disabled toggle
* Trap focus when modal open
* Enabled/disabled saved to localStorage
* Autofocus edit button on variable index
* Modal overflow styles
* Functional rebind
* Saving rebinds to localStorage for all majors
* Started on defaultCommandBindings
* Modal header style and cancel rebind on escape
* keyboardable keybindings w buttons instead of spans
* recording and defaultvalues
* Enter short-circuits rebind
* Only some commands are rebindable, and dont show dupes
* No unused get import
* More visually distinct header on modal
* Disallowed keys for rebind, showing buffer as you type, and moving dedupe to modal logic
willDestroy hook to prevent tests from doubling/tripling up addEventListener on kb events
remove unused tests
Keyboard Navigation acceptance tests (#13893)
* Acceptance tests for keyboard modal
* a11y audit fix and localStorage clear
* Bind/rebind/localStorage tests
* Keyboard tests for dynamic nav and tables
* Rebinder and assert expectation
* Second percy snapshot showing hints no longer relevant
Weird issue where linktos with query props specifically from the task-groups page would fail to route / hit undefined.shouldSuperCede errors
Adds the concept of exclusivity to a keycommand, removing peers that also share its label
Lintfix
Changelog and PR feedback
Changelog and PR feedback
Fix to rebinding in firefox by blurring the now-disabled button on rebind (#14053)
* Secure Variables shortcuts removed
* Variable index route autofocus removed
* Updated changelog entry
* Updated changelog entry
* Keynav docs (#14148)
* Section added to the API Docs UI page
* Added a note about disabling
* Prev and Next order
* Remove dev log and unneeded comments
ACL tokens can now utilize ACL roles in order to provide API
authorization. Each ACL token can be created and linked to an
array of policies as well as an array of ACL role links. The link
can be provided via the role name or ID, but internally, is always
resolved to the ID as this is immutable whereas the name can be
changed by operators.
When resolving an ACL token, the policies linked from an ACL role
are unpacked and combined with the policy array to form the
complete auth set for the token.
The ACL token creation endpoint handles deduplicating ACL role
links as well as ensuring they exist within state.
When reading a token, Nomad will also ensure the ACL role link is
current. This handles ACL roles being deleted from under a token
from a UX standpoint.
ACL Policies aren't required to have any `namespace` blocks, and this is
particularly common with the anonymous policy. If a user visits the web UI
without a token already in their local storage and the anonymous policy has no
`namespace` blocks, the UI will hit unhandled exceptions when rendering the
sidebar or jobs page.
Filter for the case where there's no `namespace` block.
Similar to the deployment watcher fix in #14121 - the server code loves these mutable structs so we need to guard access to the struct fields with locks.
Capturing ch := b.capacityChangeCh is sufficient to satisfy the data race detector, but I noticed it was also possible to leak goroutines:
Since the watchCapacity loop is in charge of receiving from capacityChangeCh and exits when stopCh is closed, senders to capacityChangeCh also must exit when stopCh is closed. Otherwise they may block forever if capacityChangeCh is full because it will never be received on again. I did not find evidence of this occurring in my meager smattering of prod goroutine dumps I have laying around, but this isn't surprising as the chan has a buffer of 8096! I would imagine that is sufficient to handle "late" sends and then just get GC'd away when the last reference to the old chan is dropped. This is just additional safety/correctness.