Prior to 2409f72 the code compared the modification index of a job to itself. Afterwards, the code compared the creation index of the job to itself. In either case there should never be a case of re-parenting of allocs causing the evaluation to trivially always result in false, which leads to unreclaimable memory.
Prior to this change allocations and evaluations for batch jobs were never garbage collected until the batch job was explicitly stopped. The new `batch_eval_gc_threshold` server configuration controls how often they are collected. The default threshold is `24h`.
* docker: set force=true on remove image to handle images referenced by multiple tags
This PR changes our call of docker client RemoveImage() to RemoveImageExtended with
the Force=true option set. This fixes a bug where an image referenced by more than
one tag could never be garbage collected by Nomad. The Force option only applies to
stopped containers; it does not affect running workloads.
* docker: add note about image_delay and multiple tags
While you can use any string value for a variable Item's key name
using characters that are outside of the set [unicode.Letter,
unicode.Number,`_`] will require the `index` function for direct
access.
* Ensure infra_image gets proper label used for reconciliation
Currently infra containers are not cleaned up as part of the dangling container
cleanup routine. The reason is that Nomad checks if a container is a Nomad owned
container by verifying the existence of the: `com.hashicorp.nomad.alloc_id` label.
Ensure we set this label on the infra container as well.
* fix unit test
* changelog: add entry
---------
Co-authored-by: Seth Hoenig <shoenig@duck.com>
When registering a job with a service and 'consul.allow_unauthenticated=false',
we scan the given Consul token for an acceptable policy or role with an
acceptable policy, but did not scan for an acceptable service identity (which
is backed by an acceptable virtual policy). This PR updates our consul token
validation to also accept a matching service identity when registering a service
into Consul.
Fixes#15902
The getting started Tutorial has a post-installation steps section that includes
installing CNI plugins. Many users will want to use `bridge` networking right
out of the gate, so adding these same post-install instructions to the main docs
will be a better Day 0 experience for them.
* client: run alloc pre-kill hooks on last pass despite no live tasks
This PR fixes a bug where alloc pre-kill hooks were not run in the
edge case where there are no live tasks remaining, but it is also
the final update to process for the (terminal) allocation. We need
to run cleanup hooks here, otherwise they will not run until the
allocation gets garbage collected (i.e. via Destroy()), possibly
at a distant time in the future.
Fixes#15477
* client: do not run ar cleanup hooks if client is shutting down
When the template hook Update() method is called it may recreate the
template manager if the Nomad or Vault token has been updated.
This caused the new template manager did not have a driver handler
because this was only being set on the Poststart hook, which is not
called for inplace updates.
Also tightens up authentication for these endpoints by enforcing the server
certificate name is valid. We protect these endpoints currently by mTLS and
can't use an auth token because these endpoints are (uniquely) called by the
leader and followers for a given node won't have the leader's ephemeral ACL
token. Add a certificate name check that requests come from a server and not a
client, because no client should ever send these RPCs directly.
Scope the `footer` tag SCSS rule for the New Variable form to prevent it
from affecting other `<footer>` elements, such as the gutter menu Nomad
version section.
This PR adjusts the artifact sandbox on Linux to enable reading from known
system-wide git or mercurial configuration, if they exist.
Folks doing something odd like specifying custom paths for global config will
need to use the standard locations, or disable artifact filesystem isolation.
* nsd: block on removal of services
This PR uses a WaitGroup to ensure workload removals are complete
before returning from ServiceRegistrationHandler.RemoveWorkload of
the nomad service provider. The de-registration of individual services
still occurs asynchrously, but we must block on the parent removal
call so that we do not race with further operations on the same set
of services - e.g. in the case of a task restart where we de-register
and then re-register the services in quick succession.
Fixes#15032
* nsd: add e2e test for initial failing check and restart
Disallowing per_alloc for host volumes in some cases makes life of a nomad user much harder.
When we rely on the NOMAD_ALLOC_INDEX for any configuration that needs to be re-used across
restarts we need to make sure allocation placement is consistent. With CSI volumes we can
use the `per_alloc` feature but for some reason this is explicitly disabled for host volumes.
Ensure host volumes understand the concept of per_alloc
This changeset configures the RPC rate metrics that were added in #15515 to all
the RPCs that support authenticated HTTP API requests. These endpoints already
configured with pre-forwarding authentication in #15870, and a handful of others
were done already as part of the proof-of-concept work. So this changeset is
entirely copy-and-pasting one method call into a whole mess of handlers.
Upcoming PRs will wire up pre-forwarding auth and rate metrics for the remaining
set of RPCs that have no API consumers or aren't authenticated, in smaller
chunks that can be more thoughtfully reviewed.
When a Nomad client that is running an allocation with
`max_client_disconnect` set misses a heartbeat the Nomad server will
update its status to `disconnected`.
Upon reconnecting, the client will make three main RPC calls:
- `Node.UpdateStatus` is used to set the client status to `ready`.
- `Node.UpdateAlloc` is used to update the client-side information about
allocations, such as their `ClientStatus`, task states etc.
- `Node.Register` is used to upsert the entire node information,
including its status.
These calls are made concurrently and are also running in parallel with
the scheduler. Depending on the order they run the scheduler may end up
with incomplete data when reconciling allocations.
For example, a client disconnects and its replacement allocation cannot
be placed anywhere else, so there's a pending eval waiting for
resources.
When this client comes back the order of events may be:
1. Client calls `Node.UpdateStatus` and is now `ready`.
2. Scheduler reconciles allocations and places the replacement alloc to
the client. The client is now assigned two allocations: the original
alloc that is still `unknown` and the replacement that is `pending`.
3. Client calls `Node.UpdateAlloc` and updates the original alloc to
`running`.
4. Scheduler notices too many allocs and stops the replacement.
This creates unnecessary placements or, in a different order of events,
may leave the job without any allocations running until the whole state
is updated and reconciled.
To avoid problems like this clients must update _all_ of its relevant
information before they can be considered `ready` and available for
scheduling.
To achieve this goal the RPC endpoints mentioned above have been
modified to enforce strict steps for nodes reconnecting:
- `Node.Register` does not set the client status anymore.
- `Node.UpdateStatus` sets the reconnecting client to the `initializing`
status until it successfully calls `Node.UpdateAlloc`.
These changes are done server-side to avoid the need of additional
coordination between clients and servers. Clients are kept oblivious of
these changes and will keep making these calls as they normally would.
The verification of whether allocations have been updates is done by
storing and comparing the Raft index of the last time the client missed
a heartbeat and the last time it updated its allocations.
This changeset allows Workload Identities to authenticate to all the RPCs that
support HTTP API endpoints, for use with PR #15864.
* Extends the work done for pre-forwarding authentication to all RPCs that
support a HTTP API endpoint.
* Consolidates the auth helpers used by the CSI, Service Registration, and Node
endpoints that are currently used to support both tokens and client secrets.
Intentionally excluded from this changeset:
* The Variables endpoint still has custom handling because of the implicit
policies. Ideally we'll figure out an efficient way to resolve those into real
policies and then we can get rid of that custom handling.
* The RPCs that don't currently support auth tokens (i.e. those that don't
support HTTP endpoints) have not been updated with the new pre-forwarding auth
We'll be doing this under a separate PR to support RPC rate metrics.
If a consumer of the new `Authenticate` method gets passed a bogus token that's
a correctly-shaped UUID, it will correctly get an identity without a ACL
token. But most consumers will then panic when they consume this nil `ACLToken`
for authorization.
Because no API client should ever send a bogus auth token, update the
`Authenticate` method to create the identity with remote IP (for metrics
tracking) but also return an `ErrPermissionDenied`.