* scheduler: create placements for non-register MRD
For multiregion jobs, the scheduler does not create placements on
registration because the deployment must wait for the other regions.
Once of these regions will then trigger the deployment to run.
Currently, this is done in the scheduler by considering any eval for a
multiregion job as "paused" since it's expected that another region will
eventually unpause it.
This becomes a problem where evals not triggered by a job registration
happen, such as on a node update. These types of regional changes do not
have other regions waiting to progress the deployment, and so they were
never resulting in placements.
The fix is to create a deployment at job registration time. This
additional piece of state allows the scheduler to differentiate between
a multiregion change, where there are other regions engaged in the
deployment so no placements are required, from a regional change, where
the scheduler does need to create placements.
This deployment starts in the new "initializing" status to signal to the
scheduler that it needs to compute the initial deployment state. The
multiregion deployment will wait until this deployment state is
persisted and its starts is set to "pending". Without this state
transition it's possible to hit a race condition where the plan applier
and the deployment watcher may step of each other and overwrite their
changes.
* changelog: add entry for #15325
When the scheduler checks feasibility of each node, it creates a "stack" which
carries attributes of the job and task group it needs to check feasibility
for. The `system` and `sysbatch` scheduler use a different stack than `service`
and `batch` jobs. This stack was missing the call to set the job ID and
namespace for the CSI check. This prevents CSI volumes from being scheduled for
system jobs whenever the volume is in a non-default namespace.
Set the job ID and namespace to match the generic scheduler.
This PR modifies the disconnect helper job to run as root, which is necesary
for manipulating iptables as it does. Also re-organizes the final test logic
to wait for client re-connect before looking for the replacement (3rd) allocation
in case that client was needed to run the alloc (also giving the sheduler more
time to do its thing).
Skips the other 3 tests, which fail and I cannot yet figure out what is going on.
Adds @hashicorp/nomad-eng to the codeowners list for the build and release
workflow files, so that we can fix problems that arise without being
bottlenecked on another team.
The `ubuntu-latest` runner has been migrated to Ubuntu 22.04, which doesn't have
all the same multilib packages as 20.04. Although we'll probably want to migrate
eventually, we should ship Nomad 1.4.3 with the same toolchain as we did
previously so that we're not introducing new issues.
* e2e: fixup oversubscription test case for jammy
jammy uses cgroups v2, need to lookup the max memory limit from the
unified heirarchy format
* e2e: set constraint to require cgroups v2 on oversub docker test
* client: accommodate Consul 1.14.0 gRPC and agent self changes.
Consul 1.14.0 changed the way in which gRPC listeners are
configured, particularly when using TLS. Prior to the change, a
single listener was responsible for handling plain-text and
encrypted gRPC requests. In 1.14.0 and beyond, separate listeners
will be used for each, defaulting to 8502 and 8503 for plain-text
and TLS respectively.
The change means that Nomad’s Consul Connect integration would not
work when integrated with Consul clusters using TLS and running
1.14.0 or greater.
The Nomad Consul fingerprinter identifies the gRPC port Consul has
exposed using the "DebugConfig.GRPCPort" value from Consul’s
“/v1/agent/self” endpoint. In Consul 1.14.0 and greater, this only
represents the plain-text gRPC port which is likely to be disbaled
in clusters running TLS. In order to fix this issue, Nomad now
takes into account the Consul version and configured scheme to
optionally use “DebugConfig.GRPCTLSPort” value from Consul’s agent
self return.
The “consul_grcp_socket” allocrunner hook has also been updated so
that the fingerprinted gRPC port attribute is passed in. This
provides a better fallback method, when the operator does not
configure the “consul.grpc_address” option.
* docs: modify Consul Connect entries to detail 1.14.0 changes.
* changelog: add entry for #15309
* fixup: tidy tests and clean version match from review feedback.
* fixup: use strings tolower func.
This PR adds trace logging around the differential done between a Nomad service
registration and its corresponding Consul service registration, in an effort
to shed light on why a service registration request is being made.
* Add mount propagation to protobuf definition of mounts
* Fix formatting
* Add mount propagation to the simple roundtrip test.
* changelog: add entry for #15096
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
In #14621 we added an eval canelation reaper goroutine with a channel that
allowed us to wake it up. But we forgot to actually send on this channel from
`Eval.Ack` and are still committing the cancelations synchronously. Fix this by
sending on the buffered channel to wake up the reaper instead.
When we migrated to the updated autopilot library in Nomad 1.4.0, the interface
for finding servers changed. Previously autopilot would get the serf members and
call `IsServer` on each of them, leaving it up to the implementor to filter out
clients (and in Nomad's case, other regions). But in the "new" autopilot
library, the equivalent interface is `KnownServers` for which we did not filter
by region. This causes spurious attempts for the cross-region stats fetching,
which results in TLS errors and a lot of log noise.
Filter the member set by region to fix the regression.
After Deployments were added in Nomad 0.6.0, the `AllocUpdateRequestType` raft
log entry was no longer in use. Mark this as deprecated, remove the associated
dead code, and remove references to the metrics it emits from the docs. We'll
leave the entry itself just in case we encounter old raft logs that we need to
be able to safely load.
This PR changes test-core to make use of
https://github.com/hashicorp/setup-golang
to consolidate the setting up of the Go compiler and the Go modules cache
used for the CI job.
Fixes: #14905
* keyring: update handle to state inside replication loop
When keyring replication starts, we take a handle to the state store. But
whenever a snapshot is restored, this handle is invalidated and no longer points
to a state store that is receiving new keys. This leaks a bunch of memory too!
In addition to operator-initiated restores, when fresh servers are added to
existing clusters with large-enough state, the keyring replication can get
started quickly enough that it's running before the snapshot from the existing
clusters have been restored.
Fix this by updating the handle to the state store on each pass.
When an evaluation is acknowledged by a scheduler, the resulting plan is
guaranteed to cover up to the `waitIndex` set by the worker based on the most
recent evaluation for that job in the state store. At that point, we no longer
need to retain blocked evaluations in the broker that are older than that index.
Move all but the highest priority / highest `ModifyIndex` blocked eval into a
canceled set. When the `Eval.Ack` RPC returns from the eval broker it will
signal a reap of a batch of cancelable evals to write to raft. This paces the
cancelations limited by how frequently the schedulers are acknowledging evals;
this should reduce the risk of cancelations from overwhelming raft relative to
scheduler progress. In order to avoid straggling batches when the cluster is
quiet, we also include a periodic sweep through the cancelable list.