The `QuotaIterator` is used as the source of nodes passed into feasibility
checking for constraints. Every node that passes the quota check counts the
allocation resources agains the quota, and as a result we count nodes which
will be later filtered out by constraints. Therefore for jobs with
constraints, nodes that are feasibility checked but fail have been counted
against quotas. This failure mode is order dependent; if all the unfiltered
nodes happen to be quota checked first, everything works as expected.
This changeset moves the `QuotaIterator` to happen last among all feasibility
checkers (but before ranking). The `QuotaIterator` will never receive filtered
nodes so it will calculate quotas correctly.
The test fails reliably locally on my machine. The test uses non-dev mode
where Raft actions get committed to disk, causing operations to exceed
the 50ms tight Raft deadlines.
So, here we ensure that non-dev servers use default Raft config
files with longer timeouts.
Also, noticed that the test queries a server, that may a follower with a
stale state.
I've updated the test to ensure we query the leader for its state. The
Barrier call ensures that the leader is a "stable" leader with committed
entries. Protects against a window where a new leader reports the
previous term before it commits a raft log entry.
Ensure that all servers are joined to each other before test proceed,
instead of just joining them to the first server and relying on
background serf propagation.
Relying on backgorund serf propagation is a cause of flakiness,
specially for tests with multiple regions. The server receiving the RPC
may not be aware of the region and fail to forward RPC accordingly.
For example, consider `TestMonitor_Monitor_RemoteServer` failure in https://app.circleci.com/pipelines/github/hashicorp/nomad/16402/workflows/7f327235-7d0c-40ba-9757-600522afca51/jobs/158045 you can observe:
* `nomad-117` is joined to `nomad-118` and `nomad-119`
* `nomad-119` is the foreign region
* `nomad-117` gains leadership in the default region, `nomad-118` is the non-leader
* search logs for `nomad: adding server` and notice that `nomad-118`
only added `nomad-118` and `nomad-118`, but not `nomad-119`!
* so the query to the non-leader in the test fails to be forwarded to
the appopriate region.
This updates `client.Ready()` so it returns once the client node got
registered at the servers. Previously, it returns when the
fingerprinters first batch completes, wtihout ensuring that the node is
stored in the Raft data. The tests may fail later when it with unknown
node errors later.
`client.Reedy()` seem to be only called in CSI and some client stats
now.
This class of bug, assuming client is registered without checking, is a
source of flakiness elsewhere. Other tests use other mechanisms for
checking node readiness, though not consistently.
Glint pulled in an updated version of mitchellh/go-testing-interface
which broke some existing tests because the update added a Parallel()
method to testing.T. This switches to the standard library testing.TB
which doesn't have a Parallel() method.
Adding '-verbose' will print out the allocation information for the
deployment. This also changes the job run command so that it now blocks
until deployment is complete and adds timestamps to the output so that
it's more in line with the output of node drain.
This uses glint to print in place in running in a tty. Because glint
doesn't yet support cmd/powershell, Windows workflows use a different
library to print in place, which results in slightly different
formatting: 1) different margins, and 2) no spinner indicating
deployment in progress.
(cherry-pick ent back to oss)
This PR moves a lot of Consul ACL token validation tests into ent files,
so that we can verify correct behavior difference between OSS and ENT
Nomad versions.
Old description of `{plan,worker}.wait_for_index` described the metric
in terms of waiting for a snapshot which has two problems:
1. "Snapshot" is an overloaded term in Nomad and operators can't be
expected to know which use we're referring to here.
2. The most important thing about the metric is what we're waiting *on*
before taking a snapshot: the raft index of the object to be
processed (plan or eval).
The new description tries to cram all of that context into the tiny
space provided.
See #5791 for details about the `wait_for_index` mechanism in general.
This PR fixes the Nomad Object Namespace <-> Consul ACL Token relationship
check when using Consul OSS (or Consul ENT without namespace support).
Nomad v1.1.0 introduced a regression where Nomad would fail the validation
when submitting Connect jobs and allow_unauthenticated set to true, with
Consul OSS - because it would do the namespace check against the Consul ACL
token assuming the "default" namespace, which does not work because Consul OSS
does not have namespaces.
Instead of making the bad assumption, expand the namespace check to handle
each special case explicitly.
Fixes#10718