The E2E test suite for rescheduling had a few bugs:
* Using the command line to stop a job with a failing deployment returns a non-zero exit
code, which would cause an otherwise passing test to fail.
* Two of the input jobs were actually invalid but were only correctly detected
as such because of #17342
This changeset also updates the whole test suite to move it off the v1
"framework". A few test assertions are also de-flaked.
Fixes: #19076
Co-authored-by: Tim Gross <tgross@hashicorp.com>
The version we have of `hc-install` doesn't allow installing Enterprise
binaries. Upgrade so that this is available to the development team and to our
E2E tests in the Enterprise repo.
(Cherry-pick also backports install of `go-modtool`)
When a user performs a client API call, the Nomad client will
perform an RPC which looks up the ACL policies which the callers
ACL token is assigned. If the ACL token includes dangling (deleted)
policies, the call would previously fail with a permission denied
error.
This change ensures this error is not returned and that the lookup
will succeed in the event of dangling policies.
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
This change fixes a bug within the generic scheduler which meant
duplicate alloc indexes (names) could be submitted to the plan
applier and written to state. The bug originates from the
placements calculation notion that names of allocations being
replaced are blindly copied to their replacement. This is not
correct in all cases, particularly when dealing with canaries.
The fix updates the alloc name index tracker to include minor
duplicate tracking. This can be used when computing placements to
ensure duplicate are found, and a new name picked before the plan
is submitted. The name index tracking is now passed from the
reconciler to the generic scheduler via the results, so this does
not have to be regenerated, or another data structure used.
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
The iowait metric obtained from `/proc/stat` can under some circumstances
decrease. The relevant condition is when an interrupt arrives on a different
core than the one that gets woken up for the IO, and a particular counter in the
kernel for that core gets interrupted. This is documented in the man page for
the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that
can't be changed for the sake of ABI compatibility.
In Nomad, we get the current "busy" time (everything except for idle) and
compare it to the previous busy time to get the counter incremeent. If the
iowait counter decreases and the idle counter increases more than the increase
in the total busy time, we can get a negative total. This previously caused a
panic in our metrics collection (see #15861) but that is being prevented by
reporting an error message.
Fix the bug by putting a zero floor on the values we return from the host CPU
stats calculator.
Backport-of: #18835
this can save a bit of cpu when
running plans for tasks that already exist,
and prevents Nomad tokens from changing,
which can cause task template{}s to restart
unnecessarily.