Originally this test relied on Job 1 blocking Job 2 until Job 1 had a
terminal *ClientStatus.* Job 2 ensured it would get blocked using 2
mechanisms:
1. A constraint requiring it is placed on the same node as Job 1.
2. Job 2 would require all unreserved CPU on the node to ensure it would
be blocked until Job 1's resources were free.
That 2nd assertion breaks if *any previous job is still running on the
target node!* That seems very likely to happen in the flaky world of our
e2e tests. In fact there may be some jobs we intentionally want running
throughout; in hindsight it was never safe to assume my test would be
the only thing scheduled when it ran.
*Ports to the rescue!* Reserving a static port means that both Job 2
will now block on Job 1 being terminal. It will only conflict with other
tests if those tests use that port *on every node.* I ensured no
existing tests were using the port I chose.
Other changes:
- Gave job a bit more breathing room resource-wise.
- Tightened timings a bit since previous failure ran into the `go test`
time limit.
- Cleaned up the DumpEvals output. It's quite nice and handy now!
Keeps failing in the nightly e2e test with unhelpful output like:
```
Failed
=== RUN TestOverlap
overlap_test.go:92: Followup job overlap93ee1d2b blocked. Sleeping for the rest of overlap48c26c39's shutdown_delay (9.2/10s)
overlap_test.go:105: 1500/2000 retries reached for github.com/hashicorp/nomad/e2e/overlap.TestOverlap (err=timed out before an allocation was found for overlap93ee1d2b)
overlap_test.go:105: timeout: timed out before an allocation was found for overlap93ee1d2b
--- FAIL: TestOverlap (38.96s)
```
I have not been able to replicate it in my own e2e cluster, so I added
the EvalDump helper to add detailed eval information like:
```
=== RUN TestOverlap
1/1 Job overlap7b0e90ec Eval c38c9919-a4f0-5baf-45f7-0702383c682a
Type: service
TriggeredBy: job-register
Deployment:
Status: pending ()
NextEval:
PrevEval:
BlockedEval:
-- No placement failures --
QueuedAllocs:
SnapshotIdx: 0
CreateIndex: 96
ModifyIndex: 96
...
```
Hopefully helpful when debugging other tests as well!
* test: simplify overlap job placement logic
Trying to fix#14806
Both the previous approach as well as this one worked on e2e clusters I
spun up.
* simplify code flow
* test: add e2e for non-overlapping placements
Followup to #10446
Fails (as expected) against 1.3.x at the wait for blocked eval (because
the allocs are allowed to overlap).
Passes against 1.4.0-beta.1 (as expected).
* Update e2e/overlap/overlap_test.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
* cleanup: refactor MapStringStringSliceValueSet to be cleaner
* cleanup: replace SliceStringToSet with actual set
* cleanup: replace SliceStringSubset with real set
* cleanup: replace SliceStringContains with slices.Contains
* cleanup: remove unused function SliceStringHasPrefix
* cleanup: fixup StringHasPrefixInSlice doc string
* cleanup: refactor SliceSetDisjoint to use real set
* cleanup: replace CompareSliceSetString with SliceSetEq
* cleanup: replace CompareMapStringString with maps.Equal
* cleanup: replace CopyMapStringString with CopyMap
* cleanup: replace CopyMapStringInterface with CopyMap
* cleanup: fixup more CopyMapStringString and CopyMapStringInt
* cleanup: replace CopySliceString with slices.Clone
* cleanup: remove unused CopySliceInt
* cleanup: refactor CopyMapStringSliceString to be generic as CopyMapOfSlice
* cleanup: replace CopyMap with maps.Clone
* cleanup: run go mod tidy
In the event a single test fails to clear up properly after
itself, all other tests will fail as they attempt to create ACL
policies with the same names. This change ensures they use unique
ACL names, so when a single test fails, it is easy to identify it
is a problem with the test rather than the suite.
The rewrite refactors the suite to use the new style along with
other recent testing improvements. In order to ensure the spread
tests do not impact each other, there is new cleanup functionality
to ensure both the job and allocations are removed from state
before the test exits completely.
The NSD checks tests were racey, whereby the check may not have
been triggered by the time it was queried. This change wraps the
check so it can account for this.
This removes the current ACL expiration GC section in order to get
the tests passing and allow more time to investigate the test. I
have full confidence the feature is working as expected and have
tested extensively locally.
In order to add an E2E test to cover token expiration, the server
config has been updated to include a low minimum allowed TTL
value. For ease of reading, the max value is also set.
This adds a new ACL test suite to the e2e framework which includes
an initial test for ACL roles. The ACL test includes a helper to
track and clean created Nomad resources which keeps the test
cluster clean no matter if the test fails early or not.
* core: allow pause/un-pause of eval broker on region leader.
* agent: add ability to pause eval broker via scheduler config.
* cli: add operator scheduler commands to interact with config.
* api: add ability to pause eval broker via scheduler config
* e2e: add operator scheduler test for eval broker pause.
* docs: include new opertor scheduler CLI and pause eval API info.
The nightly playwright tests are currently failing because of a
mismatch between the expected version of Chromium and what's in the
container image. Unfortunately the previous specific tag we were using
for the container image is no longer tagged on the registry. With some
testing, I was able to find an image tag that results in a good run.
This test exercises upgrades between 0.8 and Nomad versions greater
than 0.9. We have not supported 0.8.x in a very long time and in any
case the test has been marked to skip because the downloader doesn't
work.
We moved off the old provisioning process for nightly E2E to one driven
entirely by Terraform quite a while back now. We're in the slow
process of removing the framework code for this test-by-test, but this
chunk of code no longer has any callers.
We don't need the absolute path for any of the commands in this script
so long as we `cd` into the source directory path. Doing this removes
the need for weird platform-specific tricks we have to do with
realpath vs GNU realpath.
The E2E test runner is running from the root of the Nomad
repository. Make this run independent of the working directory for
convenience of developers and the test runner.
The CSI plugin allocations take a while to be marked healthy,
sometimes causing E2E test flakes during the setup phase of the
tests. There's nothing CSI specific about marking plugin allocs
healthy, as the plugin supervisor hook does all the fingerprinting in
the postrun hook (the prestart hook just makes a couple of empty
directories). The timeouts we're seeing may be because of where we're
pulling the images from; most our jobs pull from a CDN-backed public
registry whereas these are pulling from ECR. Set a 1min timeout for
these to make sure we have enough time to pull the image and start the
task.
Scripts for running playwright tests in a Docker container that has
chromium and webkit preinstalled. Includes a basic smoke test for
authentication so that we can be sure the test rig is working
end-to-end. Wiring this up in CI will be in an upcoming PR.
Our E2E test environment is deployed with mTLS, but it's impractical
for us to use mTLS in headless browsers for automated testing (or even
in manual testing). Provide certificates for proxying the web UI via
Nginx. This proxy uses client certs for proxying to the HTTP endpoint
and a self-signed cert for the browser-facing endpoint. We can accept
certificate errors in the automated tests we'll be adding in the next
step of this work.
While working on infrastructure for testing the UI in E2E, we needed
to upgrade the certificate provider. Performing a provider upgrade via
the TF `init -upgrade` brought in updates for the file and AWS
providers as well. These updates include deprecating the use of
`sensitive_content` fields, removing CA algorithm parameters that can
be inferred from keys, and removing the requirement to manually
specify AWS assume role parameters in the provider config if they're
available in the calling environment's AWS config file (as they are
via doormat or our E2E environment).
This test has a failure that's happening only occassionally and not
very reproducibly. Print out the allocation status on test failure so
that we can do some post-mortum debugging of the test on nightly.
Many of our scripts have a non-portable interpreter line for bash and
use bash-specific variables like `BASH_SOURCE`. Update the interpreter
line to be portable between various Linuxes and macOS without
complaint from posix shell users.
This changeset fixes two sources of flakiness in the event stream test.
First, the stream request gets the event *closest* to the index, not
the exact match. Although events are written before raft entries
they're written asynchronously, so it's possible to race and get a
raft index from this query higher than the current head of the event
buffer. Ensure the job is running before we try to get the index, so
that we've given the event enough time to land in the buffer.
Second, the assertion that the found index is greater than the start
index is only true if the `PlanResult` event manages to land before we
do the second registration. Although it should now with the first fix
above, it's not a correct assertion for what we're testing.
The oversubscription test expects an output that requires the client
has polled the task for stats at least once. Wait long enough to
ensure that we've polled the stats before failing the test.
Some tests may chose to deregister jobs to check Nomad cleanup
logic, however, it is still possible for the test to fail and exit
before this is hit. This therefore adds a cancellable cleanup func
which can be deferred, using context to control whether it gets
run or not.