Also add some debug log lines for this test, because it doesn't make sense
for the allocation to be complete yet a task in the allocation to be not
started yet, which is what the test failures are implying.
Originally this test relied on Job 1 blocking Job 2 until Job 1 had a
terminal *ClientStatus.* Job 2 ensured it would get blocked using 2
mechanisms:
1. A constraint requiring it is placed on the same node as Job 1.
2. Job 2 would require all unreserved CPU on the node to ensure it would
be blocked until Job 1's resources were free.
That 2nd assertion breaks if *any previous job is still running on the
target node!* That seems very likely to happen in the flaky world of our
e2e tests. In fact there may be some jobs we intentionally want running
throughout; in hindsight it was never safe to assume my test would be
the only thing scheduled when it ran.
*Ports to the rescue!* Reserving a static port means that both Job 2
will now block on Job 1 being terminal. It will only conflict with other
tests if those tests use that port *on every node.* I ensured no
existing tests were using the port I chose.
Other changes:
- Gave job a bit more breathing room resource-wise.
- Tightened timings a bit since previous failure ran into the `go test`
time limit.
- Cleaned up the DumpEvals output. It's quite nice and handy now!
Keeps failing in the nightly e2e test with unhelpful output like:
```
Failed
=== RUN TestOverlap
overlap_test.go:92: Followup job overlap93ee1d2b blocked. Sleeping for the rest of overlap48c26c39's shutdown_delay (9.2/10s)
overlap_test.go:105: 1500/2000 retries reached for github.com/hashicorp/nomad/e2e/overlap.TestOverlap (err=timed out before an allocation was found for overlap93ee1d2b)
overlap_test.go:105: timeout: timed out before an allocation was found for overlap93ee1d2b
--- FAIL: TestOverlap (38.96s)
```
I have not been able to replicate it in my own e2e cluster, so I added
the EvalDump helper to add detailed eval information like:
```
=== RUN TestOverlap
1/1 Job overlap7b0e90ec Eval c38c9919-a4f0-5baf-45f7-0702383c682a
Type: service
TriggeredBy: job-register
Deployment:
Status: pending ()
NextEval:
PrevEval:
BlockedEval:
-- No placement failures --
QueuedAllocs:
SnapshotIdx: 0
CreateIndex: 96
ModifyIndex: 96
...
```
Hopefully helpful when debugging other tests as well!
This PR implements a new "System Batch" scheduler type. Jobs can
make use of this new scheduler by setting their type to 'sysbatch'.
Like the name implies, sysbatch can be thought of as a hybrid between
system and batch jobs - it is for running short lived jobs intended to
run on every compatible node in the cluster.
As with batch jobs, sysbatch jobs can also be periodic and/or parameterized
dispatch jobs. A sysbatch job is considered complete when it has been run
on all compatible nodes until reaching a terminal state (success or failed
on retries).
Feasibility and preemption are governed the same as with system jobs. In
this PR, the update stanza is not yet supported. The update stanza is sill
limited in functionality for the underlying system scheduler, and is
not useful yet for sysbatch jobs. Further work in #4740 will improve
support for the update stanza and deployments.
Closes#2527
This PR adds a set of tests to the Consul test suite for testing
Nomad OSS's behavior around setting Consul Namespace on groups,
which is to ignore the setting (as Consul Namespaces are currently
an Enterprise feature).
Tests are generally a reduced facsimile of existing tests, modified
to check behavior of when group.consul.namespace is set and not set.
Verification is oriented around what happens in Consul; the in-depth
functional correctness of these features is left to the original tests.
Nomad ENT will get its own version of these tests in `namespaces_ent.go`.
Prefer testutil.WaitForResultRetries that emits more descriptive errors on
failures. `require.Evatually` fails with opaque "Condition never satisfied"
error message.
We directly parse job files in e2eutil, but currently using jobspec
package. Instead, use the Parse method from the jobspec2 package so
we can parse job files with new features.
Adds 2 tests around Connect Native. Both make use of the example connect native
services in https://github.com/hashicorp/nomad-connect-examples
One of them runs without Consul ACLs enabled, the other with.
This changeset:
* adds eval status to the error messages emitted when we have
placement failure in tests. The implementation here isn't quite
perfect but it's a lot better than "condition not met".
* enforces the ordering of teardown of the CSI test
* doesn't pass the purge flag to one of the two CSI tests, so that we
exercise both code paths.
- In script checks, ensure we're running `Exec` against the new running
allocation and not the earlier stopped one.
- In script checks, allow `Exec` calls to error due to lack of pty when
we use the exec to kill the task.
- In `utils.go/RegisterAllocs`, force query for allocations to wait on
wait index returned by registration call.
The e2e test code is absolutely hideous and leaks processes and files
on disk. NomadAgent seems useful, but the clientstate e2e tests are very
messy and slow. The last test "Corrupt" is probably the most useful as
it explicitly corrupts the state file whereas the other tests attempt to
reproduce steps thought to cause corruption in earlier releases of
Nomad.