* vault: configure user agent on Nomad vault clients
This PR attempts to set the User-Agent header on each Vault API client
created by Nomad. Still need to figure a way to set User-Agent on the
Vault client created internally by consul-template.
* vault: fixup find-and-replace gone awry
This PR modifies the disconnect helper job to run as root, which is necesary
for manipulating iptables as it does. Also re-organizes the final test logic
to wait for client re-connect before looking for the replacement (3rd) allocation
in case that client was needed to run the alloc (also giving the sheduler more
time to do its thing).
Skips the other 3 tests, which fail and I cannot yet figure out what is going on.
Also add some debug log lines for this test, because it doesn't make sense
for the allocation to be complete yet a task in the allocation to be not
started yet, which is what the test failures are implying.
Originally this test relied on Job 1 blocking Job 2 until Job 1 had a
terminal *ClientStatus.* Job 2 ensured it would get blocked using 2
mechanisms:
1. A constraint requiring it is placed on the same node as Job 1.
2. Job 2 would require all unreserved CPU on the node to ensure it would
be blocked until Job 1's resources were free.
That 2nd assertion breaks if *any previous job is still running on the
target node!* That seems very likely to happen in the flaky world of our
e2e tests. In fact there may be some jobs we intentionally want running
throughout; in hindsight it was never safe to assume my test would be
the only thing scheduled when it ran.
*Ports to the rescue!* Reserving a static port means that both Job 2
will now block on Job 1 being terminal. It will only conflict with other
tests if those tests use that port *on every node.* I ensured no
existing tests were using the port I chose.
Other changes:
- Gave job a bit more breathing room resource-wise.
- Tightened timings a bit since previous failure ran into the `go test`
time limit.
- Cleaned up the DumpEvals output. It's quite nice and handy now!
Keeps failing in the nightly e2e test with unhelpful output like:
```
Failed
=== RUN TestOverlap
overlap_test.go:92: Followup job overlap93ee1d2b blocked. Sleeping for the rest of overlap48c26c39's shutdown_delay (9.2/10s)
overlap_test.go:105: 1500/2000 retries reached for github.com/hashicorp/nomad/e2e/overlap.TestOverlap (err=timed out before an allocation was found for overlap93ee1d2b)
overlap_test.go:105: timeout: timed out before an allocation was found for overlap93ee1d2b
--- FAIL: TestOverlap (38.96s)
```
I have not been able to replicate it in my own e2e cluster, so I added
the EvalDump helper to add detailed eval information like:
```
=== RUN TestOverlap
1/1 Job overlap7b0e90ec Eval c38c9919-a4f0-5baf-45f7-0702383c682a
Type: service
TriggeredBy: job-register
Deployment:
Status: pending ()
NextEval:
PrevEval:
BlockedEval:
-- No placement failures --
QueuedAllocs:
SnapshotIdx: 0
CreateIndex: 96
ModifyIndex: 96
...
```
Hopefully helpful when debugging other tests as well!
Some tests may chose to deregister jobs to check Nomad cleanup
logic, however, it is still possible for the test to fail and exit
before this is hit. This therefore adds a cancellable cleanup func
which can be deferred, using context to control whether it gets
run or not.
* Wait longer for node to go down in disconnected clients test.
The existing helper only waits 10s, but there's a jitter on heartbeats
that we need to account for. Wait for 30s for node to go down to give
us plenty of room
* Port disconnected clients to stdlib-style test
Our E2E "framework" has a bunch of features around test discovery and
standing up infra that were never completed or fully used, and we
ended up building out a large test suite that ignored all that in lieu
of Terraform-provided infrastructure for the last couple years.
This changeset is a proposal (and demonstration) for gradually
migrating our E2E tests off the framework code so that developers can
write fairly ordinary golang stdlib testing tests.
This test exercises the behavior of clients that become disconnected
and have their allocations replaced. Future test cases will exercise
the `max_client_disconnect` field on the job spec.
If any E2E test hangs, it'll eventually timeout and panic, causing the
all the remaining tests to fail. External commands should use a short
context whenever possible so we can fail the test quickly and move on
to the next test.
The `TestRescheduleProgressDeadlineFail` E2E test failed during test
cleanup because the error message "progress deadline expired" that it
emits when we stop the job does not match the one expected from
monitoring the `job stop` command. Update the `StopJob` helper to
tolerate this use case as well.
PR #11550 changed the job stop exit behaviour when monitoring the
deployment. When stopping a job, the deployment becomes cancelled
and therefore the CLI now exits with status code 1 as it see this
as an error.
This change adds a new utility e2e function that accounts for this
behaviour.
This change modifies the Nomad job register and deregister RPCs to
accept an updated option set which includes eval priority. This
param is optional and override the use of the job priority to set
the eval priority.
In order to ensure all evaluations as a result of the request use
the same eval priority, the priority is shared to the
allocReconciler and deploymentWatcher. This creates a new
distinction between eval priority and job priority.
The Nomad agent HTTP API has been modified to allow setting the
eval priority on job update and delete. To keep consistency with
the current v1 API, job update accepts this as a payload param;
job delete accepts this as a query param.
Any user supplied value is validated within the agent HTTP handler
removing the need to pass invalid requests to the server.
The register and deregister opts functions now all for setting
the eval priority on requests.
The change includes a small change to the DeregisterOpts function
which handles nil opts. This brings the function inline with the
RegisterOpts.
This PR implements a new "System Batch" scheduler type. Jobs can
make use of this new scheduler by setting their type to 'sysbatch'.
Like the name implies, sysbatch can be thought of as a hybrid between
system and batch jobs - it is for running short lived jobs intended to
run on every compatible node in the cluster.
As with batch jobs, sysbatch jobs can also be periodic and/or parameterized
dispatch jobs. A sysbatch job is considered complete when it has been run
on all compatible nodes until reaching a terminal state (success or failed
on retries).
Feasibility and preemption are governed the same as with system jobs. In
this PR, the update stanza is not yet supported. The update stanza is sill
limited in functionality for the underlying system scheduler, and is
not useful yet for sysbatch jobs. Further work in #4740 will improve
support for the update stanza and deployments.
Closes#2527
Pick up 15d39f0dee but for RegisterFromJobspec:
> This PR changes the e2e helper thingy to set -detach option
> when registering a job with the CLI instead of the API. This is
> necessary for jobs which never become healthy, as the deployment
> never finishes for failing jobs and the command never returns,
> causing the test to timeout after 10 minutes.
This case occurs in TestVaultSecrets
This PR changes the e2e helper thingy to set -detach option
when registering a job with the CLI instead of the API. This is
necessary for jobs which never become healthy, as the deployment
never finishes for failing jobs and the command never returns,
causing the test to timeout after 10 minutes.
This PR adds e2e tests for Consul Namespaces for Nomad Enterprise
with Consul ACLs enabled.
Needed to add support for Consul ACL tokens with `namespace` and
`namespace_prefix` blocks, which Nomad parses and validates before
tossing the token. These bits will need to be picked back to OSS.
This PR adds a set of tests to the Consul test suite for testing
Nomad OSS's behavior around setting Consul Namespace on groups,
which is to ignore the setting (as Consul Namespaces are currently
an Enterprise feature).
Tests are generally a reduced facsimile of existing tests, modified
to check behavior of when group.consul.namespace is set and not set.
Verification is oriented around what happens in Consul; the in-depth
functional correctness of these features is left to the original tests.
Nomad ENT will get its own version of these tests in `namespaces_ent.go`.
* fix periodic
* update periodic to not use template
nomad job inspect no longer returns an apiliststub so the required fields to query job summary are no longer there, parse cli output instead
* rm tmp makefile entry
* fix typo
* revert makefile change
Prefer testutil.WaitForResultRetries that emits more descriptive errors on
failures. `require.Evatually` fails with opaque "Condition never satisfied"
error message.
* Prevent Job Statuses from being calculated twice
https://github.com/hashicorp/nomad/pull/8435 introduced atomic eval
insertion iwth job (de-)registration. This change removes a now obsolete
guard which checked if the index was equal to the job.CreateIndex, which
would empty the status. Now that the job regisration eval insetion is
atomic with the registration this check is no longer necessary to set
the job statuses correctly.
* test to ensure only single job event for job register
* periodic e2e
* separate job update summary step
* fix updatejobstability to use copy instead of modified reference of job
* update envoygatewaybindaddresses copy to prevent job diff on null vs empty
* set ConsulGatewayBindAddress to empty map instead of nil
fix nil assertions for empty map
rm unnecessary guard
Deflake namespace e2e test by only asserting on jobs related to the
namespace tests. During our e2e tests, some left over jobs (e.g.
prometheus) are left running while being shutdown and cause the test to
fail.
We directly parse job files in e2eutil, but currently using jobspec
package. Instead, use the Parse method from the jobspec2 package so
we can parse job files with new features.
Assert that deregistering a volume works without errors following a volume
reap. Use CLI helpers where feasible to exercise CSI command line. Dump plugin
allocation logs on deregistration failures for debugging purposes.
Exercises host volume and Docker volume functionality for the `exec` and `docker`
task driver, particularly around mounting locations within the container and
how this can be used with `template`.
The CLI helpers in the rescheduling test were intended for shared use, but
until some other tests were written we didn't want to waste time making them
generic. This changeset refactors them and adds some new helpers associated
with the node drain tests (under separate PR).
The E2E suite exercises the API, but not the CLI. This changeset adds a helper
function to send commands via a locally-built Nomad binary (which we'll need
to add to the E2E setup), and some helpers to parse the resulting structured
outputs in a way that tests can consume.
Adds 2 tests around Connect Native. Both make use of the example connect native
services in https://github.com/hashicorp/nomad-connect-examples
One of them runs without Consul ACLs enabled, the other with.