This is a followup to having tests run in serial in CI.
The e2e package isn't in CI, but lets use the helper anyway
so we can setup semgrep rules covering the entire repository.
The RPC for listing volume snapshots requires a plugin ID. Update the
`volume snapshot list` command to find the specific plugin from the
provided prefix.
If any E2E test hangs, it'll eventually timeout and panic, causing the
all the remaining tests to fail. External commands should use a short
context whenever possible so we can fail the test quickly and move on
to the next test.
The `TestRescheduleProgressDeadlineFail` E2E test failed during test
cleanup because the error message "progress deadline expired" that it
emits when we stop the job does not match the one expected from
monitoring the `job stop` command. Update the `StopJob` helper to
tolerate this use case as well.
The `Metrics` suite uses prometheus to scrape Nomad metrics so that
we're testing the full user experience of extracting metrics from
Nomad. With the addition of mTLS, we need to make sure prometheus also
has mTLS configuration because the metrics endpoint is protected.
Update the Nomad client configuration and prometheus job to bind-mount
the client's certs into the task so that the job can use these certs
to scrape the server. This is a temporary solution that gets the job
passing; we should give the job its own certificates (issued by
Vault?) when we've done some of the infrastructure rework we'd like.
The AWS EBS plugin appears to use the name field of the volume as an
idempotency token that persists across the entire AWS account, not
just the plugin lifespan.
Also fix the regex for the volume ID, which was originally taken from
the job ID regex but isn't actually the same. This hasn't failed tests
for us because we've always passed in the same volume ID.
With mTLS enabled, using `curl` in a bash script for validation
involves having to configure arguments to `curl` based on whether or
not the test infrastructure is using mTLS, whether ACLs are enabled,
etc. Use the new `operator api` command instead to pick up the client
configuration from the test environment automatically.
PR #11550 changed the job stop exit behaviour when monitoring the
deployment. When stopping a job, the deployment becomes cancelled
and therefore the CLI now exits with status code 1 as it see this
as an error.
This change adds a new utility e2e function that accounts for this
behaviour.
This change modifies the Nomad job register and deregister RPCs to
accept an updated option set which includes eval priority. This
param is optional and override the use of the job priority to set
the eval priority.
In order to ensure all evaluations as a result of the request use
the same eval priority, the priority is shared to the
allocReconciler and deploymentWatcher. This creates a new
distinction between eval priority and job priority.
The Nomad agent HTTP API has been modified to allow setting the
eval priority on job update and delete. To keep consistency with
the current v1 API, job update accepts this as a payload param;
job delete accepts this as a query param.
Any user supplied value is validated within the agent HTTP handler
removing the need to pass invalid requests to the server.
The register and deregister opts functions now all for setting
the eval priority on requests.
The change includes a small change to the DeregisterOpts function
which handles nil opts. This brings the function inline with the
RegisterOpts.
Add a new hostname string parameter to the network block which
allows operators to specify the hostname of the network namespace.
Changing this causes a destructive update to the allocation and it
is omitted if empty from API responses. This parameter also supports
interpolation.
In order to have a hostname passed as a configuration param when
creating an allocation network, the CreateNetwork func of the
DriverNetworkManager interface needs to be updated. In order to
minimize the disruption of future changes, rather than add another
string func arg, the function now accepts a request struct along with
the allocID param. The struct has the hostname as a field.
The in-tree implementations of DriverNetworkManager.CreateNetwork
have been modified to account for the function signature change.
In updating for the change, the enhancement of adding hostnames to
network namespaces has also been added to the Docker driver, whilst
the default Linux manager does not current implement it.
This allows us to spin up e2e clusters with mTLS configured for all HashiCorp services, i.e. Nomad, Consul, and Vault. Used it for testing #11089 .
mTLS is disabled by default. I have not updated Windows provisioning scripts yet - Windows also lacks ACL support from before. I intend to follow up for them in another round.
Target all e2e datacenters for system and sysbatch e2e tests. They
require that the system jobs run on all linux clients.
However, the jobs currenly only target `dc1` datacenter, but the nightly
e2e cluster has 4 clients spread in `dc1` and `dc2` datacenters, causing
the tests to fail.
I missed this problem in e2e dev cluster because it only used a single
dc1 datacenter.
This PR implements a new "System Batch" scheduler type. Jobs can
make use of this new scheduler by setting their type to 'sysbatch'.
Like the name implies, sysbatch can be thought of as a hybrid between
system and batch jobs - it is for running short lived jobs intended to
run on every compatible node in the cluster.
As with batch jobs, sysbatch jobs can also be periodic and/or parameterized
dispatch jobs. A sysbatch job is considered complete when it has been run
on all compatible nodes until reaching a terminal state (success or failed
on retries).
Feasibility and preemption are governed the same as with system jobs. In
this PR, the update stanza is not yet supported. The update stanza is sill
limited in functionality for the underlying system scheduler, and is
not useful yet for sysbatch jobs. Further work in #4740 will improve
support for the update stanza and deployments.
Closes#2527
As we moved to using `-detach` for registering jobs, we should wait
until allocs and deployments are created before asserting their
properties.
Fixing `TestNodeDrainIgnoreSystem` and `TestRescheduleProgressDeadlineFail` tests as they seem particularly flaky, failing 9 and 7 times (respectively) in the last two weeks.
Pick up 15d39f0dee but for RegisterFromJobspec:
> This PR changes the e2e helper thingy to set -detach option
> when registering a job with the CLI instead of the API. This is
> necessary for jobs which never become healthy, as the deployment
> never finishes for failing jobs and the command never returns,
> causing the test to timeout after 10 minutes.
This case occurs in TestVaultSecrets
This PR changes the e2e helper thingy to set -detach option
when registering a job with the CLI instead of the API. This is
necessary for jobs which never become healthy, as the deployment
never finishes for failing jobs and the command never returns,
causing the test to timeout after 10 minutes.
Ease spinning up a cluster, where binaries are fetched from arbitrary
urls. These could be CircleCI `build-binaries` job artifacts, or
presigned S3 urls.
Co-authored-by: Tim Gross <tgross@hashicorp.com>