The CSI plugin allocations take a while to be marked healthy,
sometimes causing E2E test flakes during the setup phase of the
tests. There's nothing CSI specific about marking plugin allocs
healthy, as the plugin supervisor hook does all the fingerprinting in
the postrun hook (the prestart hook just makes a couple of empty
directories). The timeouts we're seeing may be because of where we're
pulling the images from; most our jobs pull from a CDN-backed public
registry whereas these are pulling from ECR. Set a 1min timeout for
these to make sure we have enough time to pull the image and start the
task.
If any E2E test hangs, it'll eventually timeout and panic, causing the
all the remaining tests to fail. External commands should use a short
context whenever possible so we can fail the test quickly and move on
to the next test.
The AWS EBS plugin appears to use the name field of the volume as an
idempotency token that persists across the entire AWS account, not
just the plugin lifespan.
Also fix the regex for the volume ID, which was originally taken from
the job ID regex but isn't actually the same. This hasn't failed tests
for us because we've always passed in the same volume ID.
Split the EBS and EFS tests out into their own test cases:
* EBS exercises the Controller RPCs, including the create/snapshot workflow.
* EFS exercises only the Node RPCs, and assumes we have an existing volume
that gets registered, rather than created.
* e2e/csi: wait longer for plugins to become healthy
Plugins are Docker containers, and as such sometimes we get delays in startup
due to pulling from the registry and this is a source of test flakiness. Give
the plugins a little longer to start up.
* e2e/csi: version bump for AWS EBS plugins
Assert that deregistering a volume works without errors following a volume
reap. Use CLI helpers where feasible to exercise CSI command line. Dump plugin
allocation logs on deregistration failures for debugging purposes.
For everyday developer use, we don't need volumes for testing CSI. Providing a
flag to opt-in speeds up deploying dev clusters and slightly reduces infra costs.
Skip CSI test if missing volume specs.
The `nomad volume deregister` command currently returns an error if the volume
has any claims, but in cases where the claims can't be dropped because of
plugin errors, providing a `-force` flag gives the operator an escape hatch.
If the volume has no allocations or if they are all terminal, this flag
deletes the volume from the state store, immediately and implicitly dropping
all claims without further CSI RPCs. Note that this will not also
unmount/detach the volume, which we'll make the responsibility of a separate
`nomad volume detach` command.
This changeset:
* adds eval status to the error messages emitted when we have
placement failure in tests. The implementation here isn't quite
perfect but it's a lot better than "condition not met".
* enforces the ordering of teardown of the CSI test
* doesn't pass the purge flag to one of the two CSI tests, so that we
exercise both code paths.
This changeset provides two basic e2e tests for CSI plugins targeting
common AWS use cases.
The EBS test launches the EBS plugin (controller + nodes) and registers
an EBS volume as a Nomad CSI volume. We deploy a job that writes to
the volume, stop that job, and reuse the volume for another job which
should be able to read the data written by the first job.
The EFS test launches the EFS plugin (nodes-only) and registers an EFS
volume as a Nomad CSI volume. We deploy a job that writes to the
volume, stop that job, and reuse the volume for another job which
should be able to read the data written by the first job.
The writer jobs mount the CSI volume at a location within the alloc
dir.