This changeset adds volumes but does not mount them to instances so
that we can test the mounting ("staging") via CSI plugins. The CSI
plugins themselves will be installed as Nomad jobs.
In order to ensure we can always mount the EFS volume, this changeset
pins the deployment of the cluster to a specific subnet. In future
work we should spread the cluster out among several AZs and test that
behavior explicitly.
Golang 1.13 introduced a change in test flag parsing:
> testing
> ...
> Testing flags are now registered in the new Init function, which is invoked by the generated main function for the test. As a result, testing flags are now only registered when running a test binary, and packages that call flag.Parse during package initialization may cause tests to fail.
https://golang.org/doc/go1.13#testing
Here, we ensure that e2e framework parsing occur in TestMain, by only
initializing Framework at Run invocation.
go vet would have prevented the bug fixed in
6362e32161295fa959ebe46b93cea0ea1a9bdd72 but our use of errgroup
prevented that.
Rip out errgroup to take advantage of vet, and remove download limiting
now that we're downloading far fewer binaries overall.
Pretty sure Consul / Nomad clients are often not ready yet after
the ConsulACLs test disables ACLs, by the time the next test starts
running.
Running locally things tend to work, but in TeamCity this seems to
be a recurring problem. However, when running locally sometimes I do
see that the "show status" step after disabling ACLs, some nodes are
still initializing, suggesting we're right on the border of not waiting
long enough
nomad node status
ID DC Name Class Drain Eligibility Status
0e4dfce2 dc1 EC2AMAZ-JB3NF9P <none> false eligible ready
6b90aa06 dc2 ip-172-31-16-225 <none> false eligible ready
7068558a dc2 ip-172-31-20-143 <none> false eligible ready
e0ae3c5c dc1 ip-172-31-25-165 <none> false eligible ready
15b59ed6 dc1 ip-172-31-23-199 <none> false eligible initializing
Going to try waiting a full 2 minutes after disabling ACLs, hopefully that
will help things Just Work. In the future, we should probably be parsing the
output of the status checks and actually confirming all nodes are ready.
Even better, maybe that's something shipyard will have built-in.
Go implicitly treats files ending with `_linux.go` as build tagged for
Linux only. This broke the e2e provisioning framework on macOS once we
tried importing it into the `e2e/consulacls` module.
This changeset improves the ergonomics of running the Nomad e2e test
provisioning process by defaulting to a blank `nomad_sha` in the
Terraform configuration. By default, a user will now need to pass in
one of the Nomad version flags. But they won't have to manually edit
the `provisioning.json` file for the common case of deploying a
released version of Nomad, and won't need to put dummy values for
`nomad_sha`.
Includes general documentation improvements.
This test is causing panics. Unlike the other similar tests, this
one is using require.Eventually which is doing something bad, and
this change replaces it with a for-loop like the other tests.
Failure:
=== RUN TestE2E/Connect
=== RUN TestE2E/Connect/*connect.ConnectE2ETest
=== RUN TestE2E/Connect/*connect.ConnectE2ETest/TestConnectDemo
=== RUN TestE2E/Connect/*connect.ConnectE2ETest/TestMultiServiceConnect
=== RUN TestE2E/Connect/*connect.ConnectClientStateE2ETest
panic: Fail in goroutine after TestE2E/Connect/*connect.ConnectE2ETest has completed
goroutine 38 [running]:
testing.(*common).Fail(0xc000656500)
/opt/google/go/src/testing/testing.go:565 +0x11e
testing.(*common).Fail(0xc000656100)
/opt/google/go/src/testing/testing.go:559 +0x96
testing.(*common).FailNow(0xc000656100)
/opt/google/go/src/testing/testing.go:587 +0x2b
testing.(*common).Fatalf(0xc000656100, 0x1512f90, 0x10, 0xc000675f88, 0x1, 0x1)
/opt/google/go/src/testing/testing.go:672 +0x91
github.com/hashicorp/nomad/e2e/connect.(*ConnectE2ETest).TestMultiServiceConnect.func1(0x0)
/home/shoenig/go/src/github.com/hashicorp/nomad/e2e/connect/multi_service.go:72 +0x296
github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert.Eventually.func1(0xc0004962a0, 0xc0002338f0)
/home/shoenig/go/src/github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert/assertions.go:1494 +0x27
created by github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert.Eventually
/home/shoenig/go/src/github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert/assertions.go:1493 +0x272
FAIL github.com/hashicorp/nomad/e2e 21.427s
Provide script for managing Consul ACLs on a TF provisioned cluster for
e2e testing. Script can be used to 'enable' or 'disable' Consul ACLs,
and automatically takes care of the bootstrapping process if necessary.
The bootstrapping process takes a long time, so we may need to
extend the overall e2e timeout (20 minutes seems fine).
Introduces basic tests for Consul Connect with ACLs.
This change allows for providing the -suite=<Name> flag when
running the e2e framework. If set, only the matching e2e/Framework.TestSuite.Component
will be run, and all ther suites will be skipped.
Fixes a bug introduced in 0aa58b9 where we're writing a test file to
a taskdir-interpolated location, which works when we `alloc exec` but
not in the jobspec for a group script check.
This changeset also makes the test safe to run multiple times by
namespacing the file with the alloc ID, which has the added bonus of
exercising our alloc interpolation code for group script checks.
The e2e framework instantiates clients for Nomad/Consul but the
provisioning of the actual Nomad cluster is left to Terraform. The
Terraform provisioning process uses `remote-exec` to deploy specific
versions of Nomad so that we don't have to bake an AMI every time we
want to test a new version. But Terraform treats the resulting
instances as immutable, so we can't use the same tooling to update the
version of Nomad in-place. This is a prerequisite for upgrade testing.
This changeset extends the e2e framework to provide the option of
deploying Nomad (and, in the future, Consul/Vault) with specific
versions to running infrastructure. This initial implementation is
focused on deploying to a single cluster via `ssh` (because that's our
current need), but provides interfaces to hook the test run at the
start of the run, the start of each suite, or the start of a given
test case.
Terraform work includes:
* provides Terraform output that written to JSON used by the framework
to configure provisioning via `terraform output provisioning`.
* provides Terraform output that can be used by test operators to
configure their shell via `$(terraform output environment)`
* drops `remote-exec` provisioning steps from Terraform
* makes changes to the deployment scripts to ensure they can be run
multiple times w/ different versions against the same host.
Group service checks cannot interpolate task fields, because the task
fields are not available at the time the script check hook is created
for the group service. When f31482a was merged this e2e test began
failing because we are now correctly matching the script check ID to
the service ID, which revealed this jobspec was invalid.
This changeset is part of the work to improve our E2E provisioning
process to allow our upgrade tests:
* Move more of the setup into the AMI image creation so it's a little
more obvious to provisioning config authors which bits are essential
to deploying a specific version of Nomad.
* Make the service file update do a systemd daemon-reload so that we
can update an already-running cluster with the same script we use to
deploy it initially.
Modernize Vault integration/e2e test a bit:
- Download from releases.hashicorp.com instead of using a hardcoded list
- Remove old unused make target e2e-test
- Use NOMAD_E2E env var instead of -integration flag
- Add a README
On my machine with ~250 Mbps internet it takes ~400s to download all
Vault binaries.
When multiple Connect-enabled task groups start on the same client
node, a race condition in the CNI plugins for creating iptables chains
causes one of the tasks to fail. We upstreamed a patch to CNI plugins
to make iptables chain creation idempotent.
This changeset updates end-to-end testing, development tooling, and
documentation to use 0.8.4 which includes our patch.
Increase the shortened timeout after the first loop so that metrics
that take longer to come in aren't failing the test unnecessarily.
Move the check for empty alloc metrics into the loop so that if the
first values we get are empty we don't fail the test too early.
Adds Windows targets to the client/allocs metrics tests. Removes the
`allocstats` test, which covers less than these tests and is now
redundant.
Adds a firewall rule to our Windows instances so that the prometheus
server can scrape the Nomad HTTP API for metrics.
Refactor the metrics end-to-end tests so they can be run with our e2e
test framework. Runs fabio/prometheus and a collection of jobs that
will cause metrics to be measured. We then query Prometheus to ensure
we're publishing those allocation metrics and some metrics from the
clients as well.
Includes adding a placeholder for running the same tests on Windows.