This changeset adds volumes but does not mount them to instances so
that we can test the mounting ("staging") via CSI plugins. The CSI
plugins themselves will be installed as Nomad jobs.
In order to ensure we can always mount the EFS volume, this changeset
pins the deployment of the cluster to a specific subnet. In future
work we should spread the cluster out among several AZs and test that
behavior explicitly.
The e2e framework instantiates clients for Nomad/Consul but the
provisioning of the actual Nomad cluster is left to Terraform. The
Terraform provisioning process uses `remote-exec` to deploy specific
versions of Nomad so that we don't have to bake an AMI every time we
want to test a new version. But Terraform treats the resulting
instances as immutable, so we can't use the same tooling to update the
version of Nomad in-place. This is a prerequisite for upgrade testing.
This changeset extends the e2e framework to provide the option of
deploying Nomad (and, in the future, Consul/Vault) with specific
versions to running infrastructure. This initial implementation is
focused on deploying to a single cluster via `ssh` (because that's our
current need), but provides interfaces to hook the test run at the
start of the run, the start of each suite, or the start of a given
test case.
Terraform work includes:
* provides Terraform output that written to JSON used by the framework
to configure provisioning via `terraform output provisioning`.
* provides Terraform output that can be used by test operators to
configure their shell via `$(terraform output environment)`
* drops `remote-exec` provisioning steps from Terraform
* makes changes to the deployment scripts to ensure they can be run
multiple times w/ different versions against the same host.
* Adds a constraint to prevent tests from landing on Windows
* Improve Terraform output for mixed windows/linux clients
* Makes some Windows client config fixes from 0.10.2 testing
Includes:
* baseline Windows AMI
* initial pass at Terraform configurations
* OpenSSH for Windows
Using OpenSSH is a lot nicer for Nomad developers than winrm would be,
plus it lets us avoid passing around the Windows password in the
clear.
Note that now we're copying up all the provisioning scripts and
configs as a zipped bundle because TF's file provisioner dies in the
middle of pushing up multiple files (whereas `scp -r` works fine).
We're also running all the provisioning scripts inside the userdata by
polling for the zip file to show up (gross!). This is because
`remote-exec` provisioners are failing on Windows with the same symptoms as:
https://github.com/hashicorp/terraform/issues/17728
If we can't fix this, it'll prevent us from having multiple Windows
clients running until TF supports count interpolation in the
`template_file`, which is planned for a later 0.12 release.
Ensure that we're reusing the base configuration between client and
servers without the possibility of drift. Reduce the amount of `sed`
mangling of the configuration file, and make recommended changes from
`shellcheck` for this section of the provisioning script.
Fixes some rebase errors on the Nomad config as well.
Share base configuration for telemetry and consul. Have the server
configurations respect the `var.server_count` config. Make changes
recommended by `shellcheck` in the provisioning scripts for this section.
Switch to OS/arch-tagged release bundles on S3 for compatibility with
adding Windows builds in the near future.
Make a clear split between Packer and Terraform provisioning steps:
the scripts in the `packer/linux` directory are run when we build the
AMI whereas the stuff in shared are run at Terraform provisioning time.
Merging all runtime provisioning scripts into a single script for each
of server/client solves the following:
* Userdata scripts can't take arguments, they can only be templated
and that means we have to do TF escaping in bash/powershell scripts.
* TF provisioning scripts race with userdata scripts.
A failing script in a `remote-exec` provisioner's `inline` stanza
won't fail the provisioning step. This lets us continue on to execute
tests against potentially broken deployments, rather than letting us
know the provisioning itself failed.
When multiple developers are working on e2e testing, it helps to be
able to identify which infrastructure belongs to which Nomad SHA and
which developer. This adds tags to the EC2 instances.