The e2e framework instantiates clients for Nomad/Consul but the
provisioning of the actual Nomad cluster is left to Terraform. The
Terraform provisioning process uses `remote-exec` to deploy specific
versions of Nomad so that we don't have to bake an AMI every time we
want to test a new version. But Terraform treats the resulting
instances as immutable, so we can't use the same tooling to update the
version of Nomad in-place. This is a prerequisite for upgrade testing.
This changeset extends the e2e framework to provide the option of
deploying Nomad (and, in the future, Consul/Vault) with specific
versions to running infrastructure. This initial implementation is
focused on deploying to a single cluster via `ssh` (because that's our
current need), but provides interfaces to hook the test run at the
start of the run, the start of each suite, or the start of a given
test case.
Terraform work includes:
* provides Terraform output that written to JSON used by the framework
to configure provisioning via `terraform output provisioning`.
* provides Terraform output that can be used by test operators to
configure their shell via `$(terraform output environment)`
* drops `remote-exec` provisioning steps from Terraform
* makes changes to the deployment scripts to ensure they can be run
multiple times w/ different versions against the same host.
This changeset is part of the work to improve our E2E provisioning
process to allow our upgrade tests:
* Move more of the setup into the AMI image creation so it's a little
more obvious to provisioning config authors which bits are essential
to deploying a specific version of Nomad.
* Make the service file update do a systemd daemon-reload so that we
can update an already-running cluster with the same script we use to
deploy it initially.
When multiple Connect-enabled task groups start on the same client
node, a race condition in the CNI plugins for creating iptables chains
causes one of the tasks to fail. We upstreamed a patch to CNI plugins
to make iptables chain creation idempotent.
This changeset updates end-to-end testing, development tooling, and
documentation to use 0.8.4 which includes our patch.
Adds Windows targets to the client/allocs metrics tests. Removes the
`allocstats` test, which covers less than these tests and is now
redundant.
Adds a firewall rule to our Windows instances so that the prometheus
server can scrape the Nomad HTTP API for metrics.
* Adds a constraint to prevent tests from landing on Windows
* Improve Terraform output for mixed windows/linux clients
* Makes some Windows client config fixes from 0.10.2 testing
Includes:
* baseline Windows AMI
* initial pass at Terraform configurations
* OpenSSH for Windows
Using OpenSSH is a lot nicer for Nomad developers than winrm would be,
plus it lets us avoid passing around the Windows password in the
clear.
Note that now we're copying up all the provisioning scripts and
configs as a zipped bundle because TF's file provisioner dies in the
middle of pushing up multiple files (whereas `scp -r` works fine).
We're also running all the provisioning scripts inside the userdata by
polling for the zip file to show up (gross!). This is because
`remote-exec` provisioners are failing on Windows with the same symptoms as:
https://github.com/hashicorp/terraform/issues/17728
If we can't fix this, it'll prevent us from having multiple Windows
clients running until TF supports count interpolation in the
`template_file`, which is planned for a later 0.12 release.
Ensure that we're reusing the base configuration between client and
servers without the possibility of drift. Reduce the amount of `sed`
mangling of the configuration file, and make recommended changes from
`shellcheck` for this section of the provisioning script.
Fixes some rebase errors on the Nomad config as well.
Share base configuration for telemetry and consul. Have the server
configurations respect the `var.server_count` config. Make changes
recommended by `shellcheck` in the provisioning scripts for this section.
Switch to OS/arch-tagged release bundles on S3 for compatibility with
adding Windows builds in the near future.
Match the configuration directory layout we're using for Consul and
other services. Make recommended changes from `shellcheck` for this
section of the provisioning script.
Update the Consul and Vault configs to take advantage of their
included `go-sockaddr` library for getting the IP addresses we need in
a portable way. This particularly avoids problems with "predictable"
interface names provided by systemd.
Also adds the `sockaddr` binary to the Packer build so we can use it
in our provisioning scripts.
Make a clear split between Packer and Terraform provisioning steps:
the scripts in the `packer/linux` directory are run when we build the
AMI whereas the stuff in shared are run at Terraform provisioning time.
Merging all runtime provisioning scripts into a single script for each
of server/client solves the following:
* Userdata scripts can't take arguments, they can only be templated
and that means we have to do TF escaping in bash/powershell scripts.
* TF provisioning scripts race with userdata scripts.
A failing script in a `remote-exec` provisioner's `inline` stanza
won't fail the provisioning step. This lets us continue on to execute
tests against potentially broken deployments, rather than letting us
know the provisioning itself failed.
Nomad on the packer image will be overwritten by the sha specified in
the TF var, but including a base version on the packer image makes the
image valid for independent use.
The systemd configs spread across our repo were fairly out of sync. This
should get them on our best practices.
The deployment guide also had some strange things like running Nomad as
a non-root user. It would be fine for servers but completely breaks
clients. For simplicity I simply removed the non-root user references.