The E2E provisioning used local-exec to call ssh in a for loop in a hacky
workaround https://github.com/hashicorp/terraform/issues/25634, which
prevented remote-exec from working on Windows. Move to a newer version of
Terraform that fixes the remote-exec bug to make provisioning more reliable
and observable.
Note that Windows remote-exec needs to include the `powershell` call itself,
unlike Unix-alike remote-exec.
Split the EBS and EFS tests out into their own test cases:
* EBS exercises the Controller RPCs, including the create/snapshot workflow.
* EFS exercises only the Node RPCs, and assumes we have an existing volume
that gets registered, rather than created.
Add a `PerAlloc` field to volume requests that directs the scheduler to test
feasibility for volumes with a source ID that includes the allocation index
suffix (ex. `[0]`), rather than the exact source ID.
Read the `PerAlloc` field when making the volume claim at the client to
determine if the allocation index suffix (ex. `[0]`) should be added to the
volume source ID.
* fix periodic
* update periodic to not use template
nomad job inspect no longer returns an apiliststub so the required fields to query job summary are no longer there, parse cli output instead
* rm tmp makefile entry
* fix typo
* revert makefile change
This PR enables jobs configured with a custom sidecar_task to make
use of the `service.expose` feature for creating checks on services
in the service mesh. Before we would check that sidecar_task had not
been set (indicating that something other than envoy may be in use,
which would not support envoy's expose feature). However Consul has
not added support for anything other than envoy and probably never
will, so having the restriction in place seems like an unnecessary
hindrance. If Consul ever does support something other than Envoy,
they will likely find a way to provide the expose feature anyway.
Fixes#9854
Ensure that the e2e clusters are isolated and never attempt to autojoin
with another e2e cluster.
This ensures that each cluster servers have a unique `ConsulAutoJoin`,
to be used for discovery.
This PR makes two ergonomics changes, meant to get e2e builds more reproducible and ease changes.
### AMI Management
First, we pin the server AMIs to the commits associated with the build. No more using the latest AMI a developer build in a test branch, or accidentally using a stale AMI because we forgot to build one! Packer is to tag the AMI images with the commit sha used to generate the image, and then Terraform would look up only the AMIs associated with that sha. To minimize churn, we use the SHA associated with the latest Packer configurations, rather than SHA of all.
This has few benefits: reproducibility and avoiding accidental AMI changes and contamination of changes across branches. Also, the change is a stepping stone to an e2e pipeline that builds new AMIs automatically if Packer files changed.
The downside is that new AMIs will be generated even for irrelevant changes (e.g. spelling, commits), but I suspect that's OK. Also, an engineer will be forced to build the AMI whenever they change Packer files while iterating on e2e scripts; this hasn't been an issue for me yet, and I'll be open for iterating on that later if it proves to be an issue.
### Config Files and Packer
Second, this PR moves e2e config hcl management to Terraform instead of Packer. Currently, the config files live in `./terraform/config`, but they are baked into the servers by Packer and changes are ignored. This current behavior surprised me, as I spent a bit of time debugging why my config changes weren't applied. Having Terraform manage them would ease engineer's iteration. Also, make Packer management more consistent (Packer only works `e2e/terraform/packer`), and easing the logic for AMI change detection.
The config directory is very small (100KB), and having it as an upload step adds negligible time to `terraform apply`.
* Prevent Job Statuses from being calculated twice
https://github.com/hashicorp/nomad/pull/8435 introduced atomic eval
insertion iwth job (de-)registration. This change removes a now obsolete
guard which checked if the index was equal to the job.CreateIndex, which
would empty the status. Now that the job regisration eval insetion is
atomic with the registration this check is no longer necessary to set
the job statuses correctly.
* test to ensure only single job event for job register
* periodic e2e
* separate job update summary step
* fix updatejobstability to use copy instead of modified reference of job
* update envoygatewaybindaddresses copy to prevent job diff on null vs empty
* set ConsulGatewayBindAddress to empty map instead of nil
fix nil assertions for empty map
rm unnecessary guard
Terraform v0.14 is producing a lockfile after running `terraform init`.
The docs suggest we should include this file in the git repository:
> You should include this file in your version control repository so
> that you can discuss potential changes to your external dependencies
> via code review, just as you would discuss potential changes to your
> configuration itself.
Sounds similar to go.sum
https://www.terraform.io/docs/configuration/dependency-lock.html#lock-file-location
Our dnsmasq configuration needs host-specific data that we can't configure in
the AMI build. But configuring this in userdata leads to a race between
userdata execution, docker.service startup, and dnsmasq.service startup. So
rather than letting dnsmasq come up with incorrect configuration and then
modifying it after the fact, do the configuration in the service's prestart,
and have it kick off a Docker restart when we're done.
The cloud-init configuration runs on boot, which can result in a race
condition between that and service startup. This has caused provisioning
failures because Nomad expects the userdata to have configured a host volume
directory. Diagnosing this was also compounded by a warning being fired by
systemd for the Nomad unit file.
* Update the location of the `StartLimitIntervalSec` field to it's
post-systemd-230 location.
* Ensure that the weekly AMI build is up-to-date to reduce the risk of
unexpected system software changes.
* Move the host volume to a directory we can set up at AMI build time rather
than in userdata.
Small changes to the Windows 2016 Packer build for debuggability of
provisioning:
* improve verbosity of powershell error handling
* remove unused "tools" installation
* use ssh communicator for Packer to improve Packer build times and eliminate
deprecated winrm remote access (unavailable from current macOS)
The `nomad_sha`, `nomad_version`, and `nomad_local_binary` variables for the
Nomad provisioning module assumed that only one would be set. By having the
override each other with an explicit precedence, it makes it easier to avoid
problems with Terraform's implicit variables behavior.
Set the expected default values in the `terraform.full.tfvars` to avoid
shadowing by any future changes to the `terraform.tfvars` file.
Update the Makefile to put the `-var` and `-var-file` in the correct order.
The base Ubuntu AMI modifies apt sources during cloud-init. But the Packer
build can potentially start the setup script before that work is done,
resulting in errors trying to install base system dependencies like
`dnsmasq`. Delay the setup long enough to lose the race with cloud-init.
We intend to expand the nightly E2E test to cover multiple distros and
platforms. Change the naming structure for "Linux client" to the more precise
"Ubuntu Bionic", and "Windows" to "Windows 2016" to make it easier to add new
targets without additional refactoring.
Most of the time that a human is running the TF provisioning, they want the
"dev cluster" which is going to deploy an OSS sha, with fewer targets and
configuration alternatives. But the default `terraform.tfvars` is the nightly
E2E run. Because the nightly run is automated, there's no reason we can't have
it pick a non-default `terraform.full.tfvars` file and have the default be the
dev cluster.
When uploading a local binary for provisioning, the location that we pass into
the provisioning script needs to be where we uploaded it to, not the source on
our laptop. Also, the null_resource for uploading needs to read in the private
key, not its path.
Instead of hard-coding the base AMI for our Packer image for Ubuntu, use the
latest from Canonical so that we always have their current kernel patches.
For everyday developer use, we don't need volumes for testing CSI. Providing a
flag to opt-in speeds up deploying dev clusters and slightly reduces infra costs.
Skip CSI test if missing volume specs.
Provisions vault with the policies described in the Nomad Vault integration
guide, and drops a configuration file for Nomad vault server configuration
with its token. The vault root token is exposed to the E2E runner so that
tests can write additional policies to vault.
The `-var-file` flag for loading variables into Terraform overlays the default
variables file if present. This means that variables that are set in the
default variables file will take precedence if the overlay file does not have
them set.
Set `nomad_acls` and `nomad_enteprise` to `false` in the dev cluster.
Adds a `nomad_acls` flag to our Terraform stack that bootstraps Nomad ACLs via
a `local-exec` provider. There's no way to set the `NOMAD_TOKEN` in the Nomad
TF provider if we're bootstrapping in the same Terraform stack, so instead of
using `resource.nomad_acl_token`, we also bootstrap a wide-open anonymous
policy. The resulting management token is exported as an environment var with
`$(terraform output environment)` and tests that want stricter ACLs will be
able to write them using that token.
This should also provide a basis to do similar work with Consul ACLs in the
future.
Newer EC2 instances are both cheaper and have generally better
performance.
The dnsmasq configuration had a hard-coded interface name, so in order to
accomodate instances with more recent networking that result in so-called
predictable interface names, the dnsmasq configuration needs to be replaced at
runtime with userdata to select the default interface.