Our dnsmasq configuration needs host-specific data that we can't configure in
the AMI build. But configuring this in userdata leads to a race between
userdata execution, docker.service startup, and dnsmasq.service startup. So
rather than letting dnsmasq come up with incorrect configuration and then
modifying it after the fact, do the configuration in the service's prestart,
and have it kick off a Docker restart when we're done.
The cloud-init configuration runs on boot, which can result in a race
condition between that and service startup. This has caused provisioning
failures because Nomad expects the userdata to have configured a host volume
directory. Diagnosing this was also compounded by a warning being fired by
systemd for the Nomad unit file.
* Update the location of the `StartLimitIntervalSec` field to it's
post-systemd-230 location.
* Ensure that the weekly AMI build is up-to-date to reduce the risk of
unexpected system software changes.
* Move the host volume to a directory we can set up at AMI build time rather
than in userdata.
Small changes to the Windows 2016 Packer build for debuggability of
provisioning:
* improve verbosity of powershell error handling
* remove unused "tools" installation
* use ssh communicator for Packer to improve Packer build times and eliminate
deprecated winrm remote access (unavailable from current macOS)
The `nomad_sha`, `nomad_version`, and `nomad_local_binary` variables for the
Nomad provisioning module assumed that only one would be set. By having the
override each other with an explicit precedence, it makes it easier to avoid
problems with Terraform's implicit variables behavior.
Set the expected default values in the `terraform.full.tfvars` to avoid
shadowing by any future changes to the `terraform.tfvars` file.
Update the Makefile to put the `-var` and `-var-file` in the correct order.
The base Ubuntu AMI modifies apt sources during cloud-init. But the Packer
build can potentially start the setup script before that work is done,
resulting in errors trying to install base system dependencies like
`dnsmasq`. Delay the setup long enough to lose the race with cloud-init.
We intend to expand the nightly E2E test to cover multiple distros and
platforms. Change the naming structure for "Linux client" to the more precise
"Ubuntu Bionic", and "Windows" to "Windows 2016" to make it easier to add new
targets without additional refactoring.
Most of the time that a human is running the TF provisioning, they want the
"dev cluster" which is going to deploy an OSS sha, with fewer targets and
configuration alternatives. But the default `terraform.tfvars` is the nightly
E2E run. Because the nightly run is automated, there's no reason we can't have
it pick a non-default `terraform.full.tfvars` file and have the default be the
dev cluster.
When uploading a local binary for provisioning, the location that we pass into
the provisioning script needs to be where we uploaded it to, not the source on
our laptop. Also, the null_resource for uploading needs to read in the private
key, not its path.
Instead of hard-coding the base AMI for our Packer image for Ubuntu, use the
latest from Canonical so that we always have their current kernel patches.
For everyday developer use, we don't need volumes for testing CSI. Providing a
flag to opt-in speeds up deploying dev clusters and slightly reduces infra costs.
Skip CSI test if missing volume specs.
Provisions vault with the policies described in the Nomad Vault integration
guide, and drops a configuration file for Nomad vault server configuration
with its token. The vault root token is exposed to the E2E runner so that
tests can write additional policies to vault.
The `-var-file` flag for loading variables into Terraform overlays the default
variables file if present. This means that variables that are set in the
default variables file will take precedence if the overlay file does not have
them set.
Set `nomad_acls` and `nomad_enteprise` to `false` in the dev cluster.
Adds a `nomad_acls` flag to our Terraform stack that bootstraps Nomad ACLs via
a `local-exec` provider. There's no way to set the `NOMAD_TOKEN` in the Nomad
TF provider if we're bootstrapping in the same Terraform stack, so instead of
using `resource.nomad_acl_token`, we also bootstrap a wide-open anonymous
policy. The resulting management token is exported as an environment var with
`$(terraform output environment)` and tests that want stricter ACLs will be
able to write them using that token.
This should also provide a basis to do similar work with Consul ACLs in the
future.
Newer EC2 instances are both cheaper and have generally better
performance.
The dnsmasq configuration had a hard-coded interface name, so in order to
accomodate instances with more recent networking that result in so-called
predictable interface names, the dnsmasq configuration needs to be replaced at
runtime with userdata to select the default interface.
Have Terraform run the target-specific `provision.sh`/`provision.ps1` script
rather than the test runner code which needs to be customized for each
distro. Use Terraform's detection of variable value changes so that we can
re-run the provisioning without having to re-install Nomad on those specific
hosts that need it changed.
Allow the configuration "profile" (well-known directory) to be set by a
Terraform variable. The default configurations are installed during Packer
build time, and symlinked into the live configuration directory by the
provision script. Detect changes in the file contents so that we only upload
custom configuration files that have changed between Terraform runs
* remove outdated references to envchain in documentation
* add new host volume locations in userdata
* don't exit the entire script during provisioning, just return
This changeset stages upcoming E2E provisioning improvements work. It splits
the existing shared configuration directory into 3 profiles:
* "full-cluster": the set of configurations currently in use
* "dev-cluster": a simplified set of mostly existing configurations that
weren't in use.
* "custom": an empty profile for developers to keep non-standard
configurations during complex feature development.
The tooling to switch between profiles will be in a later changeset.
Also drops some unused configuration knobs from the provisioning scripts to
make the next stage of work easier.
Our provisioning process for E2E doesn't require the `depends_on` fields to be
set for client instances, so dropping that field allows all instances to be
started in parallel.
We don't use the extra EBS volumes (they aren't even mounted), so remove them
to reduce costs.
The `-recursor` flag in the Consul service unit files is specific to a given
cloud, but we already have cloud-specific configuration files. Consolidate all
the cloud-specific items into the config.
As we add new Linux targets for E2E, the existing setup.sh script will be used
only for Ubuntu. Rather than have the service and config files echo'd from the
script, move them into files we upload so they can be reused.
Includes some general noise reduction in the setup.sh script and removal of
unused bits.
This changeset moves the installation of Nomad binaries out of the
provisioning framework and into scripts that are installed on the remote host
during AMI builds.
This provides a few advantages:
* The provisioning framework can be reduced in scope (with the goal of moving
most of it into the Terraform stack entirely).
* The scripts can be arbitrarily complex if we don't have to stuff them into
ssh commands, so it's easier to make them idempotent. In this changeset, the
scripts check the version of the existing binary and don't re-download when
using the `--nomad_sha` or `--nomad_version` flags.
* The scripts can be OS/distro specific, which helps in building new test
targets.
By default, Docker containers get /etc/resolv.conf bound into the container
with the localhost entry stripped out. In order to resolve using the host's
dnsmasq, we need to make sure the container uses the docker0 IP as its
nameserver and that dnsmasq is listening on that port and forwarding to either
the AWS VPC DNS (so that we can query private resources like EFS) or to the
Consul DNS.
* initial setup for terrform to install podman task driver
podman
* Update e2e provisioning to support root podman
Excludes setup for rootless podman. updates source ami to ubuntu 18.04
Installs podman and configures podman varlink
base podman test
ensure client status running
revert terraform directory changes
* back out random go-discover go mod change
* include podman varlink docs
* address comments
There have been a number of bug fixes and features particularly around
Connect that will help us in Nomad's e2e tests. Upgrade Consul in our
packer builder so e2e can make use of the new version.
This changeset provides two basic e2e tests for CSI plugins targeting
common AWS use cases.
The EBS test launches the EBS plugin (controller + nodes) and registers
an EBS volume as a Nomad CSI volume. We deploy a job that writes to
the volume, stop that job, and reuse the volume for another job which
should be able to read the data written by the first job.
The EFS test launches the EFS plugin (nodes-only) and registers an EFS
volume as a Nomad CSI volume. We deploy a job that writes to the
volume, stop that job, and reuse the volume for another job which
should be able to read the data written by the first job.
The writer jobs mount the CSI volume at a location within the alloc
dir.
This changeset adds volumes but does not mount them to instances so
that we can test the mounting ("staging") via CSI plugins. The CSI
plugins themselves will be installed as Nomad jobs.
In order to ensure we can always mount the EFS volume, this changeset
pins the deployment of the cluster to a specific subnet. In future
work we should spread the cluster out among several AZs and test that
behavior explicitly.
This changeset improves the ergonomics of running the Nomad e2e test
provisioning process by defaulting to a blank `nomad_sha` in the
Terraform configuration. By default, a user will now need to pass in
one of the Nomad version flags. But they won't have to manually edit
the `provisioning.json` file for the common case of deploying a
released version of Nomad, and won't need to put dummy values for
`nomad_sha`.
Includes general documentation improvements.
The e2e framework instantiates clients for Nomad/Consul but the
provisioning of the actual Nomad cluster is left to Terraform. The
Terraform provisioning process uses `remote-exec` to deploy specific
versions of Nomad so that we don't have to bake an AMI every time we
want to test a new version. But Terraform treats the resulting
instances as immutable, so we can't use the same tooling to update the
version of Nomad in-place. This is a prerequisite for upgrade testing.
This changeset extends the e2e framework to provide the option of
deploying Nomad (and, in the future, Consul/Vault) with specific
versions to running infrastructure. This initial implementation is
focused on deploying to a single cluster via `ssh` (because that's our
current need), but provides interfaces to hook the test run at the
start of the run, the start of each suite, or the start of a given
test case.
Terraform work includes:
* provides Terraform output that written to JSON used by the framework
to configure provisioning via `terraform output provisioning`.
* provides Terraform output that can be used by test operators to
configure their shell via `$(terraform output environment)`
* drops `remote-exec` provisioning steps from Terraform
* makes changes to the deployment scripts to ensure they can be run
multiple times w/ different versions against the same host.
This changeset is part of the work to improve our E2E provisioning
process to allow our upgrade tests:
* Move more of the setup into the AMI image creation so it's a little
more obvious to provisioning config authors which bits are essential
to deploying a specific version of Nomad.
* Make the service file update do a systemd daemon-reload so that we
can update an already-running cluster with the same script we use to
deploy it initially.
When multiple Connect-enabled task groups start on the same client
node, a race condition in the CNI plugins for creating iptables chains
causes one of the tasks to fail. We upstreamed a patch to CNI plugins
to make iptables chain creation idempotent.
This changeset updates end-to-end testing, development tooling, and
documentation to use 0.8.4 which includes our patch.
Adds Windows targets to the client/allocs metrics tests. Removes the
`allocstats` test, which covers less than these tests and is now
redundant.
Adds a firewall rule to our Windows instances so that the prometheus
server can scrape the Nomad HTTP API for metrics.
* Adds a constraint to prevent tests from landing on Windows
* Improve Terraform output for mixed windows/linux clients
* Makes some Windows client config fixes from 0.10.2 testing
Includes:
* baseline Windows AMI
* initial pass at Terraform configurations
* OpenSSH for Windows
Using OpenSSH is a lot nicer for Nomad developers than winrm would be,
plus it lets us avoid passing around the Windows password in the
clear.
Note that now we're copying up all the provisioning scripts and
configs as a zipped bundle because TF's file provisioner dies in the
middle of pushing up multiple files (whereas `scp -r` works fine).
We're also running all the provisioning scripts inside the userdata by
polling for the zip file to show up (gross!). This is because
`remote-exec` provisioners are failing on Windows with the same symptoms as:
https://github.com/hashicorp/terraform/issues/17728
If we can't fix this, it'll prevent us from having multiple Windows
clients running until TF supports count interpolation in the
`template_file`, which is planned for a later 0.12 release.
Ensure that we're reusing the base configuration between client and
servers without the possibility of drift. Reduce the amount of `sed`
mangling of the configuration file, and make recommended changes from
`shellcheck` for this section of the provisioning script.
Fixes some rebase errors on the Nomad config as well.
Share base configuration for telemetry and consul. Have the server
configurations respect the `var.server_count` config. Make changes
recommended by `shellcheck` in the provisioning scripts for this section.
Switch to OS/arch-tagged release bundles on S3 for compatibility with
adding Windows builds in the near future.
Match the configuration directory layout we're using for Consul and
other services. Make recommended changes from `shellcheck` for this
section of the provisioning script.
Update the Consul and Vault configs to take advantage of their
included `go-sockaddr` library for getting the IP addresses we need in
a portable way. This particularly avoids problems with "predictable"
interface names provided by systemd.
Also adds the `sockaddr` binary to the Packer build so we can use it
in our provisioning scripts.
Make a clear split between Packer and Terraform provisioning steps:
the scripts in the `packer/linux` directory are run when we build the
AMI whereas the stuff in shared are run at Terraform provisioning time.
Merging all runtime provisioning scripts into a single script for each
of server/client solves the following:
* Userdata scripts can't take arguments, they can only be templated
and that means we have to do TF escaping in bash/powershell scripts.
* TF provisioning scripts race with userdata scripts.
A failing script in a `remote-exec` provisioner's `inline` stanza
won't fail the provisioning step. This lets us continue on to execute
tests against potentially broken deployments, rather than letting us
know the provisioning itself failed.
When multiple developers are working on e2e testing, it helps to be
able to identify which infrastructure belongs to which Nomad SHA and
which developer. This adds tags to the EC2 instances.
Nomad on the packer image will be overwritten by the sha specified in
the TF var, but including a base version on the packer image makes the
image valid for independent use.
The systemd configs spread across our repo were fairly out of sync. This
should get them on our best practices.
The deployment guide also had some strange things like running Nomad as
a non-root user. It would be fine for servers but completely breaks
clients. For simplicity I simply removed the non-root user references.
This adds a message that provides environment setup instructions for
running e2e tests after running terraform apply.
This allows copy/pasting exports, rather than manually constructing
them.