This PR makes two ergonomics changes, meant to get e2e builds more reproducible and ease changes.
### AMI Management
First, we pin the server AMIs to the commits associated with the build. No more using the latest AMI a developer build in a test branch, or accidentally using a stale AMI because we forgot to build one! Packer is to tag the AMI images with the commit sha used to generate the image, and then Terraform would look up only the AMIs associated with that sha. To minimize churn, we use the SHA associated with the latest Packer configurations, rather than SHA of all.
This has few benefits: reproducibility and avoiding accidental AMI changes and contamination of changes across branches. Also, the change is a stepping stone to an e2e pipeline that builds new AMIs automatically if Packer files changed.
The downside is that new AMIs will be generated even for irrelevant changes (e.g. spelling, commits), but I suspect that's OK. Also, an engineer will be forced to build the AMI whenever they change Packer files while iterating on e2e scripts; this hasn't been an issue for me yet, and I'll be open for iterating on that later if it proves to be an issue.
### Config Files and Packer
Second, this PR moves e2e config hcl management to Terraform instead of Packer. Currently, the config files live in `./terraform/config`, but they are baked into the servers by Packer and changes are ignored. This current behavior surprised me, as I spent a bit of time debugging why my config changes weren't applied. Having Terraform manage them would ease engineer's iteration. Also, make Packer management more consistent (Packer only works `e2e/terraform/packer`), and easing the logic for AMI change detection.
The config directory is very small (100KB), and having it as an upload step adds negligible time to `terraform apply`.
This PR implements Nomad built-in support for running Consul Connect
terminating gateways. Such a gateway can be used by services running
inside the service mesh to access "legacy" services running outside
the service mesh while still making use of Consul's service identity
based networking and ACL policies.
https://www.consul.io/docs/connect/gateways/terminating-gateway
These gateways are declared as part of a task group level service
definition within the connect stanza.
service {
connect {
gateway {
proxy {
// envoy proxy configuration
}
terminating {
// terminating-gateway configuration entry
}
}
}
}
Currently Envoy is the only supported gateway implementation in
Consul. The gateay task can be customized by configuring the
connect.sidecar_task block.
When the gateway.terminating field is set, Nomad will write/update
the Configuration Entry into Consul on job submission. Because CEs
are global in scope and there may be more than one Nomad cluster
communicating with Consul, there is an assumption that any terminating
gateway defined in Nomad for a particular service will be the same
among Nomad clusters.
Gateways require Consul 1.8.0+, checked by a node constraint.
Closes#9445
* Prevent Job Statuses from being calculated twice
https://github.com/hashicorp/nomad/pull/8435 introduced atomic eval
insertion iwth job (de-)registration. This change removes a now obsolete
guard which checked if the index was equal to the job.CreateIndex, which
would empty the status. Now that the job regisration eval insetion is
atomic with the registration this check is no longer necessary to set
the job statuses correctly.
* test to ensure only single job event for job register
* periodic e2e
* separate job update summary step
* fix updatejobstability to use copy instead of modified reference of job
* update envoygatewaybindaddresses copy to prevent job diff on null vs empty
* set ConsulGatewayBindAddress to empty map instead of nil
fix nil assertions for empty map
rm unnecessary guard
After submitting an update, the test ought to wait until the new
allocations are placed. Previously, we'd use the original to-be-stopped
allocations and the test fails when attempting to exec.
Deflake namespace e2e test by only asserting on jobs related to the
namespace tests. During our e2e tests, some left over jobs (e.g.
prometheus) are left running while being shutdown and cause the test to
fail.
Connect ingress gateway services were being registered into Consul without
an explicit deterministic service ID. Consul would generate one automatically,
but then Nomad would have no way to register a second gateway on the same agent
as it would not supply 'proxy-id' during envoy bootstrap.
Set the ServiceID for gateways, and supply 'proxy-id' when doing envoy bootstrap.
Fixes#9834
We directly parse job files in e2eutil, but currently using jobspec
package. Instead, use the Parse method from the jobspec2 package so
we can parse job files with new features.
Terraform v0.14 is producing a lockfile after running `terraform init`.
The docs suggest we should include this file in the git repository:
> You should include this file in your version control repository so
> that you can discuss potential changes to your external dependencies
> via code review, just as you would discuss potential changes to your
> configuration itself.
Sounds similar to go.sum
https://www.terraform.io/docs/configuration/dependency-lock.html#lock-file-location
Our dnsmasq configuration needs host-specific data that we can't configure in
the AMI build. But configuring this in userdata leads to a race between
userdata execution, docker.service startup, and dnsmasq.service startup. So
rather than letting dnsmasq come up with incorrect configuration and then
modifying it after the fact, do the configuration in the service's prestart,
and have it kick off a Docker restart when we're done.
* use full name for events
use evaluation and allocation instead of short name
* update api event stream package and shortnames
* update docs
* make sync; fix typo
* backwards compat not from 1.0.0-beta event stream api changes
* use api types instead of string
* rm backwards compat note that only changed between prereleases
* remove backwards incompat that only existed in prereleases
Add the ingress gateway example from the noamd connect examples
to the e2e Connect suite. Includes the ACLs enabled version,
which means the nomad server consul acl policy will require
operator=write permission.
The clean up in #8908 inadvertently caused the output from the scripts
involved in the Consul ACL bootstrap process to be printed as a big blob
of bytes, which is slightly less useful than the text version.
* e2e/csi: wait longer for plugins to become healthy
Plugins are Docker containers, and as such sometimes we get delays in startup
due to pulling from the registry and this is a source of test flakiness. Give
the plugins a little longer to start up.
* e2e/csi: version bump for AWS EBS plugins
The cloud-init configuration runs on boot, which can result in a race
condition between that and service startup. This has caused provisioning
failures because Nomad expects the userdata to have configured a host volume
directory. Diagnosing this was also compounded by a warning being fired by
systemd for the Nomad unit file.
* Update the location of the `StartLimitIntervalSec` field to it's
post-systemd-230 location.
* Ensure that the weekly AMI build is up-to-date to reduce the risk of
unexpected system software changes.
* Move the host volume to a directory we can set up at AMI build time rather
than in userdata.
Assert that deregistering a volume works without errors following a volume
reap. Use CLI helpers where feasible to exercise CSI command line. Dump plugin
allocation logs on deregistration failures for debugging purposes.
Small changes to the Windows 2016 Packer build for debuggability of
provisioning:
* improve verbosity of powershell error handling
* remove unused "tools" installation
* use ssh communicator for Packer to improve Packer build times and eliminate
deprecated winrm remote access (unavailable from current macOS)
The `nomad_sha`, `nomad_version`, and `nomad_local_binary` variables for the
Nomad provisioning module assumed that only one would be set. By having the
override each other with an explicit precedence, it makes it easier to avoid
problems with Terraform's implicit variables behavior.
Set the expected default values in the `terraform.full.tfvars` to avoid
shadowing by any future changes to the `terraform.tfvars` file.
Update the Makefile to put the `-var` and `-var-file` in the correct order.
The base Ubuntu AMI modifies apt sources during cloud-init. But the Packer
build can potentially start the setup script before that work is done,
resulting in errors trying to install base system dependencies like
`dnsmasq`. Delay the setup long enough to lose the race with cloud-init.
We intend to expand the nightly E2E test to cover multiple distros and
platforms. Change the naming structure for "Linux client" to the more precise
"Ubuntu Bionic", and "Windows" to "Windows 2016" to make it easier to add new
targets without additional refactoring.
Most of the time that a human is running the TF provisioning, they want the
"dev cluster" which is going to deploy an OSS sha, with fewer targets and
configuration alternatives. But the default `terraform.tfvars` is the nightly
E2E run. Because the nightly run is automated, there's no reason we can't have
it pick a non-default `terraform.full.tfvars` file and have the default be the
dev cluster.
Prior to Nomad 0.12.5, you could use `${NOMAD_SECRETS_DIR}/mysecret.txt` as
the `artifact.destination` and `template.destination` because we would always
append the destination to the task working directory. In the recent security
patch we treated the `destination` absolute path as valid if it didn't escape
the working directory, but this breaks backwards compatibility and
interpolation of `destination` fields.
This changeset partially reverts the behavior so that we always append the
destination, but we also perform the escape check on that new destination
after interpolation so the security hole is closed.
Also, ConsulTemplate test should exercise interpolation
The `$NOMAD_SECRETS_DIR` environment variable is rendered as `/secrets`, which
prior to the recent security patch would unintentionally escape the file
sandbox and get dropped in a directory named `/secrets` where the Nomad client
binary was running. The `VaultSecrets` test was accidentally relying on this
behavior and that causes the test to fail.
When uploading a local binary for provisioning, the location that we pass into
the provisioning script needs to be where we uploaded it to, not the source on
our laptop. Also, the null_resource for uploading needs to read in the private
key, not its path.
* adds two base event stream e2e tests
test evaluation filter keys are included
* Apply suggestions from code review
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* gc aftereach
Co-authored-by: Tim Gross <tgross@hashicorp.com>
The spread test is infrequently flaky and it's hard to extract what's actually
happening. If the test fails, dump all the allocation metrics so that we can
debug the behavior.
Assert that we get at least N task events, rather than exactly N. When a
task within an allocation dies, a sibling task can get an Allocation Unhealthy
event after it's also killed, even though it's not the origin of the event.
The `e2ejob` utility asserts that a job is running for 5s, but with a sleep
time of 5s, the networking job can race with that check. Sleeping for a longer
period should guarantee that we're running long enough to pass the assert.
Also constrains the job to Linux because our Windows test targets don't yet
support Docker (LCOW), and expand the set of DCs we can safely land on.
The autorevert test checks for reverted allocations to be placed and running
before checking the deployment status, but the deployment can be completed and
marked "successful" before we check it for "running" status. Instead, just
wait for it to be marked "successful" and assert we have the expected count of
deployment statuses.
Instead of hard-coding the base AMI for our Packer image for Ubuntu, use the
latest from Canonical so that we always have their current kernel patches.
For everyday developer use, we don't need volumes for testing CSI. Providing a
flag to opt-in speeds up deploying dev clusters and slightly reduces infra costs.
Skip CSI test if missing volume specs.
Add service discovery integration to the existing consul-template E2E test,
and verify both service and key updates force re-rendering. Fixes flakiness by
using the longer default wait config we use elsewhere.
Removes our last direct dependency on gomega.
Provisions vault with the policies described in the Nomad Vault integration
guide, and drops a configuration file for Nomad vault server configuration
with its token. The vault root token is exposed to the E2E runner so that
tests can write additional policies to vault.
The `-var-file` flag for loading variables into Terraform overlays the default
variables file if present. This means that variables that are set in the
default variables file will take precedence if the overlay file does not have
them set.
Set `nomad_acls` and `nomad_enteprise` to `false` in the dev cluster.
Until we have LCOW support in the E2E environment (which requires a Windows
2019 test target), we need to constrain E2E tests to the appropriate kernel
The E2E framework wraps testify's `require` so that by default we can stop
tests on errors, but the cleanup functions should use `assert` so that we
continue to try to cleanup the test environment even if there's a failure.
Exercises host volume and Docker volume functionality for the `exec` and `docker`
task driver, particularly around mounting locations within the container and
how this can be used with `template`.
Adds a `nomad_acls` flag to our Terraform stack that bootstraps Nomad ACLs via
a `local-exec` provider. There's no way to set the `NOMAD_TOKEN` in the Nomad
TF provider if we're bootstrapping in the same Terraform stack, so instead of
using `resource.nomad_acl_token`, we also bootstrap a wide-open anonymous
policy. The resulting management token is exported as an environment var with
`$(terraform output environment)` and tests that want stricter ACLs will be
able to write them using that token.
This should also provide a basis to do similar work with Consul ACLs in the
future.
Newer EC2 instances are both cheaper and have generally better
performance.
The dnsmasq configuration had a hard-coded interface name, so in order to
accomodate instances with more recent networking that result in so-called
predictable interface names, the dnsmasq configuration needs to be replaced at
runtime with userdata to select the default interface.
The conditional around some of the rescheduling tests was backwards, where we
were waiting for allocations to be rescheduled but testing for a count of
0. The test was passing but flaky because if the check happened quickly enough
before the scheduler rescheduled the allocations, it would pass.
Have Terraform run the target-specific `provision.sh`/`provision.ps1` script
rather than the test runner code which needs to be customized for each
distro. Use Terraform's detection of variable value changes so that we can
re-run the provisioning without having to re-install Nomad on those specific
hosts that need it changed.
Allow the configuration "profile" (well-known directory) to be set by a
Terraform variable. The default configurations are installed during Packer
build time, and symlinked into the live configuration directory by the
provision script. Detect changes in the file contents so that we only upload
custom configuration files that have changed between Terraform runs
* remove outdated references to envchain in documentation
* add new host volume locations in userdata
* don't exit the entire script during provisioning, just return
The CLI helpers in the rescheduling test were intended for shared use, but
until some other tests were written we didn't want to waste time making them
generic. This changeset refactors them and adds some new helpers associated
with the node drain tests (under separate PR).
The rescheduling test workloads were created before we had Windows targets in
the E2E nightly run. When these were recently ported to the e2e framework they
were missing the constraint to Linux machines.
Also added a little extra time to polling to avoid some flakiness on the first
run, and a minor readability adjustment to the job names.
Ports the rescheduling tests (which aren't running in CI) into the current
test framework so that they're run on nightly, and exercises the new CLI
helpers.
The E2E suite exercises the API, but not the CLI. This changeset adds a helper
function to send commands via a locally-built Nomad binary (which we'll need
to add to the E2E setup), and some helpers to parse the resulting structured
outputs in a way that tests can consume.
When running the Fabio and Prometheus jobs for the metrics suite
it seems the outer directory is required in the call when
registering the job.
error: "e2e/input/fabio.nomad: no such file or directory"
This changeset stages upcoming E2E provisioning improvements work. It splits
the existing shared configuration directory into 3 profiles:
* "full-cluster": the set of configurations currently in use
* "dev-cluster": a simplified set of mostly existing configurations that
weren't in use.
* "custom": an empty profile for developers to keep non-standard
configurations during complex feature development.
The tooling to switch between profiles will be in a later changeset.
Also drops some unused configuration knobs from the provisioning scripts to
make the next stage of work easier.
Our provisioning process for E2E doesn't require the `depends_on` fields to be
set for client instances, so dropping that field allows all instances to be
started in parallel.
We don't use the extra EBS volumes (they aren't even mounted), so remove them
to reduce costs.
The `-recursor` flag in the Consul service unit files is specific to a given
cloud, but we already have cloud-specific configuration files. Consolidate all
the cloud-specific items into the config.
As we add new Linux targets for E2E, the existing setup.sh script will be used
only for Ubuntu. Rather than have the service and config files echo'd from the
script, move them into files we upload so they can be reused.
Includes some general noise reduction in the setup.sh script and removal of
unused bits.
This changeset moves the installation of Nomad binaries out of the
provisioning framework and into scripts that are installed on the remote host
during AMI builds.
This provides a few advantages:
* The provisioning framework can be reduced in scope (with the goal of moving
most of it into the Terraform stack entirely).
* The scripts can be arbitrarily complex if we don't have to stuff them into
ssh commands, so it's easier to make them idempotent. In this changeset, the
scripts check the version of the existing binary and don't re-download when
using the `--nomad_sha` or `--nomad_version` flags.
* The scripts can be OS/distro specific, which helps in building new test
targets.
Just a smattering of attempted improvements as I read through this
again. Some of my goals:
- Tried to add more high level info to the intro to set the context
- Clarify the difference between *test* dev and *agent* dev workflows
- Add -timeout to provisioning step because cable Internet is lol
Controller plugins that land on the same node will collide over their CSI
`mount_dir`, so give them enough room in our tests that they don't land on the
same host.
Also, version bump the EBS node plugins to match the controllers.
By default, Docker containers get /etc/resolv.conf bound into the container
with the localhost entry stripped out. In order to resolve using the host's
dnsmasq, we need to make sure the container uses the docker0 IP as its
nameserver and that dnsmasq is listening on that port and forwarding to either
the AWS VPC DNS (so that we can query private resources like EFS) or to the
Consul DNS.
Adds 2 tests around Connect Native. Both make use of the example connect native
services in https://github.com/hashicorp/nomad-connect-examples
One of them runs without Consul ACLs enabled, the other with.
The `nomad volume deregister` command currently returns an error if the volume
has any claims, but in cases where the claims can't be dropped because of
plugin errors, providing a `-force` flag gives the operator an escape hatch.
If the volume has no allocations or if they are all terminal, this flag
deletes the volume from the state store, immediately and implicitly dropping
all claims without further CSI RPCs. Note that this will not also
unmount/detach the volume, which we'll make the responsibility of a separate
`nomad volume detach` command.
* initial setup for terrform to install podman task driver
podman
* Update e2e provisioning to support root podman
Excludes setup for rootless podman. updates source ami to ubuntu 18.04
Installs podman and configures podman varlink
base podman test
ensure client status running
revert terraform directory changes
* back out random go-discover go mod change
* include podman varlink docs
* address comments
We have been using fatih/hclfmt which is long abandoned. Instead, switch
to HashiCorp's own hclfmt implementation. There are some trivial changes in
behavior around whitespace.
There have been a number of bug fixes and features particularly around
Connect that will help us in Nomad's e2e tests. Upgrade Consul in our
packer builder so e2e can make use of the new version.
This changeset:
* adds eval status to the error messages emitted when we have
placement failure in tests. The implementation here isn't quite
perfect but it's a lot better than "condition not met".
* enforces the ordering of teardown of the CSI test
* doesn't pass the purge flag to one of the two CSI tests, so that we
exercise both code paths.
Issue #7523 documents the Consul ACLs used in each Consul interface
used by Nomad. Minimize the policies used in e2e tests so that we
are setting a good example.
This changeset provides two basic e2e tests for CSI plugins targeting
common AWS use cases.
The EBS test launches the EBS plugin (controller + nodes) and registers
an EBS volume as a Nomad CSI volume. We deploy a job that writes to
the volume, stop that job, and reuse the volume for another job which
should be able to read the data written by the first job.
The EFS test launches the EFS plugin (nodes-only) and registers an EFS
volume as a Nomad CSI volume. We deploy a job that writes to the
volume, stop that job, and reuse the volume for another job which
should be able to read the data written by the first job.
The writer jobs mount the CSI volume at a location within the alloc
dir.
This changeset adds volumes but does not mount them to instances so
that we can test the mounting ("staging") via CSI plugins. The CSI
plugins themselves will be installed as Nomad jobs.
In order to ensure we can always mount the EFS volume, this changeset
pins the deployment of the cluster to a specific subnet. In future
work we should spread the cluster out among several AZs and test that
behavior explicitly.
Golang 1.13 introduced a change in test flag parsing:
> testing
> ...
> Testing flags are now registered in the new Init function, which is invoked by the generated main function for the test. As a result, testing flags are now only registered when running a test binary, and packages that call flag.Parse during package initialization may cause tests to fail.
https://golang.org/doc/go1.13#testing
Here, we ensure that e2e framework parsing occur in TestMain, by only
initializing Framework at Run invocation.
go vet would have prevented the bug fixed in
6362e32161295fa959ebe46b93cea0ea1a9bdd72 but our use of errgroup
prevented that.
Rip out errgroup to take advantage of vet, and remove download limiting
now that we're downloading far fewer binaries overall.
Pretty sure Consul / Nomad clients are often not ready yet after
the ConsulACLs test disables ACLs, by the time the next test starts
running.
Running locally things tend to work, but in TeamCity this seems to
be a recurring problem. However, when running locally sometimes I do
see that the "show status" step after disabling ACLs, some nodes are
still initializing, suggesting we're right on the border of not waiting
long enough
nomad node status
ID DC Name Class Drain Eligibility Status
0e4dfce2 dc1 EC2AMAZ-JB3NF9P <none> false eligible ready
6b90aa06 dc2 ip-172-31-16-225 <none> false eligible ready
7068558a dc2 ip-172-31-20-143 <none> false eligible ready
e0ae3c5c dc1 ip-172-31-25-165 <none> false eligible ready
15b59ed6 dc1 ip-172-31-23-199 <none> false eligible initializing
Going to try waiting a full 2 minutes after disabling ACLs, hopefully that
will help things Just Work. In the future, we should probably be parsing the
output of the status checks and actually confirming all nodes are ready.
Even better, maybe that's something shipyard will have built-in.
Go implicitly treats files ending with `_linux.go` as build tagged for
Linux only. This broke the e2e provisioning framework on macOS once we
tried importing it into the `e2e/consulacls` module.
This changeset improves the ergonomics of running the Nomad e2e test
provisioning process by defaulting to a blank `nomad_sha` in the
Terraform configuration. By default, a user will now need to pass in
one of the Nomad version flags. But they won't have to manually edit
the `provisioning.json` file for the common case of deploying a
released version of Nomad, and won't need to put dummy values for
`nomad_sha`.
Includes general documentation improvements.
This test is causing panics. Unlike the other similar tests, this
one is using require.Eventually which is doing something bad, and
this change replaces it with a for-loop like the other tests.
Failure:
=== RUN TestE2E/Connect
=== RUN TestE2E/Connect/*connect.ConnectE2ETest
=== RUN TestE2E/Connect/*connect.ConnectE2ETest/TestConnectDemo
=== RUN TestE2E/Connect/*connect.ConnectE2ETest/TestMultiServiceConnect
=== RUN TestE2E/Connect/*connect.ConnectClientStateE2ETest
panic: Fail in goroutine after TestE2E/Connect/*connect.ConnectE2ETest has completed
goroutine 38 [running]:
testing.(*common).Fail(0xc000656500)
/opt/google/go/src/testing/testing.go:565 +0x11e
testing.(*common).Fail(0xc000656100)
/opt/google/go/src/testing/testing.go:559 +0x96
testing.(*common).FailNow(0xc000656100)
/opt/google/go/src/testing/testing.go:587 +0x2b
testing.(*common).Fatalf(0xc000656100, 0x1512f90, 0x10, 0xc000675f88, 0x1, 0x1)
/opt/google/go/src/testing/testing.go:672 +0x91
github.com/hashicorp/nomad/e2e/connect.(*ConnectE2ETest).TestMultiServiceConnect.func1(0x0)
/home/shoenig/go/src/github.com/hashicorp/nomad/e2e/connect/multi_service.go:72 +0x296
github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert.Eventually.func1(0xc0004962a0, 0xc0002338f0)
/home/shoenig/go/src/github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert/assertions.go:1494 +0x27
created by github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert.Eventually
/home/shoenig/go/src/github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert/assertions.go:1493 +0x272
FAIL github.com/hashicorp/nomad/e2e 21.427s
Provide script for managing Consul ACLs on a TF provisioned cluster for
e2e testing. Script can be used to 'enable' or 'disable' Consul ACLs,
and automatically takes care of the bootstrapping process if necessary.
The bootstrapping process takes a long time, so we may need to
extend the overall e2e timeout (20 minutes seems fine).
Introduces basic tests for Consul Connect with ACLs.
This change allows for providing the -suite=<Name> flag when
running the e2e framework. If set, only the matching e2e/Framework.TestSuite.Component
will be run, and all ther suites will be skipped.
Fixes a bug introduced in 0aa58b9 where we're writing a test file to
a taskdir-interpolated location, which works when we `alloc exec` but
not in the jobspec for a group script check.
This changeset also makes the test safe to run multiple times by
namespacing the file with the alloc ID, which has the added bonus of
exercising our alloc interpolation code for group script checks.
The e2e framework instantiates clients for Nomad/Consul but the
provisioning of the actual Nomad cluster is left to Terraform. The
Terraform provisioning process uses `remote-exec` to deploy specific
versions of Nomad so that we don't have to bake an AMI every time we
want to test a new version. But Terraform treats the resulting
instances as immutable, so we can't use the same tooling to update the
version of Nomad in-place. This is a prerequisite for upgrade testing.
This changeset extends the e2e framework to provide the option of
deploying Nomad (and, in the future, Consul/Vault) with specific
versions to running infrastructure. This initial implementation is
focused on deploying to a single cluster via `ssh` (because that's our
current need), but provides interfaces to hook the test run at the
start of the run, the start of each suite, or the start of a given
test case.
Terraform work includes:
* provides Terraform output that written to JSON used by the framework
to configure provisioning via `terraform output provisioning`.
* provides Terraform output that can be used by test operators to
configure their shell via `$(terraform output environment)`
* drops `remote-exec` provisioning steps from Terraform
* makes changes to the deployment scripts to ensure they can be run
multiple times w/ different versions against the same host.