The connect tests are very disruptive: restart consul/nomad agents with new
tokens. The test seems particularly flaky, failing 32 times out of 73 in my
sample.
The tests are particularly problematic because they are disruptive and affect
other tests. On failure, the nomad or consul agent on the client can get into a
wedged state, so health/deployment info in subsequent tests may be wrong. In
some cases, the node will be deemed as fail, and then the subsequent tests may
fail when the node is deemed lost and the test allocations get migrated unexpectedly.
The nodedrain deadline test asserts that all allocations are migrated by the
deadline. However, when the deadline is short (e.g. 10s), the test may fail
because of scheduler/client-propagation delays.
In one failing test, it took ~15s from the RPC call to the moment to the moment
the scheduler issued migration update, and then 3 seconds for the alloc to be
stopped.
Here, I increase the timeouts to avoid such false positives.
Increase the timeout for vaultsecrets. As the default interval is 0.1s, 10
retries mean it only retries for one second, a very short time for some waiting
scenarios in the test (e.g. starting allocs, etc).
Prefer testutil.WaitForResultRetries that emits more descriptive errors on
failures. `require.Evatually` fails with opaque "Condition never satisfied"
error message.
This is an attempt at deflaking the e2e exec tests, and a way to improve
messages.
e2e occasionally fail with "unexpected EOF" even though the exec output matches
expectations. I suspect there is a race in handling EOF in server/http handling.
Here, we special case this error and ensures we get all failures,
to help debug the case better.
This PR makes two ergonomics changes, meant to get e2e builds more reproducible and ease changes.
### AMI Management
First, we pin the server AMIs to the commits associated with the build. No more using the latest AMI a developer build in a test branch, or accidentally using a stale AMI because we forgot to build one! Packer is to tag the AMI images with the commit sha used to generate the image, and then Terraform would look up only the AMIs associated with that sha. To minimize churn, we use the SHA associated with the latest Packer configurations, rather than SHA of all.
This has few benefits: reproducibility and avoiding accidental AMI changes and contamination of changes across branches. Also, the change is a stepping stone to an e2e pipeline that builds new AMIs automatically if Packer files changed.
The downside is that new AMIs will be generated even for irrelevant changes (e.g. spelling, commits), but I suspect that's OK. Also, an engineer will be forced to build the AMI whenever they change Packer files while iterating on e2e scripts; this hasn't been an issue for me yet, and I'll be open for iterating on that later if it proves to be an issue.
### Config Files and Packer
Second, this PR moves e2e config hcl management to Terraform instead of Packer. Currently, the config files live in `./terraform/config`, but they are baked into the servers by Packer and changes are ignored. This current behavior surprised me, as I spent a bit of time debugging why my config changes weren't applied. Having Terraform manage them would ease engineer's iteration. Also, make Packer management more consistent (Packer only works `e2e/terraform/packer`), and easing the logic for AMI change detection.
The config directory is very small (100KB), and having it as an upload step adds negligible time to `terraform apply`.
This PR implements Nomad built-in support for running Consul Connect
terminating gateways. Such a gateway can be used by services running
inside the service mesh to access "legacy" services running outside
the service mesh while still making use of Consul's service identity
based networking and ACL policies.
https://www.consul.io/docs/connect/gateways/terminating-gateway
These gateways are declared as part of a task group level service
definition within the connect stanza.
service {
connect {
gateway {
proxy {
// envoy proxy configuration
}
terminating {
// terminating-gateway configuration entry
}
}
}
}
Currently Envoy is the only supported gateway implementation in
Consul. The gateay task can be customized by configuring the
connect.sidecar_task block.
When the gateway.terminating field is set, Nomad will write/update
the Configuration Entry into Consul on job submission. Because CEs
are global in scope and there may be more than one Nomad cluster
communicating with Consul, there is an assumption that any terminating
gateway defined in Nomad for a particular service will be the same
among Nomad clusters.
Gateways require Consul 1.8.0+, checked by a node constraint.
Closes#9445
* Prevent Job Statuses from being calculated twice
https://github.com/hashicorp/nomad/pull/8435 introduced atomic eval
insertion iwth job (de-)registration. This change removes a now obsolete
guard which checked if the index was equal to the job.CreateIndex, which
would empty the status. Now that the job regisration eval insetion is
atomic with the registration this check is no longer necessary to set
the job statuses correctly.
* test to ensure only single job event for job register
* periodic e2e
* separate job update summary step
* fix updatejobstability to use copy instead of modified reference of job
* update envoygatewaybindaddresses copy to prevent job diff on null vs empty
* set ConsulGatewayBindAddress to empty map instead of nil
fix nil assertions for empty map
rm unnecessary guard
After submitting an update, the test ought to wait until the new
allocations are placed. Previously, we'd use the original to-be-stopped
allocations and the test fails when attempting to exec.
Deflake namespace e2e test by only asserting on jobs related to the
namespace tests. During our e2e tests, some left over jobs (e.g.
prometheus) are left running while being shutdown and cause the test to
fail.
Connect ingress gateway services were being registered into Consul without
an explicit deterministic service ID. Consul would generate one automatically,
but then Nomad would have no way to register a second gateway on the same agent
as it would not supply 'proxy-id' during envoy bootstrap.
Set the ServiceID for gateways, and supply 'proxy-id' when doing envoy bootstrap.
Fixes#9834
We directly parse job files in e2eutil, but currently using jobspec
package. Instead, use the Parse method from the jobspec2 package so
we can parse job files with new features.
Terraform v0.14 is producing a lockfile after running `terraform init`.
The docs suggest we should include this file in the git repository:
> You should include this file in your version control repository so
> that you can discuss potential changes to your external dependencies
> via code review, just as you would discuss potential changes to your
> configuration itself.
Sounds similar to go.sum
https://www.terraform.io/docs/configuration/dependency-lock.html#lock-file-location
Our dnsmasq configuration needs host-specific data that we can't configure in
the AMI build. But configuring this in userdata leads to a race between
userdata execution, docker.service startup, and dnsmasq.service startup. So
rather than letting dnsmasq come up with incorrect configuration and then
modifying it after the fact, do the configuration in the service's prestart,
and have it kick off a Docker restart when we're done.
* use full name for events
use evaluation and allocation instead of short name
* update api event stream package and shortnames
* update docs
* make sync; fix typo
* backwards compat not from 1.0.0-beta event stream api changes
* use api types instead of string
* rm backwards compat note that only changed between prereleases
* remove backwards incompat that only existed in prereleases
Add the ingress gateway example from the noamd connect examples
to the e2e Connect suite. Includes the ACLs enabled version,
which means the nomad server consul acl policy will require
operator=write permission.
The clean up in #8908 inadvertently caused the output from the scripts
involved in the Consul ACL bootstrap process to be printed as a big blob
of bytes, which is slightly less useful than the text version.
* e2e/csi: wait longer for plugins to become healthy
Plugins are Docker containers, and as such sometimes we get delays in startup
due to pulling from the registry and this is a source of test flakiness. Give
the plugins a little longer to start up.
* e2e/csi: version bump for AWS EBS plugins
The cloud-init configuration runs on boot, which can result in a race
condition between that and service startup. This has caused provisioning
failures because Nomad expects the userdata to have configured a host volume
directory. Diagnosing this was also compounded by a warning being fired by
systemd for the Nomad unit file.
* Update the location of the `StartLimitIntervalSec` field to it's
post-systemd-230 location.
* Ensure that the weekly AMI build is up-to-date to reduce the risk of
unexpected system software changes.
* Move the host volume to a directory we can set up at AMI build time rather
than in userdata.
Assert that deregistering a volume works without errors following a volume
reap. Use CLI helpers where feasible to exercise CSI command line. Dump plugin
allocation logs on deregistration failures for debugging purposes.
Small changes to the Windows 2016 Packer build for debuggability of
provisioning:
* improve verbosity of powershell error handling
* remove unused "tools" installation
* use ssh communicator for Packer to improve Packer build times and eliminate
deprecated winrm remote access (unavailable from current macOS)
The `nomad_sha`, `nomad_version`, and `nomad_local_binary` variables for the
Nomad provisioning module assumed that only one would be set. By having the
override each other with an explicit precedence, it makes it easier to avoid
problems with Terraform's implicit variables behavior.
Set the expected default values in the `terraform.full.tfvars` to avoid
shadowing by any future changes to the `terraform.tfvars` file.
Update the Makefile to put the `-var` and `-var-file` in the correct order.
The base Ubuntu AMI modifies apt sources during cloud-init. But the Packer
build can potentially start the setup script before that work is done,
resulting in errors trying to install base system dependencies like
`dnsmasq`. Delay the setup long enough to lose the race with cloud-init.
We intend to expand the nightly E2E test to cover multiple distros and
platforms. Change the naming structure for "Linux client" to the more precise
"Ubuntu Bionic", and "Windows" to "Windows 2016" to make it easier to add new
targets without additional refactoring.
Most of the time that a human is running the TF provisioning, they want the
"dev cluster" which is going to deploy an OSS sha, with fewer targets and
configuration alternatives. But the default `terraform.tfvars` is the nightly
E2E run. Because the nightly run is automated, there's no reason we can't have
it pick a non-default `terraform.full.tfvars` file and have the default be the
dev cluster.
Prior to Nomad 0.12.5, you could use `${NOMAD_SECRETS_DIR}/mysecret.txt` as
the `artifact.destination` and `template.destination` because we would always
append the destination to the task working directory. In the recent security
patch we treated the `destination` absolute path as valid if it didn't escape
the working directory, but this breaks backwards compatibility and
interpolation of `destination` fields.
This changeset partially reverts the behavior so that we always append the
destination, but we also perform the escape check on that new destination
after interpolation so the security hole is closed.
Also, ConsulTemplate test should exercise interpolation
The `$NOMAD_SECRETS_DIR` environment variable is rendered as `/secrets`, which
prior to the recent security patch would unintentionally escape the file
sandbox and get dropped in a directory named `/secrets` where the Nomad client
binary was running. The `VaultSecrets` test was accidentally relying on this
behavior and that causes the test to fail.
When uploading a local binary for provisioning, the location that we pass into
the provisioning script needs to be where we uploaded it to, not the source on
our laptop. Also, the null_resource for uploading needs to read in the private
key, not its path.
* adds two base event stream e2e tests
test evaluation filter keys are included
* Apply suggestions from code review
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* gc aftereach
Co-authored-by: Tim Gross <tgross@hashicorp.com>
The spread test is infrequently flaky and it's hard to extract what's actually
happening. If the test fails, dump all the allocation metrics so that we can
debug the behavior.
Assert that we get at least N task events, rather than exactly N. When a
task within an allocation dies, a sibling task can get an Allocation Unhealthy
event after it's also killed, even though it's not the origin of the event.
The `e2ejob` utility asserts that a job is running for 5s, but with a sleep
time of 5s, the networking job can race with that check. Sleeping for a longer
period should guarantee that we're running long enough to pass the assert.
Also constrains the job to Linux because our Windows test targets don't yet
support Docker (LCOW), and expand the set of DCs we can safely land on.
The autorevert test checks for reverted allocations to be placed and running
before checking the deployment status, but the deployment can be completed and
marked "successful" before we check it for "running" status. Instead, just
wait for it to be marked "successful" and assert we have the expected count of
deployment statuses.
Instead of hard-coding the base AMI for our Packer image for Ubuntu, use the
latest from Canonical so that we always have their current kernel patches.
For everyday developer use, we don't need volumes for testing CSI. Providing a
flag to opt-in speeds up deploying dev clusters and slightly reduces infra costs.
Skip CSI test if missing volume specs.
Add service discovery integration to the existing consul-template E2E test,
and verify both service and key updates force re-rendering. Fixes flakiness by
using the longer default wait config we use elsewhere.
Removes our last direct dependency on gomega.
Provisions vault with the policies described in the Nomad Vault integration
guide, and drops a configuration file for Nomad vault server configuration
with its token. The vault root token is exposed to the E2E runner so that
tests can write additional policies to vault.
The `-var-file` flag for loading variables into Terraform overlays the default
variables file if present. This means that variables that are set in the
default variables file will take precedence if the overlay file does not have
them set.
Set `nomad_acls` and `nomad_enteprise` to `false` in the dev cluster.
Until we have LCOW support in the E2E environment (which requires a Windows
2019 test target), we need to constrain E2E tests to the appropriate kernel
The E2E framework wraps testify's `require` so that by default we can stop
tests on errors, but the cleanup functions should use `assert` so that we
continue to try to cleanup the test environment even if there's a failure.
Exercises host volume and Docker volume functionality for the `exec` and `docker`
task driver, particularly around mounting locations within the container and
how this can be used with `template`.