Our E2E test environment is deployed with mTLS, but it's impractical
for us to use mTLS in headless browsers for automated testing (or even
in manual testing). Provide certificates for proxying the web UI via
Nginx. This proxy uses client certs for proxying to the HTTP endpoint
and a self-signed cert for the browser-facing endpoint. We can accept
certificate errors in the automated tests we'll be adding in the next
step of this work.
While working on infrastructure for testing the UI in E2E, we needed
to upgrade the certificate provider. Performing a provider upgrade via
the TF `init -upgrade` brought in updates for the file and AWS
providers as well. These updates include deprecating the use of
`sensitive_content` fields, removing CA algorithm parameters that can
be inferred from keys, and removing the requirement to manually
specify AWS assume role parameters in the provider config if they're
available in the calling environment's AWS config file (as they are
via doormat or our E2E environment).
Many of our scripts have a non-portable interpreter line for bash and
use bash-specific variables like `BASH_SOURCE`. Update the interpreter
line to be portable between various Linuxes and macOS without
complaint from posix shell users.
Concurrent E2E runs can collide when provisioning policies on HCP
Consul and HCP Vault. Namespace these by the test run name, as we do
for most everything else.
Use HCP Consul and HCP Vault for the Consul and Vault clusters used in E2E testing. This has the following benefits:
* Without the need to support mTLS bootstrapping for Consul and Vault, we can simplify the mTLS configuration by leaning on Terraform instead of janky bash shell scripting.
* Vault bootstrapping is no longer required, so we can eliminate even more janky shell scripting
* Our E2E exercises HCP, which is important to us as an organization
* With the reduction in configurability, we can simplify the Terraform configuration and drop the complicated `provision.sh`/`provision.ps1` scripts we were using previously. We can template Nomad configuration files and upload them with the `file` provisioner.
* Packer builds for Linux and Windows become much simpler.
tl;dr way less janky shell scripting!
The `Metrics` suite uses prometheus to scrape Nomad metrics so that
we're testing the full user experience of extracting metrics from
Nomad. With the addition of mTLS, we need to make sure prometheus also
has mTLS configuration because the metrics endpoint is protected.
Update the Nomad client configuration and prometheus job to bind-mount
the client's certs into the task so that the job can use these certs
to scrape the server. This is a temporary solution that gets the job
passing; we should give the job its own certificates (issued by
Vault?) when we've done some of the infrastructure rework we'd like.
This allows us to spin up e2e clusters with mTLS configured for all HashiCorp services, i.e. Nomad, Consul, and Vault. Used it for testing #11089 .
mTLS is disabled by default. I have not updated Windows provisioning scripts yet - Windows also lacks ACL support from before. I intend to follow up for them in another round.
Ease spinning up a cluster, where binaries are fetched from arbitrary
urls. These could be CircleCI `build-binaries` job artifacts, or
presigned S3 urls.
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Add a new driver capability: RemoteTasks.
When a task is run by a driver with RemoteTasks set, its TaskHandle will
be propagated to the server in its allocation's TaskState. If the task
is replaced due to a down node or draining, its TaskHandle will be
propagated to its replacement allocation.
This allows tasks to be scheduled in remote systems whose lifecycles are
disconnected from the Nomad node's lifecycle.
See https://github.com/hashicorp/nomad-driver-ecs for an example ECS
remote task driver.
The E2E provisioning used local-exec to call ssh in a for loop in a hacky
workaround https://github.com/hashicorp/terraform/issues/25634, which
prevented remote-exec from working on Windows. Move to a newer version of
Terraform that fixes the remote-exec bug to make provisioning more reliable
and observable.
Note that Windows remote-exec needs to include the `powershell` call itself,
unlike Unix-alike remote-exec.
Split the EBS and EFS tests out into their own test cases:
* EBS exercises the Controller RPCs, including the create/snapshot workflow.
* EFS exercises only the Node RPCs, and assumes we have an existing volume
that gets registered, rather than created.
Add a `PerAlloc` field to volume requests that directs the scheduler to test
feasibility for volumes with a source ID that includes the allocation index
suffix (ex. `[0]`), rather than the exact source ID.
Read the `PerAlloc` field when making the volume claim at the client to
determine if the allocation index suffix (ex. `[0]`) should be added to the
volume source ID.
* fix periodic
* update periodic to not use template
nomad job inspect no longer returns an apiliststub so the required fields to query job summary are no longer there, parse cli output instead
* rm tmp makefile entry
* fix typo
* revert makefile change
This PR enables jobs configured with a custom sidecar_task to make
use of the `service.expose` feature for creating checks on services
in the service mesh. Before we would check that sidecar_task had not
been set (indicating that something other than envoy may be in use,
which would not support envoy's expose feature). However Consul has
not added support for anything other than envoy and probably never
will, so having the restriction in place seems like an unnecessary
hindrance. If Consul ever does support something other than Envoy,
they will likely find a way to provide the expose feature anyway.
Fixes#9854
Ensure that the e2e clusters are isolated and never attempt to autojoin
with another e2e cluster.
This ensures that each cluster servers have a unique `ConsulAutoJoin`,
to be used for discovery.
This PR makes two ergonomics changes, meant to get e2e builds more reproducible and ease changes.
### AMI Management
First, we pin the server AMIs to the commits associated with the build. No more using the latest AMI a developer build in a test branch, or accidentally using a stale AMI because we forgot to build one! Packer is to tag the AMI images with the commit sha used to generate the image, and then Terraform would look up only the AMIs associated with that sha. To minimize churn, we use the SHA associated with the latest Packer configurations, rather than SHA of all.
This has few benefits: reproducibility and avoiding accidental AMI changes and contamination of changes across branches. Also, the change is a stepping stone to an e2e pipeline that builds new AMIs automatically if Packer files changed.
The downside is that new AMIs will be generated even for irrelevant changes (e.g. spelling, commits), but I suspect that's OK. Also, an engineer will be forced to build the AMI whenever they change Packer files while iterating on e2e scripts; this hasn't been an issue for me yet, and I'll be open for iterating on that later if it proves to be an issue.
### Config Files and Packer
Second, this PR moves e2e config hcl management to Terraform instead of Packer. Currently, the config files live in `./terraform/config`, but they are baked into the servers by Packer and changes are ignored. This current behavior surprised me, as I spent a bit of time debugging why my config changes weren't applied. Having Terraform manage them would ease engineer's iteration. Also, make Packer management more consistent (Packer only works `e2e/terraform/packer`), and easing the logic for AMI change detection.
The config directory is very small (100KB), and having it as an upload step adds negligible time to `terraform apply`.
* Prevent Job Statuses from being calculated twice
https://github.com/hashicorp/nomad/pull/8435 introduced atomic eval
insertion iwth job (de-)registration. This change removes a now obsolete
guard which checked if the index was equal to the job.CreateIndex, which
would empty the status. Now that the job regisration eval insetion is
atomic with the registration this check is no longer necessary to set
the job statuses correctly.
* test to ensure only single job event for job register
* periodic e2e
* separate job update summary step
* fix updatejobstability to use copy instead of modified reference of job
* update envoygatewaybindaddresses copy to prevent job diff on null vs empty
* set ConsulGatewayBindAddress to empty map instead of nil
fix nil assertions for empty map
rm unnecessary guard