open-nomad/e2e/terraform
Tim Gross 9f05d62338
E2E with HCP Consul/Vault (#12267)
Use HCP Consul and HCP Vault for the Consul and Vault clusters used in E2E testing. This has the following benefits:

* Without the need to support mTLS bootstrapping for Consul and Vault, we can simplify the mTLS configuration by leaning on Terraform instead of janky bash shell scripting.
* Vault bootstrapping is no longer required, so we can eliminate even more janky shell scripting
* Our E2E exercises HCP, which is important to us as an organization
* With the reduction in configurability, we can simplify the Terraform configuration and drop the complicated `provision.sh`/`provision.ps1` scripts we were using previously. We can template Nomad configuration files and upload them with the `file` provisioner.
* Packer builds for Linux and Windows become much simpler.

tl;dr way less janky shell scripting!
2022-03-18 09:27:28 -04:00
..
etc E2E with HCP Consul/Vault (#12267) 2022-03-18 09:27:28 -04:00
hcp-vault-auth E2E with HCP Consul/Vault (#12267) 2022-03-18 09:27:28 -04:00
packer E2E with HCP Consul/Vault (#12267) 2022-03-18 09:27:28 -04:00
provision-nomad E2E with HCP Consul/Vault (#12267) 2022-03-18 09:27:28 -04:00
scripts E2E with HCP Consul/Vault (#12267) 2022-03-18 09:27:28 -04:00
uploads E2E with HCP Consul/Vault (#12267) 2022-03-18 09:27:28 -04:00
userdata e2e: move dnsmasq config into dnsmasq service unit (#9660) 2020-12-17 10:33:19 -05:00
.gitignore Infrastructure for Windows e2e testing (#6584) 2019-11-19 11:06:10 -05:00
.terraform.lock.hcl E2E with HCP Consul/Vault (#12267) 2022-03-18 09:27:28 -04:00
compute.tf E2E with HCP Consul/Vault (#12267) 2022-03-18 09:27:28 -04:00
ecs-task.json core: propagate remote task handles 2021-04-27 15:07:03 -07:00
ecs.tf core: propagate remote task handles 2021-04-27 15:07:03 -07:00
hcp_consul.tf E2E with HCP Consul/Vault (#12267) 2022-03-18 09:27:28 -04:00
hcp_vault.tf E2E with HCP Consul/Vault (#12267) 2022-03-18 09:27:28 -04:00
iam.tf migrate E2E test runs to new AWS account (#8676) 2020-08-18 14:24:34 -04:00
main.tf e2e: minor TF refactor to split out vars and outputs (#8752) 2020-08-26 17:00:36 -04:00
Makefile E2E with HCP Consul/Vault (#12267) 2022-03-18 09:27:28 -04:00
network.tf Support mTLS clusters for e2e testing (#11092) 2021-08-30 10:18:16 -04:00
nomad-acls.tf E2E with HCP Consul/Vault (#12267) 2022-03-18 09:27:28 -04:00
nomad.tf E2E with HCP Consul/Vault (#12267) 2022-03-18 09:27:28 -04:00
outputs.tf E2E with HCP Consul/Vault (#12267) 2022-03-18 09:27:28 -04:00
README.md E2E with HCP Consul/Vault (#12267) 2022-03-18 09:27:28 -04:00
terraform.tfvars E2E with HCP Consul/Vault (#12267) 2022-03-18 09:27:28 -04:00
tls_ca.tf E2E with HCP Consul/Vault (#12267) 2022-03-18 09:27:28 -04:00
tls_client.tf E2E with HCP Consul/Vault (#12267) 2022-03-18 09:27:28 -04:00
variables.tf E2E with HCP Consul/Vault (#12267) 2022-03-18 09:27:28 -04:00
versions.tf e2e: upgrade terraform to 0.12.x (#6489) 2019-10-14 11:27:08 -04:00
volumes.tf E2E: extend CSI test to cover create and snapshot workflows 2021-04-08 12:55:36 -04:00

Terraform infrastructure

This folder contains Terraform resources for provisioning a Nomad cluster on EC2 instances on AWS to use as the target of end-to-end tests.

Terraform provisions the AWS infrastructure assuming that EC2 AMIs have already been built via Packer and HCP Consul and HCP Vault clusters are already running. It deploys a build of Nomad from your local machine along with configuration files.

Setup

You'll need a recent version of Terraform (1.1+ recommended), as well as AWS credentials to create the Nomad cluster and credentials for HCP. This Terraform stack assumes that an appropriate instance role has been configured elsewhere and that you have the ability to AssumeRole into the AWS account.

Configure the following environment variables. For HashiCorp Nomad developers, this configuration can be found in 1Pass in the Nomad team's vault under nomad-e2e.

export HCP_CLIENT_ID=
export HCP_CLIENT_SECRET=
export CONSUL_HTTP_TOKEN=
export CONSUL_HTTP_ADDR=

The Vault admin token will expire after 6 hours. If you haven't created one already use the separate Terraform configuration found in the hcp-vault-auth directory. The following will set the correct values for VAULT_TOKEN, VAULT_ADDR, and VAULT_NAMESPACE:

cd ./hcp-vault-auth
terraform apply --auto-approve
$(terraform output environment --raw)

Optionally, edit the terraform.tfvars file to change the number of Linux clients or Windows clients.

region                           = "us-east-1"
instance_type                    = "t2.medium"
server_count                     = "3"
client_count_ubuntu_bionic_amd64 = "4"
client_count_windows_2016_amd64  = "1"

Optionally, edit the nomad_local_binary variable in the terraform.tfvars file to change the path to the local binary of Nomad you'd like to upload.

Run Terraform apply to deploy the infrastructure:

cd e2e/terraform/
terraform apply

Note: You will likely see "Connection refused" or "Permission denied" errors in the logs as the provisioning script run by Terraform hits an instance where the ssh service isn't yet ready. That's ok and expected; they'll get retried. In particular, Windows instances can take a few minutes before ssh is ready.

Also note: When ACLs are being bootstrapped, you may see "No cluster leader" in the output several times while the ACL bootstrap script polls the cluster to start and and elect a leader.

Configuration

The files in etc are template configuration files for Nomad and the Consul agent. Terraform will render these files to the uploads folder and upload them to the cluster during provisioning.

  • etc/nomad.d are the Nomad configuration files.
    • base.hcl, tls.hcl, consul.hcl, and vault.hcl are shared.
    • server-linux.hcl, client-linux.hcl, and client-windows.hcl are role and platform specific.
    • client-linux-0.hcl, etc. are specific to individual instances.
  • etc/consul.d are the Consul agent configuration files.
  • etc/acls are ACL policy files for Consul and Vault.

Outputs

After deploying the infrastructure, you can get connection information about the cluster:

  • $(terraform output --raw environment) will set your current shell's NOMAD_ADDR and CONSUL_HTTP_ADDR to point to one of the cluster's server nodes, and set the NOMAD_E2E variable.
  • terraform output servers will output the list of server node IPs.
  • terraform output linux_clients will output the list of Linux client node IPs.
  • terraform output windows_clients will output the list of Windows client node IPs.

SSH

You can use Terraform outputs above to access nodes via ssh:

ssh -i keys/nomad-e2e-*.pem ubuntu@${EC2_IP_ADDR}

The Windows client runs OpenSSH for convenience, but has a different user and will drop you into a Powershell shell instead of bash:

ssh -i keys/nomad-e2e-*.pem Administrator@${EC2_IP_ADDR}

Teardown

The terraform state file stores all the info.

cd e2e/terraform/
terraform destroy

FAQ

E2E Provisioning Goals

  1. The provisioning process should be able to run a nightly build against a variety of OS targets.
  2. The provisioning process should be able to support update-in-place tests. (See #7063)
  3. A developer should be able to quickly stand up a small E2E cluster and provision it with a version of Nomad they've built on their laptop. The developer should be able to send updated builds to that cluster with a short iteration time, rather than having to rebuild the cluster.

Why not just drop all the provisioning into the AMI?

While that's the "correct" production approach for cloud infrastructure, it creates a few pain points for testing:

  • Creating a Linux AMI takes >10min, and creating a Windows AMI can take 15-20min. This interferes with goal (3) above.
  • We won't be able to do in-place upgrade testing without having an in-place provisioning process anyways. This interferes with goals (2) above.

Why not just drop all the provisioning into the user data?

  • Userdata is executed on boot, which prevents using them for in-place upgrade testing.
  • Userdata scripts are not very observable and it's painful to determine whether they've failed or simply haven't finished yet before trying to run tests.