178 lines
5.9 KiB
Markdown
178 lines
5.9 KiB
Markdown
# Terraform infrastructure
|
|
|
|
This folder contains Terraform resources for provisioning a Nomad
|
|
cluster on EC2 instances on AWS to use as the target of end-to-end
|
|
tests.
|
|
|
|
Terraform provisions the AWS infrastructure assuming that EC2 AMIs
|
|
have already been built via Packer and HCP Consul and HCP Vault
|
|
clusters are already running. It deploys a build of Nomad from your
|
|
local machine along with configuration files.
|
|
|
|
## Setup
|
|
|
|
You'll need a recent version of Terraform (1.1+ recommended), as well
|
|
as AWS credentials to create the Nomad cluster and credentials for
|
|
HCP. This Terraform stack assumes that an appropriate instance role
|
|
has been configured elsewhere and that you have the ability to
|
|
`AssumeRole` into the AWS account.
|
|
|
|
Configure the following environment variables. For HashiCorp Nomad
|
|
developers, this configuration can be found in 1Pass in the Nomad
|
|
team's vault under `nomad-e2e`.
|
|
|
|
```
|
|
export HCP_CLIENT_ID=
|
|
export HCP_CLIENT_SECRET=
|
|
export CONSUL_HTTP_TOKEN=
|
|
export CONSUL_HTTP_ADDR=
|
|
```
|
|
|
|
The Vault admin token will expire after 6 hours. If you haven't
|
|
created one already use the separate Terraform configuration found in
|
|
the `hcp-vault-auth` directory. The following will set the correct
|
|
values for `VAULT_TOKEN`, `VAULT_ADDR`, and `VAULT_NAMESPACE`:
|
|
|
|
```
|
|
cd ./hcp-vault-auth
|
|
terraform init
|
|
terraform apply --auto-approve
|
|
$(terraform output --raw environment)
|
|
```
|
|
|
|
Optionally, edit the `terraform.tfvars` file to change the number of
|
|
Linux clients or Windows clients.
|
|
|
|
```hcl
|
|
region = "us-east-1"
|
|
instance_type = "t2.medium"
|
|
server_count = "3"
|
|
client_count_ubuntu_jammy_amd64 = "4"
|
|
client_count_windows_2016_amd64 = "1"
|
|
```
|
|
|
|
Optionally, edit the `nomad_local_binary` variable in the
|
|
`terraform.tfvars` file to change the path to the local binary of
|
|
Nomad you'd like to upload.
|
|
|
|
Run Terraform apply to deploy the infrastructure:
|
|
|
|
```sh
|
|
cd e2e/terraform/
|
|
terraform init
|
|
terraform apply
|
|
```
|
|
|
|
> Note: You will likely see "Connection refused" or "Permission denied" errors
|
|
> in the logs as the provisioning script run by Terraform hits an instance
|
|
> where the ssh service isn't yet ready. That's ok and expected; they'll get
|
|
> retried. In particular, Windows instances can take a few minutes before ssh
|
|
> is ready.
|
|
>
|
|
> Also note: When ACLs are being bootstrapped, you may see "No cluster
|
|
> leader" in the output several times while the ACL bootstrap script
|
|
> polls the cluster to start and and elect a leader.
|
|
|
|
## Configuration
|
|
|
|
The files in `etc` are template configuration files for Nomad and the
|
|
Consul agent. Terraform will render these files to the `uploads`
|
|
folder and upload them to the cluster during provisioning.
|
|
|
|
* `etc/nomad.d` are the Nomad configuration files.
|
|
* `base.hcl`, `tls.hcl`, `consul.hcl`, and `vault.hcl` are shared.
|
|
* `server-linux.hcl`, `client-linux.hcl`, and `client-windows.hcl` are role and platform specific.
|
|
* `client-linux-0.hcl`, etc. are specific to individual instances.
|
|
* `etc/consul.d` are the Consul agent configuration files.
|
|
* `etc/acls` are ACL policy files for Consul and Vault.
|
|
|
|
## Web UI
|
|
|
|
To access the web UI, deploy a reverse proxy to the cluster. All
|
|
clients have a TLS proxy certificate at `/etc/nomad.d/tls_proxy.crt`
|
|
and a self-signed cert at `/etc/nomad.d/self_signed.crt`. See
|
|
`../ui/inputs/proxy.nomad` for an example of using this. Deploy as follows:
|
|
|
|
```sh
|
|
nomad namespace apply proxy
|
|
nomad job run ../ui/input/proxy.nomad
|
|
```
|
|
|
|
You can get the public IP for the proxy allocation from the following
|
|
nested query:
|
|
|
|
```sh
|
|
nomad node status -json -verbose \
|
|
$(nomad operator api '/v1/allocations?namespace=proxy' | jq -r '.[] | select(.JobID == "nomad-proxy") | .NodeID') \
|
|
| jq '.Attributes."unique.platform.aws.public-ipv4"'
|
|
```
|
|
|
|
## Outputs
|
|
|
|
After deploying the infrastructure, you can get connection information
|
|
about the cluster:
|
|
|
|
- `$(terraform output --raw environment)` will set your current shell's
|
|
`NOMAD_ADDR` and `CONSUL_HTTP_ADDR` to point to one of the cluster's server
|
|
nodes, and set the `NOMAD_E2E` variable.
|
|
- `terraform output servers` will output the list of server node IPs.
|
|
- `terraform output linux_clients` will output the list of Linux
|
|
client node IPs.
|
|
- `terraform output windows_clients` will output the list of Windows
|
|
client node IPs.
|
|
|
|
## SSH
|
|
|
|
You can use Terraform outputs above to access nodes via ssh:
|
|
|
|
```sh
|
|
ssh -i keys/nomad-e2e-*.pem ubuntu@${EC2_IP_ADDR}
|
|
```
|
|
|
|
The Windows client runs OpenSSH for convenience, but has a different
|
|
user and will drop you into a Powershell shell instead of bash:
|
|
|
|
```sh
|
|
ssh -i keys/nomad-e2e-*.pem Administrator@${EC2_IP_ADDR}
|
|
```
|
|
|
|
## Teardown
|
|
|
|
The terraform state file stores all the info.
|
|
|
|
```sh
|
|
cd e2e/terraform/
|
|
terraform destroy
|
|
```
|
|
|
|
## FAQ
|
|
|
|
#### E2E Provisioning Goals
|
|
|
|
1. The provisioning process should be able to run a nightly build against a
|
|
variety of OS targets.
|
|
2. The provisioning process should be able to support update-in-place
|
|
tests. (See [#7063](https://github.com/hashicorp/nomad/issues/7063))
|
|
3. A developer should be able to quickly stand up a small E2E cluster and
|
|
provision it with a version of Nomad they've built on their laptop. The
|
|
developer should be able to send updated builds to that cluster with a short
|
|
iteration time, rather than having to rebuild the cluster.
|
|
|
|
#### Why not just drop all the provisioning into the AMI?
|
|
|
|
While that's the "correct" production approach for cloud infrastructure, it
|
|
creates a few pain points for testing:
|
|
|
|
* Creating a Linux AMI takes >10min, and creating a Windows AMI can take
|
|
15-20min. This interferes with goal (3) above.
|
|
* We won't be able to do in-place upgrade testing without having an in-place
|
|
provisioning process anyways. This interferes with goals (2) above.
|
|
|
|
#### Why not just drop all the provisioning into the user data?
|
|
|
|
* Userdata is executed on boot, which prevents using them for in-place upgrade
|
|
testing.
|
|
* Userdata scripts are not very observable and it's painful to determine
|
|
whether they've failed or simply haven't finished yet before trying to run
|
|
tests.
|