make test-nomad sets 15 minute time out for build. Increase the ci
timeout to 20m, so we can get meaningful output and goroutine stack
traces rather than have test be simply killed by CircleCI.
The extra 5 minutes is a buffer for generating-structs and some
unnecessary padding.
TestClient_UpdateNodeFromFingerprintKeepsConfig checks a test node
network interface, which is hardcoded to `eth0` and is updated
asynchronously. This causes flakiness when eth0 isn't available.
Here, we hardcode the value to an arbitrary network interface.
When spinning a second client, ensure that it uses new driver
instances, rather than reuse the already shutdown unhealthy drivers from
first instance.
This speeds up tests significantly, but cutting ~50 seconds or so, the
timeout in NewClient until drivers fingerprints. They never do because
drivers were shutdown already.
TestClient_RestoreError is very slow, taking ~81 seconds.
It has few problematic patterns. It's unclear what it tests, it
simulates a failure condition where all state db lookup fails and
asserts that alloc fails. Though starting from
https://github.com/hashicorp/nomad/pull/6216 , we don't fail allocs in
that condition but rather restart them.
Also, the drivers used in second client `c2` are the same singleton
instances used in `c1` and already shutdown. We ought to start healthy
new driver instances.
This adopts pattern used by Vault, where we split CircleCI yaml config
into multiple files that get packed and translated to 2.0.
This has two motivations: First, to ease translating config to CircleCI
2.0 so it can run on Enterprise private repository. Second and most
importantly, it also adding Enterprise specific jobs in separate files
with reduced config file merging conflict resolution.
This is a remenant of the time we used a custom hashicorp docker image for CI.
Currently, we use the official golang image, so no longer need the job
or manage the dockerhub credentials.
`stable-website` branch is only meant for updating the nomadproject.io
website, and the backend tests are irrelevant. Also, the ci workflow
uses up the plans containers and may delay website deployments by 20
minutes or more while we are cutting a release.
Noticed that ACL endpoints return 500 status code for user errors. This
is confusing and can lead to false monitoring alerts.
Here, I introduce a concept of RPCCoded errors to be returned by RPC
that signal a code in addition to error message. Codes for now match
HTTP codes to ease reasoning.
```
$ nomad acl bootstrap
Error bootstrapping: Unexpected response code: 500 (ACL bootstrap already done (reset index: 9))
$ nomad acl bootstrap
Error bootstrapping: Unexpected response code: 400 (ACL bootstrap already done (reset index: 9))
```