This PR upgrades our CI images and fixes some affected tests.
- upgrade go-machine-image to premade latest ubuntu LTS (ubuntu-2004:202111-02)
- eliminate go-machine-recent-image (no longer necessary)
- manage GOPATH in GNUMakefile (see https://discuss.circleci.com/t/gopath-is-set-to-multiple-directories/7174)
- fix tcp dial error check (message seems to be OS specific)
- spot check values measured instead of specifically 'RSS' (rss no longer reported in cgroups v2)
- use safe MkdirTemp for generating tmpfiles
NOT applied: (too flakey)
- eliminate setting GOMAXPROCS=1 (build tools were also affected by this setting)
- upgrade resource type for all imanges to large (2C -> 4C)
* debug: refactor Consul API collection
* debug: refactor Vault API collection
* debug: cleanup test timing
* debug: extend test to multiregion
* debug: save cmdline flags in bundle
* debug: add cli version to output
* Add changelog entry
* Include region and namespace in CLI output
* Add region and prefix matching for server members
* Add namespace and region API outputs to cluster metadata folder
* Add region awareness to WaitForClient helper function
* Add helper functions for SliceStringHasPrefix and StringHasPrefixInSlice
* Refactor test client agent generation
* Add tests for region
* Add changelog
Glint pulled in an updated version of mitchellh/go-testing-interface
which broke some existing tests because the update added a Parallel()
method to testing.T. This switches to the standard library testing.TB
which doesn't have a Parallel() method.
* Fixup uses of `sanity`
* Remove unnecessary comments.
These checks are better explained by earlier comments about
the context of the test. Per @tgross, moved the tests together
to better reinforce the overall shared context.
* Update nomad/fsm_test.go
* operator debug - add client node filtering arguments
* add WaitForClient helper function
* use RPC in WaitForClient to avoid unnecessary imports
* guard against nil values
* move initialization up and shorten test duration
* cleanup nodeLookupFailCount logic
* only display max node notice if we actually tried to capture nodes
* add goroutine text profiles to nomad operator debug
* add server-id=all to nomad operator debug
* fix bug from changing metrics from string to []byte
* Add function to return MetricsSummary struct, metrics gotemplate support
* fix bug resolving 'server-id=all' when no servers are available
* add url to operator_debug tests
* removed test section which is used for future operator_debug.go changes
* separate metrics from operator, use only structs from go-metrics
* ensure parent directories are created as needed
* add suggested comments for text debug pprof
* move check down to where it is used
* add WaitForFiles helper function to wait for multiple files to exist
* compact metrics check
Co-authored-by: Drew Bailey <2614075+drewbailey@users.noreply.github.com>
* fix github's silly apply suggestion
Co-authored-by: Drew Bailey <2614075+drewbailey@users.noreply.github.com>
Reverts d5c7d6e491e36a11159211f5236c19a41bed4d8e .
We actually need to forward the request to ensure that the leader is
properly configured and that establishedLeadership completes.
Introduce limits to prevent unauthorized users from exhausting all
ephemeral ports on agents:
* `{https,rpc}_handshake_timeout`
* `{http,rpc}_max_conns_per_client`
The handshake timeout closes connections that have not completed the TLS
handshake by the deadline (5s by default). For RPC connections this
timeout also separately applies to first byte being read so RPC
connections with TLS enabled have `rpc_handshake_time * 2` as their
deadline.
The connection limit per client prevents a single remote TCP peer from
exhausting all ephemeral ports. The default is 100, but can be lowered
to a minimum of 26. Since streaming RPC connections create a new TCP
connection (until MultiplexV2 is used), 20 connections are reserved for
Raft and non-streaming RPCs to prevent connection exhaustion due to
streaming RPCs.
All limits are configurable and may be disabled by setting them to `0`.
This also includes a fix that closes connections that attempt to create
TLS RPC connections recursively. While only users with valid mTLS
certificates could perform such an operation, it was added as a
safeguard to prevent programming errors before they could cause resource
exhaustion.
Copy the updated version of freeport (sdk/freeport), and tweak it for use
in Nomad tests. This means staying below port 10000 to avoid conflicts with
the lib/freeport that is still transitively used by the old version of
consul that we vendor. Also provide implementations to find ephemeral ports
of macOS and Windows environments.
Ports acquired through freeport are supposed to be returned to freeport,
which this change now also introduces. Many tests are modified to include
calls to a cleanup function for Server objects.
This should help quite a bit with some flakey tests, but not all of them.
Our port problems will not go away completely until we upgrade our vendor
version of consul. With Go modules, we'll probably do a 'replace' to swap
out other copies of freeport with the one now in 'nomad/helper/freeport'.
Fix a bug where a millicious user can access or manipulate an alloc in a
namespace they don't have access to. The allocation endpoints perform
ACL checks against the request namespace, not the allocation namespace,
and performs the allocation lookup independently from namespaces.
Here, we check that the requested can access the alloc namespace
regardless of the declared request namespace.
Ideally, we'd enforce that the declared request namespace matches
the actual allocation namespace. Unfortunately, we haven't documented
alloc endpoints as namespaced functions; we suspect starting to enforce
this will be very disruptive and inappropriate for a nomad point
release. As such, we maintain current behavior that doesn't require
passing the proper namespace in request. A future major release may
start enforcing checking declared namespace.
This is a workaround for the windows process model.
Go os/exec does not pass the parent process handle to the child
processes STARTUPINFO struct, this means that unless we wait in
the _same_ execution context as Starting the process, the
handle will be lost, and we cannot kill it without regaining
a handle.
A better long term solution would be a higher level process
abstraction that uses windows.CreateProcess on windows.
When vault is installed through chocolatey, it also installs a shim that
will not pass kill signals to the child. This means the process will
never actually terminate, and we lose the process handle.
Here, rather than waiting forever, we timeout fast.
WaitForRunning risks a race condition where the allocation succeeds and
completes before WaitForRunning is called (or while it is running).
Here, I made the behavior match the function documentation.
I considered making it stricter, but callers need to account for
allocation terminating immediately after WaitForRunning terminates
anyway.
Although the really exciting change is making WaitForRunning return the
allocations that it started. This should cut down test boilerplate
significantly.
httptest.ResponseRecorder exposes a bytes.Buffer which we were reading
and writing concurrently to test streaming log APIs. This is a race, so
I wrapped the struct in a lock with some helpers.
Not setting the host name led the Go HTTP client to expect a certificate
with a DNS-resolvable name. Since Nomad uses `${role}.${region}.nomad`
names ephemeral dir migrations were broken when TLS was enabled.
Added an e2e test to ensure this doesn't break again as it's very
difficult to test and the TLS configuration is very easy to get wrong.
Adds a Consul field to the TestServerConfig that allows passing in non-default values for e.g. consul address.
This will allow the TestServer to integrate with Consul's testutil/TestServer.
This drops the testings stdlib pkg from our dependencies. Saves a
whopping 46kb on our binary (was really hoping for more of a win there),
but also avoids potential ugliness with how testing sets flags.
* alloc_runner
* Random tests
* parallel task_runner and no exec compatible check
* Parallel client
* Fail fast and use random ports
* Fix docker port mapping
* Make concurrent pull less timing dependant
* up parallel
* Fixes
* don't build chroots in parallel on travis
* Reduce parallelism on travis with lxc/rkt
* make java test app not run forever
* drop parallelism a little
* use docker ports that are out of the os's ephemeral port range
* Limit even more on travis
* rkt deadline