open-nomad

Commit Graph

Author	SHA1	Message	Date
Luiz Aoqui	3479e2231f	core: enforce strict steps for clients reconnect (#15808 ) When a Nomad client that is running an allocation with `max_client_disconnect` set misses a heartbeat the Nomad server will update its status to `disconnected`. Upon reconnecting, the client will make three main RPC calls: - `Node.UpdateStatus` is used to set the client status to `ready`. - `Node.UpdateAlloc` is used to update the client-side information about allocations, such as their `ClientStatus`, task states etc. - `Node.Register` is used to upsert the entire node information, including its status. These calls are made concurrently and are also running in parallel with the scheduler. Depending on the order they run the scheduler may end up with incomplete data when reconciling allocations. For example, a client disconnects and its replacement allocation cannot be placed anywhere else, so there's a pending eval waiting for resources. When this client comes back the order of events may be: 1. Client calls `Node.UpdateStatus` and is now `ready`. 2. Scheduler reconciles allocations and places the replacement alloc to the client. The client is now assigned two allocations: the original alloc that is still `unknown` and the replacement that is `pending`. 3. Client calls `Node.UpdateAlloc` and updates the original alloc to `running`. 4. Scheduler notices too many allocs and stops the replacement. This creates unnecessary placements or, in a different order of events, may leave the job without any allocations running until the whole state is updated and reconciled. To avoid problems like this clients must update _all_ of its relevant information before they can be considered `ready` and available for scheduling. To achieve this goal the RPC endpoints mentioned above have been modified to enforce strict steps for nodes reconnecting: - `Node.Register` does not set the client status anymore. - `Node.UpdateStatus` sets the reconnecting client to the `initializing` status until it successfully calls `Node.UpdateAlloc`. These changes are done server-side to avoid the need of additional coordination between clients and servers. Clients are kept oblivious of these changes and will keep making these calls as they normally would. The verification of whether allocations have been updates is done by storing and comparing the Raft index of the last time the client missed a heartbeat and the last time it updated its allocations.	2023-01-25 15:53:59 -05:00
Seth Hoenig	83450c8762	vault: configure user agent on Nomad vault clients (#15745 ) * vault: configure user agent on Nomad vault clients This PR attempts to set the User-Agent header on each Vault API client created by Nomad. Still need to figure a way to set User-Agent on the Vault client created internally by consul-template. * vault: fixup find-and-replace gone awry	2023-01-10 10:39:45 -06:00
Seth Hoenig	7214e21402	ci: swap freeport for portal in packages (#15661 )	2023-01-03 11:25:20 -06:00
Lance Haig	0263e7af34	Add command "nomad tls" (#14296 )	2022-11-22 14:12:07 -05:00
Michael Schurter	ed3218c3dd	Fixing flaky TestOverlap test (#14780 ) * test: ensure feasible node selected in overlap test * test: warn when getting close to retry limit	2022-10-03 14:35:02 -07:00
Tim Gross	17aee4d69c	fingerprint: don't clear Consul/Vault attributes on failure (#14673 ) Clients periodically fingerprint Vault and Consul to ensure the server has updated attributes in the client's fingerprint. If the client can't reach Vault/Consul, the fingerprinter clears the attributes and requires a node update. Although this seems like correct behavior so that we can detect intentional removal of Vault/Consul access, it has two serious failure modes: (1) If a local Consul agent is restarted to pick up configuration changes and the client happens to fingerprint at that moment, the client will update its fingerprint and result in evaluations for all its jobs and all the system jobs in the cluster. (2) If a client loses Vault connectivity, the same thing happens. But the consequences are much worse in the Vault case because Vault is not run as a local agent, so Vault connectivity failures are highly correlated across the entire cluster. A 15 second Vault outage will cause a new `node-update` evalution for every system job on the cluster times the number of nodes, plus one `node-update` evaluation for every non-system job on each node. On large clusters of 1000s of nodes, we've seen this create a large backlog of evaluations. This changeset updates the fingerprinting behavior to keep the last fingerprint if Consul or Vault queries fail. This prevents a storm of evaluations at the cost of requiring a client restart if Consul or Vault is intentionally removed from the client.	2022-09-23 14:45:12 -04:00
Mahmood Ali	a9d5e4c510	scheduler: stopped-yet-running allocs are still running (#10446 ) * scheduler: stopped-yet-running allocs are still running * scheduler: test new stopped-but-running logic * test: assert nonoverlapping alloc behavior Also add a simpler Wait test helper to improve line numbers and save few lines of code. * docs: tried my best to describe #10446 it's not concise... feedback welcome * scheduler: fix test that allowed overlapping allocs * devices: only free devices when ClientStatus is terminal * test: output nicer failure message if err==nil Co-authored-by: Mahmood Ali <mahmood@hashicorp.com> Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2022-09-13 12:52:47 -07:00
Seth Hoenig	fc58f4972c	cli: correctly use and validate job with vault token set This PR fixes `job validate` to respect '-vault-token', '$VAULT_TOKEN', '-vault-namespace' if set.	2022-05-19 12:13:34 -05:00
Eng Zer Jun	97d1bc735c	test: use `T.TempDir` to create temporary test directory (#12853 ) * test: use `T.TempDir` to create temporary test directory This commit replaces `ioutil.TempDir` with `t.TempDir` in tests. The directory created by `t.TempDir` is automatically removed when the test and all its subtests complete. Prior to this commit, temporary directory created using `ioutil.TempDir` needs to be removed manually by calling `os.RemoveAll`, which is omitted in some tests. The error handling boilerplate e.g. defer func() { if err := os.RemoveAll(dir); err != nil { t.Fatal(err) } } is also tedious, but `t.TempDir` handles this for us nicely. Reference: https://pkg.go.dev/testing#T.TempDir Signed-off-by: Eng Zer Jun <engzerjun@gmail.com> * test: fix TestLogmon_Start_restart on Windows Signed-off-by: Eng Zer Jun <engzerjun@gmail.com> * test: fix failing TestConsul_Integration t.TempDir fails to perform the cleanup properly because the folder is still in use testing.go:967: TempDir RemoveAll cleanup: unlinkat /tmp/TestConsul_Integration2837567823/002/191a6f1a-5371-cf7c-da38-220fe85d10e5/web/secrets: device or resource busy Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>	2022-05-12 11:42:40 -04:00
Seth Hoenig	2631659551	ci: swap ci parallelization for unconstrained gomaxprocs	2022-03-15 12:58:52 -05:00
Seth Hoenig	2f0cfb5740	build: upgrade and speedup circleci configuration This PR upgrades our CI images and fixes some affected tests. - upgrade go-machine-image to premade latest ubuntu LTS (ubuntu-2004:202111-02) - eliminate go-machine-recent-image (no longer necessary) - manage GOPATH in GNUMakefile (see https://discuss.circleci.com/t/gopath-is-set-to-multiple-directories/7174) - fix tcp dial error check (message seems to be OS specific) - spot check values measured instead of specifically 'RSS' (rss no longer reported in cgroups v2) - use safe MkdirTemp for generating tmpfiles NOT applied: (too flakey) - eliminate setting GOMAXPROCS=1 (build tools were also affected by this setting) - upgrade resource type for all imanges to large (2C -> 4C)	2022-01-24 08:28:14 -06:00
Dave May	3c04d7927b	cli: refactor operator debug capture (#11466 ) * debug: refactor Consul API collection * debug: refactor Vault API collection * debug: cleanup test timing * debug: extend test to multiregion * debug: save cmdline flags in bundle * debug: add cli version to output * Add changelog entry	2021-11-05 19:43:10 -04:00
Michael Schurter	fd68bbc342	test: update tests to properly use AllocDir Also use t.TempDir when possible.	2021-10-19 10:49:07 -07:00
Dave May	c37a6ed583	cli: rename paths in debug bundle for clarity (#11307 ) * Rename folders to reflect purpose * Improve captured files test coverage * Rename CSI plugins output file * Add changelog entry * fix test and make changelog message more explicit Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2021-10-13 18:00:55 -04:00
Dave May	2d14c54fa0	debug: Improve namespace and region support (#11269 ) * Include region and namespace in CLI output * Add region and prefix matching for server members * Add namespace and region API outputs to cluster metadata folder * Add region awareness to WaitForClient helper function * Add helper functions for SliceStringHasPrefix and StringHasPrefixInSlice * Refactor test client agent generation * Add tests for region * Add changelog	2021-10-12 16:58:41 -04:00
Dave May	1e51d00d98	Add remaining pprof profiles to nomad operator debug (#10748 ) * Add remaining pprof profiles to debug dump * Refactor pprof profile capture * Add WaitForFilesUntil and WaitForResultUntil utility functions * Add CHANGELOG entry	2021-06-21 14:22:49 -04:00
Mahmood Ali	aa77c2731b	tests: use standard library testing.TB Glint pulled in an updated version of mitchellh/go-testing-interface which broke some existing tests because the update added a Parallel() method to testing.T. This switches to the standard library testing.TB which doesn't have a Parallel() method.	2021-06-09 16:18:45 -07:00
Charlie Voiselle	0473f35003	Fixup uses of `sanity` (#10187 ) * Fixup uses of `sanity` * Remove unnecessary comments. These checks are better explained by earlier comments about the context of the test. Per @tgross, moved the tests together to better reinforce the overall shared context. * Update nomad/fsm_test.go	2021-03-16 18:05:08 -04:00
Dennis Schön	a9c97d9257	use os.ErrDeadlineExceeded in tests	2020-12-07 10:40:28 -05:00
Dave May	e89302aa4b	nomad operator debug - add client node filtering arguments (#9331 ) * operator debug - add client node filtering arguments * add WaitForClient helper function * use RPC in WaitForClient to avoid unnecessary imports * guard against nil values * move initialization up and shorten test duration * cleanup nodeLookupFailCount logic * only display max node notice if we actually tried to capture nodes	2020-11-12 11:25:28 -05:00
Dave May	f37e90be18	Metrics gotemplate support, debug bundle features (#9067 ) * add goroutine text profiles to nomad operator debug * add server-id=all to nomad operator debug * fix bug from changing metrics from string to []byte * Add function to return MetricsSummary struct, metrics gotemplate support * fix bug resolving 'server-id=all' when no servers are available * add url to operator_debug tests * removed test section which is used for future operator_debug.go changes * separate metrics from operator, use only structs from go-metrics * ensure parent directories are created as needed * add suggested comments for text debug pprof * move check down to where it is used * add WaitForFiles helper function to wait for multiple files to exist * compact metrics check Co-authored-by: Drew Bailey <2614075+drewbailey@users.noreply.github.com> * fix github's silly apply suggestion Co-authored-by: Drew Bailey <2614075+drewbailey@users.noreply.github.com>	2020-10-14 15:16:10 -04:00
Mahmood Ali	2ab54fe060	gracefully shutdown test server	2020-05-27 08:59:06 -04:00
Mahmood Ali	0a539c629f	tests: wait until leadership loop finishes Reverts d5c7d6e491e36a11159211f5236c19a41bed4d8e . We actually need to forward the request to ensure that the leader is properly configured and that establishedLeadership completes.	2020-03-06 14:41:59 -05:00
Mahmood Ali	6daac02548	avoid forwarding leadership checks in tests The tests only care if a test server recognizes the leader.	2020-03-02 13:47:43 -05:00
Michael Schurter	c82b14b0c4	core: add limits to unauthorized connections Introduce limits to prevent unauthorized users from exhausting all ephemeral ports on agents: * `{https,rpc}_handshake_timeout` * `{http,rpc}_max_conns_per_client` The handshake timeout closes connections that have not completed the TLS handshake by the deadline (5s by default). For RPC connections this timeout also separately applies to first byte being read so RPC connections with TLS enabled have `rpc_handshake_time * 2` as their deadline. The connection limit per client prevents a single remote TCP peer from exhausting all ephemeral ports. The default is 100, but can be lowered to a minimum of 26. Since streaming RPC connections create a new TCP connection (until MultiplexV2 is used), 20 connections are reserved for Raft and non-streaming RPCs to prevent connection exhaustion due to streaming RPCs. All limits are configurable and may be disabled by setting them to `0`. This also includes a fix that closes connections that attempt to create TLS RPC connections recursively. While only users with valid mTLS certificates could perform such an operation, it was added as a safeguard to prevent programming errors before they could cause resource exhaustion.	2020-01-30 10:38:25 -08:00
Seth Hoenig	f0c3dca49c	tests: swap lib/freeport for tweaked helper/freeport Copy the updated version of freeport (sdk/freeport), and tweak it for use in Nomad tests. This means staying below port 10000 to avoid conflicts with the lib/freeport that is still transitively used by the old version of consul that we vendor. Also provide implementations to find ephemeral ports of macOS and Windows environments. Ports acquired through freeport are supposed to be returned to freeport, which this change now also introduces. Many tests are modified to include calls to a cleanup function for Server objects. This should help quite a bit with some flakey tests, but not all of them. Our port problems will not go away completely until we upgrade our vendor version of consul. With Go modules, we'll probably do a 'replace' to swap out other copies of freeport with the one now in 'nomad/helper/freeport'.	2019-12-09 08:37:32 -06:00
Mahmood Ali	4b2ba62e35	acl: check ACL against object namespace Fix a bug where a millicious user can access or manipulate an alloc in a namespace they don't have access to. The allocation endpoints perform ACL checks against the request namespace, not the allocation namespace, and performs the allocation lookup independently from namespaces. Here, we check that the requested can access the alloc namespace regardless of the declared request namespace. Ideally, we'd enforce that the declared request namespace matches the actual allocation namespace. Unfortunately, we haven't documented alloc endpoints as namespaced functions; we suspect starting to enforce this will be very disruptive and inappropriate for a nomad point release. As such, we maintain current behavior that doesn't require passing the proper namespace in request. A future major release may start enforcing checking declared namespace.	2019-10-08 12:59:22 -04:00
Mahmood Ali	6cefd8f97e	tests: attempt to fix TestAutopilot_CleanupStaleRaftServer Also add a utility function for waiting for stable leadership	2019-09-04 08:49:33 -04:00
Mahmood Ali	fc72fff0ed	test helper for registering jobs with acl Test helper that allows registration of jobs when ACL is activated.	2019-04-30 10:23:56 -04:00
Mahmood Ali	8c82c19831	tests: IsTravis() -> IsCI() Replace IsTravis() references that is intended for more CI environments rather than for Travis environment specifically.	2019-02-20 08:21:03 -05:00
Mahmood Ali	33ff8c3e8d	tests: expect Docker on AppVeyor Prepare to run docker on AppVeyor Windows environment	2019-02-20 07:41:47 -05:00
Alex Dadgar	4bdccab550	goimports	2019-01-22 15:44:31 -08:00
Danielle Tomlinson	7fca934509	chore: General Cleanup	2019-01-17 18:43:14 +01:00
Michael Schurter	9692271926	Update testutil/vault.go Co-Authored-By: dantoml <dani@tomlinson.io>	2019-01-17 18:43:14 +01:00
Danielle Tomlinson	160c8d80e8	testutil: Start vault in the same routine as waiting This is a workaround for the windows process model. Go os/exec does not pass the parent process handle to the child processes STARTUPINFO struct, this means that unless we wait in the _same_ execution context as Starting the process, the handle will be lost, and we cannot kill it without regaining a handle. A better long term solution would be a higher level process abstraction that uses windows.CreateProcess on windows.	2019-01-17 18:43:13 +01:00
Danielle Tomlinson	5d54a0408f	fingerprint: Limit vault shutdown waiting When vault is installed through chocolatey, it also installs a shim that will not pass kill signals to the child. This means the process will never actually terminate, and we lose the process handle. Here, rather than waiting forever, we timeout fast.	2019-01-17 18:43:13 +01:00
Mahmood Ali	b7d421a149	tests: WaitForRunning checks for pending only WaitForRunning risks a race condition where the allocation succeeds and completes before WaitForRunning is called (or while it is running). Here, I made the behavior match the function documentation. I considered making it stricter, but callers need to account for allocation terminating immediately after WaitForRunning terminates anyway.	2019-01-10 15:36:57 -05:00
Mahmood Ali	c3eaa0f4c8	tests: enable and fix tests requiring mock driver	2019-01-10 10:10:11 -05:00
Michael Schurter	f279b1d1b1	tests: test logs endpoint against pending task Although the really exciting change is making WaitForRunning return the allocations that it started. This should cut down test boilerplate significantly.	2018-10-16 16:56:55 -07:00
Michael Schurter	1c9ccdeab5	tests: fix races caused by sharing a buffer httptest.ResponseRecorder exposes a bytes.Buffer which we were reading and writing concurrently to test streaming log APIs. This is a race, so I wrapped the struct in a lock with some helpers.	2018-10-16 16:56:55 -07:00
Alex Dadgar	e546215046	add a vault test matrix	2018-09-19 10:18:10 -07:00
Michael Schurter	6def5bc4f9	client: set host name when migrating over tls Not setting the host name led the Go HTTP client to expect a certificate with a DNS-resolvable name. Since Nomad uses `${role}.${region}.nomad` names ephemeral dir migrations were broken when TLS was enabled. Added an e2e test to ensure this doesn't break again as it's very difficult to test and the TLS configuration is very easy to get wrong.	2018-09-05 17:24:17 -07:00
Michael Schurter	d3650fb2cd	test: build with mock_driver by default `make release` and `make prerelease` set a `release` tag to disable enabling the `mock_driver`	2018-04-18 14:45:33 -07:00
Alex Dadgar	345169a3cc	Merge pull request #4105 from hashicorp/b-flaky-deadline-tests Fix flaky deadline tests	2018-04-03 17:21:38 -07:00
Alex Dadgar	af1b185ce4	Fix flaky deadline tests	2018-04-03 16:51:57 -07:00
Oz Katz	1b3eaf70ae	Support custom Consul config for TestServer Adds a Consul field to the TestServerConfig that allows passing in non-default values for e.g. consul address. This will allow the TestServer to integrate with Consul's testutil/TestServer.	2018-04-04 01:40:11 +03:00
Alex Dadgar	b18f789020	Unmark drain when nodes hit their deadline and only batch/system left and add all job type integration test	2018-03-28 17:25:58 -07:00
Alex Dadgar	d498fa950a	Remove fake advertise address and fix TestAPI_OperatorAutopilotServerHealth	2018-03-19 15:49:12 -07:00
Michael Schurter	0ac43a7622	Skip QEMU graceful shutdown test except on Travis Hopefully we can reuse the SkipSlow helper elsewhere.	2018-01-31 15:47:26 -08:00
Kyle Havlovitz	1c07066064	Add autopilot functionality based on Consul's autopilot	2017-12-18 14:29:41 -08:00

1 2

96 Commits