open-nomad

Commit Graph

Author	SHA1	Message	Date
Tim Gross	09b5e8d388	Fix flaky `operator debug` test (#12501 ) We introduced a `pprof-interval` argument to `operator debug` in #11938, and unfortunately this has resulted in a lot of test flakes. The actual command in use is mostly fine (although I've fixed some quirks here), so what's really happened is that the change has revealed some existing issues in the tests. Summary of changes: * Make first pprof collection synchronous to preserve the existing behavior for the common case where the pprof interval matches the duration. * Clamp `operator debug` pprof timing to that of the command. The `pprof-duration` should be no more than `duration` and the `pprof-interval` should be no more than `pprof-duration`. Clamp the values rather than throwing errors, which could change the commands that existing users might already have in debugging scripts * Testing: remove test parallelism The `operator debug` tests that stand up servers can't be run in parallel, because we don't have a way of canceling the API calls for pprof. The agent will still be running the last pprof when we exit, and that breaks the next test that talks to that same agent. (Because you can only run one pprof at a time on any process!) We could split off each subtest into its own server, but this test suite is already very slow. In future work we should fix this "for real" by making the API call cancelable. * Testing: assert against unexpected errors in `operator debug` tests. If we assert there are no unexpected error outputs, it's easier for the developer to debug when something is going wrong with the tests because the error output will be presented as a failing test, rather than just a failing exit code check. Or worse, no failing exit code check! This also forces us to be explicit about which tests will return 0 exit codes but still emit (presumably ignorable) error outputs. Additional minor bug fixes (mostly in tests) and test refactorings: * Fix text alignment on pprof Duration in `operator debug` output * Remove "done" channel from `operator debug` event stream test. The goroutine we're blocking for here already tells us it's done by sending a value, so block on that instead of an extraneous channel * Event stream test timer should start at current time, not zero * Remove noise from `operator debug` test log output. The `t.Logf` calls already are picked out from the rest of the test output by being prefixed with the filename. * Remove explicit pprof args so we use the defaults clamped from duration/interval	2022-04-07 15:00:07 -04:00
Danish Prakash	e7e8ce212e	command/operator_debug: add pprof interval (#11938 )	2022-04-04 15:24:12 -04:00
Seth Hoenig	fa0dc05b7a	tests: wait on client in a couple of tests These tend to fail on GHA, where I believe the client is not starting up fast enough before making requests. So wait on the client agent first. ``` === RUN TestDebug_CapturedFiles operator_debug_test.go:422: serverName: TestDebug_CapturedFiles.global, clientID, 1afb00e6-13f2-d8d6-d0f9-745a3fd6e8e4 operator_debug_test.go:492: Error Trace: operator_debug_test.go:492 Error: Should be empty, but was No node(s) with prefix "1afb00e6-13f2-d8d6-d0f9-745a3fd6e8e4" found Failed to retrieve clients, 0 nodes found in list: 1afb00e6-13f2-d8d6-d0f9-745a3fd6e8e4 Test: TestDebug_CapturedFiles --- FAIL: TestDebug_CapturedFiles (0.08s) ```	2022-03-30 08:48:23 -05:00
Seth Hoenig	2631659551	ci: swap ci parallelization for unconstrained gomaxprocs	2022-03-15 12:58:52 -05:00
Dave May	330d24a873	cli: Add event stream capture to nomad operator debug (#11865 )	2022-01-17 21:35:51 -05:00
Michael Schurter	99c863f909	cli: improve debug error messages (#11507 ) Improves `nomad debug` error messages when contacting agents that do not have /v1/agent/host endpoints (the endpoint was added in v0.12.0) Part of #9568 and manually tested against Nomad v0.8.7. Hopefully isRedirectError can be reused for more cases listed in #9568	2022-01-17 11:15:17 -05:00
Tim Gross	f8a133a810	cli: ensure `-stale` flag is respected by `nomad operator debug` (#11678 ) When a cluster doesn't have a leader, the `nomad operator debug` command can safely use stale queries to gracefully degrade the consistency of almost all its queries. The query parameter for these API calls was not being set by the command. Some `api` package queries do not include `QueryOptions` because they target a specific agent, but they can potentially be forwarded to other agents. If there is no leader, these forwarded queries will fail. Provide methods to call these APIs with `QueryOptions`.	2021-12-15 10:44:03 -05:00
Dave May	3c04d7927b	cli: refactor operator debug capture (#11466 ) * debug: refactor Consul API collection * debug: refactor Vault API collection * debug: cleanup test timing * debug: extend test to multiregion * debug: save cmdline flags in bundle * debug: add cli version to output * Add changelog entry	2021-11-05 19:43:10 -04:00
Dave May	c37a6ed583	cli: rename paths in debug bundle for clarity (#11307 ) * Rename folders to reflect purpose * Improve captured files test coverage * Rename CSI plugins output file * Add changelog entry * fix test and make changelog message more explicit Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2021-10-13 18:00:55 -04:00
Dave May	2d14c54fa0	debug: Improve namespace and region support (#11269 ) * Include region and namespace in CLI output * Add region and prefix matching for server members * Add namespace and region API outputs to cluster metadata folder * Add region awareness to WaitForClient helper function * Add helper functions for SliceStringHasPrefix and StringHasPrefixInSlice * Refactor test client agent generation * Add tests for region * Add changelog	2021-10-12 16:58:41 -04:00
Dave May	1e51d00d98	Add remaining pprof profiles to nomad operator debug (#10748 ) * Add remaining pprof profiles to debug dump * Refactor pprof profile capture * Add WaitForFilesUntil and WaitForResultUntil utility functions * Add CHANGELOG entry	2021-06-21 14:22:49 -04:00
Dave May	e93b49a119	debug: update defaults to commonly used values	2021-03-09 08:31:38 -05:00
Dave May	cd506cb887	Handle Consul API URL protocol mismatch (#10082 )	2021-02-25 08:22:44 -05:00
Dave May	0dd2d8944f	Debug test refactor (#9637 ) * debug: refactor test cases * debug: remove unnecessary syncbuffer resets * debug: cleaned up test code per suggestions * debug: clarify note on parallel testing	2020-12-15 13:51:41 -05:00
Dave May	5f50c1d0c1	debug: Fix node count bug from GH-9566 (#9625 ) * debug: update test to identify bug in GH-9566 * debug: range tests need fresh cmd each iteration * debug: fix node count bug in GH-9566	2020-12-14 15:02:48 -05:00
Dave May	be0a14d70b	fix AgentHostRequest panic found in GH-9546 (#9554 ) * debug: refactor nodeclass test * debug: add case to track down SIGSEGV on client to server Agent.Host RPC * verify server to avoid panic on AgentHostRequest RPC call, fixes GH-9546 * simplify Agent.Host RPC lookup logic	2020-12-07 17:34:40 -05:00
Dave May	e045bd3a5e	nomad operator debug - add pprof duration / csi details (#9346 ) * debug: add pprof duration CLI argument * debug: add CSI plugin details * update help text with ACL requirements * debug: provide ACL hints upon permission failures * debug: only write file when pprof retrieve is successful * debug: add helper function to clean bad characters from dynamic filenames * debug: ensure files are unable to escape the capture directory	2020-12-01 12:36:05 -05:00
Dave May	e89302aa4b	nomad operator debug - add client node filtering arguments (#9331 ) * operator debug - add client node filtering arguments * add WaitForClient helper function * use RPC in WaitForClient to avoid unnecessary imports * guard against nil values * move initialization up and shorten test duration * cleanup nodeLookupFailCount logic * only display max node notice if we actually tried to capture nodes	2020-11-12 11:25:28 -05:00
Dave May	f37e90be18	Metrics gotemplate support, debug bundle features (#9067 ) * add goroutine text profiles to nomad operator debug * add server-id=all to nomad operator debug * fix bug from changing metrics from string to []byte * Add function to return MetricsSummary struct, metrics gotemplate support * fix bug resolving 'server-id=all' when no servers are available * add url to operator_debug tests * removed test section which is used for future operator_debug.go changes * separate metrics from operator, use only structs from go-metrics * ensure parent directories are created as needed * add suggested comments for text debug pprof * move check down to where it is used * add WaitForFiles helper function to wait for multiple files to exist * compact metrics check Co-authored-by: Drew Bailey <2614075+drewbailey@users.noreply.github.com> * fix github's silly apply suggestion Co-authored-by: Drew Bailey <2614075+drewbailey@users.noreply.github.com>	2020-10-14 15:16:10 -04:00
davemay99	19a075cf47	update deprecated syntax per GH-9027	2020-10-06 09:47:16 -04:00
davemay99	603cc1776c	Add metrics command / output to debug bundle	2020-10-05 22:30:01 -04:00
Lang Martin	07ea822c6a	nomad debug renamed to nomad operator debug (#8602 ) * renamed: command/debug.go -> command/operator_debug.go * website: rename debug -> operator debug * website/pages/api-docs/agent: name in api docs	2020-08-11 15:39:44 -04:00

22 Commits