open-nomad

Commit Graph

Author	SHA1	Message	Date
Tim Gross	1fc8995590	query for leader in `operator debug` command (#13472 ) The `operator debug` command doesn't output the leader anywhere in the output, which adds extra burden to offline debugging (away from an ongoing incident where you can simply check manually). Query the `/v1/status/leader` API but degrade gracefully.	2022-07-06 10:57:44 -04:00
Dave May	97cf204c00	debug: add version constraint to avoid pprof panic (#12807 )	2022-04-28 13:18:55 -04:00
Tim Gross	09b5e8d388	Fix flaky `operator debug` test (#12501 ) We introduced a `pprof-interval` argument to `operator debug` in #11938, and unfortunately this has resulted in a lot of test flakes. The actual command in use is mostly fine (although I've fixed some quirks here), so what's really happened is that the change has revealed some existing issues in the tests. Summary of changes: * Make first pprof collection synchronous to preserve the existing behavior for the common case where the pprof interval matches the duration. * Clamp `operator debug` pprof timing to that of the command. The `pprof-duration` should be no more than `duration` and the `pprof-interval` should be no more than `pprof-duration`. Clamp the values rather than throwing errors, which could change the commands that existing users might already have in debugging scripts * Testing: remove test parallelism The `operator debug` tests that stand up servers can't be run in parallel, because we don't have a way of canceling the API calls for pprof. The agent will still be running the last pprof when we exit, and that breaks the next test that talks to that same agent. (Because you can only run one pprof at a time on any process!) We could split off each subtest into its own server, but this test suite is already very slow. In future work we should fix this "for real" by making the API call cancelable. * Testing: assert against unexpected errors in `operator debug` tests. If we assert there are no unexpected error outputs, it's easier for the developer to debug when something is going wrong with the tests because the error output will be presented as a failing test, rather than just a failing exit code check. Or worse, no failing exit code check! This also forces us to be explicit about which tests will return 0 exit codes but still emit (presumably ignorable) error outputs. Additional minor bug fixes (mostly in tests) and test refactorings: * Fix text alignment on pprof Duration in `operator debug` output * Remove "done" channel from `operator debug` event stream test. The goroutine we're blocking for here already tells us it's done by sending a value, so block on that instead of an extraneous channel * Event stream test timer should start at current time, not zero * Remove noise from `operator debug` test log output. The `t.Logf` calls already are picked out from the rest of the test output by being prefixed with the filename. * Remove explicit pprof args so we use the defaults clamped from duration/interval	2022-04-07 15:00:07 -04:00
Danish Prakash	e7e8ce212e	command/operator_debug: add pprof interval (#11938 )	2022-04-04 15:24:12 -04:00
Dave May	330d24a873	cli: Add event stream capture to nomad operator debug (#11865 )	2022-01-17 21:35:51 -05:00
Michael Schurter	99c863f909	cli: improve debug error messages (#11507 ) Improves `nomad debug` error messages when contacting agents that do not have /v1/agent/host endpoints (the endpoint was added in v0.12.0) Part of #9568 and manually tested against Nomad v0.8.7. Hopefully isRedirectError can be reused for more cases listed in #9568	2022-01-17 11:15:17 -05:00
Tim Gross	f8a133a810	cli: ensure `-stale` flag is respected by `nomad operator debug` (#11678 ) When a cluster doesn't have a leader, the `nomad operator debug` command can safely use stale queries to gracefully degrade the consistency of almost all its queries. The query parameter for these API calls was not being set by the command. Some `api` package queries do not include `QueryOptions` because they target a specific agent, but they can potentially be forwarded to other agents. If there is no leader, these forwarded queries will fail. Provide methods to call these APIs with `QueryOptions`.	2021-12-15 10:44:03 -05:00
Dave May	3c04d7927b	cli: refactor operator debug capture (#11466 ) * debug: refactor Consul API collection * debug: refactor Vault API collection * debug: cleanup test timing * debug: extend test to multiregion * debug: save cmdline flags in bundle * debug: add cli version to output * Add changelog entry	2021-11-05 19:43:10 -04:00
Dave May	509c74ce19	debug: update default node-id and docs (#11398 ) * debug: default node-id to all * debug: align cli help and website documentation	2021-10-27 13:43:56 -04:00
Dave May	c37a6ed583	cli: rename paths in debug bundle for clarity (#11307 ) * Rename folders to reflect purpose * Improve captured files test coverage * Rename CSI plugins output file * Add changelog entry * fix test and make changelog message more explicit Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2021-10-13 18:00:55 -04:00
Dave May	305e8e98bf	cli: Improved autocomplete support for job dispatch and operator debug (#11270 ) * Add autocomplete to nomad job dispatch * Add autocomplete to nomad operator debug * Update incorrect comment * Update test to verify autocomplete * Add changelog * Apply lint suggestions * Create dynamic slices instead of specific length * Align style across predictors	2021-10-12 20:01:54 -04:00
Dave May	2d14c54fa0	debug: Improve namespace and region support (#11269 ) * Include region and namespace in CLI output * Add region and prefix matching for server members * Add namespace and region API outputs to cluster metadata folder * Add region awareness to WaitForClient helper function * Add helper functions for SliceStringHasPrefix and StringHasPrefixInSlice * Refactor test client agent generation * Add tests for region * Add changelog	2021-10-12 16:58:41 -04:00
James Rasell	b6813f1221	chore: fix incorrect docstring formatting.	2021-08-30 11:08:12 +02:00
Dave May	1e51d00d98	Add remaining pprof profiles to nomad operator debug (#10748 ) * Add remaining pprof profiles to debug dump * Refactor pprof profile capture * Add WaitForFilesUntil and WaitForResultUntil utility functions * Add CHANGELOG entry	2021-06-21 14:22:49 -04:00
Yoan Blanc	ac0d5d8bd3	chore: bump golangci-lint from v1.24 to v1.39 Signed-off-by: Yoan Blanc <yoan@dosimple.ch>	2021-04-03 09:50:23 +02:00
Dave May	ba4da7efca	debug: Remove extra linefeed in monitor.log (#10252 )	2021-03-29 09:22:27 -04:00
Dave May	e93b49a119	debug: update defaults to commonly used values	2021-03-09 08:31:38 -05:00
Dave May	cd506cb887	Handle Consul API URL protocol mismatch (#10082 )	2021-02-25 08:22:44 -05:00
Dave May	5f50c1d0c1	debug: Fix node count bug from GH-9566 (#9625 ) * debug: update test to identify bug in GH-9566 * debug: range tests need fresh cmd each iteration * debug: fix node count bug in GH-9566	2020-12-14 15:02:48 -05:00
Kris Hicks	0a3a748053	Add gosimple linter (#9590 )	2020-12-09 11:05:18 -08:00
Kris Hicks	93155ba3da	Add gocritic to golangci-lint config (#9556 )	2020-12-08 12:47:04 -08:00
Dave May	e045bd3a5e	nomad operator debug - add pprof duration / csi details (#9346 ) * debug: add pprof duration CLI argument * debug: add CSI plugin details * update help text with ACL requirements * debug: provide ACL hints upon permission failures * debug: only write file when pprof retrieve is successful * debug: add helper function to clean bad characters from dynamic filenames * debug: ensure files are unable to escape the capture directory	2020-12-01 12:36:05 -05:00
Tim Gross	f1ad512986	docs: describe required ACLs for all commands	2020-11-20 13:38:29 -05:00
Tim Gross	de6b023af2	command: remove -namespace from help options when not applicable	2020-11-19 16:28:39 -05:00
Dave May	e89302aa4b	nomad operator debug - add client node filtering arguments (#9331 ) * operator debug - add client node filtering arguments * add WaitForClient helper function * use RPC in WaitForClient to avoid unnecessary imports * guard against nil values * move initialization up and shorten test duration * cleanup nodeLookupFailCount logic * only display max node notice if we actually tried to capture nodes	2020-11-12 11:25:28 -05:00
Dave May	f37e90be18	Metrics gotemplate support, debug bundle features (#9067 ) * add goroutine text profiles to nomad operator debug * add server-id=all to nomad operator debug * fix bug from changing metrics from string to []byte * Add function to return MetricsSummary struct, metrics gotemplate support * fix bug resolving 'server-id=all' when no servers are available * add url to operator_debug tests * removed test section which is used for future operator_debug.go changes * separate metrics from operator, use only structs from go-metrics * ensure parent directories are created as needed * add suggested comments for text debug pprof * move check down to where it is used * add WaitForFiles helper function to wait for multiple files to exist * compact metrics check Co-authored-by: Drew Bailey <2614075+drewbailey@users.noreply.github.com> * fix github's silly apply suggestion Co-authored-by: Drew Bailey <2614075+drewbailey@users.noreply.github.com>	2020-10-14 15:16:10 -04:00
davemay99	603cc1776c	Add metrics command / output to debug bundle	2020-10-05 22:30:01 -04:00
Drew Bailey	6d7a6ebb38	run commands for duration and interval without needing to specify servers or nodes	2020-08-31 14:13:03 -04:00
Drew Bailey	1f7ea53876	add license info to operator debug command	2020-08-31 13:22:23 -04:00
Lang Martin	97c7f2acea	command/operator_debug: mkdir before storing agent-host (#8707 ) The api calls were reordered, the new order omits the `agent-host.json` result by fetching it before the directory is created.	2020-08-28 11:58:06 -04:00
Lang Martin	07ea822c6a	nomad debug renamed to nomad operator debug (#8602 ) * renamed: command/debug.go -> command/operator_debug.go * website: rename debug -> operator debug * website/pages/api-docs/agent: name in api docs	2020-08-11 15:39:44 -04:00

31 Commits