Use gotest.tools/v3/fs to make better assertions about the files
Remove the TestAgent from TestDebugCommand_Prepare_ValidateTiming, since we can test that validation
without making any API calls.
* deps: upgrade gogo-protobuf to v1.3.2
* go mod tidy using go 1.16
* proto: regen protobufs after upgrading gogo/protobuf
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* debug: remove the CLI check for debug_enabled
The API allows collecting profiles even debug_enabled=false as long as
ACLs are enabled. Remove this check from the CLI so that users do not
need to set debug_enabled=true for no reason.
Also:
- fix the API client to return errors on non-200 status codes for debug
endpoints
- improve the failure messages when pprof data can not be collected
Co-Authored-By: Dhia Ayachi <dhia@hashicorp.com>
* remove parallel test runs
parallel runs create a race condition that fail the debug tests
* snapshot the timestamp at the beginning of the capture
- timestamp used to create the capture sub folder is snapshot only at the beginning of the capture and reused for subsequent captures
- capture append to the file if it already exist
* Revert "snapshot the timestamp at the beginning of the capture"
This reverts commit c2d03346
* Refactor captureDynamic to extract capture logic for each item in a different func
* snapshot the timestamp at the beginning of the capture
- timestamp used to create the capture sub folder is snapshot only at the beginning of the capture and reused for subsequent captures
- capture append to the file if it already exist
* Revert "snapshot the timestamp at the beginning of the capture"
This reverts commit c2d03346
* Refactor captureDynamic to extract capture logic for each item in a different func
* extract wait group outside the go routine to avoid a race condition
* capture pprof in a separate go routine
* perform a single capture for pprof data for the whole duration
* add missing vendor dependency
* add a change log and fix documentation to reflect the change
* create function for timestamp dir creation and simplify error handling
* use error groups and ticker to simplify interval capture loop
* Logs, profile and traces are captured for the full duration. Metrics, Heap and Go routines are captured every interval
* refactor Logs capture routine and add log capture specific test
* improve error reporting when log test fail
* change test duration to 1s
* make time parsing in log line more robust
* refactor log time format in a const
* test on log line empty the earliest possible and return
Co-authored-by: Freddy <freddygv@users.noreply.github.com>
* rename function to captureShortLived
* more specific changelog
Co-authored-by: Paul Banks <banks@banksco.de>
* update documentation to reflect current implementation
* add test for behavior when invalid param is passed to the command
* fix argument line in test
* a more detailed description of the new behaviour
Co-authored-by: Paul Banks <banks@banksco.de>
* print success right after the capture is done
* remove an unnecessary error check
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* upgraded github.com/google/pprof v0.0.0-20181206194817-3ea8567a2e57 => v0.0.0-20210601050228-01bbb1931b22
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
Co-authored-by: Freddy <freddygv@users.noreply.github.com>
Co-authored-by: Paul Banks <banks@banksco.de>
* WIP reloadable raft config
* Pre-define new raft gauges
* Update go-metrics to change gauge reset behaviour
* Update raft to pull in new metric and reloadable config
* Add snapshot persistance timing and installSnapshot to our 'protected' list as they can be infrequent but are important
* Update telemetry docs
* Update config and telemetry docs
* Add note to oldestLogAge on when it is visible
* Add changelog entry
* Update website/content/docs/agent/options.mdx
Co-authored-by: Matt Keeler <mkeeler@users.noreply.github.com>
Co-authored-by: Matt Keeler <mkeeler@users.noreply.github.com>
This allows for client agent to be run in a more stateless manner where they may be abruptly terminated and not expected to come back. If advertising a per-agent reconnect timeout using the advertise_reconnect_timeout configuration when that agent leaves, other agents will wait only that amount of time for the agent to come back before reaping it.
This has the advantageous side effect of causing servers to deregister the node/services/checks for that agent sooner than if the global reconnect_timeout was used.
During gossip encryption key rotation it would be nice to be able to see if all nodes are using the same key. This PR adds another field to the json response from `GET v1/operator/keyring` which lists the primary keys in use per dc. That way an operator can tell when a key was successfully setup as primary key.
Based on https://github.com/hashicorp/serf/pull/611 to add primary key to list keyring output:
```json
[
{
"WAN": true,
"Datacenter": "dc2",
"Segment": "",
"Keys": {
"0OuM4oC3Os18OblWiBbZUaHA7Hk+tNs/6nhNYtaNduM=": 6,
"SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 6
},
"PrimaryKeys": {
"SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 6
},
"NumNodes": 6
},
{
"WAN": false,
"Datacenter": "dc2",
"Segment": "",
"Keys": {
"0OuM4oC3Os18OblWiBbZUaHA7Hk+tNs/6nhNYtaNduM=": 8,
"SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 8
},
"PrimaryKeys": {
"SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 8
},
"NumNodes": 8
},
{
"WAN": false,
"Datacenter": "dc1",
"Segment": "",
"Keys": {
"0OuM4oC3Os18OblWiBbZUaHA7Hk+tNs/6nhNYtaNduM=": 3,
"SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 8
},
"PrimaryKeys": {
"SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 8
},
"NumNodes": 8
}
]
```
I intentionally did not change the CLI output because I didn't find a good way of displaying this information. There are a couple of options that we could implement later:
* add a flag to show the primary keys
* add a flag to show json output
Fixes#3393.
Fixes#7527
I want to highlight this and explain what I think the implications are and make sure we are aware:
* `HTTPConnStateFunc` closes the connection when it is beyond the limit. `Close` does not block.
* `HTTPConnStateFuncWithDefault429Handler(10 * time.Millisecond)` blocks until the following is done (worst case):
1) `conn.SetDeadline(10*time.Millisecond)` so that
2) `conn.Write(429error)` is guaranteed to timeout after 10ms, so that the http 429 can be written and
3) `conn.Close` can happen
The implication of this change is that accepting any new connection is worst case delayed by 10ms. But only after a client reached the limit already.