* remove flush for each write to http response in the agent monitor endpoint
* fix race condition when we stop and start monitor multiple times, the doneCh is closed and never recover.
* start log reading goroutine before adding the sink to avoid filling the log channel before getting a chance of reading from it
* flush every 500ms to optimize log writing in the http server side.
* add changelog file
* add issue url to changelog
* fix changelog url
* Update changelog
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* use ticker to flush and avoid race condition when flushing in a different goroutine
* stop the ticker when done
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* Revert "fix race condition when we stop and start monitor multiple times, the doneCh is closed and never recover."
This reverts commit 1eeddf7a
* wait for log consumer loop to start before registering the sink
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
Updates to a cluster will clear the associated endpoints, and updates to
a listener will clear the associated routes. Update the incremental xDS
logic to account for this implicit cleanup so that we can finish warming
the clusters and listeners.
Fixes#10379
Previously we would return an error if duplicate paths were specified.
This could lead to problems in cases where a user has the same path,
say /healthz, on two different ports.
This validation was added to signal a potential misconfiguration.
Instead we will only check for duplicate listener ports, since that is
what would lead to ambiguity issues when generating xDS config.
In the future we could look into using a single listener and creating
distinct filter chains for each path/port.
* debug: remove the CLI check for debug_enabled
The API allows collecting profiles even debug_enabled=false as long as
ACLs are enabled. Remove this check from the CLI so that users do not
need to set debug_enabled=true for no reason.
Also:
- fix the API client to return errors on non-200 status codes for debug
endpoints
- improve the failure messages when pprof data can not be collected
Co-Authored-By: Dhia Ayachi <dhia@hashicorp.com>
* remove parallel test runs
parallel runs create a race condition that fail the debug tests
* snapshot the timestamp at the beginning of the capture
- timestamp used to create the capture sub folder is snapshot only at the beginning of the capture and reused for subsequent captures
- capture append to the file if it already exist
* Revert "snapshot the timestamp at the beginning of the capture"
This reverts commit c2d03346
* Refactor captureDynamic to extract capture logic for each item in a different func
* snapshot the timestamp at the beginning of the capture
- timestamp used to create the capture sub folder is snapshot only at the beginning of the capture and reused for subsequent captures
- capture append to the file if it already exist
* Revert "snapshot the timestamp at the beginning of the capture"
This reverts commit c2d03346
* Refactor captureDynamic to extract capture logic for each item in a different func
* extract wait group outside the go routine to avoid a race condition
* capture pprof in a separate go routine
* perform a single capture for pprof data for the whole duration
* add missing vendor dependency
* add a change log and fix documentation to reflect the change
* create function for timestamp dir creation and simplify error handling
* use error groups and ticker to simplify interval capture loop
* Logs, profile and traces are captured for the full duration. Metrics, Heap and Go routines are captured every interval
* refactor Logs capture routine and add log capture specific test
* improve error reporting when log test fail
* change test duration to 1s
* make time parsing in log line more robust
* refactor log time format in a const
* test on log line empty the earliest possible and return
Co-authored-by: Freddy <freddygv@users.noreply.github.com>
* rename function to captureShortLived
* more specific changelog
Co-authored-by: Paul Banks <banks@banksco.de>
* update documentation to reflect current implementation
* add test for behavior when invalid param is passed to the command
* fix argument line in test
* a more detailed description of the new behaviour
Co-authored-by: Paul Banks <banks@banksco.de>
* print success right after the capture is done
* remove an unnecessary error check
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* upgraded github.com/google/pprof v0.0.0-20181206194817-3ea8567a2e57 => v0.0.0-20210601050228-01bbb1931b22
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
Co-authored-by: Freddy <freddygv@users.noreply.github.com>
Co-authored-by: Paul Banks <banks@banksco.de>
This PR adds cluster members to the metrics API. The number of members per
segment are reported as well as the total number of members.
Tested by running a multi-node cluster locally and ensuring the numbers were
correct. Also added unit test coverage to add the new expected gauges to
existing test cases.
Normally the named pipe would buffer up to 64k, but in some cases when a
soft limit is reached, they will start only buffering up to 4k.
In either case, we should not deadlock.
This commit changes the pipe-bootstrap command to first buffer all of
stdin into the process, before trying to write it to the named pipe.
This allows the process memory to act as the buffer, instead of the
named pipe.
Also changed the order of operations in `makeBootstrapPipe`. The new
test added in this PR showed that simply buffering in the process memory
was not enough to fix the issue. We also need to ensure that the
`pipe-bootstrap` process is started before we try to write to its
stdin. Otherwise the write will still block.
Also set stdout/stderr on the subprocess, so that any errors are visible
to the user.
* debug: remove the CLI check for debug_enabled
The API allows collecting profiles even debug_enabled=false as long as
ACLs are enabled. Remove this check from the CLI so that users do not
need to set debug_enabled=true for no reason.
Also:
- fix the API client to return errors on non-200 status codes for debug
endpoints
- improve the failure messages when pprof data can not be collected
Co-Authored-By: Dhia Ayachi <dhia@hashicorp.com>
* remove parallel test runs
parallel runs create a race condition that fail the debug tests
* Add changelog
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* Create and use collapsible notices
* Refactor collapsible-notices
* Split up the topology acceptance tests
* Add acceptance tests for tproxy notices
* Add component file
* Adds additional TProxy notices tests
* Adds conditional to only show collapsable if more than 2 notices are present
* Adds changelog
* Refactorting the conditonal for collapsing the notices
* Renaming undefinedIntention to be notDefinedIntention
* Refactor tests
The bulk of this commit is moving the LeaderRoutineManager from the agent/consul package into its own package: lib/gort. It also got a renaming and its Start method now requires a context. Requiring that context required updating a whole bunch of other places in the code.
The prior solution to call reply.Reset() aged poorly since newer fields
were added to the reply, but not added to Reset() leading serial
blocking query loops on the server to blend replies.
This could manifest as a service-defaults protocol change from
default=>http not reverting back to default after the config entry
reponsible was deleted.
When the Consul serf health check is failing, this means that the health checks registered with the agent may no longer be correct. Therefore we show a notice to the user when we detect that the serf health check is failing both for the health check listing for nodes and for service instances.
There were a few little things we fixed up whilst we were here:
- We use our @replace decorator to replace an empty Type with serf in the model.
- We noticed that ServiceTags can be null, so we replace that with an empty array.
- We added docs for both our Notice component and the Consul::HealthCheck::List component. Notice now defaults to @type=info.
* Save exposed HTTP or GRPC ports to the agent's store
* Add those the health checks API so we can retrieve them from the API
* Change redirect-traffic command to also exclude those ports from inbound traffic redirection when expose.checks is set to true.
* Add conditionals to Lock Session list items
* Add changelog
* Show ID in details if there is a name to go in title
* Add copy-button if ID is in the title
* Update TTL conditional
* Update .changelog/10121.txt
Co-authored-by: John Cowen <johncowen@users.noreply.github.com>
Co-authored-by: John Cowen <johncowen@users.noreply.github.com>
This fixes the spacing bug in nspaces only by only showing Description if the namespace has one, and removing the extra 2 pixel margin of dds for when dts aren't rendered/don't exist.
* ui: Add support for showing partial lists in ListCollection
* Add CSS for partial 'View more' button, and move all CSS to /components
* Enable partial view for intention permissions
* ui: Loader amends/improvements
1. Create a JS compatible template only 'glimmer' component so we can
use it with or without glimmer.
2. Add a set of `rose` colors.
3. Animate the brand loader to keep it centered when the side
navigation appears.
4. Tweak the color of Consul::Loader to use a 'rose' color.
5. Move everything loader related to the `app/components/` folder and
add docs.
A recent change in 1.9.x inverted the order of these two lines, which caused the
X-Consul-Effective-Consistency header to be missing for the servie health endpoints
* ui: Fix text search for upstream instances
* Clean up predicates for other model types
* Add some docs around DataCollection and searching
* Enable UI Engineering Docs for our preview sites
* Use debug CSS in dev and staging
* WIP reloadable raft config
* Pre-define new raft gauges
* Update go-metrics to change gauge reset behaviour
* Update raft to pull in new metric and reloadable config
* Add snapshot persistance timing and installSnapshot to our 'protected' list as they can be infrequent but are important
* Update telemetry docs
* Update config and telemetry docs
* Add note to oldestLogAge on when it is visible
* Add changelog entry
* Update website/content/docs/agent/options.mdx
Co-authored-by: Matt Keeler <mkeeler@users.noreply.github.com>
Co-authored-by: Matt Keeler <mkeeler@users.noreply.github.com>
* Give descriptive error if auth method not found
Previously during a `consul login -method=blah`, if the auth method was not found, the
error returned would be "ACL not found". This is potentially confusing
because there may be many different ACLs involved in a login: the ACL of
the Consul client, perhaps the binding rule or the auth method.
Now the error will be "auth method blah not found", which is much easier
to debug.
Initially we were loading every potential upstream address into Envoy
and then routing traffic to the logical upstream service. The downside
of this behavior is that traffic meant to go to a specific instance
would be load balanced across ALL instances.
Traffic to specific instance IPs should be forwarded to the original
destination and if it's a destination in the mesh then we should ensure
the appropriate certificates are used.
This PR makes transparent proxying a Kubernetes-only feature for now
since support for other environments requires generating virtual IPs,
and Consul does not do that at the moment.
* Update header logo and inline icon
* Update full logos + layout on loading screen
* Update favicon assets and strategy
- Switches to serve an ico file alongside an SVG file
- Introduces an apple-touch-icon
* Removes unused favicon/meta assets
* Changelog item for ui
* Create component for logo
* Simplify logo component, set brand color
* Fix docs loading state CSS issue
The only thing that needed fixing up pertained to this section of the 1.18.x release notes:
> grpc_stats: the default value for stats_for_all_methods is switched from true to false, in order to avoid possible memory exhaustion due to an untrusted downstream sending a large number of unique method names. The previous default value was deprecated in version 1.14.0. This only changes the behavior when the value is not set. The previous behavior can be used by setting the value to true. This behavior change by be overridden by setting runtime feature envoy.deprecated_features.grpc_stats_filter_enable_stats_for_all_methods_by_default.
For now to maintain status-quo I'm explicitly setting `stats_for_all_methods=true` in all versions to avoid relying upon the default.
Additionally the naming of the emitted metrics for these gRPC requests changed slightly so the integration test assertions for `case-grpc` needed adjusting.
This ensures that if someone does include some extension Consul does not currently make use of, that extension is actually usable. Without linking these envoy protobufs into the main binary it can't round trip the escape hatches to send them down to envoy.
Whenenver the go-control-plane library is upgraded next we just have to re-run 'make envoy-library'.
This adds support for the Incremental xDS protocol when using xDS v3. This is best reviewed commit-by-commit and will not be squashed when merged.
Union of all commit messages follows to give an overarching summary:
xds: exclusively support incremental xDS when using xDS v3
Attempts to use SoTW via v3 will fail, much like attempts to use incremental via v2 will fail.
Work around a strange older envoy behavior involving empty CDS responses over incremental xDS.
xds: various cleanups and refactors that don't strictly concern the addition of incremental xDS support
Dissolve the connectionInfo struct in favor of per-connection ResourceGenerators instead.
Do a better job of ensuring the xds code uses a well configured logger that accurately describes the connected client.
xds: pull out checkStreamACLs method in advance of a later commit
xds: rewrite SoTW xDS protocol tests to use protobufs rather than hand-rolled json strings
In the test we very lightly reuse some of the more boring protobuf construction helper code that is also technically under test. The important thing of the protocol tests is testing the protocol. The actual inputs and outputs are largely already handled by the xds golden output tests now so these protocol tests don't have to do double-duty.
This also updates the SoTW protocol test to exclusively use xDS v2 which is the only variant of SoTW that will be supported in Consul 1.10.
xds: default xds.Server.AuthCheckFrequency at use-time instead of construction-time
* Use proxy outbound port from TransparentProxyConfig if provided
* If -proxy-id is provided to the redirect-traffic command, exclude any listener ports
from inbound traffic redirection. This includes envoy_prometheus_bind_addr,
envoy_stats_bind_addr, and the ListenerPort from the Expose configuration.
* Allow users to provide additional inbound and outbound ports, outbound CIDRs
and additional user IDs to be excluded from traffic redirection.
This affects both the traffic-redirect command and the iptables SDK package.
This config entry is being renamed primarily because in k8s the name
cluster could be confusing given that the config entry applies across
federated datacenters.
Additionally, this config entry will only apply to Consul as a service
mesh, so the more generic "cluster" name is not needed.
* CLI: Add support for reading internal raft snapshots to snapshot inspect
* Add snapshot inspect test for raw state files
* Add changelog entry
* Update .changelog/10089.txt
The extra argument meant that the blocking query configuration wasn't
being read properly, and therefore the correct ?index wasn't being sent
with the request.
Previously only a single auth method would be saved to the snapshot. This commit fixes the typo
and adds to the test, to show that all auth methods are now saved.
* add http2 ping checks
* fix test issue
* add h2ping check to config resources
* add new test and docs for h2ping
* fix grammatical inconsistency in H2PING documentation
* resolve rebase conflicts, add test for h2ping tls verification failure
* api documentation for h2ping
* update test config data with H2PING
* add H2PING to protocol buffers and update changelog
* fix typo in changelog entry
* Add new consul connect redirect-traffic command for applying traffic redirection rules when Transparent Proxy is enabled.
* Add new iptables package for applying traffic redirection rules with iptables.
* Fix bug in cache where TTLs are effectively ignored
This mostly affects streaming since streaming will immediately return from Fetch calls when the state is Closed on eviction which causes the race condition every time.
However this also affects all other cache types if the fetch call happens to return between the eviction and then next time around the Get loop by any client.
There is a separate bug that allows cache items to be evicted even when there are active clients which is the trigger here.
* Add changelog entry
* Update .changelog/9978.txt
The streaming cache type for service health has no way to handle v1/health/ingress/:service queries as there is no equivalent topic that would return the appropriate data.
Ensure that attempts to use this endpoint will use the old cache-type for now so that they return appropriate data when streaming is enabled.
* Allow passing ALPN next protocols down to connect services. Fixes#4466.
* Update connect/proxy/proxy_test.go
Co-authored-by: Paul Banks <banks@banksco.de>
Co-authored-by: Paul Banks <banks@banksco.de>
This PR adds support for setting QueryOptions on a few agent API
endpoints. Nomad needs to be able to set the Namespace field on
these endpoints to:
- query for services / checks in a namespace
- deregister services / checks in a namespace
- update TTL status on checks in a namespace
* Configure ember-auto-import so we can use a stricter CSP
* Create a fake filesystem using JSON to avoid inline scripts in index
We used to have inline scripts in index.html in order to support embers
filepath fingerprinting and our configurable rootURL.
Instead of using inline scripts we use application/json plus a JSON blob
to create a fake filesystem JSON blob/hash/map to hold all of the
rootURL'ed fingerprinted file paths which we can then retrive later in
non-inline scripts.
We move our inlined polyfills script into the init.js external script,
and we move the CodeMirror syntax highlighting configuration inline
script into the main app itself - into the already existing CodeMirror
initializer (this has been moved so we can lookup a service located
document using ember's DI container)
* Set a strict-ish CSP policy during development
AutopilotServerHealthy now handles the 429 status code
Previously we would error out and not parse the response. Now either a 200 or 429 status code are considered expected statuses and will result in the method returning the reply allowing API consumers to not only see if the system is healthy or not but which server is unhealthy.
This PR uses the excellent a11y-dialog to implement our modal functionality across the UI.
This package covers all our a11y needs - overlay click and ESC to close, controlling aria-* attributes, focus trap and restore. It's also very small (1.6kb) and has good DOM and JS APIs and also seems to be widely used and well tested.
There is one downside to using this, and that is:
We made use of a very handy characteristic of the relationship between HTML labels and inputs in order to implement our modals previously. Adding a for="id" attribute to a label meant you can control an <input id="id" /> from anywhere else in the page without having to pass javascript objects around. It's just based on using the same string for the for attribute and the id attribute. This allowed us to easily open our login dialog with CSS from anywhere within the UI without having to manage passing around a javascript object/function/method in order to open the dialog.
We've PRed #9813 which includes an approach which would make passing around JS modal object easier to do. But in the meantime we've added a little 'hack' here using an additional <input /> element and a change listener which allows us to keep this label/input characteristic of our old modals. I'd originally thought this would be a temporary amend in order to wait on #9813 but the more I think about it, the more I think its quite a nice thing to keep - so longer term we may/may not keep this.
Allows setting -prometheus-backend-port to configure the cluster
envoy_prometheus_bind_addr points to.
Allows setting -prometheus-scrape-path to configure which path
envoy_prometheus_bind_addr exposes metrics on.
-prometheus-backend-port is used by the consul-k8s metrics merging feature, to
configure envoy_prometheus_bind_addr to point to the merged metrics
endpoint that combines Envoy and service metrics so that one set of
annotations on a Pod can scrape metrics from the service and it's Envoy
sidecar.
-prometheus-scrape-path is used to allow configurability of the path
where prometheus metrics are exposed on envoy_prometheus_bind_addr.
Previous to this commit, the API response would include Gateway
Addresses in the form `domain.name.:8080`, which due to the addition of
the port is probably not the expected response.
This commit rightTrims any `.` characters from the end of the domain
before formatting the address to include the port resulting in
`domain.name:8080`
Note that this does NOT upgrade to xDS v3. That will come in a future PR.
Additionally:
- Ignored staticcheck warnings about how github.com/golang/protobuf is deprecated.
- Shuffled some agent/xds imports in advance of a later xDS v3 upgrade.
- Remove support for envoy 1.13.x but don't add in 1.17.x yet. We have to wait until the xDS v3 support is added in a follow-up PR.
Fixes#8425
When de-registering in anti-entropy sync, when there is no service or
check token.
The agent token will fall back to the default (aka user) token if no agent
token is set, so the existing behaviour still works, but it will prefer
the agent token over the user token if both are set.
ref: https://www.consul.io/docs/agent/options#acl_tokens
The agent token seems more approrpiate in this case, since this is an
"internal operation", not something initiated by the user.
This commit use the internal authorize endpoint along wiht ember-can to further restrict user access to certain UI features and navigational elements depending on the users ACL token
* A GET of the /acl/auth-method/:name endpoint returns the fields
MaxTokenTTL and TokenLocality, while a LIST (/acl/auth-methods) does
not.
The list command returns a filtered subset of the full set. This is
somewhat deliberate, so that secrets aren't shown, but the TTL and
Locality fields aren't (IMO) security critical, and it is useful for
the front end to be able to show them.
For consistency these changes mirror the 'omit empty' and string
representation choices made for the GET call.
This includes changes to the gRPC and API code in the client.
The new output looks similar to this
curl 'http://localhost:8500/v1/acl/auth-methods' | jq '.'
{
"MaxTokenTTL": "8m20s",
"Name": "minikube-ttl-local2",
"Type": "kubernetes",
"Description": "minikube auth method",
"TokenLocality": "local",
"CreateIndex": 530,
"ModifyIndex": 530,
"Namespace": "default"
}
]
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Add changelog
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
Previously a snapshot created as part of a resumse-stream request could have incorrectly
cached the newSnapshotToFollow event. This would cause clients to error because they
received an unexpected framing event.
This fixes an issue where leaf certificates issued in primary
datacenters using Vault as a Connect CA would be reissued very
frequently (every ~20 seconds) because the logic meant to detect root
rotation was errantly triggering.
The hash of the rootCA was being compared against a hash of the
intermediateCA and always failing. This doesn't apply to the Consul
built-in CA provider because there is no intermediate in use in the
primary DC.
This is reminiscent of #6513