This is like a Möbius strip of code due to the fact that low-level components (serf/memberlist) are connected to high-level components (the catalog and mesh-gateways) in a twisty maze of references which make it hard to dive into. With that in mind here's a high level summary of what you'll find in the patch:
There are several distinct chunks of code that are affected:
* new flags and config options for the server
* retry join WAN is slightly different
* retry join code is shared to discover primary mesh gateways from secondary datacenters
* because retry join logic runs in the *agent* and the results of that
operation for primary mesh gateways are needed in the *server* there are
some methods like `RefreshPrimaryGatewayFallbackAddresses` that must occur
at multiple layers of abstraction just to pass the data down to the right
layer.
* new cache type `FederationStateListMeshGatewaysName` for use in `proxycfg/xds` layers
* the function signature for RPC dialing picked up a new required field (the
node name of the destination)
* several new RPCs for manipulating a FederationState object:
`FederationState:{Apply,Get,List,ListMeshGateways}`
* 3 read-only internal APIs for debugging use to invoke those RPCs from curl
* raft and fsm changes to persist these FederationStates
* replication for FederationStates as they are canonically stored in the
Primary and replicated to the Secondaries.
* a special derivative of anti-entropy that runs in secondaries to snapshot
their local mesh gateway `CheckServiceNodes` and sync them into their upstream
FederationState in the primary (this works in conjunction with the
replication to distribute addresses for all mesh gateways in all DCs to all
other DCs)
* a "gateway locator" convenience object to make use of this data to choose
the addresses of gateways to use for any given RPC or gossip operation to a
remote DC. This gets data from the "retry join" logic in the agent and also
directly calls into the FSM.
* RPC (`:8300`) on the server sniffs the first byte of a new connection to
determine if it's actually doing native TLS. If so it checks the ALPN header
for protocol determination (just like how the existing system uses the
type-byte marker).
* 2 new kinds of protocols are exclusively decoded via this native TLS
mechanism: one for ferrying "packet" operations (udp-like) from the gossip
layer and one for "stream" operations (tcp-like). The packet operations
re-use sockets (using length-prefixing) to cut down on TLS re-negotiation
overhead.
* the server instances specially wrap the `memberlist.NetTransport` when running
with gateway federation enabled (in a `wanfed.Transport`). The general gist is
that if it tries to dial a node in the SAME datacenter (deduced by looking
at the suffix of the node name) there is no change. If dialing a DIFFERENT
datacenter it is wrapped up in a TLS+ALPN blob and sent through some mesh
gateways to eventually end up in a server's :8300 port.
* a new flag when launching a mesh gateway via `consul connect envoy` to
indicate that the servers are to be exposed. This sets a special service
meta when registering the gateway into the catalog.
* `proxycfg/xds` notice this metadata blob to activate additional watches for
the FederationState objects as well as the location of all of the consul
servers in that datacenter.
* `xds:` if the extra metadata is in place additional clusters are defined in a
DC to bulk sink all traffic to another DC's gateways. For the current
datacenter we listen on a wildcard name (`server.<dc>.consul`) that load
balances all servers as well as one mini-cluster per node
(`<node>.server.<dc>.consul`)
* the `consul tls cert create` command got a new flag (`-node`) to help create
an additional SAN in certs that can be used with this flavor of federation.
This will avoid adding format=prometheus in request and to parse
more easily metrics using Prometheus.
This commit follows https://github.com/hashicorp/consul/pull/6514 as
the PR has been closed and extends it by accepting old Prometheus
mime-type.
This adds acl enforcement to the two endpoints that were missing it.
Note that in the case of getting a services health by its id, we still
must first lookup the service so we still "leak" information about a
service with that ID existing. There isn't really a way around it though
as ACLs are meant to check service names.
* Renamed structs.IntentionWildcard to structs.WildcardSpecifier
* Refactor ACL Config
Get rid of remnants of enterprise only renaming.
Add a WildcardName field for specifying what string should be used to indicate a wildcard.
* Add wildcard support in the ACL package
For read operations they can call anyAllowed to determine if any read access to the given resource would be granted.
For write operations they can call allAllowed to ensure that write access is granted to everything.
* Make v1/agent/connect/authorize namespace aware
* Update intention ACL enforcement
This also changes how intention:read is granted. Before the Intention.List RPC would allow viewing an intention if the token had intention:read on the destination. However Intention.Match allowed viewing if access was allowed for either the source or dest side. Now Intention.List and Intention.Get fall in line with Intention.Matches previous behavior.
Due to this being done a few different places ACL enforcement for a singular intention is now done with the CanRead and CanWrite methods on the intention itself.
* Refactor Intention.Apply to make things easier to follow.
* ACL Authorizer overhaul
To account for upcoming features every Authorization function can now take an extra *acl.EnterpriseAuthorizerContext. These are unused in OSS and will always be nil.
Additionally the acl package has received some thorough refactoring to enable all of the extra Consul Enterprise specific authorizations including moving sentinel enforcement into the stubbed structs. The Authorizer funcs now return an acl.EnforcementDecision instead of a boolean. This improves the overall interface as it makes multiple Authorizers easily chainable as they now indicate whether they had an authoritative decision or should use some other defaults. A ChainedAuthorizer was added to handle this Authorizer enforcement chain and will never itself return a non-authoritative decision.
* Include stub for extra enterprise rules in the global management policy
* Allow for an upgrade of the global-management policy
Fixes: #5396
This PR adds a proxy configuration stanza called expose. These flags register
listeners in Connect sidecar proxies to allow requests to specific HTTP paths from outside of the node. This allows services to protect themselves by only
listening on the loopback interface, while still accepting traffic from non
Connect-enabled services.
Under expose there is a boolean checks flag that would automatically expose all
registered HTTP and gRPC check paths.
This stanza also accepts a paths list to expose individual paths. The primary
use case for this functionality would be to expose paths for third parties like
Prometheus or the kubelet.
Listeners for requests to exposed paths are be configured dynamically at run
time. Any time a proxy, or check can be registered, a listener can also be
created.
In this initial implementation requests to these paths are not
authenticated/encrypted.
Also:
* Finished threading replaceExistingChecks setting (from GH-4905)
through service manager.
* Respected the original configSource value that was used to register a
service or a check when restoring persisted data.
* Run several existing tests with and without central config enabled
(not exhaustive yet).
* Switch to ioutil.ReadFile for all types of agent persistence.
All these changes should have no side-effects or change behavior:
- Use bytes.Buffer's String() instead of a conversion
- Use time.Since and time.Until where fitting
- Drop unnecessary returns and assignment
* Support for maximum size for Output of checks
This PR allows users to limit the size of output produced by checks at the agent
and check level.
When set at the agent level, it will limit the output for all checks monitored
by the agent.
When set at the check level, it can override the agent max for a specific check but
only if it is lower than the agent max.
Default value is 4k, and input must be at least 1.
This allows addresses to be tagged at the service level similar to what we allow for nodes already. The address translation that can be enabled with the `translate_wan_addrs` config was updated to take these new addresses into account as well.
Also update some snapshot agent docs
* Enforce correct permissions when registering a check
Previously we had attempted to enforce service:write for a check associated with a service instead of node:write on the agent but due to how we decoded the health check from the request it would never do it properly. This commit fixes that.
* Update website/source/docs/commands/snapshot/agent.html.markdown.erb
Co-Authored-By: mkeeler <mkeeler@users.noreply.github.com>
Fixes: #4222
# Data Filtering
This PR will implement filtering for the following endpoints:
## Supported HTTP Endpoints
- `/agent/checks`
- `/agent/services`
- `/catalog/nodes`
- `/catalog/service/:service`
- `/catalog/connect/:service`
- `/catalog/node/:node`
- `/health/node/:node`
- `/health/checks/:service`
- `/health/service/:service`
- `/health/connect/:service`
- `/health/state/:state`
- `/internal/ui/nodes`
- `/internal/ui/services`
More can be added going forward and any endpoint which is used to list some data is a good candidate.
## Usage
When using the HTTP API a `filter` query parameter can be used to pass a filter expression to Consul. Filter Expressions take the general form of:
```
<selector> == <value>
<selector> != <value>
<value> in <selector>
<value> not in <selector>
<selector> contains <value>
<selector> not contains <value>
<selector> is empty
<selector> is not empty
not <other expression>
<expression 1> and <expression 2>
<expression 1> or <expression 2>
```
Normal boolean logic and precedence is supported. All of the actual filtering and evaluation logic is coming from the [go-bexpr](https://github.com/hashicorp/go-bexpr) library
## Other changes
Adding the `Internal.ServiceDump` RPC endpoint. This will allow the UI to filter services better.
This PR adds two features which will be useful for operators when ACLs are in use.
1. Tokens set in configuration files are now reloadable.
2. If `acl.enable_token_persistence` is set to `true` in the configuration, tokens set via the `v1/agent/token` endpoint are now persisted to disk and loaded when the agent starts (or during configuration reload)
Note that token persistence is opt-in so our users who do not want tokens on the local disk will see no change.
Some other secondary changes:
* Refactored a bunch of places where the replication token is retrieved from the token store. This token isn't just for replicating ACLs and now it is named accordingly.
* Allowed better paths in the `v1/agent/token/` API. Instead of paths like: `v1/agent/token/acl_replication_token` the path can now be just `v1/agent/token/replication`. The old paths remain to be valid.
* Added a couple new API functions to set tokens via the new paths. Deprecated the old ones and pointed to the new names. The names are also generally better and don't imply that what you are setting is for ACLs but rather are setting ACL tokens. There is a minor semantic difference there especially for the replication token as again, its no longer used only for ACL token/policy replication. The new functions will detect 404s and fallback to using the older token paths when talking to pre-1.4.3 agents.
* Docs updated to reflect the API additions and to show using the new endpoints.
* Updated the ACL CLI set-agent-tokens command to use the non-deprecated APIs.
This adds a MaxQueryTime field to the connect ca leaf cache request type and populates it via the wait query param. The cache will then do the right thing and timeout the operation as expected if no new leaf cert is available within that time.
Fixes#4462
The reproduction scenario in the original issue now times out appropriately.
This endpoint aggregates all checks related to <service id> on the agent
and return an appropriate http code + the string describing the worst
check.
This allows to cleanly expose service status to other component, hiding
complexity of multiple checks.
This is especially useful to use consul to feed a load balancer which
would delegate health checking to consul agent.
Exposing this endpoint on the agent is necessary to avoid a hit on
consul servers and avoid decreasing resiliency (this endpoint will work
even if there is no consul leader in the cluster).
* website: add multi-dc enterprise landing page
* website: switch all 1.4.0 alerts/RC warnings
* website: connect product wording
Co-Authored-By: pearkes <jackpearkes@gmail.com>
* website: remove RC notification
* commmand/acl: fix usage docs for ACL tokens
* agent: remove comment, OperatorRead
* website: improve multi-dc docs
Still not happy with this but tried to make it slightly more informative.
* website: put back acl guide warning for 1.4.0
* website: simplify multi-dc page and respond to feedback
* Fix Multi-DC typos on connect index page.
* Improve Multi-DC overview.
A full guide is a WIP and will be added post-release.
* Fixes typo avaiable > available
* agent/debug: add package for debugging, host info
* api: add v1/agent/host endpoint
* agent: add v1/agent/host endpoint
* command/debug: implementation of static capture
* command/debug: tests and only configured targets
* agent/debug: add basic test for host metrics
* command/debug: add methods for dynamic data capture
* api: add debug/pprof endpoints
* command/debug: add pprof
* command/debug: timing, wg, logs to disk
* vendor: add gopsutil/disk
* command/debug: add a usage section
* website: add docs for consul debug
* agent/host: require operator:read
* api/host: improve docs and no retry timing
* command/debug: fail on extra arguments
* command/debug: fixup file permissions to 0644
* command/debug: remove server flags
* command/debug: improve clarity of usage section
* api/debug: add Trace for profiling, fix profile
* command/debug: capture profile and trace at the same time
* command/debug: add index document
* command/debug: use "clusters" in place of members
* command/debug: remove address in output
* command/debug: improve comment on metrics sleep
* command/debug: clarify usage
* agent: always register pprof handlers and protect
This will allow us to avoid a restart of a target agent
for profiling by always registering the pprof handlers.
Given this is a potentially sensitive path, it is protected
with an operator:read ACL and enable debug being
set to true on the target agent. enable_debug still requires
a restart.
If ACLs are disabled, enable_debug is sufficient.
* command/debug: use trace.out instead of .prof
More in line with golang docs.
* agent: fix comment wording
* agent: wrap table driven tests in t.run()
* Add -enable-local-script-checks options
These options allow for a finer control over when script checks are enabled by
giving the option to only allow them when they are declared from the local
file system.
* Add documentation for the new option
* Nitpick doc wording
* Plumb xDS server and proxyxfg into the agent startup
* Add `consul connect envoy` command to allow running Envoy as a connect sidecar.
* Add test for help tabs; typos and style fixups from review
- A new endpoint `/v1/agent/service/:service_id` which is a generic way to look up the service for a single instance. The primary value here is that it:
- **supports hash-based blocking** and so;
- **replaces `/agent/connect/proxy/:proxy_id`** as the mechanism the built-in proxy uses to read its config.
- It's not proxy specific and so works for any service.
- It has a temporary shim to call through to the existing endpoint to preserve current managed proxy config defaulting behaviour until that is removed entirely (tested).
- The built-in proxy now uses the new endpoint exclusively for it's config
- The built-in proxy now has a `-sidecar-for` flag that allows the service ID of the _target_ service to be specified, on the condition that there is exactly one "sidecar" proxy (that is one that has `Proxy.DestinationServiceID` set) for the service registered.
- Several fixes for edge cases for SidecarService
- A fix for `Alias` checks - when running locally they didn't update their state until some external thing updated the target. If the target service has no checks registered as below, then the alias never made it past critical.