This is like a Möbius strip of code due to the fact that low-level components (serf/memberlist) are connected to high-level components (the catalog and mesh-gateways) in a twisty maze of references which make it hard to dive into. With that in mind here's a high level summary of what you'll find in the patch:
There are several distinct chunks of code that are affected:
* new flags and config options for the server
* retry join WAN is slightly different
* retry join code is shared to discover primary mesh gateways from secondary datacenters
* because retry join logic runs in the *agent* and the results of that
operation for primary mesh gateways are needed in the *server* there are
some methods like `RefreshPrimaryGatewayFallbackAddresses` that must occur
at multiple layers of abstraction just to pass the data down to the right
layer.
* new cache type `FederationStateListMeshGatewaysName` for use in `proxycfg/xds` layers
* the function signature for RPC dialing picked up a new required field (the
node name of the destination)
* several new RPCs for manipulating a FederationState object:
`FederationState:{Apply,Get,List,ListMeshGateways}`
* 3 read-only internal APIs for debugging use to invoke those RPCs from curl
* raft and fsm changes to persist these FederationStates
* replication for FederationStates as they are canonically stored in the
Primary and replicated to the Secondaries.
* a special derivative of anti-entropy that runs in secondaries to snapshot
their local mesh gateway `CheckServiceNodes` and sync them into their upstream
FederationState in the primary (this works in conjunction with the
replication to distribute addresses for all mesh gateways in all DCs to all
other DCs)
* a "gateway locator" convenience object to make use of this data to choose
the addresses of gateways to use for any given RPC or gossip operation to a
remote DC. This gets data from the "retry join" logic in the agent and also
directly calls into the FSM.
* RPC (`:8300`) on the server sniffs the first byte of a new connection to
determine if it's actually doing native TLS. If so it checks the ALPN header
for protocol determination (just like how the existing system uses the
type-byte marker).
* 2 new kinds of protocols are exclusively decoded via this native TLS
mechanism: one for ferrying "packet" operations (udp-like) from the gossip
layer and one for "stream" operations (tcp-like). The packet operations
re-use sockets (using length-prefixing) to cut down on TLS re-negotiation
overhead.
* the server instances specially wrap the `memberlist.NetTransport` when running
with gateway federation enabled (in a `wanfed.Transport`). The general gist is
that if it tries to dial a node in the SAME datacenter (deduced by looking
at the suffix of the node name) there is no change. If dialing a DIFFERENT
datacenter it is wrapped up in a TLS+ALPN blob and sent through some mesh
gateways to eventually end up in a server's :8300 port.
* a new flag when launching a mesh gateway via `consul connect envoy` to
indicate that the servers are to be exposed. This sets a special service
meta when registering the gateway into the catalog.
* `proxycfg/xds` notice this metadata blob to activate additional watches for
the FederationState objects as well as the location of all of the consul
servers in that datacenter.
* `xds:` if the extra metadata is in place additional clusters are defined in a
DC to bulk sink all traffic to another DC's gateways. For the current
datacenter we listen on a wildcard name (`server.<dc>.consul`) that load
balances all servers as well as one mini-cluster per node
(`<node>.server.<dc>.consul`)
* the `consul tls cert create` command got a new flag (`-node`) to help create
an additional SAN in certs that can be used with this flavor of federation.
This fixes issue #7318
Between versions 1.5.2 and 1.5.3, a regression has been introduced regarding health
of services. A patch #6144 had been issued for HealthChecks of nodes, but not for healthchecks
of services.
What happened when a reload was:
1. save all healthcheck statuses
2. cleanup everything
3. add new services with healthchecks
In step 3, the state of healthchecks was taken into account locally,
so at step 3, but since we cleaned up at step 2, state was lost.
This PR introduces the snap parameter, so step 3 can use information from step 1
Fixes#7231. Before an agent would always emit a warning when there is
an encrypt key in the configuration and an existing keyring stored,
which is happening on restart.
Now it only emits that warning when the encrypt key from the
configuration is not part of the keyring.
* Add CreateCSRWithSAN
* Use CreateCSRWithSAN in auto_encrypt and cache
* Copy DNSNames and IPAddresses to cert
* Verify auto_encrypt.sign returns cert with SAN
* provide configuration options for auto_encrypt dnssan and ipsan
* rename CreateCSRWithSAN to CreateCSR
* Use consts for well known tagged adress keys
* Add ipv4 and ipv6 tagged addresses for node lan and wan
* Add ipv4 and ipv6 tagged addresses for service lan and wan
* Use IPv4 and IPv6 address in DNS
* Increase number to test ignore. Consul Enterprise has more flags and since we are trying to reduce the differences between both code bases, we are increasing the number in oss. The semantics don't change, it is just a cosmetic thing.
* Introduce agent.initEnterprise for enterprise related hooks.
* Sync test with ent version.
* Fix import order.
* revert error wording.
A check may be set to become passing/critical only if a specified number of successive
checks return passing/critical in a row. Status will stay identical as before until
the threshold is reached.
This feature is available for HTTP, TCP, gRPC, Docker & Monitor checks.
Fixes: #5396
This PR adds a proxy configuration stanza called expose. These flags register
listeners in Connect sidecar proxies to allow requests to specific HTTP paths from outside of the node. This allows services to protect themselves by only
listening on the loopback interface, while still accepting traffic from non
Connect-enabled services.
Under expose there is a boolean checks flag that would automatically expose all
registered HTTP and gRPC check paths.
This stanza also accepts a paths list to expose individual paths. The primary
use case for this functionality would be to expose paths for third parties like
Prometheus or the kubelet.
Listeners for requests to exposed paths are be configured dynamically at run
time. Any time a proxy, or check can be registered, a listener can also be
created.
In this initial implementation requests to these paths are not
authenticated/encrypted.
Also:
* Finished threading replaceExistingChecks setting (from GH-4905)
through service manager.
* Respected the original configSource value that was used to register a
service or a check when restoring persisted data.
* Run several existing tests with and without central config enabled
(not exhaustive yet).
* Switch to ioutil.ReadFile for all types of agent persistence.
The embedded `Server` field on a `DNSServer` is only set inside of the
`ListenAndServe` method. If that method fails for reasons like the
address being in use and is not bindable, then the `Server` field will
not be set and the overall `Agent.Start()` will fail.
This will trigger the inner loop of `TestAgent.Start()` to invoke
`ShutdownEndpoints` which will attempt to pretty print the DNS servers
using fields on that inner `Server` field. Because it was never set,
this causes a nil pointer dereference and crashes the test.
This fixes pathological cases where the write throughput and snapshot size are both so large that more than 10k log entries are written in the time it takes to restore the snapshot from disk. In this case followers that restart can never catch up with leader replication again and enter a loop of constantly downloading a full snapshot and restoring it only to find that snapshot is already out of date and the leader has truncated its logs so a new snapshot is sent etc.
In general if you need to adjust this, you are probably abusing Consul for purposes outside its design envelope and should reconsider your usage to reduce data size and/or write volume.
* Support for maximum size for Output of checks
This PR allows users to limit the size of output produced by checks at the agent
and check level.
When set at the agent level, it will limit the output for all checks monitored
by the agent.
When set at the check level, it can override the agent max for a specific check but
only if it is lower than the agent max.
Default value is 4k, and input must be at least 1.
* Improve startup message to avoid confusing users when no error occurs
Several times, some users not very familiar with Consul get confused
by error message at startup:
`[INFO] agent: (LAN) joined: 1 Err: <nil>`
Having `Err: <nil>` seems weird to many users, I propose to have the
following instead:
* Success: `[INFO] agent: (LAN) joined: 1`
* Error: `[WARN] agent: (LAN) couldn't join: %d Err: ERROR`
Roles are named and can express the same bundle of permissions that can
currently be assigned to a Token (lists of Policies and Service
Identities). The difference with a Role is that it not itself a bearer
token, but just another entity that can be tied to a Token.
This lets an operator potentially curate a set of smaller reusable
Policies and compose them together into reusable Roles, rather than
always exploding that same list of Policies on any Token that needs
similar permissions.
This also refactors the acl replication code to be semi-generic to avoid
3x copypasta.
* Move the watch package into the api module
It was already just a thin wrapper around the API anyways. The biggest change was to the testing. Instead of using a test agent directly from the agent package it now uses the binary on the PATH just like the other API tests.
The other big changes were to fix up the connect based watch tests so that we didn’t need to pull in the connect package (and therefore all of Consul)