This is like a Möbius strip of code due to the fact that low-level components (serf/memberlist) are connected to high-level components (the catalog and mesh-gateways) in a twisty maze of references which make it hard to dive into. With that in mind here's a high level summary of what you'll find in the patch:
There are several distinct chunks of code that are affected:
* new flags and config options for the server
* retry join WAN is slightly different
* retry join code is shared to discover primary mesh gateways from secondary datacenters
* because retry join logic runs in the *agent* and the results of that
operation for primary mesh gateways are needed in the *server* there are
some methods like `RefreshPrimaryGatewayFallbackAddresses` that must occur
at multiple layers of abstraction just to pass the data down to the right
layer.
* new cache type `FederationStateListMeshGatewaysName` for use in `proxycfg/xds` layers
* the function signature for RPC dialing picked up a new required field (the
node name of the destination)
* several new RPCs for manipulating a FederationState object:
`FederationState:{Apply,Get,List,ListMeshGateways}`
* 3 read-only internal APIs for debugging use to invoke those RPCs from curl
* raft and fsm changes to persist these FederationStates
* replication for FederationStates as they are canonically stored in the
Primary and replicated to the Secondaries.
* a special derivative of anti-entropy that runs in secondaries to snapshot
their local mesh gateway `CheckServiceNodes` and sync them into their upstream
FederationState in the primary (this works in conjunction with the
replication to distribute addresses for all mesh gateways in all DCs to all
other DCs)
* a "gateway locator" convenience object to make use of this data to choose
the addresses of gateways to use for any given RPC or gossip operation to a
remote DC. This gets data from the "retry join" logic in the agent and also
directly calls into the FSM.
* RPC (`:8300`) on the server sniffs the first byte of a new connection to
determine if it's actually doing native TLS. If so it checks the ALPN header
for protocol determination (just like how the existing system uses the
type-byte marker).
* 2 new kinds of protocols are exclusively decoded via this native TLS
mechanism: one for ferrying "packet" operations (udp-like) from the gossip
layer and one for "stream" operations (tcp-like). The packet operations
re-use sockets (using length-prefixing) to cut down on TLS re-negotiation
overhead.
* the server instances specially wrap the `memberlist.NetTransport` when running
with gateway federation enabled (in a `wanfed.Transport`). The general gist is
that if it tries to dial a node in the SAME datacenter (deduced by looking
at the suffix of the node name) there is no change. If dialing a DIFFERENT
datacenter it is wrapped up in a TLS+ALPN blob and sent through some mesh
gateways to eventually end up in a server's :8300 port.
* a new flag when launching a mesh gateway via `consul connect envoy` to
indicate that the servers are to be exposed. This sets a special service
meta when registering the gateway into the catalog.
* `proxycfg/xds` notice this metadata blob to activate additional watches for
the FederationState objects as well as the location of all of the consul
servers in that datacenter.
* `xds:` if the extra metadata is in place additional clusters are defined in a
DC to bulk sink all traffic to another DC's gateways. For the current
datacenter we listen on a wildcard name (`server.<dc>.consul`) that load
balances all servers as well as one mini-cluster per node
(`<node>.server.<dc>.consul`)
* the `consul tls cert create` command got a new flag (`-node`) to help create
an additional SAN in certs that can be used with this flavor of federation.
This fixes issue #7318
Between versions 1.5.2 and 1.5.3, a regression has been introduced regarding health
of services. A patch #6144 had been issued for HealthChecks of nodes, but not for healthchecks
of services.
What happened when a reload was:
1. save all healthcheck statuses
2. cleanup everything
3. add new services with healthchecks
In step 3, the state of healthchecks was taken into account locally,
so at step 3, but since we cleaned up at step 2, state was lost.
This PR introduces the snap parameter, so step 3 can use information from step 1
When node name contains vertical bar symbol some commands output is
garbled because `|` is used as a delimiter in `columnize.SimpleFormat`.
This commit changes format string to use `\x1f` - ASCII unit
separator[1] as a delimiter and also adds test to cover this case.
Affected commands:
* `consul catalog nodes`
* `consul members`
* `consul operator raft list-peers`
* `consul intention get`
Fixes#3951.
[1]: https://en.wikipedia.org/wiki/Delimiter#Solutions
For URL maintenance reasons we store the last visited DC in
localStorage incase you come back to a page (for example settings) that
doesn't have a dc in the URL.
A problem arises here if the last DC you tried to visit is unreachable.
The first fix here clears out the last visited DC from localStorage if
the API has errored out.
Secondly, our `href-mut` helper which mutates the current current and
replaces 'parts' in the URL rather than the whole thing functioned by
detecting the current route/URL you are on an 'mutating' that. A problem
arose here as even though you might be on the `/ui/dc-1/services` URL the
actual route is the 'error' route which does not have a URL that can be
changed properly.
The second fix here uses route.currentRoute.name over route.currentRouteName.
The latter is equal to error when an error occurs whereas the former gives you the name of the route before the error happened, which is actually what we want/the intent here.
ie. when `router.currentRouteName === 'error'` then
`router.currentRoute.name === Name Of Route Before It Errored` it seems
The Session API in Consul 1.7.0 and 1.7.1 is incompatible with prior versions of Consul.
This PR adds a note to our version-specific upgrade guide to guard against users upgrading before the fix in 1.7.2 is released.
* ui: Move tomography length check inside of the partial
Previously we checked the length of tomography.distances to decide
whether to show the RTT tab or not. Previous to our ember upgrade this
would not cause a DOM reload of so many elements (i.e. all of the tab
content). Since our ember upgrade, any change to tomography (so not
necessarily the length of distances) seems to fire a change to the length (even if
the length remains the same). The knock on effect of this is that the
array of tab panels seems to be recalculated (but remain the same) and
all of the tab panels are completely re-rendered, causing the scroll of
the page to be reset.
This commit moves the check for tomography.distance.length to the lower
down with the loop, which means the array of tab panels always remains
the same, which consequently means that the entire array of tab panels
is never re-rendered entirely, and therefore fixes the issue.
This will avoid adding format=prometheus in request and to parse
more easily metrics using Prometheus.
This commit follows https://github.com/hashicorp/consul/pull/6514 as
the PR has been closed and extends it by accepting old Prometheus
mime-type.
The go-discover library supports Linode. This adds support for
discovering other Consul agents running on Linode. Consul has supported
this since [66b8c20][1] was merged, so this commit just updates the
documentation to match current features.
[1]: 66b8c20990