This allows for client agent to be run in a more stateless manner where they may be abruptly terminated and not expected to come back. If advertising a per-agent reconnect timeout using the advertise_reconnect_timeout configuration when that agent leaves, other agents will wait only that amount of time for the agent to come back before reaping it.
This has the advantageous side effect of causing servers to deregister the node/services/checks for that agent sooner than if the global reconnect_timeout was used.
During gossip encryption key rotation it would be nice to be able to see if all nodes are using the same key. This PR adds another field to the json response from `GET v1/operator/keyring` which lists the primary keys in use per dc. That way an operator can tell when a key was successfully setup as primary key.
Based on https://github.com/hashicorp/serf/pull/611 to add primary key to list keyring output:
```json
[
{
"WAN": true,
"Datacenter": "dc2",
"Segment": "",
"Keys": {
"0OuM4oC3Os18OblWiBbZUaHA7Hk+tNs/6nhNYtaNduM=": 6,
"SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 6
},
"PrimaryKeys": {
"SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 6
},
"NumNodes": 6
},
{
"WAN": false,
"Datacenter": "dc2",
"Segment": "",
"Keys": {
"0OuM4oC3Os18OblWiBbZUaHA7Hk+tNs/6nhNYtaNduM=": 8,
"SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 8
},
"PrimaryKeys": {
"SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 8
},
"NumNodes": 8
},
{
"WAN": false,
"Datacenter": "dc1",
"Segment": "",
"Keys": {
"0OuM4oC3Os18OblWiBbZUaHA7Hk+tNs/6nhNYtaNduM=": 3,
"SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 8
},
"PrimaryKeys": {
"SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 8
},
"NumNodes": 8
}
]
```
I intentionally did not change the CLI output because I didn't find a good way of displaying this information. There are a couple of options that we could implement later:
* add a flag to show the primary keys
* add a flag to show json output
Fixes#3393.
Fixes#7527
I want to highlight this and explain what I think the implications are and make sure we are aware:
* `HTTPConnStateFunc` closes the connection when it is beyond the limit. `Close` does not block.
* `HTTPConnStateFuncWithDefault429Handler(10 * time.Millisecond)` blocks until the following is done (worst case):
1) `conn.SetDeadline(10*time.Millisecond)` so that
2) `conn.Write(429error)` is guaranteed to timeout after 10ms, so that the http 429 can be written and
3) `conn.Close` can happen
The implication of this change is that accepting any new connection is worst case delayed by 10ms. But only after a client reached the limit already.
This is in its own separate package so that it will be a separate test binary that runs thus isolating the go runtime from other tests and allowing accurate go routine leak checking.
This test would ideally use goleak.VerifyTestMain but that will fail 100% of the time due to some architectural things (blocking queries and net/rpc uncancellability).
This test is not comprehensive. We should enable/exercise more features and more cluster configurations. However its a start.
Partially extracted from #7547
Updates protobuf to the most recent in the 1.3.x series, and updates
golang.org/x/sys to a7d97aace0b0 because of https://github.com/shirou/gopsutil/issues/853
prevents updating to a more recent version.
This breaking change in x/sys also prevents us from getting a newer
version of x/net. In the future, if gopsutil is not patched, we may want to run a fork version of
gopsutil so that we can update both x/net and x/sys.
There was an RSA private key used for testing included in the old
version. This commit updates it to a version that does not include the
key so that the key is not detected by tools which scan the Consul
binary for private keys.
Commands run:
go get github.com/joyent/triton-go@6801d15b779f042cfd821c8a41ef80fc33af9d47
make update-vendor
This is like a Möbius strip of code due to the fact that low-level components (serf/memberlist) are connected to high-level components (the catalog and mesh-gateways) in a twisty maze of references which make it hard to dive into. With that in mind here's a high level summary of what you'll find in the patch:
There are several distinct chunks of code that are affected:
* new flags and config options for the server
* retry join WAN is slightly different
* retry join code is shared to discover primary mesh gateways from secondary datacenters
* because retry join logic runs in the *agent* and the results of that
operation for primary mesh gateways are needed in the *server* there are
some methods like `RefreshPrimaryGatewayFallbackAddresses` that must occur
at multiple layers of abstraction just to pass the data down to the right
layer.
* new cache type `FederationStateListMeshGatewaysName` for use in `proxycfg/xds` layers
* the function signature for RPC dialing picked up a new required field (the
node name of the destination)
* several new RPCs for manipulating a FederationState object:
`FederationState:{Apply,Get,List,ListMeshGateways}`
* 3 read-only internal APIs for debugging use to invoke those RPCs from curl
* raft and fsm changes to persist these FederationStates
* replication for FederationStates as they are canonically stored in the
Primary and replicated to the Secondaries.
* a special derivative of anti-entropy that runs in secondaries to snapshot
their local mesh gateway `CheckServiceNodes` and sync them into their upstream
FederationState in the primary (this works in conjunction with the
replication to distribute addresses for all mesh gateways in all DCs to all
other DCs)
* a "gateway locator" convenience object to make use of this data to choose
the addresses of gateways to use for any given RPC or gossip operation to a
remote DC. This gets data from the "retry join" logic in the agent and also
directly calls into the FSM.
* RPC (`:8300`) on the server sniffs the first byte of a new connection to
determine if it's actually doing native TLS. If so it checks the ALPN header
for protocol determination (just like how the existing system uses the
type-byte marker).
* 2 new kinds of protocols are exclusively decoded via this native TLS
mechanism: one for ferrying "packet" operations (udp-like) from the gossip
layer and one for "stream" operations (tcp-like). The packet operations
re-use sockets (using length-prefixing) to cut down on TLS re-negotiation
overhead.
* the server instances specially wrap the `memberlist.NetTransport` when running
with gateway federation enabled (in a `wanfed.Transport`). The general gist is
that if it tries to dial a node in the SAME datacenter (deduced by looking
at the suffix of the node name) there is no change. If dialing a DIFFERENT
datacenter it is wrapped up in a TLS+ALPN blob and sent through some mesh
gateways to eventually end up in a server's :8300 port.
* a new flag when launching a mesh gateway via `consul connect envoy` to
indicate that the servers are to be exposed. This sets a special service
meta when registering the gateway into the catalog.
* `proxycfg/xds` notice this metadata blob to activate additional watches for
the FederationState objects as well as the location of all of the consul
servers in that datacenter.
* `xds:` if the extra metadata is in place additional clusters are defined in a
DC to bulk sink all traffic to another DC's gateways. For the current
datacenter we listen on a wildcard name (`server.<dc>.consul`) that load
balances all servers as well as one mini-cluster per node
(`<node>.server.<dc>.consul`)
* the `consul tls cert create` command got a new flag (`-node`) to help create
an additional SAN in certs that can be used with this flavor of federation.
This PR adds the option to set in-memory certificates to the API client instead of requiring the certificate to be stored on disk in a file.
This allows us to define API client TLS options per Consul secret backend in Vault.
Related issue hashicorp/vault#4800