* feat: calculate retry wait time with exponential back-off
* test: add test for getWaitTime method
* feat: enforce random jitter between min value from previous iteration and current
* extract randomStagger to simplify tests and use Milliseconds to avoid float math.
* rename variables
* add test and rename comment
---------
Co-authored-by: Poonam Jadhav <poonam.jadhav@hashicorp.com>
Fix an issue where rpc_hold_timeout was being used as the timeout for non-blocking queries. Users should be able to tune read timeouts without fiddling with rpc_hold_timeout. A new configuration `rpc_read_timeout` is created.
Refactor some implementation from the original PR 11500 to remove the misleading linkage between RPCInfo's timeout (used to retry in case of certain modes of failures) and the client RPC timeouts.
Adds a timeout (deadline) to client RPC calls, so that streams will no longer hang indefinitely in unstable network conditions.
Co-authored-by: kisunji <ckim@hashicorp.com>
The only function passed to SnapshotRPC today always returns a nil error, so there's no
way to exercise this bug in practice. This change is being made for correctness so that
it doesn't become a problem in the future, if we ever pass a different function to
SnapshotRPC.
As part of removing the legacy ACL system ACL upgrading and the flag for
legacy ACLs is removed from Clients.
This commit also removes the 'acls' serf tag from client nodes. The tag is only ever read
from server nodes.
This commit also introduces a constant for the acl serf tag, to make it easier to track where
it is used.
Remove the error return, so that not handling is not reported as an
error by errcheck. It was returning the error passed as an arg
unmodified so there is no reason to return the same value that was
passed in.
Remove the term upstreams to remove any confusion with the term used in
service mesh.
Remove the AutoDisable field, and replace it with the TTL value, using 0
to indicate the setting is turned off.
Replace "not Before" with "After".
Add some test coverage to show the behaviour is still correct.
Also accept an RPCInfo instead of interface{}. Accepting an interface
lead to a bug where the caller was expecting the arg to be the response
when in fact it was always passed the request. By accepting RPCInfo
it should indicate that this is actually the request value.
One caller of canRetry already passed an RPCInfo, the second handles
the type assertion before calling canRetry.
I believe this commit also fixes a bug. Previously RPCMaxConnsPerClient was not being re-read from the RuntimeConfig, so passing it to Server.ReloadConfig was never changing the value.
Also improve the test runtime by not doing a lot of unnecessary work.
In an upcoming change we will need to pass a grpc.ClientConnPool from
BaseDeps into Server. While looking at that change I noticed all of the
existing consulOption fields are already on BaseDeps.
Instead of duplicating the fields, we can create a struct used by
agent/consul, and use that struct in BaseDeps. This allows us to pass
along dependencies without translating them into different
representations.
I also looked at moving all of BaseDeps in agent/consul, however that
created some circular imports. Resolving those cycles wouldn't be too
bad (it was only an error in agent/consul being imported from
cache-types), however this change seems a little better by starting to
introduce some structure to BaseDeps.
This change is also a small step in reducing the scope of Agent.
Also remove some constants that were only used by tests, and move the
relevant comment to where the live configuration is set.
Removed some validation from NewServer and NewClient, as these are not
really runtime errors. They would be code errors, which will cause a
panic anyway, so no reason to handle them specially here.
In all cases (oss/ent, client/server) this method was returning a value from config. Since the
value is consistent, it doesn't need to be part of the delegate interface.
In the past TLS usage was enforced with these variables, but these days
this decision is made by TLSConfigurator and there is no reason to keep
using the variables.
The version field has been used to decide which multiplexing to use. It
was introduced in 2457293dceec95ecd12ef4f01442e13710ea131a. But this is
6y ago and there is no need for this differentiation anymore.
This is like a Möbius strip of code due to the fact that low-level components (serf/memberlist) are connected to high-level components (the catalog and mesh-gateways) in a twisty maze of references which make it hard to dive into. With that in mind here's a high level summary of what you'll find in the patch:
There are several distinct chunks of code that are affected:
* new flags and config options for the server
* retry join WAN is slightly different
* retry join code is shared to discover primary mesh gateways from secondary datacenters
* because retry join logic runs in the *agent* and the results of that
operation for primary mesh gateways are needed in the *server* there are
some methods like `RefreshPrimaryGatewayFallbackAddresses` that must occur
at multiple layers of abstraction just to pass the data down to the right
layer.
* new cache type `FederationStateListMeshGatewaysName` for use in `proxycfg/xds` layers
* the function signature for RPC dialing picked up a new required field (the
node name of the destination)
* several new RPCs for manipulating a FederationState object:
`FederationState:{Apply,Get,List,ListMeshGateways}`
* 3 read-only internal APIs for debugging use to invoke those RPCs from curl
* raft and fsm changes to persist these FederationStates
* replication for FederationStates as they are canonically stored in the
Primary and replicated to the Secondaries.
* a special derivative of anti-entropy that runs in secondaries to snapshot
their local mesh gateway `CheckServiceNodes` and sync them into their upstream
FederationState in the primary (this works in conjunction with the
replication to distribute addresses for all mesh gateways in all DCs to all
other DCs)
* a "gateway locator" convenience object to make use of this data to choose
the addresses of gateways to use for any given RPC or gossip operation to a
remote DC. This gets data from the "retry join" logic in the agent and also
directly calls into the FSM.
* RPC (`:8300`) on the server sniffs the first byte of a new connection to
determine if it's actually doing native TLS. If so it checks the ALPN header
for protocol determination (just like how the existing system uses the
type-byte marker).
* 2 new kinds of protocols are exclusively decoded via this native TLS
mechanism: one for ferrying "packet" operations (udp-like) from the gossip
layer and one for "stream" operations (tcp-like). The packet operations
re-use sockets (using length-prefixing) to cut down on TLS re-negotiation
overhead.
* the server instances specially wrap the `memberlist.NetTransport` when running
with gateway federation enabled (in a `wanfed.Transport`). The general gist is
that if it tries to dial a node in the SAME datacenter (deduced by looking
at the suffix of the node name) there is no change. If dialing a DIFFERENT
datacenter it is wrapped up in a TLS+ALPN blob and sent through some mesh
gateways to eventually end up in a server's :8300 port.
* a new flag when launching a mesh gateway via `consul connect envoy` to
indicate that the servers are to be exposed. This sets a special service
meta when registering the gateway into the catalog.
* `proxycfg/xds` notice this metadata blob to activate additional watches for
the FederationState objects as well as the location of all of the consul
servers in that datacenter.
* `xds:` if the extra metadata is in place additional clusters are defined in a
DC to bulk sink all traffic to another DC's gateways. For the current
datacenter we listen on a wildcard name (`server.<dc>.consul`) that load
balances all servers as well as one mini-cluster per node
(`<node>.server.<dc>.consul`)
* the `consul tls cert create` command got a new flag (`-node`) to help create
an additional SAN in certs that can be used with this flavor of federation.
* Renamed structs.IntentionWildcard to structs.WildcardSpecifier
* Refactor ACL Config
Get rid of remnants of enterprise only renaming.
Add a WildcardName field for specifying what string should be used to indicate a wildcard.
* Add wildcard support in the ACL package
For read operations they can call anyAllowed to determine if any read access to the given resource would be granted.
For write operations they can call allAllowed to ensure that write access is granted to everything.
* Make v1/agent/connect/authorize namespace aware
* Update intention ACL enforcement
This also changes how intention:read is granted. Before the Intention.List RPC would allow viewing an intention if the token had intention:read on the destination. However Intention.Match allowed viewing if access was allowed for either the source or dest side. Now Intention.List and Intention.Get fall in line with Intention.Matches previous behavior.
Due to this being done a few different places ACL enforcement for a singular intention is now done with the CanRead and CanWrite methods on the intention itself.
* Refactor Intention.Apply to make things easier to follow.
Ensure we close the Sentinel Evaluator so as not to leak go routines
Fix a bunch of test logging so that various warnings when starting a test agent go to the ltest logger and not straight to stdout.
Various canned ent meta types always return a valid pointer (no more nils). This allows us to blindly deref + assign in various places.
Update ACL index tracking to ensure oss -> ent upgrades will work as expected.
Update ent meta parsing to include function to disallow wildcarding.
* ACL Authorizer overhaul
To account for upcoming features every Authorization function can now take an extra *acl.EnterpriseAuthorizerContext. These are unused in OSS and will always be nil.
Additionally the acl package has received some thorough refactoring to enable all of the extra Consul Enterprise specific authorizations including moving sentinel enforcement into the stubbed structs. The Authorizer funcs now return an acl.EnforcementDecision instead of a boolean. This improves the overall interface as it makes multiple Authorizers easily chainable as they now indicate whether they had an authoritative decision or should use some other defaults. A ChainedAuthorizer was added to handle this Authorizer enforcement chain and will never itself return a non-authoritative decision.
* Include stub for extra enterprise rules in the global management policy
* Allow for an upgrade of the global-management policy
This PR introduces reloading tls configuration. Consul will now be able to reload the TLS configuration which previously required a restart. It is not yet possible to turn TLS ON or OFF with these changes. Only when TLS is already turned on, the configuration can be reloaded. Most importantly the certificates and CAs.