* Display a warning when rpc.enable_streaming = true is set on a client
This option has no effect when running as an agent
* Added warning when server starts with use_streaming_backend but without rpc.enable_streaming
* Added unit test
Add a skip condition to all tests slower than 100ms.
This change was made using `gotestsum tool slowest` with data from the
last 3 CI runs of master.
See https://github.com/gotestyourself/gotestsum#finding-and-skipping-slow-tests
With this change:
```
$ time go test -count=1 -short ./agent
ok github.com/hashicorp/consul/agent 0.743s
real 0m4.791s
$ time go test -count=1 -short ./agent/consul
ok github.com/hashicorp/consul/agent/consul 4.229s
real 0m8.769s
```
And remove the devMode field from builder.
This change helps make the Builder state more explicit by moving inputs to the BuilderOps struct,
leaving only fields that can change during Builder.Build on the Builder struct.
Using the LiteralSource makes it much easier to find default values, because an IDE reports
the location of a default. With an HCL string they are harder to discover.
Also removes unnecessary mapstructure.Decodes of constant values.
Added a new option `ui_config.metrics_proxy.path_allowlist`. This defaults to `["/api/v1/query", "/api/v1/query_range"]` when the metrics provider is set to `prometheus`.
Requests that do not use one of the allow-listed paths (via exact match) get a 403 Forbidden response instead.
This allows for client agent to be run in a more stateless manner where they may be abruptly terminated and not expected to come back. If advertising a per-agent reconnect timeout using the advertise_reconnect_timeout configuration when that agent leaves, other agents will wait only that amount of time for the agent to come back before reaping it.
This has the advantageous side effect of causing servers to deregister the node/services/checks for that agent sooner than if the global reconnect_timeout was used.
- Upgrade the ConfigEntry.ListAll RPC to be kind-aware so that older
copies of consul will not see new config entries it doesn't understand
replicate down.
- Add shim conversion code so that the old API/CLI method of interacting
with intentions will continue to work so long as none of these are
edited via config entry endpoints. Almost all of the read-only APIs will
continue to function indefinitely.
- Add new APIs that operate on individual intentions without IDs so that
the UI doesn't need to implement CAS operations.
- Add a new serf feature flag indicating support for
intentions-as-config-entries.
- The old line-item intentions way of interacting with the state store
will transparently flip between the legacy memdb table and the config
entry representations so that readers will never see a hiccup during
migration where the results are incomplete. It uses a piece of system
metadata to control the flip.
- The primary datacenter will begin migrating intentions into config
entries on startup once all servers in the datacenter are on a version
of Consul with the intentions-as-config-entries feature flag. When it is
complete the old state store representations will be cleared. We also
record a piece of system metadata indicating this has occurred. We use
this metadata to skip ALL of this code the next time the leader starts
up.
- The secondary datacenters continue to run the old intentions
replicator until all servers in the secondary DC and primary DC support
intentions-as-config-entries (via serf flag). Once this condition it met
the old intentions replicator ceases.
- The secondary datacenters replicate the new config entries as they are
migrated in the primary. When they detect that the primary has zeroed
it's old state store table it waits until all config entries up to that
point are replicated and then zeroes its own copy of the old state store
table. We also record a piece of system metadata indicating this has
occurred. We use this metadata to skip ALL of this code the next time
the leader starts up.
This might be better handled by allowing configuration for the InMemSink interval and retail, and disabling
the global. For now this is a smaller change to remove the goroutine leak caused by tests because go-metrics
does not provide any way of shutting down the global goroutine.
Previsouly it was done in Agent.Start, which is much later then it needs to be.
The new 'dns' package was required, because otherwise there would be an
import cycle. In the future we should move more of the dns server into
the dns package.
- unexport testing shims, and document their purpose
- resolve a TODO by moving validation to NewBuilder and storing the one
field that is used instead of all of Options
- create a slice with the correct size to avoid extra allocations
AutoConfig will generate local tokens for clients and the ability to use local tokens is gated off of token replication being enabled and being configured with a replication token. Therefore we already have a hard requirement on having token replication enabled, this commit just makes sure to surface that to the operator instead of having to discern what the issue is from RPC errors.
Ensure that enabling AutoConfig sets the tls configurator properly
This also refactors the TLS configurator a bit so the naming doesn’t imply only AutoEncrypt as the source of the automatically setup TLS cert info.
Most of the groundwork was laid in previous PRs between adding the cert-monitor package to extracting the logic of signing certificates out of the connect_ca_endpoint.go code and into a method on the server.
This also refactors the auto-config package a bit to split things out into multiple files.
This implements a solution for #7863
It does:
Add a new config cache.entry_fetch_rate to limit the number of calls/s for a given cache entry, default value = rate.Inf
Add cache.entry_fetch_max_burst size of rate limit (default value = 2)
The new configuration now supports the following syntax for instance to allow 1 query every 3s:
command line HCL: -hcl 'cache = { entry_fetch_rate = 0.333}'
in JSON
{
"cache": {
"entry_fetch_rate": 0.333
}
}
On the servers they must have a certificate.
On the clients they just have to set verify_outgoing to true to attempt TLS connections for RPCs.
Eventually we may relax these restrictions but right now all of the settings we push down (acl tokens, acl related settings, certificates, gossip key) are sensitive and shouldn’t be transmitted over an unencrypted connection. Our guides and docs should recoommend verify_server_hostname on the clients as well.
Another reason to do this is weird things happen when making an insecure RPC when TLS is not enabled. Basically it tries TLS anyways. We should probably fix that to make it clearer what is going on.
The envisioned changes would allow extra settings to enable dynamically defined auth methods to be used instead of or in addition to the statically defined one in the configuration.
All commands which read config (agent, services, and validate) will now
print warnings when one of the config files is skipped because it did
not match an expected format.
Also ensures that config validate prints all warnings.
Previously the logic for reading ConfigFiles and produces Sources was split
between NewBuilder and Build. This commit moves all of the logic into NewBuilder
so that Build() can operate entirely on Sources.
This change is in preparation for logging warnings when files have an
unsupported extension.
It also reduces the scope of BuilderOpts, and gets us very close to removing
Builder.options.
Flags is an overloaded term in this context. It generally is used to
refer to command line flags. This struct, however, is a data object
used as input to the construction.
It happens to be partially populated by command line flags, but
otherwise has very little to do with them.
Renaming this struct should make the actual responsibility of this struct
more obvious, and remove the possibility that it is confused with
command line flags.
This change is in preparation for adding additional fields to
BuilderOpts.
This field was populated for one reason, to test that it was empty.
Of all the callers, only a single one used this functionality. The rest
constructed a `Flags{}` struct which did not set Args.
I think this shows that the logic was in the wrong place. Only the agent
command needs to care about validating the args.
This commit removes the field, and moves the logic to the one caller
that cares.
Also fix some comments.
This allows the operator to disable agent caching for the http endpoint.
It is on by default for backwards compatibility and if disabled will
ignore the url parameter `cached`.
* testing: replace most goe/verify.Values with require.Equal
One difference between these two comparisons is that go/verify considers
nil slices/maps to be equal to empty slices/maps, where as testify/require
does not, and does not appear to provide any way to enable that behaviour.
Because of this difference some expected values were changed from empty
slices to nil slices, and some calls to verify.Values were left.
* Remove github.com/pascaldekloe/goe/verify
Reduce the number of assertion packages we use from 2 to 1
Based on work done in https://github.com/hashicorp/memberlist/pull/196
this allows to restrict the IP ranges that can join a given Serf cluster
and be a member of the cluster.
Restrictions on IPs can be done separatly using 2 new differents flags
and config options to restrict IPs for LAN and WAN Serf.
This will emit warnings about the configs not doing anything but still allow them to be parsed.
This also added the warnings for enterprise fields that we already had in OSS but didn’t change their enforcement behavior. For example, attempting to use a network segment will cause a hard error in OSS.
On recent Mac OS versions, the ulimit defaults to 256 by default, but many
systems (eg: some Linux distributions) often limit this value to 1024.
On validation of configuration, Consul now validates that the number of
allowed files descriptors is bigger than http_max_conns_per_client.
This make some unit tests failing on Mac OS.
Use a less important value in unit test, so tests runs well by default
on Mac OS without need for tuning the OS.
When run in with `-dev` in DevMode, it is not possible to replace
the embeded UI with another one because `-dev` implies `-ui`.
This commit allows this an slightly change the error message
about Consul 0.7.0 which is very old and does not apply to
current version anyway.
This is like a Möbius strip of code due to the fact that low-level components (serf/memberlist) are connected to high-level components (the catalog and mesh-gateways) in a twisty maze of references which make it hard to dive into. With that in mind here's a high level summary of what you'll find in the patch:
There are several distinct chunks of code that are affected:
* new flags and config options for the server
* retry join WAN is slightly different
* retry join code is shared to discover primary mesh gateways from secondary datacenters
* because retry join logic runs in the *agent* and the results of that
operation for primary mesh gateways are needed in the *server* there are
some methods like `RefreshPrimaryGatewayFallbackAddresses` that must occur
at multiple layers of abstraction just to pass the data down to the right
layer.
* new cache type `FederationStateListMeshGatewaysName` for use in `proxycfg/xds` layers
* the function signature for RPC dialing picked up a new required field (the
node name of the destination)
* several new RPCs for manipulating a FederationState object:
`FederationState:{Apply,Get,List,ListMeshGateways}`
* 3 read-only internal APIs for debugging use to invoke those RPCs from curl
* raft and fsm changes to persist these FederationStates
* replication for FederationStates as they are canonically stored in the
Primary and replicated to the Secondaries.
* a special derivative of anti-entropy that runs in secondaries to snapshot
their local mesh gateway `CheckServiceNodes` and sync them into their upstream
FederationState in the primary (this works in conjunction with the
replication to distribute addresses for all mesh gateways in all DCs to all
other DCs)
* a "gateway locator" convenience object to make use of this data to choose
the addresses of gateways to use for any given RPC or gossip operation to a
remote DC. This gets data from the "retry join" logic in the agent and also
directly calls into the FSM.
* RPC (`:8300`) on the server sniffs the first byte of a new connection to
determine if it's actually doing native TLS. If so it checks the ALPN header
for protocol determination (just like how the existing system uses the
type-byte marker).
* 2 new kinds of protocols are exclusively decoded via this native TLS
mechanism: one for ferrying "packet" operations (udp-like) from the gossip
layer and one for "stream" operations (tcp-like). The packet operations
re-use sockets (using length-prefixing) to cut down on TLS re-negotiation
overhead.
* the server instances specially wrap the `memberlist.NetTransport` when running
with gateway federation enabled (in a `wanfed.Transport`). The general gist is
that if it tries to dial a node in the SAME datacenter (deduced by looking
at the suffix of the node name) there is no change. If dialing a DIFFERENT
datacenter it is wrapped up in a TLS+ALPN blob and sent through some mesh
gateways to eventually end up in a server's :8300 port.
* a new flag when launching a mesh gateway via `consul connect envoy` to
indicate that the servers are to be exposed. This sets a special service
meta when registering the gateway into the catalog.
* `proxycfg/xds` notice this metadata blob to activate additional watches for
the FederationState objects as well as the location of all of the consul
servers in that datacenter.
* `xds:` if the extra metadata is in place additional clusters are defined in a
DC to bulk sink all traffic to another DC's gateways. For the current
datacenter we listen on a wildcard name (`server.<dc>.consul`) that load
balances all servers as well as one mini-cluster per node
(`<node>.server.<dc>.consul`)
* the `consul tls cert create` command got a new flag (`-node`) to help create
an additional SAN in certs that can be used with this flavor of federation.
Fixes#7231. Before an agent would always emit a warning when there is
an encrypt key in the configuration and an existing keyring stored,
which is happening on restart.
Now it only emits that warning when the encrypt key from the
configuration is not part of the keyring.
Something similar already happens inside of the server
(agent/consul/server.go) but by doing it in the general config parsing
for the agent we can have agent-level code rely on the PrimaryDatacenter
field, too.
Currently when using the built-in CA provider for Connect, root certificates are valid for 10 years, however secondary DCs get intermediates that are valid for only 1 year. There is no mechanism currently short of rotating the root in the primary that will cause the secondary DCs to renew their intermediates.
This PR adds a check that renews the cert if it is half way through its validity period.
In order to be able to test these changes, a new configuration option was added: IntermediateCertTTL which is set extremely low in the tests.
* Add CreateCSRWithSAN
* Use CreateCSRWithSAN in auto_encrypt and cache
* Copy DNSNames and IPAddresses to cert
* Verify auto_encrypt.sign returns cert with SAN
* provide configuration options for auto_encrypt dnssan and ipsan
* rename CreateCSRWithSAN to CreateCSR
* Use consts for well known tagged adress keys
* Add ipv4 and ipv6 tagged addresses for node lan and wan
* Add ipv4 and ipv6 tagged addresses for service lan and wan
* Use IPv4 and IPv6 address in DNS
* relax requirements for auto_encrypt on server
* better error message when auto_encrypt and verify_incoming on
* docs: explain verify_incoming on Consul clients.
* Update AWS SDK to use PCA features.
* Add AWS PCA provider
* Add plumbing for config, config validation tests, add test for inheriting existing CA resources created by user
* Unparallel the tests so we don't exhaust PCA limits
* Merge updates
* More aggressive polling; rate limit pass through on sign; Timeout on Sign and CA create
* Add AWS PCA docs
* Fix Vault doc typo too
* Doc typo
* Apply suggestions from code review
Co-Authored-By: R.B. Boyer <rb@hashicorp.com>
Co-Authored-By: kaitlincarter-hc <43049322+kaitlincarter-hc@users.noreply.github.com>
* Doc fixes; tests for erroring if State is modified via API
* More review cleanup
* Uncomment tests!
* Minor suggested clean ups
A check may be set to become passing/critical only if a specified number of successive
checks return passing/critical in a row. Status will stay identical as before until
the threshold is reached.
This feature is available for HTTP, TCP, gRPC, Docker & Monitor checks.
Fixes: #5396
This PR adds a proxy configuration stanza called expose. These flags register
listeners in Connect sidecar proxies to allow requests to specific HTTP paths from outside of the node. This allows services to protect themselves by only
listening on the loopback interface, while still accepting traffic from non
Connect-enabled services.
Under expose there is a boolean checks flag that would automatically expose all
registered HTTP and gRPC check paths.
This stanza also accepts a paths list to expose individual paths. The primary
use case for this functionality would be to expose paths for third parties like
Prometheus or the kubelet.
Listeners for requests to exposed paths are be configured dynamically at run
time. Any time a proxy, or check can be registered, a listener can also be
created.
In this initial implementation requests to these paths are not
authenticated/encrypted.
Previously `verify_incoming` was required when turning on `auto_encrypt.allow_tls`, but that doesn't work together with HTTPS UI in some scenarios. Adding `verify_incoming_rpc` to the allowed configurations.