If multiple instances of a service are co-located on the same node then
their proxies will all share a cache entry for their resolved service
configuration. This is because the cache key contains the name of the
watched service but does not take into account the ID of the watching
proxies.
This means that there will be multiple agent service manager watches
that can wake up on the same cache update. These watchers then
concurrently modify the value in the cache when merging the resolved
config into the local proxy definitions.
To avoid this concurrent map write we will only delete the key from
opaque config in the local proxy definition after the merge, rather
than from the cached value before the merge.
This PR replaces the original boolean used to configure transparent
proxy mode. It was replaced with a string mode that can be set to:
- "": Empty string is the default for when the setting should be
defaulted from other configuration like config entries.
- "direct": Direct mode is how applications originally opted into the
mesh. Proxy listeners need to be dialed directly.
- "transparent": Transparent mode enables configuring Envoy as a
transparent proxy. Traffic must be captured and redirected to the
inbound and outbound listeners.
This PR also adds a struct for transparent proxy specific configuration.
Initially this is not stored as a pointer. Will revisit that decision
before GA.
Synthetic upstreams from service-defaults config are stored locally in
the Upstreams list. Since these come from service-defaults they should
be cleaned up locally when no longer present in the service config
response.
This is needed in case the client proxy is in TransparentProxy mode.
Typically they won't have explicit configuration for every upstream, so
this ensures the settings can be applied to all of them when generating
xDS config.
Previously the ServiceManager had to run a separate goroutine so that it could block on a channel
send/receive instead of a lock. Using this mutex with TryLock allows us to cancel the lock when
the serviceConfigWatch is stopped.
Without this change removing the ServiceManager.Start goroutine would not be possible because
when AddService is called it acquires the stateLock. While that lock is held, if there are
existing watches for the service, the old watch will be stopped, and the goroutine holding the
lock will attempt to wait for that watcher goroutine to exit.
If the goroutine is handling an update (serviceConfigWatch.handleUpdate) then it can block on
acquiring the stateLock and deadlock the agent. With this change the context is cancelled as
and the goroutine will exit instead of waiting on the stateLock.
The ServiceManager.Start goroutine was used to serialize calls to
agent.addServiceInternal.
All the goroutines which sent events to the channel would block waiting
for a response from that same goroutine, which is effectively the same
as a synchronous call without any channels.
This commit removes the goroutine and channels, and instead calls
addServiceInternal directly. Since all of these goroutines will need to
take the agent.stateLock, the mutex handles the serializing of calls.
Move the field into the struct for addServiceLocked. Also don't require
setting a default value, so that the callers can leave it as nil if they
don't already have a snapshot.
Replace with the existing AddServiceRequest struct. These structs are
almost identical. Additionally, the only reason the serviceRegistration
struct existed was to recreate an AddServiceRequest.
By storing and re-using the AddServiceRequest we remove the need to
translate into one type and back to the original type.
We also remove the extra parameters to a function, because those values
are already available from the AddServiceRequest field.
Also a minor optimization to only call tokens.AgentToken() when
necessary. Previous it was being called every time, but the value was
being ignored if the AddServiceRequest had a token.
Handle the decision to use ServiceManager in a single place. Instead of
calling ServiceManager.AddService, then calling back into
addServiceInternal, only call ServiceManager.AddService if we are going
to use it.
This change removes some small duplication and removes a branch from the
AddService flow.
The temprorary variables make it much harder to trace where and how struct
fields are used. If a field is only used a small number of times than
refer to the field directly.
The method is only used in tests, and only exists for legacy calls.
There was one other package which used this method in tests. Export
the AddServiceRequest and a couple of its fields so the new function can
be used in those tests.
Passing the channel to the function which uses it significantly
reduces the scope of the variable, and makes its usage more explicit. It
also moves the initialization of the channel closer to where it is used.
Also includes a couple very small cleanups to remove a local var and
read the error from `ctx.Err()` directly instead of creating a channel
to check for an error.
This field was always read by the same function that populated the field,
so it does not need to be a field. Passing the value as an argument to
functions makes it more obvious where the value comes from, and also reduces
the scope of the variable significantly.
[The documentation for context](https://golang.org/pkg/context/)
recommends not storing context in a struct field:
> Do not store Contexts inside a struct type; instead, pass a Context
> explicitly to each function that needs it. The Context should be the
> first parameter, typically named ctx...
Sometimes there are good reasons to not follow this recommendation, but
in this case it seems easy enough to follow.
Also moved the ctx argument to be the first in one of the function calls
to follow the same recommendation.
Blocking queries issues will still be uncancellable (that cannot be helped until we get rid of net/rpc). However this makes it so that if calling getWithIndex (like during a cache Notify go routine) we can cancell the outer routine. Previously it would keep issuing more blocking queries until the result state actually changed.
This fixes issue #7318
Between versions 1.5.2 and 1.5.3, a regression has been introduced regarding health
of services. A patch #6144 had been issued for HealthChecks of nodes, but not for healthchecks
of services.
What happened when a reload was:
1. save all healthcheck statuses
2. cleanup everything
3. add new services with healthchecks
In step 3, the state of healthchecks was taken into account locally,
so at step 3, but since we cleaned up at step 2, state was lost.
This PR introduces the snap parameter, so step 3 can use information from step 1
Fixes: #5396
This PR adds a proxy configuration stanza called expose. These flags register
listeners in Connect sidecar proxies to allow requests to specific HTTP paths from outside of the node. This allows services to protect themselves by only
listening on the loopback interface, while still accepting traffic from non
Connect-enabled services.
Under expose there is a boolean checks flag that would automatically expose all
registered HTTP and gRPC check paths.
This stanza also accepts a paths list to expose individual paths. The primary
use case for this functionality would be to expose paths for third parties like
Prometheus or the kubelet.
Listeners for requests to exposed paths are be configured dynamically at run
time. Any time a proxy, or check can be registered, a listener can also be
created.
In this initial implementation requests to these paths are not
authenticated/encrypted.
Also:
* Finished threading replaceExistingChecks setting (from GH-4905)
through service manager.
* Respected the original configSource value that was used to register a
service or a check when restoring persisted data.
* Run several existing tests with and without central config enabled
(not exhaustive yet).
* Switch to ioutil.ReadFile for all types of agent persistence.
* Add integration test for central config; fix central config WIP
* Add integration test for central config; fix central config WIP
* Set proxy protocol correctly and begin adding upstream support
* Add upstreams to service config cache key and start new notify watcher if they change.
This doesn't update the tests to pass though.
* Fix some merging logic get things working manually with a hack (TODO fix properly)
* Simplification to not allow enabling sidecars centrally - it makes no sense without upstreams anyway
* Test compile again and obvious ones pass. Lots of failures locally not debugged yet but may be flakes. Pushing up to see what CI does
* Fix up service manageer and API test failures
* Remove the enable command since it no longer makes much sense without being able to turn on sidecar proxies centrally
* Remove version.go hack - will make integration test fail until release
* Remove unused code from commands and upstream merge
* Re-bump version to 1.5.0