Commit Graph

65 Commits

Author SHA1 Message Date
Aestek 97bb907b69 ae: use stale requests when performing full sync (#5873)
Read requests performed during anti antropy full sync currently target
the leader only. This generates a non-negligible load on the leader when
the DC is large enough and can be offloaded to the followers following
the "eventually consistent" policy for the agent state.
We switch the AE read calls to use stale requests with a small (2s)
MaxStaleDuration value and make sure we do not read too fast after a
write.
2019-06-17 18:05:47 +02:00
R.B. Boyer 9b41199585
agent: fix several data races and bugs related to node-local alias checks (#5876)
The observed bug was that a full restart of a consul datacenter (servers
and clients) in conjunction with a restart of a connect-flavored
application with bring-your-own-service-registration logic would very
frequently cause the envoy sidecar service check to never reflect the
aliased service.

Over the course of investigation several bugs and unfortunate
interactions were corrected:

(1)

local.CheckState objects were only shallow copied, but the key piece of
data that gets read and updated is one of the things not copied (the
underlying Check with a Status field). When the stock code was run with
the race detector enabled this highly-relevant-to-the-test-scenario field
was found to be racy.

Changes:

 a) update the existing Clone method to include the Check field
 b) copy-on-write when those fields need to change rather than
    incrementally updating them in place.

This made the observed behavior occur slightly less often.

(2)

If anything about how the runLocal method for node-local alias check
logic was ever flawed, there was no fallback option. Those checks are
purely edge-triggered and failure to properly notice a single edge
transition would leave the alias check incorrect until the next flap of
the aliased check.

The change was to introduce a fallback timer to act as a control loop to
double check the alias check matches the aliased check every minute
(borrowing the duration from the non-local alias check logic body).

This made the observed behavior eventually go away when it did occur.

(3)

Originally I thought there were two main actions involved in the data race:

A. The act of adding the original check (from disk recovery) and its
   first health evaluation.

B. The act of the HTTP API requests coming in and resetting the local
   state when re-registering the same services and checks.

It took awhile for me to realize that there's a third action at work:

C. The goroutines associated with the original check and the later
   checks.

The actual sequence of actions that was causing the bad behavior was
that the API actions result in the original check to be removed and
re-added _without waiting for the original goroutine to terminate_. This
means for brief windows of time during check definition edits there are
two goroutines that can be sending updates for the alias check status.

In extremely unlikely scenarios the original goroutine sees the aliased
check start up in `critical` before being removed but does not get the
notification about the nearly immediate update of that check to
`passing`.

This is interlaced wit the new goroutine coming up, initializing its
base case to `passing` from the current state and then listening for new
notifications of edge triggers.

If the original goroutine "finishes" its update, it then commits one
more write into the local state of `critical` and exits leaving the
alias check no longer reflecting the underlying check.

The correction here is to enforce that the old goroutines must terminate
before spawning the new one for alias checks.
2019-05-24 13:36:56 -05:00
Freddy 1538a738f2
Update alias checks on local add and remove 2019-04-24 12:17:06 -06:00
R.B. Boyer 91e78e00c7
fix typos reported by golangci-lint:misspell (#5434) 2019-03-06 11:13:28 -06:00
Aestek 2ce7240abc Register and deregisters services and their checks atomically in the local state (#5012)
Prevent race between register and deregister requests by saving them
together in the local state on registration.
Also adds more cleaning in case of failure when registering services
/ checks.
2019-03-04 09:34:05 -05:00
Aestek 5647ca2bbb [Fix] Services sometimes not being synced with acl_enforce_version_8 = false (#4771)
Fixes: https://github.com/hashicorp/consul/issues/3676

This fixes a bug were registering an agent with a non-existent ACL token can prevent other 
services registered with a good token from being synced to the server when using 
`acl_enforce_version_8 = false`.

## Background

When `acl_enforce_version_8` is off the agent does not check the ACL token validity before 
storing the service in its state.
When syncing a service registered with a missing ACL token we fall into the default error 
handling case (https://github.com/hashicorp/consul/blob/master/agent/local/state.go#L1255)
and stop the sync (https://github.com/hashicorp/consul/blob/master/agent/local/state.go#L1082)
without setting its Synced property to true like in the permission denied case.
This means that the sync will always stop at the faulty service(s).
The order in which the services are synced is random since we iterate on a map. So eventually
all services with good ACL tokens will be synced, this can however take some time and is influenced 
by the cluster size, the bigger the slower because retries are less frequent.
Having a service in this state also prevent all further sync of checks as they are done after
the services.

## Changes 

This change modify the sync process to continue even if there is an error. 
This fixes the issue described above as well as making the sync more error tolerant: if the server repeatedly refuses
a service (the ACL token could have been deleted by the time the service is synced, the servers 
were upgraded to a newer version that has more strict checks on the service definition...). 
Then all services and check that can be synced will, and those that don't will be marked as errors in 
the logs instead of blocking the whole process.
2019-01-04 10:01:50 -05:00
Paul Banks 10af44006a Proxy Config Manager (#4729)
* Proxy Config Manager

This component watches for local state changes on the agent and ensures that each service registered locally with Kind == connect-proxy has it's state being actively populated in the cache.

This serves two purposes:
 1. For the built-in proxy, it ensures that the state needed to accept connections is available in RAM shortly after registration and likely before the proxy actually starts accepting traffic.
 2. For (future - next PR) xDS server and other possible future proxies that require _push_ based config discovery, this provides a mechanism to subscribe and be notified about updates to a proxy instance's config including upstream service discovery results.

* Address review comments

* Better comments; Better delivery of latest snapshot for slow watchers; Embed Config

* Comment typos

* Add upstream Stringer for funsies
2018-10-10 16:55:34 +01:00
Paul Banks 979e1c9c94 Add -sidecar-for and new /agent/service/:service_id endpoint (#4691)
- A new endpoint `/v1/agent/service/:service_id` which is a generic way to look up the service for a single instance. The primary value here is that it:
   - **supports hash-based blocking** and so;
   - **replaces `/agent/connect/proxy/:proxy_id`** as the mechanism the built-in proxy uses to read its config.
   - It's not proxy specific and so works for any service.
   - It has a temporary shim to call through to the existing endpoint to preserve current managed proxy config defaulting behaviour until that is removed entirely (tested).
 - The built-in proxy now uses the new endpoint exclusively for it's config
 - The built-in proxy now has a `-sidecar-for` flag that allows the service ID of the _target_ service to be specified, on the condition that there is exactly one "sidecar" proxy (that is one that has `Proxy.DestinationServiceID` set) for the service registered.
 - Several fixes for edge cases for SidecarService
 - A fix for `Alias` checks - when running locally they didn't update their state until some external thing updated the target. If the target service has no checks registered as below, then the alias never made it past critical.
2018-10-10 16:55:34 +01:00
Paul Banks 92fe8c8e89 Add Proxy Upstreams to Service Definition (#4639)
* Refactor Service Definition ProxyDestination.

This includes:
 - Refactoring all internal structs used
 - Updated tests for both deprecated and new input for:
   - Agent Services endpoint response
   - Agent Service endpoint response
   - Agent Register endpoint
     - Unmanaged deprecated field
     - Unmanaged new fields
     - Managed deprecated upstreams
     - Managed new
   - Catalog Register
     - Unmanaged deprecated field
     - Unmanaged new fields
     - Managed deprecated upstreams
     - Managed new
   - Catalog Services endpoint response
   - Catalog Node endpoint response
   - Catalog Service endpoint response
 - Updated API tests for all of the above too (both deprecated and new forms of register)

TODO:
 - config package changes for on-disk service definitions
 - proxy config endpoint
 - built-in proxy support for new fields

* Agent proxy config endpoint updated with upstreams

* Config file changes for upstreams.

* Add upstream opaque config and update all tests to ensure it works everywhere.

* Built in proxy working with new Upstreams config

* Command fixes and deprecations

* Fix key translation, upstream type defaults and a spate of other subtele bugs found with ned to end test scripts...

TODO: tests still failing on one case that needs a fix. I think it's key translation for upstreams nested in Managed proxy struct.

* Fix translated keys in API registration.
≈

* Fixes from docs
 - omit some empty undocumented fields in API
 - Bring back ServiceProxyDestination in Catalog responses to not break backwards compat - this was removed assuming it was only used internally.

* Documentation updates for Upstreams in service definition

* Fixes for tests broken by many refactors.

* Enable travis on f-connect branch in this branch too.

* Add consistent Deprecation comments to ProxyDestination uses

* Update version number on deprecation notices, and correct upstream datacenter field with explanation in docs
2018-10-10 16:55:34 +01:00
Martin 6af4501a68 Use target service name instead of ID as connect proxy service name (#4620) 2018-09-05 20:33:17 +01:00
Mitchell Hashimoto 5889a3b6ff
agent: address some basic feedback 2018-07-12 09:36:11 -07:00
Mitchell Hashimoto 3177d1719d
agent/local: support local alias checks 2018-07-12 09:36:10 -07:00
Pierre Souchay 9128de5b11 Merge remote-tracking branch 'origin/master' into ACL_additional_info 2018-07-07 14:09:18 +02:00
Paul Banks 3a00574a13 Persist proxy state through agent restart 2018-06-25 12:24:08 -07:00
Mitchell Hashimoto 657c09133a
agent/local: clarify the non-risk of a full buffer 2018-06-14 09:42:10 -07:00
Mitchell Hashimoto 31b09c0674
agent/local: remove outdated comment 2018-06-14 09:42:10 -07:00
Mitchell Hashimoto fae8dc8951
agent/local: add Notify mechanism for proxy changes 2018-06-14 09:42:08 -07:00
Mitchell Hashimoto f64a002f68
agent: start/stop proxies 2018-06-14 09:42:08 -07:00
Mitchell Hashimoto 76c6849ffe
agent/local: store proxy on local state, wip, not working yet 2018-06-14 09:42:08 -07:00
Paul Banks 02ab461dae
TLS watching integrated into Service with some basic tests.
There are also a lot of small bug fixes found when testing lots of things end-to-end for the first time and some cleanup now it's integrated with real CA code.
2018-06-14 09:42:07 -07:00
Paul Banks 44afb5c699
Agent Connect Proxy config endpoint with hash-based blocking 2018-06-14 09:41:57 -07:00
Paul Banks 78e48fd547
Added connect proxy config and local agent state setup on boot. 2018-06-14 09:41:57 -07:00
Pierre Souchay 6c7f01ae73 Removed labels from new ACL denied metrics 2018-06-08 11:56:46 +02:00
Pierre Souchay 2113071ae7 Removed consul prefix from metrics as requested by @kyhavlov 2018-06-08 11:51:50 +02:00
Pierre Souchay bebf03e26e Fixed import 2018-04-18 17:09:25 +02:00
Pierre Souchay 4739b05d12 Added labels to improve new metric 2018-04-18 16:51:22 +02:00
Pierre Souchay 12f81c60ac Track calls blocked by ACLs using metrics 2018-04-17 10:17:16 +02:00
Guido Iaquinti 244fc72b05 Add package name to log output 2018-03-21 15:56:14 +00:00
Josh Soref 1dd8c378b9 Spelling (#3958)
* spelling: another

* spelling: autopilot

* spelling: beginning

* spelling: circonus

* spelling: default

* spelling: definition

* spelling: distance

* spelling: encountered

* spelling: enterprise

* spelling: expands

* spelling: exits

* spelling: formatting

* spelling: health

* spelling: hierarchy

* spelling: imposed

* spelling: independence

* spelling: inspect

* spelling: last

* spelling: latest

* spelling: client

* spelling: message

* spelling: minimum

* spelling: notify

* spelling: nonexistent

* spelling: operator

* spelling: payload

* spelling: preceded

* spelling: prepared

* spelling: programmatically

* spelling: required

* spelling: reconcile

* spelling: responses

* spelling: request

* spelling: response

* spelling: results

* spelling: retrieve

* spelling: service

* spelling: significantly

* spelling: specifies

* spelling: supported

* spelling: synchronization

* spelling: synchronous

* spelling: themselves

* spelling: unexpected

* spelling: validations

* spelling: value
2018-03-19 16:56:00 +00:00
Frank Schroeder f5a3d73b27
local state: clone check to avoid side effect 2017-10-23 10:56:05 +02:00
Frank Schroeder b36613e7ff
local state: use synchronized access to internal maps 2017-10-23 10:56:05 +02:00
Frank Schroeder f187c37c27
local state: rename Add{Check,Service}State to Set{Check,Service}State 2017-10-23 10:56:04 +02:00
Frank Schroeder 209e67b2f9
local state: move Metadata methods together 2017-10-23 10:56:04 +02:00
Frank Schroeder 9513a042be
local state: update documentation of updateSyncState 2017-10-23 10:56:04 +02:00
Frank Schroeder 2e3b72d2c3
local state: update comments 2017-10-23 10:56:04 +02:00
Frank Schroeder da604495a0
local state: address review comments
* move non-blocking notification mechanism into ae.Trigger
* move Pause/Resume into separate type
2017-10-23 10:56:04 +02:00
Frank Schroeder 32c2d1b217
local state: fix anti-entropy state tests
The anti-entropy tests relied on the side-effect of the StartSync()
method to perform a full sync instead of a partial sync. This lead to
multiple anti-entropy go routines being started unnecessary retry loops.

This change changes the behavior to perform synchronous full syncs when
necessary removing the need for all of the time.Sleep and most of the
retry loops.
2017-10-23 10:56:04 +02:00
Frank Schroeder 6b966e48ce
local state: fix test with updated error message 2017-10-23 10:56:04 +02:00
Frank Schroeder ea92ee308a
local state: tests compile 2017-10-23 10:56:03 +02:00
Frank Schroeder 7289576988
local state: replace multi-map state with structs
The state of the service and health check records was spread out over
multiple maps guarded by a single lock. Access to the maps has to happen
in a coordinated effort and the tests often violated this which made
them brittle and racy.

This patch replaces the multiple maps with a single one for both checks
and services to make the code less fragile.

This is also necessary since moving the local state into its own package
creates circular dependencies for the tests. To avoid this the tests can
no longer access internal data structures which they should not be doing
in the first place.

The tests still don't compile but this is a ncessary step in that
direction.
2017-10-23 10:56:03 +02:00
Frank Schroeder bc7571cccf
local state: move to separate package
This patch moves the local state to a separate package to further
decouple it from the agent code.

The code compiles but the tests do not yet.
2017-10-23 10:56:03 +02:00
Frank Schroeder 443fe8e4db
Revert "local state: move to separate package"
This reverts commit d447e823c63720c74bb02459a985724f035f023e.
2017-10-23 10:08:34 +02:00
Frank Schroeder 435b442c8b
Revert "local state: replace multi-map state with structs"
This reverts commit ccbae7da5bceeb2328ab7993a8badbf2e72a4597.
2017-10-23 10:08:34 +02:00
Frank Schroeder 138aa25280
Revert "local state: tests compile"
This reverts commit 1af52bf7be02d952e16e14209899a9715451f7ba.
2017-10-23 10:08:34 +02:00
Frank Schroeder 80d9df69e4
Revert "local state: fix test with updated error message"
This reverts commit e9149f64d9afb38246f9432edd806321c1eefb83.
2017-10-23 10:08:34 +02:00
Frank Schroeder ded6f79b6a
Revert "local state: fix anti-entropy state tests"
This reverts commit f8e20cd9960e19bbe61e258c445250723870816f.
2017-10-23 10:08:34 +02:00
Frank Schroeder c72d21813b
Revert "local state: address review comments"
This reverts commit 1d315075b15647db7fcd42986c9c5673cbb77a77.
2017-10-23 10:08:33 +02:00
Frank Schroeder 4177bad4f3
Revert "local state: update comments"
This reverts commit 42188164f885188e3bc8cff70ea5aeb47d633b83.
2017-10-23 10:08:33 +02:00
Frank Schroeder d1e514cedc
Revert "local state: update documentation of updateSyncState"
This reverts commit e86521e637d742bce1e460b6b960037cef3578ed.
2017-10-23 10:08:33 +02:00
Frank Schroeder 133b23fb77
Revert "local state: move Metadata methods together"
This reverts commit 9bc8127728a62beb94b28849070b6ac35c181404.
2017-10-23 10:08:33 +02:00