open-consul/agent
R.B. Boyer 9b41199585
agent: fix several data races and bugs related to node-local alias checks (#5876)
The observed bug was that a full restart of a consul datacenter (servers
and clients) in conjunction with a restart of a connect-flavored
application with bring-your-own-service-registration logic would very
frequently cause the envoy sidecar service check to never reflect the
aliased service.

Over the course of investigation several bugs and unfortunate
interactions were corrected:

(1)

local.CheckState objects were only shallow copied, but the key piece of
data that gets read and updated is one of the things not copied (the
underlying Check with a Status field). When the stock code was run with
the race detector enabled this highly-relevant-to-the-test-scenario field
was found to be racy.

Changes:

 a) update the existing Clone method to include the Check field
 b) copy-on-write when those fields need to change rather than
    incrementally updating them in place.

This made the observed behavior occur slightly less often.

(2)

If anything about how the runLocal method for node-local alias check
logic was ever flawed, there was no fallback option. Those checks are
purely edge-triggered and failure to properly notice a single edge
transition would leave the alias check incorrect until the next flap of
the aliased check.

The change was to introduce a fallback timer to act as a control loop to
double check the alias check matches the aliased check every minute
(borrowing the duration from the non-local alias check logic body).

This made the observed behavior eventually go away when it did occur.

(3)

Originally I thought there were two main actions involved in the data race:

A. The act of adding the original check (from disk recovery) and its
   first health evaluation.

B. The act of the HTTP API requests coming in and resetting the local
   state when re-registering the same services and checks.

It took awhile for me to realize that there's a third action at work:

C. The goroutines associated with the original check and the later
   checks.

The actual sequence of actions that was causing the bad behavior was
that the API actions result in the original check to be removed and
re-added _without waiting for the original goroutine to terminate_. This
means for brief windows of time during check definition edits there are
two goroutines that can be sending updates for the alias check status.

In extremely unlikely scenarios the original goroutine sees the aliased
check start up in `critical` before being removed but does not get the
notification about the nearly immediate update of that check to
`passing`.

This is interlaced wit the new goroutine coming up, initializing its
base case to `passing` from the current state and then listening for new
notifications of edge triggers.

If the original goroutine "finishes" its update, it then commits one
more write into the local state of `critical` and exits leaving the
alias check no longer reflecting the underlying check.

The correction here is to enforce that the old goroutines must terminate
before spawning the new one for alias checks.
2019-05-24 13:36:56 -05:00
..
ae Add -sidecar-for and new /agent/service/:service_id endpoint (#4691) 2018-10-10 16:55:34 +01:00
cache Fixes race condition in Agent Cache (#5796) 2019-05-07 11:15:49 +01:00
cache-types Add integration test for central config; fix central config WIP (#5752) 2019-05-01 16:39:31 -07:00
checks agent: fix several data races and bugs related to node-local alias checks (#5876) 2019-05-24 13:36:56 -05:00
config Update to use a consulent build tag instead of just ent (#5759) 2019-05-01 11:11:27 -04:00
connect fix typos reported by golangci-lint:misspell (#5434) 2019-03-06 11:13:28 -06:00
consul Increase reliability of TestResetSessionTimerLocked_Renew 2019-05-24 13:54:51 -04:00
debug fix comment typos (#4890) 2018-11-02 12:00:39 -05:00
exec fix go vet issue 2017-10-25 19:30:35 +02:00
local agent: fix several data races and bugs related to node-local alias checks (#5876) 2019-05-24 13:36:56 -05:00
metadata New ACLs (#4791) 2018-10-19 12:04:07 -04:00
mock agent: replace docker check 2017-07-18 20:24:38 +02:00
pool Makes RPC handling more robust when rolling servers. (#3561) 2017-10-10 15:19:50 -07:00
proxycfg Connect: allow configuring Envoy for L7 Observability (#5558) 2019-04-29 17:27:57 +01:00
proxyprocess Move internal/ to sdk/ (#5568) 2019-03-27 08:54:56 -04:00
router Call RemoveServer for reap events (#5317) 2019-03-04 09:19:35 -05:00
structs agent: fix several data races and bugs related to node-local alias checks (#5876) 2019-05-24 13:36:56 -05:00
systemd agent: notify systemd after JoinLAN (#2121) 2017-06-21 06:43:55 +02:00
token ACL Token Persistence and Reloading (#5328) 2019-02-27 14:28:31 -05:00
xds Connect: allow configuring Envoy for L7 Observability (#5558) 2019-04-29 17:27:57 +01:00
acl.go New ACLs (#4791) 2018-10-19 12:04:07 -04:00
acl_endpoint.go ACL Token ID Initialization (#5307) 2019-04-30 11:45:36 -04:00
acl_endpoint_legacy.go New ACLs (#4791) 2018-10-19 12:04:07 -04:00
acl_endpoint_legacy_test.go Pass a testing.T into NewTestAgent and TestAgent.Start (#5342) 2019-02-14 10:59:14 -05:00
acl_endpoint_test.go ACL Token ID Initialization (#5307) 2019-04-30 11:45:36 -04:00
acl_test.go ACL Token Persistence and Reloading (#5328) 2019-02-27 14:28:31 -05:00
agent.go agent: Improve startup message to avoid confusing users when no error occurs (#5896) 2019-05-24 16:50:18 +02:00
agent_endpoint.go Ensure ServiceName is populated correctly for agent service checks 2019-04-30 19:00:57 -04:00
agent_endpoint_test.go Change log line used for verification 2019-05-21 17:07:06 -06:00
agent_test.go agent: fix several data races and bugs related to node-local alias checks (#5876) 2019-05-24 13:36:56 -05:00
bindata_assetfs.go Release v1.5.1 2019-05-22 20:19:12 +00:00
blacklist.go Adds the ability to blacklist specific HTTP endpoints. (#3252) 2017-07-10 13:51:25 -07:00
blacklist_test.go Adds the ability to blacklist specific HTTP endpoints. (#3252) 2017-07-10 13:51:25 -07:00
catalog_endpoint.go Support multiple tags for health and catalog http api endpoints (#4717) 2018-10-11 12:50:05 +01:00
catalog_endpoint_test.go Implement data filtering of some endpoints (#5579) 2019-04-16 12:00:15 -04:00
check.go Decouple the code that executes checks from the agent 2017-10-25 11:18:07 +02:00
config.go Make a few config entry endpoints return 404s and allow for snake_case and lowercase key names. (#5748) 2019-04-30 18:19:19 -04:00
config_endpoint.go Centralized Config CLI (#5731) 2019-04-30 16:27:16 -07:00
config_endpoint_test.go Centralized Config CLI (#5731) 2019-04-30 16:27:16 -07:00
connect_auth.go fix typos reported by golangci-lint:misspell (#5434) 2019-03-06 11:13:28 -06:00
connect_ca_endpoint.go Fix CA pruning when CA config uses string durations. (#4669) 2018-09-13 15:43:00 +01:00
connect_ca_endpoint_test.go Pass a testing.T into NewTestAgent and TestAgent.Start (#5342) 2019-02-14 10:59:14 -05:00
coordinate_endpoint.go Merge pull request #3885 from eddsteel/support-options-requests 2018-03-16 09:20:16 -05:00
coordinate_endpoint_test.go Add retries around `obj` 2019-05-21 13:36:52 -05:00
dns.go Add support for DNS config hot-reload (#4875) 2019-04-24 14:11:54 -04:00
dns_test.go Add fmt and vet (#5671) 2019-04-25 12:26:33 -04:00
enterprise_delegate_oss.go Update to use a consulent build tag instead of just ent (#5759) 2019-05-01 11:11:27 -04:00
event_endpoint.go Fixes memory leak when blocking on /event/list (#4482) 2018-08-02 14:54:48 +01:00
event_endpoint_test.go Move internal/ to sdk/ (#5568) 2019-03-27 08:54:56 -04:00
health_endpoint.go Filter non-passing nodes without modifying cache 2019-04-16 10:29:34 -06:00
health_endpoint_test.go Implement data filtering of some endpoints (#5579) 2019-04-16 12:00:15 -04:00
http.go Make a few config entry endpoints return 404s and allow for snake_case and lowercase key names. (#5748) 2019-04-30 18:19:19 -04:00
http_oss.go Add HTTP endpoints for config entry management (#5718) 2019-04-29 18:08:09 -04:00
http_oss_test.go Pass a testing.T into NewTestAgent and TestAgent.Start (#5342) 2019-02-14 10:59:14 -05:00
http_test.go Move internal/ to sdk/ (#5568) 2019-03-27 08:54:56 -04:00
intentions_endpoint.go Deferred updating response meta with consul headers (#5355) 2019-02-19 11:45:36 +00:00
intentions_endpoint_test.go Pass a testing.T into NewTestAgent and TestAgent.Start (#5342) 2019-02-14 10:59:14 -05:00
keyring.go agent: move agent/consul/structs to agent/structs 2017-08-09 14:32:12 +02:00
keyring_test.go Move internal/ to sdk/ (#5568) 2019-03-27 08:54:56 -04:00
kvs_endpoint.go Support OPTIONS requests 2018-02-12 10:15:31 -08:00
kvs_endpoint_test.go Pass a testing.T into NewTestAgent and TestAgent.Start (#5342) 2019-02-14 10:59:14 -05:00
notify.go Fixes memory leak when blocking on /event/list (#4482) 2018-08-02 14:54:48 +01:00
notify_test.go Fixes memory leak when blocking on /event/list (#4482) 2018-08-02 14:54:48 +01:00
operator_endpoint.go Support OPTIONS requests 2018-02-12 10:15:31 -08:00
operator_endpoint_test.go Merge pull request #5376 from hashicorp/fix-tests 2019-04-04 17:09:32 -04:00
prepared_query_endpoint.go Support Agent Caching for Service Discovery Results (#4541) 2018-10-10 16:55:34 +01:00
prepared_query_endpoint_test.go Pass a testing.T into NewTestAgent and TestAgent.Start (#5342) 2019-02-14 10:59:14 -05:00
remote_exec.go Decouple the code that executes checks from the agent 2017-10-25 11:18:07 +02:00
remote_exec_test.go Add fmt and vet (#5671) 2019-04-25 12:26:33 -04:00
retry_join.go agent: configure k8s go-discover 2018-09-05 13:38:13 -07:00
retry_join_test.go fix remaining CI failures after Go 1.12.1 Upgrade (#5576) 2019-03-29 16:29:27 +01:00
service_manager.go Add integration test for central config; fix central config WIP (#5752) 2019-05-01 16:39:31 -07:00
service_manager_test.go Add integration test for central config; fix central config WIP (#5752) 2019-05-01 16:39:31 -07:00
session_endpoint.go Support OPTIONS requests 2018-02-12 10:15:31 -08:00
session_endpoint_test.go tests: actually have TestSessionTTLRenew sleep during execution (#5669) 2019-04-17 15:52:23 -05:00
sidecar_service.go fix typos reported by golangci-lint:misspell (#5434) 2019-03-06 11:13:28 -06:00
sidecar_service_test.go Pass a testing.T into NewTestAgent and TestAgent.Start (#5342) 2019-02-14 10:59:14 -05:00
signal_unix.go cli: forward SIGTERM to child process of 'lock' and 'watch' subcommands (#4737) 2018-10-02 15:57:21 -05:00
signal_windows.go cli: forward SIGTERM to child process of 'lock' and 'watch' subcommands (#4737) 2018-10-02 15:57:21 -05:00
snapshot_endpoint.go agent: consolidate handling of 405 Method Not Allowed (#3405) 2017-09-25 23:11:19 -07:00
snapshot_endpoint_test.go add wait to TestSnapshot 2019-02-22 17:34:45 -05:00
status_endpoint.go Support OPTIONS requests 2018-02-12 10:15:31 -08:00
status_endpoint_test.go Pass a testing.T into NewTestAgent and TestAgent.Start (#5342) 2019-02-14 10:59:14 -05:00
testagent.go agent: fix several data races and bugs related to node-local alias checks (#5876) 2019-05-24 13:36:56 -05:00
testagent_test.go New config parser, HCL support, multiple bind addrs (#3480) 2017-09-25 11:40:42 -07:00
translate_addr.go New config parser, HCL support, multiple bind addrs (#3480) 2017-09-25 11:40:42 -07:00
txn_endpoint.go Add fmt and vet (#5671) 2019-04-25 12:26:33 -04:00
txn_endpoint_test.go Add fmt and vet (#5671) 2019-04-25 12:26:33 -04:00
ui_endpoint.go Implement data filtering of some endpoints (#5579) 2019-04-16 12:00:15 -04:00
ui_endpoint_test.go Implement data filtering of some endpoints (#5579) 2019-04-16 12:00:15 -04:00
user_event.go Spelling (#3958) 2018-03-19 16:56:00 +00:00
user_event_test.go Move internal/ to sdk/ (#5568) 2019-03-27 08:54:56 -04:00
util.go cli: forward SIGTERM to child process of 'lock' and 'watch' subcommands (#4737) 2018-10-02 15:57:21 -05:00
util_test.go Move internal/ to sdk/ (#5568) 2019-03-27 08:54:56 -04:00
watch_handler.go Move the watch package into the api module (#5664) 2019-04-26 12:33:01 -04:00
watch_handler_test.go Move the watch package into the api module (#5664) 2019-04-26 12:33:01 -04:00