[Fix] Services sometimes not being synced with acl_enforce_version_8 = false (#4771)

Fixes: https://github.com/hashicorp/consul/issues/3676 This fixes a bug were registering an agent with a non-existent ACL token can prevent other services registered with a good token from being synced to the server when using `acl_enforce_version_8 = false`. ## Background When `acl_enforce_version_8` is off the agent does not check the ACL token validity before storing the service in its state. When syncing a service registered with a missing ACL token we fall into the default error handling case (https://github.com/hashicorp/consul/blob/master/agent/local/state.go#L1255) and stop the sync (https://github.com/hashicorp/consul/blob/master/agent/local/state.go#L1082) without setting its Synced property to true like in the permission denied case. This means that the sync will always stop at the faulty service(s). The order in which the services are synced is random since we iterate on a map. So eventually all services with good ACL tokens will be synced, this can however take some time and is influenced by the cluster size, the bigger the slower because retries are less frequent. Having a service in this state also prevent all further sync of checks as they are done after the services. ## Changes This change modify the sync process to continue even if there is an error. This fixes the issue described above as well as making the sync more error tolerant: if the server repeatedly refuses a service (the ACL token could have been deleted by the time the service is synced, the servers were upgraded to a newer version that has more strict checks on the service definition...). Then all services and check that can be synced will, and those that don't will be marked as errors in the logs instead of blocking the whole process.
2019-01-04 16:01:50 +01:00 · 2019-01-04 16:01:50 +01:00 · 5647ca2bbb
parent 2254c4000e
commit 5647ca2bbb
1 changed files with 5 additions and 5 deletions
--- a/agent/local/state.go
+++ b/agent/local/state.go
@ -1210,7 +1210,7 @@ func (l *State) deleteService(id string) error {
 		l.logger.Printf("[INFO] agent: Deregistered service %q", id)
 		return nil

-	case acl.IsErrPermissionDenied(err):
+	case acl.IsErrPermissionDenied(err), acl.IsErrNotFound(err):
 		// todo(fs): mark the service to be in sync to prevent excessive retrying before next full sync
 		// todo(fs): some backoff strategy might be a better solution
 		l.services[id].InSync = true
@ -1248,7 +1248,7 @@ func (l *State) deleteCheck(id types.CheckID) error {
 		l.logger.Printf("[INFO] agent: Deregistered check %q", id)
 		return nil

-	case acl.IsErrPermissionDenied(err):
+	case acl.IsErrPermissionDenied(err), acl.IsErrNotFound(err):
 		// todo(fs): mark the check to be in sync to prevent excessive retrying before next full sync
 		// todo(fs): some backoff strategy might be a better solution
 		l.checks[id].InSync = true
@ -1316,7 +1316,7 @@ func (l *State) syncService(id string) error {
 		l.logger.Printf("[INFO] agent: Synced service %q", id)
 		return nil

-	case acl.IsErrPermissionDenied(err):
+	case acl.IsErrPermissionDenied(err), acl.IsErrNotFound(err):
 		// todo(fs): mark the service and the checks to be in sync to prevent excessive retrying before next full sync
 		// todo(fs): some backoff strategy might be a better solution
 		l.services[id].InSync = true
@ -1365,7 +1365,7 @@ func (l *State) syncCheck(id types.CheckID) error {
 		l.logger.Printf("[INFO] agent: Synced check %q", id)
 		return nil

-	case acl.IsErrPermissionDenied(err):
+	case acl.IsErrPermissionDenied(err), acl.IsErrNotFound(err):
 		// todo(fs): mark the check to be in sync to prevent excessive retrying before next full sync
 		// todo(fs): some backoff strategy might be a better solution
 		l.checks[id].InSync = true
@ -1397,7 +1397,7 @@ func (l *State) syncNodeInfo() error {
 		l.logger.Printf("[INFO] agent: Synced node info")
 		return nil

-	case acl.IsErrPermissionDenied(err):
+	case acl.IsErrPermissionDenied(err), acl.IsErrNotFound(err):
 		// todo(fs): mark the node info to be in sync to prevent excessive retrying before next full sync
 		// todo(fs): some backoff strategy might be a better solution
 		l.nodeInfoInSync = true