Commit Graph

1506 Commits

Author SHA1 Message Date
R.B. Boyer 4c05f1f519
agent: avoid reverting any check updates that occur while a service is being added or the config is reloaded (#6144) 2019-07-17 14:06:50 -05:00
Freddy f59e6db9b1
Reduce number of servers in TestServer_Expect_NonVoters (#6155) 2019-07-17 11:35:33 -06:00
Freddy 476a4b95a5
More flaky test fixes (#6151)
* Add retry to TestAPI_ClientTxn

* Add retry to TestLeader_RegisterMember

* Account for empty watch result in ConnectRootsWatch
2019-07-17 09:33:38 -06:00
Freddy 99601aa3a7
Update retries that weren't using retry.R (#6146) 2019-07-16 14:47:45 -06:00
Sarah Adams 4afa034d6a
fix flaky test TestACLEndpoint_SecureIntroEndpoints_OnlyCreateLocalData (#6116)
* fix test to write only to dc2 (typo)
* fix retry behavior in existing test (was being used incorrectly)
2019-07-12 14:14:42 -07:00
Freddy a295d9e5db
Flaky test overhaul (#6100) 2019-07-12 09:52:26 -06:00
Freddy b6b6dbadb0
Remove dummy config (#6121) 2019-07-12 09:50:14 -06:00
Freddy 74b7bcb612
Update TestServer creation in sdk/testutil (#6084)
* Retry the creation of the test server three times.
* Reduce the retry timeout for the API wait to 2 seconds, opting to fail faster and start over.
* Remove wait for leader from server creation. This wait can be added on a test by test basis now that the function is being exported.
* Remove wait for anti-entropy sync. This is built into the existing WaitForSerfCheck func, so that can be used if the anti-entropy wait is needed
2019-07-12 09:37:29 -06:00
Freddy f5634a24e8
Clean up StatsFetcher work when context is exceeded (#6086) 2019-07-12 08:23:28 -06:00
Matt Keeler 6cc936d64b
Move ctx and cancel func setup into the Replicator.Start (#6115)
Previously a sequence of events like:

Start
Stop
Start
Stop

would segfault on the second stop because the original ctx and cancel func were only initialized during the constructor and not during Start.
2019-07-12 10:10:48 -04:00
Pierre Souchay 2e9370ba11 Bump timeout in TestManager_BasicLifecycle (#6030) 2019-07-01 17:02:00 -06:00
Sarah Christoff 8a930f7d3a
Remove failed nodes from serfWAN (#6028)
* Prune Servers from WAN and LAN

* cleaned up and fixed LAN to WAN

* moving things around

* force-leave remove from serfWAN, create pruneSerfWAN

* removed serfWAN remove, reduced complexity, fixed comments

* add another place to remove from serfWAN

* add nil check

* Update agent/consul/server.go

Co-Authored-By: Paul Banks <banks@banksco.de>
2019-06-28 12:40:07 -05:00
Hans Hasselberg 4aad3e2fb2
Release v1.5.2 2019-06-27 22:59:46 +00:00
Hans Hasselberg 73c4e9f07c
tls: auto_encrypt enables automatic RPC cert provisioning for consul clients (#5597) 2019-06-27 22:22:07 +02:00
Todd Radel 8ece11a24a connect: store signingKeyId instead of authorityKeyId (#6005) 2019-06-27 16:47:22 +02:00
Aestek 04a52a967b acl: allow service deregistration with node write permission (#5217)
With ACLs enabled if an agent is wiped and restarted without a leave
it can no longer deregister the services it had previously registered
because it no longer has the tokens the services were registered with.
To remedy that we allow service deregistration from tokens with node
write permission.
2019-06-27 14:24:34 +02:00
Akshay Ganeshen 93b8a4e8d8 dns: support alt domains for dns resolution (#5940)
this adds an option for an alt domain to be used with dns while migrating to a new consul domain.
2019-06-27 12:00:37 +02:00
Pierre Souchay ca7c7faac8 agent: added metadata information about servers into consul service description (#5455)
This allows have information about servers from HTTP APIs without
using the command line.
2019-06-26 23:46:47 +02:00
Sarah Christoff e946ed9427
ui: modify content path (#5950)
* Add ui-content-path flag

* tests complete, regex validator on string, index.html updated

* cleaning up debugging stuff

* ui: Enable ember environment configuration to be set via the go binary at runtime (#5934)

* ui: Only inject {{.ContentPath}} if we are makeing a prod build...

...otherwise we just use the current rootURL

This gets injected into a <base /> node which solves the assets path
problem but not the ember problem

* ui: Pull out the <base href=""> value and inject it into ember env

See previous commit:

The <base href=""> value is 'sometimes' injected from go at index
serve time. We pass this value down to ember by overwriting the ember
config that is injected via a <meta> tag. This has to be done before
ember bootup.

Sometimes (during testing and development, basically not production)
this is injected with the already existing value, in which case this
essentially changes nothing.

The code here is slightly abstracted away from our specific usage to
make it easier for anyone else to use, and also make sure we can cope
with using this same method to pass variables down from the CLI through
to ember in the future.

* ui: We can't use <base /> move everything to javascript (#5941)

Unfortuantely we can't seem to be able to use <base> and rootURL
together as URL paths will get doubled up (`ui/ui/`).

This moves all the things that we need to interpolate with .ContentPath
to the `startup` javascript so we can conditionally print out
`{{.ContentPath}}` in lots of places (now we can't use base)

* fixed when we serve index.html

* ui: For writing a ContentPath, we also need to cope with testing... (#5945)

...and potentially more environments

Testing has more additional things in a separate index.html in `tests/`

This make the entire thing a little saner and uses just javascriopt
template literals instead of a pseudo handbrake synatx for our
templating of these files.

Intead of just templating the entire file this way, we still only
template `{{content-for 'head'}}` and `{{content-for 'body'}}`
in this way to ensure we support other plugins/addons

* build: Loosen up the regex for retrieving the CONSUL_VERSION (#5946)

* build: Loosen up the regex for retrieving the CONSUL_VERSION

1. Previously the `sed` replacement was searching for the CONSUL_VERSION
comment at the start of a line, it no longer does this to allow for
indentation.
2. Both `grep` and `sed` where looking for the omment at the end of the
line. We've removed this restriction here. We don't need to remove it
right now, but if we ever put the comment followed by something here the
searching would break.
3. Added `xargs` for trimming the resulting version string. We aren't
using this already in the rest of the scripts, but we are pretty sure
this is available on most systems.

* ui: Fix erroneous variable, and also force an ember cache clean on build

1. We referenced a variable incorrectly here, this fixes that.
2. We also made sure that every `make` target clears ember's `tmp` cache
to ensure that its not using any caches that have since been edited
everytime we call a `make` target.

* added docs, fixed encoding

* fixed go fmt

* Update agent/config/config.go

Co-Authored-By: R.B. Boyer <public@richardboyer.net>

* Completed Suggestions

* run gofmt on http.go

* fix testsanitize

* fix fullconfig/hcl by setting correct 'want'

* ran gofmt on agent/config/runtime_test.go

* Update website/source/docs/agent/options.html.md

Co-Authored-By: Hans Hasselberg <me@hans.io>

* Update website/source/docs/agent/options.html.md

Co-Authored-By: kaitlincarter-hc <43049322+kaitlincarter-hc@users.noreply.github.com>

* remove contentpath from redirectFS struct
2019-06-26 11:43:30 -05:00
Pierre Souchay e394a9469b Support for maximum size for Output of checks (#5233)
* Support for maximum size for Output of checks

This PR allows users to limit the size of output produced by checks at the agent 
and check level.

When set at the agent level, it will limit the output for all checks monitored
by the agent.

When set at the check level, it can override the agent max for a specific check but
only if it is lower than the agent max.

Default value is 4k, and input must be at least 1.
2019-06-26 09:43:25 -06:00
Hans Hasselberg 0d8d7ae052
agent: transfer leadership when establishLeadership fails (#5247) 2019-06-19 14:50:48 +02:00
Aestek 97bb907b69 ae: use stale requests when performing full sync (#5873)
Read requests performed during anti antropy full sync currently target
the leader only. This generates a non-negligible load on the leader when
the DC is large enough and can be offloaded to the followers following
the "eventually consistent" policy for the agent state.
We switch the AE read calls to use stale requests with a small (2s)
MaxStaleDuration value and make sure we do not read too fast after a
write.
2019-06-17 18:05:47 +02:00
Matt Keeler 4c03f99a85
Fix CAS operations on Services (#5971)
* Fix CAS operations on services

* Update agent/consul/state/catalog_test.go

Co-Authored-By: R.B. Boyer <public@richardboyer.net>
2019-06-17 10:41:04 -04:00
Paul Banks e90fab0aec
Add rate limiting to RPCs sent within a server instance too (#5927) 2019-06-13 04:26:27 -05:00
Paul Banks 737be347eb
Upgrade xDS (go-control-plane) API to support Envoy 1.10. (#5872)
* Upgrade xDS (go-control-plane) API to support Envoy 1.10.

This includes backwards compatibility shim to work around the ext_authz package rename in 1.10.

It also adds integration test support in CI for 1.10.0.

* Fix go vet complaints

* go mod vendor

* Update Envoy version info in docs

* Update website/source/docs/connect/proxies/envoy.md
2019-06-07 07:10:43 -05:00
Pierre Souchay 1da1825056 Ensure Consul is IPv6 compliant (#5468) 2019-06-04 10:02:38 -04:00
Matt Keeler 923448f00e
Update links to envoy docs on xDS protocol (#5871) 2019-06-03 11:03:05 -05:00
R.B. Boyer 9b41199585
agent: fix several data races and bugs related to node-local alias checks (#5876)
The observed bug was that a full restart of a consul datacenter (servers
and clients) in conjunction with a restart of a connect-flavored
application with bring-your-own-service-registration logic would very
frequently cause the envoy sidecar service check to never reflect the
aliased service.

Over the course of investigation several bugs and unfortunate
interactions were corrected:

(1)

local.CheckState objects were only shallow copied, but the key piece of
data that gets read and updated is one of the things not copied (the
underlying Check with a Status field). When the stock code was run with
the race detector enabled this highly-relevant-to-the-test-scenario field
was found to be racy.

Changes:

 a) update the existing Clone method to include the Check field
 b) copy-on-write when those fields need to change rather than
    incrementally updating them in place.

This made the observed behavior occur slightly less often.

(2)

If anything about how the runLocal method for node-local alias check
logic was ever flawed, there was no fallback option. Those checks are
purely edge-triggered and failure to properly notice a single edge
transition would leave the alias check incorrect until the next flap of
the aliased check.

The change was to introduce a fallback timer to act as a control loop to
double check the alias check matches the aliased check every minute
(borrowing the duration from the non-local alias check logic body).

This made the observed behavior eventually go away when it did occur.

(3)

Originally I thought there were two main actions involved in the data race:

A. The act of adding the original check (from disk recovery) and its
   first health evaluation.

B. The act of the HTTP API requests coming in and resetting the local
   state when re-registering the same services and checks.

It took awhile for me to realize that there's a third action at work:

C. The goroutines associated with the original check and the later
   checks.

The actual sequence of actions that was causing the bad behavior was
that the API actions result in the original check to be removed and
re-added _without waiting for the original goroutine to terminate_. This
means for brief windows of time during check definition edits there are
two goroutines that can be sending updates for the alias check status.

In extremely unlikely scenarios the original goroutine sees the aliased
check start up in `critical` before being removed but does not get the
notification about the nearly immediate update of that check to
`passing`.

This is interlaced wit the new goroutine coming up, initializing its
base case to `passing` from the current state and then listening for new
notifications of edge triggers.

If the original goroutine "finishes" its update, it then commits one
more write into the local state of `critical` and exits leaving the
alias check no longer reflecting the underlying check.

The correction here is to enforce that the old goroutines must terminate
before spawning the new one for alias checks.
2019-05-24 13:36:56 -05:00
Freddy 8f5fe058ea
Increase reliability of TestResetSessionTimerLocked_Renew 2019-05-24 13:54:51 -04:00
Pierre Souchay 27207fdaed agent: Improve startup message to avoid confusing users when no error occurs (#5896)
* Improve startup message to avoid confusing users when no error occurs

Several times, some users not very familiar with Consul get confused
by error message at startup:

  `[INFO] agent: (LAN) joined: 1 Err: <nil>`

Having `Err: <nil>` seems weird to many users, I propose to have the
following instead:

* Success: `[INFO] agent: (LAN) joined: 1`
* Error:   `[WARN] agent: (LAN) couldn't join: %d Err: ERROR`
2019-05-24 16:50:18 +02:00
Freddy f7f0207f78
Run TestServer_Expect on its own (#5890) 2019-05-23 19:52:33 -04:00
Freddy e9bdb3a4f9
Flaky test: ACLReplication_Tokens (#5891)
* Exclude non-go workflows while testing

* Wait for s2 global-management policy

* Revert "Exclude non-go workflows while testing"

This reverts commit 47a83cbe9f19d0e1e475eabaa223d61fb4c56019.
2019-05-23 19:52:02 -04:00
Freddy c9e6640337
Add retries to StatsFetcherTest (#5892) 2019-05-23 19:51:31 -04:00
Jack Pearkes d9285f4b7f
Release v1.5.1 2019-05-22 20:19:12 +00:00
freddygv d133d565a5 Wait for s2 global-management policy 2019-05-21 17:58:37 -06:00
Freddy 988aedce0a
Change log line used for verification 2019-05-21 17:07:06 -06:00
Freddy 7ce28bbfee
Stop running TestLeader_ChangeServerID in parallel 2019-05-21 15:28:08 -06:00
Sarah Christoff 508759eb76 Add retries around `obj` 2019-05-21 13:36:52 -05:00
Sarah Christoff 1a03220a1a Add retries to all `obj` 2019-05-21 13:31:37 -05:00
Sarah Christoff 843fb3f374
Update agent/coordinate_endpoint_test.go
Co-Authored-By: Freddy <freddygv@users.noreply.github.com>
2019-05-17 14:32:50 -05:00
Sarah Christoff 1ee6dd253b Update type assertion logic
Logic updated to evaluate with a boolean after the type assertion.
This allows us to check if the type assertion succeeded and be
more clear with errors.
2019-05-17 13:32:36 -05:00
Kyle Havlovitz ad24456f49
Set the dead node reclaim timer at 30s 2019-05-15 11:59:33 -07:00
Kyle Havlovitz dcbffdb956
Merge branch 'master' into change-node-id 2019-05-15 10:51:04 -07:00
Jack Pearkes ba0258e725
Release v1.5.0 2019-05-08 18:34:08 +00:00
Matt Keeler 07f2854683 Fixes race condition in Agent Cache (#5796)
* Fix race condition during a cache get

Check the entry we pulled out of the cache while holding the lock had Fetching set.
If it did then we should use the existing Waiter instead of calling fetch. The reason
this is better than just calling fetch is that fetch re-gets the entry out of the
entries map and the previous fetch may have finished. Therefore this prevents
erroneously starting a new fetch because we just missed the last update.

* Fix race condition fully

The first commit still allowed for the following scenario:

• No entry existing when checked in getWithIndex while holding the read lock
• Then by time we had reached fetch it had been created and finished.

* always use ok when returning

* comment mentioning the reading from entries.

* use cacheHit consistently
2019-05-07 11:15:49 +01:00
Matt Keeler 46956ed769
Copy the proxy config instead of direct assignment (#5786)
This prevents modifying the data in the state store which is supposed to be immutable.
2019-05-06 12:09:59 -04:00
R.B. Boyer 372bb06c83
acl: a role binding rule for a role that does not exist should be ignored (#5778)
I wrote the docs under this assumption but completely forgot to actually
enforce it.
2019-05-03 14:22:44 -05:00
R.B. Boyer 7d0f729f77
acl: enforce that you cannot persist tokens and roles with missing links except during replication (#5779) 2019-05-02 15:02:21 -05:00
Matt Keeler 26708570c5
Fix ConfigEntryResponse binary marshaller and ensure we watch the chan in ConfigEntry.Get even when no entry exists. (#5773) 2019-05-02 15:25:29 -04:00
Paul Banks df0c61fd31
Fix previous accidental master push 🤦 (#5771)
* Fix previous accidental master push 🤦

* Fix ACL test
2019-05-02 15:49:37 +01:00