Commit Graph

201 Commits

Author SHA1 Message Date
Wei Wei 04531ff0fb fix globalRPC goroutine leak
Signed-off-by: Wei Wei <weiwei.inf@gmail.com>
2017-12-05 11:53:30 +08:00
James Phillips c4bc89a187
Creates a registration mechanism for snapshot and restore. 2017-11-29 18:36:53 -08:00
James Phillips 8571555703
Begins split out of snapshots from the main FSM class. 2017-11-29 18:36:53 -08:00
James Phillips 4eaee8e0ba
Creates a registration mechanism for FSM commands. 2017-11-29 18:36:53 -08:00
James Phillips 3e7ea1931c
Moves the FSM into its own package.
This will help make it clearer what happens when we add some registration
plumbing for the different operations and snapshots.
2017-11-29 18:36:53 -08:00
James Phillips 7f3783f4be
Resolves an FSM snapshot TODO.
This adds checks for sink write calls before we continue the refactor, which
will resolve the other TODO comment we deleted as part of this change.
2017-11-29 18:36:53 -08:00
James Phillips 5a24d37ac0
Creates a registration mechanism for schemas.
This also splits out the registration into the table-specific source
files.
2017-11-29 18:36:52 -08:00
James Phillips 36bb30e67a
Creates a registration mechanism for RPC endpoints. 2017-11-29 18:36:52 -08:00
James Phillips ba56669ea8
Renames stubs to be more consistent. 2017-11-29 18:36:52 -08:00
James Phillips 56552095c9
Sheds monotonic time info so tombstone GC bins work properly. 2017-11-29 10:34:24 -08:00
James Phillips 8656b7a3e9
Gives back the lock before writing to the expire channel.
The lock isn't needed after we clean up the expire bin, and as seen
in #3700 we can get into a deadlock waiting to place the expire index
into the channel while holding this lock.

Fixes #3700
2017-11-19 16:24:16 -08:00
James Phillips 8210523b1b
Moves the LAN event handler after the router is created.
Fixes #3680
2017-11-10 12:26:48 -08:00
James Phillips bfbbfb62ca
Revert "Adds a small sleep to make sure we are in the next GC bucket." 2017-11-08 22:18:37 -08:00
James Phillips d6328a5bf8
Adds a sleep to make sure we are in the next GC bucket, ups time.
Fixes #3670
2017-11-08 22:02:40 -08:00
James Phillips 91824375be
Skips the tombstone GC test in Travis for now.
Related to #3670
2017-11-08 20:14:20 -08:00
James Phillips b94ba8aeb4
Removes bogus getPort() in favor of freeport. 2017-11-08 19:55:50 -08:00
James Phillips 444a345a3a
Tightens timing up and reorders GC test to be less flaky. 2017-11-08 15:09:29 -08:00
James Phillips e00624425b
Doubles the GC timing. 2017-11-08 15:01:11 -08:00
James Phillips 8eb91777d9
Opens up test timing a little more. 2017-11-08 14:01:19 -08:00
James Phillips d45c2a01f1
Shifts off a gran boundary to help make test less flaky. 2017-11-08 13:57:17 -08:00
James Phillips 757e353334
Opens up the tombstone GC test timing. 2017-11-08 13:43:39 -08:00
Kyle Havlovitz 068ca11eb8
Move check definition to a sub-struct 2017-11-01 14:54:46 -07:00
Kyle Havlovitz bc3ba5f873
Merge branch 'master' into esm-changes 2017-11-01 11:37:48 -07:00
Kyle Havlovitz 83524f44c4
Merge pull request #3622 from hashicorp/coordinate-node-endpoint
agent: add /v1/coordianate/node/:node endpoint
2017-11-01 11:35:50 -07:00
Kyle Havlovitz 9909b661ac
Fill out the tests around coordinate/node functionality 2017-10-31 15:36:44 -07:00
Kyle Havlovitz fd4d9f1c16
Factor out registerNodes function 2017-10-31 13:34:49 -07:00
James Phillips c6e0366c02
Relaxes Autopilot promotion logic. (#3623)
* Relaxes Autopilot promotion logic.

When we defaulted the Raft protocol version to 3 in #3477 we made
the numPeers() routine more strict to only count voters (this is
more conservative and more correct). This had the side effect of
breaking rolling updates because it's at odds with the Autopilot
non-voter promotion logic.

That logic used to wait to only promote to maintain an odd quorum
of servers. During a rolling update (add one new server, wait, and
then kill an old server) the dead server cleanup would still count
the old server as a peer, which is conservative and the right thing
to do, and no longer count the non-voter. This would wait to promote,
so you could get into a stalemate. It is safer to promote early than
remove early, so by promoting as soon as possible we have chosen
that as the solution here.

Fixes #3611

* Gets rid of unnecessary extra not-a-voter check.
2017-10-31 15:16:56 -05:00
Kyle Havlovitz 496dd7ab5b
Merge branch 'coordinate-node-endpoint' of github.com:hashicorp/consul into esm-changes 2017-10-26 19:20:24 -07:00
Kyle Havlovitz f80e70271d
Added Coordinate.Node rpc endpoint and client api method 2017-10-26 19:16:40 -07:00
Kyle Havlovitz 84a07ea113
Expose SkipNodeUpdate field and some health check info in the http api 2017-10-25 19:37:30 +02:00
Frank Schroeder 74859ff3c0 test: replace porter tool with freeport lib
This patch removes the porter tool which hands out free ports from a
given range with a library which does the same thing. The challenge for
acquiring free ports in concurrent go test runs is that go packages are
tested concurrently and run in separate processes. There has to be some
inter-process synchronization in preventing processes allocating the
same ports.

freeport allocates blocks of ports from a range expected to be not in
heavy use and implements a system-wide mutex by binding to the first
port of that block for the lifetime of the application. Ports are then
provided sequentially from that block and are tested on localhost before
being returned as available.
2017-10-21 22:01:09 +02:00
Ryan Slade 6f05ea91a3 Replace time.Now().Sub(x) with time.Since(x) 2017-10-17 20:38:24 +02:00
James Phillips e9670761f9
Cleans up some drift between the OSS and Enterprise trees. 2017-10-11 15:53:07 -07:00
James Phillips d1ad538345 Makes RPC handling more robust when rolling servers. (#3561)
* Adds client-side retry for no leader errors.

This paves over the case where the client was connected to the leader
when it loses leadership.

* Adds a configurable server RPC drain time and a fail-fast path for RPCs.

When a server leaves it gets removed from the Raft configuration, so it will
never know who the new leader server ends up being. Without this we'd be
doomed to wait out the RPC hold timeout and then fail. This makes things fail
a little quicker while a sever is draining, and since we added a client retry
AND since the server doing this has already shut down and left the Serf LAN,
clients should retry against some other server.

* Makes the RPC hold timeout configurable.

* Reorders struct members.

* Sets the RPC hold timeout default for test servers.

* Bumps the leave drain time up to 5 seconds.

* Robustifies retries with a simpler client-side RPC hold.

* Reverts untended delete.
2017-10-10 15:19:50 -07:00
James Phillips a1db119d02 Fixes handling of stop channel and failed barrier attempts. (#3546)
* Fixes handling of stop channel and failed barrier attempts.

There were two issues here. First, we needed to not exit when there
was a timeout trying to write the barrier, because Raft might not
step down, so we'd be left as the leader but having run all the step
down actions.

Second, we didn't close over the stopCh correctly, so it was possible
to nil that out and have the leaderLoop never exit. We close over it
properly AND sequence the nil-ing of it AFTER the leaderLoop exits for
good measure, so the code is more robust.

Fixes #3545

* Cleans up based on code review feedback.

* Tweaks comments.

* Renames variables and removes comments.
2017-10-06 07:54:49 -07:00
Kyle Havlovitz 0063516e5e
Update metric names and add a legacy config flag 2017-10-04 16:43:27 -07:00
Preetha Appan f38d20eb40 Remove extra newline 2017-10-03 15:19:31 -05:00
Preetha Appan 3c81e2db7c Only allow 'list' policies within 'key' policy definitions. Consolidated two similar tests into one and fixed alignment. 2017-10-03 15:15:56 -05:00
Preetha Appan d5acfc3982 Introduces new 'list' permission that applies to KV store recursive reads, and enforced only when opted in. 2017-10-02 17:10:21 -05:00
James Phillips 330ce87851
Gets rid of flaky clause in stats fetcher unit test.
Given how the rutine is coded we can still get data so this wasn't
a reliable thing to check.
2017-09-26 20:53:06 -07:00
preetapan 783e24be64 Issue 3452 (#3500)
* Make sure that id and address are set in member created during reaping of catalog nodes that have been removed from serf

* Get address from node table in the state store rather than from service address

* Fix incorrect lookup by checkname instead of node name

* Make sure that serverlookup is called with the right address format, added unit test.

* Address code review comments

* Tweaks style stuff.
2017-09-26 20:49:41 -07:00
James Phillips 4b17c9618f
Cleans up some edge cases in TestSnapshot_Forward_Leader.
These could cause the tests to hang.
2017-09-26 14:07:28 -07:00
Preetha Appan 318d0232f7 Move Raft protocol version for list peers end point to server side, fix unit tests. This fixes #3449 2017-09-26 09:35:39 -05:00
James Phillips fcaa889116 Bumps default Raft protocol to version 3. (#3477)
* Changes default Raft protocol to 3.

* Changes numPeers() to report only voters.

This should have been there before, but it's more obvious that this
is incorrect now that we default the Raft protocol to 3, which puts
new servers in a read-only state while Autopilot waits for them to
become healthy.

* Fixes TestLeader_RollRaftServer.

* Fixes TestOperator_RaftRemovePeerByAddress.

* Fixes TestServer_*.

Relaxed the check for a given number of voter peers and instead do
a thorough check that all servers see each other in their Raft
configurations.

* Fixes TestACL_*.

These now just check for Raft replication to be set up, and don't
care about the number of voter peers.

* Fixes TestOperator_Raft_ListPeers.

* Fixes TestAutopilot_CleanupDeadServerPeriodic.

* Fixes TestCatalog_ListNodes_ConsistentRead_Fail.

* Fixes TestLeader_ChangeServerID and adjusts the conn pool to throw away
sockets when it sees io.EOF.

* Changes version to 1.0.0 in the options doc.

* Makes metrics test more deterministic with autopilot metrics possible.
2017-09-25 15:27:04 -07:00
Preetha Appan 8394ad08db Introduce Code Policy validation via sentinel, with a noop implementation 2017-09-25 13:44:55 -05:00
Frank Schröder 69a088ca85 New config parser, HCL support, multiple bind addrs (#3480)
* new config parser for agent

This patch implements a new config parser for the consul agent which
makes the following changes to the previous implementation:

 * add HCL support
 * all configuration fragments in tests and for default config are
   expressed as HCL fragments
 * HCL fragments can be provided on the command line so that they
   can eventually replace the command line flags.
 * HCL/JSON fragments are parsed into a temporary Config structure
   which can be merged using reflection (all values are pointers).
   The existing merge logic of overwrite for values and append
   for slices has been preserved.
 * A single builder process generates a typed runtime configuration
   for the agent.

The new implementation is more strict and fails in the builder process
if no valid runtime configuration can be generated. Therefore,
additional validations in other parts of the code should be removed.

The builder also pre-computes all required network addresses so that no
address/port magic should be required where the configuration is used
and should therefore be removed.

* Upgrade github.com/hashicorp/hcl to support int64

* improve error messages

* fix directory permission test

* Fix rtt test

* Fix ForceLeave test

* Skip performance test for now until we know what to do

* Update github.com/hashicorp/memberlist to update log prefix

* Make memberlist use the default logger

* improve config error handling

* do not fail on non-existing data-dir

* experiment with non-uniform timeouts to get a handle on stalled leader elections

* Run tests for packages separately to eliminate the spurious port conflicts

* refactor private address detection and unify approach for ipv4 and ipv6.

Fixes #2825

* do not allow unix sockets for DNS

* improve bind and advertise addr error handling

* go through builder using test coverage

* minimal update to the docs

* more coverage tests fixed

* more tests

* fix makefile

* cleanup

* fix port conflicts with external port server 'porter'

* stop test server on error

* do not run api test that change global ENV concurrently with the other tests

* Run remaining api tests concurrently

* no need for retry with the port number service

* monkey patch race condition in go-sockaddr until we understand why that fails

* monkey patch hcl decoder race condidtion until we understand why that fails

* monkey patch spurious errors in strings.EqualFold from here

* add test for hcl decoder race condition. Run with go test -parallel 128

* Increase timeout again

* cleanup

* don't log port allocations by default

* use base command arg parsing to format help output properly

* handle -dc deprecation case in Build

* switch autopilot.max_trailing_logs to int

* remove duplicate test case

* remove unused methods

* remove comments about flag/config value inconsistencies

* switch got and want around since the error message was misleading.

* Removes a stray debug log.

* Removes a stray newline in imports.

* Fixes TestACL_Version8.

* Runs go fmt.

* Adds a default case for unknown address types.

* Reoders and reformats some imports.

* Adds some comments and fixes typos.

* Reorders imports.

* add unix socket support for dns later

* drop all deprecated flags and arguments

* fix wrong field name

* remove stray node-id file

* drop unnecessary patch section in test

* drop duplicate test

* add test for LeaveOnTerm and SkipLeaveOnInt in client mode

* drop "bla" and add clarifying comment for the test

* split up tests to support enterprise/non-enterprise tests

* drop raft multiplier and derive values during build phase

* sanitize runtime config reflectively and add test

* detect invalid config fields

* fix tests with invalid config fields

* use different values for wan sanitiziation test

* drop recursor in favor of recursors

* allow dns_config.udp_answer_limit to be zero

* make sure tests run on machines with multiple ips

* Fix failing tests in a few more places by providing a bind address in the test

* Gets rid of skipped TestAgent_CheckPerformanceSettings and adds case for builder.

* Add porter to server_test.go to make tests there less flaky

* go fmt
2017-09-25 11:40:42 -07:00
James Phillips 268018c558
Robustifies check in TestCatalog_ListNodes_ConsistentRead_Fail test.
Fixes #3469
2017-09-13 21:22:53 -07:00
James Phillips 8be4ee766a
Revert "Manages segments list via a pointer."
This reverts commit c277a4250461443cbd63de0259e5e32766f651ea.
2017-09-07 16:37:11 -07:00
James Phillips 5008aabb62
Manages segments list via a pointer. 2017-09-07 16:21:07 -07:00
James Phillips 908f7be97f
Cleans up formatting. 2017-09-07 12:26:58 -07:00
James Phillips 02a3f3f27b
Shows the segment name in the keyring API and command output. 2017-09-07 12:17:39 -07:00
James Phillips 7c616e3768
Moves reconcile loop into segment stub. 2017-09-06 18:01:53 -07:00
James Phillips 4e34c2af06
Takes the skip out of the client check.
Without this the merge delegate won't check the segment for non-servers
a little below here.
2017-09-06 17:05:40 -07:00
James Phillips 78ac144fff Merge pull request #3447 from hashicorp/issue-3070
Skips unique node ID check for old versions of Consul.
2017-09-06 13:24:15 -07:00
James Phillips 62d9299646
Fixes incorrect comment. 2017-09-06 13:23:19 -07:00
James Phillips 031f1874d0
Pulls down some code for the check loop. 2017-09-06 13:07:42 -07:00
James Phillips 2fd9328b21
Uses the Raft configuration for the self-add skip check. 2017-09-06 13:05:51 -07:00
Preetha Appan 1eae9f1e2f Change member join reconcile step to process joining itself, to handle node IP address changes correctly when number of servers < 3 2017-09-06 13:53:01 -05:00
James Phillips 353e037c9b
Skips unique node ID check for old versions of Consul.
Fixes #3070.
2017-09-05 22:57:29 -07:00
James Phillips c629773b40
Makes the all segments query explict, and the default for `consul members`. 2017-09-05 12:22:20 -07:00
James Phillips bc9780baad Adds simple rate limiting for client agent RPC calls to Consul servers. (#3440)
* Added rate limiting for agent RPC calls.
* Initializes the rate limiter based on the config.
* Adds the rate limiter into the snapshot RPC path.
* Adds unit tests for the RPC rate limiter.
* Groups the RPC limit parameters under "limits" in the config.
* Adds some documentation about the RPC limiter.
* Sends a 429 response when the rate limiter kicks in.
* Adds docs for new telemetry.
* Makes snapshot telemetry look like RPC telemetry and cleans up comments.
2017-09-01 15:02:50 -07:00
Kyle Havlovitz 334e082848 Merge pull request #3431 from hashicorp/network-segments-oss 2017-09-01 10:24:58 -07:00
Kyle Havlovitz ff994e9ade
Pass listeners into setupSegments 2017-08-31 17:56:43 -07:00
Kyle Havlovitz 5cc4b32a5d
Organize segments for a cleaner split between enterprise and OSS 2017-08-31 17:39:46 -07:00
Kyle Havlovitz b77a0aa932
Fix some inconsistencies with segment logic and comments 2017-08-30 17:43:46 -07:00
Preetha Appan 0728a04dbb Wire server provider for raft layer only on protocol version 3 and above, and update changelog 2017-08-30 14:36:47 -05:00
Kyle Havlovitz 6ded43131a
Add segment addr field to tags for LAN flood joiner 2017-08-30 11:58:29 -07:00
Kyle Havlovitz 1c04f1537a
Add agent.segment interpolation to prepared queries 2017-08-30 11:58:29 -07:00
Kyle Havlovitz 107d7f6c5a
Add rpc_listener option to segment config 2017-08-30 11:58:29 -07:00
James Phillips 6a6eadd8c7
Adds open source side of network segments (feature is Enterprise-only). 2017-08-30 11:58:29 -07:00
Preetha Appan e944370cde More cleanup from code review 2017-08-30 12:31:36 -05:00
Preetha Appan a215c764cd Remove copy pasted duplicate line, update documentation. 2017-08-30 10:02:10 -05:00
Preetha Appan 5a29eb7486 Consolidate server lookup into one place and replace usages of localConsuls. 2017-08-30 09:30:33 -05:00
Preetha Appan d8fe01db4c Remove stray commented line 2017-08-30 09:30:33 -05:00
Preetha Appan ca48e7e4c2 Remove server address tracking logic from manager/router and maintain it as part of lan event listener instead. Used sync.Map to track this, and added unit tests 2017-08-30 09:30:33 -05:00
Preetha Appan b4a9d77d49 ServerAddressProvider interface also returns an error now 2017-08-30 09:30:33 -05:00
Preetha Appan edb408bc22 Use config struct to create NetworkTransport layer when setting up raft 2017-08-30 09:30:33 -05:00
Preetha Appan 01f8e469aa Implement AddressProvider and wire that up to raft transport layer to support server nodes changing their IP addresses in containerized environments 2017-08-30 09:30:33 -05:00
Frank Schroeder 62c77d70f0 build: make tests independent of build tags
When the metadata server is scanning the agents for potential servers
it is parsing the version number which the agent provided when it
joined. This version number has to conform to a certain format, i.e.
'n.n.n'. Without this version number properly set some tests fail with
error messages that disguise the root cause.

The default version number is currently set to 'unknown' in
version/version.go which does not parse and triggers the tests to fail.
The work around is to use a build tag 'consul' which will use the
version number set in version_base.go instead which has the correct
format and is set to the current release version.

In addition, some parts of the code also require the version number to
be of a certain value. Setting it to '0.0.0' for example makes some
tests pass and others fail since they don't pass the semantic check.

When using go build/install/test one has to remember to use '-tags
consul' or tests will fail with non-obvious error messages.

Using build tags makes the build process more complex and error prone
since it prevents the use of the plain go toolchain and - at least in
its current form - introduces subtle build and test issues. We should
try to eliminate build tags for anything else but platform specific
code.

This patch removes all references to specific version numbers in the
code and tests and sets the default version to '9.9.9' which is
syntactically correct and passes the semantic check. This solves the
issue of running go build/install/test without tags for the OSS build.
2017-08-30 13:40:18 +02:00
Frank Schröder 44e6b8122d acl: consolidate error handling (#3401)
The error handling of the ACL code relies on the presence of certain
magic error messages. Since the error values are sent via RPC between
older and newer consul agents we cannot just replace the magic values
with typed errors and switch to type checks since this would break
compatibility with older clients.

Therefore, this patch moves all magic ACL error messages into the acl
package and provides default error values and helper functions which
determine the type of error.
2017-08-23 16:52:48 +02:00
Frank Schroeder d9e2a51887 agent: drop unused code
This code from http://github.com/hashicorp/consul/pull/3353 is no longer
required.
2017-08-22 00:02:46 +02:00
James Phillips 3518e27a76 Revert "Return 403 rather than a 404 when acls cause all results to be filter…" 2017-08-09 15:06:57 -07:00
James Phillips 91205b2cd6 Revert "Ensure that we return a permission denied only if the list of keys/en…" 2017-08-09 15:06:20 -07:00
Preetha Appan 121326161e Added unit test case to kvs_endpointtest 2017-08-09 15:50:22 -05:00
Preetha Appan d06002dc62 Ensure that we return a permission denied only if the list of keys/entries prior to filtering by ACL is non empty 2017-08-09 15:32:18 -05:00
Frank Schroeder c38dcf2d17
agent: move agent/consul/agent to agent/metadata 2017-08-09 14:36:52 +02:00
Frank Schroeder 85bdb77d90
agent: move agent/consul/servers to agent/router 2017-08-09 14:36:37 +02:00
Frank Schroeder 1d0bbfed9c
agent: move agent/consul/structs to agent/structs 2017-08-09 14:32:12 +02:00
Kyle Havlovitz 8c2e422074 Merge pull request #3369 from hashicorp/metrics-enhancements
Add support for labels/filters from go-metrics
2017-08-08 13:55:30 -07:00
Kyle Havlovitz 975ded2714
Add support for labels/filters from go-metrics 2017-08-08 01:45:10 -07:00
Preetha Appan 6bac9355fd
Use sanitized version of node name of server in NS record, and start with "server" rather than "ns" 2017-08-07 11:11:55 +02:00
Preetha Appan 7e9d683ab1
Removed a copy pasted irrelevant comment, and other code review feedback 2017-08-07 11:11:54 +02:00
Preetha Appan c38906daad
Add NS records and A records for each server. Constructs ns host names using the advertise address of the server. 2017-08-07 11:11:54 +02:00
James Phillips 803ed9a245 Adds secure introduction for the ACL replication token. (#3357)
Adds secure introduction for the ACL replication token, as well as a separate enable config for ACL replication.
2017-08-03 15:39:31 -07:00
James Phillips c31b56a03e Adds a new /v1/acl/bootstrap API (#3349) 2017-08-02 17:05:18 -07:00
Preetha Appan 307049e17f Return nil instead of empty list when returning a PermissionDenied error, updated unit test 2017-07-31 17:23:20 -05:00
Preetha Appan da29b74d03 Return 403 rather than a 404 when acls cause all results to be filtered out. This fixes #2637 2017-07-31 13:50:29 -05:00
James Phillips 8f1f762ddd Adds missing autopilot snapshot test and avoids snapshotting nil. (#3333) 2017-07-28 15:48:42 -07:00
James Phillips 6b51744ddf Adds option to prepared queries to remove empty tags. (#3330) 2017-07-26 22:46:43 -07:00
James Phillips 6e794ea1b3 Adds support for agent-side ACL token management via API instead of config files. (#3324)
* Adds token store and removes all runtime use of config for ACL tokens.
* Adds a new API for changing agent tokens on the fly.
2017-07-26 11:03:43 -07:00