- Improve resilience of testrpc.WaitForLeader()
- Add additionall retry to CI
- Increase "go test" timeout to 8m
- Add wait for cluster leader to several tests in the agent package
- Add retry to some tests in the api and command packages
- Dev mode assumed no persistence of services although proxy state is persisted which caused proxies to be killed on startup as their services were no longer registered. Fixed.
- Didn't snapshot the ProxyID which meant that proxies were adopted OK from snapshot but failed to restart if they died since there was no proxyID in the ENV on restart
- Dev mode with no persistence just kills all proxies on shutdown since it can't recover them later
- Naming things
This turns out to have a lot more subtelty than we accounted for. The test suite is especially prone to races now we can only poll the child and many extra levels of indirectoin are needed to correctly run daemon process without it becoming a Zombie.
I ran this test suite in a loop with parallel enabled to verify for races (-race doesn't find any as they are logical inter-process ones not actual data races). I made it through ~50 runs before hitting an error due to timing which is much better than before. I want to go back and see if we can do better though. Just getting this up.
- Includes some bug fixes for previous `api` work and `agent` that weren't tested
- Needed somewhat pervasive changes to support hash based blocking - some TODOs left in our watch toolchain that will explicitly fail on hash-based watches.
- Integration into `connect` is partially done here but still WIP
This has an explcit unit test already which somehow passes at least some of the time. I suspect it passes because under some conditions the actual KV delete fails and returns non-zero as well as printing the warning which is what is being checked for in the test.
For some reason despite working for quite some time like this, I now have a branch in which this test fails consistently. It may be a timing/env issue where another process running an agent causes the delete to be successful so the command returns a 0 by chance. Either way this is clearly wrong and fixing it stops the test being flaky in my branch.
Calling twice appears to have no adverse effects, however serves to
confuse as to what the semantics of such code may be! This seems like it
was probably introduced while resolving conflicts during the merge of
the fix for #2404.
The root cause is actually that the agent's streaming HTTP API didn't flush until the first log line was found which commonly was pretty soon since the default level is INFO. In cases where there were no logs immediately due to level for instance, the client gets stuck in the HTTP code waiting on a response packet from the server before we enter the loop that checks the shutdown channel from the signal handler.
This fix flushes the initial status immediately on the streaming endpoint which lets the client code get into it's expected state where it's listening for shutdown or log lines.
The `consul agent` command was ignoring extra command line arguments
which can lead to confusion when the user has for example forgotten to
add a dash in front of an argument or is not using an `=` when setting
boolean flags to `true`. `-bootstrap true` is not the same as
`-bootstrap=true`, for example.
Since all command line flags are known and we don't expect unparsed
arguments we can return an error. However, this may make it slightly
more difficult in the future if we ever wanted to have these kinds of
arguments.
Fixes#3397
This patch refactors the commands that use the mitchellh/cli library to
populate the command line flag set in both the Run() and the Help()
method. Earlier versions of the mitchellh/cli library relied on the
Run() method to populuate the flagset for generating the usage screen.
This has changed in later versions and was previously solved with a
small monkey patch to the library to restore the old behavior.
However, this makes upgrading the library difficult since the patch has
to be restored every time.
This patch addresses this by moving the command line flags into an
initFlags() method where appropriate and also moving all variables for
the flags from the Run() method into the command itself.
Fixes#3536
* Clean up handling of subprocesses and make using a shell optional
* Update docs for subprocess changes
* Fix tests for new subprocess behavior
* More cleanup of subprocesses
* Minor adjustments and cleanup for subprocess logic
* Makes the watch handler reload test use the new path.
* Adds check tests for new args path, and updates existing tests to use new path.
* Adds support for script args in Docker checks.
* Fixes the sanitize unit test.
* Adds panic for unknown watch type, and reverts back to Run().
* Adds shell option back to consul lock command.
* Adds shell option back to consul exec command.
* Adds shell back into consul watch command.
* Refactors signal forwarding and makes Windows-friendly.
* Adds a clarifying comment.
* Changes error wording to a warning.
* Scopes signals to interrupt and kill.
This avoids us trying to send SIGCHILD to the dead process.
* Adds an error for shell=false for consul exec.
* Adds notes about the deprecated script and handler fields.
* De-nests an if statement.
* metrics: replace statsite_prefix with service_prefix
The metrics prefix isn't statsite specific and is in fact used
for all metrics providers. Since we are deprecating fields
anyway we should fix this one as well.
Fixes#3293
* Updates docs and sorts telemetry section.
* Renames to "metrics_prefix" to disambiguate with Consul services.
* Updates the change log.
* Changes default Raft protocol to 3.
* Changes numPeers() to report only voters.
This should have been there before, but it's more obvious that this
is incorrect now that we default the Raft protocol to 3, which puts
new servers in a read-only state while Autopilot waits for them to
become healthy.
* Fixes TestLeader_RollRaftServer.
* Fixes TestOperator_RaftRemovePeerByAddress.
* Fixes TestServer_*.
Relaxed the check for a given number of voter peers and instead do
a thorough check that all servers see each other in their Raft
configurations.
* Fixes TestACL_*.
These now just check for Raft replication to be set up, and don't
care about the number of voter peers.
* Fixes TestOperator_Raft_ListPeers.
* Fixes TestAutopilot_CleanupDeadServerPeriodic.
* Fixes TestCatalog_ListNodes_ConsistentRead_Fail.
* Fixes TestLeader_ChangeServerID and adjusts the conn pool to throw away
sockets when it sees io.EOF.
* Changes version to 1.0.0 in the options doc.
* Makes metrics test more deterministic with autopilot metrics possible.
* new config parser for agent
This patch implements a new config parser for the consul agent which
makes the following changes to the previous implementation:
* add HCL support
* all configuration fragments in tests and for default config are
expressed as HCL fragments
* HCL fragments can be provided on the command line so that they
can eventually replace the command line flags.
* HCL/JSON fragments are parsed into a temporary Config structure
which can be merged using reflection (all values are pointers).
The existing merge logic of overwrite for values and append
for slices has been preserved.
* A single builder process generates a typed runtime configuration
for the agent.
The new implementation is more strict and fails in the builder process
if no valid runtime configuration can be generated. Therefore,
additional validations in other parts of the code should be removed.
The builder also pre-computes all required network addresses so that no
address/port magic should be required where the configuration is used
and should therefore be removed.
* Upgrade github.com/hashicorp/hcl to support int64
* improve error messages
* fix directory permission test
* Fix rtt test
* Fix ForceLeave test
* Skip performance test for now until we know what to do
* Update github.com/hashicorp/memberlist to update log prefix
* Make memberlist use the default logger
* improve config error handling
* do not fail on non-existing data-dir
* experiment with non-uniform timeouts to get a handle on stalled leader elections
* Run tests for packages separately to eliminate the spurious port conflicts
* refactor private address detection and unify approach for ipv4 and ipv6.
Fixes#2825
* do not allow unix sockets for DNS
* improve bind and advertise addr error handling
* go through builder using test coverage
* minimal update to the docs
* more coverage tests fixed
* more tests
* fix makefile
* cleanup
* fix port conflicts with external port server 'porter'
* stop test server on error
* do not run api test that change global ENV concurrently with the other tests
* Run remaining api tests concurrently
* no need for retry with the port number service
* monkey patch race condition in go-sockaddr until we understand why that fails
* monkey patch hcl decoder race condidtion until we understand why that fails
* monkey patch spurious errors in strings.EqualFold from here
* add test for hcl decoder race condition. Run with go test -parallel 128
* Increase timeout again
* cleanup
* don't log port allocations by default
* use base command arg parsing to format help output properly
* handle -dc deprecation case in Build
* switch autopilot.max_trailing_logs to int
* remove duplicate test case
* remove unused methods
* remove comments about flag/config value inconsistencies
* switch got and want around since the error message was misleading.
* Removes a stray debug log.
* Removes a stray newline in imports.
* Fixes TestACL_Version8.
* Runs go fmt.
* Adds a default case for unknown address types.
* Reoders and reformats some imports.
* Adds some comments and fixes typos.
* Reorders imports.
* add unix socket support for dns later
* drop all deprecated flags and arguments
* fix wrong field name
* remove stray node-id file
* drop unnecessary patch section in test
* drop duplicate test
* add test for LeaveOnTerm and SkipLeaveOnInt in client mode
* drop "bla" and add clarifying comment for the test
* split up tests to support enterprise/non-enterprise tests
* drop raft multiplier and derive values during build phase
* sanitize runtime config reflectively and add test
* detect invalid config fields
* fix tests with invalid config fields
* use different values for wan sanitiziation test
* drop recursor in favor of recursors
* allow dns_config.udp_answer_limit to be zero
* make sure tests run on machines with multiple ips
* Fix failing tests in a few more places by providing a bind address in the test
* Gets rid of skipped TestAgent_CheckPerformanceSettings and adds case for builder.
* Add porter to server_test.go to make tests there less flaky
* go fmt
When the metadata server is scanning the agents for potential servers
it is parsing the version number which the agent provided when it
joined. This version number has to conform to a certain format, i.e.
'n.n.n'. Without this version number properly set some tests fail with
error messages that disguise the root cause.
The default version number is currently set to 'unknown' in
version/version.go which does not parse and triggers the tests to fail.
The work around is to use a build tag 'consul' which will use the
version number set in version_base.go instead which has the correct
format and is set to the current release version.
In addition, some parts of the code also require the version number to
be of a certain value. Setting it to '0.0.0' for example makes some
tests pass and others fail since they don't pass the semantic check.
When using go build/install/test one has to remember to use '-tags
consul' or tests will fail with non-obvious error messages.
Using build tags makes the build process more complex and error prone
since it prevents the use of the plain go toolchain and - at least in
its current form - introduces subtle build and test issues. We should
try to eliminate build tags for anything else but platform specific
code.
This patch removes all references to specific version numbers in the
code and tests and sets the default version to '9.9.9' which is
syntactically correct and passes the semantic check. This solves the
issue of running go build/install/test without tags for the OSS build.
* Exit 2 if -child-exit-code and the child returned with an error.
* There is no platform independent way to check the exact return code of
* the child, so on error always return 2.
* Closes#947
* Closes#1503
This patch fixes watch registration through the config file and a broken log line when the watch registration fails. It also plumbs all the watch loading through a common function and tweaks the
unit test to create the watch before the reload.
When the agent is triggered to shutdown via an external 'consul leave'
command delivered via the HTTP API then the client expects to receive a
response when the agent is down. This creates a race on when to shutdown
the agent itself like the RPC server, the checks and the state and the
external endpoints like DNS and HTTP.
This patch splits the shutdown process into two parts:
* shutdown the agent
* shutdown the endpoints (http and dns)
They can be executed multiple times, concurrently and in any order but
should be executed first agent, then endpoints to provide consistent
behavior across all use cases. Both calls have to be executed for a
proper shutdown.
This could be partially hidden in a single function but would introduce
some magic that happens behind the scenes which one has to know of but
isn't obvious.
Fixes#2880
This PR fixes GH-2212 in the most backwards-compatible way I can think
of. If the user does not pass a value for `?passing`, it's assumed to be
true, which mirrors the current behavior. However, if the user passes
any value for passing, that value is parsed as a bool using strconv.
It's important to note that this is technically a breaking change.
Previously using `?passing=false` would return only passing nodes. While
this behavior is obviously incorrect, it was the previous behavior. We
should call this out very clearly in the CHANGELOG.
This patch logs the signals, events, errors and the exit
code to the log file instead of printing it on the console.
This should provide a more complete picture for debugging.
When triggering a leave through an INT/TERM signal the hard-coded
timeout of 5 seconds is too short to complete the leave successfully.
Therefore, the agent always times out.
This value should probably configurable.
Pick the random ports only once and try starting with them
a number of times so that the configuration can be re-used.
This is because the ports are written into the data files
and a subsequent agent reading the files needs to have the
same ports.
For the same reason we do not remove the data directory on
every attempt since this makes it impossible to re-read the
data files.
* refactor DNS server to be ready for multiple bind addresses
* drop tcpKeepAliveListener since it is default for the HTTP servers
* add startup timeout watcher for HTTP servers identical to DNS server
* don't use retry to try restarting the agent
this caused some issues when the startup would fail in
a separate go routine
* clear out the data directory on every retry since the ports
are stored in the raft data files
* set a unique id for every agent to allow for tracking of
concurrent output
This brings down the test run from 108 sec to 15 sec.
There is an occasional port conflict because of the nature
the next port is chosen. So far it seems rare enough to live
with it.
TestAgent will replace the following mechanisms to
start test agents in subsequent requests:
* makeAgentXXX
* makeDNSServerXXX
* makeHTTPServerXXX
* testServer
* httpTest
Move the HTTP and DNS endpoints into the agent and control
their lifespan via the agent.
This removes the requirement to manage HTTP and DNS servers
indpendent of the agent since the agent is mostly useless
without an endpoint and the endpoints without the agent.
This patch adds support for a custom check id and name when
registering a service.
This is achieved by adding a CheckID and a Name field to the
CheckType structure which is used to register checks with a
service and when returning health check definitions.
CheckDefinition is a superset of CheckType which duplicates
some of the fields of CheckType. This patch decouples these
two structures by removing the embedding of CheckType in
CheckDefinition.
Fixes#3047
This patch adds a new internal interface clientServer
which defines the common methods of consul.Client and
consul.Server. This allows to replace the following
code
if a.server != nil {
a.server.do()
} else {
a.client.do()
}
with
a.delegate.do()
In case a specific type is required a type check can
be performed:
if srv, ok := a.delegate.(*consul.Server); ok {
srv.doSrv()
}
This creates a simplified helper for temporary directories and files.
All path names are prefixed with the name of the current test.
All files and directories are stored either in /tmp/consul-test
or /tmp if the former could not be created.
Using the system temp dir breaks some tests on macOS where the unix
socket path becomes too long.
macOS displays a firewall warning dialog when an unsigned
application is trying to bind to a non-loopback address.
This patch updates some test configurations to ensure binding
to a loopback address where possible to suppress these warnings.
Use the bind address as source address for outgoing
RPC connections unless it is INADDR_ANY.
The current code uses the advertise address which will
not work in certain environments where the advertise
address is not routable in the network of the agent,
e.g. NAT environment, container... After all, that is
the purpose of the advertise address.
See #2822
Since this was doing registration to a foreign DC, it needs extra time
for the route to the ACL datacenter to be set up. ACLs aren't part of
this test, so by disabling them we make this more reliable and converge
faster than if we had added a retry.
Refactor tests that use testutil.WaitForResult to use retry.
Since this requires refactoring the test functions in general this patch
also shows the use of the github.com/pascaldekloe/goe/verify library
which provides a good mechanism for comparing nested data structures.
Instead of just converting the tests from testutil.WaitForResult to
retry the tests that performing a nested comparison of data structures
are converted to the verify library at the same time.
This patch removes duplicate internal copies of constants in the structs
package which are also defined in the api package. The api.KVOp type
with all its values for the TXN endpoint and the api.HealthXXX constants
are now used throughout the codebase.
This resulted in some circular dependencies in the testutil package
which have been resolved by copying code and constants and moving the
WaitForLeader function into a separate testrpc package.
This PR takes the host ID and runs it through a hash so that it is well
distributed. This makes it so that machines that report similar host IDs
are easily distinguished.
Instances of similar IDs occur on EC2 where the ID is prefixed and on
motherboards created in the same batch.
When consul-template is communicating with consul and the job is done, consul thread receives SIGPIPE.
This cause the logs to be filled "Caught signal: broken pipe" and they does not bring any usefull info with them.
Skipping those.
Ended up removing the leader_test.go server address change test as part
of this. The join was failing becase we were using a new node name with
the new logic here, but realized this was hitting some of the memberlist
conflict logic and not working as we expected. We need some additional
work to fully support address changes, so removed the test for now.
We fixed a few related issues while we were in here. We now only let
services register checks with a matching token, and we also close out
service and check delete operations if the catalog deregister claims
it doesn't know about the ID of the service or check being deleted.
This makes the upgrade path a bit nicer, since people will likely have
older configurations. This prints out a warning instead of just failing
if the old rpc addr or ports definition is in the config.
This has the next wave of RTT integration with the router and also
factors some common RTT-related helpers out to lib. While we were
in here we also got rid of the coordinate disable config so we don't
need to deal with the complexity in the router (there was never a
user-visible way to disable coordinates).
This adds two goroutines to perform autopilot tasks on the leader - one
to monitor the health of servers and another to periodically clean up
dead servers with a limit on removal count. Also adds a new http endpoint,
`/v1/operator/autopilot/health`, for querying this information through an
operator RPC endpoint.