* Added rate limiting for agent RPC calls.
* Initializes the rate limiter based on the config.
* Adds the rate limiter into the snapshot RPC path.
* Adds unit tests for the RPC rate limiter.
* Groups the RPC limit parameters under "limits" in the config.
* Adds some documentation about the RPC limiter.
* Sends a 429 response when the rate limiter kicks in.
* Adds docs for new telemetry.
* Makes snapshot telemetry look like RPC telemetry and cleans up comments.
When the metadata server is scanning the agents for potential servers
it is parsing the version number which the agent provided when it
joined. This version number has to conform to a certain format, i.e.
'n.n.n'. Without this version number properly set some tests fail with
error messages that disguise the root cause.
The default version number is currently set to 'unknown' in
version/version.go which does not parse and triggers the tests to fail.
The work around is to use a build tag 'consul' which will use the
version number set in version_base.go instead which has the correct
format and is set to the current release version.
In addition, some parts of the code also require the version number to
be of a certain value. Setting it to '0.0.0' for example makes some
tests pass and others fail since they don't pass the semantic check.
When using go build/install/test one has to remember to use '-tags
consul' or tests will fail with non-obvious error messages.
Using build tags makes the build process more complex and error prone
since it prevents the use of the plain go toolchain and - at least in
its current form - introduces subtle build and test issues. We should
try to eliminate build tags for anything else but platform specific
code.
This patch removes all references to specific version numbers in the
code and tests and sets the default version to '9.9.9' which is
syntactically correct and passes the semantic check. This solves the
issue of running go build/install/test without tags for the OSS build.
The error handling of the ACL code relies on the presence of certain
magic error messages. Since the error values are sent via RPC between
older and newer consul agents we cannot just replace the magic values
with typed errors and switch to type checks since this would break
compatibility with older clients.
Therefore, this patch moves all magic ACL error messages into the acl
package and provides default error values and helper functions which
determine the type of error.
This patch replaces the code which determines the list of servers in the
current cluster with an RPC call to get the list of active consul
service instances which only run on servers.
This replaces the previous implementation which was more complex and
relied on serf messages which can provide a different view than the
consistent response from the raft log.
As a side effect it makes the implementation independent of the server
and the agent which means it works consistently across both. Different
behavior for server and agent was the root cause for the bug in
http://github.com/hashicorp/consul/issue/3047.
Fixes#3407
This patch changes the behavior of the DNS server as follows:
* The SOA response contains the SOA record in the Answer section instead
of the Authority section. It also contains NS records in the Authority
and the corresponding A glue records in the Extra section.
In addition, CNAMEs are added to the Extra section to make the
MNAME of the SOA record resolvable.
AAAA glue records are not yet supported.
* The NS response returns up to three random servers from the
consul cluster in the Answer section and the glue A
records in the Extra section.
AAAA glue records are not yet supported.
Note that there is no test since the correct way to solve (and test)
this is to replace the different maps with a single one or to hide
that functionality behind a separate data structure. This will be
addressed in #3294.
Fixes#3265
This patch replaces the Docker client which is used
for health checks with a simplified version tailored
for that purpose.
See #3254
See #3257Fixes#3270
* Changes remote exec KV read to call GetTokenForAgent(), which can use
the acl_agent_token instead of the acl_token.
Fixes#3160.
* Fixes remote exec unit test with ACLs.
* Adds unhappy ACL path to unit tests for remote exec.
* Obfuscates ACL tokens appearing in /v1/acl APIs.
* Makes test positively identify the desired strings.
* Adds an example and explanation of the regular expression.
* Moves magic check and service constants into shared structs package.
* Removes the "consul" service from local state.
Since this service is added by the leader, it doesn't really make sense to
also keep it in local state (which requires special ACLs to configure), and
requires a bunch of special cases in the local state logic. This requires
fewer special cases and makes ACL bootstrapping cleaner.
* Makes coordinate update ACL log message a warning, similar to other AE warnings.
* Adds much more detailed examples for bootstrapping ACLs.
This can hopefully replace https://gist.github.com/slackpad/d89ce0e1cc0802c3c4f2d84932fa3234.
* Removes unnecessary set for model component which will be null.
* Returns a 404 for a missing node, not a 200 with an empty response.
* Updates built-in web assets.
* Removes GitHub reference.
* Doesn't display ACL token on the unauthorized page.
* Removes useless fetch for nodes and cleans up comments.
* Provides a path to reset the ACL token when it's invalid.
This included making the settings page global so it's reachable, and adding
some more information about an error on the error page.
* Updates built-in web assets.
This isn't racy, it's just a little dirty. The listen will happen and a port
will be selected and injected into the config once the Serf instance is
created, so we don't need the retry loop here.
This fixes TestServer_JoinSeparateLanAndWanAddresses which sets bogus
advertise addresses as part of the test. Port numbers uniquely identify
members since everything is running on localhost.
This patch creates a local config structure for the local state
which is independent from the agent but populated from its
configuration. This avoids data races between the agent configuration
which can change during tests and concurrent go routines using the
configuraiton at the same time.
The agent configuration for the consul server is a partial configuration
which needs to be cloned to avoid data races.
This is a stop-gap measure before moving the configuration into
a separate package.
The tests that use the localState of the agent access the internal
variables and call methods which are not guarded by locks creating
data races in tests. While the use of internal variables is somewhat
easy to spot the fact that not all methods are thread-safe is a
surprise.
A proper fix requires the localState struct to be moved into its own
package so that tests in the agent can only access the external
interface.
However, the localState is currently dependent on the agent.Config
which would create a circular dependency. Therefore, the Config
struct needs to be moved first for this to happen.
This patch literally monkey patches the use of the lock around the
cases which have data races and marks them with a
// todo(fs): data race comment.
The sessionTimers map was secured by a lock which wasn't used
properly in the tests. This lead to data races and failing tests
when accessing the length or the members of the map.
This patch adds a separate SessionTimers struct which is safe
for concurrent use and which ecapsulates the behavior of the
sessionTimers map.
When using dynamic ports for the serf clusters then
the actual bind port of the serf WAN cluster needs to
be discovered before the serf LAN cluster is started
since the serf LAN cluster announces the port of the WAN
cluster.
Host header must be set explicitely on http requests
Change-Id: I91a32f0fb1ec3fbc713adf0e10869797e91172c7
Signed-off-by: Grégoire Seux <g.seux@criteo.com>
The makeRecursor function was using an unreliable mechanism
to start a server with a random port. This patch changes this
so that the server starts on port 0 to let the kernel pick
a free port.
In addition, to similar functions for starting a test DNS
server were folded into one.
This patch fixes watch registration through the config file and a broken log line when the watch registration fails. It also plumbs all the watch loading through a common function and tweaks the
unit test to create the watch before the reload.
This patch adds an "http_config" object to the config file
and moves the "http_api_response_headers" option there.
"http_api_response_headers" is now deprecated in favor of
"http_config.response_headers"
When the agent is triggered to shutdown via an external 'consul leave'
command delivered via the HTTP API then the client expects to receive a
response when the agent is down. This creates a race on when to shutdown
the agent itself like the RPC server, the checks and the state and the
external endpoints like DNS and HTTP.
This patch splits the shutdown process into two parts:
* shutdown the agent
* shutdown the endpoints (http and dns)
They can be executed multiple times, concurrently and in any order but
should be executed first agent, then endpoints to provide consistent
behavior across all use cases. Both calls have to be executed for a
proper shutdown.
This could be partially hidden in a single function but would introduce
some magic that happens behind the scenes which one has to know of but
isn't obvious.
Fixes#2880
This patch hides the RPC handler overwrite mechanism from the
rest of the code so that it works in all cases and that there
is no cooperation required from the tested code, i.e. we can
drop a.getEndpoint().
Fix stale reads on server startup. Consistent reads will now wait for up to config.RPCHoldTimeout for the server to get past its raft log, before returning an error. Servers that are starting up will eventually catch up.
This fixes issue #2644
When the agent is triggered to shutdown via an external 'consul leave'
command delivered via the HTTP API then the client expects to receive a
response when the agent is down. This creates a race on when to shutdown
the agent itself like the RPC server, the checks and the state and the
external endpoints like DNS and HTTP. Ideally, the external endpoints
should be shutdown before the internal state but if the goal is to
respond reliably that the agent is down then this is not possible.
This patch splits the agent shutdown into two parts implemented in a
single method to keep it simple and unambiguos for the caller. The first
stage shuts down the internal state, checks, RPC server, ...
synchronously and then triggers the shutdown of the external endpoints
asychronously. This way the caller is guaranteed that the internal state
services are down when Shutdown returns and there remains enough time to
send a response.
Fixes#2880