* Adds client-side retry for no leader errors.
This paves over the case where the client was connected to the leader
when it loses leadership.
* Adds a configurable server RPC drain time and a fail-fast path for RPCs.
When a server leaves it gets removed from the Raft configuration, so it will
never know who the new leader server ends up being. Without this we'd be
doomed to wait out the RPC hold timeout and then fail. This makes things fail
a little quicker while a sever is draining, and since we added a client retry
AND since the server doing this has already shut down and left the Serf LAN,
clients should retry against some other server.
* Makes the RPC hold timeout configurable.
* Reorders struct members.
* Sets the RPC hold timeout default for test servers.
* Bumps the leave drain time up to 5 seconds.
* Robustifies retries with a simpler client-side RPC hold.
* Reverts untended delete.
* Clean up handling of subprocesses and make using a shell optional
* Update docs for subprocess changes
* Fix tests for new subprocess behavior
* More cleanup of subprocesses
* Minor adjustments and cleanup for subprocess logic
* Makes the watch handler reload test use the new path.
* Adds check tests for new args path, and updates existing tests to use new path.
* Adds support for script args in Docker checks.
* Fixes the sanitize unit test.
* Adds panic for unknown watch type, and reverts back to Run().
* Adds shell option back to consul lock command.
* Adds shell option back to consul exec command.
* Adds shell back into consul watch command.
* Refactors signal forwarding and makes Windows-friendly.
* Adds a clarifying comment.
* Changes error wording to a warning.
* Scopes signals to interrupt and kill.
This avoids us trying to send SIGCHILD to the dead process.
* Adds an error for shell=false for consul exec.
* Adds notes about the deprecated script and handler fields.
* De-nests an if statement.
* Cleans up information about file extensions, now that they are required.
* Removes references to deprecated configuration options.
* Adds docs for multiple bind address support.
* metrics: replace statsite_prefix with service_prefix
The metrics prefix isn't statsite specific and is in fact used
for all metrics providers. Since we are deprecating fields
anyway we should fix this one as well.
Fixes#3293
* Updates docs and sorts telemetry section.
* Renames to "metrics_prefix" to disambiguate with Consul services.
* Updates the change log.
* Changes default Raft protocol to 3.
* Changes numPeers() to report only voters.
This should have been there before, but it's more obvious that this
is incorrect now that we default the Raft protocol to 3, which puts
new servers in a read-only state while Autopilot waits for them to
become healthy.
* Fixes TestLeader_RollRaftServer.
* Fixes TestOperator_RaftRemovePeerByAddress.
* Fixes TestServer_*.
Relaxed the check for a given number of voter peers and instead do
a thorough check that all servers see each other in their Raft
configurations.
* Fixes TestACL_*.
These now just check for Raft replication to be set up, and don't
care about the number of voter peers.
* Fixes TestOperator_Raft_ListPeers.
* Fixes TestAutopilot_CleanupDeadServerPeriodic.
* Fixes TestCatalog_ListNodes_ConsistentRead_Fail.
* Fixes TestLeader_ChangeServerID and adjusts the conn pool to throw away
sockets when it sees io.EOF.
* Changes version to 1.0.0 in the options doc.
* Makes metrics test more deterministic with autopilot metrics possible.
* new config parser for agent
This patch implements a new config parser for the consul agent which
makes the following changes to the previous implementation:
* add HCL support
* all configuration fragments in tests and for default config are
expressed as HCL fragments
* HCL fragments can be provided on the command line so that they
can eventually replace the command line flags.
* HCL/JSON fragments are parsed into a temporary Config structure
which can be merged using reflection (all values are pointers).
The existing merge logic of overwrite for values and append
for slices has been preserved.
* A single builder process generates a typed runtime configuration
for the agent.
The new implementation is more strict and fails in the builder process
if no valid runtime configuration can be generated. Therefore,
additional validations in other parts of the code should be removed.
The builder also pre-computes all required network addresses so that no
address/port magic should be required where the configuration is used
and should therefore be removed.
* Upgrade github.com/hashicorp/hcl to support int64
* improve error messages
* fix directory permission test
* Fix rtt test
* Fix ForceLeave test
* Skip performance test for now until we know what to do
* Update github.com/hashicorp/memberlist to update log prefix
* Make memberlist use the default logger
* improve config error handling
* do not fail on non-existing data-dir
* experiment with non-uniform timeouts to get a handle on stalled leader elections
* Run tests for packages separately to eliminate the spurious port conflicts
* refactor private address detection and unify approach for ipv4 and ipv6.
Fixes#2825
* do not allow unix sockets for DNS
* improve bind and advertise addr error handling
* go through builder using test coverage
* minimal update to the docs
* more coverage tests fixed
* more tests
* fix makefile
* cleanup
* fix port conflicts with external port server 'porter'
* stop test server on error
* do not run api test that change global ENV concurrently with the other tests
* Run remaining api tests concurrently
* no need for retry with the port number service
* monkey patch race condition in go-sockaddr until we understand why that fails
* monkey patch hcl decoder race condidtion until we understand why that fails
* monkey patch spurious errors in strings.EqualFold from here
* add test for hcl decoder race condition. Run with go test -parallel 128
* Increase timeout again
* cleanup
* don't log port allocations by default
* use base command arg parsing to format help output properly
* handle -dc deprecation case in Build
* switch autopilot.max_trailing_logs to int
* remove duplicate test case
* remove unused methods
* remove comments about flag/config value inconsistencies
* switch got and want around since the error message was misleading.
* Removes a stray debug log.
* Removes a stray newline in imports.
* Fixes TestACL_Version8.
* Runs go fmt.
* Adds a default case for unknown address types.
* Reoders and reformats some imports.
* Adds some comments and fixes typos.
* Reorders imports.
* add unix socket support for dns later
* drop all deprecated flags and arguments
* fix wrong field name
* remove stray node-id file
* drop unnecessary patch section in test
* drop duplicate test
* add test for LeaveOnTerm and SkipLeaveOnInt in client mode
* drop "bla" and add clarifying comment for the test
* split up tests to support enterprise/non-enterprise tests
* drop raft multiplier and derive values during build phase
* sanitize runtime config reflectively and add test
* detect invalid config fields
* fix tests with invalid config fields
* use different values for wan sanitiziation test
* drop recursor in favor of recursors
* allow dns_config.udp_answer_limit to be zero
* make sure tests run on machines with multiple ips
* Fix failing tests in a few more places by providing a bind address in the test
* Gets rid of skipped TestAgent_CheckPerformanceSettings and adds case for builder.
* Add porter to server_test.go to make tests there less flaky
* go fmt
* Added rate limiting for agent RPC calls.
* Initializes the rate limiter based on the config.
* Adds the rate limiter into the snapshot RPC path.
* Adds unit tests for the RPC rate limiter.
* Groups the RPC limit parameters under "limits" in the config.
* Adds some documentation about the RPC limiter.
* Sends a 429 response when the rate limiter kicks in.
* Adds docs for new telemetry.
* Makes snapshot telemetry look like RPC telemetry and cleans up comments.
* Changes remote exec KV read to call GetTokenForAgent(), which can use
the acl_agent_token instead of the acl_token.
Fixes#3160.
* Fixes remote exec unit test with ACLs.
* Adds unhappy ACL path to unit tests for remote exec.
This patch adds an "http_config" object to the config file
and moves the "http_api_response_headers" option there.
"http_api_response_headers" is now deprecated in favor of
"http_config.response_headers"
- Add various join options to bootstrapping guide
- Add note about Atlas deprecation to bootstrapping guide
- Add notes about -retry-join and retry_join to -join option
- Add notes about -retry-join and retry_join to start_join option
This adds two goroutines to perform autopilot tasks on the leader - one
to monitor the health of servers and another to periodically clean up
dead servers with a limit on removal count. Also adds a new http endpoint,
`/v1/operator/autopilot/health`, for querying this information through an
operator RPC endpoint.
This commit adds several command-line and config options that facilitate
host discovery through Google Compute Engine (GCE), much like the
recently added EC2 host discovery options. This should assist with
bootstrapping and joining servers within GCE when non-static addresses
are used, such as when using managed instance groups.
Documentation has also been added. It should be noted that if running
from within a GCE instance, the only option that should be necessary is
-retry-join-gce-tag-value.
I was reviewing some docs and found a few issues.
1. Fixed some spelling mistakes.
2. Re-formatted some paragraphs.
3. Changed some potentially loaded language.
4. Fixed some grammar issues.
5. Tried to consistently use syntax-highlighting.
6. Fixed post-period spacing.
7. Fixed some formatting issues and inconsistency.
8. All "notes" are either proper notes or re-written.
Default MaxStale to 10 years and add a counter at `consul.dns.stale_queries` that tracks when an agent serves a query that's stale by at least 5 seconds. Previously, MaxStale defaulted to 5 seconds and DNS would become unavailable after a short period of time with no leader. This new default allows DNS requests to still be served in the event of a long outage.
Fixes#2460.
* * adding cli config and config file support for specifying the serf wan and lan bind addresses
* updating documentation for serf wan and lan options
Fixes#2007
* Cleans up some small things from #2380.
* Uses the bind default for the agent test for Serf WAN and LAN.
* Updates Raft library to get new snapshot/restore API.
* Basic backup and restore working, but need some cleanup.
* Breaks out a snapshot module and adds a SHA256 integrity check.
* Adds snapshot ACL and fills in some missing comments.
* Require a consistent read for snapshots.
* Make sure snapshot works if ACLs aren't enabled.
* Adds a bit of package documentation.
* Returns an empty response from restore to avoid EOF errors.
* Adds API client support for snapshots.
* Makes internal file names match on-disk file snapshots.
* Adds DC and token coverage for snapshot API test.
* Adds missing documentation.
* Adds a unit test for the snapshot client endpoint.
* Moves the connection pool out of the client for easier testing.
* Fixes an incidental issue in the prepared query unit test.
I realized I had two servers in bootstrap mode so this wasn't a good setup.
* Adds a half close to the TCP stream and fixes panic on error.
* Adds client and endpoint tests for snapshots.
* Moves the pool back into the snapshot RPC client.
* Adds a TLS test and fixes half-closes for TLS connections.
* Tweaks some comments.
* Adds a low-level snapshot test.
This is independent of Consul so we can pull this out into a library
later if we want to.
* Cleans up snapshot and archive and completes archive tests.
* Sends a clear error for snapshot operations in dev mode.
Snapshots require the Raft snapshots to be readable, which isn't supported
in dev mode. Send a clear error instead of a deep-down Raft one.
* Adds docs for the snapshot endpoint.
* Adds a stale mode and index feedback for snapshot saves.
This gives folks a way to extract data even if the cluster has no
leader.
* Changes the internal format of a snapshot from zip to tgz.
* Pulls in Raft fix to cancel inflight before a restore.
* Pulls in new Raft restore interface.
* Adds metadata to snapshot saves and a verify function.
* Adds basic save and restore snapshot CLI commands.
* Gets rid of tarball extensions and adds restore message.
* Fixes an incidental bad link in the KV docs.
* Adds documentation for the snapshot CLI commands.
* Scuttle any request body when a snapshot is saved.
* Fixes archive unit test error message check.
* Allows for nil output writers in snapshot RPC handlers.
* Renames hash list Decode to DecodeAndVerify.
* Closes the client connection for snapshot ops.
* Lowers timeout for restore ops.
* Updates Raft vendor to get new Restore signature and integrates with Consul.
* Bounces the leader's internal state when we do a restore.
Earlier on this page, under `addresses`, we say "For TCP addresses, these should simply be an IP address without the port. For example: 10.0.0.1, not 10.0.0.1:8500." Since we expect the port to be included for `_address` for telemetry, call it out specifically.
The `-dc` flag from the agent CLI command has been deprecated in favor of
`-datacenter`. This is done this way because:
- Other CLI commands used `-datacenter`. See: event, exec and watch.
- The agent configuration file uses `datacenter`.
Signed-off-by: Miquel Sabaté Solà <msabate@suse.com>
Prior to this change, prepared queries had the following behavior for
ACLs, which will need to change to support templates:
1. A management token, or a token with read access to the service being
queried needed to be provided in order to create a prepared query.
2. The token used to create the prepared query was stored with the query
in the state store and used to execute the query.
3. A management token, or the token used to create the query needed to be
supplied to perform and CRUD operations on an existing prepared query.
This was pretty subtle and complicated behavior, and won't work for
templates since the service name is computed at execution time. To solve
this, we introduce a new "prepared-query" ACL type, where the prefix
applies to the query name for static prepared query types and to the
prefix for template prepared query types.
With this change, the new behavior is:
1. A management token, or a token with "prepared-query" write access to
the query name or (soon) the given template prefix is required to do
any CRUD operations on a prepared query, or to list prepared queries
(the list is filtered by this ACL).
2. You will no longer need a management token to list prepared queries,
but you will only be able to see prepared queries that you have access
to (you get an empty list instead of permission denied).
3. When listing or getting a query, because it was easy to capture
management tokens given the past behavior, this will always blank out
the "Token" field (replacing the contents as <hidden>) for all tokens
unless a management token is supplied. Going forward, we should
discourage people from binding tokens for execution unless strictly
necessary.
4. No token will be captured by default when a prepared query is created.
If the user wishes to supply an execution token then can pass it in via
the "Token" field in the prepared query definition. Otherwise, this
field will default to empty.
5. At execution time, we will use the captured token if it exists with the
prepared query definition, otherwise we will use the token that's passed
in with the request, just like we do for other RPCs (or you can use the
agent's configured token for DNS).
6. Prepared queries with no name (accessible only by ID) will not require
ACLs to create or modify (execution time will depend on the service ACL
configuration). Our argument here is that these are designed to be
ephemeral and the IDs are as good as an ACL. Management tokens will be
able to list all of these.
These changes enable templates, but also enable delegation of authority to
manage the prepared query namespace.
The example configuration file omits TLS support in the HTTP API. This is fine, but a second example demonstrating how to enable TLS over the HTTP API is harmless and, in fact, should be default practice.
Using the format `ip:port` in the "addresses" block will cause Consul to crash on reload/start. See issue (#1727)[https://github.com/hashicorp/consul/issues/1727#issuecomment-184980751]
Add the `Service`/`Address` field to the example output for the `/v1/health/service/\<service\>` endpoint.
Even though it's an optional value, this is probably the one consumers are looking for (rather than the `Node` address)