open-consul

Commit Graph

Author	SHA1	Message	Date
Pierre Souchay	947d8eb039	Added ratelimit to handle throtling cache (#8226 ) This implements a solution for #7863 It does: Add a new config cache.entry_fetch_rate to limit the number of calls/s for a given cache entry, default value = rate.Inf Add cache.entry_fetch_max_burst size of rate limit (default value = 2) The new configuration now supports the following syntax for instance to allow 1 query every 3s: command line HCL: -hcl 'cache = { entry_fetch_rate = 0.333}' in JSON { "cache": { "entry_fetch_rate": 0.333 } }	2020-07-27 23:11:11 +02:00
Matt Keeler	fa6a2b38d9	Add an AutoEncrypt “integration” test Also fix a bug where Consul could segfault if TLS was enabled but no client certificate was provided. How no one has reported this as a problem I am not sure.	2020-06-30 15:23:29 -04:00
Daniel Nephin	0285956fac	Update TestAgent_GetCoordinate The old test case was a very specific regresion test for a case that is no longer possible. Replaced with a new test that checks the default coordinate is returned.	2020-06-24 13:00:15 -04:00
Daniel Nephin	07c1081d39	Fix a bunch of unparam lint issues	2020-06-24 13:00:14 -04:00
Matt Keeler	7086a50353	Change auto config authorizer to allow for future extension The envisioned changes would allow extra settings to enable dynamically defined auth methods to be used instead of or in addition to the statically defined one in the configuration.	2020-06-18 15:22:24 -04:00
Matt Keeler	2c7844d220	Implement Client Agent Auto Config There are a couple of things in here. First, just like auto encrypt, any Cluster.AutoConfig RPC will implicitly use the less secure RPC mechanism. This drastically modifies how the Consul Agent starts up and moves most of the responsibilities (other than signal handling) from the cli command and into the Agent.	2020-06-17 16:49:46 -04:00
Daniel Nephin	89d95561df	Enable gofmt simplify Code changes done automatically with 'gofmt -s -w'	2020-06-16 13:21:11 -04:00
Hans Hasselberg	7f14d3ac8a	tests: use constructor instead init (#8024 )	2020-06-04 22:59:06 +02:00
Pierre Souchay	7cd5477c3c	checks: when a service does not exists in an alias, consider it failing (#7384 ) In current implementation of Consul, check alias cannot determine if a service exists or not. Because a service without any check is semantically considered as passing, so when no healthchecks are found for an agent, the check was considered as passing. But this make little sense as the current implementation does not make any difference between: * a non-existing service (passing) * a service without any check (passing as well) In order to make it work, we have to ensure that when a check did not find any healthcheck, the service does indeed exists. If it does not, lets consider the check as failing.	2020-06-04 14:50:52 +02:00
Daniel Nephin	e8a883e829	Replace goe/verify.Values with testify/require.Equal (#7993 ) * testing: replace most goe/verify.Values with require.Equal One difference between these two comparisons is that go/verify considers nil slices/maps to be equal to empty slices/maps, where as testify/require does not, and does not appear to provide any way to enable that behaviour. Because of this difference some expected values were changed from empty slices to nil slices, and some calls to verify.Values were left. * Remove github.com/pascaldekloe/goe/verify Reduce the number of assertion packages we use from 2 to 1	2020-06-02 12:41:25 -04:00
Daniel Nephin	2e0f750f1a	Add unconvert linter To find unnecessary type convertions	2020-05-12 13:47:25 -04:00
Pierre Souchay	2b8da952a8	agent: show warning when enable_script_checks is enabled without safty net (#7437 ) In order to enforce a bit security on Consul agents, add a new method in agent to highlight possible security issues. This does not return an error for now, but might in the future. For now, it detects issues such as: https://www.hashicorp.com/blog/protecting-consul-from-rce-risk-in-specific-configurations/ This would display this kind of messages: ``` 2020-03-11T18:27:49.873+0100 [ERROR] agent: [SECURITY] issue: error="using enable-script-checks without ACLs and without allow_write_http_from is DANGEROUS, use enable-local-script-checks instead see https://www.hashicorp.com/blog/protecting-consul-from-rce-risk-in-specific-configurations/" ```	2020-04-02 09:59:23 +02:00
Daniel Nephin	09c6ac8b92	Rename NewTestAgentWithFields to StartTestAgent This function now only starts the agent. Using: git grep -l 'StartTestAgent(t, true,' \| \ xargs sed -i -e 's/StartTestAgent(t, true,/StartTestAgent(t,/g'	2020-03-31 17:14:55 -04:00
Daniel Nephin	d623dcbd01	Convert the remaining calls to NewTestAgentWithFields After removing the t.Name() parameter with sed, convert the last few tests which use a custom name to call NewTestAgentWithFields instead.	2020-03-31 17:14:55 -04:00
Daniel Nephin	8b6877febd	Remove name from NewTestAgent Using: git grep -l 'NewTestAgent(t, t.Name(),' \| \ xargs sed -i -e 's/NewTestAgent(t, t.Name(),/NewTestAgent(t,/g'	2020-03-31 16:13:44 -04:00
Daniel Nephin	96c4a35de7	Remove t.Name() from TestAgent.Name And re-add the name to the logger so that log messages from different agents in a single can be identified.	2020-03-30 16:47:24 -04:00
Daniel Nephin	823295fe2a	testing: reduce verbosity of output log Previously the log output included the test name twice and a long date format. The test output is already grouped by test, so adding the test name did not add any new information. The date and time are only useful to understand elapsed time, so using a short format should provide succident detail. Also fixed a bug in NewTestAgentWithFields where nil was returned instead of the test agent.	2020-03-30 13:23:13 -04:00
Chris Piraino	0c5c97205f	Fix flakey health check reload test (#7490 ) This test would occasionally fail because we checked for a status of "critical" initially. This races with the actual healthcheck being run and declared passing. We instead use a ttl health check so that we don't rely on timing at all.	2020-03-25 09:09:13 -05:00
R.B. Boyer	a7fb26f50f	wan federation via mesh gateways (#6884 ) This is like a Möbius strip of code due to the fact that low-level components (serf/memberlist) are connected to high-level components (the catalog and mesh-gateways) in a twisty maze of references which make it hard to dive into. With that in mind here's a high level summary of what you'll find in the patch: There are several distinct chunks of code that are affected: * new flags and config options for the server * retry join WAN is slightly different * retry join code is shared to discover primary mesh gateways from secondary datacenters * because retry join logic runs in the agent and the results of that operation for primary mesh gateways are needed in the server there are some methods like `RefreshPrimaryGatewayFallbackAddresses` that must occur at multiple layers of abstraction just to pass the data down to the right layer. * new cache type `FederationStateListMeshGatewaysName` for use in `proxycfg/xds` layers * the function signature for RPC dialing picked up a new required field (the node name of the destination) * several new RPCs for manipulating a FederationState object: `FederationState:{Apply,Get,List,ListMeshGateways}` * 3 read-only internal APIs for debugging use to invoke those RPCs from curl * raft and fsm changes to persist these FederationStates * replication for FederationStates as they are canonically stored in the Primary and replicated to the Secondaries. * a special derivative of anti-entropy that runs in secondaries to snapshot their local mesh gateway `CheckServiceNodes` and sync them into their upstream FederationState in the primary (this works in conjunction with the replication to distribute addresses for all mesh gateways in all DCs to all other DCs) * a "gateway locator" convenience object to make use of this data to choose the addresses of gateways to use for any given RPC or gossip operation to a remote DC. This gets data from the "retry join" logic in the agent and also directly calls into the FSM. * RPC (`:8300`) on the server sniffs the first byte of a new connection to determine if it's actually doing native TLS. If so it checks the ALPN header for protocol determination (just like how the existing system uses the type-byte marker). * 2 new kinds of protocols are exclusively decoded via this native TLS mechanism: one for ferrying "packet" operations (udp-like) from the gossip layer and one for "stream" operations (tcp-like). The packet operations re-use sockets (using length-prefixing) to cut down on TLS re-negotiation overhead. * the server instances specially wrap the `memberlist.NetTransport` when running with gateway federation enabled (in a `wanfed.Transport`). The general gist is that if it tries to dial a node in the SAME datacenter (deduced by looking at the suffix of the node name) there is no change. If dialing a DIFFERENT datacenter it is wrapped up in a TLS+ALPN blob and sent through some mesh gateways to eventually end up in a server's :8300 port. * a new flag when launching a mesh gateway via `consul connect envoy` to indicate that the servers are to be exposed. This sets a special service meta when registering the gateway into the catalog. * `proxycfg/xds` notice this metadata blob to activate additional watches for the FederationState objects as well as the location of all of the consul servers in that datacenter. * `xds:` if the extra metadata is in place additional clusters are defined in a DC to bulk sink all traffic to another DC's gateways. For the current datacenter we listen on a wildcard name (`server.<dc>.consul`) that load balances all servers as well as one mini-cluster per node (`<node>.server.<dc>.consul`) * the `consul tls cert create` command got a new flag (`-node`) to help create an additional SAN in certs that can be used with this flavor of federation.	2020-03-09 15:59:02 -05:00
Pierre Souchay	49dc891737	agent: configuration reload preserves check's statuses for services (#7345 ) This fixes issue #7318 Between versions 1.5.2 and 1.5.3, a regression has been introduced regarding health of services. A patch #6144 had been issued for HealthChecks of nodes, but not for healthchecks of services. What happened when a reload was: 1. save all healthcheck statuses 2. cleanup everything 3. add new services with healthchecks In step 3, the state of healthchecks was taken into account locally, so at step 3, but since we cleaned up at step 2, state was lost. This PR introduces the snap parameter, so step 3 can use information from step 1	2020-03-09 12:59:41 +01:00
Chris Piraino	3dd0b59793	Allow users to configure either unstructured or JSON logging (#7130 ) * hclog Allow users to choose between unstructured and JSON logging	2020-01-28 17:50:41 -06:00
Aestek	9329cbac0a	Add support for dual stack IPv4/IPv6 network (#6640 ) * Use consts for well known tagged adress keys * Add ipv4 and ipv6 tagged addresses for node lan and wan * Add ipv4 and ipv6 tagged addresses for service lan and wan * Use IPv4 and IPv6 address in DNS	2020-01-17 09:54:17 -05:00
Matt Keeler	442924c35a	Sync of OSS changes to support namespaces (#6909 )	2019-12-09 21:26:41 -05:00
Matt Keeler	609c9dab02	Miscellaneous Fixes (#6896 ) Ensure we close the Sentinel Evaluator so as not to leak go routines Fix a bunch of test logging so that various warnings when starting a test agent go to the ltest logger and not straight to stdout. Various canned ent meta types always return a valid pointer (no more nils). This allows us to blindly deref + assign in various places. Update ACL index tracking to ensure oss -> ent upgrades will work as expected. Update ent meta parsing to include function to disallow wildcarding.	2019-12-06 14:01:34 -05:00
Freddy	caf658d0d3	Store check type in catalog (#6561 )	2019-10-17 20:33:11 +02:00
R.B. Boyer	55fdae203f	agent: cache notifications work after error if the underlying RPC returns index=1 (#6547 ) Fixes #6521 Ensure that initial failures to fetch an agent cache entry using the notify API where the underlying RPC returns a synthetic index of 1 correctly recovers when those RPCs resume working. The bug in the Cache.notifyBlockingQuery used to incorrectly "fix" the index for the next query from 0 to 1 for all queries, when it should have not done so for queries that errored. Also fixed some things that made debugging difficult: - config entry read/list endpoints send back QueryMeta headers - xds event loops don't swallow the cache notification errors	2019-09-26 10:42:17 -05:00
Freddy	5eace88ce2	Expose HTTP-based paths through Connect proxy (#6446 ) Fixes: #5396 This PR adds a proxy configuration stanza called expose. These flags register listeners in Connect sidecar proxies to allow requests to specific HTTP paths from outside of the node. This allows services to protect themselves by only listening on the loopback interface, while still accepting traffic from non Connect-enabled services. Under expose there is a boolean checks flag that would automatically expose all registered HTTP and gRPC check paths. This stanza also accepts a paths list to expose individual paths. The primary use case for this functionality would be to expose paths for third parties like Prometheus or the kubelet. Listeners for requests to exposed paths are be configured dynamically at run time. Any time a proxy, or check can be registered, a listener can also be created. In this initial implementation requests to these paths are not authenticated/encrypted.	2019-09-25 20:55:52 -06:00
R.B. Boyer	682b5370c9	agent: tolerate more failure scenarios during service registration with central config enabled (#6472 ) Also: * Finished threading replaceExistingChecks setting (from GH-4905) through service manager. * Respected the original configSource value that was used to register a service or a check when restoring persisted data. * Run several existing tests with and without central config enabled (not exhaustive yet). * Switch to ioutil.ReadFile for all types of agent persistence.	2019-09-24 10:04:48 -05:00
R.B. Boyer	5c5f21088c	sdk: add freelist tracking and ephemeral port range skipping to freeport This should cut down on test flakiness. Problems handled: - If you had enough parallel test cases running, the former circular approach to handling the port block could hand out the same port to multiple cases before they each had a chance to bind them, leading to one of the two tests to fail. - The freeport library would allocate out of the ephemeral port range. This has been corrected for Linux (which should cover CI). - The library now waits until a formerly-in-use port is verified to be free before putting it back into circulation.	2019-09-17 14:30:43 -05:00
Sarah Adams	8e673371df	test: ensure all TestAgent constructions use a constructor (#6443 ) ensure all TestAgent constructions use a constructor to get start retries + test logs going to the right place Fixes #6435	2019-09-05 10:24:36 -07:00
Sarah Adams	c6c5f9c494	remove funky panic/recover in agent tests (#6442 )	2019-09-04 13:59:11 -07:00
Sarah Adams	f8fa10fecb	refactor & add better retry logic to NewTestAgent (#6363 ) Fixes #6361	2019-09-03 15:05:51 -07:00
Mike Morris	88df658243	connect: remove managed proxies (#6220 ) * connect: remove managed proxies implementation and all supporting config options and structs * connect: remove deprecated ProxyDestination * command: remove CONNECT_PROXY_TOKEN env var * agent: remove entire proxyprocess proxy manager * test: remove all managed proxy tests * test: remove irrelevant managed proxy note from TestService_ServerTLSConfig * test: update ContentHash to reflect managed proxy removal * test: remove deprecated ProxyDestination test * telemetry: remove managed proxy note * http: remove /v1/agent/connect/proxy endpoint * ci: remove deprecated test exclusion * website: update managed proxies deprecation page to note removal * website: remove managed proxy configuration API docs * website: remove managed proxy note from built-in proxy config * website: add note on removing proxy subdirectory of data_dir	2019-08-09 15:19:30 -04:00
Paul Banks	42296292a4	Allow raft TrailingLogs to be configured. (#6186 ) This fixes pathological cases where the write throughput and snapshot size are both so large that more than 10k log entries are written in the time it takes to restore the snapshot from disk. In this case followers that restart can never catch up with leader replication again and enter a loop of constantly downloading a full snapshot and restoring it only to find that snapshot is already out of date and the leader has truncated its logs so a new snapshot is sent etc. In general if you need to adjust this, you are probably abusing Consul for purposes outside its design envelope and should reconsider your usage to reduce data size and/or write volume.	2019-07-23 15:19:57 +01:00
R.B. Boyer	4c05f1f519	agent: avoid reverting any check updates that occur while a service is being added or the config is reloaded (#6144 )	2019-07-17 14:06:50 -05:00
Hans Hasselberg	73c4e9f07c	tls: auto_encrypt enables automatic RPC cert provisioning for consul clients (#5597 )	2019-06-27 22:22:07 +02:00
Pierre Souchay	e394a9469b	Support for maximum size for Output of checks (#5233 ) * Support for maximum size for Output of checks This PR allows users to limit the size of output produced by checks at the agent and check level. When set at the agent level, it will limit the output for all checks monitored by the agent. When set at the check level, it can override the agent max for a specific check but only if it is lower than the agent max. Default value is 4k, and input must be at least 1.	2019-06-26 09:43:25 -06:00
R.B. Boyer	9b41199585	agent: fix several data races and bugs related to node-local alias checks (#5876 ) The observed bug was that a full restart of a consul datacenter (servers and clients) in conjunction with a restart of a connect-flavored application with bring-your-own-service-registration logic would very frequently cause the envoy sidecar service check to never reflect the aliased service. Over the course of investigation several bugs and unfortunate interactions were corrected: (1) local.CheckState objects were only shallow copied, but the key piece of data that gets read and updated is one of the things not copied (the underlying Check with a Status field). When the stock code was run with the race detector enabled this highly-relevant-to-the-test-scenario field was found to be racy. Changes: a) update the existing Clone method to include the Check field b) copy-on-write when those fields need to change rather than incrementally updating them in place. This made the observed behavior occur slightly less often. (2) If anything about how the runLocal method for node-local alias check logic was ever flawed, there was no fallback option. Those checks are purely edge-triggered and failure to properly notice a single edge transition would leave the alias check incorrect until the next flap of the aliased check. The change was to introduce a fallback timer to act as a control loop to double check the alias check matches the aliased check every minute (borrowing the duration from the non-local alias check logic body). This made the observed behavior eventually go away when it did occur. (3) Originally I thought there were two main actions involved in the data race: A. The act of adding the original check (from disk recovery) and its first health evaluation. B. The act of the HTTP API requests coming in and resetting the local state when re-registering the same services and checks. It took awhile for me to realize that there's a third action at work: C. The goroutines associated with the original check and the later checks. The actual sequence of actions that was causing the bad behavior was that the API actions result in the original check to be removed and re-added _without waiting for the original goroutine to terminate_. This means for brief windows of time during check definition edits there are two goroutines that can be sending updates for the alias check status. In extremely unlikely scenarios the original goroutine sees the aliased check start up in `critical` before being removed but does not get the notification about the nearly immediate update of that check to `passing`. This is interlaced wit the new goroutine coming up, initializing its base case to `passing` from the current state and then listening for new notifications of edge triggers. If the original goroutine "finishes" its update, it then commits one more write into the local state of `critical` and exits leaving the alias check no longer reflecting the underlying check. The correction here is to enforce that the old goroutines must terminate before spawning the new one for alias checks.	2019-05-24 13:36:56 -05:00
Alvin Huang	aacb81a566	Merge pull request #5376 from hashicorp/fix-tests Fix tests in prep for CircleCI Migration	2019-04-04 17:09:32 -04:00
Jeff Mitchell	d3c7d57209	Move internal/ to sdk/ (#5568 ) * Move internal/ to sdk/ * Add a readme to the SDK folder	2019-03-27 08:54:56 -04:00
Jeff Mitchell	a41c865059	Convert to Go Modules (#5517 ) * First conversion * Use serf 0.8.2 tag and associated updated deps * * Move freeport and testutil into internal/ * Make internal/ its own module * Update imports * Add replace statements so API and normal Consul code are self-referencing for ease of development * Adapt to newer goe/values * Bump to new cleanhttp * Fix ban nonprintable chars test * Update lock bad args test The error message when the duration cannot be parsed changed in Go 1.12 (ae0c435877d3aacb9af5e706c40f9dddde5d3e67). This updates that test. * Update another test as well * Bump travis * Bump circleci * Bump go-discover and godo to get rid of launchpad dep * Bump dockerfile go version * fix tar command * Bump go-cleanhttp	2019-03-26 17:04:58 -04:00
Hans Hasselberg	4eaffe4c41	agent: only use TestAgent when appropriate (#5502 )	2019-03-18 17:06:16 +01:00
Valentin Fritz	0fde90b172	Fix checks removal when removing service (#5457 ) Fix my recently discovered issue described here: #5456	2019-03-14 11:02:49 -04:00
Hans Hasselberg	d511e86491	agent: enable reloading of tls config (#5419 ) This PR introduces reloading tls configuration. Consul will now be able to reload the TLS configuration which previously required a restart. It is not yet possible to turn TLS ON or OFF with these changes. Only when TLS is already turned on, the configuration can be reloaded. Most importantly the certificates and CAs.	2019-03-13 10:29:06 +01:00
Aestek	2ce7240abc	Register and deregisters services and their checks atomically in the local state (#5012 ) Prevent race between register and deregister requests by saving them together in the local state on registration. Also adds more cleaning in case of failure when registering services / checks.	2019-03-04 09:34:05 -05:00
Matt Keeler	0c76a4389f	ACL Token Persistence and Reloading (#5328 ) This PR adds two features which will be useful for operators when ACLs are in use. 1. Tokens set in configuration files are now reloadable. 2. If `acl.enable_token_persistence` is set to `true` in the configuration, tokens set via the `v1/agent/token` endpoint are now persisted to disk and loaded when the agent starts (or during configuration reload) Note that token persistence is opt-in so our users who do not want tokens on the local disk will see no change. Some other secondary changes: * Refactored a bunch of places where the replication token is retrieved from the token store. This token isn't just for replicating ACLs and now it is named accordingly. * Allowed better paths in the `v1/agent/token/` API. Instead of paths like: `v1/agent/token/acl_replication_token` the path can now be just `v1/agent/token/replication`. The old paths remain to be valid. * Added a couple new API functions to set tokens via the new paths. Deprecated the old ones and pointed to the new names. The names are also generally better and don't imply that what you are setting is for ACLs but rather are setting ACL tokens. There is a minor semantic difference there especially for the replication token as again, its no longer used only for ACL token/policy replication. The new functions will detect 404s and fallback to using the older token paths when talking to pre-1.4.3 agents. * Docs updated to reflect the API additions and to show using the new endpoints. * Updated the ACL CLI set-agent-tokens command to use the non-deprecated APIs.	2019-02-27 14:28:31 -05:00
Alvin Huang	c8847c4213	add wait to TestAgent_RPCPing	2019-02-22 17:34:45 -05:00
Matt Keeler	a34f8c751e	Pass a testing.T into NewTestAgent and TestAgent.Start (#5342 ) This way we can avoid unnecessary panics which cause other tests not to run. This doesn't remove all the possibilities for panics causing other tests not to run, it just fixes the TestAgent	2019-02-14 10:59:14 -05:00
Matt Keeler	210c3a56b0	Improve Connect with Prepared Queries (#5291 ) Given a query like: ``` { "Name": "tagged-connect-query", "Service": { "Service": "foo", "Tags": ["tag"], "Connect": true } } ``` And a Consul configuration like: ``` { "services": [ "name": "foo", "port": 8080, "connect": { "sidecar_service": {} }, "tags": ["tag"] ] } ``` If you executed the query it would always turn up with 0 results. This was because the sidecar service was being created without any tags. You could instead make your config look like: ``` { "services": [ "name": "foo", "port": 8080, "connect": { "sidecar_service": { "tags": ["tag"] } }, "tags": ["tag"] ] } ``` However that is a bit redundant for most cases. This PR ensures that the tags and service meta of the parent service get copied to the sidecar service. If there are any tags or service meta set in the sidecar service definition then this copying does not take place. After the changes, the query will now return the expected results. A second change was made to prepared queries in this PR which is to allow filtering on ServiceMeta just like we allow for filtering on NodeMeta.	2019-02-04 09:36:51 -05:00
Kyle Havlovitz	3934df9472	Fix failing TestAgent_PurgeCheckOnDuplicate after merge	2019-01-28 13:19:38 -08:00

1 2 3

118 Commits