Commit Graph

168 Commits

Author SHA1 Message Date
Daniel Nephin 979749d86e state: do not delete from inside an iteration
Deleting from memdb inside an interation can cause a panic from Iterator.Next. This
case is technically safe (for now) because the iterator is using the root radix tree
not a modified one.

However this could break at any time if someone adds an insert or delete to the coordinates table
before this place in the function.

It also sets a bad example, because generally deletes in an interator are not safe. So this
commit uses the pattern we have in other places to move the deletes out of the iteration.
2021-01-19 17:00:07 -05:00
Daniel Nephin aa21c1ea04 state: reduce interface for Enterprise schema
Using withEnterpriseSchema() we can apply any enterprise schema changes
with a single shim, removing the need to duplicate all of the table
definitions.

Also move all the catalog schemas to a new file to shrink catalog.go a bit.
2021-01-15 18:49:55 -05:00
Matt Keeler 2badb01d30
Add a paramter in state store methods to indicate whether a resource insertion is from a snapshot restoration (#9156)
The Catalog, Config Entry, KV and Session resources potentially re-validate the input as its coming in. We need to prevent snapshot restoration failures due to missing namespaces or namespaces that are being deleted in enterprise.
2020-11-11 11:21:42 -05:00
Daniel Nephin f9b2834171 state: convert the remaining functions to ReadTxn
Required also converting some of the transaction functions to WriteTxn
because TxnRO() called the same helper as TxnRW.

This change allows us to return a memdb.Txn for read-only txn instead of
wrapping them with state.txn.
2020-10-23 14:29:22 -04:00
Freddy 89d52f41c4
Add protocol to the topology endpoint response (#8868) 2020-10-08 17:31:54 -06:00
Freddy de4af766f3
Support ingress gateways in mesh viz endpoint (#8864)
Co-authored-by: R.B. Boyer <rb@hashicorp.com>
2020-10-08 09:47:09 -06:00
Freddy 7d1f50d2e6
Return intention info in svc topology endpoint (#8853) 2020-10-07 18:35:34 -06:00
freddygv 82a17ccee6 Do not evaluate discovery chain for topology upstreams 2020-10-05 10:24:50 -06:00
freddygv 63c50e15bc Single DB txn for ServiceTopology and other PR comments 2020-10-05 10:24:50 -06:00
freddygv 7c11580e93 Add topology RPC endpoint 2020-10-05 10:24:50 -06:00
freddygv ac54bf99b3 Add func to combine up+downstream queries 2020-10-05 10:24:50 -06:00
freddygv 160a6539d1 factor in discovery chain when querying up/downstreams 2020-10-05 10:24:50 -06:00
freddygv 214b25919f support querying upstreams/downstreams from registrations 2020-10-05 10:24:50 -06:00
freddygv 43efb4809c Merge master 2020-09-14 16:17:43 -06:00
Daniel Nephin 629c34085d state: remove unused Store method receiver
And use ReadTxn interface where appropriate.
2020-08-13 11:25:22 -04:00
Kyle Havlovitz 8118e3db40 Fix a state store comment about version 2020-08-11 13:46:12 -07:00
Kyle Havlovitz 2601585017 fsm: Fix snapshot bug with restoring node/service/check indexes 2020-08-11 11:49:52 -07:00
freddygv 6dcfa11c21 Update error handling 2020-08-10 17:48:22 -06:00
freddygv 83f4e32376 PR comments and addtl tests 2020-08-05 16:07:11 -06:00
freddygv c87af29506 collect GatewayServices from iter in a function 2020-07-31 13:30:40 -06:00
freddygv 94d1f0a310 end to end changes to pass gatewayservices to /ui/services/ 2020-07-30 10:21:11 -06:00
Daniel Nephin 3fcb2e16f4 state: un-method funcs that don't use their receiver
This change was mostly automated with the following

First generate a list of functions with:

  git grep -o 'Store) \([^(]\+\)(tx \*txn' ./agent/consul/state | awk '{print $2}' | grep -o '^[^(]\+'

Then the list was curated a bit with trial/error to remove and add funcs
as necessary.

Finally the replacement was done with:

  dir=agent/consul/state
  file=${1-funcnames}

  while read fn; do
    echo "$fn"
    sed -i -e "s/(s \*Store) $fn(/$fn(/" $dir/*.go
    sed -i -e "s/s\.$fn(/$fn(/" $dir/*.go
    sed -i -e "s/s\.store\.$fn(/$fn(/" $dir/*.go
  done < $file
2020-07-16 15:30:39 -04:00
Daniel Nephin 07c1081d39 Fix a bunch of unparam lint issues 2020-06-24 13:00:14 -04:00
Daniel Nephin 97342de262
Merge pull request #8070 from hashicorp/dnephin/add-gofmt-simplify
ci: Enable gofmt simplify
2020-06-16 17:18:38 -04:00
Daniel Nephin 89d95561df Enable gofmt simplify
Code changes done automatically with 'gofmt -s -w'
2020-06-16 13:21:11 -04:00
Daniel Nephin 71e6534061 Rename txnWrapper to txn 2020-06-16 13:06:02 -04:00
Daniel Nephin 78c76f0773 Handle return value from txn.Commit 2020-06-16 13:04:31 -04:00
Paul Banks f9a6386c4a state: track changes so that they may be used to produce change events 2020-06-16 13:04:29 -04:00
freddygv cc4ff3ae02 Fixup stray sid references 2020-06-12 13:47:43 -06:00
freddygv 1e7e716742 Move compound service names to use ServiceName type 2020-06-12 13:47:43 -06:00
Daniel Nephin f9a89db86e
Update agent/consul/state/catalog.go
Co-authored-by: Hans Hasselberg <me@hans.io>
2020-05-20 16:34:14 -04:00
Daniel Nephin e1e1c13b35 state: use an error to indicate compare failed
Errors are values. We can use the error value to identify the 'comparison failed' case which makes the function easier to use and should make it harder to miss handle the error case
2020-05-20 12:43:33 -04:00
Chris Piraino 1173b31949
Return early from updateGatewayServices if nothing to update (#7838)
* Return early from updateGatewayServices if nothing to update

Previously, we returned an empty slice of gatewayServices, which caused
us to accidentally delete everything in the memdb table

* PR comment and better formatting
2020-05-11 14:46:48 -05:00
Chris Piraino 107c7a9ca7 PR comment and better formatting 2020-05-11 14:04:59 -05:00
Chris Piraino 9f924400e0 Return early from updateGatewayServices if nothing to update
Previously, we returned an empty slice of gatewayServices, which caused
us to accidentally delete everything in the memdb table
2020-05-11 12:38:04 -05:00
Freddy ebbb234ecb
Gateway Services Nodes UI Endpoint (#7685)
The endpoint supports queries for both Ingress Gateways and Terminating Gateways. Used to display a gateway's linked services in the UI.
2020-05-11 11:35:17 -06:00
Freddy a37d7a42c9
Fix up enterprise compatibility for gateways (#7813) 2020-05-08 09:44:34 -06:00
Jono Sosulska 44011c81f2
Fix spelling of deregister (#7804) 2020-05-08 10:03:45 -04:00
Kyle Havlovitz 04b6bd637a Filter wildcard gateway services to match listener protocol
This now requires some type of protocol setting in ingress gateway tests
to ensure the services are not filtered out.

- small refactor to add a max(x, y) function
- Use internal configEntryTxn function and add MaxUint64 to lib
2020-05-06 15:06:13 -05:00
Chris Piraino 210dda5682 Allow Hosts field to be set on an ingress config entry
- Validate that this cannot be set on a 'tcp' listener nor on a wildcard
service.
- Add Hosts field to api and test in consul config write CLI
- xds: Configure envoy with user-provided hosts from ingress gateways
2020-05-06 15:06:13 -05:00
Kyle Havlovitz e4268c8b7f Support multiple listeners referencing the same service in gateway definitions 2020-05-06 15:06:13 -05:00
Kyle Havlovitz b21cd112e5 Allow ingress gateways to route traffic based on Host header
This commit adds the necessary changes to allow an ingress gateway to
route traffic from a single defined port to multiple different upstream
services in the Consul mesh.

To do this, we now require all HTTP requests coming into the ingress
gateway to specify a Host header that matches "<service-name>.*" in
order to correctly route traffic to the correct service.

- Differentiate multiple listener's route names by port
- Adds a case in xds for allowing default discovery chains to create a
  route configuration when on an ingress gateway. This allows default
  services to easily use host header routing
- ingress-gateways have a single route config for each listener
  that utilizes domain matching to route to different services.
2020-05-06 15:06:13 -05:00
Freddy c34ee5d339
Watch fallback channel for gateways that do not exist (#7715)
Also ensure that WatchSets in tests are reset between calls to watchFired. 
Any time a watch fires, subsequent calls to watchFired on the same WatchSet
will also return true even if there were no changes.
2020-04-29 16:52:27 -06:00
Freddy f5c1e5268b
TLS Origination for Terminating Gateways (#7671) 2020-04-27 16:25:37 -06:00
freddygv e751b83a3f Clean up dead code, issue addressed by passing ws to serviceGatewayNodes 2020-04-27 11:08:41 -06:00
freddygv 7667567688 Avoid deleting mappings for services linked to other gateways on dereg 2020-04-27 11:08:41 -06:00
freddygv 28fe6920fe Re-fix bug in CheckConnectServiceNodes 2020-04-27 11:08:41 -06:00
freddygv bab101107c Fix ConnectQueryBlocking test 2020-04-27 11:08:40 -06:00
freddygv 65e60d02f1 Fix bug in CheckConnectServiceNodes
Previously, if a blocking query called CheckConnectServiceNodes
before the gateway-services memdb table had any entries,
a nil watchCh would be returned when calling serviceTerminatingGatewayNodes.
This means that the blocking query would not fire if a gateway config entry
was added after the watch started.

In cases where the blocking query started on proxy registration,
the proxy could potentially never become aware of an upstream endpoint
if that upstream was going to be represented by a gateway.
2020-04-27 11:08:40 -06:00
Chris Piraino c5ab43ebbc
Fix bug where non-typical services are associated with gateways (#7662)
On every service registration, we check to see if a service should be
assassociated to a wildcard gateway-service. This fixes an issue where
we did not correctly check to see if the service being registered was a
"typical" service or not.
2020-04-17 11:24:34 -05:00
Kyle Havlovitz 6a5eba63ab
Ingress Gateways for TCP services (#7509)
* Implements a simple, tcp ingress gateway workflow

This adds a new type of gateway for allowing Ingress traffic into Connect from external services.

Co-authored-by: Chris Piraino <cpiraino@hashicorp.com>
2020-04-16 14:00:48 -07:00
Freddy c1f79c6b3c
Terminating gateway discovery (#7571)
* Enable discovering terminating gateways

* Add TerminatingGatewayServices to state store

* Use GatewayServices RPC endpoint for ingress/terminating
2020-04-08 12:37:24 -06:00
Emre Savcı 7a99f29adc
agent: add len, cap while initializing arrays 2020-04-01 10:54:51 +02:00
R.B. Boyer a7fb26f50f
wan federation via mesh gateways (#6884)
This is like a Möbius strip of code due to the fact that low-level components (serf/memberlist) are connected to high-level components (the catalog and mesh-gateways) in a twisty maze of references which make it hard to dive into. With that in mind here's a high level summary of what you'll find in the patch:

There are several distinct chunks of code that are affected:

* new flags and config options for the server

* retry join WAN is slightly different

* retry join code is shared to discover primary mesh gateways from secondary datacenters

* because retry join logic runs in the *agent* and the results of that
  operation for primary mesh gateways are needed in the *server* there are
  some methods like `RefreshPrimaryGatewayFallbackAddresses` that must occur
  at multiple layers of abstraction just to pass the data down to the right
  layer.

* new cache type `FederationStateListMeshGatewaysName` for use in `proxycfg/xds` layers

* the function signature for RPC dialing picked up a new required field (the
  node name of the destination)

* several new RPCs for manipulating a FederationState object:
  `FederationState:{Apply,Get,List,ListMeshGateways}`

* 3 read-only internal APIs for debugging use to invoke those RPCs from curl

* raft and fsm changes to persist these FederationStates

* replication for FederationStates as they are canonically stored in the
  Primary and replicated to the Secondaries.

* a special derivative of anti-entropy that runs in secondaries to snapshot
  their local mesh gateway `CheckServiceNodes` and sync them into their upstream
  FederationState in the primary (this works in conjunction with the
  replication to distribute addresses for all mesh gateways in all DCs to all
  other DCs)

* a "gateway locator" convenience object to make use of this data to choose
  the addresses of gateways to use for any given RPC or gossip operation to a
  remote DC. This gets data from the "retry join" logic in the agent and also
  directly calls into the FSM.

* RPC (`:8300`) on the server sniffs the first byte of a new connection to
  determine if it's actually doing native TLS. If so it checks the ALPN header
  for protocol determination (just like how the existing system uses the
  type-byte marker).

* 2 new kinds of protocols are exclusively decoded via this native TLS
  mechanism: one for ferrying "packet" operations (udp-like) from the gossip
  layer and one for "stream" operations (tcp-like). The packet operations
  re-use sockets (using length-prefixing) to cut down on TLS re-negotiation
  overhead.

* the server instances specially wrap the `memberlist.NetTransport` when running
  with gateway federation enabled (in a `wanfed.Transport`). The general gist is
  that if it tries to dial a node in the SAME datacenter (deduced by looking
  at the suffix of the node name) there is no change. If dialing a DIFFERENT
  datacenter it is wrapped up in a TLS+ALPN blob and sent through some mesh
  gateways to eventually end up in a server's :8300 port.

* a new flag when launching a mesh gateway via `consul connect envoy` to
  indicate that the servers are to be exposed. This sets a special service
  meta when registering the gateway into the catalog.

* `proxycfg/xds` notice this metadata blob to activate additional watches for
  the FederationState objects as well as the location of all of the consul
  servers in that datacenter.

* `xds:` if the extra metadata is in place additional clusters are defined in a
  DC to bulk sink all traffic to another DC's gateways. For the current
  datacenter we listen on a wildcard name (`server.<dc>.consul`) that load
  balances all servers as well as one mini-cluster per node
  (`<node>.server.<dc>.consul`)

* the `consul tls cert create` command got a new flag (`-node`) to help create
  an additional SAN in certs that can be used with this flavor of federation.
2020-03-09 15:59:02 -05:00
Matt Keeler 966d085066
Catalog + Namespace OSS changes. (#7219)
* Various Prepared Query + Namespace things

* Last round of OSS changes for a namespaced catalog
2020-02-10 10:40:44 -05:00
Matt Keeler 485a0a65ea
Updates to Config Entries and Connect for Namespaces (#7116) 2020-01-24 10:04:58 -05:00
R.B. Boyer 42f80367be
Restore a few more service-kind index updates so blocking in ServiceDump works in more cases (#6948)
Restore a few more service-kind index updates so blocking in ServiceDump works in more cases

Namely one omission was that check updates for dumped services were not
unblocking.

Also adds a ServiceDump state store test and also fix a watch bug with the
normal dump.

Follow-on from #6916
2019-12-19 10:15:37 -06:00
Matt Keeler 442924c35a
Sync of OSS changes to support namespaces (#6909) 2019-12-09 21:26:41 -05:00
Matt Keeler 90ae4a1f1e
OSS KV Modifications to Support Namespaces 2019-11-25 12:57:35 -05:00
Matt Keeler 68d79142c4
OSS Modifications necessary for sessions namespacing 2019-11-25 12:07:04 -05:00
Pierre Souchay 35d90fc899 Display IPs of machines when node names conflict to ease troubleshooting
When there is an node name conflicts, such messages are displayed within Consul:

`consul.fsm: EnsureRegistration failed: failed inserting node: Error while renaming Node ID: "e1d456bc-f72d-98e5-ebb3-26ae80d785cf": Node name node001 is reserved by node 05f10209-1b9c-b90c-e3e2-059e64556d4a with name node001`

While it is easy to find the node that has reserved the name, it is hard to find
the node trying to aquire the name since it is not registered, because it
is not part of `consul members` output

This PR will display the IP of the offender and solve far more easily those issues.
2019-08-28 15:57:05 -04:00
Matt Keeler 3914ec5c62
Various Gateway Fixes (#6093)
* Ensure the mesh gateway configuration comes back in the api within each upstream

* Add a test for the MeshGatewayConfig in the ToAPI functions

* Ensure we don’t use gateways for dc local connections

* Update the svc kind index for deletions

* Replace the proxycfg.state cache with an interface for testing

Also start implementing proxycfg state testing.

* Update the state tests to verify some gateway watches for upstream-targets of a discovery chain.
2019-07-12 17:19:37 -04:00
Matt Keeler 24749bc7e5 Implement Kind based ServiceDump and caching of the ServiceDump RPC 2019-07-01 16:28:30 -04:00
hashicorp-ci d237e86d83 Merge Consul OSS branch 'master' at commit 88b15d84f9fdb58ceed3dc971eb0390be85e3c15
skip-checks: true
2019-06-25 02:00:26 +00:00
Matt Keeler 93debd2610
Ensure that looking for services by addreses works with Tagged Addresses (#5984) 2019-06-21 13:16:17 -04:00
Matt Keeler 4c03f99a85
Fix CAS operations on Services (#5971)
* Fix CAS operations on services

* Update agent/consul/state/catalog_test.go

Co-Authored-By: R.B. Boyer <public@richardboyer.net>
2019-06-17 10:41:04 -04:00
Kyle Havlovitz dcbffdb956
Merge branch 'master' into change-node-id 2019-05-15 10:51:04 -07:00
Matt Keeler ac78c23021
Implement data filtering of some endpoints (#5579)
Fixes: #4222 

# Data Filtering

This PR will implement filtering for the following endpoints:

## Supported HTTP Endpoints

- `/agent/checks`
- `/agent/services`
- `/catalog/nodes`
- `/catalog/service/:service`
- `/catalog/connect/:service`
- `/catalog/node/:node`
- `/health/node/:node`
- `/health/checks/:service`
- `/health/service/:service`
- `/health/connect/:service`
- `/health/state/:state`
- `/internal/ui/nodes`
- `/internal/ui/services`

More can be added going forward and any endpoint which is used to list some data is a good candidate.

## Usage

When using the HTTP API a `filter` query parameter can be used to pass a filter expression to Consul. Filter Expressions take the general form of:

```
<selector> == <value>
<selector> != <value>
<value> in <selector>
<value> not in <selector>
<selector> contains <value>
<selector> not contains <value>
<selector> is empty
<selector> is not empty
not <other expression>
<expression 1> and <expression 2>
<expression 1> or <expression 2>
```

Normal boolean logic and precedence is supported. All of the actual filtering and evaluation logic is coming from the [go-bexpr](https://github.com/hashicorp/go-bexpr) library

## Other changes

Adding the `Internal.ServiceDump` RPC endpoint. This will allow the UI to filter services better.
2019-04-16 12:00:15 -04:00
Paul Banks 68e8933ba5
Connect: Make Connect health queries unblock correctly (#5508)
* Make Connect health queryies unblock correctly in all cases and use optimal number of watch chans. Fixes #5506.

* Node check test cases and clearer bug test doc

* Comment update
2019-03-21 16:01:56 +00:00
Kyle Havlovitz bb0839ea5b Condense some test logic and add a comment about renaming 2019-03-18 16:15:36 -07:00
Paul Banks dd08426b04
Optimize health watching to single chan/goroutine. (#5449)
Refs #4984.

Watching chans for every node we touch in a health query is wasteful. In #4984 it shows that if there are more than 682 service instances we always fallback to watching all services which kills performance.

We already have a record in MemDB that is reliably update whenever the service health result should change thanks to per-service watch indexes.

So in general, provided there is at least one service instances and we actually have a service index for it (we always do now) we only ever need to watch a single channel.

This saves us from ever falling back to the general index and causing the performance cliff in #4984, but it also means fewer goroutines and work done for every blocking health query.

It also saves some allocations made during the query because we no longer have to populate a WatchSet with 3 chans per service instance which saves the internal map allocation.

This passes all state store tests except the one that explicitly checked for the fallback behaviour we've now optimized away and in general seems safe.
2019-03-15 20:18:48 +00:00
Kyle Havlovitz 3aec844fd2 Update state store test for changing node ID 2019-03-13 17:05:31 -07:00
Aestek 071fcb28ba [catalog] Update the node's services indexes on update (#5458)
Node updates were not updating the service indexes, which are used for
service related queries. This caused the X-Consul-Index to stay the same
after a node update as seen from a service query even though the node
data is returned in heath queries. If that happened in between queries
the client would miss this change.
We now update the indexes of the services on the node when it is
updated.

Fixes: #5450
2019-03-11 14:48:19 +00:00
Kyle Havlovitz bf09061e86 Add logic to allow changing a failed node's ID 2019-03-07 22:42:54 -08:00
R.B. Boyer 91e78e00c7
fix typos reported by golangci-lint:misspell (#5434) 2019-03-06 11:13:28 -06:00
Kyle Havlovitz b0f07d9b5e
Merge pull request #4869 from hashicorp/txn-checks
Add node/service/check operations to transaction api
2019-01-22 11:16:09 -08:00
Aestek ff13518961 Improve blocking queries on services that do not exist (#4810)
## Background

When making a blocking query on a missing service (was never registered, or is not registered anymore) the query returns as soon as any service is updated.
On clusters with frequent updates (5~10 updates/s in our DCs) these queries virtually do not block, and clients with no protections againt this waste ressources on the agent and server side. Clients that do protect against this get updates later than they should because of the backoff time they implement between requests.

## Implementation

While reducing the number of unnecessary updates we still want :
* Clients to be notified as soon as when the last instance of a service disapears.
* Clients to be notified whenever there's there is an update for the service.
* Clients to be notified as soon as the first instance of the requested service is added.

To reduce the number of unnecessary updates we need to block when a request to a missing service is made. However in the following case :

1. Client `client1` makes a query for service `foo`, gets back a node and X-Consul-Index 42
2. `foo` is unregistered 
3. `client1`  makes a query for `foo` with `index=42` -> `foo` does not exist, the query blocks and `client1` is not notified of the change on `foo` 

We could store the last raft index when each service was last alive to know wether we should block on the incoming query or not, but that list could grow indefinetly. 
We instead store the last raft index when a service was unregistered and use it when a query targets a service that does not exist. 
When a service `srv` is unregistered this "missing service index" is always greater than any X-Consul-Index held by the clients while `srv` was up, allowing us to immediatly notify them.

1. Client `client1` makes a query for service `foo`, gets back a node and `X-Consul-Index: 42`
2. `foo` is unregistered, we set the "missing service index" to 43 
3. `client1` makes a blocking query for `foo` with `index=42` -> `foo` does not exist, we check against the "missing service index" and return immediatly with `X-Consul-Index: 43`
4. `client1` makes a blocking query for `foo` with `index=43` -> we block
5. Other changes happen in the cluster, but foo still doesn't exist and "missing service index" hasn't changed, the query is still blocked
6. `foo` is registered again on index 62 -> `foo` exists and its index is greater than 43, we unblock the query
2019-01-11 09:26:14 -05:00
Kyle Havlovitz c266277a49 txn: clean up some state store/acl code 2019-01-09 11:59:23 -08:00
Kyle Havlovitz efcdc85e1a api: add support for new txn operations 2018-12-12 10:54:09 -08:00
Kyle Havlovitz 41e8120d3d state: add tests for new txn ops 2018-12-12 10:04:10 -08:00
Kyle Havlovitz a40a346be8 txn: add service operations 2018-12-12 10:04:10 -08:00
Kyle Havlovitz b1aeb3b943 txn: add node operations 2018-12-12 10:04:10 -08:00
Kyle Havlovitz bd6b7ad162 txn: add pre-check operations to txn endpoint 2018-12-12 10:04:10 -08:00
Kyle Havlovitz 8a0d7b65d6 Add check operations to transaction api 2018-12-12 10:04:10 -08:00
Rebecca Zanzig 0ec6d880f5 Support multiple tags for health and catalog http api endpoints (#4717)
* Support multiple tags for health and catalog api endpoints

Fixes #1781.

Adds a `ServiceTags` field to the ServiceSpecificRequest to support
multiple tags, updates the filter logic in the catalog store, and
propagates these change through to the health and catalog endpoints.

Note: Leaves `ServiceTag` in the struct, since it is being used as
part of the DNS lookup, which in turn uses the health check.

* Update the api package to support multiple tags

Includes additional tests.

* Update new tests to use the `require` library

* Update HealthConnect check after a bad merge
2018-10-11 12:50:05 +01:00
Pierre Souchay b0fc91a1d2 [Performance On Large clusters] Reduce updates on large services (#4720)
* [Performance On Large clusters] Checks do update services/nodes only when really modified to avoid too many updates on very large clusters

In a large cluster, when having a few thousands of nodes, the anti-entropy
mechanism performs lots of changes (several per seconds) while
there is no real change. This patch wants to improve this in order
to increase Consul scalability when using many blocking requests on
health for instance.

* [Performance for large clusters] Only updates index of service if service is really modified

* [Performance for large clusters] Only updates index of nodes if node is really modified

* Added comments / ensure IsSame() has clear semantics

* Avoid having modified boolean, return nil directly if stutures are Same

* Fixed unstable unit tests TestLeader_ChangeServerID

* Rewrite TestNode_IsSame() for better readability as suggested by @banks

* Rename ServiceNode.IsSame() into IsSameService() + added unit tests

* Do not duplicate TestStructs_ServiceNode_Conversions() and increase test coverage of IsSameService

* Clearer documentation in IsSameService

* Take into account ServiceProxy into ServiceNode.IsSameService()

* Fixed IsSameService() with all new structures
2018-10-11 12:42:39 +01:00
Pierre Souchay a16f34058b Display more information about check being not properly added when it fails (#4405)
* Display more information about check being not properly added when it fails

It follows an incident where we add lots of error messages:

  [WARN] consul.fsm: EnsureRegistration failed: failed inserting check: Missing service registration

That seems related to Consul failing to restart on respective agents.

Having Node information as well as service information would help diagnose the issue.

* Renamed ensureCheckIfNodeMatches() as requested by @banks
2018-08-14 17:45:33 +01:00
Pierre Souchay 821a91ca31 Allow to rename nodes with IDs, will fix #3974 and #4413 (#4415)
* Allow to rename nodes with IDs, will fix #3974 and #4413

This change allow to rename any well behaving recent agent with an
ID to be renamed safely, ie: without taking the name of another one
with case insensitive comparison.

Deprecated behaviour warning
----------------------------

Due to asceding compatibility, it is still possible however to
"take" the name of another name by not providing any ID.

Note that when not providing any ID, it is possible to have 2 nodes
having similar names with case differences, ie: myNode and mynode
which might lead to DB corruption on Consul server side and
lead to server not properly restarting.

See #3983 and #4399 for Context about this change.

Disabling registration of nodes without IDs as specified in #4414
should probably be the way to go eventually.

* Removed the case-insensitive search when adding a node within the else
block since it breaks the test TestAgentAntiEntropy_Services

While the else case is probably legit, it will be fixed with #4414 in
a later release.

* Added again the test in the else to avoid duplicated names, but
enforce this test only for nodes having IDs.

Thus most tests without any ID will work, and allows us fixing

* Added more tests regarding request with/without IDs.

`TestStateStore_EnsureNode` now test registration and renaming with IDs

`TestStateStore_EnsureNodeDeprecated` tests registration without IDs
and tests removing an ID from a node as well as updated a node
without its ID (deprecated behaviour kept for backwards compatibility)

* Do not allow renaming in case of conflict, including when other node has no ID

* Fixed function GetNodeID that was not working due to wrong type when searching node from its ID

Thus, all tests about renaming were not working properly.

Added the full test cas that allowed me to detect it.

* Better error messages, more tests when nodeID is not a valid UUID in GetNodeID()

* Added separate TestStateStore_GetNodeID to test GetNodeID.

More complete test coverage for GetNodeID

* Added new unit test `TestStateStore_ensureNoNodeWithSimilarNameTxn`

Also fixed comments to be clearer after remarks from @banks

* Fixed error message in unit test to match test case

* Use uuid.ParseUUID to parse Node.ID as requested by @mkeeler
2018-08-10 11:30:45 -04:00
Matt Keeler 965fc9cf62
Revert "Allow changing Node names since Node now have IDs" 2018-07-12 11:19:21 -04:00
Matt Keeler 42729d5aff
Merge pull request #3983 from pierresouchay/node_renaming
Allow changing Node names since Node now have IDs
2018-07-11 16:03:02 -04:00
Pierre Souchay 3d0a960470 When renaming a node, ensure the name is not taken by another node.
Since DNS is case insensitive and DB as issues when similar names with different
cases are added, check for unicity based on case insensitivity.

Following another big incident we had in our cluster, we also validate
that adding/renaming a not does not conflicts with case insensitive
matches.

We had the following error once:

 - one node called: mymachine.MYDC.mydomain was shut off
 - another node (different ID) was added with name: mymachine.mydc.mydomain before
   72 hours

When restarting the consul server of domain, the consul server restarted failed
to start since it detected an issue in RAFT database because
mymachine.MYDC.mydomain and mymachine.mydc.mydomain had the same names.

Checking at registration time with case insensitivity should definitly fix
those issues and avoid Consul DB corruption.
2018-07-11 14:42:54 +02:00
Mitchell Hashimoto a3e0ac1ee3 agent/consul/state: support querying by Connect native 2018-06-25 12:24:08 -07:00
Mitchell Hashimoto daaa6e2403
agent: clean up connect/non-connect duplication by using shared methods 2018-06-14 09:41:48 -07:00
Mitchell Hashimoto 119ffe3ed9
agent/consul: implement Health.ServiceNodes for Connect, DNS works 2018-06-14 09:41:47 -07:00
Mitchell Hashimoto 06957f6d7f
agent/consul/state: ConnectServiceNodes 2018-06-14 09:41:47 -07:00
Wim 88514d6a82 Add support for reverse lookup of services 2018-05-19 19:39:02 +02:00
Paul Banks c55885efd8
Merge pull request #3970 from pierresouchay/node_health_should_change_service_index
[BUGFIX] When a node level check is removed, ensure all services of node are notified
2018-05-08 16:44:50 +01:00
Pierre Souchay 1b55e3559b Allow renaming nodes when ID is unchanged 2018-04-18 15:39:38 +02:00
Preetha Appan d9d9944179
Renames agent API layer for service metadata to "meta" for consistency 2018-03-28 09:04:50 -05:00
Preetha 8dacb12c79
Merge pull request #3881 from pierresouchay/service_metadata
Feature Request: Support key-value attributes for services
2018-03-27 16:33:57 -05:00