Commit Graph

1823 Commits

Author SHA1 Message Date
Daniel Nephin afdbf2a8ef state: rename table name constants to new pattern
Using Apps Hungarian Notation for these constants makes the memdb queries more readable.
2021-02-05 12:12:18 -05:00
Pierre Souchay c466b08481 Streaming filter tags + case insensitive lookups for Service Names
Will fix:
 * https://github.com/hashicorp/consul/issues/9695
 * https://github.com/hashicorp/consul/issues/9702
2021-02-04 11:00:51 +01:00
Daniel Nephin f929a7117e state: Remove unnecessary entMeta arg to EnsureConfigEntry 2021-02-03 18:10:38 -05:00
Kyle Havlovitz 1dee4173c1 connect/ca: Allow ForceWithoutCrossSigning for all providers
This allows setting ForceWithoutCrossSigning when reconfiguring the CA
for any provider, in order to forcibly move to a new root in cases where
the old provider isn't reachable or able to cross-sign for whatever
reason.
2021-01-29 13:38:11 -08:00
Daniel Nephin 09425b22a1 state: rename config-entries table const to match new pattern 2021-01-28 20:34:34 -05:00
Daniel Nephin 7d17e20270 state: move config-entries table to new pattern 2021-01-28 20:34:15 -05:00
Daniel Nephin 825b8ade39 state: use indexID
this change was already made to enterprise, so backporting it.
2021-01-28 20:30:08 -05:00
Daniel Nephin 2a262f07fc state: Move ACL schema indexes to match Ent
and use constants for table and index names.
2021-01-28 20:05:09 -05:00
Matt Keeler 1379b5f7d6
Upgrade raft-autopilot and wait for autopilot it to stop when revoking leadership (#9644)
Fixes: 9626
2021-01-27 11:14:52 -05:00
Hans Hasselberg 623aab5880
Add flags to support CA generation for Connect (#9585) 2021-01-27 08:52:15 +01:00
R.B. Boyer 5777fa1f59
server: initialize mgw-wanfed to use local gateways more on startup (#9528)
Fixes #9342
2021-01-25 17:30:38 -06:00
Daniel Nephin d7d081f402
Merge pull request #9420 from hashicorp/dnephin/reduce-duplicate-in-catalog-schema
state: reduce interface for Enterprise schema
2021-01-25 17:04:25 -05:00
R.B. Boyer 6622185d64
server: use the presense of stored federation state data as a sign that we already activated the federation state feature flag (#9519)
This way we only have to wait for the serf barrier to pass once before
we can make use of federation state APIs Without this patch every
restart needs to re-compute the change.
2021-01-25 13:24:32 -06:00
R.B. Boyer 0247f409a0
server: when wan federating via mesh gateways only do heuristic primary DC bypass on the leader (#9366)
Fixes #9341
2021-01-22 10:03:24 -06:00
Freddy 5519051c84
Update topology mapping Refs on all proxy instance deletions (#9589)
* Insert new upstream/downstream mapping to persist new Refs

* Avoid upserting mapping copy if it's a no-op

* Add test with panic repro

* Avoid deleting up/downstreams from inside memdb iterator

* Avoid deleting gateway mappings from inside memdb iterator

* Add CHANGELOG entry

* Tweak changelog entry

Co-authored-by: Paul Banks <banks@banksco.de>
2021-01-20 15:17:26 +00:00
Daniel Nephin 979749d86e state: do not delete from inside an iteration
Deleting from memdb inside an interation can cause a panic from Iterator.Next. This
case is technically safe (for now) because the iterator is using the root radix tree
not a modified one.

However this could break at any time if someone adds an insert or delete to the coordinates table
before this place in the function.

It also sets a bad example, because generally deletes in an interator are not safe. So this
commit uses the pattern we have in other places to move the deletes out of the iteration.
2021-01-19 17:00:07 -05:00
Matt Keeler 2d2ce1fb0c
Ensure that CA initialization does not block leader election.
After fixing that bug I uncovered a couple more:

Fix an issue where we might try to cross sign a cert when we never had a valid root.
Fix a potential issue where reconfiguring the CA could cause either the Vault or AWS PCA CA providers to delete resources that are still required by the new incarnation of the CA.
2021-01-19 15:27:48 -05:00
Daniel Nephin 52a1d78e39 state: add a regression test for state store schema
To allow the index to be refactored without accidental changes.

To update the expected value run: 'go test ./agent/consul/state -update'
2021-01-15 18:49:55 -05:00
Daniel Nephin aa21c1ea04 state: reduce interface for Enterprise schema
Using withEnterpriseSchema() we can apply any enterprise schema changes
with a single shim, removing the need to duplicate all of the table
definitions.

Also move all the catalog schemas to a new file to shrink catalog.go a bit.
2021-01-15 18:49:55 -05:00
Daniel Nephin e8427a48ab agent/consuk: Rename RPCRate -> RPCRateLimit
so that the field name is consistent across config structs.
2021-01-14 17:26:00 -05:00
Daniel Nephin e5320c2db6 agent/consul: make Client/Server config reloading more obvious
I believe this commit also fixes a bug. Previously RPCMaxConnsPerClient was not being re-read from the RuntimeConfig, so passing it to Server.ReloadConfig was never changing the value.

Also improve the test runtime by not doing a lot of unnecessary work.
2021-01-14 17:21:10 -05:00
Daniel Nephin f2b504873a
Merge pull request #9460 from hashicorp/dnephin/fix-data-races
Fix a couple data races in tests
2021-01-14 17:07:01 -05:00
Chris Piraino baad708929
Fix bug in usage metrics when multiple service instances are changed in a single transaction (#9440)
* Fix bug in usage metrics that caused a negative count to occur

There were a couple of instances were usage metrics would do the wrong
thing and result in incorrect counts, causing the count to attempt to
decrement below zero and return an error. The usage metrics did not
account for various places where a single transaction could
delete/update/add multiple service instances at once.

We also remove the error when attempting to decrement below zero, and
instead just make sure we do not accidentally underflow the unsigned
integer. This is a more graceful failure than returning an error and not
allowing a transaction to commit.

* Add changelog
2021-01-12 15:31:47 -06:00
Chris Piraino 2eac571276
Log replication warnings when no error suppression is defined (#9320)
* Log replication warnings when no error suppression is defined

* Add changelog file
2021-01-08 14:03:06 -06:00
Daniel Nephin 45f0afcbf4 structs: Fix printing of IDs
These types are used as values (not pointers) in other structs. Using a pointer receiver causes
problems when the value is printed. fmt will not call the String method if it is passed a value
and the String method has a pointer receiver. By using a value receiver the correct string is printed.

Also remove some unused methods.
2021-01-07 18:47:38 -05:00
Daniel Nephin 27c38bfebb
Merge pull request #9213 from hashicorp/dnephin/resolve-tokens-take-2
acl: Remove some unused things and document delegate method
2021-01-06 18:51:51 -05:00
R.B. Boyer db62541676
acl: use the presence of a management policy in the state store as a sign that we already migrated to v2 acls (#9505)
This way we only have to wait for the serf barrier to pass once before
we can upgrade to v2 acls. Without this patch every restart needs to
re-compute the change, and potentially if a stray older node joins after
a migration it might regress back to v1 mode which would be problematic.
2021-01-05 17:04:27 -06:00
Matt Keeler 3a79b559f9
Special case the error returned when we have a Raft leader but are not tracking it in the ServerLookup (#9487)
This can happen when one other node in the cluster such as a client is unable to communicate with the leader server and sees it as failed. When that happens its failing status eventually gets propagated to the other servers in the cluster and eventually this can result in RPCs returning “No cluster leader” error.

That error is misleading and unhelpful for determing the root cause of the issue as its not raft stability but rather and client -> server networking issue. Therefore this commit will add a new error that will be returned in that case to differentiate between the two cases.
2021-01-04 14:05:23 -05:00
R.B. Boyer 42dea6f01e
server: deletions of intentions by name using the intention API is now idempotent (#9278)
Restoring a behavior inadvertently changed while fixing #9254
2021-01-04 11:27:00 -06:00
Daniel Nephin 088831c91e Maybe fix another data race in a test 2020-12-22 18:53:54 -05:00
Daniel Nephin d0f2eca8de Fix one race caused by t.Parallel 2020-12-22 18:27:18 -05:00
Daniel Nephin c66a63275f
Merge pull request #9340 from hashicorp/dnephin/skip-slow-tests-with-short
testing: skip slow tests with -short
2020-12-11 13:33:44 -05:00
R.B. Boyer f9dcaf7f6b
acl: global tokens created by auth methods now correctly replicate to secondary datacenters (#9351)
Previously the tokens would fail to insert into the secondary's state
store because the AuthMethod field of the ACLToken did not point to a
known auth method from the primary.
2020-12-09 15:22:29 -06:00
Daniel Nephin ef0999547a testing: skip slow tests with -short
Add a skip condition to all tests slower than 100ms.

This change was made using `gotestsum tool slowest` with data from the
last 3 CI runs of master.
See https://github.com/gotestyourself/gotestsum#finding-and-skipping-slow-tests

With this change:

```
$ time go test -count=1 -short ./agent
ok      github.com/hashicorp/consul/agent       0.743s

real    0m4.791s

$ time go test -count=1 -short ./agent/consul
ok      github.com/hashicorp/consul/agent/consul        4.229s

real    0m8.769s
```
2020-12-07 13:42:55 -05:00
Kyle Havlovitz 57210a59c3 connect: Fix a case where the active root would get unset even when there wasn't a new one 2020-12-02 11:42:23 -08:00
Kyle Havlovitz 91d5d6c586
Merge pull request #9009 from hashicorp/update-secondary-ca
connect: Fix an issue with updating CA config in a secondary datacenter
2020-11-30 14:49:28 -08:00
Kyle Havlovitz c5167cf9c4 Use a buffered channel for CA intermediate renew func 2020-11-30 14:37:24 -08:00
R.B. Boyer 6d6b6c15c6
server: fix panic when deleting a non existent intention (#9254)
* server: fix panic when deleting a non existent intention

* add changelog

* Always return an error when deleting non-existent ixn

Co-authored-by: freddygv <gh@freddygv.xyz>
2020-11-24 13:44:20 -05:00
Hans Hasselberg 25f9e232af add missing descriptions for metrics 2020-11-23 22:06:30 +01:00
Kit Patella 7a8844ccce add entries for missing fsm operations and mark duplicated metrics prefixes as deprecated 2020-11-23 12:42:51 -08:00
Kyle Havlovitz a01f853aa5 Clean up the logic in persistNewRootAndConfig 2020-11-20 15:54:44 -08:00
Kyle Havlovitz 26a9c985c5 Add CA server delegate interface for testing 2020-11-19 20:08:06 -08:00
Kit Patella 4ad076207e add telemetry and definition help entries for missing catalog and acl metrics 2020-11-19 13:29:44 -08:00
Kit Patella 46205bbf27 remove stale entries and rename/define acl.resolveToken 2020-11-19 13:06:28 -08:00
Freddy e4e306210a
Require operator:write to get Connect CA config (#9240)
A vulnerability was identified in Consul and Consul Enterprise (“Consul”) such that operators with `operator:read` ACL permissions are able to read the Consul Connect CA configuration when explicitly configured with the `/v1/connect/ca/configuration` endpoint, including the private key. This allows the user to effectively privilege escalate by enabling the ability to mint certificates for any Consul Connect services. This would potentially allow them to masquerade (receive/send traffic) as any service in the mesh.

--

This PR increases the permissions required to read the Connect CA's private key when it was configured via the `/connect/ca/configuration` endpoint. They are now `operator:write`.
2020-11-19 10:14:48 -07:00
Kyle Havlovitz c8d4a40a87 connect: update some function comments in CA manager 2020-11-17 16:00:19 -08:00
Daniel Nephin b9306d8827 acl: remove a test-only method 2020-11-17 18:16:34 -05:00
Daniel Nephin 9e7c8dd19d Remove two unused delegate methods 2020-11-17 18:16:26 -05:00
Matt Keeler 4bca029be9
Refactor to call non-voting servers read replicas (#9191)
Co-authored-by: Kit Patella <kit@jepsen.io>
2020-11-17 10:53:57 -05:00
Kit Patella 4dfcdbab26
Merge pull request #9198 from hashicorp/mkcp/telemetry/add-all-metric-definitions
Add metric definitions for all metrics known at Consul start
2020-11-16 15:54:50 -08:00
Matt Keeler 197a37a860
Prevent panic if autopilot health is requested prior to leader establishment finishing. (#9204) 2020-11-16 17:08:17 -05:00
Daniel Nephin de88ceed1c
Merge pull request #9114 from hashicorp/dnephin/filtering-in-stream
stream: improve naming of Payload methods
2020-11-16 14:20:07 -05:00
Kit Patella 0b18f5612e trim help strings to save a few bytes 2020-11-16 11:02:11 -08:00
Kit Patella 374748dafc merge master 2020-11-16 10:46:53 -08:00
Kit Patella af719981f3 finish adding static server metrics 2020-11-13 16:26:08 -08:00
Kyle Havlovitz 0a86533e20 Reorganize some CA manager code for correctness/readability 2020-11-13 14:46:01 -08:00
Kyle Havlovitz 5de81c1375 connect: Add CAManager for synchronizing CA operations 2020-11-13 14:33:44 -08:00
Kyle Havlovitz 0b4876f906 connect: Add logic for updating secondary DC intermediate on config set 2020-11-13 14:33:44 -08:00
R.B. Boyer db1184c094
server: intentions CRUD requires connect to be enabled (#9194)
Fixes #9123
2020-11-13 16:19:12 -06:00
Kit Patella b486c1bce8 add the service name in the agent rather than in the definitions themselves 2020-11-13 13:18:04 -08:00
R.B. Boyer e323014faf
server: remove config entry CAS in legacy intention API bridge code (#9151)
Change so line-item intention edits via the API are handled via the state store instead of via CAS operations.

Fixes #9143
2020-11-13 14:42:21 -06:00
R.B. Boyer 6300abed18
server: skip deleted and deleting namespaces when migrating intentions to config entries (#9186) 2020-11-13 13:56:41 -06:00
Mike Morris a343365da7
ci: update to Go 1.15.4 and alpine:3.12 (#9036)
* ci: stop building darwin/386 binaries

Go 1.15 drops support for 32-bit binaries on Darwin https://golang.org/doc/go1.15#darwin

* tls: ConnectionState::NegotiatedProtocolIsMutual is deprecated in Go 1.15, this value is always true

* correct error messages that changed slightly

* Completely regenerate some TLS test data

Co-authored-by: R.B. Boyer <rb@hashicorp.com>
2020-11-13 13:02:59 -05:00
R.B. Boyer 758384893d
server: break up Intention.Apply monolithic method (#9007)
The Intention.Apply RPC is quite large, so this PR attempts to break it down into smaller functions and dissolves the pre-config-entry approach to the breakdown as it only confused things.
2020-11-13 09:15:39 -06:00
Kit Patella 9533372ded first pass on agent-configured prometheusDefs and adding defs for every consul metric 2020-11-12 18:12:12 -08:00
R.B. Boyer a5bd1ba323
agent: return the default ACL policy to callers as a header (#9101)
Header is: X-Consul-Default-ACL-Policy=<allow|deny>

This is of particular utility when fetching matching intentions, as the
fallthrough for a request that doesn't match any intentions is to
enforce using the default acl policy.
2020-11-12 10:38:32 -06:00
Matt Keeler 2badb01d30
Add a paramter in state store methods to indicate whether a resource insertion is from a snapshot restoration (#9156)
The Catalog, Config Entry, KV and Session resources potentially re-validate the input as its coming in. We need to prevent snapshot restoration failures due to missing namespaces or namespaces that are being deleted in enterprise.
2020-11-11 11:21:42 -05:00
Matt Keeler 1f40f51a58
Fix a bunch of linter warnings 2020-11-09 09:22:12 -05:00
Matt Keeler 755fb72994
Switch to using the external autopilot module 2020-11-09 09:22:11 -05:00
Daniel Nephin e4a78c977d stream: document that Payload must be immutable
If they are sent to EventPublisher.Publish.

Also document that PayloadEvents is expected to come from a subscription and that it is
not immutable.
2020-11-06 13:00:33 -05:00
Daniel Nephin 4fc073b1f4 stream: rename FilterByKey 2020-11-05 19:21:16 -05:00
Daniel Nephin d4cd2fa6a8 stream: Add HasReadPermission to Payload
Required now that filter is a method on PayloadEvents instead of Event
2020-11-05 19:17:18 -05:00
Daniel Nephin 8a26bca020 stream: move event filtering to PayloadEvents
Removes the weirdness around PayloadEvents.FilterByKey
2020-11-05 17:50:17 -05:00
Daniel Nephin dcacfd3548 stream: Remove unused method 2020-11-05 16:49:59 -05:00
Daniel Nephin 621f1db766
Merge pull request #9073 from hashicorp/dnephin/backport-streaming-namespaces
streaming: backport namespace changes
2020-11-05 14:19:10 -05:00
Daniel Nephin cd220e5d6c
Merge pull request #9061 from hashicorp/dnephin/event-fields
stream: support filtering by namespace
2020-11-05 14:18:35 -05:00
Daniel Nephin f6b629852f state: test EventPayloadCheckServiceNode.FilterByKey
Also fix a bug in that function when only one of key or namespace were the empty string.
2020-10-30 14:35:57 -04:00
Daniel Nephin 60df44df4f stream: Add tests for filterByKey with namespace
And fix a bug where a request with a Namespace but no Key would not be properly filtered
2020-10-30 14:35:42 -04:00
Daniel Nephin 318dfbe6e4 stream: Move FilterByKey events to a table
In preparation for adding new tests.
2020-10-30 14:35:28 -04:00
Daniel Nephin 2d0030da39 state: use enterprise meta for creating events 2020-10-30 14:34:04 -04:00
Daniel Nephin b57c7afcbb stream: include the namespace in the snap cache key
Otherwise the wrong snapshot could be returned when the same key is used in different namespaces
2020-10-30 14:34:04 -04:00
Daniel Nephin 8da30fcb9a subscribe: set the request namespace 2020-10-30 14:34:04 -04:00
R.B. Boyer 67a0d0c426
state: ensure we unblock intentions queries upon the upgrade to config entries (#9062)
1. do a state store query to list intentions as the agent would do over in `agent/proxycfg` backing `agent/xds`
2. upgrade the database and do a fresh `service-intentions` config entry write
3. the blocking query inside of the agent cache in (1) doesn't notice (2)
2020-10-29 15:28:31 -05:00
R.B. Boyer 78014653b3 restore prior signature of test helper so enterprise compiles 2020-10-29 13:52:15 -05:00
Daniel Nephin 61ce0964a4 stream: remove Event.Key
Makes Payload a type with FilterByKey so that Payloads can implement
filtering by key. With this approach we don't need to expose a Namespace
field on Event, and we don't need to invest micro formats or require a
bunch of code to be aware of exactly how the key field is encoded.
2020-10-28 16:48:04 -04:00
Daniel Nephin 8ef4c0fcc5 state: use go-cmp for comparison
The output of the previous assertions made it impossible to debug the tests without code changes.

With go-cmp comparing the entire slice we can see the full diffs making it easier to debug failures.
2020-10-28 16:33:00 -04:00
Daniel Nephin 44da869ed4 stream: Use a no-op event publisher if streaming is disabled 2020-10-28 13:54:19 -04:00
Daniel Nephin eea87e1acf store: use a ReadDB for snapshots
to remove the cyclic dependency between the snapshot handlers and the state.Store
2020-10-28 13:07:42 -04:00
Daniel Nephin cfe0ffde15
Merge pull request #9026 from hashicorp/dnephin/streaming-without-cache-query-param
streaming: rename config and remove requirement for cache=1
2020-10-28 12:33:25 -04:00
Daniel Nephin 03d2be03e7
Merge pull request #8618 from hashicorp/dnephin/remove-txn-readtxn
state: Use ReadTxn everywhere
2020-10-28 12:32:47 -04:00
Daniel Nephin abd8cfcfe9 state: disable streaming connect topic 2020-10-26 11:49:47 -04:00
R.B. Boyer 0a80e82f21
server: config entry replication now correctly uses namespaces in comparisons (#9024)
Previously config entries sharing a kind & name but in different
namespaces could occasionally cause "stuck states" in replication
because the namespace fields were ignored during the differential
comparison phase.

Example:

Two config entries written to the primary:

    kind=A,name=web,namespace=bar
    kind=A,name=web,namespace=foo

Under the covers these both get saved to memdb, so they are sorted by
all 3 components (kind,name,namespace) during natural iteration. This
means that before the replication code does it's own incomplete sort,
the underlying data IS sorted by namespace ascending (bar comes before
foo).

After one pass of replication the primary and secondary datacenters have
the same set of config entries present. If
"kind=A,name=web,namespace=bar" were to be deleted, then things get
weird. Before replication the two sides look like:

primary: [
    kind=A,name=web,namespace=foo
]
secondary: [
    kind=A,name=web,namespace=bar
    kind=A,name=web,namespace=foo
]

The differential comparison phase walks these two lists in sorted order
and first compares "kind=A,name=web,namespace=foo" vs
"kind=A,name=web,namespace=bar" and falsely determines they are the SAME
and are thus cause an update of "kind=A,name=web,namespace=foo". Then it
compares "<nothing>" with "kind=A,name=web,namespace=foo" and falsely
determines that the latter should be DELETED.

During reconciliation the deletes are processed before updates, and so
for a brief moment in the secondary "kind=A,name=web,namespace=foo" is
erroneously deleted and then immediately restored.

Unfortunately after this replication phase the final state is identical
to the initial state, so when it loops around again (rate limited) it
repeats the same set of operations indefinitely.
2020-10-23 13:41:54 -05:00
Daniel Nephin f9b2834171 state: convert the remaining functions to ReadTxn
Required also converting some of the transaction functions to WriteTxn
because TxnRO() called the same helper as TxnRW.

This change allows us to return a memdb.Txn for read-only txn instead of
wrapping them with state.txn.
2020-10-23 14:29:22 -04:00
Daniel Nephin 26387cdc0e
Merge pull request #8975 from hashicorp/dnephin/stream-close-on-unsub
stream: close the subscription on Unsubscribe
2020-10-23 12:58:12 -04:00
Freddy d23038f94f
Add HasExact to topology endpoint (#9010) 2020-10-23 10:45:41 -06:00
Daniel Nephin fb8b68a6ec stream: close the subscription on Unsubscribe 2020-10-22 13:39:27 -04:00
Pierre Souchay 54f9f247f8
Consul Service meta wrongly computes and exposes non_voter meta (#8731)
* Consul Service meta wrongly computes and exposes non_voter meta

In Serf Tags, entreprise members being non-voters use the tag
`nonvoter=1`, not `non_voter = false`, so non-voters in members
were wrongly displayed as voter.

Demonstration:

```
consul members -detailed|grep voter
consul20-hk5 10.200.100.110:8301   alive   acls=1,build=1.8.4+ent,dc=hk5,expect=3,ft_fs=1,ft_ns=1,id=xxxxxxxx-5629-08f2-3a79-10a1ab3849d5,nonvoter=1,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2,wan_join_port=8302
```

* Added changelog

* Added changelog entry
2020-10-09 17:18:24 -04:00
s-christoff a62705101f
Enhance the output of consul snapshot inspect (#8787) 2020-10-09 14:57:29 -05:00
Kyle Havlovitz 707f4a8d26 Stop intermediate renew routine on leader stop 2020-10-09 12:30:57 -07:00
Kyle Havlovitz 926a393a5c
Merge pull request #8784 from hashicorp/renew-intermediate-primary
connect: Enable renewing the intermediate cert in the primary DC
2020-10-09 12:18:59 -07:00
Daniel Nephin dd0e8d42c4
Merge pull request #8825 from hashicorp/streaming/add-config
streaming: add config and docs
2020-10-09 14:33:58 -04:00
Chris Piraino 4f77f87065
Emit service usage metrics with correct labeling strategy (#8856)
Previously, we would emit service usage metrics both with and without a
namespace label attached. This is problematic in the case when you want
to aggregate metrics together, i.e. "sum(consul.state.services)". This
would cause services to be counted twice in that aggregate, once via the
metric emitted with a namespace label, and once in the metric emited
without any namespace label.
2020-10-09 11:01:45 -05:00
Kyle Havlovitz 50543d678e Fix intermediate refresh test comments 2020-10-09 08:53:33 -07:00
R.B. Boyer d2f09ca306
upstream some differences from enterprise (#8902) 2020-10-09 09:42:53 -05:00
Kyle Havlovitz 968fd8660d Update CI for leader renew CA test using Vault 2020-10-09 05:48:15 -07:00
Kyle Havlovitz 62270c3f9a
Merge branch 'master' into renew-intermediate-primary 2020-10-09 04:40:34 -07:00
Kyle Havlovitz b78f618beb connect: Check for expired root cert when cross-signing 2020-10-09 04:35:56 -07:00
Freddy 89d52f41c4
Add protocol to the topology endpoint response (#8868) 2020-10-08 17:31:54 -06:00
Matt Keeler 141eb60f06
Add per-agent reconnect timeouts (#8781)
This allows for client agent to be run in a more stateless manner where they may be abruptly terminated and not expected to come back. If advertising a per-agent reconnect timeout using the advertise_reconnect_timeout configuration when that agent leaves, other agents will wait only that amount of time for the agent to come back before reaping it.

This has the advantageous side effect of causing servers to deregister the node/services/checks for that agent sooner than if the global reconnect_timeout was used.
2020-10-08 15:02:19 -04:00
Daniel Nephin 05df7b18a9 config: add field for enabling streaming RPC endpoint 2020-10-08 12:11:20 -04:00
Freddy de4af766f3
Support ingress gateways in mesh viz endpoint (#8864)
Co-authored-by: R.B. Boyer <rb@hashicorp.com>
2020-10-08 09:47:09 -06:00
Daniel Nephin a94fe054f0
Merge pull request #8809 from hashicorp/streaming/materialize-view
Add StreamingHealthServices cache-type
2020-10-07 21:26:38 -04:00
Daniel Nephin e0236b5a9f
Merge pull request #8818 from hashicorp/streaming/add-subscribe-service-batch-events
stream: handle batch events as a special case of Event
2020-10-07 21:25:32 -04:00
Daniel Nephin 783627aeef
Merge pull request #8768 from hashicorp/streaming/add-subscribe-service
subscribe: add subscribe service for streaming change events
2020-10-07 21:24:03 -04:00
Freddy 7d1f50d2e6
Return intention info in svc topology endpoint (#8853) 2020-10-07 18:35:34 -06:00
R.B. Boyer 35c4efd220
connect: support defining intentions using layer 7 criteria (#8839)
Extend Consul’s intentions model to allow for request-based access control enforcement for HTTP-like protocols in addition to the existing connection-based enforcement for unspecified protocols (e.g. tcp).
2020-10-06 17:09:13 -05:00
R.B. Boyer d6dce2332a
connect: intentions are now managed as a new config entry kind "service-intentions" (#8834)
- Upgrade the ConfigEntry.ListAll RPC to be kind-aware so that older
copies of consul will not see new config entries it doesn't understand
replicate down.

- Add shim conversion code so that the old API/CLI method of interacting
with intentions will continue to work so long as none of these are
edited via config entry endpoints. Almost all of the read-only APIs will
continue to function indefinitely.

- Add new APIs that operate on individual intentions without IDs so that
the UI doesn't need to implement CAS operations.

- Add a new serf feature flag indicating support for
intentions-as-config-entries.

- The old line-item intentions way of interacting with the state store
will transparently flip between the legacy memdb table and the config
entry representations so that readers will never see a hiccup during
migration where the results are incomplete. It uses a piece of system
metadata to control the flip.

- The primary datacenter will begin migrating intentions into config
entries on startup once all servers in the datacenter are on a version
of Consul with the intentions-as-config-entries feature flag. When it is
complete the old state store representations will be cleared. We also
record a piece of system metadata indicating this has occurred. We use
this metadata to skip ALL of this code the next time the leader starts
up.

- The secondary datacenters continue to run the old intentions
replicator until all servers in the secondary DC and primary DC support
intentions-as-config-entries (via serf flag). Once this condition it met
the old intentions replicator ceases.

- The secondary datacenters replicate the new config entries as they are
migrated in the primary. When they detect that the primary has zeroed
it's old state store table it waits until all config entries up to that
point are replicated and then zeroes its own copy of the old state store
table. We also record a piece of system metadata indicating this has
occurred. We use this metadata to skip ALL of this code the next time
the leader starts up.
2020-10-06 13:24:05 -05:00
Daniel Nephin 83401194ab streaming: improve godoc for cache-type
And fix a bug where any error that implemented the temporary interface was considered
a temporary error, even when the method would return false.
2020-10-06 13:52:02 -04:00
Daniel Nephin f857aef4a8 submatview: add a test for handling of NewSnapshotToFollow
Also add some godoc
Rename some vars and functions
Fix a data race in the new cache test for entry closing.
2020-10-06 13:22:02 -04:00
Daniel Nephin ad29cf4f94 stream: Return a single event from a subscription.Next
Handle batch events as a single event
2020-10-06 13:18:20 -04:00
Daniel Nephin fa115c6249 Move agent/subscribe -> agent/rpc/subscribe 2020-10-06 12:49:35 -04:00
Daniel Nephin 011109a6f6 subscirbe: extract streamID and logging from Subscribe
By extracting all of the tracing logic the core logic of the Subscribe
endpoint is much easier to read.
2020-10-06 12:49:35 -04:00
Daniel Nephin 4c4441997a subscribe: add integration test for acl token updates 2020-10-06 12:49:35 -04:00
Daniel Nephin 371ec2d70a subscribe: add a stateless subscribe service for the gRPC server
With a Backend that provides access to the necessary dependencies.
2020-10-06 12:49:35 -04:00
Daniel Nephin ae433947a4
Merge pull request #8799 from hashicorp/streaming/rename-framing-events
stream: remove EndOfEmptySnapshot, add NewSnapshotToFollow
2020-10-06 12:42:58 -04:00
R.B. Boyer a77b518542
server: create new memdb table for storing system metadata (#8703)
This adds a new very tiny memdb table and corresponding raft operation
for updating a very small effective map[string]string collection of
"system metadata". This can persistently record a fact about the Consul
state machine itself.

The first use of this feature will come in a later PR.
2020-10-06 10:08:37 -05:00
Daniel Nephin 2706cf9b2a
Merge pull request #8802 from hashicorp/dnephin/extract-lib-retry
lib/retry - extract a new package from lib/retry.go
2020-10-05 14:22:37 -04:00
freddygv 82a17ccee6 Do not evaluate discovery chain for topology upstreams 2020-10-05 10:24:50 -06:00
freddygv 63c50e15bc Single DB txn for ServiceTopology and other PR comments 2020-10-05 10:24:50 -06:00
freddygv 263bd9dd92 Add topology HTTP endpoint 2020-10-05 10:24:50 -06:00
freddygv 7c11580e93 Add topology RPC endpoint 2020-10-05 10:24:50 -06:00
freddygv 21c4708fe9 Add topology ACL filter 2020-10-05 10:24:50 -06:00
freddygv ac54bf99b3 Add func to combine up+downstream queries 2020-10-05 10:24:50 -06:00
freddygv 160a6539d1 factor in discovery chain when querying up/downstreams 2020-10-05 10:24:50 -06:00
freddygv 214b25919f support querying upstreams/downstreams from registrations 2020-10-05 10:24:50 -06:00
freddygv 3653045cb0 Add method for downstreams from disco chain 2020-10-05 10:24:50 -06:00
Daniel Nephin 40aac46cf4 lib/retry: Refactor to reduce the interface surface
Reduce Jitter to one function

Rename NewRetryWaiter

Fix a bug in calculateWait where maxWait was applied before jitter, which would make it
possible to wait longer than maxWait.
2020-10-04 18:12:42 -04:00
Daniel Nephin 0c7f9c72d7 lib/retry: extract a new package from lib 2020-10-04 17:43:01 -04:00
Daniel Nephin 9c5181c897 stream: full test coverage for EventPublisher.Subscribe 2020-10-02 13:46:24 -04:00
Daniel Nephin 0769f54fe1 stream: refactor to support change in framing events
Removing EndOfEmptySnapshot, add NewSnapshotToFollow
2020-10-02 13:41:31 -04:00
Daniel Nephin 5ef630f664
Merge pull request #8769 from hashicorp/streaming/prep-for-subscribe-service
state: use protobuf Topic and and export payload type
2020-10-02 13:30:06 -04:00
R.B. Boyer e84d52ba3a
ensure these tests work fine with namespaces in enterprise (#8794) 2020-10-01 09:54:46 -05:00
R.B. Boyer ccd0200bd9
server: ensure that we also shutdown network segment serf instances on server shutdown (#8786)
This really only matters for unit tests, since typically if an agent shuts down its server, it follows that up by exiting the process, which would also clean up all of the networking anyway.
2020-09-30 16:23:43 -05:00
Kyle Havlovitz 2956313f2d connect: Enable renewing the intermediate cert in the primary DC 2020-09-30 12:31:21 -07:00
freddygv ec6e8021c0 Resolve conflicts 2020-09-29 08:59:18 -06:00
Daniel Nephin d192b0a080 stream: move goroutine out of New
This change will make it easier to manage goroutine lifecycle from the caller.

Also expose EventPublisher from state.Store
2020-09-28 18:40:10 -04:00
Daniel Nephin e345c8d8a6 state: use pbsubscribe.Topic for topic values 2020-09-28 18:40:10 -04:00
Daniel Nephin 6e592ec485 state: rename and export EventPayload
The subscribe endpoint needs to be able to inspect the payload to filter
events, and convert them into the protobuf types.

Use the protobuf CatalogOp type for the operation field, for now. In the
future if we end up with multiple interfaces we should be able to remove
the protobuf dependency by changing this to an int32 and adding a test
for the mapping between the values.

Make the value of the payload a concrete type instead of interface{}. We
can create other payloads for other event types.
2020-09-28 18:34:30 -04:00
R.B. Boyer 45609fccdf
server: make sure that the various replication loggers use consistent logging (#8745) 2020-09-24 15:49:38 -05:00
Daniel Nephin 4b041a018d grpc: redeuce dependencies, unexport, and add godoc
Rename GRPCClient to ClientConnPool. This type appears to be more of a
conn pool than a client. The clients receive the connections from this
pool.

Reduce some dependencies by adjusting the interface baoundaries.

Remove the need to create a second slice of Servers, just to pick one and throw the rest away.

Unexport serverResolver, it is not used outside the package.

Use a RWMutex for ServerResolverBuilder, some locking is read-only.

Add more godoc.
2020-09-24 12:53:10 -04:00
Daniel Nephin 4b24470887 grpc: move client conn pool to grpc package 2020-09-24 12:48:12 -04:00
Daniel Nephin fad15171ec grpc: client conn pool and resolver
Extracted from 936522a13c07e8b732b6fde61bba23d05f7b9a70

Co-authored-by: Paul Banks <banks@banksco.de>
2020-09-24 12:46:22 -04:00
Daniel Nephin e0119a6e92
Merge pull request #8680 from hashicorp/dnephin/replace-consul-opts-with-base-deps
agent: Repalce ConsulOptions with a new struct from agent.BaseDeps
2020-09-24 12:45:54 -04:00
Paul Banks 0594667c3a
Fix bad int -> string conversions caught by go vet changes in 1.15 (#8739) 2020-09-24 11:14:07 +01:00
Hans Hasselberg c6fa758d6f
fix TestLeader_SecondaryCA_IntermediateRenew (#8702)
* fix lessThanHalfTime
* get lock for CAProvider()
* make a var to relate both vars
* rename to getCAProviderWithLock
* move CertificateTimeDriftBuffer to agent/connect/ca
2020-09-18 10:13:29 +02:00
Mike Morris fe984b3ee3
test: update tags for database service registrations and queries (#8693) 2020-09-16 14:05:01 -04:00
Kyle Havlovitz c8fd61abc7 Merge branch 'master' into vault-ca-renew-token 2020-09-15 14:39:04 -07:00
Daniel Nephin c621b4a420 agent/consul: pass dependencies directly from agent
In an upcoming change we will need to pass a grpc.ClientConnPool from
BaseDeps into Server. While looking at that change I noticed all of the
existing consulOption fields are already on BaseDeps.

Instead of duplicating the fields, we can create a struct used by
agent/consul, and use that struct in BaseDeps. This allows us to pass
along dependencies without translating them into different
representations.

I also looked at moving all of BaseDeps in agent/consul, however that
created some circular imports. Resolving those cycles wouldn't be too
bad (it was only an error in agent/consul being imported from
cache-types), however this change seems a little better by starting to
introduce some structure to BaseDeps.

This change is also a small step in reducing the scope of Agent.

Also remove some constants that were only used by tests, and move the
relevant comment to where the live configuration is set.

Removed some validation from NewServer and NewClient, as these are not
really runtime errors. They would be code errors, which will cause a
panic anyway, so no reason to handle them specially here.
2020-09-15 17:29:32 -04:00
Daniel Nephin 0536b2047e agent/consul: make router required 2020-09-15 17:26:26 -04:00
Kyle Havlovitz 63d3a5fc1f Clean up CA shutdown logic and error 2020-09-15 12:28:58 -07:00
freddygv 43efb4809c Merge master 2020-09-14 16:17:43 -06:00
Daniel Nephin 75515f3431
Merge pull request #8587 from hashicorp/streaming/add-grpc-server
streaming: add gRPC server for handling connections
2020-09-14 15:24:54 -04:00
freddygv 33af8dab9a Resolve conflicts against master 2020-09-11 18:41:58 -06:00
Kyle Havlovitz 1595add842 Clean up Vault renew tests and shutdown 2020-09-11 08:41:05 -07:00
freddygv 5871b667a5 Revert EnvoyConfig nesting 2020-09-11 09:21:43 -06:00
Kyle Havlovitz 7588e22739 Add a stop function to make sure the renewer is shut down on leader change 2020-09-10 06:12:48 -07:00
Kyle Havlovitz 1c57b72a9f Add a test for token renewal 2020-09-09 16:36:37 -07:00
Daniel Nephin 863a9df951 server: add gRPC server for streaming events
Includes a stats handler and stream interceptor for grpc metrics.

Co-authored-by: Paul Banks <banks@banksco.de>
2020-09-08 12:10:41 -04:00
Hans Hasselberg 51f079dcdd
secondaryIntermediateCertRenewalWatch abort on success (#8588)
secondaryIntermediateCertRenewalWatch was using `retryLoopBackoff` to
renew the intermediate certificate. Once it entered the inner loop and
started `retryLoopBackoff` it would never leave that.
`retryLoopBackoffAbortOnSuccess` will return when renewing is
successful, like it was intended originally.
2020-09-04 11:47:16 +02:00
Daniel Nephin c17a5b0628 state: handle terminating gateways in service health events 2020-09-03 16:58:05 -04:00
Daniel Nephin b241debee7 state: improve comments in catalog_events.go
Co-authored-by: Paul Banks <banks@banksco.de>
2020-09-03 16:58:05 -04:00
Daniel Nephin 870823e8ed state: use changeType in serviceChanges
To be a little more explicit, instead of nil implying an indirect change
2020-09-03 16:58:05 -04:00
Daniel Nephin 68682e7e83 don't over allocate slice 2020-09-03 16:58:04 -04:00
Daniel Nephin 5f52220f53 state: fix a bug in building service health events
The nodeCheck slice was being used as the first arg in append, which in some cases will modify the array backing the slice. This would lead to service checks for other services in the wrong event.

Also refactor some things to reduce the arguments to functions.
2020-09-03 16:58:04 -04:00
Daniel Nephin c61313b78a state: Remove unused args and return values
Also rename some functions to identify them as constructors for events
2020-09-03 16:58:04 -04:00
Daniel Nephin 668b98bcce state: use an enum for tracking node changes 2020-09-03 16:58:04 -04:00
Daniel Nephin 7c3c627028 state: serviceHealthSnapshot
refactored to remove unused return value and remove duplication
2020-09-03 16:58:04 -04:00
Daniel Nephin fdfe176deb state: Add Change processor and snapshotter for service health
Co-authored-by: Paul Banks <banks@banksco.de>
2020-09-03 16:58:04 -04:00
Daniel Nephin 6a1a43721d state: fix bug in changeTrackerDB.publish
Creating a new readTxn does not work because it will not see the newly created objects that are about to be committed. Instead use the active write Txn.
2020-09-03 16:58:01 -04:00
Daniel Nephin 81cc3daf69 stream: have SnapshotFunc accept a non-pointer SubscribeRequest
The value is not expected to be modified. Passing a value makes that explicit.
2020-09-03 16:54:02 -04:00
freddygv 56fdae9ace Update resolver defaulting 2020-09-03 13:08:44 -06:00
freddygv 02d6acd8fc Ensure resolver node with LB isn't considered default 2020-09-03 08:55:57 -06:00
freddygv daad3b9210 Remove LB infix and move injection to xds 2020-09-02 15:13:50 -06:00
Chris Piraino df1381f77f
Merge pull request #8603 from hashicorp/feature/usage-metrics
Track node and service counts in the state store and emit them periodically as metrics
2020-09-02 13:23:39 -05:00
R.B. Boyer 4197bed23b
connect: fix bug in preventing some namespaced config entry modifications (#8601)
Whenever an upsert/deletion of a config entry happens, within the open
state store transaction we speculatively test compile all discovery
chains that may be affected by the pending modification to verify that
the write would not create an erroneous scenario (such as splitting
traffic to a subset that did not exist).

If a single discovery chain evaluation references two config entries
with the same kind and name in different namespaces then sometimes the
upsert/deletion would be falsely rejected. It does not appear as though
this bug would've let invalid writes through to the state store so the
correction does not require a cleanup phase.
2020-09-02 10:47:19 -05:00
Chris Piraino b245d60200 Set metrics reporting interval to 9 seconds
This is below the 10 second interval that lib/telemetry.go implements as
its aggregation interval, ensuring that we always report these metrics.
2020-09-02 10:24:23 -05:00
Chris Piraino e9b397005c Update godoc string for memdb wrapper functions/structs 2020-09-02 10:24:22 -05:00
Chris Piraino 80f923a47a Refactor state store usage to track unique service names
This commit refactors the state store usage code to track unique service
name changes on transaction commit. This means we only need to lookup
usage entries when reading the information, as opposed to iterating over
a large number of service indices.

- Take into account a service instance's name being changed
- Do not iterate through entire list of service instances, we only care
about whether there is 0, 1, or more than 1.
2020-09-02 10:24:21 -05:00
Chris Piraino 79e6534345 Use ReadTxn interface in state store helper functions 2020-09-02 10:24:20 -05:00
Chris Piraino d90d95421d Add WriteTxn interface and convert more functions to ReadTxn
We add a WriteTxn interface for use in updating the usage memdb table,
with the forward-looking prospect of incrementally converting other
functions to accept interfaces.

As well, we use the ReadTxn in new usage code, and as a side effect
convert a couple of existing functions to use that interface as well.
2020-09-02 10:24:19 -05:00
Chris Piraino 45a4057f60 Report node/service usage metrics from every server
Using the newly provided state store methods, we periodically emit usage
metrics from the servers.

We decided to emit these metrics from all servers, not just the leader,
because that means we do not have to care about leader election flapping
causing metrics turbulence, and it seems reasonable for each server to
emit its own view of the state, even if they should always converge
rapidly.
2020-09-02 10:24:17 -05:00
Chris Piraino 3af96930eb Add new usage memdb table that tracks usage counts of various elements
We update the usage table on Commit() by using the TrackedChanges() API
of memdb.

Track memdb changes on restore so that usage data can be compiled
2020-09-02 10:24:16 -05:00
freddygv d7bda050e0 Restructure structs and other PR comments 2020-09-02 09:10:50 -06:00
Matt Keeler 335c604ced
Merge of auto-config and auto-encrypt code (#8523)
auto-encrypt is now handled as a special case of auto-config.

This also is moving all the cert-monitor code into the auto-config package.
2020-08-31 13:12:17 -04:00
freddygv afb14b6705 Compile down LB policy to disco chain nodes 2020-08-28 13:11:04 -06:00
Daniel Nephin 845661c8af
Merge pull request #8548 from edevil/fix_flake
Fix flaky TestACLResolver_Client/Concurrent-Token-Resolve
2020-08-28 15:10:55 -04:00
R.B. Boyer f2b8bf109c
xds: use envoy's rbac filter to handle intentions entirely within envoy (#8569) 2020-08-27 12:20:58 -05:00
Matt Keeler 106e1d50bd
Move RPC router from Client/Server and into BaseDeps (#8559)
This will allow it to be a shared component which is needed for AutoConfig
2020-08-27 11:23:52 -04:00
André Cruz 673bd69f36
Decrease test flakiness
Fix flaky TestACLResolver_Client/Concurrent-Token-Resolve and TestCacheNotifyPolling
2020-08-24 20:30:02 +01:00
André Cruz a64686fab6
testing: Fix govet errors 2020-08-21 18:01:55 +01:00
Hans Hasselberg 02de4c8b76
add primary keys to list keyring (#8522)
During gossip encryption key rotation it would be nice to be able to see if all nodes are using the same key. This PR adds another field to the json response from `GET v1/operator/keyring` which lists the primary keys in use per dc. That way an operator can tell when a key was successfully setup as primary key.

Based on https://github.com/hashicorp/serf/pull/611 to add primary key to list keyring output:

```json
[
  {
    "WAN": true,
    "Datacenter": "dc2",
    "Segment": "",
    "Keys": {
      "0OuM4oC3Os18OblWiBbZUaHA7Hk+tNs/6nhNYtaNduM=": 6,
      "SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 6
    },
    "PrimaryKeys": {
      "SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 6
    },
    "NumNodes": 6
  },
  {
    "WAN": false,
    "Datacenter": "dc2",
    "Segment": "",
    "Keys": {
      "0OuM4oC3Os18OblWiBbZUaHA7Hk+tNs/6nhNYtaNduM=": 8,
      "SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 8
    },
    "PrimaryKeys": {
      "SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 8
    },
    "NumNodes": 8
  },
  {
    "WAN": false,
    "Datacenter": "dc1",
    "Segment": "",
    "Keys": {
      "0OuM4oC3Os18OblWiBbZUaHA7Hk+tNs/6nhNYtaNduM=": 3,
      "SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 8
    },
    "PrimaryKeys": {
      "SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 8
    },
    "NumNodes": 8
  }
]
```

I intentionally did not change the CLI output because I didn't find a good way of displaying this information. There are a couple of options that we could implement later:
* add a flag to show the primary keys
* add a flag to show json output

Fixes #3393.
2020-08-18 09:50:24 +02:00
Daniel Nephin 8d35e37b3c testing: Remove all the defer os.Removeall
Now that testutil uses t.Cleanup to remove the directory the caller no longer has to manage
the removal
2020-08-14 19:58:53 -04:00
Daniel Nephin 629c34085d state: remove unused Store method receiver
And use ReadTxn interface where appropriate.
2020-08-13 11:25:22 -04:00
Daniel Nephin fc797a279a
Merge pull request #8461 from hashicorp/dnephin/remove-notify-shutdown
agent/consul: Remove NotifyShutdown
2020-08-13 11:16:48 -04:00
Daniel Nephin d8ffcd5686
Merge pull request #8365 from hashicorp/dnephin/fix-service-by-node-meta-flake
state: speed up tests that use watchLimit
2020-08-13 11:16:12 -04:00
R.B. Boyer 63422ca9c5
connect: use stronger validation that ingress gateways have compatible protocols defined for their upstreams (#8470)
Fixes #8466

Since Consul 1.8.0 there was a bug in how ingress gateway protocol
compatibility was enforced. At the point in time that an ingress-gateway
config entry was modified the discovery chain for each upstream was
checked to ensure the ingress gateway protocol matched. Unfortunately
future modifications of other config entries were not validated against
existing ingress-gateway definitions, such as:

1. create tcp ingress-gateway pointing to 'api' (ok)
2. create service-defaults for 'api' setting protocol=http (worked, but not ok)
3. create service-splitter or service-router for 'api' (worked, but caused an agent panic)

If you were to do these in a different order, it would fail without a
crash:

1. create service-defaults for 'api' setting protocol=http (ok)
2. create service-splitter or service-router for 'api' (ok)
3. create tcp ingress-gateway pointing to 'api' (fail with message about
   protocol mismatch)

This PR introduces the missing validation. The two new behaviors are:

1. create tcp ingress-gateway pointing to 'api' (ok)
2. (NEW) create service-defaults for 'api' setting protocol=http ("ok" for back compat)
3. (NEW) create service-splitter or service-router for 'api' (fail with
   message about protocol mismatch)

In consideration for any existing users that may be inadvertently be
falling into item (2) above, that is now officiall a valid configuration
to be in. For anyone falling into item (3) above while you cannot use
the API to manufacture that scenario anymore, anyone that has old (now
bad) data will still be able to have the agent use them just enough to
generate a new agent/proxycfg error message rather than a panic.
Unfortunately we just don't have enough information to properly fix the
config entries.
2020-08-12 11:19:20 -05:00
Hans Hasselberg 7a6d916ddc
Merge pull request #8471 from hashicorp/local_only
thread local-only through the layers
2020-08-12 08:54:51 +02:00
Freddy 50fee12d62
Internal endpoint to query intentions associated with a gateway (#8400) 2020-08-11 17:20:41 -06:00
Kyle Havlovitz 8118e3db40 Fix a state store comment about version 2020-08-11 13:46:12 -07:00
Kyle Havlovitz 2601585017 fsm: Fix snapshot bug with restoring node/service/check indexes 2020-08-11 11:49:52 -07:00
Hans Hasselberg e0297b6e99 Refactor keyring ops:
* changes some functions to return data instead of modifying pointer
  arguments
* renames globalRPC() to keyringRPCs() to make its purpose more clear
* restructures KeyringOperation() to make it more understandable
2020-08-11 13:42:03 +02:00
freddygv 6dcfa11c21 Update error handling 2020-08-10 17:48:22 -06:00
Daniel Nephin bef9348ca8 testing: remove unnecessary defers in tests
The data directory is now removed by the test helper that created it.
2020-08-07 17:28:16 -04:00
Daniel Nephin f3b63514d5 testing: Remove NotifyShutdown
NotifyShutdown was only used for testing. Now that t.Cleanup exists, we
can use that instead of attaching cleanup to the Server shutdown.

The Autopilot test which used NotifyShutdown doesn't need this
notification because Shutdown is synchronous. Waiting for the function
to return is equivalent.
2020-08-07 17:14:44 -04:00
Hans Hasselberg fdceb24323
auto_config implies connect (#8433) 2020-08-07 12:02:02 +02:00
freddygv 83f4e32376 PR comments and addtl tests 2020-08-05 16:07:11 -06:00
Daniel Nephin 061ae94c63 Rename NewClient/NewServer
Now that duplicate constructors have been removed we can use the shorter names for the single constructor.
2020-08-05 14:00:55 -04:00
Daniel Nephin e6c94c1411 Remove LogOutput from Server 2020-08-05 14:00:44 -04:00
Daniel Nephin fdf966896f Remove LogOutput from Client 2020-08-05 14:00:42 -04:00
Daniel Nephin 73493ca01b Pass a logger to ConnPool and yamux, instead of an io.Writer
Allowing us to remove the LogOutput field from config.
2020-08-05 13:25:08 -04:00
Daniel Nephin c7c941811d config: Remove unused field 2020-08-05 13:25:08 -04:00
freddygv c87af29506 collect GatewayServices from iter in a function 2020-07-31 13:30:40 -06:00
Freddy 7c2c8815d7
Avoid panics during shutdown routine (#8412) 2020-07-30 11:11:10 -06:00
freddygv 94d1f0a310 end to end changes to pass gatewayservices to /ui/services/ 2020-07-30 10:21:11 -06:00
Matt Keeler 76add4f24c
Allow setting verify_incoming* when using auto_encrypt or auto_config (#8394)
Ensure that enabling AutoConfig sets the tls configurator properly

This also refactors the TLS configurator a bit so the naming doesn’t imply only AutoEncrypt as the source of the automatically setup TLS cert info.
2020-07-30 10:15:12 -04:00
Matt Keeler dad0f189a2
Agent Auto Config: Implement Certificate Generation (#8360)
Most of the groundwork was laid in previous PRs between adding the cert-monitor package to extracting the logic of signing certificates out of the connect_ca_endpoint.go code and into a method on the server.

This also refactors the auto-config package a bit to split things out into multiple files.
2020-07-28 15:31:48 -04:00
Matt Keeler 3a1058a06b
Move connect root retrieval and cert signing logic out of the RPC endpoints (#8364)
The code now lives on the Server type itself. This was done so that all of this could be shared with auto config certificate signing.
2020-07-24 10:00:51 -04:00
Matt Keeler e7d8a02ae8
Move generation of the CA Configuration from the agent code into a method on the RuntimeConfig (#8363)
This allows this to be reused elsewhere.
2020-07-23 16:05:28 -04:00
Daniel Nephin 597dcf2bfb
Merge pull request #8323 from hashicorp/dnephin/add-event-publisher-2
stream: close subscriptions on shutdown
2020-07-23 13:12:50 -04:00
Matt Keeler c3e7d689b7
Refactor the agentpb package (#8362)
First move the whole thing to the top-level proto package name.

Secondly change some things around internally to have sub-packages.
2020-07-23 11:24:20 -04:00
Daniel Nephin decba06b7d stream: close all subs when EventProcessor is shutdown. 2020-07-22 19:04:10 -04:00
Daniel Nephin e802689bbe stream: fix overallocation in filter
And add tests
2020-07-22 19:04:10 -04:00
Daniel Nephin f64725f7aa state: speed up TestStateStore_ServicesByNodeMeta
Make watchLimit a var so that we can patch it in tests and reduce the time spent creating state.
2020-07-22 16:57:06 -04:00
Daniel Nephin a44ddea9ba state: Use subtests in TestStateStore_ServicesByNodeMeta
These subtests make it much easier to identify the slow part of the test, but they also help enumerate all the different cases which are being tested.
2020-07-22 16:39:09 -04:00
Daniel Nephin 6d3b042872
Merge pull request #7948 from hashicorp/dnephin/buffer-test-logs
testutil: NewLogBuffer - buffer logs until a test fails
2020-07-21 15:21:52 -04:00
Matt Keeler 8ea8a939f0
Merge pull request #8311 from hashicorp/bugfix/auto-encrypt-token-update 2020-07-21 13:15:27 -04:00
Daniel Nephin 80ff174880 testutil: NewLogBuffer - buffer logs until a test fails
Replaces #7559

Running tests in parallel, with background goroutines, results in test output not being associated with the correct test. `go test` does not make any guarantees about output from goroutines being attributed to the correct test case.

Attaching log output from background goroutines also cause data races.  If the goroutine outlives the test, it will race with the test being marked done. Previously this was noticed as a panic when logging, but with the race detector enabled it is shown as a data race.

The previous solution did not address the problem of correct test attribution because test output could still be hidden when it was associated with a test that did not fail. You would have to look at all of the log output to find the relevant lines. It also made debugging test failures more difficult because each log line was very long.

This commit attempts a new approach. Instead of printing all the logs, only print when a test fails. This should work well when there are a small number of failures, but may not work well when there are many test failures at the same time. In those cases the failures are unlikely a result of a specific test, and the log output is likely less useful.

All of the logs are printed from the test goroutine, so they should be associated with the correct test.

Also removes some test helpers that were not used, or only had a single caller. Packages which expose many functions with similar names can be difficult to use correctly.

Related:
https://github.com/golang/go/issues/38458 (may be fixed in go1.15)
https://github.com/golang/go/issues/38382#issuecomment-612940030
2020-07-21 12:50:40 -04:00
Matt Keeler 133a6d99f2
Fix issue with changing the agent token causing failure to renew the auto-encrypt certificate
The fallback method would still work but it would get into a state where it would let the certificate expire for 10s before getting a new one. And the new one used the less secure RPC endpoint.

This is also a pretty large refactoring of the auto encrypt code. I was going to write some tests around the certificate monitoring but it was going to be impossible to get a TestAgent configured in such a way that I could write a test that ran in less than an hour or two to exercise the functionality.

Moving the certificate monitoring into its own package will allow for dependency injection and in particular mocking the cache types to control how it hands back certificates and how long those certificates should live. This will allow for exercising the main loop more than would be possible with it coupled so tightly with the Agent.
2020-07-21 12:19:25 -04:00
Daniel Nephin 7599e280de stream: handle empty event in TestEventSnapshot
When the race detector is enabled we see this test fail occasionally. The reordering of execution seems to make it possible for the snapshot splice to happen before any events are published to the topicBuffers.

We can handle this case in the test the same way it is handled by a subscription, by proceeding to the next event.
2020-07-20 18:20:02 -04:00
Daniel Nephin 75f10fb191 state: update calls that are no longer state methods
In a previous commit these methods were changed to functions, so remove the Store paramter.
2020-07-16 15:46:10 -04:00
Daniel Nephin 3fcb2e16f4 state: un-method funcs that don't use their receiver
This change was mostly automated with the following

First generate a list of functions with:

  git grep -o 'Store) \([^(]\+\)(tx \*txn' ./agent/consul/state | awk '{print $2}' | grep -o '^[^(]\+'

Then the list was curated a bit with trial/error to remove and add funcs
as necessary.

Finally the replacement was done with:

  dir=agent/consul/state
  file=${1-funcnames}

  while read fn; do
    echo "$fn"
    sed -i -e "s/(s \*Store) $fn(/$fn(/" $dir/*.go
    sed -i -e "s/s\.$fn(/$fn(/" $dir/*.go
    sed -i -e "s/s\.store\.$fn(/$fn(/" $dir/*.go
  done < $file
2020-07-16 15:30:39 -04:00
Daniel Nephin edb0a4f1f8 store: convert methods that don't use their receiver to functions
Making these functions allows them to be used without introducing
an artificial dependency on the struct. Many of these will be called
from streaming Event processors, which do not have a store.

This change is being made ahead of the streaming work to get to reduce
the size of the streaming diff.
2020-07-16 15:30:10 -04:00
Daniel Nephin a2f8605c66 stream: Add forceClose and refactor subscription filtering
Move the subscription context to Next. context.Context should generally
never be stored in a struct because it makes that struct only valid
while the context is valid. This is rarely obvious from the caller.
Adds a forceClosed channel in place of the old context, and uses the new
context as a way for the caller to stop the Subscription blocking.

Remove some recursion out of bufferImte.Next. The caller is already looping so we can continue
in that loop instead of recursing. This ensures currentItem is updated immediately (which probably
does not matter in practice), and also removes the chance that we overflow the stack.

NextNoBlock and FollowAfter do not need to handle bufferItem.Err, the caller already
handles it.

Moves filter to a method to simplify Next, and more explicitly separate filtering from looping.

Also improve some godoc

Only unwrap itemBuffer.Err when necessary
2020-07-14 15:57:47 -04:00
Daniel Nephin 2595436f62 stream: Improve docstrings
Also rename ResumeStrema to EndOfEmptySnapshot to be more consistent with other framing events

Co-authored-by: Paul Banks <banks@banksco.de>
2020-07-14 15:57:47 -04:00
Daniel Nephin 16a2b3fafc stream: change Topic to an interface
Consumers of the package can decide on which type to use for the Topic. In the future we may
use a gRPC type for the topic.
2020-07-14 15:57:47 -04:00
Daniel Nephin aa571bd0ce state: Move change processing out of EventPublisher
EventPublisher was receiving TopicHandlers, which had a couple of
problems:

- ChangeProcessors were being grouped by Topic, but they completely
  ignored the topic and were performed on every change
- ChangeProcessors required EventPublisher to be aware of database
  changes

By moving ChangeProcesors out of EventPublisher, and having Publish
accept events instead of changes, EventPublisher no longer needs to
be aware of these things.

Handlers is now only SnapshotHandlers, which are still mapped by Topic.

Also allows us to remove the small 'db' package that had only two types.
They can now be unexported types in state.
2020-07-14 15:57:47 -04:00
Daniel Nephin 23a940daad server: Abandom state store to shutdown EventPublisher
So that we don't leak goroutines
2020-07-14 15:57:47 -04:00
Daniel Nephin e1305fe80c stream: unexport identifiers
Now that EventPublisher is part of stream a lot of the internals can be hidden
2020-07-14 15:57:47 -04:00
Daniel Nephin 9e37894778 stream: Move EventPublisher to stream package
The EventPublisher is the central hub of the PubSub system. It is toughly coupled with much of
stream. Some stream internals were exported exclusively for EventPublisher.

The two Subscribe cases (with or without index) were also awkwardly split between two packages. By
moving EventPublisher into stream they are now both in the same package (although still in different files).
2020-07-14 15:57:47 -04:00
Daniel Nephin 6e87e83d77 state: Make handleACLUpdate async once again
So that we keep as much as possible out of the FSM commit hot path.
2020-07-14 15:57:47 -04:00
Daniel Nephin a92dab724d state: Use interface for Txn
Also store the index in Changes instead of the Txn.

This change is in preparation for movinng EventPublisher to the stream package, and
making handleACLUpdates async once again.
2020-07-14 15:57:46 -04:00
Daniel Nephin c778d61b6a stream.Subscription unexport fields and additiona docstrings 2020-07-14 15:57:46 -04:00
Daniel Nephin 37a38629d7 Add a context for stopping EventPublisher goroutine 2020-07-14 15:57:46 -04:00
Daniel Nephin 02bc5a26e4 EventPublisher: Make Unsubscribe a function on Subscription
It is critical that Unsubscribe be called with the same pointer to a
SubscriptionRequest that was used to create the Subscription. The
docstring made that clear, but it sill allowed a caler to get it wrong by
creating a new SubscriptionRequest.

By hiding this detail from the caller, and only exposing an Unsubscribe
method, it should be impossible to fail to Unsubscribe.

Also update some godoc strings.
2020-07-14 15:57:46 -04:00
Daniel Nephin 86976cf23c EventPublisher: handleACL changes synchronously
Use a separate lock for subscriptions.ByToken to allow it to happen synchronously
in the commit flow.
This removes the need to create a new txn for the goroutine, and removes
the need for EventPublisher to contain a reference to DB.
2020-07-14 15:57:46 -04:00
Daniel Nephin 606121fae6 stream.EventSnapshot: reduce the fields on the struct
Many of the fields are only needed in one place, and by using a closure
they can be removed from the struct. This reduces the scope of the variables
making it esier to see how they are used.
2020-07-14 15:57:45 -04:00
Daniel Nephin 7196917051 stream.EventBuffer: Seed the fuzz test with time.Now()
Otherwise the test will run with exactly the same values each time.

By printing the seed we can attempt to reproduce the test by adding an env var to override the seed
2020-07-14 15:57:45 -04:00
Daniel Nephin 525b275a52 state: memdb_wrapper.go -> memdb.go
Renaming in a separate commit so that git can merge changes to the file.
2020-07-14 15:57:45 -04:00
Daniel Nephin b5d2bea770 state: publish changes from Commit
Make topicRegistry use functions instead of unbound methods
Use a regular memDB in EventPublisher to remove a reference cycle
Removes the need for EventPublisher to use a store
2020-07-14 15:57:45 -04:00
Daniel Nephin f626c3d6c5 EventPublisher: docstrings and getTopicBuffer
also rename commitCh -> publishCh
2020-07-14 15:57:45 -04:00
Daniel Nephin 2020e9c7c7 ProcessChanges: use stream.Event
Also remove secretHash, which was used to hash tokens. We don't expose
these tokens anywhere, so we can use the string itself instead of a
Hash.

Fix acl_events_test.go for storing a structs type.
2020-07-14 15:57:45 -04:00
Daniel Nephin 2e45bbbb3e stream: Use local types for Event Topic SubscriptionRequest 2020-07-14 15:57:45 -04:00
Daniel Nephin 3d62013062 Rename stream_publisher.go -> event_publisher.go 2020-07-14 15:57:44 -04:00
Daniel Nephin 526fb53f85 Add streaming package with Subscription and Snapshot components.
The remaining files from 7965767de0bd62ab07669b85d6879bd5f815d157

Co-authored-by: Paul Banks <banks@banksco.de>
2020-07-14 15:57:44 -04:00
Chris Piraino b80cbb499f
Set enterprise metadata after resolving the token (#8302)
The token can encode enterprise metadata information, and we must make
sure we set that on the reply so that we can correct filter ACLs.
2020-07-13 13:39:57 -05:00
Daniel Nephin 13e0d258b5
Merge pull request #8237 from hashicorp/dnephin/remove-acls-enabled-from-delegate
Remove ACLsEnabled from delegate interface
2020-07-09 16:35:43 -04:00
Matt Keeler 39d9babab3
Pass the Config and TLS Configurator into the AutoConfig constructor
This is instead of having the AutoConfigBackend interface provide functions for retrieving them.

NOTE: the config is not reloadable. For now this is fine as we don’t look at any reloadable fields. If that changes then we should provide a way to make it reloadable.
2020-07-08 12:36:11 -04:00
Matt Keeler a77ed471c8
Rename (*Server).forward to (*Server).ForwardRPC
Also get rid of the preexisting shim in server.go that existed before to have this name just call the unexported one.
2020-07-08 11:05:44 -04:00
Matt Keeler 386ec3a2a2
Refactor AutoConfig RPC to not have a direct dependency on the Server type
Instead it has an interface which can be mocked for better unit testing that is deterministic and not prone to flakiness.
2020-07-08 11:05:44 -04:00
Daniel Nephin 8b6036c077 Remove ACLsEnabled from delegate interface
In all cases (oss/ent, client/server) this method was returning a value from config. Since the
value is consistent, it doesn't need to be part of the delegate interface.
2020-07-03 17:00:20 -04:00
Daniel Nephin dfa8856e5f agent/consul: Add support for NotModified to two endpoints
A query made with AllowNotModifiedResponse and a MinIndex, where the
result has the same Index as MinIndex, will return an empty response
with QueryMeta.NotModified set to true.

Co-authored-by: Pierre Souchay <pierresouchay@users.noreply.github.com>
2020-07-02 17:05:46 -04:00
Matt Keeler 87764e5bfb
Merge pull request #8211 from hashicorp/bugfix/auto-encrypt-various 2020-07-02 09:49:49 -04:00
Yury Evtikhov dbf3c05fa5 DNS: add IsErrQueryNotFound function for easier error evaluation 2020-07-01 03:41:44 +01:00
Matt Keeler a97f9ff386
Overwrite agent leaf cert trust domain on the servers 2020-06-30 09:59:08 -04:00
Matt Keeler 5600069d69
Store the Connect CA rate limiter on the server
This fixes a bug where auto_encrypt was operating without utilizing a common rate limiter.
2020-06-30 09:59:07 -04:00
Matt Keeler fa42d9b34f
Fix auto_encrypt IP/DNS SANs
The initial auto encrypt CSR wasn’t containing the user supplied IP and DNS SANs. This fixes that. Also We were configuring a default :: IP SAN. This should be ::1 instead and was fixed.
2020-06-30 09:59:07 -04:00
R.B. Boyer 72a515f5ec
connect: various changes to make namespaces for intentions work more like for other subsystems (#8194)
Highlights:

- add new endpoint to query for intentions by exact match

- using this endpoint from the CLI instead of the dump+filter approach

- enforcing that OSS can only read/write intentions with a SourceNS or
  DestinationNS field of "default".

- preexisting OSS intentions with now-invalid namespace fields will
  delete those intentions on initial election or for wildcard namespaces
  an attempt will be made to downgrade them to "default" unless one
  exists.

- also allow the '-namespace' CLI arg on all of the intention subcommands

- update lots of docs
2020-06-26 16:59:15 -05:00
Daniel Nephin 7d5f1ba6bd
Merge pull request #8176 from hashicorp/dnephin/add-linter-unparam-1
lint: add unparam linter and fix some of the issues
2020-06-25 15:34:48 -04:00
Matt Keeler d471977f62
Fix go routine leak in auto encrypt ca roots tracking 2020-06-24 17:09:50 -04:00
Matt Keeler 90e741c6d2
Allow cancelling blocking queries in response to shutting down. 2020-06-24 17:09:50 -04:00
Daniel Nephin 07c1081d39 Fix a bunch of unparam lint issues 2020-06-24 13:00:14 -04:00
Matt Keeler 341aedbce9
Ensure that retryLoopBackoff can be cancelled
We needed to pass a cancellable context into the limiter.Wait instead of context.Background. So I made the func take a context instead of a chan as most places were just passing through a Done chan from a context anyways.

Fix go routine leak in the gateway locator
2020-06-24 12:41:08 -04:00
Matt Keeler 9dc9f7df15
Allow cancelling startup when performing auto-config (#8157)
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
2020-06-19 15:16:00 -04:00
Daniel Nephin b5ef9b7ea9 Remove bytesToUint64 from agent/consul 2020-06-18 12:45:43 -04:00
Daniel Nephin 81bc082b63 Remove unused private IP code from agent/consul 2020-06-18 12:40:38 -04:00
Matt Keeler 2c7844d220
Implement Client Agent Auto Config
There are a couple of things in here.

First, just like auto encrypt, any Cluster.AutoConfig RPC will implicitly use the less secure RPC mechanism.

This drastically modifies how the Consul Agent starts up and moves most of the responsibilities (other than signal handling) from the cli command and into the Agent.
2020-06-17 16:49:46 -04:00
Matt Keeler f5d57ccd48
Allow the Agent its its child Client/Server to share a connection pool
This is needed so that we can make an AutoConfig RPC at the Agent level prior to creating the Client/Server.
2020-06-17 16:19:33 -04:00
Matt Keeler 8c601ad8db
Merge pull request #8035 from hashicorp/feature/auto-config/server-rpc 2020-06-17 16:07:25 -04:00
Chris Piraino 79d003d395
Remove ACLEnforceVersion8 from tests (#8138)
The field had been deprecated for a while and was recently removed,
however a PR which added these tests prior to removal was merged.
2020-06-17 14:58:01 -05:00
Daniel Nephin 1ef8279ac9
Merge pull request #8034 from hashicorp/dnephin/add-linter-staticcheck-4
ci: enable SA4006 staticcheck check and add ineffassign
2020-06-17 12:16:02 -04:00
Matt Keeler eda8cb39fd
Implement the insecure version of the Cluster.AutoConfig RPC endpoint
Right now this is only hooked into the insecure RPC server and requires JWT authorization. If no JWT authorizer is setup in the configuration then we inject a disabled “authorizer” to always report that JWT authorization is disabled.
2020-06-17 11:25:29 -04:00
Pierre Souchay f7a1189dba
gossip: Ensure that metadata of Consul Service is updated (#7903)
While upgrading servers to a new version, I saw that metadata of
existing servers are not upgraded, so the version and raft meta
is not up to date in catalog.

The only way to do it was to:
 * update Consul server
 * make it leave the cluster, then metadata is accurate

That's because the optimization to avoid updating catalog does
not take into account metadata, so no update on catalog is performed.
2020-06-17 12:16:13 +02:00
Daniel Nephin 8753d1f1ba ci: Add ineffsign linter
And fix an additional ineffective assignment that was not caught by staticcheck
2020-06-16 17:32:50 -04:00
Daniel Nephin 97342de262
Merge pull request #8070 from hashicorp/dnephin/add-gofmt-simplify
ci: Enable gofmt simplify
2020-06-16 17:18:38 -04:00
Matt Keeler d994dc7b35
Agent Auto Configuration: Configuration Syntax Updates (#8003) 2020-06-16 15:03:22 -04:00
Daniel Nephin 89d95561df Enable gofmt simplify
Code changes done automatically with 'gofmt -s -w'
2020-06-16 13:21:11 -04:00
Daniel Nephin 5f24171f13 ci: enable SA4006 staticcheck check
And fix the 'value not used' issues.

Many of these are not bugs, but a few are tests not checking errors, and
one appears to be a missed error in non-test code.
2020-06-16 13:10:11 -04:00
Daniel Nephin 71e6534061 Rename txnWrapper to txn 2020-06-16 13:06:02 -04:00
Daniel Nephin 537ae1fd46 Rename db 2020-06-16 13:04:31 -04:00
Daniel Nephin 78c76f0773 Handle return value from txn.Commit 2020-06-16 13:04:31 -04:00
Daniel Nephin 50db8f409a state: Update docstrings for changeTrackerDB and txn
And un-embed memdb.DB to prevent accidental access to underlying
methods.
2020-06-16 13:04:31 -04:00
Paul Banks f9a6386c4a state: track changes so that they may be used to produce change events 2020-06-16 13:04:29 -04:00
Matt Keeler cdc4b20afa
ACL Node Identities (#7970)
A Node Identity is very similar to a service identity. Its main targeted use is to allow creating tokens for use by Consul agents that will grant the necessary permissions for all the typical agent operations (node registration, coordinate updates, anti-entropy).

Half of this commit is for golden file based tests of the acl token and role cli output. Another big updates was to refactor many of the tests in agent/consul/acl_endpoint_test.go to use the same style of tests and the same helpers. Besides being less boiler plate in the tests it also uses a common way of starting a test server with ACLs that should operate without any warnings regarding deprecated non-uuid master tokens etc.
2020-06-16 12:54:27 -04:00
freddygv cc4ff3ae02 Fixup stray sid references 2020-06-12 13:47:43 -06:00
freddygv 1e7e716742 Move compound service names to use ServiceName type 2020-06-12 13:47:43 -06:00
freddygv 806b1fb608 Move GatewayServices out of Internal 2020-06-12 13:46:47 -06:00
Daniel Nephin 6719f1a6fa
Merge pull request #7900 from hashicorp/dnephin/add-linter-staticcheck-2
intentions: fix a bug in Intention.SetHash
2020-06-09 15:40:20 -04:00
Daniel Nephin 5f14eb124c
Merge pull request #8037 from hashicorp/dnephin/add-linter-staticcheck-5
ci: Enabled SA2002 staticcheck check
2020-06-09 15:31:24 -04:00
Hans Hasselberg 7404712854
acl: do not resolve local tokens from remote dcs (#8068) 2020-06-09 21:13:09 +02:00
Hans Hasselberg bec21c849d
Tokens converted from legacy ACLs get their Hash computed (#8047)
* Fixes #5606: Tokens converted from legacy ACLs get their Hash computed

This allows new style token replication to work for legacy tokens as well when they change.

* tests: fix timestamp comparison

Co-authored-by: Matt Keeler <mjkeeler7@gmail.com>
2020-06-08 21:44:06 +02:00
Daniel Nephin 1cdfc4f290 ci: Enabled SA2002 staticcheck check
And handle errors in the main test goroutine
2020-06-05 17:50:11 -04:00
Daniel Nephin b9e4544ec3 intentions: fix a bug in Intention.SetHash
Found using staticcheck.

binary.Write does not accept int types without a size. The error from binary.Write was ignored, so we never saw this error. Casting the data to uint64 produces a correct hash.

Also deprecate the Default{Addr,Port} fields, and prevent them from being encoded. These fields will always be empty and are not used.
Removing these would break backwards compatibility, so they are left in place for now.

Co-authored-by: Hans Hasselberg <me@hans.io>
2020-06-05 14:51:43 -04:00
R.B. Boyer 3ad570ba99
server: don't activate federation state replication or anti-entropy until all servers are running 1.8.0+ (#8014) 2020-06-04 16:05:27 -05:00
Hans Hasselberg dd8cd9bc24
Merge pull request #7966 from hashicorp/pool_improvements
Agent connection pool cleanup
2020-06-04 08:56:26 +02:00
Matt Keeler 2c615807af
Fix legacy management tokens in unupgraded secondary dcs (#7908)
The ACL.GetPolicy RPC endpoint was supposed to return the “parent” policy and not always the default policy. In the case of legacy management tokens the parent policy was supposed to be “manage”. The result of us not sending this properly was that operations that required specifically a management token such as saving a snapshot would not work in secondary DCs until they were upgraded.
2020-06-03 11:22:22 -04:00
Matt Keeler 9fa9ec4ba0
Fix segfault due to race condition for checking server versions (#7957)
The ACL monitoring routine uses c.routers to check for server version updates. Therefore it needs to be started after initializing the routers.
2020-06-03 10:36:32 -04:00
Daniel Nephin e8a883e829
Replace goe/verify.Values with testify/require.Equal (#7993)
* testing: replace most goe/verify.Values with require.Equal

One difference between these two comparisons is that go/verify considers
nil slices/maps to be equal to empty slices/maps, where as testify/require
does not, and does not appear to provide any way to enable that behaviour.

Because of this difference some expected values were changed from empty
slices to nil slices, and some calls to verify.Values were left.

* Remove github.com/pascaldekloe/goe/verify

Reduce the number of assertion packages we use from 2 to 1
2020-06-02 12:41:25 -04:00
R.B. Boyer 7bd7895047
acl: allow auth methods created in the primary datacenter to optionally create global tokens (#7899) 2020-06-01 11:44:47 -05:00
R.B. Boyer 16db20b1f3
acl: remove the deprecated `acl_enforce_version_8` option (#7991)
Fixes #7292
2020-05-29 16:16:03 -05:00
Jono Sosulska 7a13c96a2a
Replace whitelist/blacklist terminology with allowlist/denylist (#7971)
* Replace whitelist/blacklist terminology with allowlist/denylist
2020-05-29 14:19:16 -04:00
Hans Hasselberg 1ed91cbdf6 pool: remove timeout parameter
Timeout was never used in a meaningful way by callers, which is why it
is now entirely internal to the pool.
2020-05-29 08:21:28 +02:00
Hans Hasselberg 5cda505495 pool: remove useTLS and ForceTLS
In the past TLS usage was enforced with these variables, but these days
this decision is made by TLSConfigurator and there is no reason to keep
using the variables.
2020-05-29 08:21:24 +02:00
Hans Hasselberg 9ef44ec3da pool: remove version
The version field has been used to decide which multiplexing to use. It
was introduced in 2457293dceec95ecd12ef4f01442e13710ea131a. But this is
6y ago and there is no need for this differentiation anymore.
2020-05-28 23:06:01 +02:00
Daniel Nephin ea6c2b2adc ci: Add staticcheck and fix most errors
Three of the checks are temporarily disabled to limit the size of the
diff, and allow us to enable all the other checks in CI.

In a follow up we can fix the issues reported by the other checks one
at a time, and enable them.
2020-05-28 11:59:58 -04:00
R.B. Boyer 54c7f825d6
create lib/stringslice package (#7934) 2020-05-27 11:47:32 -05:00
R.B. Boyer 813d69622e
agent: handle re-bootstrapping in a secondary datacenter when WAN federation via mesh gateways is configured (#7931)
The main fix here is to always union the `primary-gateways` list with
the list of mesh gateways in the primary returned from the replicated
federation states list. This will allow any replicated (incorrect) state
to be supplemented with user-configured (correct) state in the config
file. Eventually the game of random selection whack-a-mole will pick a
winning entry and re-replicate the latest federation states from the
primary. If the user-configured state is actually the incorrect one,
then the same eventual correct selection process will work in that case,
too.

The secondary fix is actually to finish making wanfed-via-mgws actually
work as originally designed. Once a secondary datacenter has replicated
federation states for the primary AND managed to stand up its own local
mesh gateways then all of the RPCs from a secondary to the primary
SHOULD go through two sets of mesh gateways to arrive in the consul
servers in the primary (one hop for the secondary datacenter's mesh
gateway, and one hop through the primary datacenter's mesh gateway).
This was neglected in the initial implementation. While everything
works, ideally we should treat communications that go around the mesh
gateways as just provided for bootstrapping purposes.

Now we heuristically use the success/failure history of the federation
state replicator goroutine loop to determine if our current mesh gateway
route is working as intended. If it is, we try using the local gateways,
and if those don't work we fall back on trying the primary via the union
of the replicated state and the go-discover configuration flags.

This can be improved slightly in the future by possibly initializing the
gateway choice to local on startup if we already have replicated state.
This PR does not address that improvement.

Fixes #7339
2020-05-27 11:31:10 -05:00
R.B. Boyer 7e42819a71
connect: ensure proxy-defaults protocol is used for upstreams (#7938) 2020-05-21 16:08:39 -05:00
Daniel Nephin f9a89db86e
Update agent/consul/state/catalog.go
Co-authored-by: Hans Hasselberg <me@hans.io>
2020-05-20 16:34:14 -04:00
Daniel Nephin e1e1c13b35 state: use an error to indicate compare failed
Errors are values. We can use the error value to identify the 'comparison failed' case which makes the function easier to use and should make it harder to miss handle the error case
2020-05-20 12:43:33 -04:00
Pierre Souchay 3b548f0d77
Allow to restrict servers that can join a given Serf Consul cluster. (#7628)
Based on work done in https://github.com/hashicorp/memberlist/pull/196
this allows to restrict the IP ranges that can join a given Serf cluster
and be a member of the cluster.

Restrictions on IPs can be done separatly using 2 new differents flags
and config options to restrict IPs for LAN and WAN Serf.
2020-05-20 11:31:19 +02:00
Daniel Nephin 6e3a7b0aa8 consul/state: refactor tnxService to avoid missed cases
Handling errors at the end of a log switch/case block is somewhat
brittle. This block included a couple cases where errors were ignored,
but it was not obvious the way it was written.

This change moves all error handling into each case block. There is
still potentially one case where err is ignored, which will be handled
in a follow up.
2020-05-19 16:50:14 -04:00
Daniel Nephin 545bd766e7 Fix a number of problems found by staticcheck
Some of these problems are minor (unused vars), but others are real bugs (ignored errors).

Co-authored-by: Matt Keeler <mkeeler@users.noreply.github.com>
2020-05-19 16:50:14 -04:00
Daniel Nephin aa8009ee45 Remove unused var
The usage of this var was removed in b92f895c233bf8a99ae35361117a90416fea29b5.

Found by using staticcheck
2020-05-19 16:50:14 -04:00
Chris Piraino ce099c9aca Do not return an error if requested service is not a gateway
This commit converts the previous error into just a Warn-level log
message. By returning an error when the requested service was not a
gateway, we did not appropriately update envoy because the cache Fetch
returned an error and thus did not propagate the update through proxycfg
and xds packages.
2020-05-18 09:08:04 -05:00
Aleksandr Zagaevskiy 75f0607d3b
Preserve ModifyIndex for unchanged entry in KVS TXN (#7832) 2020-05-14 13:25:04 -06:00
Matt Keeler 849eedd142
Fix identity resolution on clients and in secondary dcs (#7862)
Previously this happened to be using the method on the Server/Client that was meant to allow the ACLResolver to locally resolve tokens. On Servers that had tokens (primary or secondary dc + token replication) this function would lookup the token from raft and return the ACLIdentity. On clients this was always a noop. We inadvertently used this function instead of creating a new one when we added logging accessor ids for permission denied RPC requests. 

With this commit, a new method is used for resolving the identity properly via the ACLResolver which may still resolve locally in the case of being on a server with tokens but also supports remote token resolution.
2020-05-13 13:00:08 -04:00
Chris Piraino 3eb4e4012a
Make new gateway tests compatible with enterprise (#7856) 2020-05-12 13:48:20 -05:00
Daniel Nephin 2e0f750f1a Add unconvert linter
To find unnecessary type convertions
2020-05-12 13:47:25 -04:00
Drew Bailey 7e1734b1c6 Value is already an int, remove type cast 2020-05-12 13:13:09 -04:00
R.B. Boyer 940e5ad160
acl: add auth method for JWTs (#7846) 2020-05-11 20:59:29 -05:00
Chris Piraino 1173b31949
Return early from updateGatewayServices if nothing to update (#7838)
* Return early from updateGatewayServices if nothing to update

Previously, we returned an empty slice of gatewayServices, which caused
us to accidentally delete everything in the memdb table

* PR comment and better formatting
2020-05-11 14:46:48 -05:00
Chris Piraino 1a7c99cf31
Fix TestInternal_GatewayServiceDump_Ingress (#7840)
Protocol was added as a field on GatewayServices after
GatewayServiceDump PR branch was created.
2020-05-11 14:46:31 -05:00
R.B. Boyer c54211ad52
cli: ensure 'acl auth-method update' doesn't deep merge the Config field (#7839) 2020-05-11 14:21:17 -05:00
Chris Piraino 107c7a9ca7 PR comment and better formatting 2020-05-11 14:04:59 -05:00
Chris Piraino 9f924400e0 Return early from updateGatewayServices if nothing to update
Previously, we returned an empty slice of gatewayServices, which caused
us to accidentally delete everything in the memdb table
2020-05-11 12:38:04 -05:00
Freddy ebbb234ecb
Gateway Services Nodes UI Endpoint (#7685)
The endpoint supports queries for both Ingress Gateways and Terminating Gateways. Used to display a gateway's linked services in the UI.
2020-05-11 11:35:17 -06:00
Chris Piraino a635e23f86
Restoring config entries updates the gateway-services table (#7811)
- Adds a new validateConfigEntryEnterprise function
- Also fixes some state store tests that were failing in enterprise
2020-05-08 13:24:33 -05:00
Freddy a37d7a42c9
Fix up enterprise compatibility for gateways (#7813) 2020-05-08 09:44:34 -06:00
Jono Sosulska 44011c81f2
Fix spelling of deregister (#7804) 2020-05-08 10:03:45 -04:00
Chris Piraino ad8a0544f2
Require individual services in ingress entry to match protocols (#7774)
We require any non-wildcard services to match the protocol defined in
the listener on write, so that we can maintain a consistent experience
through ingress gateways. This also helps guard against accidental
misconfiguration by a user.

- Update tests that require an updated protocol for ingress gateways
2020-05-06 16:09:24 -05:00
Chris Piraino ac115e39b2 A proxy-default config entry only exists in the default namespace 2020-05-06 15:06:14 -05:00
Chris Piraino 9a130f2ccc Remove outdated comment 2020-05-06 15:06:14 -05:00
Kyle Havlovitz 04b6bd637a Filter wildcard gateway services to match listener protocol
This now requires some type of protocol setting in ingress gateway tests
to ensure the services are not filtered out.

- small refactor to add a max(x, y) function
- Use internal configEntryTxn function and add MaxUint64 to lib
2020-05-06 15:06:13 -05:00
Chris Piraino 210dda5682 Allow Hosts field to be set on an ingress config entry
- Validate that this cannot be set on a 'tcp' listener nor on a wildcard
service.
- Add Hosts field to api and test in consul config write CLI
- xds: Configure envoy with user-provided hosts from ingress gateways
2020-05-06 15:06:13 -05:00
Kyle Havlovitz e4268c8b7f Support multiple listeners referencing the same service in gateway definitions 2020-05-06 15:06:13 -05:00
Kyle Havlovitz b21cd112e5 Allow ingress gateways to route traffic based on Host header
This commit adds the necessary changes to allow an ingress gateway to
route traffic from a single defined port to multiple different upstream
services in the Consul mesh.

To do this, we now require all HTTP requests coming into the ingress
gateway to specify a Host header that matches "<service-name>.*" in
order to correctly route traffic to the correct service.

- Differentiate multiple listener's route names by port
- Adds a case in xds for allowing default discovery chains to create a
  route configuration when on an ingress gateway. This allows default
  services to easily use host header routing
- ingress-gateways have a single route config for each listener
  that utilizes domain matching to route to different services.
2020-05-06 15:06:13 -05:00
R.B. Boyer 1187d7288e
acl: oss plumbing to support auth method namespace rules in enterprise (#7794)
This includes website docs updates.
2020-05-06 13:48:04 -05:00
R.B. Boyer b6cc92020d
test: make the kube auth method test helper use freeport (#7788) 2020-05-05 16:55:21 -05:00
Hans Hasselberg e3e2b82a00 network_segments: stop advertising segment tags 2020-05-05 21:32:05 +02:00
Hans Hasselberg 854aac510f agent: refactor to use a single addrFn 2020-05-05 21:08:10 +02:00
Hans Hasselberg 0f2e189012 agent: rename local/global to src/dst 2020-05-05 21:07:34 +02:00
Chris Piraino 837bd6f558
Construct a default destination if one does not exist for service-router (#7783) 2020-05-05 10:49:50 -05:00
R.B. Boyer c9c557477b
acl: add MaxTokenTTL field to auth methods (#7779)
When set to a non zero value it will limit the ExpirationTime of all
tokens created via the auth method.
2020-05-04 17:02:57 -05:00
R.B. Boyer 265d2ea9e1
acl: add DisplayName field to auth methods (#7769)
Also add a few missing acl fields in the api.
2020-05-04 15:18:25 -05:00
Hans Hasselberg 1be90e0fa1
agent: don't let left nodes hold onto their node-id (#7747) 2020-05-04 18:39:08 +02:00
Matt Keeler 669d22933e
Merge pull request #7714 from hashicorp/oss-sync/msp-agent-token 2020-05-04 11:33:50 -04:00
R.B. Boyer 3ac5a841ec
acl: refactor the authmethod.Validator interface (#7760)
This is a collection of refactors that make upcoming PRs easier to digest.

The main change is the introduction of the authmethod.Identity struct.
In the one and only current auth method (type=kubernetes) all of the
trusted identity attributes are both selectable and projectable, so they
were just passed around as a map[string]string.

When namespaces were added, this was slightly changed so that the
enterprise metadata can also come back from the login operation, so
login now returned two fields.

Now with some upcoming auth methods it won't be true that all identity
attributes will be both selectable and projectable, so rather than
update the login function to return 3 pieces of data it seemed worth it
to wrap those fields up and give them a proper name.
2020-05-01 17:35:28 -05:00
R.B. Boyer 4cd1d62e40
acl: change authmethod.Validator to take a logger (#7758) 2020-05-01 15:55:26 -05:00
R.B. Boyer 9faf8c42d1
sdk: extracting testutil.RequireErrorContains from various places it was duplicated (#7753) 2020-05-01 11:56:34 -05:00
Hans Hasselberg 6626cb69d6
rpc: oss changes for network area connection pooling (#7735) 2020-04-30 22:12:17 +02:00
Freddy c34ee5d339
Watch fallback channel for gateways that do not exist (#7715)
Also ensure that WatchSets in tests are reset between calls to watchFired. 
Any time a watch fires, subsequent calls to watchFired on the same WatchSet
will also return true even if there were no changes.
2020-04-29 16:52:27 -06:00
Matt Keeler 901d6739ad
Some boilerplate to allow for ACL Bootstrap disabling configurability 2020-04-28 09:42:46 -04:00
Freddy f5c1e5268b
TLS Origination for Terminating Gateways (#7671) 2020-04-27 16:25:37 -06:00
freddygv e751b83a3f Clean up dead code, issue addressed by passing ws to serviceGatewayNodes 2020-04-27 11:08:41 -06:00
freddygv 75e737b0f2 Fix internal endpoint test 2020-04-27 11:08:41 -06:00
freddygv 7667567688 Avoid deleting mappings for services linked to other gateways on dereg 2020-04-27 11:08:41 -06:00
freddygv 28fe6920fe Re-fix bug in CheckConnectServiceNodes 2020-04-27 11:08:41 -06:00
freddygv bab101107c Fix ConnectQueryBlocking test 2020-04-27 11:08:40 -06:00
freddygv 65e60d02f1 Fix bug in CheckConnectServiceNodes
Previously, if a blocking query called CheckConnectServiceNodes
before the gateway-services memdb table had any entries,
a nil watchCh would be returned when calling serviceTerminatingGatewayNodes.
This means that the blocking query would not fire if a gateway config entry
was added after the watch started.

In cases where the blocking query started on proxy registration,
the proxy could potentially never become aware of an upstream endpoint
if that upstream was going to be represented by a gateway.
2020-04-27 11:08:40 -06:00
Matt Keeler 4b1b42cef5
A couple testing helper updates (#7694) 2020-04-27 12:17:38 -04:00
Kit Patella 2b95bd7ca9
Merge pull request #7656 from hashicorp/feature/audit/oss-merge
agent: stub out auditing functionality in OSS
2020-04-17 13:33:06 -07:00
Chris Piraino c5ab43ebbc
Fix bug where non-typical services are associated with gateways (#7662)
On every service registration, we check to see if a service should be
assassociated to a wildcard gateway-service. This fixes an issue where
we did not correctly check to see if the service being registered was a
"typical" service or not.
2020-04-17 11:24:34 -05:00
Kit Patella c3d24d7c3e agent: stub out auditing functionality in OSS 2020-04-16 15:07:52 -07:00
Kyle Havlovitz 6a5eba63ab
Ingress Gateways for TCP services (#7509)
* Implements a simple, tcp ingress gateway workflow

This adds a new type of gateway for allowing Ingress traffic into Connect from external services.

Co-authored-by: Chris Piraino <cpiraino@hashicorp.com>
2020-04-16 14:00:48 -07:00
Matt Keeler 1e70ffee76
Update the Client code to use the common version checking infra… (#7558)
Also reduce the log level of some version checking messages on the server as they can be pretty noisy during upgrades and really are more for debugging purposes.
2020-04-14 11:54:27 -04:00
Matt Keeler 1332628b67
Allow the bootstrap endpoint to be disabled in enterprise. (#7614) 2020-04-14 11:45:39 -04:00
Pierre Souchay 2e6cd9e11a
fix flaky TestReplication_FederationStates test due to race conditions (#7612)
The test had two racy bugs related to memdb references.

The first was when we initially populated data and retained the FederationState objects in a slice. Due to how the `inmemCodec` works these were actually the identical objects passed into memdb.

The second was that the `checkSame` assertion function was reading from memdb and setting the RaftIndexes to zeros to aid in equality checks. This was mutating the contents of memdb which is a no-no.

With this fix, the command:
```
i=0; while /usr/local/bin/go test -count=1 -timeout 30s github.com/hashicorp/consul/agent/consul -run '^(TestReplication_FederationStates)$'; do i=$((i + 1)); printf "$i "; done
```
That used to break on my machine in less than 20 runs is now running 150+ times without any issue.

Might also fix #7575
2020-04-09 15:42:41 -05:00
Freddy c1f79c6b3c
Terminating gateway discovery (#7571)
* Enable discovering terminating gateways

* Add TerminatingGatewayServices to state store

* Use GatewayServices RPC endpoint for ingress/terminating
2020-04-08 12:37:24 -06:00
Matt Keeler 42f02e80c3
Enable filtering language support for the v1/connect/intentions… (#7593)
* Enable filtering language support for the v1/connect/intentions listing API

* Update website for filtering of Intentions

* Update website/source/api/connect/intentions.html.md
2020-04-07 11:48:44 -04:00
Matt Keeler 5d0e661203
Ensure that token clone copies the roles (#7577) 2020-04-02 12:09:35 -04:00
Emre Savcı 7a99f29adc
agent: add len, cap while initializing arrays 2020-04-01 10:54:51 +02:00
Freddy 8a1e53754e
Add config entry for terminating gateways (#7545)
This config entry will be used to configure terminating gateways.

It accepts the name of the gateway and a list of services the gateway will represent.

For each service users will be able to specify: its name, namespace, and additional options for TLS origination.

Co-authored-by: Kyle Havlovitz <kylehav@gmail.com>
Co-authored-by: Chris Piraino <cpiraino@hashicorp.com>
2020-03-31 13:27:32 -06:00
Kyle Havlovitz 01a23b8eb4
Add config entry/state for Ingress Gateways (#7483)
* Add Ingress gateway config entry and other relevant structs

* Add api package tests for ingress gateways

* Embed EnterpriseMeta into ingress service struct

* Add namespace fields to api module and test consul config write decoding

* Don't require a port for ingress gateways

* Add snakeJSON and camelJSON cases in command test

* Run Normalize on service's ent metadata

Sadly cannot think of a way to test this in OSS.

* Every protocol requires at least 1 service

* Validate ingress protocols

* Update agent/structs/config_entry_gateways.go

Co-authored-by: Chris Piraino <cpiraino@hashicorp.com>
Co-authored-by: Freddy <freddygv@users.noreply.github.com>
2020-03-31 11:59:10 -05:00
Matt Keeler 35c8e996c3
Ensure server requirements checks are done against ALL known se… (#7491)
Co-authored-by: Paul Banks <banks@banksco.de>
2020-03-27 12:31:43 -04:00
Daniel Nephin a2eb66963c
Merge pull request #7516 from hashicorp/dnephin/remove-unused-method
agent: Remove unused method Encrypted from delegate interface
2020-03-26 14:17:58 -04:00
Daniel Nephin ebb851f32d agent: Remove unused Encrypted from interface
It appears to be unused. It looks like it has been around a while,
I geuss at some point we stopped using this method.
2020-03-26 12:34:31 -04:00
Freddy cb55fa3742
Enable CLI to register terminating gateways (#7500)
* Enable CLI to register terminating gateways

* Centralize gateway proxy configuration
2020-03-26 10:20:56 -06:00
Alejandro Baez 7d68d7eaa6
Add PolicyReadByName for API (#6615) 2020-03-25 10:34:24 -04:00
Matt Keeler 58e2969fc1
Fix ACL mode advertisement and detection (#7451)
These changes are necessary to ensure advertisement happens correctly even when datacenters are connected via network areas in Consul enterprise.

This also changes how we check if ACLs can be upgraded within the local datacenter. Previously we would iterate through all LAN members. Now we just use the ServerLookup type to iterate through all known servers in the DC.
2020-03-16 12:54:45 -04:00
Freddy 8a7ff69b19
Update MSP token and filtering (#7431) 2020-03-11 12:08:49 -06:00
R.B. Boyer 10d3ff9a4f
server: strip local ACL tokens from RPCs during forwarding if crossing datacenters (#7419)
Fixes #7414
2020-03-10 11:15:22 -05:00
Kyle Havlovitz 520d464c85
Merge pull request #7373 from hashicorp/acl-segments-fix
Add stub methods for ACL/segment bug fix from enterprise
2020-03-09 14:25:49 -07:00
R.B. Boyer a7fb26f50f
wan federation via mesh gateways (#6884)
This is like a Möbius strip of code due to the fact that low-level components (serf/memberlist) are connected to high-level components (the catalog and mesh-gateways) in a twisty maze of references which make it hard to dive into. With that in mind here's a high level summary of what you'll find in the patch:

There are several distinct chunks of code that are affected:

* new flags and config options for the server

* retry join WAN is slightly different

* retry join code is shared to discover primary mesh gateways from secondary datacenters

* because retry join logic runs in the *agent* and the results of that
  operation for primary mesh gateways are needed in the *server* there are
  some methods like `RefreshPrimaryGatewayFallbackAddresses` that must occur
  at multiple layers of abstraction just to pass the data down to the right
  layer.

* new cache type `FederationStateListMeshGatewaysName` for use in `proxycfg/xds` layers

* the function signature for RPC dialing picked up a new required field (the
  node name of the destination)

* several new RPCs for manipulating a FederationState object:
  `FederationState:{Apply,Get,List,ListMeshGateways}`

* 3 read-only internal APIs for debugging use to invoke those RPCs from curl

* raft and fsm changes to persist these FederationStates

* replication for FederationStates as they are canonically stored in the
  Primary and replicated to the Secondaries.

* a special derivative of anti-entropy that runs in secondaries to snapshot
  their local mesh gateway `CheckServiceNodes` and sync them into their upstream
  FederationState in the primary (this works in conjunction with the
  replication to distribute addresses for all mesh gateways in all DCs to all
  other DCs)

* a "gateway locator" convenience object to make use of this data to choose
  the addresses of gateways to use for any given RPC or gossip operation to a
  remote DC. This gets data from the "retry join" logic in the agent and also
  directly calls into the FSM.

* RPC (`:8300`) on the server sniffs the first byte of a new connection to
  determine if it's actually doing native TLS. If so it checks the ALPN header
  for protocol determination (just like how the existing system uses the
  type-byte marker).

* 2 new kinds of protocols are exclusively decoded via this native TLS
  mechanism: one for ferrying "packet" operations (udp-like) from the gossip
  layer and one for "stream" operations (tcp-like). The packet operations
  re-use sockets (using length-prefixing) to cut down on TLS re-negotiation
  overhead.

* the server instances specially wrap the `memberlist.NetTransport` when running
  with gateway federation enabled (in a `wanfed.Transport`). The general gist is
  that if it tries to dial a node in the SAME datacenter (deduced by looking
  at the suffix of the node name) there is no change. If dialing a DIFFERENT
  datacenter it is wrapped up in a TLS+ALPN blob and sent through some mesh
  gateways to eventually end up in a server's :8300 port.

* a new flag when launching a mesh gateway via `consul connect envoy` to
  indicate that the servers are to be exposed. This sets a special service
  meta when registering the gateway into the catalog.

* `proxycfg/xds` notice this metadata blob to activate additional watches for
  the FederationState objects as well as the location of all of the consul
  servers in that datacenter.

* `xds:` if the extra metadata is in place additional clusters are defined in a
  DC to bulk sink all traffic to another DC's gateways. For the current
  datacenter we listen on a wildcard name (`server.<dc>.consul`) that load
  balances all servers as well as one mini-cluster per node
  (`<node>.server.<dc>.consul`)

* the `consul tls cert create` command got a new flag (`-node`) to help create
  an additional SAN in certs that can be used with this flavor of federation.
2020-03-09 15:59:02 -05:00
Matt Keeler b684138882 Fix session backwards incompatibility with 1.6.x and earlier. 2020-03-05 15:34:55 -05:00
Kyle Havlovitz b05ebe2507 Add stub methods for ACL/segment bug fix from enterprise 2020-03-02 10:30:23 -08:00
rerorero b366a25179
fix: Destroying a session that doesn't exist returns status cod… (#6905)
fix #6840
2020-02-18 11:13:15 -05:00
Matt Keeler be0d6efac9
Allow the PolicyResolve and RoleResolve endpoints to process na… (#7296) 2020-02-13 14:55:27 -05:00
R.B. Boyer 919741838d
fix use of hclog logger (#7264) 2020-02-12 09:37:16 -06:00
ShimmerGlass a27ccc7248
agent: add server raft.{last,applied}_index gauges (#6694)
These metrics are useful for :
* Tracking the rate of update to the db
* Allow to have a rough idea of when an index originated
2020-02-11 10:50:18 +01:00
Hans Hasselberg 71ce832990
connect: add validations around intermediate cert ttl (#7213) 2020-02-11 00:05:49 +01:00
R.B. Boyer c37d00791c
make the TestRPC_RPCMaxConnsPerClient test less flaky (#7255) 2020-02-10 15:13:53 -06:00
Sarah Christoff 85d2714c76
Fix flaky TestAutopilot_BootstrapExpect (#7242) 2020-02-10 14:52:58 -06:00
Kit Patella d28bc1acbe
rpc: measure blocking queries (#7224)
* agent: measure blocking queries

* agent.rpc: update docs to mention we only record blocking queries

* agent.rpc: make go fmt happy

* agent.rpc: fix non-atomic read and decrement with bitwise xor of uint64 0

* agent.rpc: clarify review question

* agent.rpc: today I learned that one must declare all variables before interacting with goto labels

* Update agent/consul/server.go

agent.rpc: more precise comment on `Server.queriesBlocking`

Co-Authored-By: Paul Banks <banks@banksco.de>

* Update website/source/docs/agent/telemetry.html.md

agent.rpc: improve queries_blocking description

Co-Authored-By: Paul Banks <banks@banksco.de>

* agent.rpc: fix some bugs found in review

* add a note about the updated counter behavior to telemetry.md

* docs: add upgrade-specific note on consul.rpc.quer{y,ies_blocking} behavior

Co-authored-by: Paul Banks <banks@banksco.de>
2020-02-10 10:01:15 -08:00
Matt Keeler 966d085066
Catalog + Namespace OSS changes. (#7219)
* Various Prepared Query + Namespace things

* Last round of OSS changes for a namespaced catalog
2020-02-10 10:40:44 -05:00
R.B. Boyer b4325dfbce
agent: ensure that we always use the same settings for msgpack (#7245)
We set RawToString=true so that []uint8 => string when decoding an interface{}.
We set the MapType so that map[interface{}]interface{} decodes to map[string]interface{}.

Add tests to ensure that this doesn't break existing usages.

Fixes #7223
2020-02-07 15:50:24 -06:00
Freddy aca8b85440
Remove outdated TODO (#7244) 2020-02-07 13:14:48 -07:00
Matt Keeler f610d1d791
Fix a bug with ACL enforcement of reads on namespaced config entries. (#7239) 2020-02-07 08:30:40 -05:00
Kit Patella aa9db3f903
agent/consul server: fix LeaderTest_ChangeNodeID (#7236)
* fix LeaderTest_ChangeNodeID to use StatusLeft and add waitForAnyLANLeave

* unextract the waitFor... fn, simplify, and provide a more descriptive error
2020-02-06 16:37:53 -08:00
Matt Keeler 2524a028ea
OSS Changes for various config entry namespacing bugs (#7226) 2020-02-06 10:52:25 -05:00
R.B. Boyer a67001aa22
agent: differentiate wan vs lan loggers in memberlist and serf (#7205)
This should be a helpful change until memberlist and serf can be
properly switched to native hclog.
2020-02-05 09:52:43 -06:00
Matt Keeler 119168203b
Fix disco chain graph validation for namespaces (#7217)
Previously this happened to be validating only the chains in the default namespace. Now it will validate all chains in all namespaces when the global proxy-defaults is changed.
2020-02-05 10:06:27 -05:00
Matt Keeler 3621f7090b
Minor Non-Functional Updates (#7215)
* Cleanup the discovery chain compilation route handling

Nothing functionally should be different here. The real difference is that when creating new targets or handling route destinations we use the router config entries name and namespace instead of that of the top level request. Today they SHOULD always be the same but that may not always be the case. This hopefully also makes it easier to understand how the router entries are handled.

* Refactor a small bit of the service manager tests in oss

We used to use the stringHash function to compute part of the filename where things would get persisted to. This has been changed in the core code to calling the StringHash method on the ServiceID type. It just so happens that the new method will output the same value for anything in the default namespace (by design actually). However, logically this filename computation in the test should do the same thing as the core code itself so I updated it here.

Also of note is that newer enterprise-only tests for the service manager cannot use the old stringHash function at all because it will produce incorrect results for non-default namespaces.
2020-02-05 10:06:11 -05:00
Freddy 67e02a0752
Add managed service provider token (#7218)
Stubs for enterprise-only ACL token to be used by managed service providers.
2020-02-04 13:58:56 -07:00
Hans Hasselberg a9f9ed83cb
agent: increase watchLimit to 8192. (#7200)
The previous value was too conservative and users with many instances
were having problems because of it. This change increases the limit to
8192 which reportedly fixed most of the issues with that.

Related: #4984, #4986, #5050.
2020-02-04 13:11:30 +01:00
Davor Kapsa c280dd8549
auto_encrypt: check previously ignored error (#6604) 2020-02-03 10:35:11 +01:00
Hans Hasselberg 50281032e0
Security fixes (#7182)
* Mitigate HTTP/RPC Services Allow Unbounded Resource Usage

Fixes #7159.

Co-authored-by: Matt Keeler <mkeeler@users.noreply.github.com>
Co-authored-by: Paul Banks <banks@banksco.de>
2020-01-31 11:19:37 -05:00
Matt Keeler 26bb1584c1
Updates to the Txn API for namespaces (#7172)
* Updates to the Txn API for namespaces

* Update agent/consul/txn_endpoint.go

Co-Authored-By: R.B. Boyer <rb@hashicorp.com>

Co-authored-by: R.B. Boyer <public@richardboyer.net>
2020-01-30 13:12:26 -05:00
Matt Keeler 3f253080a2
Sync some feature flag support from enterprise (#7167) 2020-01-29 13:21:38 -05:00
R.B. Boyer 01ebdff2a9
various tweaks on top of the hclog work (#7165) 2020-01-29 11:16:08 -06:00
Chris Piraino 3dd0b59793
Allow users to configure either unstructured or JSON logging (#7130)
* hclog Allow users to choose between unstructured and JSON logging
2020-01-28 17:50:41 -06:00
Kit Patella 49e9bbbdf9
Add accessorID of token when ops are denied by ACL system (#7117)
* agent: add and edit doc comments

* agent: add ACL token accessorID to debugging traces

* agent: polish acl debugging

* agent: minor fix + string fmt over value interp

* agent: undo export & fix logging field names

* agent: remove note and migrate up to code review

* Update agent/consul/acl.go

Co-Authored-By: Matt Keeler <mkeeler@users.noreply.github.com>

* agent: incorporate review feedback

* Update agent/acl.go

Co-Authored-By: R.B. Boyer <public@richardboyer.net>

Co-authored-by: Matt Keeler <mkeeler@users.noreply.github.com>
Co-authored-by: R.B. Boyer <public@richardboyer.net>
2020-01-27 11:54:32 -08:00
Matt Keeler 485a0a65ea
Updates to Config Entries and Connect for Namespaces (#7116) 2020-01-24 10:04:58 -05:00
Matt Keeler 90b9f87160
Add the v1/catalog/node-services/:node endpoint (#7115)
The backing RPC already existed but the endpoint will be useful for other service syncing processes such as consul-k8s as this endpoint can return all services registered with a node regardless of namespacing.
2020-01-24 09:27:25 -05:00
Hans Hasselberg 5379cf7c67
raft: increase raft notify buffer. (#6863)
* Increase raft notify buffer.

Fixes https://github.com/hashicorp/consul/issues/6852.

Increasing the buffer helps recovering from leader flapping. It lowers
the chances of the flapping leader to get into a deadlock situation like
described in #6852.
2020-01-22 16:15:59 +01:00
Hans Hasselberg d52a4e3b82
tests: fix autopilot test (#7092) 2020-01-21 14:09:51 +01:00
Hans Hasselberg 43392d5db3
raft: update raft to v1.1.2 (#7079)
* update raft
* use hclogger for raft.
2020-01-20 13:58:02 +01:00
Hans Hasselberg 315ba7d6ad
connect: check if intermediate cert needs to be renewed. (#6835)
Currently when using the built-in CA provider for Connect, root certificates are valid for 10 years, however secondary DCs get intermediates that are valid for only 1 year. There is no mechanism currently short of rotating the root in the primary that will cause the secondary DCs to renew their intermediates.
This PR adds a check that renews the cert if it is half way through its validity period.

In order to be able to test these changes, a new configuration option was added: IntermediateCertTTL which is set extremely low in the tests.
2020-01-17 23:27:13 +01:00
Hans Hasselberg b6c83e06d5
auto_encrypt: set dns and ip san for k8s and provide configuration (#6944)
* Add CreateCSRWithSAN
* Use CreateCSRWithSAN in auto_encrypt and cache
* Copy DNSNames and IPAddresses to cert
* Verify auto_encrypt.sign returns cert with SAN
* provide configuration options for auto_encrypt dnssan and ipsan
* rename CreateCSRWithSAN to CreateCSR
2020-01-17 23:25:26 +01:00
Matej Urbas d877e091d6 agent: configurable MaxQueryTime and DefaultQueryTime. (#3777) 2020-01-17 14:20:57 +01:00
Matt Keeler c8294b8595
AuthMethod updates to support alternate namespace logins (#7029) 2020-01-14 10:09:29 -05:00
Matt Keeler baa89c7c65
Intentions ACL enforcement updates (#7028)
* Renamed structs.IntentionWildcard to structs.WildcardSpecifier

* Refactor ACL Config

Get rid of remnants of enterprise only renaming.

Add a WildcardName field for specifying what string should be used to indicate a wildcard.

* Add wildcard support in the ACL package

For read operations they can call anyAllowed to determine if any read access to the given resource would be granted.

For write operations they can call allAllowed to ensure that write access is granted to everything.

* Make v1/agent/connect/authorize namespace aware

* Update intention ACL enforcement

This also changes how intention:read is granted. Before the Intention.List RPC would allow viewing an intention if the token had intention:read on the destination. However Intention.Match allowed viewing if access was allowed for either the source or dest side. Now Intention.List and Intention.Get fall in line with Intention.Matches previous behavior.

Due to this being done a few different places ACL enforcement for a singular intention is now done with the CanRead and CanWrite methods on the intention itself.

* Refactor Intention.Apply to make things easier to follow.
2020-01-13 15:51:40 -05:00
Pierre Souchay 61fc4f8253 rpc: log method when a server/server RPC call fails (#4548)
Sometimes, we have lots of errors in cross calls between DCs (several hundreds / sec)
Enrich the log in order to help diagnose the root cause of issue.
2020-01-13 19:55:29 +01:00
R.B. Boyer 20f51f9181 connect: derive connect certificate serial numbers from a memdb index instead of the provider table max index (#7011) 2020-01-09 16:32:19 +01:00
R.B. Boyer 446f0533cd connect: ensure that updates to the secondary root CA configuration use the correct signing key ID values for comparison (#7012)
Fixes #6886
2020-01-09 16:28:16 +01:00
R.B. Boyer 42f80367be
Restore a few more service-kind index updates so blocking in ServiceDump works in more cases (#6948)
Restore a few more service-kind index updates so blocking in ServiceDump works in more cases

Namely one omission was that check updates for dumped services were not
unblocking.

Also adds a ServiceDump state store test and also fix a watch bug with the
normal dump.

Follow-on from #6916
2019-12-19 10:15:37 -06:00
Matt Keeler 6de4eb8569
OSS changes for implementing token based namespace inferencing
remove debug log
2019-12-18 14:07:08 -05:00
Matt Keeler 185654b075
Unflake the TestACLEndpoint_TokenList test
In order to do this I added a waitForLeaderEstablishment helper which does the right thing to ensure that leader establishment has finished.

fixup
2019-12-18 14:07:07 -05:00
Matt Keeler 8af12bf4f4
Miscellaneous acl package cleanup
• Renamed EnterpriseACLConfig to just Config
• Removed chained_authorizer_oss.go as it was empty
• Renamed acl.go to errors.go to more closely describe its contents
2019-12-18 13:44:32 -05:00
Matt Keeler bdf025a758
Rename EnterpriseAuthorizerContext -> AuthorizerContext 2019-12-18 13:43:24 -05:00
Preetha f607a00138 autopilot: fix dead server removal condition to use correct failure tolerance (#4017)
* Make dead server removal condition in autopilot use correct failure tolerance rules
* Introduce func with explanation
2019-12-16 23:35:13 +01:00
Matt Keeler 9812b32155
Fix blocking for ServiceDumping by kind (#6919) 2019-12-10 13:58:30 -05:00
Matt Keeler 442924c35a
Sync of OSS changes to support namespaces (#6909) 2019-12-09 21:26:41 -05:00
Hans Hasselberg a36e58c964
agent: fewer file local differences between enterprise and oss (#6820) (#6898)
* Increase number to test ignore. Consul Enterprise has more flags and since we are trying to reduce the differences between both code bases, we are increasing the number in oss. The semantics don't change, it is just a cosmetic thing.
* Introduce agent.initEnterprise for enterprise related hooks.
* Sync test with ent version.
* Fix import order.
* revert error wording.
2019-12-06 21:35:58 +01:00
Matt Keeler 609c9dab02
Miscellaneous Fixes (#6896)
Ensure we close the Sentinel Evaluator so as not to leak go routines

Fix a bunch of test logging so that various warnings when starting a test agent go to the ltest logger and not straight to stdout.

Various canned ent meta types always return a valid pointer (no more nils). This allows us to blindly deref + assign in various places.

Update ACL index tracking to ensure oss -> ent upgrades will work as expected.

Update ent meta parsing to include function to disallow wildcarding.
2019-12-06 14:01:34 -05:00
Matt Keeler c15c81a7ed
[Feature] API: Add a internal endpoint to query for ACL authori… (#6888)
* Implement endpoint to query whether the given token is authorized for a set of operations

* Updates to allow for remote ACL authorization via RPC

This is only used when making an authorization request to a different datacenter.
2019-12-06 09:25:26 -05:00
Matt Keeler f30af37d11
Fix the TestLeader_SecondaryCA_IntermediateRefresh test flakiness 2019-12-04 19:19:55 -05:00
Matt Keeler 90ae4a1f1e
OSS KV Modifications to Support Namespaces 2019-11-25 12:57:35 -05:00
Matt Keeler 68d79142c4
OSS Modifications necessary for sessions namespacing 2019-11-25 12:07:04 -05:00
Paul Banks a84b82b3df
connect: Add AWS PCA provider (#6795)
* Update AWS SDK to use PCA features.

* Add AWS PCA provider

* Add plumbing for config, config validation tests, add test for inheriting existing CA resources created by user

* Unparallel the tests so we don't exhaust PCA limits

* Merge updates

* More aggressive polling; rate limit pass through on sign; Timeout on Sign and CA create

* Add AWS PCA docs

* Fix Vault doc typo too

* Doc typo

* Apply suggestions from code review

Co-Authored-By: R.B. Boyer <rb@hashicorp.com>
Co-Authored-By: kaitlincarter-hc <43049322+kaitlincarter-hc@users.noreply.github.com>

* Doc fixes; tests for erroring if State is modified via API

* More review cleanup

* Uncomment tests!

* Minor suggested clean ups
2019-11-21 17:40:29 +00:00
Paul Banks 9e17aa3b41
Change CA Configure struct to pass Datacenter through (#6775)
* Change CA Configure struct to pass Datacenter through

* Remove connect/ca/plugin as we don't have immediate plans to use it.

We still intend to one day but there are likely to be several changes to the CA provider interface before we do so it's better to rebuild from history when we do that work properly.

* Rename PrimaryDC; fix endpoint in secondary DCs
2019-11-18 14:22:19 +00:00
Paul Banks 1197b43c7b
Support Connect CAs that can't cross sign (#6726)
* Support Connect CAs that can't cross sign

* revert spurios mod changes from make tools

* Add log warning when forcing CA rotation

* Fixup SupportsCrossSigning to report errors and work with Plugin interface (fixes tests)

* Fix failing snake_case test

* Remove misleading comment

* Revert "Remove misleading comment"

This reverts commit bc4db9cabed8ad5d0e39b30e1fe79196d248349c.

* Remove misleading comment

* Regen proto files messed up by rebase
2019-11-11 21:36:22 +00:00
Paul Banks ca96d5fa72
connect: Allow CA Providers to store small amount of state (#6751)
* pass logger through to provider

* test for proper operation of NeedsLogger

* remove public testServer function

* Ooops actually set the logger in all the places we need it - CA config set wasn't and causing segfault

* Fix all the other places in tests where we set the logger

* Allow CA Providers to persist some state

* Update CA provider plugin interface

* Fix plugin stubs to match provider changes

* Update agent/connect/ca/provider.go

Co-Authored-By: R.B. Boyer <rb@hashicorp.com>

* Cleanup review comments
2019-11-11 20:57:16 +00:00
Todd Radel 19a3892f71 connect: Implement NeedsLogger interface for CA providers (#6556)
* add NeedsLogger to Provider interface

* implements NeedsLogger in default provider

* pass logger through to provider

* test for proper operation of NeedsLogger

* remove public testServer function

* Switch test to actually assert on logging output rather than reflection.

--amend

* Ooops actually set the logger in all the places we need it - CA config set wasn't and causing segfault

* Fix all the other places in tests where we set the logger

* Add TODO comment
2019-11-11 20:30:01 +00:00
Todd Radel e100fda218 Make all Connect Cert Common Names valid FQDNs (#6423) 2019-11-11 17:11:54 +00:00
Matt Keeler 7081643191
Fill the Authz Context with a Sentinel Scope (#6729) 2019-11-01 17:05:22 -04:00
Matt Keeler c71ea7056f
Miscellaneous fixes (#6727) 2019-11-01 16:11:44 -04:00
Paul Banks 5f405c3277
Fix support for RSA CA keys in Connect. (#6638)
* Allow RSA CA certs for consul and vault providers to correctly sign EC leaf certs.

* Ensure key type ad bits are populated from CA cert and clean up tests

* Add integration test and fix error when initializing secondary CA with RSA key.

* Add more tests, fix review feedback

* Update docs with key type config and output

* Apply suggestions from code review

Co-Authored-By: R.B. Boyer <rb@hashicorp.com>
2019-11-01 13:20:26 +00:00
Matt Keeler 21f98f426e
Add hook for validating the enterprise meta attached to a reque… (#6695) 2019-10-30 12:42:39 -04:00
Matt Keeler c2d9041c0f
PreVerify acl:read access for listing endpoints (#6696)
We still will need to filter results based on the authorizer too but this helps to give an early 403.
2019-10-30 09:10:11 -04:00
Sarah Christoff 86b30bbfbe
Set MinQuorum variable in Autopilot (#6654)
* Add MinQuorum to Autopilot
2019-10-29 09:04:41 -05:00
Matt Keeler 0fc2c95255
More Replication Abstractions (#6689)
Also updated ACL replication to use a function to fill in the desired enterprise meta for all remote listing RPCs.
2019-10-28 13:49:57 -04:00
Matt Keeler 87c44a3b8d
Ensure that cache entries for tokens are prefixed “token-secret… (#6688)
This will be necessary once we store other types of identities in here.
2019-10-25 13:05:43 -04:00
Matt Keeler a688ea952d
Update the ACL Resolver to allow for Consul Enterprise specific hooks. (#6687) 2019-10-25 11:06:16 -04:00
Matt Keeler 1270a93274
Updates to allow for Namespacing ACL resources in Consul Enterp… (#6675)
Main Changes:

• method signature updates everywhere to account for passing around enterprise meta.
• populate the EnterpriseAuthorizerContext for all ACL related authorizations.
• ACL resource listings now operate like the catalog or kv listings in that the returned entries are filtered down to what the token is allowed to see. With Namespaces its no longer all or nothing.
• Modified the acl.Policy parsing to abstract away basic decoding so that enterprise can do it slightly differently. Also updated method signatures so that when parsing a policy it can take extra ent metadata to use during rules validation and policy creation.

Secondary Changes:

• Moved protobuf encoding functions out of the agentpb package to eliminate circular dependencies.
• Added custom JSON unmarshalers for a few ACL resource types (to support snake case and to get rid of mapstructure)
• AuthMethod validator cache is now an interface as these will be cached per-namespace for Consul Enterprise.
• Added checks for policy/role link existence at the RPC API so we don’t push the request through raft to have it fail internally.
• Forward ACL token delete request to the primary datacenter when the secondary DC doesn’t have the token.
• Added a bunch of ACL test helpers for inserting ACL resource test data.
2019-10-24 14:38:09 -04:00
Freddy caf658d0d3
Store check type in catalog (#6561) 2019-10-17 20:33:11 +02:00
R.B. Boyer e74a6c44f1
server: ensure the primary dc and ACL dc match (#6634)
This is mostly a sanity check for server tests that skip the normal
config builder equivalent fixup.
2019-10-17 10:57:17 -05:00
R.B. Boyer bc22eb8090
unflake TestLeader_SecondaryCA_Initialize (#6631) 2019-10-16 16:49:01 -05:00
R.B. Boyer 3ae748c7a4
fix flaky multidc acl tests that failed to wait for token replication (#6628)
If acls have not yet replicated to the secondary then authz requests
will be remotely resolved by the primary. Now these tests explicitly
wait until replication has caught up first.
2019-10-16 12:24:29 -05:00
R.B. Boyer a4c5b8e85c
appease the retry linter (#6629) 2019-10-16 11:39:22 -05:00
Paul Banks 979ad7fecb
Allow time for secondary CA to initialize (#6627) 2019-10-16 17:03:31 +01:00
Matt Keeler f9a43a1e2d
ACL Authorizer overhaul (#6620)
* ACL Authorizer overhaul

To account for upcoming features every Authorization function can now take an extra *acl.EnterpriseAuthorizerContext. These are unused in OSS and will always be nil.

Additionally the acl package has received some thorough refactoring to enable all of the extra Consul Enterprise specific authorizations including moving sentinel enforcement into the stubbed structs. The Authorizer funcs now return an acl.EnforcementDecision instead of a boolean. This improves the overall interface as it makes multiple Authorizers easily chainable as they now indicate whether they had an authoritative decision or should use some other defaults. A ChainedAuthorizer was added to handle this Authorizer enforcement chain and will never itself return a non-authoritative decision.

* Include stub for extra enterprise rules in the global management policy

* Allow for an upgrade of the global-management policy
2019-10-15 16:58:50 -04:00
R.B. Boyer 9a51ecc98b
agent: clients should only attempt to remove pruned nodes once per call (#6591) 2019-10-07 16:15:23 -05:00
Sarah Christoff 9b93dd93c9
Prune Unhealthy Agents (#6571)
* Add -prune flag to ForceLeave
2019-10-04 16:10:02 -05:00
Matt Keeler b0b57588d1
Implement Leader Routine Management (#6580)
* Implement leader routine manager

Switch over the following to use it for go routine management:

• Config entry Replication
• ACL replication - tokens, policies, roles and legacy tokens
• ACL legacy token upgrade
• ACL token reaping
• Intention Replication
• Secondary CA Roots Watching
• CA Root Pruning

Also added the StopAll call into the Server Shutdown method to ensure all leader routines get killed off when shutting down.

This should be mostly unnecessary as `revokeLeadership` should manually stop each one but just in case we really want these to go away (eventually).
2019-10-04 13:08:45 -04:00
Matt Keeler 9bd378a95c
Add EnterpriseConfig stubs (#6566) 2019-10-01 14:34:55 -04:00
R.B. Boyer 8433ef02a8
connect: connect CA Roots in secondary datacenters should use a SigningKeyID derived from their local intermediate (#6513)
This fixes an issue where leaf certificates issued in secondary
datacenters would be reissued very frequently (every ~20 seconds)
because the logic meant to detect root rotation was errantly triggering
because a hash of the ultimate root (in the primary) was being compared
against a hash of the local intermediate root (in the secondary) and
always failing.
2019-09-26 11:54:14 -05:00
Matt Keeler 5b83f589da
Expand the QueryOptions and QueryMeta interfaces (#6545)
In a previous PR I made it so that we had interfaces that would work enough to allow blockingQueries to work. However to complete this we need all fields to be settable and gettable.

Notes:
   • If Go ever gets contracts/generics then we could get rid of all the Getters/Setters
   • protoc / protoc-gen-gogo are going to generate all the getters for us.
   • I copied all the getters/setters from the protobuf funcs into agent/structs/protobuf_compat.go
   • Also added JSON marshaling funcs that use jsonpb for protobuf types.
2019-09-26 09:55:02 -04:00
Freddy 5eace88ce2
Expose HTTP-based paths through Connect proxy (#6446)
Fixes: #5396

This PR adds a proxy configuration stanza called expose. These flags register
listeners in Connect sidecar proxies to allow requests to specific HTTP paths from outside of the node. This allows services to protect themselves by only
listening on the loopback interface, while still accepting traffic from non
Connect-enabled services.

Under expose there is a boolean checks flag that would automatically expose all
registered HTTP and gRPC check paths.

This stanza also accepts a paths list to expose individual paths. The primary
use case for this functionality would be to expose paths for third parties like
Prometheus or the kubelet.

Listeners for requests to exposed paths are be configured dynamically at run
time. Any time a proxy, or check can be registered, a listener can also be
created.

In this initial implementation requests to these paths are not
authenticated/encrypted.
2019-09-25 20:55:52 -06:00
Matt Keeler 8885c8d318
Allow for enterprise only leader routines (#6533)
Eventually I am thinking we may need a way to register these at different priority levels but for now sticking this here is fine
2019-09-23 20:09:56 -04:00
R.B. Boyer cc889443a5
connect: don't colon-hex-encode the AuthorityKeyId and SubjectKeyId fields in connect certs (#6492)
The fields in the certs are meant to hold the original binary
representation of this data, not some ascii-encoded version.

The only time we should be colon-hex-encoding fields is for display
purposes or marshaling through non-TLS mediums (like RPC).
2019-09-23 12:52:35 -05:00
Matt Keeler 8431c5f533
Add support for implementing new requests with protobufs instea… (#6502)
* Add build system support for protobuf generation

This is done generically so that we don’t have to keep updating the makefile to add another proto generation.

Note: anything not in the vendor directory and with a .proto extension will be run through protoc if the corresponding namespace.pb.go file is not up to date.

If you want to rebuild just a single proto file you can do so with: make proto-rebuild PROTOFILES=<list of proto files to rebuild>

Providing the PROTOFILES var will override the default behavior of finding all the .proto files.

* Start adding types to the agent/proto package

These will be needed for some other work and are by no means comprehensive.

* Add ability to resolve/fixup the agentpb.ACLLinks structure in the state store.

* Use protobuf marshalling of raft requests instead of msgpack for protoc generated types.

This does not change any encoding of existing types.

* Removed structs package automatically encoding with protobuf marshalling

Instead the caller of raftApply that wants to opt-in to protobuf encoding will have to call `raftApplyProtobuf`

* Run update-vendor to fixup modules.txt

Nothing changed as far as dependencies go but the ordering of modules in that file depends on the time they are first seen and its not alphabetical.

* Rename some things and implement the structs.RPCInfo interface bits

agentpb.QueryOptions and agentpb.WriteRequest implement 3 of the 4 RPCInfo funcs and the new TargetDatacenter message type implements the fourth.

* Use the right encoding function.

* Renamed agent/proto package to agent/agentpb to prevent package name conflicts

* Update modules.txt to fix ordering

* Change blockingQuery to take in interfaces for the query options and meta

* Add %T to error output.

* Add/Update some comments
2019-09-20 14:37:22 -04:00
R.B. Boyer 5c5f21088c sdk: add freelist tracking and ephemeral port range skipping to freeport
This should cut down on test flakiness.

Problems handled:

- If you had enough parallel test cases running, the former circular
approach to handling the port block could hand out the same port to
multiple cases before they each had a chance to bind them, leading to
one of the two tests to fail.

- The freeport library would allocate out of the ephemeral port range.
This has been corrected for Linux (which should cover CI).

- The library now waits until a formerly-in-use port is verified to be
free before putting it back into circulation.
2019-09-17 14:30:43 -05:00
R.B. Boyer edf5347d3c fix typo of 'unknown' in log messages 2019-09-13 15:59:49 -05:00
Hans Hasselberg f025a7440d
agent: handleEnterpriseLeave (#6453) 2019-09-11 11:01:37 +02:00
Pierre Souchay 6d13efa828 Distinguish between DC not existing and not being available (#6399) 2019-09-03 09:46:24 -06:00
Matt Keeler 31d9d2e557
Store primaries root in secondary after intermediate signature (#6333)
* Store primaries root in secondary after intermediate signature

This ensures that the intermediate exists within the CA root stored in raft and not just in the CA provider state. This has the very nice benefit of actually outputting the intermediate cert within the ca roots HTTP/RPC endpoints.

This change means that if signing the intermediate fails it will not set the root within raft. So far I have not come up with a reason why that is bad. The secondary CA roots watch will pull the root again and go through all the motions. So as soon as getting an intermediate CA works the root will get set.

* Make TestAgentAntiEntropy_Check_DeferSync less flaky

I am not sure this is the full fix but it seems to help for me.
2019-08-30 11:38:46 -04:00
Pierre Souchay 35d90fc899 Display IPs of machines when node names conflict to ease troubleshooting
When there is an node name conflicts, such messages are displayed within Consul:

`consul.fsm: EnsureRegistration failed: failed inserting node: Error while renaming Node ID: "e1d456bc-f72d-98e5-ebb3-26ae80d785cf": Node name node001 is reserved by node 05f10209-1b9c-b90c-e3e2-059e64556d4a with name node001`

While it is easy to find the node that has reserved the name, it is hard to find
the node trying to aquire the name since it is not registered, because it
is not part of `consul members` output

This PR will display the IP of the offender and solve far more easily those issues.
2019-08-28 15:57:05 -04:00
Alvin Huang e4e9381851
revert commits on master (#6413) 2019-08-27 17:45:58 -04:00
tradel 2838a1550a update tests to match new method signatures 2019-08-27 14:16:39 -07:00
tradel 93c839b76c confi\gure providers with DC and domain 2019-08-27 14:16:25 -07:00
tradel 1acde6e30a create a common name for autoTLS agent certs 2019-08-27 14:15:53 -07:00
Alvin Huang 9662b7c01a
add nil pointer check for pointer to ACLToken struct (#6407) 2019-08-27 11:23:28 -04:00
Hans Hasselberg 4f7a3e8fa8 make sure auto_encrypt has private key type and bits 2019-08-26 13:09:50 +02:00
R.B. Boyer 2d4a3b51d0
Merge pull request #6388 from hashicorp/release/1-6
merging release/1-6 into master
2019-08-23 13:44:46 -05:00
Matt Keeler 89ac998e8b
Secondary CA `establishLeadership` fix (#6383)
This prevents ACL issues (or other issues) during intermediate CA cert signing from failing leader establishment.
2019-08-23 11:32:37 -04:00
Hans Hasselberg aada537d87
auto_encrypt: use server-port (#6287)
AutoEncrypt needs the server-port because it wants to talk via RPC. Information from gossip might not be available at that point and thats why the server-port is being used.
2019-08-23 10:18:46 +02:00
Matt Keeler 8cb0560f52
Ensure that config entry writes are forwarded to the primary DC (#6339) 2019-08-20 12:01:13 -04:00
R.B. Boyer 0675e0606e
connect: generate the full SNI names for discovery targets in the compiler rather than in the xds package (#6340) 2019-08-19 13:03:03 -05:00
R.B. Boyer d6456fddeb
connect: introduce ExternalSNI field on service-defaults (#6324)
Compiling this will set an optional SNI field on each DiscoveryTarget.
When set this value should be used for TLS connections to the instances
of the target. If not set the default should be used.

Setting ExternalSNI will disable mesh gateway use for that target. It also 
disables several service-resolver features that do not make sense for an 
external service.
2019-08-19 12:19:44 -05:00
R.B. Boyer f84f509ce4
connect: updating a service-defaults config entry should leave an unset protocol alone (#6342)
If the entry is updated for reasons other than protocol it is surprising
that the value is explicitly persisted as 'tcp' rather than leaving it
empty and letting it fall back dynamically on the proxy-defaults value.
2019-08-19 10:44:06 -05:00
Matt Keeler 73888eed36
Filter out left/leaving serf members when determining if new AC… (#6332) 2019-08-16 10:34:18 -04:00
R.B. Boyer 22ee60d1ba
agent: blocking central config RPCs iterations should not interfere with each other (#6316) 2019-08-14 09:08:46 -05:00
hashicorp-ci 29767157ed Merge Consul OSS branch 'master' at commit 8f7586b339dbb518eff3a2eec27d7b8eae7a3fbb 2019-08-13 02:00:43 +00:00
Sarah Adams 2f7a90bc52
add flag to allow /operator/keyring requests to only hit local servers (#6279)
Add parameter local-only to operator keyring list requests to force queries to only hit local servers (no WAN traffic).

HTTP API: GET /operator/keyring?local-only=true
CLI: consul keyring -list --local-only

Sending the local-only flag with any non-GET/list request will result in an error.
2019-08-12 11:11:11 -07:00
Mike Morris 88df658243
connect: remove managed proxies (#6220)
* connect: remove managed proxies implementation and all supporting config options and structs

* connect: remove deprecated ProxyDestination

* command: remove CONNECT_PROXY_TOKEN env var

* agent: remove entire proxyprocess proxy manager

* test: remove all managed proxy tests

* test: remove irrelevant managed proxy note from TestService_ServerTLSConfig

* test: update ContentHash to reflect managed proxy removal

* test: remove deprecated ProxyDestination test

* telemetry: remove managed proxy note

* http: remove /v1/agent/connect/proxy endpoint

* ci: remove deprecated test exclusion

* website: update managed proxies deprecation page to note removal

* website: remove managed proxy configuration API docs

* website: remove managed proxy note from built-in proxy config

* website: add note on removing proxy subdirectory of data_dir
2019-08-09 15:19:30 -04:00
R.B. Boyer 357ca39868
connect: ensure intention replication continues to work when the replication ACL token changes (#6288) 2019-08-07 11:34:09 -05:00
hashicorp-ci 3ac803da5e Merge Consul OSS branch 'master' at commit d84863799deca45ccf4bec5ab9f645ccae6b3aeb 2019-08-06 02:00:30 +00:00
Sarah Adams 9ed3e64510
fallback to proxy config global protocol when upstream services' protocol is unset (#6277)
fallback to proxy config global protocol when upstream services' protocol is unset

Fixes #5857
2019-08-05 12:52:35 -07:00
R.B. Boyer 64fc002e03
connect: fix failover through a mesh gateway to a remote datacenter (#6259)
Failover is pushed entirely down to the data plane by creating envoy
clusters and putting each successive destination in a different load
assignment priority band. For example this shows that normally requests
go to 1.2.3.4:8080 but when that fails they go to 6.7.8.9:8080:

- name: foo
  load_assignment:
    cluster_name: foo
    policy:
      overprovisioning_factor: 100000
    endpoints:
    - priority: 0
      lb_endpoints:
      - endpoint:
          address:
            socket_address:
              address: 1.2.3.4
              port_value: 8080
    - priority: 1
      lb_endpoints:
      - endpoint:
          address:
            socket_address:
              address: 6.7.8.9
              port_value: 8080

Mesh gateways route requests based solely on the SNI header tacked onto
the TLS layer. Envoy currently only lets you configure the outbound SNI
header at the cluster layer.

If you try to failover through a mesh gateway you ideally would
configure the SNI value per endpoint, but that's not possible in envoy
today.

This PR introduces a simpler way around the problem for now:

1. We identify any target of failover that will use mesh gateway mode local or
   remote and then further isolate any resolver node in the compiled discovery
   chain that has a failover destination set to one of those targets.

2. For each of these resolvers we will perform a small measurement of
   comparative healths of the endpoints that come back from the health API for the
   set of primary target and serial failover targets. We walk the list of targets
   in order and if any endpoint is healthy we return that target, otherwise we
   move on to the next target.

3. The CDS and EDS endpoints both perform the measurements in (2) for the
   affected resolver nodes.

4. For CDS this measurement selects which TLS SNI field to use for the cluster
   (note the cluster is always going to be named for the primary target)

5. For EDS this measurement selects which set of endpoints will populate the
   cluster. Priority tiered failover is ignored.

One of the big downsides to this approach to failover is that the failover
detection and correction is going to be controlled by consul rather than
deferring that entirely to the data plane as with the prior version. This also
means that we are bound to only failover using official health signals and
cannot make use of data plane signals like outlier detection to affect
failover.

In this specific scenario the lack of data plane signals is ok because the
effectiveness is already muted by the fact that the ultimate destination
endpoints will have their data plane signals scrambled when they pass through
the mesh gateway wrapper anyway so we're not losing much.

Another related fix is that we now use the endpoint health from the
underlying service, not the health of the gateway (regardless of
failover mode).
2019-08-05 13:30:35 -05:00
R.B. Boyer 0165e93517
connect: expose an API endpoint to compile the discovery chain (#6248)
In addition to exposing compilation over the API cleaned up the structures that would be exchanged to be cleaner and easier to support and understand.

Also removed ability to configure the envoy OverprovisioningFactor.
2019-08-02 15:34:54 -05:00
Todd Radel 295abd82c3
connect: generate intermediate at same time as root (#6272)
Generate intermediate at same time as root
Co-Authored-By: Freddy <freddygv@users.noreply.github.com>
2019-08-02 15:36:03 -04:00
R.B. Boyer 4e2fb5730c
connect: detect and prevent circular discovery chain references (#6246) 2019-08-02 09:18:45 -05:00
R.B. Boyer 6c9edb17c2
server: if inserting bootstrap config entries fails don't silence the errors (#6256) 2019-08-01 23:07:11 -05:00
R.B. Boyer 782c647bf4
connect: simplify the compiled discovery chain data structures (#6242)
This should make them better for sending over RPC or the API.

Instead of a chain implemented explicitly like a linked list (nodes
holding pointers to other nodes) instead switch to a flat map of named
nodes with nodes linking other other nodes by name. The shipped
structure is just a map and a string to indicate which key to start
from.

Other changes:

* inline the compiler option InferDefaults as true

* introduce compiled target config to avoid needing to send back
  additional maps of Resolvers; future target-specific compiled state
  can go here

* move compiled MeshGateway out of the Resolver and into the
  TargetConfig where it makes more sense.
2019-08-01 22:44:05 -05:00
R.B. Boyer 4666599e18
connect: reconcile how upstream configuration works with discovery chains (#6225)
* connect: reconcile how upstream configuration works with discovery chains

The following upstream config fields for connect sidecars sanely
integrate into discovery chain resolution:

- Destination Namespace/Datacenter: Compilation occurs locally but using
different default values for namespaces and datacenters. The xDS
clusters that are created are named as they normally would be.

- Mesh Gateway Mode (single upstream): If set this value overrides any
value computed for any resolver for the entire discovery chain. The xDS
clusters that are created may be named differently (see below).

- Mesh Gateway Mode (whole sidecar): If set this value overrides any
value computed for any resolver for the entire discovery chain. If this
is specifically overridden for a single upstream this value is ignored
in that case. The xDS clusters that are created may be named differently
(see below).

- Protocol (in opaque config): If set this value overrides the value
computed when evaluating the entire discovery chain. If the normal chain
would be TCP or if this override is set to TCP then the result is that
we explicitly disable L7 Routing and Splitting. The xDS clusters that
are created may be named differently (see below).

- Connect Timeout (in opaque config): If set this value overrides the
value for any resolver in the entire discovery chain. The xDS clusters
that are created may be named differently (see below).

If any of the above overrides affect the actual result of compiling the
discovery chain (i.e. "tcp" becomes "grpc" instead of being a no-op
override to "tcp") then the relevant parameters are hashed and provided
to the xDS layer as a prefix for use in naming the Clusters. This is to
ensure that if one Upstream discovery chain has no overrides and
tangentially needs a cluster named "api.default.XXX", and another
Upstream does have overrides for "api.default.XXX" that they won't
cross-pollinate against the operator's wishes.

Fixes #6159
2019-08-01 22:03:34 -05:00
Paul Banks a5c70d79d0 Revert "connect: support AWS PCA as a CA provider" (#6251)
This reverts commit 3497b7c00d49c4acbbf951d84f2bba93f3da7510.
2019-07-31 09:08:10 -04:00
Todd Radel d3b7fd83fe
connect: support AWS PCA as a CA provider (#6189)
Port AWS PCA provider from consul-ent
2019-07-30 22:57:51 -04:00
Todd Radel 1b14d6595e
connect: Support RSA keys in addition to ECDSA (#6055)
Support RSA keys in addition to ECDSA
2019-07-30 17:47:39 -04:00
Matt Keeler a7c4b7af7c
Fix CA Replication when ACLs are enabled (#6201)
Secondary CA initialization steps are:

• Wait until the primary will be capable of signing intermediate certs. We use serf metadata to check the versions of servers in the primary which avoids needing a token like the previous implementation that used RPCs. We require at least one alive server in the primary and the all alive servers meet the version requirement.
• Initialize the secondary CA by getting the primary to sign an intermediate

When a primary dc is configured, if no existing CA is initialized and for whatever reason we cannot initialize a secondary CA the secondary DC will remain without a CA. As soon as it can it will initialize the secondary CA by pulling the primaries roots and getting the primary to sign an intermediate.

This also fixes a segfault that can happen during leadership revocation. There was a spot in the secondaryCARootsWatch that was getting the CA Provider and executing methods on it without nil checking. Under normal circumstances it wont be nil but during leadership revocation it gets nil'ed out. Therefore there is a period of time between closing the stop chan and when the go routine is actually stopped where it could read a nil provider and cause a segfault.
2019-07-26 15:57:57 -04:00
R.B. Boyer 1b95d2e5e3 Merge Consul OSS branch master at commit b3541c4f34d43ab92fe52256420759f17ea0ed73 2019-07-26 10:34:24 -05:00
Matt Keeler c4a34602b6
Allow forwarding of some status RPCs (#6198)
* Allow forwarding of some status RPCs

* Update docs

* add comments about not using the regular forward
2019-07-25 14:26:22 -04:00
Jeff Mitchell e266b038cc Make the chunking test multidimensional (#6212)
This ensures that it's not just a single operation we restores
successfully, but many. It's the same foundation, just with multiple
going on at once.
2019-07-25 11:40:09 +01:00
Freddy 7dbbe7e55a
auto-encrypt: Fix port resolution and fallback to default port (#6205)
Auto-encrypt meant to fallback to the default port when it wasn't provided, but it hadn't been because of an issue with the error handling. We were checking against an incomplete error value:
"missing port in address" vs "address $HOST: missing port in address"

Additionally, all RPCs to AutoEncrypt.Sign were using a.config.ServerPort, so those were updated to use ports resolved by resolveAddrs, if they are available.
2019-07-24 16:49:37 -07:00
Jeff Mitchell e0068431f5 Chunking support (#6172)
* Initial chunk support

This uses the go-raft-middleware library to allow for chunked commits to the KV
2019-07-24 17:06:39 -04:00
Freddy 1b97d65873
Make new config when retrying testServer creation (#6204) 2019-07-24 08:41:00 -06:00
Alvin Huang 5b6fa58453 resolve circleci config conflicts 2019-07-23 20:18:36 -04:00
Freddy c19f46639b
Restore NotifyListen to avoid panic in newServer retry (#6200) 2019-07-23 14:33:00 -06:00
Christian Muehlhaeuser 2602f6907e Simplified code in various places (#6176)
All these changes should have no side-effects or change behavior:

- Use bytes.Buffer's String() instead of a conversion
- Use time.Since and time.Until where fitting
- Drop unnecessary returns and assignment
2019-07-20 09:37:19 -04:00
hashicorp-ci 8b109e5f9f Merge Consul OSS branch 'master' at commit ef257b084d2e2a474889518440515e360d0cd990 2019-07-20 02:00:29 +00:00
Christian Muehlhaeuser 26f9368567 Fixed typos in comments (#6175)
Just a few nitpicky typo fixes.
2019-07-19 07:54:53 -04:00
Christian Muehlhaeuser 877bfd280b Fixed a few tautological condition mistakes (#6177)
None of these changes should have any side-effects. They're merely
fixing tautological mistakes.
2019-07-19 07:53:42 -04:00
Christian Muehlhaeuser d1426767f6 Fixed nil check for token (#6179)
I can only assume we want to check for the retrieved `updatedToken` to not be
nil, before accessing it below.

`token` can't possibly be nil at this point, as we accessed `token.AccessorID`
just before.
2019-07-19 07:48:11 -04:00
Alvin Huang 17654c6292 Merge branch 'master' into release/1-6 2019-07-17 15:43:30 -04:00
Freddy f59e6db9b1
Reduce number of servers in TestServer_Expect_NonVoters (#6155) 2019-07-17 11:35:33 -06:00
Freddy 476a4b95a5
More flaky test fixes (#6151)
* Add retry to TestAPI_ClientTxn

* Add retry to TestLeader_RegisterMember

* Account for empty watch result in ConnectRootsWatch
2019-07-17 09:33:38 -06:00
hashicorp-ci 022483aff0 Merge Consul OSS branch 'master' at commit 95dbb7f2f1b9fc3528a16335201e2324f1b388bd 2019-07-17 02:00:21 +00:00
Freddy 99601aa3a7
Update retries that weren't using retry.R (#6146) 2019-07-16 14:47:45 -06:00
R.B. Boyer 1cc6d07d0f
add test for discovery chain agent cache-type (#6130) 2019-07-15 10:09:52 -05:00
Jack Pearkes fa15914813 Merge branch 'master' into release/1-6 2019-07-12 14:51:25 -07:00
Matt Keeler 3914ec5c62
Various Gateway Fixes (#6093)
* Ensure the mesh gateway configuration comes back in the api within each upstream

* Add a test for the MeshGatewayConfig in the ToAPI functions

* Ensure we don’t use gateways for dc local connections

* Update the svc kind index for deletions

* Replace the proxycfg.state cache with an interface for testing

Also start implementing proxycfg state testing.

* Update the state tests to verify some gateway watches for upstream-targets of a discovery chain.
2019-07-12 17:19:37 -04:00
Sarah Adams 4afa034d6a
fix flaky test TestACLEndpoint_SecureIntroEndpoints_OnlyCreateLocalData (#6116)
* fix test to write only to dc2 (typo)
* fix retry behavior in existing test (was being used incorrectly)
2019-07-12 14:14:42 -07:00
R.B. Boyer 72a8195839
implement some missing service-router features and add more xDS testing (#6065)
- also implement OnlyPassing filters for non-gateway clusters
2019-07-12 14:16:21 -05:00
R.B. Boyer 9e1e9aad2e
Fix bug in service-resolver redirects if the destination uses a default resolver. (#6122)
Also:
- add back an internal http endpoint to dump a compiled discovery chain for debugging purposes

Before the CompiledDiscoveryChain.IsDefault() method would test:

- is this chain just one resolver step?
- is that resolver step just the default?

But what I forgot to test:

- is that resolver step for the same service that the chain represents?

This last point is important because if you configured just one config
entry:

    kind = "service-resolver"
    name = "web"
    redirect {
      service = "other"
    }

and requested the chain for "web" you'd get back a **default** resolver
for "other".  In the xDS code the IsDefault() method is used to
determine if this chain is "empty". If it is then we use the
pre-discovery-chain logic that just uses data embedded in the Upstream
object (and still lets the escape hatches function).

In the example above that means certain parts of the xDS code were going
to try referencing a cluster named "web..." despite the other parts of
the xDS code maintaining clusters named "other...".
2019-07-12 12:21:25 -05:00
Freddy a295d9e5db
Flaky test overhaul (#6100) 2019-07-12 09:52:26 -06:00
Freddy b6b6dbadb0
Remove dummy config (#6121) 2019-07-12 09:50:14 -06:00
Freddy 74b7bcb612
Update TestServer creation in sdk/testutil (#6084)
* Retry the creation of the test server three times.
* Reduce the retry timeout for the API wait to 2 seconds, opting to fail faster and start over.
* Remove wait for leader from server creation. This wait can be added on a test by test basis now that the function is being exported.
* Remove wait for anti-entropy sync. This is built into the existing WaitForSerfCheck func, so that can be used if the anti-entropy wait is needed
2019-07-12 09:37:29 -06:00
Freddy f5634a24e8
Clean up StatsFetcher work when context is exceeded (#6086) 2019-07-12 08:23:28 -06:00
Matt Keeler 6cc936d64b
Move ctx and cancel func setup into the Replicator.Start (#6115)
Previously a sequence of events like:

Start
Stop
Start
Stop

would segfault on the second stop because the original ctx and cancel func were only initialized during the constructor and not during Start.
2019-07-12 10:10:48 -04:00
Jack Pearkes 2b1761bab3 Make cluster names SNI always (#6081)
* Make cluster names SNI always

* Update some tests

* Ensure we check for prepared query types

* Use sni for route cluster names

* Proper mesh gateway mode defaulting when the discovery chain is used

* Ignore service splits from PatchSliceOfMaps

* Update some xds golden files for proper test output

* Allow for grpc/http listeners/cluster configs with the disco chain

* Update stats expectation
2019-07-08 12:48:48 +01:00
Matt Keeler 35a839952b Fix Internal.ServiceDump blocking (#6076)
maxIndexWatchTxn was only watching the IndexEntry of the max index of all the entries. It needed to watch all of them regardless of which was the max.

Also plumbed the query source through in the proxy config to help better track requests.
2019-07-04 16:17:49 +01:00
R.B. Boyer a1900754db
digest the proxy-defaults protocol into the graph (#6050) 2019-07-02 11:01:17 -05:00
R.B. Boyer bccbb2b4ae
activate most discovery chain features in xDS for envoy (#6024) 2019-07-01 22:10:51 -05:00
Matt Keeler 39bb0e3e77 Implement Mesh Gateways
This includes both ingress and egress functionality.
2019-07-01 16:28:30 -04:00
Matt Keeler 03ccc7c5ae Fix secondary dc connect CA roots watch issue
The general problem was that a the CA config which contained the trust domain was happening outside of the blocking mechanism so if the client started the blocking query before the primary dcs roots had been set then a state trust domain was being pushed down.

This was fixed here but in the future we should probably fixup the CA initialization code to not initialize the CA config twice when it doesn’t need to.
2019-07-01 16:28:30 -04:00
Matt Keeler 44dea31d1f Include a content hash of the intention for use during replication 2019-07-01 16:28:30 -04:00
Matt Keeler 0fc4da6861 Implement intention replication and secondary CA initialization 2019-07-01 16:28:30 -04:00
Matt Keeler 24749bc7e5 Implement Kind based ServiceDump and caching of the ServiceDump RPC 2019-07-01 16:28:30 -04:00
R.B. Boyer 686e4606c6
do some initial config entry graph validation during writes (#6047) 2019-07-01 15:23:36 -05:00
hashicorp-ci e36792395e Merge Consul OSS branch 'master' at commit e91f73f59249f5756896b10890e9298e7c1fbacc 2019-06-30 02:00:31 +00:00
Sarah Christoff 8a930f7d3a
Remove failed nodes from serfWAN (#6028)
* Prune Servers from WAN and LAN

* cleaned up and fixed LAN to WAN

* moving things around

* force-leave remove from serfWAN, create pruneSerfWAN

* removed serfWAN remove, reduced complexity, fixed comments

* add another place to remove from serfWAN

* add nil check

* Update agent/consul/server.go

Co-Authored-By: Paul Banks <banks@banksco.de>
2019-06-28 12:40:07 -05:00
Hans Hasselberg 73c4e9f07c
tls: auto_encrypt enables automatic RPC cert provisioning for consul clients (#5597) 2019-06-27 22:22:07 +02:00
R.B. Boyer 3eb1f00371
initial version of L7 config entry compiler (#5994)
With this you should be able to fetch all of the relevant discovery
chain config entries from the state store in one query and then feed
them into the compiler outside of a transaction.

There are a lot of TODOs scattered through here, but they're mostly
around handling fun edge cases and can be deferred until more of the
plumbing works completely.
2019-06-27 13:38:21 -05:00
R.B. Boyer 8850656580
adding new config entries for L7 discovery chain (unused) (#5987) 2019-06-27 12:37:43 -05:00
Todd Radel 8ece11a24a connect: store signingKeyId instead of authorityKeyId (#6005) 2019-06-27 16:47:22 +02:00
Aestek 04a52a967b acl: allow service deregistration with node write permission (#5217)
With ACLs enabled if an agent is wiped and restarted without a leave
it can no longer deregister the services it had previously registered
because it no longer has the tokens the services were registered with.
To remedy that we allow service deregistration from tokens with node
write permission.
2019-06-27 14:24:34 +02:00
hashicorp-ci 3224bea082 Merge Consul OSS branch 'master' at commit 4eb73973b6e53336fd505dc727ac84c1f7e78872 2019-06-27 02:00:41 +00:00
Pierre Souchay ca7c7faac8 agent: added metadata information about servers into consul service description (#5455)
This allows have information about servers from HTTP APIs without
using the command line.
2019-06-26 23:46:47 +02:00
Pierre Souchay e394a9469b Support for maximum size for Output of checks (#5233)
* Support for maximum size for Output of checks

This PR allows users to limit the size of output produced by checks at the agent 
and check level.

When set at the agent level, it will limit the output for all checks monitored
by the agent.

When set at the check level, it can override the agent max for a specific check but
only if it is lower than the agent max.

Default value is 4k, and input must be at least 1.
2019-06-26 09:43:25 -06:00
hashicorp-ci d237e86d83 Merge Consul OSS branch 'master' at commit 88b15d84f9fdb58ceed3dc971eb0390be85e3c15
skip-checks: true
2019-06-25 02:00:26 +00:00
Matt Keeler f0f28707bc
New Cache Types (#5995)
* Add a cache type for the Catalog.ListServices endpoint

* Add a cache type for the Catalog.ListDatacenters endpoint
2019-06-24 14:11:34 -04:00
Matt Keeler 93debd2610
Ensure that looking for services by addreses works with Tagged Addresses (#5984) 2019-06-21 13:16:17 -04:00
Hans Hasselberg 0d8d7ae052
agent: transfer leadership when establishLeadership fails (#5247) 2019-06-19 14:50:48 +02:00
Aestek 24c29e195b kv: do not trigger watches when setting the same value (#5885)
If a KVSet is performed but does not update the entry, do not trigger
watches for this key.
This avoids releasing blocking queries for KV values that did not
actually changed.
2019-06-18 15:06:29 +02:00
Matt Keeler 4c03f99a85
Fix CAS operations on Services (#5971)
* Fix CAS operations on services

* Update agent/consul/state/catalog_test.go

Co-Authored-By: R.B. Boyer <public@richardboyer.net>
2019-06-17 10:41:04 -04:00
Paul Banks e90fab0aec
Add rate limiting to RPCs sent within a server instance too (#5927) 2019-06-13 04:26:27 -05:00
Freddy 8f5fe058ea
Increase reliability of TestResetSessionTimerLocked_Renew 2019-05-24 13:54:51 -04:00
Freddy f7f0207f78
Run TestServer_Expect on its own (#5890) 2019-05-23 19:52:33 -04:00
Freddy e9bdb3a4f9
Flaky test: ACLReplication_Tokens (#5891)
* Exclude non-go workflows while testing

* Wait for s2 global-management policy

* Revert "Exclude non-go workflows while testing"

This reverts commit 47a83cbe9f19d0e1e475eabaa223d61fb4c56019.
2019-05-23 19:52:02 -04:00
Freddy c9e6640337
Add retries to StatsFetcherTest (#5892) 2019-05-23 19:51:31 -04:00
freddygv d133d565a5 Wait for s2 global-management policy 2019-05-21 17:58:37 -06:00
Freddy 7ce28bbfee
Stop running TestLeader_ChangeServerID in parallel 2019-05-21 15:28:08 -06:00
Kyle Havlovitz ad24456f49
Set the dead node reclaim timer at 30s 2019-05-15 11:59:33 -07:00
Kyle Havlovitz dcbffdb956
Merge branch 'master' into change-node-id 2019-05-15 10:51:04 -07:00
Matt Keeler 46956ed769
Copy the proxy config instead of direct assignment (#5786)
This prevents modifying the data in the state store which is supposed to be immutable.
2019-05-06 12:09:59 -04:00
R.B. Boyer 372bb06c83
acl: a role binding rule for a role that does not exist should be ignored (#5778)
I wrote the docs under this assumption but completely forgot to actually
enforce it.
2019-05-03 14:22:44 -05:00
R.B. Boyer 7d0f729f77
acl: enforce that you cannot persist tokens and roles with missing links except during replication (#5779) 2019-05-02 15:02:21 -05:00
Matt Keeler 26708570c5
Fix ConfigEntryResponse binary marshaller and ensure we watch the chan in ConfigEntry.Get even when no entry exists. (#5773) 2019-05-02 15:25:29 -04:00
Paul Banks df0c61fd31
Fix previous accidental master push 🤦 (#5771)
* Fix previous accidental master push 🤦

* Fix ACL test
2019-05-02 15:49:37 +01:00
Paul Banks 95bb1e368f
Fix panic in Resolving service config when proxy-defaults isn't defined yet (#5769) 2019-05-02 14:12:21 +01:00
Paul Banks cf24e7d1ed
Fix uint8 conversion issues for service config response maps. 2019-05-02 14:11:33 +01:00
Paul Banks 078f4cf5bb Add integration test for central config; fix central config WIP (#5752)
* Add integration test for central config; fix central config WIP

* Add integration test for central config; fix central config WIP

* Set proxy protocol correctly and begin adding upstream support

* Add upstreams to service config cache key and start new notify watcher if they change.

This doesn't update the tests to pass though.

* Fix some merging logic get things working manually with a hack (TODO fix properly)

* Simplification to not allow enabling sidecars centrally - it makes no sense without upstreams anyway

* Test compile again and obvious ones pass. Lots of failures locally not debugged yet but may be flakes. Pushing up to see what CI does

* Fix up service manageer and API test failures

* Remove the enable command since it no longer makes much sense without being able to turn on sidecar proxies centrally

* Remove version.go hack - will make integration test fail until release

* Remove unused code from commands and upstream merge

* Re-bump version to 1.5.0
2019-05-01 16:39:31 -07:00
Matt Keeler 9c77f2c52a
Update to use a consulent build tag instead of just ent (#5759) 2019-05-01 11:11:27 -04:00
Matt Keeler ea6cbf01a5 Centralized Config CLI (#5731)
* Add HTTP endpoints for config entry management

* Finish implementing decoding in the HTTP Config entry apply endpoint

* Add CAS operation to the config entry apply endpoint

Also use this for the bootstrapping and move the config entry decoding function into the structs package.

* First pass at the API client for the config entries

* Fixup some of the ConfigEntry APIs

Return a singular response object instead of a list for the ConfigEntry.Get RPC. This gets plumbed through the HTTP API as well.

Dont return QueryMeta in the JSON response for the config entry listing HTTP API. Instead just return a list of config entries.

* Minor API client fixes

* Attempt at some ConfigEntry api client tests

These don’t currently work due to weak typing in JSON

* Get some of the api client tests passing

* Implement reflectwalk magic to correct JSON encoding a ProxyConfigEntry

Also added a test for the HTTP endpoint that exposes the problem. However, since the test doesn’t actually do the JSON encode/decode its still failing.

* Move MapWalk magic into a binary marshaller instead of JSON.

* Add a MapWalk test

* Get rid of unused func

* Get rid of unused imports

* Fixup some tests now that the decoding from msgpack coerces things into json compat types

* Stub out most of the central config cli

Fully implement the config read command.

* Basic config delete command implementation

* Implement config write command

* Implement config list subcommand

Not entirely sure about the output here. Its basically the read output indented with a line specifying the kind/name of each type which is also duplicated in the indented output.

* Update command usage

* Update some help usage formatting

* Add the connect enable helper cli command

* Update list command output

* Rename the config entry API client methods.

* Use renamed apis

* Implement config write tests

Stub the others with the noTabs tests.

* Change list output format

Now just simply output 1 line per named config

* Add config read tests

* Add invalid args write test.

* Add config delete tests

* Add config list tests

* Add connect enable tests

* Update some CLI commands to use CAS ops

This also modifies the HTTP API for a write op to return a boolean indicating whether the value was written or not.

* Fix up the HTTP API CAS tests as I realized they weren’t testing what they should.

* Update config entry rpc tests to properly test CAS

* Fix up a few more tests

* Fix some tests that using ConfigEntries.Apply

* Update config_write_test.go

* Get rid of unused import
2019-04-30 16:27:16 -07:00
Matt Keeler 697efb588c
Make a few config entry endpoints return 404s and allow for snake_case and lowercase key names. (#5748) 2019-04-30 18:19:19 -04:00
Matt Keeler 8beb5c6082
ACL Token ID Initialization (#5307) 2019-04-30 11:45:36 -04:00
Kyle Havlovitz 64174f13d6 Add HTTP endpoints for config entry management (#5718) 2019-04-29 18:08:09 -04:00
R.B. Boyer ea2740fd32
Merge pull request #5617 from hashicorp/f-acl-ux
Secure ACL Introduction for Kubernetes
2019-04-26 15:34:26 -05:00
Aestek 9813abcb09 Fix: fail messages after a node rename replace the new node definition (#5520)
When receiving a serf faild message for a node which is not in the
catalog, do not perform a register request to set is serf heath to
critical as it could overwrite the node information and services if it
was renamed.

Fixes : #5518
2019-04-26 21:33:41 +01:00
R.B. Boyer 5a505c5b3a acl: adding support for kubernetes auth provider login (#5600)
* auth providers
* binding rules
* auth provider for kubernetes
* login/logout
2019-04-26 14:49:25 -05:00
R.B. Boyer 9542fdc9bc acl: adding Roles to Tokens (#5514)
Roles are named and can express the same bundle of permissions that can
currently be assigned to a Token (lists of Policies and Service
Identities). The difference with a Role is that it not itself a bearer
token, but just another entity that can be tied to a Token.

This lets an operator potentially curate a set of smaller reusable
Policies and compose them together into reusable Roles, rather than
always exploding that same list of Policies on any Token that needs
similar permissions.

This also refactors the acl replication code to be semi-generic to avoid
3x copypasta.
2019-04-26 14:49:12 -05:00
R.B. Boyer f43bc981e9 making ACLToken.ExpirationTime a *time.Time value instead of time.Time (#5663)
This is mainly to avoid having the API return "0001-01-01T00:00:00Z" as
a value for the ExpirationTime field when it is not set. Unfortunately
time.Time doesn't respect the json marshalling "omitempty" directive.
2019-04-26 14:48:16 -05:00
R.B. Boyer b3956e511c acl: ACL Tokens can now be assigned an optional set of service identities (#5390)
These act like a special cased version of a Policy Template for granting
a token the privileges necessary to register a service and its connect
proxy, and read upstreams from the catalog.
2019-04-26 14:48:04 -05:00
R.B. Boyer 76321aa952 acl: tokens can be created with an optional expiration time (#5353) 2019-04-26 14:47:51 -05:00
Matt Keeler 3ea9fe3bff
Implement bootstrapping proxy defaults from the config file (#5714) 2019-04-26 14:25:03 -04:00
Matt Keeler 3b5d38fb49
Implement config entry replication (#5706) 2019-04-26 13:38:39 -04:00
Alvin Huang 96c2c79908
Add fmt and vet (#5671)
* add go fmt and vet

* go fmt fixes
2019-04-25 12:26:33 -04:00
Kyle Havlovitz 6faa8ba451 Fill out the service manager functionality and fix tests 2019-04-23 00:17:28 -07:00
Kyle Havlovitz d51fd740bf
Merge pull request #5615 from hashicorp/config-entry-rpc
Add RPC endpoints for config entry operations
2019-04-23 00:16:54 -07:00
Kyle Havlovitz e64d1b8016 Rename config entry ACL methods 2019-04-22 23:55:11 -07:00
kaitlincarter-hc 7859d8c409
[docs] Server Performance (#5627)
* Moving server performance guide to docs.

* fixing broken links

* updating broken link

* fixing broken links
2019-04-17 13:17:12 -05:00
Matt Keeler ac78c23021
Implement data filtering of some endpoints (#5579)
Fixes: #4222 

# Data Filtering

This PR will implement filtering for the following endpoints:

## Supported HTTP Endpoints

- `/agent/checks`
- `/agent/services`
- `/catalog/nodes`
- `/catalog/service/:service`
- `/catalog/connect/:service`
- `/catalog/node/:node`
- `/health/node/:node`
- `/health/checks/:service`
- `/health/service/:service`
- `/health/connect/:service`
- `/health/state/:state`
- `/internal/ui/nodes`
- `/internal/ui/services`

More can be added going forward and any endpoint which is used to list some data is a good candidate.

## Usage

When using the HTTP API a `filter` query parameter can be used to pass a filter expression to Consul. Filter Expressions take the general form of:

```
<selector> == <value>
<selector> != <value>
<value> in <selector>
<value> not in <selector>
<selector> contains <value>
<selector> not contains <value>
<selector> is empty
<selector> is not empty
not <other expression>
<expression 1> and <expression 2>
<expression 1> or <expression 2>
```

Normal boolean logic and precedence is supported. All of the actual filtering and evaluation logic is coming from the [go-bexpr](https://github.com/hashicorp/go-bexpr) library

## Other changes

Adding the `Internal.ServiceDump` RPC endpoint. This will allow the UI to filter services better.
2019-04-16 12:00:15 -04:00
Kyle Havlovitz 2cffe4894f Move the ACL logic into the ConfigEntry interface 2019-04-10 14:27:28 -07:00
Kyle Havlovitz 81254deb59 Add RPC endpoints for config entry operations 2019-04-06 23:38:08 -07:00
Alvin Huang aacb81a566
Merge pull request #5376 from hashicorp/fix-tests
Fix tests in prep for CircleCI Migration
2019-04-04 17:09:32 -04:00
Kyle Havlovitz d6c25a13a5
Merge pull request #5539 from hashicorp/service-config
Service config state model
2019-04-02 16:34:58 -07:00
Kyle Havlovitz 63c9434779 Cleaned up some error handling/comments around config entries 2019-04-02 15:42:12 -07:00
Kyle Havlovitz ace5c7a1cb Encode config entry FSM messages in a generic type 2019-03-28 00:06:56 -07:00
Kyle Havlovitz 96a460c0cf Clean up service config state store methods 2019-03-27 16:52:38 -07:00
R.B. Boyer ab57b02ff8
acl: memdb filter of tokens-by-policy was inverted (#5575)
The inversion wasn't noticed because the parallel execution of TokenList
tests was operating incorrectly due to variable shadowing.
2019-03-27 15:24:44 -05:00
Jeff Mitchell d3c7d57209
Move internal/ to sdk/ (#5568)
* Move internal/ to sdk/

* Add a readme to the SDK folder
2019-03-27 08:54:56 -04:00
Jeff Mitchell a41c865059
Convert to Go Modules (#5517)
* First conversion

* Use serf 0.8.2 tag and associated updated deps

* * Move freeport and testutil into internal/

* Make internal/ its own module

* Update imports

* Add replace statements so API and normal Consul code are
self-referencing for ease of development

* Adapt to newer goe/values

* Bump to new cleanhttp

* Fix ban nonprintable chars test

* Update lock bad args test

The error message when the duration cannot be parsed changed in Go 1.12
(ae0c435877d3aacb9af5e706c40f9dddde5d3e67). This updates that test.

* Update another test as well

* Bump travis

* Bump circleci

* Bump go-discover and godo to get rid of launchpad dep

* Bump dockerfile go version

* fix tar command

* Bump go-cleanhttp
2019-03-26 17:04:58 -04:00
Paul Banks 68e8933ba5
Connect: Make Connect health queries unblock correctly (#5508)
* Make Connect health queryies unblock correctly in all cases and use optimal number of watch chans. Fixes #5506.

* Node check test cases and clearer bug test doc

* Comment update
2019-03-21 16:01:56 +00:00
Kyle Havlovitz c2cba68042 Fix fsm serialization and add snapshot/restore 2019-03-20 16:13:13 -07:00
Kyle Havlovitz 9df597b257 Fill out state store/FSM functions and add tests 2019-03-19 15:56:17 -07:00
Kyle Havlovitz 53913461db Add config types and state store table 2019-03-19 10:06:46 -07:00
Kyle Havlovitz bb0839ea5b Condense some test logic and add a comment about renaming 2019-03-18 16:15:36 -07:00
Paul Banks dd08426b04
Optimize health watching to single chan/goroutine. (#5449)
Refs #4984.

Watching chans for every node we touch in a health query is wasteful. In #4984 it shows that if there are more than 682 service instances we always fallback to watching all services which kills performance.

We already have a record in MemDB that is reliably update whenever the service health result should change thanks to per-service watch indexes.

So in general, provided there is at least one service instances and we actually have a service index for it (we always do now) we only ever need to watch a single channel.

This saves us from ever falling back to the general index and causing the performance cliff in #4984, but it also means fewer goroutines and work done for every blocking health query.

It also saves some allocations made during the query because we no longer have to populate a WatchSet with 3 chans per service instance which saves the internal map allocation.

This passes all state store tests except the one that explicitly checked for the fallback behaviour we've now optimized away and in general seems safe.
2019-03-15 20:18:48 +00:00
R.B. Boyer d65008700a
acl: reduce complexity of token resolution process with alternative singleflighting (#5480)
acl: reduce complexity of token resolution process with alternative singleflighting

Switches acl resolution to use golang.org/x/sync/singleflight. For the
identity/legacy lookups this is a drop-in replacement with the same
overall approach to request coalescing.

For policies this is technically a change in behavior, but when
considered holistically is approximately performance neutral (with the
benefit of less code).

There are two goals with this blob of code (speaking specifically of
policy resolution here):

  1) Minimize cross-DC requests.
  2) Minimize client-to-server LAN requests.

The previous iteration of this code was optimizing for the case of many
possibly different tokens being resolved concurrently that have a
significant overlap in linked policies such that deduplication would be
worth the complexity. While this is laudable there are some things to
consider that can help to adjust expectations:

  1) For v1.4+ policies are always replicated, and once a single policy
  shows up in a secondary DC the replicated data is considered
  authoritative for requests made in that DC. This means that our
  earlier concerns about minimizing cross-DC requests are irrelevant
  because there will be no cross-DC policy reads that occur.

  2) For Server nodes the in-memory ACL policy cache is capped at zero,
  meaning it has no caching. Only Client nodes run with a cache. This
  means that instead of having an entire DC's worth of tokens (what a
  Server might see) that can have policy resolutions coalesced these
  nodes will only ever be seeing node-local token resolutions. In a
  reasonable worst-case scenario where a scheduler like Kubernetes has
  "filled" a node with Connect services, even that will only schedule
  ~100 connect services per node. If every service has a unique token
  there will only be 100 tokens to coalesce and even then those requests
  have to occur concurrently AND be hitting an empty consul cache.

Instead of seeing a great coalescing opportunity for cutting down on
redundant Policy resolutions, in practice it's far more likely given
node densities that you'd see requests for the same token concurrently
than you would for two tokens sharing a policy concurrently (to a degree
that would warrant the overhead of the current variation of
singleflighting.

Given that, this patch switches the Policy resolution process to only
singleflight by requesting token (but keeps the cache as by-policy).
2019-03-14 09:35:34 -05:00
Kyle Havlovitz 3aec844fd2 Update state store test for changing node ID 2019-03-13 17:05:31 -07:00
Kyle Havlovitz df4ec913f0 Add a test for changing a failed node's ID 2019-03-13 15:39:07 -07:00
Hans Hasselberg d511e86491
agent: enable reloading of tls config (#5419)
This PR introduces reloading tls configuration. Consul will now be able to reload the TLS configuration which previously required a restart. It is not yet possible to turn TLS ON or OFF with these changes. Only when TLS is already turned on, the configuration can be reloaded. Most importantly the certificates and CAs.
2019-03-13 10:29:06 +01:00
R.B. Boyer e9614ee92f
acl: correctly extend the cache for acl identities during resolution (#5475) 2019-03-12 10:23:43 -05:00
Aestek 071fcb28ba [catalog] Update the node's services indexes on update (#5458)
Node updates were not updating the service indexes, which are used for
service related queries. This caused the X-Consul-Index to stay the same
after a node update as seen from a service query even though the node
data is returned in heath queries. If that happened in between queries
the client would miss this change.
We now update the indexes of the services on the node when it is
updated.

Fixes: #5450
2019-03-11 14:48:19 +00:00
Kyle Havlovitz bf09061e86 Add logic to allow changing a failed node's ID 2019-03-07 22:42:54 -08:00
Alvin Huang ece3b5907d fix typos 2019-03-06 14:47:33 -05:00
R.B. Boyer 91e78e00c7
fix typos reported by golangci-lint:misspell (#5434) 2019-03-06 11:13:28 -06:00
R.B. Boyer c24e3584be improve flaky LANReap tests by expliciting configuring the tombstone timeout
In TestServer_LANReap autopilot is running, so the alternate flow
through the serf reaping function is possible. In that situation the
ReconnectTimeout is not relevant so for parity also override the
TombstoneTimeout value as well.

For additional parity update the TestServer_WANReap and
TestClient_LANReap versions of this test in the same way even though
autopilot is irrelevant here .
2019-03-05 14:34:03 -06:00
Matt Keeler 87f9365eee Fixes for CVE-2019-8336
Fix error in detecting raft replication errors.

Detect redacted token secrets and prevent attempting to insert.

Add a Redacted field to the TokenBatchRead and TokenRead RPC endpoints

This will indicate whether token secrets have been redacted.

Ensure any token with a redacted secret in secondary datacenters is removed.

Test that redacted tokens cannot be replicated.
2019-03-04 19:13:24 +00:00
Matt Keeler 612aba7ced
Dont modify memdb owned token data for get/list requests of tokens (#5412)
Previously we were fixing up the token links directly on the *ACLToken returned by memdb. This invalidated some assumptions that a snapshot is immutable as well as potentially being able to cause a crash.

The fix here is to give the policy link fixing function copy on write semantics. When no fixes are necessary we can return the memdb object directly, otherwise we copy it and create a new list of links.

Eventually we might find a better way to keep those policy links in sync but for now this fixes the issue.
2019-03-04 09:28:46 -05:00
Matt Keeler 416a6543a6
Call RemoveServer for reap events (#5317)
This ensures that servers are removed from RPC routing when they are reaped.
2019-03-04 09:19:35 -05:00
R.B. Boyer d3be5c1d3a fix ignored errors in state store internals as reported by errcheck 2019-03-01 14:18:00 -06:00
Matt Keeler 0c76a4389f
ACL Token Persistence and Reloading (#5328)
This PR adds two features which will be useful for operators when ACLs are in use.

1. Tokens set in configuration files are now reloadable.
2. If `acl.enable_token_persistence` is set to `true` in the configuration, tokens set via the `v1/agent/token` endpoint are now persisted to disk and loaded when the agent starts (or during configuration reload)

Note that token persistence is opt-in so our users who do not want tokens on the local disk will see no change.

Some other secondary changes:

* Refactored a bunch of places where the replication token is retrieved from the token store. This token isn't just for replicating ACLs and now it is named accordingly.
* Allowed better paths in the `v1/agent/token/` API. Instead of paths like: `v1/agent/token/acl_replication_token` the path can now be just `v1/agent/token/replication`. The old paths remain to be valid. 
* Added a couple new API functions to set tokens via the new paths. Deprecated the old ones and pointed to the new names. The names are also generally better and don't imply that what you are setting is for ACLs but rather are setting ACL tokens. There is a minor semantic difference there especially for the replication token as again, its no longer used only for ACL token/policy replication. The new functions will detect 404s and fallback to using the older token paths when talking to pre-1.4.3 agents.
* Docs updated to reflect the API additions and to show using the new endpoints.
* Updated the ACL CLI set-agent-tokens command to use the non-deprecated APIs.
2019-02-27 14:28:31 -05:00
Hans Hasselberg 75ababb54f
Centralise tls configuration part 1 (#5366)
In order to be able to reload the TLS configuration, we need one way to generate the different configurations.

This PR introduces a `tlsutil.Configurator` which holds a `tlsutil.Config`. Afterwards it is responsible for rendering every `tls.Config`. In this particular PR I moved `IncomingHTTPSConfig`, `IncomingTLSConfig`, and `OutgoingTLSWrapper` into `tlsutil.Configurator`.

This PR is a pure refactoring - not a single feature added. And not a single test added. I only slightly modified existing tests as necessary.
2019-02-26 16:52:07 +01:00
Alvin Huang c4168e6dfc add wait to TestClient_JoinLAN 2019-02-22 17:34:45 -05:00
Alvin Huang 2e961d6539 add retry to TestResetSessionTimerLocked 2019-02-22 17:34:45 -05:00
R.B. Boyer ae1cb27126 fix incorrect body of TestACLEndpoint_PolicyBatchRead
Lifted from PR #5307 as it was an unrelated drive-by fix on that PR anyway.

s/token/policy/
2019-02-22 09:32:51 -06:00
R.B. Boyer 8e344c0218 test: switch test file from assert -> require for consistency
Also in acl_endpoint_test.go:

* convert logical blocks in some token tests to subtests
* remove use of require.New

This removes a lot of noise in a later PR.
2019-02-14 14:21:19 -06:00
R.B. Boyer 57be6ca215 correct some typos 2019-02-13 13:02:12 -06:00
R.B. Boyer a3e0fb8370 ensure that we plumb our configured logger into all parts of the raft library 2019-02-13 13:02:09 -06:00
R.B. Boyer 3b60891bf8 reduce the local scope of variable 2019-02-13 11:54:28 -06:00
R.B. Boyer 77d28fe9ce
clarify the ACL.PolicyDelete endpoint (#5337)
There was an errant early-return in PolicyDelete() that bypassed the
rest of the function.  This was ok because the only caller of this
function ignores the results.

This removes the early-return making it structurally behave like
TokenDelete() and for both PolicyDelete and TokenDelete clarify the lone
callers to indicate that the return values are ignored.

We may wish to avoid the entire return value as well, but this patch
doesn't go that far.
2019-02-13 09:16:30 -06:00
R.B. Boyer 106d87a4a8
update TestStateStore_ACLBootstrap to not rely upon request mutation (#5335) 2019-02-12 16:09:26 -06:00
Matt Keeler fa2c7059a2
Move autopilot initialization to prevent race (#5322)
`establishLeadership` invoked during leadership monitoring may use autopilot to do promotions etc. There was a race with doing that and having autopilot initialized and this fixes it.
2019-02-11 11:12:24 -05:00
Matt Keeler 210c3a56b0
Improve Connect with Prepared Queries (#5291)
Given a query like:

```
{
   "Name": "tagged-connect-query",
   "Service": {
      "Service": "foo",
      "Tags": ["tag"],
      "Connect": true
   }
}
```

And a Consul configuration like:

```
{
   "services": [
      "name": "foo",
      "port": 8080,
      "connect": { "sidecar_service": {} },
      "tags": ["tag"]
   ]
}
```

If you executed the query it would always turn up with 0 results. This was because the sidecar service was being created without any tags. You could instead make your config look like:

```
{
   "services": [
      "name": "foo",
      "port": 8080,
      "connect": { "sidecar_service": {
         "tags": ["tag"]
      } },
      "tags": ["tag"]
   ]
}
```

However that is a bit redundant for most cases. This PR ensures that the tags and service meta of the parent service get copied to the sidecar service. If there are any tags or service meta set in the sidecar service definition then this copying does not take place. After the changes, the query will now return the expected results.

A second change was made to prepared queries in this PR which is to allow filtering on ServiceMeta just like we allow for filtering on NodeMeta.
2019-02-04 09:36:51 -05:00
R.B. Boyer b5d71ea779
testutil: redirect some test agent logs to testing.T.Logf (#5304)
When tests fail, only the logs for the failing run are dumped to the
console which helps in diagnosis. This is easily added to other test
scenarios as they come up.
2019-02-01 09:21:54 -06:00
Kyle Havlovitz b30b541007
connect: Forward intention RPCs if this isn't the primary 2019-01-22 11:29:21 -08:00
Kyle Havlovitz a731173661
Merge pull request #5249 from hashicorp/ca-fixes-oss
Minor CA fixes
2019-01-22 11:25:09 -08:00
Kyle Havlovitz b0f07d9b5e
Merge pull request #4869 from hashicorp/txn-checks
Add node/service/check operations to transaction api
2019-01-22 11:16:09 -08:00
Matt Keeler cc2cd75f5c
Fix several ACL token/policy resolution issues. (#5246)
* Fix 2 remote ACL policy resolution issues

1 - Use the right method to fire async not found errors when the ACL.PolicyResolve RPC returns that error. This was previously accidentally firing a token result instead of a policy result which would have effectively done nothing (unless there happened to be a token with a secret id == the policy id being resolved.

2. When concurrent policy resolution is being done we single flight the requests. The bug before was that for the policy resolution that was going to piggy back on anothers RPC results it wasn’t waiting long enough for the results to come back due to looping with the wrong variable.

* Fix a handful of other edge case ACL scenarios

The main issue was that token specific issues (not able to access a particular policy or the token being deleted after initial fetching) were poisoning the policy cache.

A second issue was that for concurrent token resolutions, the first resolution to get started would go fetch all the policies. If before the policies were retrieved a second resolution request came in, the new request would register watchers for those policies but then never block waiting for them to complete. This resulted in using the default policy when it shouldn't have.
2019-01-22 13:14:43 -05:00
Paul Banks 1c4dfbcd2e
connect: tame thundering herd of CSRs on CA rotation (#5228)
* Support rate limiting and concurrency limiting CSR requests on servers; handle CA rotations gracefully with jitter and backoff-on-rate-limit in client

* Add CSR rate limiting docs

* Fix config naming and add tests for new CA configs
2019-01-22 17:19:36 +00:00
Kyle Havlovitz 4f53fe897a
oss: add the enterprise server stub for intention replication check 2019-01-18 17:32:10 -08:00
Matt Keeler 2f6a9edfac
Store leaf cert indexes in raft and use for the ModifyIndex on the returned certs (#5211)
* Store leaf cert indexes in raft and use for the ModifyIndex on the returned certs

This ensures that future certificate signings will have a strictly greater ModifyIndex than any previous certs signed.
2019-01-11 16:04:57 -05:00
Aestek ff13518961 Improve blocking queries on services that do not exist (#4810)
## Background

When making a blocking query on a missing service (was never registered, or is not registered anymore) the query returns as soon as any service is updated.
On clusters with frequent updates (5~10 updates/s in our DCs) these queries virtually do not block, and clients with no protections againt this waste ressources on the agent and server side. Clients that do protect against this get updates later than they should because of the backoff time they implement between requests.

## Implementation

While reducing the number of unnecessary updates we still want :
* Clients to be notified as soon as when the last instance of a service disapears.
* Clients to be notified whenever there's there is an update for the service.
* Clients to be notified as soon as the first instance of the requested service is added.

To reduce the number of unnecessary updates we need to block when a request to a missing service is made. However in the following case :

1. Client `client1` makes a query for service `foo`, gets back a node and X-Consul-Index 42
2. `foo` is unregistered 
3. `client1`  makes a query for `foo` with `index=42` -> `foo` does not exist, the query blocks and `client1` is not notified of the change on `foo` 

We could store the last raft index when each service was last alive to know wether we should block on the incoming query or not, but that list could grow indefinetly. 
We instead store the last raft index when a service was unregistered and use it when a query targets a service that does not exist. 
When a service `srv` is unregistered this "missing service index" is always greater than any X-Consul-Index held by the clients while `srv` was up, allowing us to immediatly notify them.

1. Client `client1` makes a query for service `foo`, gets back a node and `X-Consul-Index: 42`
2. `foo` is unregistered, we set the "missing service index" to 43 
3. `client1` makes a blocking query for `foo` with `index=42` -> `foo` does not exist, we check against the "missing service index" and return immediatly with `X-Consul-Index: 43`
4. `client1` makes a blocking query for `foo` with `index=43` -> we block
5. Other changes happen in the cluster, but foo still doesn't exist and "missing service index" hasn't changed, the query is still blocked
6. `foo` is registered again on index 62 -> `foo` exists and its index is greater than 43, we unblock the query
2019-01-11 09:26:14 -05:00
Matt Keeler 29b4512120
acl: Prevent tokens from deleting themselves (#5210)
Fixes #4897 

Also apparently token deletion could segfault in secondary DCs when attempting to delete non-existant tokens. For that reason both checks are wrapped within the non-nil check.
2019-01-10 09:22:51 -05:00
Kyle Havlovitz c266277a49 txn: clean up some state store/acl code 2019-01-09 11:59:23 -08:00
Pierre Souchay 5b8a7d7127 Avoid to have infinite recursion in DNS lookups when resolving CNAMEs (#4918)
* Avoid to have infinite recursion in DNS lookups when resolving CNAMEs

This will avoid killing Consul when a Service.Address is using CNAME
to a Consul CNAME that creates an infinite recursion.

This will fix https://github.com/hashicorp/consul/issues/4907

* Use maxRecursionLevel = 3 to allow several recursions
2019-01-07 16:53:54 -05:00
Paul Banks 0962e95e85
bugfix: use ServiceTags to generate cache key hash (#4987)
* bugfix: use ServiceTags to generate cahce key hash

* update unit test

* update

* remote print log

* Update .gitignore

* Completely deprecate ServiceTag field internally for clarity

* Add explicit test for CacheInfo cases
2019-01-07 21:30:47 +00:00
Kyle Havlovitz 8b1dc6a22c txn: fix an issue with querying nodes by name instead of ID 2018-12-12 12:46:33 -08:00
Pierre Souchay 61870be137 [Travis][UnstableTests] Fixed unstable tests in travis (#5013)
* [Travis][UnstableTests] Fixed unstable tests in travis as seen in https://travis-ci.org/hashicorp/consul/jobs/460824602

* Fixed unstable tests in https://travis-ci.org/hashicorp/consul/jobs/460857687
2018-12-12 12:09:42 -08:00
Kyle Havlovitz efcdc85e1a api: add support for new txn operations 2018-12-12 10:54:09 -08:00
Kyle Havlovitz 2408f99cca txn: add tests for RPC endpoint 2018-12-12 10:04:10 -08:00
Kyle Havlovitz 9f4f673c4d txn: add ACL enforcement/validation to new txn ops 2018-12-12 10:04:10 -08:00
Kyle Havlovitz 41e8120d3d state: add tests for new txn ops 2018-12-12 10:04:10 -08:00
Kyle Havlovitz a40a346be8 txn: add service operations 2018-12-12 10:04:10 -08:00
Kyle Havlovitz b1aeb3b943 txn: add node operations 2018-12-12 10:04:10 -08:00
Kyle Havlovitz bd6b7ad162 txn: add pre-check operations to txn endpoint 2018-12-12 10:04:10 -08:00
Kyle Havlovitz 8a0d7b65d6 Add check operations to transaction api 2018-12-12 10:04:10 -08:00
Kyle Havlovitz e7946197b8 connect/ca: prevent blank CA config in snapshot
This PR both prevents a blank CA config from being written out to
a snapshot and allows Consul to gracefully recover from a snapshot
with an invalid CA config.

Fixes #4954.
2018-12-06 17:40:53 -08:00
R.B. Boyer c86eff8859
agent: remove some stray fmt.Print* calls (#5015) 2018-11-29 09:45:51 -06:00
Pierre Souchay d0ca1bade9 Fixed another list of unstable unit tests in travis (#4915)
* Fixed another list of unstable unit tests in travis

Fixed failing tests in https://travis-ci.org/hashicorp/consul/jobs/451357061

* Fixed another list of unstable unit tests in travis.

Fixed failing tests in https://travis-ci.org/hashicorp/consul/jobs/451357061
2018-11-20 11:27:26 +00:00
Kyle Havlovitz 3cc7d6ebb5
Merge pull request #4952 from hashicorp/test-version
tests: Bump test server version to 1.4.0
2018-11-13 13:37:10 -08:00
R.B. Boyer 8662a6d260
acl: add stub hooks to support some plumbing in enterprise (#4951) 2018-11-13 15:35:54 -06:00
Kyle Havlovitz 19f9cad3fe
oss: bump test server version to 1.4.0 2018-11-13 13:13:26 -08:00
Aestek 4fb564abbc Fix catalog tag filter backward compat (#4944)
Fix catalog service node filtering (ex /v1/catalog/service/srv?tag=tag1)
between agent version <=v1.2.3 and server >=v1.3.0.
New server version did not account for the old field when filtering
hence request made from old agent were not tag-filtered.
2018-11-13 14:44:36 +00:00
Kyle Havlovitz b0dcf54e50
Merge pull request #4917 from hashicorp/replication-token-cleanup
Use acl replication_token for connect
2018-11-12 09:12:54 -08:00
Kyle Havlovitz 038aefa0bc update non-voting server test to fix enterprise diff 2018-11-09 12:50:24 -08:00
Kyle Havlovitz 70accbb2e0 oss: do a proper check-and-set on the CA roots/config fsm operation 2018-11-09 12:36:23 -08:00
R.B. Boyer 2e29f234b1
acl: fixes ACL replication for legacy tokens without AccessorIDs (#4885) 2018-11-07 07:59:44 -08:00
Kyle Havlovitz 1a4204f363
agent: fix formatting 2018-11-07 02:16:03 -08:00
R.B. Boyer a5d57f5326
fix comment typos (#4890) 2018-11-02 12:00:39 -05:00
Kyle Havlovitz 5b7b8bf842
Merge pull request #4872 from hashicorp/node-snapshot-fix
Node ID/datacenter snapshot fix
2018-10-31 15:51:07 -07:00
Matt Keeler 26b1873b3b Adds documentation for the new ACL APIs (#4851)
* Update the ACL API docs

* Add a CreateTime to the anon token

Also require acl:read permissions at least to perform rule translation. Don’t want someone DoSing the system with an open endpoint that actually does a bit of work.

* Fix one place where I was referring to id instead of AccessorID

* Add godocs for the API package additions.

* Minor updates: removed some extra commas and updated the acl intro paragraph

* minor tweaks

* Updated the language to be clearer

* Updated the language to be clearer for policy page

* I was also confused by that! Your updates are much clearer.

Co-Authored-By: kaitlincarter-hc <43049322+kaitlincarter-hc@users.noreply.github.com>

* Sounds much better.

Co-Authored-By: kaitlincarter-hc <43049322+kaitlincarter-hc@users.noreply.github.com>

* Updated sidebar layout and deprecated warning
2018-10-31 15:11:51 -07:00
Matt Keeler ec9934b6f8 Remaining ACL Unit Tests (#4852)
* Add leader token upgrade test and fix various ACL enablement bugs

* Update the leader ACL initialization tests.

* Add a StateStore ACL tests for ACLTokenSet and ACLTokenGetBy* functions

* Advertise the agents acl support status with the agent/self endpoint.

* Make batch token upsert CAS’able to prevent consistency issues with token auto-upgrade

* Finish up the ACL state store token tests

* Finish the ACL state store unit tests

Also rename some things to make them more consistent.

* Do as much ACL replication testing as I can.
2018-10-31 13:00:46 -07:00
Kyle Havlovitz cf2210b5c5 fsm: update snapshot/restore test to include ID and datacenter 2018-10-30 15:53:14 -07:00
Kyle Havlovitz 58ff5e46cb fsm: add missing ID/datacenter to persistNodes 2018-10-30 15:52:54 -07:00
Matt Keeler 0dd537e506
Fix the NonVoter Bootstrap test (#4786) 2018-10-24 10:23:50 -04:00
Kyle Havlovitz 6f40708aca fsm: add Intention operations to transactions for internal use 2018-10-19 10:02:28 -07:00
Matt Keeler df507a4a55 A few misc fixes found by go vet 2018-10-19 12:28:36 -04:00
Matt Keeler 99e0a124cb
New ACLs (#4791)
This PR is almost a complete rewrite of the ACL system within Consul. It brings the features more in line with other HashiCorp products. Obviously there is quite a bit left to do here but most of it is related docs, testing and finishing the last few commands in the CLI. I will update the PR description and check off the todos as I finish them over the next few days/week.
Description

At a high level this PR is mainly to split ACL tokens from Policies and to split the concepts of Authorization from Identities. A lot of this PR is mostly just to support CRUD operations on ACLTokens and ACLPolicies. These in and of themselves are not particularly interesting. The bigger conceptual changes are in how tokens get resolved, how backwards compatibility is handled and the separation of policy from identity which could lead the way to allowing for alternative identity providers.

On the surface and with a new cluster the ACL system will look very similar to that of Nomads. Both have tokens and policies. Both have local tokens. The ACL management APIs for both are very similar. I even ripped off Nomad's ACL bootstrap resetting procedure. There are a few key differences though.

    Nomad requires token and policy replication where Consul only requires policy replication with token replication being opt-in. In Consul local tokens only work with token replication being enabled though.
    All policies in Nomad are globally applicable. In Consul all policies are stored and replicated globally but can be scoped to a subset of the datacenters. This allows for more granular access management.
    Unlike Nomad, Consul has legacy baggage in the form of the original ACL system. The ramifications of this are:
        A server running the new system must still support other clients using the legacy system.
        A client running the new system must be able to use the legacy RPCs when the servers in its datacenter are running the legacy system.
        The primary ACL DC's servers running in legacy mode needs to be a gate that keeps everything else in the entire multi-DC cluster running in legacy mode.

So not only does this PR implement the new ACL system but has a legacy mode built in for when the cluster isn't ready for new ACLs. Also detecting that new ACLs can be used is automatic and requires no configuration on the part of administrators. This process is detailed more in the "Transitioning from Legacy to New ACL Mode" section below.
2018-10-19 12:04:07 -04:00
Pierre Souchay a72f92cac6 dns: implements prefix lookups for DNS TTL (#4605)
This will fix https://github.com/hashicorp/consul/issues/4509 and allow forinstance lb-* to match services lb-001 or lb-service-007.
2018-10-19 08:41:04 -07:00
Kyle Havlovitz 96a35f8abc re-add Connect multi-dc config changes
This reverts commit 8bcfbaffb6588b024cd1a3cf0952e6bfa7d9e900.
2018-10-19 08:41:03 -07:00
Jack Pearkes 847a0a5266 Revert "Connect multi-dc config" (#4784) 2018-10-11 17:32:45 +01:00
Aestek 260a9880ae [Security] Add finer control over script checks (#4715)
* Add -enable-local-script-checks options

These options allow for a finer control over when script checks are enabled by
giving the option to only allow them when they are declared from the local
file system.

* Add documentation for the new option

* Nitpick doc wording
2018-10-11 13:22:11 +01:00
Rebecca Zanzig 0ec6d880f5 Support multiple tags for health and catalog http api endpoints (#4717)
* Support multiple tags for health and catalog api endpoints

Fixes #1781.

Adds a `ServiceTags` field to the ServiceSpecificRequest to support
multiple tags, updates the filter logic in the catalog store, and
propagates these change through to the health and catalog endpoints.

Note: Leaves `ServiceTag` in the struct, since it is being used as
part of the DNS lookup, which in turn uses the health check.

* Update the api package to support multiple tags

Includes additional tests.

* Update new tests to use the `require` library

* Update HealthConnect check after a bad merge
2018-10-11 12:50:05 +01:00
Pierre Souchay b0fc91a1d2 [Performance On Large clusters] Reduce updates on large services (#4720)
* [Performance On Large clusters] Checks do update services/nodes only when really modified to avoid too many updates on very large clusters

In a large cluster, when having a few thousands of nodes, the anti-entropy
mechanism performs lots of changes (several per seconds) while
there is no real change. This patch wants to improve this in order
to increase Consul scalability when using many blocking requests on
health for instance.

* [Performance for large clusters] Only updates index of service if service is really modified

* [Performance for large clusters] Only updates index of nodes if node is really modified

* Added comments / ensure IsSame() has clear semantics

* Avoid having modified boolean, return nil directly if stutures are Same

* Fixed unstable unit tests TestLeader_ChangeServerID

* Rewrite TestNode_IsSame() for better readability as suggested by @banks

* Rename ServiceNode.IsSame() into IsSameService() + added unit tests

* Do not duplicate TestStructs_ServiceNode_Conversions() and increase test coverage of IsSameService

* Clearer documentation in IsSameService

* Take into account ServiceProxy into ServiceNode.IsSameService()

* Fixed IsSameService() with all new structures
2018-10-11 12:42:39 +01:00
Pierre Souchay 42f250fa53 Added SOA configuration for DNS settings. (#4714)
This will allow to fine TUNE SOA settings sent by Consul in DNS responses,
for instance to be able to control negative ttl.

Will fix: https://github.com/hashicorp/consul/issues/4713

# Example

Override all settings:

* min_ttl: 0 => 60s
* retry: 600 (10m) => 300s (5 minutes),
* expire: 86400 (24h) => 43200 (12h)
* refresh: 3600 (1h) => 1800 (30 minutes)

```
consul agent -dev -hcl 'dns_config={soa={min_ttl=60,retry=300,expire=43200,refresh=1800}}'
```

Result:
```
dig +multiline @localhost -p 8600 service.consul

; <<>> DiG 9.12.1 <<>> +multiline @localhost -p 8600 service.consul
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 36557
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;service.consul.		IN A

;; AUTHORITY SECTION:
consul.			0 IN SOA ns.consul. hostmaster.consul. (
				1537959133 ; serial
				1800       ; refresh (30 minutes)
				300        ; retry (5 minutes)
				43200      ; expire (12 hours)
				60         ; minimum (1 minute)
				)

;; Query time: 4 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Wed Sep 26 12:52:13 CEST 2018
;; MSG SIZE  rcvd: 93
```
2018-10-10 15:50:56 -04:00
Kyle Havlovitz 0cbd176a48 connect/ca: more OSS split for multi-dc 2018-10-10 12:17:59 -07:00
Kyle Havlovitz 6d5160c139 connect/ca: split CA initialization logic between oss/enterprise 2018-10-10 12:17:59 -07:00
Kyle Havlovitz 5b98a602af agent: add primary_datacenter and connect replication config options 2018-10-10 12:17:59 -07:00
Kyle Havlovitz 304595f7a6 connect: add ExternalTrustDomain to CARoot fields 2018-10-10 12:16:47 -07:00
Kyle Havlovitz 475afd0300 docs: deprecate acl_datacenter and replace it with primary_datacenter 2018-10-10 12:16:47 -07:00
Paul Banks 92fe8c8e89 Add Proxy Upstreams to Service Definition (#4639)
* Refactor Service Definition ProxyDestination.

This includes:
 - Refactoring all internal structs used
 - Updated tests for both deprecated and new input for:
   - Agent Services endpoint response
   - Agent Service endpoint response
   - Agent Register endpoint
     - Unmanaged deprecated field
     - Unmanaged new fields
     - Managed deprecated upstreams
     - Managed new
   - Catalog Register
     - Unmanaged deprecated field
     - Unmanaged new fields
     - Managed deprecated upstreams
     - Managed new
   - Catalog Services endpoint response
   - Catalog Node endpoint response
   - Catalog Service endpoint response
 - Updated API tests for all of the above too (both deprecated and new forms of register)

TODO:
 - config package changes for on-disk service definitions
 - proxy config endpoint
 - built-in proxy support for new fields

* Agent proxy config endpoint updated with upstreams

* Config file changes for upstreams.

* Add upstream opaque config and update all tests to ensure it works everywhere.

* Built in proxy working with new Upstreams config

* Command fixes and deprecations

* Fix key translation, upstream type defaults and a spate of other subtele bugs found with ned to end test scripts...

TODO: tests still failing on one case that needs a fix. I think it's key translation for upstreams nested in Managed proxy struct.

* Fix translated keys in API registration.
≈

* Fixes from docs
 - omit some empty undocumented fields in API
 - Bring back ServiceProxyDestination in Catalog responses to not break backwards compat - this was removed assuming it was only used internally.

* Documentation updates for Upstreams in service definition

* Fixes for tests broken by many refactors.

* Enable travis on f-connect branch in this branch too.

* Add consistent Deprecation comments to ProxyDestination uses

* Update version number on deprecation notices, and correct upstream datacenter field with explanation in docs
2018-10-10 16:55:34 +01:00
Alex Dadgar 90ed72fd70 do not bootstrap with non voters 2018-09-19 17:41:36 -07:00
Kyle Havlovitz 9b8f8975c6
Merge pull request #4644 from hashicorp/ca-refactor
connect/ca: rework initialization/root generation in providers
2018-09-13 13:08:34 -07:00
Paul Banks 09e4c2995b
Fix CA pruning when CA config uses string durations. (#4669)
* Fix CA pruning when CA config uses string durations.

The tl;dr here is:

 - Configuring LeafCertTTL with a string like "72h" is how we do it by default and should be supported
 - Most of our tests managed to escape this by defining them as time.Duration directly
 - Out actual default value is a string
 - Since this is stored in a map[string]interface{} config, when it is written to Raft it goes through a msgpack encode/decode cycle (even though it's written from server not over RPC).
 - msgpack decode leaves the string as a `[]uint8`
 - Some of our parsers required string and failed
 - So after 1 hour, a default configured server would throw an error about pruning old CAs
 - If a new CA was configured that set LeafCertTTL as a time.Duration, things might be OK after that, but if a new CA was just configured from config file, intialization would cause same issue but always fail still so would never prune the old CA.
 - Mostly this is just a janky error that got passed tests due to many levels of complicated encoding/decoding.

tl;dr of the tl;dr: Yay for type safety. Map[string]interface{} combined with msgpack always goes wrong but we somehow get bitten every time in a new way :D

We already fixed this once! The main CA config had the same problem so @kyhavlov already wrote the mapstructure DecodeHook that fixes it. It wasn't used in several places it needed to be and one of those is notw in `structs` which caused a dependency cycle so I've moved them.

This adds a whole new test thta explicitly tests the case that broke here. It also adds tests that would have failed in other places before (Consul and Vaul provider parsing functions). I'm not sure if they would ever be affected as it is now as we've not seen things broken with them but it seems better to explicitly test that and support it to not be bitten a third time!

* Typo fix

* Fix bad Uint8 usage
2018-09-13 15:43:00 +01:00
Pierre Souchay 5ecf9823d2 Fix more unstable tests in agent and command 2018-09-12 14:49:27 +01:00
Kyle Havlovitz 8fc2c77fdf
connect/ca: some cleanup and reorganizing of the new methods 2018-09-11 16:43:04 -07:00
Pierre Souchay 7a42c31330 Fix unstable tests in agent, api, and command/watch 2018-09-10 16:58:53 +01:00
Pierre Souchay 473e589d86 Implementation of Weights Data structures (#4468)
* Implementation of Weights Data structures

Adding this datastructure will allow us to resolve the
issues #1088 and #4198

This new structure defaults to values:
```
   { Passing: 1, Warning: 0 }
```

Which means, use weight of 0 for a Service in Warning State
while use Weight 1 for a Healthy Service.
Thus it remains compatible with previous Consul versions.

* Implemented weights for DNS SRV Records

* DNS properly support agents with weight support while server does not (backwards compatibility)

* Use Warning value of Weights of 1 by default

When using DNS interface with only_passing = false, all nodes
with non-Critical healthcheck used to have a weight value of 1.
While having weight.Warning = 0 as default value, this is probably
a bad idea as it breaks ascending compatibility.

Thus, we put a default value of 1 to be consistent with existing behaviour.

* Added documentation for new weight field in service description

* Better documentation about weights as suggested by @banks

* Return weight = 1 for unknown Check states as suggested by @banks

* Fixed typo (of -> or) in error message as requested by @mkeeler

* Fixed unstable unit test TestRetryJoin

* Fixed unstable tests

* Fixed wrong Fatalf format in `testrpc/wait.go`

* Added notes regarding DNS SRV lookup limitations regarding number of instances

* Documentation fixes and clarification regarding SRV records with weights as requested by @banks

* Rephrase docs
2018-09-07 15:30:47 +01:00
Kyle Havlovitz e184a18e4b
connect/ca: add Configure/GenerateRoot to provider interface 2018-09-06 19:18:59 -07:00
Pierre Souchay 54d8157ee1 Fixed more flaky tests in ./agent/consul (#4617) 2018-09-04 14:02:47 +01:00
Freddy 10d3048bd6
Bugfix: Use "%#v" when formatting structs (#4600) 2018-08-28 12:37:34 -04:00
Pierre Souchay 9b5cf0c1d0 [BUGFIX] Avoid returning empty data on startup of a non-leader server (#4554)
Ensure that DB is properly initialized when performing stale queries

Addresses:
- https://github.com/hashicorp/consul-replicate/issues/82
- https://github.com/hashicorp/consul/issues/3975
- https://github.com/hashicorp/consul-template/issues/1131
2018-08-23 12:06:39 -04:00
Kyle Havlovitz 26a21df014
Merge branch 'master' into ca-snapshot-fix 2018-08-16 13:00:54 -07:00
Kyle Havlovitz af4b037c52
fsm: add connect service config to snapshot/restore test 2018-08-16 12:58:54 -07:00
nickmy9729 43a68822e3 Added code to allow snapshot inclusion of NodeMeta (#4527) 2018-08-16 15:33:35 -04:00
Kyle Havlovitz 880eccb502
fsm: add missing CA config to snapshot/restore logic 2018-08-16 11:58:50 -07:00
Kyle Havlovitz fd83063686
autopilot: don't follow the normal server removal rules for nonvoters 2018-08-14 14:24:51 -07:00
Kyle Havlovitz aa19559cc7
Fix stats fetcher healthcheck RPCs not being independent 2018-08-14 14:23:52 -07:00
Pierre Souchay a16f34058b Display more information about check being not properly added when it fails (#4405)
* Display more information about check being not properly added when it fails

It follows an incident where we add lots of error messages:

  [WARN] consul.fsm: EnsureRegistration failed: failed inserting check: Missing service registration

That seems related to Consul failing to restart on respective agents.

Having Node information as well as service information would help diagnose the issue.

* Renamed ensureCheckIfNodeMatches() as requested by @banks
2018-08-14 17:45:33 +01:00
Pierre Souchay 821a91ca31 Allow to rename nodes with IDs, will fix #3974 and #4413 (#4415)
* Allow to rename nodes with IDs, will fix #3974 and #4413

This change allow to rename any well behaving recent agent with an
ID to be renamed safely, ie: without taking the name of another one
with case insensitive comparison.

Deprecated behaviour warning
----------------------------

Due to asceding compatibility, it is still possible however to
"take" the name of another name by not providing any ID.

Note that when not providing any ID, it is possible to have 2 nodes
having similar names with case differences, ie: myNode and mynode
which might lead to DB corruption on Consul server side and
lead to server not properly restarting.

See #3983 and #4399 for Context about this change.

Disabling registration of nodes without IDs as specified in #4414
should probably be the way to go eventually.

* Removed the case-insensitive search when adding a node within the else
block since it breaks the test TestAgentAntiEntropy_Services

While the else case is probably legit, it will be fixed with #4414 in
a later release.

* Added again the test in the else to avoid duplicated names, but
enforce this test only for nodes having IDs.

Thus most tests without any ID will work, and allows us fixing

* Added more tests regarding request with/without IDs.

`TestStateStore_EnsureNode` now test registration and renaming with IDs

`TestStateStore_EnsureNodeDeprecated` tests registration without IDs
and tests removing an ID from a node as well as updated a node
without its ID (deprecated behaviour kept for backwards compatibility)

* Do not allow renaming in case of conflict, including when other node has no ID

* Fixed function GetNodeID that was not working due to wrong type when searching node from its ID

Thus, all tests about renaming were not working properly.

Added the full test cas that allowed me to detect it.

* Better error messages, more tests when nodeID is not a valid UUID in GetNodeID()

* Added separate TestStateStore_GetNodeID to test GetNodeID.

More complete test coverage for GetNodeID

* Added new unit test `TestStateStore_ensureNoNodeWithSimilarNameTxn`

Also fixed comments to be clearer after remarks from @banks

* Fixed error message in unit test to match test case

* Use uuid.ParseUUID to parse Node.ID as requested by @mkeeler
2018-08-10 11:30:45 -04:00
Siva Prasad d98d02777f
PR to fix TestAgent_IndexChurn and TestPreparedQuery_Wrapper. (#4512)
* Fixes TestAgent_IndexChurn

* Fixes TestPreparedQuery_Wrapper

* Increased sleep in agent_test for IndexChurn to 500ms

* Made the comment about joinWAN operation much less of a cliffhanger
2018-08-09 12:40:07 -04:00
Armon Dadgar a343392f63 consul: Update buffer sizes 2018-08-08 10:26:58 -07:00
Siva Prasad cfa436dc16
Revert "CA initialization while boostrapping and TestLeader_ChangeServerID fix." (#4497)
* Revert "BUGFIX: Unit test relying on WaitForLeader() did not work due to wrong test (#4472)"

This reverts commit cec5d7239621e0732b3f70158addb1899442acb3.

* Revert "CA initialization while boostrapping and TestLeader_ChangeServerID fix. (#4493)"

This reverts commit 589b589b53e56af38de25db9b56967bdf1f2c069.
2018-08-07 08:29:48 -04:00
Pierre Souchay fd927ea110 BUGFIX: Unit test relying on WaitForLeader() did not work due to wrong test (#4472)
- Improve resilience of testrpc.WaitForLeader()

- Add additionall retry to CI

- Increase "go test" timeout to 8m

- Add wait for cluster leader to several tests in the agent package

- Add retry to some tests in the api and command packages
2018-08-06 19:46:09 -04:00
Siva Prasad 29c181f5fa
CA initialization while boostrapping and TestLeader_ChangeServerID fix. (#4493)
* connect: fix an issue with Consul CA bootstrapping being interrupted

* streamline change server id test
2018-08-06 16:15:24 -04:00
Kyle Havlovitz 42ab07b398
fix inconsistency in TestConnectCAConfig_GetSet 2018-07-26 07:46:47 -07:00
Kyle Havlovitz ecc02c6aee
Merge pull request #4400 from hashicorp/leaf-cert-ttl
Add configurable leaf cert TTL to Connect CA
2018-07-25 17:53:25 -07:00
Paul Banks 217137b775
Fixes #4421: General solution to stop blocking queries with index 0 (#4437)
* Fix theoretical cache collision bug if/when we use more cache types with same result type

* Generalized fix for blocking query handling when state store methods return zero index

* Refactor test retry to only affect CI

* Undo make file merge

* Add hint to error message returned to end-user requests if Connect is not enabled when they try to request cert

* Explicit error for Roots endpoint if connect is disabled

* Fix tests that were asserting old behaviour
2018-07-25 20:26:27 +01:00
Kyle Havlovitz a125735d76
connect/ca: check LeafCertTTL when rotating expired roots 2018-07-20 16:04:04 -07:00