Commit Graph

1302 Commits

Author SHA1 Message Date
Daniel Nephin f2b504873a
Merge pull request #9460 from hashicorp/dnephin/fix-data-races
Fix a couple data races in tests
2021-01-14 17:07:01 -05:00
Chris Piraino baad708929
Fix bug in usage metrics when multiple service instances are changed in a single transaction (#9440)
* Fix bug in usage metrics that caused a negative count to occur

There were a couple of instances were usage metrics would do the wrong
thing and result in incorrect counts, causing the count to attempt to
decrement below zero and return an error. The usage metrics did not
account for various places where a single transaction could
delete/update/add multiple service instances at once.

We also remove the error when attempting to decrement below zero, and
instead just make sure we do not accidentally underflow the unsigned
integer. This is a more graceful failure than returning an error and not
allowing a transaction to commit.

* Add changelog
2021-01-12 15:31:47 -06:00
Chris Piraino 2eac571276
Log replication warnings when no error suppression is defined (#9320)
* Log replication warnings when no error suppression is defined

* Add changelog file
2021-01-08 14:03:06 -06:00
Daniel Nephin 45f0afcbf4 structs: Fix printing of IDs
These types are used as values (not pointers) in other structs. Using a pointer receiver causes
problems when the value is printed. fmt will not call the String method if it is passed a value
and the String method has a pointer receiver. By using a value receiver the correct string is printed.

Also remove some unused methods.
2021-01-07 18:47:38 -05:00
Daniel Nephin 27c38bfebb
Merge pull request #9213 from hashicorp/dnephin/resolve-tokens-take-2
acl: Remove some unused things and document delegate method
2021-01-06 18:51:51 -05:00
R.B. Boyer db62541676
acl: use the presence of a management policy in the state store as a sign that we already migrated to v2 acls (#9505)
This way we only have to wait for the serf barrier to pass once before
we can upgrade to v2 acls. Without this patch every restart needs to
re-compute the change, and potentially if a stray older node joins after
a migration it might regress back to v1 mode which would be problematic.
2021-01-05 17:04:27 -06:00
Matt Keeler 3a79b559f9
Special case the error returned when we have a Raft leader but are not tracking it in the ServerLookup (#9487)
This can happen when one other node in the cluster such as a client is unable to communicate with the leader server and sees it as failed. When that happens its failing status eventually gets propagated to the other servers in the cluster and eventually this can result in RPCs returning “No cluster leader” error.

That error is misleading and unhelpful for determing the root cause of the issue as its not raft stability but rather and client -> server networking issue. Therefore this commit will add a new error that will be returned in that case to differentiate between the two cases.
2021-01-04 14:05:23 -05:00
R.B. Boyer 42dea6f01e
server: deletions of intentions by name using the intention API is now idempotent (#9278)
Restoring a behavior inadvertently changed while fixing #9254
2021-01-04 11:27:00 -06:00
Daniel Nephin 088831c91e Maybe fix another data race in a test 2020-12-22 18:53:54 -05:00
Daniel Nephin d0f2eca8de Fix one race caused by t.Parallel 2020-12-22 18:27:18 -05:00
Daniel Nephin c66a63275f
Merge pull request #9340 from hashicorp/dnephin/skip-slow-tests-with-short
testing: skip slow tests with -short
2020-12-11 13:33:44 -05:00
R.B. Boyer f9dcaf7f6b
acl: global tokens created by auth methods now correctly replicate to secondary datacenters (#9351)
Previously the tokens would fail to insert into the secondary's state
store because the AuthMethod field of the ACLToken did not point to a
known auth method from the primary.
2020-12-09 15:22:29 -06:00
Daniel Nephin ef0999547a testing: skip slow tests with -short
Add a skip condition to all tests slower than 100ms.

This change was made using `gotestsum tool slowest` with data from the
last 3 CI runs of master.
See https://github.com/gotestyourself/gotestsum#finding-and-skipping-slow-tests

With this change:

```
$ time go test -count=1 -short ./agent
ok      github.com/hashicorp/consul/agent       0.743s

real    0m4.791s

$ time go test -count=1 -short ./agent/consul
ok      github.com/hashicorp/consul/agent/consul        4.229s

real    0m8.769s
```
2020-12-07 13:42:55 -05:00
Kyle Havlovitz 57210a59c3 connect: Fix a case where the active root would get unset even when there wasn't a new one 2020-12-02 11:42:23 -08:00
Kyle Havlovitz 91d5d6c586
Merge pull request #9009 from hashicorp/update-secondary-ca
connect: Fix an issue with updating CA config in a secondary datacenter
2020-11-30 14:49:28 -08:00
Kyle Havlovitz c5167cf9c4 Use a buffered channel for CA intermediate renew func 2020-11-30 14:37:24 -08:00
R.B. Boyer 6d6b6c15c6
server: fix panic when deleting a non existent intention (#9254)
* server: fix panic when deleting a non existent intention

* add changelog

* Always return an error when deleting non-existent ixn

Co-authored-by: freddygv <gh@freddygv.xyz>
2020-11-24 13:44:20 -05:00
Hans Hasselberg 25f9e232af add missing descriptions for metrics 2020-11-23 22:06:30 +01:00
Kit Patella 7a8844ccce add entries for missing fsm operations and mark duplicated metrics prefixes as deprecated 2020-11-23 12:42:51 -08:00
Kyle Havlovitz a01f853aa5 Clean up the logic in persistNewRootAndConfig 2020-11-20 15:54:44 -08:00
Kyle Havlovitz 26a9c985c5 Add CA server delegate interface for testing 2020-11-19 20:08:06 -08:00
Kit Patella 4ad076207e add telemetry and definition help entries for missing catalog and acl metrics 2020-11-19 13:29:44 -08:00
Kit Patella 46205bbf27 remove stale entries and rename/define acl.resolveToken 2020-11-19 13:06:28 -08:00
Freddy e4e306210a
Require operator:write to get Connect CA config (#9240)
A vulnerability was identified in Consul and Consul Enterprise (“Consul”) such that operators with `operator:read` ACL permissions are able to read the Consul Connect CA configuration when explicitly configured with the `/v1/connect/ca/configuration` endpoint, including the private key. This allows the user to effectively privilege escalate by enabling the ability to mint certificates for any Consul Connect services. This would potentially allow them to masquerade (receive/send traffic) as any service in the mesh.

--

This PR increases the permissions required to read the Connect CA's private key when it was configured via the `/connect/ca/configuration` endpoint. They are now `operator:write`.
2020-11-19 10:14:48 -07:00
Kyle Havlovitz c8d4a40a87 connect: update some function comments in CA manager 2020-11-17 16:00:19 -08:00
Daniel Nephin b9306d8827 acl: remove a test-only method 2020-11-17 18:16:34 -05:00
Daniel Nephin 9e7c8dd19d Remove two unused delegate methods 2020-11-17 18:16:26 -05:00
Matt Keeler 4bca029be9
Refactor to call non-voting servers read replicas (#9191)
Co-authored-by: Kit Patella <kit@jepsen.io>
2020-11-17 10:53:57 -05:00
Kit Patella 4dfcdbab26
Merge pull request #9198 from hashicorp/mkcp/telemetry/add-all-metric-definitions
Add metric definitions for all metrics known at Consul start
2020-11-16 15:54:50 -08:00
Matt Keeler 197a37a860
Prevent panic if autopilot health is requested prior to leader establishment finishing. (#9204) 2020-11-16 17:08:17 -05:00
Daniel Nephin de88ceed1c
Merge pull request #9114 from hashicorp/dnephin/filtering-in-stream
stream: improve naming of Payload methods
2020-11-16 14:20:07 -05:00
Kit Patella 0b18f5612e trim help strings to save a few bytes 2020-11-16 11:02:11 -08:00
Kit Patella 374748dafc merge master 2020-11-16 10:46:53 -08:00
Kit Patella af719981f3 finish adding static server metrics 2020-11-13 16:26:08 -08:00
Kyle Havlovitz 0a86533e20 Reorganize some CA manager code for correctness/readability 2020-11-13 14:46:01 -08:00
Kyle Havlovitz 5de81c1375 connect: Add CAManager for synchronizing CA operations 2020-11-13 14:33:44 -08:00
Kyle Havlovitz 0b4876f906 connect: Add logic for updating secondary DC intermediate on config set 2020-11-13 14:33:44 -08:00
R.B. Boyer db1184c094
server: intentions CRUD requires connect to be enabled (#9194)
Fixes #9123
2020-11-13 16:19:12 -06:00
Kit Patella b486c1bce8 add the service name in the agent rather than in the definitions themselves 2020-11-13 13:18:04 -08:00
R.B. Boyer e323014faf
server: remove config entry CAS in legacy intention API bridge code (#9151)
Change so line-item intention edits via the API are handled via the state store instead of via CAS operations.

Fixes #9143
2020-11-13 14:42:21 -06:00
R.B. Boyer 6300abed18
server: skip deleted and deleting namespaces when migrating intentions to config entries (#9186) 2020-11-13 13:56:41 -06:00
Mike Morris a343365da7
ci: update to Go 1.15.4 and alpine:3.12 (#9036)
* ci: stop building darwin/386 binaries

Go 1.15 drops support for 32-bit binaries on Darwin https://golang.org/doc/go1.15#darwin

* tls: ConnectionState::NegotiatedProtocolIsMutual is deprecated in Go 1.15, this value is always true

* correct error messages that changed slightly

* Completely regenerate some TLS test data

Co-authored-by: R.B. Boyer <rb@hashicorp.com>
2020-11-13 13:02:59 -05:00
R.B. Boyer 758384893d
server: break up Intention.Apply monolithic method (#9007)
The Intention.Apply RPC is quite large, so this PR attempts to break it down into smaller functions and dissolves the pre-config-entry approach to the breakdown as it only confused things.
2020-11-13 09:15:39 -06:00
Kit Patella 9533372ded first pass on agent-configured prometheusDefs and adding defs for every consul metric 2020-11-12 18:12:12 -08:00
R.B. Boyer a5bd1ba323
agent: return the default ACL policy to callers as a header (#9101)
Header is: X-Consul-Default-ACL-Policy=<allow|deny>

This is of particular utility when fetching matching intentions, as the
fallthrough for a request that doesn't match any intentions is to
enforce using the default acl policy.
2020-11-12 10:38:32 -06:00
Matt Keeler 2badb01d30
Add a paramter in state store methods to indicate whether a resource insertion is from a snapshot restoration (#9156)
The Catalog, Config Entry, KV and Session resources potentially re-validate the input as its coming in. We need to prevent snapshot restoration failures due to missing namespaces or namespaces that are being deleted in enterprise.
2020-11-11 11:21:42 -05:00
Matt Keeler 1f40f51a58
Fix a bunch of linter warnings 2020-11-09 09:22:12 -05:00
Matt Keeler 755fb72994
Switch to using the external autopilot module 2020-11-09 09:22:11 -05:00
Daniel Nephin e4a78c977d stream: document that Payload must be immutable
If they are sent to EventPublisher.Publish.

Also document that PayloadEvents is expected to come from a subscription and that it is
not immutable.
2020-11-06 13:00:33 -05:00
Daniel Nephin 4fc073b1f4 stream: rename FilterByKey 2020-11-05 19:21:16 -05:00
Daniel Nephin d4cd2fa6a8 stream: Add HasReadPermission to Payload
Required now that filter is a method on PayloadEvents instead of Event
2020-11-05 19:17:18 -05:00
Daniel Nephin 8a26bca020 stream: move event filtering to PayloadEvents
Removes the weirdness around PayloadEvents.FilterByKey
2020-11-05 17:50:17 -05:00
Daniel Nephin dcacfd3548 stream: Remove unused method 2020-11-05 16:49:59 -05:00
Daniel Nephin 621f1db766
Merge pull request #9073 from hashicorp/dnephin/backport-streaming-namespaces
streaming: backport namespace changes
2020-11-05 14:19:10 -05:00
Daniel Nephin cd220e5d6c
Merge pull request #9061 from hashicorp/dnephin/event-fields
stream: support filtering by namespace
2020-11-05 14:18:35 -05:00
Daniel Nephin f6b629852f state: test EventPayloadCheckServiceNode.FilterByKey
Also fix a bug in that function when only one of key or namespace were the empty string.
2020-10-30 14:35:57 -04:00
Daniel Nephin 60df44df4f stream: Add tests for filterByKey with namespace
And fix a bug where a request with a Namespace but no Key would not be properly filtered
2020-10-30 14:35:42 -04:00
Daniel Nephin 318dfbe6e4 stream: Move FilterByKey events to a table
In preparation for adding new tests.
2020-10-30 14:35:28 -04:00
Daniel Nephin 2d0030da39 state: use enterprise meta for creating events 2020-10-30 14:34:04 -04:00
Daniel Nephin b57c7afcbb stream: include the namespace in the snap cache key
Otherwise the wrong snapshot could be returned when the same key is used in different namespaces
2020-10-30 14:34:04 -04:00
Daniel Nephin 8da30fcb9a subscribe: set the request namespace 2020-10-30 14:34:04 -04:00
R.B. Boyer 67a0d0c426
state: ensure we unblock intentions queries upon the upgrade to config entries (#9062)
1. do a state store query to list intentions as the agent would do over in `agent/proxycfg` backing `agent/xds`
2. upgrade the database and do a fresh `service-intentions` config entry write
3. the blocking query inside of the agent cache in (1) doesn't notice (2)
2020-10-29 15:28:31 -05:00
R.B. Boyer 78014653b3 restore prior signature of test helper so enterprise compiles 2020-10-29 13:52:15 -05:00
Daniel Nephin 61ce0964a4 stream: remove Event.Key
Makes Payload a type with FilterByKey so that Payloads can implement
filtering by key. With this approach we don't need to expose a Namespace
field on Event, and we don't need to invest micro formats or require a
bunch of code to be aware of exactly how the key field is encoded.
2020-10-28 16:48:04 -04:00
Daniel Nephin 8ef4c0fcc5 state: use go-cmp for comparison
The output of the previous assertions made it impossible to debug the tests without code changes.

With go-cmp comparing the entire slice we can see the full diffs making it easier to debug failures.
2020-10-28 16:33:00 -04:00
Daniel Nephin 44da869ed4 stream: Use a no-op event publisher if streaming is disabled 2020-10-28 13:54:19 -04:00
Daniel Nephin eea87e1acf store: use a ReadDB for snapshots
to remove the cyclic dependency between the snapshot handlers and the state.Store
2020-10-28 13:07:42 -04:00
Daniel Nephin cfe0ffde15
Merge pull request #9026 from hashicorp/dnephin/streaming-without-cache-query-param
streaming: rename config and remove requirement for cache=1
2020-10-28 12:33:25 -04:00
Daniel Nephin 03d2be03e7
Merge pull request #8618 from hashicorp/dnephin/remove-txn-readtxn
state: Use ReadTxn everywhere
2020-10-28 12:32:47 -04:00
Daniel Nephin abd8cfcfe9 state: disable streaming connect topic 2020-10-26 11:49:47 -04:00
R.B. Boyer 0a80e82f21
server: config entry replication now correctly uses namespaces in comparisons (#9024)
Previously config entries sharing a kind & name but in different
namespaces could occasionally cause "stuck states" in replication
because the namespace fields were ignored during the differential
comparison phase.

Example:

Two config entries written to the primary:

    kind=A,name=web,namespace=bar
    kind=A,name=web,namespace=foo

Under the covers these both get saved to memdb, so they are sorted by
all 3 components (kind,name,namespace) during natural iteration. This
means that before the replication code does it's own incomplete sort,
the underlying data IS sorted by namespace ascending (bar comes before
foo).

After one pass of replication the primary and secondary datacenters have
the same set of config entries present. If
"kind=A,name=web,namespace=bar" were to be deleted, then things get
weird. Before replication the two sides look like:

primary: [
    kind=A,name=web,namespace=foo
]
secondary: [
    kind=A,name=web,namespace=bar
    kind=A,name=web,namespace=foo
]

The differential comparison phase walks these two lists in sorted order
and first compares "kind=A,name=web,namespace=foo" vs
"kind=A,name=web,namespace=bar" and falsely determines they are the SAME
and are thus cause an update of "kind=A,name=web,namespace=foo". Then it
compares "<nothing>" with "kind=A,name=web,namespace=foo" and falsely
determines that the latter should be DELETED.

During reconciliation the deletes are processed before updates, and so
for a brief moment in the secondary "kind=A,name=web,namespace=foo" is
erroneously deleted and then immediately restored.

Unfortunately after this replication phase the final state is identical
to the initial state, so when it loops around again (rate limited) it
repeats the same set of operations indefinitely.
2020-10-23 13:41:54 -05:00
Daniel Nephin f9b2834171 state: convert the remaining functions to ReadTxn
Required also converting some of the transaction functions to WriteTxn
because TxnRO() called the same helper as TxnRW.

This change allows us to return a memdb.Txn for read-only txn instead of
wrapping them with state.txn.
2020-10-23 14:29:22 -04:00
Daniel Nephin 26387cdc0e
Merge pull request #8975 from hashicorp/dnephin/stream-close-on-unsub
stream: close the subscription on Unsubscribe
2020-10-23 12:58:12 -04:00
Freddy d23038f94f
Add HasExact to topology endpoint (#9010) 2020-10-23 10:45:41 -06:00
Daniel Nephin fb8b68a6ec stream: close the subscription on Unsubscribe 2020-10-22 13:39:27 -04:00
Pierre Souchay 54f9f247f8
Consul Service meta wrongly computes and exposes non_voter meta (#8731)
* Consul Service meta wrongly computes and exposes non_voter meta

In Serf Tags, entreprise members being non-voters use the tag
`nonvoter=1`, not `non_voter = false`, so non-voters in members
were wrongly displayed as voter.

Demonstration:

```
consul members -detailed|grep voter
consul20-hk5 10.200.100.110:8301   alive   acls=1,build=1.8.4+ent,dc=hk5,expect=3,ft_fs=1,ft_ns=1,id=xxxxxxxx-5629-08f2-3a79-10a1ab3849d5,nonvoter=1,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2,wan_join_port=8302
```

* Added changelog

* Added changelog entry
2020-10-09 17:18:24 -04:00
s-christoff a62705101f
Enhance the output of consul snapshot inspect (#8787) 2020-10-09 14:57:29 -05:00
Kyle Havlovitz 707f4a8d26 Stop intermediate renew routine on leader stop 2020-10-09 12:30:57 -07:00
Kyle Havlovitz 926a393a5c
Merge pull request #8784 from hashicorp/renew-intermediate-primary
connect: Enable renewing the intermediate cert in the primary DC
2020-10-09 12:18:59 -07:00
Daniel Nephin dd0e8d42c4
Merge pull request #8825 from hashicorp/streaming/add-config
streaming: add config and docs
2020-10-09 14:33:58 -04:00
Chris Piraino 4f77f87065
Emit service usage metrics with correct labeling strategy (#8856)
Previously, we would emit service usage metrics both with and without a
namespace label attached. This is problematic in the case when you want
to aggregate metrics together, i.e. "sum(consul.state.services)". This
would cause services to be counted twice in that aggregate, once via the
metric emitted with a namespace label, and once in the metric emited
without any namespace label.
2020-10-09 11:01:45 -05:00
Kyle Havlovitz 50543d678e Fix intermediate refresh test comments 2020-10-09 08:53:33 -07:00
R.B. Boyer d2f09ca306
upstream some differences from enterprise (#8902) 2020-10-09 09:42:53 -05:00
Kyle Havlovitz 968fd8660d Update CI for leader renew CA test using Vault 2020-10-09 05:48:15 -07:00
Kyle Havlovitz 62270c3f9a
Merge branch 'master' into renew-intermediate-primary 2020-10-09 04:40:34 -07:00
Kyle Havlovitz b78f618beb connect: Check for expired root cert when cross-signing 2020-10-09 04:35:56 -07:00
Freddy 89d52f41c4
Add protocol to the topology endpoint response (#8868) 2020-10-08 17:31:54 -06:00
Matt Keeler 141eb60f06
Add per-agent reconnect timeouts (#8781)
This allows for client agent to be run in a more stateless manner where they may be abruptly terminated and not expected to come back. If advertising a per-agent reconnect timeout using the advertise_reconnect_timeout configuration when that agent leaves, other agents will wait only that amount of time for the agent to come back before reaping it.

This has the advantageous side effect of causing servers to deregister the node/services/checks for that agent sooner than if the global reconnect_timeout was used.
2020-10-08 15:02:19 -04:00
Daniel Nephin 05df7b18a9 config: add field for enabling streaming RPC endpoint 2020-10-08 12:11:20 -04:00
Freddy de4af766f3
Support ingress gateways in mesh viz endpoint (#8864)
Co-authored-by: R.B. Boyer <rb@hashicorp.com>
2020-10-08 09:47:09 -06:00
Daniel Nephin a94fe054f0
Merge pull request #8809 from hashicorp/streaming/materialize-view
Add StreamingHealthServices cache-type
2020-10-07 21:26:38 -04:00
Daniel Nephin e0236b5a9f
Merge pull request #8818 from hashicorp/streaming/add-subscribe-service-batch-events
stream: handle batch events as a special case of Event
2020-10-07 21:25:32 -04:00
Daniel Nephin 783627aeef
Merge pull request #8768 from hashicorp/streaming/add-subscribe-service
subscribe: add subscribe service for streaming change events
2020-10-07 21:24:03 -04:00
Freddy 7d1f50d2e6
Return intention info in svc topology endpoint (#8853) 2020-10-07 18:35:34 -06:00
R.B. Boyer 35c4efd220
connect: support defining intentions using layer 7 criteria (#8839)
Extend Consul’s intentions model to allow for request-based access control enforcement for HTTP-like protocols in addition to the existing connection-based enforcement for unspecified protocols (e.g. tcp).
2020-10-06 17:09:13 -05:00
R.B. Boyer d6dce2332a
connect: intentions are now managed as a new config entry kind "service-intentions" (#8834)
- Upgrade the ConfigEntry.ListAll RPC to be kind-aware so that older
copies of consul will not see new config entries it doesn't understand
replicate down.

- Add shim conversion code so that the old API/CLI method of interacting
with intentions will continue to work so long as none of these are
edited via config entry endpoints. Almost all of the read-only APIs will
continue to function indefinitely.

- Add new APIs that operate on individual intentions without IDs so that
the UI doesn't need to implement CAS operations.

- Add a new serf feature flag indicating support for
intentions-as-config-entries.

- The old line-item intentions way of interacting with the state store
will transparently flip between the legacy memdb table and the config
entry representations so that readers will never see a hiccup during
migration where the results are incomplete. It uses a piece of system
metadata to control the flip.

- The primary datacenter will begin migrating intentions into config
entries on startup once all servers in the datacenter are on a version
of Consul with the intentions-as-config-entries feature flag. When it is
complete the old state store representations will be cleared. We also
record a piece of system metadata indicating this has occurred. We use
this metadata to skip ALL of this code the next time the leader starts
up.

- The secondary datacenters continue to run the old intentions
replicator until all servers in the secondary DC and primary DC support
intentions-as-config-entries (via serf flag). Once this condition it met
the old intentions replicator ceases.

- The secondary datacenters replicate the new config entries as they are
migrated in the primary. When they detect that the primary has zeroed
it's old state store table it waits until all config entries up to that
point are replicated and then zeroes its own copy of the old state store
table. We also record a piece of system metadata indicating this has
occurred. We use this metadata to skip ALL of this code the next time
the leader starts up.
2020-10-06 13:24:05 -05:00
Daniel Nephin 83401194ab streaming: improve godoc for cache-type
And fix a bug where any error that implemented the temporary interface was considered
a temporary error, even when the method would return false.
2020-10-06 13:52:02 -04:00
Daniel Nephin f857aef4a8 submatview: add a test for handling of NewSnapshotToFollow
Also add some godoc
Rename some vars and functions
Fix a data race in the new cache test for entry closing.
2020-10-06 13:22:02 -04:00
Daniel Nephin ad29cf4f94 stream: Return a single event from a subscription.Next
Handle batch events as a single event
2020-10-06 13:18:20 -04:00
Daniel Nephin fa115c6249 Move agent/subscribe -> agent/rpc/subscribe 2020-10-06 12:49:35 -04:00
Daniel Nephin 011109a6f6 subscirbe: extract streamID and logging from Subscribe
By extracting all of the tracing logic the core logic of the Subscribe
endpoint is much easier to read.
2020-10-06 12:49:35 -04:00
Daniel Nephin 4c4441997a subscribe: add integration test for acl token updates 2020-10-06 12:49:35 -04:00
Daniel Nephin 371ec2d70a subscribe: add a stateless subscribe service for the gRPC server
With a Backend that provides access to the necessary dependencies.
2020-10-06 12:49:35 -04:00
Daniel Nephin ae433947a4
Merge pull request #8799 from hashicorp/streaming/rename-framing-events
stream: remove EndOfEmptySnapshot, add NewSnapshotToFollow
2020-10-06 12:42:58 -04:00
R.B. Boyer a77b518542
server: create new memdb table for storing system metadata (#8703)
This adds a new very tiny memdb table and corresponding raft operation
for updating a very small effective map[string]string collection of
"system metadata". This can persistently record a fact about the Consul
state machine itself.

The first use of this feature will come in a later PR.
2020-10-06 10:08:37 -05:00
Daniel Nephin 2706cf9b2a
Merge pull request #8802 from hashicorp/dnephin/extract-lib-retry
lib/retry - extract a new package from lib/retry.go
2020-10-05 14:22:37 -04:00
freddygv 82a17ccee6 Do not evaluate discovery chain for topology upstreams 2020-10-05 10:24:50 -06:00
freddygv 63c50e15bc Single DB txn for ServiceTopology and other PR comments 2020-10-05 10:24:50 -06:00
freddygv 263bd9dd92 Add topology HTTP endpoint 2020-10-05 10:24:50 -06:00
freddygv 7c11580e93 Add topology RPC endpoint 2020-10-05 10:24:50 -06:00
freddygv 21c4708fe9 Add topology ACL filter 2020-10-05 10:24:50 -06:00
freddygv ac54bf99b3 Add func to combine up+downstream queries 2020-10-05 10:24:50 -06:00
freddygv 160a6539d1 factor in discovery chain when querying up/downstreams 2020-10-05 10:24:50 -06:00
freddygv 214b25919f support querying upstreams/downstreams from registrations 2020-10-05 10:24:50 -06:00
freddygv 3653045cb0 Add method for downstreams from disco chain 2020-10-05 10:24:50 -06:00
Daniel Nephin 40aac46cf4 lib/retry: Refactor to reduce the interface surface
Reduce Jitter to one function

Rename NewRetryWaiter

Fix a bug in calculateWait where maxWait was applied before jitter, which would make it
possible to wait longer than maxWait.
2020-10-04 18:12:42 -04:00
Daniel Nephin 0c7f9c72d7 lib/retry: extract a new package from lib 2020-10-04 17:43:01 -04:00
Daniel Nephin 9c5181c897 stream: full test coverage for EventPublisher.Subscribe 2020-10-02 13:46:24 -04:00
Daniel Nephin 0769f54fe1 stream: refactor to support change in framing events
Removing EndOfEmptySnapshot, add NewSnapshotToFollow
2020-10-02 13:41:31 -04:00
Daniel Nephin 5ef630f664
Merge pull request #8769 from hashicorp/streaming/prep-for-subscribe-service
state: use protobuf Topic and and export payload type
2020-10-02 13:30:06 -04:00
R.B. Boyer e84d52ba3a
ensure these tests work fine with namespaces in enterprise (#8794) 2020-10-01 09:54:46 -05:00
R.B. Boyer ccd0200bd9
server: ensure that we also shutdown network segment serf instances on server shutdown (#8786)
This really only matters for unit tests, since typically if an agent shuts down its server, it follows that up by exiting the process, which would also clean up all of the networking anyway.
2020-09-30 16:23:43 -05:00
Kyle Havlovitz 2956313f2d connect: Enable renewing the intermediate cert in the primary DC 2020-09-30 12:31:21 -07:00
freddygv ec6e8021c0 Resolve conflicts 2020-09-29 08:59:18 -06:00
Daniel Nephin d192b0a080 stream: move goroutine out of New
This change will make it easier to manage goroutine lifecycle from the caller.

Also expose EventPublisher from state.Store
2020-09-28 18:40:10 -04:00
Daniel Nephin e345c8d8a6 state: use pbsubscribe.Topic for topic values 2020-09-28 18:40:10 -04:00
Daniel Nephin 6e592ec485 state: rename and export EventPayload
The subscribe endpoint needs to be able to inspect the payload to filter
events, and convert them into the protobuf types.

Use the protobuf CatalogOp type for the operation field, for now. In the
future if we end up with multiple interfaces we should be able to remove
the protobuf dependency by changing this to an int32 and adding a test
for the mapping between the values.

Make the value of the payload a concrete type instead of interface{}. We
can create other payloads for other event types.
2020-09-28 18:34:30 -04:00
R.B. Boyer 45609fccdf
server: make sure that the various replication loggers use consistent logging (#8745) 2020-09-24 15:49:38 -05:00
Daniel Nephin 4b041a018d grpc: redeuce dependencies, unexport, and add godoc
Rename GRPCClient to ClientConnPool. This type appears to be more of a
conn pool than a client. The clients receive the connections from this
pool.

Reduce some dependencies by adjusting the interface baoundaries.

Remove the need to create a second slice of Servers, just to pick one and throw the rest away.

Unexport serverResolver, it is not used outside the package.

Use a RWMutex for ServerResolverBuilder, some locking is read-only.

Add more godoc.
2020-09-24 12:53:10 -04:00
Daniel Nephin 4b24470887 grpc: move client conn pool to grpc package 2020-09-24 12:48:12 -04:00
Daniel Nephin fad15171ec grpc: client conn pool and resolver
Extracted from 936522a13c07e8b732b6fde61bba23d05f7b9a70

Co-authored-by: Paul Banks <banks@banksco.de>
2020-09-24 12:46:22 -04:00
Daniel Nephin e0119a6e92
Merge pull request #8680 from hashicorp/dnephin/replace-consul-opts-with-base-deps
agent: Repalce ConsulOptions with a new struct from agent.BaseDeps
2020-09-24 12:45:54 -04:00
Paul Banks 0594667c3a
Fix bad int -> string conversions caught by go vet changes in 1.15 (#8739) 2020-09-24 11:14:07 +01:00
Hans Hasselberg c6fa758d6f
fix TestLeader_SecondaryCA_IntermediateRenew (#8702)
* fix lessThanHalfTime
* get lock for CAProvider()
* make a var to relate both vars
* rename to getCAProviderWithLock
* move CertificateTimeDriftBuffer to agent/connect/ca
2020-09-18 10:13:29 +02:00
Mike Morris fe984b3ee3
test: update tags for database service registrations and queries (#8693) 2020-09-16 14:05:01 -04:00
Kyle Havlovitz c8fd61abc7 Merge branch 'master' into vault-ca-renew-token 2020-09-15 14:39:04 -07:00
Daniel Nephin c621b4a420 agent/consul: pass dependencies directly from agent
In an upcoming change we will need to pass a grpc.ClientConnPool from
BaseDeps into Server. While looking at that change I noticed all of the
existing consulOption fields are already on BaseDeps.

Instead of duplicating the fields, we can create a struct used by
agent/consul, and use that struct in BaseDeps. This allows us to pass
along dependencies without translating them into different
representations.

I also looked at moving all of BaseDeps in agent/consul, however that
created some circular imports. Resolving those cycles wouldn't be too
bad (it was only an error in agent/consul being imported from
cache-types), however this change seems a little better by starting to
introduce some structure to BaseDeps.

This change is also a small step in reducing the scope of Agent.

Also remove some constants that were only used by tests, and move the
relevant comment to where the live configuration is set.

Removed some validation from NewServer and NewClient, as these are not
really runtime errors. They would be code errors, which will cause a
panic anyway, so no reason to handle them specially here.
2020-09-15 17:29:32 -04:00
Daniel Nephin 0536b2047e agent/consul: make router required 2020-09-15 17:26:26 -04:00
Kyle Havlovitz 63d3a5fc1f Clean up CA shutdown logic and error 2020-09-15 12:28:58 -07:00
freddygv 43efb4809c Merge master 2020-09-14 16:17:43 -06:00
Daniel Nephin 75515f3431
Merge pull request #8587 from hashicorp/streaming/add-grpc-server
streaming: add gRPC server for handling connections
2020-09-14 15:24:54 -04:00
freddygv 33af8dab9a Resolve conflicts against master 2020-09-11 18:41:58 -06:00
Kyle Havlovitz 1595add842 Clean up Vault renew tests and shutdown 2020-09-11 08:41:05 -07:00
freddygv 5871b667a5 Revert EnvoyConfig nesting 2020-09-11 09:21:43 -06:00
Kyle Havlovitz 7588e22739 Add a stop function to make sure the renewer is shut down on leader change 2020-09-10 06:12:48 -07:00
Kyle Havlovitz 1c57b72a9f Add a test for token renewal 2020-09-09 16:36:37 -07:00
Daniel Nephin 863a9df951 server: add gRPC server for streaming events
Includes a stats handler and stream interceptor for grpc metrics.

Co-authored-by: Paul Banks <banks@banksco.de>
2020-09-08 12:10:41 -04:00
Hans Hasselberg 51f079dcdd
secondaryIntermediateCertRenewalWatch abort on success (#8588)
secondaryIntermediateCertRenewalWatch was using `retryLoopBackoff` to
renew the intermediate certificate. Once it entered the inner loop and
started `retryLoopBackoff` it would never leave that.
`retryLoopBackoffAbortOnSuccess` will return when renewing is
successful, like it was intended originally.
2020-09-04 11:47:16 +02:00
Daniel Nephin c17a5b0628 state: handle terminating gateways in service health events 2020-09-03 16:58:05 -04:00
Daniel Nephin b241debee7 state: improve comments in catalog_events.go
Co-authored-by: Paul Banks <banks@banksco.de>
2020-09-03 16:58:05 -04:00
Daniel Nephin 870823e8ed state: use changeType in serviceChanges
To be a little more explicit, instead of nil implying an indirect change
2020-09-03 16:58:05 -04:00
Daniel Nephin 68682e7e83 don't over allocate slice 2020-09-03 16:58:04 -04:00
Daniel Nephin 5f52220f53 state: fix a bug in building service health events
The nodeCheck slice was being used as the first arg in append, which in some cases will modify the array backing the slice. This would lead to service checks for other services in the wrong event.

Also refactor some things to reduce the arguments to functions.
2020-09-03 16:58:04 -04:00
Daniel Nephin c61313b78a state: Remove unused args and return values
Also rename some functions to identify them as constructors for events
2020-09-03 16:58:04 -04:00
Daniel Nephin 668b98bcce state: use an enum for tracking node changes 2020-09-03 16:58:04 -04:00
Daniel Nephin 7c3c627028 state: serviceHealthSnapshot
refactored to remove unused return value and remove duplication
2020-09-03 16:58:04 -04:00
Daniel Nephin fdfe176deb state: Add Change processor and snapshotter for service health
Co-authored-by: Paul Banks <banks@banksco.de>
2020-09-03 16:58:04 -04:00
Daniel Nephin 6a1a43721d state: fix bug in changeTrackerDB.publish
Creating a new readTxn does not work because it will not see the newly created objects that are about to be committed. Instead use the active write Txn.
2020-09-03 16:58:01 -04:00
Daniel Nephin 81cc3daf69 stream: have SnapshotFunc accept a non-pointer SubscribeRequest
The value is not expected to be modified. Passing a value makes that explicit.
2020-09-03 16:54:02 -04:00
freddygv 56fdae9ace Update resolver defaulting 2020-09-03 13:08:44 -06:00
freddygv 02d6acd8fc Ensure resolver node with LB isn't considered default 2020-09-03 08:55:57 -06:00
freddygv daad3b9210 Remove LB infix and move injection to xds 2020-09-02 15:13:50 -06:00
Chris Piraino df1381f77f
Merge pull request #8603 from hashicorp/feature/usage-metrics
Track node and service counts in the state store and emit them periodically as metrics
2020-09-02 13:23:39 -05:00
R.B. Boyer 4197bed23b
connect: fix bug in preventing some namespaced config entry modifications (#8601)
Whenever an upsert/deletion of a config entry happens, within the open
state store transaction we speculatively test compile all discovery
chains that may be affected by the pending modification to verify that
the write would not create an erroneous scenario (such as splitting
traffic to a subset that did not exist).

If a single discovery chain evaluation references two config entries
with the same kind and name in different namespaces then sometimes the
upsert/deletion would be falsely rejected. It does not appear as though
this bug would've let invalid writes through to the state store so the
correction does not require a cleanup phase.
2020-09-02 10:47:19 -05:00
Chris Piraino b245d60200 Set metrics reporting interval to 9 seconds
This is below the 10 second interval that lib/telemetry.go implements as
its aggregation interval, ensuring that we always report these metrics.
2020-09-02 10:24:23 -05:00
Chris Piraino e9b397005c Update godoc string for memdb wrapper functions/structs 2020-09-02 10:24:22 -05:00
Chris Piraino 80f923a47a Refactor state store usage to track unique service names
This commit refactors the state store usage code to track unique service
name changes on transaction commit. This means we only need to lookup
usage entries when reading the information, as opposed to iterating over
a large number of service indices.

- Take into account a service instance's name being changed
- Do not iterate through entire list of service instances, we only care
about whether there is 0, 1, or more than 1.
2020-09-02 10:24:21 -05:00
Chris Piraino 79e6534345 Use ReadTxn interface in state store helper functions 2020-09-02 10:24:20 -05:00
Chris Piraino d90d95421d Add WriteTxn interface and convert more functions to ReadTxn
We add a WriteTxn interface for use in updating the usage memdb table,
with the forward-looking prospect of incrementally converting other
functions to accept interfaces.

As well, we use the ReadTxn in new usage code, and as a side effect
convert a couple of existing functions to use that interface as well.
2020-09-02 10:24:19 -05:00
Chris Piraino 45a4057f60 Report node/service usage metrics from every server
Using the newly provided state store methods, we periodically emit usage
metrics from the servers.

We decided to emit these metrics from all servers, not just the leader,
because that means we do not have to care about leader election flapping
causing metrics turbulence, and it seems reasonable for each server to
emit its own view of the state, even if they should always converge
rapidly.
2020-09-02 10:24:17 -05:00
Chris Piraino 3af96930eb Add new usage memdb table that tracks usage counts of various elements
We update the usage table on Commit() by using the TrackedChanges() API
of memdb.

Track memdb changes on restore so that usage data can be compiled
2020-09-02 10:24:16 -05:00
freddygv d7bda050e0 Restructure structs and other PR comments 2020-09-02 09:10:50 -06:00
Matt Keeler 335c604ced
Merge of auto-config and auto-encrypt code (#8523)
auto-encrypt is now handled as a special case of auto-config.

This also is moving all the cert-monitor code into the auto-config package.
2020-08-31 13:12:17 -04:00
freddygv afb14b6705 Compile down LB policy to disco chain nodes 2020-08-28 13:11:04 -06:00
Daniel Nephin 845661c8af
Merge pull request #8548 from edevil/fix_flake
Fix flaky TestACLResolver_Client/Concurrent-Token-Resolve
2020-08-28 15:10:55 -04:00
R.B. Boyer f2b8bf109c
xds: use envoy's rbac filter to handle intentions entirely within envoy (#8569) 2020-08-27 12:20:58 -05:00
Matt Keeler 106e1d50bd
Move RPC router from Client/Server and into BaseDeps (#8559)
This will allow it to be a shared component which is needed for AutoConfig
2020-08-27 11:23:52 -04:00
André Cruz 673bd69f36
Decrease test flakiness
Fix flaky TestACLResolver_Client/Concurrent-Token-Resolve and TestCacheNotifyPolling
2020-08-24 20:30:02 +01:00
André Cruz a64686fab6
testing: Fix govet errors 2020-08-21 18:01:55 +01:00
Hans Hasselberg 02de4c8b76
add primary keys to list keyring (#8522)
During gossip encryption key rotation it would be nice to be able to see if all nodes are using the same key. This PR adds another field to the json response from `GET v1/operator/keyring` which lists the primary keys in use per dc. That way an operator can tell when a key was successfully setup as primary key.

Based on https://github.com/hashicorp/serf/pull/611 to add primary key to list keyring output:

```json
[
  {
    "WAN": true,
    "Datacenter": "dc2",
    "Segment": "",
    "Keys": {
      "0OuM4oC3Os18OblWiBbZUaHA7Hk+tNs/6nhNYtaNduM=": 6,
      "SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 6
    },
    "PrimaryKeys": {
      "SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 6
    },
    "NumNodes": 6
  },
  {
    "WAN": false,
    "Datacenter": "dc2",
    "Segment": "",
    "Keys": {
      "0OuM4oC3Os18OblWiBbZUaHA7Hk+tNs/6nhNYtaNduM=": 8,
      "SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 8
    },
    "PrimaryKeys": {
      "SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 8
    },
    "NumNodes": 8
  },
  {
    "WAN": false,
    "Datacenter": "dc1",
    "Segment": "",
    "Keys": {
      "0OuM4oC3Os18OblWiBbZUaHA7Hk+tNs/6nhNYtaNduM=": 3,
      "SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 8
    },
    "PrimaryKeys": {
      "SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 8
    },
    "NumNodes": 8
  }
]
```

I intentionally did not change the CLI output because I didn't find a good way of displaying this information. There are a couple of options that we could implement later:
* add a flag to show the primary keys
* add a flag to show json output

Fixes #3393.
2020-08-18 09:50:24 +02:00
Daniel Nephin 8d35e37b3c testing: Remove all the defer os.Removeall
Now that testutil uses t.Cleanup to remove the directory the caller no longer has to manage
the removal
2020-08-14 19:58:53 -04:00
Daniel Nephin 629c34085d state: remove unused Store method receiver
And use ReadTxn interface where appropriate.
2020-08-13 11:25:22 -04:00
Daniel Nephin fc797a279a
Merge pull request #8461 from hashicorp/dnephin/remove-notify-shutdown
agent/consul: Remove NotifyShutdown
2020-08-13 11:16:48 -04:00
Daniel Nephin d8ffcd5686
Merge pull request #8365 from hashicorp/dnephin/fix-service-by-node-meta-flake
state: speed up tests that use watchLimit
2020-08-13 11:16:12 -04:00
R.B. Boyer 63422ca9c5
connect: use stronger validation that ingress gateways have compatible protocols defined for their upstreams (#8470)
Fixes #8466

Since Consul 1.8.0 there was a bug in how ingress gateway protocol
compatibility was enforced. At the point in time that an ingress-gateway
config entry was modified the discovery chain for each upstream was
checked to ensure the ingress gateway protocol matched. Unfortunately
future modifications of other config entries were not validated against
existing ingress-gateway definitions, such as:

1. create tcp ingress-gateway pointing to 'api' (ok)
2. create service-defaults for 'api' setting protocol=http (worked, but not ok)
3. create service-splitter or service-router for 'api' (worked, but caused an agent panic)

If you were to do these in a different order, it would fail without a
crash:

1. create service-defaults for 'api' setting protocol=http (ok)
2. create service-splitter or service-router for 'api' (ok)
3. create tcp ingress-gateway pointing to 'api' (fail with message about
   protocol mismatch)

This PR introduces the missing validation. The two new behaviors are:

1. create tcp ingress-gateway pointing to 'api' (ok)
2. (NEW) create service-defaults for 'api' setting protocol=http ("ok" for back compat)
3. (NEW) create service-splitter or service-router for 'api' (fail with
   message about protocol mismatch)

In consideration for any existing users that may be inadvertently be
falling into item (2) above, that is now officiall a valid configuration
to be in. For anyone falling into item (3) above while you cannot use
the API to manufacture that scenario anymore, anyone that has old (now
bad) data will still be able to have the agent use them just enough to
generate a new agent/proxycfg error message rather than a panic.
Unfortunately we just don't have enough information to properly fix the
config entries.
2020-08-12 11:19:20 -05:00
Hans Hasselberg 7a6d916ddc
Merge pull request #8471 from hashicorp/local_only
thread local-only through the layers
2020-08-12 08:54:51 +02:00
Freddy 50fee12d62
Internal endpoint to query intentions associated with a gateway (#8400) 2020-08-11 17:20:41 -06:00
Kyle Havlovitz 8118e3db40 Fix a state store comment about version 2020-08-11 13:46:12 -07:00
Kyle Havlovitz 2601585017 fsm: Fix snapshot bug with restoring node/service/check indexes 2020-08-11 11:49:52 -07:00
Hans Hasselberg e0297b6e99 Refactor keyring ops:
* changes some functions to return data instead of modifying pointer
  arguments
* renames globalRPC() to keyringRPCs() to make its purpose more clear
* restructures KeyringOperation() to make it more understandable
2020-08-11 13:42:03 +02:00
freddygv 6dcfa11c21 Update error handling 2020-08-10 17:48:22 -06:00
Daniel Nephin bef9348ca8 testing: remove unnecessary defers in tests
The data directory is now removed by the test helper that created it.
2020-08-07 17:28:16 -04:00
Daniel Nephin f3b63514d5 testing: Remove NotifyShutdown
NotifyShutdown was only used for testing. Now that t.Cleanup exists, we
can use that instead of attaching cleanup to the Server shutdown.

The Autopilot test which used NotifyShutdown doesn't need this
notification because Shutdown is synchronous. Waiting for the function
to return is equivalent.
2020-08-07 17:14:44 -04:00
Hans Hasselberg fdceb24323
auto_config implies connect (#8433) 2020-08-07 12:02:02 +02:00
freddygv 83f4e32376 PR comments and addtl tests 2020-08-05 16:07:11 -06:00
Daniel Nephin 061ae94c63 Rename NewClient/NewServer
Now that duplicate constructors have been removed we can use the shorter names for the single constructor.
2020-08-05 14:00:55 -04:00
Daniel Nephin e6c94c1411 Remove LogOutput from Server 2020-08-05 14:00:44 -04:00
Daniel Nephin fdf966896f Remove LogOutput from Client 2020-08-05 14:00:42 -04:00
Daniel Nephin 73493ca01b Pass a logger to ConnPool and yamux, instead of an io.Writer
Allowing us to remove the LogOutput field from config.
2020-08-05 13:25:08 -04:00
Daniel Nephin c7c941811d config: Remove unused field 2020-08-05 13:25:08 -04:00
freddygv c87af29506 collect GatewayServices from iter in a function 2020-07-31 13:30:40 -06:00
Freddy 7c2c8815d7
Avoid panics during shutdown routine (#8412) 2020-07-30 11:11:10 -06:00
freddygv 94d1f0a310 end to end changes to pass gatewayservices to /ui/services/ 2020-07-30 10:21:11 -06:00
Matt Keeler 76add4f24c
Allow setting verify_incoming* when using auto_encrypt or auto_config (#8394)
Ensure that enabling AutoConfig sets the tls configurator properly

This also refactors the TLS configurator a bit so the naming doesn’t imply only AutoEncrypt as the source of the automatically setup TLS cert info.
2020-07-30 10:15:12 -04:00
Matt Keeler dad0f189a2
Agent Auto Config: Implement Certificate Generation (#8360)
Most of the groundwork was laid in previous PRs between adding the cert-monitor package to extracting the logic of signing certificates out of the connect_ca_endpoint.go code and into a method on the server.

This also refactors the auto-config package a bit to split things out into multiple files.
2020-07-28 15:31:48 -04:00
Matt Keeler 3a1058a06b
Move connect root retrieval and cert signing logic out of the RPC endpoints (#8364)
The code now lives on the Server type itself. This was done so that all of this could be shared with auto config certificate signing.
2020-07-24 10:00:51 -04:00
Matt Keeler e7d8a02ae8
Move generation of the CA Configuration from the agent code into a method on the RuntimeConfig (#8363)
This allows this to be reused elsewhere.
2020-07-23 16:05:28 -04:00
Daniel Nephin 597dcf2bfb
Merge pull request #8323 from hashicorp/dnephin/add-event-publisher-2
stream: close subscriptions on shutdown
2020-07-23 13:12:50 -04:00
Matt Keeler c3e7d689b7
Refactor the agentpb package (#8362)
First move the whole thing to the top-level proto package name.

Secondly change some things around internally to have sub-packages.
2020-07-23 11:24:20 -04:00
Daniel Nephin decba06b7d stream: close all subs when EventProcessor is shutdown. 2020-07-22 19:04:10 -04:00
Daniel Nephin e802689bbe stream: fix overallocation in filter
And add tests
2020-07-22 19:04:10 -04:00
Daniel Nephin f64725f7aa state: speed up TestStateStore_ServicesByNodeMeta
Make watchLimit a var so that we can patch it in tests and reduce the time spent creating state.
2020-07-22 16:57:06 -04:00
Daniel Nephin a44ddea9ba state: Use subtests in TestStateStore_ServicesByNodeMeta
These subtests make it much easier to identify the slow part of the test, but they also help enumerate all the different cases which are being tested.
2020-07-22 16:39:09 -04:00
Daniel Nephin 6d3b042872
Merge pull request #7948 from hashicorp/dnephin/buffer-test-logs
testutil: NewLogBuffer - buffer logs until a test fails
2020-07-21 15:21:52 -04:00
Matt Keeler 8ea8a939f0
Merge pull request #8311 from hashicorp/bugfix/auto-encrypt-token-update 2020-07-21 13:15:27 -04:00
Daniel Nephin 80ff174880 testutil: NewLogBuffer - buffer logs until a test fails
Replaces #7559

Running tests in parallel, with background goroutines, results in test output not being associated with the correct test. `go test` does not make any guarantees about output from goroutines being attributed to the correct test case.

Attaching log output from background goroutines also cause data races.  If the goroutine outlives the test, it will race with the test being marked done. Previously this was noticed as a panic when logging, but with the race detector enabled it is shown as a data race.

The previous solution did not address the problem of correct test attribution because test output could still be hidden when it was associated with a test that did not fail. You would have to look at all of the log output to find the relevant lines. It also made debugging test failures more difficult because each log line was very long.

This commit attempts a new approach. Instead of printing all the logs, only print when a test fails. This should work well when there are a small number of failures, but may not work well when there are many test failures at the same time. In those cases the failures are unlikely a result of a specific test, and the log output is likely less useful.

All of the logs are printed from the test goroutine, so they should be associated with the correct test.

Also removes some test helpers that were not used, or only had a single caller. Packages which expose many functions with similar names can be difficult to use correctly.

Related:
https://github.com/golang/go/issues/38458 (may be fixed in go1.15)
https://github.com/golang/go/issues/38382#issuecomment-612940030
2020-07-21 12:50:40 -04:00
Matt Keeler 133a6d99f2
Fix issue with changing the agent token causing failure to renew the auto-encrypt certificate
The fallback method would still work but it would get into a state where it would let the certificate expire for 10s before getting a new one. And the new one used the less secure RPC endpoint.

This is also a pretty large refactoring of the auto encrypt code. I was going to write some tests around the certificate monitoring but it was going to be impossible to get a TestAgent configured in such a way that I could write a test that ran in less than an hour or two to exercise the functionality.

Moving the certificate monitoring into its own package will allow for dependency injection and in particular mocking the cache types to control how it hands back certificates and how long those certificates should live. This will allow for exercising the main loop more than would be possible with it coupled so tightly with the Agent.
2020-07-21 12:19:25 -04:00
Daniel Nephin 7599e280de stream: handle empty event in TestEventSnapshot
When the race detector is enabled we see this test fail occasionally. The reordering of execution seems to make it possible for the snapshot splice to happen before any events are published to the topicBuffers.

We can handle this case in the test the same way it is handled by a subscription, by proceeding to the next event.
2020-07-20 18:20:02 -04:00
Daniel Nephin 75f10fb191 state: update calls that are no longer state methods
In a previous commit these methods were changed to functions, so remove the Store paramter.
2020-07-16 15:46:10 -04:00
Daniel Nephin 3fcb2e16f4 state: un-method funcs that don't use their receiver
This change was mostly automated with the following

First generate a list of functions with:

  git grep -o 'Store) \([^(]\+\)(tx \*txn' ./agent/consul/state | awk '{print $2}' | grep -o '^[^(]\+'

Then the list was curated a bit with trial/error to remove and add funcs
as necessary.

Finally the replacement was done with:

  dir=agent/consul/state
  file=${1-funcnames}

  while read fn; do
    echo "$fn"
    sed -i -e "s/(s \*Store) $fn(/$fn(/" $dir/*.go
    sed -i -e "s/s\.$fn(/$fn(/" $dir/*.go
    sed -i -e "s/s\.store\.$fn(/$fn(/" $dir/*.go
  done < $file
2020-07-16 15:30:39 -04:00
Daniel Nephin edb0a4f1f8 store: convert methods that don't use their receiver to functions
Making these functions allows them to be used without introducing
an artificial dependency on the struct. Many of these will be called
from streaming Event processors, which do not have a store.

This change is being made ahead of the streaming work to get to reduce
the size of the streaming diff.
2020-07-16 15:30:10 -04:00
Daniel Nephin a2f8605c66 stream: Add forceClose and refactor subscription filtering
Move the subscription context to Next. context.Context should generally
never be stored in a struct because it makes that struct only valid
while the context is valid. This is rarely obvious from the caller.
Adds a forceClosed channel in place of the old context, and uses the new
context as a way for the caller to stop the Subscription blocking.

Remove some recursion out of bufferImte.Next. The caller is already looping so we can continue
in that loop instead of recursing. This ensures currentItem is updated immediately (which probably
does not matter in practice), and also removes the chance that we overflow the stack.

NextNoBlock and FollowAfter do not need to handle bufferItem.Err, the caller already
handles it.

Moves filter to a method to simplify Next, and more explicitly separate filtering from looping.

Also improve some godoc

Only unwrap itemBuffer.Err when necessary
2020-07-14 15:57:47 -04:00
Daniel Nephin 2595436f62 stream: Improve docstrings
Also rename ResumeStrema to EndOfEmptySnapshot to be more consistent with other framing events

Co-authored-by: Paul Banks <banks@banksco.de>
2020-07-14 15:57:47 -04:00
Daniel Nephin 16a2b3fafc stream: change Topic to an interface
Consumers of the package can decide on which type to use for the Topic. In the future we may
use a gRPC type for the topic.
2020-07-14 15:57:47 -04:00
Daniel Nephin aa571bd0ce state: Move change processing out of EventPublisher
EventPublisher was receiving TopicHandlers, which had a couple of
problems:

- ChangeProcessors were being grouped by Topic, but they completely
  ignored the topic and were performed on every change
- ChangeProcessors required EventPublisher to be aware of database
  changes

By moving ChangeProcesors out of EventPublisher, and having Publish
accept events instead of changes, EventPublisher no longer needs to
be aware of these things.

Handlers is now only SnapshotHandlers, which are still mapped by Topic.

Also allows us to remove the small 'db' package that had only two types.
They can now be unexported types in state.
2020-07-14 15:57:47 -04:00
Daniel Nephin 23a940daad server: Abandom state store to shutdown EventPublisher
So that we don't leak goroutines
2020-07-14 15:57:47 -04:00
Daniel Nephin e1305fe80c stream: unexport identifiers
Now that EventPublisher is part of stream a lot of the internals can be hidden
2020-07-14 15:57:47 -04:00
Daniel Nephin 9e37894778 stream: Move EventPublisher to stream package
The EventPublisher is the central hub of the PubSub system. It is toughly coupled with much of
stream. Some stream internals were exported exclusively for EventPublisher.

The two Subscribe cases (with or without index) were also awkwardly split between two packages. By
moving EventPublisher into stream they are now both in the same package (although still in different files).
2020-07-14 15:57:47 -04:00
Daniel Nephin 6e87e83d77 state: Make handleACLUpdate async once again
So that we keep as much as possible out of the FSM commit hot path.
2020-07-14 15:57:47 -04:00
Daniel Nephin a92dab724d state: Use interface for Txn
Also store the index in Changes instead of the Txn.

This change is in preparation for movinng EventPublisher to the stream package, and
making handleACLUpdates async once again.
2020-07-14 15:57:46 -04:00
Daniel Nephin c778d61b6a stream.Subscription unexport fields and additiona docstrings 2020-07-14 15:57:46 -04:00
Daniel Nephin 37a38629d7 Add a context for stopping EventPublisher goroutine 2020-07-14 15:57:46 -04:00
Daniel Nephin 02bc5a26e4 EventPublisher: Make Unsubscribe a function on Subscription
It is critical that Unsubscribe be called with the same pointer to a
SubscriptionRequest that was used to create the Subscription. The
docstring made that clear, but it sill allowed a caler to get it wrong by
creating a new SubscriptionRequest.

By hiding this detail from the caller, and only exposing an Unsubscribe
method, it should be impossible to fail to Unsubscribe.

Also update some godoc strings.
2020-07-14 15:57:46 -04:00
Daniel Nephin 86976cf23c EventPublisher: handleACL changes synchronously
Use a separate lock for subscriptions.ByToken to allow it to happen synchronously
in the commit flow.
This removes the need to create a new txn for the goroutine, and removes
the need for EventPublisher to contain a reference to DB.
2020-07-14 15:57:46 -04:00
Daniel Nephin 606121fae6 stream.EventSnapshot: reduce the fields on the struct
Many of the fields are only needed in one place, and by using a closure
they can be removed from the struct. This reduces the scope of the variables
making it esier to see how they are used.
2020-07-14 15:57:45 -04:00
Daniel Nephin 7196917051 stream.EventBuffer: Seed the fuzz test with time.Now()
Otherwise the test will run with exactly the same values each time.

By printing the seed we can attempt to reproduce the test by adding an env var to override the seed
2020-07-14 15:57:45 -04:00
Daniel Nephin 525b275a52 state: memdb_wrapper.go -> memdb.go
Renaming in a separate commit so that git can merge changes to the file.
2020-07-14 15:57:45 -04:00
Daniel Nephin b5d2bea770 state: publish changes from Commit
Make topicRegistry use functions instead of unbound methods
Use a regular memDB in EventPublisher to remove a reference cycle
Removes the need for EventPublisher to use a store
2020-07-14 15:57:45 -04:00
Daniel Nephin f626c3d6c5 EventPublisher: docstrings and getTopicBuffer
also rename commitCh -> publishCh
2020-07-14 15:57:45 -04:00
Daniel Nephin 2020e9c7c7 ProcessChanges: use stream.Event
Also remove secretHash, which was used to hash tokens. We don't expose
these tokens anywhere, so we can use the string itself instead of a
Hash.

Fix acl_events_test.go for storing a structs type.
2020-07-14 15:57:45 -04:00
Daniel Nephin 2e45bbbb3e stream: Use local types for Event Topic SubscriptionRequest 2020-07-14 15:57:45 -04:00
Daniel Nephin 3d62013062 Rename stream_publisher.go -> event_publisher.go 2020-07-14 15:57:44 -04:00
Daniel Nephin 526fb53f85 Add streaming package with Subscription and Snapshot components.
The remaining files from 7965767de0bd62ab07669b85d6879bd5f815d157

Co-authored-by: Paul Banks <banks@banksco.de>
2020-07-14 15:57:44 -04:00
Chris Piraino b80cbb499f
Set enterprise metadata after resolving the token (#8302)
The token can encode enterprise metadata information, and we must make
sure we set that on the reply so that we can correct filter ACLs.
2020-07-13 13:39:57 -05:00
Daniel Nephin 13e0d258b5
Merge pull request #8237 from hashicorp/dnephin/remove-acls-enabled-from-delegate
Remove ACLsEnabled from delegate interface
2020-07-09 16:35:43 -04:00
Matt Keeler 39d9babab3
Pass the Config and TLS Configurator into the AutoConfig constructor
This is instead of having the AutoConfigBackend interface provide functions for retrieving them.

NOTE: the config is not reloadable. For now this is fine as we don’t look at any reloadable fields. If that changes then we should provide a way to make it reloadable.
2020-07-08 12:36:11 -04:00
Matt Keeler a77ed471c8
Rename (*Server).forward to (*Server).ForwardRPC
Also get rid of the preexisting shim in server.go that existed before to have this name just call the unexported one.
2020-07-08 11:05:44 -04:00
Matt Keeler 386ec3a2a2
Refactor AutoConfig RPC to not have a direct dependency on the Server type
Instead it has an interface which can be mocked for better unit testing that is deterministic and not prone to flakiness.
2020-07-08 11:05:44 -04:00
Daniel Nephin 8b6036c077 Remove ACLsEnabled from delegate interface
In all cases (oss/ent, client/server) this method was returning a value from config. Since the
value is consistent, it doesn't need to be part of the delegate interface.
2020-07-03 17:00:20 -04:00
Daniel Nephin dfa8856e5f agent/consul: Add support for NotModified to two endpoints
A query made with AllowNotModifiedResponse and a MinIndex, where the
result has the same Index as MinIndex, will return an empty response
with QueryMeta.NotModified set to true.

Co-authored-by: Pierre Souchay <pierresouchay@users.noreply.github.com>
2020-07-02 17:05:46 -04:00