* Fix bug in usage metrics that caused a negative count to occur
There were a couple of instances were usage metrics would do the wrong
thing and result in incorrect counts, causing the count to attempt to
decrement below zero and return an error. The usage metrics did not
account for various places where a single transaction could
delete/update/add multiple service instances at once.
We also remove the error when attempting to decrement below zero, and
instead just make sure we do not accidentally underflow the unsigned
integer. This is a more graceful failure than returning an error and not
allowing a transaction to commit.
* Add changelog
This change allows us to re-use these functions in other places without the Builder, and makes it
more explicit about which functions can warn/error and which can not.
A golden file makes the expected value easier to work with. This change also
removes a number of shims for enterprise and replaces them with a single one
for the golden filename.
* Display a warning when rpc.enable_streaming = true is set on a client
This option has no effect when running as an agent
* Added warning when server starts with use_streaming_backend but without rpc.enable_streaming
* Added unit test
It is no safe to assumes that the mapstructure keys will contain all the keys because some config can be specified
with command line flags or literals.
This change allows us to remove the json marshal/unmarshal cycle for command line flags, which will allow
us to remove all of the hcl/json struct tags on config fields.
These types are used as values (not pointers) in other structs. Using a pointer receiver causes
problems when the value is printed. fmt will not call the String method if it is passed a value
and the String method has a pointer receiver. By using a value receiver the correct string is printed.
Also remove some unused methods.
TestEnvoy.Close used e.stream.recvCh == nil to indicate the channel had already
been closed, so that TestEnvoy.Close can be called multiple times. The recvCh
was not protected by a lock, so setting it to nil caused a data race with any
goroutine trying to read from the channel.
Instead set the stream to nil. The stream is guarded by a lock, so it does not race.
This change allows us to test the agent/xds package using -race.
This way we only have to wait for the serf barrier to pass once before
we can upgrade to v2 acls. Without this patch every restart needs to
re-compute the change, and potentially if a stray older node joins after
a migration it might regress back to v1 mode which would be problematic.
This can happen when one other node in the cluster such as a client is unable to communicate with the leader server and sees it as failed. When that happens its failing status eventually gets propagated to the other servers in the cluster and eventually this can result in RPCs returning “No cluster leader” error.
That error is misleading and unhelpful for determing the root cause of the issue as its not raft stability but rather and client -> server networking issue. Therefore this commit will add a new error that will be returned in that case to differentiate between the two cases.
In some circumstances this endpoint will have no results in it (dues to
ACLs, Namespaces or filtering).
This ensures that the response is at least an empty array (`[]`) rather
than `null`
HTTPUseCache is only used is a gate for allowing QueryOptions.UseCache to be enabled. By
moving it to the place where the query options are set, this behaviour is more obvious.
Also remove parseInternal which was an alias for parse.
Previously the tokens would fail to insert into the secondary's state
store because the AuthMethod field of the ACLToken did not point to a
known auth method from the primary.
* create consul version metric with version label
* agent/agent.go: add pre-release Version as well as label
Co-Authored-By: Radha13 <kumari.radha3@gmail.com>
* verion and pre-release version labels.
* hyphen/- breaks prometheus
* Add Prometheus gauge defintion for version metric
* Add new metric to telemetry docs
Co-authored-by: Radha Kumari <kumari.radha3@gmail.com>
Co-authored-by: Aestek <thib.gilles@gmail.com>
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
Add a skip condition to all tests slower than 100ms.
This change was made using `gotestsum tool slowest` with data from the
last 3 CI runs of master.
See https://github.com/gotestyourself/gotestsum#finding-and-skipping-slow-tests
With this change:
```
$ time go test -count=1 -short ./agent
ok github.com/hashicorp/consul/agent 0.743s
real 0m4.791s
$ time go test -count=1 -short ./agent/consul
ok github.com/hashicorp/consul/agent/consul 4.229s
real 0m8.769s
```
* server: fix panic when deleting a non existent intention
* add changelog
* Always return an error when deleting non-existent ixn
Co-authored-by: freddygv <gh@freddygv.xyz>
And remove the devMode field from builder.
This change helps make the Builder state more explicit by moving inputs to the BuilderOps struct,
leaving only fields that can change during Builder.Build on the Builder struct.
Using the LiteralSource makes it much easier to find default values, because an IDE reports
the location of a default. With an HCL string they are harder to discover.
Also removes unnecessary mapstructure.Decodes of constant values.
A vulnerability was identified in Consul and Consul Enterprise (“Consul”) such that operators with `operator:read` ACL permissions are able to read the Consul Connect CA configuration when explicitly configured with the `/v1/connect/ca/configuration` endpoint, including the private key. This allows the user to effectively privilege escalate by enabling the ability to mint certificates for any Consul Connect services. This would potentially allow them to masquerade (receive/send traffic) as any service in the mesh.
--
This PR increases the permissions required to read the Connect CA's private key when it was configured via the `/connect/ca/configuration` endpoint. They are now `operator:write`.
Previously the listener was being passed to a closure in a loop without
capturing the loop variable. The result is only the last listener is
used, so the http/https servers only listen on one address.
This problem is fixed by capturing the variable by passing it into a
function.
This PR updates the tags that we generate for Envoy stats.
Several of these come with breaking changes, since we can't keep two stats prefixes for a filter.
* ci: stop building darwin/386 binaries
Go 1.15 drops support for 32-bit binaries on Darwin https://golang.org/doc/go1.15#darwin
* tls: ConnectionState::NegotiatedProtocolIsMutual is deprecated in Go 1.15, this value is always true
* correct error messages that changed slightly
* Completely regenerate some TLS test data
Co-authored-by: R.B. Boyer <rb@hashicorp.com>
The Intention.Apply RPC is quite large, so this PR attempts to break it down into smaller functions and dissolves the pre-config-entry approach to the breakdown as it only confused things.
Header is: X-Consul-Default-ACL-Policy=<allow|deny>
This is of particular utility when fetching matching intentions, as the
fallthrough for a request that doesn't match any intentions is to
enforce using the default acl policy.
Most packages should pass the race detector. An exclude list ensures
that new packages are automatically tested with -race.
Also fix a couple small test races to allow more packages to be tested.
Returning readyCh requires a lock because it can be set to nil, and
setting it to nil will race without the lock.
Move the TestServer.Listening calls around so that they properly guard
setting TestServer.l. Otherwise it races.
Remove t.Parallel in a small package. The entire package tests run in a
few seconds, so t.Parallel does very little.
In auto-config, wait for the AutoConfig.run goroutine to stop before
calling readPersistedAutoConfig. Without this change there was a data
race on reading ac.config.
defaultMetrics was being set at package import time, which meant that it received an instance of
the original default. But lib/telemetry.InitTelemetry sets a new global when it is called.
This resulted in the metrics being sent nowhere.
This commit changes defaultMetrics to be a function, so it will return the global instance when
called. Since it is called after InitTelemetry it will return the correct metrics instance.
The Catalog, Config Entry, KV and Session resources potentially re-validate the input as its coming in. We need to prevent snapshot restoration failures due to missing namespaces or namespaces that are being deleted in enterprise.
This ensures the metrics proxy endpoint is ACL protected behind a
wildcard `service:read` and `node:read` set of rules. For Consul
Enterprise these will need to span all namespaces:
```
service_prefix "" { policy = "read" }
node_prefix "" { policy = "read" }
namespace_prefix "" {
service_prefix "" { policy = "read" }
node_prefix "" { policy = "read" }
}
```
This PR contains just the backend changes. The frontend changes to
actually pass the consul token header to the proxy through the JS plugin
will come in another PR.
Added a new option `ui_config.metrics_proxy.path_allowlist`. This defaults to `["/api/v1/query", "/api/v1/query_range"]` when the metrics provider is set to `prometheus`.
Requests that do not use one of the allow-listed paths (via exact match) get a 403 Forbidden response instead.
1. do a state store query to list intentions as the agent would do over in `agent/proxycfg` backing `agent/xds`
2. upgrade the database and do a fresh `service-intentions` config entry write
3. the blocking query inside of the agent cache in (1) doesn't notice (2)
Makes Payload a type with FilterByKey so that Payloads can implement
filtering by key. With this approach we don't need to expose a Namespace
field on Event, and we don't need to invest micro formats or require a
bunch of code to be aware of exactly how the key field is encoded.
The output of the previous assertions made it impossible to debug the tests without code changes.
With go-cmp comparing the entire slice we can see the full diffs making it easier to debug failures.
Gauge metrics are great for understanding the current state, but can somtimes hide problems
if there are many disconnect/reconnects.
This commit adds counter metrics for connections and streams to make it easier to see the
count of newly created connections and streams.
Instead of using retry.Run, which appears to have problems in some cases where it does not
emit an error message, use a for loop.
Increase the number of attempts and remove any sleep, since this operation is not that expensive to do
in a tight loop
Closing l.conns can lead to a race and a 'panic: send on closed chan' when a
connection is in the middle of being handled when the server is shutting down.
Found using '-race -count=800'
* Plumb Datacenter and Namespace to metrics provider in preparation for them being usable.
* Move metrics loader/status to a new component and show reason for being disabled.
* Remove stray console.log
* Rebuild AssetFS to resolve conflicts
* Yarn upgrade
* mend
Previously config entries sharing a kind & name but in different
namespaces could occasionally cause "stuck states" in replication
because the namespace fields were ignored during the differential
comparison phase.
Example:
Two config entries written to the primary:
kind=A,name=web,namespace=bar
kind=A,name=web,namespace=foo
Under the covers these both get saved to memdb, so they are sorted by
all 3 components (kind,name,namespace) during natural iteration. This
means that before the replication code does it's own incomplete sort,
the underlying data IS sorted by namespace ascending (bar comes before
foo).
After one pass of replication the primary and secondary datacenters have
the same set of config entries present. If
"kind=A,name=web,namespace=bar" were to be deleted, then things get
weird. Before replication the two sides look like:
primary: [
kind=A,name=web,namespace=foo
]
secondary: [
kind=A,name=web,namespace=bar
kind=A,name=web,namespace=foo
]
The differential comparison phase walks these two lists in sorted order
and first compares "kind=A,name=web,namespace=foo" vs
"kind=A,name=web,namespace=bar" and falsely determines they are the SAME
and are thus cause an update of "kind=A,name=web,namespace=foo". Then it
compares "<nothing>" with "kind=A,name=web,namespace=foo" and falsely
determines that the latter should be DELETED.
During reconciliation the deletes are processed before updates, and so
for a brief moment in the secondary "kind=A,name=web,namespace=foo" is
erroneously deleted and then immediately restored.
Unfortunately after this replication phase the final state is identical
to the initial state, so when it loops around again (rate limited) it
repeats the same set of operations indefinitely.
Required also converting some of the transaction functions to WriteTxn
because TxnRO() called the same helper as TxnRW.
This change allows us to return a memdb.Txn for read-only txn instead of
wrapping them with state.txn.
Prepopulate was setting entry.Expiry.HeapIndex to 0. Previously this would result in a call to heap.Fix(0)
which wasn't correct, but was also not really a problem because at worse it would re-notify.
With the recent change to extract cachettl it was changed to call Update(idx), which would have updated
the wrong entry.
A previous commit removed the setting of entry.Expiry so that the HeapIndex would be reported
as -1, and this commit adds a test and handles the -1 heap index.
And remove duplicate notifications.
Instead of performing the check in the heap implementation, check the
index in the higher level interface (Add,Remove,Update) and notify if one
of the relevant indexes is 0.
Instead of the whole map. This should save a lot of time performing reflecting on a large map.
The filter does not change, so there is no reason to re-apply it to older entries.
This change attempts to make the delay logic more obvious by:
* remove indirection, inline a bunch of function calls
* move all the code and constants next to each other
* replace the two constant values with a single value
* reword the comments.
When a service is deregistered, we check whever matching services were
registered as sidecar along with it and deregister them as well.
To determine if a service is indeed a sidecar we check the
structs.ServiceNode.LocallyRegisteredAsSidecar property. However, to
avoid interal API leakage, it is excluded from JSON serialization,
meaning it is not persisted to disk either.
When the agent is restarted, this property lost and sidecars are no
longer deregistered along with their parent service.
To fix this, we now specifically save this property in the persisted
service file.
* Consul Service meta wrongly computes and exposes non_voter meta
In Serf Tags, entreprise members being non-voters use the tag
`nonvoter=1`, not `non_voter = false`, so non-voters in members
were wrongly displayed as voter.
Demonstration:
```
consul members -detailed|grep voter
consul20-hk5 10.200.100.110:8301 alive acls=1,build=1.8.4+ent,dc=hk5,expect=3,ft_fs=1,ft_ns=1,id=xxxxxxxx-5629-08f2-3a79-10a1ab3849d5,nonvoter=1,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2,wan_join_port=8302
```
* Added changelog
* Added changelog entry
* Remove unused StatsCard component
* Create Card and Stats contextual components with styling
* Send endpoint, item, and protocol to Stats as props
* WIP basic plumbing for metrics in Ember
* WIP metrics data source now works for different protocols and produces reasonable mock responses
* WIP sparkline component
* Mostly working metrics and graphs in topology
* Fix date in tooltip to actually be correct
* Clean up console.log
* Add loading frame and create a style sheet for Stats
* Various polish fixes:
- Loading state for graph
- Added fake latency cookie value to test loading
- If metrics provider has no series/stats for the service show something that doesn't look broken
- Graph hover works right to the edge now
- Stats boxes now wrap so they are either shown or not as will fit not cut off
- Graph resizes when browser window size changes
- Some tweaks to number formats and stat metrics to make them more compact/useful
* Thread Protocol through topology model correctly
* Rebuild assetfs
* Fix failing tests and remove stats-card now it's changed and become different
* Fix merge conflict
* Update api doublt
* more merge fixes
* Add data-permission and id attr to Card
* Run JS linter
* Move things around so the tests run with everything available
* Get tests passing:
1. Remove fakeLatency setTimeout (will be replaced with CONSUL_LATENCY
in mocks)
2. Make sure any event handlers are removed
* Make sure the Consul/scripts are available before the app
* Make sure interval gets set if there is no cookie value
* Upgrade mocks so we can use CONSUL_LATENCY
* Fix handling of no series values from Prometheus
* Update assetfs and fix a comment
* Rebase and rebuild assetfs; fix tcp metric series units to be bits not bytes
* Rebuild assetfs
* Hide stats when provider is not configured
Co-authored-by: kenia <keniavalladarez@gmail.com>
Co-authored-by: John Cowen <jcowen@hashicorp.com>
Previously, we would emit service usage metrics both with and without a
namespace label attached. This is problematic in the case when you want
to aggregate metrics together, i.e. "sum(consul.state.services)". This
would cause services to be counted twice in that aggregate, once via the
metric emitted with a namespace label, and once in the metric emited
without any namespace label.
This is the recommended proxy integration API for listing intentions
which should not require an active connection to the servers to resolve
after the initial cache filling.
This allows for client agent to be run in a more stateless manner where they may be abruptly terminated and not expected to come back. If advertising a per-agent reconnect timeout using the advertise_reconnect_timeout configuration when that agent leaves, other agents will wait only that amount of time for the agent to come back before reaping it.
This has the advantageous side effect of causing servers to deregister the node/services/checks for that agent sooner than if the global reconnect_timeout was used.
Extend Consul’s intentions model to allow for request-based access control enforcement for HTTP-like protocols in addition to the existing connection-based enforcement for unspecified protocols (e.g. tcp).
- Upgrade the ConfigEntry.ListAll RPC to be kind-aware so that older
copies of consul will not see new config entries it doesn't understand
replicate down.
- Add shim conversion code so that the old API/CLI method of interacting
with intentions will continue to work so long as none of these are
edited via config entry endpoints. Almost all of the read-only APIs will
continue to function indefinitely.
- Add new APIs that operate on individual intentions without IDs so that
the UI doesn't need to implement CAS operations.
- Add a new serf feature flag indicating support for
intentions-as-config-entries.
- The old line-item intentions way of interacting with the state store
will transparently flip between the legacy memdb table and the config
entry representations so that readers will never see a hiccup during
migration where the results are incomplete. It uses a piece of system
metadata to control the flip.
- The primary datacenter will begin migrating intentions into config
entries on startup once all servers in the datacenter are on a version
of Consul with the intentions-as-config-entries feature flag. When it is
complete the old state store representations will be cleared. We also
record a piece of system metadata indicating this has occurred. We use
this metadata to skip ALL of this code the next time the leader starts
up.
- The secondary datacenters continue to run the old intentions
replicator until all servers in the secondary DC and primary DC support
intentions-as-config-entries (via serf flag). Once this condition it met
the old intentions replicator ceases.
- The secondary datacenters replicate the new config entries as they are
migrated in the primary. When they detect that the primary has zeroed
it's old state store table it waits until all config entries up to that
point are replicated and then zeroes its own copy of the old state store
table. We also record a piece of system metadata indicating this has
occurred. We use this metadata to skip ALL of this code the next time
the leader starts up.
This call appears to only be necessary because reset() was called from
NewMaterializer.
This commit has the constructor set a default value for updateCh, and
removes both the call to reset() from New(), and the call to
notifyUpdateLocked() from reset().
This should ensure that we do not notify the Fetch() call before we have new
values to report.
Refactor of Materializer.Run
Use handlers to manage state in Materializer
Rename Materializer receiver
rename m.l to m.lock, and flip some conditionals to remove the negative.
Improve godoc, rename Deps, move resetErr, and pass err into notifyUpdate
Update for NewSnapshotToFollow events
Refactor to move context cancel out of Materializer
Replace InitFilter with Reset.
Removes the need to store a fatalErr and the cache-type, and removes the need to recreate the filter
each time.
Pass dependencies into MaterializedView.
Remove context from MaterializedView.
Rename state to view.
Rename MaterialziedView to Materialzier.
Rename to NewMaterializer
Pass in retry.Waiter
This adds a new very tiny memdb table and corresponding raft operation
for updating a very small effective map[string]string collection of
"system metadata". This can persistently record a fact about the Consul
state machine itself.
The first use of this feature will come in a later PR.
This new package provides a client agent implementation of an interface
for fetching the health of services.
This approach has a number of benefits:
1. It provides a much more explicit interface. Instead of everything
dependency on `RPC()` and `Cache.Get()` for many unrelated things
they can depend on a type that are named according to the behaviour
it provides.
2. It gives us a single place to vary the behaviour and migrate to
a new form of RPC (gRPC). The current implementation has two options
(cache, or direct RPC), and in the future we will have more.
It is also a great opporunity to start adding `context.Context` args
to these operations, which in the future will allow us to cancel
the operations.
3. As a concequence of the first, in the Server agent where we make
these calls we can replace the current in-memory RPC calls with
a thin adapter for the real method. This removes the `net/rpc`
machinery from the call in places where it is not needed.
This new package is quite small right now, but I think we can expect it
to grow to a more reasonable size as other RPC calls are replaced.
This change also happens to replace two very similar implementations with
a single implementation.
Reduce Jitter to one function
Rename NewRetryWaiter
Fix a bug in calculateWait where maxWait was applied before jitter, which would make it
possible to wait longer than maxWait.
This really only matters for unit tests, since typically if an agent shuts down its server, it follows that up by exiting the process, which would also clean up all of the networking anyway.