This way we only have to wait for the serf barrier to pass once before
we can upgrade to v2 acls. Without this patch every restart needs to
re-compute the change, and potentially if a stray older node joins after
a migration it might regress back to v1 mode which would be problematic.
This can happen when one other node in the cluster such as a client is unable to communicate with the leader server and sees it as failed. When that happens its failing status eventually gets propagated to the other servers in the cluster and eventually this can result in RPCs returning “No cluster leader” error.
That error is misleading and unhelpful for determing the root cause of the issue as its not raft stability but rather and client -> server networking issue. Therefore this commit will add a new error that will be returned in that case to differentiate between the two cases.
In some circumstances this endpoint will have no results in it (dues to
ACLs, Namespaces or filtering).
This ensures that the response is at least an empty array (`[]`) rather
than `null`
HTTPUseCache is only used is a gate for allowing QueryOptions.UseCache to be enabled. By
moving it to the place where the query options are set, this behaviour is more obvious.
Also remove parseInternal which was an alias for parse.
Previously the tokens would fail to insert into the secondary's state
store because the AuthMethod field of the ACLToken did not point to a
known auth method from the primary.
* create consul version metric with version label
* agent/agent.go: add pre-release Version as well as label
Co-Authored-By: Radha13 <kumari.radha3@gmail.com>
* verion and pre-release version labels.
* hyphen/- breaks prometheus
* Add Prometheus gauge defintion for version metric
* Add new metric to telemetry docs
Co-authored-by: Radha Kumari <kumari.radha3@gmail.com>
Co-authored-by: Aestek <thib.gilles@gmail.com>
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
Add a skip condition to all tests slower than 100ms.
This change was made using `gotestsum tool slowest` with data from the
last 3 CI runs of master.
See https://github.com/gotestyourself/gotestsum#finding-and-skipping-slow-tests
With this change:
```
$ time go test -count=1 -short ./agent
ok github.com/hashicorp/consul/agent 0.743s
real 0m4.791s
$ time go test -count=1 -short ./agent/consul
ok github.com/hashicorp/consul/agent/consul 4.229s
real 0m8.769s
```
* server: fix panic when deleting a non existent intention
* add changelog
* Always return an error when deleting non-existent ixn
Co-authored-by: freddygv <gh@freddygv.xyz>
And remove the devMode field from builder.
This change helps make the Builder state more explicit by moving inputs to the BuilderOps struct,
leaving only fields that can change during Builder.Build on the Builder struct.
Using the LiteralSource makes it much easier to find default values, because an IDE reports
the location of a default. With an HCL string they are harder to discover.
Also removes unnecessary mapstructure.Decodes of constant values.
A vulnerability was identified in Consul and Consul Enterprise (“Consul”) such that operators with `operator:read` ACL permissions are able to read the Consul Connect CA configuration when explicitly configured with the `/v1/connect/ca/configuration` endpoint, including the private key. This allows the user to effectively privilege escalate by enabling the ability to mint certificates for any Consul Connect services. This would potentially allow them to masquerade (receive/send traffic) as any service in the mesh.
--
This PR increases the permissions required to read the Connect CA's private key when it was configured via the `/connect/ca/configuration` endpoint. They are now `operator:write`.
Previously the listener was being passed to a closure in a loop without
capturing the loop variable. The result is only the last listener is
used, so the http/https servers only listen on one address.
This problem is fixed by capturing the variable by passing it into a
function.
This PR updates the tags that we generate for Envoy stats.
Several of these come with breaking changes, since we can't keep two stats prefixes for a filter.
* ci: stop building darwin/386 binaries
Go 1.15 drops support for 32-bit binaries on Darwin https://golang.org/doc/go1.15#darwin
* tls: ConnectionState::NegotiatedProtocolIsMutual is deprecated in Go 1.15, this value is always true
* correct error messages that changed slightly
* Completely regenerate some TLS test data
Co-authored-by: R.B. Boyer <rb@hashicorp.com>
The Intention.Apply RPC is quite large, so this PR attempts to break it down into smaller functions and dissolves the pre-config-entry approach to the breakdown as it only confused things.
Header is: X-Consul-Default-ACL-Policy=<allow|deny>
This is of particular utility when fetching matching intentions, as the
fallthrough for a request that doesn't match any intentions is to
enforce using the default acl policy.
Most packages should pass the race detector. An exclude list ensures
that new packages are automatically tested with -race.
Also fix a couple small test races to allow more packages to be tested.
Returning readyCh requires a lock because it can be set to nil, and
setting it to nil will race without the lock.
Move the TestServer.Listening calls around so that they properly guard
setting TestServer.l. Otherwise it races.
Remove t.Parallel in a small package. The entire package tests run in a
few seconds, so t.Parallel does very little.
In auto-config, wait for the AutoConfig.run goroutine to stop before
calling readPersistedAutoConfig. Without this change there was a data
race on reading ac.config.
defaultMetrics was being set at package import time, which meant that it received an instance of
the original default. But lib/telemetry.InitTelemetry sets a new global when it is called.
This resulted in the metrics being sent nowhere.
This commit changes defaultMetrics to be a function, so it will return the global instance when
called. Since it is called after InitTelemetry it will return the correct metrics instance.
The Catalog, Config Entry, KV and Session resources potentially re-validate the input as its coming in. We need to prevent snapshot restoration failures due to missing namespaces or namespaces that are being deleted in enterprise.
This ensures the metrics proxy endpoint is ACL protected behind a
wildcard `service:read` and `node:read` set of rules. For Consul
Enterprise these will need to span all namespaces:
```
service_prefix "" { policy = "read" }
node_prefix "" { policy = "read" }
namespace_prefix "" {
service_prefix "" { policy = "read" }
node_prefix "" { policy = "read" }
}
```
This PR contains just the backend changes. The frontend changes to
actually pass the consul token header to the proxy through the JS plugin
will come in another PR.
Added a new option `ui_config.metrics_proxy.path_allowlist`. This defaults to `["/api/v1/query", "/api/v1/query_range"]` when the metrics provider is set to `prometheus`.
Requests that do not use one of the allow-listed paths (via exact match) get a 403 Forbidden response instead.
1. do a state store query to list intentions as the agent would do over in `agent/proxycfg` backing `agent/xds`
2. upgrade the database and do a fresh `service-intentions` config entry write
3. the blocking query inside of the agent cache in (1) doesn't notice (2)
Makes Payload a type with FilterByKey so that Payloads can implement
filtering by key. With this approach we don't need to expose a Namespace
field on Event, and we don't need to invest micro formats or require a
bunch of code to be aware of exactly how the key field is encoded.
The output of the previous assertions made it impossible to debug the tests without code changes.
With go-cmp comparing the entire slice we can see the full diffs making it easier to debug failures.
Gauge metrics are great for understanding the current state, but can somtimes hide problems
if there are many disconnect/reconnects.
This commit adds counter metrics for connections and streams to make it easier to see the
count of newly created connections and streams.
Instead of using retry.Run, which appears to have problems in some cases where it does not
emit an error message, use a for loop.
Increase the number of attempts and remove any sleep, since this operation is not that expensive to do
in a tight loop
Closing l.conns can lead to a race and a 'panic: send on closed chan' when a
connection is in the middle of being handled when the server is shutting down.
Found using '-race -count=800'
* Plumb Datacenter and Namespace to metrics provider in preparation for them being usable.
* Move metrics loader/status to a new component and show reason for being disabled.
* Remove stray console.log
* Rebuild AssetFS to resolve conflicts
* Yarn upgrade
* mend
Previously config entries sharing a kind & name but in different
namespaces could occasionally cause "stuck states" in replication
because the namespace fields were ignored during the differential
comparison phase.
Example:
Two config entries written to the primary:
kind=A,name=web,namespace=bar
kind=A,name=web,namespace=foo
Under the covers these both get saved to memdb, so they are sorted by
all 3 components (kind,name,namespace) during natural iteration. This
means that before the replication code does it's own incomplete sort,
the underlying data IS sorted by namespace ascending (bar comes before
foo).
After one pass of replication the primary and secondary datacenters have
the same set of config entries present. If
"kind=A,name=web,namespace=bar" were to be deleted, then things get
weird. Before replication the two sides look like:
primary: [
kind=A,name=web,namespace=foo
]
secondary: [
kind=A,name=web,namespace=bar
kind=A,name=web,namespace=foo
]
The differential comparison phase walks these two lists in sorted order
and first compares "kind=A,name=web,namespace=foo" vs
"kind=A,name=web,namespace=bar" and falsely determines they are the SAME
and are thus cause an update of "kind=A,name=web,namespace=foo". Then it
compares "<nothing>" with "kind=A,name=web,namespace=foo" and falsely
determines that the latter should be DELETED.
During reconciliation the deletes are processed before updates, and so
for a brief moment in the secondary "kind=A,name=web,namespace=foo" is
erroneously deleted and then immediately restored.
Unfortunately after this replication phase the final state is identical
to the initial state, so when it loops around again (rate limited) it
repeats the same set of operations indefinitely.
Required also converting some of the transaction functions to WriteTxn
because TxnRO() called the same helper as TxnRW.
This change allows us to return a memdb.Txn for read-only txn instead of
wrapping them with state.txn.
Prepopulate was setting entry.Expiry.HeapIndex to 0. Previously this would result in a call to heap.Fix(0)
which wasn't correct, but was also not really a problem because at worse it would re-notify.
With the recent change to extract cachettl it was changed to call Update(idx), which would have updated
the wrong entry.
A previous commit removed the setting of entry.Expiry so that the HeapIndex would be reported
as -1, and this commit adds a test and handles the -1 heap index.
And remove duplicate notifications.
Instead of performing the check in the heap implementation, check the
index in the higher level interface (Add,Remove,Update) and notify if one
of the relevant indexes is 0.
Instead of the whole map. This should save a lot of time performing reflecting on a large map.
The filter does not change, so there is no reason to re-apply it to older entries.
This change attempts to make the delay logic more obvious by:
* remove indirection, inline a bunch of function calls
* move all the code and constants next to each other
* replace the two constant values with a single value
* reword the comments.
When a service is deregistered, we check whever matching services were
registered as sidecar along with it and deregister them as well.
To determine if a service is indeed a sidecar we check the
structs.ServiceNode.LocallyRegisteredAsSidecar property. However, to
avoid interal API leakage, it is excluded from JSON serialization,
meaning it is not persisted to disk either.
When the agent is restarted, this property lost and sidecars are no
longer deregistered along with their parent service.
To fix this, we now specifically save this property in the persisted
service file.