Instead of using retry.Run, which appears to have problems in some cases where it does not
emit an error message, use a for loop.
Increase the number of attempts and remove any sleep, since this operation is not that expensive to do
in a tight loop
Closing l.conns can lead to a race and a 'panic: send on closed chan' when a
connection is in the middle of being handled when the server is shutting down.
Found using '-race -count=800'
* Plumb Datacenter and Namespace to metrics provider in preparation for them being usable.
* Move metrics loader/status to a new component and show reason for being disabled.
* Remove stray console.log
* Rebuild AssetFS to resolve conflicts
* Yarn upgrade
* mend
Previously config entries sharing a kind & name but in different
namespaces could occasionally cause "stuck states" in replication
because the namespace fields were ignored during the differential
comparison phase.
Example:
Two config entries written to the primary:
kind=A,name=web,namespace=bar
kind=A,name=web,namespace=foo
Under the covers these both get saved to memdb, so they are sorted by
all 3 components (kind,name,namespace) during natural iteration. This
means that before the replication code does it's own incomplete sort,
the underlying data IS sorted by namespace ascending (bar comes before
foo).
After one pass of replication the primary and secondary datacenters have
the same set of config entries present. If
"kind=A,name=web,namespace=bar" were to be deleted, then things get
weird. Before replication the two sides look like:
primary: [
kind=A,name=web,namespace=foo
]
secondary: [
kind=A,name=web,namespace=bar
kind=A,name=web,namespace=foo
]
The differential comparison phase walks these two lists in sorted order
and first compares "kind=A,name=web,namespace=foo" vs
"kind=A,name=web,namespace=bar" and falsely determines they are the SAME
and are thus cause an update of "kind=A,name=web,namespace=foo". Then it
compares "<nothing>" with "kind=A,name=web,namespace=foo" and falsely
determines that the latter should be DELETED.
During reconciliation the deletes are processed before updates, and so
for a brief moment in the secondary "kind=A,name=web,namespace=foo" is
erroneously deleted and then immediately restored.
Unfortunately after this replication phase the final state is identical
to the initial state, so when it loops around again (rate limited) it
repeats the same set of operations indefinitely.
Required also converting some of the transaction functions to WriteTxn
because TxnRO() called the same helper as TxnRW.
This change allows us to return a memdb.Txn for read-only txn instead of
wrapping them with state.txn.
Prepopulate was setting entry.Expiry.HeapIndex to 0. Previously this would result in a call to heap.Fix(0)
which wasn't correct, but was also not really a problem because at worse it would re-notify.
With the recent change to extract cachettl it was changed to call Update(idx), which would have updated
the wrong entry.
A previous commit removed the setting of entry.Expiry so that the HeapIndex would be reported
as -1, and this commit adds a test and handles the -1 heap index.
And remove duplicate notifications.
Instead of performing the check in the heap implementation, check the
index in the higher level interface (Add,Remove,Update) and notify if one
of the relevant indexes is 0.
Instead of the whole map. This should save a lot of time performing reflecting on a large map.
The filter does not change, so there is no reason to re-apply it to older entries.
This change attempts to make the delay logic more obvious by:
* remove indirection, inline a bunch of function calls
* move all the code and constants next to each other
* replace the two constant values with a single value
* reword the comments.
When a service is deregistered, we check whever matching services were
registered as sidecar along with it and deregister them as well.
To determine if a service is indeed a sidecar we check the
structs.ServiceNode.LocallyRegisteredAsSidecar property. However, to
avoid interal API leakage, it is excluded from JSON serialization,
meaning it is not persisted to disk either.
When the agent is restarted, this property lost and sidecars are no
longer deregistered along with their parent service.
To fix this, we now specifically save this property in the persisted
service file.
* Consul Service meta wrongly computes and exposes non_voter meta
In Serf Tags, entreprise members being non-voters use the tag
`nonvoter=1`, not `non_voter = false`, so non-voters in members
were wrongly displayed as voter.
Demonstration:
```
consul members -detailed|grep voter
consul20-hk5 10.200.100.110:8301 alive acls=1,build=1.8.4+ent,dc=hk5,expect=3,ft_fs=1,ft_ns=1,id=xxxxxxxx-5629-08f2-3a79-10a1ab3849d5,nonvoter=1,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2,wan_join_port=8302
```
* Added changelog
* Added changelog entry
* Remove unused StatsCard component
* Create Card and Stats contextual components with styling
* Send endpoint, item, and protocol to Stats as props
* WIP basic plumbing for metrics in Ember
* WIP metrics data source now works for different protocols and produces reasonable mock responses
* WIP sparkline component
* Mostly working metrics and graphs in topology
* Fix date in tooltip to actually be correct
* Clean up console.log
* Add loading frame and create a style sheet for Stats
* Various polish fixes:
- Loading state for graph
- Added fake latency cookie value to test loading
- If metrics provider has no series/stats for the service show something that doesn't look broken
- Graph hover works right to the edge now
- Stats boxes now wrap so they are either shown or not as will fit not cut off
- Graph resizes when browser window size changes
- Some tweaks to number formats and stat metrics to make them more compact/useful
* Thread Protocol through topology model correctly
* Rebuild assetfs
* Fix failing tests and remove stats-card now it's changed and become different
* Fix merge conflict
* Update api doublt
* more merge fixes
* Add data-permission and id attr to Card
* Run JS linter
* Move things around so the tests run with everything available
* Get tests passing:
1. Remove fakeLatency setTimeout (will be replaced with CONSUL_LATENCY
in mocks)
2. Make sure any event handlers are removed
* Make sure the Consul/scripts are available before the app
* Make sure interval gets set if there is no cookie value
* Upgrade mocks so we can use CONSUL_LATENCY
* Fix handling of no series values from Prometheus
* Update assetfs and fix a comment
* Rebase and rebuild assetfs; fix tcp metric series units to be bits not bytes
* Rebuild assetfs
* Hide stats when provider is not configured
Co-authored-by: kenia <keniavalladarez@gmail.com>
Co-authored-by: John Cowen <jcowen@hashicorp.com>
Previously, we would emit service usage metrics both with and without a
namespace label attached. This is problematic in the case when you want
to aggregate metrics together, i.e. "sum(consul.state.services)". This
would cause services to be counted twice in that aggregate, once via the
metric emitted with a namespace label, and once in the metric emited
without any namespace label.
This is the recommended proxy integration API for listing intentions
which should not require an active connection to the servers to resolve
after the initial cache filling.
This allows for client agent to be run in a more stateless manner where they may be abruptly terminated and not expected to come back. If advertising a per-agent reconnect timeout using the advertise_reconnect_timeout configuration when that agent leaves, other agents will wait only that amount of time for the agent to come back before reaping it.
This has the advantageous side effect of causing servers to deregister the node/services/checks for that agent sooner than if the global reconnect_timeout was used.
Extend Consul’s intentions model to allow for request-based access control enforcement for HTTP-like protocols in addition to the existing connection-based enforcement for unspecified protocols (e.g. tcp).
- Upgrade the ConfigEntry.ListAll RPC to be kind-aware so that older
copies of consul will not see new config entries it doesn't understand
replicate down.
- Add shim conversion code so that the old API/CLI method of interacting
with intentions will continue to work so long as none of these are
edited via config entry endpoints. Almost all of the read-only APIs will
continue to function indefinitely.
- Add new APIs that operate on individual intentions without IDs so that
the UI doesn't need to implement CAS operations.
- Add a new serf feature flag indicating support for
intentions-as-config-entries.
- The old line-item intentions way of interacting with the state store
will transparently flip between the legacy memdb table and the config
entry representations so that readers will never see a hiccup during
migration where the results are incomplete. It uses a piece of system
metadata to control the flip.
- The primary datacenter will begin migrating intentions into config
entries on startup once all servers in the datacenter are on a version
of Consul with the intentions-as-config-entries feature flag. When it is
complete the old state store representations will be cleared. We also
record a piece of system metadata indicating this has occurred. We use
this metadata to skip ALL of this code the next time the leader starts
up.
- The secondary datacenters continue to run the old intentions
replicator until all servers in the secondary DC and primary DC support
intentions-as-config-entries (via serf flag). Once this condition it met
the old intentions replicator ceases.
- The secondary datacenters replicate the new config entries as they are
migrated in the primary. When they detect that the primary has zeroed
it's old state store table it waits until all config entries up to that
point are replicated and then zeroes its own copy of the old state store
table. We also record a piece of system metadata indicating this has
occurred. We use this metadata to skip ALL of this code the next time
the leader starts up.
This call appears to only be necessary because reset() was called from
NewMaterializer.
This commit has the constructor set a default value for updateCh, and
removes both the call to reset() from New(), and the call to
notifyUpdateLocked() from reset().
This should ensure that we do not notify the Fetch() call before we have new
values to report.
Refactor of Materializer.Run
Use handlers to manage state in Materializer
Rename Materializer receiver
rename m.l to m.lock, and flip some conditionals to remove the negative.
Improve godoc, rename Deps, move resetErr, and pass err into notifyUpdate
Update for NewSnapshotToFollow events
Refactor to move context cancel out of Materializer
Replace InitFilter with Reset.
Removes the need to store a fatalErr and the cache-type, and removes the need to recreate the filter
each time.
Pass dependencies into MaterializedView.
Remove context from MaterializedView.
Rename state to view.
Rename MaterialziedView to Materialzier.
Rename to NewMaterializer
Pass in retry.Waiter
This adds a new very tiny memdb table and corresponding raft operation
for updating a very small effective map[string]string collection of
"system metadata". This can persistently record a fact about the Consul
state machine itself.
The first use of this feature will come in a later PR.
This new package provides a client agent implementation of an interface
for fetching the health of services.
This approach has a number of benefits:
1. It provides a much more explicit interface. Instead of everything
dependency on `RPC()` and `Cache.Get()` for many unrelated things
they can depend on a type that are named according to the behaviour
it provides.
2. It gives us a single place to vary the behaviour and migrate to
a new form of RPC (gRPC). The current implementation has two options
(cache, or direct RPC), and in the future we will have more.
It is also a great opporunity to start adding `context.Context` args
to these operations, which in the future will allow us to cancel
the operations.
3. As a concequence of the first, in the Server agent where we make
these calls we can replace the current in-memory RPC calls with
a thin adapter for the real method. This removes the `net/rpc`
machinery from the call in places where it is not needed.
This new package is quite small right now, but I think we can expect it
to grow to a more reasonable size as other RPC calls are replaced.
This change also happens to replace two very similar implementations with
a single implementation.
Reduce Jitter to one function
Rename NewRetryWaiter
Fix a bug in calculateWait where maxWait was applied before jitter, which would make it
possible to wait longer than maxWait.
This really only matters for unit tests, since typically if an agent shuts down its server, it follows that up by exiting the process, which would also clean up all of the networking anyway.
The subscribe endpoint needs to be able to inspect the payload to filter
events, and convert them into the protobuf types.
Use the protobuf CatalogOp type for the operation field, for now. In the
future if we end up with multiple interfaces we should be able to remove
the protobuf dependency by changing this to an int32 and adding a test
for the mapping between the values.
Make the value of the payload a concrete type instead of interface{}. We
can create other payloads for other event types.
The router.Manager is already rebalancing servers for other connection pools, so it can call into our resolver to do the same.
This change allows us to remove the serf dependency from resolverBuilder, and remove Datacenter from the config.
Also revert the change to refreshServerRebalanceTimer
It only needs to be refereced from the Router, because there is only 1 instance, and the
Router can call AddServer/RemoveServer like it does on the Manager.
Rename GRPCClient to ClientConnPool. This type appears to be more of a
conn pool than a client. The clients receive the connections from this
pool.
Reduce some dependencies by adjusting the interface baoundaries.
Remove the need to create a second slice of Servers, just to pick one and throw the rest away.
Unexport serverResolver, it is not used outside the package.
Use a RWMutex for ServerResolverBuilder, some locking is read-only.
Add more godoc.
* fix lessThanHalfTime
* get lock for CAProvider()
* make a var to relate both vars
* rename to getCAProviderWithLock
* move CertificateTimeDriftBuffer to agent/connect/ca
In an upcoming change we will need to pass a grpc.ClientConnPool from
BaseDeps into Server. While looking at that change I noticed all of the
existing consulOption fields are already on BaseDeps.
Instead of duplicating the fields, we can create a struct used by
agent/consul, and use that struct in BaseDeps. This allows us to pass
along dependencies without translating them into different
representations.
I also looked at moving all of BaseDeps in agent/consul, however that
created some circular imports. Resolving those cycles wouldn't be too
bad (it was only an error in agent/consul being imported from
cache-types), however this change seems a little better by starting to
introduce some structure to BaseDeps.
This change is also a small step in reducing the scope of Agent.
Also remove some constants that were only used by tests, and move the
relevant comment to where the live configuration is set.
Removed some validation from NewServer and NewClient, as these are not
really runtime errors. They would be code errors, which will cause a
panic anyway, so no reason to handle them specially here.
When calling `GetDatacentersByDistance()` or `GetDatacentersMap()`, an
incorrect condition was used to diplay log message, thus flooding
Consul's logs.
Example of message:
```
[WARN] agent.router: Non-server in server-only area: non_server=myClientNode area=lan
```
This message is only valid for WAN areas, filter to avoid creating
hundreds of logs/s on our clusters, each time someone is calling this
method.
Our logs were flooded by such messages when migrating our Consul servers
from 1.7.7 to 1.8.4.
This will issue fix#8663
Occasionally this test would flake. The flakes were fixed by:
1. Stopping the service and retrying to check on metrics. This way we
also include the active_streams going to 0 in the metric calls.
2. Using a reference to the global Metrics. This way when other tests
have background goroutines that are still shutting down, they won't
emit metrics to the metric instance with the fake Sink. The stats
test can patch the local reference to the global, so the existing
statHandlers will continue to emit to the global, but the stats
test will send all metrics to the replacement.
https server.
In #8234 I changed a few tests to use TestAgent.HTTPAddr() to find the
addr used in the test. Due to the way HTTPAddr() was implemented these
tests were passing, but I think the pass was incidental. HTTPAddr() was
not matching any servers, and was instead returning the last server,
which happened to be the one these tests wanted.
This commit fixes the implementation of HTTPAddr to panic if no match
was found. The tests which require an HTTPS server are changed to use
a new firstAddr() to look up the correct address.
secondaryIntermediateCertRenewalWatch was using `retryLoopBackoff` to
renew the intermediate certificate. Once it entered the inner loop and
started `retryLoopBackoff` it would never leave that.
`retryLoopBackoffAbortOnSuccess` will return when renewing is
successful, like it was intended originally.
The nodeCheck slice was being used as the first arg in append, which in some cases will modify the array backing the slice. This would lead to service checks for other services in the wrong event.
Also refactor some things to reduce the arguments to functions.
Creating a new readTxn does not work because it will not see the newly created objects that are about to be committed. Instead use the active write Txn.
Whenever an upsert/deletion of a config entry happens, within the open
state store transaction we speculatively test compile all discovery
chains that may be affected by the pending modification to verify that
the write would not create an erroneous scenario (such as splitting
traffic to a subset that did not exist).
If a single discovery chain evaluation references two config entries
with the same kind and name in different namespaces then sometimes the
upsert/deletion would be falsely rejected. It does not appear as though
this bug would've let invalid writes through to the state store so the
correction does not require a cleanup phase.
This commit refactors the state store usage code to track unique service
name changes on transaction commit. This means we only need to lookup
usage entries when reading the information, as opposed to iterating over
a large number of service indices.
- Take into account a service instance's name being changed
- Do not iterate through entire list of service instances, we only care
about whether there is 0, 1, or more than 1.
We add a WriteTxn interface for use in updating the usage memdb table,
with the forward-looking prospect of incrementally converting other
functions to accept interfaces.
As well, we use the ReadTxn in new usage code, and as a side effect
convert a couple of existing functions to use that interface as well.
Using the newly provided state store methods, we periodically emit usage
metrics from the servers.
We decided to emit these metrics from all servers, not just the leader,
because that means we do not have to care about leader election flapping
causing metrics turbulence, and it seems reasonable for each server to
emit its own view of the state, even if they should always converge
rapidly.
And into token.Store. This change isolates any awareness of token
persistence in a single place.
It is a small step in allowing Agent.New to accept its dependencies.
This test was only passing because t.Parallel was causing every subtest to run with the last value in the iteration,
which sets a value for all tokens. The test started to fail once t.Parallel was removed, but the same failure could
have been produced by adding 'tt := tt' to the t.Run() func.
These tests run in under 10ms, so there is no reason to use t.Parallel.
This will apply cache throttling parameters are properly applied:
* cache.EntryFetchMaxBurst
* cache.EntryFetchRate
When values are updated, a log is displayed in info.
This might be better handled by allowing configuration for the InMemSink interval and retail, and disabling
the global. For now this is a smaller change to remove the goroutine leak caused by tests because go-metrics
does not provide any way of shutting down the global goroutine.
With this change, Agent.New() accepts many of the dependencies instead
of creating them in New. Accepting fully constructed dependencies from
a constructor makes the type easier to test, and easier to change.
There are still a number of dependencies created in Start() which can
be addressed in a follow up.
During gossip encryption key rotation it would be nice to be able to see if all nodes are using the same key. This PR adds another field to the json response from `GET v1/operator/keyring` which lists the primary keys in use per dc. That way an operator can tell when a key was successfully setup as primary key.
Based on https://github.com/hashicorp/serf/pull/611 to add primary key to list keyring output:
```json
[
{
"WAN": true,
"Datacenter": "dc2",
"Segment": "",
"Keys": {
"0OuM4oC3Os18OblWiBbZUaHA7Hk+tNs/6nhNYtaNduM=": 6,
"SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 6
},
"PrimaryKeys": {
"SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 6
},
"NumNodes": 6
},
{
"WAN": false,
"Datacenter": "dc2",
"Segment": "",
"Keys": {
"0OuM4oC3Os18OblWiBbZUaHA7Hk+tNs/6nhNYtaNduM=": 8,
"SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 8
},
"PrimaryKeys": {
"SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 8
},
"NumNodes": 8
},
{
"WAN": false,
"Datacenter": "dc1",
"Segment": "",
"Keys": {
"0OuM4oC3Os18OblWiBbZUaHA7Hk+tNs/6nhNYtaNduM=": 3,
"SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 8
},
"PrimaryKeys": {
"SINm887hKTzmMWeBNKTJReaTLX3mBEJKriDyt88Ad+g=": 8
},
"NumNodes": 8
}
]
```
I intentionally did not change the CLI output because I didn't find a good way of displaying this information. There are a couple of options that we could implement later:
* add a flag to show the primary keys
* add a flag to show json output
Fixes#3393.
Previsouly it was done in Agent.Start, which is much later then it needs to be.
The new 'dns' package was required, because otherwise there would be an
import cycle. In the future we should move more of the dns server into
the dns package.
- unexport testing shims, and document their purpose
- resolve a TODO by moving validation to NewBuilder and storing the one
field that is used instead of all of Options
- create a slice with the correct size to avoid extra allocations
While working on another change I caused a bunch of these tests to fail.
Unfortunately the failure messages were not super helpful at first.
One problem was that the request and response were created outside of
the retry. This meant that when the second attempt happened, the request
body was empty (because the buffer had been consumed), and so the
request was not actually being retried. This was fixed by moving more of
the request creation into the retry block.
Another problem was that these functions can return errors in two ways, and
are not consistent about which way they use. Some errors are returned to
the response writer, but the tests were not checking those errors, which
was causing a panic later on. This was fixed by adding a check for the
response code.
Also adds some missing t.Helper(), and has assertIndex use checkIndex so
that it is clear these are the same implementation.
I saw this test flake locally, and it was easy to reproduce with -count=10.
The failure was: 'TestAgent.dns: rpc error: error=No known Consul servers'.
Waiting for the agent seems to fix it.
This commit fixes a test that I saw flake locally while running tests. The test output from the monitor
started immediately after the line the test was looking for.
To fix the problem a channel is closed when the goroutine starts. Shutdown is not called until this channel
is closed, which seems to greatly reduce the chance of a flake.
TestAgent.Key was only used by 3 tests. Extracting it from the common helper that is used in hundreds of
tests helps keep the shared part small and more focused.
This required a second change (which I was planning on making anyway), which was to change the behaviour of
DataDir. Now in all cases the TestAgent will use the DataDir, and clean it up once the test is complete.
There are a couple reasons for this change:
1. agent.go is way too big. Smaller files makes code eaasier to read
because tools that show usage also include filename which can give
a lot more context to someone trying to understand which functions
call other functions.
2. these two functions call into a large number of functions already in
keyring.go.
This is a small step to allowing Agent to accept its dependencies
instead of creating them in New.
There were two fields in autoconfig.Config that were used exclusively
to load config. These were replaced with a single function, allowing us
to move LoadConfig back to the config package.
Also removed the WithX functions for building a Config. Since these were
simple assignment, it appeared we were not getting much value from them.
Making these functions allows us to cleanup how an agent is initialized. They only make use of a config and a logger, so they do not need to be agent methods.
Also cleanup the testing to use t.Run and require.
Fixes#8466
Since Consul 1.8.0 there was a bug in how ingress gateway protocol
compatibility was enforced. At the point in time that an ingress-gateway
config entry was modified the discovery chain for each upstream was
checked to ensure the ingress gateway protocol matched. Unfortunately
future modifications of other config entries were not validated against
existing ingress-gateway definitions, such as:
1. create tcp ingress-gateway pointing to 'api' (ok)
2. create service-defaults for 'api' setting protocol=http (worked, but not ok)
3. create service-splitter or service-router for 'api' (worked, but caused an agent panic)
If you were to do these in a different order, it would fail without a
crash:
1. create service-defaults for 'api' setting protocol=http (ok)
2. create service-splitter or service-router for 'api' (ok)
3. create tcp ingress-gateway pointing to 'api' (fail with message about
protocol mismatch)
This PR introduces the missing validation. The two new behaviors are:
1. create tcp ingress-gateway pointing to 'api' (ok)
2. (NEW) create service-defaults for 'api' setting protocol=http ("ok" for back compat)
3. (NEW) create service-splitter or service-router for 'api' (fail with
message about protocol mismatch)
In consideration for any existing users that may be inadvertently be
falling into item (2) above, that is now officiall a valid configuration
to be in. For anyone falling into item (3) above while you cannot use
the API to manufacture that scenario anymore, anyone that has old (now
bad) data will still be able to have the agent use them just enough to
generate a new agent/proxycfg error message rather than a panic.
Unfortunately we just don't have enough information to properly fix the
config entries.
* changes some functions to return data instead of modifying pointer
arguments
* renames globalRPC() to keyringRPCs() to make its purpose more clear
* restructures KeyringOperation() to make it more understandable
* update bindata_assetfs.go
* Release v1.8.2
* Putting source back into Dev Mode
* changelog: add entries for 1.7.6, 1.7.5 and 1.6.7
Co-authored-by: hashicorp-ci <hashicorp-ci@users.noreply.github.com>
NotifyShutdown was only used for testing. Now that t.Cleanup exists, we
can use that instead of attaching cleanup to the Server shutdown.
The Autopilot test which used NotifyShutdown doesn't need this
notification because Shutdown is synchronous. Waiting for the function
to return is equivalent.