This function now only starts the agent.
Using:
git grep -l 'StartTestAgent(t, true,' | \
xargs sed -i -e 's/StartTestAgent(t, true,/StartTestAgent(t,/g'
When run in with `-dev` in DevMode, it is not possible to replace
the embeded UI with another one because `-dev` implies `-ui`.
This commit allows this an slightly change the error message
about Consul 0.7.0 which is very old and does not apply to
current version anyway.
This config entry will be used to configure terminating gateways.
It accepts the name of the gateway and a list of services the gateway will represent.
For each service users will be able to specify: its name, namespace, and additional options for TLS origination.
Co-authored-by: Kyle Havlovitz <kylehav@gmail.com>
Co-authored-by: Chris Piraino <cpiraino@hashicorp.com>
* Add Ingress gateway config entry and other relevant structs
* Add api package tests for ingress gateways
* Embed EnterpriseMeta into ingress service struct
* Add namespace fields to api module and test consul config write decoding
* Don't require a port for ingress gateways
* Add snakeJSON and camelJSON cases in command test
* Run Normalize on service's ent metadata
Sadly cannot think of a way to test this in OSS.
* Every protocol requires at least 1 service
* Validate ingress protocols
* Update agent/structs/config_entry_gateways.go
Co-authored-by: Chris Piraino <cpiraino@hashicorp.com>
Co-authored-by: Freddy <freddygv@users.noreply.github.com>
Previously the log output included the test name twice and a long date
format. The test output is already grouped by test, so adding the test
name did not add any new information. The date and time are only useful
to understand elapsed time, so using a short format should provide
succident detail.
Also fixed a bug in NewTestAgentWithFields where nil was returned
instead of the test agent.
This test would occasionally fail because we checked for a status of
"critical" initially. This races with the actual healthcheck being run
and declared passing.
We instead use a ttl health check so that we don't rely on timing at all.
To reduce the chance of some tests not being run because it does not
match the regex passed to '-run'.
Also document why some tests are allowed to be skipped on CI.
If the CI environment is not correct for running tests the tests
should fail, so that we don't accidentally stop running some tests
because of a change to our CI environment.
Also removed a duplicate delcaration from init. I believe one was
overriding the other as they are both in the same package.
These changes are necessary to ensure advertisement happens correctly even when datacenters are connected via network areas in Consul enterprise.
This also changes how we check if ACLs can be upgraded within the local datacenter. Previously we would iterate through all LAN members. Now we just use the ServerLookup type to iterate through all known servers in the DC.
This is like a Möbius strip of code due to the fact that low-level components (serf/memberlist) are connected to high-level components (the catalog and mesh-gateways) in a twisty maze of references which make it hard to dive into. With that in mind here's a high level summary of what you'll find in the patch:
There are several distinct chunks of code that are affected:
* new flags and config options for the server
* retry join WAN is slightly different
* retry join code is shared to discover primary mesh gateways from secondary datacenters
* because retry join logic runs in the *agent* and the results of that
operation for primary mesh gateways are needed in the *server* there are
some methods like `RefreshPrimaryGatewayFallbackAddresses` that must occur
at multiple layers of abstraction just to pass the data down to the right
layer.
* new cache type `FederationStateListMeshGatewaysName` for use in `proxycfg/xds` layers
* the function signature for RPC dialing picked up a new required field (the
node name of the destination)
* several new RPCs for manipulating a FederationState object:
`FederationState:{Apply,Get,List,ListMeshGateways}`
* 3 read-only internal APIs for debugging use to invoke those RPCs from curl
* raft and fsm changes to persist these FederationStates
* replication for FederationStates as they are canonically stored in the
Primary and replicated to the Secondaries.
* a special derivative of anti-entropy that runs in secondaries to snapshot
their local mesh gateway `CheckServiceNodes` and sync them into their upstream
FederationState in the primary (this works in conjunction with the
replication to distribute addresses for all mesh gateways in all DCs to all
other DCs)
* a "gateway locator" convenience object to make use of this data to choose
the addresses of gateways to use for any given RPC or gossip operation to a
remote DC. This gets data from the "retry join" logic in the agent and also
directly calls into the FSM.
* RPC (`:8300`) on the server sniffs the first byte of a new connection to
determine if it's actually doing native TLS. If so it checks the ALPN header
for protocol determination (just like how the existing system uses the
type-byte marker).
* 2 new kinds of protocols are exclusively decoded via this native TLS
mechanism: one for ferrying "packet" operations (udp-like) from the gossip
layer and one for "stream" operations (tcp-like). The packet operations
re-use sockets (using length-prefixing) to cut down on TLS re-negotiation
overhead.
* the server instances specially wrap the `memberlist.NetTransport` when running
with gateway federation enabled (in a `wanfed.Transport`). The general gist is
that if it tries to dial a node in the SAME datacenter (deduced by looking
at the suffix of the node name) there is no change. If dialing a DIFFERENT
datacenter it is wrapped up in a TLS+ALPN blob and sent through some mesh
gateways to eventually end up in a server's :8300 port.
* a new flag when launching a mesh gateway via `consul connect envoy` to
indicate that the servers are to be exposed. This sets a special service
meta when registering the gateway into the catalog.
* `proxycfg/xds` notice this metadata blob to activate additional watches for
the FederationState objects as well as the location of all of the consul
servers in that datacenter.
* `xds:` if the extra metadata is in place additional clusters are defined in a
DC to bulk sink all traffic to another DC's gateways. For the current
datacenter we listen on a wildcard name (`server.<dc>.consul`) that load
balances all servers as well as one mini-cluster per node
(`<node>.server.<dc>.consul`)
* the `consul tls cert create` command got a new flag (`-node`) to help create
an additional SAN in certs that can be used with this flavor of federation.
This fixes issue #7318
Between versions 1.5.2 and 1.5.3, a regression has been introduced regarding health
of services. A patch #6144 had been issued for HealthChecks of nodes, but not for healthchecks
of services.
What happened when a reload was:
1. save all healthcheck statuses
2. cleanup everything
3. add new services with healthchecks
In step 3, the state of healthchecks was taken into account locally,
so at step 3, but since we cleaned up at step 2, state was lost.
This PR introduces the snap parameter, so step 3 can use information from step 1
This will avoid adding format=prometheus in request and to parse
more easily metrics using Prometheus.
This commit follows https://github.com/hashicorp/consul/pull/6514 as
the PR has been closed and extends it by accepting old Prometheus
mime-type.
* xDS Mesh Gateway Resolver Subset Fixes
The first fix was that clusters were being generated for every service resolver subset regardless of there being any service instances of the associated service in that dc. The previous logic didn’t care at all but now it will omit generating those clusters unless we also have service instances that should be proxied.
The second fix was to respect the DefaultSubset of a service resolver so that mesh-gateways would configure the endpoints of the unnamed subset cluster to only those endpoints matched by the default subsets filters.
* Refactor the gateway endpoint generation to be a little easier to read
Fixes#7231. Before an agent would always emit a warning when there is
an encrypt key in the configuration and an existing keyring stored,
which is happening on restart.
Now it only emits that warning when the encrypt key from the
configuration is not part of the keyring.
* agent: measure blocking queries
* agent.rpc: update docs to mention we only record blocking queries
* agent.rpc: make go fmt happy
* agent.rpc: fix non-atomic read and decrement with bitwise xor of uint64 0
* agent.rpc: clarify review question
* agent.rpc: today I learned that one must declare all variables before interacting with goto labels
* Update agent/consul/server.go
agent.rpc: more precise comment on `Server.queriesBlocking`
Co-Authored-By: Paul Banks <banks@banksco.de>
* Update website/source/docs/agent/telemetry.html.md
agent.rpc: improve queries_blocking description
Co-Authored-By: Paul Banks <banks@banksco.de>
* agent.rpc: fix some bugs found in review
* add a note about the updated counter behavior to telemetry.md
* docs: add upgrade-specific note on consul.rpc.quer{y,ies_blocking} behavior
Co-authored-by: Paul Banks <banks@banksco.de>
We set RawToString=true so that []uint8 => string when decoding an interface{}.
We set the MapType so that map[interface{}]interface{} decodes to map[string]interface{}.
Add tests to ensure that this doesn't break existing usages.
Fixes#7223
* fix LeaderTest_ChangeNodeID to use StatusLeft and add waitForAnyLANLeave
* unextract the waitFor... fn, simplify, and provide a more descriptive error
This fixes#7020.
There are two problems this PR solves:
* if the node info changes it is highly likely to get service and check registration permission errors unless those service tokens have node:write. Hopefully services you register don’t have this permission.
* the timer for a full sync gets reset for every partial sync which means that many partial syncs are preventing a full sync from happening
Instead of syncing node info last, after services and checks, and possibly saving one RPC because it is included in every service sync, I am syncing node info first. It is only ever going to be a single RPC that we are only doing when node info has changed. This way we are guaranteed to sync node info even when something goes wrong with services or checks which is more likely because there are more syncs happening for them.
Previously this happened to be validating only the chains in the default namespace. Now it will validate all chains in all namespaces when the global proxy-defaults is changed.
* Cleanup the discovery chain compilation route handling
Nothing functionally should be different here. The real difference is that when creating new targets or handling route destinations we use the router config entries name and namespace instead of that of the top level request. Today they SHOULD always be the same but that may not always be the case. This hopefully also makes it easier to understand how the router entries are handled.
* Refactor a small bit of the service manager tests in oss
We used to use the stringHash function to compute part of the filename where things would get persisted to. This has been changed in the core code to calling the StringHash method on the ServiceID type. It just so happens that the new method will output the same value for anything in the default namespace (by design actually). However, logically this filename computation in the test should do the same thing as the core code itself so I updated it here.
Also of note is that newer enterprise-only tests for the service manager cannot use the old stringHash function at all because it will produce incorrect results for non-default namespaces.
The previous value was too conservative and users with many instances
were having problems because of it. This change increases the limit to
8192 which reportedly fixed most of the issues with that.
Related: #4984, #4986, #5050.
This adds acl enforcement to the two endpoints that were missing it.
Note that in the case of getting a services health by its id, we still
must first lookup the service so we still "leak" information about a
service with that ID existing. There isn't really a way around it though
as ACLs are meant to check service names.
* Updates to the Txn API for namespaces
* Update agent/consul/txn_endpoint.go
Co-Authored-By: R.B. Boyer <rb@hashicorp.com>
Co-authored-by: R.B. Boyer <public@richardboyer.net>
The backing RPC already existed but the endpoint will be useful for other service syncing processes such as consul-k8s as this endpoint can return all services registered with a node regardless of namespacing.
* Fix segfault when removing both a service and associated check
updateSyncState creates entries in the services and checks maps for
remote services/checks that are not found locally, so that we can then
make sure to delete them in our reconciliation process. However, the
values added to the map are missing key fields that the rest of the code
expects to not be nil.
* Add comment stating Check field can be nil
Something similar already happens inside of the server
(agent/consul/server.go) but by doing it in the general config parsing
for the agent we can have agent-level code rely on the PrimaryDatacenter
field, too.
* Increase raft notify buffer.
Fixes https://github.com/hashicorp/consul/issues/6852.
Increasing the buffer helps recovering from leader flapping. It lowers
the chances of the flapping leader to get into a deadlock situation like
described in #6852.