- return errors in TestAgent.Start so that the retry works correctly
- remove duplicate logging, the error is returned already
- add a missing t.Helper() to retry.Run
- properly set a.Agent to nil so that subsequent retry attempts will actually try to start
* Use proxy outbound port from TransparentProxyConfig if provided
* If -proxy-id is provided to the redirect-traffic command, exclude any listener ports
from inbound traffic redirection. This includes envoy_prometheus_bind_addr,
envoy_stats_bind_addr, and the ListenerPort from the Expose configuration.
* Allow users to provide additional inbound and outbound ports, outbound CIDRs
and additional user IDs to be excluded from traffic redirection.
This affects both the traffic-redirect command and the iptables SDK package.
On a few occasions I've had to read timeout stack traces for tests and
noticed that retry.Run runs the function in a goroutine. This makes
debuging a timeout more difficult because the gourinte of the retryable
function is disconnected from the stack of the actual test. It requires
searching through the entire stack trace to find the other goroutine.
By using panic instead of runtime.Goexit() we remove the need for a
separate goroutine.
Also a few other small improvements:
* add `R.Helper` so that an assertion function can be used with both
testing.T and retry.R.
* Pass t to `Retryer.NextOr`, and call `t.Helper` in a number of places
so that the line number reported by `t.Log` is the line in the test
where `retry.Run` was called, instead of some line in `retry.go` that
is not relevant to the failure.
* improve the implementation of `dedup` by removing the need to iterate
twice. Instad track the lines and skip any duplicate when writing to
the buffer.
* Add new consul connect redirect-traffic command for applying traffic redirection rules when Transparent Proxy is enabled.
* Add new iptables package for applying traffic redirection rules with iptables.
- Upgrade the ConfigEntry.ListAll RPC to be kind-aware so that older
copies of consul will not see new config entries it doesn't understand
replicate down.
- Add shim conversion code so that the old API/CLI method of interacting
with intentions will continue to work so long as none of these are
edited via config entry endpoints. Almost all of the read-only APIs will
continue to function indefinitely.
- Add new APIs that operate on individual intentions without IDs so that
the UI doesn't need to implement CAS operations.
- Add a new serf feature flag indicating support for
intentions-as-config-entries.
- The old line-item intentions way of interacting with the state store
will transparently flip between the legacy memdb table and the config
entry representations so that readers will never see a hiccup during
migration where the results are incomplete. It uses a piece of system
metadata to control the flip.
- The primary datacenter will begin migrating intentions into config
entries on startup once all servers in the datacenter are on a version
of Consul with the intentions-as-config-entries feature flag. When it is
complete the old state store representations will be cleared. We also
record a piece of system metadata indicating this has occurred. We use
this metadata to skip ALL of this code the next time the leader starts
up.
- The secondary datacenters continue to run the old intentions
replicator until all servers in the secondary DC and primary DC support
intentions-as-config-entries (via serf flag). Once this condition it met
the old intentions replicator ceases.
- The secondary datacenters replicate the new config entries as they are
migrated in the primary. When they detect that the primary has zeroed
it's old state store table it waits until all config entries up to that
point are replicated and then zeroes its own copy of the old state store
table. We also record a piece of system metadata indicating this has
occurred. We use this metadata to skip ALL of this code the next time
the leader starts up.
Occasionally we are seeing the go-test-api job timeout at 10 minutes.
Looking at the stack trace I saw the following:
1. Lots of tests blocked on server.Stop in NewTestServerConfigT. This
suggests that SIGINT is being sent to the server, but the server is
not properly shutting down.
2. Over 20k goroutines that look like this:
goroutine 16355 [select, 8 minutes]:
net/http.(*persistConn).readLoop(0xc004270240)
/usr/local/go/src/net/http/transport.go:2099 +0x99e
created by net/http.(*Transport).dialConn
/usr/local/go/src/net/http/transport.go:1647 +0xc56
Issue 1 seems to be the main problem, but debugging that directly is not
possible because our buffered logs do not get sent when the tests
timeout. To mitigate this problem I've added a timeout to the cmd.Wait()
to force kill the process and return an error.
Unfortunately because we retry this operation, we still may not see the
cause because the next attempt will likely pass. I'm tempted to remove
the retry around NewTestServerConfigT.
Issue 2 seems to be caused by not closing the response body. Since the
request is performed many times in a loop, many goroutines are created
and are not closed until the response body is closed.
Replaces #7559
Running tests in parallel, with background goroutines, results in test output not being associated with the correct test. `go test` does not make any guarantees about output from goroutines being attributed to the correct test case.
Attaching log output from background goroutines also cause data races. If the goroutine outlives the test, it will race with the test being marked done. Previously this was noticed as a panic when logging, but with the race detector enabled it is shown as a data race.
The previous solution did not address the problem of correct test attribution because test output could still be hidden when it was associated with a test that did not fail. You would have to look at all of the log output to find the relevant lines. It also made debugging test failures more difficult because each log line was very long.
This commit attempts a new approach. Instead of printing all the logs, only print when a test fails. This should work well when there are a small number of failures, but may not work well when there are many test failures at the same time. In those cases the failures are unlikely a result of a specific test, and the log output is likely less useful.
All of the logs are printed from the test goroutine, so they should be associated with the correct test.
Also removes some test helpers that were not used, or only had a single caller. Packages which expose many functions with similar names can be difficult to use correctly.
Related:
https://github.com/golang/go/issues/38458 (may be fixed in go1.15)
https://github.com/golang/go/issues/38382#issuecomment-612940030
Switch from /v1/agent/self to /v1/status/leader when checking if the test server has come up successfully in the waitForAPI function.
Previously, the test server was relying (probably not intentionally) on the default value of the acl_enforce_version_8 in the TestConfig, which was false. So if you create a test server and enabled ACLs, they would not be enforced and the server would be able to come up pretty quickly because /v1/agent/self would return a 200 status pretty much as soon as the agent is running and most likely before leader election is finished.
Now that we have removed acl_enforce_version_8 property (equivalent to being true by default) if you've created a test server with ACLs enabled, it will need to wait for leader election and for ACLs to be initialized before it'll get a successful response from the /v1/agent/self.
Note: With this change, waitForAPI function no longer requires a 200 response status from the v1/status/leader endpoint. This is because in some tests, namely TestAPI_AgentLeave, we are only running clients, and this endpoint returns a 500 status.
This removes a race condition in reset since pendingPorts can be set to nil in reset()
If ticker is hit at wrong time, it would crash the unit test.
We ensure in reset to avoid this race condition by cancelling the goroutine using
killTicker chan.
We also properly clean up eveything, so garbage collector can work as expected.
To reproduce existing bug:
`while go test -timeout 30s github.com/hashicorp/consul/sdk/freeport -run '^(Test.*)$'; do go clean -testcache; done`
Will crash after a few 10s runs on my machine.
Error could be seen in unit tests sometimes:
[INFO] freeport: resetting the freeport package state
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x1125536]
goroutine 25 [running]:
container/list.(*List).Len(...)
/usr/local/Cellar/go/1.14/libexec/src/container/list/list.go:66
github.com/hashicorp/consul/sdk/freeport.checkFreedPortsOnce()
/Users/p.souchay/go/src/github.com/hashicorp/consul/sdk/freeport/freeport.go:157 +0x86
github.com/hashicorp/consul/sdk/freeport.checkFreedPorts()
/Users/p.souchay/go/src/github.com/hashicorp/consul/sdk/freeport/freeport.go:147 +0x71
created by github.com/hashicorp/consul/sdk/freeport.initialize
/Users/p.souchay/go/src/github.com/hashicorp/consul/sdk/freeport/freeport.go:113 +0x2cf
FAIL github.com/hashicorp/consul/sdk/freeport 1.607s
Use systemctl to properly detect ephemeral ports on Mac OS (aka darwin) by fetching
systemctl values:
* net.inet.ip.portrange.first
* net.inet.ip.portrange.last
This will avoid the message:
`[INFO] freeport: ephemeral port range detection not configured for GOOS="darwin"`
and properly detect the correct port range
Using golangci-lint has a number of advantages:
- adding new linters becomes much easier, its a couple lines of yaml config
instead of more bash scripting
- it enables whitelisting of issues using inline comments or regex
- when running multiple linters less work is done. The parsed source can be reused
by multiple linters
- linters are run in parallel to reduce CI runtime.
* Unflake the TestAPI_AgentConnectCALeaf test
* Modify the WaitForActiveCARoot to actually verify that at least one root exists
Also verify that the active root id field is set
* Fix zombie consul process in Windows
Windows doesn't support Interrupt signal, thus while stop it on Windows platform
it would fail and left zombie consul process
Currently when using the built-in CA provider for Connect, root certificates are valid for 10 years, however secondary DCs get intermediates that are valid for only 1 year. There is no mechanism currently short of rotating the root in the primary that will cause the secondary DCs to renew their intermediates.
This PR adds a check that renews the cert if it is half way through its validity period.
In order to be able to test these changes, a new configuration option was added: IntermediateCertTTL which is set extremely low in the tests.
If there is imperfect goroutine lifespan tracking if we pipe our logs
through testing.T.Logf there is a chance of a stray goroutine attempting
to log after the test that spawned it completes.
This results in a panic of:
panic: Log in goroutine after TestLeader_SecondaryCA_Initialize has completed...
This isn't great and should be fixed, but quickly runs into situations
around externally cancelling blocking queries which isn't terribly
possible at the moment. The concession here is to ignore these specific
panics for now.
This can be triggered easily when running some tests with a high
`-count=YYY` value.
Also needed to update some funcs that were taking a *testing.T to use a testing.TB. This prevents passing a nil pointer as a non-nil interface value
and thus making it impossible to detect nil before using the interfaces functions.
This should cut down on test flakiness.
Problems handled:
- If you had enough parallel test cases running, the former circular
approach to handling the port block could hand out the same port to
multiple cases before they each had a chance to bind them, leading to
one of the two tests to fail.
- The freeport library would allocate out of the ephemeral port range.
This has been corrected for Linux (which should cover CI).
- The library now waits until a formerly-in-use port is verified to be
free before putting it back into circulation.
All these changes should have no side-effects or change behavior:
- Use bytes.Buffer's String() instead of a conversion
- Use time.Since and time.Until where fitting
- Drop unnecessary returns and assignment
* Ensure the mesh gateway configuration comes back in the api within each upstream
* Add a test for the MeshGatewayConfig in the ToAPI functions
* Ensure we don’t use gateways for dc local connections
* Update the svc kind index for deletions
* Replace the proxycfg.state cache with an interface for testing
Also start implementing proxycfg state testing.
* Update the state tests to verify some gateway watches for upstream-targets of a discovery chain.