Fixes a bug whereby servers present in multiple network areas would be
properly segmented in the Router, but not in the gRPC mirror. This would
lead servers in the current datacenter leaving from a network area
(possibly during the network area's removal) from deleting their own
records that still exist in the standard WAN area.
The gRPC client stack uses the gRPC server tracker to execute all RPCs,
even those targeting members of the current datacenter (which is unlike
the net/rpc stack which has a bypass mechanism).
This would manifest as a gRPC method call never opening a socket because
it would block forever waiting for the current datacenter's pool of
servers to be non-empty.
Previously we believe it was necessary for all code that required ports
to use freeport to prevent conflicts.
https://github.com/dnephin/freeport-test shows that it is actually save
to use port 0 (`127.0.0.1:0`) as long as it is passed directly to
`net.Listen`, and the listener holds the port for as long as it is
needed.
This works because freeport explicitly avoids the ephemeral port range,
and port 0 always uses that range. As you can see from the test output
of https://github.com/dnephin/freeport-test, the two systems never use
overlapping ports.
This commit converts all uses of freeport that were being passed
directly to a net.Listen to use port 0 instead. This allows us to remove
a bit of wrapping we had around httptest, in a couple places.
1) xds and grpc servers:
1.1) to use recovery middleware with callback that prints stack trace to log
1.2) callback turn the panic into a core.Internal error
2) added unit test for grpc server
* deps: upgrade gogo-protobuf to v1.3.2
* go mod tidy using go 1.16
* proto: regen protobufs after upgrading gogo/protobuf
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
sync/atomic must be used with 64-bit aligned fields, and that alignment is difficult to
ensure unless the field is the first one in the struct.
https://golang.org/pkg/sync/atomic/#pkg-note-BUG.
We have seen test flakes caused by 'concurrent map read and map write', and the race detector
reports the problem as well (prevent us from running some tests with -race).
The root of the problem is the grpc expects resolvers to be registered at init time
before any requests are made, but we were using a separate resolver for each test.
This commit introduces a resolver registry. The registry is registered as the single
resolver for the consul scheme. Each test uses the Authority section of the target
(instead of the scheme) to identify the resolver that should be used for the test.
The scheme is used for lookup, which is why it can no longer be used as the unique
key.
This allows us to use a lock around the map of resolvers, preventing the data race.
defaultMetrics was being set at package import time, which meant that it received an instance of
the original default. But lib/telemetry.InitTelemetry sets a new global when it is called.
This resulted in the metrics being sent nowhere.
This commit changes defaultMetrics to be a function, so it will return the global instance when
called. Since it is called after InitTelemetry it will return the correct metrics instance.
Gauge metrics are great for understanding the current state, but can somtimes hide problems
if there are many disconnect/reconnects.
This commit adds counter metrics for connections and streams to make it easier to see the
count of newly created connections and streams.
Instead of using retry.Run, which appears to have problems in some cases where it does not
emit an error message, use a for loop.
Increase the number of attempts and remove any sleep, since this operation is not that expensive to do
in a tight loop
Closing l.conns can lead to a race and a 'panic: send on closed chan' when a
connection is in the middle of being handled when the server is shutting down.
Found using '-race -count=800'
The router.Manager is already rebalancing servers for other connection pools, so it can call into our resolver to do the same.
This change allows us to remove the serf dependency from resolverBuilder, and remove Datacenter from the config.
Also revert the change to refreshServerRebalanceTimer
Rename GRPCClient to ClientConnPool. This type appears to be more of a
conn pool than a client. The clients receive the connections from this
pool.
Reduce some dependencies by adjusting the interface baoundaries.
Remove the need to create a second slice of Servers, just to pick one and throw the rest away.
Unexport serverResolver, it is not used outside the package.
Use a RWMutex for ServerResolverBuilder, some locking is read-only.
Add more godoc.
Occasionally this test would flake. The flakes were fixed by:
1. Stopping the service and retrying to check on metrics. This way we
also include the active_streams going to 0 in the metric calls.
2. Using a reference to the global Metrics. This way when other tests
have background goroutines that are still shutting down, they won't
emit metrics to the metric instance with the fake Sink. The stats
test can patch the local reference to the global, so the existing
statHandlers will continue to emit to the global, but the stats
test will send all metrics to the replacement.