client: use RPC address and not serf after initial Consul discovery (#16217)

Nomad servers can advertise independent IP addresses for `serf` and
`rpc`. Somewhat unexpectedly, the `serf` address is also used for both Serf and
server-to-server RPC communication (including Raft RPC). The address advertised
for `rpc` is only used for client-to-server RPC. This split was introduced
intentionally in Nomad 0.8.

When clients are using Consul discovery for connecting to servers, they get an
initial discovery set from Consul and use the correct `rpc` tag in Consul to get
a list of adddresses for servers. The client then makes a `Status.Peers` RPC to
get the list of those servers that are raft peers. But this endpoint is shared
between servers and clients, and provides the address used for Raft.

Most of the time this is harmless because servers will bind on 0.0.0.0 anyways.,
But in topologies where servers are on a private network and clients are on
separate subnets (or even public subnets), clients will make initial contact
with the server to get the list of peers but then populate their local server
set with unreachable addresses.

Cluster administrators can work around this problem by using `server_join` with
specific IP addresses (or DNS names), because the `Node.UpdateStatus` endpoint
returns the correct set of RPC addresses when updating the node. So once a
client has registered, it will get the correct set of RPC addresses.

This changeset updates the client logic to query `Status.Members` instead of
`Status.Peers`, and then extract the correctly advertised address and port from
the response body.
This commit is contained in:
Tim Gross 2023-03-02 13:36:45 -05:00 committed by GitHub
parent b57d37c070
commit bb4880ec13
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 21 additions and 12 deletions

3
.changelog/16217.txt Normal file
View File

@ -0,0 +1,3 @@
```release-note:bug
client: Fixed a bug where clients used the serf advertise address to connect to servers when using Consul auto-discovery
```

View File

@ -2944,23 +2944,29 @@ DISCOLOOP:
mErr.Errors = append(mErr.Errors, err)
continue
}
var peers []string
if err := c.connPool.RPC(region, addr, "Status.Peers", rpcargs, &peers); err != nil {
// Query the members from the region that Consul gave us, and
// extract the client-advertise RPC address from each member
var membersResp structs.ServerMembersResponse
if err := c.connPool.RPC(region, addr, "Status.Members", rpcargs, &membersResp); err != nil {
mErr.Errors = append(mErr.Errors, err)
continue
}
// Successfully received the Server peers list of the correct
// region
for _, p := range peers {
addr, err := net.ResolveTCPAddr("tcp", p)
if err != nil {
mErr.Errors = append(mErr.Errors, err)
continue
for _, member := range membersResp.Members {
if addrTag, ok := member.Tags["rpc_addr"]; ok {
if portTag, ok := member.Tags["port"]; ok {
addr, err := net.ResolveTCPAddr("tcp",
fmt.Sprintf("%s:%s", addrTag, portTag))
if err != nil {
mErr.Errors = append(mErr.Errors, err)
continue
}
srv := &servers.Server{Addr: addr}
nomadServers = append(nomadServers, srv)
}
}
srv := &servers.Server{Addr: addr}
nomadServers = append(nomadServers, srv)
}
if len(nomadServers) > 0 {
break DISCOLOOP
}