Set MaxStale default to 10 years and add a stale counter (#2481)

Default MaxStale to 10 years and add a counter at `consul.dns.stale_queries` that tracks when an agent serves a query that's stale by at least 5 seconds. Previously, MaxStale defaulted to 5 seconds and DNS would become unavailable after a short period of time with no leader. This new default allows DNS requests to still be served in the event of a long outage.

Fixes #2460.
This commit is contained in:
Kyle Havlovitz 2016-11-08 14:45:12 -05:00 committed by GitHub
parent a629920e25
commit 1ffdf04bd7
4 changed files with 48 additions and 24 deletions

View File

@ -677,7 +677,7 @@ func DefaultConfig() *Config {
DNSConfig: DNSConfig{
AllowStale: Bool(true),
UDPAnswerLimit: 3,
MaxStale: 5 * time.Second,
MaxStale: 10 * 365 * 24 * time.Hour,
RecursorTimeout: 2 * time.Second,
},
Telemetry: Telemetry{

View File

@ -24,6 +24,9 @@ const (
// times.
maxUDPAnswerLimit = 8
maxRecurseRecords = 5
// Increment a counter when requests staler than this are served
staleCounterThreshold = 5 * time.Second
)
// DNSServer is used to wrap an Agent and expose various
@ -437,10 +440,14 @@ RPC:
}
// Verify that request is not too stale, redo the request
if args.AllowStale && out.LastContact > d.config.MaxStale {
args.AllowStale = false
d.logger.Printf("[WARN] dns: Query results too stale, re-requesting")
goto RPC
if args.AllowStale {
if out.LastContact > d.config.MaxStale {
args.AllowStale = false
d.logger.Printf("[WARN] dns: Query results too stale, re-requesting")
goto RPC
} else if out.LastContact > staleCounterThreshold {
metrics.IncrCounter([]string{"consul", "dns", "stale_queries"}, 1)
}
}
// If we have no address, return not found!
@ -637,10 +644,14 @@ RPC:
}
// Verify that request is not too stale, redo the request
if args.AllowStale && out.LastContact > d.config.MaxStale {
args.AllowStale = false
d.logger.Printf("[WARN] dns: Query results too stale, re-requesting")
goto RPC
if args.AllowStale {
if out.LastContact > d.config.MaxStale {
args.AllowStale = false
d.logger.Printf("[WARN] dns: Query results too stale, re-requesting")
goto RPC
} else if out.LastContact > staleCounterThreshold {
metrics.IncrCounter([]string{"consul", "dns", "stale_queries"}, 1)
}
}
// Determine the TTL
@ -739,10 +750,14 @@ RPC:
}
// Verify that request is not too stale, redo the request.
if args.AllowStale && out.LastContact > d.config.MaxStale {
args.AllowStale = false
d.logger.Printf("[WARN] dns: Query results too stale, re-requesting")
goto RPC
if args.AllowStale {
if out.LastContact > d.config.MaxStale {
args.AllowStale = false
d.logger.Printf("[WARN] dns: Query results too stale, re-requesting")
goto RPC
} else if out.LastContact > staleCounterThreshold {
metrics.IncrCounter([]string{"consul", "dns", "stale_queries"}, 1)
}
}
// Determine the TTL. The parse should never fail since we vet it when

View File

@ -536,39 +536,42 @@ Consul will not enable TLS for the HTTP API unless the `https` port has been ass
by the leader, providing stronger consistency but less throughput and higher latency. In Consul
0.7 and later, this defaults to true for better utilization of available servers.
* <a name="max_stale"></a><a href="#max_stale">`max_stale`</a> When [`allow_stale`](#allow_stale)
is specified, this is used to limit how
stale results are allowed to be. By default, this is set to "5s":
if a Consul server is more than 5 seconds behind the leader, the query will be
re-evaluated on the leader to get more up-to-date results.
* <a name="max_stale"></a><a href="#max_stale">`max_stale`</a> - When [`allow_stale`](#allow_stale)
is specified, this is used to limit how stale results are allowed to be. If a Consul server is
behind the leader by more than `max_stale`, the query will be re-evaluated on the leader to get
more up-to-date results. Prior to Consul 0.7.1 this defaulted to 5 seconds; in Consul 0.7.1
and later this defaults to 10 years ("87600h") which effectively allows DNS queries to be answered
by any server, no matter how stale. In practice, servers are usually only milliseconds behind the
leader, so this lets Consul continue serving requests in long outage scenarios where no leader can
be elected.
* <a name="node_ttl"></a><a href="#node_ttl">`node_ttl`</a> By default, this is "0s", so all
* <a name="node_ttl"></a><a href="#node_ttl">`node_ttl`</a> - By default, this is "0s", so all
node lookups are served with a 0 TTL value. DNS caching for node lookups can be enabled by
setting this value. This should be specified with the "s" suffix for second or "m" for minute.
* <a name="service_ttl"></a><a href="#service_ttl">`service_ttl`</a> This is a sub-object
* <a name="service_ttl"></a><a href="#service_ttl">`service_ttl`</a> - This is a sub-object
which allows for setting a TTL on service lookups with a per-service policy. The "*" wildcard
service can be used when there is no specific policy available for a service. By default, all
services are served with a 0 TTL value. DNS caching for service lookups can be enabled by
setting this value.
* <a name="enable_truncate"></a><a href="#enable_truncate">`enable_truncate`</a> If set to
* <a name="enable_truncate"></a><a href="#enable_truncate">`enable_truncate`</a> - If set to
true, a UDP DNS query that would return more than 3 records, or more than would fit into a valid
UDP response, will set the truncated flag, indicating to clients that they should re-query
using TCP to get the full set of records.
* <a name="only_passing"></a><a href="#only_passing">`only_passing`</a> If set to true, any
* <a name="only_passing"></a><a href="#only_passing">`only_passing`</a> - If set to true, any
nodes whose health checks are warning or critical will be excluded from DNS results. If false,
the default, only nodes whose healthchecks are failing as critical will be excluded. For
service lookups, the health checks of the node itself, as well as the service-specific checks
are considered. For example, if a node has a health check that is critical then all services on
that node will be excluded because they are also considered critical.
* <a name="recursor_timeout"></a><a href="#recursor_timeout">`recursor_timeout`</a> Timeout used
* <a name="recursor_timeout"></a><a href="#recursor_timeout">`recursor_timeout`</a> - Timeout used
by Consul when recursively querying an upstream DNS server. See <a href="#recursors">`recursors`</a>
for more details. Default is 2s. This is available in Consul 0.7 and later.
* <a name="disable_compression"></a><a href="#disable_compression">`disable_compression`</a> If
* <a name="disable_compression"></a><a href="#disable_compression">`disable_compression`</a> - If
set to true, DNS responses will not be compressed. Compression was added and enabled by default
in Consul 0.7.

View File

@ -177,6 +177,12 @@ These metrics give insight into the health of the cluster as a whole.
<td>ms</td>
<td>timer</td>
</tr>
<tr>
<td>`consul.dns.stale_queries`</td>
<td>Available in Consul 0.7.1 and later, this increments when an agent serves a DNS query based on information from a server that is more than 5 seconds out of date.</td>
<td>queries</td>
<td>counter</td>
</tr>
<tr>
<td>`consul.http.<verb>.<path>`</td>
<td>This tracks how long it takes to service the given HTTP request for the given verb and path. Note that paths do not include details like service or key names, for these an underscore will be present as a placeholder (eg. `consul.http.GET.v1.kv._`)</td>