Merge pull request #2283 from hashicorp/f-outage-docs

Updates docs for new peers.json behavior.
This commit is contained in:
James Phillips 2016-08-16 15:13:50 -07:00 committed by GitHub
commit a19e0f6cfa
3 changed files with 105 additions and 13 deletions

View File

@ -56,6 +56,10 @@ BACKWARDS INCOMPATIBILITIES:
header was added to allow clients to detect if translation is enabled for HTTP
responses, and a "lan" tag was added to `TaggedAddresses` for clients that need
the local address regardless of translation. [GH-2280]
* The behavior of the `peers.json` file is different in this version of Consul:
this file won't normally be present and is used only during outage recovery. Be
sure to read [Outage Recovery Guide](http://localhost:4567/docs/guides/outage.html)
for details.
IMPROVEMENTS:

View File

@ -46,16 +46,35 @@ this server is online, the cluster will return to a fully healthy state.
Both of these strategies involve a potentially lengthy time to reboot or rebuild
a failed server. If this is impractical or if building a new server with the same
IP isn't an option, you need to remove the failed server from the `raft/peers.json`
file on all remaining servers.
IP isn't an option, you need to remove the failed server using the `raft/peers.json`
recovery file on all remaining servers.
To begin, stop all remaining servers. You can attempt a graceful leave,
but it will not work in most cases. Do not worry if the leave exits with an
error. The cluster is in an unhealthy state, so this is expected.
~> Note that in Consul 0.7 and later, the peers.json file is no longer present
by default and is only used when performing recovery. This file will be deleted
after Consul starts and ingests this file. Consul 0.7 also uses a new, automatically-
created `raft/peers.info` file to avoid ingesting the `raft/peers.json` file on the
first start after upgrading. Be sure to leave `raft/peers.info` in place for proper
operation.
<br>
<br>
Note that using `raft/peers.json` for recovery can cause uncommitted Raft log
entries to be committed, so this should only be used after an outage where no
other option is available to recover a lost server. Make sure you don't have
any automated processes that will put the peers in place on a periodic basis,
for example.
<br>
<br>
When the final version of Consul 0.7 ships, it should include a command to
remove a dead peer without having to stop servers and edit the `raft/peers.json`
recovery file.
The next step is to go to the [`-data-dir`](/docs/agent/options.html#_data_dir)
of each Consul server. Inside that directory, there will be a `raft/`
sub-directory. We need to edit the `raft/peers.json` file. It should look
sub-directory. We need to create a `raft/peers.json` file. It should look
something like:
```javascript
@ -66,13 +85,27 @@ something like:
]
```
Simply delete the entries for all the failed servers. You must confirm
those servers have indeed failed and will not later rejoin the cluster.
Ensure that this file is the same across all remaining server nodes.
Simply create entries for all remaining servers. You must confirm
that servers you do not include here have indeed failed and will not later
rejoin the cluster. Ensure that this file is the same across all remaining
server nodes.
At this point, you can restart all the remaining servers. If any servers
managed to perform a graceful leave, you may need to have them rejoin
the cluster using the [`join`](/docs/commands/join.html) command:
At this point, you can restart all the remaining servers. In Consul 0.7 and
later you will see them ingest recovery file:
```text
...
2016/08/16 14:39:20 [INFO] consul: found peers.json file, recovering Raft configuration...
2016/08/16 14:39:20 [INFO] consul.fsm: snapshot created in 12.484µs
2016/08/16 14:39:20 [INFO] snapshot: Creating new snapshot at /tmp/peers/raft/snapshots/2-5-1471383560779.tmp
2016/08/16 14:39:20 [INFO] consul: deleted peers.json file after successful recovery
2016/08/16 14:39:20 [INFO] raft: Restored from snapshot 2-5-1471383560779
2016/08/16 14:39:20 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:10.212.15.121:8300 Address:10.212.15.121:8300}]
...
```
If any servers managed to perform a graceful leave, you may need to have them
rejoin the cluster using the [`join`](/docs/commands/join.html) command:
```text
$ consul join <Node Address>
@ -113,3 +146,32 @@ You should verify that one server claims to be the `Leader` and all the
others should be in the `Follower` state. All the nodes should agree on the
peer count as well. This count is (N-1), since a server does not count itself
as a peer.
## Failure of Multiple Servers in a Multi-Server Cluster
In the event that multiple servers are lost, causing a loss of quorum and a
complete outage, partial recovery is possible using data on the remaining
servers in the cluster. There may be data loss in this situation because multiple
servers were lost, so information about what's committed could be incomplete.
The recovery process implicitly commits all outstanding Raft log entries, so
it's also possible to commit data that was uncommitted before the failure.
The procedure is the same as for the single-server case above; you simply include
just the remaining servers in the `raft/peers.json` recovery file. The cluster
should be able to elect a leader once the remaining servers are all restarted with
an identical `raft/peers.json` configuration.
Any new servers you introduce later can be fresh with totally clean data directories
and joined using Consul's `join` command.
In extreme cases, it should be possible to recover with just a single remaining
server by starting that single server with itself as the only peer in the
`raft/peers.json` recovery file.
Note that prior to Consul 0.7 it wasn't always possible to recover from certain
types of outages with `raft/peers.json` because this was ingested before any Raft
log entries were played back. In Consul 0.7 and later, the `raft/peers.json`
recovery file is final, and a snapshot is taken after it is ingested, so you are
guaranteed to start with your recovered configuration. This does implicitly commit
all Raft log entries, so should only be used to recover from an outage, but it
should allow recovery from any situation where there's some cluster data available.

View File

@ -16,6 +16,15 @@ standard upgrade flow.
## Consul 0.7
Consul version 0.7 is a very large release with many important changes. Changes
to be aware of during an upgrade are categorized below.
#### Default Configuration Changes
The default behavior of [`skip_leave_on_interrupt`](/docs/agent/options.html#skip_leave_on_interrupt)
is now dependent on whether or not the agent is acting as a server or client. When Consul is started as a
server the default is `true` and `false` when a client.
#### Dropped Support for Protocol Version 1
Consul version 0.7 dropped support for protocol version 1, which means it
@ -31,20 +40,37 @@ itself. This feature enables using the distance sorting features of prepared
queries without explicitly providing the node to sort near in requests, but
requires the agent servicing a request to send additional information about
itself to the Consul servers when executing the prepared query. Agents prior
to 0.7.0 do not send this information, which means they are unable to properly
to 0.7 do not send this information, which means they are unable to properly
execute prepared queries configured with a `Near` parameter. Similarly, any
server nodes prior to version 0.7.0 are unable to store the `Near` parameter,
server nodes prior to version 0.7 are unable to store the `Near` parameter,
making them unable to properly serve requests for prepared queries using the
feature. It is recommended that all agents be running version 0.7.0 prior to
feature. It is recommended that all agents be running version 0.7 prior to
using this feature.
#### WAN Address Translation in HTTP Endpoints
Consul version 0.7 added support for translating WAN addresses in certain
[HTTP endpoints](/docs/agent/options.html#translate_wan_addrs). The servers
and the agents need to be running version 0.7.0 or later in order to use this
and the agents need to be running version 0.7 or later in order to use this
feature.
These translated addresses could break clients that are expecting local
addresses. A new [`X-Consul-Translate-Addresses`](/docs/agent/http.html#translate_header)
header was added to allow clients to detect if translation is enabled for HTTP
responses, and a "lan" tag was added to `TaggedAddresses` for clients that need
the local address regardless of translation.
#### Changes to Outage Recovery and `peers.json`
The `peers.json` file is no longer present by default and is only used when
performing recovery. This file will be deleted after Consul starts and ingests
this file. Consul 0.7 also uses a new, automatically-created raft/peers.info file
to avoid ingesting the `peers.json` file on the first start after upgrading (it
is simply deleted on the first start after upgrading).
Please be sure to review the [Outage Recovery Guide](/docs/guides/outage.html)
before upgrading for more details.
## Consul 0.6.4
Consul 0.6.4 made some substantial changes to how ACLs work with prepared