Merge pull request #2283 from hashicorp/f-outage-docs
Updates docs for new peers.json behavior.
This commit is contained in:
commit
a19e0f6cfa
|
@ -56,6 +56,10 @@ BACKWARDS INCOMPATIBILITIES:
|
|||
header was added to allow clients to detect if translation is enabled for HTTP
|
||||
responses, and a "lan" tag was added to `TaggedAddresses` for clients that need
|
||||
the local address regardless of translation. [GH-2280]
|
||||
* The behavior of the `peers.json` file is different in this version of Consul:
|
||||
this file won't normally be present and is used only during outage recovery. Be
|
||||
sure to read [Outage Recovery Guide](http://localhost:4567/docs/guides/outage.html)
|
||||
for details.
|
||||
|
||||
IMPROVEMENTS:
|
||||
|
||||
|
|
|
@ -46,16 +46,35 @@ this server is online, the cluster will return to a fully healthy state.
|
|||
|
||||
Both of these strategies involve a potentially lengthy time to reboot or rebuild
|
||||
a failed server. If this is impractical or if building a new server with the same
|
||||
IP isn't an option, you need to remove the failed server from the `raft/peers.json`
|
||||
file on all remaining servers.
|
||||
IP isn't an option, you need to remove the failed server using the `raft/peers.json`
|
||||
recovery file on all remaining servers.
|
||||
|
||||
To begin, stop all remaining servers. You can attempt a graceful leave,
|
||||
but it will not work in most cases. Do not worry if the leave exits with an
|
||||
error. The cluster is in an unhealthy state, so this is expected.
|
||||
|
||||
~> Note that in Consul 0.7 and later, the peers.json file is no longer present
|
||||
by default and is only used when performing recovery. This file will be deleted
|
||||
after Consul starts and ingests this file. Consul 0.7 also uses a new, automatically-
|
||||
created `raft/peers.info` file to avoid ingesting the `raft/peers.json` file on the
|
||||
first start after upgrading. Be sure to leave `raft/peers.info` in place for proper
|
||||
operation.
|
||||
<br>
|
||||
<br>
|
||||
Note that using `raft/peers.json` for recovery can cause uncommitted Raft log
|
||||
entries to be committed, so this should only be used after an outage where no
|
||||
other option is available to recover a lost server. Make sure you don't have
|
||||
any automated processes that will put the peers in place on a periodic basis,
|
||||
for example.
|
||||
<br>
|
||||
<br>
|
||||
When the final version of Consul 0.7 ships, it should include a command to
|
||||
remove a dead peer without having to stop servers and edit the `raft/peers.json`
|
||||
recovery file.
|
||||
|
||||
The next step is to go to the [`-data-dir`](/docs/agent/options.html#_data_dir)
|
||||
of each Consul server. Inside that directory, there will be a `raft/`
|
||||
sub-directory. We need to edit the `raft/peers.json` file. It should look
|
||||
sub-directory. We need to create a `raft/peers.json` file. It should look
|
||||
something like:
|
||||
|
||||
```javascript
|
||||
|
@ -66,13 +85,27 @@ something like:
|
|||
]
|
||||
```
|
||||
|
||||
Simply delete the entries for all the failed servers. You must confirm
|
||||
those servers have indeed failed and will not later rejoin the cluster.
|
||||
Ensure that this file is the same across all remaining server nodes.
|
||||
Simply create entries for all remaining servers. You must confirm
|
||||
that servers you do not include here have indeed failed and will not later
|
||||
rejoin the cluster. Ensure that this file is the same across all remaining
|
||||
server nodes.
|
||||
|
||||
At this point, you can restart all the remaining servers. If any servers
|
||||
managed to perform a graceful leave, you may need to have them rejoin
|
||||
the cluster using the [`join`](/docs/commands/join.html) command:
|
||||
At this point, you can restart all the remaining servers. In Consul 0.7 and
|
||||
later you will see them ingest recovery file:
|
||||
|
||||
```text
|
||||
...
|
||||
2016/08/16 14:39:20 [INFO] consul: found peers.json file, recovering Raft configuration...
|
||||
2016/08/16 14:39:20 [INFO] consul.fsm: snapshot created in 12.484µs
|
||||
2016/08/16 14:39:20 [INFO] snapshot: Creating new snapshot at /tmp/peers/raft/snapshots/2-5-1471383560779.tmp
|
||||
2016/08/16 14:39:20 [INFO] consul: deleted peers.json file after successful recovery
|
||||
2016/08/16 14:39:20 [INFO] raft: Restored from snapshot 2-5-1471383560779
|
||||
2016/08/16 14:39:20 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:10.212.15.121:8300 Address:10.212.15.121:8300}]
|
||||
...
|
||||
```
|
||||
|
||||
If any servers managed to perform a graceful leave, you may need to have them
|
||||
rejoin the cluster using the [`join`](/docs/commands/join.html) command:
|
||||
|
||||
```text
|
||||
$ consul join <Node Address>
|
||||
|
@ -113,3 +146,32 @@ You should verify that one server claims to be the `Leader` and all the
|
|||
others should be in the `Follower` state. All the nodes should agree on the
|
||||
peer count as well. This count is (N-1), since a server does not count itself
|
||||
as a peer.
|
||||
|
||||
## Failure of Multiple Servers in a Multi-Server Cluster
|
||||
|
||||
In the event that multiple servers are lost, causing a loss of quorum and a
|
||||
complete outage, partial recovery is possible using data on the remaining
|
||||
servers in the cluster. There may be data loss in this situation because multiple
|
||||
servers were lost, so information about what's committed could be incomplete.
|
||||
The recovery process implicitly commits all outstanding Raft log entries, so
|
||||
it's also possible to commit data that was uncommitted before the failure.
|
||||
|
||||
The procedure is the same as for the single-server case above; you simply include
|
||||
just the remaining servers in the `raft/peers.json` recovery file. The cluster
|
||||
should be able to elect a leader once the remaining servers are all restarted with
|
||||
an identical `raft/peers.json` configuration.
|
||||
|
||||
Any new servers you introduce later can be fresh with totally clean data directories
|
||||
and joined using Consul's `join` command.
|
||||
|
||||
In extreme cases, it should be possible to recover with just a single remaining
|
||||
server by starting that single server with itself as the only peer in the
|
||||
`raft/peers.json` recovery file.
|
||||
|
||||
Note that prior to Consul 0.7 it wasn't always possible to recover from certain
|
||||
types of outages with `raft/peers.json` because this was ingested before any Raft
|
||||
log entries were played back. In Consul 0.7 and later, the `raft/peers.json`
|
||||
recovery file is final, and a snapshot is taken after it is ingested, so you are
|
||||
guaranteed to start with your recovered configuration. This does implicitly commit
|
||||
all Raft log entries, so should only be used to recover from an outage, but it
|
||||
should allow recovery from any situation where there's some cluster data available.
|
||||
|
|
|
@ -16,6 +16,15 @@ standard upgrade flow.
|
|||
|
||||
## Consul 0.7
|
||||
|
||||
Consul version 0.7 is a very large release with many important changes. Changes
|
||||
to be aware of during an upgrade are categorized below.
|
||||
|
||||
#### Default Configuration Changes
|
||||
|
||||
The default behavior of [`skip_leave_on_interrupt`](/docs/agent/options.html#skip_leave_on_interrupt)
|
||||
is now dependent on whether or not the agent is acting as a server or client. When Consul is started as a
|
||||
server the default is `true` and `false` when a client.
|
||||
|
||||
#### Dropped Support for Protocol Version 1
|
||||
|
||||
Consul version 0.7 dropped support for protocol version 1, which means it
|
||||
|
@ -31,20 +40,37 @@ itself. This feature enables using the distance sorting features of prepared
|
|||
queries without explicitly providing the node to sort near in requests, but
|
||||
requires the agent servicing a request to send additional information about
|
||||
itself to the Consul servers when executing the prepared query. Agents prior
|
||||
to 0.7.0 do not send this information, which means they are unable to properly
|
||||
to 0.7 do not send this information, which means they are unable to properly
|
||||
execute prepared queries configured with a `Near` parameter. Similarly, any
|
||||
server nodes prior to version 0.7.0 are unable to store the `Near` parameter,
|
||||
server nodes prior to version 0.7 are unable to store the `Near` parameter,
|
||||
making them unable to properly serve requests for prepared queries using the
|
||||
feature. It is recommended that all agents be running version 0.7.0 prior to
|
||||
feature. It is recommended that all agents be running version 0.7 prior to
|
||||
using this feature.
|
||||
|
||||
#### WAN Address Translation in HTTP Endpoints
|
||||
|
||||
Consul version 0.7 added support for translating WAN addresses in certain
|
||||
[HTTP endpoints](/docs/agent/options.html#translate_wan_addrs). The servers
|
||||
and the agents need to be running version 0.7.0 or later in order to use this
|
||||
and the agents need to be running version 0.7 or later in order to use this
|
||||
feature.
|
||||
|
||||
These translated addresses could break clients that are expecting local
|
||||
addresses. A new [`X-Consul-Translate-Addresses`](/docs/agent/http.html#translate_header)
|
||||
header was added to allow clients to detect if translation is enabled for HTTP
|
||||
responses, and a "lan" tag was added to `TaggedAddresses` for clients that need
|
||||
the local address regardless of translation.
|
||||
|
||||
#### Changes to Outage Recovery and `peers.json`
|
||||
|
||||
The `peers.json` file is no longer present by default and is only used when
|
||||
performing recovery. This file will be deleted after Consul starts and ingests
|
||||
this file. Consul 0.7 also uses a new, automatically-created raft/peers.info file
|
||||
to avoid ingesting the `peers.json` file on the first start after upgrading (it
|
||||
is simply deleted on the first start after upgrading).
|
||||
|
||||
Please be sure to review the [Outage Recovery Guide](/docs/guides/outage.html)
|
||||
before upgrading for more details.
|
||||
|
||||
## Consul 0.6.4
|
||||
|
||||
Consul 0.6.4 made some substantial changes to how ACLs work with prepared
|
||||
|
|
Loading…
Reference in New Issue