website: Adding guide on outage recovery

This commit is contained in:
Armon Dadgar 2014-04-11 12:55:02 -07:00
parent caf48c8722
commit 37bd90f09d
3 changed files with 88 additions and 0 deletions

View File

@ -27,4 +27,6 @@ The following guides are available:
* [Multiple Datacenters](/docs/guides/datacenters.html) - Configuring Consul to support multiple
datacenters.
* [Outage Recovery](/docs/guides/outage.html) - This guide covers recovering a cluster
that has become unavailable due to server failures.

View File

@ -0,0 +1,82 @@
---
layout: "docs"
page_title: "Outage Recovery"
sidebar_current: "docs-guides-outage"
---
# Outage Recovery
Do not panic! This is a critical first step. Depending on your
[deployment configuration](/docs/internals/consensus.html#toc_3), it may
take only a single server failure for cluster unavailability. Recovery
requires an operator to intervene, but is straightforward.
If you had only a single server and it has failed, simply restart it.
Note that a single server configuration requires the `-bootstrap` flag.
If that server cannot be recovered, you need to bring up a new server.
See the [bootstrapping guide](/docs/guides/bootstrapping.html). Data loss
is inevitable, since data was not replicated to any other servers. This
is why a single server deploy is never recommended. Any services registered
with agents will be re-populated when the new server comes online, as
agents perform anti-entropy.
In a multi-server deploy, there are at least N remaining servers. The first step
is to simply stop all the servers. You can attempt a graceful leave, but
it will not work in most cases. Do not worry if the leave exits with an
error, since the cluster is in an unhealthy state.
The next step is to go to the `-data-dir` of each Consul server. Inside
that directory, there will be a `raft/` sub-directory. We need to edit
the `raft/peers.json` file. It should be something like:
```
["10.0.1.8:8300","10.0.1.6:8300","10.0.1.7:8300"]
```
Simply delete the entries for all the failed servers. You must confirm
those servers have indeed failed, and will not later rejoin the cluster.
Ensure that this file is the same across all remaining server nodes.
At this point, you can restart all the remaining servers. If any servers
managed to perform a graceful leave, you may need to have then rejoin
the cluster using the `join` command:
```
$ consul join <Node Address>
Successfully joined cluster by contacting 1 nodes.
```
It should be noted that any existing member can be used to rejoin the cluster,
as the gossip protocol will take care of discovering the server nodes.
At this point the cluster should be in an operable state again. One of the
nodes should claim leadership and emit a log like:
```
[INFO] consul: cluster leadership acquired
```
Additional, the `info` command can be a useful debugging tool:
```
$ consul info
...
raft:
applied_index = 47244
commit_index = 47244
fsm_pending = 0
last_log_index = 47244
last_log_term = 21
last_snapshot_index = 40966
last_snapshot_term = 20
num_peers = 2
state = Leader
term = 21
...
```
You should verify that one server claims to be the `Leader`, and all the
others should be in the `Follower` state. All the nodes should agree on the
peer count as well. This count is (N-1), since a server does not count itself
as a peer.

View File

@ -139,6 +139,10 @@
<li<%= sidebar_current("docs-guides-datacenters") %>>
<a href="/docs/guides/datacenters.html">Multiple Datacenters</a>
</li>
<li<%= sidebar_current("docs-guides-outage") %>>
<a href="/docs/guides/outage.html">Outage Recovery</a>
</li>
</ul>
</ul>