website: Adding guide on outage recovery

2014-04-11 12:55:02 -07:00 · 2014-04-11 12:55:02 -07:00 · 37bd90f09d
parent caf48c8722
commit 37bd90f09d
3 changed files with 88 additions and 0 deletions
--- a/website/source/docs/guides/index.html.markdown
+++ b/website/source/docs/guides/index.html.markdown
@ -27,4 +27,6 @@ The following guides are available:
 * [Multiple Datacenters](/docs/guides/datacenters.html) - Configuring Consul to support multiple
 datacenters.

+ * [Outage Recovery](/docs/guides/outage.html) - This guide covers recovering a cluster
+ that has become unavailable due to server failures.

--- a/website/source/docs/guides/outage.html.markdown
+++ b/website/source/docs/guides/outage.html.markdown
@ -0,0 +1,82 @@
+---
+layout: "docs"
+page_title: "Outage Recovery"
+sidebar_current: "docs-guides-outage"
+---
+
+# Outage Recovery
+
+Do not panic! This is a critical first step. Depending on your
+[deployment configuration](/docs/internals/consensus.html#toc_3), it may
+take only a single server failure for cluster unavailability. Recovery
+requires an operator to intervene, but is straightforward.
+
+If you had only a single server and it has failed, simply restart it.
+Note that a single server configuration requires the `-bootstrap` flag.
+If that server cannot be recovered, you need to bring up a new server.
+See the [bootstrapping guide](/docs/guides/bootstrapping.html). Data loss
+is inevitable, since data was not replicated to any other servers. This
+is why a single server deploy is never recommended. Any services registered
+with agents will be re-populated when the new server comes online, as
+agents perform anti-entropy.
+
+In a multi-server deploy, there are at least N remaining servers. The first step
+is to simply stop all the servers. You can attempt a graceful leave, but
+it will not work in most cases. Do not worry if the leave exits with an
+error, since the cluster is in an unhealthy state.
+
+The next step is to go to the `-data-dir` of each Consul server. Inside
+that directory, there will be a `raft/` sub-directory. We need to edit
+the `raft/peers.json` file. It should be something like:
+
+```
+["10.0.1.8:8300","10.0.1.6:8300","10.0.1.7:8300"]
+```
+
+Simply delete the entries for all the failed servers. You must confirm
+those servers have indeed failed, and will not later rejoin the cluster.
+Ensure that this file is the same across all remaining server nodes.
+
+At this point, you can restart all the remaining servers. If any servers
+managed to perform a graceful leave, you may need to have then rejoin
+the cluster using the `join` command:
+
+```
+$ consul join <Node Address>
+Successfully joined cluster by contacting 1 nodes.
+```
+
+It should be noted that any existing member can be used to rejoin the cluster,
+as the gossip protocol will take care of discovering the server nodes.
+
+At this point the cluster should be in an operable state again. One of the
+nodes should claim leadership and emit a log like:
+
+```
+[INFO] consul: cluster leadership acquired
+```
+
+Additional, the `info` command can be a useful debugging tool:
+
+```
+$ consul info
+...
+raft:
+	applied_index = 47244
+	commit_index = 47244
+	fsm_pending = 0
+	last_log_index = 47244
+	last_log_term = 21
+	last_snapshot_index = 40966
+	last_snapshot_term = 20
+	num_peers = 2
+	state = Leader
+	term = 21
+...
+```
+
+You should verify that one server claims to be the `Leader`, and all the
+others should be in the `Follower` state. All the nodes should agree on the
+peer count as well. This count is (N-1), since a server does not count itself
+as a peer.
+
--- a/website/source/layouts/docs.erb
+++ b/website/source/layouts/docs.erb
@ -139,6 +139,10 @@
                    <li<%= sidebar_current("docs-guides-datacenters") %>>
 					<a href="/docs/guides/datacenters.html">Multiple Datacenters</a>
                    </li>
+
+                    <li<%= sidebar_current("docs-guides-outage") %>>
+					<a href="/docs/guides/outage.html">Outage Recovery</a>
+                    </li>
                </ul>

 			</ul>