Updates documentation with details on the Consul operator actions.

2016-08-30 13:15:37 -07:00 · 2016-08-30 13:15:37 -07:00 · c063a1a8d0
parent 6be1e07fec
commit c063a1a8d0
4 changed files with 213 additions and 97 deletions
--- a/website/source/docs/agent/http/operator.html.markdown
+++ b/website/source/docs/agent/http/operator.html.markdown
@ -8,7 +8,7 @@ description: >
 # Operator HTTP Endpoint
-The Operator endpoints provide cluster-level tools for Consul operators, such
+The Operator endpoint provides cluster-level tools for Consul operators, such
 as interacting with the Raft subsystem. This was added in Consul 0.7.
 ~> Use this interface with extreme caution, as improper use could lead to a Consul
@ -40,9 +40,93 @@ The Raft configuration endpoint supports the `GET` method.
 #### GET Method
 When using the `GET` method, the request will be forwarded to the cluster
 leader to retrieve its latest Raft peer configuration.
 If the cluster doesn't currently have a leader an error will be returned. You
 can use the "?stale" query parameter to read the Raft configuration from any
 of the Consul servers.
 By default, the datacenter of the agent is queried; however, the `dc` can be
 provided using the "?dc=" query parameter.
 If ACLs are enabled, the client will need to supply an ACL Token with
 [`operator`](/docs/internals/acl.html#operator) read privileges.
 A JSON body is returned that looks like this:
 ```javascript
 {
  "Servers": [
    {
      "ID": "127.0.0.1:8300",
      "Node": "alice",
      "Address": "127.0.0.1:8300",
      "Leader": true,
      "Voter": true
    },
    {
      "ID": "127.0.0.2:8300",
      "Node": "bob",
      "Address": "127.0.0.2:8300",
      "Leader": false,
      "Voter": true
    },
    {
      "ID": "127.0.0.3:8300",
      "Node": "carol",
      "Address": "127.0.0.3:8300",
      "Leader": false,
      "Voter": true
    }
  ],
  "Index": 22
 }
 ```
 The `Servers` array has information about the servers in the Raft peer
 configuration:
 `ID` is the ID of the server. This is the same as the `Address` in Consul 0.7
 but may  be upgraded to a GUID in a future version of Consul.
 `Node` is the node name of the server, as known to Consul, or "(unknown)" if
 the node is stale and not known.
 `Address` is the IP:port for the server.
 `Leader` is either "true" or "false" depending on the server's role in the
 Raft configuration.
 `Voter` is "true" or "false", indicating if the server has a vote in the Raft
 configuration. Future versions of Consul may add support for non-voting servers.
 The `Index` value is the Raft corresponding to this configuration. Note that
 the latest configuration may not yet be committed if changes are in flight.
 ### <a name="raft-peer"></a> /v1/operator/raft/peer
 The Raft peer endpoint supports the `DELETE` method.
 #### DELETE Method
 Using the `DELETE` method, this endpoint will remove the Consul server with
 given address from the Raft configuration.
 There are rare cases where a peer may be left behind in the Raft configuration
 even though the server is no longer present and known to the cluster. This
 endpoint can be used to remove the failed server so that it is no longer
 affects the Raft quorum.
 An "?address=" query parameter is required and should be set to the
 "IP:port" for the server to remove. The port number is usually 8300, unless
 configured otherwise. Nothing is required in the body of the request.
 By default, the datacenter of the agent is targeted; however, the `dc` can be
 provided using the "?dc=" query parameter.
 If ACLs are enabled, the client will need to supply an ACL Token with
 [`operator`](/docs/internals/acl.html#operator) write privileges.
 The return code will indicate success or failure.
--- a/website/source/docs/commands/operator.html.markdown
+++ b/website/source/docs/commands/operator.html.markdown
@ -49,13 +49,14 @@ The `raft` subcommand is used to view and modify Consul's Raft configuration.
 Two actions are available, as detailed in this section.
 <a name="raft-list-peers"></a>
-`raft -list-peers -stale=[true|false]`
+#### Display Peer Configuration
 This action displays the current Raft peer configuration.
-The `-stale` argument defaults to "false" which means the leader provides the
+Usage: `raft -list-peers -stale=[true|false]`
-result. If the cluster is in an outage state without a leader, you may need
+
-to set `-stale` to "true" to get the configuration from a non-leader server.
+* `-stale` - Optional and defaults to "false" which means the leader provides
 the result. If the cluster is in an outage state without a leader, you may need
 to set this to "true" to get the configuration from a non-leader server.
 The output looks like this:
@ -66,35 +67,36 @@ bob      127.0.0.2:8300  127.0.0.2:8300  leader    true
 carol    127.0.0.3:8300  127.0.0.3:8300  follower  true
 ```
-* `Node` is the node name of the server, as known to Consul, or "(unknown)" if
+`Node` is the node name of the server, as known to Consul, or "(unknown)" if
-  the node is stale at not known.
+the node is stale and not known.
-* `ID` is the ID of the server. This is the same as the `Address` in Consul 0.7
+`ID` is the ID of the server. This is the same as the `Address` in Consul 0.7
-  but may  be upgraded to a GUID in a future version of Consul.
+but may  be upgraded to a GUID in a future version of Consul.
-* `Address` is the IP:port for the server.
+`Address` is the IP:port for the server.
-* `State` is either "follower" or "leader" depending on the server's role in the
+`State` is either "follower" or "leader" depending on the server's role in the
-   Raft configuration.
+Raft configuration.
-* `Voter` is "true" or "false", indicating if the server has a vote in the Raft
+`Voter` is "true" or "false", indicating if the server has a vote in the Raft
-   configuration. Future versions of Consul may add support for non-voting
+configuration. Future versions of Consul may add support for non-voting servers.
   servers.
 <a name="raft-remove-peer"></a>
-`raft -remove-peer -address="IP:port"`
+#### Remove a Peer
 This command removes Consul server with given address from the Raft configuration.
-This command removes Consul server with given -address from the Raft
+There are rare cases where a peer may be left behind in the Raft configuration
-configuration.
+even though the server is no longer present and known to the cluster. This command
 The `-address` argument is required and is the "IP:port" for the server to
 remove. The port number is usually 8300, unless configured otherwise.
 There are rare cases where a peer may be left behind in the Raft quorum even
 though the server is no longer present and known to the cluster. This command
 can be used to remove the failed server so that it is no longer affects the
 Raft quorum. If the server still shows in the output of the
 [`consul members`](/docs/commands/members.html) command, it is preferable to
 clean up by simply running
 [`consul force-leave`](http://localhost:4567/docs/commands/force-leave.html)
 instead of this command.
 Usage: `raft -remove-peer -address="IP:port"`
 * `-address` - "IP:port" for the server to remove. The port number is usually
 8300, unless configured otherwise.
 The return code will indicate success or failure.
--- a/website/source/docs/guides/outage.html.markdown
+++ b/website/source/docs/guides/outage.html.markdown
@ -38,20 +38,72 @@ comes online as agents perform [anti-entropy](/docs/internals/anti-entropy.html)
 ## Failure of a Server in a Multi-Server Cluster
 If you think the failed server is recoverable, the easiest option is to bring
-it back online and have it rejoin the cluster, returning the cluster to a fully
+it back online and have it rejoin the cluster with the same IP address, returning
-healthy state. Similarly, even if you need to rebuild a new Consul server to
+the cluster to a fully healthy state. Similarly, even if you need to rebuild a
-replace the failed node, you may wish to do that immediately. Keep in mind that
+new Consul server to replace the failed node, you may wish to do that immediately.
-the rebuilt server needs to have the same IP as the failed server. Again, once
+Keep in mind that the rebuilt server needs to have the same IP address as the failed
-this server is online, the cluster will return to a fully healthy state.
+server. Again, once this server is online and has rejoined, the cluster will return
 to a fully healthy state.
 Both of these strategies involve a potentially lengthy time to reboot or rebuild
 a failed server. If this is impractical or if building a new server with the same
 IP isn't an option, you need to remove the failed server. Usually, you can issue
-a [`force-leave`](/docs/commands/force-leave.html) command to remove the failed
+a [`consul force-leave`](/docs/commands/force-leave.html) command to remove the failed
 server if it's still a member of the cluster.
-If the `force-leave` isn't able to remove the server, you can remove it manually
+If [`consul force-leave`](/docs/commands/force-leave.html) isn't able to remove the
-using the `raft/peers.json` recovery file on all remaining servers.
+server, you have two methods available to remove it, depending on your version of Consul:
 * In Consul 0.7 and later, you can use the [`consul operator`](/docs/commands/operator.html#raft-remove-peer)
 command to remove the stale peer server on the fly with no downtime.
 * In versions of Consul prior to 0.7, you can manually remove the stale peer
 server using the `raft/peers.json` recovery file on all remaining servers. See
 the [section below](#peers.json) for details on this procedure. This process
 requires a Consul downtime to complete.
 In Consul 0.7 and later, you can use the [`consul operator`](/docs/commands/operator.html#raft-list-peers)
 command to inspect the Raft configuration:
 ```
 $ consul operator raft -list-peers
 Node     ID              Address         State     Voter
 alice    10.0.1.8:8300   10.0.1.8:8300   follower  true
 bob      10.0.1.6:8300   10.0.1.6:8300   leader    true
 carol    10.0.1.7:8300   10.0.1.7:8300   follower  true
 ```
 ## Failure of Multiple Servers in a Multi-Server Cluster
 In the event that multiple servers are lost, causing a loss of quorum and a
 complete outage, partial recovery is possible using data on the remaining
 servers in the cluster. There may be data loss in this situation because multiple
 servers were lost, so information about what's committed could be incomplete.
 The recovery process implicitly commits all outstanding Raft log entries, so
 it's also possible to commit data that was uncommitted before the failure.
 See the [section below](#peers.json) for details of the recovery procedure. You
 simply include just the remaining servers in the `raft/peers.json` recovery file.
 The cluster should be able to elect a leader once the remaining servers are all
 restarted with an identical `raft/peers.json` configuration.
 Any new servers you introduce later can be fresh with totally clean data directories
 and joined using Consul's `join` command.
 In extreme cases, it should be possible to recover with just a single remaining
 server by starting that single server with itself as the only peer in the
 `raft/peers.json` recovery file.
 Note that prior to Consul 0.7 it wasn't always possible to recover from certain
 types of outages with `raft/peers.json` because this was ingested before any Raft
 log entries were played back. In Consul 0.7 and later, the `raft/peers.json`
 recovery file is final, and a snapshot is taken after it is ingested, so you are
 guaranteed to start with your recovered configuration. This does implicitly commit
 all Raft log entries, so should only be used to recover from an outage, but it
 should allow recovery from any situation where there's some cluster data available.
 <a name="peers.json"></a>
 ## Manual Recovery Using peers.json
 To begin, stop all remaining servers. You can attempt a graceful leave,
 but it will not work in most cases. Do not worry if the leave exits with an
@ -70,11 +122,6 @@ implicitly committed, so this should only be used after an outage where no
 other option is available to recover a lost server. Make sure you don't have
 any automated processes that will put the peers file in place on a periodic basis,
 for example.
 <br>
 <br>
 When the final version of Consul 0.7 ships, it should include a command to
 remove a dead peer without having to stop servers and edit the `raft/peers.json`
 recovery file.
 The next step is to go to the [`-data-dir`](/docs/agent/options.html#_data_dir)
 of each Consul server. Inside that directory, there will be a `raft/`
@ -83,9 +130,9 @@ something like:
 ```javascript
 [
-  "10.0.1.8:8300",
+"10.0.1.8:8300",
-  "10.0.1.6:8300",
+"10.0.1.6:8300",
-  "10.0.1.7:8300"
+"10.0.1.7:8300"
 ]
 ```
@ -126,56 +173,13 @@ nodes should claim leadership and emit a log like:
 [INFO] consul: cluster leadership acquired
 ```
-Additionally, the [`info`](/docs/commands/info.html) command can be a useful
+In Consul 0.7 and later, you can use the [`consul operator`](/docs/commands/operator.html#raft-list-peers)
-debugging tool:
+command to inspect the Raft configuration:
 ```text
 $ consul info
 ...
 raft:
 	applied_index = 47244
 	commit_index = 47244
 	fsm_pending = 0
 	last_log_index = 47244
 	last_log_term = 21
 	last_snapshot_index = 40966
 	last_snapshot_term = 20
 	num_peers = 2
 	state = Leader
 	term = 21
 ...
 ```
-
+$ consul operator raft -list-peers
-You should verify that one server claims to be the `Leader` and all the
+Node     ID              Address         State     Voter
-others should be in the `Follower` state. All the nodes should agree on the
+alice    10.0.1.8:8300   10.0.1.8:8300   follower  true
-peer count as well. This count is (N-1), since a server does not count itself
+bob      10.0.1.6:8300   10.0.1.6:8300   leader    true
-as a peer.
+carol    10.0.1.7:8300   10.0.1.7:8300   follower  true
-
+```
 ## Failure of Multiple Servers in a Multi-Server Cluster
 In the event that multiple servers are lost, causing a loss of quorum and a
 complete outage, partial recovery is possible using data on the remaining
 servers in the cluster. There may be data loss in this situation because multiple
 servers were lost, so information about what's committed could be incomplete.
 The recovery process implicitly commits all outstanding Raft log entries, so
 it's also possible to commit data that was uncommitted before the failure.
 The procedure is the same as for the single-server case above; you simply include
 just the remaining servers in the `raft/peers.json` recovery file. The cluster
 should be able to elect a leader once the remaining servers are all restarted with
 an identical `raft/peers.json` configuration.
 Any new servers you introduce later can be fresh with totally clean data directories
 and joined using Consul's `join` command.
 In extreme cases, it should be possible to recover with just a single remaining
 server by starting that single server with itself as the only peer in the
 `raft/peers.json` recovery file.
 Note that prior to Consul 0.7 it wasn't always possible to recover from certain
 types of outages with `raft/peers.json` because this was ingested before any Raft
 log entries were played back. In Consul 0.7 and later, the `raft/peers.json`
 recovery file is final, and a snapshot is taken after it is ingested, so you are
 guaranteed to start with your recovered configuration. This does implicitly commit
 all Raft log entries, so should only be used to recover from an outage, but it
 should allow recovery from any situation where there's some cluster data available.
--- a/website/source/docs/internals/acl.html.markdown
+++ b/website/source/docs/internals/acl.html.markdown
@ -210,6 +210,9 @@ query "" {
 # Read-only mode for the encryption keyring by default (list only)
 keyring = "read"
 # Read-only mode for Consul operator interfaces (list only)
 operator = "read"
 ```
 This is equivalent to the following JSON input:
@ -248,13 +251,14 @@ This is equivalent to the following JSON input:
      "policy": "read"
    }
  },
-  "keyring": "read"
+  "keyring": "read",
  "operator": "read"
 }
 ```
 ## Building ACL Policies
-#### Blacklist mode and `consul exec`
+#### Blacklist Mode and `consul exec`
 If you set [`acl_default_policy`](/docs/agent/options.html#acl_default_policy)
 to `deny`, the `anonymous` token won't have permission to read the default
@ -279,7 +283,7 @@ Alternatively, you can, of course, add an explicit
 [`acl_token`](/docs/agent/options.html#acl_token) to each agent, giving it access
 to that prefix.
-#### Blacklist mode and Service Discovery
+#### Blacklist Mode and Service Discovery
 If your [`acl_default_policy`](/docs/agent/options.html#acl_default_policy) is
 set to `deny`, the `anonymous` token will be unable to read any service
@ -327,12 +331,12 @@ event "" {
 As always, the more secure way to handle user events is to explicitly grant
 access to each API token based on the events they should be able to fire.
-#### Blacklist mode and Prepared Queries
+#### Blacklist Mode and Prepared Queries
 After Consul 0.6.3, significant changes were made to ACLs for prepared queries,
 including a new `query` ACL policy. See [Prepared Query ACLs](#prepared_query_acls) below for more details.
-#### Blacklist mode and Keyring Operations
+#### Blacklist Mode and Keyring Operations
 Consul 0.6 and later supports securing the encryption keyring operations using
 ACL's. Encryption is an optional component of the gossip layer. More information
@ -353,6 +357,28 @@ Encryption keyring operations are sensitive and should be properly secured. It
 is recommended that instead of configuring a wide-open policy like above, a
 per-token policy is applied to maximize security.
 <a name="operator"></a>
 #### Blacklist Mode and Consul Operator Actions
 Consul 0.7 added special Consul operator actions which are protected by a new
 `operator` ACL policy. The operator actions cover:
 * [Operator HTTP endpoint](/docs/agent/http/operator.html)
 * [Operator CLI command](/docs/commands/operator.html)
 If your [`acl_default_policy`](/docs/agent/options.html#acl_default_policy) is
 set to `deny`, then the `anonymous` token will not have access to Consul operator
 actions. Granting `read` access allows reading information for diagnostic purposes
 without making any changes to state. Granting `write` access allows reading
 information and changing state. Here's an example policy:
 ```
 operator = "write"
 ```
 ~> Grant `write` access to operator actions with extreme caution, as improper use
   could lead to a Consul outage and even loss of data.
 #### Services and Checks with ACLs
 Consul allows configuring ACL policies which may control access to service and