Updates to the Adding/Removing Servers Guide (#5004)
* added a new section for adding servers, updated section titles, and added code snippets. * Fixing typos * fixing typos * Addressing some of Paul's feedback. * Updated the outage recovery recommendation
This commit is contained in:
parent
21c69d7304
commit
0b4ed6ea6e
|
@ -1,12 +1,12 @@
|
||||||
---
|
---
|
||||||
layout: "docs"
|
layout: "docs"
|
||||||
page_title: "Adding/Removing Servers"
|
page_title: "Adding & Removing Servers"
|
||||||
sidebar_current: "docs-guides-servers"
|
sidebar_current: "docs-guides-servers"
|
||||||
description: |-
|
description: |-
|
||||||
Consul is designed to require minimal operator involvement, however any changes to the set of Consul servers must be handled carefully. To better understand why, reading about the consensus protocol will be useful. In short, the Consul servers perform leader election and replication. For changes to be processed, a minimum quorum of servers (N/2)+1 must be available. That means if there are 3 server nodes, at least 2 must be available.
|
Consul is designed to require minimal operator involvement, however any changes to the set of Consul servers must be handled carefully. To better understand why, reading about the consensus protocol will be useful. In short, the Consul servers perform leader election and replication. For changes to be processed, a minimum quorum of servers (N/2)+1 must be available. That means if there are 3 server nodes, at least 2 must be available.
|
||||||
---
|
---
|
||||||
|
|
||||||
# Adding/Removing Servers
|
# Adding & Removing Servers
|
||||||
|
|
||||||
Consul is designed to require minimal operator involvement, however any changes
|
Consul is designed to require minimal operator involvement, however any changes
|
||||||
to the set of Consul servers must be handled carefully. To better understand
|
to the set of Consul servers must be handled carefully. To better understand
|
||||||
|
@ -18,13 +18,16 @@ That means if there are 3 server nodes, at least 2 must be available.
|
||||||
In general, if you are ever adding and removing nodes simultaneously, it is better
|
In general, if you are ever adding and removing nodes simultaneously, it is better
|
||||||
to first add the new nodes and then remove the old nodes.
|
to first add the new nodes and then remove the old nodes.
|
||||||
|
|
||||||
## Adding New Servers
|
In this guide, we will cover the different methods for adding and removing servers.
|
||||||
|
|
||||||
Adding new servers is generally straightforward. Simply start the new
|
## Manually Add a New Server
|
||||||
agent with the `-server` flag. At this point, the server will not be a member of
|
|
||||||
|
Manually adding new servers is generally straightforward, start the new
|
||||||
|
agent with the `-server` flag. At this point the server will not be a member of
|
||||||
any cluster, and should emit something like:
|
any cluster, and should emit something like:
|
||||||
|
|
||||||
```text
|
```sh
|
||||||
|
consul agent -server
|
||||||
[WARN] raft: EnableSingleNode disabled, and no known peers. Aborting election.
|
[WARN] raft: EnableSingleNode disabled, and no known peers. Aborting election.
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -32,14 +35,50 @@ This means that it does not know about any peers and is not configured to elect
|
||||||
This is expected, and we can now add this node to the existing cluster using `join`.
|
This is expected, and we can now add this node to the existing cluster using `join`.
|
||||||
From the new server, we can join any member of the existing cluster:
|
From the new server, we can join any member of the existing cluster:
|
||||||
|
|
||||||
```text
|
```sh
|
||||||
$ consul join <Node Address>
|
$ consul join <Existing Node Address>
|
||||||
Successfully joined cluster by contacting 1 nodes.
|
Successfully joined cluster by contacting 1 nodes.
|
||||||
```
|
```
|
||||||
|
|
||||||
It is important to note that any node, including a non-server may be specified for
|
It is important to note that any node, including a non-server may be specified for
|
||||||
join. The gossip protocol is used to properly discover all the nodes in the cluster.
|
join. Generally, this method is good for testing purposes but not recommended for production
|
||||||
Once the node has joined, the existing cluster leader should log something like:
|
deployments. For production clusters, you will likely want to use the agent configuration
|
||||||
|
option to add additional servers.
|
||||||
|
|
||||||
|
## Add a Server with Agent Configuration
|
||||||
|
|
||||||
|
In production environments, you should use the [agent configuration](https://www.consul.io/docs/agent/options.html) option, `retry_join`. `retry_join` can be used as a command line flag or in the agent configuration file.
|
||||||
|
|
||||||
|
With the Consul CLI:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
$ consul agent -retry-join=["52.10.110.11", "52.10.110.12", "52.10.100.13"]
|
||||||
|
```
|
||||||
|
|
||||||
|
In the agent configuration file:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
{
|
||||||
|
"bootstrap": false,
|
||||||
|
"bootstrap_expect": 3,
|
||||||
|
"server": true,
|
||||||
|
"retry_join": ["52.10.110.11", "52.10.110.12", "52.10.100.13"]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
[`retry_join`](https://www.consul.io/docs/agent/options.html#retry-join)
|
||||||
|
will ensure that if any server loses connection
|
||||||
|
with the cluster for any reason, including the node restarting, it can
|
||||||
|
rejoin when it comes back. In additon to working with static IPs, it
|
||||||
|
can also be useful for other discovery mechanisms, such as auto joining
|
||||||
|
based on cloud metadata and discovery. Both servers and clients can use this method.
|
||||||
|
|
||||||
|
### Server Coordination
|
||||||
|
|
||||||
|
To ensure Consul servers are joining the cluster properly, you should monitor
|
||||||
|
the server coordination. The gossip protocol is used to properly discover all
|
||||||
|
the nodes in the cluster. Once the node has joined, the existing cluster
|
||||||
|
leader should log something like:
|
||||||
|
|
||||||
```text
|
```text
|
||||||
[INFO] raft: Added peer 127.0.0.2:8300, starting replication
|
[INFO] raft: Added peer 127.0.0.2:8300, starting replication
|
||||||
|
@ -60,7 +99,7 @@ raft:
|
||||||
last_log_term = 21
|
last_log_term = 21
|
||||||
last_snapshot_index = 40966
|
last_snapshot_index = 40966
|
||||||
last_snapshot_term = 20
|
last_snapshot_term = 20
|
||||||
num_peers = 2
|
num_peers = 4
|
||||||
state = Leader
|
state = Leader
|
||||||
term = 21
|
term = 21
|
||||||
...
|
...
|
||||||
|
@ -75,12 +114,12 @@ It is best to add servers one at a time, allowing them to catch up. This avoids
|
||||||
the possibility of data loss in case the existing servers fail while bringing
|
the possibility of data loss in case the existing servers fail while bringing
|
||||||
the new servers up-to-date.
|
the new servers up-to-date.
|
||||||
|
|
||||||
## Removing Servers
|
## Manually Remove a Server
|
||||||
|
|
||||||
Removing servers must be done carefully to avoid causing an availability outage.
|
Removing servers must be done carefully to avoid causing an availability outage.
|
||||||
For a cluster of N servers, at least (N/2)+1 must be available for the cluster
|
For a cluster of N servers, at least (N/2)+1 must be available for the cluster
|
||||||
to function. See this [deployment table](/docs/internals/consensus.html#toc_4).
|
to function. See this [deployment table](/docs/internals/consensus.html#toc_4).
|
||||||
If you have 3 servers, and 1 of them is currently failed, removing any servers
|
If you have 3 servers and 1 of them is currently failing, removing any other servers
|
||||||
will cause the cluster to become unavailable.
|
will cause the cluster to become unavailable.
|
||||||
|
|
||||||
To avoid this, it may be necessary to first add new servers to the cluster,
|
To avoid this, it may be necessary to first add new servers to the cluster,
|
||||||
|
@ -92,6 +131,10 @@ Once you have verified the existing servers are healthy, and that the cluster
|
||||||
can handle a node leaving, the actual process is simple. You simply issue a
|
can handle a node leaving, the actual process is simple. You simply issue a
|
||||||
`leave` command to the server.
|
`leave` command to the server.
|
||||||
|
|
||||||
|
```sh
|
||||||
|
consul leave
|
||||||
|
```
|
||||||
|
|
||||||
The server leaving should contain logs like:
|
The server leaving should contain logs like:
|
||||||
|
|
||||||
```text
|
```text
|
||||||
|
@ -114,13 +157,15 @@ The leader should also emit various logs including:
|
||||||
At this point the node has been gracefully removed from the cluster, and
|
At this point the node has been gracefully removed from the cluster, and
|
||||||
will shut down.
|
will shut down.
|
||||||
|
|
||||||
If the leader is affected by an outage, then [manual recovery](/docs/guides/outage.html#manual-recovery-using-peers-json) needs to be done.
|
|
||||||
|
|
||||||
To remove all agents that accidentally joined the wrong set of servers, clear out the contents of the data directory (`-data-dir`) on both client and server nodes.
|
To remove all agents that accidentally joined the wrong set of servers, clear out the contents of the data directory (`-data-dir`) on both client and server nodes.
|
||||||
|
|
||||||
|
These graceful methods to remove servres assumse you have a healthly cluster.
|
||||||
|
If the cluster has no leader due to loss of quorum or data corruption, you should
|
||||||
|
plan for [outage recovery](/docs/guides/outage.html#manual-recovery-using-peers-json).
|
||||||
|
|
||||||
!> **WARNING** Removing data on server nodes will destroy all state in the cluster
|
!> **WARNING** Removing data on server nodes will destroy all state in the cluster
|
||||||
|
|
||||||
## Forced Removal
|
## Manual Forced Removal
|
||||||
|
|
||||||
In some cases, it may not be possible to gracefully remove a server. For example,
|
In some cases, it may not be possible to gracefully remove a server. For example,
|
||||||
if the server simply fails, then there is no ability to issue a leave. Instead,
|
if the server simply fails, then there is no ability to issue a leave. Instead,
|
||||||
|
@ -130,5 +175,17 @@ If the server can be recovered, it is best to bring it back online and then grac
|
||||||
leave the cluster. However, if this is not a possibility, then the `force-leave` command
|
leave the cluster. However, if this is not a possibility, then the `force-leave` command
|
||||||
can be used to force removal of a server.
|
can be used to force removal of a server.
|
||||||
|
|
||||||
|
```sh
|
||||||
|
consul force-leave <node>
|
||||||
|
```
|
||||||
|
|
||||||
This is done by invoking that command with the name of the failed node. At this point,
|
This is done by invoking that command with the name of the failed node. At this point,
|
||||||
the cluster leader will mark the node as having left the cluster and it will stop attempting to replicate.
|
the cluster leader will mark the node as having left the cluster and it will stop attempting to replicate.
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
In this guide we learned the straightforward process of adding and removing servers including;
|
||||||
|
manually adding servers, adding servers through the agent configuration, gracefully removing
|
||||||
|
servers, and forcing removal of servers. Finally, we should restate that manually adding servers
|
||||||
|
is good for testing purposes, however, for production it is recommended to add servers with
|
||||||
|
the agent configuration.
|
||||||
|
|
Loading…
Reference in New Issue