raft: default to protocol v3 (#11572)

Many of Nomad's Autopilot features require raft protocol version
3. Set the default raft protocol to 3, and improve the upgrade
documentation.
This commit is contained in:
Tim Gross 2022-02-03 15:03:12 -05:00 committed by GitHub
parent 5f48e18189
commit 7ad15b2b42
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
5 changed files with 105 additions and 36 deletions

7
.changelog/11572.txt Normal file
View File

@ -0,0 +1,7 @@
```release-note:improvement
raft: The default raft protocol version is now 3.
```
```release-note:deprecation
Raft protocol version 2 is deprecated and will be removed in Nomad 1.4.0.
```

View File

@ -953,6 +953,7 @@ func DefaultConfig() *Config {
Enabled: false,
EnableEventBroker: helper.BoolToPtr(true),
EventBufferSize: helper.IntToPtr(100),
RaftProtocol: 3,
StartJoin: []string{},
ServerJoin: &ServerJoin{
RetryJoin: []string{},

View File

@ -161,7 +161,7 @@ server {
required as the agent internally knows the latest version, but may be useful
in some upgrade scenarios.
- `raft_protocol` `(int: 2)` - Specifies the Raft protocol version to use when
- `raft_protocol` `(int: 3)` - Specifies the Raft protocol version to use when
communicating with other Nomad servers. This affects available Autopilot
features and is typically not required as the agent internally knows the
latest version, but may be useful in some upgrade scenarios.

View File

@ -153,3 +153,83 @@ differences may require specific steps.
[node-status]: /docs/commands/node/status
[server-members]: /docs/commands/server/members
[upgrade-specific]: /docs/upgrade/upgrade-specific
## Upgrading to Raft Protocol 3
This section provides details on upgrading to Raft Protocol 3. Raft
protocol version 3 requires Nomad running 0.8.0 or newer on all
servers in order to work. Raft protocol version 2 will be removed in
Nomad 1.4.0.
To see the version of the Raft protocol in use on each server, use the
`nomad operator raft list-peers` command.
Note that the format of `peers.json` used for outage recovery is
different when running with the latest Raft protocol. See [Manual
Recovery Using
peers.json](https://learn.hashicorp.com/tutorials/nomad/outage-recovery#manual-recovery-using-peersjson)
for a description of the required format.
When using Raft protocol version 3, servers are identified by their
`node-id` instead of their IP address when Nomad makes changes to its
internal Raft quorum configuration. This means that once a cluster has
been upgraded with servers all running Raft protocol version 3, it
will no longer allow servers running any older Raft protocol versions
to be added.
### Upgrading a Production Cluster to Raft Version 3
For production raft clusters with 3 or more memebrs, the easiest way
to upgrade servers is to have each server leave the cluster, upgrade
its [`raft_protocol`] version in the `server` stanza, and then add it
back. Make sure the new server joins successfully and that the cluster
is stable before rolling the upgrade forward to the next server. It's
also possible to stand up a new set of servers, and then slowly stand
down each of the older servers in a similar fashion.
For in-place raft protocol upgrades, perform the following for each
server, leaving the leader until last to reduce the chance of leader
elections that will slow down the process:
* Stop the server
* Run `nomad server force-leave $server_name`
* Update the `raft_protocol` in the server's configuration file to 3.
* Restart the server
* Run `nomad operator raft list-peers` to verify that the `raft_vsn`
for the server is now 3.
* On the server, run `nomad agent-info` and check that the
`last_log_index` is of a similar value to the other servers. This
step ensures that raft is healthy and changes are replicating to the
new server.
### Upgrading a Single Server Cluster to Raft Version 3
If you are running a single Nomad server, restarting it in-place will
result in that server not being able to elect itself as a leader. To
avoid this, create a new [`raft.peers`][peers-json] file before
restarting the server with the new configuration. If you have `jq`
installed you can run the following script on the server's host to
write the correct `raft.peers` file:
```
#!/usr/bin/env bash
NOMAD_DATA_DIR=$(nomad agent-info -json | jq -r '.config.DataDir')
NOMAD_ADDR=$(nomad agent-info -json | jq -r '.stats.nomad.leader_addr')
NODE_ID=$(cat "$NOMAD_DATA_DIR/server/node-id")
cat <<EOF > "$NOMAD_DATA_DIR/server/raft/peers.json"
[
{
"id": "$NODE_ID",
"address": "$NOMAD_ADDR",
"non_voter": false
}
]
EOF
```
After running this script, update the `raft_protocol` in the server's
configuration to 3 and restart the server.
[peers-json]: https://learn.hashicorp.com/tutorials/nomad/outage-recovery#manual-recovery-using-peersjson

View File

@ -13,6 +13,18 @@ upgrade. However, specific versions of Nomad may have more details provided for
their upgrades as a result of new features or changed behavior. This page is
used to document those details separately from the standard upgrade flow.
## Nomad 1.3.0
#### Raft Protocol Version 2 Deprecation
Raft protocol version 2 will be removed from Nomad in the next major
release of Nomad, 1.4.0.
In Nomad 1.3.0, the default raft protocol version has been updated to
3. If the [`raft_protocol_version`] is not explicitly set, upgrading a
server will automatically upgrade that server's raft protocol. See the
[Upgrading to Raft Protocol 3] guide.
## Nomad 1.2.4
#### `nomad eval status -json` deprecated
@ -959,7 +971,7 @@ will be interpolated properly. Please see the
### Raft Protocol Version Compatibility
When upgrading to Nomad 0.8.0 from a version lower than 0.7.0, users will need
to set the [`raft_protocol`](/docs/configuration/server#raft_protocol) option in
to set the [`raft_protocol`] option in
their `server` stanza to 1 in order to maintain backwards compatibility with the
old servers during the upgrade. After the servers have been migrated to version
0.8.0, `raft_protocol` can be moved up to 2 and the servers restarted to match
@ -997,40 +1009,6 @@ In order to enable all
servers in a Nomad cluster must be running with Raft protocol version 3 or
later.
#### Upgrading to Raft Protocol 3
This section provides details on upgrading to Raft Protocol 3 in Nomad 0.8 and
higher. Raft protocol version 3 requires Nomad running 0.8.0 or newer on all
servers in order to work. See [Raft Protocol Version
Compatibility](/docs/upgrade/upgrade-specific#raft-protocol-version-compatibility)
for more details. Also the format of `peers.json` used for outage recovery is
different when running with the latest Raft protocol. See [Manual Recovery Using
peers.json](https://learn.hashicorp.com/tutorials/nomad/outage-recovery#manual-recovery-using-peersjson)
for a description of the required format.
Please note that the Raft protocol is different from Nomad's internal protocol
as shown in commands like `nomad server members`. To see the version of the Raft
protocol in use on each server, use the `nomad operator raft list-peers`
command.
The easiest way to upgrade servers is to have each server leave the cluster,
upgrade its `raft_protocol` version in the `server` stanza, and then add it
back. Make sure the new server joins successfully and that the cluster is stable
before rolling the upgrade forward to the next server. It's also possible to
stand up a new set of servers, and then slowly stand down each of the older
servers in a similar fashion.
When using Raft protocol version 3, servers are identified by their `node-id`
instead of their IP address when Nomad makes changes to its internal Raft quorum
configuration. This means that once a cluster has been upgraded with servers all
running Raft protocol version 3, it will no longer allow servers running any
older Raft protocol versions to be added. If running a single Nomad server,
restarting it in-place will result in that server not being able to elect itself
as a leader. To avoid this, either set the Raft protocol back to 2, or use
[Manual Recovery Using
peers.json](https://learn.hashicorp.com/tutorials/nomad/outage-recovery#manual-recovery-using-peersjson)
to map the server to its node ID in the Raft quorum configuration.
### Node Draining Improvements
Node draining via the [`node drain`][drain-cli] command or the [drain
@ -1224,6 +1202,8 @@ deleted and then Nomad 0.3.0 can be launched.
[preemption]: /docs/internals/scheduling/preemption
[proxy_concurrency]: /docs/job-specification/sidecar_task#proxy_concurrency
[`sidecar_task.config`]: /docs/job-specification/sidecar_task#config
[raft protocol version]: /docs/configuration/server#raft_protocol
[`raft protocol`]: /docs/configuration/server#raft_protocol
[reserved]: /docs/configuration/client#reserved-parameters
[task-config]: /docs/job-specification/task#config
[tls-guide]: https://learn.hashicorp.com/tutorials/nomad/security-enable-tls
@ -1248,3 +1228,4 @@ deleted and then Nomad 0.3.0 can be launched.
[cap_add_exec]: /docs/drivers/exec#cap_add
[cap_drop_exec]: /docs/drivers/exec#cap_drop
[`log_file`]: /docs/configuration#log_file
[Upgrading to Raft Protocol 3]: /docs/upgrade#upgrading-to-raft-protocol-3