History

Vishal Nayak 3e55e79a3f Autopilot: Server Stabilization, State and Dead Server Cleanup (#10856 ) * k8s doc: update for 0.9.1 and 0.8.0 releases (#10825) * k8s doc: update for 0.9.1 and 0.8.0 releases * Update website/content/docs/platform/k8s/helm/configuration.mdx Co-authored-by: Theron Voran <tvoran@users.noreply.github.com> Co-authored-by: Theron Voran <tvoran@users.noreply.github.com> * Autopilot initial commit * Move autopilot related backend implementations to its own file * Abstract promoter creation * Add nil check for health * Add server state oss no-ops * Config ext stub for oss * Make way for non-voters * s/health/state * s/ReadReplica/NonVoter * Add synopsis and description * Remove struct tags from AutopilotConfig * Use var for config storage path * Handle nin-config when reading * Enable testing autopilot by using inmem cluster * First passing test * Only report the server as known if it is present in raft config * Autopilot defaults to on for all existing and new clusters * Add locking to some functions * Persist initial config * Clarify the command usage doc * Add health metric for each node * Fix audit logging issue * Don't set DisablePerformanceStandby to true in test * Use node id label for health metric * Log updates to autopilot config * Less aggressively consume config loading failures * Return a mutable config * Return early from known servers if raft config is unable to be pulled * Update metrics name * Reduce log level for potentially noisy log * Add knob to disable autopilot * Don't persist if default config is in use * Autopilot: Dead server cleanup (#10857) * Dead server cleanup * Initialize channel in any case * Fix a bunch of tests * Fix panic * Add follower locking in heartbeat tracker * Add LastContactFailureThreshold to config * Add log when marking node as dead * Update follower state locking in heartbeat tracker * Avoid follower states being nil * Pull test to its own file * Add execution status to state response * Optionally enable autopilot in some tests * Updates * Added API function to fetch autopilot configuration * Add test for default autopilot configuration * Configuration tests * Add State API test * Update test * Added TestClusterOptions.PhysicalFactoryConfig * Update locking * Adjust locking in heartbeat tracker * s/last_contact_failure_threshold/left_server_last_contact_threshold * Add disabling autopilot as a core config option * Disable autopilot in some tests * s/left_server_last_contact_threshold/dead_server_last_contact_threshold * Set the lastheartbeat of followers to now when setting up active node * Don't use config defaults from CLI command * Remove config file support * Remove HCL test as well * Persist only supplied config; merge supplied config with default to operate * Use pointer to structs for storing follower information * Test update * Retrieve non voter status from configbucket and set it up when a node comes up * Manage desired suffrage * Consider bucket being created already * Move desired suffrage to its own entry * s/DesiredSuffrageKey/LocalNodeConfigKey * s/witnessSuffrage/recordSuffrage * Fix test compilation * Handle local node config post a snapshot install * Commit to storage first; then record suffrage in fsm * No need of local node config being nili case, post snapshot restore * Reconcile autopilot config when a new leader takes over duty * Grab fsm lock when recording suffrage * s/Suffrage/DesiredSuffrage in FollowerState * Instantiate autopilot only in leader * Default to old ways in more scenarios * Make API gracefully handle 404 * Address some feedback * Make IsDead an atomic.Value * Simplify follower hearbeat tracking * Use uber.atomic * Don't have multiple causes for having autopilot disabled * Don't remove node from follower states if we fail to remove the dead server * Autopilot server removals map (#11019) * Don't remove node from follower states if we fail to remove the dead server * Use map to track dead server removals * Use lock and map * Use delegate lock * Adjust when to remove entry from map * Only hold the lock while accessing map * Fix race * Don't set default min_quorum * Fix test * Ensure follower states is not nil before starting autopilot * Fix race Co-authored-by: Jason O'Donnell <2160810+jasonodonnell@users.noreply.github.com> Co-authored-by: Theron Voran <tvoran@users.noreply.github.com>		2021-03-03 13:59:50 -05:00
..
.gitignore	Raft Storage Backend (#6888 )	2019-06-20 12:14:58 -07:00
.golangci-lint.yml	Add password policies to Active Directory secret engine (#9144 )	2020-06-15 10:36:17 -06:00
.travis.yml	Upgrade raft library (#9170 )	2020-06-08 16:34:20 -07:00
CHANGELOG.md	Autopilot: Server Stabilization, State and Dead Server Cleanup (#10856 )	2021-03-03 13:59:50 -05:00
LICENSE	Raft Storage Backend (#6888 )	2019-06-20 12:14:58 -07:00
Makefile	Upgrade raft library (#9170 )	2020-06-08 16:34:20 -07:00
README.md	raft: Update raft library dependency (#9571 )	2020-07-22 14:49:51 -07:00
api.go	Upgrade raft library (#9170 )	2020-06-08 16:34:20 -07:00
commands.go	Improve raft write performance by utilizing FSM Batching (#7527 )	2019-10-14 09:25:07 -06:00
commitment.go	Raft Storage Backend (#6888 )	2019-06-20 12:14:58 -07:00
config.go	Upgrade raft library (#9170 )	2020-06-08 16:34:20 -07:00
configuration.go	Improve raft write performance by utilizing FSM Batching (#7527 )	2019-10-14 09:25:07 -06:00
discard_snapshot.go	Upgrade raft library (#9170 )	2020-06-08 16:34:20 -07:00
file_snapshot.go	Upgrade raft library (#9170 )	2020-06-08 16:34:20 -07:00
fsm.go	Improve raft write performance by utilizing FSM Batching (#7527 )	2019-10-14 09:25:07 -06:00
future.go	Upgrade raft library (#9170 )	2020-06-08 16:34:20 -07:00
go.mod	Raft Storage Backend (#6888 )	2019-06-20 12:14:58 -07:00
go.sum	Raft Storage Backend (#6888 )	2019-06-20 12:14:58 -07:00
inmem_snapshot.go	Upgrade raft library (#9170 )	2020-06-08 16:34:20 -07:00
inmem_store.go	Raft Storage Backend (#6888 )	2019-06-20 12:14:58 -07:00
inmem_transport.go	Upgrade raft library (#9170 )	2020-06-08 16:34:20 -07:00
log.go	Improve raft write performance by utilizing FSM Batching (#7527 )	2019-10-14 09:25:07 -06:00
log_cache.go	Raft Storage Backend (#6888 )	2019-06-20 12:14:58 -07:00
membership.md	Raft Storage Backend (#6888 )	2019-06-20 12:14:58 -07:00
net_transport.go	Upgrade raft library (#9170 )	2020-06-08 16:34:20 -07:00
observer.go	Improve raft write performance by utilizing FSM Batching (#7527 )	2019-10-14 09:25:07 -06:00
peersjson.go	Raft Storage Backend (#6888 )	2019-06-20 12:14:58 -07:00
raft.go	Pull latest raft updates (#10055 )	2020-10-05 16:36:48 +02:00
replication.go	Pull latest raft updates (#10055 )	2020-10-05 16:36:48 +02:00
snapshot.go	Upgrade raft library (#9170 )	2020-06-08 16:34:20 -07:00
stable.go	Raft Storage Backend (#6888 )	2019-06-20 12:14:58 -07:00
state.go	Raft Storage Backend (#6888 )	2019-06-20 12:14:58 -07:00
tag.sh	Raft Storage Backend (#6888 )	2019-06-20 12:14:58 -07:00
tcp_transport.go	raft: Update raft library dependency (#9571 )	2020-07-22 14:49:51 -07:00
testing.go	Upgrade raft library (#9170 )	2020-06-08 16:34:20 -07:00
testing_batch.go	Improve raft write performance by utilizing FSM Batching (#7527 )	2019-10-14 09:25:07 -06:00
transport.go	Raft Storage Backend (#6888 )	2019-06-20 12:14:58 -07:00
util.go	Upgrade raft library (#9170 )	2020-06-08 16:34:20 -07:00

README.md

raft

raft is a Go library that manages a replicated log and can be used with an FSM to manage replicated state machines. It is a library for providing consensus.

The use cases for such a library are far-reaching, such as replicated state machines which are a key component of many distributed systems. They enable building Consistent, Partition Tolerant (CP) systems, with limited fault tolerance as well.

Building

If you wish to build raft you'll need Go version 1.2+ installed.

Please check your installation with:

go version

Documentation

For complete documentation, see the associated Godoc.

To prevent complications with cgo, the primary backend MDBStore is in a separate repository, called raft-mdb. That is the recommended implementation for the LogStore and StableStore.

A pure Go backend using BoltDB is also available called raft-boltdb. It can also be used as a LogStore and StableStore.

Tagged Releases

As of September 2017, HashiCorp will start using tags for this library to clearly indicate major version updates. We recommend you vendor your application's dependency on this library.

v0.1.0 is the original stable version of the library that was in master and has been maintained with no breaking API changes. This was in use by Consul prior to version 0.7.0.
v1.0.0 takes the changes that were staged in the library-v2-stage-one branch. This version manages server identities using a UUID, so introduces some breaking API changes. It also versions the Raft protocol, and requires some special steps when interoperating with Raft servers running older versions of the library (see the detailed comment in config.go about version compatibility). You can reference https://github.com/hashicorp/consul/pull/2222 for an idea of what was required to port Consul to these new interfaces.

This version includes some new features as well, including non voting servers, a new address provider abstraction in the transport layer, and more resilient snapshots.

Protocol

raft is based on "Raft: In Search of an Understandable Consensus Algorithm"

A high level overview of the Raft protocol is described below, but for details please read the full Raft paper followed by the raft source. Any questions about the raft protocol should be sent to the raft-dev mailing list.

Protocol Description

Raft nodes are always in one of three states: follower, candidate or leader. All nodes initially start out as a follower. In this state, nodes can accept log entries from a leader and cast votes. If no entries are received for some time, nodes self-promote to the candidate state. In the candidate state nodes request votes from their peers. If a candidate receives a quorum of votes, then it is promoted to a leader. The leader must accept new log entries and replicate to all the other followers. In addition, if stale reads are not acceptable, all queries must also be performed on the leader.

Once a cluster has a leader, it is able to accept new log entries. A client can request that a leader append a new log entry, which is an opaque binary blob to Raft. The leader then writes the entry to durable storage and attempts to replicate to a quorum of followers. Once the log entry is considered committed, it can be applied to a finite state machine. The finite state machine is application specific, and is implemented using an interface.

An obvious question relates to the unbounded nature of a replicated log. Raft provides a mechanism by which the current state is snapshotted, and the log is compacted. Because of the FSM abstraction, restoring the state of the FSM must result in the same state as a replay of old logs. This allows Raft to capture the FSM state at a point in time, and then remove all the logs that were used to reach that state. This is performed automatically without user intervention, and prevents unbounded disk usage as well as minimizing time spent replaying logs.

Lastly, there is the issue of updating the peer set when new servers are joining or existing servers are leaving. As long as a quorum of nodes is available, this is not an issue as Raft provides mechanisms to dynamically update the peer set. If a quorum of nodes is unavailable, then this becomes a very challenging issue. For example, suppose there are only 2 peers, A and B. The quorum size is also 2, meaning both nodes must agree to commit a log entry. If either A or B fails, it is now impossible to reach quorum. This means the cluster is unable to add, or remove a node, or commit any additional log entries. This results in unavailability. At this point, manual intervention would be required to remove either A or B, and to restart the remaining node in bootstrap mode.

A Raft cluster of 3 nodes can tolerate a single node failure, while a cluster of 5 can tolerate 2 node failures. The recommended configuration is to either run 3 or 5 raft servers. This maximizes availability without greatly sacrificing performance.

In terms of performance, Raft is comparable to Paxos. Assuming stable leadership, committing a log entry requires a single round trip to half of the cluster. Thus performance is bound by disk I/O and network latency.