History

James Phillips bc29610124 Adds support for snapshots and restores. (#2396 ) * Updates Raft library to get new snapshot/restore API. * Basic backup and restore working, but need some cleanup. * Breaks out a snapshot module and adds a SHA256 integrity check. * Adds snapshot ACL and fills in some missing comments. * Require a consistent read for snapshots. * Make sure snapshot works if ACLs aren't enabled. * Adds a bit of package documentation. * Returns an empty response from restore to avoid EOF errors. * Adds API client support for snapshots. * Makes internal file names match on-disk file snapshots. * Adds DC and token coverage for snapshot API test. * Adds missing documentation. * Adds a unit test for the snapshot client endpoint. * Moves the connection pool out of the client for easier testing. * Fixes an incidental issue in the prepared query unit test. I realized I had two servers in bootstrap mode so this wasn't a good setup. * Adds a half close to the TCP stream and fixes panic on error. * Adds client and endpoint tests for snapshots. * Moves the pool back into the snapshot RPC client. * Adds a TLS test and fixes half-closes for TLS connections. * Tweaks some comments. * Adds a low-level snapshot test. This is independent of Consul so we can pull this out into a library later if we want to. * Cleans up snapshot and archive and completes archive tests. * Sends a clear error for snapshot operations in dev mode. Snapshots require the Raft snapshots to be readable, which isn't supported in dev mode. Send a clear error instead of a deep-down Raft one. * Adds docs for the snapshot endpoint. * Adds a stale mode and index feedback for snapshot saves. This gives folks a way to extract data even if the cluster has no leader. * Changes the internal format of a snapshot from zip to tgz. * Pulls in Raft fix to cancel inflight before a restore. * Pulls in new Raft restore interface. * Adds metadata to snapshot saves and a verify function. * Adds basic save and restore snapshot CLI commands. * Gets rid of tarball extensions and adds restore message. * Fixes an incidental bad link in the KV docs. * Adds documentation for the snapshot CLI commands. * Scuttle any request body when a snapshot is saved. * Fixes archive unit test error message check. * Allows for nil output writers in snapshot RPC handlers. * Renames hash list Decode to DecodeAndVerify. * Closes the client connection for snapshot ops. * Lowers timeout for restore ops. * Updates Raft vendor to get new Restore signature and integrates with Consul. * Bounces the leader's internal state when we do a restore.		2016-10-25 19:20:24 -07:00
..
api.go	Adds support for snapshots and restores. (#2396 )	2016-10-25 19:20:24 -07:00
commands.go	Vendors first stage branch of the v2 Raft library.	2016-08-08 19:19:17 -07:00
commitment.go	Vendors first stage branch of the v2 Raft library.	2016-08-08 19:19:17 -07:00
config.go	Vendors first stage branch of the v2 Raft library.	2016-08-08 19:19:17 -07:00
configuration.go	Vendors first stage branch of the v2 Raft library.	2016-08-08 19:19:17 -07:00
discard_snapshot.go	Vendors first stage branch of the v2 Raft library.	2016-08-08 19:19:17 -07:00
file_snapshot.go	Vendors first stage branch of the v2 Raft library.	2016-08-08 19:19:17 -07:00
fsm.go	Adds support for snapshots and restores. (#2396 )	2016-10-25 19:20:24 -07:00
future.go	Adds support for snapshots and restores. (#2396 )	2016-10-25 19:20:24 -07:00
inmem_store.go	Vendors first stage branch of the v2 Raft library.	2016-08-08 19:19:17 -07:00
inmem_transport.go	Vendors first stage branch of the v2 Raft library.	2016-08-08 19:19:17 -07:00
LICENSE	Manage dependencies via Godep	2016-02-12 16:50:37 -08:00
log.go	Vendors first stage branch of the v2 Raft library.	2016-08-08 19:19:17 -07:00
log_cache.go	Manage dependencies via Godep	2016-02-12 16:50:37 -08:00
Makefile	Adds support for snapshots and restores. (#2396 )	2016-10-25 19:20:24 -07:00
membership.md	Vendors first stage branch of the v2 Raft library.	2016-08-08 19:19:17 -07:00
net_transport.go	Vendors first stage branch of the v2 Raft library.	2016-08-08 19:19:17 -07:00
observer.go	Vendors first stage branch of the v2 Raft library.	2016-08-08 19:19:17 -07:00
peersjson.go	Vendors first stage branch of the v2 Raft library.	2016-08-08 19:19:17 -07:00
raft.go	Adds support for snapshots and restores. (#2396 )	2016-10-25 19:20:24 -07:00
README.md	Vendors first stage branch of the v2 Raft library.	2016-08-08 19:19:17 -07:00
replication.go	Vendors first stage branch of the v2 Raft library.	2016-08-08 19:19:17 -07:00
snapshot.go	Adds support for snapshots and restores. (#2396 )	2016-10-25 19:20:24 -07:00
stable.go	Manage dependencies via Godep	2016-02-12 16:50:37 -08:00
state.go	Vendors first stage branch of the v2 Raft library.	2016-08-08 19:19:17 -07:00
tcp_transport.go	Vendors first stage branch of the v2 Raft library.	2016-08-08 19:19:17 -07:00
transport.go	Vendors first stage branch of the v2 Raft library.	2016-08-08 19:19:17 -07:00
util.go	Vendors first stage branch of the v2 Raft library.	2016-08-08 19:19:17 -07:00

README.md

raft

raft is a Go library that manages a replicated log and can be used with an FSM to manage replicated state machines. It is library for providing consensus.

The use cases for such a library are far-reaching as replicated state machines are a key component of many distributed systems. They enable building Consistent, Partition Tolerant (CP) systems, with limited fault tolerance as well.

Building

If you wish to build raft you'll need Go version 1.2+ installed.

Please check your installation with:

go version

Documentation

For complete documentation, see the associated Godoc.

To prevent complications with cgo, the primary backend MDBStore is in a separate repository, called raft-mdb. That is the recommended implementation for the LogStore and StableStore.

A pure Go backend using BoltDB is also available called raft-boltdb. It can also be used as a LogStore and StableStore.

Protocol

raft is based on "Raft: In Search of an Understandable Consensus Algorithm"

A high level overview of the Raft protocol is described below, but for details please read the full Raft paper followed by the raft source. Any questions about the raft protocol should be sent to the raft-dev mailing list.

Protocol Description

Raft nodes are always in one of three states: follower, candidate or leader. All nodes initially start out as a follower. In this state, nodes can accept log entries from a leader and cast votes. If no entries are received for some time, nodes self-promote to the candidate state. In the candidate state nodes request votes from their peers. If a candidate receives a quorum of votes, then it is promoted to a leader. The leader must accept new log entries and replicate to all the other followers. In addition, if stale reads are not acceptable, all queries must also be performed on the leader.

Once a cluster has a leader, it is able to accept new log entries. A client can request that a leader append a new log entry, which is an opaque binary blob to Raft. The leader then writes the entry to durable storage and attempts to replicate to a quorum of followers. Once the log entry is considered committed, it can be applied to a finite state machine. The finite state machine is application specific, and is implemented using an interface.

An obvious question relates to the unbounded nature of a replicated log. Raft provides a mechanism by which the current state is snapshotted, and the log is compacted. Because of the FSM abstraction, restoring the state of the FSM must result in the same state as a replay of old logs. This allows Raft to capture the FSM state at a point in time, and then remove all the logs that were used to reach that state. This is performed automatically without user intervention, and prevents unbounded disk usage as well as minimizing time spent replaying logs.

Lastly, there is the issue of updating the peer set when new servers are joining or existing servers are leaving. As long as a quorum of nodes is available, this is not an issue as Raft provides mechanisms to dynamically update the peer set. If a quorum of nodes is unavailable, then this becomes a very challenging issue. For example, suppose there are only 2 peers, A and B. The quorum size is also 2, meaning both nodes must agree to commit a log entry. If either A or B fails, it is now impossible to reach quorum. This means the cluster is unable to add, or remove a node, or commit any additional log entries. This results in unavailability. At this point, manual intervention would be required to remove either A or B, and to restart the remaining node in bootstrap mode.

A Raft cluster of 3 nodes can tolerate a single node failure, while a cluster of 5 can tolerate 2 node failures. The recommended configuration is to either run 3 or 5 raft servers. This maximizes availability without greatly sacrificing performance.

In terms of performance, Raft is comparable to Paxos. Assuming stable leadership, committing a log entry requires a single round trip to half of the cluster. Thus performance is bound by disk I/O and network latency.