open-vault

Author	SHA1	Message	Date
Josh Black	e75633eddc	Don't panic on unknown raft ops (#17732 ) * Don't panic on unknown raft ops * avoid excessive logging * track at the struct level, not the function level * add changelog	2022-11-30 15:37:58 -08:00
Nick Cabatoff	39c7e7c191	Add more raft metrics, emit more metrics on non-perf standbys (#12166 ) Add some metrics helpful for monitoring raft cluster state. Furthermore, we weren't emitting bolt metrics on regular (non-perf) standbys, and there were other metrics in metricsLoop that would make sense to include in OSS but weren't. We now have an active-node-only func, emitMetricsActiveNode. This runs metricsLoop on the active node. Standbys and perf-standbys run metricsLoop from a goroutine managed by the runStandby rungroup.	2022-10-07 09:09:08 -07:00
Josh Black	04a2396573	Adjust raft transactions to be safer with get operations (#17151 )	2022-09-16 09:35:48 -07:00
Josh Black	6d94dd991d	merkle sync undo logs (#17103 )	2022-09-13 10:03:19 -07:00
Josh Black	416504d8c3	Add autopilot automated upgrades and redundancy zones (#15521 )	2022-05-20 16:49:11 -04:00
Nick Cabatoff	c5928c1d15	Raft: use a larger initial heartbeat/election timeout (#15042 )	2022-04-29 08:32:16 -04:00
Nick Cabatoff	4ee4374b3e	Use MAP_POPULATE for our bbolt mmaps (#13573 ) * Use MAP_POPULATE for our bbolt mmaps, assuming the files fit in memory. This should improve startup times when freelist sync is disabled.	2022-01-11 08:16:53 -05:00
Josh Black	fe0dd6f867	Add InitialMmapSize to bolt options (#13178 )	2021-11-22 20:16:57 -08:00
Mayo	0bd0339c0b	cleanup unused code and fix t.Fatal usage in goroutine in testing (#11694 )	2021-09-30 07:33:14 -04:00
Jeff Mitchell	f7147025dd	Migrate to sdk/internalshared libs in go-secure-stdlib (#12090 ) * Swap sdk/helper libs to go-secure-stdlib * Migrate to go-secure-stdlib reloadutil * Migrate to go-secure-stdlib kv-builder * Migrate to go-secure-stdlib gatedwriter	2021-07-15 20:17:31 -04:00
Nick Cabatoff	a3ac49aa05	VAULT-2809: Tweak creation of vault.db file (#12034 )	2021-07-09 14:45:50 -04:00
Josh Black	8c069936e9	Add new boltdb options (#11895 )	2021-06-21 11:35:40 -07:00
Lars Lehtonen	5ac47a9265	physical: deprecate errwrap.Wrapf() (#11692 )	2021-05-31 12:54:05 -04:00
Josh Black	ec105f288f	Switch to shared raft-boltdb library and add metrics (#11269 )	2021-04-26 16:01:26 -07:00
Brian Kassouf	303c2aee7c	Run a more strict formatter over the code (#11312 ) * Update tooling * Run gofumpt * go mod vendor	2021-04-08 09:43:39 -07:00
Vishal Nayak	fb2df6ca73	Fix autopilot fsm race (#11091 ) * Fix autopilot fsm race * No need to grab backend's lock	2021-03-11 13:14:11 -05:00
Vishal Nayak	3e55e79a3f	Autopilot: Server Stabilization, State and Dead Server Cleanup (#10856 ) * k8s doc: update for 0.9.1 and 0.8.0 releases (#10825) * k8s doc: update for 0.9.1 and 0.8.0 releases * Update website/content/docs/platform/k8s/helm/configuration.mdx Co-authored-by: Theron Voran <tvoran@users.noreply.github.com> Co-authored-by: Theron Voran <tvoran@users.noreply.github.com> * Autopilot initial commit * Move autopilot related backend implementations to its own file * Abstract promoter creation * Add nil check for health * Add server state oss no-ops * Config ext stub for oss * Make way for non-voters * s/health/state * s/ReadReplica/NonVoter * Add synopsis and description * Remove struct tags from AutopilotConfig * Use var for config storage path * Handle nin-config when reading * Enable testing autopilot by using inmem cluster * First passing test * Only report the server as known if it is present in raft config * Autopilot defaults to on for all existing and new clusters * Add locking to some functions * Persist initial config * Clarify the command usage doc * Add health metric for each node * Fix audit logging issue * Don't set DisablePerformanceStandby to true in test * Use node id label for health metric * Log updates to autopilot config * Less aggressively consume config loading failures * Return a mutable config * Return early from known servers if raft config is unable to be pulled * Update metrics name * Reduce log level for potentially noisy log * Add knob to disable autopilot * Don't persist if default config is in use * Autopilot: Dead server cleanup (#10857) * Dead server cleanup * Initialize channel in any case * Fix a bunch of tests * Fix panic * Add follower locking in heartbeat tracker * Add LastContactFailureThreshold to config * Add log when marking node as dead * Update follower state locking in heartbeat tracker * Avoid follower states being nil * Pull test to its own file * Add execution status to state response * Optionally enable autopilot in some tests * Updates * Added API function to fetch autopilot configuration * Add test for default autopilot configuration * Configuration tests * Add State API test * Update test * Added TestClusterOptions.PhysicalFactoryConfig * Update locking * Adjust locking in heartbeat tracker * s/last_contact_failure_threshold/left_server_last_contact_threshold * Add disabling autopilot as a core config option * Disable autopilot in some tests * s/left_server_last_contact_threshold/dead_server_last_contact_threshold * Set the lastheartbeat of followers to now when setting up active node * Don't use config defaults from CLI command * Remove config file support * Remove HCL test as well * Persist only supplied config; merge supplied config with default to operate * Use pointer to structs for storing follower information * Test update * Retrieve non voter status from configbucket and set it up when a node comes up * Manage desired suffrage * Consider bucket being created already * Move desired suffrage to its own entry * s/DesiredSuffrageKey/LocalNodeConfigKey * s/witnessSuffrage/recordSuffrage * Fix test compilation * Handle local node config post a snapshot install * Commit to storage first; then record suffrage in fsm * No need of local node config being nili case, post snapshot restore * Reconcile autopilot config when a new leader takes over duty * Grab fsm lock when recording suffrage * s/Suffrage/DesiredSuffrage in FollowerState * Instantiate autopilot only in leader * Default to old ways in more scenarios * Make API gracefully handle 404 * Address some feedback * Make IsDead an atomic.Value * Simplify follower hearbeat tracking * Use uber.atomic * Don't have multiple causes for having autopilot disabled * Don't remove node from follower states if we fail to remove the dead server * Autopilot server removals map (#11019) * Don't remove node from follower states if we fail to remove the dead server * Use map to track dead server removals * Use lock and map * Use delegate lock * Adjust when to remove entry from map * Only hold the lock while accessing map * Fix race * Don't set default min_quorum * Fix test * Ensure follower states is not nil before starting autopilot * Fix race Co-authored-by: Jason O'Donnell <2160810+jasonodonnell@users.noreply.github.com> Co-authored-by: Theron Voran <tvoran@users.noreply.github.com>	2021-03-03 13:59:50 -05:00
Nick Cabatoff	c1ddfbb538	OSS parts of the new client controlled consistency feature (#10974 )	2021-02-24 06:58:10 -05:00
Brian Kassouf	3f30fc5f4e	Port changes from enterprise lease fix (#10020 )	2020-09-22 14:47:13 -07:00
ncabatoff	b2908d1744	Avoid O(n^2) lookup to remove duplicate subfolders in list output. (#9694 )	2020-08-31 09:23:34 -04:00
Mike Jarmy	a3ab902e18	set path properly in NewRaftBackend() (#9128 ) * set path properly in NewRaftBackend() * get rid of storeLatestState	2020-07-21 12:48:24 -04:00
Brian Kassouf	09593283b8	Improve the performance of snapshot installs by using rename (#9247 ) * initial work on improving snapshot performance * Work on snapshots * rename a few functions * Cleanup the snapshot file * vendor the safeio library * Add a test * Add more tests * Some review comments * Fix comment * Update physical/raft/snapshot.go Co-authored-by: Alexander Bezobchuk <alexanderbez@users.noreply.github.com> * Update physical/raft/snapshot.go Co-authored-by: Alexander Bezobchuk <alexanderbez@users.noreply.github.com> * Review feedback Co-authored-by: Alexander Bezobchuk <alexanderbez@users.noreply.github.com>	2020-06-23 11:08:30 -07:00
Jeff Mitchell	6cb26312af	Fix code copied from gogo using a different proto import (#9009 )	2020-05-15 13:45:22 -07:00
Jeff Mitchell	b4f5d38916	Update to latest go-kms-wrapping and fix protos/etcd (#8996 )	2020-05-14 18:45:10 -04:00
Brian Kassouf	d979279015	storage/raft: Fix memory allocation issue and Metadata tracking issues with snapshots (#8793 ) * storage/raft: Split snapshot restore disk write into batches * Work on snapshot consistency * make sure tests send a snapshot * Fix comment * Don't remove metrics * Fix comment	2020-04-23 11:11:08 -07:00
Brian Kassouf	1167fad704	Improve raft write performance by utilizing FSM Batching (#7527 ) * Start benchmark work * Add batching FSM function * dedupe some code * Update dependency on chunking FSM * fix raft external tests * fix go.mod * Add batching test * uncomment test * update raft deps * update vendor * Update physical/raft/fsm.go Co-Authored-By: Michel Vocks <michelvocks@gmail.com> * Update physical/raft/fsm.go	2019-10-14 09:25:07 -06:00
Brian Kassouf	a77995cdb1	Ensure raft configuration properly lists the leader (#7188 )	2019-07-25 08:41:14 -04:00
Jeff Mitchell	0425db59ab	Raft chunk snapshotting (#7185 ) Support chunking, including snapshot handling	2019-07-24 20:44:13 -04:00
Brian Kassouf	4d7d0d729a	storage/raft: When restoring a snapshot preseal first (#7011 ) * storage/raft: When restoring a snapshot preseal first * best-effort allow standbys to apply the restoreOp before sealing active node * Don't cache the raft tls key * Update physical/raft/raft.go * Move pending raft peers to core * Fix race on close bool * Extend the leaderlease time for tests * Update raft deps * Fix audit hashing * Fix race with auditing	2019-07-03 13:56:30 -07:00
Brian Kassouf	62e14c280d	storage/raft: fix races in tests (#6996 ) * storage/raft: fix races in tests * Fix another test race	2019-06-27 10:00:03 -07:00
Brian Kassouf	5d0c68ca74	Fix 32-bit builds (#6948 )	2019-06-21 09:52:02 -06:00
Brian Kassouf	ed14061578	Raft Storage Backend (#6888 ) * Work on raft backend * Add logstore locally * Add encryptor and unsealable interfaces * Add clustering support to raft * Remove client and handler * Bootstrap raft on init * Cleanup raft logic a bit * More raft work * Work on TLS config * More work on bootstrapping * Fix build * More work on bootstrapping * More bootstrapping work * fix build * Remove consul dep * Fix build * merged oss/master into raft-storage * Work on bootstrapping * Get bootstrapping to work * Clean up FMS and node-id * Update local node ID logic * Cleanup node-id change * Work on snapshotting * Raft: Add remove peer API (#906) * Add remove peer API * Add some comments * Fix existing snapshotting (#909) * Raft get peers API (#912) * Read raft configuration * address review feedback * Use the Leadership Transfer API to step-down the active node (#918) * Raft join and unseal using Shamir keys (#917) * Raft join using shamir * Store AEAD instead of master key * Split the raft join process to answer the challenge after a successful unseal * get the follower to standby state * Make unseal work * minor changes * Some input checks * reuse the shamir seal access instead of new default seal access * refactor joinRaftSendAnswer function * Synchronously send answer in auto-unseal case * Address review feedback * Raft snapshots (#910) * Fix existing snapshotting * implement the noop snapshotting * Add comments and switch log libraries * add some snapshot tests * add snapshot test file * add TODO * More work on raft snapshotting * progress on the ConfigStore strategy * Don't use two buckets * Update the snapshot store logic to hide the file logic * Add more backend tests * Cleanup code a bit * [WIP] Raft recovery (#938) * Add recovery functionality * remove fmt.Printfs * Fix a few fsm bugs * Add max size value for raft backend (#942) * Add max size value for raft backend * Include physical.ErrValueTooLarge in the message * Raft snapshot Take/Restore API (#926) * Inital work on raft snapshot APIs * Always redirect snapshot install/download requests * More work on the snapshot APIs * Cleanup code a bit * On restore handle special cases * Use the seal to encrypt the sha sum file * Add sealer mechanism and fix some bugs * Call restore while state lock is held * Send restore cb trigger through raft log * Make error messages nicer * Add test helpers * Add snapshot test * Add shamir unseal test * Add more raft snapshot API tests * Fix locking * Change working to initalize * Add underlying raw object to test cluster core * Move leaderUUID to core * Add raft TLS rotation logic (#950) * Add TLS rotation logic * Cleanup logic a bit * Add/Remove from follower state on add/remove peer * add comments * Update more comments * Update request_forwarding_service.proto * Make sure we populate all nodes in the followerstate obj * Update times * Apply review feedback * Add more raft config setting (#947) * Add performance config setting * Add more config options and fix tests * Test Raft Recovery (#944) * Test raft recovery * Leave out a node during recovery * remove unused struct * Update physical/raft/snapshot_test.go * Update physical/raft/snapshot_test.go * fix vendoring * Switch to new raft interface * Remove unused files * Switch a gogo -> proto instance * Remove unneeded vault dep in go.sum * Update helper/testhelpers/testhelpers.go Co-Authored-By: Calvin Leung Huang <cleung2010@gmail.com> * Update vault/cluster/cluster.go * track active key within the keyring itself (#6915) * track active key within the keyring itself * lookup and store using the active key ID * update docstring * minor refactor * Small text fixes (#6912) * Update physical/raft/raft.go Co-Authored-By: Calvin Leung Huang <cleung2010@gmail.com> * review feedback * Move raft logical system into separate file * Update help text a bit * Enforce cluster addr is set and use it for raft bootstrapping * Fix tests * fix http test panic * Pull in latest raft-snapshot library * Add comment	2019-06-20 12:14:58 -07:00

32 commits