Commit Graph

55 Commits

Author SHA1 Message Date
miagilepner e873932bce
VAULT-8436 remove <-time.After statements in for loops (#18818)
* replace time.After with ticker in loops

* add semgrep rule

* update to use timers

* remove stop
2023-02-06 17:49:01 +01:00
Josh Black c9763996d4
Enable undo logs by default (#18692)
* Enable undo logs by default

* add consul test

* update go.mod/sum

* add a better non-existent key
2023-01-17 13:38:18 -08:00
Tom Proctor dc85e37cf4
storage/raft: Add retry_join_as_non_voter config option (#18030) 2022-11-18 17:58:16 +00:00
Josh Black ad1503ebcd
disable undo logs by default for 1.12.0 (#17453) 2022-10-07 08:47:40 -07:00
Josh Black c45c6e51c0
only enable undo logs if all cluster members support it (#17378) 2022-10-06 11:24:16 -07:00
Nick Cabatoff 559754d580
Break grabLockOrStop into two pieces to facilitate investigating deadlocks (#17187)
Break grabLockOrStop into two pieces to facilitate investigating deadlocks.  Without this change, the "grab" goroutine looks the same regardless of who was calling grabLockOrStop, so there's no way to identify one of the deadlock parties.
2022-09-20 11:03:16 -04:00
Nick Cabatoff 3075c5bd65
Do not attempt to write a new TLS keyring at startup if raft is already setup (#17079) 2022-09-09 12:19:57 -04:00
Nick Cabatoff 5db952eada
autopilot: assume nodes we haven't received heartbeats from are running the same version as we are (#17019)
OSS parts of ent PR #3172: assume nodes we haven't received heartbeats from are running the same version as we are.  Failing to provide a version/upgrade_version will result in Autopilot (on ent) demoting those unversioned nodes to non-voters until we receive a heartbeat from them.
2022-09-06 14:49:04 -04:00
Scott Miller 3bd38fd5dc
OSS portion of wrapper-v2 (#16811)
* OSS portion of wrapper-v2

* Prefetch barrier type to avoid encountering an error in the simple BarrierType() getter

* Rename the OveriddenType to WrapperType and use it for the barrier type prefetch

* Fix unit test
2022-08-23 15:37:16 -04:00
Mike Palmiotto cd1157a905
Vault 7338/fix retry join (#16550)
* storage/raft: Fix cluster init with retry_join

Commit 8db66f4853abce3f432adcf1724b1f237b275415 introduced an error
wherein a join() would return nil (no error) with no information on its
channel if a joining node had been initialized. This was not handled
properly by the caller and resulted in a canceled `retry_join`.

Fix this by handling the `nil` channel respone by treating it as an
error and allowing the existing mechanics to work as intended.

* storage/raft: Improve retry_join go test

* storage/raft: Make VerifyRaftPeers pollable

* storage/raft: Add changelog entry for retry_join fix

* storage/raft: Add description to VerifyRaftPeers
2022-08-03 20:44:57 -05:00
Mike Palmiotto 42900b554b
storage/raft: Make raftInfo atomic (#16565)
* storage/raft: Make raftInfo atomic

This fixes some racy behavior discovered in parallel testing. Change the
core struct member to an atomic and update references throughout.
2022-08-03 18:40:49 -04:00
Mike Palmiotto 439e35f50f
Vault 6773/raft rejoin nonvoter (#16324)
* raft: Ensure init before setting suffrage

As reported in https://hashicorp.atlassian.net/browse/VAULT-6773:

	The /sys/storage/raft/join endpoint is intended to be unauthenticated. We rely
	on the seal to manage trust.

	It’s possible to use multiple join requests to switch nodes from voter to
	non-voter. The screenshot shows a 3 node cluster where vault_2 is the leader,
	and vault_3 and vault_4 are followers with non-voters set to false.  sent two
	requests to the raft join endpoint to have vault_3 and vault_4 join the cluster
	with non_voters:true.

This commit fixes the issue by delaying the call to SetDesiredSuffrage until after
the initialization check, preventing unauthenticated mangling of voter status.

Tested locally using
https://github.com/hashicorp/vault-tools/blob/main/users/ncabatoff/cluster/raft.sh
and the reproducer outlined in VAULT-6773.

* raft: Return join err on failure

This is necessary to correctly distinguish errors returned from the Join
workflow. Previously, errors were being masked as timeouts.

* raft: Default autopilot parameters in teststorage

Change some defaults so we don't have to pass in parameters or set them
in the originating tests. These storage types are only used in two
places:

1) Raft HA testing
2) Seal migration testing

Both consumers have been tested and pass with this change.

* changelog: Unauthn voter status change bugfix
2022-07-18 14:37:12 -04:00
Chris Capurso 9501d44ed5
Add endpoints to provide ability to modify logging verbosity (#16111)
* add func to set level for specific logger

* add endpoints to modify log level

* initialize base logger with IndependentLevels

* test to ensure other loggers remain unchanged

* add DELETE loggers endpoints to revert back to config

* add API docs page

* add changelog entry

* remove extraneous line

* add log level field to Core struct

* add godoc for getLogLevel

* add some loggers to c.allLoggers
2022-06-27 11:39:53 -04:00
Josh Black 416504d8c3
Add autopilot automated upgrades and redundancy zones (#15521) 2022-05-20 16:49:11 -04:00
Violet Hynes 6d4497bcbf
VAULT-4306 Ensure /raft/bootstrap/challenge call ignores erroneous namespaces set (#15519)
* VAULT-4306 Ensure /raft/bootstrap/challenge call ignores erroneous namespaces set

* VAULT-4306 Add changelog

* VAULT-4306 Update changelog/15519.txt

Co-authored-by: Nick Cabatoff <ncabatoff@hashicorp.com>

Co-authored-by: Nick Cabatoff <ncabatoff@hashicorp.com>
2022-05-19 16:27:51 -04:00
Chris Capurso cc531c793d
fix raft tls key rotation panic when rotation time in past (#15156)
* fix raft tls key rotation panic when rotation time in past

* add changelog entry

* push out next raft TLS rotation time in case close to elapsing

* consolidate tls key rotation duration calculation

* reduce raft getNextRotationTime padding to 10 seconds

* move tls rotation ticker reset to where its duration is calculated
2022-04-25 21:48:34 -04:00
hghaf099 aafb5d6427
VAULT-4240 time.After() in a select statement can lead to memory leak (#14814)
* VAULT-4240 time.After() in a select statement can lead to memory leak

* CL
2022-04-01 10:17:11 -04:00
Pratyoy Mukhopadhyay 69c22b8078
Fix raft paralle retry bug (#14303) 2022-02-28 10:38:34 -08:00
Nick Cabatoff 400996ef0d
Parallel retry join (#13606) 2022-01-17 10:33:03 -05:00
Jim Kalafut 22c4ae5933
Rename master key to root key (#13324)
* See what it looks like to replace "master key" with "root key".  There are two places that would require more challenging code changes: the storage path `core/master`, and its contents (the JSON-serialized EncodedKeyringtructure.)

* Restore accidentally deleted line

* Add changelog

* Update root->recovery

* Fix test

Co-authored-by: Nick Cabatoff <ncabatoff@hashicorp.com>
2021-12-06 17:12:20 -08:00
Daniel Kimsey b4b61efc75
Auto-join support for IPv6 discovery (#12366)
* Auto-join support for IPv6 discovery

The go-discover library returns IP addresses and not URLs. It just so
happens net.URL parses "127.0.0.1", which isn't a valid URL.

Instead, we construct the URL ourselves. Being careful to check if it's
an ipv6 address and making sure it's in explicit form if so.

Fixes #12323

* feedback: addrs & ipv6 test

Rename addrs to clusterIPs to improve clarity and intent

Tighten up our IPv6 address detection to be more correct and to ensure
it's actually in implicit form
2021-09-07 11:55:07 -07:00
Jeff Mitchell f7147025dd
Migrate to sdk/internalshared libs in go-secure-stdlib (#12090)
* Swap sdk/helper libs to go-secure-stdlib

* Migrate to go-secure-stdlib reloadutil

* Migrate to go-secure-stdlib kv-builder

* Migrate to go-secure-stdlib gatedwriter
2021-07-15 20:17:31 -04:00
Vishal Nayak eecb39a57f
OSS parts of Autopilot in DR secondaries (#12014) 2021-07-08 12:30:01 -04:00
Nick Cabatoff 01f96f18ce
VAULT-2439: OSS parts of #1889 (raft licensing init) (#11665) 2021-05-19 16:07:58 -04:00
Brian Kassouf f498d0d389
Reload raft TLS keys on active startup (#11660) 2021-05-19 10:03:32 -07:00
Lars Lehtonen 53dd619d2f
vault: deprecate errwrap.Wrapf() (#11577) 2021-05-11 13:12:54 -04:00
Josh Black 06809930a3
Add HTTP response headers for hostname and raft node ID (if applicable) (#11289) 2021-04-20 15:25:04 -07:00
Vishal Nayak 4666f40925
Support autopilot when raft is for HA only (#11260) 2021-04-12 09:33:21 -04:00
Nick Cabatoff 44c00cd54f
Fix: leader_tls_servername raft option only worked when used with mTLS and/or an explicit CA cert. (#11252) 2021-04-06 09:16:54 -04:00
Nick Cabatoff 41d9030fbb
Disable autopilot in raft-ha mode. (#11181)
* Disable autopilot in raft-ha mode.

* Also don't run autopilot on DR secondaries.
2021-03-23 14:13:44 -07:00
Vishal Nayak 3e55e79a3f
Autopilot: Server Stabilization, State and Dead Server Cleanup (#10856)
* k8s doc: update for 0.9.1 and 0.8.0 releases (#10825)

* k8s doc: update for 0.9.1 and 0.8.0 releases

* Update website/content/docs/platform/k8s/helm/configuration.mdx

Co-authored-by: Theron Voran <tvoran@users.noreply.github.com>

Co-authored-by: Theron Voran <tvoran@users.noreply.github.com>

* Autopilot initial commit

* Move autopilot related backend implementations to its own file

* Abstract promoter creation

* Add nil check for health

* Add server state oss no-ops

* Config ext stub for oss

* Make way for non-voters

* s/health/state

* s/ReadReplica/NonVoter

* Add synopsis and description

* Remove struct tags from AutopilotConfig

* Use var for config storage path

* Handle nin-config when reading

* Enable testing autopilot by using inmem cluster

* First passing test

* Only report the server as known if it is present in raft config

* Autopilot defaults to on for all existing and new clusters

* Add locking to some functions

* Persist initial config

* Clarify the command usage doc

* Add health metric for each node

* Fix audit logging issue

* Don't set DisablePerformanceStandby to true in test

* Use node id label for health metric

* Log updates to autopilot config

* Less aggressively consume config loading failures

* Return a mutable config

* Return early from known servers if raft config is unable to be pulled

* Update metrics name

* Reduce log level for potentially noisy log

* Add knob to disable autopilot

* Don't persist if default config is in use

* Autopilot: Dead server cleanup (#10857)

* Dead server cleanup

* Initialize channel in any case

* Fix a bunch of tests

* Fix panic

* Add follower locking in heartbeat tracker

* Add LastContactFailureThreshold to config

* Add log when marking node as dead

* Update follower state locking in heartbeat tracker

* Avoid follower states being nil

* Pull test to its own file

* Add execution status to state response

* Optionally enable autopilot in some tests

* Updates

* Added API function to fetch autopilot configuration

* Add test for default autopilot configuration

* Configuration tests

* Add State API test

* Update test

* Added TestClusterOptions.PhysicalFactoryConfig

* Update locking

* Adjust locking in heartbeat tracker

* s/last_contact_failure_threshold/left_server_last_contact_threshold

* Add disabling autopilot as a core config option

* Disable autopilot in some tests

* s/left_server_last_contact_threshold/dead_server_last_contact_threshold

* Set the lastheartbeat of followers to now when setting up active node

* Don't use config defaults from CLI command

* Remove config file support

* Remove HCL test as well

* Persist only supplied config; merge supplied config with default to operate

* Use pointer to structs for storing follower information

* Test update

* Retrieve non voter status from configbucket and set it up when a node comes up

* Manage desired suffrage

* Consider bucket being created already

* Move desired suffrage to its own entry

* s/DesiredSuffrageKey/LocalNodeConfigKey

* s/witnessSuffrage/recordSuffrage

* Fix test compilation

* Handle local node config post a snapshot install

* Commit to storage first; then record suffrage in fsm

* No need of local node config being nili case, post snapshot restore

* Reconcile autopilot config when a new leader takes over duty

* Grab fsm lock when recording suffrage

* s/Suffrage/DesiredSuffrage in FollowerState

* Instantiate autopilot only in leader

* Default to old ways in more scenarios

* Make API gracefully handle 404

* Address some feedback

* Make IsDead an atomic.Value

* Simplify follower hearbeat tracking

* Use uber.atomic

* Don't have multiple causes for having autopilot disabled

* Don't remove node from follower states if we fail to remove the dead server

* Autopilot server removals map (#11019)

* Don't remove node from follower states if we fail to remove the dead server

* Use map to track dead server removals

* Use lock and map

* Use delegate lock

* Adjust when to remove entry from map

* Only hold the lock while accessing map

* Fix race

* Don't set default min_quorum

* Fix test

* Ensure follower states is not nil before starting autopilot

* Fix race

Co-authored-by: Jason O'Donnell <2160810+jasonodonnell@users.noreply.github.com>
Co-authored-by: Theron Voran <tvoran@users.noreply.github.com>
2021-03-03 13:59:50 -05:00
Vishal Nayak 53cb1deb38
Revert "Read-replica instead of non-voter (#10875)" (#10890)
This reverts commit fc745670cf34821f5834357d9caebc3351dbc1e7.
2021-02-10 16:41:58 -05:00
Vishal Nayak a2394e7353
Read-replica instead of non-voter (#10875) 2021-02-10 09:58:18 -05:00
Nick Cabatoff 8cbc63d572
Add configuration to specify a TLS ServerName to use in the TLS handshake when performing a raft join. (#10698) 2021-01-19 17:54:28 -05:00
Nick Cabatoff 84d566db9e
Be consistent with how we report init status. (#10498)
Also make half-joined raft peers consider storage to be initialized, whether or not they're sealed.
2020-12-08 13:55:34 -05:00
Aleksandr Bezobchuk 95bbd8d920
Merge PR #10192: Auto-Join: Configurable Scheme & Port (and add k8s provider) 2020-10-23 16:13:09 -04:00
Aleksandr Bezobchuk d37be9af6e
Merge PR #10095: Integrated Storage Cloud Auto-Join 2020-10-13 16:26:39 -04:00
Brian Kassouf fd72d92434
raft: Fix some snapshot restore issues (#9533)
* raft: Remove double read lock

* Reload TLS keyring after reloading the barrier keys
2020-07-21 10:59:07 -07:00
ncabatoff d2436a9c56
Make standbyStopCh atomic to avoid data races (#9539) 2020-07-21 08:34:07 -04:00
Calvin Leung Huang c45bdca0b3
raft: add support for using backend for ha_storage (#9193)
* raft: initial work on raft ha storage support

* add note on join

* add todo note

* raft: add support for bootstrapping and joining existing nodes

* raft: gate bootstrap join by reading leader api address from storage

* raft: properly check for raft-only for certain conditionals

* raft: add bootstrap to api and cli

* raft: fix bootstrap cli command

* raft: add test for setting up new cluster with raft HA

* raft: extend TestRaft_HA_NewCluster to include inmem and consul backends

* raft: add test for updating an existing cluster to use raft HA

* raft: remove debug log lines, clean up verifyRaftPeers

* raft: minor cleanup

* raft: minor cleanup

* Update physical/raft/raft.go

Co-authored-by: Brian Kassouf <briankassouf@users.noreply.github.com>

* Update vault/ha.go

Co-authored-by: Brian Kassouf <briankassouf@users.noreply.github.com>

* Update vault/ha.go

Co-authored-by: Brian Kassouf <briankassouf@users.noreply.github.com>

* Update vault/logical_system_raft.go

Co-authored-by: Brian Kassouf <briankassouf@users.noreply.github.com>

* Update vault/raft.go

Co-authored-by: Brian Kassouf <briankassouf@users.noreply.github.com>

* Update vault/raft.go

Co-authored-by: Brian Kassouf <briankassouf@users.noreply.github.com>

* address feedback comments

* address feedback comments

* raft: refactor tls keyring logic

* address feedback comments

* Update vault/raft.go

Co-authored-by: Alexander Bezobchuk <alexanderbez@users.noreply.github.com>

* Update vault/raft.go

Co-authored-by: Alexander Bezobchuk <alexanderbez@users.noreply.github.com>

* address feedback comments

* testing: fix import ordering

* raft: rename var, cleanup comment line

* docs: remove ha_storage restriction note on raft

* docs: more raft HA interaction updates with migration and recovery mode

* docs: update the raft join command

* raft: update comments

* raft: add missing isRaftHAOnly check for clearing out state set earlier

* raft: update a few ha_storage config checks

* Update command/operator_raft_bootstrap.go

Co-authored-by: Vishal Nayak <vishalnayak@users.noreply.github.com>

* raft: address feedback comments

* raft: fix panic when checking for config.HAStorage.Type

* Update vault/raft.go

Co-authored-by: Alexander Bezobchuk <alexanderbez@users.noreply.github.com>

* Update website/pages/docs/commands/operator/raft.mdx

Co-authored-by: Alexander Bezobchuk <alexanderbez@users.noreply.github.com>

* raft: remove bootstrap cli command

* Update vault/raft.go

Co-authored-by: Brian Kassouf <briankassouf@users.noreply.github.com>

* Update vault/raft.go

Co-authored-by: Brian Kassouf <briankassouf@users.noreply.github.com>

* raft: address review feedback

* raft: revert vendored sdk

* raft: don't send applied index and node ID info if we're HA-only

Co-authored-by: Brian Kassouf <briankassouf@users.noreply.github.com>
Co-authored-by: Alexander Bezobchuk <alexanderbez@users.noreply.github.com>
Co-authored-by: Vishal Nayak <vishalnayak@users.noreply.github.com>
2020-06-23 12:04:13 -07:00
Brian Kassouf c8dde052f2
storage/raft: Advertise the configured cluster address (#9008)
* storage/raft: Advertise the configured cluster address

* Don't allow raft to start with unspecified IP

* Fix concurrent map write panic

* Add test file

* changelog++

* changelog++

* changelog++

* Update tcp_layer.go

* Update tcp_layer.go

* Only set the adverise addr if set
2020-05-18 18:22:25 -07:00
Brian Kassouf 1bb0bd489d
storage/raft: Add committed and applied indexes to the status output (#9011)
* storage/raft: Add committed and applied indexes to the status output

* Update api vendor

* changelog++

* Update http/sys_leader.go

Co-authored-by: Jim Kalafut <jkalafut@hashicorp.com>

Co-authored-by: Jim Kalafut <jkalafut@hashicorp.com>
2020-05-18 16:07:27 -07:00
Brian Kassouf 05eea911bd
storage/raft: Refresh TLS keyring on snapshot restore (#8546) 2020-03-13 13:39:14 -07:00
Jim Kalafut f17fc4e5c1
Run goimports (#8251) 2020-01-27 21:11:00 -08:00
Vishal Nayak 8891f2ba88 Raft retry join (#7856)
* Raft retry join

* update

* Make retry join work with shamir seal

* Return upon context completion

* Update vault/raft.go

Co-Authored-By: Brian Kassouf <briankassouf@users.noreply.github.com>

* Address some review comments

* send leader information slice as a parameter

* Make retry join work properly with Shamir case. This commit has a blocking issue

* Fix join goroutine exiting before the job is done

* Polishing changes

* Don't return after a successful join during unseal

* Added config parsing test

* Add test and fix bugs

* minor changes

* Address review comments

* Fix build error

Co-authored-by: Brian Kassouf <briankassouf@users.noreply.github.com>
2020-01-13 17:02:16 -08:00
Jeff Mitchell a0694943cc
Migrate built in auto seal to go-kms-wrapping (#8118) 2020-01-10 20:39:52 -05:00
Lexman c86fe212c0
oss changes for entropy augmentation feature (#7670)
* oss changes for entropy augmentation feature

* fix oss command/server/config tests

* update go.sum

* fix logical_system and http/ tests

* adds vendored files

* removes unused variable
2019-10-17 10:33:00 -07:00
Brian Kassouf 024c29c36a
OSS portions of raft non-voters (#7634)
* OSS portions of raft non-voters

* add file

* Update vault/raft.go

Co-Authored-By: Vishal Nayak <vishalnayak@users.noreply.github.com>
2019-10-11 11:56:59 -07:00
isbric e6e20e9eb3 Correct spelling of error message (#7630) 2019-10-11 11:14:41 -04:00
ncabatoff ed147b7ae7
Make clusterListener an atomic.Value to avoid races with getGRPCDialer. (#7408) 2019-09-03 11:59:56 -04:00