Commit Graph

16 Commits

Author SHA1 Message Date
Brian Kassouf 303c2aee7c
Run a more strict formatter over the code (#11312)
* Update tooling

* Run gofumpt

* go mod vendor
2021-04-08 09:43:39 -07:00
Vishal Nayak 3e55e79a3f
Autopilot: Server Stabilization, State and Dead Server Cleanup (#10856)
* k8s doc: update for 0.9.1 and 0.8.0 releases (#10825)

* k8s doc: update for 0.9.1 and 0.8.0 releases

* Update website/content/docs/platform/k8s/helm/configuration.mdx

Co-authored-by: Theron Voran <tvoran@users.noreply.github.com>

Co-authored-by: Theron Voran <tvoran@users.noreply.github.com>

* Autopilot initial commit

* Move autopilot related backend implementations to its own file

* Abstract promoter creation

* Add nil check for health

* Add server state oss no-ops

* Config ext stub for oss

* Make way for non-voters

* s/health/state

* s/ReadReplica/NonVoter

* Add synopsis and description

* Remove struct tags from AutopilotConfig

* Use var for config storage path

* Handle nin-config when reading

* Enable testing autopilot by using inmem cluster

* First passing test

* Only report the server as known if it is present in raft config

* Autopilot defaults to on for all existing and new clusters

* Add locking to some functions

* Persist initial config

* Clarify the command usage doc

* Add health metric for each node

* Fix audit logging issue

* Don't set DisablePerformanceStandby to true in test

* Use node id label for health metric

* Log updates to autopilot config

* Less aggressively consume config loading failures

* Return a mutable config

* Return early from known servers if raft config is unable to be pulled

* Update metrics name

* Reduce log level for potentially noisy log

* Add knob to disable autopilot

* Don't persist if default config is in use

* Autopilot: Dead server cleanup (#10857)

* Dead server cleanup

* Initialize channel in any case

* Fix a bunch of tests

* Fix panic

* Add follower locking in heartbeat tracker

* Add LastContactFailureThreshold to config

* Add log when marking node as dead

* Update follower state locking in heartbeat tracker

* Avoid follower states being nil

* Pull test to its own file

* Add execution status to state response

* Optionally enable autopilot in some tests

* Updates

* Added API function to fetch autopilot configuration

* Add test for default autopilot configuration

* Configuration tests

* Add State API test

* Update test

* Added TestClusterOptions.PhysicalFactoryConfig

* Update locking

* Adjust locking in heartbeat tracker

* s/last_contact_failure_threshold/left_server_last_contact_threshold

* Add disabling autopilot as a core config option

* Disable autopilot in some tests

* s/left_server_last_contact_threshold/dead_server_last_contact_threshold

* Set the lastheartbeat of followers to now when setting up active node

* Don't use config defaults from CLI command

* Remove config file support

* Remove HCL test as well

* Persist only supplied config; merge supplied config with default to operate

* Use pointer to structs for storing follower information

* Test update

* Retrieve non voter status from configbucket and set it up when a node comes up

* Manage desired suffrage

* Consider bucket being created already

* Move desired suffrage to its own entry

* s/DesiredSuffrageKey/LocalNodeConfigKey

* s/witnessSuffrage/recordSuffrage

* Fix test compilation

* Handle local node config post a snapshot install

* Commit to storage first; then record suffrage in fsm

* No need of local node config being nili case, post snapshot restore

* Reconcile autopilot config when a new leader takes over duty

* Grab fsm lock when recording suffrage

* s/Suffrage/DesiredSuffrage in FollowerState

* Instantiate autopilot only in leader

* Default to old ways in more scenarios

* Make API gracefully handle 404

* Address some feedback

* Make IsDead an atomic.Value

* Simplify follower hearbeat tracking

* Use uber.atomic

* Don't have multiple causes for having autopilot disabled

* Don't remove node from follower states if we fail to remove the dead server

* Autopilot server removals map (#11019)

* Don't remove node from follower states if we fail to remove the dead server

* Use map to track dead server removals

* Use lock and map

* Use delegate lock

* Adjust when to remove entry from map

* Only hold the lock while accessing map

* Fix race

* Don't set default min_quorum

* Fix test

* Ensure follower states is not nil before starting autopilot

* Fix race

Co-authored-by: Jason O'Donnell <2160810+jasonodonnell@users.noreply.github.com>
Co-authored-by: Theron Voran <tvoran@users.noreply.github.com>
2021-03-03 13:59:50 -05:00
Nick Cabatoff 6faef07fd5
Factor out the consul-using sealmigration tests to their own package, so that the remaining tests can run in the CI job that doesn't need docker. (#10342)
Remove the file-storage-backed tests: they don't add anything, and they don't represent a viable cluster storage solution that can be used in prod.
2020-11-20 07:53:31 -05:00
Nick Cabatoff 0d6a929a4c
Same seal migration oss (#10224)
* Refactoring and test improvements.

* Support migrating from a given type of autoseal to that same type but with different parameters.
2020-10-23 14:16:04 -04:00
ncabatoff 0f77d0e282
Move the code that creates Consul containers out of teststorage. This allows importers of teststorage that don't need consul to run as a non-docker test. (#9975) 2020-09-17 15:44:29 -04:00
ncabatoff b615da43d7
Run CI tests in docker instead of a machine. (#8948) 2020-09-15 10:01:26 -04:00
Mark Gritter 4633f5a8fc
Disable flaky test case. (#9926) 2020-09-10 17:54:31 -05:00
ncabatoff 4134ef2e98
Ensure that perf standbys can perform seal migrations. (#9690) 2020-08-10 08:35:57 -04:00
ncabatoff 3fbc0f35c2
Make runTransit tolerate a non-core-0 leader. (#9548) 2020-07-21 15:50:01 -04:00
ncabatoff d777708fde
Improve logging, and add polling to the post-stepdown leader check. (#9530) 2020-07-20 12:44:23 -04:00
ncabatoff 3ddc837ce3
Make sure cluster is stopped before wiping storage. (#9526) 2020-07-20 09:32:38 -04:00
Brian Kassouf f8df68b673
seal: Fix issue migrating from Auto->Shamir and improve tests (#9430)
* Fix issue migrating from Auto->Shamir and improve tests

* Undo newline

* fix panic in test

* Fix test panic
2020-07-09 12:28:17 -07:00
Calvin Leung Huang 67444d85b8
test/migration: ensure that leader client is used for storage read check (#9403) 2020-07-06 16:22:07 -07:00
Calvin Leung Huang c45bdca0b3
raft: add support for using backend for ha_storage (#9193)
* raft: initial work on raft ha storage support

* add note on join

* add todo note

* raft: add support for bootstrapping and joining existing nodes

* raft: gate bootstrap join by reading leader api address from storage

* raft: properly check for raft-only for certain conditionals

* raft: add bootstrap to api and cli

* raft: fix bootstrap cli command

* raft: add test for setting up new cluster with raft HA

* raft: extend TestRaft_HA_NewCluster to include inmem and consul backends

* raft: add test for updating an existing cluster to use raft HA

* raft: remove debug log lines, clean up verifyRaftPeers

* raft: minor cleanup

* raft: minor cleanup

* Update physical/raft/raft.go

Co-authored-by: Brian Kassouf <briankassouf@users.noreply.github.com>

* Update vault/ha.go

Co-authored-by: Brian Kassouf <briankassouf@users.noreply.github.com>

* Update vault/ha.go

Co-authored-by: Brian Kassouf <briankassouf@users.noreply.github.com>

* Update vault/logical_system_raft.go

Co-authored-by: Brian Kassouf <briankassouf@users.noreply.github.com>

* Update vault/raft.go

Co-authored-by: Brian Kassouf <briankassouf@users.noreply.github.com>

* Update vault/raft.go

Co-authored-by: Brian Kassouf <briankassouf@users.noreply.github.com>

* address feedback comments

* address feedback comments

* raft: refactor tls keyring logic

* address feedback comments

* Update vault/raft.go

Co-authored-by: Alexander Bezobchuk <alexanderbez@users.noreply.github.com>

* Update vault/raft.go

Co-authored-by: Alexander Bezobchuk <alexanderbez@users.noreply.github.com>

* address feedback comments

* testing: fix import ordering

* raft: rename var, cleanup comment line

* docs: remove ha_storage restriction note on raft

* docs: more raft HA interaction updates with migration and recovery mode

* docs: update the raft join command

* raft: update comments

* raft: add missing isRaftHAOnly check for clearing out state set earlier

* raft: update a few ha_storage config checks

* Update command/operator_raft_bootstrap.go

Co-authored-by: Vishal Nayak <vishalnayak@users.noreply.github.com>

* raft: address feedback comments

* raft: fix panic when checking for config.HAStorage.Type

* Update vault/raft.go

Co-authored-by: Alexander Bezobchuk <alexanderbez@users.noreply.github.com>

* Update website/pages/docs/commands/operator/raft.mdx

Co-authored-by: Alexander Bezobchuk <alexanderbez@users.noreply.github.com>

* raft: remove bootstrap cli command

* Update vault/raft.go

Co-authored-by: Brian Kassouf <briankassouf@users.noreply.github.com>

* Update vault/raft.go

Co-authored-by: Brian Kassouf <briankassouf@users.noreply.github.com>

* raft: address review feedback

* raft: revert vendored sdk

* raft: don't send applied index and node ID info if we're HA-only

Co-authored-by: Brian Kassouf <briankassouf@users.noreply.github.com>
Co-authored-by: Alexander Bezobchuk <alexanderbez@users.noreply.github.com>
Co-authored-by: Vishal Nayak <vishalnayak@users.noreply.github.com>
2020-06-23 12:04:13 -07:00
Mike Jarmy e608503139
Test Shamir-to-Transit and Transit-to-Shamir Seal Migration for post-1.4 Vault. (#9214)
* move adjustForSealMigration to vault package

* fix adjustForSealMigration

* begin working on new seal migration test

* create shamir seal migration test

* refactor testhelpers

* add VerifyRaftConfiguration to testhelpers

* stub out TestTransit

* Revert "refactor testhelpers"

This reverts commit 39593defd0d4c6fd79aedfd37df6298391abb9db.

* get shamir test working again

* stub out transit join

* work on transit join

* remove debug code

* initTransit now works with raft join

* runTransit works with inmem

* work on runTransit with raft

* runTransit works with raft

* cleanup tests

* TestSealMigration_TransitToShamir_Pre14

* TestSealMigration_ShamirToTransit_Pre14

* split for pre-1.4 testing

* add simple tests for transit and shamir

* fix typo in test suite

* debug wrapper type

* test debug

* test-debug

* refactor core migration

* Revert "refactor core migration"

This reverts commit a776452d32a9dca7a51e3df4a76b9234d8c0c7ce.

* begin refactor of adjustForSealMigration

* fix bug in adjustForSealMigration

* clean up tests

* clean up core refactoring

* fix bug in shamir->transit migration

* stub out test that brings individual nodes up and down

* refactor NewTestCluster

* pass listeners into newCore()

* simplify cluster address setup

* simplify extra test core setup

* refactor TestCluster for readability

* refactor TestCluster for readability

* refactor TestCluster for readability

* add shutdown func to TestCore

* add cleanup func to TestCore

* create RestartCore

* stub out TestSealMigration_ShamirToTransit_Post14

* refactor address handling in NewTestCluster

* fix listener setup in newCore()

* remove unnecessary lock from setSealsForMigration()

* rename sealmigration test package

* use ephemeral ports below 30000

* work on post-1.4 migration testing

* clean up pre-1.4 test

* TestSealMigration_ShamirToTransit_Post14 works for non-raft

* work on raft TestSealMigration_ShamirToTransit_Post14

* clean up test code

* refactor TestClusterCore

* clean up TestClusterCore

* stub out some temporary tests

* use HardcodedServerAddressProvider in seal migration tests

* work on raft for TestSealMigration_ShamirToTransit_Post14

* always use hardcoded raft address provider in seal migration tests

* debug TestSealMigration_ShamirToTransit_Post14

* fix bug in RestartCore

* remove debug code

* TestSealMigration_ShamirToTransit_Post14 works now

* clean up debug code

* clean up tests

* cleanup tests

* refactor test code

* stub out TestSealMigration_TransitToShamir_Post14

* set seals properly for transit->shamir migration

* migrateFromTransitToShamir_Post14 works for inmem

* migrateFromTransitToShamir_Post14 works for raft

* use base ports per-test

* fix seal verification test code

* simplify seal migration test suite

* simplify test suite

* cleanup test suite

* use explicit ports below 30000

* simplify use of numTestCores

* Update vault/external_tests/sealmigration/seal_migration_test.go

Co-authored-by: Calvin Leung Huang <cleung2010@gmail.com>

* Update vault/external_tests/sealmigration/seal_migration_test.go

Co-authored-by: Calvin Leung Huang <cleung2010@gmail.com>

* clean up imports

* rename to StartCore()

* Update vault/testing.go

Co-authored-by: Calvin Leung Huang <cleung2010@gmail.com>

* simplify test suite

* clean up tests

Co-authored-by: Calvin Leung Huang <cleung2010@gmail.com>
2020-06-16 14:12:22 -04:00
Mike Jarmy 4303790aae
Test pre-1.4 seal migration (#9085)
* enable seal wrap in all seal migration tests

* move adjustForSealMigration to vault package

* fix adjustForSealMigration

* begin working on new seal migration test

* create shamir seal migration test

* refactor testhelpers

* add VerifyRaftConfiguration to testhelpers

* stub out TestTransit

* Revert "refactor testhelpers"

This reverts commit 39593defd0d4c6fd79aedfd37df6298391abb9db.

* get shamir test working again

* stub out transit join

* work on transit join

* Revert "move resuable storage test to avoid creating import cycle"

This reverts commit b3ff2317381a5af12a53117f87d1c6fbb093af6b.

* remove debug code

* initTransit now works with raft join

* runTransit works with inmem

* work on runTransit with raft

* runTransit works with raft

* get rid of dis-used test

* cleanup tests

* TestSealMigration_TransitToShamir_Pre14

* TestSealMigration_ShamirToTransit_Pre14

* split for pre-1.4 testing

* add simple tests for transit and shamir

* fix typo in test suite

* debug wrapper type

* test debug

* test-debug

* refactor core migration

* Revert "refactor core migration"

This reverts commit a776452d32a9dca7a51e3df4a76b9234d8c0c7ce.

* begin refactor of adjustForSealMigration

* fix bug in adjustForSealMigration

* clean up tests

* clean up core refactoring

* fix bug in shamir->transit migration

* remove unnecessary lock from setSealsForMigration()

* rename sealmigration test package

* use ephemeral ports below 30000

* simplify use of numTestCores
2020-06-11 15:07:59 -04:00