* raft: use file paths for TLS info in the retry_join stanza
* raft: maintain backward compat for existing tls params
* docs: update raft docs with new file-based TLS params
* Update godoc comment, fix docs
* raft: check for nil on concrete type in SetupCluster
* raft: move check to its own func
* raft: func cleanup
* raft: disallow disable_clustering = true when raft storage is used
* docs: update disable_clustering to mention new behavior
* Disallow non-voter setting from peers.json
* Fix bug that would make the actual fix a no-op
* Change order of evaluation
* Error out instead of resetting the value
* Update physical/raft/raft.go
Co-Authored-By: Calvin Leung Huang <cleung2010@gmail.com>
* Print node ID
Co-authored-by: Calvin Leung Huang <cleung2010@gmail.com>
* adding support for TLS 1.3 for TCP listeners
* removed test as CI uses go 1.12
* removed Cassandra support, added deprecation notice
* re-added TestTCPListener_tls13
* Guard against using Raft as a seperate HA Storage
* Document that Raft cannot be used as a seperate ha_storage backend at this time
* remove duplicate imports from updating with master
* use observer pattern for service discovery
* update perf standby method
* fix test
* revert usersTags to being called serviceTags
* use previous consul code
* vault isnt a performance standby before starting
* log err
* changes from feedback
* add Run method to interface
* changes from feedback
* fix core test
* update example
* Raft retry join
* update
* Make retry join work with shamir seal
* Return upon context completion
* Update vault/raft.go
Co-Authored-By: Brian Kassouf <briankassouf@users.noreply.github.com>
* Address some review comments
* send leader information slice as a parameter
* Make retry join work properly with Shamir case. This commit has a blocking issue
* Fix join goroutine exiting before the job is done
* Polishing changes
* Don't return after a successful join during unseal
* Added config parsing test
* Add test and fix bugs
* minor changes
* Address review comments
* Fix build error
Co-authored-by: Brian Kassouf <briankassouf@users.noreply.github.com>
* move ServiceDiscovery into methods
* add ServiceDiscoveryFactory
* add serviceDiscovery field to vault.Core
* refactor ConsulServiceDiscovery into separate struct
* cleanup
* revert accidental change to go.mod
* cleanup
* get rid of un-needed struct tags in vault.CoreConfig
* add service_discovery parser
* add ServiceDiscovery to config
* cleanup
* cleanup
* add test for ConfigServiceDiscovery to Core
* unit testing for config service_discovery stanza
* cleanup
* get rid of un-needed redirect_addr stuff in service_discovery stanza
* improve test suite
* cleanup
* clean up test a bit
* create docs for service_discovery
* check if service_discovery is configured, but storage does not support HA
* tinker with test
* tinker with test
* tweak docs
* move ServiceDiscovery into its own package
* tweak a variable name
* fix comment
* rename service_discovery to service_registration
* tweak service_registration config
* Revert "tweak service_registration config"
This reverts commit 5509920a8ab4c5a216468f262fc07c98121dce35.
* simplify naming
* refactor into ./serviceregistration/consul
* physical/posgresql: add ability to use CONNECTION_URL environment variable instead of requiring it to be configured in the Vault config file.
Signed-off-by: Colton McCurdy <mccurdyc22@gmail.com>
* storage/postgresql: update configuration documentation for postgresql storage backend to include connection_url configuration via the PG_CONNECTION_URL environment variable
Signed-off-by: Colton McCurdy <mccurdyc22@gmail.com>
* physical/postgresql: add a configuration file and tests for getting the connection_url from the config file or environment
Signed-off-by: Colton McCurdy <mccurdyc22@gmail.com>
* physical/postgresql: update postgresql backend to pull the required connection_url from the PG_CONNECTION_URL environment variable if it exists, otherwise, fallback to using the config file
Signed-off-by: Colton McCurdy <mccurdyc22@gmail.com>
* physical/postgresql: remove configure*.go files and prefer the postgresql*.go files
Signed-off-by: Colton McCurdy <mccurdyc22@gmail.com>
* physical/postgresql: move and simplify connectionURL function
Signed-off-by: Colton McCurdy <mccurdyc22@gmail.com>
* physical/postgresql: update connectionURL test to use an unordered map instead of slice to avoid test flakiness
Signed-off-by: Colton McCurdy <mccurdyc22@gmail.com>
* physical/postgresql: update config env to be prefixed with VAULT_ - VAULT_PG_CONNECTION_URL
Signed-off-by: Colton McCurdy <mccurdyc22@gmail.com>
* docs/web: update postgresql backend docs to use updated, VAULT_ prefixed config env
Signed-off-by: Colton McCurdy <mccurdyc22@gmail.com>
* refactor test code to avoid panic if tests ran multiple times
* cleanup: don't actually send just close
* move comment to a better location
* move error check to a more obvious spot
* Revert "move error check to a more obvious spot"
Reverting because methods like this should only be called on the main
goroutine running the test:
- https://golang.org/pkg/testing/#T
This reverts commit db7641948317785bff15b3d9dbe6fb18a2d19c2c.
* Fix unordered imports
* Allow Raft node ID to be set via the environment variable `VAULT_RAFT_NODE_ID`
* Allow Raft path to be set via the environment variable `VAULT_RAFT_PATH`
* Prioritize the environment when fetching the Raft configuration values
Values in environment variables should override the config as per the
documentation as well as common sense.
* Initial work
* rework
* s/dr/recovery
* Add sys/raw support to recovery mode (#7577)
* Factor the raw paths out so they can be run with a SystemBackend.
# Conflicts:
# vault/logical_system.go
* Add handleLogicalRecovery which is like handleLogical but is only
sufficient for use with the sys-raw endpoint in recovery mode. No
authentication is done yet.
* Integrate with recovery-mode. We now handle unauthenticated sys/raw
requests, albeit on path v1/raw instead v1/sys/raw.
* Use sys/raw instead raw during recovery.
* Don't bother persisting the recovery token. Authenticate sys/raw
requests with it.
* RecoveryMode: Support generate-root for autounseals (#7591)
* Recovery: Abstract config creation and log settings
* Recovery mode integration test. (#7600)
* Recovery: Touch up (#7607)
* Recovery: Touch up
* revert the raw backend creation changes
* Added recovery operation token prefix
* Move RawBackend to its own file
* Update API path and hit it using CLI flag on generate-root
* Fix a panic triggered when handling a request that yields a nil response. (#7618)
* Improve integ test to actually make changes while in recovery mode and
verify they're still there after coming back in regular mode.
* Refuse to allow a second recovery token to be generated.
* Resize raft cluster to size 1 and start as leader (#7626)
* RecoveryMode: Setup raft cluster post unseal (#7635)
* Setup raft cluster post unseal in recovery mode
* Remove marking as unsealed as its not needed
* Address review comments
* Accept only one seal config in recovery mode as there is no scope for migration
* Store less data in Cassandra prefix buckets
The Cassandra physical backend relies on storing data for sys/foo/bar
under sys, sys/foo, and sys/foo/bar. This is necessary so that we
can list the sys bucket, get a list of all child keys, and then trim
this down to find child 'folders' eg food. Right now however, we store
the full value of every storage entry in all three buckets. This is
unnecessary as the value will only ever be read out in the leaf bucket
ie sys/foo/bar. We use the intermediary buckets simply for listing keys.
We have seen some issues around compaction where certain buckets,
particularly intermediary buckets that are exclusively for listing,
get really clogged up with data to the point of not being listable.
Buckets like sys/expire/id are huge, combining lease expiry data for
all auth methods, and need to be listed for vault to successfully
become leader. This PR tries to cut down on the amount of data stored
in intermediary buckets.
* Avoid goroutine leak by buffering results channel up to the bucket count
its timeout from 5s to 15s in the hopes that helps. The theory is that
since I haven't seen this on the OSS side, it's failing because the ent
side is heavier in terms of test load and thus the tests face more
resource contention.
* storage/raft: When restoring a snapshot preseal first
* best-effort allow standbys to apply the restoreOp before sealing active node
* Don't cache the raft tls key
* Update physical/raft/raft.go
* Move pending raft peers to core
* Fix race on close bool
* Extend the leaderlease time for tests
* Update raft deps
* Fix audit hashing
* Fix race with auditing
* Set MaxIdleConns to reduce connection churn (postgresql physical)
* Make new "max_idle_connection" config option for physical postgresql
* Add docs for "max_idle_connections" for postgresql storage
* Add minimum version to docs for max_idle_connections
* Work on raft backend
* Add logstore locally
* Add encryptor and unsealable interfaces
* Add clustering support to raft
* Remove client and handler
* Bootstrap raft on init
* Cleanup raft logic a bit
* More raft work
* Work on TLS config
* More work on bootstrapping
* Fix build
* More work on bootstrapping
* More bootstrapping work
* fix build
* Remove consul dep
* Fix build
* merged oss/master into raft-storage
* Work on bootstrapping
* Get bootstrapping to work
* Clean up FMS and node-id
* Update local node ID logic
* Cleanup node-id change
* Work on snapshotting
* Raft: Add remove peer API (#906)
* Add remove peer API
* Add some comments
* Fix existing snapshotting (#909)
* Raft get peers API (#912)
* Read raft configuration
* address review feedback
* Use the Leadership Transfer API to step-down the active node (#918)
* Raft join and unseal using Shamir keys (#917)
* Raft join using shamir
* Store AEAD instead of master key
* Split the raft join process to answer the challenge after a successful unseal
* get the follower to standby state
* Make unseal work
* minor changes
* Some input checks
* reuse the shamir seal access instead of new default seal access
* refactor joinRaftSendAnswer function
* Synchronously send answer in auto-unseal case
* Address review feedback
* Raft snapshots (#910)
* Fix existing snapshotting
* implement the noop snapshotting
* Add comments and switch log libraries
* add some snapshot tests
* add snapshot test file
* add TODO
* More work on raft snapshotting
* progress on the ConfigStore strategy
* Don't use two buckets
* Update the snapshot store logic to hide the file logic
* Add more backend tests
* Cleanup code a bit
* [WIP] Raft recovery (#938)
* Add recovery functionality
* remove fmt.Printfs
* Fix a few fsm bugs
* Add max size value for raft backend (#942)
* Add max size value for raft backend
* Include physical.ErrValueTooLarge in the message
* Raft snapshot Take/Restore API (#926)
* Inital work on raft snapshot APIs
* Always redirect snapshot install/download requests
* More work on the snapshot APIs
* Cleanup code a bit
* On restore handle special cases
* Use the seal to encrypt the sha sum file
* Add sealer mechanism and fix some bugs
* Call restore while state lock is held
* Send restore cb trigger through raft log
* Make error messages nicer
* Add test helpers
* Add snapshot test
* Add shamir unseal test
* Add more raft snapshot API tests
* Fix locking
* Change working to initalize
* Add underlying raw object to test cluster core
* Move leaderUUID to core
* Add raft TLS rotation logic (#950)
* Add TLS rotation logic
* Cleanup logic a bit
* Add/Remove from follower state on add/remove peer
* add comments
* Update more comments
* Update request_forwarding_service.proto
* Make sure we populate all nodes in the followerstate obj
* Update times
* Apply review feedback
* Add more raft config setting (#947)
* Add performance config setting
* Add more config options and fix tests
* Test Raft Recovery (#944)
* Test raft recovery
* Leave out a node during recovery
* remove unused struct
* Update physical/raft/snapshot_test.go
* Update physical/raft/snapshot_test.go
* fix vendoring
* Switch to new raft interface
* Remove unused files
* Switch a gogo -> proto instance
* Remove unneeded vault dep in go.sum
* Update helper/testhelpers/testhelpers.go
Co-Authored-By: Calvin Leung Huang <cleung2010@gmail.com>
* Update vault/cluster/cluster.go
* track active key within the keyring itself (#6915)
* track active key within the keyring itself
* lookup and store using the active key ID
* update docstring
* minor refactor
* Small text fixes (#6912)
* Update physical/raft/raft.go
Co-Authored-By: Calvin Leung Huang <cleung2010@gmail.com>
* review feedback
* Move raft logical system into separate file
* Update help text a bit
* Enforce cluster addr is set and use it for raft bootstrapping
* Fix tests
* fix http test panic
* Pull in latest raft-snapshot library
* Add comment
Make lock2's retryInterval smaller so it grabs the lock as soon as lock1's renewer fails to renew in time. Fix the logic to test if lock1's leader channel gets closed: we don't need a goroutine, and
the logic was broken in that if we timed out we'd never write to the blocking channel we then try to read from. Moreover the timeout was wrong.
* Port over some SP v2 bits
Specifically:
* Add too-large handling to Physical (Consul only for now)
* Contextify some identity funcs
* Update SP protos
* Add size limiting to inmem storage
* Exit DynamoDB tryToLock when stop channel is closed
If the stop channel is closed (e.g. an error is returned which triggers
close(stop) in Lock), this loop will spin and use 100% CPU.
* Ensure ticker is stopped
Merge both functions for creating mongodb containers into one.
Add retries to docker container cleanups.
Require $VAULT_ACC be set to enable AWS tests.
* Configurable lock and request etcd timeouts.
If etcd cluster placed on slow servers - request timeouts may be much greater, then hardcoded default values.
Also, in etcd setup, like above - may be need to greater lock timeout.
* Configurable lock and request etcd timeouts.
Docs.
* Use user friendly timeout syntax.
To allow specify more readable time values.
The result will still pass gofmtcheck and won't trigger additional
changes if someone isn't using goimports, but it will avoid the
piecemeal imports changes we've been seeing.
* Fix typo in documentation
* Update fdb-go-install.sh for new release tags
* Exclude FoundationDB bindings from vendoring, delete vendored copy
FoundationDB bindings are tightly coupled to the server version and
client library version used in a specific deployment. Bindings need
to be installed using the fdb-go-install.sh script, as documented in
the foundationdb backend documentation.
* Add TLS support to FoundationDB backend
TLS support appeared in FoundationDB 5.2.4, raising the minimum API version
for TLS-aware FoundationDB code to 520.
* Update documentation for FoundationDB TLS support
* Docker support for postgres backend testing
* Bug in handling of postgres connection url for non docker testing
* Test should fail if it cannot retrieve pg version
* internal helperfunctions pascalCasing
We're having issues with leases in the GCS backend storage being
corrupted and failing MAC checking. When that happens, we need to know
the lease ID so we can address the corruption by hand and take
appropriate action.
This will hopefully prevent any instances of incomplete data being sent
to GSS
* The added method customTLSDial() creates a tls connection to the zookeeper backend when 'tls_enabled' is set to true in config
* Update to the document for TLS configuration that is required to enable TLS connection to Zookeeper backend
* Minor formatting update
* Minor update to the description for example config
* As per review comments from @kenbreeman, additional property description indicating support for multiple Root CAs in a single file has been added
* minor formatting
* Slight cleanup around mysql ha lock implementation
* Removes some duplication around lock table naming
* Escapes lock table name with backticks to handle weird characters
* Lock table defaults to regular table name + "_lock"
* Drop lock table after tests run
* Add `ha_enabled` option for mysql storage
It defaults to false, and we gate a few things like creating the lock
table and preparing lock related statements on it
* storage/gcs: fix race condition in releasing lock
Previously we were deleting a lock without first checking if the lock we were deleting was our own. There existed a small period of time where vault-0 would lose leadership and vault-1 would get leadership. vault-0 would delete the lock key while vault-1 would write it. If vault-0 won, there'd be another leader election, etc.
This fixes the race by using a CAS operation instead.
* storage/gcs: properly break out of loop during stop
* storage/spanner: properly break out of loop during stop
when use mysql storage, set` database = "dev-dassets-bc"` , create database and create table will throw exceptions as follows:
Error initializing storage of type mysql: failed to create mysql database: Error 1064: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '-dassets-bc' at line 1
Error initializing storage of type mysql: failed to create mysql table: Error 1046: No database selected
cause of `-` is a MySQL built-in symbol. so add backtick for create database sql\create table sql \dml sqls.
* Add request timeouts in normal request path and to expirations
* Add ability to adjust default max request duration
* Some test fixes
* Ensure tests have defaults set for max request duration
* Add context cancel checking to inmem/file
* Fix tests
* Fix tests
* Set default max request duration to basically infinity for this release for BC
* Address feedback
* Add an idle timeout for the server
Because tidy operations can be long-running, this also changes all tidy
operations to behave the same operationally (kick off the process, get a
warning back, log errors to server log) and makes them all run in a
goroutine.
This could mean a sort of hard stop if Vault gets sealed because the
function won't have the read lock. This should generally be okay
(running tidy again should pick back up where it left off), but future
work could use cleanup funcs to trigger the functions to stop.
* Fix up tidy test
* Add deadline to cluster connections and an idle timeout to the cluster server, plus add readheader/read timeout to api server
If we have a panic defer functions are run but unlocks aren't. Since we
can't really trust plugins and storage, this backs out the changes for
those parts of the request path.
* Remove a lot of deferred functions in the request path.
There is an interesting benchmark at https://www.reddit.com/r/golang/comments/3h21nk/simple_micro_benchmark_to_measure_the_overhead_of/
It shows that defer actually adds quite a lot of overhead -- maybe 100ns
per call but we defer a *lot* of functions in the request path. So this
removes some of the ones in request handling, ha, barrier, router, and
physical cache.
One meta-note: nearly every metrics function is in a defer which means
every metrics call we add could add a non-trivial amount of time, e.g.
for every 10 extra metrics statements we add 1ms to a request. I don't
know how to solve this right now without doing what I did in some of
these cases and putting that call into a simple function call that then
goes before each return.
* Simplify barrier defer cleanup
Taking inspiration from
https://github.com/golang/go/issues/17604#issuecomment-256384471
suggests that taking the address of a stack variable for use in atomics
works (at least, the race detector doesn't complain) but is doing it
wrong.
The only other change is a change in Leader() detecting if HA is enabled
to fast-path out. This value never changes after NewCore, so we don't
need to grab the read lock to check it.