diff --git a/changelog/diagnose.txt b/changelog/diagnose.txt new file mode 100644 index 000000000..335a70ca9 --- /dev/null +++ b/changelog/diagnose.txt @@ -0,0 +1,3 @@ +```release-note:feature +operator diagnose: a new vault operator command to detect common issues with vault server setups. +``` \ No newline at end of file diff --git a/website/content/docs/commands/operator/diagnose.mdx b/website/content/docs/commands/operator/diagnose.mdx new file mode 100644 index 000000000..4c6c8dd00 --- /dev/null +++ b/website/content/docs/commands/operator/diagnose.mdx @@ -0,0 +1,228 @@ +--- +layout: docs +page_title: operator diagnose - Command +description: |- + "vault operator diagnose" is a new operator-centric command, focused on providing a clear description + of what is working in Vault, and what is not working. The command focuses on why Vault cannot serve requests, + but will also warn on configurations or statuses that it deems to be unsafe in some way. + +--- + +# operator diagnose + +The operator diagnose command should be used primarily when vault is down or +partially inoperational. The command can be used safely regardless of the state +vault is in, but may return meaningless results for some of the test cases if the +vault server is already running. + +Note: if you run the diagnose command proactively, either before a server +starts or while a server is operational, please consult the documentation +on the individual checks below to see which checks are returning false error +messages or warnings. + +## Usage + +The following flags are available in addition to the [standard set of +flags](/docs/commands) included on all commands. + +### Output Options + +- `-format` `(string: "table")` - Print the output in the given format. Valid + formats are "table", "json", or "yaml". This can also be specified via the + `VAULT_FORMAT` environment variable. + +#### Output Layout + +The operator diagnose command will output a set of lines in the CLI. +Each line will begin with a prefix in parenthesis. These are:. + +- `[ success ]` - Denotes that the check was successful. +- `[ warning ]` - Denotes that the check has passed, but that there may be potential +issues to look into that may relate to the issues vault is experiencing. Diagnose warns +frequently. These warnings are meant to serve as starting points in the debugging process. +- `[ failure ]` - Denotes that the check has failed. Failures are critical issues in the eyes +of the diagnose command. + +In addition to these prefixed lines, there may be output lines that are not prefixed, but are +color-coded purple. These are advice lines from Diagnose, and are meant to offer general guidance +on how to go about fixing potential warnings or failures that may arise. + +Warn or fail prefixes in nested checks will bubble up to the parent if the prefix superceeds the +parent prefix. Fail superceeds warn, and warn superceeds ok. For example, if the TLS checks under +the Storage check fails, the `[ failure ]` prefix will bubble up to the Storage check. + +### Command Options + +- `-config` `(string; "")` - The path to the vault configuration file used by +the vault server on startup. + +### Diagnose Checks + +The following section details the various checks that Diagnose runs. Check names in documentation +will be separated by slashes to denote that they are nested, when applicable. For example, a check +documented as `A / B` will show up as `B` in the `operator diagnose` output, and will be nested +(indented) under `A`. + +#### Vault Diagnose + +`Vault Diagnose` is the top level check that contains the rest of the checks. It will report the status +of the check + +#### Check Operating System / Check Open File Limit + +`Check Open File Limit` verifies that the open file limit value is set high enough for vault +to run effectively. We recommend setting these limits to at least 1024768. + +This check will be skipped on openbsd, arm, and windows. + +#### Check Operating System / Check Disk Usage + +`Check Disk Usage` will report disk usage for each partition. For each partition on a prod host, +we recommend having at least 5% of the partition free to use, and at least 1 GB of space. + +This check will be skipped on openbsd and arm. + +#### Parse Configuration + +`Parse Configuration` will check the vault server config file for syntax errors. It will check +for extra values in the configuration file, repeated stanzas, and stanzas that do not belong +in the configuration file (for example a "tcpp" listener as opposed to a tcp listener). + +Currently, the `storage` stanza is not checked. + +#### Check Storage / Create Storage Backend + +`Create Storage Backend` ensures that the storage stanza configured in the vault server config +has enough information to create a storage object internally. Common errors will have to do +with misconfigured fields in the storage stanza. + +#### Check Storage / Check Consul TLS + +`Check Consul TLS` verifies TLS information included in the storage stanza if the storage type +is consul. If a certificate chain is provided, Diagnose parses the root, intermediate, and leaf +certificates, and checks each one for correctness. + +#### Check Storage / Check Consul Direct Storage Access + +`Check Consul Direct Storage Access` is a consul-specific check that ensures Vault is not accessing +the consul server directly, but rather through a local agent. + +#### Check Storage / Check Raft Folder Permissions + +`Check Raft Folder Permissions` computes the permissions on the raft folder, checks that a boltDB file +has been initialized within the folder previously, and ensures that the folder is not too permissive, but +at the same time has enough permissions to be used. The raft folder should not have `other` permissions, but +should have `group rw` or `owner rw`, depending on different setups. This check also warns if it detects a +symlink being used. + +Note that this check will warn that a raft file has not been created if diagnose is run without any +pre-existing server runs. + +This check will be skipped on windows. + +#### Check Storage / Check Raft Folder Ownership + +`Check Raft Folder Ownership` ensures that vault does not need to run as root to access the boltDB folder. + +Note that this check will warn that a raft file has not been created if diagnose is run without any +pre-existing server runs. + +This check will be skipped on windows. + +#### Check Storage / Check For Raft Quorum + +`Check For Raft Quorum` uses the FSM to ensure that there were an odd number of voters in the raft quorum when +vault was last running. + +Note that this check will warn that there are 0 voters if diagnose is run without any pre-existing server runs. + +#### Check Storage / Check Storage Access + +`Check Storage Access` will try to write a dud value, named `diagnose/latency/`, to storage. +Ensure that there is no important data at this location before running diagnose, as this check +will overwrite that data. This check will then try to list and read the value it wrote to ensure +the name and value is as expected. + +`Check Storage Access` will warn if any operation takes longer than 100ms, and error out if the +entire check takes longer than 30s. + +#### Check Service Discovery / Check Consul Service Discovery TLS + +`Check Consul Service Discovery TLS` verifies TLS information included in the service discovery + stanza if the storage type is consul. If a certificate chain is provided, Diagnose parses + the root, intermediate, and leaf certificates, and checks each one for correctness. + +#### Check Service Discovery / Check Consul Direct Service Discovery + +`Check Consul Direct Service Discovery` is a consul-specific check that ensures Vault +is not accessing the consul server directly, but rather through a local agent. + +#### Create Vault Server Configuration Seals + +`Create Vault Server Configuration Seals` creates seals from the vault configuration +stanza and verifies they can be initialized and finalized. + +#### Check Transit Seal TLS + +`Check Transit Seal TLS` checks the TLS client certificate, key, and CA certificate +provided in a transit seal stanza (if one exists) for correctness. + +#### Create Core Configuration / Initialize Randomness for Core + +`Initialize Randomness for Core` ensures that vault has access to the randReader that +the vault core uses. + +#### HA Storage + +This check and any nested checks will be the same as the `Check Storage` checks. +The only difference is that the checks here will be run on whatever is specified in the +`ha_storage` section of the vault configuration, as opposed to the `storage` section. + +#### Determine Redirect Address + +Ensures that one of the `VAULT_API_ADDR`, `VAULT_REDIRECT_ADDR`, or `VAULT_ADVERTISE_ADDR` +environment variables are set, or that the redirect address is specified in the vault +configuration. + +#### Check Cluster Address + +Parses the cluster address from the `VAULT_CLUSTER_ADDR` environment variable, or from the +redirect address or cluster address specified in the vault configuration, and checks that +the address is of the form `host:port`. + +#### Check Core Creation + +`Check Core Creation` verifies the logical configuration checks that vault does when it +creates a core object. These are runtime checks, meaning any errors thrown by this diagnose +test will also be thrown by the vault server itself when it is run. + +#### Check For Autoloaded License + +`Check For Autoloaded License` is an enterprise diagnose check, which verifies that vault +has access to a valid autoloaded license that will not expire in the next 30 days. + +#### Start Listeners / Check Listener TLS + +`Check Listener TLS` verifies the server certificate file and key are valid and matching. +It also checks the client CA file, if one is provided, for a valid certificate, and performs +the standard runtime listener checks on the listener configuration stanza, such as verifying +that the minimum and maximum TLS versions are within the bounds of what vault supports. + +Like all the other Diagnose TLS checks, it will warn if any of the certificates provided are +set to expire within the next month. + +#### Start Listeners / Create Listeners + +`Create Listeners` uses the listener configuration to initialize the listeners, erroring with +a server error if anything goes wrong. + +#### Check Autounseal Encryption + +`Check Autounseal Encryption` will initialize the barrier using the seal stanza, if the seal +type is not a shamir seal, and use it to encrypt and decrypt a dud value. + +#### Check Server Before Runtime + +`Check Server Before Runtime` achieves parity with the server run command, running through +the runtime code checks before the server is initialized to ensure that nothing fails. +This check will never fail without another diagnose check failing.