* keyring: update handle to state inside replication loop
When keyring replication starts, we take a handle to the state store. But
whenever a snapshot is restored, this handle is invalidated and no longer points
to a state store that is receiving new keys. This leaks a bunch of memory too!
In addition to operator-initiated restores, when fresh servers are added to
existing clusters with large-enough state, the keyring replication can get
started quickly enough that it's running before the snapshot from the existing
clusters have been restored.
Fix this by updating the handle to the state store on each pass.
When replication of a single key fails, the replication loop breaks early and
therefore keys that fall later in the sorting order will never get
replicated. This is particularly a problem for clusters impacted by the bug that
caused #14981 and that were later upgraded; the keys that were never replicated
can now never be replicated, and so we need to handle them safely.
Included in the replication fix:
* Refactor the replication loop so that each key replicated in a function call
that returns an error, to make the workflow more clear and reduce nesting. Log
the error and continue.
* Improve stability of keyring replication tests. We no longer block leadership
on initializing the keyring, so there's a race condition in the keyring tests
where we can test for the existence of the root key before the keyring has
been initialize. Change this to an "eventually" test.
But these fixes aren't enough to fix#14981 because they'll end up seeing an
error once a second complaining about the missing key, so we also need to fix
keyring GC so the keys can be removed from the state store. Now we'll store the
key ID used to sign a workload identity in the Allocation, and we'll index the
Allocation table on that so we can track whether any live Allocation was signed
with a particular key ID.
* keyring: don't unblock early if rate limit burst exceeded
The rate limiter returns an error and unblocks early if its burst limit is
exceeded (unless the burst limit is Inf). Ensure we're not unblocking early,
otherwise we'll only slow down the cases where we're already pausing to make
external RPC requests.
* keyring: set MinQueryIndex on stale queries
When keyring replication makes a stale query to non-leader peers to find a key
the leader doesn't have, we need to make sure the peer we're querying has had a
chance to catch up to the most current index for that key. Otherwise it's
possible for newly-added servers to query another newly-added server and get a
non-error nil response for that key ID.
Ensure that we're setting the correct reply index in the blocking query.
Note that the "not found" case does not return an error, just an empty key. So
as a belt-and-suspenders, update the handling of empty responses so that we
don't break the loop early if we hit a server that doesn't have the key.
* test for adding new servers to keyring
* leader: initialize keyring after we have consistent reads
Wait until we're sure the FSM is current before we try to initialize the
keyring.
Also, if a key is rotated immediately following a leader election, plans that
are in-flight may get signed before the new leader has the key. Allow for a
short timeout-and-retry to avoid rejecting plans
Update the on-disk format for the root key so that it's wrapped with a unique
per-key/per-server key encryption key. This is a bit of security theatre for the
current implementation, but it uses `go-kms-wrapping` as the interface for
wrapping the key. This provides a shim for future support of external KMS such
as cloud provider APIs or Vault transit encryption.
* Removes the JSON serialization extension we had on the `RootKey` struct; this
struct is now only used for key replication and not for disk serialization, so
we don't need this helper.
* Creates a helper for generating cryptographically random slices of bytes that
properly accounts for short reads from the source.
* No observable functional changes outside of the on-disk format, so there are
no test updates.
Workload identities grant implicit access to policies, and operators
will not want to craft separate policies for each invocation of a
periodic or dispatch job. Use the parent job's ID as the JobID claim.
The test for simulating a key rotation across leader elections was
flaky because we weren't waiting for a leader election and was
checking the server configs rather than raft for which server was
currently the leader. Fixing the flake revealed a bug in the test that
we weren't ensuring the new leader was running its own replication, so
it wouldn't pick up the key material from the previous follower.
The `Encrypt` method generates an appropriately-sized nonce and uses
that buffer as the prefix for the ciphertext. This keeps the
ciphertext and nonce together for decryption, and reuses the buffer as
much as possible without presenting the temptation to reuse the
cleartext buffer owned by the caller.
We include the key ID as the "additional data" field that's used as an
extra input to the authentication signature, to provide additional
protection that a ciphertext originated with that key.
Refactors the locking for the keyring so that the public methods are
generally (with one commented exception) responsible for taking the
lock and then inner methods are assumed locked.
In order to support implicit ACL policies for tasks to get their own
secrets, each task would need to have its own ACL token. This would
add extra raft overhead as well as new garbage collection jobs for
cleaning up task-specific ACL tokens. Instead, Nomad will create a
workload Identity Claim for each task.
An Identity Claim is a JSON Web Token (JWT) signed by the server’s
private key and attached to an Allocation at the time a plan is
applied. The encoded JWT can be submitted as the X-Nomad-Token header
to replace ACL token secret IDs for the RPCs that support identity
claims.
Whenever a key is is added to a server’s keyring, it will use the key
as the seed for a Ed25519 public-private private keypair. That keypair
will be used for signing the JWT and for verifying the JWT.
This implementation is a ruthlessly minimal approach to support the
secure variables feature. When a JWT is verified, the allocation ID
will be checked against the Nomad state store, and non-existent or
terminal allocation IDs will cause the validation to be rejected. This
is sufficient to support the secure variables feature at launch
without requiring implementation of a background process to renew
soon-to-expire tokens.
Replication for the secure variables keyring. Because only key
metadata is stored in raft, we need to distribute key material
out-of-band from raft replication. A goroutine runs on each server and
watches for changes to the `RootKeyMeta`. When a new key is received,
attempt to fetch the key from the leader. If the leader doesn't have
the key (which may happen if a key is rotated right before a leader
transition), try to get the key from any peer.
After internal design review, we decided to remove exposing algorithm
choice to the end-user for the initial release. We'll solve nonce
rotation by forcing rotations automatically on key GC (in a core job,
not included in this changeset). Default to AES-256 GCM for the
following criteria:
* faster implementation when hardware acceleration is available
* FIPS compliant
* implementation in pure go
* post-quantum resistance
Also fixed a bug in the decoding from keystore and switched to a
harder-to-misuse encoding method.
When a server becomes leader, it will check if there are any keys in
the state store, and create one if there is not. The key metadata will
be replicated via raft to all followers, who will then get the key
material via key replication (not implemented in this changeset).
This changeset implements the keystore serialization/deserialization:
* Adds a JSON serialization extension for the `RootKey` struct, along with a metadata stub. When we serialize RootKey to the on-disk keystore, we want to base64 encode the key material but also exclude any frequently-changing fields which are stored in raft.
* Implements methods for loading/saving keys to the keystore.
* Implements methods for restoring the whole keystore from disk.
* Wires it all up with the `Keyring` RPC handlers and fixes up any fallout on tests.