Move conflict resolution implementation into the state store with a new Apply RPC.
This also makes the RPC for secure variables much more similar to Consul's KV,
which will help us support soft deletes in a post-1.4.0 version of Nomad.
Reimplement quotas in the state store functions.
Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com>
This PR changes the use of structs.ConsulMeshGateway to value types
instead of via pointers. This will help in a follow up PR where we
cleanup a lot of custom comparison code with helper functions instead.
* Allow specification of CSI staging and publishing directory path
* Add website documentation for stage_publish_dir
* Replace erroneous reference to csi_plugin.mount_config with csi_plugin.mount_dir
* Avoid requiring CSI plugins to be redeployed after introducing StagePublishDir
Move the secure variables quota enforcement calls into the state store to ensure
quota checks are atomic with quota updates (in the same transaction).
Switch to a machine-size int instead of a uint64 for quota tracking. The
ENT-side quota spec is described as int, and negative values have a meaning as
"not permitted at all". Using the same type for tracking will make it easier to
the math around checks, and uint64 is infeasibly large anyways.
Add secure vars to quota HTTP API and CLI outputs and API docs.
This test is a fairly trivial test of the agent RPC, but the test setup waits
for a short fixed window after the node starts to send the RPC. After looking at
detailed logs for recent test failures, it looks like the node registration for
the first node doesn't get a chance to happen before we make the RPC call. Use
`WaitForResultUntil` to give the test more time to run in slower test
environments, while allowing it to finish quickly if possible.
The search RPC used a placeholder policy for searching within the secure
variables context. Now that we have ACL policies built for secure variables, we
can use them for search. Requires a new loose policy for checking if a token has
any secure variables access within a namespace, so that we can filter on
specific paths in the iterator.
Most of our objects use int64 timestamps derived from `UnixNano()` instead of
`time.Time` objects. Switch the keyring metadata to use `UnixNano()` for
consistency across the API.
Document the secure variables keyring commands, document the aliased
gossip keyring commands, and note that the old gossip keyring commands
are deprecated.
Return 429 response on HTTP max connection limit. Instead of silently closing
the connection, return a `429 Too Many Requests` HTTP response with a helpful
error message to aid debugging when the connection limit is unintentionally
reached.
Set a 10-millisecond write timeout and rate limiter for connection-limit 429
response to prevent writing the HTTP response from consuming too many server
resources.
Add `nomad.agent.http.exceeded metric` counting the number of HTTP connections
exceeding concurrency limit.
Plan rejections occur when the scheduler work and the leader plan
applier disagree on the feasibility of a plan. This may happen for valid
reasons: since Nomad does parallel scheduling, it is expected that
different workers will have a different state when computing placements.
As the final plan reaches the leader plan applier, it may no longer be
valid due to a concurrent scheduling taking up intended resources. In
these situations the plan applier will notify the worker that the plan
was rejected and that they should refresh their state before trying
again.
In some rare and unexpected circumstances it has been observed that
workers will repeatedly submit the same plan, even if they are always
rejected.
While the root cause is still unknown this mitigation has been put in
place. The plan applier will now track the history of plan rejections
per client and include in the plan result a list of node IDs that should
be set as ineligible if the number of rejections in a given time window
crosses a certain threshold. The window size and threshold value can be
adjusted in the server configuration.
To avoid marking several nodes as ineligible at one, the operation is rate
limited to 5 nodes every 30min, with an initial burst of 10 operations.
This PR adds support for specifying checks in services registered to
the built-in nomad service provider.
Currently only HTTP and TCP checks are supported, though more types
could be added later.
Fixes#13505
This fixes#13505 by treating reserved_ports like we treat a lot of jobspec settings: merging settings from more global stanzas (client.reserved.reserved_ports) "down" into more specific stanzas (client.host_networks[].reserved_ports).
As discussed in #13505 there are other options, and since it's totally broken right now we have some flexibility:
Treat overlapping reserved_ports on addresses as invalid and refuse to start agents. However, I'm not sure there's a cohesive model we want to publish right now since so much 0.9-0.12 compat code still exists! We would have to explain to folks that if their -network-interface and host_network addresses overlapped, they could only specify reserved_ports in one place or the other?! It gets ugly.
Use the global client.reserved.reserved_ports value as the default and treat host_network[].reserverd_ports as overrides. My first suggestion in the issue, but @groggemans made me realize the addresses on the agent's interface (as configured by -network-interface) may overlap with host_networks, so you'd need to remove the global reserved_ports from addresses shared with a shared network?! This seemed really confusing and subtle for users to me.
So I think "merging down" creates the most expressive yet understandable approach. I've played around with it a bit, and it doesn't seem too surprising. The only frustrating part is how difficult it is to observe the available addresses and ports on a node! However that's a job for another PR.
This commit adds configuration parameters to control ACL token
expirations. This includes both limits on the min and max TTL
expiration values, as well as a GC threshold for expired tokens.
When the `Full` flag is passed for key rotation, we kick off a core
job to decrypt and re-encrypt all the secure variables so that they
use the new key.
* SV: CAS
* Implement Check and Set for Delete and Upsert
* Reading the conflict from the state store
* Update endpoint for new error text
* Updated HTTP api tests
* Conflicts to the HTTP api
* SV: structs: Update SV time to UnixNanos
* update mock to UnixNano; refactor
* SV: encrypter: quote KeyID in error
* SV: mock: add mock for namespace w/ SV
Move all the gossip keyring and key generation commands under
`operator gossip keyring` subcommands to align with the new `operator
secure-variables keyring` subcommands. Deprecate the `operator keyring`
and `operator keygen` commands.
* Add Path only index for SecureVariables
* Add GetSecureVariablesByPrefix; refactor tests
* Add search for SecureVariables
* Add prefix search for secure variables
This PR splits SecureVariable into SecureVariableDecrypted and
SecureVariableEncrypted in order to use the type system to help
verify that cleartext secret material is not committed to file.
* Make Encrypt function return KeyID
* Split SecureVariable
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Split the flag options for the `secure-variables keyring` into their
own subcommands. The gossip keyring CLI will be similarly refactored
and the old version will be deprecated.
After internal design review, we decided to remove exposing algorithm
choice to the end-user for the initial release. We'll solve nonce
rotation by forcing rotations automatically on key GC (in a core job,
not included in this changeset). Default to AES-256 GCM for the
following criteria:
* faster implementation when hardware acceleration is available
* FIPS compliant
* implementation in pure go
* post-quantum resistance
Also fixed a bug in the decoding from keystore and switched to a
harder-to-misuse encoding method.
When a server becomes leader, it will check if there are any keys in
the state store, and create one if there is not. The key metadata will
be replicated via raft to all followers, who will then get the key
material via key replication (not implemented in this changeset).
This changeset implements the keystore serialization/deserialization:
* Adds a JSON serialization extension for the `RootKey` struct, along with a metadata stub. When we serialize RootKey to the on-disk keystore, we want to base64 encode the key material but also exclude any frequently-changing fields which are stored in raft.
* Implements methods for loading/saving keys to the keystore.
* Implements methods for restoring the whole keystore from disk.
* Wires it all up with the `Keyring` RPC handlers and fixes up any fallout on tests.
Stream snapshot to FSM when restoring from archive
The `RestoreFromArchive` helper decompresses the snapshot archive to a
temporary file before reading it into the FSM. For large snapshots
this performs a lot of disk IO. Stream decompress the snapshot as we
read it, without first writing to a temporary file.
Add bexpr filters to the `RestoreFromArchive` helper.
The operator can pass these as `-filter` arguments to `nomad operator
snapshot state` (and other commands in the future) to include only
desired data when reading the snapshot.
Use the same output format when listing multiple evals in the `eval
list` command and when `eval status <prefix>` matches more than one
eval.
Include the eval namespace in all output formats and always include the
job ID in `eval status` since, even `node-update` evals are related to a
job.
Add Node ID to the evals table output to help differentiate
`node-update` evals.
Co-authored-by: James Rasell <jrasell@hashicorp.com>
The `operator debug` command doesn't output the leader anywhere in the
output, which adds extra burden to offline debugging (away from an
ongoing incident where you can simply check manually). Query the
`/v1/status/leader` API but degrade gracefully.
* core: allow pause/un-pause of eval broker on region leader.
* agent: add ability to pause eval broker via scheduler config.
* cli: add operator scheduler commands to interact with config.
* api: add ability to pause eval broker via scheduler config
* e2e: add operator scheduler test for eval broker pause.
* docs: include new opertor scheduler CLI and pause eval API info.
This PR adds the 'choose' query parameter to the '/v1/service/<service>' endpoint.
The value of 'choose' is in the form '<number>|<key>', number is the number
of desired services and key is a value unique but consistent to the requester
(e.g. allocID).
Folks aren't really expected to use this API directly, but rather through consul-template
which will soon be getting a new helper function making use of this query parameter.
Example,
curl 'localhost:4646/v1/service/redis?choose=2|abc123'
Note: consul-templte v0.29.1 includes the necessary nomadServices functionality.
Fix numerous go-getter security issues:
- Add timeouts to http, git, and hg operations to prevent DoS
- Add size limit to http to prevent resource exhaustion
- Disable following symlinks in both artifacts and `job run`
- Stop performing initial HEAD request to avoid file corruption on
retries and DoS opportunities.
**Approach**
Since Nomad has no ability to differentiate a DoS-via-large-artifact vs
a legitimate workload, all of the new limits are configurable at the
client agent level.
The max size of HTTP downloads is also exposed as a node attribute so
that if some workloads have large artifacts they can specify a high
limit in their jobspecs.
In the future all of this plumbing could be extended to enable/disable
specific getters or artifact downloading entirely on a per-node basis.
Closes#12927Closes#12958
This PR updates the version of redis used in our examples from 3.2 to 7.
The old version is very not supported anymore, and we should be setting
a good example by using a supported version.
The long-form example job is now fixed so that the service stanza uses
nomad as the service discovery provider, and so now the job runs without
a requirement of having Consul running and configured.
* test: use `T.TempDir` to create temporary test directory
This commit replaces `ioutil.TempDir` with `t.TempDir` in tests. The
directory created by `t.TempDir` is automatically removed when the test
and all its subtests complete.
Prior to this commit, temporary directory created using `ioutil.TempDir`
needs to be removed manually by calling `os.RemoveAll`, which is omitted
in some tests. The error handling boilerplate e.g.
defer func() {
if err := os.RemoveAll(dir); err != nil {
t.Fatal(err)
}
}
is also tedious, but `t.TempDir` handles this for us nicely.
Reference: https://pkg.go.dev/testing#T.TempDir
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
* test: fix TestLogmon_Start_restart on Windows
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
* test: fix failing TestConsul_Integration
t.TempDir fails to perform the cleanup properly because the folder is
still in use
testing.go:967: TempDir RemoveAll cleanup: unlinkat /tmp/TestConsul_Integration2837567823/002/191a6f1a-5371-cf7c-da38-220fe85d10e5/web/secrets: device or resource busy
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
After a more detailed analysis of this feature, the approach taken in
PR #12449 was found to be not ideal due to poor UX (users are
responsible for setting the entity alias they would like to use) and
issues around jobs potentially masquerading itself as another Vault
entity.
This PR introduces the `address` field in the `service` block so that Nomad
or Consul services can be registered with a custom `.Address.` to advertise.
The address can be an IP address or domain name. If the `address` field is
set, the `service.address_mode` must be set in `auto` mode.
* cli: add -json flag to support job commands
While the CLI has always supported running JSON jobs, its support has
been via HCLv2's JSON parsing. I have no idea what format it expects the
job to be in, but it's absolutely not in the same format as the API
expects.
So I ignored that and added a new -json flag to explicitly support *API*
style JSON jobspecs.
The jobspecs can even have the wrapping {"Job": {...}} envelope or not!
* docs: fix example for `nomad job validate`
We haven't been able to validate inside driver config stanzas ever since
the move to task driver plugins. 😭
The new `namespace apply` feature that allows for passing a namespace
specification file detects the difference between an empty namespace
and a namespace specification by checking if the file exists. For most
cases, the file will have an extension like `.hcl` and so there's
little danger that a user will apply a file spec when they intended to
apply a file name.
But because directory names typically don't include an extension,
you're much more likely to collide when trying to `namespace apply` by
name only, and then you get a confusing error message of the form:
Failed to read file: read $namespace: is a directory
Detect the case where the namespace name collides with a directory in
the current working directory, and skip trying to load the directory.
* Add os to NodeListStub struct.
Signed-off-by: Shishir Mahajan <smahajan@roblox.com>
* Add os as a query param to /v1/nodes.
Signed-off-by: Shishir Mahajan <smahajan@roblox.com>
* Add test: os as a query param to /v1/nodes.
Signed-off-by: Shishir Mahajan <smahajan@roblox.com>
The CSI HTTP API has to transform the CSI volume to redact secrets,
remove the claims fields, and to consolidate the allocation stubs into
a single slice of alloc stubs. This was done manually in #8590 but
this is a large amount of code and has proven both very bug prone
(see #8659, #8666, #8699, #8735, and #12150) and requires updating
lots of code every time we add a field to volumes or plugins.
In #10202 we introduce encoding improvements for the `Node` struct
that allow a more minimal transformation. Apply this same approach to
serializing `structs.CSIVolume` to API responses.
Also, the original reasoning behind #8590 for plugins no longer holds
because the counts are now denormalized within the state store, so we
can simply remove this transformation entirely.
The API for `CSIVolume.List` sorts by created index and not by ID,
which breaks the logic for prefix matching in the `volume status`
output when the prefix is also an exact match. Ensure that we're
handling this case correctly.
The Nomad client's `csi_hook` interpolates the alloc suffix with the
volume request's name for CSI volumes with `per_alloc = true`, turning
`example` into `example[1]`. We need to do this same behavior in the
`alloc status` output so that we show the correct volume.
This PR expands on the work done in #12543 to
- prefix the tag, so it is now "nomad.alloc_id" to be more consistent with Consul tags
- merge into pre-existing envoy_stats_tags fields
- update the upgrade guide docs
- update changelog