Commit graph

3317 commits

Author SHA1 Message Date
Seth Hoenig 88a1353149 cli: display nomad service check status output in CLI commands
This PR adds some NSD check status output to the CLI.

1. The 'nomad alloc status' command produces nsd check summary output (if present)
2. The 'nomad alloc checks' sub-command is added to produce complete nsd check output (if present)
2022-08-19 09:18:29 -05:00
Michael Schurter 3b57df33e3
client: fix data races in config handling (#14139)
Before this change, Client had 2 copies of the config object: config and configCopy. There was no guidance around which to use where (other than configCopy's comment to pass it to alloc runners), both are shared among goroutines and mutated in data racy ways. At least at one point I think the idea was to have `config` be mutable and then grab a lock to overwrite `configCopy`'s pointer atomically. This would have allowed alloc runners to read their config copies in data race safe ways, but this isn't how the current implementation worked.

This change takes the following approach to safely handling configs in the client:

1. `Client.config` is the only copy of the config and all access must go through the `Client.configLock` mutex
2. Since the mutex *only protects the config pointer itself and not fields inside the Config struct:* all config mutation must be done on a *copy* of the config, and then Client's config pointer is overwritten while the mutex is acquired. Alloc runners and other goroutines with the old config pointer will not see config updates.
3. Deep copying is implemented on the Config struct to satisfy the previous approach. The TLS Keyloader is an exception because it has its own internal locking to support mutating in place. An unfortunate complication but one I couldn't find a way to untangle in a timely fashion.
4. To facilitate deep copying I made an *internally backward incompatible API change:* our `helper/funcs` used to turn containers (slices and maps) with 0 elements into nils. This probably saves a few memory allocations but makes it very easy to cause panics. Since my new config handling approach uses more copying, it became very difficult to ensure all code that used containers on configs could handle nils properly. Since this code has caused panics in the past, I fixed it: nil containers are copied as nil, but 0-element containers properly return a new 0-element container. No more "downgrading to nil!"
2022-08-18 16:32:04 -07:00
Seth Hoenig c5d36eaa2f cleanup: fixing warnings and refactoring of command package, part 2
This PR continues the cleanup of the command package, removing linter
warnings, refactoring to use helpers, making tests easier to read, etc.
2022-08-18 09:43:39 -05:00
Seth Hoenig 4c1a0d4907 cleanup: first pass at fixing command package warnings
This PR is the first of several for cleaning up warnings, and refactoring
bits of code in the command package. First pass is over acl_ files and
gets some helpers in place.
2022-08-17 15:33:37 -05:00
Piotr Kazmierczak b63944b5c1
cleanup: replace TypeToPtr helper methods with pointer.Of (#14151)
Bumping compile time requirement to go 1.18 allows us to simplify our pointer helper methods.
2022-08-17 18:26:34 +02:00
Seth Hoenig 7728cf5a9a
Merge pull request #14132 from hashicorp/build-update-go1.19
build: update to go1.19
2022-08-16 11:20:27 -05:00
Seth Hoenig b3ea68948b build: run gofmt on all go source files
Go 1.19 will forecefully format all your doc strings. To get this
out of the way, here is one big commit with all the changes gofmt
wants to make.
2022-08-16 11:14:11 -05:00
Seth Hoenig 56b0b456dc
Merge pull request #14102 from hashicorp/cleanup-mesh-gateway-value
cleanup: consul mesh gateway type need not be pointer
2022-08-16 10:07:16 -05:00
Charlie Voiselle dba6b39815
SV CLI: var init (#13820)
* Nomad dep: add museli/reflow
* SV CLI: var init
2022-08-15 13:43:29 -04:00
Tim Gross 4005759d28
move secure variable conflict resolution to state store (#13922)
Move conflict resolution implementation into the state store with a new Apply RPC. 
This also makes the RPC for secure variables much more similar to Consul's KV, 
which will help us support soft deletes in a post-1.4.0 version of Nomad.

Reimplement quotas in the state store functions.

Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com>
2022-08-15 11:19:53 -04:00
Seth Hoenig f9355c29fb cleanup: consul mesh gateway type need not be pointer
This PR changes the use of structs.ConsulMeshGateway to value types
instead of via pointers. This will help in a follow up PR where we
cleanup a lot of custom comparison code with helper functions instead.
2022-08-13 11:26:58 -05:00
Seth Hoenig ba5c45ab93 cli: respect vault token in plan command
This PR fixes a regression where the 'job plan' command would not respect
a Vault token if set via --vault-token or $VAULT_TOKEN.

Basically the same bug/fix as for the validate command in https://github.com/hashicorp/nomad/issues/13062

Fixes https://github.com/hashicorp/nomad/issues/13939
2022-08-11 08:54:08 -05:00
Seth Hoenig 1901cfaba8
Merge pull request #14069 from brian-athinkingape/cli-fix-memstats-cgroupsv2
cli: for systems with cgroups v2, fix alloation resource utilization showing 0 memory used
2022-08-11 07:27:48 -05:00
Luiz Aoqui 815adbada5
Post 1.3.3 release (#14064)
* Generate files for 1.3.3 release

* Prepare for next release

* Merge release 1.3.3 files

Co-authored-by: hc-github-team-nomad-core <github-team-nomad-core@hashicorp.com>
2022-08-09 17:27:29 -04:00
Brian Chau 6621bb9db5 cli: for systems with cgroups v2, fix alloation resource utilization showing 0 memory used 2022-08-09 14:09:14 -07:00
Derek Strickland 77df9c133b
Add Nomad RetryConfig to agent template config (#13907)
* add Nomad RetryConfig to agent template config
2022-08-03 16:56:30 -04:00
Piotr Kazmierczak 530280505f
client: enable specifying user/group permissions in the template stanza (#13755)
* Adds Uid/Gid parameters to template.

* Updated diff_test

* fixed order

* update jobspec and api

* removed obsolete code

* helper functions for jobspec parse test

* updated documentation

* adjusted API jobs test.

* propagate uid/gid setting to job_endpoint

* adjusted job_endpoint tests

* making uid/gid into pointers

* refactor

* updated documentation

* updated documentation

* Update client/allocrunner/taskrunner/template/template_test.go

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>

* Update website/content/api-docs/json-jobs.mdx

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>

* propagating documentation change from Luiz

* formatting

* changelog entry

* changed changelog entry

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2022-08-02 22:15:38 +02:00
James Rasell bb5b510c9d
cli: do not import structs, use API package only. (#13938) 2022-08-02 16:33:08 +02:00
Eric Weber cbce13c1ac
Add stage_publish_base_dir field to csi_plugin stanza of a job (#13919)
* Allow specification of CSI staging and publishing directory path
* Add website documentation for stage_publish_dir
* Replace erroneous reference to csi_plugin.mount_config with csi_plugin.mount_dir
* Avoid requiring CSI plugins to be redeployed after introducing StagePublishDir
2022-08-02 09:42:44 -04:00
Tim Gross e5ac6464f6
secure vars: enforce ENT quotas (OSS work) (#13951)
Move the secure variables quota enforcement calls into the state store to ensure
quota checks are atomic with quota updates (in the same transaction).

Switch to a machine-size int instead of a uint64 for quota tracking. The
ENT-side quota spec is described as int, and negative values have a meaning as
"not permitted at all". Using the same type for tracking will make it easier to
the math around checks, and uint64 is infeasibly large anyways.

Add secure vars to quota HTTP API and CLI outputs and API docs.
2022-08-02 09:32:09 -04:00
Tim Gross 8404f998f7
fix flaky TestAgent_ProxyRPC_Dev test (#13925)
This test is a fairly trivial test of the agent RPC, but the test setup waits
for a short fixed window after the node starts to send the RPC. After looking at
detailed logs for recent test failures, it looks like the node registration for
the first node doesn't get a chance to happen before we make the RPC call. Use
`WaitForResultUntil` to give the test more time to run in slower test
environments, while allowing it to finish quickly if possible.
2022-07-28 14:47:15 -04:00
Lars Lehtonen a80df0480e
testing: fix dropped test errors in command/agent (#13926) 2022-07-28 11:04:31 -04:00
Seth Hoenig d8fe1d10ba cleanup: use constants for on_update values 2022-07-21 13:09:47 -05:00
Seth Hoenig c61e779b48
Merge pull request #13715 from hashicorp/dev-nsd-checks
client: add support for checks in nomad services
2022-07-21 10:22:57 -05:00
Seth Hoenig 67c6336c67
Merge pull request #13870 from hashicorp/exp-fp-optimization
client: use test timeouts for network fingerprinters in dev mode
2022-07-21 08:18:02 -05:00
Tim Gross 9c43c28575
search: use secure vars ACL policy for secure vars context (#13788)
The search RPC used a placeholder policy for searching within the secure
variables context. Now that we have ACL policies built for secure variables, we
can use them for search. Requires a new loose policy for checking if a token has
any secure variables access within a namespace, so that we can filter on
specific paths in the iterator.
2022-07-21 08:39:36 -04:00
Seth Hoenig 6f93aca63e devmode: use minimal network timeouts for network fingerprinters in dev mode 2022-07-20 15:13:14 -05:00
Tim Gross 97a6346da0
keyring: use nanos for CreateTime in key metadata (#13849)
Most of our objects use int64 timestamps derived from `UnixNano()` instead of
`time.Time` objects. Switch the keyring metadata to use `UnixNano()` for
consistency across the API.
2022-07-20 14:46:57 -04:00
Tim Gross 96aea74b4b
docs: keyring commands (#13690)
Document the secure variables keyring commands, document the aliased
gossip keyring commands, and note that the old gossip keyring commands
are deprecated.
2022-07-20 14:14:10 -04:00
Will Jordan 5354409b1a
Return 429 response on HTTP max connection limit (#13621)
Return 429 response on HTTP max connection limit. Instead of silently closing
the connection, return a `429 Too Many Requests` HTTP response with a helpful
error message to aid debugging when the connection limit is unintentionally
reached.

Set a 10-millisecond write timeout and rate limiter for connection-limit 429
response to prevent writing the HTTP response from consuming too many server
resources.

Add `nomad.agent.http.exceeded metric` counting the number of HTTP connections
exceeding concurrency limit.
2022-07-20 14:12:21 -04:00
hc-github-team-nomad-core fa09c13016
Generate files for 1.3.2 release 2022-07-13 19:33:41 -04:00
Michael Schurter d54d90edfa
http: only log alloc/exec errors when non-nil (#13730) 2022-07-13 09:44:51 -07:00
Luiz Aoqui b656981cf0
Track plan rejection history and automatically mark clients as ineligible (#13421)
Plan rejections occur when the scheduler work and the leader plan
applier disagree on the feasibility of a plan. This may happen for valid
reasons: since Nomad does parallel scheduling, it is expected that
different workers will have a different state when computing placements.

As the final plan reaches the leader plan applier, it may no longer be
valid due to a concurrent scheduling taking up intended resources. In
these situations the plan applier will notify the worker that the plan
was rejected and that they should refresh their state before trying
again.

In some rare and unexpected circumstances it has been observed that
workers will repeatedly submit the same plan, even if they are always
rejected.

While the root cause is still unknown this mitigation has been put in
place. The plan applier will now track the history of plan rejections
per client and include in the plan result a list of node IDs that should
be set as ineligible if the number of rejections in a given time window
crosses a certain threshold. The window size and threshold value can be
adjusted in the server configuration.

To avoid marking several nodes as ineligible at one, the operation is rate
limited to 5 nodes every 30min, with an initial burst of 10 operations.
2022-07-12 18:40:20 -04:00
Seth Hoenig 297d386bdc client: add support for checks in nomad services
This PR adds support for specifying checks in services registered to
the built-in nomad service provider.

Currently only HTTP and TCP checks are supported, though more types
could be added later.
2022-07-12 17:09:50 -05:00
Michael Schurter 3e50f72fad
core: merge reserved_ports into host_networks (#13651)
Fixes #13505

This fixes #13505 by treating reserved_ports like we treat a lot of jobspec settings: merging settings from more global stanzas (client.reserved.reserved_ports) "down" into more specific stanzas (client.host_networks[].reserved_ports).

As discussed in #13505 there are other options, and since it's totally broken right now we have some flexibility:

Treat overlapping reserved_ports on addresses as invalid and refuse to start agents. However, I'm not sure there's a cohesive model we want to publish right now since so much 0.9-0.12 compat code still exists! We would have to explain to folks that if their -network-interface and host_network addresses overlapped, they could only specify reserved_ports in one place or the other?! It gets ugly.
Use the global client.reserved.reserved_ports value as the default and treat host_network[].reserverd_ports as overrides. My first suggestion in the issue, but @groggemans made me realize the addresses on the agent's interface (as configured by -network-interface) may overlap with host_networks, so you'd need to remove the global reserved_ports from addresses shared with a shared network?! This seemed really confusing and subtle for users to me.
So I think "merging down" creates the most expressive yet understandable approach. I've played around with it a bit, and it doesn't seem too surprising. The only frustrating part is how difficult it is to observe the available addresses and ports on a node! However that's a job for another PR.
2022-07-12 14:40:25 -07:00
Charlie Voiselle 6be7a41351
SV: CLI: var list command (#13707)
* SV CLI: var list
* Fix wildcard prefix filtering

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-07-12 12:49:39 -04:00
Tim Gross a5a9eedc81 core job for secure variables re-key (#13440)
When the `Full` flag is passed for key rotation, we kick off a core
job to decrypt and re-encrypt all the secure variables so that they
use the new key.
2022-07-11 13:34:06 -04:00
Charlie Voiselle 555ac432cd SV: CAS: Implement Check and Set for Delete and Upsert (#13429)
* SV: CAS
    * Implement Check and Set for Delete and Upsert
    * Reading the conflict from the state store
    * Update endpoint for new error text
    * Updated HTTP api tests
    * Conflicts to the HTTP api

* SV: structs: Update SV time to UnixNanos
    * update mock to UnixNano; refactor

* SV: encrypter: quote KeyID in error
* SV: mock: add mock for namespace w/ SV
2022-07-11 13:34:06 -04:00
Tim Gross eaf430bfd5 secure variable server configuration (#13307)
Add fields for configuring root key garbage collection and automatic
rotation. Fix the keystore path so that we write to a tempdir when in
dev mode.
2022-07-11 13:34:06 -04:00
Tim Gross 4d011d4c53 move gossip keyring command to their own subcommands (#13383)
Move all the gossip keyring and key generation commands under
`operator gossip keyring` subcommands to align with the new `operator
secure-variables keyring` subcommands. Deprecate the `operator keyring`
and `operator keygen` commands.
2022-07-11 13:34:06 -04:00
Charlie Voiselle 1fe080c6de Implement HTTP search API for Variables (#13257)
* Add Path only index for SecureVariables
* Add GetSecureVariablesByPrefix; refactor tests
* Add search for SecureVariables
* Add prefix search for secure variables
2022-07-11 13:34:05 -04:00
Charlie Voiselle 06c6a950c4 Secure Variables: Seperate Encrypted and Decrypted structs (#13355)
This PR splits SecureVariable into SecureVariableDecrypted and
SecureVariableEncrypted in order to use the type system to help
verify that cleartext secret material is not committed to file.

* Make Encrypt function return KeyID
* Split SecureVariable

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-07-11 13:34:05 -04:00
Tim Gross 56deb6f8cc keyring CLI: refactor to use subcommands (#13351)
Split the flag options for the `secure-variables keyring` into their
own subcommands. The gossip keyring CLI will be similarly refactored
and the old version will be deprecated.
2022-07-11 13:34:05 -04:00
Tim Gross 81b0c4fd36 keyring command line (#13169)
Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com>
2022-07-11 13:34:04 -04:00
Charlie Voiselle 619e0cbafd Don't write a SecureVariable with no Items (#13258) 2022-07-11 13:34:04 -04:00
Tim Gross 5a85d96322 remove end-user algorithm selection (#13190)
After internal design review, we decided to remove exposing algorithm
choice to the end-user for the initial release. We'll solve nonce
rotation by forcing rotations automatically on key GC (in a core job,
not included in this changeset). Default to AES-256 GCM for the
following criteria:

* faster implementation when hardware acceleration is available
* FIPS compliant
* implementation in pure go
* post-quantum resistance

Also fixed a bug in the decoding from keystore and switched to a 
harder-to-misuse encoding method.
2022-07-11 13:34:04 -04:00
Tim Gross f2ee585830 bootstrap keyring (#13124)
When a server becomes leader, it will check if there are any keys in
the state store, and create one if there is not. The key metadata will
be replicated via raft to all followers, who will then get the key
material via key replication (not implemented in this changeset).
2022-07-11 13:34:04 -04:00
Charlie Voiselle 3717688f3e Secure Variables: Variables - State store, FSM, RPC (#13098)
* Secure Variables: State Store
* Secure Variables: FSM
* Secure Variables: RPC
* Secure Variables: HTTP API

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-07-11 13:34:04 -04:00
Tim Gross 05eef2b95c keystore serialization (#13106)
This changeset implements the keystore serialization/deserialization:

* Adds a JSON serialization extension for the `RootKey` struct, along with a metadata stub. When we serialize RootKey to the on-disk keystore, we want to base64 encode the key material but also exclude any frequently-changing fields which are stored in raft.
* Implements methods for loading/saving keys to the keystore.
* Implements methods for restoring the whole keystore from disk.
* Wires it all up with the `Keyring` RPC handlers and fixes up any fallout on tests.
2022-07-11 13:34:04 -04:00
Tim Gross c6929a6c1e keyring HTTP API (#13077) 2022-07-11 13:34:04 -04:00