open-nomad

Commit Graph

Author	SHA1	Message	Date
Tim Gross	e8a361310f	node pools: replicate from authoritative region (#17456 ) Upserts and deletes of node pools are forwarded to the authoritative region, just like we do for namespaces, quotas, ACL policies, etc. Replicate node pools from the authoritative region.	2023-06-12 13:24:24 -04:00
Tim Gross	bb7f0edd6a	node pools: prevent panic on upsert during upgrades (#17474 ) Whenever we write a Raft log entry for node pools, we need to first make sure that all servers can safely apply the log without panicking. Gate upsert and delete RPCs on all servers being upgraded to the minimum version.	2023-06-12 09:01:30 -04:00
Tim Gross	e3a37c0b97	replication: fix potential panic during upgrades (#17476 ) If the authoritative region has been upgraded to a version of Nomad that has new replicated objects (such as ACL Auth Methods, ACL Binding Rules, etc.), the non-authoritative regions will start replicating those objects as soon as their leader is upgraded. If a server in the non-authoritative region is upgraded and then becomes the leader before all the other servers in the region have been upgraded, then it will attempt to write a Raft log entry that the followers don't understand. The followers will then panic. Add same the minimum version checks that we do for RPC writes to the leader's replication loop.	2023-06-12 08:53:56 -04:00
Piotr Kazmierczak	dea8b1a093	acl: bump JWT auth gate to 1.5.4 (#16838 )	2023-04-11 10:07:45 +02:00
hashicorp-copywrite[bot]	005636afa0	[COMPLIANCE] Add Copyright and License Headers	2023-04-10 15:36:59 +00:00
Piotr Kazmierczak	1a5eba24a6	acl: set minACLJWTAuthMethodVersion to 1.5.3 and adjust code comment	2023-03-30 15:30:42 +02:00
Piotr Kazmierczak	d98c8f6759	acl: rebased on main and changed the gate to 1.5.3-dev	2023-03-30 09:40:12 +02:00
Piotr Kazmierczak	2b353902a1	acl: HTTP endpoints for JWT auth (#16519 )	2023-03-30 09:39:56 +02:00
Piotr Kazmierczak	e48c48e89b	acl: RPC endpoints for JWT auth (#15918 )	2023-03-30 09:39:56 +02:00
Juana De La Cuesta	320884b8ee	Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true (#16583 ) * Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true Fixes #11052 When restoring periodic dispatcher, all periodic jobs are forced without checking for previous childre. * Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true Fixes #11052 When restoring periodic dispatcher, all periodic jobs are forced without checking for previous children. * style: refactor force run function * fix: remove defer and inline unlock for speed optimization * Update nomad/leader.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * style: refactor tests to use must * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * fix: move back from defer to calling unlock before returning. createEval cant be called with the lock on * style: refactor test to use must * added new entry to changelog and update comments --------- Co-authored-by: James Rasell <jrasell@hashicorp.com> Co-authored-by: James Rasell <jrasell@users.noreply.github.com>	2023-03-27 17:25:05 +02:00
Juana De La Cuesta	21b675244e	style: rename ForceRun to ForceEval, for clarity (#16617 )	2023-03-27 15:38:48 +02:00
Tim Gross	3c0eaba9db	remove backcompat support for non-atomic job registration (#16305 ) In Nomad 0.12.1 we introduced atomic job registration/deregistration, where the new eval was written in the same raft entry. Backwards-compatibility checks were supposed to have been removed in Nomad 1.1.0, but we missed that. This is long safe to remove.	2023-03-03 15:52:22 -05:00
Tim Gross	65c7e149d3	eval broker: use write lock when reaping cancelable evals (#16112 ) The eval broker's `Cancelable` method used by the cancelable eval reaper mutates the slice of cancelable evals by removing a batch at a time from the slice. But this method unsafely uses a read lock despite this mutation. Under normal workloads this is likely to be safe but when the eval broker is under the heavy load this feature is intended to fix, we're likely to have a race condition. Switch this to a write lock, like the other locks that mutate the eval broker state. This changeset also adjusts the timeout to allow poorly-sized Actions runners more time to schedule the appropriate goroutines. The test has also been updated to use `shoenig/test/wait` so we can have sensible reporting of the results rather than just a timeout error when things go wrong.	2023-02-10 10:40:41 -05:00
James Rasell	53e0f424e9	rcp: bump SSO feature gate version. (#16080 )	2023-02-07 15:45:07 -08:00
James Rasell	b8aa53d09f	core: add ACL binding rule to replication system. (#15555 ) ACL binding rule create and deletes are always forwarded to the authoritative region. In order to make these available in federated regions, the leaders in these regions need to replicate from the authoritative.	2022-12-16 09:08:00 +01:00
James Rasell	95c9ffa505	ACL: add ACL binding rule RPC and HTTP API handlers. (#15529 ) This change add the RPC ACL binding rule handlers. These handlers are responsible for the creation, updating, reading, and deletion of binding rules. The write handlers are feature gated so that they can only be used when all federated servers are running the required version. The HTTP API handlers and API SDK have also been added where required. This allows the endpoints to be called from the API by users and clients.	2022-12-15 09:18:55 +01:00
James Rasell	726d419da1	acl: replicate auth-methods from federated cluster leaders. (#15366 )	2022-11-28 09:20:24 +01:00
Piotr Kazmierczak	bb66b5e770	acl: sso auth method RPC endpoints (#15221 ) This PR implements RPC endpoints for SSO auth methods. This PR is part of the SSO work captured under ☂️ ticket #13120.	2022-11-21 10:15:39 +01:00
Tim Gross	6415fb4284	eval broker: shed all but one blocked eval per job after ack (#14621 ) When an evaluation is acknowledged by a scheduler, the resulting plan is guaranteed to cover up to the `waitIndex` set by the worker based on the most recent evaluation for that job in the state store. At that point, we no longer need to retain blocked evaluations in the broker that are older than that index. Move all but the highest priority / highest `ModifyIndex` blocked eval into a canceled set. When the `Eval.Ack` RPC returns from the eval broker it will signal a reap of a batch of cancelable evals to write to raft. This paces the cancelations limited by how frequently the schedulers are acknowledging evals; this should reduce the risk of cancelations from overwhelming raft relative to scheduler progress. In order to avoid straggling batches when the cluster is quiet, we also include a periodic sweep through the cancelable list.	2022-11-16 16:10:11 -05:00
Tim Gross	3a811ac5e7	keyring: fixes for keyring replication on cluster join (#14987 ) * keyring: don't unblock early if rate limit burst exceeded The rate limiter returns an error and unblocks early if its burst limit is exceeded (unless the burst limit is Inf). Ensure we're not unblocking early, otherwise we'll only slow down the cases where we're already pausing to make external RPC requests. * keyring: set MinQueryIndex on stale queries When keyring replication makes a stale query to non-leader peers to find a key the leader doesn't have, we need to make sure the peer we're querying has had a chance to catch up to the most current index for that key. Otherwise it's possible for newly-added servers to query another newly-added server and get a non-error nil response for that key ID. Ensure that we're setting the correct reply index in the blocking query. Note that the "not found" case does not return an error, just an empty key. So as a belt-and-suspenders, update the handling of empty responses so that we don't break the loop early if we hit a server that doesn't have the key. * test for adding new servers to keyring * leader: initialize keyring after we have consistent reads Wait until we're sure the FSM is current before we try to initialize the keyring. Also, if a key is rotated immediately following a leader election, plans that are in-flight may get signed before the new leader has the key. Allow for a short timeout-and-retry to avoid rejecting plans	2022-10-21 12:33:16 -04:00
James Rasell	8e25048f3d	acl: gate ACL role write and delete RPC usage on v1.4.0 or greater. (#14908 )	2022-10-18 16:46:11 +02:00
James Rasell	9923f9e6f3	nnsd: gate registration write & delete RPC use on v1.3.0 or greater. (#14924 )	2022-10-18 15:30:28 +02:00
Tim Gross	3c78980b78	make version checks specific to region (1.4.x) (#14912 ) * One-time tokens are not replicated between regions, so we don't want to enforce that the version check across all of serf, just members in the same region. * Scheduler: Disconnected clients handling is specific to a single region, so we don't want to enforce that the version check across all of serf, just members in the same region. * Variables: enforce version check in Apply RPC * Cleans up a bunch of legacy checks. This changeset is specific to 1.4.x and the changes for previous versions of Nomad will be manually backported in a separate PR.	2022-10-17 16:23:51 -04:00
Tim Gross	c721ce618e	keyring: filter by region before checking version (#14901 ) In #14821 we fixed a panic that can happen if a leadership election happens in the middle of an upgrade. That fix checks that all servers are at the minimum version before initializing the keyring (which blocks evaluation processing during trhe upgrade). But the check we implemented is over the serf membership, which includes servers in any federated regions, which don't necessarily have the same upgrade cycle. Filter the version check by the leader's region. Also bump up log levels of major keyring operations	2022-10-17 13:21:16 -04:00
Tim Gross	80ec5e1346	fix panic from keyring raft entries being written during upgrade (#14821 ) During an upgrade to Nomad 1.4.0, if a server running 1.4.0 becomes the leader before one of the 1.3.x servers, the old server will crash because the keyring is initialized and writes a raft entry. Wait until all members are on a version that supports the keyring before initializing it.	2022-10-06 12:47:02 -04:00
Tim Gross	7921f044e5	migrate autopilot implementation to raft-autopilot (#14441 ) Nomad's original autopilot was importing from a private package in Consul. It has been moved out to a shared library. Switch Nomad to use this library so that we can eliminate the import of Consul, which is necessary to build Nomad ENT with the current version of the Consul SDK. This also will let us pick up autopilot improvements shared with Consul more easily.	2022-09-01 14:27:10 -04:00
James Rasell	755b4745ed	Merge branch 'main' into f-gh-13120-sso-umbrella-merged-main	2022-08-30 08:59:13 +01:00
Tim Gross	1dc053b917	rename SecureVariables to Variables throughout	2022-08-26 16:06:24 -04:00
James Rasell	2736cf0dfa	acl: make listing RPC and HTTP API a stub return object. (#14211 ) Making the ACL Role listing return object a stub future-proofs the endpoint. In the event the role object grows, we are not bound by having to return all fields within the list endpoint or change the signature of the endpoint to reduce the list return size.	2022-08-22 17:20:23 +02:00
James Rasell	802d005ef5	acl: add replication to ACL Roles from authoritative region. (#14176 ) ACL Roles along with policies and global token will be replicated from the authoritative region to all federated regions. This involves a new replication loop running on the federated leader. Policies and roles may be replicated at different times, meaning the policies and role references may not be present within the local state upon replication upsert. In order to bypass the RPC and state check, a new RPC request parameter has been added. This is used by the replication process; all other callers will trigger the ACL role policy validation check. There is a new ACL RPC endpoint to allow the reading of a set of ACL Roles which is required by the replication process and matches ACL Policies and Tokens. A bug within the ACL Role listing RPC has also been fixed which returned incorrect data during blocking queries where a deletion had occurred.	2022-08-22 08:54:07 +02:00
Seth Hoenig	b3ea68948b	build: run gofmt on all go source files Go 1.19 will forecefully format all your doc strings. To get this out of the way, here is one big commit with all the changes gofmt wants to make.	2022-08-16 11:14:11 -05:00
James Rasell	663aa92b7a	Merge branch 'main' into f-gh-13120-sso-umbrella	2022-08-02 08:30:03 +01:00
James Rasell	9264f07cc1	core: add expired token garbage collection periodic jobs. (#13805 ) Two new periodic core jobs have been added which handle removing expired local and global tokens from state. The local core job is run on every leader; the global core job is only run on the leader within the authoritative region.	2022-07-19 15:37:46 +02:00
Tim Gross	a5a9eedc81	core job for secure variables re-key (#13440 ) When the `Full` flag is passed for key rotation, we kick off a core job to decrypt and re-encrypt all the secure variables so that they use the new key.	2022-07-11 13:34:06 -04:00
Tim Gross	6300427228	core job for key rotation (#13309 ) Extend the GC job to support periodic key rotation. Update the GC process to safely support signed workload identity. We can't GC any key used to sign a workload identity. Finding which key was used to sign every allocation will be expensive, but there are not that many keys. This lets us take a conservative approach: find the oldest live allocation and ensure that we don't GC any key older than that key.	2022-07-11 13:34:06 -04:00
Tim Gross	7055ce89b1	keyring replication (#13167 ) Replication for the secure variables keyring. Because only key metadata is stored in raft, we need to distribute key material out-of-band from raft replication. A goroutine runs on each server and watches for changes to the `RootKeyMeta`. When a new key is received, attempt to fetch the key from the leader. If the leader doesn't have the key (which may happen if a key is rotated right before a leader transition), try to get the key from any peer.	2022-07-11 13:34:04 -04:00
Tim Gross	d5a214484c	core job for root key GC (#13199 ) Inactive and unused keys older than a threshold will be periodically garbage collected.	2022-07-11 13:34:04 -04:00
Tim Gross	5a85d96322	remove end-user algorithm selection (#13190 ) After internal design review, we decided to remove exposing algorithm choice to the end-user for the initial release. We'll solve nonce rotation by forcing rotations automatically on key GC (in a core job, not included in this changeset). Default to AES-256 GCM for the following criteria: * faster implementation when hardware acceleration is available * FIPS compliant * implementation in pure go * post-quantum resistance Also fixed a bug in the decoding from keystore and switched to a harder-to-misuse encoding method.	2022-07-11 13:34:04 -04:00
Tim Gross	f2ee585830	bootstrap keyring (#13124 ) When a server becomes leader, it will check if there are any keys in the state store, and create one if there is not. The key metadata will be replicated via raft to all followers, who will then get the key material via key replication (not implemented in this changeset).	2022-07-11 13:34:04 -04:00
James Rasell	181b247384	core: allow pausing and un-pausing of leader broker routine (#13045 ) * core: allow pause/un-pause of eval broker on region leader. * agent: add ability to pause eval broker via scheduler config. * cli: add operator scheduler commands to interact with config. * api: add ability to pause eval broker via scheduler config * e2e: add operator scheduler test for eval broker pause. * docs: include new opertor scheduler CLI and pause eval API info.	2022-07-06 16:13:48 +02:00
Derek Strickland	d7f44448e1	disconnected clients: Observability plumbing (#12141 ) * Add disconnects/reconnect to log output and emit reschedule metrics * TaskGroupSummary: Add Unknown, update StateStore logic, add to metrics	2022-04-05 17:12:23 -04:00
Seth Hoenig	9670adb6c6	cleanup: purge github.com/pkg/errors	2022-04-01 19:24:02 -05:00
Luiz Aoqui	f8973d364e	core: use the new Raft API when removing peers (#12340 ) Raft v3 introduced a new API for adding and removing peers that takes the peer ID instead of the address. Prior to this change, Nomad would use the remote peer Raft version for deciding which API to use, but this would not work in the scenario where a Raft v3 server tries to remove a Raft v2 server; the code running uses v3 so it's unable to call the v2 API. This change uses the Raft version of the server running the code to decide which API to use. If the remote peer is a Raft v2, it uses the server address as the ID.	2022-03-22 15:07:31 -04:00
Luiz Aoqui	8db12c2a17	server: transfer leadership in case of error (#12293 ) When a Nomad server becomes the Raft leader, it must perform several actions defined in the establishLeadership function. If any of these actions fail, Raft will think the node is the leader, but it will not actually be able to act as a Nomad leader. In this scenario, leadership must be revoked and transferred to another server if possible, or the node should retry the establishLeadership steps.	2022-03-17 11:10:57 -04:00
Luiz Aoqui	2876739a51	api: apply consistent behaviour of the reverse query parameter (#12244 )	2022-03-11 19:44:52 -05:00
Seth Hoenig	40c714a681	api: return sorted results in certain list endpoints These API endpoints now return results in chronological order. They can return results in reverse chronological order by setting the query parameter ascending=true. - Eval.List - Deployment.List	2022-02-15 13:48:28 -06:00
Tim Gross	04977525dd	csi: update leader's ACL in volumewatcher (#11891 ) The volumewatcher that runs on the leader needs to make RPC calls rather than writing to raft (as we do in the deploymentwatcher) because the unpublish workflow needs to make RPC calls to the clients. This requires that the volumewatcher has access to the leader's ACL token. But when leadership transitions, the new leader creates a new leader ACL token. This ACL token needs to be passed into the volumewatcher when we enable it, otherwise the volumewatcher can find itself with a stale token.	2022-01-24 11:49:50 -05:00
Charlie Voiselle	98a240cd99	Make number of scheduler workers reloadable (#11593 ) ## Development Environment Changes * Added stringer to build deps ## New HTTP APIs * Added scheduler worker config API * Added scheduler worker info API ## New Internals * (Scheduler)Worker API refactor—Start(), Stop(), Pause(), Resume() * Update shutdown to use context * Add mutex for contended server data - `workerLock` for the `workers` slice - `workerConfigLock` for the `Server.Config.NumSchedulers` and `Server.Config.EnabledSchedulers` values ## Other * Adding docs for scheduler worker api * Add changelog message Co-authored-by: Derek Strickland <1111455+DerekStrickland@users.noreply.github.com>	2022-01-06 11:56:13 -05:00
Mahmood Ali	ac3cf10849	nomad: only activate one-time auth tokens with 1.1.0 (#10952 ) Fix a panic in handling one-time auth tokens, used to support `nomad ui --authenticate`. If the nomad leader is a 1.1.x with some servers running as 1.0.x, the pre-1.1.0 servers risk crashing and the cluster may lose quorum. That can happen when `nomad authenticate -ui` command is issued, or when the leader scans for expired tokens every 10 minutes. Fixed #10943 .	2021-07-27 13:17:55 -04:00
Tim Gross	7a55a6af16	leader: call eval log formatting lazily Arguments to our logger's various write methods are evaluated eagerly, so method calls in log parameters will always be called, regardless of log level. Move some logger messages to the logger's `Fmt` method so that `GoString` is evaluated lazily instead.	2021-06-02 09:59:55 -04:00

1 2 3 4 5

207 Commits