open-nomad

Commit Graph

Author	SHA1	Message	Date
hc-github-team-nomad-core	0951fe1c50	backport of commit 0a5e90120b18ff450457463d6bcee68ec6804bb0 (#17900 ) This pull request was automerged via backport-assistant	2023-07-11 10:00:05 -05:00
nicoche	649831c1d3	deploymentwatcher: fail early whenever possible (#17341 ) Given a deployment that has a `progress_deadline`, if a task group runs out of reschedule attempts, allow it to fail at this time instead of waiting until the `progress_deadline` is reached. Fixes: #17260	2023-06-26 14:01:03 -04:00
grembo	7936c1e33f	Add `disable_file` parameter to job's `vault` stanza (#13343 ) This complements the `env` parameter, so that the operator can author tasks that don't share their Vault token with the workload when using `image` filesystem isolation. As a result, more powerful tokens can be used in a job definition, allowing it to use template stanzas to issue all kinds of secrets (database secrets, Vault tokens with very specific policies, etc.), without sharing that issuing power with the task itself. This is accomplished by creating a directory called `private` within the task's working directory, which shares many properties of the `secrets` directory (tmpfs where possible, not accessible by `nomad alloc fs` or Nomad's web UI), but isn't mounted into/bound to the container. If the `disable_file` parameter is set to `false` (its default), the Vault token is also written to the NOMAD_SECRETS_DIR, so the default behavior is backwards compatible. Even if the operator never changes the default, they will still benefit from the improved behavior of Nomad never reading the token back in from that - potentially altered - location.	2023-06-23 15:15:04 -04:00
Luiz Aoqui	d5aa72190f	node pools: namespace integration (#17562 ) Add structs and fields to support the Nomad Pools Governance Enterprise feature of controlling node pool access via namespaces. Nomad Enterprise allows users to specify a default node pool to be used by jobs that don't specify one. In order to accomplish this, it's necessary to distinguish between a job that explicitly uses the `default` node pool and one that did not specify any. If the `default` node pool is set during job canonicalization it's impossible to do this, so this commit allows a job to have an empty node pool value during registration but sets to `default` at the admission controller mutator. In order to guarantee state consistency the state store validates that the job node pool is set and exists before inserting it.	2023-06-16 16:30:22 -04:00
Luiz Aoqui	bc17cffaef	node pool: node pool upsert on multiregion node register (#17503 ) When registering a node with a new node pool in a non-authoritative region we can't create the node pool because this new pool will not be replicated to other regions. This commit modifies the node registration logic to only allow automatic node pool creation in the authoritative region. In non-authoritative regions, the client is registered, but the node pool is not created. The client is kept in the `initialing` status until its node pool is created in the authoritative region and replicated to the client's region.	2023-06-13 11:28:28 -04:00
Tim Gross	fbaf4c8b69	node pools: implement support in scheduler (#17443 ) Implement scheduler support for node pool: * When a scheduler is invoked, we get a set of the ready nodes in the DCs that are allowed for that job. Extend the filter to include the node pool. * Ensure that changes to a job's node pool are picked up as destructive allocation updates. * Add `NodesInPool` as a metric to all reporting done by the scheduler. * Add the node-in-pool the filter to the `Node.Register` RPC so that we don't generate spurious evals for nodes in the wrong pool.	2023-06-07 10:39:03 -04:00
Luiz Aoqui	aa1b33d157	node pools: add event stream support (#17412 )	2023-06-06 10:14:47 -04:00
Luiz Aoqui	6039c18ab6	node pools: register a node in a node pool (#17405 )	2023-06-02 17:50:50 -04:00
Luiz Aoqui	f755b9469f	core: refactor task validation (#17344 ) Move all validations related to task fields to Task.Validate(). Prior to this, some task validations were being done inside TaskGroup.Validate() because they required access to some group values. But similarly to how TaskGroup.Validate() tasks the job as parameter, it's fair to expect the task to receive its group.	2023-06-01 19:26:42 -04:00
Luiz Aoqui	4be8d7c049	core: fix kill_timeout validation when progress_deadline is 0 (#17342 )	2023-06-01 19:01:32 -04:00
Tim Gross	4f14fa0518	node pools: add `node_pool` field to job spec (#17379 ) This changeset only adds the `node_pool` field to the jobspec, and ensures that it gets picked up correctly as a change. Without the rest of the implementation landed yet, the field will be ignored.	2023-06-01 16:08:55 -04:00
Phil Renaud	7e56ca62d1	[ui] Adds a "Scheduling" filter to the job.allocations page (#17227 ) * Basic filter concept * Make sure NextAllocation gets sent up with allocation stub	2023-05-18 16:24:41 -04:00
Luiz Aoqui	389212bfda	node pool: initial base work (#17163 ) Implementation of the base work for the new node pools feature. It includes a new `NodePool` struct and its corresponding state store table. Upon start the state store is populated with two built-in node pools that cannot be modified nor deleted: * `all` is a node pool that always includes all nodes in the cluster. * `default` is the node pool where nodes that don't specify a node pool in their configuration are placed.	2023-05-15 10:49:08 -04:00
Seth Hoenig	81e36b3650	core: eliminate second index on job_submissions table (#17146 ) * core: eliminate second index on job_submissions table This PR refactors the job_submissions state store code to eliminate the use of a second index formerly used for purging all versions of a given job. In practice we ended up with duplicate entries on the table. Instead, use index prefix scanning on the primary index and tidy up any potential for creating (or removing) duplicates. * core: pr comments followup	2023-05-11 09:51:08 -05:00
Tim Gross	9ed75e1f72	client: de-duplicate alloc updates and gate during restore (#17074 ) When client nodes are restarted, all allocations that have been scheduled on the node have their modify index updated, including terminal allocations. There are several contributing factors: * The `allocSync` method that updates the servers isn't gated on first contact with the servers. This means that if a server updates the desired state while the client is down, the `allocSync` races with the `Node.ClientGetAlloc` RPC. This will typically result in the client updating the server with "running" and then immediately thereafter "complete". * The `allocSync` method unconditionally sends the `Node.UpdateAlloc` RPC even if it's possible to assert that the server has definitely seen the client state. The allocrunner may queue-up updates even if we gate sending them. So then we end up with a race between the allocrunner updating its internal state to overwrite the previous update and `allocSync` sending the bogus or duplicate update. This changeset adds tracking of server-acknowledged state to the allocrunner. This state gets checked in the `allocSync` before adding the update to the batch, and updated when `Node.UpdateAlloc` returns successfully. To implement this we need to be able to equality-check the updates against the last acknowledged state. We also need to add the last acknowledged state to the client state DB, otherwise we'd drop unacknowledged updates across restarts. The client restart test has been expanded to cover a variety of allocation states, including allocs stopped before shutdown, allocs stopped by the server while the client is down, and allocs that have been completely GC'd on the server while the client is down. I've also bench tested scenarios where the task workload is killed while the client is down, resulting in a failed restore. Fixes #16381	2023-05-11 09:05:24 -04:00
Tim Gross	17bd930ca9	logs: fix missing allocation logs after update to Nomad 1.5.4 (#17087 ) When the server restarts for the upgrade, it loads the `structs.Job` from the Raft snapshot/logs. The jobspec has long since been parsed, so none of the guards around the default value are in play. The empty field value for `Enabled` is the zero value, which is false. This doesn't impact any running allocation because we don't replace running allocations when either the client or server restart. But as soon as any allocation gets rescheduled (ex. you drain all your clients during upgrades), it'll be using the `structs.Job` that the server has, which has `Enabled = false`, and logs will not be collected. This changeset fixes the bug by adding a new field `Disabled` which defaults to false (so that the zero value works), and deprecates the old field. Fixes #17076	2023-05-04 16:01:18 -04:00
Michael Schurter	3b3b02b741	dep: update from jwt/v4 to jwt/v5 (#17062 ) Their release notes are here: https://github.com/golang-jwt/jwt/releases Seemed wise to upgrade before we do even more with JWTs. For example this upgrade would have mattered if we already implemented common JWT claims such as expiration. Since we didn't rely on any claim verification this upgrade is a noop... ...except for 1 test that called `Claims.Valid()`! Removing that assertion seems scary, but it didn't actually do anything because we didn't implement any of the standard claims it validated: https://github.com/golang-jwt/jwt/blob/v4.5.0/map_claims.go#L120-L151 So functionally this major upgrade is a noop.	2023-05-03 11:17:38 -07:00
Luiz Aoqui	7b5a8f1fb0	Revert "hashicorp/go-msgpack v2 (#16810 )" (#17047 ) This reverts commit 8a98520d56eed3848096734487d8bd3eb9162a65.	2023-05-01 17:18:34 -04:00
Tim Gross	72cbe53f19	logs: allow disabling log collection in jobspec (#16962 ) Some Nomad users ship application logs out-of-band via syslog. For these users having `logmon` (and `docker_logger`) running is unnecessary overhead. Allow disabling the logmon and pointing the task's stdout/stderr to /dev/null. This changeset is the first of several incremental improvements to log collection short of full-on logging plugins. The next step will likely be to extend the internal-only task driver configuration so that cluster administrators can turn off log collection for the entire driver. --- Fixes: #11175 Co-authored-by: Thomas Weber <towe75@googlemail.com>	2023-04-24 10:00:27 -04:00
James Rasell	367cfa6d93	rpc: use "+" concatination in hot path RPC rate limit metrics. (#16923 )	2023-04-18 13:41:34 +01:00
Ian Fijolek	619f49afcf	hashicorp/go-msgpack v2 (#16810 ) * Upgrade from hashicorp/go-msgpack v1.1.5 to v2.1.0 Fixes #16808 * Update hashicorp/net-rpc-msgpackrpc to v2 to match go-msgpack * deps: use go-msgpack v2.0.0 go-msgpack v2.1.0 includes some code changes that we will need to investigate furthere to assess its impact on Nomad, so keeping this dependency on v2.0.0 for now since it's no-op. --------- Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2023-04-17 17:02:05 -04:00
Juana De La Cuesta	8302085384	Deployment Status Command Does Not Respect -namespace Wildcard (#16792 ) * func: add namespace support for list deployment * func: add wildcard to namespace filter for deployments * Update deployment_endpoint.go * style: use must instead of require or asseert * style: rename paginator to avoid clash with import * style: add changelog entry * fix: add missing parameter for upsert jobs	2023-04-12 11:02:14 +02:00
Seth Hoenig	ba728f8f97	api: enable support for setting original job source (#16763 ) * api: enable support for setting original source alongside job This PR adds support for setting job source material along with the registration of a job. This includes a new HTTP endpoint and a new RPC endpoint for making queries for the original source of a job. The HTTP endpoint is /v1/job/<id>/submission?version=<version> and the RPC method is Job.GetJobSubmission. The job source (if submitted, and doing so is always optional), is stored in the job_submission memdb table, separately from the actual job. This way we do not incur overhead of reading the large string field throughout normal job operations. The server config now includes job_max_source_size for configuring the maximum size the job source may be, before the server simply drops the source material. This should help prevent Bad Things from happening when huge jobs are submitted. If the value is set to 0, all job source material will be dropped. * api: avoid writing var content to disk for parsing * api: move submission validation into RPC layer * api: return an error if updating a job submission without namespace or job id * api: be exact about the job index we associate a submission with (modify) * api: reword api docs scheduling * api: prune all but the last 6 job submissions * api: protect against nil job submission in job validation * api: set max job source size in test server * api: fixups from pr	2023-04-11 08:45:08 -05:00
hashicorp-copywrite[bot]	005636afa0	[COMPLIANCE] Add Copyright and License Headers	2023-04-10 15:36:59 +00:00
Tim Gross	1335543731	ephemeral disk: `migrate` should imply `sticky` (#16826 ) The `ephemeral_disk` block's `migrate` field allows for best-effort migration of the ephemeral disk data to new nodes. The documentation says the `migrate` field is only respected if `sticky=true`, but in fact if client ACLs are not set the data is migrated even if `sticky=false`. The existing behavior when client ACLs are disabled has existed since the early implementation, so "fixing" that case now would silently break backwards compatibility. Additionally, having `migrate` not imply `sticky` seems nonsensical: it suggests that if we place on a new node we migrate the data but if we place on the same node, we throw the data away! Update so that `migrate=true` implies `sticky=true` as follows: * The failure mode when client ACLs are enabled comes from the server not passing along a migration token. Update the server so that the server provides a migration token whenever `migrate=true` and not just when `sticky=true` too. * Update the scheduler so that `migrate` implies `sticky`. * Update the client so that we check for `migrate \|\| sticky` where appropriate. * Refactor the E2E tests to move them off the old framework and make the intention of the test more clear.	2023-04-07 16:33:45 -04:00
Juana De La Cuesta	9b4871fece	Prevent kill_timeout greater than progress_deadline (#16761 ) * func: add validation for kill timeout smaller than progress dealine * style: add changelog * style: typo in changelog * style: remove refactored test * Update .changelog/16761.txt Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/structs/structs.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> --------- Co-authored-by: James Rasell <jrasell@users.noreply.github.com>	2023-04-04 18:17:10 +02:00
Seth Hoenig	ed7177de76	scheduler: annotate tasksUpdated with reason and purge DeepEquals (#16421 ) * scheduler: annotate tasksUpdated with reason and purge DeepEquals * cr: move opaque into helper * cr: swap affinity/spread hashing for slice equal * contributing: update checklist-jobspec with notes about struct methods * cr: add more cases to wait config equal method * cr: use reflect when comparing envoy config blocks * cl: add cl	2023-03-14 09:46:00 -05:00
Tim Gross	9dfb51579c	scheduler: refactor system util tests (#16416 ) The tests for the system allocs reconciling code path (`diffSystemAllocs`) include many impossible test environments, such as passing allocs for the wrong node into the function. This makes the test assertions nonsensible for use in walking yourself through the correct behavior. I've pulled this changeset out of PR #16097 so that we can merge these improvements and revisit the right approach to fix the problem in #16097 with less urgency now that the PFNR bug fix has been merged. This changeset breaks up a couple of tests, expands test coverage, and makes test assertions more clear. It also corrects one bit of production code that behaves fine in production because of canonicalization, but forces us to remember to set values in tests to compensate.	2023-03-13 11:59:31 -04:00
Tim Gross	a2ceab3d8c	scheduler: correctly detect inplace update with wildcard datacenters (#16362 ) Wildcard datacenters introduced a bug where a job with any wildcard datacenters will always be treated as a destructive update when we check whether a datacenter has been removed from the jobspec. Includes updating the helper so that callers don't have to loop over the job's datacenters.	2023-03-07 10:05:59 -05:00
Luiz Aoqui	2a1a790820	client: don't emit task shutdown delay event if not waiting (#16281 )	2023-03-03 18:22:06 -05:00
Michael Schurter	bd7b60712e	Accept Workload Identities for Client RPCs (#16254 ) This change resolves policies for workload identities when calling Client RPCs. Previously only ACL tokens could be used for Client RPCs. Since the same cache is used for both bearer tokens (ACL and Workload ID), the token cache size was doubled. --------- Co-authored-by: James Rasell <jrasell@users.noreply.github.com>	2023-02-27 10:17:47 -08:00
Alessio Perugini	4e9ec24b22	Allow configurable range of Job priorities (#16084 )	2023-02-17 09:23:13 -05:00
Seth Hoenig	df3e8f82da	deps: update go-set, go-landlock (#16146 ) Made a breaking change in go-set (String() signature), need to update both these dependencies together and also fix a thing in structs.go	2023-02-13 08:26:30 -06:00
Michael Schurter	0a496c845e	Task API via Unix Domain Socket (#15864 ) This change introduces the Task API: a portable way for tasks to access Nomad's HTTP API. This particular implementation uses a Unix Domain Socket and, unlike the agent's HTTP API, always requires authentication even if ACLs are disabled. This PR contains the core feature and tests but followup work is required for the following TODO items: - Docs - might do in a followup since dynamic node metadata / task api / workload id all need to interlink - Unit tests for auth middleware - Caching for auth middleware - Rate limiting on negative lookups for auth middleware --------- Co-authored-by: Seth Hoenig <shoenig@duck.com>	2023-02-06 11:31:22 -08:00
Tim Gross	19a2c065f4	System and sysbatch jobs always have zero index (#16030 ) Service jobs should have unique allocation Names, derived from the Job.ID. System jobs do not have unique allocation Names because the index is intended to indicated the instance out of a desired count size. Because system jobs do not have an explicit count but the results are based on the targeted nodes, the index is less informative and this was intentionally omitted from the original design. Update docs to make it clear that NOMAD_ALLOC_INDEX is always zero for system/sysbatch jobs Validate that `volume.per_alloc` is incompatible with system/sysbatch jobs. System and sysbatch jobs always have a `NOMAD_ALLOC_INDEX` of 0. So interpolation via `per_alloc` will not work as soon as there's more than one allocation placed. Validate against this on job submission.	2023-02-02 16:18:01 -05:00
Daniel Bennett	335f0a5371	docs: how to troubleshoot consul connect envoy (#15908 ) * largely a doc-ification of this commit message: d47678074bf8ae9ff2da3c91d0729bf03aee8446 this doesn't spell out all the possible failure modes, but should be a good starting point for folks. * connect: add doc link to envoy bootstrap error * add Unwrap() to RecoverableError mainly for easier testing	2023-02-02 14:20:26 -06:00
Charlie Voiselle	cc6f4719f1	Add option to expose workload token to task (#15755 ) Add `identity` jobspec block to expose workload identity tokens to tasks. --------- Co-authored-by: Anders <mail@anars.dk> Co-authored-by: Tim Gross <tgross@hashicorp.com> Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2023-02-02 10:59:14 -08:00
jmwilkinson	37834dffda	Allow wildcard datacenters to be specified in job file (#11170 ) Also allows for default value of `datacenters = ["*"]`	2023-02-02 09:57:45 -05:00
Piotr Kazmierczak	14b53df3b6	renamed stanza to block for consistency with other projects (#15941 )	2023-01-30 15:48:43 +01:00
Luiz Aoqui	3479e2231f	core: enforce strict steps for clients reconnect (#15808 ) When a Nomad client that is running an allocation with `max_client_disconnect` set misses a heartbeat the Nomad server will update its status to `disconnected`. Upon reconnecting, the client will make three main RPC calls: - `Node.UpdateStatus` is used to set the client status to `ready`. - `Node.UpdateAlloc` is used to update the client-side information about allocations, such as their `ClientStatus`, task states etc. - `Node.Register` is used to upsert the entire node information, including its status. These calls are made concurrently and are also running in parallel with the scheduler. Depending on the order they run the scheduler may end up with incomplete data when reconciling allocations. For example, a client disconnects and its replacement allocation cannot be placed anywhere else, so there's a pending eval waiting for resources. When this client comes back the order of events may be: 1. Client calls `Node.UpdateStatus` and is now `ready`. 2. Scheduler reconciles allocations and places the replacement alloc to the client. The client is now assigned two allocations: the original alloc that is still `unknown` and the replacement that is `pending`. 3. Client calls `Node.UpdateAlloc` and updates the original alloc to `running`. 4. Scheduler notices too many allocs and stops the replacement. This creates unnecessary placements or, in a different order of events, may leave the job without any allocations running until the whole state is updated and reconciled. To avoid problems like this clients must update _all_ of its relevant information before they can be considered `ready` and available for scheduling. To achieve this goal the RPC endpoints mentioned above have been modified to enforce strict steps for nodes reconnecting: - `Node.Register` does not set the client status anymore. - `Node.UpdateStatus` sets the reconnecting client to the `initializing` status until it successfully calls `Node.UpdateAlloc`. These changes are done server-side to avoid the need of additional coordination between clients and servers. Clients are kept oblivious of these changes and will keep making these calls as they normally would. The verification of whether allocations have been updates is done by storing and comparing the Raft index of the last time the client missed a heartbeat and the last time it updated its allocations.	2023-01-25 15:53:59 -05:00
Tim Gross	055434cca9	add metric for count of RPC requests (#15515 ) Implement a metric for RPC requests with labels on the identity, so that administrators can monitor the source of requests within the cluster. This changeset demonstrates the change with the new `ACL.WhoAmI` RPC, and we'll wire up the remaining RPCs once we've threaded the new pre-forwarding authentication through the all. Note that metrics are measured after we forward but before we return any authentication error. This ensures that we only emit metrics on the server that actually serves the request. We'll perform rate limiting at the same place. Includes telemetry configuration to omit identity labels.	2023-01-24 11:54:20 -05:00
Seth Hoenig	d2d8ebbeba	consul: correctly interpret missing consul checks as unhealthy (#15822 ) * consul: correctly understand missing consul checks as unhealthy This PR fixes a bug where Nomad assumed any registered Checks would exist in the service registration coming back from Consul. In some cases, the Consul may be slow in processing the check registration, and the response object would not contain checks. Nomad would then scan the empty response looking for Checks with failing health status, finding none, and then marking a task/alloc as healthy. In reality, we must always use Nomad's view of what checks should exist as the source of truth, and compare that with the response Consul gives us, making sure they match, before scanning the Consul response for failing check statuses. Fixes #15536 * consul: minor CR refactor using maps not sets * consul: observe transition from healthy to unhealthy checks * consul: spell healthy correctly	2023-01-19 14:01:12 -06:00
James Rasell	fad9b40e53	Merge branch 'main' into sso/gh-13120-oidc-login	2023-01-18 10:05:31 +00:00
Piotr Kazmierczak	be36a1924f	acl: binding rules evaluation (#15697 ) Binder provides an interface for binding claims and ACL roles/policies of Nomad.	2023-01-10 16:08:08 +01:00
Tim Gross	32f6ce1c54	Authenticate method improvements (#15734 ) This changeset covers a sidebar discussion that @schmichael and I had around the design for pre-forwarding auth. This includes some changes extracted out of #15513 to make it easier to review both and leave a clean history. * Remove fast path for NodeID. Previously-connected clients will have a NodeID set on the context, and because this is a large portion of the RPCs sent we fast-pathed it at the top of the `Authenticate` method. But the context is shared for all yamux streams over the same yamux session (and TCP connection). This lets an authenticated HTTP request to a client use the NodeID for authentication, which is a privilege escalation. Remove the fast path and annotate it so that we don't break it again. * Add context to decisions around AuthenticatedIdentity. The `Authenticate` method taken on its own looks like it wants to return an `acl.ACL` that folds over all the various identity types (creating an ephemeral ACL on the fly if neccessary). But keeping these fields idependent allows RPC handlers to differentiate between internal and external origins so we most likely want to avoid this. Leave some docstrings as a warning as to why this is built the way it is. * Mutate the request rather than returning. When reviewing #15513 we decided that forcing the request handler to call `SetIdentity` was repetitive and error prone. Instead, the `Authenticate` method mutates the request by setting its `AuthenticatedIdentity`.	2023-01-10 09:46:38 -05:00
James Rasell	3c941c6bc3	acl: add binding rule object state schema and functionality. (#15511 ) This change adds a new table that will store ACL binding rule objects. The two indexes allow fast lookups by their ID, or by which auth method they are linked to. Snapshot persist and restore functionality ensures this table can be saved and restored from snapshots. In order to write and delete the object to state, new Raft messages have been added. All RPC request and response structs, along with object functions such as diff and canonicalize have been included within this work as it is nicely separated from the other areas of work.	2022-12-14 08:48:18 +01:00
Tim Gross	e0fddee386	Pre forwarding authentication (#15417 ) Upcoming work to instrument the rate of RPC requests by consumer (and eventually rate limit) require that we authenticate a RPC request before forwarding. Add a new top-level `Authenticate` method to the server and have it return an `AuthenticatedIdentity` struct. RPC handlers will use the relevant fields of this identity for performing authorization. This changeset includes: * The main implementation of `Authenticate` * Provide a new RPC `ACL.WhoAmI` for debugging authentication. This endpoint returns the same `AuthenticatedIdentity` that will be used by RPC handlers. At some point we might want to give this an equivalent HTTP endpoint but I didn't want to add that to our public API until some of the other Workload Identity work is solidified, especially if we don't need it yet. * A full coverage test of the `Authenticate` method. This sets up two server nodes with mTLS and ACLs, some tokens, and some allocations with workload identities. * Wire up an example of using `Authenticate` in the `Namespace.Upsert` RPC and see how authorization happens after forwarding. * A new semgrep rule for `Authenticate`, which we'll need to update once we're ready to wire up more RPC endpoints with authorization steps.	2022-12-06 14:44:03 -05:00
Luiz Aoqui	8f91be26ab	scheduler: create placements for non-register MRD (#15325 ) * scheduler: create placements for non-register MRD For multiregion jobs, the scheduler does not create placements on registration because the deployment must wait for the other regions. Once of these regions will then trigger the deployment to run. Currently, this is done in the scheduler by considering any eval for a multiregion job as "paused" since it's expected that another region will eventually unpause it. This becomes a problem where evals not triggered by a job registration happen, such as on a node update. These types of regional changes do not have other regions waiting to progress the deployment, and so they were never resulting in placements. The fix is to create a deployment at job registration time. This additional piece of state allows the scheduler to differentiate between a multiregion change, where there are other regions engaged in the deployment so no placements are required, from a regional change, where the scheduler does need to create placements. This deployment starts in the new "initializing" status to signal to the scheduler that it needs to compute the initial deployment state. The multiregion deployment will wait until this deployment state is persisted and its starts is set to "pending". Without this state transition it's possible to hit a race condition where the plan applier and the deployment watcher may step of each other and overwrite their changes. * changelog: add entry for #15325	2022-11-25 12:45:34 -05:00
Piotr Kazmierczak	bb66b5e770	acl: sso auth method RPC endpoints (#15221 ) This PR implements RPC endpoints for SSO auth methods. This PR is part of the SSO work captured under ☂️ ticket #13120.	2022-11-21 10:15:39 +01:00
Tim Gross	37134a4a37	eval delete: move batching of deletes into RPC handler and state (#15117 ) During unusual outage recovery scenarios on large clusters, a backlog of millions of evaluations can appear. In these cases, the `eval delete` command can put excessive load on the cluster by listing large sets of evals to extract the IDs and then sending larges batches of IDs. Although the command's batch size was carefully tuned, we still need to be JSON deserialize, re-serialize to MessagePack, send the log entries through raft, and get the FSM applied. To improve performance of this recovery case, move the batching process into the RPC handler and the state store. The design here is a little weird, so let's look a the failed options first: * A naive solution here would be to just send the filter as the raft request and let the FSM apply delete the whole set in a single operation. Benchmarking with 1M evals on a 3 node cluster demonstrated this can block the FSM apply for several minutes, which puts the cluster at risk if there's a leadership failover (the barrier write can't be made while this apply is in-flight). * A less naive but still bad solution would be to have the RPC handler filter and paginate, and then hand a list of IDs to the existing raft log entry. Benchmarks showed this blocked the FSM apply for 20-30s at a time and took roughly an hour to complete. Instead, we're filtering and paginating in the RPC handler to find a page token, and then passing both the filter and page token in the raft log. The FSM apply recreates the paginator using the filter and page token to get roughly the same page of evaluations, which it then deletes. The pagination process is fairly cheap (only abut 5% of the total FSM apply time), so counter-intuitively this rework ends up being much faster. A benchmark of 1M evaluations showed this blocked the FSM apply for 20-30ms at a time (typical for normal operations) and completes in less than 4 minutes. Note that, as with the existing design, this delete is not consistent: a new evaluation inserted "behind" the cursor of the pagination will fail to be deleted.	2022-11-14 14:08:13 -05:00

1 2 3 4 5 ...

1439 Commits