open-nomad

Author	SHA1	Message	Date
James Rasell	ac98c2ed40	vars: ensure struct reciever names are consistent. (#16995 )	2023-04-27 13:51:11 +01:00
James Rasell	4d2c1403c2	scale: do not allow scaling of jobs with type system. (#16969 )	2023-04-25 15:47:44 +01:00
Tim Gross	72cbe53f19	logs: allow disabling log collection in jobspec (#16962 ) Some Nomad users ship application logs out-of-band via syslog. For these users having `logmon` (and `docker_logger`) running is unnecessary overhead. Allow disabling the logmon and pointing the task's stdout/stderr to /dev/null. This changeset is the first of several incremental improvements to log collection short of full-on logging plugins. The next step will likely be to extend the internal-only task driver configuration so that cluster administrators can turn off log collection for the entire driver. --- Fixes: #11175 Co-authored-by: Thomas Weber <towe75@googlemail.com>	2023-04-24 10:00:27 -04:00
valodzka	379497a484	fix host port handling for ipv6 (#16723 )	2023-04-20 19:53:20 -07:00
James Rasell	367cfa6d93	rpc: use "+" concatination in hot path RPC rate limit metrics. (#16923 )	2023-04-18 13:41:34 +01:00
Ian Fijolek	619f49afcf	hashicorp/go-msgpack v2 (#16810 ) * Upgrade from hashicorp/go-msgpack v1.1.5 to v2.1.0 Fixes #16808 * Update hashicorp/net-rpc-msgpackrpc to v2 to match go-msgpack * deps: use go-msgpack v2.0.0 go-msgpack v2.1.0 includes some code changes that we will need to investigate furthere to assess its impact on Nomad, so keeping this dependency on v2.0.0 for now since it's no-op. --------- Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2023-04-17 17:02:05 -04:00
Tim Gross	62548616d4	client: allow `drain_on_shutdown` configuration (#16827 ) Adds a new configuration to clients to optionally allow them to drain their workloads on shutdown. The client sends the `Node.UpdateDrain` RPC targeting itself and then monitors the drain state as seen by the server until the drain is complete or the deadline expires. If it loses connection with the server, it will monitor local client status instead to ensure allocations are stopped before exiting.	2023-04-14 15:35:32 -04:00
Tim Gross	5a9abdc469	drain: use client status to determine drain is complete (#14348 ) If an allocation is slow to stop because of `kill_timeout` or `shutdown_delay`, the node drain is marked as complete prematurely, even though drain monitoring will continue to report allocation migrations. This impacts the UI or API clients that monitor node draining to shut down nodes. This changeset updates the behavior to wait until the client status of all drained allocs are terminal before marking the node as done draining.	2023-04-13 08:55:28 -04:00
James Rasell	b7a41fe48d	core: ensure all Server receiver names are consistent. (#16859 )	2023-04-12 14:03:07 +01:00
Juana De La Cuesta	8302085384	Deployment Status Command Does Not Respect -namespace Wildcard (#16792 ) * func: add namespace support for list deployment * func: add wildcard to namespace filter for deployments * Update deployment_endpoint.go * style: use must instead of require or asseert * style: rename paginator to avoid clash with import * style: add changelog entry * fix: add missing parameter for upsert jobs	2023-04-12 11:02:14 +02:00
Tim Gross	a9a350cfdb	drainer: fix codec race condition in integration test (#16845 ) msgpackrpc codec handles are specific to a connection and cannot be shared between goroutines; this can cause corrupted decoding. Fix the drainer integration test so that we create separate codecs for the goroutines that the test helper spins up to simulate client updates. This changeset also refactors the drainer integration test to bring it up to current idioms and library usages, make assertions more clear, and reduce duplication.	2023-04-11 14:31:13 -04:00
Seth Hoenig	ba728f8f97	api: enable support for setting original job source (#16763 ) * api: enable support for setting original source alongside job This PR adds support for setting job source material along with the registration of a job. This includes a new HTTP endpoint and a new RPC endpoint for making queries for the original source of a job. The HTTP endpoint is /v1/job/<id>/submission?version=<version> and the RPC method is Job.GetJobSubmission. The job source (if submitted, and doing so is always optional), is stored in the job_submission memdb table, separately from the actual job. This way we do not incur overhead of reading the large string field throughout normal job operations. The server config now includes job_max_source_size for configuring the maximum size the job source may be, before the server simply drops the source material. This should help prevent Bad Things from happening when huge jobs are submitted. If the value is set to 0, all job source material will be dropped. * api: avoid writing var content to disk for parsing * api: move submission validation into RPC layer * api: return an error if updating a job submission without namespace or job id * api: be exact about the job index we associate a submission with (modify) * api: reword api docs scheduling * api: prune all but the last 6 job submissions * api: protect against nil job submission in job validation * api: set max job source size in test server * api: fixups from pr	2023-04-11 08:45:08 -05:00
Piotr Kazmierczak	dea8b1a093	acl: bump JWT auth gate to 1.5.4 (#16838 )	2023-04-11 10:07:45 +02:00
hashicorp-copywrite[bot]	005636afa0	[COMPLIANCE] Add Copyright and License Headers	2023-04-10 15:36:59 +00:00
Tim Gross	1335543731	ephemeral disk: `migrate` should imply `sticky` (#16826 ) The `ephemeral_disk` block's `migrate` field allows for best-effort migration of the ephemeral disk data to new nodes. The documentation says the `migrate` field is only respected if `sticky=true`, but in fact if client ACLs are not set the data is migrated even if `sticky=false`. The existing behavior when client ACLs are disabled has existed since the early implementation, so "fixing" that case now would silently break backwards compatibility. Additionally, having `migrate` not imply `sticky` seems nonsensical: it suggests that if we place on a new node we migrate the data but if we place on the same node, we throw the data away! Update so that `migrate=true` implies `sticky=true` as follows: * The failure mode when client ACLs are enabled comes from the server not passing along a migration token. Update the server so that the server provides a migration token whenever `migrate=true` and not just when `sticky=true` too. * Update the scheduler so that `migrate` implies `sticky`. * Update the client so that we check for `migrate \|\| sticky` where appropriate. * Refactor the E2E tests to move them off the old framework and make the intention of the test more clear.	2023-04-07 16:33:45 -04:00
hc-github-team-nomad-core	3578078caf	Prepare for next release	2023-04-05 12:31:42 -04:00
hc-github-team-nomad-core	b64ee2726d	Generate files for 1.5.3 release	2023-04-05 12:31:30 -04:00
Tim Gross	8278f23042	acl: fix ACL bypass for anon requests that pass thru client HTTP Requests without an ACL token that pass thru the client's HTTP API are treated as though they come from the client itself. This allows bypass of ACLs on RPC requests where ACL permissions are checked (like `Job.Register`). Invalid tokens are correctly rejected. Fix the bypass by only setting a client ID on the identity if we have a valid node secret. Note that this changeset will break rate metrics for RPCs sent by clients without a client secret such as `Node.GetClientAllocs`; these requests will be recorded as anonymous. Future work should: * Ensure the node secret is sent with all client-driven RPCs except `Node.Register` which is TOFU. * Create a new `acl.ACL` object from client requests so that we can enforce ACLs for all endpoints in a uniform way that's less error-prone.~	2023-04-05 12:17:51 -04:00
Juana De La Cuesta	9b4871fece	Prevent kill_timeout greater than progress_deadline (#16761 ) * func: add validation for kill timeout smaller than progress dealine * style: add changelog * style: typo in changelog * style: remove refactored test * Update .changelog/16761.txt Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/structs/structs.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> --------- Co-authored-by: James Rasell <jrasell@users.noreply.github.com>	2023-04-04 18:17:10 +02:00
Juana De La Cuesta	1fc13b83d8	style: update documentation (#16729 )	2023-03-31 16:38:16 +02:00
Daniel Bennett	c42950e342	ent: move all license info into LicenseConfig{} (#16738 ) and add new TestConfigForServer() to get a valid nomad.Config to use in tests Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2023-03-30 16:15:05 -05:00
Horacio Monsalvo	20372b1721	connect: add meta on ConsulSidecarService (#16705 ) Co-authored-by: Sol-Stiep <sol.stiep@southworks.com>	2023-03-30 16:09:28 -04:00
Piotr Kazmierczak	1a5eba24a6	acl: set minACLJWTAuthMethodVersion to 1.5.3 and adjust code comment	2023-03-30 15:30:42 +02:00
Piotr Kazmierczak	d98c8f6759	acl: rebased on main and changed the gate to 1.5.3-dev	2023-03-30 09:40:12 +02:00
Piotr Kazmierczak	16b6bd9ff2	acl: fix canonicalization of JWT auth method mock (#16531 )	2023-03-30 09:39:56 +02:00
Piotr Kazmierczak	2b353902a1	acl: HTTP endpoints for JWT auth (#16519 )	2023-03-30 09:39:56 +02:00
Piotr Kazmierczak	e48c48e89b	acl: RPC endpoints for JWT auth (#15918 )	2023-03-30 09:39:56 +02:00
Piotr Kazmierczak	a9230fb0b7	acl: JWT auth method	2023-03-30 09:39:56 +02:00
Juana De La Cuesta	320884b8ee	Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true (#16583 ) * Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true Fixes #11052 When restoring periodic dispatcher, all periodic jobs are forced without checking for previous childre. * Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true Fixes #11052 When restoring periodic dispatcher, all periodic jobs are forced without checking for previous children. * style: refactor force run function * fix: remove defer and inline unlock for speed optimization * Update nomad/leader.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * style: refactor tests to use must * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * fix: move back from defer to calling unlock before returning. createEval cant be called with the lock on * style: refactor test to use must * added new entry to changelog and update comments --------- Co-authored-by: James Rasell <jrasell@hashicorp.com> Co-authored-by: James Rasell <jrasell@users.noreply.github.com>	2023-03-27 17:25:05 +02:00
Juana De La Cuesta	21b675244e	style: rename ForceRun to ForceEval, for clarity (#16617 )	2023-03-27 15:38:48 +02:00
Tim Gross	977c88dcea	drainer: test refactoring to clarify behavior around delete/down nodes (#16612 ) This changeset refactors the tests of the draining node watcher so that we don't mock the node watcher's `Remove` and `Update` methods for its own tests. Instead we'll mock the node watcher's dependencies (the job watcher and deadline notifier) and now unit tests can cover the real code. This allows us to remove a bunch of TODOs in `watch_nodes.go` around testing and clarify some important behaviors: * Nodes that are down or disconnected will still be watched until the scheduler decides what to do with their allocations. This will drive the job watcher but not the node watcher, and that lets the node watcher gracefully handle cases where a heartbeat fails but the node heartbeats again before its allocs can be evicted. * Stop watching nodes that have been deleted. The blocking query for nodes set the maximum index to the highest index of a node it found, rather than the index of the nodes table. This misses updates to the index from deleting nodes. This was done as an performance optimization to avoid excessive unblocking, but because the query is over all nodes anyways there's no optimization to be had here. Remove the optimization so we can detect deleted nodes without having to wait for an update to an unrelated node.	2023-03-23 14:07:09 -04:00
Michael Schurter	f8884d8b52	client/metadata: fix crasher caused by AllowStale = false (#16549 ) Fixes #16517 Given a 3 Server cluster with at least 1 Client connected to Follower 1: If a NodeMeta.{Apply,Read} for the Client request is received by Follower 1 with `AllowStale = false` the Follower will forward the request to the Leader. The Leader, not being connected to the target Client, will forward the RPC to Follower 1. Follower 1, seeing AllowStale=false, will forward the request to the Leader. The Leader, not being connected to... well hoppefully you get the picture: an infinite loop occurs.	2023-03-20 16:32:32 -07:00
Seth Hoenig	d6dcc53c0a	tls enforcement flaky tests (#16543 ) * tests: add WaitForLeaders helpers using must/wait timings * tests: start servers for mtls tests together Fixes #16253 (hopefully)	2023-03-17 14:11:13 -05:00
Piotr Kazmierczak	14927e93bc	acl: fix canonicalization of OIDC auth method mock (#16534 )	2023-03-17 15:37:54 +01:00
Seth Hoenig	ed7177de76	scheduler: annotate tasksUpdated with reason and purge DeepEquals (#16421 ) * scheduler: annotate tasksUpdated with reason and purge DeepEquals * cr: move opaque into helper * cr: swap affinity/spread hashing for slice equal * contributing: update checklist-jobspec with notes about struct methods * cr: add more cases to wait config equal method * cr: use reflect when comparing envoy config blocks * cl: add cl	2023-03-14 09:46:00 -05:00
Luiz Aoqui	adf147cb36	acl: update job eval requirement to `submit-job` (#16463 ) The job evaluate endpoint creates a new evaluation for the job which is a write operation. This change modifies the necessary capability from `read-job` to `submit-job` to better reflect this.	2023-03-13 17:13:54 -04:00
Tim Gross	9dfb51579c	scheduler: refactor system util tests (#16416 ) The tests for the system allocs reconciling code path (`diffSystemAllocs`) include many impossible test environments, such as passing allocs for the wrong node into the function. This makes the test assertions nonsensible for use in walking yourself through the correct behavior. I've pulled this changeset out of PR #16097 so that we can merge these improvements and revisit the right approach to fix the problem in #16097 with less urgency now that the PFNR bug fix has been merged. This changeset breaks up a couple of tests, expands test coverage, and makes test assertions more clear. It also corrects one bit of production code that behaves fine in production because of canonicalization, but forces us to remember to set values in tests to compensate.	2023-03-13 11:59:31 -04:00
Seth Hoenig	630bd8eb68	scheduler: add simple benchmark for tasksUpdated (#16422 ) In preperation for some refactoring to tasksUpdated, add a benchmark to the old code so it's easy to compare with the changes, making sure nothing goes off the rails for performance.	2023-03-13 10:44:14 -05:00
hc-github-team-nomad-core	2d1a4d90e9	Prepare for next release	2023-03-13 11:13:27 -04:00
hc-github-team-nomad-core	35167e692a	Generate files for 1.5.1 release	2023-03-13 11:13:27 -04:00
Tim Gross	1cf28996e7	acl: prevent privilege escalation via workload identity ACL policies can be associated with a job so that the job's Workload Identity can have expanded access to other policy objects, including other variables. Policies set on the variables the job automatically has access to were ignored, but this includes policies with `deny` capabilities. Additionally, when resolving claims for a workload identity without any attached policies, the `ResolveClaims` method returned a `nil` ACL object, which is treated similarly to a management token. While this was safe in Nomad 1.4.x, when the workload identity token was exposed to the task via the `identity` block, this allows a user with `submit-job` capabilities to escalate their privileges. We originally implemented automatic workload access to Variables as a separate code path in the Variables RPC endpoint so that we don't have to generate on-the-fly policies that blow up the ACL policy cache. This is fairly brittle but also the behavior around wildcard paths in policies different from the rest of our ACL polices, which is hard to reason about. Add an `ACLClaim` parameter to the `AllowVariableOperation` method so that we can push all this logic into the `acl` package and the behavior can be consistent. This will allow a `deny` policy to override automatic access (and probably speed up checks of non-automatic variable access).	2023-03-13 11:13:27 -04:00
Luiz Aoqui	7305a374e3	allocrunner: fix health check monitoring for Consul services (#16402 ) Services must be interpolated to replace runtime variables before they can be compared against the values returned by Consul.	2023-03-10 14:43:31 -05:00
Tim Gross	99d46e5a49	scheduling: prevent self-collision in dynamic port network offerings (#16401 ) When the scheduler tries to find a placement for a new allocation, it iterates over a subset of nodes. For each node, we populate a `NetworkIndex` bitmap with the ports of all existing allocations and any other allocations already proposed as part of this same evaluation via its `SetAllocs` method. Then we make an "ask" of the `NetworkIndex` in `AssignPorts` for any ports we need and receive an "offer" in return. The offer will include both static ports and any dynamic port assignments. The `AssignPorts` method was written to support group networks, and it shares code that selects dynamic ports with the original `AssignTaskNetwork` code. `AssignTaskNetwork` can request multiple ports from the bitmap at a time. But `AssignPorts` requests them one at a time and does not account for possible collisions, and doesn't return an error in that case. What happens next varies: 1. If the scheduler doesn't place the allocation on that node, the port conflict is thrown away and there's no problem. 2. If the node is picked and this is the only allocation (or last allocation), the plan applier will reject the plan when it calls `SetAllocs`, as we'd expect. 3. If the node is picked and there are additional allocations in the same eval that iterate over the same node, their call to `SetAllocs` will detect the impossible state and the node will be rejected. This can have the puzzling behavior where a second task group for the job without any networking at all can hit a port collision error! It looks like this bug has existed since we implemented group networks, but there are several factors that add up to making the issue rare for many users yet frustratingly frequent for others: * You're more likely to hit this bug the more tightly packed your range for dynamic ports is. With 12000 ports in the range by default, many clusters can avoid this for a long time. * You're more likely to hit case (3) for jobs with lots of allocations or if a scheduler has to iterate over a large number of nodes, such as with system jobs, jobs with `spread` blocks, or (sometimes) jobs using `unique` constraints. For unlucky combinations of these factors, it's possible that case (3) happens repeatedly, preventing scheduling of a given job until a client state change (ex. restarting the agent so all its allocations are rescheduled elsewhere) re-opens the range of dynamic ports available. This changeset: * Fixes the bug by accounting for collisions in dynamic port selection in `AssignPorts`. * Adds test coverage for `AssignPorts`, expands coverage of this case for the deprecated `AssignTaskNetwork`, and tightens the dynamic port range in a scheduler test for spread scheduling to more easily detect this kind of problem in the future. * Adds a `String()` method to `Bitmap` so that any future "screaming" log lines have a human-readable list of used ports.	2023-03-09 10:09:54 -05:00
Lance Haig	e89c3d3b36	Update ioutil library references to os and io respectively for e2e helper nomad (#16332 ) No user facing changes so I assume no change log is required	2023-03-08 09:39:03 -06:00
Tim Gross	a2ceab3d8c	scheduler: correctly detect inplace update with wildcard datacenters (#16362 ) Wildcard datacenters introduced a bug where a job with any wildcard datacenters will always be treated as a destructive update when we check whether a datacenter has been removed from the jobspec. Includes updating the helper so that callers don't have to loop over the job's datacenters.	2023-03-07 10:05:59 -05:00
Luiz Aoqui	2a1a790820	client: don't emit task shutdown delay event if not waiting (#16281 )	2023-03-03 18:22:06 -05:00
Tim Gross	3c0eaba9db	remove backcompat support for non-atomic job registration (#16305 ) In Nomad 0.12.1 we introduced atomic job registration/deregistration, where the new eval was written in the same raft entry. Backwards-compatibility checks were supposed to have been removed in Nomad 1.1.0, but we missed that. This is long safe to remove.	2023-03-03 15:52:22 -05:00
Tim Gross	8747059b86	service: fix regression in task access to list/read endpoint (#16316 ) When native service discovery was added, we used the node secret as the auth token. Once Workload Identity was added in Nomad 1.4.x we needed to use the claim token for `template` blocks, and so we allowed valid claims to bypass the ACL policy check to preserve the existing behavior. (Invalid claims are still rejected, so this didn't widen any security boundary.) In reworking authentication for 1.5.0, we unintentionally removed this bypass. For WIs without a policy attached to their job, everything works as expected because the resulting `acl.ACL` is nil. But once a policy is attached to the job the `acl.ACL` is no longer nil and this causes permissions errors. Fix the regression by adding back the bypass for valid claims. In future work, we should strongly consider getting turning the implicit policies into real `ACLPolicy` objects (even if not stored in state) so that we don't have these kind of brittle exceptions to the auth code.	2023-03-03 11:41:19 -05:00
Tim Gross	0e1b554299	handle `FSM.Apply` errors in `raftApply` (#16287 ) The signature of the `raftApply` function requires that the caller unwrap the first returned value (the response from `FSM.Apply`) to see if it's an error. This puts the burden on the caller to remember to check two different places for errors, and we've done so inconsistently. Update `raftApply` to do the unwrapping for us and return any `FSM.Apply` error as the error value. Similar work was done in Consul in https://github.com/hashicorp/consul/pull/9991. This eliminates some boilerplate and surfaces a few minor bugs in the process: * job deregistrations of already-GC'd jobs were still emitting evals * reconcile job summaries does not return scheduler errors * node updates did not report errors associated with inconsistent service discovery or CSI plugin states Note that although _most_ of the `FSM.Apply` functions return only errors (which makes it tempting to remove the first return value entirely), there are few that return `bool` for some reason and Variables relies on the response value for proper CAS checking.	2023-03-02 13:51:09 -05:00
Michael Schurter	bd7b60712e	Accept Workload Identities for Client RPCs (#16254 ) This change resolves policies for workload identities when calling Client RPCs. Previously only ACL tokens could be used for Client RPCs. Since the same cache is used for both bearer tokens (ACL and Workload ID), the token cache size was doubled. --------- Co-authored-by: James Rasell <jrasell@users.noreply.github.com>	2023-02-27 10:17:47 -08:00
Daniel Bennett	39e3a1ac3e	build/cli: Add BuildDate (#16216 ) * build: add BuildDate to version info will be used in enterprise to compare to license expiration time * cli: multi-line version output, add BuildDate before: $ nomad version Nomad v1.4.3 (coolfakecommithashomgoshsuchacoolonewoww) after: $ nomad version Nomad v1.5.0-dev BuildDate 2023-02-17T19:29:26Z Revision coolfakecommithashomgoshsuchacoolonewoww compare consul: $ consul version Consul v1.14.4 Revision dae670fe Build Date 2023-01-26T15:47:10Z Protocol 2 spoken by default, blah blah blah... and vault: $ vault version Vault v1.12.3 (209b3dd99fe8ca320340d08c70cff5f620261f9b), built 2023-02-02T09:07:27Z * docs: update version command output	2023-02-27 11:27:40 -06:00
Tim Gross	4c9688271a	CSI: fix potential state store corruptions (#16256 ) The `CSIVolume` struct has references to allocations that are "denormalized"; we don't store them on the `CSIVolume` struct but hydrate them on read. Tests detecting potential state store corruptions found two locations where we're not copying the volume before denormalizing: * When garbage collecting CSI volume claims. * When checking if it's safe to force-deregister the volume. There are no known user-visible problems associated with these bugs but both have the potential of mutating volume claims outside of a FSM transaction. This changeset also cleans up state mutations in some CSI tests so as to avoid having working tests cover up potential future bugs.	2023-02-27 08:47:08 -05:00
James Rasell	8295d0e516	acl: add validation to binding rule selector on upsert. (#16210 ) * acl: add validation to binding rule selector on upsert. * docs: add more information on binding rule selector escaping.	2023-02-17 15:38:55 +01:00
Alessio Perugini	4e9ec24b22	Allow configurable range of Job priorities (#16084 )	2023-02-17 09:23:13 -05:00
Michael Schurter	671d9f64ec	Minor post-1.5-beta1 API, code, and docs cleanups (#16193 ) * api: return error on parse failure * docs: clarify anonymous policy with task api	2023-02-16 10:32:21 -08:00
Tim Gross	27cc6a2ff9	fix test flake for RPC TLS enforcement test (#16199 ) The RPC TLS enforcement test was frequently failing with broken connections. The most likely cause was that the tests started to run before the server had started its RPC server. Wait until it self-elects to ensure that the RPC server is up. This seems to have corrected the error; I ran this 3 times without a failure (even accounting for `gotestsum` retries). Also, fix a minor test bug that didn't impact the test but showed an incorrect usage for `Status.Ping.`	2023-02-16 11:50:40 -05:00
Will Nicholson	4dc83757a6	eventstream: Handle missing policy documents in event streams (#15495 ) Fixes https://github.com/hashicorp/nomad/issues/15493 Co-authored-by: Tim Gross <tgross@hashicorp.com>	2023-02-14 11:27:39 -05:00
Seth Hoenig	165791dd89	artifact: protect against unbounded artifact decompression (1.5.0) (#16151 ) * artifact: protect against unbounded artifact decompression Starting with 1.5.0, set defaut values for artifact decompression limits. artifact.decompression_size_limit (default "100GB") - the maximum amount of data that will be decompressed before triggering an error and cancelling the operation artifact.decompression_file_count_limit (default 4096) - the maximum number of files that will be decompressed before triggering an error and cancelling the operation. * artifact: assert limits cannot be nil in validation	2023-02-14 09:28:39 -06:00
Charlie Voiselle	d93ba0cf32	Add warnings to `var put` for non-alphanumeric keys. (#15933 ) * Warn when Items key isn't directly accessible Go template requires that map keys are alphanumeric for direct access using the dotted reference syntax. This warns users when they create keys that run afoul of this requirement. - cli: use regex to detect invalid indentifiers in var keys - test: fix slash in escape test case - api: share warning formatting function between API and CLI - ui: warn if var key has characters other than _, letter, or number --------- Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com> Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2023-02-13 16:14:59 -05:00
Seth Hoenig	df3e8f82da	deps: update go-set, go-landlock (#16146 ) Made a breaking change in go-set (String() signature), need to update both these dependencies together and also fix a thing in structs.go	2023-02-13 08:26:30 -06:00
Charlie Voiselle	65ce3ec8de	[core] Do not start the plugin loader on non-clients (#16111 ) The plugin loader loads task and device driver plugins which are not used on server nodes.	2023-02-10 15:33:16 -05:00
Tim Gross	65c7e149d3	eval broker: use write lock when reaping cancelable evals (#16112 ) The eval broker's `Cancelable` method used by the cancelable eval reaper mutates the slice of cancelable evals by removing a batch at a time from the slice. But this method unsafely uses a read lock despite this mutation. Under normal workloads this is likely to be safe but when the eval broker is under the heavy load this feature is intended to fix, we're likely to have a race condition. Switch this to a write lock, like the other locks that mutate the eval broker state. This changeset also adjusts the timeout to allow poorly-sized Actions runners more time to schedule the appropriate goroutines. The test has also been updated to use `shoenig/test/wait` so we can have sensible reporting of the results rather than just a timeout error when things go wrong.	2023-02-10 10:40:41 -05:00
Tim Gross	c2bd829fe2	tests: don't mutate global structs in core scheduler tests (#16120 ) Some of the core scheduler tests need the maximum batch size for writes to be smaller than the usual `structs.MaxUUIDsPerWriteRequest`. But they do so by unsafely modifying the global struct, which creates test flakes in other tests. Modify the functions under test to take a batch size parameter. Production code will pass the global while the tests can inject smaller values. Turn the `structs.MaxUUIDsPerWriteRequest` into a constant, and add a semgrep rule for avoiding this kind of thing in the future.	2023-02-10 09:26:00 -05:00
Tim Gross	69a2040e82	acl: never return auth errors for `ACL.Bootstrap` RPC (#16108 ) In #15901 we introduced pre-forwarding authentication for RPCs so that we can grab the identity for rate metrics. The `ACL.Bootstrap` RPC is an unauthenticated endpoint, so any error message from authentication is not particularly useful. This would be harmless, but if you try to bootstrap with your `NOMAD_TOKEN` already set (perhaps because you were talking to another cluster previously from the same shell session), you'll get an authentication error instead of just having the token be ignored. This is a regression from the existing behavior, so have this endpoint ignore auth errors the same way we do for every other unauthenticated endpoint (ex `Status.Peers`)	2023-02-09 10:02:56 -05:00
Seth Hoenig	0e7bf87ee1	deps: upgrade to hashicorp/golang-lru/v2 (#16085 )	2023-02-08 15:20:33 -06:00
James Rasell	53e0f424e9	rcp: bump SSO feature gate version. (#16080 )	2023-02-07 15:45:07 -08:00
Michael Schurter	35d65c7c7e	Dynamic Node Metadata (#15844 ) Fixes #14617 Dynamic Node Metadata allows Nomad users, and their jobs, to update Node metadata through an API. Currently Node metadata is only reloaded when a Client agent is restarted. Includes new UI for editing metadata as well. --------- Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com>	2023-02-07 14:42:25 -08:00
Seth Hoenig	590ae08752	main: remove deprecated uses of rand.Seed (#16074 ) * main: remove deprecated uses of rand.Seed go1.20 deprecates rand.Seed, and seeds the rand package automatically. Remove cases where we seed the random package, and cleanup the one case where we intentionally create a known random source. * cl: update cl * mod: update go mod	2023-02-07 09:19:38 -06:00
Michael Schurter	0a496c845e	Task API via Unix Domain Socket (#15864 ) This change introduces the Task API: a portable way for tasks to access Nomad's HTTP API. This particular implementation uses a Unix Domain Socket and, unlike the agent's HTTP API, always requires authentication even if ACLs are disabled. This PR contains the core feature and tests but followup work is required for the following TODO items: - Docs - might do in a followup since dynamic node metadata / task api / workload id all need to interlink - Unit tests for auth middleware - Caching for auth middleware - Rate limiting on negative lookups for auth middleware --------- Co-authored-by: Seth Hoenig <shoenig@duck.com>	2023-02-06 11:31:22 -08:00
Phil Renaud	d3c351d2d2	Label for the Web UI (#16006 ) * Demoable state * Demo mirage color * Label as a block with foreground and background colours * Test mock updates * Go test updated * Documentation update for label support	2023-02-02 16:29:04 -05:00
Tim Gross	19a2c065f4	System and sysbatch jobs always have zero index (#16030 ) Service jobs should have unique allocation Names, derived from the Job.ID. System jobs do not have unique allocation Names because the index is intended to indicated the instance out of a desired count size. Because system jobs do not have an explicit count but the results are based on the targeted nodes, the index is less informative and this was intentionally omitted from the original design. Update docs to make it clear that NOMAD_ALLOC_INDEX is always zero for system/sysbatch jobs Validate that `volume.per_alloc` is incompatible with system/sysbatch jobs. System and sysbatch jobs always have a `NOMAD_ALLOC_INDEX` of 0. So interpolation via `per_alloc` will not work as soon as there's more than one allocation placed. Validate against this on job submission.	2023-02-02 16:18:01 -05:00
Daniel Bennett	335f0a5371	docs: how to troubleshoot consul connect envoy (#15908 ) * largely a doc-ification of this commit message: d47678074bf8ae9ff2da3c91d0729bf03aee8446 this doesn't spell out all the possible failure modes, but should be a good starting point for folks. * connect: add doc link to envoy bootstrap error * add Unwrap() to RecoverableError mainly for easier testing	2023-02-02 14:20:26 -06:00
Charlie Voiselle	cc6f4719f1	Add option to expose workload token to task (#15755 ) Add `identity` jobspec block to expose workload identity tokens to tasks. --------- Co-authored-by: Anders <mail@anars.dk> Co-authored-by: Tim Gross <tgross@hashicorp.com> Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2023-02-02 10:59:14 -08:00
Phil Renaud	3db9f11c37	[feat] Nomad Job Templates (#15746 ) * Extend variables under the nomad path prefix to allow for job-templates (#15570) * Extend variables under the nomad path prefix to allow for job-templates * Add job-templates to error message hinting * RadioCard component for Job Templates (#15582) * chore: add * test: component API * ui: component template * refact: remove bc naming collission * styles: remove SASS var causing conflicts * Disallow specific variable at nomad/job-templates (#15681) * Disallows variables at exactly nomad/job-templates * idiomatic refactor * Expanding nomad job init to accept a template flag (#15571) * Adding a string flag for templates on job init * data-down actions-up version of a custom template editor within variable * Dont force grid on job template editor * list-templates flag started * Correctly slice from end of path name * Pre-review cleanup * Variable form acceptance test for job template editing * Some review cleanup * List Job templates test * Example from template test * Using must.assertions instead of require etc * ui: add choose template button (#15596) * ui: add new routes * chore: update file directory * ui: add choose template button * test: button and page navigation * refact: update var name * ui: use `Button` component from `HDS` (#15607) * ui: integrate buttons * refact: remove helper * ui: remove icons on non-tertiary buttons * refact: update normalize method for key/value pairs (#15612) * `revert`: `onCancel` for `JobDefinition` The `onCancel` method isn't included in the component API for `JobEditor` and the primary cancel behavior exists outside of the component. With the exception of the `JobDefinition` page where we include this button in the top right of the component instead of next to the `Plan` button. * style: increase button size * style: keep lime green * ui: select template (#15613) * ui: deprecate unused component * ui: deprecate tests * ui: jobs.run.templates.index * ui: update logic to handle templates * refact: revert key/value changes * style: padding for cards + buttons * temp: fixtures for mirage testing * Revert "refact: revert key/value changes" This reverts commit 124e95d12140be38fc921f7e15243034092c4063. * ui: guard template for unsaved job * ui: handle reading template variable * Revert "refact: update normalize method for key/value pairs (#15612)" This reverts commit 6f5ffc9b610702aee7c47fbff742cc81f819ab74. * revert: remove test fixtures * revert: prettier problems * refact: test doesnt need filter expression * styling: button sizes and responsive cards * refact: remove route guarding * ui: update variable adapter * refact: remove model editing behavior * refact: model should query variables to populate editor * ui: clear qp on exit * refact: cleanup deprecated API * refact: query all namespaces * refact: deprecate action * ui: rely on collection * refact: patch deprecate transition API * refact: patch test to expect namespace qp * styling: padding, conditionals * ui: flashMessage on 404 * test: update for o(n+1) query * ui: create new job template (#15744) * refact: remove unused code * refact: add type safety * test: select template flow * test: add data-test attrs * chore: remove dead code * test: create new job flow * ui: add create button * ui: create job template * refact: no need for wildcard * refact: record instead of delete * styling: spacing * ui: add error handling and form validation to job create template (#15767) * ui: handle server side errors * ui: show error to prevent duplicate * refact: conditional namespace * ui: save as template flow (#15787) * bug: patches failing tests associated with `pretender` (#15812) * refact: update assertion * refact: test set-up * ui: job templates manager view (#15815) * ui: manager list view * test: edit flow * refact: deprecate column-helper * ui: template edit and delete flow (#15823) * ui: manager list view * refact: update title * refact: update permissions * ui: template edit page * bug: typo * refact: update toast messages * bug: clear selections on exit (#15827) * bug: clear controllers on exit * test: mirage config changes (#15828) * refact: deprecate column-helper * style: update z-index for HDS * Revert "style: update z-index for HDS" This reverts commit d3d87ceab6d083f7164941587448607838944fc1. * refact: update delete button * refact: edit redirect * refact: patch reactivity issues * styling: fixed width * refact: override defaults * styling: edit text causing overflow * styling: add inline text Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com> * bug: edit `text` to `template` Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com> Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com> * test: delete flow job templates (#15896) * refact: edit names * bug: set correct ref to store * chore: trim whitespace: * test: delete flow * bug: reactively update view (#15904) * Initialized default jobs (#15856) * Initialized default jobs * More jobs scaffolded * Better commenting on a couple example job specs * Adapter doing the work * fall back to epic config * Label format helper and custom serialization logic * Test updates to account for a never-empty state * Test suite uses settled and maintain RecordArray in adapter return * Updates to hello-world and variables example jobspecs * Parameterized job gets optional payload output * Formatting changes for param and service discovery job templates * Multi-group service discovery job * Basic test for default templates (#15965) * Basic test for default templates * Percy snapshot for manage page * Some late-breaking design changes * Some copy edits to the header paragraphs for job templates (#15967) * Added some init options for job templates (#15994) * Async method for populating default job templates from the variable adapter --------- Co-authored-by: Jai <41024828+ChaiWithJai@users.noreply.github.com>	2023-02-02 10:37:40 -05:00
jmwilkinson	37834dffda	Allow wildcard datacenters to be specified in job file (#11170 ) Also allows for default value of `datacenters = ["*"]`	2023-02-02 09:57:45 -05:00
Seth Hoenig	ca7ead191e	consul: restore consul token when reverting a job (#15996 ) * consul: reset consul token on job during registration of a reversion * e2e: add test for reverting a job with a consul service * cl: fixup cl entry	2023-02-01 14:02:45 -06:00
James Rasell	9e8325d63c	acl: fix a bug in token creation when parsing expiration TTLs. (#15999 ) The ACL token decoding was not correctly handling time duration syntax such as "1h" which forced people to use the nanosecond representation via the HTTP API. The change adds an unmarshal function which allows this syntax to be used, along with other styles correctly.	2023-02-01 17:43:41 +01:00
James Rasell	67acfd9f6b	acl: return 400 not 404 code when creating an invalid policy. (#16000 )	2023-02-01 17:40:15 +01:00
Mike Nomitch	80848b202e	Increases max variable size to 64KiB from 16KiB (#15983 )	2023-01-31 13:32:36 -05:00
stswidwinski	16eefbbf4d	GC: ensure no leakage of evaluations for batch jobs. (#15097 ) Prior to 2409f72 the code compared the modification index of a job to itself. Afterwards, the code compared the creation index of the job to itself. In either case there should never be a case of re-parenting of allocs causing the evaluation to trivially always result in false, which leads to unreclaimable memory. Prior to this change allocations and evaluations for batch jobs were never garbage collected until the batch job was explicitly stopped. The new `batch_eval_gc_threshold` server configuration controls how often they are collected. The default threshold is `24h`.	2023-01-31 13:32:14 -05:00
Jorge Marey	d1c9aad762	Rename fields on proxyConfig (#15541 ) * Change api Fields for expose and paths * Add changelog entry * changelog: add deprecation notes about connect fields * api: minor style tweaks --------- Co-authored-by: Seth Hoenig <shoenig@duck.com>	2023-01-30 09:31:16 -06:00
Piotr Kazmierczak	14b53df3b6	renamed stanza to block for consistency with other projects (#15941 )	2023-01-30 15:48:43 +01:00
Seth Hoenig	074b76e3bf	consul: check for acceptable service identity on consul tokens (#15928 ) When registering a job with a service and 'consul.allow_unauthenticated=false', we scan the given Consul token for an acceptable policy or role with an acceptable policy, but did not scan for an acceptable service identity (which is backed by an acceptable virtual policy). This PR updates our consul token validation to also accept a matching service identity when registering a service into Consul. Fixes #15902	2023-01-27 18:15:51 -06:00
Tim Gross	881a4cfaff	metrics: Add remaining server RPC rate metrics (#15901 )	2023-01-27 08:29:53 -05:00
Tim Gross	ce3eef8037	metrics: Add rate metrics to Client CSI endpoints (#15905 ) Also tightens up authentication for these endpoints by enforcing the server certificate name is valid. We protect these endpoints currently by mTLS and can't use an auth token because these endpoints are (uniquely) called by the leader and followers for a given node won't have the leader's ephemeral ACL token. Add a certificate name check that requests come from a server and not a client, because no client should ever send these RPCs directly.	2023-01-26 16:40:58 -05:00
Tim Gross	bed8716e44	metrics: Add metrics to unauthenticated endpoints (#15899 )	2023-01-26 15:05:51 -05:00
Tim Gross	5e75ea9fb3	metrics: Add RPC rate metrics to endpoints that validate TLS names (#15900 )	2023-01-26 15:04:25 -05:00
Yorick Gersie	2a5c423ae0	Allow per_alloc to be used with host volumes (#15780 ) Disallowing per_alloc for host volumes in some cases makes life of a nomad user much harder. When we rely on the NOMAD_ALLOC_INDEX for any configuration that needs to be re-used across restarts we need to make sure allocation placement is consistent. With CSI volumes we can use the `per_alloc` feature but for some reason this is explicitly disabled for host volumes. Ensure host volumes understand the concept of per_alloc	2023-01-26 09:14:47 -05:00
Piotr Kazmierczak	f4d6efe69f	acl: make auth method default across all types (#15869 )	2023-01-26 14:17:11 +01:00
James Rasell	5d33891910	sso: allow binding rules to create management ACL tokens. (#15860 ) * sso: allow binding rules to create management ACL tokens. * docs: update binding rule docs to detail management type addition.	2023-01-26 09:57:44 +01:00
Tim Gross	6677a103c2	metrics: measure rate of RPC requests that serve API (#15876 ) This changeset configures the RPC rate metrics that were added in #15515 to all the RPCs that support authenticated HTTP API requests. These endpoints already configured with pre-forwarding authentication in #15870, and a handful of others were done already as part of the proof-of-concept work. So this changeset is entirely copy-and-pasting one method call into a whole mess of handlers. Upcoming PRs will wire up pre-forwarding auth and rate metrics for the remaining set of RPCs that have no API consumers or aren't authenticated, in smaller chunks that can be more thoughtfully reviewed.	2023-01-25 16:37:24 -05:00
Luiz Aoqui	3479e2231f	core: enforce strict steps for clients reconnect (#15808 ) When a Nomad client that is running an allocation with `max_client_disconnect` set misses a heartbeat the Nomad server will update its status to `disconnected`. Upon reconnecting, the client will make three main RPC calls: - `Node.UpdateStatus` is used to set the client status to `ready`. - `Node.UpdateAlloc` is used to update the client-side information about allocations, such as their `ClientStatus`, task states etc. - `Node.Register` is used to upsert the entire node information, including its status. These calls are made concurrently and are also running in parallel with the scheduler. Depending on the order they run the scheduler may end up with incomplete data when reconciling allocations. For example, a client disconnects and its replacement allocation cannot be placed anywhere else, so there's a pending eval waiting for resources. When this client comes back the order of events may be: 1. Client calls `Node.UpdateStatus` and is now `ready`. 2. Scheduler reconciles allocations and places the replacement alloc to the client. The client is now assigned two allocations: the original alloc that is still `unknown` and the replacement that is `pending`. 3. Client calls `Node.UpdateAlloc` and updates the original alloc to `running`. 4. Scheduler notices too many allocs and stops the replacement. This creates unnecessary placements or, in a different order of events, may leave the job without any allocations running until the whole state is updated and reconciled. To avoid problems like this clients must update _all_ of its relevant information before they can be considered `ready` and available for scheduling. To achieve this goal the RPC endpoints mentioned above have been modified to enforce strict steps for nodes reconnecting: - `Node.Register` does not set the client status anymore. - `Node.UpdateStatus` sets the reconnecting client to the `initializing` status until it successfully calls `Node.UpdateAlloc`. These changes are done server-side to avoid the need of additional coordination between clients and servers. Clients are kept oblivious of these changes and will keep making these calls as they normally would. The verification of whether allocations have been updates is done by storing and comparing the Raft index of the last time the client missed a heartbeat and the last time it updated its allocations.	2023-01-25 15:53:59 -05:00
Tim Gross	f3f64af821	WI: allow workloads to use RPCs associated with HTTP API (#15870 ) This changeset allows Workload Identities to authenticate to all the RPCs that support HTTP API endpoints, for use with PR #15864. * Extends the work done for pre-forwarding authentication to all RPCs that support a HTTP API endpoint. * Consolidates the auth helpers used by the CSI, Service Registration, and Node endpoints that are currently used to support both tokens and client secrets. Intentionally excluded from this changeset: * The Variables endpoint still has custom handling because of the implicit policies. Ideally we'll figure out an efficient way to resolve those into real policies and then we can get rid of that custom handling. * The RPCs that don't currently support auth tokens (i.e. those that don't support HTTP endpoints) have not been updated with the new pre-forwarding auth We'll be doing this under a separate PR to support RPC rate metrics.	2023-01-25 14:33:06 -05:00
Tim Gross	cf9e5f3327	acl: Fix panic when bogus token is passed (#15863 ) If a consumer of the new `Authenticate` method gets passed a bogus token that's a correctly-shaped UUID, it will correctly get an identity without a ACL token. But most consumers will then panic when they consume this nil `ACLToken` for authorization. Because no API client should ever send a bogus auth token, update the `Authenticate` method to create the identity with remote IP (for metrics tracking) but also return an `ErrPermissionDenied`.	2023-01-25 10:03:17 -05:00
Tim Gross	055434cca9	add metric for count of RPC requests (#15515 ) Implement a metric for RPC requests with labels on the identity, so that administrators can monitor the source of requests within the cluster. This changeset demonstrates the change with the new `ACL.WhoAmI` RPC, and we'll wire up the remaining RPCs once we've threaded the new pre-forwarding authentication through the all. Note that metrics are measured after we forward but before we return any authentication error. This ensures that we only emit metrics on the server that actually serves the request. We'll perform rate limiting at the same place. Includes telemetry configuration to omit identity labels.	2023-01-24 11:54:20 -05:00
Tim Gross	2030d62920	implement pre-forwarding auth on select RPCs (#15513 ) In #15417 we added a new `Authenticate` method to the server that returns an `AuthenticatedIdentity` struct. This changeset implements this method for a small number of RPC endpoints that together represent all the various ways in which RPCs are sent, so that we can validate that we're happy with this approach.	2023-01-24 10:52:07 -05:00
Michael Schurter	ace5faf948	core: backoff considerably when worker is behind raft (#15523 ) Upon dequeuing an evaluation workers snapshot their state store at the eval's wait index or later. This ensures we process an eval at a point in time after it was created or updated. Processing an eval on an old snapshot could cause any number of problems such as: 1. Since job registration atomically updates an eval and job in a single raft entry, scheduling against indexes before that may not have the eval's job or may have an older version. 2. The older the scheduler's snapshot, the higher the likelihood something has changed in the cluster state which will cause the plan applier to reject the scheduler's plan. This could waste work or even cause eval's to be failed needlessly. However, the workers run in parallel with a new server pulling the cluster state from a peer. During this time, which may be many minutes long, the state store is likely far behind the minimum index required to process evaluations. This PR addresses this by adding an additional long backoff period after an eval is nacked. If the scheduler's indexes catches up within the additional backoff, it will unblock early to dequeue the next eval. When the server shuts down we'll get a `context.Canceled` error from the state store method. We need to bubble this error up so that other callers can detect it. Handle this case separately when waiting after dequeue so that we can warn on shutdown instead of throwing an ambiguous error message with just the text "canceled." While there may be more precise ways to block scheduling until the server catches up, this approach adds little risk and covers additional cases where a server may be temporarily behind due to a spike in load or a saturated network. For testing, we make the `raftSyncLimit` into a parameter on the worker's `run` method so that we can run backoff tests without waiting 30+ seconds. We haven't followed thru and made all the worker globals into worker parameters, because there isn't much use outside of testing, but we can consider that in the future. Co-authored-by: Tim Gross <tgross@hashicorp.com>	2023-01-24 08:56:35 -05:00
Tim Gross	a51149736d	Rename `nomad.broker.total_blocked` metric (#15835 ) This changeset fixes a long-standing point of confusion in metrics emitted by the eval broker. The eval broker has a queue of "blocked" evals that are waiting for an in-flight ("unacked") eval of the same job to be completed. But this "blocked" state is not the same as the `blocked` status that we write to raft and expose in the Nomad API to end users. There's a second metric `nomad.blocked_eval.total_blocked` that refers to evaluations in that state. This has caused ongoing confusion in major customer incidents and even in our own documentation! (Fixed in this PR.) There's little functional change in this PR aside from the name of the metric emitted, but there's a bit refactoring to clean up the names in `eval_broker.go` so that there aren't name collisions and multiple names for the same state. Changes included are: * Everything that was previously called "pending" referred to entities that were associated witht he "ready" metric. These are all now called "ready" to match the metric. * Everything named "blocked" in `eval_broker.go` is now named "pending", except for a couple of comments that actually refer to blocked RPCs. * Added a note to the upgrade guide docs for 1.5.0. * Fixed the scheduling performance metrics docs because the description for `nomad.broker.total_blocked` was actually the description for `nomad.blocked_eval.total_blocked`.	2023-01-20 14:23:56 -05:00
Charlie Voiselle	5ea1d8a970	Add raft snapshot configuration options (#15522 ) * Add config elements * Wire in snapshot configuration to raft * Add hot reload of raft config * Add documentation for new raft settings * Add changelog	2023-01-20 14:21:51 -05:00
Seth Hoenig	d2d8ebbeba	consul: correctly interpret missing consul checks as unhealthy (#15822 ) * consul: correctly understand missing consul checks as unhealthy This PR fixes a bug where Nomad assumed any registered Checks would exist in the service registration coming back from Consul. In some cases, the Consul may be slow in processing the check registration, and the response object would not contain checks. Nomad would then scan the empty response looking for Checks with failing health status, finding none, and then marking a task/alloc as healthy. In reality, we must always use Nomad's view of what checks should exist as the source of truth, and compare that with the response Consul gives us, making sure they match, before scanning the Consul response for failing check statuses. Fixes #15536 * consul: minor CR refactor using maps not sets * consul: observe transition from healthy to unhealthy checks * consul: spell healthy correctly	2023-01-19 14:01:12 -06:00

1 2 3 4 5 ...

4370 commits