open-nomad

Author	SHA1	Message	Date
Tim Gross	9dfb51579c	scheduler: refactor system util tests (#16416 ) The tests for the system allocs reconciling code path (`diffSystemAllocs`) include many impossible test environments, such as passing allocs for the wrong node into the function. This makes the test assertions nonsensible for use in walking yourself through the correct behavior. I've pulled this changeset out of PR #16097 so that we can merge these improvements and revisit the right approach to fix the problem in #16097 with less urgency now that the PFNR bug fix has been merged. This changeset breaks up a couple of tests, expands test coverage, and makes test assertions more clear. It also corrects one bit of production code that behaves fine in production because of canonicalization, but forces us to remember to set values in tests to compensate.	2023-03-13 11:59:31 -04:00
Seth Hoenig	630bd8eb68	scheduler: add simple benchmark for tasksUpdated (#16422 ) In preperation for some refactoring to tasksUpdated, add a benchmark to the old code so it's easy to compare with the changes, making sure nothing goes off the rails for performance.	2023-03-13 10:44:14 -05:00
Seth Hoenig	b3cec771d6	deps: remove replace statement for go-discover (#16304 ) Which we no longer need since we no longer have consul as a dependency	2023-03-13 10:40:35 -05:00
Tim Gross	c156640e84	Merge pull request #16445 from hashicorp/post-1.5.1-release Post 1.5.1 release	2023-03-13 11:29:49 -04:00
Tim Gross	d0aa105087	Merge release 1.5.1 files	2023-03-13 11:15:04 -04:00
hc-github-team-nomad-core	2d1a4d90e9	Prepare for next release	2023-03-13 11:13:27 -04:00
hc-github-team-nomad-core	35167e692a	Generate files for 1.5.1 release	2023-03-13 11:13:27 -04:00
Tim Gross	1cf28996e7	acl: prevent privilege escalation via workload identity ACL policies can be associated with a job so that the job's Workload Identity can have expanded access to other policy objects, including other variables. Policies set on the variables the job automatically has access to were ignored, but this includes policies with `deny` capabilities. Additionally, when resolving claims for a workload identity without any attached policies, the `ResolveClaims` method returned a `nil` ACL object, which is treated similarly to a management token. While this was safe in Nomad 1.4.x, when the workload identity token was exposed to the task via the `identity` block, this allows a user with `submit-job` capabilities to escalate their privileges. We originally implemented automatic workload access to Variables as a separate code path in the Variables RPC endpoint so that we don't have to generate on-the-fly policies that blow up the ACL policy cache. This is fairly brittle but also the behavior around wildcard paths in policies different from the rest of our ACL polices, which is hard to reason about. Add an `ACLClaim` parameter to the `AllowVariableOperation` method so that we can push all this logic into the `acl` package and the behavior can be consistent. This will allow a `deny` policy to override automatic access (and probably speed up checks of non-automatic variable access).	2023-03-13 11:13:27 -04:00
Michael Schurter	832bca91a1	e2e fixes: cli output, timing issue, and some cleanups (#16418 ) * e2e: job expects alloc to run until stopped * e2e: fix case changed by #16306 * e2e: couldn't find a bug but improved test+jobspecs	2023-03-10 13:14:51 -08:00
Luiz Aoqui	7305a374e3	allocrunner: fix health check monitoring for Consul services (#16402 ) Services must be interpolated to replace runtime variables before they can be compared against the values returned by Consul.	2023-03-10 14:43:31 -05:00
Juana De La Cuesta	5089f13f1d	cli: add `-json` and `-t` flag for `alloc checks` command (#16405 ) * cli: add -json flag to alloc checks for completion * CLI: Expand test to include testing the json flag for allocation checks * Documentation: Add the checks command * Documentation: Add example for alloc check command * Update website/content/docs/commands/alloc/checks.mdx Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * CLI: Add template flag to alloc checks command * Update website/content/docs/commands/alloc/checks.mdx Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * CLI: Extend test to include -t flag for alloc checks * func: add changelog for added flags to alloc checks * cli[doc]: Make usage section on alloc checks clearer * Update website/content/docs/commands/alloc/checks.mdx Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Delete modd.conf * cli[doc]: add -t flag to command description for alloc checks --------- Co-authored-by: James Rasell <jrasell@users.noreply.github.com> Co-authored-by: Juanita De La Cuesta Morales <juanita.delacuestamorales@juanita.delacuestamorales-LHQ7X0QG9X>	2023-03-10 16:58:53 +01:00
Michael Schurter	0021b282ef	env/aws: update ec2 cpu info data (#16417 ) Update AWS EC2 CPU tables using `make ec2info`	2023-03-09 14:33:21 -08:00
Luiz Aoqui	1aceff7806	cli: remove hard requirement on `list-jobs` (#16380 ) Most job subcommands allow for job ID prefix match as a convenience functionality so users don't have to type the full job ID. But this introduces a hard ACL requirement that the token used to run these commands have the `list-jobs` permission, even if the token has enough permission to execute the basic command action and the user passed an exact job ID. This change softens this requirement by not failing the prefix match in case the request results in a permission denied error and instead using the information passed by the user directly.	2023-03-09 15:00:04 -05:00
Bryce Kalow	3239539526	docs: update content-conformance package (#16412 )	2023-03-09 12:47:46 -06:00
James Rasell	001bca34a6	cli: fix help output format on `job init` command. (#16407 )	2023-03-09 18:17:15 +01:00
Tim Gross	99d46e5a49	scheduling: prevent self-collision in dynamic port network offerings (#16401 ) When the scheduler tries to find a placement for a new allocation, it iterates over a subset of nodes. For each node, we populate a `NetworkIndex` bitmap with the ports of all existing allocations and any other allocations already proposed as part of this same evaluation via its `SetAllocs` method. Then we make an "ask" of the `NetworkIndex` in `AssignPorts` for any ports we need and receive an "offer" in return. The offer will include both static ports and any dynamic port assignments. The `AssignPorts` method was written to support group networks, and it shares code that selects dynamic ports with the original `AssignTaskNetwork` code. `AssignTaskNetwork` can request multiple ports from the bitmap at a time. But `AssignPorts` requests them one at a time and does not account for possible collisions, and doesn't return an error in that case. What happens next varies: 1. If the scheduler doesn't place the allocation on that node, the port conflict is thrown away and there's no problem. 2. If the node is picked and this is the only allocation (or last allocation), the plan applier will reject the plan when it calls `SetAllocs`, as we'd expect. 3. If the node is picked and there are additional allocations in the same eval that iterate over the same node, their call to `SetAllocs` will detect the impossible state and the node will be rejected. This can have the puzzling behavior where a second task group for the job without any networking at all can hit a port collision error! It looks like this bug has existed since we implemented group networks, but there are several factors that add up to making the issue rare for many users yet frustratingly frequent for others: * You're more likely to hit this bug the more tightly packed your range for dynamic ports is. With 12000 ports in the range by default, many clusters can avoid this for a long time. * You're more likely to hit case (3) for jobs with lots of allocations or if a scheduler has to iterate over a large number of nodes, such as with system jobs, jobs with `spread` blocks, or (sometimes) jobs using `unique` constraints. For unlucky combinations of these factors, it's possible that case (3) happens repeatedly, preventing scheduling of a given job until a client state change (ex. restarting the agent so all its allocations are rescheduled elsewhere) re-opens the range of dynamic ports available. This changeset: * Fixes the bug by accounting for collisions in dynamic port selection in `AssignPorts`. * Adds test coverage for `AssignPorts`, expands coverage of this case for the deprecated `AssignTaskNetwork`, and tightens the dynamic port range in a scheduler test for spread scheduling to more easily detect this kind of problem in the future. * Adds a `String()` method to `Bitmap` so that any future "screaming" log lines have a human-readable list of used ports.	2023-03-09 10:09:54 -05:00
Proskurin Kirill	f3ecd1db7c	Updated who-uses-nomad to add Behavox (#16339 )	2023-03-08 19:43:12 -05:00
Seth Hoenig	ff4503aac6	client: disable running artifact downloader as nobody (#16375 ) * client: disable running artifact downloader as nobody This PR reverts a change from Nomad 1.5 where artifact downloads were executed as the nobody user on Linux systems. This was done as an attempt to improve the security model of artifact downloading where third party tools such as git or mercurial would be run as the root user with all the security implications thereof. However, doing so conflicts with Nomad's own advice for securing the Client data directory - which when setup with the recommended directory permissions structure prevents artifact downloads from working as intended. Artifact downloads are at least still now executed as a child process of the Nomad agent, and on modern Linux systems make use of the kernel Landlock feature for limiting filesystem access of the child process. * docs: update upgrade guide for 1.5.1 sandboxing * docs: add cl * docs: add title to upgrade guide fix	2023-03-08 15:58:43 -06:00
Seth Hoenig	2b5efeac04	e2e: setup nomad permissions correctly (client vs. server) (#16399 ) This PR configures - server nodes with a systemd unit running the agent as the nomad service user - client nodes with a root owned nomad data directory	2023-03-08 14:41:08 -06:00
Phil Renaud	b0124ee683	[ui] Fix: New toast notifications no longer last forever (#16384 ) * Removes an errant console.log and corrects a default sticky=true on toast notifications * Default so no need to refault	2023-03-08 14:50:18 -05:00
Lance Haig	35c17b2e56	deps: Update ioutil deprecated library references to os and io respectively in the client package (#16318 ) * Update ioutil deprecated library references to os and io respectively * Deal with the errors produced. Add error handling to filEntry info Add error handling to info	2023-03-08 13:25:10 -06:00
Lance Haig	2332d694bb	deps: Update ioutil library references to os and io respectively for drivers package (#16331 ) * Update ioutil library references to os and io respectively for drivers package No user facing changes so I assume no change log is required * Fix failing tests	2023-03-08 10:31:09 -06:00
Lance Haig	ae256e28d8	Update ioutil library references to os and io respectively for API and Plugins package (#16330 ) No user facing changes so I assume no change log is required	2023-03-08 10:25:09 -06:00
Lance Haig	e89c3d3b36	Update ioutil library references to os and io respectively for e2e helper nomad (#16332 ) No user facing changes so I assume no change log is required	2023-03-08 09:39:03 -06:00
Lance Haig	d9e585b965	Update ioutil library references to os and io respectively for command (#16329 ) No user facing changes so I assume no change log is required	2023-03-08 09:20:04 -06:00
dependabot[bot]	de766a4239	build(deps): bump golang.org/x/crypto from 0.5.0 to 0.7.0 (#16337 ) Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.5.0 to 0.7.0. - [Release notes](https://github.com/golang/crypto/releases) - [Commits](https://github.com/golang/crypto/compare/v0.5.0...v0.7.0) --- updated-dependencies: - dependency-name: golang.org/x/crypto dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-03-08 09:14:49 -06:00
Phil Renaud	54bb97f299	Outage recovery link fix (#16365 )	2023-03-07 15:52:26 -05:00
Seth Hoenig	32f8ca6ce3	e2e: fix permissions on nomad data directory (#16376 ) This PR updates the provisioning step where we create /opt/nomad/data, such that it is with 0700 permissions in line with our security guidance.	2023-03-07 14:41:54 -06:00
Seth Hoenig	835365d2a4	docker: fix bug where network pause containers would be erroneously reconciled (#16352 ) * docker: fix bug where network pause containers would be erroneously gc'd * docker: cl: thread context from driver into pause container restoration	2023-03-07 12:17:32 -06:00
James Rasell	05fff34fc8	docs: add 1.5.0, 1.4.5, and 1.3.10 pause regression upgrade note. (#16358 )	2023-03-07 18:29:03 +01:00
James Rasell	7507c92139	cli: support `json` and `t` on `acl binding-rule info` command. (#16357 )	2023-03-07 18:27:02 +01:00
Tim Gross	966c4b1a2d	docs: note that secrets dir is usually mounted `noexec` (#16363 )	2023-03-07 11:57:15 -05:00
Tim Gross	a2ceab3d8c	scheduler: correctly detect inplace update with wildcard datacenters (#16362 ) Wildcard datacenters introduced a bug where a job with any wildcard datacenters will always be treated as a destructive update when we check whether a datacenter has been removed from the jobspec. Includes updating the helper so that callers don't have to loop over the job's datacenters.	2023-03-07 10:05:59 -05:00
Ashlee M Boyer	9af02f3f4a	CI: delete test-link-rewrites.yml (#16354 )	2023-03-06 15:41:01 -05:00
Seth Hoenig	1c8b408a81	deps: update test to 0.6.2 for new functions (#16326 )	2023-03-06 09:24:45 -06:00
Phil Renaud	edf59597d2	[ui] Fix: Wildcard-datacenter system/sysbatch jobs stopped showing client links/chart (#16274 ) * Fix for wildcard DC sys/sysbatch jobs * A few extra modules for wildcard DC in systemish jobs * doesMatchPattern moved to its own util as match-glob * DC glob lookup using matchGlob * PR feedback	2023-03-06 10:06:31 -05:00
Luiz Aoqui	2a1a790820	client: don't emit task shutdown delay event if not waiting (#16281 )	2023-03-03 18:22:06 -05:00
Luiz Aoqui	3f1ea9da4b	api: set last index and request time on alloc stop (#16319 ) Some of the methods in `Allocations()` incorrectly use the `putQuery` in API calls where `put` is more appropriate since they are not reading information back. These methods are also not returning request metadata such as `LastIndex` back to callers, which can be useful to have in some scenarios. They also provide poor developer experience as they take an `api.Allocation` struct when only the allocation ID is necessary. This can lead consumers to make unnecessary API calls to fetch the full allocation. Fixing these problems require updating the methods' signatures so they take `WriteOptions` instead of `QueryOptions` and return `WriteMeta`, but this is a breaking change that requires advanced notice to consumers. This commit adds a future breaking change notice and also fixes the `Stop` method so it properly returns request metadata in a backwards compatible way.	2023-03-03 15:52:41 -05:00
Tim Gross	3c0eaba9db	remove backcompat support for non-atomic job registration (#16305 ) In Nomad 0.12.1 we introduced atomic job registration/deregistration, where the new eval was written in the same raft entry. Backwards-compatibility checks were supposed to have been removed in Nomad 1.1.0, but we missed that. This is long safe to remove.	2023-03-03 15:52:22 -05:00
Luiz Aoqui	40494e64a9	docs: fix alloc stop `no_shutdown_delay` (#16282 )	2023-03-03 14:44:49 -05:00
Luiz Aoqui	1d051d834d	cli: use shared logic for resolving job prefix (#16306 ) Several `nomad job` subcommands had duplicate or slightly similar logic for resolving a job ID from a CLI argument prefix, while others did not have this functionality at all. This commit pulls the shared logic to the command Meta and updates all `nomad job` subcommands to use it.	2023-03-03 14:43:20 -05:00
Tim Gross	8747059b86	service: fix regression in task access to list/read endpoint (#16316 ) When native service discovery was added, we used the node secret as the auth token. Once Workload Identity was added in Nomad 1.4.x we needed to use the claim token for `template` blocks, and so we allowed valid claims to bypass the ACL policy check to preserve the existing behavior. (Invalid claims are still rejected, so this didn't widen any security boundary.) In reworking authentication for 1.5.0, we unintentionally removed this bypass. For WIs without a policy attached to their job, everything works as expected because the resulting `acl.ACL` is nil. But once a policy is attached to the job the `acl.ACL` is no longer nil and this causes permissions errors. Fix the regression by adding back the bypass for valid claims. In future work, we should strongly consider getting turning the implicit policies into real `ACLPolicy` objects (even if not stored in state) so that we don't have these kind of brittle exceptions to the auth code.	2023-03-03 11:41:19 -05:00
Dao Thanh Tung	62a69552c1	api: add new test case for force-leave (#16260 ) Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>	2023-03-03 10:38:40 -05:00
Aofei Sheng	e81fecdd1f	docs: fix typos in task-api.mdx and workload-identity.mdx (#16309 )	2023-03-03 08:37:59 -05:00
Valentino	1f9d11feff	Add namespace argument to the job verification help text (#16243 )	2023-03-02 16:42:14 -05:00
Dao Thanh Tung	ed31e0a5f5	cli: sort Node value in `nomad operator raft list-peers` command (#16221 ) Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>	2023-03-02 16:16:30 -05:00
Michael Schurter	4b01df1787	Merge pull request #16293 from hashicorp/post-1.5.0-release admin: Post 1.5.0 release	2023-03-02 12:44:49 -08:00
Phil Renaud	93574ce085	[ui, helios] Toast Component (#16099 ) * Template and styles * @type to @color on flash messages * Notifications service as wrapper * Test cases updated for new notifs	2023-03-02 13:52:16 -05:00
Tim Gross	0e1b554299	handle `FSM.Apply` errors in `raftApply` (#16287 ) The signature of the `raftApply` function requires that the caller unwrap the first returned value (the response from `FSM.Apply`) to see if it's an error. This puts the burden on the caller to remember to check two different places for errors, and we've done so inconsistently. Update `raftApply` to do the unwrapping for us and return any `FSM.Apply` error as the error value. Similar work was done in Consul in https://github.com/hashicorp/consul/pull/9991. This eliminates some boilerplate and surfaces a few minor bugs in the process: * job deregistrations of already-GC'd jobs were still emitting evals * reconcile job summaries does not return scheduler errors * node updates did not report errors associated with inconsistent service discovery or CSI plugin states Note that although _most_ of the `FSM.Apply` functions return only errors (which makes it tempting to remove the first return value entirely), there are few that return `bool` for some reason and Variables relies on the response value for proper CAS checking.	2023-03-02 13:51:09 -05:00
Tim Gross	f3b5952c3e	deps: update go-plugin to 1.4.9 (#16292 ) Fixes #16288. An earlier version of `go-plugin` introduced a warning log if `SecureConfig` is unset. For Nomad and other applications that have "internal" `go-plugin` consumers where the application runs itself as a plugin, this causes spurious warn-level logs. For Nomad in particular this means every task driver and logmon invocation emits the log, which is our primary operation. The change was reverted upstream, so this changeset picks up the reverted version.	2023-03-02 13:39:57 -05:00

1 2 3 4 5 ...

24392 commits