open-nomad

Commit Graph

Author	SHA1	Message	Date
hc-github-team-nomad-core	4f087674f4	backport of commit 7fe432042eaa0a97c0aaa40d302055eb18e8a9b0 (#18040 ) This pull request was automerged via backport-assistant	2023-07-24 02:28:28 -05:00
hc-github-team-nomad-core	30260f06e8	Backport of state: canonicalize namespace on restore into release/1.6.x (#18018 ) This pull request was automerged via backport-assistant	2023-07-20 15:05:16 -05:00
hc-github-team-nomad-core	e891026755	Backport of CSI: improve controller RPC reliability into release/1.6.x (#18015 ) This pull request was automerged via backport-assistant	2023-07-20 13:52:27 -05:00
hc-github-team-nomad-core	c67a225882	Prepare for next release	2023-07-18 18:51:15 +00:00
hc-github-team-nomad-core	609a97cfab	Generate files for 1.6.0 release	2023-07-18 18:51:11 +00:00
Tim Gross	e8bfef8148	search: fix ACL filtering for plugins and variables ACL permissions for the search endpoints are done in three passes. The first (the `sufficientSearchPerms` method) is for performance and coarsely rejects requests based on the passed-in context parameter if the user has no permissions to any object in that context. The second (the `filteredSearchContexts` method) filters out contexts based on whether the user has permissions either to the requested namespace or again by context (to catch the "all" context). Finally, when iterating over the objects available, we do the usual filtering in the iterator. Internal testing found several bugs in this filtering: * CSI plugins can be searched by any authenticated user. * Variables can be searched if the user has `job:read` permissions to the variable's namespace instead of `variable:list`. * Variables cannot be searched by wildcard namespace. This is an information leak of the plugin names and variable paths, which we don't consider to be privileged information but intended to protect anyways. This changeset fixes these bugs by ensuring CSI plugins are filtered in the 1st and 2nd pass ACL filters, and changes variables to check `variable:list` in the 2nd pass filter unless the wildcard namespace is passed (at which point we'll fallback to filtering in the iterator). Fixes: CVE-2023-3300 Fixes: #17906	2023-07-18 12:09:55 -04:00
hc-github-team-nomad-core	0951fe1c50	backport of commit 0a5e90120b18ff450457463d6bcee68ec6804bb0 (#17900 ) This pull request was automerged via backport-assistant	2023-07-11 10:00:05 -05:00
Lance Haig	0455389534	Add the ability to customise the details of the CA (#17309 ) Co-authored-by: James Rasell <jrasell@users.noreply.github.com>	2023-07-11 08:53:09 +01:00
Tim Gross	ad7355e58b	CSI: persist previous mounts on client to restore during restart (#17840 ) When claiming a CSI volume, we need to ensure the CSI node plugin is running before we send any CSI RPCs. This extends even to the controller publish RPC because it requires the storage provider's "external node ID" for the client. This primarily impacts client restarts but also is a problem if the node plugin exits (and fingerprints) while the allocation that needs a CSI volume claim is being placed. Unfortunately there's no mapping of volume to plugin ID available in the jobspec, so we don't have enough information to wait on plugins until we either get the volume from the server or retrieve the plugin ID from data we've persisted on the client. If we always require getting the volume from the server before making the claim, a client restart for disconnected clients will cause all the allocations that need CSI volumes to fail. Even while connected, checking in with the server to verify the volume's plugin before trying to make a claim RPC is inherently racy, so we'll leave that case as-is and it will fail the claim if the node plugin needed to support a newly-placed allocation is flapping such that the node fingerprint is changing. This changeset persists a minimum subset of data about the volume and its plugin in the client state DB, and retrieves that data during the CSI hook's prerun to avoid re-claiming and remounting the volume unnecessarily. This changeset also updates the RPC handler to use the external node ID from the claim whenever it is available. Fixes: #13028	2023-07-10 13:20:15 -04:00
Tim Gross	5025731ebe	consul: handle "not found" errors from Consul when deleting tokens (#17847 ) In Consul 1.15.0, the Delete Token API was changed so as to return an error when deleting a non-existent ACL token. This means that if Nomad successfully deletes the token but fails to persist that fact, it will get stuck trying to delete a non-existent token forever. Update the token deletion function to ignore "not found" errors and treat them as successful deletions. Fixes: #17833	2023-07-07 16:22:13 -04:00
James Rasell	45073e8a05	job: ensure node pool is canonicalized for state restores. (#17765 )	2023-06-30 07:37:22 +01:00
nicoche	649831c1d3	deploymentwatcher: fail early whenever possible (#17341 ) Given a deployment that has a `progress_deadline`, if a task group runs out of reschedule attempts, allow it to fail at this time instead of waiting until the `progress_deadline` is reached. Fixes: #17260	2023-06-26 14:01:03 -04:00
James Rasell	74ab0badb4	test: add drain config tests. (#17724 )	2023-06-26 16:23:13 +01:00
Luiz Aoqui	66962b2b28	np: fix list of jobs for node pool `all` (#17705 ) Unlike nodes, jobs are allowed to be registered in the node pool `all`, in which case all nodes are used for evaluating placements. When listing jobs for the `all` node pool only those that are explicitly in this node pool should be returned.	2023-06-23 15:47:53 -04:00
grembo	7936c1e33f	Add `disable_file` parameter to job's `vault` stanza (#13343 ) This complements the `env` parameter, so that the operator can author tasks that don't share their Vault token with the workload when using `image` filesystem isolation. As a result, more powerful tokens can be used in a job definition, allowing it to use template stanzas to issue all kinds of secrets (database secrets, Vault tokens with very specific policies, etc.), without sharing that issuing power with the task itself. This is accomplished by creating a directory called `private` within the task's working directory, which shares many properties of the `secrets` directory (tmpfs where possible, not accessible by `nomad alloc fs` or Nomad's web UI), but isn't mounted into/bound to the container. If the `disable_file` parameter is set to `false` (its default), the Vault token is also written to the NOMAD_SECRETS_DIR, so the default behavior is backwards compatible. Even if the operator never changes the default, they will still benefit from the improved behavior of Nomad never reading the token back in from that - potentially altered - location.	2023-06-23 15:15:04 -04:00
James Rasell	78cdf0d0d8	server: remove unused endpoints struct. (#17665 )	2023-06-23 08:20:33 +01:00
Luiz Aoqui	8f05eaaa68	np: check for license on RPC endpoints (#17656 )	2023-06-22 12:52:20 -04:00
Tim Gross	11216d09af	client: send node secret with every client-to-server RPC (#16799 ) In Nomad 1.5.3 we fixed a security bug that allowed bypass of ACL checks if the request came thru a client node first. But this fix broke (knowingly) the identification of many client-to-server RPCs. These will be now measured as if they were anonymous. The reason for this is that many client-to-server RPCs do not send the node secret and instead rely on the protection of mTLS. This changeset ensures that the node secret is being sent with every client-to-server RPC request. In a future version of Nomad we can add enforcement on the server side, but this was left out of this changeset to reduce risks to the safe upgrade path. Sending the node secret as an auth token introduces a new problem during initial introduction of a client. Clients send many RPCs concurrently with `Node.Register`, but until the node is registered the node secret is unknown to the server and will be rejected as invalid. This causes permission denied errors. To fix that, this changeset introduces a gate on having successfully made a `Node.Register` RPC before any other RPCs can be sent (except for `Status.Ping`, which we need earlier but which also ignores the error because that handler doesn't do an authorization check). This ensures that we only send requests with a node secret already known to the server. This also makes client startup a little easier to reason about because we know `Node.Register` must succeed first, and it should make for a good place to hook in future plans for secure introduction of nodes. The tradeoff is that an existing client that has running allocs will take slightly longer (a second or two) to transition to ready after a restart, because the transition in `Node.UpdateStatus` is gated at the server by first submitting `Node.UpdateAlloc` with client alloc updates.	2023-06-22 11:06:49 -04:00
James Rasell	4e2d019639	variables: remove unused state store functions. (#17660 )	2023-06-22 13:54:58 +01:00
James Rasell	71fdd7e891	core: use faster concatenation for alloc name generation. (#17591 )	2023-06-22 07:46:28 +01:00
Luiz Aoqui	ac08fc751b	node pools: apply node pool scheduler configuration (#17598 )	2023-06-21 20:31:50 -04:00
Tim Gross	ff9ba8ff73	scheduler: tolerate having only one dynamic port available (#17619 ) If the dynamic port range for a node is set so that the min is equal to the max, there's only one port available and this passes config validation. But the scheduler panics when it tries to pick a random port. Only add the randomness when there's more than one to pick from. Adds a test for the behavior but also adjusts the commentary on a couple of the existing tests that made it seem like this case was already covered if you didn't look too closely. Fixes: #17585	2023-06-20 13:29:25 -04:00
Luiz Aoqui	2f5df1d8a4	test: add MultiregionMinJob mock (#17614 )	2023-06-20 10:57:02 -04:00
James Rasell	86e4c6cb9d	state: move variables tests to use must library. (#17609 )	2023-06-20 15:46:16 +01:00
James Rasell	68df578c73	state: remove vague scaling event schema todo item. (#17610 )	2023-06-20 15:22:11 +01:00
Luiz Aoqui	a56b10e857	chore: fix typo and copyright header (#17605 )	2023-06-20 10:09:47 -04:00
Luiz Aoqui	cfb3bb517f	np: scheduler configuration updates (#17575 ) * jobspec: rename node pool scheduler_configuration In HCL specifications we usually call configuration blocks `config` instead of `configuration`. * np: add memory oversubscription config * np: make scheduler config ENT	2023-06-19 11:41:46 -04:00
Luiz Aoqui	d5aa72190f	node pools: namespace integration (#17562 ) Add structs and fields to support the Nomad Pools Governance Enterprise feature of controlling node pool access via namespaces. Nomad Enterprise allows users to specify a default node pool to be used by jobs that don't specify one. In order to accomplish this, it's necessary to distinguish between a job that explicitly uses the `default` node pool and one that did not specify any. If the `default` node pool is set during job canonicalization it's impossible to do this, so this commit allows a job to have an empty node pool value during registration but sets to `default` at the admission controller mutator. In order to guarantee state consistency the state store validates that the job node pool is set and exists before inserting it.	2023-06-16 16:30:22 -04:00
Luiz Aoqui	bdc7f3305f	rpc: fix log message in Node.UpdateStatus (#17537 )	2023-06-14 16:51:46 -04:00
Luiz Aoqui	bc17cffaef	node pool: node pool upsert on multiregion node register (#17503 ) When registering a node with a new node pool in a non-authoritative region we can't create the node pool because this new pool will not be replicated to other regions. This commit modifies the node registration logic to only allow automatic node pool creation in the authoritative region. In non-authoritative regions, the client is registered, but the node pool is not created. The client is kept in the `initialing` status until its node pool is created in the authoritative region and replicated to the client's region.	2023-06-13 11:28:28 -04:00
Tim Gross	952eb2713e	node pools: protect against deleting occupied pools (#17457 ) We don't want to delete node pools that have nodes or non-terminal jobs. Add a check in the `DeleteNodePools` RPC to check locally and in federated regions, similar to how we check that it's safe to delete namespaces.	2023-06-13 09:57:42 -04:00
Tim Gross	e8a361310f	node pools: replicate from authoritative region (#17456 ) Upserts and deletes of node pools are forwarded to the authoritative region, just like we do for namespaces, quotas, ACL policies, etc. Replicate node pools from the authoritative region.	2023-06-12 13:24:24 -04:00
Tim Gross	bb7f0edd6a	node pools: prevent panic on upsert during upgrades (#17474 ) Whenever we write a Raft log entry for node pools, we need to first make sure that all servers can safely apply the log without panicking. Gate upsert and delete RPCs on all servers being upgraded to the minimum version.	2023-06-12 09:01:30 -04:00
Tim Gross	e3a37c0b97	replication: fix potential panic during upgrades (#17476 ) If the authoritative region has been upgraded to a version of Nomad that has new replicated objects (such as ACL Auth Methods, ACL Binding Rules, etc.), the non-authoritative regions will start replicating those objects as soon as their leader is upgraded. If a server in the non-authoritative region is upgraded and then becomes the leader before all the other servers in the region have been upgraded, then it will attempt to write a Raft log entry that the followers don't understand. The followers will then panic. Add same the minimum version checks that we do for RPC writes to the leader's replication loop.	2023-06-12 08:53:56 -04:00
Tim Gross	fbaf4c8b69	node pools: implement support in scheduler (#17443 ) Implement scheduler support for node pool: * When a scheduler is invoked, we get a set of the ready nodes in the DCs that are allowed for that job. Extend the filter to include the node pool. * Ensure that changes to a job's node pool are picked up as destructive allocation updates. * Add `NodesInPool` as a metric to all reporting done by the scheduler. * Add the node-in-pool the filter to the `Node.Register` RPC so that we don't generate spurious evals for nodes in the wrong pool.	2023-06-07 10:39:03 -04:00
Tim Gross	c0f2295510	node pools: implement HTTP API to list jobs in pool (#17431 ) Implements the HTTP API associated with the `NodePool.ListJobs` RPC, including the `api` package for the public API and documentation. Update the `NodePool.ListJobs` RPC to fix the missing handling of the special "all" pool.	2023-06-06 11:40:13 -04:00
Luiz Aoqui	2420c93179	node pools: list nodes in pool (#17413 )	2023-06-06 10:43:43 -04:00
Luiz Aoqui	aa1b33d157	node pools: add event stream support (#17412 )	2023-06-06 10:14:47 -04:00
Tim Gross	2d16ec6c6f	node pools: implement RPC to list jobs in a given node pool (#17396 ) Implements the `NodePool.ListJobs` RPC, with pagination and filtering based on the existing `Job.List` RPC.	2023-06-05 15:36:52 -04:00
Luiz Aoqui	700168e136	node pools: fix node upsert and state mutation tests (#17430 )	2023-06-05 14:58:32 -04:00
Luiz Aoqui	6039c18ab6	node pools: register a node in a node pool (#17405 )	2023-06-02 17:50:50 -04:00
Luiz Aoqui	3a962d07f8	np: fix node pool search permission check (#17400 ) When checking if a token is allowed to query the search endpoints we need to return an error if the search context includes `node_pool` and the token doesn't have access to _any_ pool. This prevents returning an empty list instead of a permission denied error.	2023-06-02 12:22:47 -04:00
Samantha	b92a782b6e	check: Add support for Consul field tls_server_name (#17334 )	2023-06-02 10:19:12 -04:00
Tim Gross	56e9b944e8	node pools: validate pool exists on job registration (#17386 ) Add a new job admission hook for node pools that enforces the pool exists on registration. Also provide the skeleton function we need for Enterprise enforcement functions we'll implement later.	2023-06-02 09:32:07 -04:00
Luiz Aoqui	f755b9469f	core: refactor task validation (#17344 ) Move all validations related to task fields to Task.Validate(). Prior to this, some task validations were being done inside TaskGroup.Validate() because they required access to some group values. But similarly to how TaskGroup.Validate() tasks the job as parameter, it's fair to expect the task to receive its group.	2023-06-01 19:26:42 -04:00
Luiz Aoqui	4be8d7c049	core: fix kill_timeout validation when progress_deadline is 0 (#17342 )	2023-06-01 19:01:32 -04:00
Luiz Aoqui	9bb57c08e3	node pool: add search support (#17385 )	2023-06-01 17:48:14 -04:00
Tim Gross	4f14fa0518	node pools: add `node_pool` field to job spec (#17379 ) This changeset only adds the `node_pool` field to the jobspec, and ensures that it gets picked up correctly as a change. Without the rest of the implementation landed yet, the field will be ignored.	2023-06-01 16:08:55 -04:00
Luiz Aoqui	c61e75f302	node pools: add CRUD API (#17384 )	2023-06-01 15:55:49 -04:00
Seth Hoenig	acfdf0f479	compliance: add headers with fixed copywrite tool (#17353 ) Closes #17117	2023-05-30 09:20:32 -05:00

1 2 3 4 5 ...

4388 Commits