Commit graph

87 commits

Author SHA1 Message Date
Luiz Aoqui 2420c93179
node pools: list nodes in pool (#17413) 2023-06-06 10:43:43 -04:00
Tim Gross 2d16ec6c6f
node pools: implement RPC to list jobs in a given node pool (#17396)
Implements the `NodePool.ListJobs` RPC, with pagination and filtering based on
the existing `Job.List` RPC.
2023-06-05 15:36:52 -04:00
Luiz Aoqui 389212bfda
node pool: initial base work (#17163)
Implementation of the base work for the new node pools feature. It includes a new `NodePool` struct and its corresponding state store table.

Upon start the state store is populated with two built-in node pools that cannot be modified nor deleted:

  * `all` is a node pool that always includes all nodes in the cluster.
  * `default` is the node pool where nodes that don't specify a node pool in their configuration are placed.
2023-05-15 10:49:08 -04:00
Seth Hoenig 81e36b3650
core: eliminate second index on job_submissions table (#17146)
* core: eliminate second index on job_submissions table

This PR refactors the job_submissions state store code to eliminate the
use of a second index formerly used for purging all versions of a given
job. In practice we ended up with duplicate entries on the table. Instead,
use index prefix scanning on the primary index and tidy up any potential
for creating (or removing) duplicates.

* core: pr comments followup
2023-05-11 09:51:08 -05:00
Seth Hoenig ba728f8f97
api: enable support for setting original job source (#16763)
* api: enable support for setting original source alongside job

This PR adds support for setting job source material along with
the registration of a job.

This includes a new HTTP endpoint and a new RPC endpoint for
making queries for the original source of a job. The
HTTP endpoint is /v1/job/<id>/submission?version=<version> and
the RPC method is Job.GetJobSubmission.

The job source (if submitted, and doing so is always optional), is
stored in the job_submission memdb table, separately from the
actual job. This way we do not incur overhead of reading the large
string field throughout normal job operations.

The server config now includes job_max_source_size for configuring
the maximum size the job source may be, before the server simply
drops the source material. This should help prevent Bad Things from
happening when huge jobs are submitted. If the value is set to 0,
all job source material will be dropped.

* api: avoid writing var content to disk for parsing

* api: move submission validation into RPC layer

* api: return an error if updating a job submission without namespace or job id

* api: be exact about the job index we associate a submission with (modify)

* api: reword api docs scheduling

* api: prune all but the last 6 job submissions

* api: protect against nil job submission in job validation

* api: set max job source size in test server

* api: fixups from pr
2023-04-11 08:45:08 -05:00
hashicorp-copywrite[bot] 005636afa0 [COMPLIANCE] Add Copyright and License Headers 2023-04-10 15:36:59 +00:00
James Rasell 3c941c6bc3
acl: add binding rule object state schema and functionality. (#15511)
This change adds a new table that will store ACL binding rule
objects. The two indexes allow fast lookups by their ID, or by
which auth method they are linked to. Snapshot persist and
restore functionality ensures this table can be saved and
restored from snapshots.

In order to write and delete the object to state, new Raft messages
have been added.

All RPC request and response structs, along with object functions
such as diff and canonicalize have been included within this work
as it is nicely separated from the other areas of work.
2022-12-14 08:48:18 +01:00
Piotr Kazmierczak 4851f9e68a
acl: sso auth method schema and store functions (#15191)
This PR implements ACLAuthMethod type, acl_auth_methods table schema and crud state store methods. It also updates nomadSnapshot.Persist and nomadSnapshot.Restore methods in order for them to work with the new table, and adds two new Raft messages: ACLAuthMethodsUpsertRequestType and ACLAuthMethodsDeleteRequestType

This PR is part of the SSO work captured under ☂️ ticket #13120.
2022-11-10 19:42:41 +01:00
Tim Gross 903b5baaa4
keyring: safely handle missing keys and restore GC (#15092)
When replication of a single key fails, the replication loop breaks early and
therefore keys that fall later in the sorting order will never get
replicated. This is particularly a problem for clusters impacted by the bug that
caused #14981 and that were later upgraded; the keys that were never replicated
can now never be replicated, and so we need to handle them safely.

Included in the replication fix:
* Refactor the replication loop so that each key replicated in a function call
  that returns an error, to make the workflow more clear and reduce nesting. Log
  the error and continue.
* Improve stability of keyring replication tests. We no longer block leadership
  on initializing the keyring, so there's a race condition in the keyring tests
  where we can test for the existence of the root key before the keyring has
  been initialize. Change this to an "eventually" test.

But these fixes aren't enough to fix #14981 because they'll end up seeing an
error once a second complaining about the missing key, so we also need to fix
keyring GC so the keys can be removed from the state store. Now we'll store the
key ID used to sign a workload identity in the Allocation, and we'll index the
Allocation table on that so we can track whether any live Allocation was signed
with a particular key ID.
2022-11-01 15:00:50 -04:00
James Rasell 755b4745ed
Merge branch 'main' into f-gh-13120-sso-umbrella-merged-main 2022-08-30 08:59:13 +01:00
Tim Gross 1dc053b917 rename SecureVariables to Variables throughout 2022-08-26 16:06:24 -04:00
James Rasell 601588df6b
Merge branch 'main' into f-gh-13120-sso-umbrella-merged-main 2022-08-25 12:14:29 +01:00
Tim Gross bf57d76ec7
allow ACL policies to be associated with workload identity (#14140)
The original design for workload identities and ACLs allows for operators to
extend the automatic capabilities of a workload by using a specially-named
policy. This has shown to be potentially unsafe because of naming collisions, so
instead we'll allow operators to explicitly attach a policy to a workload
identity.

This changeset adds workload identity fields to ACL policy objects and threads
that all the way down to the command line. It also a new secondary index to the
ACL policy table on namespace and job so that claim resolution can efficiently
query for related policies.
2022-08-22 16:41:21 -04:00
James Rasell e660c9a908
core: add ACL role state schema and functionality. (#13955)
This commit includes the new state schema for ACL roles along with
state interaction functions for CRUD actions.

The change also includes snapshot persist and restore
functionality and the addition of FSM messages for Raft updates
which will come via RPC endpoints.
2022-08-09 09:33:41 +02:00
James Rasell 663aa92b7a
Merge branch 'main' into f-gh-13120-sso-umbrella 2022-08-02 08:30:03 +01:00
James Rasell 0cde3182eb
core: add ACL token expiry state, struct, and RPC handling. (#13718)
The ACL token state schema has been updated to utilise two new
indexes which track expiration of tokens that are configured with
an expiration TTL or time. A new state function allows listing
ACL expired tokens which will be used by internal garbage
collection.

The ACL endpoint has been modified so that all validation happens
within a single function call. This is easier to understand and
see at a glance. The ACL token validation now also includes logic
for expiry TTL and times. The ACL endpoint upsert tests have been
condensed into a single, table driven test.

There is a new token canonicalize which provides a single place
for token canonicalization, rather than logic spread in the RPC
handler.
2022-07-13 15:40:34 +02:00
Charlie Voiselle 1fe080c6de Implement HTTP search API for Variables (#13257)
* Add Path only index for SecureVariables
* Add GetSecureVariablesByPrefix; refactor tests
* Add search for SecureVariables
* Add prefix search for secure variables
2022-07-11 13:34:05 -04:00
Charlie Voiselle 06c6a950c4 Secure Variables: Seperate Encrypted and Decrypted structs (#13355)
This PR splits SecureVariable into SecureVariableDecrypted and
SecureVariableEncrypted in order to use the type system to help
verify that cleartext secret material is not committed to file.

* Make Encrypt function return KeyID
* Split SecureVariable

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-07-11 13:34:05 -04:00
Tim Gross 973b474b3c provide state store query for variables by key ID (#13195)
The core jobs to garbage collect unused keys and perform full key
rotations will need to be able to query secure variables by key ID for
efficiency. Add an index to the state store and associated query
function and test.
2022-07-11 13:34:04 -04:00
Tim Gross d29e85d150 secure variables: initial state store (#12932)
Implement the core SecureVariable and RootKey structs in memdb,
provide the minimal skeleton for FSM, and a dummy storage and keyring
RPC endpoint.
2022-07-11 13:34:01 -04:00
James Rasell a646333263
Merge branch 'main' into f-1.3-boogie-nights 2022-03-23 09:41:25 +01:00
Luiz Aoqui ab8ce87bba
Add pagination, filtering and sort to more API endpoints (#12186) 2022-03-08 20:54:17 -05:00
Luiz Aoqui 01931587ba
api: paginated results with different ordering (#12128)
The paginator logic was built when go-memdb iterators would return items
ordered lexicographically by their ID prefixes, but #12054 added the
option for some tables to return results ordered by their `CreateIndex`
instead, which invalidated the previous paginator assumption.

The iterator used for pagination must still return results in some order
so that the paginator can properly handle requests where the next_token
value is not present in the results anymore (e.g., the eval was GC'ed).

In these situations, the paginator will start the returned page in the
first element right after where the requested token should've been.

This commit moves the logic to generate pagination tokens from the
elements being paginated to the iterator itself so that callers can have
more control over the token format to make sure they are properly
ordered and stable.

It also allows configuring the paginator as being ordered in ascending
or descending order, which is relevant when looking for a token that may
not be present anymore.
2022-03-01 15:36:49 -05:00
James Rasell cf0b63d561
state: add the table schema for the service_registrations table. 2022-02-28 10:14:10 +01:00
Seth Hoenig 40c714a681 api: return sorted results in certain list endpoints
These API endpoints now return results in chronological order. They
can return results in reverse chronological order by setting the
query parameter ascending=true.

- Eval.List
- Deployment.List
2022-02-15 13:48:28 -06:00
Seth Hoenig 3371214431 core: implement system batch scheduler
This PR implements a new "System Batch" scheduler type. Jobs can
make use of this new scheduler by setting their type to 'sysbatch'.

Like the name implies, sysbatch can be thought of as a hybrid between
system and batch jobs - it is for running short lived jobs intended to
run on every compatible node in the cluster.

As with batch jobs, sysbatch jobs can also be periodic and/or parameterized
dispatch jobs. A sysbatch job is considered complete when it has been run
on all compatible nodes until reaching a terminal state (success or failed
on retries).

Feasibility and preemption are governed the same as with system jobs. In
this PR, the update stanza is not yet supported. The update stanza is sill
limited in functionality for the underlying system scheduler, and is
not useful yet for sysbatch jobs. Further work in #4740 will improve
support for the update stanza and deployments.

Closes #2527
2021-08-03 10:30:47 -04:00
Tim Gross 6b2c4b56d0 state store updates for one-time tokens
The `OneTimeToken` struct is to support the `nomad ui -login` command. This
changeset adds the struct to the Nomad state store.
2021-03-10 08:17:56 -05:00
Kris Hicks 0a3a748053
Add gosimple linter (#9590) 2020-12-09 11:05:18 -08:00
Drew Bailey a0b7f05a7b
Remove Managed Sinks from Nomad (#9470)
* Remove Managed Sinks from Nomad

Managed Sinks were a beta feature in Nomad 1.0-beta2. During the beta
period it was determined that this was not a scalable approach to
support community and third party sinks.

* update comment

* changelog
2020-11-30 14:00:31 -05:00
Drew Bailey 1ae39a9ed9
event sink crud operation api (#9155)
* network sink rpc/api plumbing

state store methods and restore

upsert sink test

get sink

delete sink

event sink list and tests

go generate new msg types

validate sink on upsert

* go generate
2020-10-23 14:23:00 -04:00
Michael Schurter c2dd9bc996 core: open source namespaces 2020-10-22 15:26:32 -07:00
Drew Bailey f3dcefe5a9
remove event durability (#9147)
* remove event durability

temporarily removing go-memdb event durability until a new strategy is developed on how to best handled increased durability needs

* drop events table schema and state store methods

* fix neweventbuffer invocations
2020-10-22 12:21:03 -04:00
Drew Bailey 9d48818eb8
writetxn can return error, add alloc and job generic events. Add events
table for durability
2020-10-14 12:44:39 -04:00
Luiz Aoqui 88d4eecfd0
add scaling policy type 2020-09-29 17:57:46 -04:00
Chris Baker b2ab42afbb scaling api: more testing around the scaling events api 2020-04-01 16:39:23 +00:00
Chris Baker 40d6b3bbd1 adding raft and state_store support to track job scaling events
updated ScalingEvent API to record "message string,error bool" instead
of confusing "reason,error *string"
2020-04-01 16:15:14 +00:00
Chris Baker abc7a52f56 finished refactoring state store, schema, etc 2020-03-24 13:57:14 +00:00
Chris Baker 6665d0bfb0 wip: added policy get endpoint, added UUID to policy 2020-03-24 13:55:20 +00:00
Chris Baker 65d92f1fbf WIP: adding ScalingPolicy to api/structs and state store 2020-03-24 13:55:18 +00:00
Lang Martin 3621df1dbf csi: volume ids are only unique per namespace (#7358)
* nomad/state/schema: use the namespace compound index

* scheduler/scheduler: CSIVolumeByID interface signature namespace

* scheduler/stack: SetJob on CSIVolumeChecker to capture namespace

* scheduler/feasible: pass the captured namespace to CSIVolumeByID

* nomad/state/state_store: use namespace in csi_volume index

* nomad/fsm: pass namespace to CSIVolumeDeregister & Claim

* nomad/core_sched: pass the namespace in volumeClaimReap

* nomad/node_endpoint_test: namespaces in Claim testing

* nomad/csi_endpoint: pass RequestNamespace to state.*

* nomad/csi_endpoint_test: appropriately failed test

* command/alloc_status_test: appropriately failed test

* node_endpoint_test: avoid notTheNamespace for the job

* scheduler/feasible_test: call SetJob to capture the namespace

* nomad/csi_endpoint: ACL check the req namespace, query by namespace

* nomad/state/state_store: remove deregister namespace check

* nomad/state/state_store: remove unused CSIVolumes

* scheduler/feasible: CSIVolumeChecker SetJob -> SetNamespace

* nomad/csi_endpoint: ACL check

* nomad/state/state_store_test: remove call to state.CSIVolumes

* nomad/core_sched_test: job namespace match so claim gc works
2020-03-23 13:59:25 -04:00
Lang Martin 88316208a0 csi: server-side plugin state tracking and api (#6966)
* structs: CSIPlugin indexes jobs acting as plugins and node updates

* schema: csi_plugins table for CSIPlugin

* nomad: csi_endpoint use vol.Denormalize, plugin requests

* nomad: csi_volume_endpoint: rename to csi_endpoint

* agent: add CSI plugin endpoints

* state_store_test: use generated ids to avoid t.Parallel conflicts

* contributing: add note about registering new RPC structs

* command: agent http register plugin lists

* api: CSI plugin queries, ControllerHealthy -> ControllersHealthy

* state_store: copy on write for volumes and plugins

* structs: copy on write for volumes and plugins

* state_store: CSIVolumeByID returns an unhealthy volume, denormalize

* nomad: csi_endpoint use CSIVolumeDenormalizePlugins

* structs: remove struct errors for missing objects

* nomad: csi_endpoint return nil for missing objects, not errors

* api: return meta from Register to avoid EOF error

* state_store: CSIVolumeDenormalize keep allocs in their own maps

* state_store: CSIVolumeDeregister error on missing volume

* state_store: CSIVolumeRegister set indexes

* nomad: csi_endpoint use CSIVolumeDenormalizePlugins tests
2020-03-23 13:58:29 -04:00
Lang Martin 5b31b140c3 csi: do not use namespace specific identifiers 2020-03-23 13:58:29 -04:00
Lang Martin 0422b967db schema: csi_volumes schema 2020-03-23 13:58:29 -04:00
Seth Hoenig 9df33f622f nomad: proxy requests for Service Identity tokens between Clients and Consul
Nomad jobs may be configured with a TaskGroup which contains a Service
definition that is Consul Connect enabled. These service definitions end
up establishing a Consul Connect Proxy Task (e.g. envoy, by default). In
the case where Consul ACLs are enabled, a Service Identity token is required
for these tasks to run & connect, etc. This changeset enables the Nomad Server
to recieve RPC requests for the derivation of SI tokens on behalf of instances
of Consul Connect using Tasks. Those tokens are then relayed back to the
requesting Client, which then injects the tokens in the secrets directory of
the Task.
2020-01-31 19:03:53 -06:00
Seth Hoenig 2b66ce93bb nomad: ensure a unique ClusterID exists when leader (gh-6702)
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and attempt to
apply it through raft.

The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
2020-01-31 19:03:26 -06:00
Alex Dadgar 4bdccab550 goimports 2019-01-22 15:44:31 -08:00
Preetha Appan 8f7eb61823
Introduce a response object for scheduler configuration 2018-10-30 11:06:32 -05:00
Preetha Appan bd34cbb1f7
Support for new scheduler config API, first use case is to disable preemption 2018-10-30 11:06:32 -05:00
Josh Soref 64546b4f99 spelling: referenced 2018-03-11 18:40:45 +00:00
Josh Soref 7ab998803b spelling: periodic 2018-03-11 18:37:05 +00:00