Commit Graph

190 Commits

Author SHA1 Message Date
Piotr Kazmierczak e48c48e89b acl: RPC endpoints for JWT auth (#15918) 2023-03-30 09:39:56 +02:00
Luiz Aoqui 3479e2231f
core: enforce strict steps for clients reconnect (#15808)
When a Nomad client that is running an allocation with
`max_client_disconnect` set misses a heartbeat the Nomad server will
update its status to `disconnected`.

Upon reconnecting, the client will make three main RPC calls:

- `Node.UpdateStatus` is used to set the client status to `ready`.
- `Node.UpdateAlloc` is used to update the client-side information about
  allocations, such as their `ClientStatus`, task states etc.
- `Node.Register` is used to upsert the entire node information,
  including its status.

These calls are made concurrently and are also running in parallel with
the scheduler. Depending on the order they run the scheduler may end up
with incomplete data when reconciling allocations.

For example, a client disconnects and its replacement allocation cannot
be placed anywhere else, so there's a pending eval waiting for
resources.

When this client comes back the order of events may be:

1. Client calls `Node.UpdateStatus` and is now `ready`.
2. Scheduler reconciles allocations and places the replacement alloc to
   the client. The client is now assigned two allocations: the original
   alloc that is still `unknown` and the replacement that is `pending`.
3. Client calls `Node.UpdateAlloc` and updates the original alloc to
   `running`.
4. Scheduler notices too many allocs and stops the replacement.

This creates unnecessary placements or, in a different order of events,
may leave the job without any allocations running until the whole state
is updated and reconciled.

To avoid problems like this clients must update _all_ of its relevant
information before they can be considered `ready` and available for
scheduling.

To achieve this goal the RPC endpoints mentioned above have been
modified to enforce strict steps for nodes reconnecting:

- `Node.Register` does not set the client status anymore.
- `Node.UpdateStatus` sets the reconnecting client to the `initializing`
  status until it successfully calls `Node.UpdateAlloc`.

These changes are done server-side to avoid the need of additional
coordination between clients and servers. Clients are kept oblivious of
these changes and will keep making these calls as they normally would.

The verification of whether allocations have been updates is done by
storing and comparing the Raft index of the last time the client missed
a heartbeat and the last time it updated its allocations.
2023-01-25 15:53:59 -05:00
James Rasell 3c941c6bc3
acl: add binding rule object state schema and functionality. (#15511)
This change adds a new table that will store ACL binding rule
objects. The two indexes allow fast lookups by their ID, or by
which auth method they are linked to. Snapshot persist and
restore functionality ensures this table can be saved and
restored from snapshots.

In order to write and delete the object to state, new Raft messages
have been added.

All RPC request and response structs, along with object functions
such as diff and canonicalize have been included within this work
as it is nicely separated from the other areas of work.
2022-12-14 08:48:18 +01:00
Piotr Kazmierczak d3aac1fe08
acl: sso auth method snapshot restore test (#15482) 2022-12-06 15:15:29 +01:00
Piotr Kazmierczak bb66b5e770
acl: sso auth method RPC endpoints (#15221)
This PR implements RPC endpoints for SSO auth methods.

This PR is part of the SSO work captured under ☂️ ticket #13120.
2022-11-21 10:15:39 +01:00
Tim Gross 510eb435dc
remove deprecated `AllocUpdateRequestType` raft entry (#15285)
After Deployments were added in Nomad 0.6.0, the `AllocUpdateRequestType` raft
log entry was no longer in use. Mark this as deprecated, remove the associated
dead code, and remove references to the metrics it emits from the docs. We'll
leave the entry itself just in case we encounter old raft logs that we need to
be able to safely load.
2022-11-17 12:08:04 -05:00
James Rasell 755b4745ed
Merge branch 'main' into f-gh-13120-sso-umbrella-merged-main 2022-08-30 08:59:13 +01:00
Tim Gross 1dc053b917 rename SecureVariables to Variables throughout 2022-08-26 16:06:24 -04:00
James Rasell 601588df6b
Merge branch 'main' into f-gh-13120-sso-umbrella-merged-main 2022-08-25 12:14:29 +01:00
James Rasell 802d005ef5
acl: add replication to ACL Roles from authoritative region. (#14176)
ACL Roles along with policies and global token will be replicated
from the authoritative region to all federated regions. This
involves a new replication loop running on the federated leader.

Policies and roles may be replicated at different times, meaning
the policies and role references may not be present within the
local state upon replication upsert. In order to bypass the RPC
and state check, a new RPC request parameter has been added. This
is used by the replication process; all other callers will trigger
the ACL role policy validation check.

There is a new ACL RPC endpoint to allow the reading of a set of
ACL Roles which is required by the replication process and matches
ACL Policies and Tokens. A bug within the ACL Role listing RPC has
also been fixed which returned incorrect data during blocking
queries where a deletion had occurred.
2022-08-22 08:54:07 +02:00
Piotr Kazmierczak b63944b5c1
cleanup: replace TypeToPtr helper methods with pointer.Of (#14151)
Bumping compile time requirement to go 1.18 allows us to simplify our pointer helper methods.
2022-08-17 18:26:34 +02:00
Tim Gross 4005759d28
move secure variable conflict resolution to state store (#13922)
Move conflict resolution implementation into the state store with a new Apply RPC. 
This also makes the RPC for secure variables much more similar to Consul's KV, 
which will help us support soft deletes in a post-1.4.0 version of Nomad.

Reimplement quotas in the state store functions.

Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com>
2022-08-15 11:19:53 -04:00
James Rasell e660c9a908
core: add ACL role state schema and functionality. (#13955)
This commit includes the new state schema for ACL roles along with
state interaction functions for CRUD actions.

The change also includes snapshot persist and restore
functionality and the addition of FSM messages for Raft updates
which will come via RPC endpoints.
2022-08-09 09:33:41 +02:00
Charlie Voiselle 06c6a950c4 Secure Variables: Seperate Encrypted and Decrypted structs (#13355)
This PR splits SecureVariable into SecureVariableDecrypted and
SecureVariableEncrypted in order to use the type system to help
verify that cleartext secret material is not committed to file.

* Make Encrypt function return KeyID
* Split SecureVariable

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-07-11 13:34:05 -04:00
Charlie Voiselle 3717688f3e Secure Variables: Variables - State store, FSM, RPC (#13098)
* Secure Variables: State Store
* Secure Variables: FSM
* Secure Variables: RPC
* Secure Variables: HTTP API

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-07-11 13:34:04 -04:00
James Rasell 0c0b028a59
core: allow deleting of evaluations (#13492)
* core: add eval delete RPC and core functionality.

* agent: add eval delete HTTP endpoint.

* api: add eval delete API functionality.

* cli: add eval delete command.

* docs: add eval delete website documentation.
2022-07-06 16:30:11 +02:00
James Rasell 9ea1a6faf6
fsm: add service registration snapshot persistence. (#12896) 2022-05-06 15:53:27 +02:00
James Rasell 96d8512c85
test: move remaining tests to use ci.Parallel. 2022-03-24 08:45:13 +01:00
James Rasell a646333263
Merge branch 'main' into f-1.3-boogie-nights 2022-03-23 09:41:25 +01:00
Seth Hoenig 2631659551 ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
James Rasell 52283f057f
fsm: add FSM functionality for service registration endpoints. 2022-03-03 11:24:29 +01:00
Isabel Suchanek ab51050ce8
events: fix wildcard namespace handling (#10935)
This fixes a bug in the event stream API where it currently interprets
namespace=* as an actual namespace, not a wildcard. When Nomad parses
incoming requests, it sets namespace to default if not specified, which
means the request namespace will never be an empty string, which is what
the event subscription was checking for. This changes the conditional
logic to check for a wildcard namespace instead of an empty one.

It also updates some event tests to include the default namespace in the
subscription to match current behavior.

Fixes #10903
2021-09-02 09:36:55 -07:00
Chris Baker 436d46bd19
Merge branch 'main' into f-node-drain-api 2021-04-01 15:22:57 -05:00
Mahmood Ali 0c2551270a oversubscription: Add MemoryMaxMB to internal structs
Start tracking a new MemoryMaxMB field that represents the maximum memory a task
may use in the client. This allows tasks to specify a memory reservation (to be
used by scheduler when placing the task) but use excess memory used on the
client if the client has any.

This commit adds the server tracking for the value, and ensures that allocations
AllocatedResource fields include the value.
2021-03-30 16:55:58 -04:00
Chris Baker 99ef00d5da reinserted/expanded fsm node.canonicalize test that was still needed 2021-03-26 17:10:39 +00:00
Chris Baker dd291e69f4 removed deprecated fields from Drain structs and API
node drain: use msgtype on txn so that events are emitted
wip: encoding extension to add Node.Drain field back to API responses

new approach for hiding Node.SecretID in the API, using `json` tag
documented this approach in the contributing guide
refactored the JSON handlers with extensions
modified event stream encoding to use the go-msgpack encoders with the extensions
2021-03-21 15:30:11 +00:00
Charlie Voiselle 0473f35003
Fixup uses of `sanity` (#10187)
* Fixup uses of `sanity`
* Remove unnecessary comments.

These checks are better explained by earlier comments about
the context of the test. Per @tgross, moved the tests together
to better reinforce the overall shared context.

* Update nomad/fsm_test.go
2021-03-16 18:05:08 -04:00
Drew Bailey 630babb886
prevent double job status update (#9768)
* Prevent Job Statuses from being calculated twice

https://github.com/hashicorp/nomad/pull/8435 introduced atomic eval
insertion iwth job (de-)registration. This change removes a now obsolete
guard which checked if the index was equal to the job.CreateIndex, which
would empty the status. Now that the job regisration eval insetion is
atomic with the registration this check is no longer necessary to set
the job statuses correctly.

* test to ensure only single job event for job register

* periodic e2e

* separate job update summary step

* fix updatejobstability to use copy instead of modified reference of job

* update envoygatewaybindaddresses copy to prevent job diff on null vs empty

* set ConsulGatewayBindAddress to empty map instead of nil

fix nil assertions for empty map

rm unnecessary guard
2021-01-22 09:18:17 -05:00
Drew Bailey 54becaab7d
Events/acl events (#9595)
* fix acl event creation

* allow way to access secretID without exposing it to stream

test that values are omitted

test event creation

test acl events

payloads are pointers

fix failing tests, do all security steps inside constructor

* increase time

* ignore empty tokens

* uncomment line

* changelog
2020-12-11 10:40:50 -05:00
Drew Bailey 9adca240f8
Event Stream: Track ACL changes, unsubscribe on invalidating changes (#9447)
* upsertaclpolicies

* delete acl policies msgtype

* upsert acl policies msgtype

* delete acl tokens msgtype

* acl bootstrap msgtype

wip unsubscribe on token delete

test that subscriptions are closed after an ACL token has been deleted

Start writing policyupdated test

* update test to use before/after policy

* add SubscribeWithACLCheck to run acl checks on subscribe

* update rpc endpoint to use broker acl check

* Add and use subscriptions.closeSubscriptionFunc

This fixes the issue of not being able to defer unlocking the mutex on
the event broker in the for loop.

handle acl policy updates

* rpc endpoint test for terminating acl change

* add comments

Co-authored-by: Kris Hicks <khicks@hashicorp.com>
2020-12-01 11:11:34 -05:00
Michael Schurter c2dd9bc996 core: open source namespaces 2020-10-22 15:26:32 -07:00
Drew Bailey f3dcefe5a9
remove event durability (#9147)
* remove event durability

temporarily removing go-memdb event durability until a new strategy is developed on how to best handled increased durability needs

* drop events table schema and state store methods

* fix neweventbuffer invocations
2020-10-22 12:21:03 -04:00
Drew Bailey 6c788fdccd
Events/msgtype cleanup (#9117)
* use msgtype in upsert node

adds message type to signature for upsert node, update tests, remove placeholder method

* UpsertAllocs msg type test setup

* use upsertallocs with msg type in signature

update test usage of delete node

delete placeholder msgtype method

* add msgtype to upsert evals signature, update test call sites with test setup msg type

handle snapshot upsert eval outside of FSM and ignore eval event

remove placeholder upsertevalsmsgtype

handle job plan rpc and prevent event creation for plan

msgtype cleanup upsertnodeevents

updatenodedrain msgtype

msg type 0 is a node registration event, so set the default  to the ignore type

* fix named import

* fix signature ordering on upsertnode to match
2020-10-19 09:30:15 -04:00
Drew Bailey c463479848
filter on additional filter keys, remove switch statement duplication
properly wire up durable event count

move newline responsibility

moves newline creation from NDJson to the http handler, json stream only encodes and sends now

ignore snapshot restore if broker is disabled

enable dev mode to access event steam without acl

use mapping instead of switch

use pointers for config sizes, remove unused ttl, simplify closed conn logic
2020-10-14 14:14:33 -04:00
Drew Bailey df96b89958
Add EvictCallbackFn to handle removing entries from go-memdb when they
are removed from the event buffer.

Wire up event buffer size config, use pointers for structs.Events
instead of copying.
2020-10-14 12:44:42 -04:00
Drew Bailey 315f77a301
rehydrate event publisher on snapshot restore
address pr feedback
2020-10-14 12:44:41 -04:00
Drew Bailey d793529d61
event durability count and cfg 2020-10-14 12:44:40 -04:00
Drew Bailey b4c135358d
use Events to wrap index and events, store in events table 2020-10-14 12:44:39 -04:00
Michael Schurter 9dd59ceaa7 core: improve job deregister error logging
Noticed this error in some production logs, and they were far from
helpful. Changes:

1. Include job ID in logs
2. Wrap errors and log once instead of double log lines
3. Test fsm error handling behavior
2020-09-21 08:59:03 -07:00
Drew Bailey 0af749c92e
Transaction change tracking
This commit wraps memdb.DB with a changeTrackerDB, which is a thin
wrapper around memdb.DB which enables go-memdb's TrackChanges on all write
transactions. When the transaction is comitted the changes are sent to
an eventPublisher which will be used to create and emit change events.

debugging TestFSM_ReconcileSummaries

wip

revert back rebase

revert back rebase

fix snapshot to actually use a snapshot
2020-09-01 10:27:20 -04:00
Mahmood Ali 78568b8e63 Remove unused state.TestInitState 2020-07-20 09:55:55 -04:00
Drew Bailey 01e2cc5054
allow ClusterMetadata to accept a watchset (#8299)
* allow ClusterMetadata to accept a watchset

* use nil instead of empty watchset
2020-06-26 13:23:32 -04:00
Mahmood Ali 2c963885b0 handle upgrade path and defaults
Ensure that `""` Scheduler Algorithm gets explicitly set to binpack on
upgrades or on API handling when user misses the value.

The scheduler already treats `""` value as binpack.  This PR merely
ensures that the operator API returns the effective value.
2020-05-09 12:34:08 -04:00
Mahmood Ali f492ab6d9e implement MinQuorum 2020-02-16 16:04:59 -06:00
Seth Hoenig 9df33f622f nomad: proxy requests for Service Identity tokens between Clients and Consul
Nomad jobs may be configured with a TaskGroup which contains a Service
definition that is Consul Connect enabled. These service definitions end
up establishing a Consul Connect Proxy Task (e.g. envoy, by default). In
the case where Consul ACLs are enabled, a Service Identity token is required
for these tasks to run & connect, etc. This changeset enables the Nomad Server
to recieve RPC requests for the derivation of SI tokens on behalf of instances
of Consul Connect using Tasks. Those tokens are then relayed back to the
requesting Client, which then injects the tokens in the secrets directory of
the Task.
2020-01-31 19:03:53 -06:00
Seth Hoenig 2b66ce93bb nomad: ensure a unique ClusterID exists when leader (gh-6702)
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and attempt to
apply it through raft.

The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
2020-01-31 19:03:26 -06:00
Mahmood Ali d740d347ce Migrate old alloc structs on read
This commit ensures that Alloc.AllocatedResources is properly populated
when read from persistence stores (namely Raft and client state store).
The alloc struct may have been written previously by an arbitrary old
version that may only populate Alloc.TaskResources.
2020-01-09 08:46:50 -05:00
Lang Martin a95225d754 NodeDeregisterBatch -> NodeBatchDeregister match JobBatch pattern 2019-07-10 13:56:20 -04:00
Lang Martin ce0f03651a fsm support new NodeDeregisterBatchRequest 2019-07-10 13:56:20 -04:00
Lang Martin a97407e030 fsm NodeDeregisterRequest is now a batch 2019-07-10 13:56:19 -04:00