Commit Graph

111 Commits

Author SHA1 Message Date
Tim Gross 903b5baaa4
keyring: safely handle missing keys and restore GC (#15092)
When replication of a single key fails, the replication loop breaks early and
therefore keys that fall later in the sorting order will never get
replicated. This is particularly a problem for clusters impacted by the bug that
caused #14981 and that were later upgraded; the keys that were never replicated
can now never be replicated, and so we need to handle them safely.

Included in the replication fix:
* Refactor the replication loop so that each key replicated in a function call
  that returns an error, to make the workflow more clear and reduce nesting. Log
  the error and continue.
* Improve stability of keyring replication tests. We no longer block leadership
  on initializing the keyring, so there's a race condition in the keyring tests
  where we can test for the existence of the root key before the keyring has
  been initialize. Change this to an "eventually" test.

But these fixes aren't enough to fix #14981 because they'll end up seeing an
error once a second complaining about the missing key, so we also need to fix
keyring GC so the keys can be removed from the state store. Now we'll store the
key ID used to sign a workload identity in the Allocation, and we'll index the
Allocation table on that so we can track whether any live Allocation was signed
with a particular key ID.
2022-11-01 15:00:50 -04:00
Tim Gross 3c78980b78
make version checks specific to region (1.4.x) (#14912)
* One-time tokens are not replicated between regions, so we don't want to enforce
  that the version check across all of serf, just members in the same region.
* Scheduler: Disconnected clients handling is specific to a single region, so we
  don't want to enforce that the version check across all of serf, just members in
  the same region.
* Variables: enforce version check in Apply RPC
* Cleans up a bunch of legacy checks.

This changeset is specific to 1.4.x and the changes for previous versions of
Nomad will be manually backported in a separate PR.
2022-10-17 16:23:51 -04:00
Seth Hoenig b3ea68948b build: run gofmt on all go source files
Go 1.19 will forecefully format all your doc strings. To get this
out of the way, here is one big commit with all the changes gofmt
wants to make.
2022-08-16 11:14:11 -05:00
Tim Gross d11da1df5c
workload identity: use parent ID for dispatch/periodic jobs (#13748)
Workload identities grant implicit access to policies, and operators
will not want to craft separate policies for each invocation of a
periodic or dispatch job. Use the parent job's ID as the JobID claim.
2022-07-21 09:05:54 -04:00
Luiz Aoqui b656981cf0
Track plan rejection history and automatically mark clients as ineligible (#13421)
Plan rejections occur when the scheduler work and the leader plan
applier disagree on the feasibility of a plan. This may happen for valid
reasons: since Nomad does parallel scheduling, it is expected that
different workers will have a different state when computing placements.

As the final plan reaches the leader plan applier, it may no longer be
valid due to a concurrent scheduling taking up intended resources. In
these situations the plan applier will notify the worker that the plan
was rejected and that they should refresh their state before trying
again.

In some rare and unexpected circumstances it has been observed that
workers will repeatedly submit the same plan, even if they are always
rejected.

While the root cause is still unknown this mitigation has been put in
place. The plan applier will now track the history of plan rejections
per client and include in the plan result a list of node IDs that should
be set as ineligible if the number of rejections in a given time window
crosses a certain threshold. The window size and threshold value can be
adjusted in the server configuration.

To avoid marking several nodes as ineligible at one, the operation is rate
limited to 5 nodes every 30min, with an initial burst of 10 operations.
2022-07-12 18:40:20 -04:00
Tim Gross bfcbc00f4e workload identity (#13223)
In order to support implicit ACL policies for tasks to get their own
secrets, each task would need to have its own ACL token. This would
add extra raft overhead as well as new garbage collection jobs for
cleaning up task-specific ACL tokens. Instead, Nomad will create a
workload Identity Claim for each task.

An Identity Claim is a JSON Web Token (JWT) signed by the server’s
private key and attached to an Allocation at the time a plan is
applied. The encoded JWT can be submitted as the X-Nomad-Token header
to replace ACL token secret IDs for the RPCs that support identity
claims.

Whenever a key is is added to a server’s keyring, it will use the key
as the seed for a Ed25519 public-private private keypair. That keypair
will be used for signing the JWT and for verifying the JWT.

This implementation is a ruthlessly minimal approach to support the
secure variables feature. When a JWT is verified, the allocation ID
will be checked against the Nomad state store, and non-existent or
terminal allocation IDs will cause the validation to be rejected. This
is sufficient to support the secure variables feature at launch
without requiring implementation of a background process to renew
soon-to-expire tokens.
2022-07-11 13:34:05 -04:00
Tim Gross 4368dcc02f
fix deadlock in plan_apply (#13407)
The plan applier has to get a snapshot with a minimum index for the
plan it's working on in order to ensure consistency. Under heavy raft
loads, we can exceed the timeout. When this happens, we hit a bug
where the plan applier blocks waiting on the `indexCh` forever, and
all schedulers will block in `Plan.Submit`.

Closing the `indexCh` when the `asyncPlanWait` is done with it will
prevent the deadlock without impacting correctness of the previous
snapshot index.

This changeset includes the a PoC failing test that works by injecting
a large timeout into the state store. We need to turn this into a test
we can run normally without breaking the state store before we can
merge this PR.

Increase `snapshotMinIndex` timeout to 10s.
This timeout creates backpressure where any concurrent `Plan.Submit`
RPCs will block waiting for results. This sheds load across all
servers and gives raft some CPU to catch up, because schedulers won't
dequeue more work while waiting. Increase it to 10s based on
observations of large production clusters.
2022-06-23 12:06:27 -04:00
James Rasell a05114fdac
core: add namespace to plan for node rejected log line. (#12868) 2022-05-05 10:56:40 +02:00
Derek Strickland 5e309f3f33
reconciler: Handle canaries when client disconnects (#12539)
* plan_apply: Allow node updates in disconnected node plans
* plan: Keep the job when persisting unknown allocs
* reconciler: stop unknown allocs when stopping all
* reconcile_util: reorder filtering to handle canaries; skip rescheduling unknown
* heartbeat: Fix bug in node heartbeating
2022-04-21 10:05:58 -04:00
Jorge Marey 96dd3f53c6
Fix in-place updates over ineligible nodes (#12264) 2022-04-06 11:30:40 -04:00
DerekStrickland da9bc350c8 evaluateNodePlan: validate plans for disconnected nodes 2022-04-05 17:12:22 -04:00
Luiz Aoqui b1753d0568
scheduler: detect and log unexpected scheduling collisions (#11793) 2022-01-14 20:09:14 -05:00
Luiz Aoqui 5be6710216
add `/s/port-plan-failure` redirect and link to in in plan reject log message 2021-11-02 20:43:54 -04:00
Michael Schurter efe5714840 core: bump rejected plans from debug -> info
As we have continued to see reports of #9506 we need to elevate this log
line as it is the only way to detect when plans are being *erroneously*
rejected.

Users who see this log line repeatedly should drain and restart the node
in the log line. This seems to workaorund the issue.

Please post any details on #9506!
2021-10-31 12:51:42 -07:00
Tim Gross e9777a88ce plan applier: add trace-level log of plan
The plans generated by the scheduler produce high-level output of counts on each
evaluation, but when debugging scheduler issues it'd be nice to have a more
detailed view of the resulting plan. Emitting this log at trace minimizes the
overhead, and producing it in the plan applyer makes it easier to find as it
will always be on the leader.
2021-06-02 10:25:23 -04:00
Chris Baker 770c9cecb5 restored Node.Sanitize() for RPC endpoints
multiple other updates from code review
2021-03-26 17:03:15 +00:00
Chris Baker dd291e69f4 removed deprecated fields from Drain structs and API
node drain: use msgtype on txn so that events are emitted
wip: encoding extension to add Node.Drain field back to API responses

new approach for hiding Node.SecretID in the API, using `json` tag
documented this approach in the contributing guide
refactored the JSON handlers with extensions
modified event stream encoding to use the go-msgpack encoders with the extensions
2021-03-21 15:30:11 +00:00
Drew Bailey 9d48818eb8
writetxn can return error, add alloc and job generic events. Add events
table for durability
2020-10-14 12:44:39 -04:00
Drew Bailey 4793bb4e01
Events/deployment events (#9004)
* Node Drain events and Node Events (#8980)

Deployment status updates

handle deployment status updates (paused, failed, resume)

deployment alloc health

generate events from apply plan result

txn err check, slim down deployment event

one ndjson line per index

* consolidate down to node event + type

* fix UpdateDeploymentAllocHealth test invocations

* fix test
2020-10-14 12:44:37 -04:00
Lang Martin 069840bef8
scheduler/reconcile: set FollowupEvalID on lost stop_after_client_disconnect (#8105) (#8138)
* scheduler/reconcile: set FollowupEvalID on lost stop_after_client_disconnect

* scheduler/reconcile: thread follupEvalIDs through to results.stop

* scheduler/reconcile: comment typo

* nomad/_test: correct arguments for plan.AppendStoppedAlloc

* scheduler/reconcile: avoid nil, cleanup handleDelayed(Lost|Reschedules)
2020-06-09 17:13:53 -04:00
Lang Martin ac7c39d3d3
Delayed evaluations for `stop_after_client_disconnect` can cause unwanted extra followup evaluations around job garbage collection (#8099)
* client/heartbeatstop: reversed time condition for startup grace

* scheduler/generic_sched: use `delayInstead` to avoid a loop

Without protecting the loop that creates followUpEvals, a delayed eval
is allowed to create an immediate subsequent delayed eval. For both
`stop_after_client_disconnect` and the `reschedule` block, a delayed
eval should always produce some immediate result (running or blocked)
and then only after the outcome of that eval produce a second delayed
eval.

* scheduler/reconcile: lostLater are different than delayedReschedules

Just slightly. `lostLater` allocs should be used to create batched
evaluations, but `handleDelayedReschedules` assumes that the
allocations are in the untainted set. When it creates the in-place
updates to those allocations at the end, it causes the allocation to
be treated as running over in the planner, which causes the initial
`stop_after_client_disconnect` evaluation to be retried by the worker.
2020-06-03 09:48:38 -04:00
Jasmine Dahilig 8d980edd2e
add create and modify timestamps to evaluations (#5881) 2019-08-07 09:50:35 -07:00
Michael Schurter 69ba495f0c nomad: expand comments on subtle plan apply behaviors 2019-06-26 08:49:24 -07:00
Michael Schurter e4bc943a68 nomad: SnapshotAfter -> SnapshotMinIndex
Rename SnapshotAfter to SnapshotMinIndex. The old name was not
technically accurate. SnapshotAtOrAfter is more accurate, but wordy and
still lacks context about what precisely it is at or after (the index).

SnapshotMinIndex was chosen as it describes the action (snapshot), a
constraint (minimum), and the object of the constraint (index).
2019-06-24 12:16:46 -07:00
Michael Schurter 0f8164b2f1 nomad: evaluate plans after previous plan index
The previous commit prevented evaluating plans against a state snapshot
which is older than the snapshot at which the plan was created.  This is
correct and prevents failures trying to retrieve referenced objects that
may not exist until the plan's snapshot. However, this is insufficient
to guarantee consistency if the following events occur:

1. P1, P2, and P3 are enqueued with snapshot @ 100
2. Leader evaluates and applies Plan P1 with snapshot @ 100
3. Leader evaluates Plan P2 with snapshot+P1 @ 100
4. P1 commits @ 101
4. Leader evaluates applies Plan P3 with snapshot+P2 @ 100

Since only the previous plan is optimistically applied to the state
store, the snapshot used to evaluate a plan may not contain the N-2
plan!

To ensure plans are evaluated and applied serially we must consider all
previous plan's committed indexes when evaluating further plans.

Therefore combined with the last PR, the minimum index at which to
evaluate a plan is:

    min(previousPlanResultIndex, plan.SnapshotIndex)
2019-06-24 12:16:46 -07:00
Michael Schurter e10fea1d7a nomad: include snapshot index when submitting plans
Plan application should use a state snapshot at or after the Raft index
at which the plan was created otherwise it risks being rejected based on
stale data.

This commit adds a Plan.SnapshotIndex which is set by workers when
submitting plan. SnapshotIndex is set to the Raft index of the snapshot
the worker used to generate the plan.

Plan.SnapshotIndex plays a similar role to PlanResult.RefreshIndex.
While RefreshIndex informs workers their StateStore is behind the
leader's, SnapshotIndex is a way to prevent the leader from using a
StateStore behind the worker's.

Plan.SnapshotIndex should be considered the *lower bound* index for
consistently handling plan application.

Plans must also be committed serially, so Plan N+1 should use a state
snapshot containing Plan N. This is guaranteed for plans *after* the
first plan after a leader election.

The Raft barrier on leader election ensures the leader's statestore has
caught up to the log index at which it was elected. This guarantees its
StateStore is at an index > lastPlanIndex.
2019-06-24 12:16:46 -07:00
Mahmood Ali 392f5bac44 Stop updating allocs.Job on stopping or preemption 2019-06-10 18:30:20 -04:00
Michael Schurter a8fc50cc1b nomad: revert use of SnapshotAfter in planApply
Revert plan_apply.go changes from #5411

Since non-Command Raft messages do not update the StateStore index,
SnapshotAfter may unnecessarily block and needlessly fail in idle
clusters where the last Raft message is a non-Command message.

This is trivially reproducible with the dev agent and a job that has 2
tasks, 1 of which fails.

The correct logic would be to SnapshotAfter the previous plan's index to
ensure consistency. New clusters or newly elected leaders will not have
a previous plan, so the index the leader was elected should be used
instead.
2019-06-03 15:34:21 -07:00
Michael Schurter b80a7e0feb nomad: wait for state store to sync in plan apply
Wait for state store to catch up with raft when applying plans.
2019-05-17 14:37:12 -07:00
Arshneet Singh d4e7a5c005 Add comments to functions, and use require instead of assert 2019-04-23 09:57:21 -07:00
Arshneet Singh 0dd4c109e8 Compat tags 2019-04-23 09:18:01 -07:00
Arshneet Singh b977748a4b Add code for plan normalization 2019-04-23 09:18:01 -07:00
Alex Dadgar 4bdccab550 goimports 2019-01-22 15:44:31 -08:00
Alex Dadgar e3cbb2c82e allocs fit checks if devices get oversubscribed 2018-11-07 10:33:22 -08:00
Preetha Appan c33469157d
unit test plan apply with preemptions 2018-11-01 20:06:32 -05:00
Preetha Appan 1415032c13
More review comments 2018-10-30 11:06:32 -05:00
Preetha Appan 3190a2c29b
Fix linting 2018-10-30 11:06:32 -05:00
Preetha Appan 9e4a35fff0
Fix comment 2018-10-30 11:06:32 -05:00
Preetha Appan cc295b90de
Implement preemption for system jobs.
This commit implements an allocation selection algorithm for finding
allocations to preempt. It currently special cases network resource asks
from others (cpu/memory/disk/iops).
2018-10-30 11:06:32 -05:00
Alex Dadgar 3c19d01d7a server 2018-09-15 16:23:13 -07:00
Michael Schurter c0542474db drain: initial drainv2 structs and impl 2018-03-21 16:49:48 -07:00
Josh Soref 6aa0a05998 spelling: scheduling 2018-03-11 18:53:11 +00:00
Josh Soref 7da15e6ee5 spelling: minimum 2018-03-11 18:26:53 +00:00
Josh Soref a0452dd7f5 spelling: evaluate 2018-03-11 18:01:29 +00:00
Preetha Appan 2567b51c58
Edge trigger evaluation when allocations client status is failed 2018-01-31 09:56:53 -06:00
Preetha Appan 3c36abfe14
Update eval modify index as part of plan apply. 2017-12-18 10:03:55 -06:00
Preetha Appan d340c3adb1 Always set modify time on allocations, and other changes addressing review comments 2017-11-01 15:13:48 -05:00
Preetha Appan 39d70be009 Add ModifyTime to Allocation and update it both on plan applies and client initiated updates 2017-11-01 15:13:48 -05:00
Alex Dadgar c1cc51dbee sync 2017-10-13 14:36:02 -07:00
Alex Dadgar 4173834231 Enable more linters 2017-09-26 15:26:33 -07:00