Commit Graph

2829 Commits

Author SHA1 Message Date
Michael Schurter 81b4b6f19b
Merge pull request #5791 from hashicorp/b-plan-snapshotindex
nomad: include snapshot index when submitting plans
2019-07-17 09:25:00 -07:00
Mahmood Ali ad39bcef60 rpc: use tls wrapped connection for streaming rpc
This ensures that server-to-server streaming RPC calls use the tls
wrapped connections.

Prior to this, `streamingRpcImpl` function uses tls for setting header
and invoking the rpc method, but returns unwrapped tls connection.
Thus, streaming writes fail with tls errors.

This tls streaming bug existed since 0.8.0[1], but PR #5654[2]
exacerbated it in 0.9.2.  Prior to PR #5654, nomad client used to
shuffle servers at every heartbeat -- `servers.Manager.setServers`[3]
always shuffled servers and was called by heartbeat code[4].  Shuffling
servers meant that a nomad client would heartbeat and establish a
connection against all nomad servers eventually.  When handling
streaming RPC calls, nomad servers used these local connection to
communicate directly to the client.  The server-to-server forwarding
logic was left mostly unexercised.

PR #5654 means that a nomad client may connect to a single server only
and caused the server-to-server forward streaming RPC code to get
exercised more and unearthed the problem.

[1] https://github.com/hashicorp/nomad/blob/v0.8.0/nomad/rpc.go#L501-L515
[2] https://github.com/hashicorp/nomad/pull/5654
[3] https://github.com/hashicorp/nomad/blob/v0.9.1/client/servers/manager.go#L198-L216
[4] https://github.com/hashicorp/nomad/blob/v0.9.1/client/client.go#L1603
2019-07-12 14:41:44 +08:00
Mahmood Ali 9c9bec62fd rpc: add positive tests for server streaming RPC 2019-07-12 14:32:52 +08:00
Lang Martin 0b97175a16 node_endpoint preserve both messages as rpcs and in raft 2019-07-10 13:56:20 -04:00
Lang Martin ee4848167c core_sched add compat comment for later removal 2019-07-10 13:56:20 -04:00
Lang Martin c13c97c6c2 structs drop deprecation warning, revert unnecessary comment change 2019-07-10 13:56:20 -04:00
Lang Martin a95225d754 NodeDeregisterBatch -> NodeBatchDeregister match JobBatch pattern 2019-07-10 13:56:20 -04:00
Lang Martin a8e72a5b68 state_store error if called without node_ids 2019-07-10 13:56:20 -04:00
Lang Martin 44cbca9b98 fsm new NodeDeregisterBatchRequestType sorted at the end of the case 2019-07-10 13:56:20 -04:00
Lang Martin 91e139dcb5 structs NodeDeregisterBatchRequestType must go at the end 2019-07-10 13:56:20 -04:00
Lang Martin 1cc6b4062c fsm label batch_deregister_node metrics explicitly
Co-Authored-By: Mahmood Ali <mahmood@notnoop.com>
2019-07-10 13:56:20 -04:00
Lang Martin ad3549f906 core_sched use the new rpc names 2019-07-10 13:56:20 -04:00
Lang Martin ce0f03651a fsm support new NodeDeregisterBatchRequest 2019-07-10 13:56:20 -04:00
Lang Martin fa5649998e node endpoint support new NodeDeregisterBatchRequest 2019-07-10 13:56:19 -04:00
Lang Martin 683ab8d1d2 structs add NodeDeregisterBatchRequest 2019-07-10 13:56:19 -04:00
Lang Martin 82349aba5d node_endpoint argument setup 2019-07-10 13:56:19 -04:00
Lang Martin 6dbf5d7d13 fsm return an error on both NodeDeregisterRequest fields set 2019-07-10 13:56:19 -04:00
Lang Martin fbc78ba96c fsm variable names for consistency 2019-07-10 13:56:19 -04:00
Lang Martin 09fd05bd8f node_endpoint raft store then shutdown, test deprecation 2019-07-10 13:56:19 -04:00
Lang Martin 4610c70777 util simplify partitionAll 2019-07-10 13:56:19 -04:00
Lang Martin d22d9fb5b2 core_sched check ServersMeetMinimumVersion 2019-07-10 13:56:19 -04:00
Lang Martin 3bf41211fb fsm honor new and old style NodeDeregisterRequests 2019-07-10 13:56:19 -04:00
Lang Martin 3fb82e83a5 structs add back NodeDeregisterRequest.NodeID, compatibility 2019-07-10 13:56:19 -04:00
Lang Martin a4472e3d34 core_sched check ServersMeetMinimumVersion, send old node deregister 2019-07-10 13:56:19 -04:00
Lang Martin 8e53c105fc state_store just one index update, test deletion 2019-07-10 13:56:19 -04:00
Lang Martin 3e2d1f0338 node_endpoint improve error messages 2019-07-10 13:56:19 -04:00
Lang Martin 5a6a947e98 state_store improve error messages 2019-07-10 13:56:19 -04:00
Lang Martin fd14cedf95 drainer watch_nodes_test batch of 1 2019-07-10 13:56:19 -04:00
Lang Martin b176066d42 node_endpoint deregister the batch of nodes 2019-07-10 13:56:19 -04:00
Lang Martin a97407e030 fsm NodeDeregisterRequest is now a batch 2019-07-10 13:56:19 -04:00
Lang Martin d5ff2834ca core_sched batch node deregistration requests 2019-07-10 13:56:19 -04:00
Lang Martin 10848841be util partitionAll for paging 2019-07-10 13:56:19 -04:00
Lang Martin be2d6853cb state_store DeleteNode operates on a batch of ids 2019-07-10 13:56:19 -04:00
Lang Martin 77cf037bff struct NodeDeregisterRequest has a batch of NodeIDs 2019-07-10 13:56:19 -04:00
Preetha Appan 3cb798235d
Missed one revert of backwards compatibility for node drain 2019-07-01 16:46:05 -05:00
Preetha Appan aa2b4b4e00
Undo removal of node drain compat changes
Decided to remove that in 0.10
2019-07-01 15:12:01 -05:00
Preetha Appan 3484f18984
Fix more tests 2019-06-26 16:30:53 -05:00
Preetha Appan ff1b80dba6
Fix node drain test 2019-06-26 16:12:07 -05:00
Preetha Appan 23319e04d6
Restore accidentally deleted block 2019-06-26 13:59:14 -05:00
Michael Schurter 69ba495f0c nomad: expand comments on subtle plan apply behaviors 2019-06-26 08:49:24 -07:00
Preetha Appan 66fa6a67ec
newline 2019-06-25 19:41:09 -05:00
Preetha Appan 10e7d6df6d
Remove compat code associated with many previous versions of nomad
This removes compat code for namespaces (0.7), Drain(0.8) and other
older features from releases older than Nomad 0.7
2019-06-25 19:05:25 -05:00
Michael Schurter e4bc943a68 nomad: SnapshotAfter -> SnapshotMinIndex
Rename SnapshotAfter to SnapshotMinIndex. The old name was not
technically accurate. SnapshotAtOrAfter is more accurate, but wordy and
still lacks context about what precisely it is at or after (the index).

SnapshotMinIndex was chosen as it describes the action (snapshot), a
constraint (minimum), and the object of the constraint (index).
2019-06-24 12:16:46 -07:00
Michael Schurter 0f8164b2f1 nomad: evaluate plans after previous plan index
The previous commit prevented evaluating plans against a state snapshot
which is older than the snapshot at which the plan was created.  This is
correct and prevents failures trying to retrieve referenced objects that
may not exist until the plan's snapshot. However, this is insufficient
to guarantee consistency if the following events occur:

1. P1, P2, and P3 are enqueued with snapshot @ 100
2. Leader evaluates and applies Plan P1 with snapshot @ 100
3. Leader evaluates Plan P2 with snapshot+P1 @ 100
4. P1 commits @ 101
4. Leader evaluates applies Plan P3 with snapshot+P2 @ 100

Since only the previous plan is optimistically applied to the state
store, the snapshot used to evaluate a plan may not contain the N-2
plan!

To ensure plans are evaluated and applied serially we must consider all
previous plan's committed indexes when evaluating further plans.

Therefore combined with the last PR, the minimum index at which to
evaluate a plan is:

    min(previousPlanResultIndex, plan.SnapshotIndex)
2019-06-24 12:16:46 -07:00
Michael Schurter e10fea1d7a nomad: include snapshot index when submitting plans
Plan application should use a state snapshot at or after the Raft index
at which the plan was created otherwise it risks being rejected based on
stale data.

This commit adds a Plan.SnapshotIndex which is set by workers when
submitting plan. SnapshotIndex is set to the Raft index of the snapshot
the worker used to generate the plan.

Plan.SnapshotIndex plays a similar role to PlanResult.RefreshIndex.
While RefreshIndex informs workers their StateStore is behind the
leader's, SnapshotIndex is a way to prevent the leader from using a
StateStore behind the worker's.

Plan.SnapshotIndex should be considered the *lower bound* index for
consistently handling plan application.

Plans must also be committed serially, so Plan N+1 should use a state
snapshot containing Plan N. This is guaranteed for plans *after* the
first plan after a leader election.

The Raft barrier on leader election ensures the leader's statestore has
caught up to the log index at which it was elected. This guarantees its
StateStore is at an index > lastPlanIndex.
2019-06-24 12:16:46 -07:00
Chris Baker 59fac48d92 alloc lifecycle: 404 when attempting to stop non-existent allocation 2019-06-20 21:27:22 +00:00
Preetha 586e50d1a4
Merge pull request #5841 from hashicorp/f-raft-snapshot-metrics
Raft and state store indexes as metrics
2019-06-19 12:01:03 -05:00
Preetha Appan dc0ac81609
Change interval of raft stats collection to 10s 2019-06-19 11:58:46 -05:00
Preetha Appan 104d66f10c
Changed name of metric 2019-06-17 15:51:31 -05:00
Chris Baker e0170e1c67 metrics: add namespace label to allocation metrics 2019-06-17 20:50:26 +00:00