open-nomad

Author	SHA1	Message	Date
Mahmood Ali	9c9bec62fd	rpc: add positive tests for server streaming RPC	2019-07-12 14:32:52 +08:00
Lang Martin	0b97175a16	node_endpoint preserve both messages as rpcs and in raft	2019-07-10 13:56:20 -04:00
Lang Martin	ee4848167c	core_sched add compat comment for later removal	2019-07-10 13:56:20 -04:00
Lang Martin	c13c97c6c2	structs drop deprecation warning, revert unnecessary comment change	2019-07-10 13:56:20 -04:00
Lang Martin	a95225d754	NodeDeregisterBatch -> NodeBatchDeregister match JobBatch pattern	2019-07-10 13:56:20 -04:00
Lang Martin	a8e72a5b68	state_store error if called without node_ids	2019-07-10 13:56:20 -04:00
Lang Martin	44cbca9b98	fsm new NodeDeregisterBatchRequestType sorted at the end of the case	2019-07-10 13:56:20 -04:00
Lang Martin	91e139dcb5	structs NodeDeregisterBatchRequestType must go at the end	2019-07-10 13:56:20 -04:00
Lang Martin	1cc6b4062c	fsm label batch_deregister_node metrics explicitly Co-Authored-By: Mahmood Ali <mahmood@notnoop.com>	2019-07-10 13:56:20 -04:00
Lang Martin	ad3549f906	core_sched use the new rpc names	2019-07-10 13:56:20 -04:00
Lang Martin	ce0f03651a	fsm support new NodeDeregisterBatchRequest	2019-07-10 13:56:20 -04:00
Lang Martin	fa5649998e	node endpoint support new NodeDeregisterBatchRequest	2019-07-10 13:56:19 -04:00
Lang Martin	683ab8d1d2	structs add NodeDeregisterBatchRequest	2019-07-10 13:56:19 -04:00
Lang Martin	82349aba5d	node_endpoint argument setup	2019-07-10 13:56:19 -04:00
Lang Martin	6dbf5d7d13	fsm return an error on both NodeDeregisterRequest fields set	2019-07-10 13:56:19 -04:00
Lang Martin	fbc78ba96c	fsm variable names for consistency	2019-07-10 13:56:19 -04:00
Lang Martin	09fd05bd8f	node_endpoint raft store then shutdown, test deprecation	2019-07-10 13:56:19 -04:00
Lang Martin	4610c70777	util simplify partitionAll	2019-07-10 13:56:19 -04:00
Lang Martin	d22d9fb5b2	core_sched check ServersMeetMinimumVersion	2019-07-10 13:56:19 -04:00
Lang Martin	3bf41211fb	fsm honor new and old style NodeDeregisterRequests	2019-07-10 13:56:19 -04:00
Lang Martin	3fb82e83a5	structs add back NodeDeregisterRequest.NodeID, compatibility	2019-07-10 13:56:19 -04:00
Lang Martin	a4472e3d34	core_sched check ServersMeetMinimumVersion, send old node deregister	2019-07-10 13:56:19 -04:00
Lang Martin	8e53c105fc	state_store just one index update, test deletion	2019-07-10 13:56:19 -04:00
Lang Martin	3e2d1f0338	node_endpoint improve error messages	2019-07-10 13:56:19 -04:00
Lang Martin	5a6a947e98	state_store improve error messages	2019-07-10 13:56:19 -04:00
Lang Martin	fd14cedf95	drainer watch_nodes_test batch of 1	2019-07-10 13:56:19 -04:00
Lang Martin	b176066d42	node_endpoint deregister the batch of nodes	2019-07-10 13:56:19 -04:00
Lang Martin	a97407e030	fsm NodeDeregisterRequest is now a batch	2019-07-10 13:56:19 -04:00
Lang Martin	d5ff2834ca	core_sched batch node deregistration requests	2019-07-10 13:56:19 -04:00
Lang Martin	10848841be	util partitionAll for paging	2019-07-10 13:56:19 -04:00
Lang Martin	be2d6853cb	state_store DeleteNode operates on a batch of ids	2019-07-10 13:56:19 -04:00
Lang Martin	77cf037bff	struct NodeDeregisterRequest has a batch of NodeIDs	2019-07-10 13:56:19 -04:00
Mahmood Ali	ea3a98357f	Block rpc handling until state store is caught up Here, we ensure that when leader only responds to RPC calls when state store is up to date. At leadership transition or launch with restored state, the server local store might not be caught up with latest raft logs and may return a stale read. The solution here is to have an RPC consistency read gate, enabled when `establishLeadership` completes before we respond to RPC calls. `establishLeadership` is gated by a `raft.Barrier` which ensures that all prior raft logs have been applied. Conversely, the gate is disabled when leadership is lost. This is very much inspired by https://github.com/hashicorp/consul/pull/3154/files	2019-07-02 16:07:37 +08:00
Preetha Appan	3cb798235d	Missed one revert of backwards compatibility for node drain	2019-07-01 16:46:05 -05:00
Preetha Appan	aa2b4b4e00	Undo removal of node drain compat changes Decided to remove that in 0.10	2019-07-01 15:12:01 -05:00
Preetha Appan	3484f18984	Fix more tests	2019-06-26 16:30:53 -05:00
Preetha Appan	ff1b80dba6	Fix node drain test	2019-06-26 16:12:07 -05:00
Preetha Appan	23319e04d6	Restore accidentally deleted block	2019-06-26 13:59:14 -05:00
Michael Schurter	69ba495f0c	nomad: expand comments on subtle plan apply behaviors	2019-06-26 08:49:24 -07:00
Preetha Appan	66fa6a67ec	newline	2019-06-25 19:41:09 -05:00
Preetha Appan	10e7d6df6d	Remove compat code associated with many previous versions of nomad This removes compat code for namespaces (0.7), Drain(0.8) and other older features from releases older than Nomad 0.7	2019-06-25 19:05:25 -05:00
Michael Schurter	e4bc943a68	nomad: SnapshotAfter -> SnapshotMinIndex Rename SnapshotAfter to SnapshotMinIndex. The old name was not technically accurate. SnapshotAtOrAfter is more accurate, but wordy and still lacks context about what precisely it is at or after (the index). SnapshotMinIndex was chosen as it describes the action (snapshot), a constraint (minimum), and the object of the constraint (index).	2019-06-24 12:16:46 -07:00
Michael Schurter	0f8164b2f1	nomad: evaluate plans after previous plan index The previous commit prevented evaluating plans against a state snapshot which is older than the snapshot at which the plan was created. This is correct and prevents failures trying to retrieve referenced objects that may not exist until the plan's snapshot. However, this is insufficient to guarantee consistency if the following events occur: 1. P1, P2, and P3 are enqueued with snapshot @ 100 2. Leader evaluates and applies Plan P1 with snapshot @ 100 3. Leader evaluates Plan P2 with snapshot+P1 @ 100 4. P1 commits @ 101 4. Leader evaluates applies Plan P3 with snapshot+P2 @ 100 Since only the previous plan is optimistically applied to the state store, the snapshot used to evaluate a plan may not contain the N-2 plan! To ensure plans are evaluated and applied serially we must consider all previous plan's committed indexes when evaluating further plans. Therefore combined with the last PR, the minimum index at which to evaluate a plan is: min(previousPlanResultIndex, plan.SnapshotIndex)	2019-06-24 12:16:46 -07:00
Michael Schurter	e10fea1d7a	nomad: include snapshot index when submitting plans Plan application should use a state snapshot at or after the Raft index at which the plan was created otherwise it risks being rejected based on stale data. This commit adds a Plan.SnapshotIndex which is set by workers when submitting plan. SnapshotIndex is set to the Raft index of the snapshot the worker used to generate the plan. Plan.SnapshotIndex plays a similar role to PlanResult.RefreshIndex. While RefreshIndex informs workers their StateStore is behind the leader's, SnapshotIndex is a way to prevent the leader from using a StateStore behind the worker's. Plan.SnapshotIndex should be considered the lower bound index for consistently handling plan application. Plans must also be committed serially, so Plan N+1 should use a state snapshot containing Plan N. This is guaranteed for plans after the first plan after a leader election. The Raft barrier on leader election ensures the leader's statestore has caught up to the log index at which it was elected. This guarantees its StateStore is at an index > lastPlanIndex.	2019-06-24 12:16:46 -07:00
Chris Baker	59fac48d92	alloc lifecycle: 404 when attempting to stop non-existent allocation	2019-06-20 21:27:22 +00:00
Preetha	586e50d1a4	Merge pull request #5841 from hashicorp/f-raft-snapshot-metrics Raft and state store indexes as metrics	2019-06-19 12:01:03 -05:00
Preetha Appan	dc0ac81609	Change interval of raft stats collection to 10s	2019-06-19 11:58:46 -05:00
Preetha Appan	104d66f10c	Changed name of metric	2019-06-17 15:51:31 -05:00
Chris Baker	e0170e1c67	metrics: add namespace label to allocation metrics	2019-06-17 20:50:26 +00:00
Preetha Appan	c54b4a5b17	Emit metrics with raft commit and apply index and statestore latest index	2019-06-14 16:30:27 -05:00
Jasmine Dahilig	ed9740db10	Merge pull request #5664 from hashicorp/f-http-hcl-region backfill region from hcl for jobUpdate and jobPlan	2019-06-13 12:25:01 -07:00
Jasmine Dahilig	51e141be7a	backfill region from job hcl in jobUpdate and jobPlan endpoints - updated region in job metadata that gets persisted to nomad datastore - fixed many unrelated unit tests that used an invalid region value (they previously passed because hcl wasn't getting picked up and the job would default to global region)	2019-06-13 08:03:16 -07:00
Nick Ethier	1b7fa4fe29	Optional Consul service tags for nomad server and agent services (#5706 ) Optional Consul service tags for nomad server and agent services	2019-06-13 09:00:35 -04:00
Mahmood Ali	e31159bf1f	Prepare for 0.9.4 dev cycle	2019-06-12 18:47:50 +00:00
Nomad Release bot	4803215109	Generate files for 0.9.3 release	2019-06-12 16:11:16 +00:00
Mahmood Ali	07f2c77c44	comment DenormalizeAllocationDiffSlice applies to terminal allocs only	2019-06-12 08:28:43 -04:00
Lang Martin	fe8a4781d8	config merge maintains *HCL string fields used for duration conversion	2019-06-11 16:34:04 -04:00
Mahmood Ali	392f5bac44	Stop updating allocs.Job on stopping or preemption	2019-06-10 18:30:20 -04:00
Mahmood Ali	6c8e329819	test that stopped alloc jobs aren't modified When an alloc is stopped, test that we don't update the job found in alloc with new job that is no longer relevent for this alloc.	2019-06-10 17:14:26 -04:00
Mahmood Ali	d30c3d10b0	Merge pull request #5747 from hashicorp/b-test-fixes-20190521-1 More test fixes	2019-06-05 19:09:18 -04:00
Mahmood Ali	87173111de	Merge pull request #5746 from hashicorp/b-no-updating-inmem-node set node.StatusUpdatedAt in raft	2019-06-05 19:05:21 -04:00
Mahmood Ali	97957fbf75	Prepare for 0.9.3 dev cycle	2019-06-05 14:54:00 +00:00
Nomad Release bot	43bfbf3fcc	Generate files for 0.9.2 release	2019-06-05 11:59:27 +00:00
Michael Schurter	073893f529	nomad: disable service+batch preemption by default Enterprise only. Disable preemption for service and batch jobs by default. Maintain backward compatibility in a x.y.Z release. Consider switching the default for new clusters in the future.	2019-06-04 15:54:50 -07:00
Michael Schurter	a8fc50cc1b	nomad: revert use of SnapshotAfter in planApply Revert plan_apply.go changes from #5411 Since non-Command Raft messages do not update the StateStore index, SnapshotAfter may unnecessarily block and needlessly fail in idle clusters where the last Raft message is a non-Command message. This is trivially reproducible with the dev agent and a job that has 2 tasks, 1 of which fails. The correct logic would be to SnapshotAfter the previous plan's index to ensure consistency. New clusters or newly elected leaders will not have a previous plan, so the index the leader was elected should be used instead.	2019-06-03 15:34:21 -07:00
Mahmood Ali	a4ead8ff79	remove 0.9.2-rc1 generated code	2019-05-23 11:14:24 -04:00
Nomad Release bot	6d6bc59732	Generate files for 0.9.2-rc1 release	2019-05-22 19:29:30 +00:00
Lang Martin	d46613ff44	structs check TaskGroup.Update for nil	2019-05-22 12:34:57 -04:00
Lang Martin	10a3fd61b0	comment replace COMPAT 0.7.0 for job.Update with more current info	2019-05-22 12:34:57 -04:00
Lang Martin	67ebcc47dd	structs comment todo DeploymentStatus & DeploymentStatusDescription	2019-05-22 12:34:57 -04:00
Lang Martin	21bf9fdf90	structs job warnings for taskgroup with mixed auto_promote settings	2019-05-22 12:34:57 -04:00
Lang Martin	0f6f543a5f	deployment_watcher auto promote iff every task group is auto promotable	2019-05-22 12:34:57 -04:00
Lang Martin	d27d6f8ede	structs validate requires Canary for AutoPromote	2019-05-22 12:32:08 -04:00
Lang Martin	0c668ecc7a	log error on autoPromoteDeployment failure	2019-05-22 12:32:08 -04:00
Lang Martin	f23f9fd99e	describe a pending deployment without auto_promote more explicitly	2019-05-22 12:32:08 -04:00
Lang Martin	34230577df	describe a pending deployment with auto_promote accurately	2019-05-22 12:32:08 -04:00
Lang Martin	b5fd735960	add update AutoPromote bool	2019-05-22 12:32:08 -04:00
Lang Martin	3c5a9fed22	deployments_watcher_test new TestWatcher_AutoPromoteDeployment	2019-05-22 12:32:08 -04:00
Lang Martin	0bebf5d7f8	deployment_watcher when it's ok to autopromote, do so	2019-05-22 12:32:08 -04:00
Lang Martin	0cf4168ed9	deployments_watcher comments	2019-05-22 12:32:08 -04:00
Lang Martin	0c403eafde	state_store typo in a comment	2019-05-22 12:32:08 -04:00
Lang Martin	e1e28307be	new deploymentwatcher/doc.go for package level documentation	2019-05-22 12:32:08 -04:00
Mahmood Ali	9ff5f163b5	update callers in tests	2019-05-21 21:10:17 -04:00
Mahmood Ali	6bdbeed319	set node.StatusUpdatedAt in raft Fix a case where `node.StatusUpdatedAt` was manipulated directly in memory. This ensures that StatusUpdatedAt is set in raft layer, and ensures that the field is updated when node drain/eligibility is updated too.	2019-05-21 16:13:32 -04:00
Mahmood Ali	2159d0f3ac	tests: fix some nomad/drainer test data races	2019-05-21 14:40:58 -04:00
Mahmood Ali	3b0152d778	tests: fix deploymentwatcher tests data races	2019-05-21 14:29:45 -04:00
Michael Schurter	689794e08d	nomad: fix deadlock in UnblockClassAndQuota Previous commit could introduce a deadlock if the capacityChangeCh was full and the receiving side exited before freeing a slot for the sending side could send. Flush would then block forever waiting to acquire the lock just to throw the pending update away. The race is around getting/setting the chan field, not chan operations, so only lock around getting the chan field.	2019-05-20 15:41:52 -07:00
Michael Schurter	8c99214f69	nomad: fix race in BlockedEvals I assume the mutex was being released before sending on capacityChangeCh to avoid blocking in the critical section, but: 1. This is race. 2. capacityChangeCh has a huge buffer (8096). If it's full things already seem Very Bad, and a little backpressure seems appropriate.	2019-05-20 15:26:20 -07:00
Michael Schurter	05a9c6aedb	Merge pull request #5411 from hashicorp/b-snapshotafter Block plan application until state store has caught up to raft	2019-05-20 14:03:10 -07:00
Mahmood Ali	cd64ada95d	Run TestClientAllocations_Restart_ACL test	2019-05-17 20:30:23 -04:00
Michael Schurter	0e39927782	nomad: emit more detailed error Avoid returning context.DeadlineExceeded as it lacks helpful information and is often ignored or handled specially by callers.	2019-05-17 14:37:42 -07:00
Michael Schurter	b80a7e0feb	nomad: wait for state store to sync in plan apply Wait for state store to catch up with raft when applying plans.	2019-05-17 14:37:12 -07:00
Michael Schurter	1bc731da47	nomad: remove unused NotifyGroup struct I don't think it's been used for a long time.	2019-05-17 13:30:23 -07:00
Michael Schurter	9732bc37ff	nomad: refactor waitForIndex into SnapshotAfter Generalize wait for index logic in the state store for reuse elsewhere. Also begin plumbing in a context to combine handling of timeouts and shutdown.	2019-05-17 13:30:23 -07:00
Preetha	c8fdf20c66	Merge pull request #5717 from hashicorp/b-plan-apply-preemptions Fix bug in plan applier introduced in PR-5602	2019-05-16 11:01:05 -05:00
Preetha	2dcd4291f8	Merge pull request #5702 from hashicorp/f-filter-by-create-index Filter deployments by create index	2019-05-15 21:50:41 -05:00
Preetha	555dd23c2c	remove stray newline Co-Authored-By: Danielle <dani@builds.terrible.systems>	2019-05-15 21:11:52 -05:00
Preetha Appan	2b787aad7e	Fix bug in plan applier introduced in PR-5602 This fixes a bug in the state store during plan apply. When denormalizing preempted allocations it incorrectly set the preemptor's job during the update. This eventually causes a panic downstream in the client. Added a test assertion that failed before and passes after this fix	2019-05-15 20:34:06 -05:00
Danielle	d202582502	Merge pull request #5699 from hashicorp/dani/b-eval-broker-lifetime Eval Broker: Prevent redundant enqueue's when a node is not a leader	2019-05-15 23:30:52 +01:00
Danielle Lancashire	2fb93a6229	evalbroker: test for no enqueue on disabled	2019-05-15 11:02:21 +02:00

1 2 3 4 5 ...

2878 commits