open-nomad

Commit Graph

Author	SHA1	Message	Date
Lang Martin	ce0f03651a	fsm support new NodeDeregisterBatchRequest	2019-07-10 13:56:20 -04:00
Lang Martin	fa5649998e	node endpoint support new NodeDeregisterBatchRequest	2019-07-10 13:56:19 -04:00
Lang Martin	683ab8d1d2	structs add NodeDeregisterBatchRequest	2019-07-10 13:56:19 -04:00
Lang Martin	82349aba5d	node_endpoint argument setup	2019-07-10 13:56:19 -04:00
Lang Martin	6dbf5d7d13	fsm return an error on both NodeDeregisterRequest fields set	2019-07-10 13:56:19 -04:00
Lang Martin	fbc78ba96c	fsm variable names for consistency	2019-07-10 13:56:19 -04:00
Lang Martin	09fd05bd8f	node_endpoint raft store then shutdown, test deprecation	2019-07-10 13:56:19 -04:00
Lang Martin	4610c70777	util simplify partitionAll	2019-07-10 13:56:19 -04:00
Lang Martin	d22d9fb5b2	core_sched check ServersMeetMinimumVersion	2019-07-10 13:56:19 -04:00
Lang Martin	3bf41211fb	fsm honor new and old style NodeDeregisterRequests	2019-07-10 13:56:19 -04:00
Lang Martin	3fb82e83a5	structs add back NodeDeregisterRequest.NodeID, compatibility	2019-07-10 13:56:19 -04:00
Lang Martin	a4472e3d34	core_sched check ServersMeetMinimumVersion, send old node deregister	2019-07-10 13:56:19 -04:00
Lang Martin	8e53c105fc	state_store just one index update, test deletion	2019-07-10 13:56:19 -04:00
Lang Martin	3e2d1f0338	node_endpoint improve error messages	2019-07-10 13:56:19 -04:00
Lang Martin	5a6a947e98	state_store improve error messages	2019-07-10 13:56:19 -04:00
Lang Martin	fd14cedf95	drainer watch_nodes_test batch of 1	2019-07-10 13:56:19 -04:00
Lang Martin	b176066d42	node_endpoint deregister the batch of nodes	2019-07-10 13:56:19 -04:00
Lang Martin	a97407e030	fsm NodeDeregisterRequest is now a batch	2019-07-10 13:56:19 -04:00
Lang Martin	d5ff2834ca	core_sched batch node deregistration requests	2019-07-10 13:56:19 -04:00
Lang Martin	10848841be	util partitionAll for paging	2019-07-10 13:56:19 -04:00
Lang Martin	be2d6853cb	state_store DeleteNode operates on a batch of ids	2019-07-10 13:56:19 -04:00
Lang Martin	77cf037bff	struct NodeDeregisterRequest has a batch of NodeIDs	2019-07-10 13:56:19 -04:00
Preetha Appan	3cb798235d	Missed one revert of backwards compatibility for node drain	2019-07-01 16:46:05 -05:00
Preetha Appan	aa2b4b4e00	Undo removal of node drain compat changes Decided to remove that in 0.10	2019-07-01 15:12:01 -05:00
Preetha Appan	3484f18984	Fix more tests	2019-06-26 16:30:53 -05:00
Preetha Appan	ff1b80dba6	Fix node drain test	2019-06-26 16:12:07 -05:00
Preetha Appan	23319e04d6	Restore accidentally deleted block	2019-06-26 13:59:14 -05:00
Michael Schurter	69ba495f0c	nomad: expand comments on subtle plan apply behaviors	2019-06-26 08:49:24 -07:00
Preetha Appan	66fa6a67ec	newline	2019-06-25 19:41:09 -05:00
Preetha Appan	10e7d6df6d	Remove compat code associated with many previous versions of nomad This removes compat code for namespaces (0.7), Drain(0.8) and other older features from releases older than Nomad 0.7	2019-06-25 19:05:25 -05:00
Michael Schurter	e4bc943a68	nomad: SnapshotAfter -> SnapshotMinIndex Rename SnapshotAfter to SnapshotMinIndex. The old name was not technically accurate. SnapshotAtOrAfter is more accurate, but wordy and still lacks context about what precisely it is at or after (the index). SnapshotMinIndex was chosen as it describes the action (snapshot), a constraint (minimum), and the object of the constraint (index).	2019-06-24 12:16:46 -07:00
Michael Schurter	0f8164b2f1	nomad: evaluate plans after previous plan index The previous commit prevented evaluating plans against a state snapshot which is older than the snapshot at which the plan was created. This is correct and prevents failures trying to retrieve referenced objects that may not exist until the plan's snapshot. However, this is insufficient to guarantee consistency if the following events occur: 1. P1, P2, and P3 are enqueued with snapshot @ 100 2. Leader evaluates and applies Plan P1 with snapshot @ 100 3. Leader evaluates Plan P2 with snapshot+P1 @ 100 4. P1 commits @ 101 4. Leader evaluates applies Plan P3 with snapshot+P2 @ 100 Since only the previous plan is optimistically applied to the state store, the snapshot used to evaluate a plan may not contain the N-2 plan! To ensure plans are evaluated and applied serially we must consider all previous plan's committed indexes when evaluating further plans. Therefore combined with the last PR, the minimum index at which to evaluate a plan is: min(previousPlanResultIndex, plan.SnapshotIndex)	2019-06-24 12:16:46 -07:00
Michael Schurter	e10fea1d7a	nomad: include snapshot index when submitting plans Plan application should use a state snapshot at or after the Raft index at which the plan was created otherwise it risks being rejected based on stale data. This commit adds a Plan.SnapshotIndex which is set by workers when submitting plan. SnapshotIndex is set to the Raft index of the snapshot the worker used to generate the plan. Plan.SnapshotIndex plays a similar role to PlanResult.RefreshIndex. While RefreshIndex informs workers their StateStore is behind the leader's, SnapshotIndex is a way to prevent the leader from using a StateStore behind the worker's. Plan.SnapshotIndex should be considered the lower bound index for consistently handling plan application. Plans must also be committed serially, so Plan N+1 should use a state snapshot containing Plan N. This is guaranteed for plans after the first plan after a leader election. The Raft barrier on leader election ensures the leader's statestore has caught up to the log index at which it was elected. This guarantees its StateStore is at an index > lastPlanIndex.	2019-06-24 12:16:46 -07:00
Chris Baker	59fac48d92	alloc lifecycle: 404 when attempting to stop non-existent allocation	2019-06-20 21:27:22 +00:00
Preetha	586e50d1a4	Merge pull request #5841 from hashicorp/f-raft-snapshot-metrics Raft and state store indexes as metrics	2019-06-19 12:01:03 -05:00
Preetha Appan	dc0ac81609	Change interval of raft stats collection to 10s	2019-06-19 11:58:46 -05:00
Preetha Appan	104d66f10c	Changed name of metric	2019-06-17 15:51:31 -05:00
Chris Baker	e0170e1c67	metrics: add namespace label to allocation metrics	2019-06-17 20:50:26 +00:00
Preetha Appan	c54b4a5b17	Emit metrics with raft commit and apply index and statestore latest index	2019-06-14 16:30:27 -05:00
Jasmine Dahilig	ed9740db10	Merge pull request #5664 from hashicorp/f-http-hcl-region backfill region from hcl for jobUpdate and jobPlan	2019-06-13 12:25:01 -07:00
Jasmine Dahilig	51e141be7a	backfill region from job hcl in jobUpdate and jobPlan endpoints - updated region in job metadata that gets persisted to nomad datastore - fixed many unrelated unit tests that used an invalid region value (they previously passed because hcl wasn't getting picked up and the job would default to global region)	2019-06-13 08:03:16 -07:00
Nick Ethier	1b7fa4fe29	Optional Consul service tags for nomad server and agent services (#5706 ) Optional Consul service tags for nomad server and agent services	2019-06-13 09:00:35 -04:00
Mahmood Ali	e31159bf1f	Prepare for 0.9.4 dev cycle	2019-06-12 18:47:50 +00:00
Nomad Release bot	4803215109	Generate files for 0.9.3 release	2019-06-12 16:11:16 +00:00
Mahmood Ali	07f2c77c44	comment DenormalizeAllocationDiffSlice applies to terminal allocs only	2019-06-12 08:28:43 -04:00
Lang Martin	fe8a4781d8	config merge maintains *HCL string fields used for duration conversion	2019-06-11 16:34:04 -04:00
Mahmood Ali	392f5bac44	Stop updating allocs.Job on stopping or preemption	2019-06-10 18:30:20 -04:00
Mahmood Ali	6c8e329819	test that stopped alloc jobs aren't modified When an alloc is stopped, test that we don't update the job found in alloc with new job that is no longer relevent for this alloc.	2019-06-10 17:14:26 -04:00
Mahmood Ali	d30c3d10b0	Merge pull request #5747 from hashicorp/b-test-fixes-20190521-1 More test fixes	2019-06-05 19:09:18 -04:00
Mahmood Ali	87173111de	Merge pull request #5746 from hashicorp/b-no-updating-inmem-node set node.StatusUpdatedAt in raft	2019-06-05 19:05:21 -04:00
Mahmood Ali	97957fbf75	Prepare for 0.9.3 dev cycle	2019-06-05 14:54:00 +00:00
Nomad Release bot	43bfbf3fcc	Generate files for 0.9.2 release	2019-06-05 11:59:27 +00:00
Michael Schurter	073893f529	nomad: disable service+batch preemption by default Enterprise only. Disable preemption for service and batch jobs by default. Maintain backward compatibility in a x.y.Z release. Consider switching the default for new clusters in the future.	2019-06-04 15:54:50 -07:00
Michael Schurter	a8fc50cc1b	nomad: revert use of SnapshotAfter in planApply Revert plan_apply.go changes from #5411 Since non-Command Raft messages do not update the StateStore index, SnapshotAfter may unnecessarily block and needlessly fail in idle clusters where the last Raft message is a non-Command message. This is trivially reproducible with the dev agent and a job that has 2 tasks, 1 of which fails. The correct logic would be to SnapshotAfter the previous plan's index to ensure consistency. New clusters or newly elected leaders will not have a previous plan, so the index the leader was elected should be used instead.	2019-06-03 15:34:21 -07:00
Mahmood Ali	a4ead8ff79	remove 0.9.2-rc1 generated code	2019-05-23 11:14:24 -04:00
Nomad Release bot	6d6bc59732	Generate files for 0.9.2-rc1 release	2019-05-22 19:29:30 +00:00
Lang Martin	d46613ff44	structs check TaskGroup.Update for nil	2019-05-22 12:34:57 -04:00
Lang Martin	10a3fd61b0	comment replace COMPAT 0.7.0 for job.Update with more current info	2019-05-22 12:34:57 -04:00
Lang Martin	67ebcc47dd	structs comment todo DeploymentStatus & DeploymentStatusDescription	2019-05-22 12:34:57 -04:00
Lang Martin	21bf9fdf90	structs job warnings for taskgroup with mixed auto_promote settings	2019-05-22 12:34:57 -04:00
Lang Martin	0f6f543a5f	deployment_watcher auto promote iff every task group is auto promotable	2019-05-22 12:34:57 -04:00
Lang Martin	d27d6f8ede	structs validate requires Canary for AutoPromote	2019-05-22 12:32:08 -04:00
Lang Martin	0c668ecc7a	log error on autoPromoteDeployment failure	2019-05-22 12:32:08 -04:00
Lang Martin	f23f9fd99e	describe a pending deployment without auto_promote more explicitly	2019-05-22 12:32:08 -04:00
Lang Martin	34230577df	describe a pending deployment with auto_promote accurately	2019-05-22 12:32:08 -04:00
Lang Martin	b5fd735960	add update AutoPromote bool	2019-05-22 12:32:08 -04:00
Lang Martin	3c5a9fed22	deployments_watcher_test new TestWatcher_AutoPromoteDeployment	2019-05-22 12:32:08 -04:00
Lang Martin	0bebf5d7f8	deployment_watcher when it's ok to autopromote, do so	2019-05-22 12:32:08 -04:00
Lang Martin	0cf4168ed9	deployments_watcher comments	2019-05-22 12:32:08 -04:00
Lang Martin	0c403eafde	state_store typo in a comment	2019-05-22 12:32:08 -04:00
Lang Martin	e1e28307be	new deploymentwatcher/doc.go for package level documentation	2019-05-22 12:32:08 -04:00
Mahmood Ali	9ff5f163b5	update callers in tests	2019-05-21 21:10:17 -04:00
Mahmood Ali	6bdbeed319	set node.StatusUpdatedAt in raft Fix a case where `node.StatusUpdatedAt` was manipulated directly in memory. This ensures that StatusUpdatedAt is set in raft layer, and ensures that the field is updated when node drain/eligibility is updated too.	2019-05-21 16:13:32 -04:00
Mahmood Ali	2159d0f3ac	tests: fix some nomad/drainer test data races	2019-05-21 14:40:58 -04:00
Mahmood Ali	3b0152d778	tests: fix deploymentwatcher tests data races	2019-05-21 14:29:45 -04:00
Michael Schurter	689794e08d	nomad: fix deadlock in UnblockClassAndQuota Previous commit could introduce a deadlock if the capacityChangeCh was full and the receiving side exited before freeing a slot for the sending side could send. Flush would then block forever waiting to acquire the lock just to throw the pending update away. The race is around getting/setting the chan field, not chan operations, so only lock around getting the chan field.	2019-05-20 15:41:52 -07:00
Michael Schurter	8c99214f69	nomad: fix race in BlockedEvals I assume the mutex was being released before sending on capacityChangeCh to avoid blocking in the critical section, but: 1. This is race. 2. capacityChangeCh has a huge buffer (8096). If it's full things already seem Very Bad, and a little backpressure seems appropriate.	2019-05-20 15:26:20 -07:00
Michael Schurter	05a9c6aedb	Merge pull request #5411 from hashicorp/b-snapshotafter Block plan application until state store has caught up to raft	2019-05-20 14:03:10 -07:00
Mahmood Ali	cd64ada95d	Run TestClientAllocations_Restart_ACL test	2019-05-17 20:30:23 -04:00
Michael Schurter	0e39927782	nomad: emit more detailed error Avoid returning context.DeadlineExceeded as it lacks helpful information and is often ignored or handled specially by callers.	2019-05-17 14:37:42 -07:00
Michael Schurter	b80a7e0feb	nomad: wait for state store to sync in plan apply Wait for state store to catch up with raft when applying plans.	2019-05-17 14:37:12 -07:00
Michael Schurter	1bc731da47	nomad: remove unused NotifyGroup struct I don't think it's been used for a long time.	2019-05-17 13:30:23 -07:00
Michael Schurter	9732bc37ff	nomad: refactor waitForIndex into SnapshotAfter Generalize wait for index logic in the state store for reuse elsewhere. Also begin plumbing in a context to combine handling of timeouts and shutdown.	2019-05-17 13:30:23 -07:00
Preetha	c8fdf20c66	Merge pull request #5717 from hashicorp/b-plan-apply-preemptions Fix bug in plan applier introduced in PR-5602	2019-05-16 11:01:05 -05:00
Preetha	2dcd4291f8	Merge pull request #5702 from hashicorp/f-filter-by-create-index Filter deployments by create index	2019-05-15 21:50:41 -05:00
Preetha	555dd23c2c	remove stray newline Co-Authored-By: Danielle <dani@builds.terrible.systems>	2019-05-15 21:11:52 -05:00
Preetha Appan	2b787aad7e	Fix bug in plan applier introduced in PR-5602 This fixes a bug in the state store during plan apply. When denormalizing preempted allocations it incorrectly set the preemptor's job during the update. This eventually causes a panic downstream in the client. Added a test assertion that failed before and passes after this fix	2019-05-15 20:34:06 -05:00
Danielle	d202582502	Merge pull request #5699 from hashicorp/dani/b-eval-broker-lifetime Eval Broker: Prevent redundant enqueue's when a node is not a leader	2019-05-15 23:30:52 +01:00
Danielle Lancashire	2fb93a6229	evalbroker: test for no enqueue on disabled	2019-05-15 11:02:21 +02:00
Nick Ethier	ade97bc91f	fixup #5172 and rebase against master	2019-05-14 14:37:34 -04:00
Nick Ethier	cab6a95668	Merge branch 'master' into pr/5172 * master: (912 commits) Update redirects.txt Added redirect for Spark guide link client: log when server list changes docs: mention regression in task config validation fix update to changelog update CHANGELOG with datacenter config validation https://github.com/hashicorp/nomad/pull/5665 typo: "atleast" -> "at least" implement nomad exec for rkt docs: fixed typo use pty/tty terminology similar to github.com/kr/pty vendor github.com/kr/pty drivers: implement streaming exec for executor based drivers executors: implement streaming exec executor: scaffolding for executor grpc handling client: expose allocated memory per task client improve a comment in updateNetworks stalebot: Add 'thinking' as an exempt label (#5684) Added Sparrow link update links to use new canonical location Add redirects for restructing done in GH-5667 ...	2019-05-14 14:10:33 -04:00
Michael Schurter	d7e5ace1ed	client: do not restart dead tasks until server is contacted Fixes #1795 Running restored allocations and pulling what allocations to run from the server happen concurrently. This means that if a client is rebooted, and has its allocations rescheduled, it may restart the dead allocations before it contacts the server and determines they should be dead. This commit makes tasks that fail to reattach on restore wait until the server is contacted before restarting.	2019-05-14 10:53:27 -07:00
Danielle Lancashire	d9815888ed	evalbroker: Simplify nextDelayedEval locking	2019-05-14 14:06:27 +02:00
Danielle Lancashire	38562afbc1	evalbroker: No new enqueues when disabled Currently when an evalbroker is disabled, it still recieves delayed enqueues via log application in the fsm. This causes an ever growing heap of evaluations that will never be drained, and can cause memory issues in larger clusters, or when left running for an extended period of time without a leader election. This commit prevents the enqueuing of evaluations while we are disabled, and relies on the leader restoreEvals routine to handle reconciling state during a leadership transition. Existing dequeues during an Enabled->Disabled broker state transition are handled by the enqueueLocked function dropping evals.	2019-05-14 13:59:10 +02:00
Danielle Lancashire	c91ae21a6c	evalbroker: Flush within update lock Primarily a cleanup commit, however, currently there is a potential race condition (that I'm not sure we've ever actually hit) during a flapping SetEnabled/Disabled state where we may never correctly restart the eval broker, if it was being called from multiple routines.	2019-05-14 13:26:56 +02:00
Preetha Appan	4d3f74e161	Fix test setup to have correct jobcreateindex for deployments	2019-05-13 18:53:47 -05:00
Preetha Appan	d448750449	Lookup job only once, and fix tests	2019-05-13 18:33:41 -05:00
Preetha Appan	07690d6f9e	Add flag similar to --all for allocs to be able to filter deployments by latest	2019-05-13 18:33:41 -05:00
Jasmine Dahilig	30d346ca15	Merge pull request #5665 from hashicorp/b-empty-datacenters add non-empty string validation for datacenters	2019-05-13 10:23:26 -07:00
Mahmood Ali	919827f2df	Merge pull request #5632 from hashicorp/f-nomad-exec-parts-01-base nomad exec part 1: plumbing and docker driver	2019-05-09 18:09:27 -04:00

1 2 3 4 5 ...

2867 Commits