open-nomad

Author	SHA1	Message	Date
Tim Gross	5c4d0a73f4	start all but first region deployment in paused state	2020-06-17 11:05:34 -04:00
Tim Gross	48e9f75c1e	multiregion: deploymentwatcher hooks This changeset establishes hooks in deploymentwatcher for multiregion deployments (for the enterprise version of Nomad).	2020-06-17 11:05:18 -04:00
Tim Gross	b09b7a2475	Multiregion job registration Integration points for multiregion jobs to be registered in the enterprise version of Nomad: * hook in `Job.Register` for enterprise to send job to peer regions * remove monitoring from `nomad job run` and `nomad job stop` for multiregion jobs	2020-06-17 11:04:58 -04:00
Drew Bailey	9263fcb0d3	Multiregion deploy status and job status CLI	2020-06-17 11:03:34 -04:00
Tim Gross	473a0f1d44	multiregion: unblock and cancel RPCs	2020-06-17 11:02:26 -04:00
Tim Gross	ede3a4f1c4	multiregion: request structs	2020-06-17 11:00:34 -04:00
Tim Gross	6851024925	Multiregion structs Initial struct definitions, jobspec parsing, validation, and conversion between Nomad structs and API structs for multi-region deployments.	2020-06-17 11:00:14 -04:00
Chris Baker	9fc66bc1aa	support in API client and Job.Register RPC for PreserveCounts	2020-06-16 18:45:28 +00:00
Chris Baker	1e3563e08c	wip: added PreserveCounts to struct.JobRegisterRequest, development test for Job.Register	2020-06-16 18:45:17 +00:00
Chris Baker	7ed06cced0	core: update Job.Scale to save the previous job count in the ScalingEvent	2020-06-15 19:49:22 +00:00
Chris Baker	aeb3ed449e	wip: added .PreviousCount to api.ScalingEvent and structs.ScalingEvent, with developmental tests	2020-06-15 19:40:21 +00:00
Mahmood Ali	c17ffb2d35	Merge pull request #8131 from hashicorp/f-snapshot-restore Implement snapshot restore	2020-06-15 08:32:34 -04:00
Mahmood Ali	9bfc3e28d9	Apply suggestions from code review Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2020-06-15 08:32:16 -04:00
Lang Martin	069840bef8	scheduler/reconcile: set FollowupEvalID on lost stop_after_client_disconnect (#8105 ) (#8138 ) * scheduler/reconcile: set FollowupEvalID on lost stop_after_client_disconnect * scheduler/reconcile: thread follupEvalIDs through to results.stop * scheduler/reconcile: comment typo * nomad/_test: correct arguments for plan.AppendStoppedAlloc * scheduler/reconcile: avoid nil, cleanup handleDelayed(Lost\|Reschedules)	2020-06-09 17:13:53 -04:00
Mahmood Ali	63e048e972	clarify ccomments, esp related to leadership code	2020-06-09 12:01:31 -04:00
Mahmood Ali	b543460e0a	loosen raft timeout	2020-06-07 16:38:11 -04:00
Mahmood Ali	69bb42acf8	tests: prefix agent logs to identify agent sources	2020-06-07 16:38:11 -04:00
Mahmood Ali	47a163b63f	reassert leadership	2020-06-07 15:47:06 -04:00
Mahmood Ali	9eb13ae144	basic snapshot restore	2020-06-07 15:46:23 -04:00
Mahmood Ali	bf7a3583e5	Merge pull request #8089 from hashicorp/b-leader-worker-count leadership: pause and unpause workers consistently	2020-06-04 12:01:01 -04:00
Mahmood Ali	cd8e1b4d62	stop periodic dispatch at end of tests (#8111 )	2020-06-04 09:15:00 -04:00
Lang Martin	ac7c39d3d3	Delayed evaluations for `stop_after_client_disconnect` can cause unwanted extra followup evaluations around job garbage collection (#8099 ) * client/heartbeatstop: reversed time condition for startup grace * scheduler/generic_sched: use `delayInstead` to avoid a loop Without protecting the loop that creates followUpEvals, a delayed eval is allowed to create an immediate subsequent delayed eval. For both `stop_after_client_disconnect` and the `reschedule` block, a delayed eval should always produce some immediate result (running or blocked) and then only after the outcome of that eval produce a second delayed eval. * scheduler/reconcile: lostLater are different than delayedReschedules Just slightly. `lostLater` allocs should be used to create batched evaluations, but `handleDelayedReschedules` assumes that the allocations are in the untainted set. When it creates the in-place updates to those allocations at the end, it causes the allocation to be treated as running over in the planner, which causes the initial `stop_after_client_disconnect` evaluation to be retried by the worker.	2020-06-03 09:48:38 -04:00
Mahmood Ali	70fbcb99c2	leadership: pause and unpause workers consistently This fixes a bug where leadership establishment pauses 3/4 of workers but stepping down unpause only 1/2!	2020-06-01 10:57:53 -04:00
Mahmood Ali	891fb3f8a9	test for paused workers upon leadership revocation	2020-06-01 10:48:42 -04:00
Mahmood Ali	de44d9641b	Merge pull request #8047 from hashicorp/f-snapshot-save API for atomic snapshot backups	2020-06-01 07:55:16 -04:00
Mahmood Ali	e37a3312d5	If leadership fails, consider it handled The callers for `forward` and old implementation expect failures to be accompanied with a true value! This fixes the issue and have tests passing!	2020-05-31 22:06:17 -04:00
Mahmood Ali	30ab9c84e5	more review feedback	2020-05-31 21:39:09 -04:00
Mahmood Ali	a73cd01a00	Merge pull request #8001 from hashicorp/f-jobs-list-across-nses endpoint to expose all jobs across all namespaces	2020-05-31 21:28:03 -04:00
Mahmood Ali	082c085068	Merge pull request #8036 from hashicorp/f-background-vault-revoke-on-restore Speed up leadership establishment	2020-05-31 21:27:16 -04:00
Mahmood Ali	1af32e65bc	clarify rpc consistency readiness comment	2020-05-31 21:26:41 -04:00
Mahmood Ali	0819ea60ea	Apply suggestions from code review Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2020-05-31 21:04:39 -04:00
Mahmood Ali	37c6160b96	Handle nil/empty cluster metadata Handle case where a snapshot is made before cluster metadata is created. This fixes a bug where a server may have empty cluster metadata if it created and installed a Raft snapshot before a new cluster metadata ID is generated. This case is very unlikely to arise. Most likely reason is when upgrading from an old version slowly where servers may use snapshots before all servers upgrade. This happened for a user with a log line like: ``` 2020-05-21T15:21:56.996Z [ERROR] nomad.fsm: ClusterSetMetadata failed: error=""set cluster metadata failed: refusing to set new cluster id, previous: , new: <<redacted> ```	2020-05-29 13:34:21 -04:00
Drew Bailey	23d24c7a7f	removes pro tags (#8014 )	2020-05-28 15:40:17 -04:00
Mahmood Ali	475b3b77ad	Merge pull request #8060 from hashicorp/tests-deflake-20200526 Deflake some tests - 2020-05-27 edition	2020-05-27 15:24:31 -04:00
Drew Bailey	34871f89be	Oss license support for ent builds (#8054 ) * changes necessary to support oss licesning shims revert nomad fmt changes update test to work with enterprise changes update tests to work with new ent enforcements make check update cas test to use scheduler algorithm back out preemption changes add comments * remove unused method	2020-05-27 13:46:52 -04:00
Mahmood Ali	61e4f5aaf9	tests: use GreaterOrEqual and apply change to other tests	2020-05-27 11:22:48 -04:00
Mahmood Ali	6dfe0f5d3b	tests: use t.Fatalf when it's clearer	2020-05-27 10:09:56 -04:00
Mahmood Ali	ec1fcedb93	tests: node drain events may be duplicated	2020-05-27 08:59:06 -04:00
Mahmood Ali	c3c2a85314	tests: wait until clients are in the state store	2020-05-26 18:53:24 -04:00
Mahmood Ali	5d80d2a511	tests: eval may be processed quickly	2020-05-26 18:53:24 -04:00
Mahmood Ali	19141f8103	{volume\|deployment}watcher: check for nil batcher	2020-05-26 14:54:27 -04:00
Mahmood Ali	81ac098a22	deploymentwatcher: no batcher when disabling When disabling deploymentwatcher (at the end of a test), avoid starting a new update batcher with its new goroutine.	2020-05-26 14:44:47 -04:00
Mahmood Ali	ccc89f940a	terminate leader goroutines on shutdown Ensure that nomad steps down (and terminate leader goroutines) on shutdown, when the server is the leader. Without this change, `monitorLeadership` may handle `shutdownCh` event and exit early before handling the raft `leaderCh` event and end up leaking leadership goroutines.	2020-05-26 10:18:10 -04:00
Mahmood Ali	e671913e56	fix a trace logline	2020-05-26 10:18:09 -04:00
Mahmood Ali	1c79c3b93d	refactor: context is first parameter By convention, go functions take `context.Context` as the first argument.	2020-05-26 10:18:09 -04:00
Mahmood Ali	1eff8b0ed8	volumewatcher: no batcher when disabling When disabling volumewatcher (at the end of a test), avoid starting a new update batcher with its new goroutine.	2020-05-26 10:18:09 -04:00
Mahmood Ali	b895cef622	always set purgeFunc purgeFunc cannot be nil, so ensure it's set to a no-op function in tests.	2020-05-21 21:05:53 -04:00
Mahmood Ali	2108681c1d	Endpoint for snapshotting server state	2020-05-21 20:04:38 -04:00
Mahmood Ali	fbe140b26c	vault: ensure ttl expired tokens are purge If a token is scheduled for revocation expires before we revoke it, ensure that it is marked as purged in raft and is only removed from local vault state if the purge operation succeeds. Prior to this change, we may remove the accessor from local state but not purge it from Raft. This causes unnecessary and churn in the next leadership elections (and until 0.11.2 result in indefinite retries).	2020-05-21 19:54:50 -04:00
Mahmood Ali	aa8e79e55b	Reorder leadership handling Start serving RPC immediately after leader components are enabled, and move clean up to the bottom as they don't block leadership responsibilities.	2020-05-21 08:30:31 -04:00

1 2 3 4 5 ...

3340 commits