open-nomad

Author	SHA1	Message	Date
Mahmood Ali	8d4f914be9	Merge pull request #5790 from hashicorp/b-reschedule-desired-state Mark rescheduled allocs as stopped.	2019-06-13 17:28:59 -04:00
Mahmood Ali	5e6327b6a1	Test behavior no reschedule for service/batch jobs	2019-06-13 16:41:19 -04:00
Mahmood Ali	faf643a375	Don't stop rescheduleLater allocations When an alloc is due to be rescheduleLater, it goes through the reconciler twice: once to be ignored with a follow up evals, and once again when processing the follow up eval where they appear as rescheduleNow. Here, we ignore them in the first run and mark them as stopped in second iteration; rather than stop them twice.	2019-06-13 09:44:41 -04:00
Mahmood Ali	5dc404ecab	Only preempt for network when there is a network When examining preemption for networks, only consider allocs that have networks. Fixes https://github.com/hashicorp/nomad/issues/5793	2019-06-07 18:55:55 -04:00
Mahmood Ali	98575f5788	test: add tests for network devices and preemption	2019-06-07 18:55:02 -04:00
Mahmood Ali	fd8fb8c22b	Stop allocs to be rescheduled Currently, when an alloc fails and is rescheduled, the alloc desired state remains as "run" and the nomad client may not free the resources. Here, we ensure that an alloc is marked as stopped when it's rescheduled. Notice the Desired Status and Description before and after this change: Before: ``` mars-2:nomad notnoop$ nomad alloc status 02aba49e ID = 02aba49e Eval ID = bb9ed1d2 Name = example-reschedule.nodes[0] Node ID = 5853d547 Node Name = mars-2.local Job ID = example-reschedule Job Version = 0 Client Status = failed Client Description = Failed tasks Desired Status = run Desired Description = <none> Created = 10s ago Modified = 5s ago Replacement Alloc ID = d6bf872b Task "payload" is "dead" Task Resources CPU Memory Disk Addresses 0/100 MHz 24 MiB/300 MiB 300 MiB Task Events: Started At = 2019-06-06T21:12:45Z Finished At = 2019-06-06T21:12:50Z Total Restarts = 0 Last Restart = N/A Recent Events: Time Type Description 2019-06-06T17:12:50-04:00 Not Restarting Policy allows no restarts 2019-06-06T17:12:50-04:00 Terminated Exit Code: 1 2019-06-06T17:12:45-04:00 Started Task started by client 2019-06-06T17:12:45-04:00 Task Setup Building Task Directory 2019-06-06T17:12:45-04:00 Received Task received by client ``` After: ``` ID = 5001ccd1 Eval ID = 53507a02 Name = example-reschedule.nodes[0] Node ID = a3b04364 Node Name = mars-2.local Job ID = example-reschedule Job Version = 0 Client Status = failed Client Description = Failed tasks Desired Status = stop Desired Description = alloc was rescheduled because it failed Created = 13s ago Modified = 3s ago Replacement Alloc ID = 7ba7ac20 Task "payload" is "dead" Task Resources CPU Memory Disk Addresses 21/100 MHz 24 MiB/300 MiB 300 MiB Task Events: Started At = 2019-06-06T21:22:50Z Finished At = 2019-06-06T21:22:55Z Total Restarts = 0 Last Restart = N/A Recent Events: Time Type Description 2019-06-06T17:22:55-04:00 Not Restarting Policy allows no restarts 2019-06-06T17:22:55-04:00 Terminated Exit Code: 1 2019-06-06T17:22:50-04:00 Started Task started by client 2019-06-06T17:22:50-04:00 Task Setup Building Task Directory 2019-06-06T17:22:50-04:00 Received Task received by client ```	2019-06-06 17:27:12 -04:00
Mahmood Ali	3eda42d027	tests: Migrated allocs aren't lost Fix `TestServiceSched_NodeDown` for checking that the migrated allocs are actually marked to be stopped. The boolean logic in test made it skip actually checking client status as long as desired status was stop. Here, we mark some jobs for migration while leaving others as running, and we check that lost flag is only set for non-migrated allocs.	2019-06-06 16:05:07 -04:00
Lang Martin	34230577df	describe a pending deployment with auto_promote accurately	2019-05-22 12:32:08 -04:00
Lang Martin	d462639cc9	sched reconcile copy AutoPromote to DeploymentState	2019-05-22 12:32:08 -04:00
Preetha Appan	374eee421f	Fix comment and assert score in test case	2019-05-15 12:35:57 -05:00
Nick Ethier	f0b9f8e37a	fix missing brace	2019-05-15 13:02:04 -04:00
Nick Ethier	0d851b5d11	scheduler: add check to prohibit returning inf during spread boost calculation	2019-05-15 13:00:24 -04:00
Lang Martin	29ea112586	system_sched & test cleanup comments	2019-05-01 12:25:26 -04:00
Lang Martin	c490dacf76	system_sched_test extend the test to check ineligible nodes	2019-05-01 12:25:26 -04:00
Lang Martin	c43bcbd35e	system_sched when a node is filtered, don't mark failure	2019-05-01 12:25:26 -04:00
Lang Martin	aecec5df1b	system_sched_test create partially constrained job	2019-05-01 12:25:26 -04:00
Arshneet Singh	d4e7a5c005	Add comments to functions, and use require instead of assert	2019-04-23 09:57:21 -07:00
Arshneet Singh	4cf4324b8f	Remove allowPlanOptimization from schedulers	2019-04-23 09:18:02 -07:00
Arshneet Singh	0dd4c109e8	Compat tags	2019-04-23 09:18:01 -07:00
Arshneet Singh	65f5fab131	Add tests for plan normalization	2019-04-23 09:18:01 -07:00
Arshneet Singh	b977748a4b	Add code for plan normalization	2019-04-23 09:18:01 -07:00
Danielle	198a838b61	Merge pull request #5512 from hashicorp/dani/f-alloc-stop alloc-lifecycle: nomad alloc stop	2019-04-23 13:05:08 +02:00
Danielle Lancashire	832f607433	allocs: Add nomad alloc stop This adds a `nomad alloc stop` command that can be used to stop and force migrate an allocation to a different node. This is built on top of the AllocUpdateDesiredTransitionRequest and explicitly limits the scope of access to that transition to expose it under the alloc-lifecycle ACL. The API returns the follow up eval that can be used as part of monitoring in the CLI or parsed and used in an external tool.	2019-04-23 12:50:23 +02:00
Preetha Appan	bcb5c8c70d	remove stray new line	2019-04-12 10:32:48 -05:00
Preetha Appan	8ddc076c1d	Refactor scheduler package to enable preemption for batch/service jobs	2019-04-10 20:24:01 -05:00
James Rasell	9470507cf4	Add NodeName to the alloc/job status outputs. Currently when operators need to log onto a machine where an alloc is running they will need to perform both an alloc/job status call and then a call to discover the node name from the node list. This updates both the job status and alloc status output to include the node name within the information to make operator use easier. Closes #2359 Cloess #1180	2019-04-10 10:34:10 -05:00
Preetha Appan	da1ce9bcea	Fix bug where scoring metadata would be overridden during an inplace upgrade.	2019-03-12 23:36:46 -05:00
Alex Dadgar	41265d4d61	Change types of weights on spread/affinity	2019-01-30 12:20:38 -08:00
Nick Ethier	24cbf42798	scheduler: fix NPE when deployment is nil, but placement is a canary	2019-01-28 20:22:59 -06:00
Alex Dadgar	5198ff05c3	convert driver to device for device constraint/attributes	2019-01-23 10:58:45 -08:00
Alex Dadgar	4bdccab550	goimports	2019-01-22 15:44:31 -08:00
Preetha Appan	3b054d6135	Remove unnecessary usage of alloc.Resource	2019-01-10 16:36:47 -06:00
Mahmood Ali	0dfa93a3c1	appease linter	2019-01-08 10:58:49 -05:00
Alex Dadgar	8a35d7b1dd	Test recovery	2019-01-07 14:49:41 -08:00
Preetha	f406e66ab8	Merge pull request #4881 from hashicorp/f-device-preemption Device preemption	2018-12-11 18:34:19 -06:00
Preetha Appan	977a4a540d	Early continue after meeting needed count Also adds another optimization that filters out un-needed allocations as a final filtering step	2018-12-11 10:12:18 -06:00
Preetha Appan	f60c52c8ba	Score combinations of allocs from multiple devices for preemption	2018-12-07 18:35:47 -06:00
Alex Dadgar	1e3c3cb287	Deprecate IOPS IOPS have been modelled as a resource since Nomad 0.1 but has never actually been detected and there is no plan in the short term to add detection. This is because IOPS is a bit simplistic of a unit to define the performance requirements from the underlying storage system. In its current state it adds unnecessary confusion and can be removed without impacting any users. This PR leaves IOPS defined at the jobspec parsing level and in the api/ resources since these are the two public uses of the field. These should be considered deprecated and only exist to allow users to stop using them during the Nomad 0.9.x release. In the future, there should be no expectation that the field will exist.	2018-12-06 15:09:26 -08:00
Preetha Appan	63681fac0c	use structured logging everywhere consistently	2018-12-03 08:31:41 -06:00
Preetha Appan	766820def3	addresses some code clarity review comments	2018-11-27 11:02:06 -06:00
Mahmood Ali	96ffe044e7	Simplify map count update logic Co-Authored-By: preetapan <preetha@hashicorp.com>	2018-11-27 10:03:11 -06:00
Mahmood Ali	57b94c2d50	code review suggestion Co-Authored-By: preetapan <preetha@hashicorp.com>	2018-11-27 09:59:57 -06:00
Preetha Appan	86f416a984	Fix formatting	2018-11-16 20:45:52 -06:00
Preetha Appan	8efe6171e4	Fix preemption logic bug, need to group allocations by device first. This ensures that the set of allocations chosen for preemption all share the same device where ID is <vendor/type/device>	2018-11-16 20:32:10 -06:00
Danielle Tomlinson	9c72dafc95	scheduler: Add is_set/is_not_set constraints This adds constraints for asserting that a given attribute or value exists, or does not exist. This acts as a companion to =, or != operators, e.g: ```hcl constraint { attribute = "${attrs.type}" operator = "!=" value = "database" } constraint { attribute = "${attrs.type}" operator = "is_set" } ```	2018-11-15 11:00:32 -08:00
Preetha Appan	998968f57a	fix linting	2018-11-15 12:27:32 -06:00
Preetha Appan	e5de50fba8	Initial implementation of device preemption	2018-11-15 11:09:26 -06:00
Danielle Tomlinson	e5c641daa9	scheduler: Allow comparisons of nil values This commit allows the ConstraintChecker to test values that do not exist. This is useful when wanting to _exclude_ given nodes from executing a job, for example, if you wanted to give canary nodes an attribute, and not run critical services on them, you may specify something like the below, but not want to tag all other nodes with the inverse. ```hcl constraint { attribute = "${node.attr.canary} operator = "!=" value = "1" } ``` This also requires all constraint checkers to allow for nil target values, as they will no longer be short circuited by resolving a target.	2018-11-13 13:36:51 -08:00
Alex Dadgar	08dc2ea702	Merge pull request #4867 from hashicorp/b-deployment-progress-deadline Blocked evaluation fixes	2018-11-13 10:29:03 -08:00
Preetha Appan	de890b9d5c	blank line	2018-11-12 15:50:14 -06:00
Preetha Appan	20af09a1ef	Fix logic bug in tracking sum of matched affinity weights We need to track the sum of matching weights per device, but only change the final return value if its the highest scoring choice	2018-11-12 15:06:45 -06:00
Preetha Appan	285b9b6001	Normalize scores correctly	2018-11-08 17:01:58 -06:00
Preetha Appan	f20f2ca8e9	Fixes device scheduling unit tests Also changes the logic for score when there is more than one task requesting a device. Since inter task affinities are already normalized, we take the average of the scores across tasks.	2018-11-08 10:31:19 -06:00
Alex Dadgar	dbb05357bc	fix test	2018-11-07 11:59:24 -08:00
Alex Dadgar	a7ca737fb6	review comments	2018-11-07 11:31:52 -08:00
Alex Dadgar	36abd3a3d8	review comments	2018-11-07 10:33:22 -08:00
Alex Dadgar	e3cbb2c82e	allocs fit checks if devices get oversubscribed	2018-11-07 10:33:22 -08:00
Alex Dadgar	4f9b3ede87	Split device accounter and allocator	2018-11-07 10:32:03 -08:00
Alex Dadgar	6fa893c801	affinities	2018-11-07 10:32:03 -08:00
Alex Dadgar	feb83a2be3	assign devices	2018-11-07 10:32:03 -08:00
Alex Dadgar	6d8bb3a7bd	Duplicate blocked evals cancelling improved The old logic for cancelling duplicate blocked evaluations by job id had the issue where the newer evaluation could have additional node classes that it is (in)eligible for that we would not capture. This could make it such that cluster state could change such that the job would make progress but no evaluation was unblocked.	2018-11-07 10:08:23 -08:00
Preetha Appan	a6b714b81c	update preemption tests to use new node resource structs also includes a fix to remove unnecessary subtraction of network mbits	2018-11-02 17:59:53 -05:00
Preetha	b2b52b1ada	Merge pull request #4794 from hashicorp/f-preemption-systemjobs Preemption for system jobs	2018-11-02 16:28:06 -05:00
Preetha Appan	56de32f363	Address more minor code review feedback	2018-11-02 16:26:34 -05:00
Preetha Appan	253a351532	Fix test setup	2018-11-02 16:06:25 -05:00
Preetha Appan	fba24e5a8a	dereference safely	2018-11-02 15:58:59 -05:00
Preetha Appan	d061678df7	Fix static port preemption to be device aware	2018-11-02 13:07:24 -05:00
Preetha Appan	4182444937	Handle static port preemption when there are multiple devices Also added test case	2018-11-02 09:09:50 -05:00
Preetha Appan	fd60e66f86	Plumb alloc resource cache in a few more places. also removed now unused method	2018-11-01 16:44:43 -05:00
Preetha Appan	78d635edca	More review comments	2018-11-01 16:36:11 -05:00
Preetha Appan	6e1023ba08	Cleaner way to exit early, and fixed a couple more places reading from alloc.Resources	2018-11-01 16:15:58 -05:00
Preetha Appan	b4dd26247f	review comments	2018-11-01 12:01:59 -05:00
Preetha Appan	d03201adf8	Fix formatting of allocation score metrics	2018-10-30 12:03:23 -05:00
Preetha Appan	f1c3eb2792	Introduce interface with multiple implementations for resource distance	2018-10-30 11:06:32 -05:00
Preetha Appan	047af5141e	refactor preemption code to use method recievers and setters for common fields	2018-10-30 11:06:32 -05:00
Preetha Appan	1a5421f5d7	more minor cleanup	2018-10-30 11:06:32 -05:00
Preetha Appan	0494a098ce	More style and readablity fixes from review	2018-10-30 11:06:32 -05:00
Preetha Appan	3910ba9bbd	Preempted allocations should be removed from proposed allocations	2018-10-30 11:06:32 -05:00
Preetha Appan	9dd76d83dc	comments	2018-10-30 11:06:32 -05:00
Preetha Appan	e6234e3cc5	fix end to end scheduler test to use new resource structs correctly	2018-10-30 11:06:32 -05:00
Preetha Appan	8807c25b11	Modify preemption code to use new style of resource structs	2018-10-30 11:06:32 -05:00
Preetha Appan	c1c1c230e4	Make preemption config a struct to allow for enabling based on scheduler type	2018-10-30 11:06:32 -05:00
Preetha Appan	25a047267f	Use scheduler config from state store to enable/disable preemption	2018-10-30 11:06:32 -05:00
Preetha Appan	1805032e69	Fix linting and better comments	2018-10-30 11:06:32 -05:00
Preetha Appan	cc295b90de	Implement preemption for system jobs. This commit implements an allocation selection algorithm for finding allocations to preempt. It currently special cases network resource asks from others (cpu/memory/disk/iops).	2018-10-30 11:06:32 -05:00
Preetha Appan	22aee7294e	Merge branch 'f-fix-resource-type' of github.com:hashicorp/nomad into f-fix-resource-type	2018-10-16 18:30:12 -05:00
Preetha Appan	53c3f8151b	fix linting	2018-10-16 18:29:49 -05:00
Alex Dadgar	a78cefec18	use int64	2018-10-16 15:34:32 -07:00
Preetha Appan	7c0d8c646c	Change CPU/Disk/MemoryMB to int everywhere in new resource structs	2018-10-16 16:21:42 -05:00
Alex Dadgar	f5a76d8411	review comments	2018-10-15 15:31:13 -07:00
Alex Dadgar	7ecd65109a	Check constraints on devices	2018-10-14 13:35:47 -07:00
Alex Dadgar	5284554fcc	rework device checker	2018-10-13 16:47:53 -07:00
Alex Dadgar	1089e13b14	add to stack	2018-10-13 12:27:49 -07:00
Alex Dadgar	9b5aaac410	Device feasability checker	2018-10-13 12:27:49 -07:00
Preetha Appan	1574e898af	Fix bug in reconciler where terminal allocs on a job already stopped were unnecessarily updated	2018-10-08 21:03:49 -05:00
Alex Dadgar	01f8e5b95f	renames	2018-10-04 14:57:25 -07:00
Alex Dadgar	52f9cd7637	fixing tests	2018-10-04 14:26:19 -07:00
Alex Dadgar	bac5cb1e8b	Scheduler uses allocated resources	2018-10-02 17:08:25 -07:00
Preetha Appan	a10118c461	Add failed follow up to the list of allowed eval trigger reasons needs unit test	2018-09-25 10:49:55 -07:00
Alex Dadgar	6a21f9fe96	Unique TriggerBy for blocked evals Give blocked evals a unique triggerby reason to make debugging a chain of evaluations easier.	2018-09-24 14:47:49 -07:00
Alex Dadgar	3c19d01d7a	server	2018-09-15 16:23:13 -07:00
Alex Dadgar	3ba62efd5e	Failed/paused deployments do not block migrations This PR changes behavior of the scheduler such that a task group with a deployment that is failed or paused will not cause the scheduler to skip migrations. The reason for this change is that it causes a bad UX when draining nodes with allocations that are part of a failed/paused deployment. These operations should not be coupled in any way and this remedies that. Prior behavior was still correct, but required either jobs to transistion to a healthy state or for the node to hit its drain deadline.	2018-09-10 15:28:45 -07:00
Alex Dadgar	cc92cd92cd	Merge pull request #4642 from hashicorp/b-vet Fix vet errors and use newer go version in travis	2018-09-04 17:04:02 -07:00
Alex Dadgar	c6576ddac1	Fix make check errors	2018-09-04 16:03:52 -07:00
Preetha Appan	751c0eb5a5	code review feedback	2018-09-04 16:10:11 -05:00
Preetha Appan	9bc0962527	Track top k nodes by norm score rather than top k nodes per scorer	2018-09-04 16:10:11 -05:00
Preetha Appan	6ed527c636	Use heap to store top K scoring nodes. Scoring metadata is now aggregated by scorer type to make it easier to parse when reading it in the CLI.	2018-09-04 16:10:11 -05:00
Preetha Appan	65cf4373b3	fix linting error	2018-09-04 16:10:11 -05:00
Preetha Appan	dd5fe6373f	Fix scoring logic for uneven spread to incorporate current alloc count Also addressed other small code review comments	2018-09-04 16:10:11 -05:00
Preetha Appan	e72c0fe527	more cleanup	2018-09-04 16:10:11 -05:00
Preetha Appan	4c624424e6	added some unit tests for -1 spread score	2018-09-04 16:10:11 -05:00
Preetha Appan	92d37acc2a	comment and formatting cleanup	2018-09-04 16:10:11 -05:00
Preetha Appan	7b0a27cad6	fix scoring algorithm when min count == current count	2018-09-04 16:10:11 -05:00
Preetha Appan	bad075f640	Remove hardcoded boosts for even spread. instead, calculate them based on delta between current and minimum value	2018-09-04 16:10:11 -05:00
Preetha Appan	c56873ff37	Implement support for even spread across datacenters, with unit test	2018-09-04 16:10:11 -05:00
Preetha Appan	d091c00dd3	Support implicit spread target to account for remaining desired counts	2018-09-04 16:10:11 -05:00
Preetha Appan	33779abe5f	fix comments	2018-09-04 16:10:11 -05:00
Preetha Appan	5812f906c8	Allow empty spread targets, and validate target percentages.	2018-09-04 16:10:11 -05:00
Preetha Appan	55f276c189	Include spreads configured at job level when precomputing weights/desired counts.	2018-09-04 16:10:11 -05:00
Preetha Appan	fbd0004707	Fix warnings	2018-09-04 16:10:11 -05:00
Preetha Appan	db0d95b09c	Implement spread iterator that scores according to percentage of desired count in each target. Added this as a new step in the stack and some unit tests	2018-09-04 16:10:11 -05:00
Preetha Appan	eccf128c5c	Some minor changes from code review	2018-09-04 16:10:11 -05:00
Preetha Appan	038ed52877	Fix after rename to ConstraintSetContainsAny	2018-09-04 16:10:11 -05:00
Preetha Appan	3a39db3902	Fix linting	2018-09-04 16:10:11 -05:00
Preetha Appan	d5cd2bbddb	Remove unnecessary reset	2018-09-04 16:10:11 -05:00
Preetha Appan	dccb693221	test for setcontainsany, and treat set_contains same as set_contains_all	2018-09-04 16:10:11 -05:00
Preetha Appan	70bfd0c0cb	Address some review feedback	2018-09-04 16:10:11 -05:00
Preetha Appan	8685593ec0	Back out changes to propertyset that were not necessary for affinities	2018-09-04 16:10:11 -05:00
Preetha Appan	5eacd6ada4	Implement affinity support in generic scheduler	2018-09-04 16:10:11 -05:00
Alex Dadgar	e1c239daae	Merge pull request #4414 from hashicorp/b-stop-summary Reset Queued allocs to zero when job stopped	2018-07-16 14:32:55 -07:00
Nick Ethier	6b6777359b	scheduler: fix missing err assignment	2018-07-11 14:27:10 -04:00
Nick Ethier	5f6def5b04	scheduler: better error handling	2018-07-05 11:00:03 -04:00
Nick Ethier	030e650e78	scheduler: fix nil pointer exception	2018-07-02 16:05:38 -04:00
Alex Dadgar	300b1a7a15	Tests only use testlog package logger	2018-06-13 15:40:56 -07:00
Alex Dadgar	c3c79c408e	Reset Queued allocs to zero when job stopped When a job is stopped but not purged, we should set the Queued count to be zero.	2018-06-13 10:46:39 -07:00
Preetha Appan	b64788043e	make test create index clearer	2018-06-05 17:29:59 -05:00
Preetha Appan	3e264dcb79	Fix reconciler bug with deployment not being created if job create index is different This fixes an issue where if a job is purged and resubmitted Nomad does not create a new deployment. Adds unit test that failed before this fix	2018-06-05 13:58:53 -05:00
Preetha Appan	f8a23bc54a	fix test comment	2018-05-09 16:01:34 -05:00
Preetha Appan	ef531b0f34	Add unit tests for forced rescheduling	2018-05-09 11:30:42 -05:00
Preetha Appan	c1b92c284e	Work in progress - force rescheduling of failed allocs	2018-05-08 17:26:57 -05:00
Alex Dadgar	555d14fd92	Add test	2018-05-07 14:55:01 -05:00
Preetha Appan	cf44670d56	Make sure that task group has a deployment state before using it	2018-05-07 14:55:01 -05:00
Alex Dadgar	c6478d9469	clarify comment	2018-05-07 14:55:01 -05:00
Alex Dadgar	768fec8505	Allow healthy canary deployment to skip progress deadline	2018-05-07 14:55:01 -05:00
Alex Dadgar	8626c1b94a	Reschedule when we have canaries properly	2018-05-07 14:55:01 -05:00
Alex Dadgar	8dee3ab068	canary reschedule test	2018-05-07 14:50:01 -05:00
Alex Dadgar	deb93dc7b7	Test for rescheduling when there are canaries	2018-05-07 14:50:01 -05:00
Alex Dadgar	550f5e31f8	Allow canary count greater than desired	2018-05-07 14:50:01 -05:00
Alex Dadgar	f95ab4ade8	Mark canaries on creation, and unmark on promotion	2018-05-07 14:50:01 -05:00
Preetha Appan	5329900f6d	Only use DesiredTransition.Reschedule in reconciler when its an active deployment	2018-05-07 14:50:01 -05:00

1 2 3 4 5 ...

696 commits