Commit graph

2649 commits

Author SHA1 Message Date
Alex Dadgar 08dc2ea702
Merge pull request #4867 from hashicorp/b-deployment-progress-deadline
Blocked evaluation fixes
2018-11-13 10:29:03 -08:00
Mahmood Ali 865419e756 convert all config durations to strings in tests 2018-11-13 10:21:40 -05:00
Mahmood Ali 4e18846fd9 Adjust streaming duration
This test expects 11 repeats of the same message emitted at intervals of
200ms; so we need more than 2 seconds to adjust for time sleep
variations and the like.  So raising it to 3s here that should be
enough.
2018-11-13 10:21:40 -05:00
Mahmood Ali 1403ad21b9 Changelog job re-run fix 2018-11-13 07:52:51 -05:00
Mahmood Ali e2d668f21c
Merge pull request #4861 from hashicorp/b-batch-deregister-transaction
Run job deregistering in a single transaction
2018-11-12 20:59:44 -05:00
Alex Dadgar a90dc978e1 Handle new eval being the duplicate properly 2018-11-12 16:02:23 -08:00
Mahmood Ali 8513b3cccb Comment public functions and batch write txn 2018-11-12 16:09:39 -05:00
Preetha Appan 7ef126a027
Smaller methods, and added tests for RPC layer 2018-11-10 17:37:33 -06:00
Preetha Appan 75662b50d1
Use response object/querymeta/writemeta in scheduler config API 2018-11-10 10:31:10 -06:00
Mahmood Ali 9c0a15f3ce Run job deregistering in a single transaction
Fixes https://github.com/hashicorp/nomad/issues/4299

Upon investigating this case further, we determined the issue to be a race between applying `JobBatchDeregisterRequest` fsm operation and processing job-deregister evals.

Processing job-deregister evals should wait until the FSM log message finishes applying, by using the snapshot index.  However, with `JobBatchDeregister`, any single individual job deregistering was applied accidentally incremented the snapshot index and resulted into processing job-deregister evals.  When a Nomad server receives an eval for a job in the batch that is yet to be deleted, we accidentally re-run it depending on the state of allocation.

This change ensures that we delete deregister all of the jobs and inserts all evals in a single transactions, thus blocking processing related evals until deregistering complete.
2018-11-09 22:35:26 -05:00
Preetha 3739713ce1
Merge pull request #4839 from hashicorp/b-gc-alloc-jobversion
Remove terminal allocations associated with older job modify index
2018-11-09 12:21:42 -06:00
Preetha Appan 39072977d6
Use create index as trigger condition to gc old terminal allocs 2018-11-09 11:44:21 -06:00
Alex Dadgar 2f06d88f47
Merge pull request #4847 from hashicorp/b-blocked-eval
Blocked evaluation fixes
2018-11-08 13:40:01 -08:00
Alex Dadgar 98398a8a44
Merge pull request #4842 from hashicorp/b-deployment-progress-deadline
Fix multiple bugs with progress deadline handling
2018-11-08 13:31:54 -08:00
Alex Dadgar 991791a513 typo fix 2018-11-08 13:28:27 -08:00
Alex Dadgar be54e56570 review fixes 2018-11-08 09:48:36 -08:00
Preetha Appan 5f0a9d2cfd
Show preemption output in plan CLI 2018-11-08 09:48:43 -06:00
Alex Dadgar dbb05357bc fix test 2018-11-07 11:59:24 -08:00
Alex Dadgar 36abd3a3d8 review comments 2018-11-07 10:33:22 -08:00
Alex Dadgar e3cbb2c82e allocs fit checks if devices get oversubscribed 2018-11-07 10:33:22 -08:00
Alex Dadgar 4f9b3ede87 Split device accounter and allocator 2018-11-07 10:32:03 -08:00
Alex Dadgar 6fa893c801 affinities 2018-11-07 10:32:03 -08:00
Alex Dadgar feb83a2be3 assign devices 2018-11-07 10:32:03 -08:00
Alex Dadgar 2d2248e209 Add devices to allocated resources 2018-11-07 10:32:03 -08:00
Alex Dadgar b1c5d52817 Track jobs by namespace 2018-11-07 10:22:08 -08:00
Alex Dadgar 6d8bb3a7bd Duplicate blocked evals cancelling improved
The old logic for cancelling duplicate blocked evaluations by job id had
the issue where the newer evaluation could have additional node classes
that it is (in)eligible for that we would not capture. This could make
it such that cluster state could change such that the job would make
progress but no evaluation was unblocked.
2018-11-07 10:08:23 -08:00
Preetha Appan a9aec7e628
Fix failing resource subtraction test 2018-11-06 12:26:26 -06:00
Alex Dadgar 261aae32b1 more robust merging of the deployment status when getting updates from the client 2018-11-05 16:39:09 -08:00
Alex Dadgar 1c31970464 Fix multiple tgs with progress deadline handling
Fix an issue in which the deployment watcher would fail the deployment
based on the earliest progress deadline of the deployment regardless of
if the task group has finished.

Further fix an issue where the blocked eval optimization would make it
so no evals were created to progress the deployment. To reproduce this
issue, prior to this commit, you can create a job with two task groups.
The first group has count 1 and resources such that it can not be
placed. The second group has count 3, max_parallel=1, and can be placed.
Run this first and then update the second group to do a deployment. It
will place the first of three, but never progress since there exists a
blocked eval. However, that doesn't capture the fact that there are two
groups being deployed.
2018-11-05 16:06:17 -08:00
Preetha Appan 6fdc84cce3
add comment 2018-11-02 18:11:36 -05:00
Preetha Appan a6b714b81c
update preemption tests to use new node resource structs
also includes a fix to remove unnecessary subtraction of network mbits
2018-11-02 17:59:53 -05:00
Preetha b2b52b1ada
Merge pull request #4794 from hashicorp/f-preemption-systemjobs
Preemption for system jobs
2018-11-02 16:28:06 -05:00
Preetha Appan c33469157d
unit test plan apply with preemptions 2018-11-01 20:06:32 -05:00
Preetha Appan 57fe5050f0
more minor review feedback 2018-11-01 17:05:17 -05:00
Preetha Appan fd60e66f86
Plumb alloc resource cache in a few more places.
also removed now unused method
2018-11-01 16:44:43 -05:00
Preetha Appan e586817ce7
batch jobs GC removes terminal allocs if job modifyindex is older than running job 2018-11-01 00:05:31 -05:00
Mahmood Ali 9da19c6450 address review comments 2018-10-30 13:58:52 -04:00
Mahmood Ali 4937095389 Allow artifacts checksum interpolation
Fixes https://github.com/hashicorp/nomad/issues/4814
2018-10-30 13:24:30 -04:00
Preetha Appan f1c3eb2792
Introduce interface with multiple implementations for resource distance 2018-10-30 11:06:32 -05:00
Preetha Appan 8f7eb61823
Introduce a response object for scheduler configuration 2018-10-30 11:06:32 -05:00
Preetha Appan 1a5421f5d7
more minor cleanup 2018-10-30 11:06:32 -05:00
Preetha Appan 0494a098ce
More style and readablity fixes from review 2018-10-30 11:06:32 -05:00
Preetha Appan 1415032c13
More review comments 2018-10-30 11:06:32 -05:00
Preetha Appan b97f85e3e0
style fixes 2018-10-30 11:06:32 -05:00
Preetha Appan 12278527c7
make default config a variable 2018-10-30 11:06:32 -05:00
Preetha Appan 32cc764072
Add fsm layer tests 2018-10-30 11:06:32 -05:00
Preetha Appan 7b8156fc47
Restore/Snapshot plus unit tests for scheduler configuration 2018-10-30 11:06:32 -05:00
Preetha Appan 8807c25b11
Modify preemption code to use new style of resource structs 2018-10-30 11:06:32 -05:00
Preetha Appan c1c1c230e4
Make preemption config a struct to allow for enabling based on scheduler type 2018-10-30 11:06:32 -05:00
Preetha Appan bd34cbb1f7
Support for new scheduler config API, first use case is to disable preemption 2018-10-30 11:06:32 -05:00
Preetha Appan 3190a2c29b
Fix linting 2018-10-30 11:06:32 -05:00
Preetha Appan eb38488d08
Fix logic bug, unit test for plan apply method in state store 2018-10-30 11:06:32 -05:00
Preetha Appan 9e4a35fff0
Fix comment 2018-10-30 11:06:32 -05:00
Preetha Appan cc295b90de
Implement preemption for system jobs.
This commit implements an allocation selection algorithm for finding
allocations to preempt. It currently special cases network resource asks
from others (cpu/memory/disk/iops).
2018-10-30 11:06:32 -05:00
Preetha Appan d11064d6ba
structs and API changes to plan and alloc structs needed for preemption 2018-10-30 11:06:32 -05:00
Preetha Appan 9257387a69
Add number of evictions to DesiredUpdates struct to use in CLI/API 2018-10-30 11:06:32 -05:00
Preetha Appan 5ff4b8e36f
REview feedback 2018-10-30 11:06:32 -05:00
Preetha Appan 5b3bfb63eb
structs and API changes to plan and alloc structs needed for preemption 2018-10-30 11:06:32 -05:00
Michael Schurter 5d49832de4 tests: fix usages of TestClient cleanup and mock driver 2018-10-29 14:21:05 -07:00
Michael Schurter e060174130 ar: fix leader handling, state restoring, and destroying unrun ARs
* Migrated all of the old leader task tests and got them passing
* Refactor and consolidate task killing code in AR to always kill leader
  tasks first
* Fixed lots of issues with state restoring
* Fixed deadlock in AR.Destroy if AR.Run had never been called
* Added a new in memory statedb for testing
2018-10-19 09:45:45 -07:00
Alex Dadgar 6f0ed6184b Fix client reloading and pass the plugin loaders to server and client 2018-10-16 16:56:55 -07:00
Michael Schurter a4b4d7b266 consul service hook
Deregistration works but difficult to test due to terminal updates not
being fully implemented in the new client/ar/tr.
2018-10-16 16:53:29 -07:00
Alex Dadgar e401c660e7 Implement lifecycle hooks on the task runner 2018-10-16 16:53:29 -07:00
Alex Dadgar a78cefec18 use int64 2018-10-16 15:34:32 -07:00
Preetha Appan 7c0d8c646c
Change CPU/Disk/MemoryMB to int everywhere in new resource structs 2018-10-16 16:21:42 -05:00
Alex Dadgar f5a76d8411 review comments 2018-10-15 15:31:13 -07:00
Alex Dadgar f9b056e1d1 Replace attributes map with new Attribute object 2018-10-13 14:08:58 -07:00
Alex Dadgar 04ba425dd5 validate constraints/affinities 2018-10-13 12:27:49 -07:00
Alex Dadgar 9b5aaac410 Device feasability checker 2018-10-13 12:27:49 -07:00
Alex Dadgar bfb4caa2e7 node devices 2018-10-13 12:27:49 -07:00
Alex Dadgar 5a07f9f96e parse affinities and constraints on devices 2018-10-11 14:05:19 -07:00
Alex Dadgar a2a56a930c Diff 2018-10-08 17:02:58 -07:00
Alex Dadgar 6b08b9d6b6 Define device request structs 2018-10-08 15:38:03 -07:00
Alex Dadgar 01f8e5b95f renames 2018-10-04 14:57:25 -07:00
Alex Dadgar 52f9cd7637 fixing tests 2018-10-04 14:26:19 -07:00
Alex Dadgar bac5cb1e8b Scheduler uses allocated resources 2018-10-02 17:08:25 -07:00
Alex Dadgar 147d2430a1 allocated resources structs 2018-09-29 18:47:28 -07:00
Alex Dadgar 5c8697667e Node reserved resources 2018-09-29 18:44:55 -07:00
Alex Dadgar 3183153315 Node resources on client 2018-09-29 17:23:41 -07:00
Alex Dadgar 9b793531d6
Merge pull request #4720 from hashicorp/b-jet-fixes
Series of scheduler fixes / debugging enhancements
2018-09-25 13:25:11 -07:00
Alex Dadgar bd420692f3 fix logging 2018-09-25 10:49:55 -07:00
Preetha Appan 86e725e84c Added logging around nacked evals in the scheduler worker 2018-09-25 10:49:02 -07:00
Alex Dadgar 6a21f9fe96 Unique TriggerBy for blocked evals
Give blocked evals a unique triggerby reason to make debugging a chain
of evaluations easier.
2018-09-24 14:47:49 -07:00
Alex Dadgar e1a102f58c test allocs fit 2018-09-24 13:59:01 -07:00
Alex Dadgar d7f5be9148 Better comment on snapshotindex 2018-09-24 13:53:43 -07:00
Alex Dadgar 99498da6ed Denormalize jobs in plan and ignore resources of terminal allocs
Denormalize jobs in AppendAllocs:
AppendAlloc was originally only ever called for inplace upgrades and new
allocations. Both these code paths would remove the job from the
allocation. Now we use this to also add fields such as FollowupEvalID
which did not normalize the job. This is only a performance enhancement.

Ignore terminal allocs:
Failed allocations are annotated with the followup Eval ID when one is
created to replace the failed allocation. However, in the plan applier,
when we check if allocations fit, these terminal allocations were not
filtered. This could result in the plan being rejected if the node would
be overcommited if the terminal allocations resources were considered.
2018-09-24 13:53:43 -07:00
Alex Dadgar de442226ae Fix other instances of blocking queries 2018-09-24 13:52:39 -07:00
Alex Dadgar 7f0d241ef4 always handle failed allocation 2018-09-21 15:13:54 -07:00
Alex Dadgar b2449ae1ce Fix deployment watcher index usage
Fixes three issues:
1. Retrieving the latest evaluation index was not properly selecting the
greatest index. This would undermine checks we had to reduce the number
of evaluations created when the latest eval index was greater than any
alloc change
2. Fix an issue where the blocking query code was using the incorrect
index such that the index was higher than necassary.
3. Special case handling of blocked evaluation since the create/snapshot
index is no particularly useful since they can be reblocked.
2018-09-21 13:59:11 -07:00
Alex Dadgar 5009566503 do not bootstrap with non voters 2018-09-19 17:17:39 -07:00
Alex Dadgar e8f89597f5 fix rpc test 2018-09-19 10:17:54 -07:00
Alex Dadgar 9971b3393f yamux 2018-09-17 14:22:40 -07:00
Alex Dadgar b2f500b48c Serf/Raft/Memberlist logger 2018-09-17 13:57:52 -07:00
Alex Dadgar ca28afa3b2 small fixes 2018-09-15 16:42:38 -07:00
Alex Dadgar 3c19d01d7a server 2018-09-15 16:23:13 -07:00
Alex Dadgar 7739ef51ce agent + consul 2018-09-13 10:43:40 -07:00
Preetha Appan 996484981c
Fix panic when reschedule policy for allocation can't be looked up
because its task group changed
2018-09-05 17:01:02 -05:00
Alex Dadgar 4f89cabd34
Merge pull request #4631 from hashicorp/f-plugin-config
Parse plugin configs
2018-09-04 17:04:13 -07:00
Alex Dadgar cc92cd92cd
Merge pull request #4642 from hashicorp/b-vet
Fix vet errors and use newer go version in travis
2018-09-04 17:04:02 -07:00
Alex Dadgar c6576ddac1 Fix make check errors 2018-09-04 16:03:52 -07:00
Preetha Appan 26288b9522
Fix more review feedback 2018-09-04 16:10:11 -05:00
Preetha Appan 751c0eb5a5
code review feedback 2018-09-04 16:10:11 -05:00
Preetha Appan 4f8e925b54
Move topk and delay heap to separate packages under lib 2018-09-04 16:10:11 -05:00
Preetha Appan 9bc0962527
Track top k nodes by norm score rather than top k nodes per scorer 2018-09-04 16:10:11 -05:00
Preetha Appan 6ed527c636
Use heap to store top K scoring nodes.
Scoring metadata is now aggregated by scorer type to make it easier
to parse when reading it in the CLI.
2018-09-04 16:10:11 -05:00
Preetha Appan dd5fe6373f
Fix scoring logic for uneven spread to incorporate current alloc count
Also addressed other small code review comments
2018-09-04 16:10:11 -05:00
Preetha Appan e72c0fe527
more cleanup 2018-09-04 16:10:11 -05:00
Preetha Appan 92d37acc2a
comment and formatting cleanup 2018-09-04 16:10:11 -05:00
Preetha Appan 5812f906c8
Allow empty spread targets, and validate target percentages. 2018-09-04 16:10:11 -05:00
Preetha Appan 71bff00326
validate spread from job/task group validate methods 2018-09-04 16:10:11 -05:00
Preetha Appan fbd0004707
Fix warnings 2018-09-04 16:10:11 -05:00
Preetha Appan 5eb82b6260
Validate method, and rename ratio field to percent 2018-09-04 16:10:11 -05:00
Preetha Appan 0037d72fa8
Structs and validation for spread 2018-09-04 16:10:11 -05:00
Preetha Appan c407e3626f
More review comments 2018-09-04 16:10:11 -05:00
Preetha Appan dbbb4a957a
Fail validation if system job has affinities 2018-09-04 16:10:11 -05:00
Preetha Appan 0bc030c6fb
Treat set_contains as a synonym of set_contains_all 2018-09-04 16:10:11 -05:00
Preetha Appan e85a721cfb
Include affinities in job and task diff, and more test cases 2018-09-04 16:10:11 -05:00
Preetha Appan f06c7ab2ad
Fix Copy method for job and task to include affinities 2018-09-04 16:10:11 -05:00
Preetha Appan 9f0caa9c3d
Affinity parsing, api and structs 2018-09-04 16:10:11 -05:00
Preetha Appan 9e29cfee76
Use readlock 2018-09-04 11:45:05 -05:00
Preetha Appan 062e5f1898
Use eval broker lock when reading/modifying delay heap 2018-08-31 10:59:48 -05:00
Alex Dadgar bff1669ee4 Plugin config parsing 2018-08-29 17:06:01 -07:00
Chelsea Komlo 0a69cdb304
Merge pull request #4565 from hashicorp/b-compare-cert-alg
Error if TLS Certificate signature algorithm isn't supported in cipher suites
2018-08-15 16:09:46 -04:00
Xopherus 8d747578e8 Close multiplexer when context is cancelled
Multiplexer continues to create rpc connections even when
the context which is passed to the underlying rpc connections
is cancelled by the server.

This was causing #4413 - when a SIGHUP causes everything to reload,
it uses context to cancel the underlying http/rpc connections
so that they may come up with the new configuration.
The multiplexer was not being cancelled properly so it would
continue to create rpc connections and constantly fail,
causing communication issues with other nomad agents.

Fixes #4413
2018-08-13 19:32:49 -04:00
Chelsea Holland Komlo 31d6d00381 add simple getter for certificate 2018-08-10 12:37:21 -04:00
Andrei Burd 444ee45aff Parametrized/periodic jobs per child tagged metric emmision 2018-06-21 10:40:56 +03:00
Alex Dadgar b61051b3cd
Merge pull request #4409 from hashicorp/r-client-packages
Refactor client packages
2018-06-13 17:32:25 -07:00
Alex Dadgar 300b1a7a15 Tests only use testlog package logger 2018-06-13 15:40:56 -07:00
Chelsea Komlo 03075b603a
Merge pull request #4399 from hashicorp/r-reload-refactor
Refactor logic for dynamic reloading
2018-06-13 13:35:12 -04:00
Alex Dadgar d0043691fb remove structs + bump version 2018-06-11 13:52:19 -07:00
Alex Dadgar af5753d2cd bump version + generated files 2018-06-11 13:39:42 -07:00
Nick Ethier e75e3ae665
nomad: use require pkg for tests 2018-06-11 13:50:50 -04:00
Nick Ethier 50c72adbd7
nomad: code review comments 2018-06-11 13:27:48 -04:00
Nick Ethier a581cc9c01
nomad/structs: fix job diff test 2018-06-11 13:06:49 -04:00
Nick Ethier 41e010cdc2
nomad: add 'Dispatch' field to Job
New -bash: Dispatch: command not found field is used to denote if the Job is a child dispatched job of
a parameterized job.
2018-06-11 11:59:03 -04:00
Chelsea Holland Komlo de03ce8070 move logic to determine whether to reload tls configuration to tlsutil helper 2018-06-08 14:33:58 -04:00
Chelsea Komlo d738976234
Merge pull request #4395 from hashicorp/b-vault-second
Fix for dynamically reloading vault
2018-06-07 18:03:00 -04:00
Chelsea Holland Komlo dcc9cdfeb7 fixup! comment and move to always log server reload operation 2018-06-07 17:12:36 -04:00
Chelsea Holland Komlo 41e35edf0c fix test that now requires different config for test assertions 2018-06-07 17:07:06 -04:00
Chelsea Holland Komlo 9f6bd7bf3a move logic for testing equality for vault config 2018-06-07 16:23:50 -04:00
Chelsea Holland Komlo 282f37b1ee fix for dynamically reloading vault 2018-06-07 15:34:18 -04:00
Nick Ethier 2555bff4f5
nomad: add error check in test 2018-06-06 14:08:42 -04:00
Nick Ethier d35bf6d184
nomad: handle edge case where node drain event shouldn't be emitted 2018-06-06 14:02:10 -04:00
Alex Dadgar 23cd56dc78 remove generated structs 2018-06-01 16:11:28 -07:00
Alex Dadgar c0386819b3 bump version/lint/generated files 2018-06-01 15:23:10 -07:00
Preetha Appan 4134fcd2c7
Fix test setup for FSMSnapshotRestore_Deployments to use a valid job that exists 2018-05-31 14:39:39 -05:00
Alex Dadgar 7f25fcc1bd
Merge pull request #4354 from hashicorp/b-job-modify
Deployment adds JobSpecModifyIndex
2018-05-31 17:57:38 +00:00
Alex Dadgar f2b2e0482b code review fixes 2018-05-31 10:57:08 -07:00
Preetha c2291a4cf7
Merge pull request #4349 from hashicorp/b-reconcile-raft-upgrade
Remove an unnecessary check in nomad member reconciliation
2018-05-30 13:17:38 -07:00
Alex Dadgar 92777a8018
Merge pull request #4329 from hashicorp/b-leaked-deployments
Clean up leaked deployments on restoration
2018-05-30 20:06:03 +00:00