Mahmood Ali
d202924a93
include test and address review comments
2020-01-28 09:06:52 -05:00
Mahmood Ali
e436d2701a
Handle Nomad leadership flapping
...
Fixes a deadlock in leadership handling if leadership flapped.
Raft propagates leadership transition to Nomad through a NotifyCh channel.
Raft blocks when writing to this channel, so channel must be buffered or
aggressively consumed[1]. Otherwise, Raft blocks indefinitely in `raft.runLeader`
until the channel is consumed[1] and does not move on to executing follower
related logic (in `raft.runFollower`).
While Raft `runLeader` defer function blocks, raft cannot process any other
raft operations. For example, `run{Leader|Follower}` methods consume
`raft.applyCh`, and while runLeader defer is blocked, all raft log applications
or config lookup will block indefinitely.
Sadly, `leaderLoop` and `establishLeader` makes few Raft calls!
`establishLeader` attempts to auto-create autopilot/scheduler config [3]; and
`leaderLoop` attempts to check raft configuration [4]. All of these calls occur
without a timeout.
Thus, if leadership flapped quickly while `leaderLoop/establishLeadership` is
invoked and hit any of these Raft calls, Raft handler _deadlock_ forever.
Depending on how many times it flapped and where exactly we get stuck, I suspect
it's possible to get in the following case:
* Agent metrics/stats http and RPC calls hang as they check raft.Configurations
* raft.State remains in Leader state, and server attempts to handle RPC calls
(e.g. node/alloc updates) and these hang as well
As we create goroutines per RPC call, the number of goroutines grow over time
and may trigger a out of memory errors in addition to missed updates.
[1] d90d6d6bda/config.go (L190-L193)
[2] d90d6d6bda/raft.go (L425-L436)
[3] 2a89e47746/nomad/leader.go (L198-L202)
[4] 2a89e47746/nomad/leader.go (L877)
2020-01-22 13:08:34 -05:00
Mahmood Ali
129c884105
extract leader step function
2020-01-22 10:55:48 -05:00
Mahmood Ali
1ab682f622
scheduler: allow configuring default preemption for system scheduler
...
Some operators want a greater control over when preemption is enabled,
especially during an upgrade to limit potential side-effects.
2020-01-13 08:30:49 -05:00
Mahmood Ali
d699a70875
Merge pull request #5911 from hashicorp/b-rpc-consistent-reads
...
Block rpc handling until state store is caught up
2019-08-20 09:29:37 -04:00
Jasmine Dahilig
8d980edd2e
add create and modify timestamps to evaluations ( #5881 )
2019-08-07 09:50:35 -07:00
Pete Woods
9096aa3d23
Add job status metrics
...
This avoids having to write services to repeatedly hit the jobs API
2019-07-26 10:12:49 +01:00
Mahmood Ali
ea3a98357f
Block rpc handling until state store is caught up
...
Here, we ensure that when leader only responds to RPC calls when state
store is up to date. At leadership transition or launch with restored
state, the server local store might not be caught up with latest raft
logs and may return a stale read.
The solution here is to have an RPC consistency read gate, enabled when
`establishLeadership` completes before we respond to RPC calls.
`establishLeadership` is gated by a `raft.Barrier` which ensures that
all prior raft logs have been applied.
Conversely, the gate is disabled when leadership is lost.
This is very much inspired by https://github.com/hashicorp/consul/pull/3154/files
2019-07-02 16:07:37 +08:00
Preetha Appan
10e7d6df6d
Remove compat code associated with many previous versions of nomad
...
This removes compat code for namespaces (0.7), Drain(0.8) and other
older features from releases older than Nomad 0.7
2019-06-25 19:05:25 -05:00
Chris Baker
e0170e1c67
metrics: add namespace label to allocation metrics
2019-06-17 20:50:26 +00:00
Michael Schurter
073893f529
nomad: disable service+batch preemption by default
...
Enterprise only.
Disable preemption for service and batch jobs by default.
Maintain backward compatibility in a x.y.Z release. Consider switching
the default for new clusters in the future.
2019-06-04 15:54:50 -07:00
Preetha Appan
ad3c263d3f
Rename to match system scheduler config.
...
Also added docs
2019-05-03 14:06:12 -05:00
Preetha Appan
6615d5c868
Add config to disable preemption for batch/service jobs
2019-04-29 18:48:07 -05:00
Arshneet Singh
b977748a4b
Add code for plan normalization
2019-04-23 09:18:01 -07:00
Charlie Voiselle
c28c195f42
Set NextEval when making failed-follow-up
evals
...
This allows users to locate failed-follow-up evals more easily
2019-02-20 16:07:11 -08:00
Preetha Appan
7578522f58
variable name fix
2019-01-29 13:48:45 -06:00
Preetha Appan
a6cebbbf9e
Make sure that all servers are 0.9 before applying scheduler config entry
2019-01-29 12:47:42 -06:00
Alex Dadgar
4bdccab550
goimports
2019-01-22 15:44:31 -08:00
Nick Ethier
b1484aec33
nomad: fix hclog usage
2018-11-29 22:27:39 -05:00
Nick Ethier
5c5cae79ab
nomad: only lookup job is disable_dispatched_job_summary_metrics is set
2018-11-19 23:22:23 -05:00
Nick Ethier
8ac69f440d
nomad: lookup job instead of adding Dispatched to summary
2018-11-19 23:22:02 -05:00
Nick Ethier
85b221a1d6
nomad: add flag to disable publishing of job_summary metrics for dispatched jobs
2018-11-19 23:21:19 -05:00
Preetha Appan
57fe5050f0
more minor review feedback
2018-11-01 17:05:17 -05:00
Preetha Appan
12278527c7
make default config a variable
2018-10-30 11:06:32 -05:00
Preetha Appan
c1c1c230e4
Make preemption config a struct to allow for enabling based on scheduler type
2018-10-30 11:06:32 -05:00
Preetha Appan
bd34cbb1f7
Support for new scheduler config API, first use case is to disable preemption
2018-10-30 11:06:32 -05:00
Alex Dadgar
ca28afa3b2
small fixes
2018-09-15 16:42:38 -07:00
Alex Dadgar
3c19d01d7a
server
2018-09-15 16:23:13 -07:00
Andrei Burd
444ee45aff
Parametrized/periodic jobs per child tagged metric emmision
2018-06-21 10:40:56 +03:00
Preetha Appan
2fd20310ea
Remove checks in member reconcile that was causing servers in protocol 3 to not change their ID in raft forever
2018-05-30 11:34:45 -05:00
Alex Dadgar
ea24513d38
Allow nomad to restore bad periodic job
2018-04-26 15:51:47 -07:00
Alex Dadgar
d0f237086b
UX touchups
2018-04-26 15:24:27 -07:00
Chelsea Holland Komlo
fca0169dbc
handle potential panic in cron parsing
2018-04-26 16:57:45 -04:00
Michael Schurter
959d447d38
Remove unused context
2018-03-21 16:51:44 -07:00
Michael Schurter
0a17076ad2
refactor drainer into a subpkg
2018-03-21 16:51:44 -07:00
Michael Schurter
c0542474db
drain: initial drainv2 structs and impl
2018-03-21 16:49:48 -07:00
Alex Dadgar
4844317cc2
Merge pull request #3890 from hashicorp/b-heartbeat
...
Heartbeat improvements and handling failures during establishing leadership
2018-03-12 14:41:59 -07:00
Josh Soref
2c79e590ec
spelling: maintenance
2018-03-11 18:26:20 +00:00
Alex Dadgar
64a45a1603
Need to revoke leadership to clean up in case there was a failure during leadership establishment
2018-02-20 12:52:00 -08:00
Alex Dadgar
9a54abd3a8
timers
2018-02-20 10:23:11 -08:00
Alex Dadgar
601177c250
Add escape hatches when non-leader
2018-02-20 10:22:15 -08:00
Kyle Havlovitz
2ccf565bf6
Refactor redundancy_zone/upgrade_version out of client meta
2018-01-29 20:03:38 -08:00
Kyle Havlovitz
a162b9ce14
Move server health loop into autopilot leader actions
2018-01-23 12:57:02 -08:00
Kyle Havlovitz
1c07066064
Add autopilot functionality based on Consul's autopilot
2017-12-18 14:29:41 -08:00
Kyle Havlovitz
045f346293
Use region instead of datacenter for version checking
2017-12-12 10:17:16 -06:00
Kyle Havlovitz
b775fc7b33
Added support for v2 raft APIs and -raft-protocol option
2017-12-12 10:17:16 -06:00
Alex Dadgar
86608124ca
Fix followers not creating periodic launch
...
Fix an issue in which periodic launches wouldn't be made on followers.
2017-12-11 13:55:17 -08:00
Alex Dadgar
2c587fd67b
Merge pull request #3402 from hashicorp/leader-loop
...
Applies leader loop fixes from Consul.
2017-11-03 13:40:59 -07:00
Diptanu Choudhury
5a0edf646b
Resetting the timer at the beginning of the loop
2017-11-01 13:15:06 -07:00
Diptanu Choudhury
46bc4280b2
Adding support for tagged metrics
2017-11-01 13:15:06 -07:00
Diptanu Choudhury
524a1f0712
Publishing metrics for job summary
2017-11-01 13:15:06 -07:00
Alex Dadgar
794daefa5e
clear the token
2017-10-23 15:11:13 -07:00
Alex Dadgar
d3e119f4d0
thread leader token through core gc and test
2017-10-23 15:04:00 -07:00
Alex Dadgar
5c34af1ee1
leader acl token
2017-10-23 14:10:14 -07:00
James Phillips
9a5651e83a
Applies leader loop fixes from Consul.
...
There was a deadlock issue we fixed under https://github.com/hashicorp/consul/issues/3230 ,
and then discovered an issue with under https://github.com/hashicorp/consul/issues/3545 . This
PR ports over those fixes, as well as makes the revoke actions only happen if leadership was
established. This brings the Nomad leader loop inline with Consul's.
2017-10-16 22:01:49 -07:00
Alex Dadgar
c1cc51dbee
sync
2017-10-13 14:36:02 -07:00
Michael Schurter
84d8a51be1
SecretID -> AuthToken
2017-10-12 15:16:33 -07:00
Michael Schurter
a66c53d45a
Remove structs
import from api
...
Goes a step further and removes structs import from api's tests as well
by moving GenerateUUID to its own package.
2017-09-29 10:36:08 -07:00
Alex Dadgar
73b7466a6e
Run deployment garbage collector on an interval
...
Fixes https://github.com/hashicorp/nomad/issues/3244
2017-09-25 11:04:40 -07:00
Alex Dadgar
54e04b5c0e
Merge pull request #3201 from hashicorp/b-periodic-restore
...
Fix restoration of stopped periodic jobs
2017-09-13 11:42:29 -07:00
Alex Dadgar
a2363e7583
sync acls
2017-09-13 11:38:29 -07:00
Alex Dadgar
e3dbcdcb44
Fix restoration of stopped periodic jobs
...
This PR fixes an issue in which we would add a stopped periodic job to
the periodic launcher.
2017-09-12 14:25:40 -07:00
Alex Dadgar
84d06f6abe
Sync namespace changes
2017-09-07 17:04:21 -07:00
Armon Dadgar
e74ea8a152
nomad: use hashes for efficient token/policy diffing
2017-09-04 13:09:34 -07:00
Armon Dadgar
99c1001b2c
nomad: avoid replication consistency issues by setting MinQueryIndex
2017-09-04 13:07:44 -07:00
Armon Dadgar
b8bf35f087
ACL RPCs allow stale reads for scalability
2017-09-04 13:07:44 -07:00
Armon Dadgar
3e46094cee
Passthrough replication token for token/policy replication
2017-09-04 13:05:53 -07:00
Armon Dadgar
459c2b6fa7
nomad: switch policy/token replication to use batch endpoints
2017-09-04 13:05:36 -07:00
Armon Dadgar
018973aea8
Address @dadgar feedback
2017-09-04 13:04:45 -07:00
Armon Dadgar
5a3a931ec5
nomad: adding global token replication
2017-09-04 13:04:45 -07:00
Armon Dadgar
cb827b6696
nomad: adding policy replication support
2017-09-04 13:04:45 -07:00
Alex Dadgar
590ff91bf3
Deployment watcher takes state store
2017-08-30 18:51:59 -07:00
Alex Dadgar
2284e59b57
Fix double close and cleanup code
2017-08-03 13:40:34 -07:00
Alex Dadgar
146f3f5cb2
Don't restore parameterized periodic jobs
2017-08-03 12:37:58 -07:00
Alex Dadgar
d9b8fd126f
When restoring periodic jobs, take into consideration launch time zone
...
Fixes https://github.com/hashicorp/nomad/issues/2721
2017-07-07 16:18:56 -07:00
Alex Dadgar
7af65aa3d7
Add watcher to server
2017-07-07 12:03:11 -07:00
Alex Dadgar
a9c8b09da8
Push to configs
2017-04-14 15:24:55 -07:00
Alex Dadgar
8aec604e3f
Easy feedback fixes
2017-04-14 13:19:14 -07:00
Alex Dadgar
df7d59051f
Reaping failed evaluations creates follow up eval
...
Create a follow up evaluation when reaping failed evaluations. This
ensures that a job will still make eventual progress.
2017-04-12 14:47:59 -07:00
Alex Dadgar
5be806a3df
Fix vet script and fix vet problems
...
This PR fixes our vet script and fixes all the missed vet changes.
It also fixes pointers being printed in `nomad stop <job>` and `nomad
node-status <node>`.
2017-02-27 16:00:19 -08:00
Alex Dadgar
dea460281d
Merge pull request #2282 from hashicorp/f-raft-v2-stage-one
...
Update to Raft V2 stage one
2017-02-08 15:26:16 -08:00
Alex Dadgar
b69b357c7f
Nomad builds
2017-02-07 20:31:23 -08:00
Alex Dadgar
ee368762ae
It builds
2017-02-02 16:07:15 -08:00
Alex Dadgar
26db1bd12c
Join + Leave peer
2017-02-02 15:49:06 -08:00
Alex Dadgar
48696ba0cc
Use tomb to shutdown
...
Token revocation
Remove from the statestore
Revoke tokens
Don't error when Vault is disabled as this could cause issue if the operator ever goes from enabled to disabled
update server interface to allow enable/disable and config loading
test the new functions
Leader revoke
Use active
2016-08-28 14:06:25 -07:00
Diptanu Choudhury
c63a78b9a3
Removing the check related to checking version of server before reconciling in leader
2016-08-05 17:48:37 -07:00
Diptanu Choudhury
1518f23d0a
Making servers reconcile job summaries when they acquire leadership
2016-08-05 16:47:36 -07:00
Alex Dadgar
51ae7ace25
initial tail impl
2016-07-10 13:57:04 -04:00
Alex Dadgar
8ceb7ead20
Do not use snapshot
2016-06-22 09:33:15 -07:00
Alex Dadgar
91f6976423
tighter index bound when creating GC evals
2016-06-22 09:11:25 -07:00
Alex Dadgar
25decca3ca
Worker waitForIndex uses StateStore index, not Raft Applied Index
2016-06-22 09:04:22 -07:00
Alex Dadgar
6a236872b4
address comment
2016-05-25 10:30:47 -07:00
Alex Dadgar
3fd51ecece
Periodically unblock failed evaluations
2016-05-24 20:10:56 -07:00
Alex Dadgar
045f7807e0
eval_broker.Enqueue no longer returns an error
2016-05-18 11:35:15 -07:00
Sean Chittenden
dc28ab0cb5
Speling police
2016-05-15 09:41:34 -07:00
Alex Dadgar
ca938f205c
Force GC garbage collects nodes last and fix eval GC to cleanup deregistered batch jobs
2016-04-08 11:42:02 -07:00
Alex Dadgar
a3ac4bbc5a
Merge pull request #828 from hashicorp/f-gc-endpoint
...
Job GC endpoint
2016-02-20 16:03:39 -08:00
Alex Dadgar
143972b6d9
Job GC endpoint
2016-02-20 15:50:41 -08:00
Armon Dadgar
3746bf7cd3
nomad: use CPU count to determine pool size
2016-02-20 13:42:13 -08:00
Alex Dadgar
e2a4c4ccc5
Client stores when it receives a task
2016-02-19 14:49:43 -08:00