open-nomad

Author	SHA1	Message	Date
Mahmood Ali	e436d2701a	Handle Nomad leadership flapping Fixes a deadlock in leadership handling if leadership flapped. Raft propagates leadership transition to Nomad through a NotifyCh channel. Raft blocks when writing to this channel, so channel must be buffered or aggressively consumed[1]. Otherwise, Raft blocks indefinitely in `raft.runLeader` until the channel is consumed[1] and does not move on to executing follower related logic (in `raft.runFollower`). While Raft `runLeader` defer function blocks, raft cannot process any other raft operations. For example, `run{Leader\|Follower}` methods consume `raft.applyCh`, and while runLeader defer is blocked, all raft log applications or config lookup will block indefinitely. Sadly, `leaderLoop` and `establishLeader` makes few Raft calls! `establishLeader` attempts to auto-create autopilot/scheduler config [3]; and `leaderLoop` attempts to check raft configuration [4]. All of these calls occur without a timeout. Thus, if leadership flapped quickly while `leaderLoop/establishLeadership` is invoked and hit any of these Raft calls, Raft handler _deadlock_ forever. Depending on how many times it flapped and where exactly we get stuck, I suspect it's possible to get in the following case: * Agent metrics/stats http and RPC calls hang as they check raft.Configurations * raft.State remains in Leader state, and server attempts to handle RPC calls (e.g. node/alloc updates) and these hang as well As we create goroutines per RPC call, the number of goroutines grow over time and may trigger a out of memory errors in addition to missed updates. [1] `d90d6d6bda/config.go (L190-L193)` [2] `d90d6d6bda/raft.go (L425-L436)` [3] `2a89e47746/nomad/leader.go (L198-L202)` [4] `2a89e47746/nomad/leader.go (L877)`	2020-01-22 13:08:34 -05:00
Mahmood Ali	129c884105	extract leader step function	2020-01-22 10:55:48 -05:00
Mahmood Ali	d699a70875	Merge pull request #5911 from hashicorp/b-rpc-consistent-reads Block rpc handling until state store is caught up	2019-08-20 09:29:37 -04:00
Jasmine Dahilig	8d980edd2e	add create and modify timestamps to evaluations (#5881 )	2019-08-07 09:50:35 -07:00
Pete Woods	9096aa3d23	Add job status metrics This avoids having to write services to repeatedly hit the jobs API	2019-07-26 10:12:49 +01:00
Mahmood Ali	ea3a98357f	Block rpc handling until state store is caught up Here, we ensure that when leader only responds to RPC calls when state store is up to date. At leadership transition or launch with restored state, the server local store might not be caught up with latest raft logs and may return a stale read. The solution here is to have an RPC consistency read gate, enabled when `establishLeadership` completes before we respond to RPC calls. `establishLeadership` is gated by a `raft.Barrier` which ensures that all prior raft logs have been applied. Conversely, the gate is disabled when leadership is lost. This is very much inspired by https://github.com/hashicorp/consul/pull/3154/files	2019-07-02 16:07:37 +08:00
Preetha Appan	10e7d6df6d	Remove compat code associated with many previous versions of nomad This removes compat code for namespaces (0.7), Drain(0.8) and other older features from releases older than Nomad 0.7	2019-06-25 19:05:25 -05:00
Chris Baker	e0170e1c67	metrics: add namespace label to allocation metrics	2019-06-17 20:50:26 +00:00
Michael Schurter	073893f529	nomad: disable service+batch preemption by default Enterprise only. Disable preemption for service and batch jobs by default. Maintain backward compatibility in a x.y.Z release. Consider switching the default for new clusters in the future.	2019-06-04 15:54:50 -07:00
Preetha Appan	ad3c263d3f	Rename to match system scheduler config. Also added docs	2019-05-03 14:06:12 -05:00
Preetha Appan	6615d5c868	Add config to disable preemption for batch/service jobs	2019-04-29 18:48:07 -05:00
Arshneet Singh	b977748a4b	Add code for plan normalization	2019-04-23 09:18:01 -07:00
Charlie Voiselle	c28c195f42	Set NextEval when making `failed-follow-up` evals This allows users to locate failed-follow-up evals more easily	2019-02-20 16:07:11 -08:00
Preetha Appan	7578522f58	variable name fix	2019-01-29 13:48:45 -06:00
Preetha Appan	a6cebbbf9e	Make sure that all servers are 0.9 before applying scheduler config entry	2019-01-29 12:47:42 -06:00
Alex Dadgar	4bdccab550	goimports	2019-01-22 15:44:31 -08:00
Nick Ethier	b1484aec33	nomad: fix hclog usage	2018-11-29 22:27:39 -05:00
Nick Ethier	5c5cae79ab	nomad: only lookup job is disable_dispatched_job_summary_metrics is set	2018-11-19 23:22:23 -05:00
Nick Ethier	8ac69f440d	nomad: lookup job instead of adding Dispatched to summary	2018-11-19 23:22:02 -05:00
Nick Ethier	85b221a1d6	nomad: add flag to disable publishing of job_summary metrics for dispatched jobs	2018-11-19 23:21:19 -05:00
Preetha Appan	57fe5050f0	more minor review feedback	2018-11-01 17:05:17 -05:00
Preetha Appan	12278527c7	make default config a variable	2018-10-30 11:06:32 -05:00
Preetha Appan	c1c1c230e4	Make preemption config a struct to allow for enabling based on scheduler type	2018-10-30 11:06:32 -05:00
Preetha Appan	bd34cbb1f7	Support for new scheduler config API, first use case is to disable preemption	2018-10-30 11:06:32 -05:00
Alex Dadgar	ca28afa3b2	small fixes	2018-09-15 16:42:38 -07:00
Alex Dadgar	3c19d01d7a	server	2018-09-15 16:23:13 -07:00
Andrei Burd	444ee45aff	Parametrized/periodic jobs per child tagged metric emmision	2018-06-21 10:40:56 +03:00
Preetha Appan	2fd20310ea	Remove checks in member reconcile that was causing servers in protocol 3 to not change their ID in raft forever	2018-05-30 11:34:45 -05:00
Alex Dadgar	ea24513d38	Allow nomad to restore bad periodic job	2018-04-26 15:51:47 -07:00
Alex Dadgar	d0f237086b	UX touchups	2018-04-26 15:24:27 -07:00
Chelsea Holland Komlo	fca0169dbc	handle potential panic in cron parsing	2018-04-26 16:57:45 -04:00
Michael Schurter	959d447d38	Remove unused context	2018-03-21 16:51:44 -07:00
Michael Schurter	0a17076ad2	refactor drainer into a subpkg	2018-03-21 16:51:44 -07:00
Michael Schurter	c0542474db	drain: initial drainv2 structs and impl	2018-03-21 16:49:48 -07:00
Alex Dadgar	4844317cc2	Merge pull request #3890 from hashicorp/b-heartbeat Heartbeat improvements and handling failures during establishing leadership	2018-03-12 14:41:59 -07:00
Josh Soref	2c79e590ec	spelling: maintenance	2018-03-11 18:26:20 +00:00
Alex Dadgar	64a45a1603	Need to revoke leadership to clean up in case there was a failure during leadership establishment	2018-02-20 12:52:00 -08:00
Alex Dadgar	9a54abd3a8	timers	2018-02-20 10:23:11 -08:00
Alex Dadgar	601177c250	Add escape hatches when non-leader	2018-02-20 10:22:15 -08:00
Kyle Havlovitz	2ccf565bf6	Refactor redundancy_zone/upgrade_version out of client meta	2018-01-29 20:03:38 -08:00
Kyle Havlovitz	a162b9ce14	Move server health loop into autopilot leader actions	2018-01-23 12:57:02 -08:00
Kyle Havlovitz	1c07066064	Add autopilot functionality based on Consul's autopilot	2017-12-18 14:29:41 -08:00
Kyle Havlovitz	045f346293	Use region instead of datacenter for version checking	2017-12-12 10:17:16 -06:00
Kyle Havlovitz	b775fc7b33	Added support for v2 raft APIs and -raft-protocol option	2017-12-12 10:17:16 -06:00
Alex Dadgar	86608124ca	Fix followers not creating periodic launch Fix an issue in which periodic launches wouldn't be made on followers.	2017-12-11 13:55:17 -08:00
Alex Dadgar	2c587fd67b	Merge pull request #3402 from hashicorp/leader-loop Applies leader loop fixes from Consul.	2017-11-03 13:40:59 -07:00
Diptanu Choudhury	5a0edf646b	Resetting the timer at the beginning of the loop	2017-11-01 13:15:06 -07:00
Diptanu Choudhury	46bc4280b2	Adding support for tagged metrics	2017-11-01 13:15:06 -07:00
Diptanu Choudhury	524a1f0712	Publishing metrics for job summary	2017-11-01 13:15:06 -07:00
Alex Dadgar	794daefa5e	clear the token	2017-10-23 15:11:13 -07:00

1 2 3

124 commits