open-nomad

Commit Graph

Author	SHA1	Message	Date
Mahmood Ali	a9f551542d	Merge pull request #160 from hashicorp/b-mtls-hostname server: validate role and region for RPC w/ mTLS	2020-01-30 12:59:17 -06:00
Mahmood Ali	e436d2701a	Handle Nomad leadership flapping Fixes a deadlock in leadership handling if leadership flapped. Raft propagates leadership transition to Nomad through a NotifyCh channel. Raft blocks when writing to this channel, so channel must be buffered or aggressively consumed[1]. Otherwise, Raft blocks indefinitely in `raft.runLeader` until the channel is consumed[1] and does not move on to executing follower related logic (in `raft.runFollower`). While Raft `runLeader` defer function blocks, raft cannot process any other raft operations. For example, `run{Leader\|Follower}` methods consume `raft.applyCh`, and while runLeader defer is blocked, all raft log applications or config lookup will block indefinitely. Sadly, `leaderLoop` and `establishLeader` makes few Raft calls! `establishLeader` attempts to auto-create autopilot/scheduler config [3]; and `leaderLoop` attempts to check raft configuration [4]. All of these calls occur without a timeout. Thus, if leadership flapped quickly while `leaderLoop/establishLeadership` is invoked and hit any of these Raft calls, Raft handler _deadlock_ forever. Depending on how many times it flapped and where exactly we get stuck, I suspect it's possible to get in the following case: * Agent metrics/stats http and RPC calls hang as they check raft.Configurations * raft.State remains in Leader state, and server attempts to handle RPC calls (e.g. node/alloc updates) and these hang as well As we create goroutines per RPC call, the number of goroutines grow over time and may trigger a out of memory errors in addition to missed updates. [1] `d90d6d6bda/config.go (L190-L193)` [2] `d90d6d6bda/raft.go (L425-L436)` [3] `2a89e47746/nomad/leader.go (L198-L202)` [4] `2a89e47746/nomad/leader.go (L877)`	2020-01-22 13:08:34 -05:00
Drew Bailey	50288461c9	Server request forwarding for Agent.Profile Return rpc errors for profile requests, set up remote forwarding to target leader or server id for profile requests. server forwarding, endpoint tests	2020-01-09 15:15:03 -05:00
Drew Bailey	8178beecf0	address feedback, use agent_endpoint instead of monitor	2019-11-05 09:51:53 -05:00
Drew Bailey	4bc68855d0	use intercepting loggers for rpchandlers	2019-11-05 09:51:50 -05:00
Drew Bailey	3b9c33a5f0	new hclog with standardlogger intercept	2019-11-05 09:51:49 -05:00
Drew Bailey	786989dbe3	New monitor pkg for shared monitor functionality Adds new package that can be used by client and server RPC endpoints to facilitate monitoring based off of a logger clean up old code small comment about write rm old comment about minsize rename to Monitor Removes connection logic from monitor command Keep connection logic in endpoints, use a channel to send results from monitoring use new multisink logger and interfaces small test for dropped messages update go-hclogger and update sink/intercept logger interfaces	2019-11-05 09:51:49 -05:00
Lang Martin	31d7f116dd	nomad/server comments	2019-09-24 14:36:18 -04:00
Mahmood Ali	d699a70875	Merge pull request #5911 from hashicorp/b-rpc-consistent-reads Block rpc handling until state store is caught up	2019-08-20 09:29:37 -04:00
Nick Ethier	965f00b2fc	Builtin Admission Controller Framework (#6116 ) * nomad: add admission controller framework * nomad: add admission controller framework and Consul Connect hooks * run admission controllers before checking permissions * client: add default node meta for connect configurables * nomad: remove validateJob func since it has been moved to admission controller * nomad: use new TaskKind type * client: use consts for connect sidecar image and log level * Apply suggestions from code review Co-Authored-By: Michael Schurter <mschurter@hashicorp.com> * nomad: add job register test with connect sidecar * Update nomad/job_endpoint_hooks.go Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>	2019-08-15 11:22:37 -04:00
Mahmood Ali	ea3a98357f	Block rpc handling until state store is caught up Here, we ensure that when leader only responds to RPC calls when state store is up to date. At leadership transition or launch with restored state, the server local store might not be caught up with latest raft logs and may return a stale read. The solution here is to have an RPC consistency read gate, enabled when `establishLeadership` completes before we respond to RPC calls. `establishLeadership` is gated by a `raft.Barrier` which ensures that all prior raft logs have been applied. Conversely, the gate is disabled when leadership is lost. This is very much inspired by https://github.com/hashicorp/consul/pull/3154/files	2019-07-02 16:07:37 +08:00
Preetha Appan	dc0ac81609	Change interval of raft stats collection to 10s	2019-06-19 11:58:46 -05:00
Preetha Appan	104d66f10c	Changed name of metric	2019-06-17 15:51:31 -05:00
Preetha Appan	c54b4a5b17	Emit metrics with raft commit and apply index and statestore latest index	2019-06-14 16:30:27 -05:00
Michael Schurter	9732bc37ff	nomad: refactor waitForIndex into SnapshotAfter Generalize wait for index logic in the state store for reuse elsewhere. Also begin plumbing in a context to combine handling of timeouts and shutdown.	2019-05-17 13:30:23 -07:00
Mahmood Ali	919827f2df	Merge pull request #5632 from hashicorp/f-nomad-exec-parts-01-base nomad exec part 1: plumbing and docker driver	2019-05-09 18:09:27 -04:00
Mahmood Ali	3c668732af	server: server forwarding logic for nomad exec endpoint	2019-05-09 16:49:08 -04:00
Mahmood Ali	92c133b905	Update peers info with new raft config details	2019-05-03 16:55:53 -04:00
Hemanth Basappa	3fef02aa93	Add support in nomad for supporting raft 3 protocol peers.json	2019-05-02 09:11:23 -07:00
HashedDan	caad68e799	server: inconsistent receiver notation corrected Signed-off-by: HashedDan <georgedanielmangum@gmail.com>	2019-03-16 17:53:53 -05:00
Mahmood Ali	6efea6d8fc	Populate agent-info with vault Return Vault TTL info to /agent/self API and `nomad agent-info` command.	2018-11-20 17:10:55 -05:00
Alex Dadgar	6d8bb3a7bd	Duplicate blocked evals cancelling improved The old logic for cancelling duplicate blocked evaluations by job id had the issue where the newer evaluation could have additional node classes that it is (in)eligible for that we would not capture. This could make it such that cluster state could change such that the job would make progress but no evaluation was unblocked.	2018-11-07 10:08:23 -08:00
Alex Dadgar	9971b3393f	yamux	2018-09-17 14:22:40 -07:00
Alex Dadgar	b2f500b48c	Serf/Raft/Memberlist logger	2018-09-17 13:57:52 -07:00
Alex Dadgar	ca28afa3b2	small fixes	2018-09-15 16:42:38 -07:00
Alex Dadgar	3c19d01d7a	server	2018-09-15 16:23:13 -07:00
Chelsea Holland Komlo	de03ce8070	move logic to determine whether to reload tls configuration to tlsutil helper	2018-06-08 14:33:58 -04:00
Chelsea Holland Komlo	38f611a7f2	refactor NewTLSConfiguration to pass in verifyIncoming/verifyOutgoing add missing fields to TLS merge method	2018-05-23 18:35:30 -04:00
Chelsea Komlo	687c26093c	Merge pull request #4269 from hashicorp/f-tls-remove-weak-standards Configurable TLS cipher suites and versions; disallow weak ciphers	2018-05-11 08:11:46 -04:00
Preetha Appan	ca5758741b	Update serf to pick up graceful leave fix	2018-05-10 11:16:24 -05:00
Chelsea Holland Komlo	620558c107	log error if unable to create TLS configuration	2018-05-10 11:51:54 -04:00
Chelsea Holland Komlo	796bae6f1b	allow configurable cipher suites disallow 3DES and RC4 ciphers add documentation for tls_cipher_suites	2018-05-09 17:15:31 -04:00
Alex Dadgar	a510774451	Use UpdateAllocDesiredTransistion instead of UpsertEval but no transistions yet	2018-05-07 14:50:01 -05:00
Alex Dadgar	4a23307baf	Track all client connections	2018-04-26 13:22:09 -07:00
Alex Dadgar	7f28cfcdfe	small cleanup	2018-03-30 15:49:56 -07:00
Chelsea Holland Komlo	a77dd08dd9	prevent double close due to error in creating listener	2018-03-30 17:15:56 -04:00
Chelsea Holland Komlo	402a026c88	add further error handling for rpc connection handling	2018-03-30 17:03:36 -04:00
Chelsea Holland Komlo	58ada9bc42	return error when setting checksum; don't reload	2018-03-28 18:15:50 -04:00
Chelsea Holland Komlo	2d5af7ff4d	set TLS checksum when parsing config Refactor checksum comparison, always set checksum if it is empty	2018-03-28 09:56:11 -04:00
Chelsea Holland Komlo	dd5f627feb	set server configuration checksum on reload	2018-03-27 18:03:52 -04:00
Chelsea Komlo	57e2cd04bd	Merge pull request #4025 from hashicorp/reload-http-tls Allow TLS configurations for HTTP and RPC connections to be reloaded …	2018-03-26 18:00:30 -04:00
Michael Schurter	9898edfa90	Switch to drainerv2 impl	2018-03-21 16:51:44 -07:00
Alex Dadgar	e63bcb474d	Drainer	2018-03-21 16:51:44 -07:00
Michael Schurter	8b41e9b2e1	drainer: drainer should shutdown with server	2018-03-21 16:51:44 -07:00
Michael Schurter	0a17076ad2	refactor drainer into a subpkg	2018-03-21 16:51:44 -07:00
Chelsea Holland Komlo	66e44cdb73	Allow TLS configurations for HTTP and RPC connections to be reloaded separately	2018-03-21 17:51:08 -04:00
Alex Dadgar	b8607ad6d6	Heartbeat uses client rpc advertise and server defaults server rpc advertise addr	2018-03-16 16:47:08 -07:00
Alex Dadgar	52b7fb5361	Separate client and server rpc advertise addresses	2018-03-16 16:47:08 -07:00
Alex Dadgar	92cb552ff6	Always add core scheduler and detect invalid schedulers	2018-03-14 10:53:27 -07:00
Alex Dadgar	55e4f5cdc4	Require core scheduler	2018-03-14 10:37:49 -07:00

1 2 3 4 5

218 Commits