open-nomad

Commit Graph

Author	SHA1	Message	Date
Tim Gross	7d53ed88d6	csi: client RPCs should return wrapped errors for checking (#8605 ) When the client-side actions of a CSI client RPC succeed but we get disconnected during the RPC or we fail to checkpoint the claim state, we want to be able to retry the client RPC without getting blocked by the client-side state (ex. mount points) already having been cleaned up in previous calls.	2020-08-07 11:01:36 -04:00
Mahmood Ali	cf53ee57cd	remove unused dropButLastChannel	2020-02-13 18:56:53 -05:00
Mahmood Ali	79823ae07d	handle channel close signal Always deliver last value then send close signal.	2020-01-28 09:44:34 -05:00
Mahmood Ali	e436d2701a	Handle Nomad leadership flapping Fixes a deadlock in leadership handling if leadership flapped. Raft propagates leadership transition to Nomad through a NotifyCh channel. Raft blocks when writing to this channel, so channel must be buffered or aggressively consumed[1]. Otherwise, Raft blocks indefinitely in `raft.runLeader` until the channel is consumed[1] and does not move on to executing follower related logic (in `raft.runFollower`). While Raft `runLeader` defer function blocks, raft cannot process any other raft operations. For example, `run{Leader\|Follower}` methods consume `raft.applyCh`, and while runLeader defer is blocked, all raft log applications or config lookup will block indefinitely. Sadly, `leaderLoop` and `establishLeader` makes few Raft calls! `establishLeader` attempts to auto-create autopilot/scheduler config [3]; and `leaderLoop` attempts to check raft configuration [4]. All of these calls occur without a timeout. Thus, if leadership flapped quickly while `leaderLoop/establishLeadership` is invoked and hit any of these Raft calls, Raft handler _deadlock_ forever. Depending on how many times it flapped and where exactly we get stuck, I suspect it's possible to get in the following case: * Agent metrics/stats http and RPC calls hang as they check raft.Configurations * raft.State remains in Leader state, and server attempts to handle RPC calls (e.g. node/alloc updates) and these hang as well As we create goroutines per RPC call, the number of goroutines grow over time and may trigger a out of memory errors in addition to missed updates. [1] `d90d6d6bda/config.go (L190-L193)` [2] `d90d6d6bda/raft.go (L425-L436)` [3] `2a89e47746/nomad/leader.go (L198-L202)` [4] `2a89e47746/nomad/leader.go (L877)`	2020-01-22 13:08:34 -05:00
Mahmood Ali	4b2ba62e35	acl: check ACL against object namespace Fix a bug where a millicious user can access or manipulate an alloc in a namespace they don't have access to. The allocation endpoints perform ACL checks against the request namespace, not the allocation namespace, and performs the allocation lookup independently from namespaces. Here, we check that the requested can access the alloc namespace regardless of the declared request namespace. Ideally, we'd enforce that the declared request namespace matches the actual allocation namespace. Unfortunately, we haven't documented alloc endpoints as namespaced functions; we suspect starting to enforce this will be very disruptive and inappropriate for a nomad point release. As such, we maintain current behavior that doesn't require passing the proper namespace in request. A future major release may start enforcing checking declared namespace.	2019-10-08 12:59:22 -04:00
Lang Martin	4610c70777	util simplify partitionAll	2019-07-10 13:56:19 -04:00
Lang Martin	10848841be	util partitionAll for paging	2019-07-10 13:56:19 -04:00
Arshneet Singh	b7b050cdd1	Change min version required for plan optimization	2019-04-24 12:36:07 -07:00
Arshneet Singh	d4e7a5c005	Add comments to functions, and use require instead of assert	2019-04-23 09:57:21 -07:00
Arshneet Singh	0dd4c109e8	Compat tags	2019-04-23 09:18:01 -07:00
Arshneet Singh	b977748a4b	Add code for plan normalization	2019-04-23 09:18:01 -07:00
Alex Dadgar	5009566503	do not bootstrap with non voters	2018-09-19 17:17:39 -07:00
Michael Schurter	e1cbcf0b3c	rpc: give min rpc version variable a better name	2018-04-09 11:09:05 -07:00
Michael Schurter	88a9409f8e	rpc: only attempt NodeRpc for nodes>=0.8 Attempting NodeRpc (or streaming node rpc) for clients that do not support it causes it to hang indefinitely because while the TCP connection exists, the client will never respond.	2018-04-09 11:08:06 -07:00
Alex Dadgar	6c1fa878ea	Forwarding	2018-02-15 13:59:02 -08:00
Alex Dadgar	6dd1c9f49d	Refactor	2018-02-15 13:59:00 -08:00
Kyle Havlovitz	2ccf565bf6	Refactor redundancy_zone/upgrade_version out of client meta	2018-01-29 20:03:38 -08:00
Kyle Havlovitz	7b980c42d8	Add raft remove by id endpoint/command	2018-01-16 13:35:32 -08:00
Kyle Havlovitz	1c07066064	Add autopilot functionality based on Consul's autopilot	2017-12-18 14:29:41 -08:00
Kyle Havlovitz	045f346293	Use region instead of datacenter for version checking	2017-12-12 10:17:16 -06:00
Kyle Havlovitz	f088446d48	Add missing exist checks and doc line	2017-12-12 10:17:16 -06:00
Kyle Havlovitz	b775fc7b33	Added support for v2 raft APIs and -raft-protocol option	2017-12-12 10:17:16 -06:00
Alex Dadgar	c1cc51dbee	sync	2017-10-13 14:36:02 -07:00
Alex Dadgar	84d06f6abe	Sync namespace changes	2017-09-07 17:04:21 -07:00
Sean Chittenden	bff57a0dce	Reconcile, clean up, and centralize API version numbers (major and minor). Reduce future confusion by introducing a minor version that is gossiped out via the `mvn` Serf tag (Minor Version Number, `vsn` is already being used for to communicate `Major Version Number`). Background: hashicorp/consul/issues/1346#issuecomment-151663152	2016-06-10 15:50:11 -04:00
Sean Chittenden	49deaae2ae	Seed random once in main	2016-06-10 15:48:36 -04:00
Sean Chittenden	e36686a17d	Use consul/lib's RandomStagger Removes four redundant copies of the method in the process.	2016-06-10 15:48:36 -04:00
Sean Chittenden	e0e7d94450	Use consul/lib's RateScaledInterval	2016-06-10 15:48:36 -04:00
Alex Dadgar	e1dc47de91	Remove blank line	2016-02-17 11:48:52 -08:00
Alex Dadgar	25c5e543f4	Use crypto random seed	2016-02-17 11:47:02 -08:00
Armon Dadgar	ea0795995d	Use a single implementation of GenerateUUID	2015-09-07 15:23:03 -07:00
Chris Bednarski	96cb220ff4	Update references to "os" to use "kernel.name" This brings test code and mocks up to date with the fingerprinter. This was a slightly larger change than I anticipated, but I think it's good for two reasons: 1. More semanitcally correct. `os.name` is something like "Windows 10 Pro" or "Ubuntu", while `kernel.name` is "windows" or "linux". `os.version` and `kernel.version` match these semantics. 2. `kernel.name` is much easier to grep for than `os`, which is helpful because oracle can't help us with strings.	2015-08-28 01:30:47 -07:00
Armon Dadgar	e489ee8ebd	nomad: add rate based scaling util methods	2015-08-22 17:12:24 -07:00
Armon Dadgar	40def1a187	nomad: expose RuntimeStats	2015-08-20 15:29:30 -07:00
Armon Dadgar	8913a42674	nomad: move and test max function	2015-08-04 17:13:40 -07:00
Armon Dadgar	890db2d2b7	nomad: adding utility shuffle	2015-07-23 17:30:07 -07:00
Armon Dadgar	e8964a4975	nomad: adding utility methods	2015-06-06 00:14:08 +02:00
Armon Dadgar	490fa1b7db	nomad: testing serf join	2015-06-04 12:33:12 +02:00
Armon Dadgar	4f6ecce727	nomad: working on serf member parsing	2015-06-03 13:35:48 +02:00
Armon Dadgar	d52122f041	nomad: more skeleton	2015-06-03 12:26:50 +02:00
Armon Dadgar	1e7f84f3e6	nomad: adding basic structure for raft	2015-06-01 17:49:10 +02:00

41 Commits