open-nomad

Commit Graph

Author	SHA1	Message	Date
Yoan Blanc	5e8254beda	feat: remove dependency to consul/lib Signed-off-by: Yoan Blanc <yoan@dosimple.ch>	2022-04-09 13:22:44 +02:00
Seth Hoenig	1274aa690f	tests: deflake test that joins a server with non-voting servers to form qourum This PR - upgrades the serf library - has the test start the join process using the un-joined server first - disables schedulers on the servers - uses the WaitForLeader and wantPeers helpers Not sure which, if any of these actually improves the flakiness of this test.	2022-02-24 17:02:58 -06:00
Michael Schurter	7494a0c4fd	core: remove all traces of unused protocol version Nomad inherited protocol version numbering configuration from Consul and Serf, but unlike those projects Nomad has never used it. Nomad's `protocol_version` has always been `1`. While the code is effectively unused and therefore poses no runtime risks to leave, I felt like removing it was best because: 1. Nomad's RPC subsystem has been able to evolve extensively without needing to increment the version number. 2. Nomad's HTTP API has evolved extensively without increment `API{Major,Minor}Version`. If we want to version the HTTP API in the future, I doubt this is the mechanism we would choose. 3. The presence of the `server.protocol_version` configuration parameter is confusing since `server.raft_protocol` is an important parameter for operators to consider. Even more confusing is that there is a distinct Serf protocol version which is included in `nomad server members` output under the heading `Protocol`. `raft_protocol` is the only protocol version relevant to Nomad developers and operators. The other protocol versions are either deadcode or have never changed (Serf). 4. If we were to need to version the RPC, HTTP API, or Serf protocols, I don't think these configuration parameters and variables are the best choice. If we come to that point we should choose a versioning scheme based on the use case and modern best practices -- not this 6+ year old dead code.	2022-02-18 16:12:36 -08:00
Luiz Aoqui	0e09b120e4	fix mTLS certificate check on agent to agent RPCs (#11998 ) PR #11956 implemented a new mTLS RPC check to validate the role of the certificate used in the request, but further testing revealed two flaws: 1. client-only endpoints did not accept server certificates so the request would fail when forwarded from one server to another. 2. the certificate was being checked after the request was forwarded, so the check would happen over the server certificate, not the actual source. This commit checks for the desired mTLS level, where the client level accepts both, a server or a client certificate. It also validates the cercertificate before the request is forwarded.	2022-02-04 20:35:20 -05:00
Luiz Aoqui	c4cff5359f	Verify TLS certificate on endpoints that are used between agents only (#11956 )	2022-02-02 15:03:18 -05:00
Mahmood Ali	ce43a7a852	update tests to make an actual RaftRPC	2021-08-27 10:37:30 -04:00
Mahmood Ali	ff7c1ca79b	Apply authZ for nomad Raft RPC layer When mTLS is enabled, only nomad servers of the region should access the Raft RPC layer. Clients and servers in other regions should only use the Nomad RPC endpoints. Co-authored-by: Michael Schurter <mschurter@hashicorp.com> Co-authored-by: Seth Hoenig <shoenig@hashicorp.com>	2021-08-26 15:10:07 -04:00
Kris Hicks	0a3a748053	Add gosimple linter (#9590 )	2020-12-09 11:05:18 -08:00
Benjamin Buzbee	e0acbbfcc6	Fix RPC retry logic in nomad client's rpc.go for blocking queries (#9266 )	2020-11-30 15:11:10 -05:00
Pierre Cauchois	13218dc345	Enforce bounds on MaxQueryTime (#9064 ) The MaxQueryTime value used in QueryOptions.HasTimedOut() can be set to an invalid value that would throw off how RPC requests are retried. This fix uses the same logic that enforces the MaxQueryTime bounds in the blockingRPC() call.	2020-10-15 08:43:06 -04:00
Mahmood Ali	e37a3312d5	If leadership fails, consider it handled The callers for `forward` and old implementation expect failures to be accompanied with a true value! This fixes the issue and have tests passing!	2020-05-31 22:06:17 -04:00
Mahmood Ali	30ab9c84e5	more review feedback	2020-05-31 21:39:09 -04:00
Mahmood Ali	2108681c1d	Endpoint for snapshotting server state	2020-05-21 20:04:38 -04:00
Yoan Blanc	225c9c1215	fixup! vendor: explicit use of hashicorp/go-msgpack Signed-off-by: Yoan Blanc <yoan@dosimple.ch>	2020-03-31 09:48:07 -04:00
Yoan Blanc	761d014071	vendor: explicit use of hashicorp/go-msgpack Signed-off-by: Yoan Blanc <yoan@dosimple.ch>	2020-03-31 09:45:21 -04:00
Lang Martin	334979a754	nomad/rpc: indicate missing region in error message	2020-03-23 13:58:29 -04:00
Mahmood Ali	e106d373b2	rpc: Use MultiplexV2 for connections MultiplexV2 is a new connection multiplex header that supports multiplex both RPC and streaming requests over the same Yamux connection. MultiplexV2 was added in 0.8.0 as part of https://github.com/hashicorp/nomad/pull/3892 . So Nomad 0.11 can expect it to be supported. Though, some more rigorous testing is required before merging this. I want to call out some implementation details: First, the current connection pool reuses the Yamux stream for multiple RPC calls, and doesn't close them until an error is encountered. This commit doesn't change it, and sets the `RpcNomad` byte only at stream creation. Second, the StreamingRPC session gets closed by callers and cannot be reused. Every StreamingRPC opens a new Yamux session.	2020-02-03 19:31:39 -05:00
Michael Schurter	c82b14b0c4	core: add limits to unauthorized connections Introduce limits to prevent unauthorized users from exhausting all ephemeral ports on agents: * `{https,rpc}_handshake_timeout` * `{http,rpc}_max_conns_per_client` The handshake timeout closes connections that have not completed the TLS handshake by the deadline (5s by default). For RPC connections this timeout also separately applies to first byte being read so RPC connections with TLS enabled have `rpc_handshake_time * 2` as their deadline. The connection limit per client prevents a single remote TCP peer from exhausting all ephemeral ports. The default is 100, but can be lowered to a minimum of 26. Since streaming RPC connections create a new TCP connection (until MultiplexV2 is used), 20 connections are reserved for Raft and non-streaming RPCs to prevent connection exhaustion due to streaming RPCs. All limits are configurable and may be disabled by setting them to `0`. This also includes a fix that closes connections that attempt to create TLS RPC connections recursively. While only users with valid mTLS certificates could perform such an operation, it was added as a safeguard to prevent programming errors before they could cause resource exhaustion.	2020-01-30 10:38:25 -08:00
Drew Bailey	a61bf32314	Allow nomad monitor command to lookup server UUID Allows addressing servers with nomad monitor using the servers name or ID. Also unifies logic for addressing servers for client_agent_endpoint commands and makes addressing logic region aware. rpc getServer test	2020-01-29 13:55:29 -05:00
Drew Bailey	4bc68855d0	use intercepting loggers for rpchandlers	2019-11-05 09:51:50 -05:00
Mahmood Ali	d699a70875	Merge pull request #5911 from hashicorp/b-rpc-consistent-reads Block rpc handling until state store is caught up	2019-08-20 09:29:37 -04:00
Mahmood Ali	ad39bcef60	rpc: use tls wrapped connection for streaming rpc This ensures that server-to-server streaming RPC calls use the tls wrapped connections. Prior to this, `streamingRpcImpl` function uses tls for setting header and invoking the rpc method, but returns unwrapped tls connection. Thus, streaming writes fail with tls errors. This tls streaming bug existed since 0.8.0[1], but PR #5654[2] exacerbated it in 0.9.2. Prior to PR #5654, nomad client used to shuffle servers at every heartbeat -- `servers.Manager.setServers`[3] always shuffled servers and was called by heartbeat code[4]. Shuffling servers meant that a nomad client would heartbeat and establish a connection against all nomad servers eventually. When handling streaming RPC calls, nomad servers used these local connection to communicate directly to the client. The server-to-server forwarding logic was left mostly unexercised. PR #5654 means that a nomad client may connect to a single server only and caused the server-to-server forward streaming RPC code to get exercised more and unearthed the problem. [1] https://github.com/hashicorp/nomad/blob/v0.8.0/nomad/rpc.go#L501-L515 [2] https://github.com/hashicorp/nomad/pull/5654 [3] https://github.com/hashicorp/nomad/blob/v0.9.1/client/servers/manager.go#L198-L216 [4] https://github.com/hashicorp/nomad/blob/v0.9.1/client/client.go#L1603	2019-07-12 14:41:44 +08:00
Mahmood Ali	ea3a98357f	Block rpc handling until state store is caught up Here, we ensure that when leader only responds to RPC calls when state store is up to date. At leadership transition or launch with restored state, the server local store might not be caught up with latest raft logs and may return a stale read. The solution here is to have an RPC consistency read gate, enabled when `establishLeadership` completes before we respond to RPC calls. `establishLeadership` is gated by a `raft.Barrier` which ensures that all prior raft logs have been applied. Conversely, the gate is disabled when leadership is lost. This is very much inspired by https://github.com/hashicorp/consul/pull/3154/files	2019-07-02 16:07:37 +08:00
Chris Baker	121a9eb8cb	some changes for more idiomatic code	2018-12-12 23:11:17 +00:00
Chris Baker	34600f8b75	fixed bug in loop delay	2018-12-12 19:16:41 +00:00
Chris Baker	89c64932c1	gofmt	2018-12-12 19:09:06 +00:00
Chris Baker	22c11d8799	improved code for readability	2018-12-12 18:52:06 +00:00
Chris Baker	59beae35df	nomad/rpc listener: modified to throttle logging on "permanent" Accept() errors as well (with a higher delay cap)	2018-12-07 22:14:15 +00:00
Chris Baker	707bac0a7b	rpc accept loop: added backoff on logging for failed connections, in case there is a fast fail loop (NMD-1173)	2018-12-07 20:12:55 +00:00
Alex Dadgar	9971b3393f	yamux	2018-09-17 14:22:40 -07:00
Alex Dadgar	3c19d01d7a	server	2018-09-15 16:23:13 -07:00
Xopherus	8d747578e8	Close multiplexer when context is cancelled Multiplexer continues to create rpc connections even when the context which is passed to the underlying rpc connections is cancelled by the server. This was causing #4413 - when a SIGHUP causes everything to reload, it uses context to cancel the underlying http/rpc connections so that they may come up with the new configuration. The multiplexer was not being cancelled properly so it would continue to create rpc connections and constantly fail, causing communication issues with other nomad agents. Fixes #4413	2018-08-13 19:32:49 -04:00
Alex Dadgar	7f28cfcdfe	small cleanup	2018-03-30 15:49:56 -07:00
Alex Dadgar	5dacb057b7	Only track nodes if the conn is from the node Fixes a bug in which a connection to a Nomad server was treated as a connection to a node because the server forwarded a node specific RPC.	2018-03-27 09:59:31 -07:00
Alex Dadgar	a1faab0e58	Server TLS	2018-02-15 15:03:12 -08:00
Alex Dadgar	5b9806590b	add logging	2018-02-15 13:59:03 -08:00
Alex Dadgar	64ad3119d0	Implement MultiplexV2 RPC handling Implements and tests the V2 multiplexer. This will not be used until several versions of Nomad have been released to mitigate upgrade concerns.	2018-02-15 13:59:02 -08:00
Alex Dadgar	cea77df6a7	Add Streaming RPC ack This PR introduces an ack allowing the receiving end of the streaming RPC to return any error that may have occured during the establishment of the streaming RPC.	2018-02-15 13:59:02 -08:00
Alex Dadgar	6c1fa878ea	Forwarding	2018-02-15 13:59:02 -08:00
Alex Dadgar	2c0ad26374	New RPC Modes and basic setup for streaming RPC handlers	2018-02-15 13:59:01 -08:00
Alex Dadgar	b5037f20db	Remove circular dependency	2018-02-15 13:59:01 -08:00
Alex Dadgar	3f786b904b	use server manager	2018-02-15 13:59:01 -08:00
Alex Dadgar	46770d57e5	Forwarding	2018-02-15 13:59:01 -08:00
Alex Dadgar	6dd1c9f49d	Refactor	2018-02-15 13:59:00 -08:00
Alex Dadgar	8058ab039f	Store the whole verified certificate chain	2018-02-15 13:59:00 -08:00
Alex Dadgar	13bbf3fbbb	Track client connections	2018-02-15 13:59:00 -08:00
Alex Dadgar	4243438661	Improve TLS cluster testing	2018-02-15 13:59:00 -08:00
Alex Dadgar	ba5ecb8c1a	Dynamic RPC servers with context	2018-02-15 13:59:00 -08:00
Chelsea Holland Komlo	3f34b59ee6	remove unnecessary nil checks; default case add tests for TLSConfig object	2018-01-08 09:24:28 -05:00
Chelsea Holland Komlo	d9ec538d6a	don't ignore error in http reloading code review feedback	2018-01-08 09:21:06 -05:00

1 2

95 Commits