open-nomad

Author	SHA1	Message	Date
Luiz Aoqui	b656981cf0	Track plan rejection history and automatically mark clients as ineligible (#13421 ) Plan rejections occur when the scheduler work and the leader plan applier disagree on the feasibility of a plan. This may happen for valid reasons: since Nomad does parallel scheduling, it is expected that different workers will have a different state when computing placements. As the final plan reaches the leader plan applier, it may no longer be valid due to a concurrent scheduling taking up intended resources. In these situations the plan applier will notify the worker that the plan was rejected and that they should refresh their state before trying again. In some rare and unexpected circumstances it has been observed that workers will repeatedly submit the same plan, even if they are always rejected. While the root cause is still unknown this mitigation has been put in place. The plan applier will now track the history of plan rejections per client and include in the plan result a list of node IDs that should be set as ineligible if the number of rejections in a given time window crosses a certain threshold. The window size and threshold value can be adjusted in the server configuration. To avoid marking several nodes as ineligible at one, the operation is rate limited to 5 nodes every 30min, with an initial burst of 10 operations.	2022-07-12 18:40:20 -04:00
James Rasell	353323d171	agent: test full object when performing test config parse. (#13668 )	2022-07-11 16:21:36 +02:00
Michael Schurter	7494a0c4fd	core: remove all traces of unused protocol version Nomad inherited protocol version numbering configuration from Consul and Serf, but unlike those projects Nomad has never used it. Nomad's `protocol_version` has always been `1`. While the code is effectively unused and therefore poses no runtime risks to leave, I felt like removing it was best because: 1. Nomad's RPC subsystem has been able to evolve extensively without needing to increment the version number. 2. Nomad's HTTP API has evolved extensively without increment `API{Major,Minor}Version`. If we want to version the HTTP API in the future, I doubt this is the mechanism we would choose. 3. The presence of the `server.protocol_version` configuration parameter is confusing since `server.raft_protocol` is an important parameter for operators to consider. Even more confusing is that there is a distinct Serf protocol version which is included in `nomad server members` output under the heading `Protocol`. `raft_protocol` is the only protocol version relevant to Nomad developers and operators. The other protocol versions are either deadcode or have never changed (Serf). 4. If we were to need to version the RPC, HTTP API, or Serf protocols, I don't think these configuration parameters and variables are the best choice. If we come to that point we should choose a versioning scheme based on the use case and modern best practices -- not this 6+ year old dead code.	2022-02-18 16:12:36 -08:00
Matt Mukerjee	b56432e645	Add FailoverHeartbeatTTL to config (#11127 ) FailoverHeartbeatTTL is the amount of time to wait after a server leader failure before considering reallocating client tasks. This TTL should be fairly long as the new server leader needs to rebuild the entire heartbeat map for the cluster. In deployments with a small number of machines, the default TTL (5m) may be unnecessary long. Let's allow operators to configure this value in their config files.	2021-10-06 18:48:12 -04:00
Drew Bailey	74836b95b2	configuration and oss components for licensing (#10216 ) * configuration and oss components for licensing * vendor sync	2021-03-23 09:08:14 -04:00
Drew Bailey	f3dcefe5a9	remove event durability (#9147 ) * remove event durability temporarily removing go-memdb event durability until a new strategy is developed on how to best handled increased durability needs * drop events table schema and state store methods * fix neweventbuffer invocations	2020-10-22 12:21:03 -04:00
Drew Bailey	c463479848	filter on additional filter keys, remove switch statement duplication properly wire up durable event count move newline responsibility moves newline creation from NDJson to the http handler, json stream only encodes and sends now ignore snapshot restore if broker is disabled enable dev mode to access event steam without acl use mapping instead of switch use pointers for config sizes, remove unused ttl, simplify closed conn logic	2020-10-14 14:14:33 -04:00
Drew Bailey	df96b89958	Add EvictCallbackFn to handle removing entries from go-memdb when they are removed from the event buffer. Wire up event buffer size config, use pointers for structs.Events instead of copying.	2020-10-14 12:44:42 -04:00
Drew Bailey	d793529d61	event durability count and cfg	2020-10-14 12:44:40 -04:00
Drew Bailey	559517455a	wire up enable_event_publisher	2020-10-14 12:44:38 -04:00
Chris Baker	1d35578bed	removed backwards-compatible/untagged metrics deprecated in 0.7	2020-10-13 20:18:39 +00:00
Yoan Blanc	891accb89a	use allow/deny instead of the colored alternatives (#9019 ) Signed-off-by: Yoan Blanc <yoan@dosimple.ch>	2020-10-12 08:47:05 -04:00
Mahmood Ali	0e8fafd739	implement raft multiplier	2020-05-31 12:24:27 -04:00
Tim Gross	3aa761b151	Periodic GC for volume claims (#7881 ) This changeset implements a periodic garbage collection of CSI volumes with missing allocations. This can happen in a scenario where a node update fails partially and the allocation updates are written to raft but the evaluations to GC the volumes are dropped. This feature will cover this edge case and ensure that upgrades from 0.11.0 and 0.11.1 get any stray claims cleaned up.	2020-05-11 08:20:50 -04:00
Tim Gross	801ebcfe8d	periodic GC for CSI plugins (#7878 ) This changeset implements a periodic garbage collection of unused CSI plugins. Plugins are self-cleaning when the last allocation for a plugin is stopped, but this feature will cover any missing edge cases and ensure that upgrades from 0.11.0 and 0.11.1 get any stray plugins cleaned up.	2020-05-06 16:49:12 -04:00
Mahmood Ali	b9e3cde865	tests and some clean up	2020-05-01 13:13:30 -04:00
Mahmood Ali	990cfb6fef	agent config parsing tests for scheduler config	2020-04-03 07:54:32 -04:00
Drew Bailey	a98dc8c768	update audit examples to an endpoint that is audited	2020-03-30 10:03:11 -04:00
Drew Bailey	b09abef332	Audit config, seams for enterprise audit features allow oss to parse sink duration clean up audit sink parsing ent eventer config reload fix typo SetEnabled to eventer interface client acl test rm dead code fix failing test	2020-03-23 13:47:42 -04:00
Mahmood Ali	437d03779c	tests: add tests for parsing cni fields	2020-02-28 14:18:45 -05:00
Mahmood Ali	f492ab6d9e	implement MinQuorum	2020-02-16 16:04:59 -06:00
Seth Hoenig	f030a22c7c	command, docs: create and document consul token configuration for connect acls (gh-6716) This change provides an initial pass at setting up the configuration necessary to enable use of Connect with Consul ACLs. Operators will be able to pass in a Consul Token through `-consul-token` or `$CONSUL_TOKEN` in the `job run` and `job revert` commands (similar to Vault tokens). These values are not actually used yet in this changeset.	2020-01-31 19:02:53 -06:00
Danielle Lancashire	d3614ea0a8	config: Add required configuration for logging to a file	2019-10-07 14:16:59 +02:00
Danielle Lancashire	861caa9564	HostVolumeConfig: Source -> Path	2019-08-12 15:39:08 +02:00
Danielle Lancashire	063e4240c1	client: Add parsing and registration of HostVolume configuration	2019-08-12 15:39:08 +02:00
Pete Woods	b47c5ca467	Allow the job GC interval to be configured from default of 5 minutes	2019-07-26 10:11:25 +01:00
Danielle Lancashire	d454dab39b	chore: Format hcl configurations	2019-07-20 16:55:07 +02:00
Lang Martin	1e2f87a11e	agent/testdata add a configuration directory for testing	2019-06-11 16:34:04 -04:00
Lang Martin	28cf8eddfe	config parse_test check for string coercion in client.meta	2019-06-10 13:12:38 -04:00
Mahmood Ali	a9f81f2daa	client config flag to disable remote exec This exposes a client flag to disable nomad remote exec support in environments where access to tasks ought to be restricted. I used `disable_remote_exec` client flag that defaults to allowing remote exec. Opted for a client config that can be used to disable remote exec globally, or to a subset of the cluster if necessary.	2019-06-03 15:31:39 -04:00
Lang Martin	3ba6095fe3	config_parse_test additional config confirmation w/ sample json	2019-04-30 10:30:53 -04:00
Lang Martin	1f86770456	config_parse_test test direct hcl parsing	2019-04-30 10:29:14 -04:00
Preetha Appan	71e6550f81	Address review comments	2019-03-29 08:57:49 -05:00

33 commits