open-nomad

Commit Graph

Author	SHA1	Message	Date
Luiz Aoqui	b5a42cd55d	set raft v3 as the default config (#12341 )	2022-03-22 15:06:25 -04:00
Seth Hoenig	de95998faa	core: switch to go.etc.io/bbolt This PR swaps the underlying BoltDB implementation from boltdb/bolt to go.etc.io/bbolt. In addition, the Server has a new configuration option for disabling NoFreelistSync on the underlying database. Freelist option: https://github.com/etcd-io/bbolt/blob/master/db.go#L81 Consul equivelent PR: https://github.com/hashicorp/consul/pull/11720	2022-02-23 14:26:41 -06:00
Michael Schurter	7494a0c4fd	core: remove all traces of unused protocol version Nomad inherited protocol version numbering configuration from Consul and Serf, but unlike those projects Nomad has never used it. Nomad's `protocol_version` has always been `1`. While the code is effectively unused and therefore poses no runtime risks to leave, I felt like removing it was best because: 1. Nomad's RPC subsystem has been able to evolve extensively without needing to increment the version number. 2. Nomad's HTTP API has evolved extensively without increment `API{Major,Minor}Version`. If we want to version the HTTP API in the future, I doubt this is the mechanism we would choose. 3. The presence of the `server.protocol_version` configuration parameter is confusing since `server.raft_protocol` is an important parameter for operators to consider. Even more confusing is that there is a distinct Serf protocol version which is included in `nomad server members` output under the heading `Protocol`. `raft_protocol` is the only protocol version relevant to Nomad developers and operators. The other protocol versions are either deadcode or have never changed (Serf). 4. If we were to need to version the RPC, HTTP API, or Serf protocols, I don't think these configuration parameters and variables are the best choice. If we come to that point we should choose a versioning scheme based on the use case and modern best practices -- not this 6+ year old dead code.	2022-02-18 16:12:36 -08:00
Seth Hoenig	3371214431	core: implement system batch scheduler This PR implements a new "System Batch" scheduler type. Jobs can make use of this new scheduler by setting their type to 'sysbatch'. Like the name implies, sysbatch can be thought of as a hybrid between system and batch jobs - it is for running short lived jobs intended to run on every compatible node in the cluster. As with batch jobs, sysbatch jobs can also be periodic and/or parameterized dispatch jobs. A sysbatch job is considered complete when it has been run on all compatible nodes until reaching a terminal state (success or failed on retries). Feasibility and preemption are governed the same as with system jobs. In this PR, the update stanza is not yet supported. The update stanza is sill limited in functionality for the underlying system scheduler, and is not useful yet for sysbatch jobs. Further work in #4740 will improve support for the update stanza and deployments. Closes #2527	2021-08-03 10:30:47 -04:00
Jasmine Dahilig	ca4be6857e	deployment query rate limit (#10706 )	2021-06-04 12:38:46 -07:00
Seth Hoenig	1ee8d5ffc5	api: implement fuzzy search API This PR introduces the /v1/search/fuzzy API endpoint, used for fuzzy searching objects in Nomad. The fuzzy search endpoint routes requests to the Nomad Server leader, which implements the Search.FuzzySearch RPC method. Requests to the fuzzy search API are based on the api.FuzzySearchRequest object, e.g. { "Text": "ed", "Context": "all" } Responses from the fuzzy search API are based on the api.FuzzySearchResponse object, e.g. { "Index": 27, "KnownLeader": true, "LastContact": 0, "Matches": { "tasks": [ { "ID": "redis", "Scope": [ "default", "example", "cache" ] } ], "evals": [], "deployment": [], "volumes": [], "scaling_policy": [], "images": [ { "ID": "redis:3.2", "Scope": [ "default", "example", "cache", "redis" ] } ] }, "Truncations": { "volumes": false, "scaling_policy": false, "evals": false, "deployment": false } } The API is tunable using the new server.search stanza, e.g. server { search { fuzzy_enabled = true limit_query = 200 limit_results = 1000 min_term_length = 5 } } These values can be increased or decreased, so as to provide more search results or to reduce load on the Nomad Server. The fuzzy search API can be disabled entirely by setting `fuzzy_enabled` to `false`.	2021-04-16 16:36:07 -06:00
Drew Bailey	74836b95b2	configuration and oss components for licensing (#10216 ) * configuration and oss components for licensing * vendor sync	2021-03-23 09:08:14 -04:00
Tim Gross	97b0e26d1f	RPC endpoints to support 'nomad ui -login' RPC endpoints for the user-driven APIs (`UpsertOneTimeToken` and `ExchangeOneTimeToken`) and token expiration (`ExpireOneTimeTokens`). Includes adding expiration to the periodic core GC job.	2021-03-10 08:17:56 -05:00
Drew Bailey	f3dcefe5a9	remove event durability (#9147 ) * remove event durability temporarily removing go-memdb event durability until a new strategy is developed on how to best handled increased durability needs * drop events table schema and state store methods * fix neweventbuffer invocations	2020-10-22 12:21:03 -04:00
Drew Bailey	c463479848	filter on additional filter keys, remove switch statement duplication properly wire up durable event count move newline responsibility moves newline creation from NDJson to the http handler, json stream only encodes and sends now ignore snapshot restore if broker is disabled enable dev mode to access event steam without acl use mapping instead of switch use pointers for config sizes, remove unused ttl, simplify closed conn logic	2020-10-14 14:14:33 -04:00
Drew Bailey	df96b89958	Add EvictCallbackFn to handle removing entries from go-memdb when they are removed from the event buffer. Wire up event buffer size config, use pointers for structs.Events instead of copying.	2020-10-14 12:44:42 -04:00
Drew Bailey	a4a2975edf	Event Stream API/RPC (#8947 ) This Commit adds an /v1/events/stream endpoint to stream events from. The stream framer has been updated to include a SendFull method which does not fragment the data between multiple frames. This essentially treats the stream framer as a envelope to adhere to the stream framer interface in the UI. If the `encode` query parameter is omitted events will be streamed as newline delimted JSON.	2020-10-14 12:44:36 -04:00
Chris Baker	1d35578bed	removed backwards-compatible/untagged metrics deprecated in 0.7	2020-10-13 20:18:39 +00:00
Tim Gross	2854298089	csi: release claims via csi_hook postrun unpublish RPC (#8580 ) Add a Postrun hook to send the `CSIVolume.Unpublish` RPC to the server. This may forward client RPCs to the node plugins or to the controller plugins, depending on whether other allocations on this node have claims on this volume. By making clients responsible for running the `CSIVolume.Unpublish` RPC (and making the RPC available to a `nomad volume detach` command), the volumewatcher becomes only used by the core GC job and we no longer need async volume GC from job deregister and node update.	2020-08-06 14:51:46 -04:00
Mahmood Ali	ad2d484974	Set AgentShutdown	2020-07-17 11:04:57 -04:00
Drew Bailey	34871f89be	Oss license support for ent builds (#8054 ) * changes necessary to support oss licesning shims revert nomad fmt changes update test to work with enterprise changes update tests to work with new ent enforcements make check update cas test to use scheduler algorithm back out preemption changes add comments * remove unused method	2020-05-27 13:46:52 -04:00
Tim Gross	3aa761b151	Periodic GC for volume claims (#7881 ) This changeset implements a periodic garbage collection of CSI volumes with missing allocations. This can happen in a scenario where a node update fails partially and the allocation updates are written to raft but the evaluations to GC the volumes are dropped. This feature will cover this edge case and ensure that upgrades from 0.11.0 and 0.11.1 get any stray claims cleaned up.	2020-05-11 08:20:50 -04:00
Tim Gross	801ebcfe8d	periodic GC for CSI plugins (#7878 ) This changeset implements a periodic garbage collection of unused CSI plugins. Plugins are self-cleaning when the last allocation for a plugin is stopped, but this feature will cover any missing edge cases and ensure that upgrades from 0.11.0 and 0.11.1 get any stray plugins cleaned up.	2020-05-06 16:49:12 -04:00
Charlie Voiselle	663fb677cf	Add SchedulerAlgorithm to SchedulerConfig	2020-05-01 13:13:29 -04:00
Mahmood Ali	17ee94b52b	fix typo	2020-03-03 16:55:54 -05:00
Mahmood Ali	acbfeb5815	Simplify Bootstrap logic in tests This change updates tests to honor `BootstrapExpect` exclusively when forming test clusters and removes test only knobs, e.g. `config.DevDisableBootstrap`. Background: Test cluster creation is fragile. Test servers don't follow the BootstapExpected route like production clusters. Instead they start as single node clusters and then get rejoin and may risk causing brain split or other test flakiness. The test framework expose few knobs to control those (e.g. `config.DevDisableBootstrap` and `config.Bootstrap`) that control whether a server should bootstrap the cluster. These flags are confusing and it's unclear when to use: their usage in multi-node cluster isn't properly documented. Furthermore, they have some bad side-effects as they don't control Raft library: If `config.DevDisableBootstrap` is true, the test server may not immediately attempt to bootstrap a cluster, but after an election timeout (~50ms), Raft may force a leadership election and win it (with only one vote) and cause a split brain. The knobs are also confusing as Bootstrap is an overloaded term. In BootstrapExpect, we refer to bootstrapping the cluster only after N servers are connected. But in tests and the knobs above, it refers to whether the server is a single node cluster and shouldn't wait for any other server. Changes: This commit makes two changes: First, it relies on `BootstrapExpected` instead of `Bootstrap` and/or `DevMode` flags. This change is relatively trivial. Introduce a `Bootstrapped` flag to track if the cluster is bootstrapped. This allows us to keep `BootstrapExpected` immutable. Previously, the flag was a config value but it gets set to 0 after cluster bootstrap completes.	2020-03-02 13:47:43 -05:00
Michael Schurter	14c5ef3a8d	test: fix race around reused default rpc addr The default RPC addr was a global which is fine for normal runtime use when it only has a single user. However many tests modify it and cause races. Follow our convention of returning defaults from funcs instead of using globals.	2020-02-07 15:50:53 -08:00
Michael Schurter	c82b14b0c4	core: add limits to unauthorized connections Introduce limits to prevent unauthorized users from exhausting all ephemeral ports on agents: * `{https,rpc}_handshake_timeout` * `{http,rpc}_max_conns_per_client` The handshake timeout closes connections that have not completed the TLS handshake by the deadline (5s by default). For RPC connections this timeout also separately applies to first byte being read so RPC connections with TLS enabled have `rpc_handshake_time * 2` as their deadline. The connection limit per client prevents a single remote TCP peer from exhausting all ephemeral ports. The default is 100, but can be lowered to a minimum of 26. Since streaming RPC connections create a new TCP connection (until MultiplexV2 is used), 20 connections are reserved for Raft and non-streaming RPCs to prevent connection exhaustion due to streaming RPCs. All limits are configurable and may be disabled by setting them to `0`. This also includes a fix that closes connections that attempt to create TLS RPC connections recursively. While only users with valid mTLS certificates could perform such an operation, it was added as a safeguard to prevent programming errors before they could cause resource exhaustion.	2020-01-30 10:38:25 -08:00
Mahmood Ali	af17b4afc7	Support customizing full scheduler config	2020-01-28 14:51:42 -05:00
Mahmood Ali	1ab682f622	scheduler: allow configuring default preemption for system scheduler Some operators want a greater control over when preemption is enabled, especially during an upgrade to limit potential side-effects.	2020-01-13 08:30:49 -05:00
Drew Bailey	fd42020ad6	RPC server EnableDebug option Passes in agent enable_debug config to nomad server and client configs. This allows for rpc endpoints to have more granular control if they should be enabled or not in combination with ACLs. enable debug on client test	2020-01-09 15:15:07 -05:00
Drew Bailey	786989dbe3	New monitor pkg for shared monitor functionality Adds new package that can be used by client and server RPC endpoints to facilitate monitoring based off of a logger clean up old code small comment about write rm old comment about minsize rename to Monitor Removes connection logic from monitor command Keep connection logic in endpoints, use a channel to send results from monitoring use new multisink logger and interfaces small test for dropped messages update go-hclogger and update sink/intercept logger interfaces	2019-11-05 09:51:49 -05:00
Lang Martin	fb41dd86ba	default raft protocol v2	2019-09-24 14:37:55 -04:00
Mahmood Ali	3e2ab6e2a3	address review feedback	2019-09-03 21:44:39 -04:00
Mahmood Ali	6eabf53b91	Default raft protocol to version 3	2019-08-28 15:56:59 -04:00
Michael Schurter	32daa7b47b	goimports until make check is happy	2019-01-23 06:27:14 -08:00
Michael Schurter	be0bab7c3f	move pluginutils -> helper/pluginutils I wanted a different color bikeshed, so I get to paint it	2019-01-22 15:50:08 -08:00
Alex Dadgar	cdcd3c929c	loader and singleton	2019-01-22 15:11:57 -08:00
Nick Ethier	85b221a1d6	nomad: add flag to disable publishing of job_summary metrics for dispatched jobs	2018-11-19 23:21:19 -05:00
Alex Dadgar	6f0ed6184b	Fix client reloading and pass the plugin loaders to server and client	2018-10-16 16:56:55 -07:00
Alex Dadgar	3c19d01d7a	server	2018-09-15 16:23:13 -07:00
Chelsea Holland Komlo	38f611a7f2	refactor NewTLSConfiguration to pass in verifyIncoming/verifyOutgoing add missing fields to TLS merge method	2018-05-23 18:35:30 -04:00
Alex Dadgar	2baa1c38f2	clarify comment	2018-03-16 16:47:08 -07:00
Alex Dadgar	52b7fb5361	Separate client and server rpc advertise addresses	2018-03-16 16:47:08 -07:00
Josh Soref	18c5659474	spelling: version	2018-03-11 19:13:25 +00:00
Josh Soref	d300623abe	spelling: evaluation	2018-03-11 18:01:35 +00:00
Josh Soref	24c76dbcca	spelling: cores	2018-03-11 17:51:20 +00:00
Kyle Havlovitz	2ccf565bf6	Refactor redundancy_zone/upgrade_version out of client meta	2018-01-29 20:03:38 -08:00
Kyle Havlovitz	1c07066064	Add autopilot functionality based on Consul's autopilot	2017-12-18 14:29:41 -08:00
Kyle Havlovitz	b775fc7b33	Added support for v2 raft APIs and -raft-protocol option	2017-12-12 10:17:16 -06:00
Chelsea Komlo	2dfda33703	Nomad agent reload TLS configuration on SIGHUP (#3479 ) * Allow server TLS configuration to be reloaded via SIGHUP * dynamic tls reloading for nomad agents * code cleanup and refactoring * ensure keyloader is initialized, add comments * allow downgrading from TLS * initalize keyloader if necessary * integration test for tls reload * fix up test to assert success on reloaded TLS configuration * failure in loading a new TLS config should remain at current Reload only the config if agent is already using TLS * reload agent configuration before specific server/client lock keyloader before loading/caching a new certificate * introduce a get-or-set method for keyloader * fixups from code review * fix up linting errors * fixups from code review * add lock for config updates; improve copy of tls config * GetCertificate only reloads certificates dynamically for the server * config updates/copies should be on agent * improve http integration test * simplify agent reloading storing a local copy of config * reuse the same keyloader when reloading * Test that server and client get reloaded but keep keyloader * Keyloader exposes GetClientCertificate as well for outgoing connections * Fix spelling * correct changelog style	2017-11-14 17:53:23 -08:00
Diptanu Choudhury	46bc4280b2	Adding support for tagged metrics	2017-11-01 13:15:06 -07:00
Diptanu Choudhury	d4128f0e5a	Setting the default stats collection interval	2017-11-01 13:15:06 -07:00
Diptanu Choudhury	524a1f0712	Publishing metrics for job summary	2017-11-01 13:15:06 -07:00
Alex Dadgar	e5ec915ac3	sync	2017-09-19 10:08:23 -05:00

1 2

96 Commits