open-nomad

Author	SHA1	Message	Date
Charlie Voiselle	9a19279f59	Sweep of docs for repeated words; minor edits (#14032 )	2022-08-05 16:45:30 -04:00
Tim Gross	e025afdf87	docs: concepts for secure variables and workload identity (#13764 ) Includes concept docs for secure variables, concept docs for workload identity, and an operations docs for keyring management.	2022-08-02 10:06:26 -04:00
Will Jordan	5354409b1a	Return 429 response on HTTP max connection limit (#13621 ) Return 429 response on HTTP max connection limit. Instead of silently closing the connection, return a `429 Too Many Requests` HTTP response with a helpful error message to aid debugging when the connection limit is unintentionally reached. Set a 10-millisecond write timeout and rate limiter for connection-limit 429 response to prevent writing the HTTP response from consuming too many server resources. Add `nomad.agent.http.exceeded metric` counting the number of HTTP connections exceeding concurrency limit.	2022-07-20 14:12:21 -04:00
Michael Schurter	e97548b5f8	Improve metrics reference documentation (#13769 ) * docs: tighten up parameterized job metrics docs * docs: improve alloc status descriptions Remove `nomad.client.allocations.start` as it doesn't exist.	2022-07-15 14:22:39 -07:00
Michael Schurter	5414f49821	docs: clarify blocked_evals metrics (#13751 ) Related to #13740 - blocked_evals.total_blocked is the number of evals blocked for any reason - blocked_evals.total_quota_limit is the number of evals blocked by quota limits, but critically: their resources are not counted in the cpu/memory	2022-07-14 11:32:33 -07:00
Luiz Aoqui	b656981cf0	Track plan rejection history and automatically mark clients as ineligible (#13421 ) Plan rejections occur when the scheduler work and the leader plan applier disagree on the feasibility of a plan. This may happen for valid reasons: since Nomad does parallel scheduling, it is expected that different workers will have a different state when computing placements. As the final plan reaches the leader plan applier, it may no longer be valid due to a concurrent scheduling taking up intended resources. In these situations the plan applier will notify the worker that the plan was rejected and that they should refresh their state before trying again. In some rare and unexpected circumstances it has been observed that workers will repeatedly submit the same plan, even if they are always rejected. While the root cause is still unknown this mitigation has been put in place. The plan applier will now track the history of plan rejections per client and include in the plan result a list of node IDs that should be set as ineligible if the number of rejections in a given time window crosses a certain threshold. The window size and threshold value can be adjusted in the server configuration. To avoid marking several nodes as ineligible at one, the operation is rate limited to 5 nodes every 30min, with an initial burst of 10 operations.	2022-07-12 18:40:20 -04:00
Tim Gross	fc4cd53cfb	docs: rename Internals to Concepts (#13696 )	2022-07-11 16:55:33 -04:00
Michael Schurter	7b7c72b21d	docs: clarify total_escaped is just an optimization (#13460 )	2022-06-22 11:39:56 -07:00
Michael Schurter	70a04dd106	docs: add plan for node rejected details and more (#12564 ) - Moved federation docs to the bottom since everyone is potentially affected by the other sections on the page, but only users of federation are affected by it. - Added section on the plan for node rejected bug since it is fairly easy to diagnose and removing affected nodes is a fairly reliable workaround. - Mention 5s cliff for wait_for_index. - Remove the lie that we do not have job status metrics! How old was that?! - Reinforce the importance of monitoring basic system resources	2022-04-14 16:09:33 -07:00
Jasmine Dahilig	386f2fac3a	docs: add token_last_renewal and token_next_renewal to server metrics and key metrics #12435 (#12505 )	2022-04-07 15:12:41 -07:00
Derek Strickland	d7f44448e1	disconnected clients: Observability plumbing (#12141 ) * Add disconnects/reconnect to log output and emit reschedule metrics * TaskGroupSummary: Add Unknown, update StateStore logic, add to metrics	2022-04-05 17:12:23 -04:00
Seth Hoenig	de95998faa	core: switch to go.etc.io/bbolt This PR swaps the underlying BoltDB implementation from boltdb/bolt to go.etc.io/bbolt. In addition, the Server has a new configuration option for disabling NoFreelistSync on the underlying database. Freelist option: https://github.com/etcd-io/bbolt/blob/master/db.go#L81 Consul equivelent PR: https://github.com/hashicorp/consul/pull/11720	2022-02-23 14:26:41 -06:00
Luiz Aoqui	626e633b41	docs: add `nomad.plan.node_rejected` metric (#11860 )	2022-01-18 13:47:20 -05:00
Tim Gross	32f150d469	docs: new scheduler metrics (#11790 ) * Fixed name of `nomad.scheduler.allocs.reschedule` metric * Added new metrics to metrics reference documentation * Expanded definitions of "waiting" metrics * Changelog entry for #10236 and #10237	2022-01-07 09:51:15 -05:00
Tim Gross	348f482c94	docs: improve docs for troubleshooting and monitoring scheduler (#11623 ) This changeset adds more specific recommendations as to what metrics to monitor, and what resources should be examined during incident response. It also renames the "Telemetry" section to "Monitoring Nomad" to surface the material better and distinguish it from the "Metric Reference". Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com>	2021-12-07 15:52:13 -05:00
James Rasell	d44e5620dd	docs: add license expiry metric to metrics website doc.	2021-12-07 10:31:51 +00:00
kfenech1	26a0158ead	docs: `nomad.client.unallocated.memory` is in Megabytes not bytes (#11468 )	2021-11-08 11:05:11 -05:00
Michael Schurter	fff95b0697	docs: improve wait_for_index metrics description (#10717 ) Old description of `{plan,worker}.wait_for_index` described the metric in terms of waiting for a snapshot which has two problems: 1. "Snapshot" is an overloaded term in Nomad and operators can't be expected to know which use we're referring to here. 2. The most important thing about the metric is what we're waiting on before taking a snapshot: the raft index of the object to be processed (plan or eval). The new description tries to cram all of that context into the tiny space provided. See #5791 for details about the `wait_for_index` mechanism in general.	2021-06-09 08:53:06 -04:00
Luiz Aoqui	f1b9055d21	Add metrics for blocked eval resources (#10454 ) * add metrics for blocked eval resources * docs: add new blocked_evals metrics * fix to call `pruneStats` instead of `stats.prune` directly	2021-04-29 15:03:45 -04:00
Bryce Kalow	a6ca40fa4e	feat(website): migrates to new nav data format (#10264 )	2021-03-31 08:43:17 -05:00
Tim Gross	cf052cfee5	docs: add metrics from raft leadership transitions	2021-01-27 11:50:11 -05:00
Jeff Escalante	eaaafd9dd4	implement mdx remote	2021-01-05 19:02:39 -05:00

22 commits