* Fixed name of `nomad.scheduler.allocs.reschedule` metric
* Added new metrics to metrics reference documentation
* Expanded definitions of "waiting" metrics
* Changelog entry for #10236 and #10237
This changeset adds more specific recommendations as to what metrics
to monitor, and what resources should be examined during incident
response.
It also renames the "Telemetry" section to "Monitoring Nomad" to
surface the material better and distinguish it from the "Metric
Reference".
Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com>
Old description of `{plan,worker}.wait_for_index` described the metric
in terms of waiting for a snapshot which has two problems:
1. "Snapshot" is an overloaded term in Nomad and operators can't be
expected to know which use we're referring to here.
2. The most important thing about the metric is what we're waiting *on*
before taking a snapshot: the raft index of the object to be
processed (plan or eval).
The new description tries to cram all of that context into the tiny
space provided.
See #5791 for details about the `wait_for_index` mechanism in general.