The plan applier has to get a snapshot with a minimum index for the
plan it's working on in order to ensure consistency. Under heavy raft
loads, we can exceed the timeout. When this happens, we hit a bug
where the plan applier blocks waiting on the `indexCh` forever, and
all schedulers will block in `Plan.Submit`.
Closing the `indexCh` when the `asyncPlanWait` is done with it will
prevent the deadlock without impacting correctness of the previous
snapshot index.
This changeset includes the a PoC failing test that works by injecting
a large timeout into the state store. We need to turn this into a test
we can run normally without breaking the state store before we can
merge this PR.
Increase `snapshotMinIndex` timeout to 10s.
This timeout creates backpressure where any concurrent `Plan.Submit`
RPCs will block waiting for results. This sheds load across all
servers and gives raft some CPU to catch up, because schedulers won't
dequeue more work while waiting. Increase it to 10s based on
observations of large production clusters.
The changelog entry for #13340 indicated it was an improvement. But on
discussion, it was determined that this was a workaround for a
regression. Update the changelog to make this clear.
In addition to jobs, there are other objects in Nomad that have a
specific format and can be provided to commands and API endpoints.
This commit creates a new menu section to hold the specification for
volumes and update the command pages to point to the new centralized
definition.
Redirecting the previous entries is not possible with `redirect.js`
because they are done server-side and URL fragments are not accessible
to detect a match. So we provide hidden anchors with a link to the new
page to guide users towards the new documentation.
Co-authored-by: Tim Gross <tgross@hashicorp.com>
These flags were not being used because GNUmakefile overwrites them with
another value. We also don't want to set `-s -w` since they remove
information that is important for production debug.
In other projects this variable is used to override the default `-dev`
prerelease that is set even if `VersionPrerelease` is empty, but in
Nomad this check is never actually done because this conditional in
`version/version.go` is always false:
```go
func GetVersion() *VersionInfo {
// ...
rel := VersionPrerelease
// ...
if GitDescribe == "" && rel == "" && VersionPrerelease != "" {
rel = "dev"
}
// ...
}
```
This seems like some leftover from a previous release process, but I
decided the leave the code as is.
* website: fix redirects with fragments
Vercel redirects don't support fragments in relative destination paths,
so an absolute URL must be specified instead.
* website: fix Vercel redirect documentation link
If PATH comes first, an older version of Go is used that cannot install
dependencies that use features of newer versions of Go, which we just
installed.
This PR deprecates some functions in favor of generic alternatives.
The new functions are compatible only with Nomad v1.4+.
The old functions (nor their use) should not be removed until Nomad v1.6+.
If the node has been GC'd or is down, we can't send it a node
unpublish. The CSI spec requires that we don't send the controller
unpublish before the node unpublish, but in the case where a node is
gone we can't know the final fate of the node unpublish step.
The `csi_hook` on the client will unpublish if the allocation has
stopped and if the host is terminated there's no mount for the volume
anyways. So we'll now assume that the node has unpublished at its
end. If it hasn't, any controller unpublish will potentially hang or
error and need to be retried.
As a performance optimization in the scheduler, feasibility checks
that apply to an entire class are only checked once for all nodes of
that class. Other feasibility checks are "available" checks because
they rely on more ephemeral characteristics and don't contribute to
the hash for the node class. This currently includes only CSI.
We have a separate fast path for "available" checks when the node has
already been marked eligible on the basis of class. This fast path has
a bug where it returns early rather than continuing the loop. This
causes the entire task group to be rejected.
Fix the bug by not returning early in the fast path and instead jump
to the top of the loop like all the other code paths in this method.
Includes a new test exercising topology at whole-scheduler level and a
fix for an existing test that should've caught this previously.