Metrics state is local to the server and needs to use time, which is normally
forbidden in the FSM code. We have a bypass for this rule for
`metrics.MeasureSince` but needed one for `metrics.MeasureSinceWithLabels` as well.
In #14742 we introduced a cached lookup of the `nobody` user, which is only ever
called on Unixish machines. But the initial caching was being done in an `init`
block, which meant it was being run on Windows as well. This prevents the Nomad
agent from starting on Windows.
An alternative fix here would be to have a separate `init` block for Windows and
Unix, but this potentially masks incorrect behavior if we accidentally added a
call to the `Nobody()` method on Windows later. This way we're forced to handle
the error in the caller.
The `hc-install` tool we're using needed a patch for a specific bug, but that's
since been merged. We definitely want to switch to using a standard release from
that project once one is shipped with the CLI, but pinning to HEAD should keep
us for now.
This is probably undocumented for a reason, but the `enabled` toggle in the
`periodic` stanza is very useful so I figured I try adding it to the docs.
The feature has been secretly avaliable since #9142 and was called out in that
PR as being a dubious addition, only added to avoid regressions.
The use case for disabling a periodic job in this way is to prevent it from
running without modifying the schedule. Ideally Nomad would make it more clear
that this was the case, and allow you to force a run of the job, but even with
those rough edges I think users would benefit from knowing about this toggle.
This changeset adds new architecture internals documents to the contributing
guide. These are intentionally here and not on the public-facing website because
the material is not required for operators and includes a lot of diagrams that
we can cheaply maintain with mermaid syntax but would involve art assets to have
up on the main site that would become quickly out of date as code changes happen
and be extremely expensive to maintain. However, these should be suitable to use
as points of conversation with expert end users.
Included:
* A description of Evaluation triggers and expected counts, with examples.
* A description of Evaluation states and implicit states. This is taken from an
internal document in our team wiki.
* A description of how writing the State Store works. This is taken from a
diagram I put together a few months ago for internal education purposes.
* A description of Evaluation lifecycle, from registration to running
Allocations. This is mostly lifted from @lgfa29's amazing mega-diagram, but
broken into digestible chunks and without multi-region deployments, which I'd
like to cover in a future doc.
Also includes adding Deployments to our public-facing glossary.
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
Co-authored-by: Seth Hoenig <shoenig@duck.com>
Previously, the splay timeout was only applied if a template re-render
caused a restart or a signal action. The `change_mode = "script"` was
running after the `if restart || len(signals) != 0` check, so it was
invoked at all times.
This change refactors the logic so it's easier to notice that new
`change_mode` options should start only after `splay` is applied.
This reverts PR #12416 and commit 6668ce022ac561f75ad113cc838b1fb786f11f79.
While the driver options are well and truly deprecated, this documentation also
covers features like `fingerprint.denylist` that are not available any other
way. Let's revert this until #12420 is ready.
* When we isolated the variable form path to within its component for isolation reasons, we lost the model-level checks for related entites at type-time
* Be a little more functionally pure
* Use Ember.set to appease mirage
* client: protect user lookups with global lock
This PR updates Nomad client to always do user lookups while holding
a global process lock. This is to prevent concurrency unsafe implementations
of NSS, but still enabling NSS lookups of users (i.e. cannot not use osusergo).
* cl: add cl
* cleanup: fixup linter warnings in schedular/feasible.go
* core: numeric operands comparisons in constraints
This PR changes constraint comparisons to be numeric rather than
lexical if both operands are integers or floats.
Inspiration #4856Closes#4729Closes#14719
* fix: always parse as int64
A test flake revealed a bug in the CSI unpublish workflow, where an unpublish
that comes from a client that's successfully done the node-unpublish step will
not have the claim checkpointed if the controller-unpublish step fails. This
will result in a delay in releasing the volume claim until the next GC.
This changeset also ensures we're using a new snapshot after each write to raft,
and fixes two timing issues in test where either the volume watcher can
unpublish before the unpublish RPC is sent or we don't wait long enough in
resource-restricted environements like GHA.