If a core job fails more than the delivery limit, the leader will create a new
eval with the TriggeredBy field set to `failed-follow-up`.
Evaluations for core jobs have the leader's ACL, which is not valid on another
leader after an election. The `failed-follow-up` evals do not have ACLs, so
core job evals that fail more than the delivery limit or core job evals that
span leader elections will never succeed and will be re-enqueued forever. So
we should not retry with a `failed-follow-up`.
The soundness guarantees of the CSI specification leave a little to be desired
in our ability to provide a 100% reliable automated solution for managing
volumes. This changeset provides a new command to bridge this gap by providing
the operator the ability to intervene.
The command doesn't take an allocation ID so that the operator doesn't have to
keep track of alloc IDs that may have been GC'd. Handle this case in the
unpublish RPC by sending the client RPC for all the terminal/nil allocs on the
selected node.
The CSI client RPC uses error wrapping to detect the type of error bubbling up
from plugins, but if the errors we get aren't wrapped at each layer, we can't
unwrap the inner error.
Also eliminates some unused args.
This change adds the ability to set the fields `success_before_passing` and
`failures_before_critical` on Consul service check definitions. This is a
feature added to Consul v1.7.0 and later.
https://www.consul.io/docs/agent/checks#success-failures-before-passing-critical
Nomad doesn't do much besides pass the fields through to Consul.
Fixes#6913
When deregistering a client, CSI plugins running on that client may not get a
chance to fingerprint before being stopped. Account for the case where a
plugin allocation is the last instance of the plugin and has been deleted from
the state store to avoid errors during node deregistration.
When the client-side actions of a CSI client RPC succeed but we get
disconnected during the RPC or we fail to checkpoint the claim state, we want
to be able to retry the client RPC without getting blocked by the client-side
state (ex. mount points) already having been cleaned up in previous calls.
Using the count of node claims from earlier in the `CSIVolume.Unpublish RPC
doesn't correctly account for cases where the RPC was interrupted but
checkpointed. Instead, we'll check the current allocation count and status to
determine whether we need to send a controller unpublish.
Add a Postrun hook to send the `CSIVolume.Unpublish` RPC to the server. This
may forward client RPCs to the node plugins or to the controller plugins,
depending on whether other allocations on this node have claims on this
volume.
By making clients responsible for running the `CSIVolume.Unpublish` RPC (and
making the RPC available to a `nomad volume detach` command), the
volumewatcher becomes only used by the core GC job and we no longer need
async volume GC from job deregister and node update.
This changeset updates `nomad/volumewatcher` to take advantage of the
`CSIVolume.Unpublish` RPC. This lets us eliminate a bunch of code and
associated tests. The raft batching code can be safely dropped, as the
characteristic times of the CSI RPCs are on the order of seconds or even
minutes, so batching up raft RPCs added complexity without any real world
performance wins.
Includes refactor w/ test cleanup and dead code elimination in volumewatcher
The documentation encourages operators to run multiple controller plugin
instances for HA, but the client RPCs don't take advantage of this by retrying
when the RPC fails in cases when the plugin is unavailable (because the node
has drained or the alloc has failed but we haven't received an updated
fingerprint yet).
This changeset tries all known controllers on ready nodes before giving up,
and adds tests that exercise the client RPC routing and retries.
This log line should be rare since:
1. Most tokens should be logged synchronously, not via this async
batched method. Async revocation only takes place when Vault
connectivity is lost and after leader election so no revocations are
missed.
2. There should rarely be >1 batch (1,000) tokens to revoke since the
above conditions should be brief and infrequent.
3. Interval is 5 minutes, so this log line will be emitted at *most*
once every 5 minutes.
What makes this log line rare is also what makes it interesting: due to
a bug prior to Nomad 0.11.2 some tokens may never get revoked. Therefore
Nomad tries to re-revoke them on every leader election. This caused a
massive buildup of old tokens that would never be properly revoked and
purged. Nomad 0.11.3 mostly fixed this but still had a bug in purging
revoked tokens via Raft (fixed in #8553).
The nomad.vault.distributed_tokens_revoked metric is only ticked upon
successful revocation and purging, making any bugs or slowness in the
process difficult to detect.
Logging before a potentially slow revocation+purge operation is
performed will give users much better indications of what activity is
going on should the process fail to make it to the metric.
Fixes https://github.com/hashicorp/nomad/issues/8544
This PR fixes a bug where using `nomad job plan ...` always report no change if the submitted job contain scaling.
The issue has three contributing factors:
1. The plan endpoint doesn't populate the required scaling policy ID; unlike the job register endpoint
2. The plan endpoint suppresses errors on job insertion - the job insertion fails here, because the scaling policy is missing the required ID
3. The scheduler reports no update necessary when the relevant job isn't in store (because the insertion failed)
This PR fixes the first two factors. Changing the scheduler to be more strict might make sense, but may violate some idempotency invariant or make the scheduler more brittle.
Before, Connect Native Tasks needed one of these to work:
- To be run in host networking mode
- To have the Consul agent configured to listen to a unix socket
- To have the Consul agent configured to listen to a public interface
None of these are a great experience, though running in host networking is
still the best solution for non-Linux hosts. This PR establishes a connection
proxy between the Consul HTTP listener and a unix socket inside the alloc fs,
bypassing the network namespace for any Connect Native task. Similar to and
re-uses a bunch of code from the gRPC listener version for envoy sidecar proxies.
Proxy is established only if the alloc is configured for bridge networking and
there is at least one Connect Native task in the Task Group.
Fixes#8290
As of 0.11.3 Vault token revocation and purging was done in batches.
However the batch size was only limited by the number of *non-expired*
tokens being revoked.
Due to bugs prior to 0.11.3, *expired* tokens were not properly purged.
Long-lived clusters could have thousands to *millions* of very old
expired tokens that never got purged from the state store.
Since these expired tokens did not count against the batch limit, very
large batches could be created and overwhelm servers.
This commit ensures expired tokens count toward the batch limit with
this one line change:
```
- if len(revoking) >= toRevoke {
+ if len(revoking)+len(ttlExpired) >= toRevoke {
```
However, this code was difficult to test due to being in a periodically
executing loop. Most of the changes are to make this one line change
testable and test it.
adds in oss components to support enterprise multi-vault namespace feature
upgrade specific doc on vault multi-namespaces
vault docs
update test to reflect new error
The field name `Deployment.TaskGroups` contains a map of `DeploymentState`,
which makes it a little harder to follow state updates when combined with
inconsistent naming conventions, particularly when we also have the state
store or actual `TaskGroup`s in scope. This changeset changes all uses to
`dstate` so as not to be confused with actual TaskGroups.
* nomad/structs/structs: add to Job.Validate
* Update nomad/structs/structs.go
Co-authored-by: Mahmood Ali <mahmood@hashicorp.com>
* nomad/structs/structs: match error strings to the config file
* nomad/structs/structs_test: clarify the test a bit
* nomad/structs/structs_test: typo in the test error comparison
Co-authored-by: Mahmood Ali <mahmood@hashicorp.com>
This fixes a bug where jobs may get "stuck" unprocessed that
dispropotionately affect periodic jobs around leadership transitions.
When registering a job, the job registration and the eval to process it
get applied to raft as two separate transactions; if the job
registration succeeds but eval application fails, the job may remain
unprocessed. Operators may detect such failure, when submitting a job
update and get a 500 error code, and they could retry; periodic jobs
failures are more likely to go unnoticed, and no further periodic
invocations will be processed until an operator force evaluation.
This fixes the issue by ensuring that the job registration and eval
application get persisted and processed atomically in the same raft log
entry.
Also, applies the same change to ensure atomicity in job deregistration.
Backward Compatibility
We must maintain compatibility in two scenarios: mixed clusters where a
leader can handle atomic updates but followers cannot, and a recent
cluster processes old log entries from legacy or mixed cluster mode.
To handle this constraints: ensure that the leader continue to emit the
Evaluation log entry until all servers have upgraded; also, when
processing raft logs, the servers honor evaluations found in both spots,
the Eval in job (de-)registration and the eval update entries.
When an updated server sees mix-mode behavior where an eval is inserted
into the raft log twice, it ignores the second instance.
I made one compromise in consistency in the mixed-mode scenario: servers
may disagree on the eval.CreateIndex value: the leader and updated
servers will report the job registration index while old servers will
report the index of the eval update log entry. This discripency doesn't
seem to be material - it's the eval.JobModifyIndex that matters.
Deployments should wait until kicked off by `Job.Register` so that we can
assert that all regions have a scheduled deployment before starting any
region. This changeset includes the OSS fixes to support the ENT work.
`IsMultiregionStarter` has no more callers in OSS, so remove it here.
It's supposed to be possible for a region not to have `datacenters` set so
that it can use the job's `datacenters` field. This requires that operators
use the same DC name across multiple regions, but that's the default client
configuration.