Add a scatter-gather for multiregion job plans. Each region's servers
interpolate the plan locally in `Job.Plan` but don't distribute the plan as
done in `Job.Run`.
Note that it's not possible to return a usable modify index from a multiregion
plan for use with `-check-index`. Even if we were to force the modify index to
be the same at the start of `Job.Run` the index immediately drifts during each
region's deployments, depending on events local to each region. So we omit
this section of a multiregion plan.
The scheduler returns a very strange error if it detects a port number
out of range. If these would somehow make it to the client they would
overflow when converted to an int32 and could cause conflicts.
This PR adds the capability of running Connect Native Tasks on Nomad,
particularly when TLS and ACLs are enabled on Consul.
The `connect` stanza now includes a `native` parameter, which can be
set to the name of task that backs the Connect Native Consul service.
There is a new Client configuration parameter for the `consul` stanza
called `share_ssl`. Like `allow_unauthenticated` the default value is
true, but recommended to be disabled in production environments. When
enabled, the Nomad Client's Consul TLS information is shared with
Connect Native tasks through the normal Consul environment variables.
This does NOT include auth or token information.
If Consul ACLs are enabled, Service Identity Tokens are automatically
and injected into the Connect Native task through the CONSUL_HTTP_TOKEN
environment variable.
Any of the automatically set environment variables can be overridden by
the Connect Native task using the `env` stanza.
Fixes#6083
If `max_parallel` is not set, all regions should begin in a `running` state
rather than a `pending` state. Otherwise the first region is set to `running`
and then all the remaining regions once it enters `blocked. That behavior is
technically correct in that we have at most `max_parallel` regions running,
but definitely not what a user expects.
In multiregion deployments when ACLs are enabled, the deploymentwatcher needs
an appropriately scoped ACL token with the same `submit-job` rights as the
user who submitted it. The token will already be replicated, so store the
accessor ID so that it can be retrieved by the leader.
The volumewatcher restores itself on notification, but detecting this is racy
because it may reap any claim (or find there are no claims to reap) and
shutdown before we can test whether it's running. This appears to have become
flaky with a new version of golang. The other cases in this test case
sufficiently exercise the start/stop behavior of the volumewatcher, so remove
the flaky section.
The `paused` state is used as an operator safety mechanism, so that they can
debug a deployment or halt one that's causing a wider failure. By using the
`paused` state as the first state of a multiregion deployment, we risked
resuming an intentionally operator-paused deployment because of activity in a
peer region.
This changeset replaces the use of the `paused` state with a `pending` state,
and provides a `Deployment.Run` internal RPC to replace the use of the
`Deployment.Pause` (resume) RPC we were using in `deploymentwatcher`.
* `nextRegion` should take status parameter
* thread Deployment/Job RPCs thru `nextRegion`
* add `nextRegion` calls to `deploymentwatcher`
* use a better description for paused for peer
Integration points for multiregion jobs to be registered in the enterprise
version of Nomad:
* hook in `Job.Register` for enterprise to send job to peer regions
* remove monitoring from `nomad job run` and `nomad job stop` for multiregion jobs
* scheduler/reconcile: set FollowupEvalID on lost stop_after_client_disconnect
* scheduler/reconcile: thread follupEvalIDs through to results.stop
* scheduler/reconcile: comment typo
* nomad/_test: correct arguments for plan.AppendStoppedAlloc
* scheduler/reconcile: avoid nil, cleanup handleDelayed(Lost|Reschedules)
* client/heartbeatstop: reversed time condition for startup grace
* scheduler/generic_sched: use `delayInstead` to avoid a loop
Without protecting the loop that creates followUpEvals, a delayed eval
is allowed to create an immediate subsequent delayed eval. For both
`stop_after_client_disconnect` and the `reschedule` block, a delayed
eval should always produce some immediate result (running or blocked)
and then only after the outcome of that eval produce a second delayed
eval.
* scheduler/reconcile: lostLater are different than delayedReschedules
Just slightly. `lostLater` allocs should be used to create batched
evaluations, but `handleDelayedReschedules` assumes that the
allocations are in the untainted set. When it creates the in-place
updates to those allocations at the end, it causes the allocation to
be treated as running over in the planner, which causes the initial
`stop_after_client_disconnect` evaluation to be retried by the worker.
Handle case where a snapshot is made before cluster metadata is created.
This fixes a bug where a server may have empty cluster metadata if it
created and installed a Raft snapshot before a new cluster metadata ID is
generated.
This case is very unlikely to arise. Most likely reason is when
upgrading from an old version slowly where servers may use snapshots
before all servers upgrade. This happened for a user with a log line
like:
```
2020-05-21T15:21:56.996Z [ERROR] nomad.fsm: ClusterSetMetadata failed: error=""set cluster metadata failed: refusing to set new cluster id, previous: , new: <<redacted>
```
* changes necessary to support oss licesning shims
revert nomad fmt changes
update test to work with enterprise changes
update tests to work with new ent enforcements
make check
update cas test to use scheduler algorithm
back out preemption changes
add comments
* remove unused method
Ensure that nomad steps down (and terminate leader goroutines) on
shutdown, when the server is the leader.
Without this change, `monitorLeadership` may handle `shutdownCh` event
and exit early before handling the raft `leaderCh` event and end up
leaking leadership goroutines.
If a token is scheduled for revocation expires before we revoke it,
ensure that it is marked as purged in raft and is only removed from
local vault state if the purge operation succeeds.
Prior to this change, we may remove the accessor from local state but
not purge it from Raft. This causes unnecessary and churn in the next
leadership elections (and until 0.11.2 result in indefinite retries).
Establishing leadership should be very fast and never make external API
calls.
This fixes a situation where there is a long backlog of Vault tokens to
be revoked on when leadership is gained. In such case, revoking the
tokens will significantly slow down leadership establishment and slow
down processing. Worse, the revocation call does not honor leadership
`stopCh` signals, so it will not stop when the leader loses leadership.
Following the new volumewatcher in #7794 and performance improvements
to it that landed afterwards, there's no particular reason we should
be threading claim releases through the GC eval rather than writing an
empty `CSIVolumeClaimRequest` with the mode set to
`CSIVolumeClaimRelease`, just as the GC evaluation would do.
Also, by batching up these raft messages, we can reduce the amount of
raft writes by 1 and cross-server RPCs by 1 per volume we release
claims on.
Some of the CSI RPC endpoints were missing validation that the ID or
the Volume definition was present. This could result in nonsense
`CSIVolume` structs being written to raft during registration. This
changeset corrects that bug and adds validation checks to present
nicer error messages to operators in some other cases.
Fixes#8000
When requesting a Service Identity token from Consul, use the TaskKind
of the Task to get at the service name associated with the task. In
the past using the TaskName worked because it was generated as a sidecar
task with a name that included the service. In the Native context, we
need to get at the service name in a more correct way, i.e. using the
TaskKind which is defined to include the service name.
Allow a `/v1/jobs?all_namespaces=true` to list all jobs across all
namespaces. The returned list is to contain a `Namespace` field
indicating the job namespace.
If ACL is enabled, the request token needs to be a management token or
have `namespace:list-jobs` capability on all existing namespaces.
The MVP for CSI in the 0.11.0 release of Nomad did not include support
for opaque volume parameters or volume context. This changeset adds
support for both.
This also moves args for ControllerValidateCapabilities into a struct.
The CSI plugin `ControllerValidateCapabilities` struct that we turn
into a CSI RPC is accumulating arguments, so moving it into a request
struct will reduce the churn of this internal API, make the plugin
code more readable, and make this method consistent with the other
plugin methods in that package.
This ensures that token revocation is idempotent and can handle when
tokens are revoked out of band.
Idempotency is important to handle some transient failures and retries.
Consider when a single token of a batch fails to be revoked, nomad would
retry revoking the entire batch; tokens already revoked should be
gracefully handled, otherwise, nomad may retry revoking the same
tokens forever.
* jobspec, api: add stop_after_client_disconnect
* nomad/state/state_store: error message typo
* structs: alloc methods to support stop_after_client_disconnect
1. a global AllocStates to track status changes with timestamps. We
need this to track the time at which the alloc became lost
originally.
2. ShouldClientStop() and WaitClientStop() to actually do the math
* scheduler/reconcile_util: delayByStopAfterClientDisconnect
* scheduler/reconcile: use delayByStopAfterClientDisconnect
* scheduler/util: updateNonTerminalAllocsToLost comments
This was setup to only update allocs to lost if the DesiredStatus had
already been set by the scheduler. It seems like the intention was to
update the status from any non-terminal state, and not all lost allocs
have been marked stop or evict by now
* scheduler/testing: AssertEvalStatus just use require
* scheduler/generic_sched: don't create a blocked eval if delayed
* scheduler/generic_sched_test: several scheduling cases
CSI plugins can require credentials for some publishing and
unpublishing workflow RPCs. Secrets are configured at the time of
volume registration, stored in the volume struct, and then passed
around as an opaque map by Nomad to the plugins.