This fixes a bug where a batch allocation fails to complete if it has
sidecars.
If the only remaining running tasks in an allocations are sidecars - we
must kill them and mark the allocation as complete.
Add a scatter-gather for multiregion job plans. Each region's servers
interpolate the plan locally in `Job.Plan` but don't distribute the plan as
done in `Job.Run`.
Note that it's not possible to return a usable modify index from a multiregion
plan for use with `-check-index`. Even if we were to force the modify index to
be the same at the start of `Job.Run` the index immediately drifts during each
region's deployments, depending on events local to each region. So we omit
this section of a multiregion plan.
The scheduler returns a very strange error if it detects a port number
out of range. If these would somehow make it to the client they would
overflow when converted to an int32 and could cause conflicts.
This PR adds the capability of running Connect Native Tasks on Nomad,
particularly when TLS and ACLs are enabled on Consul.
The `connect` stanza now includes a `native` parameter, which can be
set to the name of task that backs the Connect Native Consul service.
There is a new Client configuration parameter for the `consul` stanza
called `share_ssl`. Like `allow_unauthenticated` the default value is
true, but recommended to be disabled in production environments. When
enabled, the Nomad Client's Consul TLS information is shared with
Connect Native tasks through the normal Consul environment variables.
This does NOT include auth or token information.
If Consul ACLs are enabled, Service Identity Tokens are automatically
and injected into the Connect Native task through the CONSUL_HTTP_TOKEN
environment variable.
Any of the automatically set environment variables can be overridden by
the Connect Native task using the `env` stanza.
Fixes#6083
If `max_parallel` is not set, all regions should begin in a `running` state
rather than a `pending` state. Otherwise the first region is set to `running`
and then all the remaining regions once it enters `blocked. That behavior is
technically correct in that we have at most `max_parallel` regions running,
but definitely not what a user expects.
In multiregion deployments when ACLs are enabled, the deploymentwatcher needs
an appropriately scoped ACL token with the same `submit-job` rights as the
user who submitted it. The token will already be replicated, so store the
accessor ID so that it can be retrieved by the leader.
The volumewatcher restores itself on notification, but detecting this is racy
because it may reap any claim (or find there are no claims to reap) and
shutdown before we can test whether it's running. This appears to have become
flaky with a new version of golang. The other cases in this test case
sufficiently exercise the start/stop behavior of the volumewatcher, so remove
the flaky section.
The `paused` state is used as an operator safety mechanism, so that they can
debug a deployment or halt one that's causing a wider failure. By using the
`paused` state as the first state of a multiregion deployment, we risked
resuming an intentionally operator-paused deployment because of activity in a
peer region.
This changeset replaces the use of the `paused` state with a `pending` state,
and provides a `Deployment.Run` internal RPC to replace the use of the
`Deployment.Pause` (resume) RPC we were using in `deploymentwatcher`.
* `nextRegion` should take status parameter
* thread Deployment/Job RPCs thru `nextRegion`
* add `nextRegion` calls to `deploymentwatcher`
* use a better description for paused for peer
Integration points for multiregion jobs to be registered in the enterprise
version of Nomad:
* hook in `Job.Register` for enterprise to send job to peer regions
* remove monitoring from `nomad job run` and `nomad job stop` for multiregion jobs
* scheduler/reconcile: set FollowupEvalID on lost stop_after_client_disconnect
* scheduler/reconcile: thread follupEvalIDs through to results.stop
* scheduler/reconcile: comment typo
* nomad/_test: correct arguments for plan.AppendStoppedAlloc
* scheduler/reconcile: avoid nil, cleanup handleDelayed(Lost|Reschedules)