If alloc exec fails to connect to the nomad client associated with the
alloc, fail over to using a server.
The code attempted to special case `net.Error` for failover to rule out
other permanent non-networking errors, by reusing a pattern in the
logging handling.
But this pattern does not apply here. `net/http.Http` wraps all errors
as `*url.Error` that is net.Error. The websocket doesn't, and instead
returns the raw error. If the raw error isn't a `net.Error`, like in
the case of TLS handshake errors, the api package would fail immediately
rather than failover.
This fixes a bug in the CLI handling of node lookup failures when
querying allocation and FS endpoints.
Allocation and FS endpoint are handled by the client; one can query the
relevant client directly, or query a server to have it forwarded
transparently to relevant client. Querying the client directly is
benefecial to avoid loading servers with IO.
As an optimization, the CLI attempts to query the client directly, but
then falls back to using server forwarding path if it encounters network
or connection errors (e.g. clients are locked down or in a separate
inaccessible network).
Here, we fix a bug where if the CLI fails to find to lookup the client
details because it lacks ACL capability or other unexpected reasons, the
CLI will not go through fallback path.
Adds nomad exec support in our API, by hitting the websocket endpoint.
We introduce API structs that correspond to the drivers streaming exec structs.
For creating the websocket connection, we reuse the transport setting from api
http client.
This command will be used to send a signal to either a single task within an
allocation, or all of the tasks if <task-name> is omitted. If the sent signal
terminates the allocation, it will be treated as if the allocation has crashed,
rather than as if it was operator-terminated.
Signal validation is currently handled by the driver itself and nomad
does not attempt to restrict or validate them.
This adds a `nomad alloc stop` command that can be used to stop and
force migrate an allocation to a different node.
This is built on top of the AllocUpdateDesiredTransitionRequest and
explicitly limits the scope of access to that transition to expose it
under the alloc-lifecycle ACL.
The API returns the follow up eval that can be used as part of
monitoring in the CLI or parsed and used in an external tool.
This adds a `nomad alloc restart` command and api that allows a job operator
with the alloc-lifecycle acl to perform an in-place restart of a Nomad
allocation, or a given subtask.
Currently when operators need to log onto a machine where an alloc
is running they will need to perform both an alloc/job status
call and then a call to discover the node name from the node list.
This updates both the job status and alloc status output to include
the node name within the information to make operator use easier.
Closes#2359
Cloess #1180
Given that the values will rarely change, specially considering that any
changes would be backward incompatible change. As such, it's simpler to
keep syncing manually in the rare occasion and avoid the syncing code
overhead.
This PR enhances the API package by having client only RPCs route
through the server when they are low cost and for filesystem access to
first attempt a direct connection to the node and then falling back to
a server routed request.
Fixes#3013
It's a little weird that Client now has a method for returning a
NewClient, but it's a convenient way to dedupe the logic to
connect-directly-to-a-node which is nontrivial and had sutble
differences between locations.