The call to render the output diff swapped the `diff` and `verbose` bool
parameters, resulting in dropping the diff output in multi-region plans but
not single-region plans.
* ar: support opting into binding host ports to default network IP
* fix config plumbing
* plumb node address into network resource
* struct: only handle network resource upgrade path once
* made api.Scaling.Max a pointer, so we can detect (and complain) when it is neglected
* added checks to HCL parsing that it is present
* when Scaling.Max is absent/invalid, don't return extraneous error messages during validation
* tweak to multiregion handling to ensure that the count is valid on the interpolated regional jobs
resolves#8355
* command/agent/host: collect host data, multi platform
* nomad/structs/structs: new HostDataRequest/Response
* client/agent_endpoint: add RPC endpoint
* command/agent/agent_endpoint: add Host
* api/agent: add the Host endpoint
* nomad/client_agent_endpoint: add Agent Host with forwarding
* nomad/client_agent_endpoint: use findClientConn
This changes forwardMonitorClient and forwardProfileClient to use
findClientConn, which was cribbed from the common parts of those
funcs.
* command/debug: call agent hosts
* command/agent/host: eliminate calling external programs
The `nomad volume deregister` command currently returns an error if the volume
has any claims, but in cases where the claims can't be dropped because of
plugin errors, providing a `-force` flag gives the operator an escape hatch.
If the volume has no allocations or if they are all terminal, this flag
deletes the volume from the state store, immediately and implicitly dropping
all claims without further CSI RPCs. Note that this will not also
unmount/detach the volume, which we'll make the responsibility of a separate
`nomad volume detach` command.
* command/debug: build a local archive of debug data
* command/debug: query consul and vault directly
* command/debug: include pprof CPUProfile Trace and goroutine
* command/debug: trap signals and close the monitor requests
Having an active check in the sample job causes issues with testing
deployments in environments that are not integrated with Consul. This
negatively impacts some of the getting-started experiences. Commenting
out the check allows deployments to proceed successfully but leaves it
in the sample job for convenience.
Made a drive-by fix to all of the URLs in the jobfile
Add a scatter-gather for multiregion job plans. Each region's servers
interpolate the plan locally in `Job.Plan` but don't distribute the plan as
done in `Job.Run`.
Note that it's not possible to return a usable modify index from a multiregion
plan for use with `-check-index`. Even if we were to force the modify index to
be the same at the start of `Job.Run` the index immediately drifts during each
region's deployments, depending on events local to each region. So we omit
this section of a multiregion plan.
Adds a `-global` flag for stopping multiregion jobs in all regions at
once. Warn the user if they attempt to stop a multiregion job in a single
region.
This PR adds the capability of running Connect Native Tasks on Nomad,
particularly when TLS and ACLs are enabled on Consul.
The `connect` stanza now includes a `native` parameter, which can be
set to the name of task that backs the Connect Native Consul service.
There is a new Client configuration parameter for the `consul` stanza
called `share_ssl`. Like `allow_unauthenticated` the default value is
true, but recommended to be disabled in production environments. When
enabled, the Nomad Client's Consul TLS information is shared with
Connect Native tasks through the normal Consul environment variables.
This does NOT include auth or token information.
If Consul ACLs are enabled, Service Identity Tokens are automatically
and injected into the Connect Native task through the CONSUL_HTTP_TOKEN
environment variable.
Any of the automatically set environment variables can be overridden by
the Connect Native task using the `env` stanza.
Fixes#6083
Integration points for multiregion jobs to be registered in the enterprise
version of Nomad:
* hook in `Job.Register` for enterprise to send job to peer regions
* remove monitoring from `nomad job run` and `nomad job stop` for multiregion jobs
This PR switches the Nomad repository from using govendor to Go modules
for managing dependencies. Aspects of the Nomad workflow remain pretty
much the same. The usual Makefile targets should continue to work as
they always did. The API submodule simply defers to the parent Nomad
version on the repository, keeping the semantics of API versioning that
currently exists.
* changes necessary to support oss licesning shims
revert nomad fmt changes
update test to work with enterprise changes
update tests to work with new ent enforcements
make check
update cas test to use scheduler algorithm
back out preemption changes
add comments
* remove unused method
We have been using fatih/hclfmt which is long abandoned. Instead, switch
to HashiCorp's own hclfmt implementation. There are some trivial changes in
behavior around whitespace.
Allow a `/v1/jobs?all_namespaces=true` to list all jobs across all
namespaces. The returned list is to contain a `Namespace` field
indicating the job namespace.
If ACL is enabled, the request token needs to be a management token or
have `namespace:list-jobs` capability on all existing namespaces.
* jobspec, api: add stop_after_client_disconnect
* nomad/state/state_store: error message typo
* structs: alloc methods to support stop_after_client_disconnect
1. a global AllocStates to track status changes with timestamps. We
need this to track the time at which the alloc became lost
originally.
2. ShouldClientStop() and WaitClientStop() to actually do the math
* scheduler/reconcile_util: delayByStopAfterClientDisconnect
* scheduler/reconcile: use delayByStopAfterClientDisconnect
* scheduler/util: updateNonTerminalAllocsToLost comments
This was setup to only update allocs to lost if the DesiredStatus had
already been set by the scheduler. It seems like the intention was to
update the status from any non-terminal state, and not all lost allocs
have been marked stop or evict by now
* scheduler/testing: AssertEvalStatus just use require
* scheduler/generic_sched: don't create a blocked eval if delayed
* scheduler/generic_sched_test: several scheduling cases
CSI plugins can require credentials for some publishing and
unpublishing workflow RPCs. Secrets are configured at the time of
volume registration, stored in the volume struct, and then passed
around as an opaque map by Nomad to the plugins.
This changeset implements a periodic garbage collection of CSI volumes
with missing allocations. This can happen in a scenario where a node
update fails partially and the allocation updates are written to raft
but the evaluations to GC the volumes are dropped. This feature will
cover this edge case and ensure that upgrades from 0.11.0 and 0.11.1
get any stray claims cleaned up.
Ensure that `""` Scheduler Algorithm gets explicitly set to binpack on
upgrades or on API handling when user misses the value.
The scheduler already treats `""` value as binpack. This PR merely
ensures that the operator API returns the effective value.
This changeset implements a periodic garbage collection of unused CSI
plugins. Plugins are self-cleaning when the last allocation for a
plugin is stopped, but this feature will cover any missing edge cases
and ensure that upgrades from 0.11.0 and 0.11.1 get any stray plugins
cleaned up.
Failed requests due to API client errors are to be marked as DEBUG.
The Error log level should be reserved to signal problems with the
cluster and are actionable for nomad system operators. Logs due to
misbehaving API clients don't represent a system level problem and seem
spurius to nomad maintainers at best. These log messages can also be
attack vectors for deniel of service attacks by filling servers disk
space with spurious log messages.
Pipe http server log to hclog, so that it uses the same logging format
as rest of nomad logs. Also, supports emitting them as json logs, when
json formatting is set.
The http server logs are emitted as Trace level, as they are typically
repsent HTTP client errors (e.g. failed tls handshakes, invalid headers,
etc).
Though, Panic logs represent server errors and are relayed as Error
level.
Shutdown http server last, after nomad client/server components
terminate.
Before this change, if the agent is taking an unexpectedly long time to
shutdown, the operator cannot query the http server directly: they
cannot access agent specific http endpoints and need to query another
agent about the troublesome agent.
Unexpectedly long shutdown can happen in normal cases, e.g. a client
might hung is if one of the allocs it is running has a long
shutdown_delay.
Here, we switch to ensuring that the http server is shutdown last.
I believe this doesn't require extra care in agent shutting down logic
while operators may be able to submit write http requests. We already
need to cope with operators submiting these http requests to another
agent or by servers updating the client allocations.