Log the failure error when the agent fails to start. Previously, the
agent startup failure error would be emitted to the command UI but not
logged. So it doesn't get emitted to syslog or `log_file` if they are
set, and it makes debugging much harder. Also, logging the error again
before exit makes the error more visible: previously, the operator
needed to scroll to the top to find the error.
On a sample failure, the output will look like:
```
==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
==> Loaded configuration from sample-configs/config-bad
==> Starting Nomad agent...
==> Error starting agent: setting up server node ID failed: mkdir /path-without-permission: read-only file system
2021-10-20T14:38:51.179-0400 [WARN] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/path-without-permission/plugins
2021-10-20T14:38:51.181-0400 [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/path-without-permission/plugins
2021-10-20T14:38:51.181-0400 [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/path-without-permission/plugins
2021-10-20T14:38:51.181-0400 [INFO] agent: detected plugin: name=java type=driver plugin_version=0.1.0
2021-10-20T14:38:51.181-0400 [INFO] agent: detected plugin: name=docker type=driver plugin_version=0.1.0
2021-10-20T14:38:51.181-0400 [INFO] agent: detected plugin: name=mock_driver type=driver plugin_version=0.1.0
2021-10-20T14:38:51.181-0400 [INFO] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
2021-10-20T14:38:51.181-0400 [INFO] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
2021-10-20T14:38:51.181-0400 [INFO] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
2021-10-20T14:38:51.181-0400 [ERROR] agent: error starting agent: error="setting up server node ID failed: mkdir /path-without-permission: read-only file system"
```
This change adds the final `ERROR` message. It's easy to miss the `==>
Error starting agent` above.
Fixes#2522
Skip embedding client.alloc_dir when building chroot. If a user
configures a Nomad client agent so that the chroot_env will embed the
client.alloc_dir, Nomad will happily infinitely recurse while building
the chroot until something horrible happens. The best case scenario is
the filesystem's path length limit is hit. The worst case scenario is
disk space is exhausted.
A bad agent configuration will look something like this:
```hcl
data_dir = "/tmp/nomad-badagent"
client {
enabled = true
chroot_env {
# Note that the source matches the data_dir
"/tmp/nomad-badagent" = "/ohno"
# ...
}
}
```
Note that `/ohno/client` (the state_dir) will still be created but not
`/ohno/alloc` (the alloc_dir).
While I cannot think of a good reason why someone would want to embed
Nomad's client (and possibly server) directories in chroots, there
should be no cause for harm. chroots are only built when Nomad runs as
root, and Nomad disables running exec jobs as root by default. Therefore
even if client state is copied into chroots, it will be inaccessible to
tasks.
Skipping the `data_dir` and `{client,server}.state_dir` is possible, but
this PR attempts to implement the minimum viable solution to reduce risk
of unintended side effects or bugs.
When running tests as root in a vm without the fix, the following error
occurs:
```
=== RUN TestAllocDir_SkipAllocDir
alloc_dir_test.go:520:
Error Trace: alloc_dir_test.go:520
Error: Received unexpected error:
Couldn't create destination file /tmp/TestAllocDir_SkipAllocDir1457747331/001/nomad/test/testtask/nomad/test/testtask/.../nomad/test/testtask/secrets/.nomad-mount: open /tmp/TestAllocDir_SkipAllocDir1457747331/001/nomad/test/.../testtask/secrets/.nomad-mount: file name too long
Test: TestAllocDir_SkipAllocDir
--- FAIL: TestAllocDir_SkipAllocDir (22.76s)
```
Also removed unused Copy methods on AllocDir and TaskDir structs.
Thanks to @eveld for not letting me forget about this!
Fix a test corruption issue, where a test accidentally unsets
the `NOMAD_LICENSE` environment variable, that's relied on by some
tests.
As a habit, tests should always restore the environment variable value
on test completion. Golang 1.17 introduced
[`t.Setenv`](https://pkg.go.dev/testing#T.Setenv) to address this issue.
However, as 1.0.x and 1.1.x branches target golang 1.15 and 1.16, I
opted to use a helper function to ease backports.
* Include region and namespace in CLI output
* Add region and prefix matching for server members
* Add namespace and region API outputs to cluster metadata folder
* Add region awareness to WaitForClient helper function
* Add helper functions for SliceStringHasPrefix and StringHasPrefixInSlice
* Refactor test client agent generation
* Add tests for region
* Add changelog
FailoverHeartbeatTTL is the amount of time to wait after a server leader failure
before considering reallocating client tasks. This TTL should be fairly long as
the new server leader needs to rebuild the entire heartbeat map for the
cluster. In deployments with a small number of machines, the default TTL (5m)
may be unnecessary long. Let's allow operators to configure this value in their
config files.
By default we should not expose the NOMAD_LICENSE environment variable
to tasks.
Also refactor where the DefaultEnvDenyList lives so we don't have to
maintain 2 copies of it. Since client/config is the most obvious
location, keep a reference there to its unfortunate home buried deep
in command/agent/host. Since the agent uses this list as well for the
/agent/host endpoint the list must be accessible from both command/agent
and client.
Add a new hostname string parameter to the network block which
allows operators to specify the hostname of the network namespace.
Changing this causes a destructive update to the allocation and it
is omitted if empty from API responses. This parameter also supports
interpolation.
In order to have a hostname passed as a configuration param when
creating an allocation network, the CreateNetwork func of the
DriverNetworkManager interface needs to be updated. In order to
minimize the disruption of future changes, rather than add another
string func arg, the function now accepts a request struct along with
the allocID param. The struct has the hostname as a field.
The in-tree implementations of DriverNetworkManager.CreateNetwork
have been modified to account for the function signature change.
In updating for the change, the enhancement of adding hostnames to
network namespaces has also been added to the Docker driver, whilst
the default Linux manager does not current implement it.
* don't timestamp active log file
* website: update log_file default value
* changelog: add entry for #11070
* website: add upgrade instructions for log_file in v1.14 and v1.2.0
This PR implements a new "System Batch" scheduler type. Jobs can
make use of this new scheduler by setting their type to 'sysbatch'.
Like the name implies, sysbatch can be thought of as a hybrid between
system and batch jobs - it is for running short lived jobs intended to
run on every compatible node in the cluster.
As with batch jobs, sysbatch jobs can also be periodic and/or parameterized
dispatch jobs. A sysbatch job is considered complete when it has been run
on all compatible nodes until reaching a terminal state (success or failed
on retries).
Feasibility and preemption are governed the same as with system jobs. In
this PR, the update stanza is not yet supported. The update stanza is sill
limited in functionality for the underlying system scheduler, and is
not useful yet for sysbatch jobs. Further work in #4740 will improve
support for the update stanza and deployments.
Closes#2527
Use glint to determine if os.Stdout is a terminal.
glint Terminal renderer expects os.Stdout [not only to be a terminal, but also to have non-zero size](b492b545f6/renderer_term.go (L39-L46)). It's unclear how this condition arises, but this additional check causes Nomad to render deployments progress through glint when glint cannot support it.
By using golint to perform the check, we eliminate the risk of mis-judgement.
Otherwise the spinner would just end, which felt a bit awkward.
I wanted to see a "✓" to know that everything was ok, and a "!" (maybe something else?) if something went wrong.
This PR will have Nomad de-register a sidecar proxy service before
attempting to de-register the parent service. Otherwise, Consul will
emit a warning and an error.
Fixes#10845
When a jobspec doesn't include a namespace, we provide it with the default
namespace, but this ends up overriding the explicit `-namespace` flag. This
changeset uses the same logic as region parsing to create an order of
precedence: the query string parameter (the `-namespace` flag) overrides the
API request body which overrides the jobspec.
This PR uses regex-based matching for sidecar proxy services and checks when syncing
with Consul. Previously we would check if the parent of the sidecar was still being
tracked in Nomad. This is a false invariant - one which we must not depend when we
make #10845 work.
Fixes#10843
The default agent configuration values were not set, which meant they were not
being set in the client configuration and this results in fingerprints failing
unless the values were set explicitly.
Alloc exec only works when task is passed as a flag and not an arg.
Alloc logs currently accepts either, but alloc signal and restart only
accept task as an arg. This adds -task as a flag to the other alloc
commands to make the cli UX consistent. If task is passed as a flag and
an arg, it ignores the arg.
This PR makes it so the Consul sync logic will ignore operations that
do not specify an action to take (i.e. [de-]register [services|checks]).
Ideally such noops would be discarded at the callsites (i.e. users
of [Create|Update|Remove]Workload], but we can also be defensive
at the commit point.
Also adds 2 trace logging statements which are helpful for diagnosing
sync operations with Consul - when they happen and why.
Fixes#10797
When the `-verbose` flag is passed to the `nomad volume status` command, we
hit a code path where the rows of text to be formatted were not initialized
correctly, resulting in a panic in the CLI.
This PR fixes a bug where modifying the upstreams of a Connect sidecar proxy
would not result Consul applying the changes, unless an additional change to
the job would trigger a task replacement (thus replacing the service definition).
The fix is to check if upstreams have been modified between Nomad's view of the
sidecar service definition, and the service definition for the sidecar that is
actually registered in Consul.
Fixes#8754
This PR fixes some job submission plumbing to make sure the Consul Check parameters
- failure_before_critical
- success_before_passing
work with group-level services. They already work with task-level services.
System and batch jobs don't create deployments, which means nomad tries
to monitor a non-existent deployment when it runs a job and outputs an
error message. This adds a check to make sure a deployment exists before
monitoring. Also fixes some formatting.
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Glint pulled in an updated version of mitchellh/go-testing-interface
which broke some existing tests because the update added a Parallel()
method to testing.T. This switches to the standard library testing.TB
which doesn't have a Parallel() method.
Adding '-verbose' will print out the allocation information for the
deployment. This also changes the job run command so that it now blocks
until deployment is complete and adds timestamps to the output so that
it's more in line with the output of node drain.
This uses glint to print in place in running in a tty. Because glint
doesn't yet support cmd/powershell, Windows workflows use a different
library to print in place, which results in slightly different
formatting: 1) different margins, and 2) no spinner indicating
deployment in progress.
(cherry-pick ent back to oss)
This PR moves a lot of Consul ACL token validation tests into ent files,
so that we can verify correct behavior difference between OSS and ENT
Nomad versions.
This PR fixes the Nomad Object Namespace <-> Consul ACL Token relationship
check when using Consul OSS (or Consul ENT without namespace support).
Nomad v1.1.0 introduced a regression where Nomad would fail the validation
when submitting Connect jobs and allow_unauthenticated set to true, with
Consul OSS - because it would do the namespace check against the Consul ACL
token assuming the "default" namespace, which does not work because Consul OSS
does not have namespaces.
Instead of making the bad assumption, expand the namespace check to handle
each special case explicitly.
Fixes#10718
This PR changes Nomad's wrapper around the Consul NamespaceAPI so that
it will detect if the Consul Namespaces feature is enabled before making
a request to the Namespaces API. Namespaces are not enabled in Consul OSS,
and require a suitable license to be used with Consul ENT.
Previously Nomad would check for a 404 status code when makeing a request
to the Namespaces API to "detect" if Consul OSS was being used. This does
not work for Consul ENT with Namespaces disabled, which returns a 500.
Now we avoid requesting the namespace API altogether if Consul is detected
to be the OSS sku, or if the Namespaces feature is not licensed. Since
Consul can be upgraded from OSS to ENT, or a new license applied, we cache
the value for 1 minute, refreshing on demand if expired.
Fixes https://github.com/hashicorp/nomad-enterprise/issues/575
Note that the ticket originally describes using attributes from https://github.com/hashicorp/nomad/issues/10688.
This turns out not to be possible due to a chicken-egg situation between
bootstrapping the agent and setting up the consul client. Also fun: the
Consul fingerprinter creates its own Consul client, because there is no
[currently] no way to pass the agent's client through the fingerprint factory.
This PR implements first-class support for Nomad running Consul
Connect Mesh Gateways. Mesh gateways enable services in the Connect
mesh to make cross-DC connections via gateways, where each datacenter
may not have full node interconnectivity.
Consul docs with more information:
https://www.consul.io/docs/connect/gateways/mesh-gateway
The following group level service block can be used to establish
a Connect mesh gateway.
service {
connect {
gateway {
mesh {
// no configuration
}
}
}
}
Services can make use of a mesh gateway by configuring so in their
upstream blocks, e.g.
service {
connect {
sidecar_service {
proxy {
upstreams {
destination_name = "<service>"
local_bind_port = <port>
datacenter = "<datacenter>"
mesh_gateway {
mode = "<mode>"
}
}
}
}
}
}
Typical use of a mesh gateway is to create a bridge between datacenters.
A mesh gateway should then be configured with a service port that is
mapped from a host_network configured on a WAN interface in Nomad agent
config, e.g.
client {
host_network "public" {
interface = "eth1"
}
}
Create a port mapping in the group.network block for use by the mesh
gateway service from the public host_network, e.g.
network {
mode = "bridge"
port "mesh_wan" {
host_network = "public"
}
}
Use this port label for the service.port of the mesh gateway, e.g.
service {
name = "mesh-gateway"
port = "mesh_wan"
connect {
gateway {
mesh {}
}
}
}
Currently Envoy is the only supported gateway implementation in Consul.
By default Nomad client will run the latest official Envoy docker image
supported by the local Consul agent. The Envoy task can be customized
by setting `meta.connect.gateway_image` in agent config or by setting
the `connect.sidecar_task` block.
Gateways require Consul 1.8.0+, enforced by the Nomad scheduler.
Closes#9446
In this loop, we ought to close the websocket connection gracefully when
the StreamErrWrapper reaches EOF.
Previously, it's possible that that we drop the last few events or skip sending
the websocket closure. If `handler(handlerPipe)` returns and `cancel` is called,
before the loop here completes processing streaming events, the loop exits
prematurely without propagating the last few events.
Instead here, the loop continues until we hit `httpPipe` EOF (through
`decoder.Decode`), to ensure we process the events to completion.
When a wildcard namespace is used for `nomad job` commands that support prefix
matching, avoid asking the user for input if a prefix is an unambiguous exact
match so that the behavior is similar to the commands using a specific or
unset namespace.
The websocket interface used for `alloc exec` has to silently drop client send
errors because otherwise those errors would interleave with the streamed
output. But we may be able to surface errors that cause terminated websockets
a little better in the HTTP server logs.
The `Terminates At` field can't be removed from the struct for backwards
compatibility reasons, but there's no purpose to it anymore so we shouldn't be
showing it to end users of the command.