Fix a test corruption issue, where a test accidentally unsets
the `NOMAD_LICENSE` environment variable, that's relied on by some
tests.
As a habit, tests should always restore the environment variable value
on test completion. Golang 1.17 introduced
[`t.Setenv`](https://pkg.go.dev/testing#T.Setenv) to address this issue.
However, as 1.0.x and 1.1.x branches target golang 1.15 and 1.16, I
opted to use a helper function to ease backports.
* Include region and namespace in CLI output
* Add region and prefix matching for server members
* Add namespace and region API outputs to cluster metadata folder
* Add region awareness to WaitForClient helper function
* Add helper functions for SliceStringHasPrefix and StringHasPrefixInSlice
* Refactor test client agent generation
* Add tests for region
* Add changelog
FailoverHeartbeatTTL is the amount of time to wait after a server leader failure
before considering reallocating client tasks. This TTL should be fairly long as
the new server leader needs to rebuild the entire heartbeat map for the
cluster. In deployments with a small number of machines, the default TTL (5m)
may be unnecessary long. Let's allow operators to configure this value in their
config files.
By default we should not expose the NOMAD_LICENSE environment variable
to tasks.
Also refactor where the DefaultEnvDenyList lives so we don't have to
maintain 2 copies of it. Since client/config is the most obvious
location, keep a reference there to its unfortunate home buried deep
in command/agent/host. Since the agent uses this list as well for the
/agent/host endpoint the list must be accessible from both command/agent
and client.
Add a new hostname string parameter to the network block which
allows operators to specify the hostname of the network namespace.
Changing this causes a destructive update to the allocation and it
is omitted if empty from API responses. This parameter also supports
interpolation.
In order to have a hostname passed as a configuration param when
creating an allocation network, the CreateNetwork func of the
DriverNetworkManager interface needs to be updated. In order to
minimize the disruption of future changes, rather than add another
string func arg, the function now accepts a request struct along with
the allocID param. The struct has the hostname as a field.
The in-tree implementations of DriverNetworkManager.CreateNetwork
have been modified to account for the function signature change.
In updating for the change, the enhancement of adding hostnames to
network namespaces has also been added to the Docker driver, whilst
the default Linux manager does not current implement it.
* don't timestamp active log file
* website: update log_file default value
* changelog: add entry for #11070
* website: add upgrade instructions for log_file in v1.14 and v1.2.0
This PR implements a new "System Batch" scheduler type. Jobs can
make use of this new scheduler by setting their type to 'sysbatch'.
Like the name implies, sysbatch can be thought of as a hybrid between
system and batch jobs - it is for running short lived jobs intended to
run on every compatible node in the cluster.
As with batch jobs, sysbatch jobs can also be periodic and/or parameterized
dispatch jobs. A sysbatch job is considered complete when it has been run
on all compatible nodes until reaching a terminal state (success or failed
on retries).
Feasibility and preemption are governed the same as with system jobs. In
this PR, the update stanza is not yet supported. The update stanza is sill
limited in functionality for the underlying system scheduler, and is
not useful yet for sysbatch jobs. Further work in #4740 will improve
support for the update stanza and deployments.
Closes#2527
Use glint to determine if os.Stdout is a terminal.
glint Terminal renderer expects os.Stdout [not only to be a terminal, but also to have non-zero size](b492b545f6/renderer_term.go (L39-L46)). It's unclear how this condition arises, but this additional check causes Nomad to render deployments progress through glint when glint cannot support it.
By using golint to perform the check, we eliminate the risk of mis-judgement.
Otherwise the spinner would just end, which felt a bit awkward.
I wanted to see a "✓" to know that everything was ok, and a "!" (maybe something else?) if something went wrong.
This PR will have Nomad de-register a sidecar proxy service before
attempting to de-register the parent service. Otherwise, Consul will
emit a warning and an error.
Fixes#10845
When a jobspec doesn't include a namespace, we provide it with the default
namespace, but this ends up overriding the explicit `-namespace` flag. This
changeset uses the same logic as region parsing to create an order of
precedence: the query string parameter (the `-namespace` flag) overrides the
API request body which overrides the jobspec.
This PR uses regex-based matching for sidecar proxy services and checks when syncing
with Consul. Previously we would check if the parent of the sidecar was still being
tracked in Nomad. This is a false invariant - one which we must not depend when we
make #10845 work.
Fixes#10843
The default agent configuration values were not set, which meant they were not
being set in the client configuration and this results in fingerprints failing
unless the values were set explicitly.
Alloc exec only works when task is passed as a flag and not an arg.
Alloc logs currently accepts either, but alloc signal and restart only
accept task as an arg. This adds -task as a flag to the other alloc
commands to make the cli UX consistent. If task is passed as a flag and
an arg, it ignores the arg.
This PR makes it so the Consul sync logic will ignore operations that
do not specify an action to take (i.e. [de-]register [services|checks]).
Ideally such noops would be discarded at the callsites (i.e. users
of [Create|Update|Remove]Workload], but we can also be defensive
at the commit point.
Also adds 2 trace logging statements which are helpful for diagnosing
sync operations with Consul - when they happen and why.
Fixes#10797
When the `-verbose` flag is passed to the `nomad volume status` command, we
hit a code path where the rows of text to be formatted were not initialized
correctly, resulting in a panic in the CLI.
This PR fixes a bug where modifying the upstreams of a Connect sidecar proxy
would not result Consul applying the changes, unless an additional change to
the job would trigger a task replacement (thus replacing the service definition).
The fix is to check if upstreams have been modified between Nomad's view of the
sidecar service definition, and the service definition for the sidecar that is
actually registered in Consul.
Fixes#8754
This PR fixes some job submission plumbing to make sure the Consul Check parameters
- failure_before_critical
- success_before_passing
work with group-level services. They already work with task-level services.
System and batch jobs don't create deployments, which means nomad tries
to monitor a non-existent deployment when it runs a job and outputs an
error message. This adds a check to make sure a deployment exists before
monitoring. Also fixes some formatting.
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Glint pulled in an updated version of mitchellh/go-testing-interface
which broke some existing tests because the update added a Parallel()
method to testing.T. This switches to the standard library testing.TB
which doesn't have a Parallel() method.
Adding '-verbose' will print out the allocation information for the
deployment. This also changes the job run command so that it now blocks
until deployment is complete and adds timestamps to the output so that
it's more in line with the output of node drain.
This uses glint to print in place in running in a tty. Because glint
doesn't yet support cmd/powershell, Windows workflows use a different
library to print in place, which results in slightly different
formatting: 1) different margins, and 2) no spinner indicating
deployment in progress.
(cherry-pick ent back to oss)
This PR moves a lot of Consul ACL token validation tests into ent files,
so that we can verify correct behavior difference between OSS and ENT
Nomad versions.
This PR fixes the Nomad Object Namespace <-> Consul ACL Token relationship
check when using Consul OSS (or Consul ENT without namespace support).
Nomad v1.1.0 introduced a regression where Nomad would fail the validation
when submitting Connect jobs and allow_unauthenticated set to true, with
Consul OSS - because it would do the namespace check against the Consul ACL
token assuming the "default" namespace, which does not work because Consul OSS
does not have namespaces.
Instead of making the bad assumption, expand the namespace check to handle
each special case explicitly.
Fixes#10718
This PR changes Nomad's wrapper around the Consul NamespaceAPI so that
it will detect if the Consul Namespaces feature is enabled before making
a request to the Namespaces API. Namespaces are not enabled in Consul OSS,
and require a suitable license to be used with Consul ENT.
Previously Nomad would check for a 404 status code when makeing a request
to the Namespaces API to "detect" if Consul OSS was being used. This does
not work for Consul ENT with Namespaces disabled, which returns a 500.
Now we avoid requesting the namespace API altogether if Consul is detected
to be the OSS sku, or if the Namespaces feature is not licensed. Since
Consul can be upgraded from OSS to ENT, or a new license applied, we cache
the value for 1 minute, refreshing on demand if expired.
Fixes https://github.com/hashicorp/nomad-enterprise/issues/575
Note that the ticket originally describes using attributes from https://github.com/hashicorp/nomad/issues/10688.
This turns out not to be possible due to a chicken-egg situation between
bootstrapping the agent and setting up the consul client. Also fun: the
Consul fingerprinter creates its own Consul client, because there is no
[currently] no way to pass the agent's client through the fingerprint factory.
This PR implements first-class support for Nomad running Consul
Connect Mesh Gateways. Mesh gateways enable services in the Connect
mesh to make cross-DC connections via gateways, where each datacenter
may not have full node interconnectivity.
Consul docs with more information:
https://www.consul.io/docs/connect/gateways/mesh-gateway
The following group level service block can be used to establish
a Connect mesh gateway.
service {
connect {
gateway {
mesh {
// no configuration
}
}
}
}
Services can make use of a mesh gateway by configuring so in their
upstream blocks, e.g.
service {
connect {
sidecar_service {
proxy {
upstreams {
destination_name = "<service>"
local_bind_port = <port>
datacenter = "<datacenter>"
mesh_gateway {
mode = "<mode>"
}
}
}
}
}
}
Typical use of a mesh gateway is to create a bridge between datacenters.
A mesh gateway should then be configured with a service port that is
mapped from a host_network configured on a WAN interface in Nomad agent
config, e.g.
client {
host_network "public" {
interface = "eth1"
}
}
Create a port mapping in the group.network block for use by the mesh
gateway service from the public host_network, e.g.
network {
mode = "bridge"
port "mesh_wan" {
host_network = "public"
}
}
Use this port label for the service.port of the mesh gateway, e.g.
service {
name = "mesh-gateway"
port = "mesh_wan"
connect {
gateway {
mesh {}
}
}
}
Currently Envoy is the only supported gateway implementation in Consul.
By default Nomad client will run the latest official Envoy docker image
supported by the local Consul agent. The Envoy task can be customized
by setting `meta.connect.gateway_image` in agent config or by setting
the `connect.sidecar_task` block.
Gateways require Consul 1.8.0+, enforced by the Nomad scheduler.
Closes#9446
In this loop, we ought to close the websocket connection gracefully when
the StreamErrWrapper reaches EOF.
Previously, it's possible that that we drop the last few events or skip sending
the websocket closure. If `handler(handlerPipe)` returns and `cancel` is called,
before the loop here completes processing streaming events, the loop exits
prematurely without propagating the last few events.
Instead here, the loop continues until we hit `httpPipe` EOF (through
`decoder.Decode`), to ensure we process the events to completion.
When a wildcard namespace is used for `nomad job` commands that support prefix
matching, avoid asking the user for input if a prefix is an unambiguous exact
match so that the behavior is similar to the commands using a specific or
unset namespace.
The websocket interface used for `alloc exec` has to silently drop client send
errors because otherwise those errors would interleave with the streamed
output. But we may be able to surface errors that cause terminated websockets
a little better in the HTTP server logs.
The `Terminates At` field can't be removed from the struct for backwards
compatibility reasons, but there's no purpose to it anymore so we shouldn't be
showing it to end users of the command.
This PR adds e2e tests for Consul Namespaces for Nomad Enterprise
with Consul ACLs enabled.
Needed to add support for Consul ACL tokens with `namespace` and
`namespace_prefix` blocks, which Nomad parses and validates before
tossing the token. These bits will need to be picked back to OSS.
ParentID is an internal field that Nomad sets for dispatched or parameterized jobs. Job submitters should not be able to set it directly, as that messes up children tracking.
Fixes#10422 . It specifically stops the scheduler from honoring the ParentID. The reason failure and why the scheduler didn't schedule that job once it was created is very interesting and requires follow up with a more technical issue.
Add templating to `network-interface` option.
This PR also adds a fast-fail to in the case where an invalid interface is set or produced by the template
* add tests and check for valid interface
* Add documentation
* Incorporate suggestions from code review
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
(cherry-picked from ent without _ent things)
This is part 2/4 of e2e tests for Consul Namespaces. Took a
first pass at what the parameterized tests can look like, but
only on the ENT side for this PR. Will continue to refactor
in the next PRs.
Also fixes 2 bugs:
- Config Entries registered by Nomad Server on job registration
were not getting Namespace set
- Group level script checks were not getting Namespace set
Those changes will need to be copied back to Nomad OSS.
Nomad OSS + no ACLs (previously, needs refactor)
Nomad ENT + no ACLs (this)
Nomad OSS + ACLs (todo)
Nomad ENT + ALCs (todo)