The goal is to always find an interface with an address, preferring
sandbox interfaces, but falling back to the first address found.
A test was added against a known CNI plugin output that was not handled
correctly before.
Registration of Nomad volumes previously allowed for a single volume
capability (access mode + attachment mode pair). The recent `volume create`
command requires that we pass a list of requested capabilities, but the
existing workflow for claiming volumes and attaching them on the client
assumed that the volume's single capability was correct and unchanging.
Add `AccessMode` and `AttachmentMode` to `CSIVolumeClaim`, use these fields to
set the initial claim value, and add backwards compatibility logic to handle
the existing volumes that already have claims without these fields.
This PR adds the common OSS changes for adding support for Consul Namespaces,
which is going to be a Nomad Enterprise feature. There is no new functionality
provided by this changeset and hopefully no new bugs.
In order to support new node RPCs, we need to fingerprint plugin capabilities
in more detail. This changeset mirrors recent work to fingerprint controller
capabilities, but is not yet in use by any Nomad RPC.
In order to support new controller RPCs, we need to fingerprint volume
capabilities in more detail and perform controller RPCs only when the specific
capability is present. This fixes a bug in Ceph support where the plugin can
only suport create/delete but we assume that it also supports attach/detach.
This commit includes a new test client that allows overriding the RPC
protocols. Only the RPCs that are passed in are registered, which lets you
implement a mock RPC in the server tests. This commit includes an example of
this for the ClientCSI RPC server.
Use the MemoryMaxMB as the LinuxResources limit. This is intended to ease
drivers implementation and adoption of the features: drivers that use
`resources.LinuxResources.MemoryLimitBytes` don't need to be updated.
Drivers that use NomadResources will need to updated to track the new
field value. Given that tasks aren't guaranteed to use up the excess
memory limit, this is a reasonable compromise.
Add a `PerAlloc` field to volume requests that directs the scheduler to test
feasibility for volumes with a source ID that includes the allocation index
suffix (ex. `[0]`), rather than the exact source ID.
Read the `PerAlloc` field when making the volume claim at the client to
determine if the allocation index suffix (ex. `[0]`) should be added to the
volume source ID.
Allow for readiness type checks by configuring nomad to ignore warnings
or errors reported by a service check. This allows the deployment to
progress and while Consul handles introducing the sercive into a
resource pool once the check passes.
This PR implements Nomad built-in support for running Consul Connect
terminating gateways. Such a gateway can be used by services running
inside the service mesh to access "legacy" services running outside
the service mesh while still making use of Consul's service identity
based networking and ACL policies.
https://www.consul.io/docs/connect/gateways/terminating-gateway
These gateways are declared as part of a task group level service
definition within the connect stanza.
service {
connect {
gateway {
proxy {
// envoy proxy configuration
}
terminating {
// terminating-gateway configuration entry
}
}
}
}
Currently Envoy is the only supported gateway implementation in
Consul. The gateay task can be customized by configuring the
connect.sidecar_task block.
When the gateway.terminating field is set, Nomad will write/update
the Configuration Entry into Consul on job submission. Because CEs
are global in scope and there may be more than one Nomad cluster
communicating with Consul, there is an assumption that any terminating
gateway defined in Nomad for a particular service will be the same
among Nomad clusters.
Gateways require Consul 1.8.0+, checked by a node constraint.
Closes#9445
Most allocation hooks don't need to know when a single task within the
allocation is restarted. The check watcher for group services triggers the
alloc runner to restart all tasks, but the alloc runner's `Restart` method
doesn't trigger any of the alloc hooks, including the group service hook. The
result is that after the first time a check triggers a restart, we'll never
restart the tasks of an allocation again.
This commit adds a `RunnerTaskRestartHook` interface so that alloc runner
hooks can act if a task within the alloc is restarted. The only implementation
is in the group service hook, which will force a re-registration of the
alloc's services and fix check restarts.
Connect ingress gateway services were being registered into Consul without
an explicit deterministic service ID. Consul would generate one automatically,
but then Nomad would have no way to register a second gateway on the same agent
as it would not supply 'proxy-id' during envoy bootstrap.
Set the ServiceID for gateways, and supply 'proxy-id' when doing envoy bootstrap.
Fixes#9834
* Throw away result of multierror.Append
When given a *multierror.Error, it is mutated, therefore the return
value is not needed.
* Simplify MergeMultierrorWarnings, use StringBuilder
* Hash.Write() never returns an error
* Remove error that was always nil
* Remove error from Resources.Add signature
When this was originally written it could return an error, but that was
refactored away, and callers of it as of today never handle the error.
* Throw away results of io.Copy during Bridge
* Handle errors when computing node class in test
In 492d62d we prevented poststop tasks from contributing to allocation health
status, which fixed a bug where poststop tasks would prevent a deployment from
ever being marked successful. The patch introduced a regression where prestart
tasks that complete are causing the allocation to be marked unhealthy. This
changeset restores the previous behavior for prestart tasks.
* investigating where to ignore poststop task in alloc health tracker
* ignore poststop when setting latest start time for allocation
* clean up logic
* lifecycle: isolate mocks for poststop deployment test
* lifecycle: update comments in tracker
Co-authored-by: Jasmine Dahilig <jasmine@dahilig.com>
When a client restarts, the network_hook's prerun will call
`CreateNetwork`. Drivers that don't implement their own network manager will
fall back to the default network manager, which doesn't handle the case where
the network namespace is being recreated safely. This results in an error and
the task being restarted for `exec` tasks with `network` blocks (this also
impacts the community `containerd` and probably other community task drivers).
If we get an error when attempting to create the namespace and that error is
because the file already exists and is locked by its process, then we'll
return a `nil` error with the `created` flag set to false, just as we do with
the `docker` driver.
When upgrading from Nomad v0.12.x to v1.0.x, Nomad client will panic on
startup if the node is running Connect enabled jobs. This is caused by
a missing piece of plumbing of the Consul Proxies API interface during the
client restore process.
Fixes#9738
This PR deflakes TestTaskRunner_StatsHook_Periodic tests and adds backoff when the driver closes the channel.
TestTaskRunner_StatsHook_Periodic is currently the most flaky test - failing ~4% of the time (20 out of 486 workflows). A sample failure: https://app.circleci.com/pipelines/github/hashicorp/nomad/14028/workflows/957b674f-cbcc-4228-96d9-1094fdee5b9c/jobs/128563 .
This change has two components:
First, it updates the StatsHook so that it backs off when stats channel is closed. In the context of the test where the mock driver emits a single stats update and closes the channel, the test may make tens of thousands update during the period. In real context, if a driver doesn't implement the stats handler properly or when a task finishes, we may generate way too many Stats queries in a tight loop. Here, the backoff reduces these queries. I've added a failing test that shows 154,458 stats updates within 500ms in https://app.circleci.com/pipelines/github/hashicorp/nomad/14092/workflows/50672445-392d-4661-b19e-e3561ed32746/jobs/129423 .
Second, the test ignores the first stats update after a task exit. Due to the asynchronicity of updates and channel/context use, it's possible that an update is enqueued while the test marks the task as exited, resulting into a spurious update.
Previously, Nomad would optimize out the services task runner
hook for tasks which were initially submitted with no services
defined. This causes a problem when the job is later updated to
include service(s) on that task, which will result in nothing
happening because the hook is not present to handle the service
registration in the .Update.
Instead, always enable the services hook. The group services
alloc runner hook is already always enabled.
Fixes#9707
The client allocation GC API returns a misleading error message when the
allocation exists but is not yet eligible for GC. Make this clear in the error
response.
Note in the docs that the allocation will still show on the server responses.
When a task is restored after a client restart, the template runner will
create a new lease for any dynamic secret (ex. Consul or PKI secrets
engines). But because this lease is being created in the prestart hook, we
don't trigger the `change_mode`.
This changeset uses the the existence of the task handle to detect a
previously running task that's been restored, so that we can trigger the
template `change_mode` if the template is changed, as it will be only with
dynamic secrets.
When we iterate over the interfaces returned from CNI setup, we filter for one
with the `Sandbox` field set. Ensure that if none of the interfaces has that
field set that we still return an available interface.
CNI network configuration is currently only supported on Linux.
For now, add the linux build tag so that the deadcode linter does
not trip over unused CNI stuff on macOS.
Nomad v1.0.0 introduced a regression where the client configurations
for `connect.sidecar_image` and `connect.gateway_image` would be
ignored despite being set. This PR restores that functionality.
There was a missing layer of interpolation that needs to occur for
these parameters. Since Nomad 1.0 now supports dynamic envoy versioning
through the ${NOMAD_envoy_version} psuedo variable, we basically need
to first interpolate
${connect.sidecar_image} => envoyproxy/envoy:v${NOMAD_envoy_version}
then use Consul at runtime to resolve to a real image, e.g.
envoyproxy/envoy:v${NOMAD_envoy_version} => envoyproxy/envoy:v1.16.0
Of course, if the version of Consul is too old to provide an envoy
version preference, we then need to know to fallback to the old
version of envoy that we used before.
envoyproxy/envoy:v${NOMAD_envoy_version} => envoyproxy/envoy:v1.11.2@sha256:a7769160c9c1a55bb8d07a3b71ce5d64f72b1f665f10d81aa1581bc3cf850d09
Beyond that, we also need to continue to support jobs that set the
sidecar task themselves, e.g.
sidecar_task { config { image: "custom/envoy" } }
which itself could include teh pseudo envoy version variable.
Previously, Nomad would fail to startup if the CPU fingerprinter could
not detect the cpu total compute (i.e. cores * mhz). This is common on
some EC2 instance types (graviton class), where the env_aws fingerprinter
will override the detected CPU performance with a more accurate value
anyway.
Instead of crashing on startup, have Nomad use a low default for available
cpu performance of 1000 ticks (e.g. 1 core * 1 GHz). This enables Nomad
to get past the useless cpu fingerprinting on those EC2 instances. The
crashing error message is now a log statement suggesting the setting of
cpu_total_compute in client config.
Fixes#7989
This PR enables job submitters to use interpolation in the connect
block of jobs making use of consul connect. Before, only the name of
the connect service would be interpolated, and only for a few select
identifiers related to the job itself (#6853). Now, all connect fields
can be interpolated using the full spectrum of runtime parameters.
Note that the service name is interpolated at job-submission time,
and cannot make use of values known only at runtime.
Fixes#7221
Previously, every Envoy Connect sidecar would spawn as many worker
threads as logical CPU cores. That is Envoy's default behavior when
`--concurrency` is not explicitly set. Nomad now sets the concurrency
flag to 1, which is sensible for the default cpu = 250 Mhz resources
allocated for sidecar proxies. The concurrency value can be configured
in Client configuration by setting `meta.connect.proxy_concurrency`.
Closes#9341
* upsertaclpolicies
* delete acl policies msgtype
* upsert acl policies msgtype
* delete acl tokens msgtype
* acl bootstrap msgtype
wip unsubscribe on token delete
test that subscriptions are closed after an ACL token has been deleted
Start writing policyupdated test
* update test to use before/after policy
* add SubscribeWithACLCheck to run acl checks on subscribe
* update rpc endpoint to use broker acl check
* Add and use subscriptions.closeSubscriptionFunc
This fixes the issue of not being able to defer unlocking the mutex on
the event broker in the for loop.
handle acl policy updates
* rpc endpoint test for terminating acl change
* add comments
Co-authored-by: Kris Hicks <khicks@hashicorp.com>
Always wait 200ms before calling the Node.UpdateAlloc RPC to send
allocation updates to servers.
Prior to this change we only reset the update ticker when an error was
encountered. This meant the 200ms ticker was running while the RPC was
being performed. If the RPC was slow due to network latency or server
load and took >=200ms, the ticker would tick during the RPC.
Then on the next loop only the select would randomly choose between the
two viable cases: receive an update or fire the RPC again.
If the RPC case won it would immediately loop again due to there being
no updates to send.
When the update chan receive is selected a single update is added to the
slice. The odds are then 50/50 that the subsequent loop will send the
single update instead of receiving any more updates.
This could cause a couple of problems:
1. Since only a small number of updates are sent, the chan buffer may
fill, applying backpressure, and slowing down other client
operations.
2. The small number of updates sent may already be stale and not
represent the current state of the allocation locally.
A risk here is that it's hard to reason about how this will interact
with the 50ms batches on servers when the servers under load.
A further improvement would be to completely remove the alloc update
chan and instead use a mutex to build a map of alloc updates. I wanted
to test the lowest risk possible change on loaded servers first before
making more drastic changes.
While Nomad v0.12.8 fixed `NOMAD_{ALLOC,TASK,SECRETS}_DIR` use in
`template.destination`, interpolating these variables in
`template.source` caused a path escape error.
**Why not apply the destination fix to source?**
The destination fix forces destination to always be relative to the task
directory. This makes sense for the destination as a destination outside
the task directory would be unreachable by the task. There's no reason
to ever render a template outside the task directory. (Using `..` does
allow destinations to escape the task directory if
`template.disable_file_sandbox = true`. That's just awkward and unsafe
enough I hope no one uses it.)
There is a reason to source a template outside a task
directory. At least if there weren't then I can't think of why we
implemented `template.disable_file_sandbox`. So v0.12.8 left the
behavior of `template.source` the more straightforward "Interpolate and
validate."
However, since outside of `raw_exec` every other driver uses absolute
paths for `NOMAD_*_DIR` interpolation, this means those variables are
unusable unless `disable_file_sandbox` is set.
**The Fix**
The variables are now interpolated as relative paths *only for the
purpose of rendering templates.* This is an unfortunate special case,
but reflects the fact that the templates view of the filesystem is
completely different (unconstrainted) vs the task's view (chrooted).
Arguably the values of these variables *should be context-specific.*
I think it's more reasonable to think of the "hack" as templating
running uncontainerized than that giving templates different paths is a
hack.
**TODO**
- [ ] E2E tests
- [ ] Job validation may still be broken and prevent my fix from
working?
**raw_exec**
`raw_exec` is actually broken _a different way_ as exercised by tests in
this commit. I think we should probably remove these tests and fix that
in a followup PR/release, but I wanted to leave them in for the initial
review and discussion. Since non-containerized source paths are broken
anyway, perhaps there's another solution to this entire problem I'm
overlooking?
This PR adds the ability to set HTTP headers when downloading
an artifact from an `http` or `https` resource.
The implementation in `go-getter` is such that a new `HTTPGetter`
must be created for each artifact that sets headers (as opposed
to conveniently setting headers per-request). This PR maintains
the memoization of the default Getter objects, creating new ones
only for artifacts where headers are set.
Closes#9306
The unpublish workflow requires that we know the mode (RW vs RO) if we want to
unpublish the node. Update the hook and the Unpublish RPC so that we mark the
claim for release in a new state but leave the mode alone. This fixes a bug
where RO claims were failing node unpublish.
The core job GC doesn't know the mode, but we don't need it for that workflow,
so add a mode specifically for GC; the volumewatcher uses this as a sentinel
to check whether claims (with their specific RW vs RO modes) need to be claimed.
Even if a plugin sends back an empty `[]*device.DeviceGroup`, it's transformed to `nil` during the RPC. Our custom device plugin is returning empty `FingerprintResponse.Devices` very often. Our temporary fix is to send a dummy `*DeviceGroup` if the slice is empty. This has the effect of never triggering the "first fingerprint" and therefore timing out after 50s.
In turn, this made our node exceed its hearbeat grace period when restarting it, revoking all vault tokens for its allocations, causing a restart of all our allocations because the token couldn't be renewed.
Removing the logic for `f.Devices == nil` does not appear to affect the functionality of the function.
In Nomad v0.12.0, the client added additional fingerprinting around the
presense of the bridge kernel module. The fingerprinter only checked in
`/proc/modules` which is a list of loaded modules. In some cases, the
bridge kernel module is builtin rather than dynamically loaded. The fix
for that case is in #8721. However we were still missing the case where
the bridge module is dynamically loaded, but not yet loaded during the
startup of the Nomad agent. In this case the fingerprinter would believe
the bridge module was unavailable when really it gets loaded on demand.
This PR now has the fingerprinter scan the kernel module dependency file,
which will contain an entry for the bridge module even if it is not yet
loaded.
In summary, the client now looks for the bridge kernel module in
- /proc/modules
- /lib/modules/<kernel>/modules.builtin
- /lib/modules/<kernel>/modules.dep
Closes#8423
Beforehand tasks and field replacements did not have access to the
unique ID of their job or its parent. This adds this information as
new environment variables.
Prior to Nomad 0.12.5, you could use `${NOMAD_SECRETS_DIR}/mysecret.txt` as
the `artifact.destination` and `template.destination` because we would always
append the destination to the task working directory. In the recent security
patch we treated the `destination` absolute path as valid if it didn't escape
the working directory, but this breaks backwards compatibility and
interpolation of `destination` fields.
This changeset partially reverts the behavior so that we always append the
destination, but we also perform the escape check on that new destination
after interpolation so the security hole is closed.
Also, ConsulTemplate test should exercise interpolation
Ensure that the client honors the client configuration for the
`template.disable_file_sandbox` field when validating the jobspec's
`template.source` parameter, and not just with consul-template's own `file`
function.
Prevent interpolated `template.source`, `template.destination`, and
`artifact.destination` fields from escaping file sandbox.
* use msgtype in upsert node
adds message type to signature for upsert node, update tests, remove placeholder method
* UpsertAllocs msg type test setup
* use upsertallocs with msg type in signature
update test usage of delete node
delete placeholder msgtype method
* add msgtype to upsert evals signature, update test call sites with test setup msg type
handle snapshot upsert eval outside of FSM and ignore eval event
remove placeholder upsertevalsmsgtype
handle job plan rpc and prevent event creation for plan
msgtype cleanup upsertnodeevents
updatenodedrain msgtype
msg type 0 is a node registration event, so set the default to the ignore type
* fix named import
* fix signature ordering on upsertnode to match
* consul: advertise cni and multi host interface addresses
* structs: add service/check address_mode validation
* ar/groupservices: fetch networkstatus at hook runtime
* ar/groupservice: nil check network status getter before calling
* consul: comment network status can be nil
As newer versions of Consul are released, the minimum version of Envoy
it supports as a sidecar proxy also gets bumped. Starting with the upcoming
Consul v1.9.X series, Envoy v1.11.X will no longer be supported. Current
versions of Nomad hardcode a version of Envoy v1.11.2 to be used as the
default implementation of Connect sidecar proxy.
This PR introduces a change such that each Nomad Client will query its
local Consul for a list of Envoy proxies that it supports (https://github.com/hashicorp/consul/pull/8545)
and then launch the Connect sidecar proxy task using the latest supported version
of Envoy. If the `SupportedProxies` API component is not available from
Consul, Nomad will fallback to the old version of Envoy supported by old
versions of Consul.
Setting the meta configuration option `meta.connect.sidecar_image` or
setting the `connect.sidecar_task` stanza will take precedence as is
the current behavior for sidecar proxies.
Setting the meta configuration option `meta.connect.gateway_image`
will take precedence as is the current behavior for connect gateways.
`meta.connect.sidecar_image` and `meta.connect.gateway_image` may make
use of the special `${NOMAD_envoy_version}` variable interpolation, which
resolves to the newest version of Envoy supported by the Consul agent.
Addresses #8585#7665
Previously, Nomad was using a hand-made lookup table for looking
up EC2 CPU performance characteristics (core count + speed = ticks).
This data was incomplete and incorrect depending on region. The AWS
API has the correct data but requires API keys to use (i.e. should not
be queried directly from Nomad).
This change introduces a lookup table generated by a small command line
tool in Nomad's tools module which uses the Amazon AWS API.
Running the tool requires AWS_* environment variables set.
$ # in nomad/tools/cpuinfo
$ go run .
Going forward, Nomad can incorporate regeneration of the lookup table
somewhere in the CI pipeline so that we remain up-to-date on the latest
offerings from EC2.
Fixes#7830
Host with systemd-resolved have `/etc/resolv.conf` is a symlink
to `/run/systemd/resolve/stub-resolv.conf`. By bind-mounting
/etc/resolv.conf only, the exec container DNS resolution fail very badly.
This change fixes DNS resolution by binding /run/systemd/resolve as
well.
Note that this assumes that the systemd resolver (default to 127.0.0.53) is
accessible within the container. This is the case here because exec
containers share the same network namespace by default.
Jobs with custom network dns configurations are not affected, and Nomad
will continue to use the job dns settings rather than host one.
When defining a script-check in a group-level service, Nomad needs to
know which task is associated with the check so that it can use the
correct task driver to execute the check.
This PR fixes two bugs:
1) validate service.task or service.check.task is configured
2) make service.check.task inherit service.task if it is itself unset
Fixes#8952
- We previously added these to the client host metrics, but it's useful to have them on all client metrics.
- e.g. so you can exclude draining nodes from charts showing your fleet size.
The current implementation measures RPC request timeout only against
config.RPCHoldTimeout, which is fine for non-blocking requests but will
almost surely be exceeded by long-poll requests that block for minutes
at a time.
This adds an HasTimedOut method on the RPCInfo interface that takes into
account whether the request is blocking, its maximum wait time, and the
RPCHoldTimeout.
This PR adds initial support for running Consul Connect Ingress Gateways (CIGs) in Nomad. These gateways are declared as part of a task group level service definition within the connect stanza.
```hcl
service {
connect {
gateway {
proxy {
// envoy proxy configuration
}
ingress {
// ingress-gateway configuration entry
}
}
}
}
```
A gateway can be run in `bridge` or `host` networking mode, with the caveat that host networking necessitates manually specifying the Envoy admin listener (which cannot be disabled) via the service port value.
Currently Envoy is the only supported gateway implementation in Consul, and Nomad only supports running Envoy as a gateway using the docker driver.
Aims to address #8294 and tangentially #8647
* docker: support group allocated ports
* docker: add new ports driver config to specify which group ports are mapped
* docker: update port mapping docs
When the client-side actions of a CSI client RPC succeed but we get
disconnected during the RPC or we fail to checkpoint the claim state, we want
to be able to retry the client RPC without getting blocked by the client-side
state (ex. mount points) already having been cleaned up in previous calls.
Add a Postrun hook to send the `CSIVolume.Unpublish` RPC to the server. This
may forward client RPCs to the node plugins or to the controller plugins,
depending on whether other allocations on this node have claims on this
volume.
By making clients responsible for running the `CSIVolume.Unpublish` RPC (and
making the RPC available to a `nomad volume detach` command), the
volumewatcher becomes only used by the core GC job and we no longer need
async volume GC from job deregister and node update.
Before, Connect Native Tasks needed one of these to work:
- To be run in host networking mode
- To have the Consul agent configured to listen to a unix socket
- To have the Consul agent configured to listen to a public interface
None of these are a great experience, though running in host networking is
still the best solution for non-Linux hosts. This PR establishes a connection
proxy between the Consul HTTP listener and a unix socket inside the alloc fs,
bypassing the network namespace for any Connect Native task. Similar to and
re-uses a bunch of code from the gRPC listener version for envoy sidecar proxies.
Proxy is established only if the alloc is configured for bridge networking and
there is at least one Connect Native task in the Task Group.
Fixes#8290
adds in oss components to support enterprise multi-vault namespace feature
upgrade specific doc on vault multi-namespaces
vault docs
update test to reflect new error
The NodePublish workflow currently creates the target path and its parent
directory. However, the CSI specification says that the CO shall ensure the
parent directory of the target path exists, and that the SP shall place the
block device or mounted directory at the target path. Much of our testing has
been with CSI plugins that are more forgiving, but our behavior breaks
spec-compliant CSI plugins.
This changeset ensures we only create the parent directory.
Also fixed the same typo in a test. Fixing the typo fixes the link, but
the link was still broken when running the website locally due to the
trailing slash. It would have worked in prod thanks to redirects, but
using the canonical URL seems ideal.
* ar: support opting into binding host ports to default network IP
* fix config plumbing
* plumb node address into network resource
* struct: only handle network resource upgrade path once
* command/agent/host: collect host data, multi platform
* nomad/structs/structs: new HostDataRequest/Response
* client/agent_endpoint: add RPC endpoint
* command/agent/agent_endpoint: add Host
* api/agent: add the Host endpoint
* nomad/client_agent_endpoint: add Agent Host with forwarding
* nomad/client_agent_endpoint: use findClientConn
This changes forwardMonitorClient and forwardProfileClient to use
findClientConn, which was cribbed from the common parts of those
funcs.
* command/debug: call agent hosts
* command/agent/host: eliminate calling external programs
This fixes a bug where a batch allocation fails to complete if it has
sidecars.
If the only remaining running tasks in an allocations are sidecars - we
must kill them and mark the allocation as complete.
This PR adds the capability of running Connect Native Tasks on Nomad,
particularly when TLS and ACLs are enabled on Consul.
The `connect` stanza now includes a `native` parameter, which can be
set to the name of task that backs the Connect Native Consul service.
There is a new Client configuration parameter for the `consul` stanza
called `share_ssl`. Like `allow_unauthenticated` the default value is
true, but recommended to be disabled in production environments. When
enabled, the Nomad Client's Consul TLS information is shared with
Connect Native tasks through the normal Consul environment variables.
This does NOT include auth or token information.
If Consul ACLs are enabled, Service Identity Tokens are automatically
and injected into the Connect Native task through the CONSUL_HTTP_TOKEN
environment variable.
Any of the automatically set environment variables can be overridden by
the Connect Native task using the `env` stanza.
Fixes#6083
In #7957 we added support for passing a volume context to the controller RPCs.
This is an opaque map that's created by `CreateVolume` or, in Nomad's case,
in the volume registration spec.
However, we missed passing this field to the `NodeStage` and `NodePublish` RPC,
which prevents certain plugins (such as MooseFS) from making node RPCs.
* client/heartbeatstop: reversed time condition for startup grace
* scheduler/generic_sched: use `delayInstead` to avoid a loop
Without protecting the loop that creates followUpEvals, a delayed eval
is allowed to create an immediate subsequent delayed eval. For both
`stop_after_client_disconnect` and the `reschedule` block, a delayed
eval should always produce some immediate result (running or blocked)
and then only after the outcome of that eval produce a second delayed
eval.
* scheduler/reconcile: lostLater are different than delayedReschedules
Just slightly. `lostLater` allocs should be used to create batched
evaluations, but `handleDelayedReschedules` assumes that the
allocations are in the untainted set. When it creates the in-place
updates to those allocations at the end, it causes the allocation to
be treated as running over in the planner, which causes the initial
`stop_after_client_disconnect` evaluation to be retried by the worker.
* changes necessary to support oss licesning shims
revert nomad fmt changes
update test to work with enterprise changes
update tests to work with new ent enforcements
make check
update cas test to use scheduler algorithm
back out preemption changes
add comments
* remove unused method
This fixes few cases where driver eventor goroutines are leaked during
normal operations, but especially so in tests.
This change makes few modifications:
First, it switches drivers to use `Context`s to manage shutdown events.
Previously, it relied on callers invoking `.Shutdown()` function that is
specific to internal drivers only and require casting. Using `Contexts`
provide a consistent idiomatic way to manage lifecycle for both internal
and external drivers.
Also, I discovered few places where we don't clean up a temporary driver
instance in the plugin catalog code, where we dispense a driver to
inspect and validate the schema config without properly cleaning it up.
When an allocation runs for a task driver that can't support volume mounts,
the mounting will fail in a way that can be hard to understand. With host
volumes this usually means failing silently, whereas with CSI the operator
gets inscrutable internals exposed in the `nomad alloc status`.
This changeset adds a MountConfig field to the task driver Capabilities
response. We validate this when the `csi_hook` or `volume_hook` fires and
return a user-friendly error.
Note that we don't currently have a way to get driver capabilities up to the
server, except through attributes. Validating this when the user initially
submits the jobspec would be even better than what we're doing here (and could
be useful for all our other capabilities), but that's out of scope for this
changeset.
Also note that the MountConfig enum starts with "supports all" in order to
support community plugins in a backwards compatible way, rather than cutting
them off from volume mounting unexpectedly.
The `stats_hook` writes an Error log every time an allocation becomes
terminal. This is a normal condition, not an error. A real error
condition like a failure to collect the stats is logged later. It just
creates log noise, and this is a particularly bad operator experience
for heavy batch workloads.
The plugin supervisor lazily connects to plugins, but this means we
only get "Unavailable" back from the gRPC call in cases where the
plugin can never be reached (for example, if the Nomad client has the
wrong permissions for the socket).
This changeset improves the operator experience by switching to a
blocking `DialWithContext`. It eagerly connects so that we can
validate the connection is real and get a "failed to open" error in
case where Nomad can't establish the initial connection.
The MVP for CSI in the 0.11.0 release of Nomad did not include support
for opaque volume parameters or volume context. This changeset adds
support for both.
This also moves args for ControllerValidateCapabilities into a struct.
The CSI plugin `ControllerValidateCapabilities` struct that we turn
into a CSI RPC is accumulating arguments, so moving it into a request
struct will reduce the churn of this internal API, make the plugin
code more readable, and make this method consistent with the other
plugin methods in that package.
The plugin supervisor lazily connects to plugins, but this means we
only get "Unavailable" back from the gRPC call in cases where the
plugin can never be reached (for example, if the Nomad client has the
wrong permissions for the socket).
This changeset improves the operator experience by switching to a
blocking `DialWithContext`. It eagerly connects so that we can
validate the connection is real and get a "failed to open" error in
case where Nomad can't establish the initial connection.
The CSI plugins RPCs require the use of the storage provider's volume
ID, rather than the user-defined volume ID. Although changing the RPCs
to use the field name `ExternalID` risks breaking backwards
compatibility, we can use the `ExternalID` name internally for the
client and only use `VolumeID` at the RPC boundaries.
* jobspec, api: add stop_after_client_disconnect
* nomad/state/state_store: error message typo
* structs: alloc methods to support stop_after_client_disconnect
1. a global AllocStates to track status changes with timestamps. We
need this to track the time at which the alloc became lost
originally.
2. ShouldClientStop() and WaitClientStop() to actually do the math
* scheduler/reconcile_util: delayByStopAfterClientDisconnect
* scheduler/reconcile: use delayByStopAfterClientDisconnect
* scheduler/util: updateNonTerminalAllocsToLost comments
This was setup to only update allocs to lost if the DesiredStatus had
already been set by the scheduler. It seems like the intention was to
update the status from any non-terminal state, and not all lost allocs
have been marked stop or evict by now
* scheduler/testing: AssertEvalStatus just use require
* scheduler/generic_sched: don't create a blocked eval if delayed
* scheduler/generic_sched_test: several scheduling cases
CSI plugins can require credentials for some publishing and
unpublishing workflow RPCs. Secrets are configured at the time of
volume registration, stored in the volume struct, and then passed
around as an opaque map by Nomad to the plugins.
When serializing structs with msgpack, only consider type tags of
`codec`.
Hashicorp/go-msgpack (based on ugorji/go) defaults to interpretting
`codec` tag if it's available, but falls to using `json` if `codec`
isn't present.
This behavior is surprising in cases where we want to serialize json
differently from msgpack, e.g. serializing `ConsulExposeConfig`.
This change deflakes TestTaskTemplateManager_BlockedEvents test, because
it is expecting a number of events without accounting for transitional
state.
The test TestTaskTemplateManager_BlockedEvents attempts to ensure that a
template rendering emits blocked events for missing template ksys.
It works by setting a template that requires keys 0,1,2,3,4 and then
eventually sets keys 0,1,2,3 and ensures that we get a final event indicating
that keys 3 and 4 are still missing.
The test waits to get a blocked event for the final state, but it can
fail if receives a blocked event for a transitional state (e.g. one
reporting 2,3,4,5 are missing).
This fixes the test by ensuring that it waits until the final message
before assertion.
Also, it clarifies the intent of the test with stricter assertions and
additional comments.
Makes it possible to run Linux Containers On Windows with Nomad alongside Windows Containers. Fingerprint prevents only to run Nomad in Windows 10 with Linux Containers
In order to minimize this change while keeping a simple version of the
behavior, we set `lastOk` to the current time less the intial server
connection timeout. If the client starts and never contacts the
server, it will stop all configured tasks after the initial server
connection grace period, on the assumption that we've been out of
touch longer than any configured `stop_after_client_disconnect`.
The more complex state behavior might be justified later, but we
should learn about failure modes first.
- track lastHeartbeat, the client local time of the last successful
heartbeat round trip
- track allocations with `stop_after_client_disconnect` configured
- trigger allocation destroy (which handles cleanup)
- restore heartbeat/killable allocs tracking when allocs are recovered from disk
- on client restart, stop those allocs after a grace period if the
servers are still partioned
During MVP development, we reduced the timeout for controller plugins
to avoid long hangs in GC workers. But now that this work has been
moved to the volume watcher, we can restore the original timeout which
is better suited for the characteristic timescales of some cloud
provider APIs and better matches the behavior of k8s.
Fixes#7681
The current behavior of the CPU fingerprinter in AWS is that it
reads the **current** speed from `/proc/cpuinfo` (`CPU MHz` field).
This is because the max CPU frequency is not available by reading
anything on the EC2 instance itself. Normally on Linux one would
look at e.g. `sys/devices/system/cpu/cpuN/cpufreq/cpuinfo_max_freq`
or perhaps parse the values from the `CPU max MHz` field in
`/proc/cpuinfo`, but those values are not available.
Furthermore, no metadata about the CPU is made available in the
EC2 metadata service.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-categories.html
Since `go-psutil` cannot determine the max CPU speed it defaults to
the current CPU speed, which could be basically any number between
0 and the true max. This is particularly bad on large, powerful
reserved instances which often idle at ~800 MHz while Nomad does
its fingerprinting (typically IO bound), which Nomad then uses as
the max, which results in severe loss of available resources.
Since the CPU specification is unavailable programmatically (at least
not without sudo) use a best-effort lookup table. This table was
generated by going through every instance type in AWS documentation
and copy-pasting the numbers.
https://aws.amazon.com/ec2/instance-types/
This approach obviously is not ideal as future instance types will
need to be added as they are introduced to AWS. However, using the
table should only be an improvement over the status quo since right
now Nomad miscalculates available CPU resources on all instance types.
Use v1.1.5 of go-msgpack/codec/codecgen, so go-msgpack codecgen matches
the library version.
We branched off earlier to pick up
f51b518921
, but apparently that's not needed as we could customize the package via
`-c` argument.
Adds a `CSIVolumeClaim` type to be tracked as current and past claims
on a volume. Allows for a client RPC failure during node or controller
detachment without having to keep the allocation around after the
first garbage collection eval.
This changeset lays groundwork for moving the actual detachment RPCs
into a volume watching loop outside the GC eval.
task shutdown_delay will currently only run if there are registered
services for the task. This implementation detail isn't explicity stated
anywhere and is defined outside of the service stanza.
This change moves shutdown_delay to be evaluated after prekill hooks are
run, outside of any task runner hooks.
just use time.sleep
The `Job.Deregister` call will block on the client CSI controller RPCs
while the alloc still exists on the Nomad client node. So we need to
make the volume claim reaping async from the `Job.Deregister`. This
allows `nomad job stop` to return immediately. In order to make this
work, this changeset changes the volume GC so that the GC jobs are on a
by-volume basis rather than a by-job basis; we won't have to query
the (possibly deleted) job at the time of volume GC. We smuggle the
volume ID and whether it's a purge into the GC eval ID the same way we
smuggled the job ID previously.
The CSI plugins uses the external volume ID for all operations, but
the Client CSI RPCs uses the Nomad volume ID (human-friendly) for the
mount paths. Pass the External ID as an arg in the RPC call so that
the unpublish workflows have it without calling back to the server to
find the external ID.
The controller CSI plugins need the CSI node ID (or in other words,
the storage provider's view of node ID like the EC2 instance ID), not
the Nomad node ID, to determine how to detach the external volume.
If a volume-claiming alloc stops and the CSI Node plugin that serves
that alloc's volumes is missing, there's no way for the allocrunner
hook to send the `NodeUnpublish` and `NodeUnstage` RPCs.
This changeset addresses this issue with a redesign of the client-side
for CSI. Rather than unmounting in the alloc runner hook, the alloc
runner hook will simply exit. When the server gets the
`Node.UpdateAlloc` for the terminal allocation that had a volume claim,
it creates a volume claim GC job. This job will made client RPCs to a
new node plugin RPC endpoint, and only once that succeeds, move on to
making the client RPCs to the controller plugin. If the node plugin is
unavailable, the GC job will fail and be requeued.
Fixes#6594#6711#6714#7567
e2e testing is still TBD in #6502
Before, we only passed the Nomad agent's configured Consul HTTP
address onto the `consul connect envoy ...` bootstrap command.
This meant any Consul setup with TLS enabled would not work with
Nomad's Connect integration.
This change now sets CLI args and Environment Variables for
configuring TLS options for communicating with Consul when doing
the envoy bootstrap, as described in
https://www.consul.io/docs/commands/connect/envoy.html#usage
Enable configuration of HTTP and gRPC endpoints which should be exposed by
the Connect sidecar proxy. This changeset is the first "non-magical" pass
that lays the groundwork for enabling Consul service checks for tasks
running in a network namespace because they are Connect-enabled. The changes
here provide for full configuration of the
connect {
sidecar_service {
proxy {
expose {
paths = [{
path = <exposed endpoint>
protocol = <http or grpc>
local_path_port = <local endpoint port>
listener_port = <inbound mesh port>
}, ... ]
}
}
}
stanza. Everything from `expose` and below is new, and partially implements
the precedent set by Consul:
https://www.consul.io/docs/connect/registration/service-registration.html#expose-paths-configuration-reference
Combined with a task-group level network port-mapping in the form:
port "exposeExample" { to = -1 }
it is now possible to "punch a hole" through the network namespace
to a specific HTTP or gRPC path, with the anticipated use case of creating
Consul checks on Connect enabled services.
A future PR may introduce more automagic behavior, where we can do things like
1) auto-fill the 'expose.path.local_path_port' with the default value of the
'service.port' value for task-group level connect-enabled services.
2) automatically generate a port-mapping
3) enable an 'expose.checks' flag which automatically creates exposed endpoints
for every compatible consul service check (http/grpc checks on connect
enabled services).
* nomad/structs/structs: new NodeEventSubsystemCSI
* client/client: pass triggerNodeEvent in the CSIConfig
* client/pluginmanager/csimanager/instance: add eventer to instanceManager
* client/pluginmanager/csimanager/manager: pass triggerNodeEvent
* client/pluginmanager/csimanager/volume: node event on [un]mount
* nomad/structs/structs: use storage, not CSI
* client/pluginmanager/csimanager/volume: use storage, not CSI
* client/pluginmanager/csimanager/volume_test: eventer
* client/pluginmanager/csimanager/volume: event on error
* client/pluginmanager/csimanager/volume_test: check event on error
* command/node_status: remove an extra space in event detail format
* client/pluginmanager/csimanager/volume: use snake_case for details
* client/pluginmanager/csimanager/volume_test: snake_case details
The CSI Specification defines various gRPC Errors and how they may be retried. After auditing all our CSI RPC calls in #6863, this changeset:
* adds retries and backoffs to the where they were needed but not implemented
* annotates those CSI RPCs that do not need retries so that we don't wonder whether it's been left off accidentally
* added a timeout and cancellation context to the `Probe` call, which didn't have one.
The test inserts an alloc in the server state, but expect the client to
start the alloc runner for it almost immediately.
Here, we add a retry loop to check that the client start all expected
alloc runners eventually.
Fix a regression where we accidentally started treating non-AWS
environments as AWS environments, resulting in bad networking settings.
Two factors some at play:
First, in [1], we accidentally switched the ultimate AWS test from
checking `ami-id` to `instance-id`. This means that nomad started
treating more environments as AWS; e.g. Hetzner implements `instance-id`
but not `ami-id`.
Second, some of these environments return empty values instead of
errors! Hetzner returns empty 200 response for `local-ipv4`, resulting
into bad networking configuration.
This change fix the situation by restoring the check to `ami-id` and
ensuring that we only set network configuration when the ip address is
not-empty. Also, be more defensive around response whitespace input.
[1] https://github.com/hashicorp/nomad/pull/6779
Add mount_options to both the volume definition on registration and to the volume block in the group where the volume is requested. If both are specified, the options provided in the request replace the options defined in the volume. They get passed to the NodePublishVolume, which causes the node plugin to actually mount the volume on the host.
Individual tasks just mount bind into the host mounted volume (unchanged behavior). An operator can mount the same volume with different options by specifying it twice in the group context.
closes#7007
* nomad/structs/volumes: add MountOptions to volume request
* jobspec/test-fixtures/basic.hcl: add mount_options to volume block
* jobspec/parse_test: add expected MountOptions
* api/tasks: add mount_options
* jobspec/parse_group: use hcl decode not mapstructure, mount_options
* client/allocrunner/csi_hook: pass MountOptions through
client/allocrunner/csi_hook: add a VolumeMountOptions
client/allocrunner/csi_hook: drop Options
client/allocrunner/csi_hook: use the structs options
* client/pluginmanager/csimanager/interface: UsageOptions.MountOptions
* client/pluginmanager/csimanager/volume: pass MountOptions in capabilities
* plugins/csi/plugin: remove todo 7007 comment
* nomad/structs/csi: MountOptions
* api/csi: add options to the api for parsing, match structs
* plugins/csi/plugin: move VolumeMountOptions to structs
* api/csi: use specific type for mount_options
* client/allocrunner/csi_hook: merge MountOptions here
* rename CSIOptions to CSIMountOptions
* client/allocrunner/csi_hook
* client/pluginmanager/csimanager/volume
* nomad/structs/csi
* plugins/csi/fake/client: add PrevVolumeCapability
* plugins/csi/plugin
* client/pluginmanager/csimanager/volume_test: remove debugging
* client/pluginmanager/csimanager/volume: fix odd merging logic
* api: rename CSIOptions -> CSIMountOptions
* nomad/csi_endpoint: remove a 7007 comment
* command/alloc_status: show mount options in the volume list
* nomad/structs/csi: include MountOptions in the volume stub
* api/csi: add MountOptions to stub
* command/volume_status_csi: clean up csiVolMountOption, add it
* command/alloc_status: csiVolMountOption lives in volume_csi_status
* command/node_status: display mount flags
* nomad/structs/volumes: npe
* plugins/csi/plugin: npe in ToCSIRepresentation
* jobspec/parse_test: expand volume parse test cases
* command/agent/job_endpoint: ApiTgToStructsTG needs MountOptions
* command/volume_status_csi: copy paste error
* jobspec/test-fixtures/basic: hclfmt
* command/volume_status_csi: clean up csiVolMountOption
Run the plugin fingerprint one last time with a closed client during
instance manager shutdown. This will return quickly and will give us a
correctly-populated `PluginInfo` marked as unhealthy so the Nomad
client can update the server about plugin health.
Allow for faster updates to plugin status when allocations become
terminal by listening for register/deregister events from the dynamic
plugin registry (which in turn are triggered by the plugin supervisor
hook).
The deregistration function closures that we pass up to the CSI plugin
manager don't properly close over the name and type of the
registration, causing monolith-type plugins to deregister only one of
their two plugins on alloc shutdown. Rebind plugin supervisor
deregistration targets to fix that.
Includes log message and comment improvements