* Node Drain events and Node Events (#8980)
Deployment status updates
handle deployment status updates (paused, failed, resume)
deployment alloc health
generate events from apply plan result
txn err check, slim down deployment event
one ndjson line per index
* consolidate down to node event + type
* fix UpdateDeploymentAllocHealth test invocations
* fix test
This Commit adds an /v1/events/stream endpoint to stream events from.
The stream framer has been updated to include a SendFull method which
does not fragment the data between multiple frames. This essentially
treats the stream framer as a envelope to adhere to the stream framer
interface in the UI.
If the `encode` query parameter is omitted events will be streamed as
newline delimted JSON.
As newer versions of Consul are released, the minimum version of Envoy
it supports as a sidecar proxy also gets bumped. Starting with the upcoming
Consul v1.9.X series, Envoy v1.11.X will no longer be supported. Current
versions of Nomad hardcode a version of Envoy v1.11.2 to be used as the
default implementation of Connect sidecar proxy.
This PR introduces a change such that each Nomad Client will query its
local Consul for a list of Envoy proxies that it supports (https://github.com/hashicorp/consul/pull/8545)
and then launch the Connect sidecar proxy task using the latest supported version
of Envoy. If the `SupportedProxies` API component is not available from
Consul, Nomad will fallback to the old version of Envoy supported by old
versions of Consul.
Setting the meta configuration option `meta.connect.sidecar_image` or
setting the `connect.sidecar_task` stanza will take precedence as is
the current behavior for sidecar proxies.
Setting the meta configuration option `meta.connect.gateway_image`
will take precedence as is the current behavior for connect gateways.
`meta.connect.sidecar_image` and `meta.connect.gateway_image` may make
use of the special `${NOMAD_envoy_version}` variable interpolation, which
resolves to the newest version of Envoy supported by the Consul agent.
Addresses #8585#7665
Volumes using attachment mode `file-system` use the CSI filesystem API when
they're mounted, and can be passed mount options. But `block-device` mode
volumes don't have this option. When RPCs are made to plugins, we are silently
dropping the mount options we don't expect to see, but this results in a poor
operator experience when the mount options aren't honored. This changeset
makes passing mount options to a `block-device` volume a validation error.
Fixes a bug where CSI volumes with the `MULTI_NODE_MULTI_WRITER` access mode
were using the same logic as `MULTI_NODE_SINGLE_WRITER` to determine whether
the volume had writer claims available for scheduling.
Extends CSI claim endpoint test to exercise multi-reader and make sure `WriteFreeClaims`
is exercised for multi-writer in feasibility test.
This PR fixes a long standing bug where submitting jobs with changes
to connect services would not trigger updates as expected. Previously,
service blocks were not considered as sources of destructive updates
since they could be synced with consul non-destructively. With Connect,
task group services that have changes to their connect block or to
the service port should be destructive, since the network plumbing of
the alloc is going to need updating.
Fixes#8596#7991
Non-destructive half in #7192
Since CPU resources are usually a soft limit it is desirable to allow
setting it as low as possible to allow tasks to run only in "idle" time.
Setting it to 0 is still not allowed to avoid potential unintentional
side effects with allowing a zero value. While there may not be any side
effects this commit attempts to minimize risk by avoiding the issue.
This does *not* change the defaults.
When defining a script-check in a group-level service, Nomad needs to
know which task is associated with the check so that it can use the
correct task driver to execute the check.
This PR fixes two bugs:
1) validate service.task or service.check.task is configured
2) make service.check.task inherit service.task if it is itself unset
Fixes#8952
The current implementation measures RPC request timeout only against
config.RPCHoldTimeout, which is fine for non-blocking requests but will
almost surely be exceeded by long-poll requests that block for minutes
at a time.
This adds an HasTimedOut method on the RPCInfo interface that takes into
account whether the request is blocking, its maximum wait time, and the
RPCHoldTimeout.
Fix CSIMountOptions.Copy() and VolumeRequest.Copy() where they
accidentally returned a reference to self rather than a deep copy.
`&(*ref)` in Golang apparently equivalent to plain `&ref`.
The initial implementation of global job stop for MRD looped over all the
regions in the CLI for expedience. This changeset includes the OSS parts of
moving this into the RPC layer so that API consumers don't have to implement
this logic themselves.
Multiregion deployments use the `NomadTokenID` to allow the deploymentwatcher
to send RPCs between regions with the original submitter's ACL token. This ID
should be filtered from diffs so that it doesn't cause a difference for
purposes of job plans.
When consul.allow_unauthenticated is set to false, the job_endpoint hook validates
that a `-consul-token` is provided and validates the token against the privileges
inherent to a Consul Service Identity policy for all the Connect enabled services
defined in the job.
Before, the check was assuming the service was of type sidecar-proxy. This fixes the
check to use the type of the task so we can distinguish between the different connect
types.
This PR adds initial support for running Consul Connect Ingress Gateways (CIGs) in Nomad. These gateways are declared as part of a task group level service definition within the connect stanza.
```hcl
service {
connect {
gateway {
proxy {
// envoy proxy configuration
}
ingress {
// ingress-gateway configuration entry
}
}
}
}
```
A gateway can be run in `bridge` or `host` networking mode, with the caveat that host networking necessitates manually specifying the Envoy admin listener (which cannot be disabled) via the service port value.
Currently Envoy is the only supported gateway implementation in Consul, and Nomad only supports running Envoy as a gateway using the docker driver.
Aims to address #8294 and tangentially #8647
This change fixes a bug where lost/failed allocations are replaced by
allocations with the latest versions, even if the version hasn't been
promoted yet.
Now, when generating a plan for lost/failed allocations, the scheduler
first checks if the current deployment is in Canary stage, and if so, it
ensures that any lost/failed allocations is replaced one with the latest
promoted version instead.
This change adds the ability to set the fields `success_before_passing` and
`failures_before_critical` on Consul service check definitions. This is a
feature added to Consul v1.7.0 and later.
https://www.consul.io/docs/agent/checks#success-failures-before-passing-critical
Nomad doesn't do much besides pass the fields through to Consul.
Fixes#6913
When the client-side actions of a CSI client RPC succeed but we get
disconnected during the RPC or we fail to checkpoint the claim state, we want
to be able to retry the client RPC without getting blocked by the client-side
state (ex. mount points) already having been cleaned up in previous calls.
adds in oss components to support enterprise multi-vault namespace feature
upgrade specific doc on vault multi-namespaces
vault docs
update test to reflect new error
* nomad/structs/structs: add to Job.Validate
* Update nomad/structs/structs.go
Co-authored-by: Mahmood Ali <mahmood@hashicorp.com>
* nomad/structs/structs: match error strings to the config file
* nomad/structs/structs_test: clarify the test a bit
* nomad/structs/structs_test: typo in the test error comparison
Co-authored-by: Mahmood Ali <mahmood@hashicorp.com>
This fixes a bug where jobs may get "stuck" unprocessed that
dispropotionately affect periodic jobs around leadership transitions.
When registering a job, the job registration and the eval to process it
get applied to raft as two separate transactions; if the job
registration succeeds but eval application fails, the job may remain
unprocessed. Operators may detect such failure, when submitting a job
update and get a 500 error code, and they could retry; periodic jobs
failures are more likely to go unnoticed, and no further periodic
invocations will be processed until an operator force evaluation.
This fixes the issue by ensuring that the job registration and eval
application get persisted and processed atomically in the same raft log
entry.
Also, applies the same change to ensure atomicity in job deregistration.
Backward Compatibility
We must maintain compatibility in two scenarios: mixed clusters where a
leader can handle atomic updates but followers cannot, and a recent
cluster processes old log entries from legacy or mixed cluster mode.
To handle this constraints: ensure that the leader continue to emit the
Evaluation log entry until all servers have upgraded; also, when
processing raft logs, the servers honor evaluations found in both spots,
the Eval in job (de-)registration and the eval update entries.
When an updated server sees mix-mode behavior where an eval is inserted
into the raft log twice, it ignores the second instance.
I made one compromise in consistency in the mixed-mode scenario: servers
may disagree on the eval.CreateIndex value: the leader and updated
servers will report the job registration index while old servers will
report the index of the eval update log entry. This discripency doesn't
seem to be material - it's the eval.JobModifyIndex that matters.
Deployments should wait until kicked off by `Job.Register` so that we can
assert that all regions have a scheduled deployment before starting any
region. This changeset includes the OSS fixes to support the ENT work.
`IsMultiregionStarter` has no more callers in OSS, so remove it here.
It's supposed to be possible for a region not to have `datacenters` set so
that it can use the job's `datacenters` field. This requires that operators
use the same DC name across multiple regions, but that's the default client
configuration.
Before, the service definition for a Connect Native service would always
require setting the `service.task` parameter. Now, that parameter is
automatically inferred when there is only one task in the task group.
Fixes#8274
The multiregion plan diffs swap the old and new versions for each region when
they're edited (rather than added/removed). The `multiregionRegionDiff`
function call incorrectly reversed its arguments for existing regions.
* ar: support opting into binding host ports to default network IP
* fix config plumbing
* plumb node address into network resource
* struct: only handle network resource upgrade path once
* made api.Scaling.Max a pointer, so we can detect (and complain) when it is neglected
* added checks to HCL parsing that it is present
* when Scaling.Max is absent/invalid, don't return extraneous error messages during validation
* tweak to multiregion handling to ensure that the count is valid on the interpolated regional jobs
resolves#8355
* command/agent/host: collect host data, multi platform
* nomad/structs/structs: new HostDataRequest/Response
* client/agent_endpoint: add RPC endpoint
* command/agent/agent_endpoint: add Host
* api/agent: add the Host endpoint
* nomad/client_agent_endpoint: add Agent Host with forwarding
* nomad/client_agent_endpoint: use findClientConn
This changes forwardMonitorClient and forwardProfileClient to use
findClientConn, which was cribbed from the common parts of those
funcs.
* command/debug: call agent hosts
* command/agent/host: eliminate calling external programs
The `nomad volume deregister` command currently returns an error if the volume
has any claims, but in cases where the claims can't be dropped because of
plugin errors, providing a `-force` flag gives the operator an escape hatch.
If the volume has no allocations or if they are all terminal, this flag
deletes the volume from the state store, immediately and implicitly dropping
all claims without further CSI RPCs. Note that this will not also
unmount/detach the volume, which we'll make the responsibility of a separate
`nomad volume detach` command.
This fixes a bug where a batch allocation fails to complete if it has
sidecars.
If the only remaining running tasks in an allocations are sidecars - we
must kill them and mark the allocation as complete.
The scheduler returns a very strange error if it detects a port number
out of range. If these would somehow make it to the client they would
overflow when converted to an int32 and could cause conflicts.
This PR adds the capability of running Connect Native Tasks on Nomad,
particularly when TLS and ACLs are enabled on Consul.
The `connect` stanza now includes a `native` parameter, which can be
set to the name of task that backs the Connect Native Consul service.
There is a new Client configuration parameter for the `consul` stanza
called `share_ssl`. Like `allow_unauthenticated` the default value is
true, but recommended to be disabled in production environments. When
enabled, the Nomad Client's Consul TLS information is shared with
Connect Native tasks through the normal Consul environment variables.
This does NOT include auth or token information.
If Consul ACLs are enabled, Service Identity Tokens are automatically
and injected into the Connect Native task through the CONSUL_HTTP_TOKEN
environment variable.
Any of the automatically set environment variables can be overridden by
the Connect Native task using the `env` stanza.
Fixes#6083
If `max_parallel` is not set, all regions should begin in a `running` state
rather than a `pending` state. Otherwise the first region is set to `running`
and then all the remaining regions once it enters `blocked. That behavior is
technically correct in that we have at most `max_parallel` regions running,
but definitely not what a user expects.
In multiregion deployments when ACLs are enabled, the deploymentwatcher needs
an appropriately scoped ACL token with the same `submit-job` rights as the
user who submitted it. The token will already be replicated, so store the
accessor ID so that it can be retrieved by the leader.
The `paused` state is used as an operator safety mechanism, so that they can
debug a deployment or halt one that's causing a wider failure. By using the
`paused` state as the first state of a multiregion deployment, we risked
resuming an intentionally operator-paused deployment because of activity in a
peer region.
This changeset replaces the use of the `paused` state with a `pending` state,
and provides a `Deployment.Run` internal RPC to replace the use of the
`Deployment.Pause` (resume) RPC we were using in `deploymentwatcher`.
* `nextRegion` should take status parameter
* thread Deployment/Job RPCs thru `nextRegion`
* add `nextRegion` calls to `deploymentwatcher`
* use a better description for paused for peer
Integration points for multiregion jobs to be registered in the enterprise
version of Nomad:
* hook in `Job.Register` for enterprise to send job to peer regions
* remove monitoring from `nomad job run` and `nomad job stop` for multiregion jobs
* scheduler/reconcile: set FollowupEvalID on lost stop_after_client_disconnect
* scheduler/reconcile: thread follupEvalIDs through to results.stop
* scheduler/reconcile: comment typo
* nomad/_test: correct arguments for plan.AppendStoppedAlloc
* scheduler/reconcile: avoid nil, cleanup handleDelayed(Lost|Reschedules)
* changes necessary to support oss licesning shims
revert nomad fmt changes
update test to work with enterprise changes
update tests to work with new ent enforcements
make check
update cas test to use scheduler algorithm
back out preemption changes
add comments
* remove unused method
Fixes#8000
When requesting a Service Identity token from Consul, use the TaskKind
of the Task to get at the service name associated with the task. In
the past using the TaskName worked because it was generated as a sidecar
task with a name that included the service. In the Native context, we
need to get at the service name in a more correct way, i.e. using the
TaskKind which is defined to include the service name.
Allow a `/v1/jobs?all_namespaces=true` to list all jobs across all
namespaces. The returned list is to contain a `Namespace` field
indicating the job namespace.
If ACL is enabled, the request token needs to be a management token or
have `namespace:list-jobs` capability on all existing namespaces.
The MVP for CSI in the 0.11.0 release of Nomad did not include support
for opaque volume parameters or volume context. This changeset adds
support for both.
This also moves args for ControllerValidateCapabilities into a struct.
The CSI plugin `ControllerValidateCapabilities` struct that we turn
into a CSI RPC is accumulating arguments, so moving it into a request
struct will reduce the churn of this internal API, make the plugin
code more readable, and make this method consistent with the other
plugin methods in that package.
* jobspec, api: add stop_after_client_disconnect
* nomad/state/state_store: error message typo
* structs: alloc methods to support stop_after_client_disconnect
1. a global AllocStates to track status changes with timestamps. We
need this to track the time at which the alloc became lost
originally.
2. ShouldClientStop() and WaitClientStop() to actually do the math
* scheduler/reconcile_util: delayByStopAfterClientDisconnect
* scheduler/reconcile: use delayByStopAfterClientDisconnect
* scheduler/util: updateNonTerminalAllocsToLost comments
This was setup to only update allocs to lost if the DesiredStatus had
already been set by the scheduler. It seems like the intention was to
update the status from any non-terminal state, and not all lost allocs
have been marked stop or evict by now
* scheduler/testing: AssertEvalStatus just use require
* scheduler/generic_sched: don't create a blocked eval if delayed
* scheduler/generic_sched_test: several scheduling cases
CSI plugins can require credentials for some publishing and
unpublishing workflow RPCs. Secrets are configured at the time of
volume registration, stored in the volume struct, and then passed
around as an opaque map by Nomad to the plugins.
When serializing structs with msgpack, only consider type tags of
`codec`.
Hashicorp/go-msgpack (based on ugorji/go) defaults to interpretting
`codec` tag if it's available, but falls to using `json` if `codec`
isn't present.
This behavior is surprising in cases where we want to serialize json
differently from msgpack, e.g. serializing `ConsulExposeConfig`.
Ensure that `""` Scheduler Algorithm gets explicitly set to binpack on
upgrades or on API handling when user misses the value.
The scheduler already treats `""` value as binpack. This PR merely
ensures that the operator API returns the effective value.
This changeset implements a periodic garbage collection of unused CSI
plugins. Plugins are self-cleaning when the last allocation for a
plugin is stopped, but this feature will cover any missing edge cases
and ensure that upgrades from 0.11.0 and 0.11.1 get any stray plugins
cleaned up.
In this changeset:
* If a Nomad client node is running both a controller and a node
plugin (which is a common case), then if only the controller or the
node is removed, the plugin was not being updated with the correct
counts.
* The existing test for plugin cleanup didn't go back to the state
store, which normally is ok but is complicated in this case by
denormalization which changes the behavior. This commit makes the
test more comprehensive.
* Set "controller required" when plugin has `PUBLISH_READONLY`. All
known controllers that support `PUBLISH_READONLY` also support
`PUBLISH_UNPUBLISH_VOLUME` but we shouldn't assume this.
* Only create plugins when the allocs for those plugins are
healthy. If we allow a plugin to be created for the first time when
the alloc is not healthy, then we'll recreate deleted plugins when
the job's allocs all get marked terminal.
* Terminal plugin alloc updates should cleanup the plugin. The client
fingerprint can't tell if the plugin is unhealthy intentionally (for
the case of updates or job stop). Allocations that are
server-terminal should delete themselves from the plugin and trigger
a plugin self-GC, the same as an unused node.
- track lastHeartbeat, the client local time of the last successful
heartbeat round trip
- track allocations with `stop_after_client_disconnect` configured
- trigger allocation destroy (which handles cleanup)
- restore heartbeat/killable allocs tracking when allocs are recovered from disk
- on client restart, stop those allocs after a grace period if the
servers are still partioned
We should only remove the `ReadAllocs`/`WriteAllocs` values for a
volume after the claim has entered the "ready to free"
state. The volume will eventually be released as expected. But
querying the volume API will show the volume is released before the
controller unpublish has finished and this can cause a race with
starting new jobs.
Test updates are to cover cases where we're dropping claims but not
running through the whole reaping process.
This changeset adds a subsystem to run on the leader, similar to the
deployment watcher or node drainer. The `Watcher` performs a blocking
query on updates to the `CSIVolumes` table and triggers reaping of
volume claims.
This will avoid tying up scheduling workers by immediately sending
volume claim workloads into their own loop, rather than blocking the
scheduling workers in the core GC job doing things like talking to CSI
controllers
The volume watcher is enabled on leader step-up and disabled on leader
step-down.
The volume claim GC mechanism now makes an empty claim RPC for the
volume to trigger an index bump. That in turn unblocks the blocking
query in the volume watcher so it can assess which claims can be
released for a volume.
Adds a `CSIVolumeClaim` type to be tracked as current and past claims
on a volume. Allows for a client RPC failure during node or controller
detachment without having to keep the allocation around after the
first garbage collection eval.
This changeset lays groundwork for moving the actual detachment RPCs
into a volume watching loop outside the GC eval.
The BinPackIter accounted for node reservations twice when scoring nodes
which could bias scores toward nodes with reservations.
Pseudo-code for previous algorithm:
```
proposed = reservedResources + sum(allocsResources)
available = nodeResources - reservedResources
score = 1 - (proposed / available)
```
The node's reserved resources are added to the total resources used by
allocations, and then the node's reserved resources are later
substracted from the node's overall resources.
The new algorithm is:
```
proposed = sum(allocResources)
available = nodeResources - reservedResources
score = 1 - (proposed / available)
```
The node's reserved resources are no longer added to the total resources
used by allocations.
My guess as to how this bug happened is that the resource utilization
variable (`util`) is calculated and returned by the `AllocsFit` function
which needs to take reserved resources into account as a basic
feasibility check.
To avoid re-calculating alloc resource usage (because there may be a
large number of allocs), we reused `util` in the `ScoreFit` function.
`ScoreFit` properly accounts for reserved resources by subtracting them
from the node's overall resources. However since `util` _also_ took
reserved resources into account the score would be incorrect.
Prior to the fix the added test output:
```
Node: reserved Score: 1.0000
Node: reserved2 Score: 1.0000
Node: no-reserved Score: 0.9741
```
The scores being 1.0 for *both* nodes with reserved resources is a good
hint something is wrong as they should receive different scores. Upon
further inspection the double accounting of reserved resources caused
their scores to be >1.0 and clamped.
After the fix the added test outputs:
```
Node: no-reserved Score: 0.9741
Node: reserved Score: 0.9480
Node: reserved2 Score: 0.8717
```
The field names within the structs representing the Connect proxy
definition were not the same (nomad/structs/ vs api/), causing the
values to be lost in translation for the 'nomad job inspect' command.
Since the field names already shipped in v0.11.0 we cannot simply
fix the names. Instead, use the json struct tag on the structs/ structs
to remap the name to match the publicly expose api/ package on json
encoding.
This means existing jobs from v0.11.0 will continue to work, and
the JSON API for job submission will remain backwards compatible.
Before, if the sidecar_service stanza of a connect enabled service
was missing, the job submission would cause a panic in the nomad
agent. Since the panic was happening in the API handler the agent
itself continued running, but this change will the condition more
gracefully.
By fixing the `Copy` method, the API handler now returns the proper
error.
$ nomad job run foo.nomad
Error submitting job: Unexpected response code: 500 (1 error occurred:
* Task group api validation failed: 2 errors occurred:
* Missing tasks for task group
* Task group service validation failed: 1 error occurred:
* Service[0] count-api validation failed: 1 error occurred:
* Consul Connect must be native or use a sidecar service
* nomad/state/state_store: error message copy/paste error
* nomad/structs/structs: add a VolumeEval to the JobDeregisterResponse
* nomad/job_endpoint: synchronously, volumeClaimReap on job Deregister
* nomad/core_sched: make volumeClaimReap available without a CoreSched
* nomad/job_endpoint: Deregister return early if the job is missing
* nomad/job_endpoint_test: job Deregistion is idempotent
* nomad/core_sched: conditionally ignore alloc status in volumeClaimReap
* nomad/job_endpoint: volumeClaimReap all allocations, even running
* nomad/core_sched_test: extra argument to collectClaimsToGCImpl
* nomad/job_endpoint: job deregistration is not idempotent
Part of #6120
Building on the support for enabling connect proxy paths in #7323, this change
adds the ability to configure the 'service.check.expose' flag on group-level
service check definitions for services that are connect-enabled. This is a slight
deviation from the "magic" that Consul provides. With Consul, the 'expose' flag
exists on the connect.proxy stanza, which will then auto-generate expose paths
for every HTTP and gRPC service check associated with that connect-enabled
service.
A first attempt at providing similar magic for Nomad's Consul Connect integration
followed that pattern exactly, as seen in #7396. However, on reviewing the PR
we realized having the `expose` flag on the proxy stanza inseperably ties together
the automatic path generation with every HTTP/gRPC defined on the service. This
makes sense in Consul's context, because a service definition is reasonably
associated with a single "task". With Nomad's group level service definitions
however, there is a reasonable expectation that a service definition is more
abstractly representative of multiple services within the task group. In this
case, one would want to define checks of that service which concretely make HTTP
or gRPC requests to different underlying tasks. Such a model is not possible
with the course `proxy.expose` flag.
Instead, we now have the flag made available within the check definitions themselves.
By making the expose feature resolute to each check, it is possible to have
some HTTP/gRPC checks which make use of the envoy exposed paths, as well as
some HTTP/gRPC checks which make use of some orthongonal port-mapping to do
checks on some other task (or even some other bound port of the same task)
within the task group.
Given this example,
group "server-group" {
network {
mode = "bridge"
port "forchecks" {
to = -1
}
}
service {
name = "myserver"
port = 2000
connect {
sidecar_service {
}
}
check {
name = "mycheck-myserver"
type = "http"
port = "forchecks"
interval = "3s"
timeout = "2s"
method = "GET"
path = "/classic/responder/health"
expose = true
}
}
}
Nomad will automatically inject (via job endpoint mutator) the
extrapolated expose path configuration, i.e.
expose {
path {
path = "/classic/responder/health"
protocol = "http"
local_path_port = 2000
listener_port = "forchecks"
}
}
Documentation is coming in #7440 (needs updating, doing next)
Modifications to the `countdash` examples in https://github.com/hashicorp/demo-consul-101/pull/6
which will make the examples in the documentation actually runnable.
Will add some e2e tests based on the above when it becomes available.
Enable configuration of HTTP and gRPC endpoints which should be exposed by
the Connect sidecar proxy. This changeset is the first "non-magical" pass
that lays the groundwork for enabling Consul service checks for tasks
running in a network namespace because they are Connect-enabled. The changes
here provide for full configuration of the
connect {
sidecar_service {
proxy {
expose {
paths = [{
path = <exposed endpoint>
protocol = <http or grpc>
local_path_port = <local endpoint port>
listener_port = <inbound mesh port>
}, ... ]
}
}
}
stanza. Everything from `expose` and below is new, and partially implements
the precedent set by Consul:
https://www.consul.io/docs/connect/registration/service-registration.html#expose-paths-configuration-reference
Combined with a task-group level network port-mapping in the form:
port "exposeExample" { to = -1 }
it is now possible to "punch a hole" through the network namespace
to a specific HTTP or gRPC path, with the anticipated use case of creating
Consul checks on Connect enabled services.
A future PR may introduce more automagic behavior, where we can do things like
1) auto-fill the 'expose.path.local_path_port' with the default value of the
'service.port' value for task-group level connect-enabled services.
2) automatically generate a port-mapping
3) enable an 'expose.checks' flag which automatically creates exposed endpoints
for every compatible consul service check (http/grpc checks on connect
enabled services).
* nomad/structs/structs: new NodeEventSubsystemCSI
* client/client: pass triggerNodeEvent in the CSIConfig
* client/pluginmanager/csimanager/instance: add eventer to instanceManager
* client/pluginmanager/csimanager/manager: pass triggerNodeEvent
* client/pluginmanager/csimanager/volume: node event on [un]mount
* nomad/structs/structs: use storage, not CSI
* client/pluginmanager/csimanager/volume: use storage, not CSI
* client/pluginmanager/csimanager/volume_test: eventer
* client/pluginmanager/csimanager/volume: event on error
* client/pluginmanager/csimanager/volume_test: check event on error
* command/node_status: remove an extra space in event detail format
* client/pluginmanager/csimanager/volume: use snake_case for details
* client/pluginmanager/csimanager/volume_test: snake_case details
* nomad/structs/csi: delete just one plugin type from a node
* nomad/structs/csi: add DeleteAlloc
* nomad/state/state_store: add deleteJobFromPlugin
* nomad/state/state_store: use DeleteAlloc not DeleteNodeType
* move CreateTestCSIPlugin to state to avoid an import cycle
* nomad/state/state_store_test: delete a plugin by deleting its jobs
* nomad/*_test: move CreateTestCSIPlugin to state
* nomad/state/state_store: update one plugin per transaction
* command/plugin_status_test: move CreateTestCSIPlugin
* nomad: csi: handle nils CSIPlugin methods, clarity
- tg.Count defaults to tg.Scaling.Min if present (falls back on previous default of 1 if Scaling is absent)
- Validate() enforces tg.Scaling.Min <= tg.Count <= tg.Scaling.Max
modification in ApiScalingPolicyToStructs, api.TaskGroup.Validate so that defaults are handled for TaskGroup.Count and
* nomad/structs/csi: split CanWrite into health, in use
* scheduler/scheduler: expose AllocByID in the state interface
* nomad/state/state_store_test
* scheduler/stack: SetJobID on the matcher
* scheduler/feasible: when a volume writer is in use, check if it's us
* scheduler/feasible: remove SetJob
* nomad/state/state_store: denormalize allocs before Claim
* nomad/structs/csi: return errors on claim, with context
* nomad/csi_endpoint_test: new alloc doesn't look like an update
* nomad/state/state_store_test: change test reference to CanWrite
Add mount_options to both the volume definition on registration and to the volume block in the group where the volume is requested. If both are specified, the options provided in the request replace the options defined in the volume. They get passed to the NodePublishVolume, which causes the node plugin to actually mount the volume on the host.
Individual tasks just mount bind into the host mounted volume (unchanged behavior). An operator can mount the same volume with different options by specifying it twice in the group context.
closes#7007
* nomad/structs/volumes: add MountOptions to volume request
* jobspec/test-fixtures/basic.hcl: add mount_options to volume block
* jobspec/parse_test: add expected MountOptions
* api/tasks: add mount_options
* jobspec/parse_group: use hcl decode not mapstructure, mount_options
* client/allocrunner/csi_hook: pass MountOptions through
client/allocrunner/csi_hook: add a VolumeMountOptions
client/allocrunner/csi_hook: drop Options
client/allocrunner/csi_hook: use the structs options
* client/pluginmanager/csimanager/interface: UsageOptions.MountOptions
* client/pluginmanager/csimanager/volume: pass MountOptions in capabilities
* plugins/csi/plugin: remove todo 7007 comment
* nomad/structs/csi: MountOptions
* api/csi: add options to the api for parsing, match structs
* plugins/csi/plugin: move VolumeMountOptions to structs
* api/csi: use specific type for mount_options
* client/allocrunner/csi_hook: merge MountOptions here
* rename CSIOptions to CSIMountOptions
* client/allocrunner/csi_hook
* client/pluginmanager/csimanager/volume
* nomad/structs/csi
* plugins/csi/fake/client: add PrevVolumeCapability
* plugins/csi/plugin
* client/pluginmanager/csimanager/volume_test: remove debugging
* client/pluginmanager/csimanager/volume: fix odd merging logic
* api: rename CSIOptions -> CSIMountOptions
* nomad/csi_endpoint: remove a 7007 comment
* command/alloc_status: show mount options in the volume list
* nomad/structs/csi: include MountOptions in the volume stub
* api/csi: add MountOptions to stub
* command/volume_status_csi: clean up csiVolMountOption, add it
* command/alloc_status: csiVolMountOption lives in volume_csi_status
* command/node_status: display mount flags
* nomad/structs/volumes: npe
* plugins/csi/plugin: npe in ToCSIRepresentation
* jobspec/parse_test: expand volume parse test cases
* command/agent/job_endpoint: ApiTgToStructsTG needs MountOptions
* command/volume_status_csi: copy paste error
* jobspec/test-fixtures/basic: hclfmt
* command/volume_status_csi: clean up csiVolMountOption
Nomad clients will push node updates during client restart which can
cause an extra claim for a volume by the same alloc. If an alloc
already claims a volume, we can allow it to be treated as a valid
claim and continue.
* nomad/structs/csi: new RemoteID() uses the ExternalID if set
* nomad/csi_endpoint: pass RemoteID to volume request types
* client/pluginmanager/csimanager/volume: pass RemoteID to NodePublishVolume
* api/allocations: GetTaskGroup finds the taskgroup struct
* command/node_status: display CSI volume names
* nomad/state/state_store: new CSIVolumesByNodeID
* nomad/state/iterator: new SliceIterator type implements memdb.ResultIterator
* nomad/csi_endpoint: deal with a slice of volumes
* nomad/state/state_store: CSIVolumesByNodeID return a SliceIterator
* nomad/structs/csi: CSIVolumeListRequest takes a NodeID
* nomad/csi_endpoint: use the return iterator
* command/agent/csi_endpoint: parse query params for CSIVolumes.List
* api/nodes: new CSIVolumes to list volumes by node
* command/node_status: use the new list endpoint to print volumes
* nomad/state/state_store: error messages consider the operator
* command/node_status: include the Provider
* client/allocrunner/csi_hook: tag errors
* nomad/client_csi_endpoint: tag errors
* nomad/client_rpc: remove an unnecessary error tag
* nomad/state/state_store: ControllerRequired fix intent
We use ControllerRequired to indicate that a volume should use the
publish/unpublish workflow, rather than that it has a controller. We
need to check both RequiresControllerPlugin and SupportsAttachDetach
from the fingerprint to check that.
* nomad/csi_endpoint: tag errors
* nomad/csi_endpoint_test: longer error messages, mock fingerprints
Derive a provider name and version for plugins (and the volumes that
use them) from the CSI identity API `GetPluginInfo`. Expose the vendor
name as `Provider` in the API and CLI commands.
* command/csi: csi, csi_plugin, csi_volume
* helper/funcs: move ExtraKeys from parse_config to UnusedKeys
* command/agent/config_parse: use helper.UnusedKeys
* api/csi: annotate CSIVolumes with hcl fields
* command/csi_plugin: add Synopsis
* command/csi_volume_register: use hcl.Decode style parsing
* command/csi_volume_list
* command/csi_volume_status: list format, cleanup
* command/csi_plugin_list
* command/csi_plugin_status
* command/csi_volume_deregister
* command/csi_volume: add Synopsis
* api/contexts/contexts: add csi search contexts to the constants
* command/commands: register csi commands
* api/csi: fix struct tag for linter
* command/csi_plugin_list: unused struct vars
* command/csi_plugin_status: unused struct vars
* command/csi_volume_list: unused struct vars
* api/csi: add allocs to CSIPlugin
* command/csi_plugin_status: format the allocs
* api/allocations: copy Allocation.Stub in from structs
* nomad/client_rpc: add some error context with Errorf
* api/csi: collapse read & write alloc maps to a stub list
* command/csi_volume_status: cleanup allocation display
* command/csi_volume_list: use Schedulable instead of Healthy
* command/csi_volume_status: use Schedulable instead of Healthy
* command/csi_volume_list: sprintf string
* command/csi: delete csi.go, csi_plugin.go
* command/plugin: refactor csi components to sub-command plugin status
* command/plugin: remove csi
* command/plugin_status: remove csi
* command/volume: remove csi
* command/volume_status: split out csi specific
* helper/funcs: add RemoveEqualFold
* command/agent/config_parse: use helper.RemoveEqualFold
* api/csi: do ,unusedKeys right
* command/volume: refactor csi components to `nomad volume`
* command/volume_register: split out csi specific
* command/commands: use the new top level commands
* command/volume_deregister: hardwired type csi for now
* command/volume_status: csiFormatVolumes rescued from volume_list
* command/plugin_status: avoid a panic on no args
* command/volume_status: avoid a panic on no args
* command/plugin_status: predictVolumeType
* command/volume_status: predictVolumeType
* nomad/csi_endpoint_test: move CreateTestPlugin to testing
* command/plugin_status_test: use CreateTestCSIPlugin
* nomad/structs/structs: add CSIPlugins and CSIVolumes search consts
* nomad/state/state_store: add CSIPlugins and CSIVolumesByIDPrefix
* nomad/search_endpoint: add CSIPlugins and CSIVolumes
* command/plugin_status: move the header to the csi specific
* command/volume_status: move the header to the csi specific
* nomad/state/state_store: CSIPluginByID prefix
* command/status: rename the search context to just Plugins/Volumes
* command/plugin,volume_status: test return ids now
* command/status: rename the search context to just Plugins/Volumes
* command/plugin_status: support -json and -t
* command/volume_status: support -json and -t
* command/plugin_status_csi: comments
* command/*_status: clean up text
* api/csi: fix stale comments
* command/volume: make deregister sound less fearsome
* command/plugin_status: set the id length
* command/plugin_status_csi: more compact plugin health
* command/volume: better error message, comment
* structs: add ControllerRequired, volume.Name, no plug.Type
* structs: Healthy -> Schedulable
* state_store: Healthy -> Schedulable
* api: add ControllerRequired to api data types
* api: copy csi structs changes
* nomad/structs/csi: include name and external id
* api/csi: include Name and ExternalID
* nomad/structs/csi: comments for the 3 ids
* structs: CSIInfo include AllocID, CSIPlugins no Jobs
* state_store: eliminate plugin Jobs, delete an empty plugin
* nomad/structs/csi: detect empty plugins correctly
* client/allocrunner/taskrunner/plugin_supervisor_hook: option AllocID
* client/pluginmanager/csimanager/instance: allocID
* client/pluginmanager/csimanager/fingerprint: set AllocID
* client/node_updater: split controller and node plugins
* api/csi: remove Jobs
The CSI Plugin API will map plugins to allocations, which allows
plugins to be defined by jobs in many configurations. In particular,
multiple plugins can be defined in the same job, and multiple jobs can
be used to define a single plugin.
Because we now map the allocation context directly from the node, it's
no longer necessary to track the jobs associated with a plugin
directly.
* nomad/csi_endpoint_test: CreateTestPlugin & register via fingerprint
* client/dynamicplugins: lift AllocID into the struct from Options
* api/csi_test: remove Jobs test
* nomad/structs/csi: CSIPlugins has an array of allocs
* nomad/state/state_store: implement CSIPluginDenormalize
* nomad/state/state_store: CSIPluginDenormalize npe on missing alloc
* nomad/csi_endpoint_test: defer deleteNodes for clarity
* api/csi_test: disable this test awaiting mocks:
https://github.com/hashicorp/nomad/issues/7123
When an alloc is marked terminal (and after node unstage/unpublish
have been called), the client syncs the terminal alloc state with the
server via `Node.UpdateAlloc RPC`.
For each job that has a terminal alloc, the `Node.UpdateAlloc` RPC
handler at the server will emit an eval for a new core job to garbage
collect CSI volume claims. When this eval is handled on the core
scheduler, it will call a `volumeReap` method to release the claims
for all terminal allocs on the job.
The volume reap will issue a `ControllerUnpublishVolume` RPC for any
node that has no alloc claiming the volume. Once this returns (or
is skipped), the volume reap will send a new `CSIVolume.Claim` RPC
that releases the volume claim for that allocation in the state store,
making it available for scheduling again.
This same `volumeReap` method will be called from the core job GC,
which gives us a second chance to reclaim volumes during GC if there
were controller RPC failures.
Currently, the client has to ship an entire allocation to the server as
part of performing a VolumeClaim, this has a few problems:
Firstly, it means the client is sending significantly more data than is
required (an allocation contains the entire contents of a Nomad job,
alongside other irrelevant state) which has a non-zero (de)serialization
cost.
Secondly, because the allocation was never re-fetched from the state
store, it means that we were potentially open to issues caused by stale
state on a misbehaving or malicious client.
The change removes both of those issues at the cost of a couple of more
state store lookups, but they should be relatively cheap.
We also now provide the CSIVolume in the response for a claim, so the
client can perform a Claim without first going ahead and fetching all of
the volumes.
Nomad servers need to make requests to CSI controller plugins running
on a client for publish/unpublish. The RPC needs to look up the client
node based on the plugin, load balancing across controllers, and then
perform the required client RPC to that node (via server forwarding if
neccessary).
The `ControllerPublishVolumeResponse` CSI RPC includes the publish
context intended to be passed by the orchestrator as an opaque value
to the node plugins. This changeset adds it to our response to a
volume claim request to proxy the controller's response back to the
client node.
When the client receives an allocation which includes a CSI volume,
the alloc runner will block its main `Run` loop. The alloc runner will
issue a `VolumeClaim` RPC to the Nomad servers. This changeset
implements the portions of the `VolumeClaim` RPC endpoint that have
not been previously completed.
* state_store: csi volumes/plugins store the index in the txn
* nomad: csi_endpoint_test require index checks need uint64()
* nomad: other tests using int 0 not uint64(0)
* structs: pass index into New, but not other struct methods
* state_store: csi plugin indexes, use new struct interface
* nomad: csi_endpoint_test check index/query meta (on explicit 0)
* structs: NewCSIVolume takes an index arg now
* scheduler/test: NewCSIVolume takes an index arg now
This changeset adds a new core job `CoreJobCSIVolumePublicationGC` to
the leader's loop for scheduling core job evals. Right now this is an
empty method body without even a config file stanza. Later changesets
will implement the logic of volume publication GC.
This changeset implements the initial registration and fingerprinting
of CSI Plugins as part of #5378. At a high level, it introduces the
following:
* A `csi_plugin` stanza as part of a Nomad task configuration, to
allow a task to expose that it is a plugin.
* A new task runner hook: `csi_plugin_supervisor`. This hook does two
things. When the `csi_plugin` stanza is detected, it will
automatically configure the plugin task to receive bidirectional
mounts to the CSI intermediary directory. At runtime, it will then
perform an initial heartbeat of the plugin and handle submitting it to
the new `dynamicplugins.Registry` for further use by the client, and
then run a lightweight heartbeat loop that will emit task events
when health changes.
* The `dynamicplugins.Registry` for handling plugins that run
as Nomad tasks, in contrast to the existing catalog that requires
`go-plugin` type plugins and to know the plugin configuration in
advance.
* The `csimanager` which fingerprints CSI plugins, in a similar way to
`drivermanager` and `devicemanager`. It currently only fingerprints
the NodeID from the plugin, and assumes that all plugins are
monolithic.
Missing features
* We do not use the live updates of the `dynamicplugin` registry in
the `csimanager` yet.
* We do not deregister the plugins from the client when they shutdown
yet, they just become indefinitely marked as unhealthy. This is
deliberate until we figure out how we should manage deploying new
versions of plugins/transitioning them.
allow oss to parse sink duration
clean up audit sink parsing
ent eventer config reload
fix typo
SetEnabled to eventer interface
client acl test
rm dead code
fix failing test
Fix a bug where consul service definitions would not be updated if changes
were made to the service in the Nomad job. Currently this only fixes the
bug for cases where the fix is a matter of updating consul agent's service
registration. There is related bug where destructive changes are required
(see #6877) which will be fixed in another PR.
The enable_tag_override configuration setting for the parent service is
applied to the sidecar service.
Fixes#6459
Consul CLI uses CONSUL_HTTP_TOKEN, so Nomad should use the same.
Note that consul-template uses CONSUL_TOKEN, which Nomad also uses,
so be careful to preserve any reference to that in the consul-template
context.
Consul provides a feature of Service Definitions where the tags
associated with a service can be modified through the Catalog API,
overriding the value(s) configured in the agent's service configuration.
To enable this feature, the flag enable_tag_override must be configured
in the service definition.
Previously, Nomad did not allow configuring this flag, and thus the default
value of false was used. Now, it is configurable.
Because Nomad itself acts as a state machine around the the service definitions
of the tasks it manages, it's worth describing what happens when this feature
is enabled and why.
Consider the basic case where there is no Nomad, and your service is provided
to consul as a boring JSON file. The ultimate source of truth for the definition
of that service is the file, and is stored in the agent. Later, Consul performs
"anti-entropy" which synchronizes the Catalog (stored only the leaders). Then
with enable_tag_override=true, the tags field is available for "external"
modification through the Catalog API (rather than directly configuring the
service definition file, or using the Agent API). The important observation
is that if the service definition ever changes (i.e. the file is changed &
config reloaded OR the Agent API is used to modify the service), those
"external" tag values are thrown away, and the new service definition is
once again the source of truth.
In the Nomad case, Nomad itself is the source of truth over the Agent in
the same way the JSON file was the source of truth in the example above.
That means any time Nomad sets a new service definition, any externally
configured tags are going to be replaced. When does this happen? Only on
major lifecycle events, for example when a task is modified because of an
updated job spec from the 'nomad job run <existing>' command. Otherwise,
Nomad's periodic re-sync's with Consul will now no longer try to restore
the externally modified tag values (as long as enable_tag_override=true).
Fixes#2057
Nomad jobs may be configured with a TaskGroup which contains a Service
definition that is Consul Connect enabled. These service definitions end
up establishing a Consul Connect Proxy Task (e.g. envoy, by default). In
the case where Consul ACLs are enabled, a Service Identity token is required
for these tasks to run & connect, etc. This changeset enables the Nomad Server
to recieve RPC requests for the derivation of SI tokens on behalf of instances
of Consul Connect using Tasks. Those tokens are then relayed back to the
requesting Client, which then injects the tokens in the secrets directory of
the Task.
When a job is configured with Consul Connect aware tasks (i.e. sidecar),
the Nomad Client should be able to request from Consul (through Nomad Server)
Service Identity tokens specific to those tasks.
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and attempt to
apply it through raft.
The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
This change provides an initial pass at setting up the configuration necessary to
enable use of Connect with Consul ACLs. Operators will be able to pass in a Consul
Token through `-consul-token` or `$CONSUL_TOKEN` in the `job run` and `job revert`
commands (similar to Vault tokens).
These values are not actually used yet in this changeset.
Introduce limits to prevent unauthorized users from exhausting all
ephemeral ports on agents:
* `{https,rpc}_handshake_timeout`
* `{http,rpc}_max_conns_per_client`
The handshake timeout closes connections that have not completed the TLS
handshake by the deadline (5s by default). For RPC connections this
timeout also separately applies to first byte being read so RPC
connections with TLS enabled have `rpc_handshake_time * 2` as their
deadline.
The connection limit per client prevents a single remote TCP peer from
exhausting all ephemeral ports. The default is 100, but can be lowered
to a minimum of 26. Since streaming RPC connections create a new TCP
connection (until MultiplexV2 is used), 20 connections are reserved for
Raft and non-streaming RPCs to prevent connection exhaustion due to
streaming RPCs.
All limits are configurable and may be disabled by setting them to `0`.
This also includes a fix that closes connections that attempt to create
TLS RPC connections recursively. While only users with valid mTLS
certificates could perform such an operation, it was added as a
safeguard to prevent programming errors before they could cause resource
exhaustion.
This commit ensures that Alloc.AllocatedResources is properly populated
when read from persistence stores (namely Raft and client state store).
The alloc struct may have been written previously by an arbitrary old
version that may only populate Alloc.TaskResources.
When parsing a config file which had the consul.timeout param set,
Nomad was reporting an error causing startup to fail. This seems
to be caused by the HCL decoder interpreting the timeout type as
an int rather than a string. This is caused by the struct
TimeoutHCL param having a hcl key of timeout alongside a Timeout
struct param of type time.Duration (int). Ensuring the decoder
ignores the Timeout struct param ensure the decoder runs
correctly.
copy struct values
ensure groupserviceHook implements RunnerPreKillhook
run deregister first
test that shutdown times are delayed
move magic number into variable
Fixes#6853
Canonicalize jobs first before adding any sidecars. This fixes a bug
where sidecar tasks were added without interpolated names and broke
validation. Sidecar tasks must be canonicalized independently.
Also adds a group network to the mock connect job because it wasn't a
valid connect job before!
Noticed that ACL endpoints return 500 status code for user errors. This
is confusing and can lead to false monitoring alerts.
Here, I introduce a concept of RPCCoded errors to be returned by RPC
that signal a code in addition to error message. Codes for now match
HTTP codes to ease reasoning.
```
$ nomad acl bootstrap
Error bootstrapping: Unexpected response code: 500 (ACL bootstrap already done (reset index: 9))
$ nomad acl bootstrap
Error bootstrapping: Unexpected response code: 400 (ACL bootstrap already done (reset index: 9))
```
The existing version constraint uses logic optimized for package
managers, not schedulers, when checking prereleases:
- 1.3.0-beta1 will *not* satisfy ">= 0.6.1"
- 1.7.0-rc1 will *not* satisfy ">= 1.6.0-beta1"
This is due to package managers wishing to favor final releases over
prereleases.
In a scheduler versions more often represent the earliest release all
required features/APIs are available in a system. Whether the constraint
or the version being evaluated are prereleases has no impact on
ordering.
This commit adds a new constraint - `semver` - which will use Semver
v2.0 ordering when evaluating constraints. Given the above examples:
- 1.3.0-beta1 satisfies ">= 0.6.1" using `semver`
- 1.7.0-rc1 satisfies ">= 1.6.0-beta1" using `semver`
Since existing jobspecs may rely on the old behavior, a new constraint
was added and the implicit Consul Connect and Vault constraints were
updated to use it.
* client: improve group service stanza interpolation and check_restart support
Interpolation can now be done on group service stanzas. Note that some task runtime specific information
that was previously available when the service was registered poststart of a task is no longer available.
The check_restart stanza for checks defined on group services will now properly restart the allocation upon
check failures if configured.
This commit introduces support for configuring mount propagation when
mounting volumes with the `volume_mount` stanza on Linux targets.
Similar to Kubernetes, we expose 3 options for configuring mount
propagation:
- private, which is equivalent to `rprivate` on Linux, which does not allow the
container to see any new nested mounts after the chroot was created.
- host-to-task, which is equivalent to `rslave` on Linux, which allows new mounts
that have been created _outside of the container_ to be visible
inside the container after the chroot is created.
- bidirectional, which is equivalent to `rshared` on Linux, which allows both
the container to see new mounts created on the host, but
importantly _allows the container to create mounts that are
visible in other containers an don the host_
private and host-to-task are safe, but bidirectional mounts can be
dangerous, as if the code inside a container creates a mount, and does
not clean it up before tearing down the container, it can cause bad
things to happen inside the kernel.
To add a layer of safety here, we require that the user has ReadWrite
permissions on the volume before allowing bidirectional mounts, as a
defense in depth / validation case, although creating mounts should also require
a priviliged execution environment inside the container.
Fix a bug where a millicious user can access or manipulate an alloc in a
namespace they don't have access to. The allocation endpoints perform
ACL checks against the request namespace, not the allocation namespace,
and performs the allocation lookup independently from namespaces.
Here, we check that the requested can access the alloc namespace
regardless of the declared request namespace.
Ideally, we'd enforce that the declared request namespace matches
the actual allocation namespace. Unfortunately, we haven't documented
alloc endpoints as namespaced functions; we suspect starting to enforce
this will be very disruptive and inappropriate for a nomad point
release. As such, we maintain current behavior that doesn't require
passing the proper namespace in request. A future major release may
start enforcing checking declared namespace.
Without a `LocalServicePort`, Connect services will try to use the
mapped port even when delivering traffic locally. A user can override
this behavior by pinning the port value in the `service` stanza but
this prevents us from using the Consul service name to reach the
service.
This commits configures the Consul proxy with its `LocalServicePort`
and `LocalServiceAddress` fields.
Currently, using a Volume in a job uses the following configuration:
```
volume "alias-name" {
type = "volume-type"
read_only = true
config {
source = "host_volume_name"
}
}
```
This commit migrates to the following:
```
volume "alias-name" {
type = "volume-type"
source = "host_volume_name"
read_only = true
}
```
The original design was based due to being uncertain about the future of storage
plugins, and to allow maxium flexibility.
However, this causes a few issues, namely:
- We frequently need to parse this configuration during submission,
scheduling, and mounting
- It complicates the configuration from and end users perspective
- It complicates the ability to do validation
As we understand the problem space of CSI a little more, it has become
clear that we won't need the `source` to be in config, as it will be
used in the majority of cases:
- Host Volumes: Always need a source
- Preallocated CSI Volumes: Always needs a source from a volume or claim name
- Dynamic Persistent CSI Volumes*: Always needs a source to attach the volumes
to for managing upgrades and to avoid dangling.
- Dynamic Ephemeral CSI Volumes*: Less thought out, but `source` will probably point
to the plugin name, and a `config` block will
allow you to pass meta to the plugin. Or will
point to a pre-configured ephemeral config.
*If implemented
The new design simplifies this by merging the source into the volume
stanza to solve the above issues with usability, performance, and error
handling.
This is an attempt to ease dependency management for external driver
plugins, by avoiding requiring them to compile ugorji/go generated
files. Plugin developers reported some pain with the brittleness of
ugorji/go dependency in particular, specially when using go mod, the
default go mod manager in golang 1.13.
Context
--------
Nomad uses msgpack to persist and serialize internal structs, using
ugorji/go library. As an optimization, we use ugorji/go code generation
to speedup process and aovid the relection-based slow path.
We commit these generated files in repository when we cut and tag the
release to ease reproducability and debugging old releases. Thus,
downstream projects that depend on release tag, indirectly depends on
ugorji/go generated code.
Sadly, the generated code is brittle and specific to the version of
ugorji/go being used. When go mod picks another version of ugorji/go
then nomad (go mod by default uses release according to semver),
downstream projects face compilation errors.
Interestingly, downstream projects don't commonly serialize nomad
internal structs. Drivers and device plugins use grpc instead of
msgpack for the most part. In the few cases where they use msgpag (e.g.
decoding task config), they do without codegen path as they run on
driver specific structs not the nomad internal structs. Also, the
ugorji/go serialization through reflection is generally backward
compatible (mod some ugorji/go regression bugs that get introduced every
now and then :( ).
Proposal
---------
The proposal here is to keep committing ugorji/go codec generated files
for releases but to use a go tag for them.
All nomad development through the makefile, including releasing, CI and
dev flow, has the tag enabled.
Downstream plugin projects, by default, will skip these files and life
proceed as normal for them.
The downside is that nomad developers who use generated code but avoid
using make must start passing additional go tag argument. Though this
is not a blessed configuration.
* connect: add unix socket to proxy grpc for envoy
Fixes#6124
Implement a L4 proxy from a unix socket inside a network namespace to
Consul's gRPC endpoint on the host. This allows Envoy to connect to
Consul's xDS configuration API.
* connect: pointer receiver on structs with mutexes
* connect: warn on all proxy errors
* taskenv: add connect upstream env vars + test
* set taskenv upstreams instead of appending
* Update client/taskenv/env.go
Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>
* adds meta object to service in job spec, sends it to consul
* adds tests for service meta
* fix tests
* adds docs
* better hashing for service meta, use helper for copying meta when registering service
* tried to be DRY, but looks like it would be more work to use the
helper function
Fixes#6041
Unlike all other Consul operations, boostrapping requires Consul be
available. This PR tries Consul 3 times with a backoff to account for
the group services being asynchronously registered with Consul.
Adds a check for differences in `job.Diff` so that task group networks
and services, including new Consul connect stanzas, show up in the job
plan outputs.
* nomad: add admission controller framework
* nomad: add admission controller framework and Consul Connect hooks
* run admission controllers before checking permissions
* client: add default node meta for connect configurables
* nomad: remove validateJob func since it has been moved to admission controller
* nomad: use new TaskKind type
* client: use consts for connect sidecar image and log level
* Apply suggestions from code review
Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>
* nomad: add job register test with connect sidecar
* Update nomad/job_endpoint_hooks.go
Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>
* jobspec: breakup parse.go into smaller files
* add sidecar_task parsing to jobspec and api
* jobspec: combine service parsing logic for task and group service stanzas
* api: use slice of ConsulUpstream values instead of pointers
Also includes unit tests for binpacker and preemption.
The tests verify that network resources specified at the
task group level are properly accounted for
Here, we ensure that when leader only responds to RPC calls when state
store is up to date. At leadership transition or launch with restored
state, the server local store might not be caught up with latest raft
logs and may return a stale read.
The solution here is to have an RPC consistency read gate, enabled when
`establishLeadership` completes before we respond to RPC calls.
`establishLeadership` is gated by a `raft.Barrier` which ensures that
all prior raft logs have been applied.
Conversely, the gate is disabled when leadership is lost.
This is very much inspired by https://github.com/hashicorp/consul/pull/3154/files
Plan application should use a state snapshot at or after the Raft index
at which the plan was created otherwise it risks being rejected based on
stale data.
This commit adds a Plan.SnapshotIndex which is set by workers when
submitting plan. SnapshotIndex is set to the Raft index of the snapshot
the worker used to generate the plan.
Plan.SnapshotIndex plays a similar role to PlanResult.RefreshIndex.
While RefreshIndex informs workers their StateStore is behind the
leader's, SnapshotIndex is a way to prevent the leader from using a
StateStore behind the worker's.
Plan.SnapshotIndex should be considered the *lower bound* index for
consistently handling plan application.
Plans must also be committed serially, so Plan N+1 should use a state
snapshot containing Plan N. This is guaranteed for plans *after* the
first plan after a leader election.
The Raft barrier on leader election ensures the leader's statestore has
caught up to the log index at which it was elected. This guarantees its
StateStore is at an index > lastPlanIndex.
Fix a case where `node.StatusUpdatedAt` was manipulated directly in
memory.
This ensures that StatusUpdatedAt is set in raft layer, and ensures that
the field is updated when node drain/eligibility is updated too.
* master: (912 commits)
Update redirects.txt
Added redirect for Spark guide link
client: log when server list changes
docs: mention regression in task config validation
fix update to changelog
update CHANGELOG with datacenter config validation https://github.com/hashicorp/nomad/pull/5665
typo: "atleast" -> "at least"
implement nomad exec for rkt
docs: fixed typo
use pty/tty terminology similar to github.com/kr/pty
vendor github.com/kr/pty
drivers: implement streaming exec for executor based drivers
executors: implement streaming exec
executor: scaffolding for executor grpc handling
client: expose allocated memory per task
client improve a comment in updateNetworks
stalebot: Add 'thinking' as an exempt label (#5684)
Added Sparrow link
update links to use new canonical location
Add redirects for restructing done in GH-5667
...
Fixes#1795
Running restored allocations and pulling what allocations to run from
the server happen concurrently. This means that if a client is rebooted,
and has its allocations rescheduled, it may restart the dead allocations
before it contacts the server and determines they should be dead.
This commit makes tasks that fail to reattach on restore wait until the
server is contacted before restarting.
Our testing so far indicates that ugorji/go/codec maintains backward
compatiblity with the version we are using now, for purposes of Nomad
serialization.
Using latest ugorji/go allows us to get back to using upstream library,
get get the optimizations benefits in RPC paths (including code
generation optimizations).
ugorji/go introduced two significant changes:
* time binary format in debb8e2d2e. Setting `h.BasicHandle.TimeNotBuiltin = true` restores old behavior
* ugorji/go started honoring `json` tag as well:
v1.1.4 is the latest but has a bug in handling RawString that's fixed in
d09a80c1e0
.
`Index` is already included as part of `WriteMeta` embedding.
This is a backward compatible change: Clients never read the field; and
Server refernces to `EmitNodeEventsResponse.Index` would be using the
value in `WriteMeta`, which is consistent with other response structs.
This command will be used to send a signal to either a single task within an
allocation, or all of the tasks if <task-name> is omitted. If the sent signal
terminates the allocation, it will be treated as if the allocation has crashed,
rather than as if it was operator-terminated.
Signal validation is currently handled by the driver itself and nomad
does not attempt to restrict or validate them.
This adds a `nomad alloc stop` command that can be used to stop and
force migrate an allocation to a different node.
This is built on top of the AllocUpdateDesiredTransitionRequest and
explicitly limits the scope of access to that transition to expose it
under the alloc-lifecycle ACL.
The API returns the follow up eval that can be used as part of
monitoring in the CLI or parsed and used in an external tool.
Revert "fingerprint Constraints and Affinities have Equals, as set"
This reverts commit 596f16fb5f1a4a6766a57b3311af806d22382609.
Revert "client tests assert the independent handling of interface and speed"
This reverts commit 7857ac5993a578474d0570819f99b7b6e027de40.
Revert "structs missed applying a style change from the review"
This reverts commit 658916e3274efa438beadc2535f47109d0c2f0f2.
Revert "client, structs comments"
This reverts commit be2838d6baa9d382a5013fa80ea016856f28ade2.
Revert "client fingerprint updateNetworks preserves the network configuration"
This reverts commit fc309cb430e62d8e66267a724f006ae9abe1c63c.
Revert "client_test cleanup comments from review"
This reverts commit bc0bf4efb9114e699bc662f50c8f12319b6b3445.
Revert "client Networks Equals is set equality"
This reverts commit f8d432345b54b1953a4a4c719b9269f845e3e573.
Revert "struct cleanup indentation in RequestedDevice Equals"
This reverts commit f4746411cab328215def6508955b160a53452da3.
Revert "struct Equals checks for identity before value checking"
This reverts commit 0767a4665ed30ab8d9586a59a74db75d51fd9226.
Revert "fix client-test, avoid hardwired platform dependecy on lo0"
This reverts commit e89dbb2ab182b6368507dbcd33c3342223eb0ae7.
Revert "refactor error in client fingerprint to include the offending data"
This reverts commit a7fed726c6e0264d42a58410d840adde780a30f5.
Revert "add client updateNodeResources to merge but preserve manual config"
This reverts commit 84bd433c7e1d030193e054ec23474380ff3b9032.
Revert "refactor struts.RequestedDevice to have its own Equals"
This reverts commit 689782524090e51183474516715aa2f34908b8e6.
Revert "refactor structs.Resource.Networks to have its own Equals"
This reverts commit 49e2e6c77bb3eaa4577772b36c62205061c92fa1.
Revert "refactor structs.Resource.Devices to have its own Equals"
This reverts commit 4ede9226bb971ae42cc203560ed0029897aec2c9.
Revert "add COMPAT(0.10): Remove in 0.10 notes to impl for structs.Resources"
This reverts commit 49fbaace5298d5ccf031eb7ebec93906e1d468b5.
Revert "add structs.Resources Equals"
This reverts commit 8528a2a2a6450e4462a1d02741571b5efcb45f0b.
Revert "test that fingerprint resources are updated, net not clobbered"
This reverts commit 8ee02ddd23bafc87b9fce52b60c6026335bb722d.
This adds a `nomad alloc restart` command and api that allows a job operator
with the alloc-lifecycle acl to perform an in-place restart of a Nomad
allocation, or a given subtask.
Currently when operators need to log onto a machine where an alloc
is running they will need to perform both an alloc/job status
call and then a call to discover the node name from the node list.
This updates both the job status and alloc status output to include
the node name within the information to make operator use easier.
Closes#2359
Cloess #1180
Track the download status of each artifact independently so that if only
one of many artifacts fails to download, completed artifacts aren't
downloaded again.
Given that the values will rarely change, specially considering that any
changes would be backward incompatible change. As such, it's simpler to
keep syncing manually in the rare occasion and avoid the syncing code
overhead.
nomad/structs is an internal package and imports many libraries (e.g.
raft, codec) that are not relevant to api clients, and may cause
unnecessary dependency pain (e.g. `github.com/ugorji/go/codec`
version is very old now).
Here, we add a code generator that imports the relevant constants from
`nomad/structs`.
I considered using this approach for other structs, but didn't find a
quick viable way to reduce duplication. `nomad/structs` use values as
struct fields (e.g. `string`), while `api` uses value pointer (e.g.
`*string`) instead. Also, sometimes, `api` structs contain deprecated
fields or additional documentation, so simple copy-paste doesn't work.
For these reasons, I opt to keep the status quo.
IOPS have been modelled as a resource since Nomad 0.1 but has never
actually been detected and there is no plan in the short term to add
detection. This is because IOPS is a bit simplistic of a unit to define
the performance requirements from the underlying storage system. In its
current state it adds unnecessary confusion and can be removed without
impacting any users. This PR leaves IOPS defined at the jobspec parsing
level and in the api/ resources since these are the two public uses of
the field. These should be considered deprecated and only exist to allow
users to stop using them during the Nomad 0.9.x release. In the future,
there should be no expectation that the field will exist.
This PR introduces a device hook that retrieves the device mount
information for an allocation. It also updates the computed node class
computation to take into account devices.
TODO Fix the task runner unit test. The environment variable is being
lost even though it is being properly set in the prestart hook.
This adds constraints for asserting that a given attribute or value
exists, or does not exist. This acts as a companion to =, or !=
operators, e.g:
```hcl
constraint {
attribute = "${attrs.type}"
operator = "!="
value = "database"
}
constraint {
attribute = "${attrs.type}"
operator = "is_set"
}
```
This commit implements an allocation selection algorithm for finding
allocations to preempt. It currently special cases network resource asks
from others (cpu/memory/disk/iops).