Fixes#8000
When requesting a Service Identity token from Consul, use the TaskKind
of the Task to get at the service name associated with the task. In
the past using the TaskName worked because it was generated as a sidecar
task with a name that included the service. In the Native context, we
need to get at the service name in a more correct way, i.e. using the
TaskKind which is defined to include the service name.
Allow a `/v1/jobs?all_namespaces=true` to list all jobs across all
namespaces. The returned list is to contain a `Namespace` field
indicating the job namespace.
If ACL is enabled, the request token needs to be a management token or
have `namespace:list-jobs` capability on all existing namespaces.
The MVP for CSI in the 0.11.0 release of Nomad did not include support
for opaque volume parameters or volume context. This changeset adds
support for both.
This also moves args for ControllerValidateCapabilities into a struct.
The CSI plugin `ControllerValidateCapabilities` struct that we turn
into a CSI RPC is accumulating arguments, so moving it into a request
struct will reduce the churn of this internal API, make the plugin
code more readable, and make this method consistent with the other
plugin methods in that package.
This ensures that token revocation is idempotent and can handle when
tokens are revoked out of band.
Idempotency is important to handle some transient failures and retries.
Consider when a single token of a batch fails to be revoked, nomad would
retry revoking the entire batch; tokens already revoked should be
gracefully handled, otherwise, nomad may retry revoking the same
tokens forever.
* jobspec, api: add stop_after_client_disconnect
* nomad/state/state_store: error message typo
* structs: alloc methods to support stop_after_client_disconnect
1. a global AllocStates to track status changes with timestamps. We
need this to track the time at which the alloc became lost
originally.
2. ShouldClientStop() and WaitClientStop() to actually do the math
* scheduler/reconcile_util: delayByStopAfterClientDisconnect
* scheduler/reconcile: use delayByStopAfterClientDisconnect
* scheduler/util: updateNonTerminalAllocsToLost comments
This was setup to only update allocs to lost if the DesiredStatus had
already been set by the scheduler. It seems like the intention was to
update the status from any non-terminal state, and not all lost allocs
have been marked stop or evict by now
* scheduler/testing: AssertEvalStatus just use require
* scheduler/generic_sched: don't create a blocked eval if delayed
* scheduler/generic_sched_test: several scheduling cases
CSI plugins can require credentials for some publishing and
unpublishing workflow RPCs. Secrets are configured at the time of
volume registration, stored in the volume struct, and then passed
around as an opaque map by Nomad to the plugins.
When serializing structs with msgpack, only consider type tags of
`codec`.
Hashicorp/go-msgpack (based on ugorji/go) defaults to interpretting
`codec` tag if it's available, but falls to using `json` if `codec`
isn't present.
This behavior is surprising in cases where we want to serialize json
differently from msgpack, e.g. serializing `ConsulExposeConfig`.
When a CSI client RPC is given a specific node for a controller but
the lookup fails (because the node is gone or is an older version), we
fallthrough to select a node from all those available. This adds
logging to this case to aid in diagnostics.
The watcher goroutines will be automatically started if a volume has
updates, but when idle we shouldn't keep a goroutine running and
taking up memory.
This changeset implements a periodic garbage collection of CSI volumes
with missing allocations. This can happen in a scenario where a node
update fails partially and the allocation updates are written to raft
but the evaluations to GC the volumes are dropped. This feature will
cover this edge case and ensure that upgrades from 0.11.0 and 0.11.1
get any stray claims cleaned up.
TestClientEndpoint_CreateNodeEvals flakes a bit but its output is very
confusing, as `structs.Evaluations` overrides GoString.
Here, we emit the entire struct of the evaluation, and hopefully we'll
figure out the problem the next time it happens
Ensure that `""` Scheduler Algorithm gets explicitly set to binpack on
upgrades or on API handling when user misses the value.
The scheduler already treats `""` value as binpack. This PR merely
ensures that the operator API returns the effective value.
The `volumewatcher` has a 250ms batch window so claim updates will not
typically be large enough to risk exceeding the maximum raft message
size. But large jobs might have enough volume claims that this could
be a danger. Set a maximum batch size of 100 messages per
batch (roughly 33K), as a very conservative safety/robustness guard.
Co-authored-by: Chris Baker <1675087+cgbaker@users.noreply.github.com>
An operator could submit a scale request including a negative count
value. This negative value caused the Nomad server to panic. The
fix adds validation to the submitted count, returning an error to
the caller if it is negative.
* volumewatcher: remove redundant log fields
The constructor for `volumeWatcher` already sets a `logger.With` that
includes the volume ID and namespace fields. Remove them from the
various trace logs.
* volumewatcher: advance state for controller already released
One way of bypassing client RPCs in testing is to set a claim status
to controller-detached, but this results in an incorrect claim state
when we checkpoint.
This changeset implements a periodic garbage collection of unused CSI
plugins. Plugins are self-cleaning when the last allocation for a
plugin is stopped, but this feature will cover any missing edge cases
and ensure that upgrades from 0.11.0 and 0.11.1 get any stray plugins
cleaned up.
In this changeset:
* If a Nomad client node is running both a controller and a node
plugin (which is a common case), then if only the controller or the
node is removed, the plugin was not being updated with the correct
counts.
* The existing test for plugin cleanup didn't go back to the state
store, which normally is ok but is complicated in this case by
denormalization which changes the behavior. This commit makes the
test more comprehensive.
* Set "controller required" when plugin has `PUBLISH_READONLY`. All
known controllers that support `PUBLISH_READONLY` also support
`PUBLISH_UNPUBLISH_VOLUME` but we shouldn't assume this.
* Only create plugins when the allocs for those plugins are
healthy. If we allow a plugin to be created for the first time when
the alloc is not healthy, then we'll recreate deleted plugins when
the job's allocs all get marked terminal.
* Terminal plugin alloc updates should cleanup the plugin. The client
fingerprint can't tell if the plugin is unhealthy intentionally (for
the case of updates or job stop). Allocations that are
server-terminal should delete themselves from the plugin and trigger
a plugin self-GC, the same as an unused node.
The nil-check here is left-over from an earlier approach that didn't
get merged. It doesn't do anything for us now as we can't ever pass it
`nil` and if we leave it in the `getVolume` call it guards will panic
anyways.
- track lastHeartbeat, the client local time of the last successful
heartbeat round trip
- track allocations with `stop_after_client_disconnect` configured
- trigger allocation destroy (which handles cleanup)
- restore heartbeat/killable allocs tracking when allocs are recovered from disk
- on client restart, stop those allocs after a grace period if the
servers are still partioned
We should only remove the `ReadAllocs`/`WriteAllocs` values for a
volume after the claim has entered the "ready to free"
state. The volume will eventually be released as expected. But
querying the volume API will show the volume is released before the
controller unpublish has finished and this can cause a race with
starting new jobs.
Test updates are to cover cases where we're dropping claims but not
running through the whole reaping process.
This changeset adds a subsystem to run on the leader, similar to the
deployment watcher or node drainer. The `Watcher` performs a blocking
query on updates to the `CSIVolumes` table and triggers reaping of
volume claims.
This will avoid tying up scheduling workers by immediately sending
volume claim workloads into their own loop, rather than blocking the
scheduling workers in the core GC job doing things like talking to CSI
controllers
The volume watcher is enabled on leader step-up and disabled on leader
step-down.
The volume claim GC mechanism now makes an empty claim RPC for the
volume to trigger an index bump. That in turn unblocks the blocking
query in the volume watcher so it can assess which claims can be
released for a volume.
The `CSIVolumeClaim` fields were added after 0.11.1, so claims made
before that may be missing the value. Repair this when we read the
volume out of the state store.
The `NodeID` field was added after 0.11.0, so we need to ensure it's
been populated during upgrades from 0.11.0.
Adds a `CSIVolumeClaim` type to be tracked as current and past claims
on a volume. Allows for a client RPC failure during node or controller
detachment without having to keep the allocation around after the
first garbage collection eval.
This changeset lays groundwork for moving the actual detachment RPCs
into a volume watching loop outside the GC eval.
The BinPackIter accounted for node reservations twice when scoring nodes
which could bias scores toward nodes with reservations.
Pseudo-code for previous algorithm:
```
proposed = reservedResources + sum(allocsResources)
available = nodeResources - reservedResources
score = 1 - (proposed / available)
```
The node's reserved resources are added to the total resources used by
allocations, and then the node's reserved resources are later
substracted from the node's overall resources.
The new algorithm is:
```
proposed = sum(allocResources)
available = nodeResources - reservedResources
score = 1 - (proposed / available)
```
The node's reserved resources are no longer added to the total resources
used by allocations.
My guess as to how this bug happened is that the resource utilization
variable (`util`) is calculated and returned by the `AllocsFit` function
which needs to take reserved resources into account as a basic
feasibility check.
To avoid re-calculating alloc resource usage (because there may be a
large number of allocs), we reused `util` in the `ScoreFit` function.
`ScoreFit` properly accounts for reserved resources by subtracting them
from the node's overall resources. However since `util` _also_ took
reserved resources into account the score would be incorrect.
Prior to the fix the added test output:
```
Node: reserved Score: 1.0000
Node: reserved2 Score: 1.0000
Node: no-reserved Score: 0.9741
```
The scores being 1.0 for *both* nodes with reserved resources is a good
hint something is wrong as they should receive different scores. Upon
further inspection the double accounting of reserved resources caused
their scores to be >1.0 and clamped.
After the fix the added test outputs:
```
Node: no-reserved Score: 0.9741
Node: reserved Score: 0.9480
Node: reserved2 Score: 0.8717
```
The field names within the structs representing the Connect proxy
definition were not the same (nomad/structs/ vs api/), causing the
values to be lost in translation for the 'nomad job inspect' command.
Since the field names already shipped in v0.11.0 we cannot simply
fix the names. Instead, use the json struct tag on the structs/ structs
to remap the name to match the publicly expose api/ package on json
encoding.
This means existing jobs from v0.11.0 will continue to work, and
the JSON API for job submission will remain backwards compatible.
Follow-up for a method missed in the refactor for #7688. The
`volAndPluginLookup` method is only ever called from the server's `CSI`
RPC and never the `ClientCSI` RPC, so move it into that scope.
The current design of `ClientCSI` RPC requires that callers in the
server know about the free-standing `nodeForControllerPlugin`
function. This makes it difficult to send `ClientCSI` RPC messages
from subpackages of `nomad` and adds a bunch of boilerplate to every
server-side caller of a controller RPC.
This changeset makes it so that the `ClientCSI` RPCs will populate and
validate the controller's client node ID if it hasn't been passed by
the caller, centralizing the logic of picking and validating
controller targets into the `nomad.ClientCSI` struct.
Before, if the sidecar_service stanza of a connect enabled service
was missing, the job submission would cause a panic in the nomad
agent. Since the panic was happening in the API handler the agent
itself continued running, but this change will the condition more
gracefully.
By fixing the `Copy` method, the API handler now returns the proper
error.
$ nomad job run foo.nomad
Error submitting job: Unexpected response code: 500 (1 error occurred:
* Task group api validation failed: 2 errors occurred:
* Missing tasks for task group
* Task group service validation failed: 1 error occurred:
* Service[0] count-api validation failed: 1 error occurred:
* Consul Connect must be native or use a sidecar service
The `Job.Deregister` call will block on the client CSI controller RPCs
while the alloc still exists on the Nomad client node. So we need to
make the volume claim reaping async from the `Job.Deregister`. This
allows `nomad job stop` to return immediately. In order to make this
work, this changeset changes the volume GC so that the GC jobs are on a
by-volume basis rather than a by-job basis; we won't have to query
the (possibly deleted) job at the time of volume GC. We smuggle the
volume ID and whether it's a purge into the GC eval ID the same way we
smuggled the job ID previously.
The CSI plugins uses the external volume ID for all operations, but
the Client CSI RPCs uses the Nomad volume ID (human-friendly) for the
mount paths. Pass the External ID as an arg in the RPC call so that
the unpublish workflows have it without calling back to the server to
find the external ID.
The controller CSI plugins need the CSI node ID (or in other words,
the storage provider's view of node ID like the EC2 instance ID), not
the Nomad node ID, to determine how to detach the external volume.
* nomad/state/state_store: error message copy/paste error
* nomad/structs/structs: add a VolumeEval to the JobDeregisterResponse
* nomad/job_endpoint: synchronously, volumeClaimReap on job Deregister
* nomad/core_sched: make volumeClaimReap available without a CoreSched
* nomad/job_endpoint: Deregister return early if the job is missing
* nomad/job_endpoint_test: job Deregistion is idempotent
* nomad/core_sched: conditionally ignore alloc status in volumeClaimReap
* nomad/job_endpoint: volumeClaimReap all allocations, even running
* nomad/core_sched_test: extra argument to collectClaimsToGCImpl
* nomad/job_endpoint: job deregistration is not idempotent
I hypothesize that the flakiness in rolling update is due to shutting
down s3 server before s4 is properly added as a voter.
The chain of the flakiness is as follows:
1. Bootstrap with s1, s2, s3
2. Add s4
3. Wait for servers to register with 3 voting peers
* But we already have 3 voters (s1, s2, and s3)
* s4 is added as a non-voter in Raft v3 and must wait until autopilot promots it
4. Test proceeds without s4 being a voter
5. s3 shutdown
6. cluster changes stall due to leader election and too many pending configuration
changes (e.g. removing s3 from raft, promoting s4).
Here, I have the test wait until s4 is marked as a voter before shutting
down s3, so we don't have too many configuration changes at once.
In https://circleci.com/gh/hashicorp/nomad/57092, I noticed the
following events:
```
TestAutopilot_RollingUpdate: autopilot_test.go:204: adding server s4
TestAutopilot_RollingUpdate: testlog.go:34: 2020-04-03T20:08:19.789Z [INFO] nomad/serf.go:60: nomad: adding server: server="nomad-137.global (Addr: 127.0.0.1:9177) (DC: dc1)"
TestAutopilot_RollingUpdate: testlog.go:34: 2020-04-03T20:08:19.789Z [INFO] raft/raft.go:1018: nomad.raft: updating configuration: command=AddNonvoter server-id=c54b5bf4-1159-34f6-032d-56aefeb08425 server-addr=127.0.0.1:9177 servers="[{Suffrage:Voter ID:df01ba65-d1b2-17a9-f792-a4459b3a7c09 Address:127.0.0.1:9171} {Suffrage:Voter ID:c3337778-811e-2675-87f5-006309888387 Address:127.0.0.1:9173} {Suffrage:Voter ID:186d5e15-c473-e2b3-b5a4-3259a84e10ef Address:127.0.0.1:9169} {Suffrage:Nonvoter ID:c54b5bf4-1159-34f6-032d-56aefeb08425 Address:127.0.0.1:9177}]"
TestAutopilot_RollingUpdate: autopilot_test.go:218: shutting down server s3
TestAutopilot_RollingUpdate: testlog.go:34: 2020-04-03T20:08:19.797Z [INFO] raft/replication.go:456: nomad.raft: aborting pipeline replication: peer="{Nonvoter c54b5bf4-1159-34f6-032d-56aefeb08425 127.0.0.1:9177}"
TestAutopilot_RollingUpdate: autopilot_test.go:235: waiting for s4 to stabalize and be promoted
TestAutopilot_RollingUpdate: testlog.go:34: 2020-04-03T20:08:19.975Z [ERROR] raft/raft.go:1656: nomad.raft: failed to make requestVote RPC: target="{Voter c3337778-811e-2675-87f5-006309888387 127.0.0.1:9173}" error="dial tcp 127.0.0.1:9173: connect: connection refused"
TestAutopilot_RollingUpdate: retry.go:121: autopilot_test.go:241: don't want "c3337778-811e-2675-87f5-006309888387"
autopilot_test.go:241: didn't find map[c54b5bf4-1159-34f6-032d-56aefeb08425:true] in []raft.ServerID{"df01ba65-d1b2-17a9-f792-a4459b3a7c09", "186d5e15-c473-e2b3-b5a4-3259a84e10ef"}
```
Note how s3, c3337778, is present in the peers list in the final
failure, but s4, c54b5bf4, is added as a Nonvoter and isn't present in
the final peers list.
If a volume-claiming alloc stops and the CSI Node plugin that serves
that alloc's volumes is missing, there's no way for the allocrunner
hook to send the `NodeUnpublish` and `NodeUnstage` RPCs.
This changeset addresses this issue with a redesign of the client-side
for CSI. Rather than unmounting in the alloc runner hook, the alloc
runner hook will simply exit. When the server gets the
`Node.UpdateAlloc` for the terminal allocation that had a volume claim,
it creates a volume claim GC job. This job will made client RPCs to a
new node plugin RPC endpoint, and only once that succeeds, move on to
making the client RPCs to the controller plugin. If the node plugin is
unavailable, the GC job will fail and be requeued.
* nomad/state/state_store: enforce that the volume namespace exists
* nomad/csi_endpoint_test: a couple of broken namespaces now
* nomad/csi_endpoint_test: one more test
* nomad/node_endpoint_test: use structs.DefaultNamespace
* nomad/state/state_store_test: use DefaultNamespace
Part of #6120
Building on the support for enabling connect proxy paths in #7323, this change
adds the ability to configure the 'service.check.expose' flag on group-level
service check definitions for services that are connect-enabled. This is a slight
deviation from the "magic" that Consul provides. With Consul, the 'expose' flag
exists on the connect.proxy stanza, which will then auto-generate expose paths
for every HTTP and gRPC service check associated with that connect-enabled
service.
A first attempt at providing similar magic for Nomad's Consul Connect integration
followed that pattern exactly, as seen in #7396. However, on reviewing the PR
we realized having the `expose` flag on the proxy stanza inseperably ties together
the automatic path generation with every HTTP/gRPC defined on the service. This
makes sense in Consul's context, because a service definition is reasonably
associated with a single "task". With Nomad's group level service definitions
however, there is a reasonable expectation that a service definition is more
abstractly representative of multiple services within the task group. In this
case, one would want to define checks of that service which concretely make HTTP
or gRPC requests to different underlying tasks. Such a model is not possible
with the course `proxy.expose` flag.
Instead, we now have the flag made available within the check definitions themselves.
By making the expose feature resolute to each check, it is possible to have
some HTTP/gRPC checks which make use of the envoy exposed paths, as well as
some HTTP/gRPC checks which make use of some orthongonal port-mapping to do
checks on some other task (or even some other bound port of the same task)
within the task group.
Given this example,
group "server-group" {
network {
mode = "bridge"
port "forchecks" {
to = -1
}
}
service {
name = "myserver"
port = 2000
connect {
sidecar_service {
}
}
check {
name = "mycheck-myserver"
type = "http"
port = "forchecks"
interval = "3s"
timeout = "2s"
method = "GET"
path = "/classic/responder/health"
expose = true
}
}
}
Nomad will automatically inject (via job endpoint mutator) the
extrapolated expose path configuration, i.e.
expose {
path {
path = "/classic/responder/health"
protocol = "http"
local_path_port = 2000
listener_port = "forchecks"
}
}
Documentation is coming in #7440 (needs updating, doing next)
Modifications to the `countdash` examples in https://github.com/hashicorp/demo-consul-101/pull/6
which will make the examples in the documentation actually runnable.
Will add some e2e tests based on the above when it becomes available.
Enable configuration of HTTP and gRPC endpoints which should be exposed by
the Connect sidecar proxy. This changeset is the first "non-magical" pass
that lays the groundwork for enabling Consul service checks for tasks
running in a network namespace because they are Connect-enabled. The changes
here provide for full configuration of the
connect {
sidecar_service {
proxy {
expose {
paths = [{
path = <exposed endpoint>
protocol = <http or grpc>
local_path_port = <local endpoint port>
listener_port = <inbound mesh port>
}, ... ]
}
}
}
stanza. Everything from `expose` and below is new, and partially implements
the precedent set by Consul:
https://www.consul.io/docs/connect/registration/service-registration.html#expose-paths-configuration-reference
Combined with a task-group level network port-mapping in the form:
port "exposeExample" { to = -1 }
it is now possible to "punch a hole" through the network namespace
to a specific HTTP or gRPC path, with the anticipated use case of creating
Consul checks on Connect enabled services.
A future PR may introduce more automagic behavior, where we can do things like
1) auto-fill the 'expose.path.local_path_port' with the default value of the
'service.port' value for task-group level connect-enabled services.
2) automatically generate a port-mapping
3) enable an 'expose.checks' flag which automatically creates exposed endpoints
for every compatible consul service check (http/grpc checks on connect
enabled services).
* nomad/structs/structs: new NodeEventSubsystemCSI
* client/client: pass triggerNodeEvent in the CSIConfig
* client/pluginmanager/csimanager/instance: add eventer to instanceManager
* client/pluginmanager/csimanager/manager: pass triggerNodeEvent
* client/pluginmanager/csimanager/volume: node event on [un]mount
* nomad/structs/structs: use storage, not CSI
* client/pluginmanager/csimanager/volume: use storage, not CSI
* client/pluginmanager/csimanager/volume_test: eventer
* client/pluginmanager/csimanager/volume: event on error
* client/pluginmanager/csimanager/volume_test: check event on error
* command/node_status: remove an extra space in event detail format
* client/pluginmanager/csimanager/volume: use snake_case for details
* client/pluginmanager/csimanager/volume_test: snake_case details
The `csi_plugins` and `csi_volumes` tables were missing support for
snapshot persist and restore. This means restoring a snapshot would
result in missing information for CSI.
In tests where the logger is a test logger, emitting a trace log in a
background thread while it's shutting down may trigger a panic. Thus
avoid logging Trace if err != nil. Note that we already log an error
when err isn't a trace.
This fixes cases where tests panic with a trace like:
```
panic: Log in goroutine after TestAllocGarbageCollector_MakeRoomFor_MaxAllocs has completed
goroutine 30 [running]:
testing.(*common).logDepth(0xc000aa9e60, 0xc000c4a000, 0xab, 0x3)
/usr/local/Cellar/go/1.14/libexec/src/testing/testing.go:680 +0x4d3
testing.(*common).log(...)
/usr/local/Cellar/go/1.14/libexec/src/testing/testing.go:662
testing.(*common).Logf(0xc000aa9e60, 0x690b941, 0x4, 0xc001366c00, 0x2, 0x2)
/usr/local/Cellar/go/1.14/libexec/src/testing/testing.go:701 +0x7e
github.com/hashicorp/nomad/helper/testlog.(*writer).Write(0xc000a82a60, 0xc0000b48c0, 0xab, 0x13f, 0x0, 0x0, 0x0)
/Users/notnoop/go/src/github.com/hashicorp/nomad/helper/testlog/testlog.go:34 +0x106
github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog.(*writer).Flush(0xc000a80900, 0xbf9870f000000001, 0x20a87556e, 0x8b12bc0)
/Users/notnoop/go/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog/writer.go:29 +0x14f
github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog.(*intLogger).log(0xc000e2c180, 0xc0003b6880, 0x17, 0x1, 0x6974edc, 0x22, 0xc000db57a0, 0x6, 0x6)
/Users/notnoop/go/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog/intlogger.go:139 +0x15d
github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog.(*intLogger).Trace(0xc000e2c180, 0x6974edc, 0x22, 0xc000db57a0, 0x6, 0x6)
/Users/notnoop/go/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog/intlogger.go:446 +0x7a
github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog.(*interceptLogger).Trace(0xc0002f1ad0, 0x6974edc, 0x22, 0xc000db57a0, 0x6, 0x6)
/Users/notnoop/go/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog/interceptlogger.go:48 +0x9c
github.com/hashicorp/nomad/nomad/drainer.(*drainingJobWatcher).watch(0xc0002f2380)
/Users/notnoop/go/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:147 +0x1125
created by github.com/hashicorp/nomad/nomad/drainer.NewDrainingJobWatcher
/Users/notnoop/go/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:89 +0x1e3
FAIL github.com/hashicorp/nomad/client 10.605s
FAIL
```
* added method to retrieve all scaling policies for use in snapshotting, plus test
* better testing for ScalingPoliciesByNamespace
* added scaling policy snapshot persist and restore (and test of restore)
manually tested snapshot restore.
resolves#7539
* nomad/structs/csi: delete just one plugin type from a node
* nomad/structs/csi: add DeleteAlloc
* nomad/state/state_store: add deleteJobFromPlugin
* nomad/state/state_store: use DeleteAlloc not DeleteNodeType
* move CreateTestCSIPlugin to state to avoid an import cycle
* nomad/state/state_store_test: delete a plugin by deleting its jobs
* nomad/*_test: move CreateTestCSIPlugin to state
* nomad/state/state_store: update one plugin per transaction
* command/plugin_status_test: move CreateTestCSIPlugin
* nomad: csi: handle nils CSIPlugin methods, clarity
If not in use and not changing external ids, it should not be an error to register a volume again.
* nomad/state/state_store: make volume registration idempotent
TestStateStore_Indexes specifically tests for `nodes` index, but asserts
on the exact number of indexes present in the state. This is fragile
and will break almost everytime we add a state index.
Plugins track the client nodes where they are placed. On client
updates, remove the client from the plugin tracking if the client is
no longer running an instance of that controller/node plugin.
Extends the state store tests to ensure deregistration works as
expected and that controllers and nodes are being tracked
independently.
- tg.Count defaults to tg.Scaling.Min if present (falls back on previous default of 1 if Scaling is absent)
- Validate() enforces tg.Scaling.Min <= tg.Count <= tg.Scaling.Max
modification in ApiScalingPolicyToStructs, api.TaskGroup.Validate so that defaults are handled for TaskGroup.Count and
* nomad/structs/csi: split CanWrite into health, in use
* scheduler/scheduler: expose AllocByID in the state interface
* nomad/state/state_store_test
* scheduler/stack: SetJobID on the matcher
* scheduler/feasible: when a volume writer is in use, check if it's us
* scheduler/feasible: remove SetJob
* nomad/state/state_store: denormalize allocs before Claim
* nomad/structs/csi: return errors on claim, with context
* nomad/csi_endpoint_test: new alloc doesn't look like an update
* nomad/state/state_store_test: change test reference to CanWrite
Add mount_options to both the volume definition on registration and to the volume block in the group where the volume is requested. If both are specified, the options provided in the request replace the options defined in the volume. They get passed to the NodePublishVolume, which causes the node plugin to actually mount the volume on the host.
Individual tasks just mount bind into the host mounted volume (unchanged behavior). An operator can mount the same volume with different options by specifying it twice in the group context.
closes#7007
* nomad/structs/volumes: add MountOptions to volume request
* jobspec/test-fixtures/basic.hcl: add mount_options to volume block
* jobspec/parse_test: add expected MountOptions
* api/tasks: add mount_options
* jobspec/parse_group: use hcl decode not mapstructure, mount_options
* client/allocrunner/csi_hook: pass MountOptions through
client/allocrunner/csi_hook: add a VolumeMountOptions
client/allocrunner/csi_hook: drop Options
client/allocrunner/csi_hook: use the structs options
* client/pluginmanager/csimanager/interface: UsageOptions.MountOptions
* client/pluginmanager/csimanager/volume: pass MountOptions in capabilities
* plugins/csi/plugin: remove todo 7007 comment
* nomad/structs/csi: MountOptions
* api/csi: add options to the api for parsing, match structs
* plugins/csi/plugin: move VolumeMountOptions to structs
* api/csi: use specific type for mount_options
* client/allocrunner/csi_hook: merge MountOptions here
* rename CSIOptions to CSIMountOptions
* client/allocrunner/csi_hook
* client/pluginmanager/csimanager/volume
* nomad/structs/csi
* plugins/csi/fake/client: add PrevVolumeCapability
* plugins/csi/plugin
* client/pluginmanager/csimanager/volume_test: remove debugging
* client/pluginmanager/csimanager/volume: fix odd merging logic
* api: rename CSIOptions -> CSIMountOptions
* nomad/csi_endpoint: remove a 7007 comment
* command/alloc_status: show mount options in the volume list
* nomad/structs/csi: include MountOptions in the volume stub
* api/csi: add MountOptions to stub
* command/volume_status_csi: clean up csiVolMountOption, add it
* command/alloc_status: csiVolMountOption lives in volume_csi_status
* command/node_status: display mount flags
* nomad/structs/volumes: npe
* plugins/csi/plugin: npe in ToCSIRepresentation
* jobspec/parse_test: expand volume parse test cases
* command/agent/job_endpoint: ApiTgToStructsTG needs MountOptions
* command/volume_status_csi: copy paste error
* jobspec/test-fixtures/basic: hclfmt
* command/volume_status_csi: clean up csiVolMountOption
* acl/policy: add the volume ACL policies
* nomad/csi_endpoint: enforce ACLs for volume access
* nomad/search_endpoint_oss: volume acls
* acl/acl: add plugin read as a global policy
* acl/policy: add PluginPolicy global cap type
* nomad/csi_endpoint: check the global plugin ACL policy
* nomad/mock/acl: PluginPolicy
* nomad/csi_endpoint: fix list rebase
* nomad/core_sched_test: new test since #7358
* nomad/csi_endpoint_test: use correct permissions for list
* nomad/csi_endpoint: allowCSIMount keeps ACL checks together
* nomad/job_endpoint: check mount permission for jobs
* nomad/job_endpoint_test: need plugin read, too
* nomad/state/schema: use the namespace compound index
* scheduler/scheduler: CSIVolumeByID interface signature namespace
* scheduler/stack: SetJob on CSIVolumeChecker to capture namespace
* scheduler/feasible: pass the captured namespace to CSIVolumeByID
* nomad/state/state_store: use namespace in csi_volume index
* nomad/fsm: pass namespace to CSIVolumeDeregister & Claim
* nomad/core_sched: pass the namespace in volumeClaimReap
* nomad/node_endpoint_test: namespaces in Claim testing
* nomad/csi_endpoint: pass RequestNamespace to state.*
* nomad/csi_endpoint_test: appropriately failed test
* command/alloc_status_test: appropriately failed test
* node_endpoint_test: avoid notTheNamespace for the job
* scheduler/feasible_test: call SetJob to capture the namespace
* nomad/csi_endpoint: ACL check the req namespace, query by namespace
* nomad/state/state_store: remove deregister namespace check
* nomad/state/state_store: remove unused CSIVolumes
* scheduler/feasible: CSIVolumeChecker SetJob -> SetNamespace
* nomad/csi_endpoint: ACL check
* nomad/state/state_store_test: remove call to state.CSIVolumes
* nomad/core_sched_test: job namespace match so claim gc works
This changeset implements the remaining controller detach RPCs: server-to-client and client-to-controller. The tests also uncovered a bug in our RPC for claims which is fixed here; the volume claim RPC is used for both claiming and releasing a claim on a volume. We should only submit a controller publish RPC when the claim is new and not when it's being released.
Nomad clients will push node updates during client restart which can
cause an extra claim for a volume by the same alloc. If an alloc
already claims a volume, we can allow it to be treated as a valid
claim and continue.
* nomad/structs/csi: new RemoteID() uses the ExternalID if set
* nomad/csi_endpoint: pass RemoteID to volume request types
* client/pluginmanager/csimanager/volume: pass RemoteID to NodePublishVolume
* api/allocations: GetTaskGroup finds the taskgroup struct
* command/node_status: display CSI volume names
* nomad/state/state_store: new CSIVolumesByNodeID
* nomad/state/iterator: new SliceIterator type implements memdb.ResultIterator
* nomad/csi_endpoint: deal with a slice of volumes
* nomad/state/state_store: CSIVolumesByNodeID return a SliceIterator
* nomad/structs/csi: CSIVolumeListRequest takes a NodeID
* nomad/csi_endpoint: use the return iterator
* command/agent/csi_endpoint: parse query params for CSIVolumes.List
* api/nodes: new CSIVolumes to list volumes by node
* command/node_status: use the new list endpoint to print volumes
* nomad/state/state_store: error messages consider the operator
* command/node_status: include the Provider
* client/allocrunner/csi_hook: tag errors
* nomad/client_csi_endpoint: tag errors
* nomad/client_rpc: remove an unnecessary error tag
* nomad/state/state_store: ControllerRequired fix intent
We use ControllerRequired to indicate that a volume should use the
publish/unpublish workflow, rather than that it has a controller. We
need to check both RequiresControllerPlugin and SupportsAttachDetach
from the fingerprint to check that.
* nomad/csi_endpoint: tag errors
* nomad/csi_endpoint_test: longer error messages, mock fingerprints
We denormalize the `CSIVolume` struct when we query it from the state
store by getting the plugin and its health. But unless we copy the
volume, this denormalization gets synced back to the state store
without passing through the fsm (which is invalid).
We denormalize the `CSIPlugin` struct when we query it from the state
store by getting the current set of allocations that provide the
plugin. But unless we copy the plugin, this denormalization gets
synced back to the state store and each time we query we'll add
another copy of the current allocations.
Derive a provider name and version for plugins (and the volumes that
use them) from the CSI identity API `GetPluginInfo`. Expose the vendor
name as `Provider` in the API and CLI commands.
* command/csi: csi, csi_plugin, csi_volume
* helper/funcs: move ExtraKeys from parse_config to UnusedKeys
* command/agent/config_parse: use helper.UnusedKeys
* api/csi: annotate CSIVolumes with hcl fields
* command/csi_plugin: add Synopsis
* command/csi_volume_register: use hcl.Decode style parsing
* command/csi_volume_list
* command/csi_volume_status: list format, cleanup
* command/csi_plugin_list
* command/csi_plugin_status
* command/csi_volume_deregister
* command/csi_volume: add Synopsis
* api/contexts/contexts: add csi search contexts to the constants
* command/commands: register csi commands
* api/csi: fix struct tag for linter
* command/csi_plugin_list: unused struct vars
* command/csi_plugin_status: unused struct vars
* command/csi_volume_list: unused struct vars
* api/csi: add allocs to CSIPlugin
* command/csi_plugin_status: format the allocs
* api/allocations: copy Allocation.Stub in from structs
* nomad/client_rpc: add some error context with Errorf
* api/csi: collapse read & write alloc maps to a stub list
* command/csi_volume_status: cleanup allocation display
* command/csi_volume_list: use Schedulable instead of Healthy
* command/csi_volume_status: use Schedulable instead of Healthy
* command/csi_volume_list: sprintf string
* command/csi: delete csi.go, csi_plugin.go
* command/plugin: refactor csi components to sub-command plugin status
* command/plugin: remove csi
* command/plugin_status: remove csi
* command/volume: remove csi
* command/volume_status: split out csi specific
* helper/funcs: add RemoveEqualFold
* command/agent/config_parse: use helper.RemoveEqualFold
* api/csi: do ,unusedKeys right
* command/volume: refactor csi components to `nomad volume`
* command/volume_register: split out csi specific
* command/commands: use the new top level commands
* command/volume_deregister: hardwired type csi for now
* command/volume_status: csiFormatVolumes rescued from volume_list
* command/plugin_status: avoid a panic on no args
* command/volume_status: avoid a panic on no args
* command/plugin_status: predictVolumeType
* command/volume_status: predictVolumeType
* nomad/csi_endpoint_test: move CreateTestPlugin to testing
* command/plugin_status_test: use CreateTestCSIPlugin
* nomad/structs/structs: add CSIPlugins and CSIVolumes search consts
* nomad/state/state_store: add CSIPlugins and CSIVolumesByIDPrefix
* nomad/search_endpoint: add CSIPlugins and CSIVolumes
* command/plugin_status: move the header to the csi specific
* command/volume_status: move the header to the csi specific
* nomad/state/state_store: CSIPluginByID prefix
* command/status: rename the search context to just Plugins/Volumes
* command/plugin,volume_status: test return ids now
* command/status: rename the search context to just Plugins/Volumes
* command/plugin_status: support -json and -t
* command/volume_status: support -json and -t
* command/plugin_status_csi: comments
* command/*_status: clean up text
* api/csi: fix stale comments
* command/volume: make deregister sound less fearsome
* command/plugin_status: set the id length
* command/plugin_status_csi: more compact plugin health
* command/volume: better error message, comment
In #7252 we removed the `DevDisableBootstrap` flag to require tests to
honor only `BootstrapExpect`, in order to reduce a source of test
flakiness. This changeset applies the same fix to the CSI tests.
* structs: add ControllerRequired, volume.Name, no plug.Type
* structs: Healthy -> Schedulable
* state_store: Healthy -> Schedulable
* api: add ControllerRequired to api data types
* api: copy csi structs changes
* nomad/structs/csi: include name and external id
* api/csi: include Name and ExternalID
* nomad/structs/csi: comments for the 3 ids
* structs: CSIInfo include AllocID, CSIPlugins no Jobs
* state_store: eliminate plugin Jobs, delete an empty plugin
* nomad/structs/csi: detect empty plugins correctly
* client/allocrunner/taskrunner/plugin_supervisor_hook: option AllocID
* client/pluginmanager/csimanager/instance: allocID
* client/pluginmanager/csimanager/fingerprint: set AllocID
* client/node_updater: split controller and node plugins
* api/csi: remove Jobs
The CSI Plugin API will map plugins to allocations, which allows
plugins to be defined by jobs in many configurations. In particular,
multiple plugins can be defined in the same job, and multiple jobs can
be used to define a single plugin.
Because we now map the allocation context directly from the node, it's
no longer necessary to track the jobs associated with a plugin
directly.
* nomad/csi_endpoint_test: CreateTestPlugin & register via fingerprint
* client/dynamicplugins: lift AllocID into the struct from Options
* api/csi_test: remove Jobs test
* nomad/structs/csi: CSIPlugins has an array of allocs
* nomad/state/state_store: implement CSIPluginDenormalize
* nomad/state/state_store: CSIPluginDenormalize npe on missing alloc
* nomad/csi_endpoint_test: defer deleteNodes for clarity
* api/csi_test: disable this test awaiting mocks:
https://github.com/hashicorp/nomad/issues/7123
Currently the handling of CSINode RPCs does not correctly handle
forwarding RPCs to Nodes.
This commit fixes this by introducing a shim RPC
(nomad/client_csi_enpdoint) that will correctly forward the request to
the owning node, or submit the RPC to the client.
In the process it also cleans up handling a little bit by adding the
`CSIControllerQuery` embeded struct for required forwarding state.
The CSIControllerQuery embeding the requirement of a `PluginID` also
means we could move node targetting into the shim RPC if wanted in the
future.
When an alloc is marked terminal (and after node unstage/unpublish
have been called), the client syncs the terminal alloc state with the
server via `Node.UpdateAlloc RPC`.
For each job that has a terminal alloc, the `Node.UpdateAlloc` RPC
handler at the server will emit an eval for a new core job to garbage
collect CSI volume claims. When this eval is handled on the core
scheduler, it will call a `volumeReap` method to release the claims
for all terminal allocs on the job.
The volume reap will issue a `ControllerUnpublishVolume` RPC for any
node that has no alloc claiming the volume. Once this returns (or
is skipped), the volume reap will send a new `CSIVolume.Claim` RPC
that releases the volume claim for that allocation in the state store,
making it available for scheduling again.
This same `volumeReap` method will be called from the core job GC,
which gives us a second chance to reclaim volumes during GC if there
were controller RPC failures.
Currently, the client has to ship an entire allocation to the server as
part of performing a VolumeClaim, this has a few problems:
Firstly, it means the client is sending significantly more data than is
required (an allocation contains the entire contents of a Nomad job,
alongside other irrelevant state) which has a non-zero (de)serialization
cost.
Secondly, because the allocation was never re-fetched from the state
store, it means that we were potentially open to issues caused by stale
state on a misbehaving or malicious client.
The change removes both of those issues at the cost of a couple of more
state store lookups, but they should be relatively cheap.
We also now provide the CSIVolume in the response for a claim, so the
client can perform a Claim without first going ahead and fetching all of
the volumes.
This PR implements some intitial support for doing deeper validation of
a volume during its registration with the server. This allows us to
validate the capabilities before users attempt to use the volumes during
most cases, and also prevents registering volumes without first setting
up a plugin, which should help to catch typos and the like during
registration.
This does have the downside of requiring users to wait for (1) instance
of a plugin to be running in their cluster before they can register
volumes.
Nomad servers need to make requests to CSI controller plugins running
on a client for publish/unpublish. The RPC needs to look up the client
node based on the plugin, load balancing across controllers, and then
perform the required client RPC to that node (via server forwarding if
neccessary).
This commit refactors the ACL code for the CSI endpoint to support
environments that run without acls enabled (e.g developer environments)
and also provides an easy way to restrict which endpoints may be
accessed with a client's SecretID to limit the blast radius of a
malicious client on the state of the environment.
The `ControllerPublishVolumeResponse` CSI RPC includes the publish
context intended to be passed by the orchestrator as an opaque value
to the node plugins. This changeset adds it to our response to a
volume claim request to proxy the controller's response back to the
client node.
When an alloc is marked terminal, and after node unstage/unpublish
have been called, the client will sync the terminal alloc state with
the server via `Node.UpdateAlloc` RPC.
This changeset implements releasing the volume claim for each volume
associated with the terminal alloc. It doesn't yet implement the RPC
call we need to make to the `ControllerUnpublishVolume` CSI RPC.
When the client receives an allocation which includes a CSI volume,
the alloc runner will block its main `Run` loop. The alloc runner will
issue a `VolumeClaim` RPC to the Nomad servers. This changeset
implements the portions of the `VolumeClaim` RPC endpoint that have
not been previously completed.
* state_store: csi volumes/plugins store the index in the txn
* nomad: csi_endpoint_test require index checks need uint64()
* nomad: other tests using int 0 not uint64(0)
* structs: pass index into New, but not other struct methods
* state_store: csi plugin indexes, use new struct interface
* nomad: csi_endpoint_test check index/query meta (on explicit 0)
* structs: NewCSIVolume takes an index arg now
* scheduler/test: NewCSIVolume takes an index arg now
This changeset adds a new core job `CoreJobCSIVolumePublicationGC` to
the leader's loop for scheduling core job evals. Right now this is an
empty method body without even a config file stanza. Later changesets
will implement the logic of volume publication GC.
state_store: change claim counts
state_store: get volumes by all, by driver
state_store: process volume claims
state_store: csi volume register error on update
This changeset implements the initial registration and fingerprinting
of CSI Plugins as part of #5378. At a high level, it introduces the
following:
* A `csi_plugin` stanza as part of a Nomad task configuration, to
allow a task to expose that it is a plugin.
* A new task runner hook: `csi_plugin_supervisor`. This hook does two
things. When the `csi_plugin` stanza is detected, it will
automatically configure the plugin task to receive bidirectional
mounts to the CSI intermediary directory. At runtime, it will then
perform an initial heartbeat of the plugin and handle submitting it to
the new `dynamicplugins.Registry` for further use by the client, and
then run a lightweight heartbeat loop that will emit task events
when health changes.
* The `dynamicplugins.Registry` for handling plugins that run
as Nomad tasks, in contrast to the existing catalog that requires
`go-plugin` type plugins and to know the plugin configuration in
advance.
* The `csimanager` which fingerprints CSI plugins, in a similar way to
`drivermanager` and `devicemanager`. It currently only fingerprints
the NodeID from the plugin, and assumes that all plugins are
monolithic.
Missing features
* We do not use the live updates of the `dynamicplugin` registry in
the `csimanager` yet.
* We do not deregister the plugins from the client when they shutdown
yet, they just become indefinitely marked as unhealthy. This is
deliberate until we figure out how we should manage deploying new
versions of plugins/transitioning them.
allow oss to parse sink duration
clean up audit sink parsing
ent eventer config reload
fix typo
SetEnabled to eventer interface
client acl test
rm dead code
fix failing test