When an allocation runs for a task driver that can't support volume mounts,
the mounting will fail in a way that can be hard to understand. With host
volumes this usually means failing silently, whereas with CSI the operator
gets inscrutable internals exposed in the `nomad alloc status`.
This changeset adds a MountConfig field to the task driver Capabilities
response. We validate this when the `csi_hook` or `volume_hook` fires and
return a user-friendly error.
Note that we don't currently have a way to get driver capabilities up to the
server, except through attributes. Validating this when the user initially
submits the jobspec would be even better than what we're doing here (and could
be useful for all our other capabilities), but that's out of scope for this
changeset.
Also note that the MountConfig enum starts with "supports all" in order to
support community plugins in a backwards compatible way, rather than cutting
them off from volume mounting unexpectedly.
Following the new volumewatcher in #7794 and performance improvements
to it that landed afterwards, there's no particular reason we should
be threading claim releases through the GC eval rather than writing an
empty `CSIVolumeClaimRequest` with the mode set to
`CSIVolumeClaimRelease`, just as the GC evaluation would do.
Also, by batching up these raft messages, we can reduce the amount of
raft writes by 1 and cross-server RPCs by 1 per volume we release
claims on.
The `stats_hook` writes an Error log every time an allocation becomes
terminal. This is a normal condition, not an error. A real error
condition like a failure to collect the stats is logged later. It just
creates log noise, and this is a particularly bad operator experience
for heavy batch workloads.
The tasklet passes the timeout for the script check into the task
driver's `Exec`, and its up to the task driver to enforce that via a
golang `context.WithDeadline`. In practice, this deadline is started
before the task driver starts setting up the execution
environment (because we need it to do things like timeout Docker API
calls).
Under even moderate load, the time it takes to set up the execution
context for the script check regularly exceeds a full second or
two. This can cause script checks to unexpected timeout or even never
execute if the context expires before the task driver ever gets a
chance to `execve`.
This changeset adds a notice to operators about setting script check
timeouts with plenty of padding and what to monitor for problems.
Some of the CSI RPC endpoints were missing validation that the ID or
the Volume definition was present. This could result in nonsense
`CSIVolume` structs being written to raft during registration. This
changeset corrects that bug and adds validation checks to present
nicer error messages to operators in some other cases.
Creating a FAQ question to provide a home for additional context around
bootstrapping. Linking from API page to `default_server_config`
attribute. Added sample API response to to discuss "Updated: false"
Fixes#8000
When requesting a Service Identity token from Consul, use the TaskKind
of the Task to get at the service name associated with the task. In
the past using the TaskName worked because it was generated as a sidecar
task with a name that included the service. In the Native context, we
need to get at the service name in a more correct way, i.e. using the
TaskKind which is defined to include the service name.
Some CSI plugins don't return much for errors over the gRPC socket
above and beyond the bare minimum error codes. This changeset improves
the operator experience by unpacking the error codes when available
and wrapping the error with some user-friendly direction.
Improving these errors also revealed a bad comparison with
`require.Error` when `require.EqualError` should be used in the test
code for plugin errors. This defect in turn was hiding a bug in volume
validation where we're being overly permissive in allowing mount
flags, which is now fixed.
The presence of Storybook’s preview-head.html file in the repository
is a constant annoyance: it’s only needed for Storybook and it changes
all the time, producing a lot of Git noise. By making it a separate
step to have the Ember CLI server running before starting Storybook,
we no longer need to have preview-head in the repository. It needed to
be present because there was a race condition where it was sometimes
not generated in time for the Storybook parallel startup.