Fixes#7681
The current behavior of the CPU fingerprinter in AWS is that it
reads the **current** speed from `/proc/cpuinfo` (`CPU MHz` field).
This is because the max CPU frequency is not available by reading
anything on the EC2 instance itself. Normally on Linux one would
look at e.g. `sys/devices/system/cpu/cpuN/cpufreq/cpuinfo_max_freq`
or perhaps parse the values from the `CPU max MHz` field in
`/proc/cpuinfo`, but those values are not available.
Furthermore, no metadata about the CPU is made available in the
EC2 metadata service.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-categories.html
Since `go-psutil` cannot determine the max CPU speed it defaults to
the current CPU speed, which could be basically any number between
0 and the true max. This is particularly bad on large, powerful
reserved instances which often idle at ~800 MHz while Nomad does
its fingerprinting (typically IO bound), which Nomad then uses as
the max, which results in severe loss of available resources.
Since the CPU specification is unavailable programmatically (at least
not without sudo) use a best-effort lookup table. This table was
generated by going through every instance type in AWS documentation
and copy-pasting the numbers.
https://aws.amazon.com/ec2/instance-types/
This approach obviously is not ideal as future instance types will
need to be added as they are introduced to AWS. However, using the
table should only be an improvement over the status quo since right
now Nomad miscalculates available CPU resources on all instance types.
Use v1.1.5 of go-msgpack/codec/codecgen, so go-msgpack codecgen matches
the library version.
We branched off earlier to pick up
f51b518921
, but apparently that's not needed as we could customize the package via
`-c` argument.
Adds a `CSIVolumeClaim` type to be tracked as current and past claims
on a volume. Allows for a client RPC failure during node or controller
detachment without having to keep the allocation around after the
first garbage collection eval.
This changeset lays groundwork for moving the actual detachment RPCs
into a volume watching loop outside the GC eval.
task shutdown_delay will currently only run if there are registered
services for the task. This implementation detail isn't explicity stated
anywhere and is defined outside of the service stanza.
This change moves shutdown_delay to be evaluated after prekill hooks are
run, outside of any task runner hooks.
just use time.sleep
The `Job.Deregister` call will block on the client CSI controller RPCs
while the alloc still exists on the Nomad client node. So we need to
make the volume claim reaping async from the `Job.Deregister`. This
allows `nomad job stop` to return immediately. In order to make this
work, this changeset changes the volume GC so that the GC jobs are on a
by-volume basis rather than a by-job basis; we won't have to query
the (possibly deleted) job at the time of volume GC. We smuggle the
volume ID and whether it's a purge into the GC eval ID the same way we
smuggled the job ID previously.
The CSI plugins uses the external volume ID for all operations, but
the Client CSI RPCs uses the Nomad volume ID (human-friendly) for the
mount paths. Pass the External ID as an arg in the RPC call so that
the unpublish workflows have it without calling back to the server to
find the external ID.
The controller CSI plugins need the CSI node ID (or in other words,
the storage provider's view of node ID like the EC2 instance ID), not
the Nomad node ID, to determine how to detach the external volume.
If a volume-claiming alloc stops and the CSI Node plugin that serves
that alloc's volumes is missing, there's no way for the allocrunner
hook to send the `NodeUnpublish` and `NodeUnstage` RPCs.
This changeset addresses this issue with a redesign of the client-side
for CSI. Rather than unmounting in the alloc runner hook, the alloc
runner hook will simply exit. When the server gets the
`Node.UpdateAlloc` for the terminal allocation that had a volume claim,
it creates a volume claim GC job. This job will made client RPCs to a
new node plugin RPC endpoint, and only once that succeeds, move on to
making the client RPCs to the controller plugin. If the node plugin is
unavailable, the GC job will fail and be requeued.
Fixes#6594#6711#6714#7567
e2e testing is still TBD in #6502
Before, we only passed the Nomad agent's configured Consul HTTP
address onto the `consul connect envoy ...` bootstrap command.
This meant any Consul setup with TLS enabled would not work with
Nomad's Connect integration.
This change now sets CLI args and Environment Variables for
configuring TLS options for communicating with Consul when doing
the envoy bootstrap, as described in
https://www.consul.io/docs/commands/connect/envoy.html#usage
Enable configuration of HTTP and gRPC endpoints which should be exposed by
the Connect sidecar proxy. This changeset is the first "non-magical" pass
that lays the groundwork for enabling Consul service checks for tasks
running in a network namespace because they are Connect-enabled. The changes
here provide for full configuration of the
connect {
sidecar_service {
proxy {
expose {
paths = [{
path = <exposed endpoint>
protocol = <http or grpc>
local_path_port = <local endpoint port>
listener_port = <inbound mesh port>
}, ... ]
}
}
}
stanza. Everything from `expose` and below is new, and partially implements
the precedent set by Consul:
https://www.consul.io/docs/connect/registration/service-registration.html#expose-paths-configuration-reference
Combined with a task-group level network port-mapping in the form:
port "exposeExample" { to = -1 }
it is now possible to "punch a hole" through the network namespace
to a specific HTTP or gRPC path, with the anticipated use case of creating
Consul checks on Connect enabled services.
A future PR may introduce more automagic behavior, where we can do things like
1) auto-fill the 'expose.path.local_path_port' with the default value of the
'service.port' value for task-group level connect-enabled services.
2) automatically generate a port-mapping
3) enable an 'expose.checks' flag which automatically creates exposed endpoints
for every compatible consul service check (http/grpc checks on connect
enabled services).
* nomad/structs/structs: new NodeEventSubsystemCSI
* client/client: pass triggerNodeEvent in the CSIConfig
* client/pluginmanager/csimanager/instance: add eventer to instanceManager
* client/pluginmanager/csimanager/manager: pass triggerNodeEvent
* client/pluginmanager/csimanager/volume: node event on [un]mount
* nomad/structs/structs: use storage, not CSI
* client/pluginmanager/csimanager/volume: use storage, not CSI
* client/pluginmanager/csimanager/volume_test: eventer
* client/pluginmanager/csimanager/volume: event on error
* client/pluginmanager/csimanager/volume_test: check event on error
* command/node_status: remove an extra space in event detail format
* client/pluginmanager/csimanager/volume: use snake_case for details
* client/pluginmanager/csimanager/volume_test: snake_case details
The CSI Specification defines various gRPC Errors and how they may be retried. After auditing all our CSI RPC calls in #6863, this changeset:
* adds retries and backoffs to the where they were needed but not implemented
* annotates those CSI RPCs that do not need retries so that we don't wonder whether it's been left off accidentally
* added a timeout and cancellation context to the `Probe` call, which didn't have one.
The test inserts an alloc in the server state, but expect the client to
start the alloc runner for it almost immediately.
Here, we add a retry loop to check that the client start all expected
alloc runners eventually.
Fix a regression where we accidentally started treating non-AWS
environments as AWS environments, resulting in bad networking settings.
Two factors some at play:
First, in [1], we accidentally switched the ultimate AWS test from
checking `ami-id` to `instance-id`. This means that nomad started
treating more environments as AWS; e.g. Hetzner implements `instance-id`
but not `ami-id`.
Second, some of these environments return empty values instead of
errors! Hetzner returns empty 200 response for `local-ipv4`, resulting
into bad networking configuration.
This change fix the situation by restoring the check to `ami-id` and
ensuring that we only set network configuration when the ip address is
not-empty. Also, be more defensive around response whitespace input.
[1] https://github.com/hashicorp/nomad/pull/6779
Add mount_options to both the volume definition on registration and to the volume block in the group where the volume is requested. If both are specified, the options provided in the request replace the options defined in the volume. They get passed to the NodePublishVolume, which causes the node plugin to actually mount the volume on the host.
Individual tasks just mount bind into the host mounted volume (unchanged behavior). An operator can mount the same volume with different options by specifying it twice in the group context.
closes#7007
* nomad/structs/volumes: add MountOptions to volume request
* jobspec/test-fixtures/basic.hcl: add mount_options to volume block
* jobspec/parse_test: add expected MountOptions
* api/tasks: add mount_options
* jobspec/parse_group: use hcl decode not mapstructure, mount_options
* client/allocrunner/csi_hook: pass MountOptions through
client/allocrunner/csi_hook: add a VolumeMountOptions
client/allocrunner/csi_hook: drop Options
client/allocrunner/csi_hook: use the structs options
* client/pluginmanager/csimanager/interface: UsageOptions.MountOptions
* client/pluginmanager/csimanager/volume: pass MountOptions in capabilities
* plugins/csi/plugin: remove todo 7007 comment
* nomad/structs/csi: MountOptions
* api/csi: add options to the api for parsing, match structs
* plugins/csi/plugin: move VolumeMountOptions to structs
* api/csi: use specific type for mount_options
* client/allocrunner/csi_hook: merge MountOptions here
* rename CSIOptions to CSIMountOptions
* client/allocrunner/csi_hook
* client/pluginmanager/csimanager/volume
* nomad/structs/csi
* plugins/csi/fake/client: add PrevVolumeCapability
* plugins/csi/plugin
* client/pluginmanager/csimanager/volume_test: remove debugging
* client/pluginmanager/csimanager/volume: fix odd merging logic
* api: rename CSIOptions -> CSIMountOptions
* nomad/csi_endpoint: remove a 7007 comment
* command/alloc_status: show mount options in the volume list
* nomad/structs/csi: include MountOptions in the volume stub
* api/csi: add MountOptions to stub
* command/volume_status_csi: clean up csiVolMountOption, add it
* command/alloc_status: csiVolMountOption lives in volume_csi_status
* command/node_status: display mount flags
* nomad/structs/volumes: npe
* plugins/csi/plugin: npe in ToCSIRepresentation
* jobspec/parse_test: expand volume parse test cases
* command/agent/job_endpoint: ApiTgToStructsTG needs MountOptions
* command/volume_status_csi: copy paste error
* jobspec/test-fixtures/basic: hclfmt
* command/volume_status_csi: clean up csiVolMountOption