2588b3bc98
This fixes few cases where driver eventor goroutines are leaked during normal operations, but especially so in tests. This change makes few modifications: First, it switches drivers to use `Context`s to manage shutdown events. Previously, it relied on callers invoking `.Shutdown()` function that is specific to internal drivers only and require casting. Using `Contexts` provide a consistent idiomatic way to manage lifecycle for both internal and external drivers. Also, I discovered few places where we don't clean up a temporary driver instance in the plugin catalog code, where we dispense a driver to inspect and validate the schema config without properly cleaning it up. |
||
---|---|---|
.. | ||
cmd | ||
nvml | ||
device.go | ||
device_test.go | ||
fingerprint.go | ||
fingerprint_test.go | ||
README.md | ||
stats.go | ||
stats_test.go |
This package provides an implementation of nvidia device plugin
Behavior
Nvidia device plugin uses NVML bindings to get data regarding available nvidia devices and will expose them via Fingerprint RPC. GPUs can be excluded from fingerprinting by setting the ignored_gpu_ids
field. Plugin sends statistics for fingerprinted devices every stats_period
period.
Config
The configuration should be passed via an HCL file that begins with a top level config
stanza:
config {
ignored_gpu_ids = ["uuid1", "uuid2"]
fingerprint_period = "5s"
}
The valid configuration options are:
ignored_gpu_ids
(list(string)
:[]
): list of GPU UUIDs strings that should not be exposed to nomadfingerprint_period
(string
:"1m"
): interval to repeat the fingerprint process to identify possible changes.