--- layout: docs page_title: 'Device Plugins: Nvidia' description: The Nvidia Device Plugin detects and makes Nvidia devices available to tasks. --- # Nvidia GPU Device Plugin Name: `nvidia-gpu` The Nvidia device plugin is used to expose Nvidia GPUs to Nomad. The Nvidia plugin is built into Nomad and does not need to be downloaded separately. ## Fingerprinted Attributes
Attribute Unit
memory MiB
power W (Watt)
bar1 MiB
driver_version string
cores_clock MHz
memory_clock MHz
pci_bandwidth MB/s
display_state string
persistence_mode string
## Runtime Environment The `nvidia-gpu` device plugin exposes the following environment variables: - `NVIDIA_VISIBLE_DEVICES` - List of Nvidia GPU IDs available to the task. ### Additional Task Configurations Additional environment variables can be set by the task to influence the runtime environment. See [Nvidia's documentation](https://github.com/NVIDIA/nvidia-container-runtime#environment-variables-oci-spec). ## Installation Requirements In order to use the `nvidia-gpu` the following prerequisites must be met: 1. GNU/Linux x86_64 with kernel version > 3.10 2. NVIDIA GPU with Architecture > Fermi (2.1) 3. NVIDIA drivers >= 340.29 with binary `nvidia-smi` ### Docker Driver Requirements In order to use the Nvidia driver plugin with the Docker driver, please follow the installation instructions for [`nvidia-docker`](). ## Plugin Configuration ```hcl plugin "nvidia-gpu" { config { enabled = true ignored_gpu_ids = ["GPU-fef8089b", "GPU-ac81e44d"] fingerprint_period = "1m" } } ``` The `nvidia-gpu` device plugin supports the following configuration in the agent config: - `enabled` `(bool: true)` - Control whether the plugin should be enabled and running. - `ignored_gpu_ids` `(array: [])` - Specifies the set of GPU UUIDs that should be ignored when fingerprinting. - `fingerprint_period` `(string: "1m")` - The period in which to fingerprint for device changes. ## Restrictions The Nvidia integration only works with drivers who natively integrate with Nvidia's [container runtime library](https://github.com/NVIDIA/libnvidia-container). Nomad has tested support with the [`docker` driver][docker-driver] and plans to bring support to the built-in [`exec`][exec-driver] and [`java`][java-driver] drivers. Support for [`lxc`][lxc-driver] should be possible by installing the [Nvidia hook](https://github.com/lxc/lxc/blob/master/hooks/nvidia) but is not tested or documented by Nomad. ## Examples Inspect a node with a GPU: ```shell-session $ nomad node status 4d46e59f ID = 4d46e59f Name = nomad Class = DC = dc1 Drain = false Eligibility = eligible Status = ready Uptime = 19m43s Driver Status = docker,mock_driver,raw_exec Node Events Time Subsystem Message 2019-01-23T18:25:18Z Cluster Node registered Allocated Resources CPU Memory Disk 0/15576 MHz 0 B/55 GiB 0 B/28 GiB Allocation Resource Utilization CPU Memory 0/15576 MHz 0 B/55 GiB Host Resource Utilization CPU Memory Disk 2674/15576 MHz 1.5 GiB/55 GiB 3.0 GiB/31 GiB Device Resource Utilization nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB Allocations No allocations placed ``` Display detailed statistics on a node with a GPU: ```shell-session $ nomad node status -stats 4d46e59f ID = 4d46e59f Name = nomad Class = DC = dc1 Drain = false Eligibility = eligible Status = ready Uptime = 19m59s Driver Status = docker,mock_driver,raw_exec Node Events Time Subsystem Message 2019-01-23T18:25:18Z Cluster Node registered Allocated Resources CPU Memory Disk 0/15576 MHz 0 B/55 GiB 0 B/28 GiB Allocation Resource Utilization CPU Memory 0/15576 MHz 0 B/55 GiB Host Resource Utilization CPU Memory Disk 2673/15576 MHz 1.5 GiB/55 GiB 3.0 GiB/31 GiB Device Resource Utilization nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB // ...TRUNCATED... Device Stats Device = nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] BAR1 buffer state = 2 / 16384 MiB Decoder utilization = 0 % ECC L1 errors = 0 ECC L2 errors = 0 ECC memory errors = 0 Encoder utilization = 0 % GPU utilization = 0 % Memory state = 0 / 11441 MiB Memory utilization = 0 % Power usage = 37 / 149 W Temperature = 34 C Allocations No allocations placed ``` Run the following example job to see that that the GPU was mounted in the container: ```hcl job "gpu-test" { datacenters = ["dc1"] type = "batch" group "smi" { task "smi" { driver = "docker" config { image = "nvidia/cuda:9.0-base" command = "nvidia-smi" } resources { device "nvidia/gpu" { count = 1 # Add an affinity for a particular model affinity { attribute = "${device.model}" value = "Tesla K80" weight = 50 } } } } } } ``` ```shell-session $ nomad run example.nomad ==> Monitoring evaluation "21bd7584" Evaluation triggered by job "gpu-test" Allocation "d250baed" created: node "4d46e59f", group "smi" Evaluation status changed: "pending" -> "complete" ==> Evaluation "21bd7584" finished with status "complete" $ nomad alloc status d250baed ID = d250baed Eval ID = 21bd7584 Name = gpu-test.smi[0] Node ID = 4d46e59f Job ID = example Job Version = 0 Client Status = complete Client Description = All tasks have completed Desired Status = run Desired Description = Created = 7s ago Modified = 2s ago Task "smi" is "dead" Task Resources CPU Memory Disk Addresses 0/100 MHz 0 B/300 MiB 300 MiB Device Stats nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB Task Events: Started At = 2019-01-23T18:25:32Z Finished At = 2019-01-23T18:25:34Z Total Restarts = 0 Last Restart = N/A Recent Events: Time Type Description 2019-01-23T18:25:34Z Terminated Exit Code: 0 2019-01-23T18:25:32Z Started Task started by client 2019-01-23T18:25:29Z Task Setup Building Task Directory 2019-01-23T18:25:29Z Received Task received by client $ nomad alloc logs d250baed Wed Jan 23 18:25:32 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.48 Driver Version: 410.48 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K80 On | 00004477:00:00.0 Off | 0 | | N/A 33C P8 37W / 149W | 0MiB / 11441MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ ``` [docker-driver]: /docs/drivers/docker 'Nomad docker Driver' [exec-driver]: /docs/drivers/exec 'Nomad exec Driver' [java-driver]: /docs/drivers/java 'Nomad java Driver' [lxc-driver]: /docs/drivers/external/lxc 'Nomad lxc Driver'