--- layout: docs page_title: 'Device Plugins: Nvidia' description: The Nvidia Device Plugin detects and makes Nvidia devices available to tasks. --- # Nvidia GPU Device Plugin Name: `nvidia-gpu` The Nvidia device plugin is used to expose Nvidia GPUs to Nomad. ~> **Note**: The Nvidia device plugin setup has changed in Nomad 1.2. You must add a [`plugin`] block to your clients configuration and install the [external Nvidia device plugin][nvidia_plugin_download] into their [`plugin_dir`] prior to upgrading. See plugin options below for an example. Note the job specification remains the same. ## Fingerprinted Attributes
Attribute Unit
memory MiB
power W (Watt)
bar1 MiB
driver_version string
cores_clock MHz
memory_clock MHz
pci_bandwidth MB/s
display_state string
persistence_mode string
## Runtime Environment The `nvidia-gpu` device plugin exposes the following environment variables: - `NVIDIA_VISIBLE_DEVICES` - List of Nvidia GPU IDs available to the task. ### Additional Task Configurations Additional environment variables can be set by the task to influence the runtime environment. See [Nvidia's documentation](https://github.com/NVIDIA/nvidia-container-runtime#environment-variables-oci-spec). ## Installation Requirements In order to use the `nvidia-gpu` the following prerequisites must be met: 1. GNU/Linux x86_64 with kernel version > 3.10 2. NVIDIA GPU with Architecture > Fermi (2.1) 3. NVIDIA drivers >= 340.29 with binary `nvidia-smi` ### Docker Driver Requirements The Nvidia driver plugin currently only supports the older v1.0 version of the Docker driver provided by Nvidia. In order to use the Nvidia driver plugin with the Docker driver, please follow the installation instructions for [`nvidia-container-runtime`](https://github.com/nvidia/nvidia-container-runtime#installation). ## Plugin Configuration ```hcl plugin "nvidia-gpu" { config { enabled = true ignored_gpu_ids = ["GPU-fef8089b", "GPU-ac81e44d"] fingerprint_period = "1m" } } ``` The `nvidia-gpu` device plugin supports the following configuration in the agent config: - `enabled` `(bool: true)` - Control whether the plugin should be enabled and running. - `ignored_gpu_ids` `(array: [])` - Specifies the set of GPU UUIDs that should be ignored when fingerprinting. - `fingerprint_period` `(string: "1m")` - The period in which to fingerprint for device changes. ## Restrictions The Nvidia integration only works with drivers who natively integrate with Nvidia's [container runtime library](https://github.com/NVIDIA/libnvidia-container). Nomad has tested support with the [`docker` driver][docker-driver] and plans to bring support to the built-in [`exec`][exec-driver] and [`java`][java-driver] drivers. Support for [`lxc`][lxc-driver] should be possible by installing the [Nvidia hook](https://github.com/lxc/lxc/blob/master/hooks/nvidia) but is not tested or documented by Nomad. ## Examples Inspect a node with a GPU: ```shell-session $ nomad node status 4d46e59f ID = 4d46e59f Name = nomad Class = DC = dc1 Drain = false Eligibility = eligible Status = ready Uptime = 19m43s Driver Status = docker,mock_driver,raw_exec Node Events Time Subsystem Message 2019-01-23T18:25:18Z Cluster Node registered Allocated Resources CPU Memory Disk 0/15576 MHz 0 B/55 GiB 0 B/28 GiB Allocation Resource Utilization CPU Memory 0/15576 MHz 0 B/55 GiB Host Resource Utilization CPU Memory Disk 2674/15576 MHz 1.5 GiB/55 GiB 3.0 GiB/31 GiB Device Resource Utilization nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB Allocations No allocations placed ``` Display detailed statistics on a node with a GPU: ```shell-session $ nomad node status -stats 4d46e59f ID = 4d46e59f Name = nomad Class = DC = dc1 Drain = false Eligibility = eligible Status = ready Uptime = 19m59s Driver Status = docker,mock_driver,raw_exec Node Events Time Subsystem Message 2019-01-23T18:25:18Z Cluster Node registered Allocated Resources CPU Memory Disk 0/15576 MHz 0 B/55 GiB 0 B/28 GiB Allocation Resource Utilization CPU Memory 0/15576 MHz 0 B/55 GiB Host Resource Utilization CPU Memory Disk 2673/15576 MHz 1.5 GiB/55 GiB 3.0 GiB/31 GiB Device Resource Utilization nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB // ...TRUNCATED... Device Stats Device = nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] BAR1 buffer state = 2 / 16384 MiB Decoder utilization = 0 % ECC L1 errors = 0 ECC L2 errors = 0 ECC memory errors = 0 Encoder utilization = 0 % GPU utilization = 0 % Memory state = 0 / 11441 MiB Memory utilization = 0 % Power usage = 37 / 149 W Temperature = 34 C Allocations No allocations placed ``` Run the following example job to see that that the GPU was mounted in the container: ```hcl job "gpu-test" { datacenters = ["dc1"] type = "batch" group "smi" { task "smi" { driver = "docker" config { image = "nvidia/cuda:9.0-base" command = "nvidia-smi" } resources { device "nvidia/gpu" { count = 1 # Add an affinity for a particular model affinity { attribute = "${device.model}" value = "Tesla K80" weight = 50 } } } } } } ``` ```shell-session $ nomad run example.nomad ==> Monitoring evaluation "21bd7584" Evaluation triggered by job "gpu-test" Allocation "d250baed" created: node "4d46e59f", group "smi" Evaluation status changed: "pending" -> "complete" ==> Evaluation "21bd7584" finished with status "complete" $ nomad alloc status d250baed ID = d250baed Eval ID = 21bd7584 Name = gpu-test.smi[0] Node ID = 4d46e59f Job ID = example Job Version = 0 Client Status = complete Client Description = All tasks have completed Desired Status = run Desired Description = Created = 7s ago Modified = 2s ago Task "smi" is "dead" Task Resources CPU Memory Disk Addresses 0/100 MHz 0 B/300 MiB 300 MiB Device Stats nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB Task Events: Started At = 2019-01-23T18:25:32Z Finished At = 2019-01-23T18:25:34Z Total Restarts = 0 Last Restart = N/A Recent Events: Time Type Description 2019-01-23T18:25:34Z Terminated Exit Code: 0 2019-01-23T18:25:32Z Started Task started by client 2019-01-23T18:25:29Z Task Setup Building Task Directory 2019-01-23T18:25:29Z Received Task received by client $ nomad alloc logs d250baed Wed Jan 23 18:25:32 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.48 Driver Version: 410.48 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K80 On | 00004477:00:00.0 Off | 0 | | N/A 33C P8 37W / 149W | 0MiB / 11441MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ ``` [docker-driver]: /docs/drivers/docker 'Nomad docker Driver' [exec-driver]: /docs/drivers/exec 'Nomad exec Driver' [java-driver]: /docs/drivers/java 'Nomad java Driver' [lxc-driver]: /docs/drivers/external/lxc 'Nomad lxc Driver' [`plugin`]: /docs/configuration/plugin [`plugin_dir`]: /docs/configuration#plugin_dir [nvidia_plugin_download]: https://releases.hashicorp.com/nomad-device-nvidia/