295 lines
8.0 KiB
Plaintext
295 lines
8.0 KiB
Plaintext
---
|
|
layout: docs
|
|
page_title: 'Device Plugins: Nvidia'
|
|
description: The Nvidia Device Plugin detects and makes Nvidia devices available to tasks.
|
|
---
|
|
|
|
# Nvidia GPU Device Plugin
|
|
|
|
Name: `nomad-device-nvidia`
|
|
|
|
The Nvidia device plugin is used to expose Nvidia GPUs to Nomad.
|
|
|
|
~> **Note**: The Nvidia device plugin setup has changed in Nomad 1.2. You must
|
|
add a [`plugin`] block to your clients configuration and install the
|
|
[external Nvidia device plugin][nvidia_plugin_download] into their
|
|
[`plugin_dir`] prior to upgrading. See plugin options below for an example.
|
|
Note the job specification remains the same.
|
|
|
|
## Fingerprinted Attributes
|
|
|
|
<table>
|
|
<thead>
|
|
<tr>
|
|
<th>Attribute</th>
|
|
<th>Unit</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody>
|
|
<tr>
|
|
<td>
|
|
<tt>memory</tt>
|
|
</td>
|
|
<td>MiB</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<tt>power</tt>
|
|
</td>
|
|
<td>W (Watt)</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<tt>bar1</tt>
|
|
</td>
|
|
<td>MiB</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<tt>driver_version</tt>
|
|
</td>
|
|
<td>string</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<tt>cores_clock</tt>
|
|
</td>
|
|
<td>MHz</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<tt>memory_clock</tt>
|
|
</td>
|
|
<td>MHz</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<tt>pci_bandwidth</tt>
|
|
</td>
|
|
<td>MB/s</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<tt>display_state</tt>
|
|
</td>
|
|
<td>string</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<tt>persistence_mode</tt>
|
|
</td>
|
|
<td>string</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
|
|
## Runtime Environment
|
|
|
|
The `nvidia-gpu` device plugin exposes the following environment variables:
|
|
|
|
- `NVIDIA_VISIBLE_DEVICES` - List of Nvidia GPU IDs available to the task.
|
|
|
|
### Additional Task Configurations
|
|
|
|
Additional environment variables can be set by the task to influence the runtime
|
|
environment. See [Nvidia's
|
|
documentation](https://github.com/NVIDIA/nvidia-container-runtime#environment-variables-oci-spec).
|
|
|
|
## Installation Requirements
|
|
|
|
In order to use the `nomad-device-nvidia` device driver the following prerequisites must be met:
|
|
|
|
1. GNU/Linux x86_64 with kernel version > 3.10
|
|
2. NVIDIA GPU with Architecture > Fermi (2.1)
|
|
3. NVIDIA drivers >= 340.29 with binary `nvidia-smi`
|
|
4. Docker v19.03+
|
|
|
|
### Container Toolkit Installation
|
|
|
|
Follow the [NVIDIA Container Toolkit installation instructions][nvidia_container_toolkit]
|
|
from Nvidia to prepare a machine to use docker containers with Nvidia GPUs. You should
|
|
be able to run this simple command to test your environment and produce meaningful
|
|
output.
|
|
|
|
```shell
|
|
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
|
|
```
|
|
|
|
|
|
## Plugin Configuration
|
|
|
|
```hcl
|
|
plugin "nomad-device-nvidia" {
|
|
config {
|
|
enabled = true
|
|
ignored_gpu_ids = ["GPU-fef8089b", "GPU-ac81e44d"]
|
|
fingerprint_period = "1m"
|
|
}
|
|
}
|
|
```
|
|
|
|
The `nomad-device-nvidia` device plugin supports the following configuration in the agent
|
|
config:
|
|
|
|
- `enabled` `(bool: true)` - Control whether the plugin should be enabled and running.
|
|
|
|
- `ignored_gpu_ids` `(array<string>: [])` - Specifies the set of GPU UUIDs that
|
|
should be ignored when fingerprinting.
|
|
|
|
- `fingerprint_period` `(string: "1m")` - The period in which to fingerprint for
|
|
device changes.
|
|
|
|
## Limitations
|
|
|
|
The Nvidia integration only works with drivers who natively integrate with
|
|
Nvidia's [container runtime
|
|
library](https://github.com/NVIDIA/libnvidia-container).
|
|
|
|
Nomad has tested support with the [`docker` driver][docker-driver]. Support for
|
|
[`lxc`][lxc-driver] should be possible by installing the [Nvidia hook][nvidia_hook]
|
|
but is not tested or documented by Nomad.
|
|
|
|
## Source Code & Compiled Binaries
|
|
|
|
The source code for this plugin can be found at hashicorp/nomad-device-nvidia. You
|
|
can also find pre-built binaries on the [releases page][nvidia_plugin_download].
|
|
|
|
## Examples
|
|
|
|
Inspect a node with a GPU:
|
|
|
|
```shell-session
|
|
$ nomad node status 4d46e59f
|
|
|
|
// ...TRUNCATED...
|
|
|
|
Device Resource Utilization
|
|
nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB
|
|
```
|
|
|
|
Display detailed statistics on a node with a GPU:
|
|
|
|
```shell-session
|
|
$ nomad node status -stats 4d46e59f
|
|
|
|
// ...TRUNCATED...
|
|
|
|
Device Resource Utilization
|
|
nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB
|
|
|
|
// ...TRUNCATED...
|
|
|
|
Device Stats
|
|
Device = nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]
|
|
BAR1 buffer state = 2 / 16384 MiB
|
|
Decoder utilization = 0 %
|
|
ECC L1 errors = 0
|
|
ECC L2 errors = 0
|
|
ECC memory errors = 0
|
|
Encoder utilization = 0 %
|
|
GPU utilization = 0 %
|
|
Memory state = 0 / 11441 MiB
|
|
Memory utilization = 0 %
|
|
Power usage = 37 / 149 W
|
|
Temperature = 34 C
|
|
```
|
|
|
|
Run the following example job to see that the GPU was mounted in the
|
|
container:
|
|
|
|
```hcl
|
|
job "gpu-test" {
|
|
datacenters = ["dc1"]
|
|
type = "batch"
|
|
|
|
group "smi" {
|
|
task "smi" {
|
|
driver = "docker"
|
|
|
|
config {
|
|
image = "nvidia/cuda:11.0-base"
|
|
command = "nvidia-smi"
|
|
}
|
|
|
|
resources {
|
|
device "nvidia/gpu" {
|
|
count = 1
|
|
|
|
# Add an affinity for a particular model
|
|
affinity {
|
|
attribute = "${device.model}"
|
|
value = "Tesla K80"
|
|
weight = 50
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
```shell-session
|
|
$ nomad run example.nomad
|
|
==> Monitoring evaluation "21bd7584"
|
|
Evaluation triggered by job "gpu-test"
|
|
Allocation "d250baed" created: node "4d46e59f", group "smi"
|
|
Evaluation status changed: "pending" -> "complete"
|
|
==> Evaluation "21bd7584" finished with status "complete"
|
|
|
|
$ nomad alloc status d250baed
|
|
|
|
// ...TRUNCATED...
|
|
|
|
Task "smi" is "dead"
|
|
Task Resources
|
|
CPU Memory Disk Addresses
|
|
0/100 MHz 0 B/300 MiB 300 MiB
|
|
|
|
Device Stats
|
|
nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB
|
|
|
|
Task Events:
|
|
Started At = 2019-01-23T18:25:32Z
|
|
Finished At = 2019-01-23T18:25:34Z
|
|
Total Restarts = 0
|
|
Last Restart = N/A
|
|
|
|
Recent Events:
|
|
Time Type Description
|
|
2019-01-23T18:25:34Z Terminated Exit Code: 0
|
|
2019-01-23T18:25:32Z Started Task started by client
|
|
2019-01-23T18:25:29Z Task Setup Building Task Directory
|
|
2019-01-23T18:25:29Z Received Task received by client
|
|
|
|
$ nomad alloc logs d250baed
|
|
Wed Jan 23 18:25:32 2019
|
|
+-----------------------------------------------------------------------------+
|
|
| NVIDIA-SMI 410.48 Driver Version: 410.48 |
|
|
|-------------------------------+----------------------+----------------------+
|
|
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
|
|
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|
|
|===============================+======================+======================|
|
|
| 0 Tesla K80 On | 00004477:00:00.0 Off | 0 |
|
|
| N/A 33C P8 37W / 149W | 0MiB / 11441MiB | 0% Default |
|
|
+-------------------------------+----------------------+----------------------+
|
|
|
|
+-----------------------------------------------------------------------------+
|
|
| Processes: GPU Memory |
|
|
| GPU PID Type Process name Usage |
|
|
|=============================================================================|
|
|
| No running processes found |
|
|
+-----------------------------------------------------------------------------+
|
|
```
|
|
|
|
|
|
[docker-driver]: /docs/drivers/docker 'Nomad docker Driver'
|
|
[exec-driver]: /docs/drivers/exec 'Nomad exec Driver'
|
|
[java-driver]: /docs/drivers/java 'Nomad java Driver'
|
|
[lxc-driver]: /plugins/drivers/community/lxc 'Nomad lxc Driver'
|
|
[`plugin`]: /docs/configuration/plugin
|
|
[`plugin_dir`]: /docs/configuration#plugin_dir
|
|
[nvidia_hook]: https://github.com/lxc/lxc/blob/master/hooks/nvidia
|
|
[nvidia_plugin_download]: https://releases.hashicorp.com/nomad-device-nvidia/
|
|
[nvidia_container_toolkit]: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
|
|
[source]: https://github.com/hashicorp/nomad-device-nvidia
|