open-nomad/website/content/docs/devices/external/nvidia.mdx

336 lines
8.8 KiB
Plaintext
Raw Normal View History

2019-01-23 00:30:57 +00:00
---
2020-02-06 23:45:31 +00:00
layout: docs
page_title: 'Device Plugins: Nvidia'
description: The Nvidia Device Plugin detects and makes Nvidia devices available to tasks.
2019-01-23 00:30:57 +00:00
---
2019-01-23 18:43:13 +00:00
# Nvidia GPU Device Plugin
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
Name: `nvidia-gpu`
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
The Nvidia device plugin is used to expose Nvidia GPUs to Nomad. The Nvidia
plugin is built into Nomad and does not need to be downloaded separately.
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
## Fingerprinted Attributes
2019-01-23 00:30:57 +00:00
2020-02-06 23:45:31 +00:00
<table>
<thead>
<tr>
<th>Attribute</th>
<th>Unit</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<tt>memory</tt>
</td>
<td>MiB</td>
</tr>
<tr>
<td>
<tt>power</tt>
</td>
<td>W (Watt)</td>
</tr>
<tr>
<td>
<tt>bar1</tt>
</td>
<td>MiB</td>
</tr>
<tr>
<td>
<tt>driver_version</tt>
</td>
<td>string</td>
</tr>
<tr>
<td>
<tt>cores_clock</tt>
</td>
<td>MHz</td>
</tr>
<tr>
<td>
<tt>memory_clock</tt>
</td>
<td>MHz</td>
</tr>
<tr>
<td>
<tt>pci_bandwidth</tt>
</td>
<td>MB/s</td>
</tr>
<tr>
<td>
<tt>display_state</tt>
</td>
<td>string</td>
</tr>
<tr>
<td>
<tt>persistence_mode</tt>
</td>
<td>string</td>
</tr>
</tbody>
2019-01-23 18:43:13 +00:00
</table>
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
## Runtime Environment
The `nvidia-gpu` device plugin exposes the following environment variables:
2020-02-06 23:45:31 +00:00
- `NVIDIA_VISIBLE_DEVICES` - List of Nvidia GPU IDs available to the task.
2019-01-23 18:43:13 +00:00
2019-01-24 01:47:07 +00:00
### Additional Task Configurations
2019-01-23 18:43:13 +00:00
2019-01-24 01:47:07 +00:00
Additional environment variables can be set by the task to influence the runtime
2019-01-23 18:43:13 +00:00
environment. See [Nvidia's
documentation](https://github.com/NVIDIA/nvidia-container-runtime#environment-variables-oci-spec).
## Installation Requirements
In order to use the `nvidia-gpu` the following prerequisites must be met:
1. GNU/Linux x86_64 with kernel version > 3.10
2. NVIDIA GPU with Architecture > Fermi (2.1)
3. NVIDIA drivers >= 340.29 with binary `nvidia-smi`
### Docker Driver Requirements
The Nvidia driver plugin currently only supports the older v1.0 version of the
Docker driver provided by Nvidia. In order to use the Nvidia driver plugin with
the Docker driver, please follow the installation instructions for
[`nvidia-container-runtime`](https://github.com/nvidia/nvidia-container-runtime#installation).
2019-01-23 18:43:13 +00:00
## Plugin Configuration
```hcl
plugin "nvidia-gpu" {
config {
enabled = true
ignored_gpu_ids = ["GPU-fef8089b", "GPU-ac81e44d"]
fingerprint_period = "1m"
}
2020-02-06 23:45:31 +00:00
}
2019-01-23 00:30:57 +00:00
```
2019-01-23 18:43:13 +00:00
The `nvidia-gpu` device plugin supports the following configuration in the agent
config:
- `enabled` `(bool: true)` - Control whether the plugin should be enabled and running.
2020-02-06 23:45:31 +00:00
- `ignored_gpu_ids` `(array<string>: [])` - Specifies the set of GPU UUIDs that
2019-01-23 18:43:13 +00:00
should be ignored when fingerprinting.
2020-02-06 23:45:31 +00:00
- `fingerprint_period` `(string: "1m")` - The period in which to fingerprint for
2019-01-23 18:43:13 +00:00
device changes.
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
## Restrictions
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
The Nvidia integration only works with drivers who natively integrate with
Nvidia's [container runtime
2020-02-06 23:45:31 +00:00
library](https://github.com/NVIDIA/libnvidia-container).
2019-01-23 18:43:13 +00:00
Nomad has tested support with the [`docker` driver][docker-driver] and plans to
bring support to the built-in [`exec`][exec-driver] and [`java`][java-driver]
drivers. Support for [`lxc`][lxc-driver] should be possible by installing the
[Nvidia hook](https://github.com/lxc/lxc/blob/master/hooks/nvidia) but is not
tested or documented by Nomad.
2019-01-23 00:30:57 +00:00
## Examples
2019-01-23 18:43:13 +00:00
Inspect a node with a GPU:
2019-01-23 00:30:57 +00:00
2020-05-18 20:53:06 +00:00
```shell-session
$ nomad node status 4d46e59f
2019-01-23 18:43:13 +00:00
ID = 4d46e59f
Name = nomad
Class = <none>
DC = dc1
Drain = false
Eligibility = eligible
Status = ready
Uptime = 19m43s
Driver Status = docker,mock_driver,raw_exec
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
Node Events
Time Subsystem Message
2019-01-23T18:25:18Z Cluster Node registered
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
Allocated Resources
CPU Memory Disk
0/15576 MHz 0 B/55 GiB 0 B/28 GiB
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
Allocation Resource Utilization
CPU Memory
0/15576 MHz 0 B/55 GiB
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
Host Resource Utilization
CPU Memory Disk
2674/15576 MHz 1.5 GiB/55 GiB 3.0 GiB/31 GiB
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
Device Resource Utilization
nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB
Allocations
No allocations placed
2019-01-23 00:30:57 +00:00
```
2019-01-23 18:43:13 +00:00
Display detailed statistics on a node with a GPU:
2020-05-18 20:53:06 +00:00
```shell-session
$ nomad node status -stats 4d46e59f
2019-01-23 18:43:13 +00:00
ID = 4d46e59f
Name = nomad
Class = <none>
DC = dc1
Drain = false
Eligibility = eligible
Status = ready
Uptime = 19m59s
Driver Status = docker,mock_driver,raw_exec
Node Events
Time Subsystem Message
2019-01-23T18:25:18Z Cluster Node registered
Allocated Resources
CPU Memory Disk
0/15576 MHz 0 B/55 GiB 0 B/28 GiB
Allocation Resource Utilization
CPU Memory
0/15576 MHz 0 B/55 GiB
Host Resource Utilization
CPU Memory Disk
2673/15576 MHz 1.5 GiB/55 GiB 3.0 GiB/31 GiB
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
Device Resource Utilization
nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
// ...TRUNCATED...
Device Stats
Device = nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]
BAR1 buffer state = 2 / 16384 MiB
Decoder utilization = 0 %
ECC L1 errors = 0
ECC L2 errors = 0
ECC memory errors = 0
Encoder utilization = 0 %
GPU utilization = 0 %
Memory state = 0 / 11441 MiB
Memory utilization = 0 %
Power usage = 37 / 149 W
Temperature = 34 C
Allocations
No allocations placed
2019-01-23 00:30:57 +00:00
```
2019-01-23 18:43:13 +00:00
Run the following example job to see that that the GPU was mounted in the
container:
```hcl
job "gpu-test" {
datacenters = ["dc1"]
type = "batch"
group "smi" {
task "smi" {
driver = "docker"
config {
image = "nvidia/cuda:9.0-base"
command = "nvidia-smi"
}
resources {
2019-01-23 18:56:22 +00:00
device "nvidia/gpu" {
count = 1
# Add an affinity for a particular model
affinity {
attribute = "${device.model}"
value = "Tesla K80"
2019-01-30 13:23:03 +00:00
weight = 50
2019-01-23 18:56:22 +00:00
}
}
2019-01-23 18:43:13 +00:00
}
}
2019-01-23 00:30:57 +00:00
}
}
```
2020-05-18 20:53:06 +00:00
```shell-session
$ nomad run example.nomad
2019-01-23 18:43:13 +00:00
==> Monitoring evaluation "21bd7584"
Evaluation triggered by job "gpu-test"
Allocation "d250baed" created: node "4d46e59f", group "smi"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "21bd7584" finished with status "complete"
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
$ nomad alloc status d250baed
ID = d250baed
Eval ID = 21bd7584
Name = gpu-test.smi[0]
Node ID = 4d46e59f
Job ID = example
Job Version = 0
Client Status = complete
Client Description = All tasks have completed
Desired Status = run
Desired Description = <none>
Created = 7s ago
Modified = 2s ago
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
Task "smi" is "dead"
Task Resources
CPU Memory Disk Addresses
0/100 MHz 0 B/300 MiB 300 MiB
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
Device Stats
nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
Task Events:
Started At = 2019-01-23T18:25:32Z
Finished At = 2019-01-23T18:25:34Z
Total Restarts = 0
Last Restart = N/A
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
Recent Events:
Time Type Description
2019-01-23T18:25:34Z Terminated Exit Code: 0
2019-01-23T18:25:32Z Started Task started by client
2019-01-23T18:25:29Z Task Setup Building Task Directory
2019-01-23T18:25:29Z Received Task received by client
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
$ nomad alloc logs d250baed
Wed Jan 23 18:25:32 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48 Driver Version: 410.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00004477:00:00.0 Off | 0 |
| N/A 33C P8 37W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
```
2019-01-23 00:30:57 +00:00
2020-02-06 23:45:31 +00:00
[docker-driver]: /docs/drivers/docker 'Nomad docker Driver'
[exec-driver]: /docs/drivers/exec 'Nomad exec Driver'
[java-driver]: /docs/drivers/java 'Nomad java Driver'
[lxc-driver]: /docs/drivers/external/lxc 'Nomad lxc Driver'