open-nomad/website/content/docs/devices/nvidia.mdx

335 lines
8.7 KiB
Plaintext
Raw Normal View History

2019-01-23 00:30:57 +00:00
---
2020-02-06 23:45:31 +00:00
layout: docs
page_title: 'Device Plugins: Nvidia'
description: The Nvidia Device Plugin detects and makes Nvidia devices available to tasks.
2019-01-23 00:30:57 +00:00
---
2019-01-23 18:43:13 +00:00
# Nvidia GPU Device Plugin
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
Name: `nvidia-gpu`
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
The Nvidia device plugin is used to expose Nvidia GPUs to Nomad. The Nvidia
plugin is built into Nomad and does not need to be downloaded separately.
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
## Fingerprinted Attributes
2019-01-23 00:30:57 +00:00
2020-02-06 23:45:31 +00:00
<table>
<thead>
<tr>
<th>Attribute</th>
<th>Unit</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<tt>memory</tt>
</td>
<td>MiB</td>
</tr>
<tr>
<td>
<tt>power</tt>
</td>
<td>W (Watt)</td>
</tr>
<tr>
<td>
<tt>bar1</tt>
</td>
<td>MiB</td>
</tr>
<tr>
<td>
<tt>driver_version</tt>
</td>
<td>string</td>
</tr>
<tr>
<td>
<tt>cores_clock</tt>
</td>
<td>MHz</td>
</tr>
<tr>
<td>
<tt>memory_clock</tt>
</td>
<td>MHz</td>
</tr>
<tr>
<td>
<tt>pci_bandwidth</tt>
</td>
<td>MB/s</td>
</tr>
<tr>
<td>
<tt>display_state</tt>
</td>
<td>string</td>
</tr>
<tr>
<td>
<tt>persistence_mode</tt>
</td>
<td>string</td>
</tr>
</tbody>
2019-01-23 18:43:13 +00:00
</table>
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
## Runtime Environment
The `nvidia-gpu` device plugin exposes the following environment variables:
2020-02-06 23:45:31 +00:00
- `NVIDIA_VISIBLE_DEVICES` - List of Nvidia GPU IDs available to the task.
2019-01-23 18:43:13 +00:00
2019-01-24 01:47:07 +00:00
### Additional Task Configurations
2019-01-23 18:43:13 +00:00
2019-01-24 01:47:07 +00:00
Additional environment variables can be set by the task to influence the runtime
2019-01-23 18:43:13 +00:00
environment. See [Nvidia's
documentation](https://github.com/NVIDIA/nvidia-container-runtime#environment-variables-oci-spec).
## Installation Requirements
In order to use the `nvidia-gpu` the following prerequisites must be met:
1. GNU/Linux x86_64 with kernel version > 3.10
2. NVIDIA GPU with Architecture > Fermi (2.1)
3. NVIDIA drivers >= 340.29 with binary `nvidia-smi`
### Docker Driver Requirements
In order to use the Nvidia driver plugin with the Docker driver, please follow
the installation instructions for
2020-02-06 23:45:31 +00:00
[`nvidia-docker`](<https://github.com/NVIDIA/nvidia-docker/wiki/Installation-(version-1.0)>).
2019-01-23 18:43:13 +00:00
## Plugin Configuration
```hcl
plugin "nvidia-gpu" {
config {
enabled = true
ignored_gpu_ids = ["GPU-fef8089b", "GPU-ac81e44d"]
fingerprint_period = "1m"
}
2020-02-06 23:45:31 +00:00
}
2019-01-23 00:30:57 +00:00
```
2019-01-23 18:43:13 +00:00
The `nvidia-gpu` device plugin supports the following configuration in the agent
config:
- `enabled` `(bool: true)` - Control whether the plugin should be enabled and running.
2020-02-06 23:45:31 +00:00
- `ignored_gpu_ids` `(array<string>: [])` - Specifies the set of GPU UUIDs that
2019-01-23 18:43:13 +00:00
should be ignored when fingerprinting.
2020-02-06 23:45:31 +00:00
- `fingerprint_period` `(string: "1m")` - The period in which to fingerprint for
2019-01-23 18:43:13 +00:00
device changes.
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
## Restrictions
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
The Nvidia integration only works with drivers who natively integrate with
Nvidia's [container runtime
2020-02-06 23:45:31 +00:00
library](https://github.com/NVIDIA/libnvidia-container).
2019-01-23 18:43:13 +00:00
Nomad has tested support with the [`docker` driver][docker-driver] and plans to
bring support to the built-in [`exec`][exec-driver] and [`java`][java-driver]
drivers. Support for [`lxc`][lxc-driver] should be possible by installing the
[Nvidia hook](https://github.com/lxc/lxc/blob/master/hooks/nvidia) but is not
tested or documented by Nomad.
2019-01-23 00:30:57 +00:00
## Examples
2019-01-23 18:43:13 +00:00
Inspect a node with a GPU:
2019-01-23 00:30:57 +00:00
2020-05-18 20:53:06 +00:00
```shell-session
$ nomad node status 4d46e59f
2019-01-23 18:43:13 +00:00
ID = 4d46e59f
Name = nomad
Class = <none>
DC = dc1
Drain = false
Eligibility = eligible
Status = ready
Uptime = 19m43s
Driver Status = docker,mock_driver,raw_exec
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
Node Events
Time Subsystem Message
2019-01-23T18:25:18Z Cluster Node registered
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
Allocated Resources
CPU Memory Disk
0/15576 MHz 0 B/55 GiB 0 B/28 GiB
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
Allocation Resource Utilization
CPU Memory
0/15576 MHz 0 B/55 GiB
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
Host Resource Utilization
CPU Memory Disk
2674/15576 MHz 1.5 GiB/55 GiB 3.0 GiB/31 GiB
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
Device Resource Utilization
nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB
Allocations
No allocations placed
2019-01-23 00:30:57 +00:00
```
2019-01-23 18:43:13 +00:00
Display detailed statistics on a node with a GPU:
2020-05-18 20:53:06 +00:00
```shell-session
$ nomad node status -stats 4d46e59f
2019-01-23 18:43:13 +00:00
ID = 4d46e59f
Name = nomad
Class = <none>
DC = dc1
Drain = false
Eligibility = eligible
Status = ready
Uptime = 19m59s
Driver Status = docker,mock_driver,raw_exec
Node Events
Time Subsystem Message
2019-01-23T18:25:18Z Cluster Node registered
Allocated Resources
CPU Memory Disk
0/15576 MHz 0 B/55 GiB 0 B/28 GiB
Allocation Resource Utilization
CPU Memory
0/15576 MHz 0 B/55 GiB
Host Resource Utilization
CPU Memory Disk
2673/15576 MHz 1.5 GiB/55 GiB 3.0 GiB/31 GiB
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
Device Resource Utilization
nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
// ...TRUNCATED...
Device Stats
Device = nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]
BAR1 buffer state = 2 / 16384 MiB
Decoder utilization = 0 %
ECC L1 errors = 0
ECC L2 errors = 0
ECC memory errors = 0
Encoder utilization = 0 %
GPU utilization = 0 %
Memory state = 0 / 11441 MiB
Memory utilization = 0 %
Power usage = 37 / 149 W
Temperature = 34 C
Allocations
No allocations placed
2019-01-23 00:30:57 +00:00
```
2019-01-23 18:43:13 +00:00
Run the following example job to see that that the GPU was mounted in the
container:
```hcl
job "gpu-test" {
datacenters = ["dc1"]
type = "batch"
group "smi" {
task "smi" {
driver = "docker"
config {
image = "nvidia/cuda:9.0-base"
command = "nvidia-smi"
}
resources {
2019-01-23 18:56:22 +00:00
device "nvidia/gpu" {
count = 1
# Add an affinity for a particular model
affinity {
attribute = "${device.model}"
value = "Tesla K80"
2019-01-30 13:23:03 +00:00
weight = 50
2019-01-23 18:56:22 +00:00
}
}
2019-01-23 18:43:13 +00:00
}
}
2019-01-23 00:30:57 +00:00
}
}
```
2020-05-18 20:53:06 +00:00
```shell-session
$ nomad run example.nomad
2019-01-23 18:43:13 +00:00
==> Monitoring evaluation "21bd7584"
Evaluation triggered by job "gpu-test"
Allocation "d250baed" created: node "4d46e59f", group "smi"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "21bd7584" finished with status "complete"
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
$ nomad alloc status d250baed
ID = d250baed
Eval ID = 21bd7584
Name = gpu-test.smi[0]
Node ID = 4d46e59f
Job ID = example
Job Version = 0
Client Status = complete
Client Description = All tasks have completed
Desired Status = run
Desired Description = <none>
Created = 7s ago
Modified = 2s ago
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
Task "smi" is "dead"
Task Resources
CPU Memory Disk Addresses
0/100 MHz 0 B/300 MiB 300 MiB
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
Device Stats
nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
Task Events:
Started At = 2019-01-23T18:25:32Z
Finished At = 2019-01-23T18:25:34Z
Total Restarts = 0
Last Restart = N/A
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
Recent Events:
Time Type Description
2019-01-23T18:25:34Z Terminated Exit Code: 0
2019-01-23T18:25:32Z Started Task started by client
2019-01-23T18:25:29Z Task Setup Building Task Directory
2019-01-23T18:25:29Z Received Task received by client
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
$ nomad alloc logs d250baed
Wed Jan 23 18:25:32 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48 Driver Version: 410.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00004477:00:00.0 Off | 0 |
| N/A 33C P8 37W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
2019-01-23 00:30:57 +00:00
2019-01-23 18:43:13 +00:00
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
```
2019-01-23 00:30:57 +00:00
2020-02-06 23:45:31 +00:00
[docker-driver]: /docs/drivers/docker 'Nomad docker Driver'
[exec-driver]: /docs/drivers/exec 'Nomad exec Driver'
[java-driver]: /docs/drivers/java 'Nomad java Driver'
[lxc-driver]: /docs/drivers/external/lxc 'Nomad lxc Driver'