open-nomad/website/pages/docs/internals/filesystem.mdx
Tim Gross c15a16301e
docs: internals documentation for alloc filesystem (#9195)
We recently added documentation disambiguating the terminology of the
allocation/task working directories. This changeset adds an internals document
that describes in more detail exactly what does into the allocation working
directory, how this interacts with the filesystem isolation provided by task
drivers, and how this interacts with features like `artifact` and `template`.

Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com>
2020-11-04 09:59:19 -05:00

466 lines
15 KiB
Plaintext

---
layout: docs
page_title: Filesystem
sidebar_title: Filesystem
description: |-
Nomad creates an allocation working directory for every allocation. Learn what
goes into the working directory and how it interacts with Nomad task drivers.
---
# Filesystem
Nomad creates a working directory for each allocation on a client. This
directory can be found in the Nomad [`data_dir`] at
`./allocs/«alloc_id»`. The allocation working directory is where Nomad
creates task directories and directories shared between tasks, write logs for
tasks, and downloads artifacts or templates.
An allocation with two tasks (named `task1` and `task2`) will have an
allocation directory like the one below.
```shell-session
.
├── alloc
│ ├── data
│ ├── logs
│ │ ├── task1.stderr.0
│ │ ├── task1.stdout.0
│ │ ├── task2.stderr.0
│ │ └── task2.stdout.0
│ └── tmp
├── task1
│ ├── local
│ ├── secrets
│ └── tmp
└── task2
├── local
├── secrets
└── tmp
```
- **alloc/**: This directory is shared across all tasks in an allocation and
can be used to store data that needs to be used by multiple tasks, such as a
log shipper. This is the directory that's provided to the task as the
`NOMAD_ALLOC_DIR`. Note that this `alloc/` directory is not the same as the
"allocation working directory", which is the top-level directory. All tasks
in a task group can read and write to the `alloc/` directory. Within the
`alloc/` directory are three standard directories:
- **alloc/data/**: This directory is the location used by the
[`ephemeral_disk`] stanza for shared data.
- **alloc/logs/**: This directory is the location of the log files for every
task within an allocation. The `nomad alloc logs` command streams these
files to your terminal.
- **alloc/tmp/**: A temporary directory used as scratch space by task drivers.
- **«taskname»**: Each task has a **task working directory** with the same name as
the task. Tasks in a task group can't read each other's task working
directory. Depending on the task driver's [filesystem isolation mode], a
task may not be able to access the task working directory. Within the
`task/` directory are three standard directories:
- **«taskname»/local/**: This directory is the location provided to the task as the
`NOMAD_TASK_DIR`. Note this is not the same as the "task working
directory". This directory is private to the task.
- **«taskname»/secrets/**: This directory is the location provided to the task as
`NOMAD_SECRETS_DIR`. The contents of files in this directory cannot be read
the the `nomad alloc fs` command. It can be used to store secret data that
should not be visible outside the task.
- **«taskname»/tmp/**: A temporary directory used as scratch space by task drivers.
The allocation working directory is the directory you see when using the
`nomad alloc fs` command. If you were to run `nomad alloc fs` against the
allocation that made the working directory shown above, you'd see the
following:
```shell-session
$ nomad alloc fs c0b2245f
Mode Size Modified Time Name
drwxrwxrwx 4.0 KiB 2020-10-27T18:00:39Z alloc/
drwxrwxrwx 4.0 KiB 2020-10-27T18:00:32Z task1/
drwxrwxrwx 4.0 KiB 2020-10-27T18:00:39Z task2/
$ nomad alloc fs c0b2245f alloc/
Mode Size Modified Time Name
drwxrwxrwx 4.0 KiB 2020-10-27T18:00:32Z data/
drwxrwxrwx 4.0 KiB 2020-10-27T18:00:39Z logs/
drwxrwxrwx 4.0 KiB 2020-10-27T18:00:32Z tmp/
$ nomad alloc fs c0b2245f task1/
Mode Size Modified Time Name
drwxrwxrwx 4.0 KiB 2020-10-27T18:00:33Z local/
drwxrwxrwx 60 B 2020-10-27T18:00:32Z secrets/
dtrwxrwxrwx 4.0 KiB 2020-10-27T18:00:32Z tmp/
```
## Task Drivers and Filesystem Isolation Modes
Depending on the task driver, the task's working directory may also be the
root directory for the running task. This is determined by the task driver's
[filesystem isolation capability].
### `image` isolation
Task drivers like `docker` or `qemu` use `image` isolation, where the task
driver isolates task filesystems as machine images. These filesystems are
owned by the task driver's external process and not by Nomad itself. These
filesystems will not typically be found anywhere in the allocation working
directory. For example, Docker containers will have their overlay filesystem
unpacked to `/var/run/docker/containerd/«container_id»` by default.
Nomad will provide the `NOMAD_ALLOC_DIR`, `NOMAD_TASK_DIR`, and
`NOMAD_SECRETS_DIR` to tasks with `image` isolation, typically by
bind-mounting them to the task driver's filesystem.
You can see an example of `image` isolation by running the following minimal
job:
```hcl
job "example" {
datacenters = ["dc1"]
task "task1" {
driver = "docker"
config {
image = "redis:6.0"
}
}
}
```
If you look at the allocation working directory from the host, you'll see a
minimal filesystem tree:
```shell-session
.
├── alloc
│ ├── data
│ ├── logs
│ │ ├── task1.stderr.0
│ │ └── task1.stdout.0
│ └── tmp
└── task1
├── local
├── secrets
└── tmp
```
The `nomad alloc fs` command shows the same bare directory tree:
```shell-session
$ nomad alloc fs b0686b27
Mode Size Modified Time Name
drwxrwxrwx 4.0 KiB 2020-10-27T18:51:54Z alloc/
drwxrwxrwx 4.0 KiB 2020-10-27T18:51:54Z task1/
$ nomad alloc fs b0686b27 task1
Mode Size Modified Time Name
drwxrwxrwx 4.0 KiB 2020-10-27T18:51:54Z local/
drwxrwxrwx 60 B 2020-10-27T18:51:54Z secrets/
dtrwxrwxrwx 4.0 KiB 2020-10-27T18:51:54Z tmp/
$ nomad alloc fs b0686b27 task1/local
Mode Size Modified Time Name
```
If you inspect the Docker container that's created, you'll see three
directories bind-mounted into the container:
```shell-session
$ docker inspect 32e | jq '.[0].HostConfig.Binds'
[
"/var/nomad/alloc/b0686b27-8af3-8252-028f-af485c81a8b3/alloc:/alloc",
"/var/nomad/alloc/b0686b27-8af3-8252-028f-af485c81a8b3/task1/local:/local",
"/var/nomad/alloc/b0686b27-8af3-8252-028f-af485c81a8b3/task1/secrets:/secrets"
]
```
The root filesystem inside the container can see these three mounts, along
with the rest of the container filesystem:
```shell-session
$ docker exec -it 32e /bin/sh
# ls /
alloc boot dev home lib64 media opt root sbin srv tmp var
bin data etc lib local mnt proc run secrets sys usr
```
Note that because the three directories are bind-mounted into the container
filesystem, nothing written outside those three directories elsewhere in the
allocation working directory will be accessible inside the container. This
means templates, artifacts, and dispatch payloads for tasks with `image`
isolation must be written into the `NOMAD_ALLOC_DIR`, `NOMAD_TASK_DIR`, or
`NOMAD_SECRETS_DIR`.
To work around this limitation, you can use the task driver's mounting
capabilities to mount one of the three directories to another location in the
task. For example, with the Docker driver you can use the driver's `mounts`
block to bind a secret written by a `template` block to the
`NOMAD_SECRETS_DIR` into a configuration directory elsewhere in the task:
```hcl
job "example" {
datacenters = ["dc1"]
task "task1" {
driver = "docker"
config {
image = "redis:6.0"
mounts = [{
type = "bind"
source = "secrets"
target = "/etc/redis.d"
readonly = true
}]
template {
destination = "${NOMAD_SECRETS_DIR}/redis.conf"
data = <<EOT
{{ with secret "secrets/data/redispass" }}
requirepass {{- .Data.data.passwd -}}{{end}}
EOT
}
}
}
}
```
### `chroot` isolation
Task drivers like `exec` or `java` (on Linux) use `chroot` isolation, where
the task driver isolates task filesystems with `chroot` or `pivot_root`. These
isolated filesystems will be built inside the task working directory.
You can see an example of `chroot` isolation by running the following minimal
job on Linux:
```hcl
job "example" {
datacenters = ["dc1"]
task "task2" {
driver = "exec"
config {
command = "/bin/sh"
args = ["-c", "sleep 600"]
}
}
}
```
If you look at the allocation working directory from the host, you'll see a
filesystem tree that has been populated with the task driver's [chroot
contents], in addition to the `NOMAD_ALLOC_DIR`, `NOMAD_TASK_DIR`, and
`NOMAD_SECRETS_DIR`:
```shell-session
.
├── alloc
│ ├── container
│ ├── data
│ ├── logs
│ └── tmp
└── task2
├── alloc
├── bin
├── dev
├── etc
├── executor.out
├── lib
├── lib32
├── lib64
├── local
├── proc
├── run
├── sbin
├── secrets
├── sys
├── tmp
└── usr
```
Likewise, the root directory of the task is now available in the `nomad alloc
fs` command output:
```shell-session
$ nomad alloc fs eebd13a7
Mode Size Modified Time Name
drwxrwxrwx 4.0 KiB 2020-10-27T19:05:24Z alloc/
drwxrwxrwx 4.0 KiB 2020-10-27T19:05:24Z task2/
$ nomad alloc fs eebd13a7 task2
Mode Size Modified Time Name
drwxrwxrwx 4.0 KiB 2020-10-27T19:05:24Z alloc/
drwxr-xr-x 4.0 KiB 2020-10-27T19:05:22Z bin/
drwxr-xr-x 4.0 KiB 2020-10-27T19:05:24Z dev/
drwxr-xr-x 4.0 KiB 2020-10-27T19:05:22Z etc/
-rw-r--r-- 297 B 2020-10-27T19:05:24Z executor.out
drwxr-xr-x 4.0 KiB 2020-10-27T19:05:22Z lib/
drwxr-xr-x 4.0 KiB 2020-10-27T19:05:22Z lib32/
drwxr-xr-x 4.0 KiB 2020-10-27T19:05:22Z lib64/
drwxrwxrwx 4.0 KiB 2020-10-27T19:05:22Z local/
drwxr-xr-x 4.0 KiB 2020-10-27T19:05:24Z proc/
drwxr-xr-x 4.0 KiB 2020-10-27T19:05:22Z run/
drwxr-xr-x 12 KiB 2020-10-27T19:05:22Z sbin/
drwxrwxrwx 60 B 2020-10-27T19:05:22Z secrets/
drwxr-xr-x 4.0 KiB 2020-10-27T19:05:24Z sys/
dtrwxrwxrwx 4.0 KiB 2020-10-27T19:05:22Z tmp/
drwxr-xr-x 4.0 KiB 2020-10-27T19:05:22Z usr/
```
Nomad will provide the `NOMAD_ALLOC_DIR`, `NOMAD_TASK_DIR`, and
`NOMAD_SECRETS_DIR` to tasks with `chroot` isolation. But unlike with `image`
isolation, Nomad does not need to bind-mount the `NOMAD_TASK_DIR` directory
because it can be directly created inside the chroot.
```shell-session
$ nomad alloc exec eebd13a7 /bin/sh
$ mount
...
/dev/mapper/root on /alloc type ext4 (rw,relatime,errors=remount-ro,data=ordered)
tmpfs on /secrets type tmpfs (rw,noexec,relatime,size=1024k)
...
```
### `none` isolation
The `raw_exec` task driver (or the `java` task driver on Windows) uses the
`none` filesystem isolation mode. This means the task driver does not isolate
the filesystem for the task, and the task can read and write anywhere the
user that's running Nomad can.
You can see an example of `none` isolation by running the following minimal
`raw_exec` job on Linux or Unix.
```hcl
job "example" {
datacenters = ["dc1"]
task "task3" {
driver = "raw_exec"
config {
command = "/bin/sh"
args = ["-c", "sleep 600"]
}
}
}
```
If you look at the allocation working directory from the host, you'll see a
minimal filesystem tree:
```shell-session
.
├── alloc
│ ├── data
│ ├── logs
│ │ ├── task3.stderr.0
│ │ └── task3.stdout.0
│ └── tmp
└── task3
├── executor.out
├── local
├── secrets
└── tmp
```
The `nomad alloc fs` command shows the same bare directory tree:
```shell-session
$ nomad alloc fs 87ec7d12 task3
Mode Size Modified Time Name
-rw-r--r-- 140 B 2020-10-27T19:15:33Z executor.out
drwxrwxrwx 4.0 KiB 2020-10-27T19:15:33Z local/
drwxrwxrwx 60 B 2020-10-27T19:15:33Z secrets/
dtrwxrwxrwx 4.0 KiB 2020-10-27T19:15:33Z tmp/
```
But if you use `nomad alloc exec` to view the filesystem from inside the
container, you'll see that the task has access to the entire root
filesystem. The `NOMAD_ALLOC_DIR`, `NOMAD_TASK_DIR`, and `NOMAD_SECRETS_DIR`
point to the filepath on the host, not a path anchored in the task working
directory. And the task is running as `root`, because the Nomad client agent
is running as `root`. This is why the `raw_exec` driver is disabled by
default.
```shell-session
$ nomad alloc exec 87ec7d12 /bin/sh
# ls /
bin dev home lib lib64 lost+found mnt proc run snap sys usr vmlinuz
boot etc initrd.img lib32 libx32 media opt root sbin srv tmp var
# echo $NOMAD_SECRETS_DIR
/var/nomad/alloc/87ec7d12-5e35-8fba-96cc-09e5376be15a/task3/secrets
# whoami
root
```
## Templates, Artifacts, and Dispatch Payloads
The other contents of the allocation working directory depend on what features
the job specification uses. The allocation working directory is populated by
other features in a specific order:
* The allocation working directory is created.
* The ephemeral disk data is [migrated] from any previous allocation.
* [CSI volumes] are staged.
* Then, for each task:
* Task working directories are created.
* [Dispatch payloads] are written.
* [Artifacts] are downloaded.
* [Templates] are rendered.
* The task is started by the task driver, which includes all bind mounts and
[volume mounts].
Dispatch payloads, artifacts, and templates are written to the task working
directory before a task can start because the resulting files may be binary or
image run by the task. For example, an `artifact` can be used to download a
Docker image or .jar file, or a `template` can be used to render a shell
script that's run by `exec`.
The `artifact` and `template` blocks write their data to a destination
relative to the task working directory, not the `NOMAD_TASK_DIR`. For task
drivers with `image` filesystem isolation, this means the `destination` field
path should be prefixed with either `NOMAD_TASK_DIR` or
`NOMAD_SECRETS_DIR`. Otherwise, the file will not be visible from inside the
resulting container. (The `dispatch_payload` block always writes its data to
the `NOMAD_TASK_DIR`.)
For [CSI volumes], the client will stage the volume before setting up the task
working directory. Staging typically involves mounting the volume into the CSI
plugin's task directory, sending commands to the plugin to format the volume
as required, and making a volume claim to the Nomad server.
The behavior of the `volume_mount` block is controlled by the task driver. The
client builds a mount configuration describing the host volume or CSI volume
and passes it to the task driver to execute. Because the task driver mounts
the volume, it is not possible to have `artifact`, `template`, or
`dispatch_payload` blocks write to a volume.
[Artifacts]: /docs/job-specification/artifact
[CSI volumes]: /docs/internals/plugins/csi
[Dispatch payloads]: /docs/job-specification/dispatch_payload
[Templates]: /docs/job-specification/template
[`data_dir`]: /docs/configuration#data_dir
[`ephemeral_disk`]: /docs/job-specification/ephemeral_disk
[artifact]: /docs/job-specification/artifact
[chroot contents]: /docs/drivers/exec#chroot
[filesystem isolation capability]: /docs/internals/plugins/task-drivers#capabilities-capabilities-error
[filesystem isolation mode]: #task-drivers-and-filesystem-isolation-modes
[migrated]: /docs/job-specification/ephemeral_disk#migrate
[template]: /docs/job-specification/template
[volume mounts]: /docs/job-specification/volume_mount