docs: internals documentation for alloc filesystem (#9195)

We recently added documentation disambiguating the terminology of the allocation/task working directories. This changeset adds an internals document that describes in more detail exactly what does into the allocation working directory, how this interacts with the filesystem isolation provided by task drivers, and how this interacts with features like `artifact` and `template`. Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com>
2020-11-04 09:59:19 -05:00 · 2020-11-04 09:59:19 -05:00 · c15a16301e
parent 0e373f70c1
commit c15a16301e
5 changed files with 481 additions and 7 deletions
--- a/website/data/docs-navigation.js
+++ b/website/data/docs-navigation.js
@ -49,8 +49,9 @@ export default [
        content: ['scheduling', 'preemption'],
      },
      'consensus',
+      'filesystem',
      'gossip',
-      'security',
+      'security'
    ],
  },
  {
--- a/website/pages/docs/internals/filesystem.mdx
+++ b/website/pages/docs/internals/filesystem.mdx
@ -0,0 +1,465 @@
+---
+layout: docs
+page_title: Filesystem
+sidebar_title: Filesystem
+description: |-
+  Nomad creates an allocation working directory for every allocation. Learn what
+  goes into the working directory and how it interacts with Nomad task drivers.
+---
+
+# Filesystem
+
+Nomad creates a working directory for each allocation on a client. This
+directory can be found in the Nomad [`data_dir`] at
+`./allocs/«alloc_id»`. The allocation working directory is where Nomad
+creates task directories and directories shared between tasks, write logs for
+tasks, and downloads artifacts or templates.
+
+An allocation with two tasks (named `task1` and `task2`) will have an
+allocation directory like the one below.
+
+```shell-session
+.
+├── alloc
+│   ├── data
+│   ├── logs
+│   │   ├── task1.stderr.0
+│   │   ├── task1.stdout.0
+│   │   ├── task2.stderr.0
+│   │   └── task2.stdout.0
+│   └── tmp
+├── task1
+│   ├── local
+│   ├── secrets
+│   └── tmp
+└── task2
+    ├── local
+    ├── secrets
+    └── tmp
+```
+
+-  **alloc/**: This directory is shared across all tasks in an allocation and
+  can be used to store data that needs to be used by multiple tasks, such as a
+  log shipper. This is the directory that's provided to the task as the
+  `NOMAD_ALLOC_DIR`. Note that this `alloc/` directory is not the same as the
+  "allocation working directory", which is the top-level directory. All tasks
+  in a task group can read and write to the `alloc/` directory. Within the
+  `alloc/` directory are three standard directories:
+
+  - **alloc/data/**: This directory is the location used by the
+    [`ephemeral_disk`] stanza for shared data.
+
+  - **alloc/logs/**: This directory is the location of the log files for every
+    task within an allocation. The `nomad alloc logs` command streams these
+    files to your terminal.
+
+  - **alloc/tmp/**: A temporary directory used as scratch space by task drivers.
+
+- **«taskname»**: Each task has a **task working directory** with the same name as
+  the task. Tasks in a task group can't read each other's task working
+  directory. Depending on the task driver's [filesystem isolation mode], a
+  task may not be able to access the task working directory. Within the
+  `task/` directory are three standard directories:
+
+  - **«taskname»/local/**: This directory is the location provided to the task as the
+    `NOMAD_TASK_DIR`. Note this is not the same as the "task working
+    directory". This directory is private to the task.
+
+  - **«taskname»/secrets/**: This directory is the location provided to the task as
+  `NOMAD_SECRETS_DIR`. The contents of files in this directory cannot be read
+  the the `nomad alloc fs` command. It can be used to store secret data that
+  should not be visible outside the task.
+
+  - **«taskname»/tmp/**: A temporary directory used as scratch space by task drivers.
+
+The allocation working directory is the directory you see when using the
+`nomad alloc fs` command. If you were to run `nomad alloc fs` against the
+allocation that made the working directory shown above, you'd see the
+following:
+
+```shell-session
+$ nomad alloc fs c0b2245f
+Mode        Size     Modified Time         Name
+drwxrwxrwx  4.0 KiB  2020-10-27T18:00:39Z  alloc/
+drwxrwxrwx  4.0 KiB  2020-10-27T18:00:32Z  task1/
+drwxrwxrwx  4.0 KiB  2020-10-27T18:00:39Z  task2/
+
+$ nomad alloc fs c0b2245f alloc/
+Mode        Size     Modified Time         Name
+drwxrwxrwx  4.0 KiB  2020-10-27T18:00:32Z  data/
+drwxrwxrwx  4.0 KiB  2020-10-27T18:00:39Z  logs/
+drwxrwxrwx  4.0 KiB  2020-10-27T18:00:32Z  tmp/
+
+$ nomad alloc fs c0b2245f task1/
+Mode         Size     Modified Time         Name
+drwxrwxrwx   4.0 KiB  2020-10-27T18:00:33Z  local/
+drwxrwxrwx   60 B     2020-10-27T18:00:32Z  secrets/
+dtrwxrwxrwx  4.0 KiB  2020-10-27T18:00:32Z  tmp/
+```
+
+## Task Drivers and Filesystem Isolation Modes
+
+Depending on the task driver, the task's working directory may also be the
+root directory for the running task. This is determined by the task driver's
+[filesystem isolation capability].
+
+### `image` isolation
+
+Task drivers like `docker` or `qemu` use `image` isolation, where the task
+driver isolates task filesystems as machine images. These filesystems are
+owned by the task driver's external process and not by Nomad itself. These
+filesystems will not typically be found anywhere in the allocation working
+directory. For example, Docker containers will have their overlay filesystem
+unpacked to `/var/run/docker/containerd/«container_id»` by default.
+
+Nomad will provide the `NOMAD_ALLOC_DIR`, `NOMAD_TASK_DIR`, and
+`NOMAD_SECRETS_DIR` to tasks with `image` isolation, typically by
+bind-mounting them to the task driver's filesystem.
+
+You can see an example of `image` isolation by running the following minimal
+job:
+
+```hcl
+job "example" {
+  datacenters = ["dc1"]
+
+  task "task1" {
+    driver = "docker"
+
+    config {
+      image = "redis:6.0"
+    }
+  }
+}
+```
+
+If you look at the allocation working directory from the host, you'll see a
+minimal filesystem tree:
+
+```shell-session
+.
+├── alloc
+│   ├── data
+│   ├── logs
+│   │   ├── task1.stderr.0
+│   │   └── task1.stdout.0
+│   └── tmp
+└── task1
+    ├── local
+    ├── secrets
+    └── tmp
+```
+
+The `nomad alloc fs` command shows the same bare directory tree:
+
+```shell-session
+$ nomad alloc fs b0686b27
+Mode        Size     Modified Time         Name
+drwxrwxrwx  4.0 KiB  2020-10-27T18:51:54Z  alloc/
+drwxrwxrwx  4.0 KiB  2020-10-27T18:51:54Z  task1/
+
+$ nomad alloc fs b0686b27 task1
+Mode         Size     Modified Time         Name
+drwxrwxrwx   4.0 KiB  2020-10-27T18:51:54Z  local/
+drwxrwxrwx   60 B     2020-10-27T18:51:54Z  secrets/
+dtrwxrwxrwx  4.0 KiB  2020-10-27T18:51:54Z  tmp/
+
+$ nomad alloc fs b0686b27 task1/local
+Mode  Size  Modified Time  Name
+```
+
+If you inspect the Docker container that's created, you'll see three
+directories bind-mounted into the container:
+
+```shell-session
+$ docker inspect 32e | jq '.[0].HostConfig.Binds'
+[
+  "/var/nomad/alloc/b0686b27-8af3-8252-028f-af485c81a8b3/alloc:/alloc",
+  "/var/nomad/alloc/b0686b27-8af3-8252-028f-af485c81a8b3/task1/local:/local",
+  "/var/nomad/alloc/b0686b27-8af3-8252-028f-af485c81a8b3/task1/secrets:/secrets"
+]
+```
+
+The root filesystem inside the container can see these three mounts, along
+with the rest of the container filesystem:
+
+```shell-session
+$ docker exec -it 32e /bin/sh
+# ls /
+alloc  boot  dev  home  lib64  media  opt   root  sbin     srv  tmp  var
+bin    data  etc  lib   local  mnt    proc  run   secrets  sys  usr
+```
+
+Note that because the three directories are bind-mounted into the container
+filesystem, nothing written outside those three directories elsewhere in the
+allocation working directory will be accessible inside the container. This
+means templates, artifacts, and dispatch payloads for tasks with `image`
+isolation must be written into the `NOMAD_ALLOC_DIR`, `NOMAD_TASK_DIR`, or
+`NOMAD_SECRETS_DIR`.
+
+To work around this limitation, you can use the task driver's mounting
+capabilities to mount one of the three directories to another location in the
+task. For example, with the Docker driver you can use the driver's `mounts`
+block to bind a secret written by a `template` block to the
+`NOMAD_SECRETS_DIR` into a configuration directory elsewhere in the task:
+
+```hcl
+job "example" {
+  datacenters = ["dc1"]
+
+  task "task1" {
+    driver = "docker"
+
+    config {
+      image = "redis:6.0"
+      mounts = [{
+        type     = "bind"
+        source   = "secrets"
+        target   = "/etc/redis.d"
+        readonly = true
+      }]
+
+      template {
+        destination = "${NOMAD_SECRETS_DIR}/redis.conf"
+        data        = <<EOT
+{{ with secret "secrets/data/redispass" }}
+requirepass {{- .Data.data.passwd -}}{{end}}
+EOT
+
+      }
+    }
+  }
+}
+```
+
+
+### `chroot` isolation
+
+Task drivers like `exec` or `java` (on Linux) use `chroot` isolation, where
+the task driver isolates task filesystems with `chroot` or `pivot_root`. These
+isolated filesystems will be built inside the task working directory.
+
+You can see an example of `chroot` isolation by running the following minimal
+job on Linux:
+
+```hcl
+job "example" {
+  datacenters = ["dc1"]
+
+  task "task2" {
+    driver = "exec"
+
+    config {
+      command = "/bin/sh"
+      args = ["-c", "sleep 600"]
+    }
+  }
+}
+```
+
+If you look at the allocation working directory from the host, you'll see a
+filesystem tree that has been populated with the task driver's [chroot
+contents], in addition to the `NOMAD_ALLOC_DIR`, `NOMAD_TASK_DIR`, and
+`NOMAD_SECRETS_DIR`:
+
+```shell-session
+.
+├── alloc
+│   ├── container
+│   ├── data
+│   ├── logs
+│   └── tmp
+└── task2
+    ├── alloc
+    ├── bin
+    ├── dev
+    ├── etc
+    ├── executor.out
+    ├── lib
+    ├── lib32
+    ├── lib64
+    ├── local
+    ├── proc
+    ├── run
+    ├── sbin
+    ├── secrets
+    ├── sys
+    ├── tmp
+    └── usr
+```
+
+Likewise, the root directory of the task is now available in the `nomad alloc
+fs` command output:
+
+```shell-session
+$ nomad alloc fs eebd13a7
+Mode        Size     Modified Time         Name
+drwxrwxrwx  4.0 KiB  2020-10-27T19:05:24Z  alloc/
+drwxrwxrwx  4.0 KiB  2020-10-27T19:05:24Z  task2/
+
+$ nomad alloc fs eebd13a7 task2
+Mode         Size     Modified Time         Name
+drwxrwxrwx   4.0 KiB  2020-10-27T19:05:24Z  alloc/
+drwxr-xr-x   4.0 KiB  2020-10-27T19:05:22Z  bin/
+drwxr-xr-x   4.0 KiB  2020-10-27T19:05:24Z  dev/
+drwxr-xr-x   4.0 KiB  2020-10-27T19:05:22Z  etc/
+-rw-r--r--   297 B    2020-10-27T19:05:24Z  executor.out
+drwxr-xr-x   4.0 KiB  2020-10-27T19:05:22Z  lib/
+drwxr-xr-x   4.0 KiB  2020-10-27T19:05:22Z  lib32/
+drwxr-xr-x   4.0 KiB  2020-10-27T19:05:22Z  lib64/
+drwxrwxrwx   4.0 KiB  2020-10-27T19:05:22Z  local/
+drwxr-xr-x   4.0 KiB  2020-10-27T19:05:24Z  proc/
+drwxr-xr-x   4.0 KiB  2020-10-27T19:05:22Z  run/
+drwxr-xr-x   12 KiB   2020-10-27T19:05:22Z  sbin/
+drwxrwxrwx   60 B     2020-10-27T19:05:22Z  secrets/
+drwxr-xr-x   4.0 KiB  2020-10-27T19:05:24Z  sys/
+dtrwxrwxrwx  4.0 KiB  2020-10-27T19:05:22Z  tmp/
+drwxr-xr-x   4.0 KiB  2020-10-27T19:05:22Z  usr/
+```
+
+Nomad will provide the `NOMAD_ALLOC_DIR`, `NOMAD_TASK_DIR`, and
+`NOMAD_SECRETS_DIR` to tasks with `chroot` isolation. But unlike with `image`
+isolation, Nomad does not need to bind-mount the `NOMAD_TASK_DIR` directory
+because it can be directly created inside the chroot.
+
+```shell-session
+$ nomad alloc exec eebd13a7 /bin/sh
+$ mount
+...
+/dev/mapper/root on /alloc type ext4 (rw,relatime,errors=remount-ro,data=ordered)
+tmpfs on /secrets type tmpfs (rw,noexec,relatime,size=1024k)
+...
+```
+
+### `none` isolation
+
+The `raw_exec` task driver (or the `java` task driver on Windows) uses the
+`none` filesystem isolation mode. This means the task driver does not isolate
+the filesystem for the task, and the task can read and write anywhere the
+user that's running Nomad can.
+
+You can see an example of `none` isolation by running the following minimal
+`raw_exec` job on Linux or Unix.
+
+
+```hcl
+job "example" {
+  datacenters = ["dc1"]
+
+  task "task3" {
+    driver = "raw_exec"
+
+    config {
+      command = "/bin/sh"
+      args = ["-c", "sleep 600"]
+    }
+  }
+}
+```
+
+If you look at the allocation working directory from the host, you'll see a
+minimal filesystem tree:
+
+```shell-session
+.
+├── alloc
+│   ├── data
+│   ├── logs
+│   │   ├── task3.stderr.0
+│   │   └── task3.stdout.0
+│   └── tmp
+└── task3
+    ├── executor.out
+    ├── local
+    ├── secrets
+    └── tmp
+```
+
+The `nomad alloc fs` command shows the same bare directory tree:
+
+```shell-session
+$ nomad alloc fs 87ec7d12 task3
+Mode         Size     Modified Time         Name
+-rw-r--r--   140 B    2020-10-27T19:15:33Z  executor.out
+drwxrwxrwx   4.0 KiB  2020-10-27T19:15:33Z  local/
+drwxrwxrwx   60 B     2020-10-27T19:15:33Z  secrets/
+dtrwxrwxrwx  4.0 KiB  2020-10-27T19:15:33Z  tmp/
+```
+
+But if you use `nomad alloc exec` to view the filesystem from inside the
+container, you'll see that the task has access to the entire root
+filesystem. The `NOMAD_ALLOC_DIR`, `NOMAD_TASK_DIR`, and `NOMAD_SECRETS_DIR`
+point to the filepath on the host, not a path anchored in the task working
+directory. And the task is running as `root`, because the Nomad client agent
+is running as `root`. This is why the `raw_exec` driver is disabled by
+default.
+
+```shell-session
+$ nomad alloc exec 87ec7d12 /bin/sh
+# ls /
+bin   dev  home        lib    lib64   lost+found  mnt  proc  run   snap  sys  usr  vmlinuz
+boot  etc  initrd.img  lib32  libx32  media       opt  root  sbin  srv   tmp  var
+
+# echo $NOMAD_SECRETS_DIR
+/var/nomad/alloc/87ec7d12-5e35-8fba-96cc-09e5376be15a/task3/secrets
+
+# whoami
+root
+```
+
+## Templates, Artifacts, and Dispatch Payloads
+
+The other contents of the allocation working directory depend on what features
+the job specification uses. The allocation working directory is populated by
+other features in a specific order:
+
+* The allocation working directory is created.
+* The ephemeral disk data is [migrated] from any previous allocation.
+* [CSI volumes] are staged.
+* Then, for each task:
+  * Task working directories are created.
+  * [Dispatch payloads] are written.
+  * [Artifacts] are downloaded.
+  * [Templates] are rendered.
+  * The task is started by the task driver, which includes all bind mounts and
+    [volume mounts].
+
+Dispatch payloads, artifacts, and templates are written to the task working
+directory before a task can start because the resulting files may be binary or
+image run by the task. For example, an `artifact` can be used to download a
+Docker image or .jar file, or a `template` can be used to render a shell
+script that's run by `exec`.
+
+The `artifact` and `template` blocks write their data to a destination
+relative to the task working directory, not the `NOMAD_TASK_DIR`. For task
+drivers with `image` filesystem isolation, this means the `destination` field
+path should be prefixed with either `NOMAD_TASK_DIR` or
+`NOMAD_SECRETS_DIR`. Otherwise, the file will not be visible from inside the
+resulting container. (The `dispatch_payload` block always writes its data to
+the `NOMAD_TASK_DIR`.)
+
+For [CSI volumes], the client will stage the volume before setting up the task
+working directory. Staging typically involves mounting the volume into the CSI
+plugin's task directory, sending commands to the plugin to format the volume
+as required, and making a volume claim to the Nomad server.
+
+The behavior of the `volume_mount` block is controlled by the task driver. The
+client builds a mount configuration describing the host volume or CSI volume
+and passes it to the task driver to execute. Because the task driver mounts
+the volume, it is not possible to have `artifact`, `template`, or
+`dispatch_payload` blocks write to a volume.
+
+
+[Artifacts]: /docs/job-specification/artifact
+[CSI volumes]: /docs/internals/plugins/csi
+[Dispatch payloads]: /docs/job-specification/dispatch_payload
+[Templates]: /docs/job-specification/template
+[`data_dir`]: /docs/configuration#data_dir
+[`ephemeral_disk`]: /docs/job-specification/ephemeral_disk
+[artifact]: /docs/job-specification/artifact
+[chroot contents]: /docs/drivers/exec#chroot
+[filesystem isolation capability]: /docs/internals/plugins/task-drivers#capabilities-capabilities-error
+[filesystem isolation mode]: #task-drivers-and-filesystem-isolation-modes
+[migrated]: /docs/job-specification/ephemeral_disk#migrate
+[template]: /docs/job-specification/template
+[volume mounts]: /docs/job-specification/volume_mount
--- a/website/pages/docs/job-specification/artifact.mdx
+++ b/website/pages/docs/job-specification/artifact.mdx
@ -43,7 +43,9 @@ automatically unarchived before the starting the task.
  download the artifact, relative to the root of the [task's working
  directory]. If omitted, the default value is to place the artifact in
  `local/`. The destination is treated as a directory unless `mode` is set to
-  `file`. Source files will be downloaded into that directory path.
+  `file`. Source files will be downloaded into that directory path. For more
+  details on how the `destination` interacts with task drivers, see the
+  [Filesystem internals] documentation.

 - `mode` `(string: "any")` - One of `any`, `file`, or `dir`. If set to `file`
  the `destination` must be a file, not a directory. By default the
@ -189,3 +191,4 @@ artifact {
 [s3-region-endpoints]: http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region 'Amazon S3 Region Endpoints'
 [iam-instance-profiles]: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2_instance-profiles.html 'EC2 IAM instance profiles'
 [task's working directory]: /docs/runtime/environment#task-directories 'Task Directories'
+[Filesystem internals]: /docs/internals/filesystem#templates-artifacts-and-dispatch-payloads
--- a/website/pages/docs/job-specification/template.mdx
+++ b/website/pages/docs/job-specification/template.mdx
@ -61,9 +61,10 @@ README][ct]. Since Nomad v0.6.0, templates can be read as environment variables.
 - `destination` `(string: <required>)` - Specifies the location where the
  resulting template should be rendered, relative to the [task working
  directory]. Only drivers without filesystem isolation (ex. `raw_exec`) or
-  that buiold a chroot in the task working directory (ex. `exec`) can render
+  that build a chroot in the task working directory (ex. `exec`) can render
  templates outside of the `NOMAD_ALLOC_DIR`, `NOMAD_TASK_DIR`, or
-  `NOMAD_SECRETS_DIR`.
+  `NOMAD_SECRETS_DIR`. For more details on how `destination` interacts with
+  task drivers, see the [Filesystem internals] documentation.

 - `env` `(bool: false)` - Specifies the template should be read back in as
  environment variables for the task. ([See below](#environment-variables))
@ -385,3 +386,4 @@ options](/docs/configuration/client#options):
 [nodevars]: /docs/runtime/interpolation#interpreted_node_vars 'Nomad Node Variables'
 [go-envparse]: https://github.com/hashicorp/go-envparse#readme 'The go-envparse Readme'
 [task working directory]: /docs/runtime/environment#task-directories 'Task Directories'
+[Filesystem internals]: /docs/internals/filesystem#templates-artifacts-and-dispatch-payloads
--- a/website/pages/docs/runtime/environment.mdx
+++ b/website/pages/docs/runtime/environment.mdx
@ -25,9 +25,9 @@ environment variable names such as `NOMAD_ADDR_<task>_<label>`.
 Nomad will pass both the allocation ID and name, the deployment ID that created
 the allocation, the job ID and name, the parent job ID as well as
 the task and group's names. These are given as `NOMAD_ALLOC_ID`, `NOMAD_ALLOC_NAME`,
-`NOMAD_ALLOC_INDEX`, `NOMAD_JOB_NAME`, `NOMAD_JOB_ID`, 
-`NOMAD_JOB_PARENT_ID`, `NOMAD_GROUP_NAME` and `NOMAD_TASK_NAME`. The allocation ID 
-and index can be useful when the task being run needs a unique identifier or to 
+`NOMAD_ALLOC_INDEX`, `NOMAD_JOB_NAME`, `NOMAD_JOB_ID`,
+`NOMAD_JOB_PARENT_ID`, `NOMAD_GROUP_NAME` and `NOMAD_TASK_NAME`. The allocation ID
+and index can be useful when the task being run needs a unique identifier or to
 know its instance count.

 ## Resources
@ -90,6 +90,8 @@ chroot. Regardless of how the directories are made available, the path to the
 directories can be read through the `NOMAD_ALLOC_DIR`, `NOMAD_TASK_DIR`, and
 `NOMAD_SECRETS_DIR` environment variables.

+For more details on the task directories, see the [Filesystem internals].
+
 ## Meta

 The job specification also allows you to specify a `meta` block to supply arbitrary
@ -106,3 +108,4 @@ behavior.

 [jobspec]: /docs/job-specification 'Nomad Job Specification'
 [vault]: /docs/vault-integration 'Nomad Vault Integration'
+[Filesystem internals]: /docs/internals/filesystem