open-nomad/website/source/guides/spark/hdfs.html.md

---
layout: "guides"
page_title: "Apache Spark Integration - Using HDFS"
sidebar_current: "guides-spark-hdfs"
description: |-
  Learn how to deploy HDFS on Nomad and integrate it with Spark.
---

# Using HDFS

[HDFS](https://en.wikipedia.org/wiki/Apache_Hadoop#Hadoop_distributed_file_system)
is a distributed, replicated and scalable file system written for the Hadoop
framework. Spark was designed to read from and write to HDFS, since it is
common for Spark applications to perform data-intensive processing over large
datasets. HDFS can be deployed as its own Nomad job.

## Running HDFS on Nomad

A sample HDFS job file can be found [here](https://github.com/hashicorp/nomad/blob/master/terraform/examples/spark/hdfs.nomad).
It has two task groups, one for the HDFS NameNode and one for the
DataNodes. Both task groups use a [Docker image](https://github.com/hashicorp/nomad/tree/master/terraform/examples/spark/docker/hdfs) that includes Hadoop:

```hcl
  group "NameNode" {

    constraint {
      operator  = "distinct_hosts"
      value     = "true"
    }

    task "NameNode" {

      driver = "docker"

      config {
        image = "rcgenova/hadoop-2.7.3"
        command = "bash"
        args = [ "-c", "hdfs namenode -format && exec hdfs namenode
          -D fs.defaultFS=hdfs://${NOMAD_ADDR_ipc}/ -D dfs.permissions.enabled=false" ]
        network_mode = "host"
        port_map {
          ipc = 8020
          ui = 50070
        }
      }

      resources {
        cpu    = 1000
        memory = 1024
        network {
          port "ipc" {
            static = "8020"
          }
          port "ui" {
            static = "50070"
          }
        }
      }

      service {
        name = "hdfs"
        port = "ipc"
      }
    }
  }
```

The NameNode task registers itself in Consul as `hdfs`. This enables the
DataNodes to generically reference the NameNode:

```hcl
  group "DataNode" {

    count = 3

    constraint {
      operator  = "distinct_hosts"
      value     = "true"
    }

    task "DataNode" {

      driver = "docker"

      config {
        network_mode = "host"
        image = "rcgenova/hadoop-2.7.3"
        args = [ "hdfs", "datanode"
          , "-D", "fs.defaultFS=hdfs://hdfs.service.consul/"
          , "-D", "dfs.permissions.enabled=false"
        ]
        port_map {
          data = 50010
          ipc = 50020
          ui = 50075
        }
      }

      resources {
        cpu    = 1000
        memory = 1024
        network {
          port "data" {
            static = "50010"
          }
          port "ipc" {
            static = "50020"
          }
          port "ui" {
            static = "50075"
          }
        }
      }

    }
  }
```

Another viable option for DataNode task group is to use a dedicated
[system](https://www.nomadproject.io/docs/runtime/schedulers.html#system) job.
This will deploy a DataNode to every client node in the system, which may or may
not be desirable depending on your use case.

The HDFS job can be deployed using the `nomad run` command:

```shell
$ nomad run hdfs.nomad
```

## Production Deployment Considerations

A production deployment will typically have redundant NameNodes in an
active/passive configuration (which requires ZooKeeper). See [HDFS High
Availability](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html).

## Next Steps

Learn how to [monitor the output](/guides/spark/monitoring.html) of your
Spark applications.