open-nomad/website/source/guides/spark/hdfs.html.md

---
layout: "guides"
page_title: "Apache Spark Integration - Using HDFS"
sidebar_current: "guides-spark-hdfs"
description: |-
  Learn how to deploy HDFS on Nomad and integrate it with Spark.
---

# Using HDFS

[HDFS](https://en.wikipedia.org/wiki/Apache_Hadoop#Hadoop_distributed_file_system) 
is a distributed, replicated and scalable file system written for the Hadoop 
framework. Spark was designed to read from and write to HDFS, since it is 
common for Spark applications to perform data-intensive processing over large 
datasets. HDFS can be deployed as its own Nomad job.

## Running HDFS on Nomad

A sample HDFS job file can be found [here](https://github.com/hashicorp/nomad/blob/master/terraform/examples/spark/hdfs.nomad).
It has two task groups, one for the HDFS NameNode and one for the 
DataNodes. Both task groups use a [Docker image](https://github.com/hashicorp/nomad/tree/master/terraform/examples/spark/docker/hdfs) that includes Hadoop:

```hcl
  group "NameNode" {

    constraint {
      operator  = "distinct_hosts"
      value     = "true"
    }

    task "NameNode" {

      driver = "docker"

      config {
        image = "rcgenova/hadoop-2.7.3"
        command = "bash"
        args = [ "-c", "hdfs namenode -format && exec hdfs namenode 
          -D fs.defaultFS=hdfs://${NOMAD_ADDR_ipc}/ -D dfs.permissions.enabled=false" ]
        network_mode = "host"
        port_map {
          ipc = 8020
          ui = 50070
        }
      }

      resources {
        cpu    = 1000
        memory = 1024
        network {
          port "ipc" {
            static = "8020"
          }
          port "ui" {
            static = "50070"
          }
        }
      }

      service {
        name = "hdfs"
        port = "ipc"
      }
    }
  }
```

The NameNode task registers itself in Consul as `hdfs`. This enables the 
DataNodes to generically reference the NameNode:

```hcl
  group "DataNode" {

    count = 3

    constraint {
      operator  = "distinct_hosts"
      value     = "true"
    }
    
    task "DataNode" {

      driver = "docker"

      config {
        network_mode = "host"
        image = "rcgenova/hadoop-2.7.3"
        args = [ "hdfs", "datanode"
          , "-D", "fs.defaultFS=hdfs://hdfs.service.consul/"
          , "-D", "dfs.permissions.enabled=false"
        ]
        port_map {
          data = 50010
          ipc = 50020
          ui = 50075
        }
      }

      resources {
        cpu    = 1000
        memory = 1024
        network {
          port "data" {
            static = "50010"
          }
          port "ipc" {
            static = "50020"
          }
          port "ui" {
            static = "50075"
          }
        }
      }

    }
  }
```

Another viable option for DataNode task group is to use a dedicated 
[system](https://www.nomadproject.io/docs/runtime/schedulers.html#system) job. 
This will deploy a DataNode to every client node in the system, which may or may 
not be desirable depending on your use case. 

The HDFS job can be deployed using the `nomad run` command:

```shell
$ nomad run hdfs.nomad
```

## Production Deployment Considerations

A production deployment will typically have redundant NameNodes in an 
active/passive configuration (which requires ZooKeeper). See [HDFS High 
Availability](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html).

## Next Steps

Learn how to [monitor the output](/guides/spark/monitoring.html) of your 
Spark applications.
Apache Spark Integration guide 2017-06-29 00:10:05 +00:00			`---`
			`layout: "guides"`
			`page_title: "Apache Spark Integration - Using HDFS"`
			`sidebar_current: "guides-spark-hdfs"`
			`description: \|-`
			`Learn how to deploy HDFS on Nomad and integrate it with Spark.`
			`---`

			`# Using HDFS`

			`[HDFS](https://en.wikipedia.org/wiki/Apache_Hadoop#Hadoop_distributed_file_system)`
			`is a distributed, replicated and scalable file system written for the Hadoop`
			`framework. Spark was designed to read from and write to HDFS, since it is`
			`common for Spark applications to perform data-intensive processing over large`
			`datasets. HDFS can be deployed as its own Nomad job.`

			`## Running HDFS on Nomad`

docs: fix invalid link for hdfs examples 2017-08-06 03:47:10 +00:00			`A sample HDFS job file can be found [here](https://github.com/hashicorp/nomad/blob/master/terraform/examples/spark/hdfs.nomad).`
Apache Spark Integration guide 2017-06-29 00:10:05 +00:00			`It has two task groups, one for the HDFS NameNode and one for the`
docs: fix invalid link for hdfs examples 2017-08-06 03:47:10 +00:00			`DataNodes. Both task groups use a [Docker image](https://github.com/hashicorp/nomad/tree/master/terraform/examples/spark/docker/hdfs) that includes Hadoop:`
Apache Spark Integration guide 2017-06-29 00:10:05 +00:00
			```hcl
			`group "NameNode" {`

			`constraint {`
			`operator = "distinct_hosts"`
			`value = "true"`
			`}`

			`task "NameNode" {`

			`driver = "docker"`

			`config {`
			`image = "rcgenova/hadoop-2.7.3"`
			`command = "bash"`
			`args = [ "-c", "hdfs namenode -format && exec hdfs namenode`
			`-D fs.defaultFS=hdfs://${NOMAD_ADDR_ipc}/ -D dfs.permissions.enabled=false" ]`
			`network_mode = "host"`
			`port_map {`
			`ipc = 8020`
			`ui = 50070`
			`}`
			`}`

			`resources {`
rewording and minor fixes to various pages 2017-07-07 20:20:40 +00:00			`cpu = 1000`
			`memory = 1024`
Apache Spark Integration guide 2017-06-29 00:10:05 +00:00			`network {`
			`port "ipc" {`
			`static = "8020"`
			`}`
			`port "ui" {`
			`static = "50070"`
			`}`
			`}`
			`}`

			`service {`
			`name = "hdfs"`
			`port = "ipc"`
			`}`
			`}`
			`}`
			```

			The NameNode task registers itself in Consul as `hdfs`. This enables the
			`DataNodes to generically reference the NameNode:`

			```hcl
			`group "DataNode" {`

			`count = 3`

			`constraint {`
			`operator = "distinct_hosts"`
			`value = "true"`
			`}`

			`task "DataNode" {`

			`driver = "docker"`

			`config {`
			`network_mode = "host"`
			`image = "rcgenova/hadoop-2.7.3"`
			`args = [ "hdfs", "datanode"`
			`, "-D", "fs.defaultFS=hdfs://hdfs.service.consul/"`
			`, "-D", "dfs.permissions.enabled=false"`
			`]`
			`port_map {`
			`data = 50010`
			`ipc = 50020`
			`ui = 50075`
			`}`
			`}`

			`resources {`
rewording and minor fixes to various pages 2017-07-07 20:20:40 +00:00			`cpu = 1000`
			`memory = 1024`
Apache Spark Integration guide 2017-06-29 00:10:05 +00:00			`network {`
			`port "data" {`
			`static = "50010"`
			`}`
			`port "ipc" {`
			`static = "50020"`
			`}`
			`port "ui" {`
			`static = "50075"`
			`}`
			`}`
			`}`

			`}`
			`}`
			```

rewording and minor fixes to various pages 2017-07-07 20:20:40 +00:00			`Another viable option for DataNode task group is to use a dedicated`
			`[system](https://www.nomadproject.io/docs/runtime/schedulers.html#system) job.`
			`This will deploy a DataNode to every client node in the system, which may or may`
			`not be desirable depending on your use case.`

Apache Spark Integration guide 2017-06-29 00:10:05 +00:00			The HDFS job can be deployed using the `nomad run` command:

			```shell
			`$ nomad run hdfs.nomad`
			```

rewording and minor fixes to various pages 2017-07-07 20:20:40 +00:00			`## Production Deployment Considerations`

			`A production deployment will typically have redundant NameNodes in an`
			`active/passive configuration (which requires ZooKeeper). See [HDFS High`
			`Availability](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html).`

Apache Spark Integration guide 2017-06-29 00:10:05 +00:00			`## Next Steps`

			`Learn how to [monitor the output](/guides/spark/monitoring.html) of your`
			`Spark applications.`