open-nomad/website/source/guides/spark/hdfs.html.md

140 lines
3.4 KiB
Markdown
Raw Normal View History

2017-06-29 00:10:05 +00:00
---
layout: "guides"
page_title: "Apache Spark Integration - Using HDFS"
sidebar_current: "guides-spark-hdfs"
description: |-
Learn how to deploy HDFS on Nomad and integrate it with Spark.
---
# Using HDFS
[HDFS](https://en.wikipedia.org/wiki/Apache_Hadoop#Hadoop_distributed_file_system)
is a distributed, replicated and scalable file system written for the Hadoop
framework. Spark was designed to read from and write to HDFS, since it is
common for Spark applications to perform data-intensive processing over large
datasets. HDFS can be deployed as its own Nomad job.
## Running HDFS on Nomad
A sample HDFS job file can be found [here](https://github.com/hashicorp/nomad/blob/master/terraform/examples/spark/hdfs.nomad).
2017-06-29 00:10:05 +00:00
It has two task groups, one for the HDFS NameNode and one for the
DataNodes. Both task groups use a [Docker image](https://github.com/hashicorp/nomad/tree/master/terraform/examples/spark/docker/hdfs) that includes Hadoop:
2017-06-29 00:10:05 +00:00
```hcl
group "NameNode" {
constraint {
operator = "distinct_hosts"
value = "true"
}
task "NameNode" {
driver = "docker"
config {
image = "rcgenova/hadoop-2.7.3"
command = "bash"
args = [ "-c", "hdfs namenode -format && exec hdfs namenode
-D fs.defaultFS=hdfs://${NOMAD_ADDR_ipc}/ -D dfs.permissions.enabled=false" ]
network_mode = "host"
port_map {
ipc = 8020
ui = 50070
}
}
resources {
cpu = 1000
memory = 1024
2017-06-29 00:10:05 +00:00
network {
port "ipc" {
static = "8020"
}
port "ui" {
static = "50070"
}
}
}
service {
name = "hdfs"
port = "ipc"
}
}
}
```
The NameNode task registers itself in Consul as `hdfs`. This enables the
DataNodes to generically reference the NameNode:
```hcl
group "DataNode" {
count = 3
constraint {
operator = "distinct_hosts"
value = "true"
}
task "DataNode" {
driver = "docker"
config {
network_mode = "host"
image = "rcgenova/hadoop-2.7.3"
args = [ "hdfs", "datanode"
, "-D", "fs.defaultFS=hdfs://hdfs.service.consul/"
, "-D", "dfs.permissions.enabled=false"
]
port_map {
data = 50010
ipc = 50020
ui = 50075
}
}
resources {
cpu = 1000
memory = 1024
2017-06-29 00:10:05 +00:00
network {
port "data" {
static = "50010"
}
port "ipc" {
static = "50020"
}
port "ui" {
static = "50075"
}
}
}
}
}
```
Another viable option for DataNode task group is to use a dedicated
[system](https://www.nomadproject.io/docs/runtime/schedulers.html#system) job.
This will deploy a DataNode to every client node in the system, which may or may
not be desirable depending on your use case.
2017-06-29 00:10:05 +00:00
The HDFS job can be deployed using the `nomad run` command:
```shell
$ nomad run hdfs.nomad
```
## Production Deployment Considerations
A production deployment will typically have redundant NameNodes in an
active/passive configuration (which requires ZooKeeper). See [HDFS High
Availability](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html).
2017-06-29 00:10:05 +00:00
## Next Steps
Learn how to [monitor the output](/guides/spark/monitoring.html) of your
Spark applications.