140 lines
3.4 KiB
Markdown
140 lines
3.4 KiB
Markdown
---
|
|
layout: "guides"
|
|
page_title: "Apache Spark Integration - Using HDFS"
|
|
sidebar_current: "guides-spark-hdfs"
|
|
description: |-
|
|
Learn how to deploy HDFS on Nomad and integrate it with Spark.
|
|
---
|
|
|
|
# Using HDFS
|
|
|
|
[HDFS](https://en.wikipedia.org/wiki/Apache_Hadoop#Hadoop_distributed_file_system)
|
|
is a distributed, replicated and scalable file system written for the Hadoop
|
|
framework. Spark was designed to read from and write to HDFS, since it is
|
|
common for Spark applications to perform data-intensive processing over large
|
|
datasets. HDFS can be deployed as its own Nomad job.
|
|
|
|
## Running HDFS on Nomad
|
|
|
|
A sample HDFS job file can be found [here](https://github.com/hashicorp/nomad/blob/master/terraform/examples/spark/hdfs.nomad).
|
|
It has two task groups, one for the HDFS NameNode and one for the
|
|
DataNodes. Both task groups use a [Docker image](https://github.com/hashicorp/nomad/tree/master/terraform/examples/spark/docker/hdfs) that includes Hadoop:
|
|
|
|
```hcl
|
|
group "NameNode" {
|
|
|
|
constraint {
|
|
operator = "distinct_hosts"
|
|
value = "true"
|
|
}
|
|
|
|
task "NameNode" {
|
|
|
|
driver = "docker"
|
|
|
|
config {
|
|
image = "rcgenova/hadoop-2.7.3"
|
|
command = "bash"
|
|
args = [ "-c", "hdfs namenode -format && exec hdfs namenode
|
|
-D fs.defaultFS=hdfs://${NOMAD_ADDR_ipc}/ -D dfs.permissions.enabled=false" ]
|
|
network_mode = "host"
|
|
port_map {
|
|
ipc = 8020
|
|
ui = 50070
|
|
}
|
|
}
|
|
|
|
resources {
|
|
cpu = 1000
|
|
memory = 1024
|
|
network {
|
|
port "ipc" {
|
|
static = "8020"
|
|
}
|
|
port "ui" {
|
|
static = "50070"
|
|
}
|
|
}
|
|
}
|
|
|
|
service {
|
|
name = "hdfs"
|
|
port = "ipc"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
The NameNode task registers itself in Consul as `hdfs`. This enables the
|
|
DataNodes to generically reference the NameNode:
|
|
|
|
```hcl
|
|
group "DataNode" {
|
|
|
|
count = 3
|
|
|
|
constraint {
|
|
operator = "distinct_hosts"
|
|
value = "true"
|
|
}
|
|
|
|
task "DataNode" {
|
|
|
|
driver = "docker"
|
|
|
|
config {
|
|
network_mode = "host"
|
|
image = "rcgenova/hadoop-2.7.3"
|
|
args = [ "hdfs", "datanode"
|
|
, "-D", "fs.defaultFS=hdfs://hdfs.service.consul/"
|
|
, "-D", "dfs.permissions.enabled=false"
|
|
]
|
|
port_map {
|
|
data = 50010
|
|
ipc = 50020
|
|
ui = 50075
|
|
}
|
|
}
|
|
|
|
resources {
|
|
cpu = 1000
|
|
memory = 1024
|
|
network {
|
|
port "data" {
|
|
static = "50010"
|
|
}
|
|
port "ipc" {
|
|
static = "50020"
|
|
}
|
|
port "ui" {
|
|
static = "50075"
|
|
}
|
|
}
|
|
}
|
|
|
|
}
|
|
}
|
|
```
|
|
|
|
Another viable option for DataNode task group is to use a dedicated
|
|
[system](https://www.nomadproject.io/docs/runtime/schedulers.html#system) job.
|
|
This will deploy a DataNode to every client node in the system, which may or may
|
|
not be desirable depending on your use case.
|
|
|
|
The HDFS job can be deployed using the `nomad job run` command:
|
|
|
|
```shell
|
|
$ nomad job run hdfs.nomad
|
|
```
|
|
|
|
## Production Deployment Considerations
|
|
|
|
A production deployment will typically have redundant NameNodes in an
|
|
active/passive configuration (which requires ZooKeeper). See [HDFS High
|
|
Availability](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html).
|
|
|
|
## Next Steps
|
|
|
|
Learn how to [monitor the output](/guides/spark/monitoring.html) of your
|
|
Spark applications.
|