History

Rob Genova 9fd45265f2 Replace references to personal S3 bucket		2017-07-07 22:29:21 -07:00
..
docker	add job files and Dockerfiles	2017-06-24 16:28:12 -07:00
hdfs.nomad	cleanup Vagrantfile, Terraform configs and Nomad job files	2017-06-25 11:10:14 -07:00
README.md	Replace references to personal S3 bucket	2017-07-07 22:29:21 -07:00
RunningSparkOnNomad.pdf	Initial commit for Terraform/AWS support	2017-05-15 11:56:41 -07:00
spark-history-server-hdfs.nomad	cleanup Vagrantfile, Terraform configs and Nomad job files	2017-06-25 11:10:14 -07:00

README.md

Nomad / Spark integration

The Nomad ecosystem includes a fork of Apache Spark that natively supports using a Nomad cluster to run Spark applications. When running on Nomad, the Spark executors that run Spark tasks for your application, and optionally the application driver itself, run as Nomad tasks in a Nomad job. See the usage guide for more details.

Clusters provisioned with Nomad's Terraform templates are automatically configured to run the Spark integration. The sample job files found here are also provisioned onto every client and server.

Setup

To give the Spark integration a test drive, provision a cluster and SSH to any one of the clients or servers (the public IPs are displayed when the Terraform provisioning process completes):

$ ssh -i /path/to/key ubuntu@PUBLIC_IP

The Spark history server and several of the sample Spark jobs below require HDFS. Using the included job file, deploy an HDFS cluster on Nomad:

$ cd $HOME/examples/spark
$ nomad run hdfs.nomad
$ nomad status hdfs

When the allocations are all in the running state (as shown by nomad status hdfs), query Consul to verify that the HDFS service has been registered:

$ dig hdfs.service.consul

Next, create directories and files in HDFS for use by the history server and the sample Spark jobs:

$ hdfs dfs -mkdir /foo
$ hdfs dfs -put /var/log/apt/history.log /foo
$ hdfs dfs -mkdir /spark-events
$ hdfs dfs -ls /

Finally, deploy the Spark history server:

$ nomad run spark-history-server-hdfs.nomad

You can get the private IP for the history server with a Consul DNS lookup:

$ dig spark-history.service.consul

Cross-reference the private IP with the terraforom apply output to get the corresponding public IP. You can access the history server at http://PUBLIC_IP:18080.

Sample Spark jobs

The sample spark-submit commands listed below demonstrate several of the official Spark examples. Features like spark-sql, spark-shell and pyspark are included. The commands can be executed from any client or server.

You can monitor the status of a Spark job in a second terminal session with:

$ nomad status
$ nomad status JOB_ID
$ nomad alloc-status DRIVER_ALLOC_ID
$ nomad logs DRIVER_ALLOC_ID

To view the output of the job, run nomad logs for the driver's Allocation ID.

SparkPi (Java)

spark-submit \
  --class org.apache.spark.examples.JavaSparkPi \
  --master nomad \
  --deploy-mode cluster \
  --conf spark.executor.instances=4 \
  --conf spark.nomad.cluster.monitorUntil=complete \
  --conf spark.eventLog.enabled=true \
  --conf spark.eventLog.dir=hdfs://hdfs.service.consul/spark-events \
  --conf spark.nomad.sparkDistribution=https://s3.amazonaws.com/nomad-spark/spark-2.1.0-bin-nomad.tgz \
  https://s3.amazonaws.com/nomad-spark/spark-examples_2.11-2.1.0-SNAPSHOT.jar 100

Word count (Java)

spark-submit \
  --class org.apache.spark.examples.JavaWordCount \
  --master nomad \
  --deploy-mode cluster \
  --conf spark.executor.instances=4 \
  --conf spark.nomad.cluster.monitorUntil=complete \
  --conf spark.eventLog.enabled=true \
  --conf spark.eventLog.dir=hdfs://hdfs.service.consul/spark-events \
  --conf spark.nomad.sparkDistribution=https://s3.amazonaws.com/nomad-spark/spark-2.1.0-bin-nomad.tgz \
  https://s3.amazonaws.com/nomad-spark/spark-examples_2.11-2.1.0-SNAPSHOT.jar \
  hdfs://hdfs.service.consul/foo/history.log

DFSReadWriteTest (Scala)

spark-submit \
  --class org.apache.spark.examples.DFSReadWriteTest \
  --master nomad \
  --deploy-mode cluster \
  --conf spark.executor.instances=4 \
  --conf spark.nomad.cluster.monitorUntil=complete \
  --conf spark.eventLog.enabled=true \
  --conf spark.eventLog.dir=hdfs://hdfs.service.consul/spark-events \
  --conf spark.nomad.sparkDistribution=https://s3.amazonaws.com/nomad-spark/spark-2.1.0-bin-nomad.tgz \
  https://s3.amazonaws.com/nomad-spark/spark-examples_2.11-2.1.0-SNAPSHOT.jar \
  /home/ubuntu/.bashrc hdfs://hdfs.service.consul/foo

spark-shell

Start the shell:

spark-shell \
  --master nomad \
  --conf spark.executor.instances=4 \
  --conf spark.nomad.sparkDistribution=https://s3.amazonaws.com/nomad-spark/spark-2.1.0-bin-nomad.tgz

Run a few commands:

$ spark.version

$ val data = 1 to 10000
$ val distData = sc.parallelize(data)
$ distData.filter(_ < 10).collect()

sql-shell

Start the shell:

spark-sql \
  --master nomad \
  --conf spark.executor.instances=4 \
  --conf spark.nomad.sparkDistribution=https://s3.amazonaws.com/nomad-spark/spark-2.1.0-bin-nomad.tgz jars/spark-sql_2.11-2.1.0-SNAPSHOT.jar

Run a few commands:

$ CREATE TEMPORARY VIEW usersTable
USING org.apache.spark.sql.parquet
OPTIONS (
  path "/usr/local/bin/spark/examples/src/main/resources/users.parquet"
);

$ SELECT * FROM usersTable;

pyspark

Start the shell:

pyspark \
  --master nomad \
  --conf spark.executor.instances=4 \
  --conf spark.nomad.sparkDistribution=https://s3.amazonaws.com/nomad-spark/spark-2.1.0-bin-nomad.tgz

Run a few commands:

$ df = spark.read.json("/usr/local/bin/spark/examples/src/main/resources/people.json")
$ df.show()
$ df.printSchema()
$ df.createOrReplaceTempView("people")
$ sqlDF = spark.sql("SELECT * FROM people")
$ sqlDF.show()