2017-05-16 04:17:48 +00:00
# Nomad / Spark integration
2017-05-15 18:56:41 +00:00
2017-06-24 23:28:12 +00:00
The Nomad ecosystem includes a fork of Apache Spark that natively supports using a Nomad cluster to run Spark applications. When running on Nomad, the Spark executors that run Spark tasks for your application, and optionally the application driver itself, run as Nomad tasks in a Nomad job. See the [usage guide ](./RunningSparkOnNomad.pdf ) for more details.
2017-05-15 18:56:41 +00:00
2017-06-24 23:28:12 +00:00
Clusters provisioned with Nomad's Terraform templates are automatically configured to run the Spark integration. The sample job files found here are also provisioned onto every client and server.
2017-05-16 04:17:48 +00:00
2017-06-24 23:28:12 +00:00
## Setup
2017-05-16 04:17:48 +00:00
2017-06-24 23:28:12 +00:00
To give the Spark integration a test drive, provision a cluster and SSH to any one of the clients or servers (the public IPs are displayed when the Terraform provisioning process completes):
2017-05-15 18:56:41 +00:00
```bash
2017-06-24 23:28:12 +00:00
$ ssh -i /path/to/key ubuntu@PUBLIC_IP
2017-05-15 18:56:41 +00:00
```
2017-06-24 23:28:12 +00:00
The Spark history server and several of the sample Spark jobs below require HDFS. Using the included job file, deploy an HDFS cluster on Nomad:
2017-05-15 18:56:41 +00:00
```bash
2017-06-24 23:28:12 +00:00
$ cd $HOME/examples/spark
$ nomad run hdfs.nomad
$ nomad status hdfs
2017-05-15 18:56:41 +00:00
```
2017-06-24 23:28:12 +00:00
When the allocations are all in the `running` state (as shown by `nomad status hdfs` ), query Consul to verify that the HDFS service has been registered:
2017-05-16 04:17:48 +00:00
2017-05-16 04:33:40 +00:00
```bash
2017-06-24 23:28:12 +00:00
$ dig hdfs.service.consul
2017-05-16 04:33:40 +00:00
```
2017-05-16 04:17:48 +00:00
2017-06-24 23:28:12 +00:00
Next, create directories and files in HDFS for use by the history server and the sample Spark jobs:
2017-05-15 18:56:41 +00:00
```bash
2017-06-24 23:28:12 +00:00
$ hdfs dfs -mkdir /foo
$ hdfs dfs -put /var/log/apt/history.log /foo
$ hdfs dfs -mkdir /spark-events
$ hdfs dfs -ls /
2017-05-15 18:56:41 +00:00
```
2017-06-24 23:28:12 +00:00
Finally, deploy the Spark history server:
2017-05-16 04:17:48 +00:00
2017-05-16 04:33:40 +00:00
```bash
2017-06-24 23:28:12 +00:00
$ nomad run spark-history-server-hdfs.nomad
2017-05-16 04:33:40 +00:00
```
2017-05-16 04:17:48 +00:00
2017-06-24 23:50:11 +00:00
You can get the private IP for the history server with a Consul DNS lookup:
2017-05-15 18:56:41 +00:00
```bash
2017-06-24 23:28:12 +00:00
$ dig spark-history.service.consul
2017-05-15 18:56:41 +00:00
```
2017-06-24 23:50:11 +00:00
Cross-reference the private IP with the `terraforom apply` output to get the corresponding public IP. You can access the history server at `http://PUBLIC_IP:18080` .
2017-05-16 04:17:48 +00:00
2017-06-24 23:28:12 +00:00
## Sample Spark jobs
2017-05-15 18:56:41 +00:00
2017-06-24 23:50:11 +00:00
The sample `spark-submit` commands listed below demonstrate several of the official Spark examples. Features like `spark-sql` , `spark-shell` and pyspark are included. The commands can be executed from any client or server.
2017-05-16 04:17:48 +00:00
2017-06-25 17:09:28 +00:00
You can monitor the status of a Spark job in a second terminal session with:
```bash
$ nomad status
$ nomad status JOB_ID
$ nomad alloc-status DRIVER_ALLOC_ID
$ nomad logs DRIVER_ALLOC_ID
```
To view the output of the job, run `nomad logs` for the driver's Allocation ID.
2017-06-24 23:28:12 +00:00
### SparkPi (Java)
2017-05-15 18:56:41 +00:00
```bash
2017-06-24 23:28:12 +00:00
spark-submit --class org.apache.spark.examples.JavaSparkPi --master nomad --deploy-mode cluster --conf spark.executor.instances=4 --conf spark.nomad.cluster.monitorUntil=complete --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://hdfs.service.consul/spark-events --conf spark.nomad.sparkDistribution=https://s3.amazonaws.com/rcgenova-nomad-spark/spark-2.1.0-bin-nomad-preview-6.tgz https://s3.amazonaws.com/rcgenova-nomad-spark/spark-examples_2.11-2.1.0-SNAPSHOT.jar 100
2017-05-15 18:56:41 +00:00
```
2017-06-24 23:28:12 +00:00
### Word count (Java)
2017-05-15 18:56:41 +00:00
2017-05-16 04:17:48 +00:00
```bash
2017-06-24 23:28:12 +00:00
spark-submit --class org.apache.spark.examples.JavaWordCount --master nomad --deploy-mode cluster --conf spark.executor.instances=4 --conf spark.nomad.cluster.monitorUntil=complete --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://hdfs.service.consul/spark-events --conf spark.nomad.sparkDistribution=https://s3.amazonaws.com/rcgenova-nomad-spark/spark-2.1.0-bin-nomad-preview-6.tgz https://s3.amazonaws.com/rcgenova-nomad-spark/spark-examples_2.11-2.1.0-SNAPSHOT.jar hdfs://hdfs.service.consul/foo/history.log
2017-05-16 04:17:48 +00:00
```
2017-06-24 23:28:12 +00:00
### DFSReadWriteTest (Scala)
2017-05-15 18:56:41 +00:00
```bash
2017-06-24 23:28:12 +00:00
spark-submit --class org.apache.spark.examples.DFSReadWriteTest --master nomad --deploy-mode cluster --conf spark.executor.instances=4 --conf spark.nomad.cluster.monitorUntil=complete --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://hdfs.service.consul/spark-events --conf spark.nomad.sparkDistribution=https://s3.amazonaws.com/rcgenova-nomad-spark/spark-2.1.0-bin-nomad-preview-6.tgz https://s3.amazonaws.com/rcgenova-nomad-spark/spark-examples_2.11-2.1.0-SNAPSHOT.jar /home/ubuntu/.bashrc hdfs://hdfs.service.consul/foo
2017-05-15 18:56:41 +00:00
```
2017-05-16 04:17:48 +00:00
### spark-shell
2017-05-15 18:56:41 +00:00
2017-06-24 23:28:12 +00:00
Start the shell:
2017-05-15 18:56:41 +00:00
```bash
2017-06-24 23:28:12 +00:00
spark-shell --master nomad --conf spark.executor.instances=4 --conf spark.nomad.sparkDistribution=https://s3.amazonaws.com/rcgenova-nomad-spark/spark-2.1.0-bin-nomad-preview-6.tgz
2017-05-16 04:17:48 +00:00
```
2017-05-15 18:56:41 +00:00
2017-06-24 23:28:12 +00:00
Run a few commands:
2017-05-16 04:17:48 +00:00
```bash
2017-05-16 04:33:40 +00:00
$ spark.version
2017-05-15 18:56:41 +00:00
2017-05-16 04:33:40 +00:00
$ val data = 1 to 10000
$ val distData = sc.parallelize(data)
$ distData.filter(_ < 10 ) . collect ( )
2017-05-15 18:56:41 +00:00
```
2017-06-24 23:28:12 +00:00
### sql-shell
Start the shell:
2017-05-15 18:56:41 +00:00
```bash
2017-06-24 23:28:12 +00:00
spark-sql --master nomad --conf spark.executor.instances=4 --conf spark.nomad.sparkDistribution=https://s3.amazonaws.com/rcgenova-nomad-spark/spark-2.1.0-bin-nomad-preview-6.tgz jars/spark-sql_2.11-2.1.0-SNAPSHOT.jar
2017-05-15 18:56:41 +00:00
```
2017-06-24 23:28:12 +00:00
Run a few commands:
2017-05-16 04:17:48 +00:00
```bash
2017-06-24 23:28:12 +00:00
$ CREATE TEMPORARY VIEW usersTable
2017-05-16 04:17:48 +00:00
USING org.apache.spark.sql.parquet
OPTIONS (
2017-06-24 23:28:12 +00:00
path "/usr/local/bin/spark/examples/src/main/resources/users.parquet"
2017-05-16 04:17:48 +00:00
);
2017-06-24 23:28:12 +00:00
$ SELECT * FROM usersTable;
2017-05-16 04:17:48 +00:00
```
2017-05-15 18:56:41 +00:00
2017-06-24 23:28:12 +00:00
### pyspark
Start the shell:
2017-05-16 04:17:48 +00:00
```bash
2017-06-24 23:28:12 +00:00
pyspark --master nomad --conf spark.executor.instances=4 --conf spark.nomad.sparkDistribution=https://s3.amazonaws.com/rcgenova-nomad-spark/spark-2.1.0-bin-nomad-preview-6.tgz
2017-05-16 04:17:48 +00:00
```
2017-06-24 23:28:12 +00:00
Run a few commands:
2017-05-16 04:17:48 +00:00
```bash
2017-06-24 23:28:12 +00:00
$ df = spark.read.json("/usr/local/bin/spark/examples/src/main/resources/people.json")
$ df.show()
$ df.printSchema()
$ df.createOrReplaceTempView("people")
$ sqlDF = spark.sql("SELECT * FROM people")
$ sqlDF.show()
2017-05-16 04:17:48 +00:00
```
2017-06-24 23:28:12 +00:00