2017-05-16 04:17:48 +00:00
# Nomad / Spark integration
2017-05-15 18:56:41 +00:00
2017-06-15 22:49:02 +00:00
We maintain a fork of Apache Spark that natively supports using a Nomad cluster to run Spark applications. When running on Nomad, the Spark executors that run Spark tasks for your application, and optionally the application driver itself, run as Nomad tasks in a Nomad job. See the [usage guide ](./RunningSparkOnNomad.pdf ) for more details.
2017-05-15 18:56:41 +00:00
2017-05-16 04:17:48 +00:00
To give the Spark integration a test drive `cd` to `examples/spark/spark` on one of the servers (the `examples/spark/spark` subdirectory will be created when the cluster is provisioned).
2017-05-16 04:33:40 +00:00
A number of sample Spark commands are listed below. These demonstrate some of the official examples as well as features like `spark-sql` , `spark-shell` and dataframes.
2017-05-16 04:17:48 +00:00
You can monitor Nomad status simulaneously with:
2017-05-15 18:56:41 +00:00
```bash
$ nomad status
$ nomad status [JOB_ID]
$ nomad alloc-status [ALLOC_ID]
```
2017-05-16 04:17:48 +00:00
## Sample Spark commands
2017-05-15 18:56:41 +00:00
### SparkPi
2017-05-16 04:17:48 +00:00
Java (client mode)
2017-05-15 18:56:41 +00:00
```bash
2017-05-16 04:17:48 +00:00
$ ./bin/spark-submit --class org.apache.spark.examples.JavaSparkPi --master nomad --conf spark.executor.instances=8 --conf spark.nomad.sparkDistribution=https://s3.amazonaws.com/rcgenova-nomad-spark/spark-2.1.0-bin-nomad-preview-6.tgz examples/jars/spark-examples*.jar 100
2017-05-15 18:56:41 +00:00
```
2017-05-16 04:17:48 +00:00
Java (cluster mode)
2017-05-16 04:33:40 +00:00
```bash
$ ./bin/spark-submit --class org.apache.spark.examples.JavaSparkPi --master nomad --deploy-mode cluster --conf spark.executor.instances=4 --conf spark.nomad.cluster.monitorUntil=complete --conf spark.nomad.sparkDistribution=https://s3.amazonaws.com/rcgenova-nomad-spark/spark-2.1.0-bin-nomad-preview-6.tgz https://s3.amazonaws.com/rcgenova-nomad-spark/spark-examples_2.11-2.1.0-SNAPSHOT.jar 100
```
2017-05-16 04:17:48 +00:00
Python (client mode)
2017-05-15 18:56:41 +00:00
```bash
2017-05-16 04:17:48 +00:00
$ ./bin/spark-submit --master nomad --conf spark.executor.instances=8 --conf spark.nomad.sparkDistribution=https://s3.amazonaws.com/rcgenova-nomad-spark/spark-2.1.0-bin-nomad-preview-6.tgz examples/src/main/python/pi.py 100
2017-05-15 18:56:41 +00:00
```
2017-05-16 04:17:48 +00:00
Python (cluster mode)
2017-05-16 04:33:40 +00:00
```bash
$ ./bin/spark-submit --master nomad --deploy-mode cluster --conf spark.executor.instances=4 --conf spark.nomad.cluster.monitorUntil=complete --conf spark.nomad.sparkDistribution=https://s3.amazonaws.com/rcgenova-nomad-spark/spark-2.1.0-bin-nomad-preview-6.tgz examples/src/main/python/pi.py 100
```
2017-05-16 04:17:48 +00:00
Scala, (client mode)
2017-05-15 18:56:41 +00:00
```bash
2017-05-16 04:17:48 +00:00
$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master nomad --conf spark.executor.instances=8 --conf spark.nomad.sparkDistribution=https://s3.amazonaws.com/rcgenova-nomad-spark/spark-2.1.0-bin-nomad-preview-6.tgz examples/jars/spark-examples*.jar 100
2017-05-15 18:56:41 +00:00
```
2017-05-16 04:17:48 +00:00
### Machine Learning
Python (client mode)
2017-05-15 18:56:41 +00:00
2017-05-16 04:17:48 +00:00
```bash
$ ./bin/spark-submit --master nomad --conf spark.executor.instances=8 --conf spark.nomad.sparkDistribution=https://s3.amazonaws.com/rcgenova-nomad-spark/spark-2.1.0-bin-nomad-preview-6.tgz examples/src/main/python/ml/logistic_regression_with_elastic_net.py
```
Scala (client mode)
2017-05-15 18:56:41 +00:00
```bash
2017-05-16 04:17:48 +00:00
$ ./bin/spark-submit --class org.apache.spark.examples.SparkLR --master nomad --conf spark.executor.instances=8 --conf spark.nomad.sparkDistribution=https://s3.amazonaws.com/rcgenova-nomad-spark/spark-2.1.0-bin-nomad-preview-6.tgz examples/jars/spark-examples*.jar
2017-05-15 18:56:41 +00:00
```
2017-05-16 04:17:48 +00:00
### Streaming
Run these commands simultaneously:
2017-05-15 18:56:41 +00:00
```bash
2017-05-16 04:33:40 +00:00
$ bin/spark-submit --class org.apache.spark.examples.streaming.clickstream.PageViewGenerator --master nomad --deploy-mode cluster --conf spark.executor.instances=4 --conf spark.nomad.cluster.monitorUntil=complete --conf spark.nomad.sparkDistribution=https://s3.amazonaws.com/rcgenova-nomad-spark/spark-2.1.0-bin-nomad-preview-6.tgz https://s3.amazonaws.com/rcgenova-nomad-spark/spark-examples_2.11-2.1.0-SNAPSHOT.jar 44444 10
2017-05-15 18:56:41 +00:00
```
2017-05-16 04:17:48 +00:00
```bash
2017-05-16 04:33:40 +00:00
$ bin/spark-submit --class org.apache.spark.examples.streaming.clickstream.PageViewStream --master nomad --deploy-mode cluster --conf spark.executor.instances=4 --conf spark.nomad.cluster.monitorUntil=complete --conf spark.nomad.sparkDistribution=https://s3.amazonaws.com/rcgenova-nomad-spark/spark-2.1.0-bin-nomad-preview-6.tgz https://s3.amazonaws.com/rcgenova-nomad-spark/spark-examples_2.11-2.1.0-SNAPSHOT.jar errorRatePerZipCode localhost 44444
2017-05-16 04:17:48 +00:00
```
### pyspark
2017-05-15 18:56:41 +00:00
```bash
2017-05-16 04:17:48 +00:00
$ ./bin/pyspark --master nomad --conf spark.executor.instances=8 --conf spark.nomad.sparkDistribution=https://s3.amazonaws.com/rcgenova-nomad-spark/spark-2.1.0-bin-nomad-preview-6.tgz
```
2017-05-15 18:56:41 +00:00
2017-05-16 04:17:48 +00:00
```bash
2017-05-16 04:33:40 +00:00
$ df = spark.read.json("examples/src/main/resources/people.json")
$ df.show()
$ df.printSchema()
$ df.createOrReplaceTempView("people")
$ sqlDF = spark.sql("SELECT * FROM people")
$ sqlDF.show()
2017-05-15 18:56:41 +00:00
```
2017-05-16 04:17:48 +00:00
### spark-shell
2017-05-15 18:56:41 +00:00
```bash
2017-05-16 04:17:48 +00:00
$ ./bin/spark-shell --master nomad --conf spark.executor.instances=8 --conf spark.nomad.sparkDistribution=https://s3.amazonaws.com/rcgenova-nomad-spark/spark-2.1.0-bin-nomad-preview-6.tgz
```
2017-05-15 18:56:41 +00:00
2017-05-16 04:17:48 +00:00
From spark-shell:
```bash
2017-05-16 04:33:40 +00:00
$ :type spark
$ spark.version
2017-05-15 18:56:41 +00:00
2017-05-16 04:33:40 +00:00
$ val data = 1 to 10000
$ val distData = sc.parallelize(data)
$ distData.filter(_ < 10 ) . collect ( )
2017-05-15 18:56:41 +00:00
```
2017-05-16 04:17:48 +00:00
### spark-sql
2017-05-15 18:56:41 +00:00
```bash
2017-05-16 04:33:40 +00:00
$ bin/spark-sql --master nomad --conf spark.executor.instances=8 --conf spark.nomad.sparkDistribution=https://s3.amazonaws.com/rcgenova-nomad-spark/spark-2.1.0-bin-nomad-preview-6.tgz jars/spark-sql_2.11-2.1.0-SNAPSHOT.jar
2017-05-15 18:56:41 +00:00
```
2017-05-16 04:17:48 +00:00
From spark-shell:
```bash
CREATE TEMPORARY VIEW usersTable
USING org.apache.spark.sql.parquet
OPTIONS (
path "examples/src/main/resources/users.parquet"
);
SELECT * FROM usersTable;
```
2017-05-15 18:56:41 +00:00
2017-05-16 04:17:48 +00:00
### Data frames
```bash
2017-05-16 04:33:40 +00:00
$ bin/spark-shell --master nomad --conf spark.executor.instances=8 --conf spark.nomad.sparkDistribution=https://s3.amazonaws.com/rcgenova-nomad-spark/spark-2.1.0-bin-nomad-preview-6.tgz
2017-05-16 04:17:48 +00:00
```
From spark-shell:
```bash
2017-05-16 04:33:40 +00:00
$ val usersDF = spark.read.load("examples/src/main/resources/users.parquet")
$ usersDF.select("name", "favorite_color").write.save("/tmp/namesAndFavColors.parquet")
2017-05-16 04:17:48 +00:00
```