Python job submission to spark from remotely - apache-spark

I have a python script with pyspark code on my local system. I am trying to submit a pyspark job from my local machine to remote spark cluster.
Please let me know how it can be done.
Do I need spark locally installed to submit spark job.

You need to have the spark master URL in the spark conf like below
SparkSession spark = SparkSession.builder().appName("CDX JSON Merge Job").master("spark://ip-address:7077")
.getOrCreate();
You have to install the spark client in your localhost and then execute the jar using the spark-submit
spark-submit --num-executors 50 --executor-memory 4G --executor-cores 4 --master spark://ip-address:7077 --deploy-mode client --class fully-qualified-class-name artifact.jar
You can also have the master as YARN if you are running Spark on YARN and deploy-mode as cluster.

Related

SnappyData Smart Connector - how to run jobs

I'm reading the documentation and I would like to ask you to help me understand the SnappyData Smart Connector point.
There is a few different examples in documentation how should I use spark-submit e.g:
example 1
./bin/spark-submit --deploy-mode cluster --class somePackage.someClass
--master spark://localhost:7077 --conf spark.snappydata.connection=localhost:1527
--packages "SnappyDataInc:snappydata:1.0.0-s_2.11"
example 2
// Start the Spark standalone cluster from SnappyData base directory
$ sbin/start-all.sh
// Submit AirlineDataSparkApp to Spark Cluster with snappydata's
locator host port.
$ bin/spark-submit --class io.snappydata.examples.AirlineDataSparkApp --master spark://masterhost:7077 --conf spark.snappydata.connection=locatorhost:clientPort --conf spark.ui.port=4041 $SNAPPY_HOME/examples/jars/quickstart.jar
example 3
$ <Spark_Product_Home>/bin/spark-submit --master local[*] --conf
spark.snappydata.connection=localhost:1527 --class
org.apache.spark.examples.snappydata.SmartConnectorExample --
packages SnappyDataInc:snappydata:1.0.0-s_2.11
<SnappyData_Product_Home>/examples/jars/quickstart.jar
Let say I have Spark cluster on 3 hosts : 1 master and 3 workers
I would like to use SnappyData cluster as datasource for my current spark environment.
Should I use command from example 1 or 2 or 3?
Could you also explain to me what is --deploy-mode argument in spark-submit - http://snappydatainc.github.io/snappydata/affinity_modes/connector_mode/
what is different between cluster mode and client mode for spark-submit?
Thank you in advance for any help.
Regards,
Deploy-mode is explained here. No different when using SnappyData. When running your own Spark cluster (any Spark distro compatible with Spark 2.1) then working with SnappyData only requires you to configure the Snappy locator (e.g. localhost:1527).

Running spark application in local mode

I'm trying to start my Spark application in local mode using spark-submit. I am using Spark 2.0.2, Hadoop 2.6 & Scala 2.11.8 on Windows. The application runs fine from within my IDE (IntelliJ), and I can also start it on a cluster with actual, physical executors.
The command I'm running is
spark-submit --class [MyClassName] --master local[*] target/[MyApp]-jar-with-dependencies.jar [Params]
Spark starts up as usual, but then terminates with
java.io.Exception: Failed to connect to /192.168.88.1:56370
What am I missing here?
Check which port you are using: if on cluster: log in to master node and include:
--master spark://XXXX:7077
You can find it always in spark ui under port 8080
Also check your spark builder config if you have set master already as it takes priority when launching eg:
val spark = SparkSession
.builder
.appName("myapp")
.master("local[*]")

Dependency is not distributed to Spark cluster

I'm trying to execute Spark job on Mesos cluster that depends on spark-cassandra-connector library, but it keeps failing with
Exception in thread "main" java.lang.NoClassDefFoundError: com/datastax/spark/connector/package$
As I understand from spark documentation
JARs and files are copied to the working directory for each SparkContext on the executor nodes.
...
Users may also include any other dependencies by supplying a comma-delimited list of maven coordinates with --packages.
But it seems that only pucker-assembly-1.0.jar task jar is distributed.
I'm running spark 1.6.1 with scala 2.10.6.
And here's spark-submit command I'm executing:
spark-submit --deploy-mode cluster
--master mesos://localhost:57811
--conf spark.ssl.noCertVerification=true
--packages datastax:spark-cassandra-connector:1.5.1-s_2.10
--conf spark.cassandra.connection.host=10.0.1.83,10.0.1.86,10.0.1.85
--driver-cores 3
--driver-memory 4000M
--class SimpleApp
https://dripit-spark.s3.amazonaws.com/pucker-assembly-1.0.jar
s3n://logs/E1SR85P3DEM3LU.2016-05-05-11.ceaeb015.gz
So why isn't spark-cassandra-connector distributed to all my spark executers?
You should use the correct Maven coordinate syntax:
--packages com.datastax.spark:spark-cassandra-connector_2.10:1.6.0
See
https://mvnrepository.com/artifact/com.datastax.spark/spark-cassandra-connector_2.10
http://spark.apache.org/docs/latest/submitting-applications.html
http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell

How to submit pyspark job in yarn cluster mode from code

Can we submit a pyspark job in yarn cluster mode from Python code.
spark-submit is the command for submit the pyspark job on spark and we have to mention yarn cluster mode for deploy the job on cluster.
spark-submit --master yarn --deploy-mode cluster py_files.py

PySpark distributed processing on a YARN cluster

I have Spark running on a Cloudera CDH5.3 cluster, using YARN as the resource manager. I am developing Spark apps in Python (PySpark).
I can submit jobs and they run succesfully, however they never seem to run on more than one machine (the local machine I submit from).
I have tried a variety of options, like setting --deploy-mode to cluster and --master to yarn-client and yarn-cluster, yet it never seems to run on more than one server.
I can get it to run on more than one core by passing something like --master local[8], but that obviously doesn't distribute the processing over multiple nodes.
I have a very simply Python script processing data from HDFS like so:
import simplejson as json
from pyspark import SparkContext
sc = SparkContext("", "Joe Counter")
rrd = sc.textFile("hdfs:///tmp/twitter/json/data/")
data = rrd.map(lambda line: json.loads(line))
joes = data.filter(lambda tweet: "Joe" in tweet.get("text",""))
print joes.count()
And I am running a submit command like:
spark-submit atest.py --deploy-mode client --master yarn-client
What can I do to ensure the job runs in parallel across the cluster?
Can you swap the arguments for the command?
spark-submit --deploy-mode client --master yarn-client atest.py
If you see the help text for the command:
spark-submit
Usage: spark-submit [options] <app jar | python file>
I believe #MrChristine is correct -- the option flags you specify are being passed to your python script, not to spark-submit. In addition, you'll want to specify --executor-cores and --num-executors since by default it will run on a single core and use two executors.
Its not true that python script doesn't run in cluster mode. I am not sure about previous versions but this is executing in spark 2.2 version on Hortonworks cluster.
Command : spark-submit --master yarn --num-executors 10 --executor-cores 1 --driver-memory 5g /pyspark-example.py
Python Code :
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
conf = (SparkConf()
.setMaster("yarn")
.setAppName("retrieve data"))
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
parquetFile = sqlContext.read.parquet("/<hdfs-path>/*.parquet")
parquetFile.createOrReplaceTempView("temp")
df1 = sqlContext.sql("select * from temp limit 5")
df1.show()
df1.write.save('/<hdfs-path>/test.csv', format='csv', mode='append')
sc.stop()
Output : Its big so i am not pasting. But it runs perfect.
It seems that PySpark does not run in distributed mode using Spark/YARN - you need to use stand-alone Spark with a Spark Master server. In that case, my PySpark script ran very well across the cluster with a Python process per core/node.

Resources