Spark pyspark vs spark-submit - apache-spark

The documentation on spark-submit says the following:
The spark-submit script in Spark’s bin directory is used to launch
applications on a cluster.
Regarding the pyspark it says the following:
You can also use bin/pyspark to launch an interactive Python shell.
This question may sound stupid, but when i am running the commands though pyspark they also run on the "cluster", right? They do not run on the master node only, right?

There is no practical difference between these two. If not configured otherwise both will execute code in a local mode. If master is configured (either by --master command line parameter or spark.master configuration) corresponding cluster will be used to execute the program.

If you are using EMR , there are three things
using pyspark(or spark-shell)
using spark-submit without using --master and --deploy-mode
using spark-submit and using --master and --deploy-mode
although using all the above three will run the application in spark cluster, there is a difference how the driver program works.
in 1st and 2nd the driver will be in client mode whereas in 3rd the
driver will also be in the cluster.
in 1st and 2nd, you will have to wait untill one application complete
to run another, but in 3rd you can run multiple applications in
parallel.

Just adding a clarification that others have not addressed (you may already know this, but it was unclear from the wording of your question):
..when i am running the commands though pyspark they also run on the "cluster", right? They do not run on the master node only, right?
As with spark-submit, standard Python code will run only on the driver. When you call operations through the various pyspark APIs, you will trigger transformations or actions that will be registered/executed on the cluster.
As others have pointed out, spark-submit can also launch jobs in cluster mode. In this case, driver still executes standard Python code, but the driver is a different machine to the one that you call spark-submit from.

Pyspark compare to Scala spark and Java Spark have extreme differences, for Python spark in only support YARN for scheduling the cluster.
If you are running python spark on a local machine, then you can use pyspark. If in the cluster, use the spark-submit.
If you have any dependencies in your python spark job, you need a zip file for submission.

Related

Is there a way to submit spark job on different server running master

We have a requirement to schedule spark jobs, since we are familiar with apache-airflow we want to go ahead with it to create different workflows. I searched web but did not find a step by step guide to schedule spark job on airflow and option to run them on different server running master.
Answer to this will be highly appreciated.
Thanks in advance.
There are 3 ways you can submit Spark jobs using Apache Airflow remotely:
(1) Using SparkSubmitOperator: This operator expects you have a spark-submit binary and YARN client config setup on our Airflow server. It invokes the spark-submit command with given options, blocks until the job finishes and returns the final status. The good thing is, it also streams the logs from the spark-submit command stdout and stderr.
You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work.
Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.
If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath
(2) Using SSHOperator: Use this operator to run bash commands on a remote server (using SSH protocol via paramiko library) like spark-submit. The benefit of this approach is you don't need to copy the hdfs-site.xml or maintain any file.
(3) Using SimpleHTTPOperator with Livy: Livy is an open source REST interface for interacting with Apache Spark from anywhere. You just need to have REST calls.
I personally prefer SSHOperator :)

Airflow and Spark/Hadoop - Unique cluster or one for Airflow and other for Spark/Hadoop

I'm trying to figure out which is the best way to work with Airflow and Spark/Hadoop.
I already have a Spark/Hadoop cluster and I'm thinking about creating another cluster for Airflow that will submit jobs remotely to Spark/Hadoop cluster.
Any advice about it? Looks like it's a little complicated to deploy spark remotely from another cluster and that will create some file configuration duplication.
You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work. (You could try cluster deploy mode, but I think having the driver being managed by Airflow isn't a bad idea)
Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.
If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath (not all NodeManagers could have a Hive client installed on them)
I prefer submitting Spark Jobs using SSHOperator and running spark-submit command which would save you from copy/pasting yarn-site.xml. Also, I would not create a cluster for Airflow if the only task that I perform is running Spark jobs, a single VM with LocalExecutor should be fine.
There are a variety of options for remotely performing spark-submit via Airflow.
Emr-Step
Apache-Livy (see this for hint)
SSH
Do note that none of these are plug-and-play ready and you'll have to write your own operators to get things done.

Spark job with explicit setMaster("local"), passed to spark-submit with YARN

If I have a Spark job (2.2.0) compiled with setMaster("local") what will happen if I send that job with spark-submit --master yarn --deploy-mode cluster ?
I tried this and it looked like the job did get packaged up and executed on the YARN cluster rather than locally.
What I'm not clear on:
why does this work? According to the docs, things that you set in SparkConf explicitly have precedence over things passed in from the command line or via spark-submit (see: https://spark.apache.org/docs/latest/configuration.html). Is this different because I'm using SparkSession.getBuilder?
is there any less obvious impact of leaving setMaster("local") in code vs. removing it? I'm wondering if what I'm seeing is something like the job running in local mode, within the cluster, rather than properly using cluster resources.
It's because submitting your application to Yarn happens before SparkConf.setMaster.
When you use --master yarn --deploy-mode cluster, Spark will run its main method in your local machine and upload the jar to run on Yarn. Yarn will allocate a container as the application master to run the Spark driver, a.k.a, your codes. SparkConf.setMaster("local") runs inside a Yarn container, and then it creates SparkContext running in the local mode, and doesn't use the Yarn cluster resources.
I recommend that not setting master in your codes. Just use the command line --master or the MASTER env to specify the Spark master.
If I have a Spark job (2.2.0) compiled with setMaster("local") what will happen if I send that job with spark-submit --master yarn --deploy-mode cluster
setMaster has the highest priority and as such excludes other options.
My recommendation: Don't use this (unless you convince me I'm wrong - feel challenged :))
That's why I'm a strong advocate of using spark-submit early and often. It defaults to local[*] and does its job very well. It even got improved in the recent versions of Spark where it adds a nice-looking application name (aka appName) so you don't have to set it (or even...please don't...hardcore it).
Given we are in Spark 2.2 days with Spark SQL being the entry point to all the goodies in Spark, you should always start with SparkSession (and forget about SparkConf or SparkContext as too low-level).
The only reason I'm aware of when you could have setMaster in a Spark application is when you want to run the application inside your IDE (e.g. IntelliJ IDEA). Without setMaster you won't be able to run the application.
A workaround is to use src/test/scala for the sources (in sbt) and use a launcher with setMaster that will execute the main application.

Specify spark driver for spark-submit

I'm submitting a spark job from a shell script that has a bunch of env vars and parameters to pass to spark. Strangely, the driver host is not one of these parameters (there are driver cores and memory however). So if I have 3 machines in the cluster, a driver will be chosen randomly. I don't want this behaviour since 1) the jar I'm submitting is only on one of the machines and 2) the driver machine should often be smaller than the other machines which is not the case if it's random choice.
So far, I found no way to specify this param on the command line to spark-submit. I've tried --conf SPARK_DRIVER_HOST="172.30.1.123, --conf spark.driver.host="172.30.1.123 and many other things but nothing has any effect. I'm using spark 2.1.0. Thanks.
I assume you are running with Yarn cluster. In brief yarn uses containers to launch and implement tasks. And resource manager decides where to run which container based on availability of resources. In spark case drivers and executors also launched as containers with separate jvms. Driver dedicated to splitting tasks among executors and collect the results from them. If your node from where you launch your application included in cluster then it will be also used as shared resource for launching driver/executor.
From the documentation: http://spark.apache.org/docs/latest/running-on-yarn.html
When running the cluster in standalone or in Mesos the driver host (this is the master) can be launched with:
--master <master-url> #e.g. spark://23.195.26.187:7077
When using YARN it works a little different. Here the parameter is yarn
--master yarn
The yarn is specified in Hadoop its configuration for the ResourceManager. For how to do this see this interesting guide https://dqydj.com/raspberry-pi-hadoop-cluster-apache-spark-yarn/ . Basically in the hdfs the hdfs-site.xml and in yarn the yarn-site.xml

Pyspark on yarn-cluster mode

Is there any way to run pyspark scripts with yarn-cluster mode without using the spark-submit script? I need it in this way because i will integrate this code into a django web app.
When i try to run any script in yarn-cluster mode i got the following error :
org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.
I'm creating the sparkContext in the following way :
conf = (SparkConf()
.setMaster("yarn-cluster")
.setAppName("DataFrameTest"))
sc = SparkContext(conf = conf)
#Dataframe code ....
Thanks
The reason yarn-cluster mode isn't supported is that yarn-cluster means bootstrapping the driver-program itself (e.g. the program calling using a SparkContext) onto a YARN container. Guessing from your statement about submitting from a django web app, it sounds like you want the python code that contains the SparkContext to be embedded in the web app itself, rather than shipping the driver code onto a YARN container which then handles a separate spark job.
This means your case most closely fits with yarn-client mode instead of yarn-cluster; in yarn-client mode, you can run your SparkContext code anywhere (like inside your web app), while it talks to YARN for the actual mechanics of running jobs.
Fundamentally, if you're sharing any in-memory state between your web app and your Spark code, that means you won't be able to chop off the Spark portion to run inside a YARN container, which is what yarn-cluster tries to do. If you're not sharing state, then you can simply invoke a subprocess which actually does call spark-submit to bundle an independent PySpark job to run in yarn-cluster mode.
To summarize:
If you want to embed your Spark code directly in your web app, you need to use yarn-client mode instead: SparkConf().setMaster("yarn-client")
If the Spark code is loosely coupled enough that yarn-cluster is actually viable, you can issue a Python subprocess to actually invoke spark-submit in yarn-cluster mode.

Resources