Pyspark on yarn-cluster mode - apache-spark

Is there any way to run pyspark scripts with yarn-cluster mode without using the spark-submit script? I need it in this way because i will integrate this code into a django web app.
When i try to run any script in yarn-cluster mode i got the following error :
org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.
I'm creating the sparkContext in the following way :
conf = (SparkConf()
.setMaster("yarn-cluster")
.setAppName("DataFrameTest"))
sc = SparkContext(conf = conf)
#Dataframe code ....
Thanks

The reason yarn-cluster mode isn't supported is that yarn-cluster means bootstrapping the driver-program itself (e.g. the program calling using a SparkContext) onto a YARN container. Guessing from your statement about submitting from a django web app, it sounds like you want the python code that contains the SparkContext to be embedded in the web app itself, rather than shipping the driver code onto a YARN container which then handles a separate spark job.
This means your case most closely fits with yarn-client mode instead of yarn-cluster; in yarn-client mode, you can run your SparkContext code anywhere (like inside your web app), while it talks to YARN for the actual mechanics of running jobs.
Fundamentally, if you're sharing any in-memory state between your web app and your Spark code, that means you won't be able to chop off the Spark portion to run inside a YARN container, which is what yarn-cluster tries to do. If you're not sharing state, then you can simply invoke a subprocess which actually does call spark-submit to bundle an independent PySpark job to run in yarn-cluster mode.
To summarize:
If you want to embed your Spark code directly in your web app, you need to use yarn-client mode instead: SparkConf().setMaster("yarn-client")
If the Spark code is loosely coupled enough that yarn-cluster is actually viable, you can issue a Python subprocess to actually invoke spark-submit in yarn-cluster mode.

Related

Kafka Spark Streaming

I was trying to build Kafka and spark streaming use case. In that, Spark Streaming is consuming streaming from Kafka. And we are enhancing stream and storing enhanced stream into some target system.
My question here is that does it make sense to run spark streaming job in yarn-cluster or yarn-client mode? (Hadoop is not involved here)
What I think Spark streaming job should run only local mode but another question is how to improve the performance of spark streaming job.
Thanks,
local[*]
This is specific to run the job in local mode
Usually we use this to perform POC's and on a very small data.
You can debug the job to understand how each line of code is working.
But, you need to be aware that since the job is running in your local you cannot get the most out of sparks distributed architecture.
yarn-client
your driver program is running on the yarn client where you type the command to submit the spark application . But, the tasks are still executed on the Executors.
yarn-cluster
In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. This is the finest way of running the spark job to be benefited by the advantages provided by a cluster manager
I hope this gives you a clarity on how you may want to deploy your spark job.
Infact, Spark provides you a very clean documentation explaining various deployment strategies with examples.
https://spark.apache.org/docs/latest/running-on-yarn.html
the difference will be with yarn-client, you will force the spark job to choose the host where you run spark-submit as the driver , because in yarn-cluster , the choice won't be the same host everytime you run it
so the best choice is to always choose yarn-cluster to avoide overloading the same host if you are going to submit multi job in the same host with yarn-client

Spark job with explicit setMaster("local"), passed to spark-submit with YARN

If I have a Spark job (2.2.0) compiled with setMaster("local") what will happen if I send that job with spark-submit --master yarn --deploy-mode cluster ?
I tried this and it looked like the job did get packaged up and executed on the YARN cluster rather than locally.
What I'm not clear on:
why does this work? According to the docs, things that you set in SparkConf explicitly have precedence over things passed in from the command line or via spark-submit (see: https://spark.apache.org/docs/latest/configuration.html). Is this different because I'm using SparkSession.getBuilder?
is there any less obvious impact of leaving setMaster("local") in code vs. removing it? I'm wondering if what I'm seeing is something like the job running in local mode, within the cluster, rather than properly using cluster resources.
It's because submitting your application to Yarn happens before SparkConf.setMaster.
When you use --master yarn --deploy-mode cluster, Spark will run its main method in your local machine and upload the jar to run on Yarn. Yarn will allocate a container as the application master to run the Spark driver, a.k.a, your codes. SparkConf.setMaster("local") runs inside a Yarn container, and then it creates SparkContext running in the local mode, and doesn't use the Yarn cluster resources.
I recommend that not setting master in your codes. Just use the command line --master or the MASTER env to specify the Spark master.
If I have a Spark job (2.2.0) compiled with setMaster("local") what will happen if I send that job with spark-submit --master yarn --deploy-mode cluster
setMaster has the highest priority and as such excludes other options.
My recommendation: Don't use this (unless you convince me I'm wrong - feel challenged :))
That's why I'm a strong advocate of using spark-submit early and often. It defaults to local[*] and does its job very well. It even got improved in the recent versions of Spark where it adds a nice-looking application name (aka appName) so you don't have to set it (or even...please don't...hardcore it).
Given we are in Spark 2.2 days with Spark SQL being the entry point to all the goodies in Spark, you should always start with SparkSession (and forget about SparkConf or SparkContext as too low-level).
The only reason I'm aware of when you could have setMaster in a Spark application is when you want to run the application inside your IDE (e.g. IntelliJ IDEA). Without setMaster you won't be able to run the application.
A workaround is to use src/test/scala for the sources (in sbt) and use a launcher with setMaster that will execute the main application.

Spark pyspark vs spark-submit

The documentation on spark-submit says the following:
The spark-submit script in Spark’s bin directory is used to launch
applications on a cluster.
Regarding the pyspark it says the following:
You can also use bin/pyspark to launch an interactive Python shell.
This question may sound stupid, but when i am running the commands though pyspark they also run on the "cluster", right? They do not run on the master node only, right?
There is no practical difference between these two. If not configured otherwise both will execute code in a local mode. If master is configured (either by --master command line parameter or spark.master configuration) corresponding cluster will be used to execute the program.
If you are using EMR , there are three things
using pyspark(or spark-shell)
using spark-submit without using --master and --deploy-mode
using spark-submit and using --master and --deploy-mode
although using all the above three will run the application in spark cluster, there is a difference how the driver program works.
in 1st and 2nd the driver will be in client mode whereas in 3rd the
driver will also be in the cluster.
in 1st and 2nd, you will have to wait untill one application complete
to run another, but in 3rd you can run multiple applications in
parallel.
Just adding a clarification that others have not addressed (you may already know this, but it was unclear from the wording of your question):
..when i am running the commands though pyspark they also run on the "cluster", right? They do not run on the master node only, right?
As with spark-submit, standard Python code will run only on the driver. When you call operations through the various pyspark APIs, you will trigger transformations or actions that will be registered/executed on the cluster.
As others have pointed out, spark-submit can also launch jobs in cluster mode. In this case, driver still executes standard Python code, but the driver is a different machine to the one that you call spark-submit from.
Pyspark compare to Scala spark and Java Spark have extreme differences, for Python spark in only support YARN for scheduling the cluster.
If you are running python spark on a local machine, then you can use pyspark. If in the cluster, use the spark-submit.
If you have any dependencies in your python spark job, you need a zip file for submission.

Spark submit does automatically upload the jar to cluster?

I'm trying to submit a Spark app from local machine Terminal to my Cluster.
I'm using --master yarn-cluster. I need to run the driver program on my Cluster too, not on the machine I do submit the application i.e my local machine
When I provide the path to application jar which is in my local machine, would spark-submit automatically upload it to my Cluster?
I'm using
bin/spark-submit
--class com.my.application.XApp
--master yarn-cluster --executor-memory 100m
--num-executors 50 /Users/nish1013/proj1/target/x-service-1.0.0-201512141101-assembly.jar
1000
and getting error
Diagnostics: java.io.FileNotFoundException: File file:/Users/nish1013/proj1/target/x-service-1.0.0-201512141101- does not exist
In Documentation ,http://spark.apache.org/docs/latest/submitting-applications.html#launching-applications-with-spark-submit
Advanced Dependency Management When using spark-submit, the
application jar along with any jars included with the --jars option
will be automatically transferred to the cluster.
But seems like it does not !
I see you are quoting the spark-submit page from Spark Docs but I would spend a lot more time on the Running Spark on YARN page. Bottom-line, look at:
There are two deploy modes that can be used to launch Spark
applications on YARN. In yarn-cluster mode, the Spark driver runs
inside an application master process which is managed by YARN on the
cluster, and the client can go away after initiating the application.
In yarn-client mode, the driver runs in the client process, and the
application master is only used for requesting resources from YARN.
Further you note, "I need to run the driver program on my Cluster too, not on the machine I do submit the application i.e my local machine"
So I agree with you that you are right to run --master yarn-cluster instead of --master yarn-client
(and one comment notes what might just be a syntax error where you dropped "assembly.jar" but I think this will apply as well...)
Some of the basic assumptions about non-YARN implementations change a lot when YARN is introduced, mostly related to Classpaths and the need to push jars to the workers.
From an email on the Apache Spark User list:
YARN cluster mode. Spark submit does upload your jars to the cluster.
In particular, it puts the jars in HDFS so your driver can just read
from there. As in other deployments, the executors pull the jars from
the driver.
So finally, from the Apache Spark YARN doc:
Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory
which contains the (client side) configuration files for the Hadoop
cluster. These configs are used to write to HDFS and connect to the
YARN ResourceManager.
NOTE: I only see you adding a single JAR, if there's a need to add other JARs there's a special note about doing that with YARN:
In yarn-cluster mode, the driver runs on a different machine than the
client, so SparkContext.addJar won’t work out of the box with files
that are local to the client. To make files on the client available to
SparkContext.addJar, include them with the --jars option in the launch
command.
That page in the link has some examples.
And of course you downloaded or built the YARN-specific version of Spark.
Background, in a standalone cluster deployment using spark-submit and the option --deploy-mode cluster, yes you do need to make sure every worker node has access to all the dependencies, Spark will not push them to the cluster. This is because in "standalone cluster" mode with Spark as the job manager, you don't know which node the driver will run on! But that doesn't apply to your case.
But if I could, depending on the size of the jars you are uploading, I would still explicitly put the jars on each node, or "globally available" via HDFS, for another reason from the docs:
From Advanced Dependency Management, it seems to present the best of both worlds, but also a great reason for manually pushing your jars out to all nodes:
local: - a URI starting with local:/ is expected to exist as a local
file on each worker node. This means that no network IO will be
incurred, and works well for large files/JARs that are pushed to each
worker, or shared via NFS, GlusterFS, etc.
But I assume that local:/... would change to hdfs:/ ... not sure on that one.
Yes and no. It depends on what you mean. Spark deploys the .jar to the nodes in the cluster. However, it won't upload your .jar file from your local machine to the cluster.
You can find more info in the Submitting Applications page. As you can see, in the arguments you pass to spark-submit, there is one that needs to be globally visible: the application-jar.
application-jar: Path to a bundled jar including your application and
all dependencies. The URL must be globally visible inside of your
cluster, for instance, an hdfs:// path or a file:// path that is
present on all nodes.
As far as I understand, what you want is to use yarn-client, not yarn-cluster. This will run the driver in the client (e.g., the machine which you are trying to call spark-submit on, for example your laptop), without the need of copying the .jar file on the cluster. More about this here.
Try adding --jars option before your /path/to/jar/file
spark-submit --jars /tmp/test.jar

What is the entry point of Spark container in YARN cluster mode?

What is the main entry point of a Spark executor when a Spark job is being run in Yarn cluster mode (for Spark 1.2.0+)?
What I am looking for is the Scala class name for the entry point of an Executor (which will be the process executing one of the tasks on a slave machine).
I think what you're asking about is org.apache.spark.executor.Executor or perhaps org.apache.spark.executor.Executor$TaskRunner. It is TaskRunner that will ultimately run a task.
It is regardless of the deploy mode (client vs cluster) or the cluster manager used, i.e. Hadoop YARN or Spark Standalone or Apache Mesos.
spark-submit --class [FULLY QUALIFIED CLASS NAME]
--master yarn-cluster
[JAR_TO_USE]
So, given the above, the class to be used is the one specified, which is loaded from the given jar, and it searches within that class for a static main method.
From SparkSubmit.scala:
val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass)

Resources