Control which worker node gets the driver program in deploy mode cluster - apache-spark

I am running multiple spark-submit applications using --deploy-mode cluster. However, it seems that only one node in the cluster is getting the driver program for each application. As a result, this node's memory get filled pretty fast.
Is there a way to specify which node in the cluster is assigned the driver program for each application launched using spark-submit.

Related

Understanding Spark Submit Yarn Client vs Cluster mode [duplicate]

TL;DR: In a Spark Standalone cluster, what are the differences between client and cluster deploy modes? How do I set which mode my application is going to run on?
We have a Spark Standalone cluster with three machines, all of them with Spark 1.6.1:
A master machine, which also is where our application is run using spark-submit
2 identical worker machines
From the Spark Documentation, I read:
(...) For standalone clusters, Spark currently supports two deploy modes. In client mode, the driver is launched in the same process as the client that submits the application. In cluster mode, however, the driver is launched from one of the Worker processes inside the cluster, and the client process exits as soon as it fulfills its responsibility of submitting the application without waiting for the application to finish.
However, I don't really understand the practical differences by reading this, and I don't get what are the advantages and disadvantages of the different deploy modes.
Additionally, when I start my application using start-submit, even if I set the property spark.submit.deployMode to "cluster", the Spark UI for my context shows the following entry:
So I am not able to test both modes to see the practical differences. That being said, my questions are:
1) What are the practical differences between Spark Standalone client deploy mode and cluster deploy mode? What are the pro's and con's of using each one?
2) How to I choose which one my application is going to be running on, using spark-submit?
What are the practical differences between Spark Standalone client
deploy mode and cluster deploy mode? What are the pro's and con's of
using each one?
Let's try to look at the differences between client and cluster mode.
Client:
Driver runs on a dedicated server (Master node) inside a dedicated process. This means it has all available resources at it's disposal to execute work.
Driver opens up a dedicated Netty HTTP server and distributes the JAR files specified to all Worker nodes (big advantage).
Because the Master node has dedicated resources of it's own, you don't need to "spend" worker resources for the Driver program.
If the driver process dies, you need an external monitoring system to reset it's execution.
Cluster:
Driver runs on one of the cluster's Worker nodes. The worker is chosen by the Master leader
Driver runs as a dedicated, standalone process inside the Worker.
Driver programs takes up at least 1 core and a dedicated amount of memory from one of the workers (this can be configured).
Driver program can be monitored from the Master node using the --supervise flag and be reset in case it dies.
When working in Cluster mode, all JARs related to the execution of your application need to be publicly available to all the workers. This means you can either manually place them in a shared place or in a folder for each of the workers.
Which one is better? Not sure, that's actually for you to experiment and decide. This is no better decision here, you gain things from the former and latter, it's up to you to see which one works better for your use-case.
How to I choose which one my application is going to be running on,
using spark-submit
The way to choose which mode to run in is by using the --deploy-mode flag. From the Spark Configuration page:
/bin/spark-submit \
--class <main-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
Let's say you are going to perform a spark submit in EMR by doing SSH to the master node.
If you are providing the option --deploy-mode cluster, then following things will happen.
You won't be able to see the detailed logs in the terminal.
Since driver is not created in the Master itself, you won't be able to terminate the job from the terminal.
But in case of --deploy-mode client:
You will be able to see the detailed logs in the terminal.
You will be able to terminate the job from the terminal itself.
These are the basic things that I have noticed till now.
I'm also having the same scenario, here master node use a standalone ec2 cluster. In this setup client mode is appropriate. In this driver is launched directly with in the spark-submit process which acts as a client to the cluster. The Input & output of the application is attached to the console.Thus, this mode is especially suitable for applications that involve REPL.
Else if your application is submitted from a machine far from the worker machines then it is quite common to use cluster mode to minimize the network latency b/w driver & executor.

Can I specify a certain machine to be the driver in spark on yarn?

The question is exactly what is specified in the title.
I want to start my driver program on 192.168.1.1, but the fact is when I submit my spark application to yarn, yarn will choose a random machine to be the driver of my application.
Can I choose the driver manually in yarn cluster mode?
the dupilicated question won't work on yarn.
Like Yaron replied before, with YARN as master you have two options:
client
cluster
If you select cluster mode then you let yarn manage where the driver is spawned, based on resource availability in Yarn. If you select client mode then the driver is spawned in the client process, on the server where you ran the spark-submit.
So, a solution for your problem should be to run the command
spark-submit --master yarn --deploy-mode client ...
on the machine you want the driver to be on.
Make sure that:
the machine has the resources to host the driver,
the resources you want to give to the driver are not committed to Yarn as well
there is a Spark gateway (for CM) role on that machine
If you want to use a specific machine as the driver, you should use YARN Client mode
SPARK docs - launching spark on yarn:
There are two deploy modes that can be used to launch Spark
applications on YARN. In cluster mode, the Spark driver runs inside an
application master process which is managed by YARN on the cluster,
and the client can go away after initiating the application. In client
mode, the driver runs in the client process, and the application
master is only used for requesting resources from YARN.
In YARN Client mode - the driver runs in the client process (you can choose the driver machine, it is the machine which execute the spark-submit command)
In YARN Cluster mode - the Spark driver runs inside an application master process which is managed by YARN on the cluster.

Setting Driver manually in Spark Submit over Yarn Cluster

I noticed that when I start a job in spark submit using yarn, the driver and executor nodes get set randomly. Is it possible to set this manually, so that when I collect the data and write it to file, it can be written on the same node every single time?
As of right now, the parameter I tried playing around with are:
spark.yarn.am.port <driver-ip-address>
and
spark.driver.hostname <driver-ip-address>
Thanks!
If you submit to Yarn with --master yarn --deploy-mode client, the driver will be located on the node you are submitting from.
Also you can configure node labels for executors using property: spark.yarn.executor.nodeLabelExpression
A YARN node label expression that restricts the set of nodes executors will be scheduled on. Only versions of YARN greater than or equal to 2.6 support node label expressions, so when running against earlier versions, this property will be ignored.
Docs - Running Spark on YARN - Latest Documentation
A spark cluster can run in either yarncluster or yarn-client mode.
In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client machine can go away after initiating the application.
In yarn-client mode, the driver runs in the client
process, and the application master is only used for requesting resources from YARN.
So as you see, depending upon the mode, the spark picks up the Application Master. Its not happened randomly until this stage. However, the worker nodes which the application master requests the resource manager to perform tasks will be randomly picked based on the availability of the worker nodes.

Can driver process run outside of the Spark cluster?

I read an answer from What conditions should cluster deploy mode be used instead of client?,
(In client mode) You could run spark-submit on your laptop, and the Driver Program would run on your laptop.
Also, the Spark Doc says,
In client mode, the driver is launched in the same process as the client that submits the application.
Does it mean that I can submit spark tasks from any machine, as long as it can be reachable from master and has Spark environment?
Or in other words, can driver process run outside of the Spark cluster?
Yes, the driver can run on your laptop. Keep in mind though:
The Spark driver will need the Hadoop configuration to be able to talk to YARN and HDFS. You could copy it from the cluster and point to it via HADOOP_CONF_DIR.
The Spark driver will listen on a lot of ports and expect the executors to be able to connect to it. It will advertise the hostname of your laptop. Make sure it can be resolved and all ports accessed from the cluster environment.
Yes, I'm running spark-submit jobs over the LAN using option --deploy-mode cluster. Currently running into this issue however: the server response (json object) isn't very descriptive.

What conditions should cluster deploy mode be used instead of client?

The doc https://spark.apache.org/docs/1.1.0/submitting-applications.html
describes deploy-mode as :
--deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)
Using this diagram fig1 as a guide (taken from http://spark.apache.org/docs/1.2.0/cluster-overview.html) :
If I kick off a Spark job :
./bin/spark-submit \
--class com.driver \
--master spark://MY_MASTER:7077 \
--executor-memory 845M \
--deploy-mode client \
./bin/Driver.jar
Then the Driver Program will be MY_MASTER as specified in fig1 MY_MASTER
If instead I use --deploy-mode cluster then the Driver Program will be shared among the Worker Nodes ? If this is true then does this mean that the Driver Program box in fig1 can be dropped (as it is no longer utilized) as the SparkContext will also be shared among the worker nodes ?
What conditions should cluster be used instead of client ?
No, when deploy-mode is client, the Driver Program is not necessarily the master node. You could run spark-submit on your laptop, and the Driver Program would run on your laptop.
On the contrary, when deploy-mode is cluster, then cluster manager (master node) is used to find a slave having enough available resources to execute the Driver Program. As a result, the Driver Program would run on one of the slave nodes. As its execution is delegated, you can not get the result from Driver Program, it must store its results in a file, database, etc.
Client mode
Want to get a job result (dynamic analysis)
Easier for developing/debugging
Control where your Driver Program is running
Always up application: expose your Spark job launcher as REST service or a Web UI
Cluster mode
Easier for resource allocation (let the master decide): Fire and forget
Monitor your Driver Program from Master Web UI like other workers
Stop at the end: one job is finished, allocated resources are freed
I think this may help you understand.In the document https://spark.apache.org/docs/latest/submitting-applications.html
It says " A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, client mode is appropriate. In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster. The input and output of the application is attached to the console. Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell).
Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to use cluster mode to minimize network latency between the drivers and the executors. Note that cluster mode is currently not supported for Mesos clusters or Python applications."
What about HADR?
In cluster mode, YARN restarts the driver without killing the executors.
In client mode, YARN automatically kills all executors if your driver is killed.

Resources