Setting Driver manually in Spark Submit over Yarn Cluster - apache-spark

I noticed that when I start a job in spark submit using yarn, the driver and executor nodes get set randomly. Is it possible to set this manually, so that when I collect the data and write it to file, it can be written on the same node every single time?
As of right now, the parameter I tried playing around with are:
spark.yarn.am.port <driver-ip-address>
and
spark.driver.hostname <driver-ip-address>
Thanks!

If you submit to Yarn with --master yarn --deploy-mode client, the driver will be located on the node you are submitting from.
Also you can configure node labels for executors using property: spark.yarn.executor.nodeLabelExpression
A YARN node label expression that restricts the set of nodes executors will be scheduled on. Only versions of YARN greater than or equal to 2.6 support node label expressions, so when running against earlier versions, this property will be ignored.
Docs - Running Spark on YARN - Latest Documentation

A spark cluster can run in either yarncluster or yarn-client mode.
In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client machine can go away after initiating the application.
In yarn-client mode, the driver runs in the client
process, and the application master is only used for requesting resources from YARN.
So as you see, depending upon the mode, the spark picks up the Application Master. Its not happened randomly until this stage. However, the worker nodes which the application master requests the resource manager to perform tasks will be randomly picked based on the availability of the worker nodes.

Related

Better understanding of communication between YARN and Spark

I would like to get a better understanding of the communication exchange between YARN and Spark.
For example:
What happens from the moment a Spark job is triggered until the allocation of the resources by YARN?
What happens when the Spark job requests for resources more than that are available with YARN?
What happens when the Spark job requests for resources more than the cluster capacity?
Steps done when we run spark-submit on Yarn client mode -
Spark driver internally invokes Client class submitApplication method. This submits a Spark application to a YARN cluster (i.e. to the YARN ResourceManager) and returns the application’s ApplicationId.
After this, spark uses the application_id generated in step 1 and calls createContainerLaunchContext method. This method creates a YARN ContainerLaunchContext request for YARN NodeManager to launch ApplicationMaster (in a container).
Step 2 is responsible for launching an ApplicationMaster for the application. If the cluster dont have resources to start an AM, then it will fail and driver will shut down with an exception. Once the AM is up and running, it contacts the driver and that it is up. At this point the spark yarn application is UP and running.
After this driver asks for resources (executors) to AM which then asks the same to Yarn ResourceManager.
If the yarn doesn't have that much capacity, it will give whatever is possible to the Spark Application. If it has capacity, it will give whatever is asked for.
More details here - https://jaceklaskowski.gitbooks.io/mastering-apache-spark/yarn/spark-yarn-client.html

Can I specify a certain machine to be the driver in spark on yarn?

The question is exactly what is specified in the title.
I want to start my driver program on 192.168.1.1, but the fact is when I submit my spark application to yarn, yarn will choose a random machine to be the driver of my application.
Can I choose the driver manually in yarn cluster mode?
the dupilicated question won't work on yarn.
Like Yaron replied before, with YARN as master you have two options:
client
cluster
If you select cluster mode then you let yarn manage where the driver is spawned, based on resource availability in Yarn. If you select client mode then the driver is spawned in the client process, on the server where you ran the spark-submit.
So, a solution for your problem should be to run the command
spark-submit --master yarn --deploy-mode client ...
on the machine you want the driver to be on.
Make sure that:
the machine has the resources to host the driver,
the resources you want to give to the driver are not committed to Yarn as well
there is a Spark gateway (for CM) role on that machine
If you want to use a specific machine as the driver, you should use YARN Client mode
SPARK docs - launching spark on yarn:
There are two deploy modes that can be used to launch Spark
applications on YARN. In cluster mode, the Spark driver runs inside an
application master process which is managed by YARN on the cluster,
and the client can go away after initiating the application. In client
mode, the driver runs in the client process, and the application
master is only used for requesting resources from YARN.
In YARN Client mode - the driver runs in the client process (you can choose the driver machine, it is the machine which execute the spark-submit command)
In YARN Cluster mode - the Spark driver runs inside an application master process which is managed by YARN on the cluster.

Spark job with explicit setMaster("local"), passed to spark-submit with YARN

If I have a Spark job (2.2.0) compiled with setMaster("local") what will happen if I send that job with spark-submit --master yarn --deploy-mode cluster ?
I tried this and it looked like the job did get packaged up and executed on the YARN cluster rather than locally.
What I'm not clear on:
why does this work? According to the docs, things that you set in SparkConf explicitly have precedence over things passed in from the command line or via spark-submit (see: https://spark.apache.org/docs/latest/configuration.html). Is this different because I'm using SparkSession.getBuilder?
is there any less obvious impact of leaving setMaster("local") in code vs. removing it? I'm wondering if what I'm seeing is something like the job running in local mode, within the cluster, rather than properly using cluster resources.
It's because submitting your application to Yarn happens before SparkConf.setMaster.
When you use --master yarn --deploy-mode cluster, Spark will run its main method in your local machine and upload the jar to run on Yarn. Yarn will allocate a container as the application master to run the Spark driver, a.k.a, your codes. SparkConf.setMaster("local") runs inside a Yarn container, and then it creates SparkContext running in the local mode, and doesn't use the Yarn cluster resources.
I recommend that not setting master in your codes. Just use the command line --master or the MASTER env to specify the Spark master.
If I have a Spark job (2.2.0) compiled with setMaster("local") what will happen if I send that job with spark-submit --master yarn --deploy-mode cluster
setMaster has the highest priority and as such excludes other options.
My recommendation: Don't use this (unless you convince me I'm wrong - feel challenged :))
That's why I'm a strong advocate of using spark-submit early and often. It defaults to local[*] and does its job very well. It even got improved in the recent versions of Spark where it adds a nice-looking application name (aka appName) so you don't have to set it (or even...please don't...hardcore it).
Given we are in Spark 2.2 days with Spark SQL being the entry point to all the goodies in Spark, you should always start with SparkSession (and forget about SparkConf or SparkContext as too low-level).
The only reason I'm aware of when you could have setMaster in a Spark application is when you want to run the application inside your IDE (e.g. IntelliJ IDEA). Without setMaster you won't be able to run the application.
A workaround is to use src/test/scala for the sources (in sbt) and use a launcher with setMaster that will execute the main application.

How to config a specific node as a driver node in spark-submit with cluster mode and standalone mode

I have to problem in spark-submit with cluster deploy mode and standalone mode:
How to specify a node as a driver node in spark cluster
in my case, the driver node was assigned dynamically by spark
How to distribute the app automatic from local
in my case, i must deploy the jar of app to every node,because i don't know which node will be the driver node .
PS : My submit command is :
spark-submit --master spark://master_ip:6066 --class appMainClass --deploy-mode cluster file:///tmp/spark_app/sparkrun
The --deploy-mode flag determines if the job will be submitted in cluster or client mode.
In cluster mode all the nodes will act as executors. One node will submit the JAR and then you can track the execution using web UI. That particular node will also act as an executor.
In client mode, the node where the spark-submit is invoked will act as the driver. Note that this node will not execute the DAG as this it is designated as a driver for your cluster. All the other nodes will be executors. Again, Web UI will help to see the execution of jobs and other useful information like RDD partitions, cached RDDs size etc.

Specify spark driver for spark-submit

I'm submitting a spark job from a shell script that has a bunch of env vars and parameters to pass to spark. Strangely, the driver host is not one of these parameters (there are driver cores and memory however). So if I have 3 machines in the cluster, a driver will be chosen randomly. I don't want this behaviour since 1) the jar I'm submitting is only on one of the machines and 2) the driver machine should often be smaller than the other machines which is not the case if it's random choice.
So far, I found no way to specify this param on the command line to spark-submit. I've tried --conf SPARK_DRIVER_HOST="172.30.1.123, --conf spark.driver.host="172.30.1.123 and many other things but nothing has any effect. I'm using spark 2.1.0. Thanks.
I assume you are running with Yarn cluster. In brief yarn uses containers to launch and implement tasks. And resource manager decides where to run which container based on availability of resources. In spark case drivers and executors also launched as containers with separate jvms. Driver dedicated to splitting tasks among executors and collect the results from them. If your node from where you launch your application included in cluster then it will be also used as shared resource for launching driver/executor.
From the documentation: http://spark.apache.org/docs/latest/running-on-yarn.html
When running the cluster in standalone or in Mesos the driver host (this is the master) can be launched with:
--master <master-url> #e.g. spark://23.195.26.187:7077
When using YARN it works a little different. Here the parameter is yarn
--master yarn
The yarn is specified in Hadoop its configuration for the ResourceManager. For how to do this see this interesting guide https://dqydj.com/raspberry-pi-hadoop-cluster-apache-spark-yarn/ . Basically in the hdfs the hdfs-site.xml and in yarn the yarn-site.xml

Resources