How does Spark prepare executors on Hadoop YARN? - apache-spark

I'm trying to understand the details of how Spark prepares the executors. In order to do this I tried to debug org.apache.spark.executor.CoarseGrainedExecutorBackend and invoked
Thread.currentThread().getContextClassLoader.getResource("")
It points out to the following directory:
/hadoop/yarn/local/usercache/_MY_USER_NAME_/appcache/application_1507907717252_15771/container_1507907717252_15771_01_000002/
Looking at the directory I found the following files:
default_container_executor_session.sh
default_container_executor.sh
launch_container.sh
__spark_conf__
__spark_libs__
The question is who delivers the files to each executor and then just runs CoarseGrainedExecutorBackend with the appropriate classpath? What are the scripts? Are they all YARN-autogenerated?
I looked at org.apache.spark.deploy.SparkSubmit, but didn't find anything useful inside.

Ouch...you're asking for quite a lot of details on how Spark communicates with cluster managers while requesting resources. Let me give you some information. Keep asking if you want more...
You are using Hadoop YARN as the cluster manager for Spark applications. Let's focus on this particular cluster manager only (as there are others that Spark supports like Apache Mesos, Spark Standalone, DC/OS and soon Kubernetes that have their own ways to deal with Spark deployments).
By default, while submitting a Spark application using spark-submit, the Spark application (i.e. the SparkContext it uses actually) requests three YARN containers. One container is for that Spark application's ApplicationMaster that knows how to talk to YARN and request two other YARN containers for two Spark executors.
You could review the YARN official documentation's Apache Hadoop YARN and Hadoop: Writing YARN Applications to dig deeper into the YARN internals.
While submitting the Spark application, Spark's ApplicationMaster is submitted to YARN using the YARN "protocol" that requires that the request for the very first YARN container (container 0) uses ContainerLaunchContext that holds all the necessary launch details (see Client.createContainerLaunchContext).
who delivers the files to each executor
That's how YARN gets told how to launch the ApplicationMaster for the Spark application. While fulfilling the request for a ApplicationMaster container, YARN downloads necessary files which you found in the container's working space.
That's very internal to how any YARN application works on YARN and has (almost) nothing to do with Spark.
The code that's responsible for the communication is in Spark's Client, esp. Client.submitApplication.
and then just runs CoarseGrainedExecutorBackend with the appropriate classpath.
Quoting Mastering Apache Spark 2 gitbook:
CoarseGrainedExecutorBackend is a standalone application that is started in a resource container when (...) Spark on YARN’s ExecutorRunnable is started.
ExecutorRunnable is started when when Spark on YARN's YarnAllocator schedules it in allocated YARN resource containers.
What are the scripts? Are they all YARN-autogenerated?
Kind of.
Some are prepared by Spark as part of a Spark application submission while others are YARN-specific.
Enable DEBUG logging level in your Spark application and you'll see the file transfer.
You can find more information in the Spark official documentation's Running Spark on YARN and the Mastering Apache Spark 2 gitbook of mine.

Related

Better understanding of communication between YARN and Spark

I would like to get a better understanding of the communication exchange between YARN and Spark.
For example:
What happens from the moment a Spark job is triggered until the allocation of the resources by YARN?
What happens when the Spark job requests for resources more than that are available with YARN?
What happens when the Spark job requests for resources more than the cluster capacity?
Steps done when we run spark-submit on Yarn client mode -
Spark driver internally invokes Client class submitApplication method. This submits a Spark application to a YARN cluster (i.e. to the YARN ResourceManager) and returns the application’s ApplicationId.
After this, spark uses the application_id generated in step 1 and calls createContainerLaunchContext method. This method creates a YARN ContainerLaunchContext request for YARN NodeManager to launch ApplicationMaster (in a container).
Step 2 is responsible for launching an ApplicationMaster for the application. If the cluster dont have resources to start an AM, then it will fail and driver will shut down with an exception. Once the AM is up and running, it contacts the driver and that it is up. At this point the spark yarn application is UP and running.
After this driver asks for resources (executors) to AM which then asks the same to Yarn ResourceManager.
If the yarn doesn't have that much capacity, it will give whatever is possible to the Spark Application. If it has capacity, it will give whatever is asked for.
More details here - https://jaceklaskowski.gitbooks.io/mastering-apache-spark/yarn/spark-yarn-client.html

How to make sure Spark master node is using the worker nodes? (Google cluster)

I just created a Google Cloud cluster (1 master and 6 workers) and by default Spark is configured.
I have a pure python code that uses NLTK to build the dependency tree for each line from a text file. When I run this code on the master spark-submit run.py I get the same execution time when I run it using my machine.
How to make sure that the master is using the workers in order to reduce the execution time ?
You can check the spark UI. If its running on top of yarn, please open the yarn UI and click on your application id which will open the spark UI. Check under the executors tab it will have the node ip address also.
could you please share your spark submit config.
Your command 'spark-submit run.py' doesn't seem to send your job to YARN. To do such thing, you need to add the --master parameter. For example, a valid command to execute a job in YARN is:
./bin/spark-submit --master yarn python/pi.py 1000
If you execute your job from the master, this execution will be straightforward. Anyway, check this link for another parameter that spark-submit accept.
For a Dataproc cluster (Hadoop Google cluster) you have two options to check the job history including the ones that are running:
By command line from the master: yarn application -list, this option sometimes needs additional configuration. If you have troubles, this link will be useful.
By UI. Dataproc enables you to access the Spark Web UI, it improves monitoring tasks. Check this link to learn how to access the Spark UI and other Dataproc UIs. In summary, you have to create a tunnel and configure your browser to use socks proxy.
Hope the information above help you.

How to setup YARN with Spark in cluster mode

I need to setup spark cluster (1 Master and 2 slaves nodes) on centos7 along with resource manager as YARN. I am new to all this and still exploring. Can somebody share me detailed steps of setting up Spark with Yarn in cluster mode.
Afterwards i have to integrate Livy too(an open source REST interface for using Spark from anywhere).
Inputs are welcome.Thanks
YARN is part of Hadoop. So, a Hadoop installation is necessary to run Spark on YARN.
Check out the page on the Hadoop Cluster Setup.
Then you can utilize the this documentation to learn about Spark on YARN.
Another method to quickly learn about Hadoop, YARN and Spark is to utilize Cloudera Distribution of Hadoop (CDH). Read the CDH 5 Quick Start Guide.
We are currently using the similar setup in aws. AWS EMR is costly hence
we setup our own cluster using ec2 machines with the help of Hadoop Cookbook. The cookbook supports multiple distributions, however we choose HDP.
The setup included following.
Master Setup
Spark (Along with History server)
Yarn Resource Manager
HDFS Name Node
Livy server
Slave Setup
Yarn Node Manager
HDFS Data Node
More information on manually installing can be found in HDP Documentation
You can see the part of that automation in here.

Apache Spark and Mesos running on a single node

I am interested in testing Spark running on Mesos. I created a Hadoop 2.6.0 single-node cluster in my Virtualbox and installed Spark on it. I can successfully process files in HDFS using Spark.
Then I installed Mesos Master and Slave on the same node. I tried to run Spark as a framework in Mesos using these instructions. I get the following error with Spark:
WARN TaskSchedulerImpl: Initial job has not accepted any resources;
check your cluster UI to ensure that workers are registered and have
sufficient resources
Sparkshell is successfully registered as a framework in the Mesos. Is there anything wrong with using a single-node setup? Or whether I need to add more Spark worker nodes?
I am very new to Spark and my aim is to just test Spark, HDFS, and Mesos.
If you have allocated enough resources for spark slaves, the cause might be firewall blocking the communication. Take a look at my other answer:
Apache Spark on Mesos: Initial job has not accepted any resources

YARN vs Spark processing engine based on real time application?

I understood YARN and Spark. But I want to know when I need to use Yarn and Spark processing engine. What are the different case studies in that I can identify the difference between YARN and Spark?
You cannot compare Yarn and Spark directly per se. Yarn is a distributed container manager, like Mesos for example, whereas Spark is a data processing tool. Spark can run on Yarn, the same way Hadoop Map Reduce can run on Yarn. It just happens that Hadoop Map Reduce is a feature that ships with Yarn, when Spark is not.
If you mean comparing Map Reduce and Spark, I suggest reading this other answer.
Apache Spark can be run on YARN, MESOS or StandAlone Mode.
Spark in StandAlone mode - it means that all the resource management and job scheduling are taken care Spark inbuilt.
Spark in YARN - YARN is a resource manager introduced in MRV2, which not only supports native hadoop but also Spark, Kafka, Elastic Search and other custom applications.
Spark in Mesos - Spark also supports Mesos, this is one more type of resource manager.
Advantages of Spark on YARN
YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN.
YARN schedulers can be used for spark jobs, Only With YARN, Spark can run against Kerberized Hadoop clusters and uses secure authentication between its processes.
Link for more documentation on YARN, Spark.
We can conclude saying this, if you want to build a small and simple cluster independent of everything go for standalone. If you want to use existing hadoop cluster go for YARN/Mesos.

Resources