YARN vs Spark processing engine based on real time application? - apache-spark

I understood YARN and Spark. But I want to know when I need to use Yarn and Spark processing engine. What are the different case studies in that I can identify the difference between YARN and Spark?

You cannot compare Yarn and Spark directly per se. Yarn is a distributed container manager, like Mesos for example, whereas Spark is a data processing tool. Spark can run on Yarn, the same way Hadoop Map Reduce can run on Yarn. It just happens that Hadoop Map Reduce is a feature that ships with Yarn, when Spark is not.
If you mean comparing Map Reduce and Spark, I suggest reading this other answer.

Apache Spark can be run on YARN, MESOS or StandAlone Mode.
Spark in StandAlone mode - it means that all the resource management and job scheduling are taken care Spark inbuilt.
Spark in YARN - YARN is a resource manager introduced in MRV2, which not only supports native hadoop but also Spark, Kafka, Elastic Search and other custom applications.
Spark in Mesos - Spark also supports Mesos, this is one more type of resource manager.
Advantages of Spark on YARN
YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN.
YARN schedulers can be used for spark jobs, Only With YARN, Spark can run against Kerberized Hadoop clusters and uses secure authentication between its processes.
Link for more documentation on YARN, Spark.
We can conclude saying this, if you want to build a small and simple cluster independent of everything go for standalone. If you want to use existing hadoop cluster go for YARN/Mesos.

Related

Spark Standalone vs YARN

What features of YARN make it better than Spark Standalone mode for multi-tenant cluster running only Spark applications? Maybe besides authentication.
There are a lot of answers at Google, pretty much of them sounds wrong to me, so I'm not sure where is the truth.
For example:
DZone, Deep Dive Into Spark Cluster Management
Standalone is good for small Spark clusters, but it is not good for
bigger clusters (there is an overhead of running Spark daemons —
master + slave — in cluster nodes)
But other cluster managers also require running agents on cluster nodes. I.e. YARN's slaves are called node managers. They may consume even more memory than Spark's slaves (Spark default is 1 GB).
This answer
The Spark standalone mode requires each application to run an executor
on every node in the cluster; whereas with YARN, you choose the number
of executors to use
agains Spark Standalone # executor/cores control, that shows how you can specify number of consumed resources at Standalone mode.
Spark Standalone Mode documentation
The standalone cluster mode currently only supports a simple FIFO
scheduler across applications.
Against the fact Standalone mode can use Dynamic Allocation, and you can specify spark.dynamicAllocation.minExecutors & spark.dynamicAllocation.maxExecutors. Also I haven't found a note about Standalone doesn't support FairScheduler.
This answer
YARN directly handles rack and machine locality
How does YARN may know anything about data locality in my job? Suppose, I'm storing file locations at AWS Glue (used by EMR as Hive metastore). Inside Spark job I'm querying some-db.some-table. How YARN may know what executor is better for job assignment?
UPD: found another mention about YARN and data locality https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-data-locality.html. Still doesn't matter in case of S3 for example.

How to setup YARN with Spark in cluster mode

I need to setup spark cluster (1 Master and 2 slaves nodes) on centos7 along with resource manager as YARN. I am new to all this and still exploring. Can somebody share me detailed steps of setting up Spark with Yarn in cluster mode.
Afterwards i have to integrate Livy too(an open source REST interface for using Spark from anywhere).
Inputs are welcome.Thanks
YARN is part of Hadoop. So, a Hadoop installation is necessary to run Spark on YARN.
Check out the page on the Hadoop Cluster Setup.
Then you can utilize the this documentation to learn about Spark on YARN.
Another method to quickly learn about Hadoop, YARN and Spark is to utilize Cloudera Distribution of Hadoop (CDH). Read the CDH 5 Quick Start Guide.
We are currently using the similar setup in aws. AWS EMR is costly hence
we setup our own cluster using ec2 machines with the help of Hadoop Cookbook. The cookbook supports multiple distributions, however we choose HDP.
The setup included following.
Master Setup
Spark (Along with History server)
Yarn Resource Manager
HDFS Name Node
Livy server
Slave Setup
Yarn Node Manager
HDFS Data Node
More information on manually installing can be found in HDP Documentation
You can see the part of that automation in here.

How does Spark prepare executors on Hadoop YARN?

I'm trying to understand the details of how Spark prepares the executors. In order to do this I tried to debug org.apache.spark.executor.CoarseGrainedExecutorBackend and invoked
Thread.currentThread().getContextClassLoader.getResource("")
It points out to the following directory:
/hadoop/yarn/local/usercache/_MY_USER_NAME_/appcache/application_1507907717252_15771/container_1507907717252_15771_01_000002/
Looking at the directory I found the following files:
default_container_executor_session.sh
default_container_executor.sh
launch_container.sh
__spark_conf__
__spark_libs__
The question is who delivers the files to each executor and then just runs CoarseGrainedExecutorBackend with the appropriate classpath? What are the scripts? Are they all YARN-autogenerated?
I looked at org.apache.spark.deploy.SparkSubmit, but didn't find anything useful inside.
Ouch...you're asking for quite a lot of details on how Spark communicates with cluster managers while requesting resources. Let me give you some information. Keep asking if you want more...
You are using Hadoop YARN as the cluster manager for Spark applications. Let's focus on this particular cluster manager only (as there are others that Spark supports like Apache Mesos, Spark Standalone, DC/OS and soon Kubernetes that have their own ways to deal with Spark deployments).
By default, while submitting a Spark application using spark-submit, the Spark application (i.e. the SparkContext it uses actually) requests three YARN containers. One container is for that Spark application's ApplicationMaster that knows how to talk to YARN and request two other YARN containers for two Spark executors.
You could review the YARN official documentation's Apache Hadoop YARN and Hadoop: Writing YARN Applications to dig deeper into the YARN internals.
While submitting the Spark application, Spark's ApplicationMaster is submitted to YARN using the YARN "protocol" that requires that the request for the very first YARN container (container 0) uses ContainerLaunchContext that holds all the necessary launch details (see Client.createContainerLaunchContext).
who delivers the files to each executor
That's how YARN gets told how to launch the ApplicationMaster for the Spark application. While fulfilling the request for a ApplicationMaster container, YARN downloads necessary files which you found in the container's working space.
That's very internal to how any YARN application works on YARN and has (almost) nothing to do with Spark.
The code that's responsible for the communication is in Spark's Client, esp. Client.submitApplication.
and then just runs CoarseGrainedExecutorBackend with the appropriate classpath.
Quoting Mastering Apache Spark 2 gitbook:
CoarseGrainedExecutorBackend is a standalone application that is started in a resource container when (...) Spark on YARN’s ExecutorRunnable is started.
ExecutorRunnable is started when when Spark on YARN's YarnAllocator schedules it in allocated YARN resource containers.
What are the scripts? Are they all YARN-autogenerated?
Kind of.
Some are prepared by Spark as part of a Spark application submission while others are YARN-specific.
Enable DEBUG logging level in your Spark application and you'll see the file transfer.
You can find more information in the Spark official documentation's Running Spark on YARN and the Mastering Apache Spark 2 gitbook of mine.

What's difference between HDInsight Hadoop cluster & HDInsight Spark cluster?

What's difference between HDInsight Hadoop cluster & HDInsight Spark cluster? I have seen that even in Hadoop cluster pyspark is available. Is the difference with respect to the cluster type? i.e. Hadoop cluster implies YARN as a cluster management layer and Spark implying Spark Standalone (or Mesos?) as a cluster management layer?
If that is the case we can still run Spark in Hadoop cluster I believe so Spark will run on top of YARN.
HDInsight Spark uses YARN as cluster management layer, just as Hadoop. The binary on the cluster is the same.
The difference between HDInsight Spark and Hadoop clusters are the following:
1) Optimal Configurations:
Spark cluster is tuned and configured for spark workloads. For example, we have pre-configured spark clusters to use SSD and adjust executor memory size based on machine resource, so customers will have better out-of-box experience than the spark default configuration.
2) Service setups:
Spark cluster also run spark-related services including Livy, Jupyter, and Spark Thrift Server.
3)Workload Quality: We test spark workloads on spark clusters prior every release to ensure quality of service.
The bits are the same as you noticed. The difference is set of services and Ambari components that are running by default (on Spark you will have additional spark thrift, livy, jupyter) and set of configurations for those services. So while you technically can run spark jobs on yarn on hadoop cluster it's not recommended, some configs may be not set to optimal values. The other way around would be more reliable - create spark cluster and run hadoop jobs on it.
Maxim (HDInsight Spark PM)

Is it worth deploying Spark on YARN if I have no other cluster software?

I have a Spark cluster running in standalone mode. I am currently executing code on using Jupyter notebook calling pyspark. Is there a benefit to using YARN as the cluster manager, assuming that the machines are not doing anything else?
Would I get better performance using YARN? If so, why?
Many thanks,
John
I'd say YES by considering these points.
Why Run on YARN?
Using YARN as Spark’s cluster manager confers a few benefits over Spark standalone:
You can take advantage of all the features of YARN schedulers for categorizing, isolating, and prioritizing workloads.
Any how Spark standalone mode also requires worker for slave activity which can not run non Spark applications, where as with YARN, this is isolated in containers, so adoption of another compute framework should be a code change instead of infra + code. So the cluster can be shared among different frameworks.
YARN is the only cluster manager for Spark that supports security. With
YARN, Spark can run against Kerberized Hadoop clusters and uses
secure authentication between its processes.
YARN allows you to dynamically share and centrally configure the same
pool of cluster resources between all frameworks that run on YARN.
You can throw your entire cluster at a MapReduce job, then use some
of it on an Impala query and the rest on Spark application, without
any changes in configuration.
I would say 1,2 and 3 are suitable for mentioned scenarios but not point 4 as we assumed no other frameworks are going to use the cluster.
souce

Resources