With how many spark nodes should I use Mesos or Yarn? - apache-spark

I currently run a cluster with 4 spark nodes and 1 solr node. I want to expand the cluster quickly to 20 nodes and afterwards around 100. I am just not sure at what cluster size it would make sense to use Mesos or Yarn? Does it make sense to add Yarn or Mesos when I have less then 100 nodes?
Thanks

Mesos and YARN can scale upto thousands of nodes without any issue.
It is the the workload that decides what to be used, if your workload has jobs/tasks related to spark or hadoop only, YARN would be a better choice, else if you have Docker containers or something else to run then Mesos would be a better choice.
There are many other advantages and disadvantages using Mesos, please find them in the comparison here.
Spark standalone cluster will provide almost all the same features as the other cluster managers if you are only running Spark.
If you would like to run Spark alongside other applications, or to use richer resource scheduling capabilities (e.g. queues), both YARN and Mesos provide these features. Of these, YARN will likely be preinstalled in many Hadoop distributions.
If you have less than 100 nodes and you are not going to run any other applications alongside spark then spark standalone cluster would be a better choice as you would not be overkilling.
It again depends on the capabilities that you would like to use like the queues or schedulers like Fair scheduler then YARN/Mesos would make sense.
(To use these features or not to use them depends on what you do with the spark cluster, workload and how busy your cluster is.)

Related

Does Spark streaming needs HDFS with Kafka

I have to design a setup to read incoming data from twitter (streaming). I have decided to use Apache Kafka with Spark streaming for real time processing. It is required to show analytics in a dashboard.
Now, being a newbie is this domain, My assumed data rate will be 10 Mb/sec maximum. I have decided to use 1 machine for Kafka of 12 cores and 16 GB memory. *Zookeeper will also be on same machine. Now, I am confused about Spark, it will have to perform streaming job analysis only. Later, analyzed data output is pushed to DB and dashboard.
Confused list:
Should I run Spark on Hadoop cluster or local file system ?
Is standalone mode of Spark can fulfill my requirements ?
Is my approach is appropriate or what should be best in this case ?
Try answer:
Should I run Spark on Hadoop cluster or local file system ?
recommend use hdfs,it can can save more data, ensure High availability.
Is standalone mode of Spark can fulfill my requirements ?
Standalone mode is the easiest to set up and will provide almost all the same features as the other cluster managers if you are only running Spark.
YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN.
YARN doesn’t need to run a separate ZooKeeper Failover Controller.
YARN will likely be preinstalled in many Hadoop distributions.such as CDH HADOOP.
so recommend use
YARN doesn’t need to run a separate ZooKeeper Failover Controller.
so recommend yarn
Useful links:
spark yarn doc
spark standalone doc
other wonderful answer
Is my approach is appropriate or what should be best in this case ?
If you data not more than 10 million ,I think can use use local cluster to do it.
local mode avoid many nodes shuffle. shuffles between processes are faster than shuffles between nodes.
else recommend use greater than or equal 3 nodes,That is real Hadoop cluster.
As a spark elementary players,this is my understand. I hope ace corrects me.

Spark Standalone vs YARN

What features of YARN make it better than Spark Standalone mode for multi-tenant cluster running only Spark applications? Maybe besides authentication.
There are a lot of answers at Google, pretty much of them sounds wrong to me, so I'm not sure where is the truth.
For example:
DZone, Deep Dive Into Spark Cluster Management
Standalone is good for small Spark clusters, but it is not good for
bigger clusters (there is an overhead of running Spark daemons —
master + slave — in cluster nodes)
But other cluster managers also require running agents on cluster nodes. I.e. YARN's slaves are called node managers. They may consume even more memory than Spark's slaves (Spark default is 1 GB).
This answer
The Spark standalone mode requires each application to run an executor
on every node in the cluster; whereas with YARN, you choose the number
of executors to use
agains Spark Standalone # executor/cores control, that shows how you can specify number of consumed resources at Standalone mode.
Spark Standalone Mode documentation
The standalone cluster mode currently only supports a simple FIFO
scheduler across applications.
Against the fact Standalone mode can use Dynamic Allocation, and you can specify spark.dynamicAllocation.minExecutors & spark.dynamicAllocation.maxExecutors. Also I haven't found a note about Standalone doesn't support FairScheduler.
This answer
YARN directly handles rack and machine locality
How does YARN may know anything about data locality in my job? Suppose, I'm storing file locations at AWS Glue (used by EMR as Hive metastore). Inside Spark job I'm querying some-db.some-table. How YARN may know what executor is better for job assignment?
UPD: found another mention about YARN and data locality https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-data-locality.html. Still doesn't matter in case of S3 for example.

Is it worth deploying Spark on YARN if I have no other cluster software?

I have a Spark cluster running in standalone mode. I am currently executing code on using Jupyter notebook calling pyspark. Is there a benefit to using YARN as the cluster manager, assuming that the machines are not doing anything else?
Would I get better performance using YARN? If so, why?
Many thanks,
John
I'd say YES by considering these points.
Why Run on YARN?
Using YARN as Spark’s cluster manager confers a few benefits over Spark standalone:
You can take advantage of all the features of YARN schedulers for categorizing, isolating, and prioritizing workloads.
Any how Spark standalone mode also requires worker for slave activity which can not run non Spark applications, where as with YARN, this is isolated in containers, so adoption of another compute framework should be a code change instead of infra + code. So the cluster can be shared among different frameworks.
YARN is the only cluster manager for Spark that supports security. With
YARN, Spark can run against Kerberized Hadoop clusters and uses
secure authentication between its processes.
YARN allows you to dynamically share and centrally configure the same
pool of cluster resources between all frameworks that run on YARN.
You can throw your entire cluster at a MapReduce job, then use some
of it on an Impala query and the rest on Spark application, without
any changes in configuration.
I would say 1,2 and 3 are suitable for mentioned scenarios but not point 4 as we assumed no other frameworks are going to use the cluster.
souce

How to best manage all my nodes CPU, memory and storage with Datastax spark?

I now have a cluster of 4 spark nodes and 1 solr node and use cassandra as my database. I want to increase the nodes in the medium term to 20 and in the long term to 100. But Datastax doesn't seem to support Mesos or Yarn. How would I best manage all these nodes CPU, memory and storage? Is Mesos even necessary with 20 or 100 nodes? So far I couldn't find any example of this using datastax. I usually do not have jobs that need to be completed but I am running a continuous stream of data. That's why I am even thinking of deleting Datastax because I couldn't manage this many nodes efficiently without YARN or Mesos in my opinion, but maybe there is a better solution I haven't thought of? Also I am using python so apparently Yarn is my only option.
If you have any suggestions or best practice examples let me know.
Thanks!
If you want to run DSE with a supported Hadoop/Yarn environmet you need to use BYOH, read about it HERE In BYOH you can either run the internal Hadoop platform in DSE or you can run a Cloudera or HDP platform with YARN and anything else that is available.

Which cluster type should I choose for Spark?

I am new to Apache Spark, and I just learned that Spark supports three types of cluster:
Standalone - meaning Spark will manage its own cluster
YARN - using Hadoop's YARN resource manager
Mesos - Apache's dedicated resource manager project
I think I should try Standalone first. In the future, I need to build a large cluster (hundreds of instances).
Which cluster type should I choose?
Spark Standalone Manager : A simple cluster manager included with Spark that makes it easy to set up a cluster. By default, each application uses all the available nodes in the cluster.
A few benefits of YARN over Standalone & Mesos:
YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN.
You can take advantage of all the features of YARN schedulers for categorizing, isolating, and prioritizing workloads.
The Spark standalone mode requires each application to run an executor on every node in the cluster; whereas with YARN, you choose the number of executors to use
YARN directly handles rack and machine locality in your requests, which is convenient.
The resource request model is, oddly, backwards in Mesos. In YARN, you (the framework) request containers with a given specification and give locality preferences. In Mesos you get resource "offers" and choose to accept or reject those based on your own scheduling policy. The Mesos model is a arguably more flexible, but seemingly more work for the person implementing the framework.
If you have a big Hadoop cluster already in place, YARN is better choice.
The Standalone manager requires the user configure each of the nodes with the shared secret. Mesos’ default authentication module, Cyrus SASL, can be replaced with a custom module. YARN has security for authentication, service level authorization, authentication for Web consoles and data confidentiality. Hadoop authentication uses Kerberos to verify that each user and service is authenticated by Kerberos.
High availability is offered by all three cluster managers but Hadoop YARN doesn’t need to run a separate ZooKeeper Failover Controller.
Useful links:
spark documentation page
agildata article
I think the best to answer that are those who work on Spark. So, from Learning Spark
Start with a standalone cluster if this is a new deployment.
Standalone mode is the easiest to set up and will provide almost all
the same features as the other cluster managers if you are only
running Spark.
If you would like to run Spark alongside other applications, or to use
richer resource scheduling capabilities (e.g. queues), both YARN and
Mesos provide these features. Of these, YARN will likely be
preinstalled in many Hadoop distributions.
One advantage of Mesos over both YARN and standalone mode is its
fine-grained sharing option, which lets interactive applications such
as the Spark shell scale down their CPU allocation between commands.
This makes it attractive in environments where multiple users are
running interactive shells.
In all cases, it is best to run Spark on the same nodes as HDFS for
fast access to storage. You can install Mesos or the standalone
cluster manager on the same nodes manually, or most Hadoop
distributions already install YARN and HDFS together.
Standalone is pretty clear as other mentioned it should be used only when you have spark only workload.
Between yarn and mesos, One thing to consider is the fact that unlike mapreduce, spark job grabs executors and hold it for entire lifetime of a job. where in mapreduce a job can get and release mappers and reducers over lifetime.
if you have long running spark jobs which during the lifetime of a job doesn't fully utilize all the resources it got in beginning, you may want to share those resources to other app and that you can only do either via Mesos or Spark dynamic scheduling. https://spark.apache.org/docs/2.0.2/job-scheduling.html#scheduling-across-applications
So with yarn, only way have dynamic allocation for spark is by using spark provided dynamic allocation. Yarn won't interfere in that while Mesos will. Again this whole point is only important if you have a long running spark application and you would like to scale it up and down dynamically.
In this case and similar dilemmas in data engineering, there are many side questions to be answered before choosing one distribution method over another.
For example, if you are not running your processing engine on more than 3 nodes, you usually are not facing too big of a problem to handle so your margin of performance tuning between YARN and SparkStandalone (based on experience) will not clarify your decision. Because usually you will try to make your pipeline simple, specially when your services are not self-managed by cloud and bugs and failures happen often.
I choose standalone for relatively small or not-complex pipelines but if I'm feeling alright and have a Hadoop cluster already in place, I prefer to take advantage of all the extra configs that Hadoop(Yarn) can give me.
Mesos has more sophisticated scheduling design, allowing applications like Spark to negotiate with it. It's more suitable for the diversity of applications today. I found this site really insightful:
https://www.oreilly.com/ideas/a-tale-of-two-clusters-mesos-and-yarn
"... YARN is optimized for scheduling Hadoop jobs, which are historically (and still typically) batch jobs with long run times. This means that YARN was not designed for long-running services, nor for short-lived interactive queries (like small and fast Spark jobs), and while it’s possible to have it schedule other kinds of workloads, this is not an ideal model. The resource demands, execution model, and architectural demands of MapReduce are very different from those of long-running services, such as web servers or SOA applications, or real-time workloads like those of Spark or Storm..."

Resources