I'm spinning up an EMR 5.4.0 cluster with Spark installed. I have a job for which performance really degrades if it's scheduled on executors which aren't available (eg on a cluster w/ 2 m3.xlarge core nodes there are about 16 executors available).
Is there any way for my app to discover this number? I can discover the hosts by doing this:
sc.range(1,100,1,100).pipe("hostname").distinct().count(), but I'm hoping there's a better way of getting an understanding of the cluster that Spark is running on.
Related
what is the difference in using 1 vs 2 driver core in spark yarn cluster mode? If i use 2 driver cores in yarn cluster mode, then spark driver will be relaunched incase of failure? If so, how many retry if would do before failing?
Appreciate if anyone can share any article on this?
When you launch application in YARN cluster mode, it will create container for your driver.
This container - depending on your application - might need multiple cores and multiple gigs of memory. It all depends on how many sessions will connect to your Spark application at the same time and on complexity of your query.
If it looks like your query compiles slowly or your Spark Web UI/app hangs, it might be worth it to increase core count.
From the point of YARN, there is still only one driver container.
What features of YARN make it better than Spark Standalone mode for multi-tenant cluster running only Spark applications? Maybe besides authentication.
There are a lot of answers at Google, pretty much of them sounds wrong to me, so I'm not sure where is the truth.
For example:
DZone, Deep Dive Into Spark Cluster Management
Standalone is good for small Spark clusters, but it is not good for
bigger clusters (there is an overhead of running Spark daemons —
master + slave — in cluster nodes)
But other cluster managers also require running agents on cluster nodes. I.e. YARN's slaves are called node managers. They may consume even more memory than Spark's slaves (Spark default is 1 GB).
This answer
The Spark standalone mode requires each application to run an executor
on every node in the cluster; whereas with YARN, you choose the number
of executors to use
agains Spark Standalone # executor/cores control, that shows how you can specify number of consumed resources at Standalone mode.
Spark Standalone Mode documentation
The standalone cluster mode currently only supports a simple FIFO
scheduler across applications.
Against the fact Standalone mode can use Dynamic Allocation, and you can specify spark.dynamicAllocation.minExecutors & spark.dynamicAllocation.maxExecutors. Also I haven't found a note about Standalone doesn't support FairScheduler.
This answer
YARN directly handles rack and machine locality
How does YARN may know anything about data locality in my job? Suppose, I'm storing file locations at AWS Glue (used by EMR as Hive metastore). Inside Spark job I'm querying some-db.some-table. How YARN may know what executor is better for job assignment?
UPD: found another mention about YARN and data locality https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-data-locality.html. Still doesn't matter in case of S3 for example.
I have been prototyping Spark Streaming 1.6.1 using kafka receiver on a Mesos 0.28 cluster running with Coarse grained mode.
I have 6 mesos slaves each with 64GB RAM and 16 Cores.
My kafka topic has 3 partitions.
My goal is to launch 3 executors in all (each on a different mesos slave) with each executor having one kafka receiver reading from one kafka partition.
When I launch my spark application with spark.cores.max set to 24 and spark.executor.memory set to 8GB, I get two executors - with 16 cores on one slave and with 8 cores on another slave.
I am looking to get 3 executors with 8 cores each on three different slaves. Is that possible with mesos through resource reservation / isolation, constraints etc. ?
Only workaround that works for me now is to scale down each mesos slave node to only have 8 cores max. I don't want to use mesos in fine-grained mode for performance reasons and plus its support is going away soon.
Mesosphere has contributed the following patch to Spark: https://github.com/apache/spark/commit/80cb963ad963e26c3a7f8388bdd4ffd5e99aad1a. This improvement will land in Spark 2.0. Mesosphere has backported this and other improvements to Spark 1.6.1 and made it available in DC/OS (http://dcos.io).
This patch introduces a new "spark.executor.cores" config variable in course gain mode. When the "spark.executor.cores" config variable is set, executors will be sized with the specified number of cores.
If an offer arrives with a multiple of (spark.executor.memory, spark.executor.cores), multiple executors will be launched on that offer. This means there could be multiple, but seperate, Spark executors running on the same Mesos agent node.
There is no way (currently) to spread the executors across N Mesos agents. We briefly discussed adding the ability to spread Spark executors across N Mesos agents but concluded it doesn't buy much in terms of improved availability.
Can you help us understand your motivations for spreading Spark executors across 3 Mesos agents? It's likely we haven't considered all possibly use cases and advantages.
Keith
I currently run a cluster with 4 spark nodes and 1 solr node. I want to expand the cluster quickly to 20 nodes and afterwards around 100. I am just not sure at what cluster size it would make sense to use Mesos or Yarn? Does it make sense to add Yarn or Mesos when I have less then 100 nodes?
Thanks
Mesos and YARN can scale upto thousands of nodes without any issue.
It is the the workload that decides what to be used, if your workload has jobs/tasks related to spark or hadoop only, YARN would be a better choice, else if you have Docker containers or something else to run then Mesos would be a better choice.
There are many other advantages and disadvantages using Mesos, please find them in the comparison here.
Spark standalone cluster will provide almost all the same features as the other cluster managers if you are only running Spark.
If you would like to run Spark alongside other applications, or to use richer resource scheduling capabilities (e.g. queues), both YARN and Mesos provide these features. Of these, YARN will likely be preinstalled in many Hadoop distributions.
If you have less than 100 nodes and you are not going to run any other applications alongside spark then spark standalone cluster would be a better choice as you would not be overkilling.
It again depends on the capabilities that you would like to use like the queues or schedulers like Fair scheduler then YARN/Mesos would make sense.
(To use these features or not to use them depends on what you do with the spark cluster, workload and how busy your cluster is.)
I have a Mesos cluster with 1 Master and 3 slaves (with 2 cores and 4GB RAM each) that has a Spark application already up and running. I wanted to run another application on the same cluster, as the CPU and Memory utilization isn't high. Regardless, when I try to run the new Application, I get the error:
16/02/25 13:40:18 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
I guess the new process is not getting any CPU as the old one occupies all 6.
I have tried enabling dynamic allocation, making the spark app Fine grained. Assigning numerous combinations of executor cores and number of executors. What I am missing here? Is it possible to run a Mesos Cluster with multiple Spark Frameworks at all?
You can try setting spark.cores.max to limit the number of CPUs used by each Spark driver, which will free up some resources.
Docs: https://spark.apache.org/docs/latest/configuration.html#scheduling