What is the benefit of using more then 1 driver core in spark yarn cluster mode?

What is the benefit of using more then 1 driver core in spark yarn cluster mode? - apache-spark

what is the difference in using 1 vs 2 driver core in spark yarn cluster mode? If i use 2 driver cores in yarn cluster mode, then spark driver will be relaunched incase of failure? If so, how many retry if would do before failing?
Appreciate if anyone can share any article on this?

When you launch application in YARN cluster mode, it will create container for your driver.
This container - depending on your application - might need multiple cores and multiple gigs of memory. It all depends on how many sessions will connect to your Spark application at the same time and on complexity of your query.
If it looks like your query compiles slowly or your Spark Web UI/app hangs, it might be worth it to increase core count.
From the point of YARN, there is still only one driver container.

Related

Spark client memory configuration

I'm trying to run multiple spark clients on Airflow(ETL scheduler).
I'm running in cluster mode on YARN, therefore ApplicationMaster Executor and Driver are all running on executor in Yarn context.
However, my spark client which sample the process and monitor the state is running in airflow worker.
The problem is that the Spark client take lot's of memory ~500 MB per job. It may sound as not much in terms of executors or drivers but for the role of spark client it sounds crazy.
My question is, how can I configure/manipulate spark client memory/cpu requirements can I limit it's intervals ? can I limit it's memory with flags?

So in spark code it make a distinction if it's running in standalone mode or cluster mode. For standalone it set a default of -Xmx 1G and in cluster mode it doesn't have default but it trying to read java options from environment variable called SPARK_SUBMIT_OPTS.
So if you wanna set any java opts for the client java process only use SPARK_SUBMIT_OPTS

Spark Standalone vs YARN

What features of YARN make it better than Spark Standalone mode for multi-tenant cluster running only Spark applications? Maybe besides authentication.
There are a lot of answers at Google, pretty much of them sounds wrong to me, so I'm not sure where is the truth.
For example:
DZone, Deep Dive Into Spark Cluster Management
Standalone is good for small Spark clusters, but it is not good for
bigger clusters (there is an overhead of running Spark daemons —
master + slave — in cluster nodes)
But other cluster managers also require running agents on cluster nodes. I.e. YARN's slaves are called node managers. They may consume even more memory than Spark's slaves (Spark default is 1 GB).
This answer
The Spark standalone mode requires each application to run an executor
on every node in the cluster; whereas with YARN, you choose the number
of executors to use
agains Spark Standalone # executor/cores control, that shows how you can specify number of consumed resources at Standalone mode.
Spark Standalone Mode documentation
The standalone cluster mode currently only supports a simple FIFO
scheduler across applications.
Against the fact Standalone mode can use Dynamic Allocation, and you can specify spark.dynamicAllocation.minExecutors & spark.dynamicAllocation.maxExecutors. Also I haven't found a note about Standalone doesn't support FairScheduler.
This answer
YARN directly handles rack and machine locality
How does YARN may know anything about data locality in my job? Suppose, I'm storing file locations at AWS Glue (used by EMR as Hive metastore). Inside Spark job I'm querying some-db.some-table. How YARN may know what executor is better for job assignment?
UPD: found another mention about YARN and data locality https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-data-locality.html. Still doesn't matter in case of S3 for example.

Avoid CPU pegging on Spark Standalone

I have a daily pipeline running on Spark Standalone 2.1. Its deployed in and runs on AWS EC2 and uses S3 for its persistence layer. For the most part, the pipeline runs without a hitch, but occasionally the job hangs on a single worker node during a reduceByKey operation. When I work into the worker, I notice that the CPU (as seen via top) is pegged at 100%. My remedy so far is to reboot the worker node so that Spark re-assigns the task and the job proceeds fine from there.
I would like to be able to mitigate this issue. I gather that I can prevent CPU pegging by switching to use YARN as my cluster manager, but I wonder whether I could configure Spark Standalone to prevent CPU pegging by maybe limiting the number of cores that get assigned to the Spark job ? Any suggestions would be greatly appreciated.

Does any of the executors run on the driver node in cluster deploy mode?

While running a program in Cluster mode, does any executor also run on the node on which the Driver Program is running.
Following text explains about the cluster mode:
https://spark.apache.org/docs/latest/cluster-overview.html
But doesn't answer this question.
Thanks
Anuj

This depends on the cluster manger implementation, configuration and requested resources. In general cluster manager is free to start multiple containers on the same physical node.
So without additional assumptions - driver can be, but doesn't have to be, colocated with one or more executors.

Get number of available executors

I'm spinning up an EMR 5.4.0 cluster with Spark installed. I have a job for which performance really degrades if it's scheduled on executors which aren't available (eg on a cluster w/ 2 m3.xlarge core nodes there are about 16 executors available).
Is there any way for my app to discover this number? I can discover the hosts by doing this:
sc.range(1,100,1,100).pipe("hostname").distinct().count(), but I'm hoping there's a better way of getting an understanding of the cluster that Spark is running on.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string