GCP Dataproc : CPUs and Memory for Spark Job - apache-spark

I am completely new to GCP. Is it the user who must manage the amount of allocated memory for the driver and workers and the number of CPUs to run the Spark job in a Dataproc cluster? If yes, what are the aspects of Elasticity for Dataproc usage?
Thank you.

Usually you don't, resources of a Dataproc cluster are managed by YARN, Spark jobs are automatically configured to make use of them. In particular, Spark dynamic allocation is enabled by default. But your application code still matters, e.g., you need to specify the appropriate number of partitions.

Related

Spark Standalone vs YARN

What features of YARN make it better than Spark Standalone mode for multi-tenant cluster running only Spark applications? Maybe besides authentication.
There are a lot of answers at Google, pretty much of them sounds wrong to me, so I'm not sure where is the truth.
For example:
DZone, Deep Dive Into Spark Cluster Management
Standalone is good for small Spark clusters, but it is not good for
bigger clusters (there is an overhead of running Spark daemons —
master + slave — in cluster nodes)
But other cluster managers also require running agents on cluster nodes. I.e. YARN's slaves are called node managers. They may consume even more memory than Spark's slaves (Spark default is 1 GB).
This answer
The Spark standalone mode requires each application to run an executor
on every node in the cluster; whereas with YARN, you choose the number
of executors to use
agains Spark Standalone # executor/cores control, that shows how you can specify number of consumed resources at Standalone mode.
Spark Standalone Mode documentation
The standalone cluster mode currently only supports a simple FIFO
scheduler across applications.
Against the fact Standalone mode can use Dynamic Allocation, and you can specify spark.dynamicAllocation.minExecutors & spark.dynamicAllocation.maxExecutors. Also I haven't found a note about Standalone doesn't support FairScheduler.
This answer
YARN directly handles rack and machine locality
How does YARN may know anything about data locality in my job? Suppose, I'm storing file locations at AWS Glue (used by EMR as Hive metastore). Inside Spark job I'm querying some-db.some-table. How YARN may know what executor is better for job assignment?
UPD: found another mention about YARN and data locality https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-data-locality.html. Still doesn't matter in case of S3 for example.

How does spark choose nodes to run executors?(spark on yarn)

How does spark choose nodes to run executors?(spark on yarn)
We use spark on yarn mode, with a cluster of 120 nodes.
Yesterday one spark job create 200 executors, while 11 executors on node1,
10 executors on node2, and other executors distributed equally on the other nodes.
Since there are so many executors on node1 and node2, the job run slowly.
How does spark select the node to run executors?
according to yarn resourceManager?
As you mentioned Spark on Yarn:
Yarn Services choose executor nodes for spark job based on the availability of the cluster resource. Please check queue system and dynamic allocation of Yarn. the best documentation https://blog.cloudera.com/blog/2016/01/untangling-apache-hadoop-yarn-part-3/
Cluster Manager allocates resources across the other applications.
I think the issue is with bad optimized configuration. You need to configure Spark on the Dynamic Allocation. In this case Spark will analyze cluster resources and add changes to optimize work.
You can find all information about Spark resource allocation and how to configure it here: http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/
Are all 120 nodes having identical capacity?
Moreover the jobs will be submitted to a suitable node manager based on the health and resource availability of the node manager.
To optimise spark job, You can use dynamic resource allocation, where you do not need to define the number of executors required for running a job. By default it runs the application with the configured minimum cpu and memory. Later it acquires resource from the cluster for executing tasks. It will release the resources to the cluster manager once the job has completed and if the job is idle up to the configured idle timeout value. It reclaims the resources from the cluster once it starts again.

Spark Submit Configuration while running parallel jobs in EMR

We are currently running parallel Spark jobs on an EMR cluster using HadoopActivity task from Datapipeline. By default, the newer versions of EMR clusters sets spark dynamic allocation to true which will increase/ reduce the number of executors required based on the load. So do we need to set any other property along with spark-submit e.g. number of cores, executor memory etc. or its best to have EMR cluster handle it dynamically?
This always depends of how you application is working. I can give you an good example of how I work here. For the Data Scientists in general they use the default configuration and it works pretty well due to they use Jupyter here to run their models. The only thing that we setup that can be useful for you is the conf spark.dynamicAllocation.minExecutors this allow to setup at least two or one worker for the job. To not be without any executor. That is what we do with the Data Scientists.
But, EMR has one specific type of configuration for each type of machine you choose. So in general it is optimized for the most common activities. But sometimes you need to change according your request, if you need more memory and less cores for skewed data that is better to change.

How does spark dynamic resource allocation work on YARN (with regards to NodeManagers)?

Let's assume that I have 4 NM and I have configured spark in yarn-client mode. Then, I set dynamic allocation to true to automatically add or remove a executor based on workload. If I understand correctly, each Spark executor runs as a Yarn container.
So, if I add more NM will the number of executors increase ?
If I remove a NM while a Spark application is running, something will happen to that application?
Can I add/remove executors based on other metrics ? If the answer is yes, there is a function, preferably in python,that does that ?
If I understand correctly, each Spark executor runs as a Yarn container.
Yes. That's how it happens for any application deployed to YARN, Spark including. Spark is not in any way special to YARN.
So, if I add more NM will the number of executors increase ?
No. There's no relationship between the number of YARN NodeManagers and Spark's executors.
From Dynamic Resource Allocation:
Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload. This means that your application may give resources back to the cluster if they are no longer used and request them again later when there is demand.
As you may have guessed correctly by now, it is irrelevant how many NMs you have in your cluster and it's by the workload when Spark decides whether to request new executors or remove some.
If I remove a NM while a Spark application is running, something will happen to that application?
Yes, but only when Spark uses that NM for executors. After all, NodeManager gives resources (CPU and memory) to a YARN cluster manager that will in turn give them to applications like Spark applications. If you take them back, say by shutting the node down, the resource won't be available anymore and the process of a Spark executor simply dies (as any other process with no resources to run).
Can I add/remove executors based on other metrics ?
Yes, but usually it's Spark job (no pun intended) to do the calculation and requesting new executors.
You can use SparkContext to manage executors using killExecutors, requestExecutors and requestTotalExecutors methods.
killExecutor(executorId: String): Boolean Request that the cluster manager kill the specified executor.
requestExecutors(numAdditionalExecutors: Int): Boolean Request an additional number of executors from the cluster manager.
requestTotalExecutors(numExecutors: Int, localityAwareTasks: Int, hostToLocalTaskCount: Map[String, Int]): Boolean Update the cluster manager on our scheduling needs.

Resource allocation with Apache Spark and Mesos

So I've deployed a cluster with Apache Mesos and Apache Spark and I've several jobs that I need to execute on the cluster. I would like to be sure that a job has enough resources to be successfully executed, and if it is not the case, it must return an error.
Spark provides several settings like spark.cores.max and spark.executor.memory in order to limit the resources used by the job, but there is no settings for the lower bound (e.g. set the minimal number of core to 8 for the job).
I'm looking for a way to be sure that a job has enough resources before it is executed (during the resource allocation for instance), do you know if it is possible to get this information with Apache Spark on Mesos?

Resources