Apache spark:Can you start a worker in standalone mode with more cores or memory than physically available? - apache-spark

I am talking about the standalone mode of spark.
Lets say the value of SPARK_WORKER_CORES=200 and there are only 4 cores available on the node where I am trying to start the worker. Will the worker get 4 cores and continue or will the worker not start at all ?
A similar case, If I set SPARK_WORKER_MEMORY=32g and there is only 2g of memory actually available on that node ?

"Cores" in Spark is sort of a misnomer. "Cores" actually corresponds to the number of threads created to process data. So, you could have an arbitrarily large number of cores without an explicit failure. That being said, overcommitting by 50x will likely lead to incredibly poor performance due to context switching and overhead costs. This means that for both workers and executors you can arbitrarily increase this number. In practice in Spark Standalone, I've generally seen this overcommitted no more than 2-3x the number of logical cores.
When it comes to specifying worker memory, once again, you can in theory increase it to an arbitrarily large number. This is because, for a worker, the memory amount specifies how much it is allowed to allocate for executors, but it doesn't explicitly allocate that amount when you start the worker. Therefore you can make this value much larger than physical memory.
The caveat here is that when you start up an executor, if you set the executor memory to be greater than the amount of physical memory, your executors will fail to start. This is because executor memory directly corresponds to the -Xmx setting of the java process for that executor.

Related

How spark manages IO perfomnce if we reduce the number of cores per executor and incease number of executors

As per my research whenever we run the spark job we should not run the executors with more than 5 cores, if we increase the cores beyond the limit job will suffer due to bad I/O throughput.
my doubt is if we increase the number of executors and reduce the cores, even then these executors will be ending up in the same physical machine and those executors will be reading from the same disk and writing to the same disk, why will this not cause I/O throughput issue.
can consider
Apache Spark: The number of cores vs. the number of executors
use case for reference.
The core within the executor are like threads. So just like how more work is done if we increase parallelism, we should always keep in mind that there is a limit to it. Because we have to gather the results from those parallel tasks.

What happens if I allocate all the available cores on the server for spark cluster

As is well known, It is possible to increase the number of cores when submitting our application. Actually, I'm trying to allocate all available cores on the server for the Spark application. I'm wondering what will happen to the performance? will it reduce or be better than usual?
The first thing about allocating cores (--executor-cores) might come in mind that more cores in an executor means more parallelism, more tasks will be executed concurrently, better performance. But it's not true for spark ecosystem. After leaving 1 core for os and other application running in the worker, Study has shown that it's optimal to allocate 5 cores for each executor.
For example, if you have a worker node with 16 cores, the optimal total executors and cores per executor will be --num-executors 3 and --executor-cores 5 (as 5*3=15) respectively.
Not only optimal resource allocation brings better performance, it also depends on how the transformations and actions are done on dataframes. More shuffling of data between different executors hampers in performance.
your operating system always need resource for its bare need.
It good to keep 1 core and 1 GB memory for operating system and for other applications.
If you allocate all resource to spark then it will not going to improve your performance, your other applications starve for resources.
I think its not better idea to allocate all resources to spark only.
Follow below post if you want to tune your spark cluster
How to tune spark executor number, cores and executor memory?

Is there such a thing as too many executors in Spark?

I'm working with a Spark/YARN cluster that limits the resources I can allocate to 8GB memory and 1 core per container, but I can allocate hundreds, perhaps even thousands of executors to run my application on.
However since the driver has similar resource limitations (8GB memory, 4 cores), I'm concerned that too many executors may overwhelm the driver and cause timeouts.
Is there a rule of thumb for sizing the driver memory and cores to handle large numbers of executors?
There are rules on how to size your "executors".
For driver with 8GB and 4 core it should be able to handle thousands of executors easily as it only maintains bookkeeping metadata of the executors.
Given the assumption you are not having functions like collect() in your spark code.
Spark code analysis will help you to understand which actions in spark are performed where : http://bytepadding.com/big-data/spark/spark-code-analysis/

the spark.yarn.driver.memoryOverhead or spark.yarn.executor.memoryOverhead is used to store what kind of data?

I wondered that :
spark use the spark.yarn.driver.memoryOverhead or spark.yarn.executor.memoryOverhead to store what kind of data?
And in which case i should boost the value of spark.yarn.driver.memoryOverhead or spark.yarn.executor.memoryOverhead?
In YARN terminology, executors and application masters run inside “containers”. Spark offers yarn specific properties so you can run your application :
spark.yarn.executor.memoryOverhead is the amount of off-heap memory (in megabytes) to be allocated per executor. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size (typically 6-10%).
spark.yarn.driver.memoryOverhead is the amount of off-heap memory (in megabytes) to be allocated per driver in cluster mode with the memory properties as the executor's memoryOverhead.
So it's not about storing data, it's just the resources needed for YARN to run properly.
In some cases,
e.g if you enable dynamicAllocation you might want to set these properties explicitly along with the maximum number of executor (spark.dynamicAllocation.maxExecutors) that can be created during the process which can easily overwhelm YARN by asking for thousands of executors and thus loosing the already running executors.
spark.dynamicAllocation.maxExecutors is set to infinity by default which set the upper bound for the number of executors if dynamic allocation is enabled. [ref.http://spark.apache.org/docs/latest/configuration.html#dynamic-allocation]
According to the code documentation : [ref.https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L43]
Increasing the target number of executors happens in response to backlogged tasks waiting to be scheduled. If the scheduler queue is not drained in N seconds, then new executors are added. If the queue persists for another M seconds, then more executors are added and so on. The number added in each round increases exponentially from the previous round until an upper bound has been reached. The upper bound is based both on a configured property and on the current number of running and pending tasks, as described above.
This can lead into an exponential increase of the number of executors in some cases which can break the YARN resource manager. In my case :
16/03/31 07:15:44 INFO ExecutorAllocationManager: Requesting 8000 new executors because tasks are backlogged (new desired total will be 40000)
This doesn't cover all the use case which one can use those property, but it gives a general idea about how it's been used.

Partitioning the RDD for Spark Jobs

When I submit job spark in yarn cluster I see spark-UI I get 4 stages of jobs but, memory used is very low in all nodes and it says 0 out of 4 gb used. I guess that might be because I left it in default partition.
Files size ranges are betweenr 1 mb to 100 mb in s3. There are around 2700 files with size of 26 GB. And exactly same 2700 jobs were running in stage 2.
Is it worth to repartition something around 640 partitons, would it improve the performace? or
It doesn't matter if partition is granular than actually required? or
My submit parameters needs to be addressed?
Cluster details,
Cluster with 10 nodes
Overall memory 500 GB
Overall vCores 64
--excutor-memory 16 g
--num-executors 16
--executor-cores 1
Actually it runs on 17 cores out of 64. I dont want to increase the number of cores since others might use the cluster.
You partition, and repartition for following reasons:
To make sure we have enough work to distribute to the distinct cores in our cluster (nodes * cores_per_node). Obviously we need to tune the number of executors, cores per executor, and memory per executor to make that happen as intended.
To make sure we evenly distribute work: the smaller the partitions, the lesser the chance than one core might have much more work to do than all other cores. Skewed distribution can have a huge effect on total lapse time if the partitions are too big.
To keep partitions in managable sizes. Not to big, and not to small so we dont overtax GC. Also bigger partitions might have issues when we have non-linear O.
To small partitions will create too much process overhead.
As you might have noticed, there will be a goldilocks zone. Testing will help you determine ideal partition size.
Note that it is ok to have much more partitions than we have cores. Queuing partitions to be assigned a task is something that I design for.
Also make sure you configure your spark job properly otherwise:
Make sure you do not have too many executors. One or Very Few executors per node is more than enough. Fewer executors will have less overhead, as they work in shared memory space, and individual tasks are handled by threads instead of processes. There is a huge amount of overhead to starting up a process, but Threads are pretty lightweight.
Tasks need to talk to each other. If they are in the same executor, they can do that in-memory. If they are in different executors (processes), then that happens over a socket (overhead). If that is over multiple nodes, that happens over a traditional network connection (more overhead).
Assign enough memory to your executors. When using Yarn as the scheduler, it will fit the executors by default by their memory, not by the CPU you declare to use.
I do not know what your situation is (you made the node names invisible), but if you only have a single node with 15 cores, then 16 executors do not make sense. Instead, set it up with One executor, and 16 cores per executor.

Resources