Max Executor failures in Spark dynamic allocation - apache-spark

I am using dynamic allocation feature of spark to run my spark job. It allocates around 50-100 executors. For some reason few executors are lost resulting in shutting down the job. Log shows that this happened due to max executor failures reached. It is set to 3 by default. Hence when 3 executors are lost the job gets killed even if other 40-50 executors are running.
I know that I can change the max executor failure limit but this seems like a workaround. Is there something else that I can try. All suggestions are welcome.

Related

Spark UI Executor

In Spark UI, there are 18 executors are added and 6 executors are removed. When I checked the executor tabs, I've seen many dead and excluded executors. Currently, dynamic allocation is used in EMR.
I've looked up some postings about dead executors but these mostly related with job failure. For my case, it seems that the job itself is not failed but can see dead and excluded executors.
What are these "dead" and "excluded" executors?
How does it affect the performance of current spark cluster configuration?
(If it affects performance) then what would be good way to improve the performance?
With dynamic alocation enabled spark is trying to adjust number of executors to number of tasks in active stages. Lets take a look at this example:
Job started, first stage is read from huge source which is taking some time. Lets say that this source is partitioned and Spark generated 100 task to get the data. If your executor has 5 cores, Spark is going to spawn 20 executors to ensure the best parallelism (20 executors x 5 cores = 100 tasks in parallel)
Lets say that on next step you are doing repartitioning or sor merge join, with shuffle partitions set to 200 spark is going to generated 200 tasks. He is smart enough to figure out that he has currently only 100 cores avilable so if new resources are avilable he will try to spawn another 20 executors (40 executors x 5 cores = 200 tasks in parallel)
Now the join is done, in next stage you have only 50 partitions, to calculate this in parallel you dont need 40 executors, 10 is ok (10 executors x 5 cores = 50 tasks in paralell). Right now if process is taking enough of time Spark can free some resources and you are going to see deleted executors.
Now we have next stage which involves repartitioning. Number of partitions equals to 200. Withs 10 executors you can process in paralell only 50 partitions. Spark will try to get new executors...
You can read this blog post: https://aws.amazon.com/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/
The problem with the spark.dynamicAllocation.enabled property is that
it requires you to set subproperties. Some example subproperties are
spark.dynamicAllocation.initialExecutors, minExecutors, and
maxExecutors. Subproperties are required for most cases to use the
right number of executors in a cluster for an application, especially
when you need multiple applications to run simultaneously. Setting
subproperties requires a lot of trial and error to get the numbers
right. If they’re not right, the capacity might be reserved but never
actually used. This leads to wastage of resources or memory errors for
other applications.
Here you will find some hints, from my experience it is worth to set maxExecutors if you are going to run few jobs in parallel in the same cluster as most of the time it is not worth to starve other jobs just to get 100% efficiency from one job

Databricks Spark: java.lang.OutOfMemoryError: GC overhead limit exceeded i

I am executing a Spark job in Databricks cluster. I am triggering the job via a Azure Data Factory pipeline and it execute at 15 minute interval so after the successful execution of three or four times it is getting failed and throwing with the exception "java.lang.OutOfMemoryError: GC overhead limit exceeded".
Though there are many answer with for the above said question but in most of the cases their jobs are not running but in my cases it is getting failed after successful execution of some previous jobs.
My data size is less than 20 MB only.
My cluster configuration is:
So the my question is what changes I should make in the server configuration. If the issue is coming from my code then why it is getting succeeded most of the time. Please advise and suggest me the solution.
This is most probably related to executor memory being bit low .Not sure what is current setting and if its default what is the default value in this particular databrics distribution. Even though it passes but there would lot of GCs happening because of low memory hence it would keep failing once in a while . Under spark configuration please provide spark.executor.memory and also some other params related to num of executors and cores per executor . In spark-submit the config would be provided as spark-submit --conf spark.executor.memory=1g
You may try increasing memory of driver node.
Sometimes the Garbage Collector is not releasing all the loaded objects in the driver's memory.
What you can try is to force the GC to do that. You can do that by executing the following:
spark.catalog.clearCache()
for (id, rdd) in spark.sparkContext._jsc.getPersistentRDDs().items():
rdd.unpersist()
print("Unpersisted {} rdd".format(id))

Spark Streaming memory usage doubts

I am trying to figure out the memory used by executors for a Spark Streaming job. For data I am using the rest endpoint for Spark AllExecutors
and just summing up the metrics totalDuration * spark.executor.memory for every executor and then emitting the final sum as the memory usage.
But this is coming out to be very small for application which ran whole day , is something wrong with the logic.Also I am using dynamic allocation and executorIdleTimeout is 5 seconds.
Also I am also assuming that if some executor was removed for due to idle timeout and then was allocated to some other task then its totalDuration will be increased by the amount of time took by the executor to execute this new task.
But what is unusual is that as the duration for execution of increases the memory usage metrics is decreasing.

how spark dynamic allocation clears queued task

I can open new vm on the fly based on that I am asking this question.
I am using spark dynamic allocation when I used
spark.dynamicAllocation.minExecutors=10 on sudden burst of data spark opens new executors very slowly results in long queues
when I changed spark.dynamicAllocation.minExecutors=200 to larger number on sudden burst it opens new executors very fast and the queue can clear up.
My question is do we have to set this to high value for such situation.
Vipin,
When you set the Dynamic Allocation in spark, as I can see you enable it and set the min of executors. But, when you need 200 executor to be faster the allocation has one configuration called spark.dynamicAllocation.schedulerBacklogTimeout this for default has 1s of timeout.
This mean that after 1s if you task didn't finished a task it will allocate more executors.
According to the documentation in spark, that says:
Spark requests executors in rounds. The actual request is triggered when there have been pending tasks for spark.dynamicAllocation.schedulerBacklogTimeout seconds, and then triggered again every spark.dynamicAllocation.sustainedSchedulerBacklogTimeout seconds thereafter if the queue of pending tasks persists. Additionally, the number of executors requested in each round increases exponentially from the previous round. For instance, an application will add 1 executor in the first round, and then 2, 4, 8 and so on executors in the subsequent rounds.
So for every seccond Spark allocate 2^n for n secconds of delay. To reach the 200 executor you need to wait at least 8 secconds to request the executors to Yarn. And few more secconds to solve that.
Maybe if you raise the number of cores it will help you. But if you are using the full cores of each node... Well thre is no solution.

SPARK: YARN kills containers for exceeding memory limits

We're currently encountering an issue where Spark jobs are seeing a number of containers being killed for exceeding memory limits when running on YARN.
16/11/18 17:58:52 WARN TaskSetManager: Lost task 53.0 in stage 49.0 (TID 32715, XXXXXXXXXX):
ExecutorLostFailure (executor 23 exited caused by one of the running tasks)
Reason: Container killed by YARN for exceeding memory limits. 12.4 GB of 12 GB physical memory used.
Consider boosting spark.yarn.executor.memoryOverhead.
The following arguments are being passed via spark-submit:
--executor-memory=6G
--driver-memory=4G
--conf "spark.yarn.executor.memoryOverhead=6G"`
I am using Spark 2.0.1.
We have increased the memoryOverhead to this value after reading several posts about YARN killing containers (e.g. How to avoid Spark executor from getting lost and yarn container killing it due to memory limit?).
Given my parameters and the log message it does seem that "Yarn kills executors when its memory usage is larger than (executor-memory + executor.memoryOverhead)".
It is not practical to continue increasing this overhead in the hope that eventually we find a value at which these errors do not occur. We are seeing this issue on several different jobs. I would appreciate any suggestions as to parameters I should change, things I should check, where I should start looking to debug this etc. Am able to provide further config options etc.
You can reduce the memory usage with the following configurations in spark-defaults.conf:
spark.default.parallelism
spark.sql.shuffle.partitions
And there is a difference when you use more than 2000 partitions for spark.sql.shuffle.partitions. You can see it in the code of spark on Github:
private[spark] object MapStatus {
def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = {
if (uncompressedSizes.length > 2000) {
HighlyCompressedMapStatus(loc, uncompressedSizes)
} else {
new CompressedMapStatus(loc, uncompressedSizes)
}
}
I recommend to try to use more than 2000 Partitions for a test. It could be faster some times, when you use very huge datasets. And according to this your tasks can be short as 200 ms. The correct configuration is not easy to find, but depending on your workload it can make a difference of hours.

Resources